The integration of large language models (LLMs) into clinical systems is becoming more prevalent, prompting a need for effective evaluation methods.
Current static benchmarks may not accurately reflect the practical utility of these models in real-world scenarios.
The study suggests that new evaluation approaches are necessary to better predict query-level rejection risks in clinical applications.