Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes
CoRR(2024)
摘要
The field of healthcare has increasingly turned its focus towards Large
Language Models (LLMs) due to their remarkable performance. However, their
performance in actual clinical applications has been underexplored. Traditional
evaluations based on question-answering tasks don't fully capture the nuanced
contexts. This gap highlights the need for more in-depth and practical
assessments of LLMs in real-world healthcare settings. Objective: We sought to
evaluate the performance of LLMs in the complex clinical context of adult
critical care medicine using systematic and comprehensible analytic methods,
including clinician annotation and adjudication. Methods: We investigated the
performance of three general LLMs in understanding and processing real-world
clinical notes. Concepts from 150 clinical notes were identified by MetaMap and
then labeled by 9 clinicians. Each LLM's proficiency was evaluated by
identifying the temporality and negation of these concepts using different
prompts for an in-depth analysis. Results: GPT-4 showed overall superior
performance compared to other LLMs. In contrast, both GPT-3.5 and
text-davinci-003 exhibit enhanced performance when the appropriate prompting
strategies are employed. The GPT family models have demonstrated considerable
efficiency, evidenced by their cost-effectiveness and time-saving capabilities.
Conclusion: A comprehensive qualitative performance evaluation framework for
LLMs is developed and operationalized. This framework goes beyond singular
performance aspects. With expert annotations, this methodology not only
validates LLMs' capabilities in processing complex medical data but also
establishes a benchmark for future LLM evaluations across specialized domains.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要