EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings
CoRR(2024)
摘要
This study introduces EHRNoteQA, a novel patient-specific question answering
benchmark tailored for evaluating Large Language Models (LLMs) in clinical
environments. Based on MIMIC-IV Electronic Health Record (EHR), a team of three
medical professionals has curated the dataset comprising 962 unique questions,
each linked to a specific patient's EHR clinical notes. What makes EHRNoteQA
distinct from existing EHR-based benchmarks is as follows: Firstly, it is the
first dataset to adopt a multi-choice question answering format, a design
choice that effectively evaluates LLMs with reliable scores in the context of
automatic evaluation, compared to other formats. Secondly, it requires an
analysis of multiple clinical notes to answer a single question, reflecting the
complex nature of real-world clinical decision-making where clinicians review
extensive records of patient histories. Our comprehensive evaluation on various
large language models showed that their scores on EHRNoteQA correlate more
closely with their performance in addressing real-world medical questions
evaluated by clinicians than their scores from other LLM benchmarks. This
underscores the significance of EHRNoteQA in evaluating LLMs for medical
applications and highlights its crucial role in facilitating the integration of
LLMs into healthcare systems. The dataset will be made available to the public
under PhysioNet credential access, promoting further research in this vital
field.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要