大语言模型与人类在医患沟通中任务意图理解差异分析
Analysis of Differences in Task Intent Understanding between LLMs and Humans in Doctor-Patient Communication
DOI: 10.12677/airr.2025.146125, PDF, HTML, XML,    国家社会科学基金支持
作者: 何明净, 陈 梅*, 刘小英:中央民族大学信息工程学院,北京;程怡凡, 郭婉蓉, 赵天宇:中央民族大学生命与环境科学学院,北京
关键词: 大模型意图理解医患沟通LLM Intent Understanding Doctor-Patient Communication
摘要: 大语言模型(LLMs)已广泛应用于医患沟通场景,能够生成符合常识和语言规范的回答。然而,它们是否具备与人类相当的任务意图理解能力,仍是一个关键问题。为探讨这一差异,本文设计了双盲实验,对比多个主流模型与人类在面对相同医疗问题时的任务理解方式。我们将模型生成的评分标准视为其对任务意图的显性表达,并通过语义熵与加权置信度等指标,评估其在评分任务中的执行表现。结果显示,模型执行自身生成的评分标准时的结果加权置信度显著高于执行人类评分标准。进一步分析表明,模型在构建评分标准时倾向于将临床表达中的复杂语义拆解为表层片段,聚焦于局部信息点,难以还原人类评分标准中所体现的临床推理链、语境敏感性和整体性判断。为验证这一理解偏差是否可通过输入调整加以缓解,我们设计了表达方式干预实验,发现通过引入动词、形容词、副词等语言结构约束对医疗指令进行微调,能够显著提升模型在执行人工评分标准时的表现。
Abstract: Large Language Models (LLMs) have been widely applied in doctor-patient communication scenarios, capable of generating responses that align with common sense and linguistic norms. However, whether they possess task intent understanding abilities comparable to those of humans remains a critical question. To explore this discrepancy, this paper designed a double-blind experiment to compare the task comprehension approaches of multiple mainstream models and humans when confronted with identical medical questions. We regarded the scoring criteria generated by the models as their explicit expressions of task intent and evaluated their performance in scoring tasks using metrics such as semantic entropy and weighted confidence. The results revealed that the models exhibited significantly higher weighted confidence when applying their own generated scoring criteria compared to when adhering to human-derived scoring standards. Further analysis indicated that when constructing scoring criteria, models tended to decompose complex semantics in clinical expressions into superficial fragments, focusing on localized information points and struggling to reconstruct the clinical reasoning chains, contextual sensitivity, and holistic judgment embodied in human scoring standards. To verify whether this comprehension bias could be mitigated through input adjustments, we designed an intervention experiment on expression styles and found that fine-tuning medical instructions by introducing linguistic structural constraints, such as verbs, adjectives, and adverbs, could significantly enhance the models’ performance when executing human-derived scoring standards.
文章引用:何明净, 陈梅, 刘小英, 程怡凡, 郭婉蓉, 赵天宇. 大语言模型与人类在医患沟通中任务意图理解差异分析[J]. 人工智能与机器人研究, 2025, 14(6): 1339-1350. https://doi.org/10.12677/airr.2025.146125

1. 引言

大语言模型(LLMs)在辅助诊断、患者教育、健康咨询和心理咨询等医疗领域展现出变革性潜力[1]。其强大的指令跟随能力和上下文生成能力,使其成为解决医疗资源短缺、减轻医护人员负担的有力工具[2] [3]。ChatGPT、DeepSeek和Gemini等模型在医学知识问答、医师资格考试等标准化评估中,已达到甚至在部分情况下超越人类专家的水平[4] [5]

现有研究大多聚焦于事实准确性、医学幻觉发生率、推理链(思维链)连贯性以及安全对齐等指标[6]-[10]。然而,这些评估主要停留在输出层面,极少探讨大语言模型在处理输入时是否具备与人类相当的任务意图理解能力。本研究将任务意图理解定义为:模型在接收用户输入后,准确解析目标类型(如诊断、治疗建议、预防咨询)以及生成恰当回应所需逻辑框架的能力。它涵盖情境判断、未明示需求及现实约束。在真实的医患互动中,患者提问极少是中性信息请求,而是融合多重目标。例如,有胰腺炎病史的患者可能不仅会询问腹痛是否为复发迹象,还会询问超声检查是否足以进行随访。同样,糖尿病患者可能在询问胰岛素剂量调整的同时,表达对安全性及日常生活质量的担忧。人类医生能够解析此类多层次意图,但大语言模型是否具备类似、与人类对齐的任务意图理解能力,仍是一个待解问题[11]

我们认为,仅通过输出导向的评估无法捕捉这一差距,也不能简单归因于医学知识或表达能力的不足[12]。我们的假设是,大语言模型与人类在任务意图理解上的差异,体现在它们所生成的评估标准中。为验证这一点,我们引入了一种新型“反向评估”范式,即要求模型和人类专家针对相同的医学问题,分别生成评分标准和问题[13] [14]。该方法将评估者的优先级和推理过程外化,从而能够直接比较任务解读方式。

为检验这一假设,我们设计了一项包含四个阶段的双盲对比实验(图1)。首先,在评分标准生成阶段,模型和专家针对相同的医学问题独立生成评估问题。其次,在回应生成与评估阶段,使用纳入语义熵的加权置信度指标(详见方法),依据模型和人类定义的双重标准对模型回应进行评估。第三,在差异分析阶段,我们比较了诊断、治疗、预防和咨询任务中的评分维度和关注点,以识别模型中可能存在的基于模板的评分倾向。最后,通过在指令中嵌入明确的任务导向语言约束(如动词、形容词、副词)并生成新的回应,我们重新评估了在优化条件下任务理解与适应性的变化。

Figure 1. Experimental workflow. Four-phase double-blind study: (1) Scoring Criteria Generation, where models and experts independently generated evaluation questions for identical medical queries; (2) Response Generation and Evaluation, where model responses were assessed against both model- and human-defined criteria using a weighted confidence metric that incorporates semantic entropy ; (3) Compare the scoring dimensions and focal points of the model and human evaluators across diagnosis, treatment, prevention, and consultation tasks, and assess whether the model exhibits a template-based scoring tendency; (4) By embedding explicit task-oriented linguistic constraints (e.g., verbs, adjectives, adverbs) into the instructions and generating new responses, we reassessed changes in evaluation results under optimized conditions

1. 实验流程。本双盲研究包含四个阶段:(1) 评分标准生成阶段,模型和专家针对相同的医学问题独立生成评估问题;(2) 回应生成与评估阶段,采用纳入语义熵的加权置信度指标,依据模型和人类定义的双重标准对模型回应进行评估;(3) 比较模型与人类评估者在诊断、治疗、预防和咨询任务中的评分维度与关注点,评估模型是否存在基于模板的评分倾向;(4) 通过在指令中嵌入明确的任务导向语言约束(如动词、形容词、副词)并生成新的回应,重新评估在优化条件下评估结果的变化

2. 指标说明

2.1. 加权置信度指标

在评分执行阶段,本研究引入加权置信度指标,以更精准地评估LLMs在不同评分标准下对任务意图的理解能力及判断稳定性。我们采用Farquhar等人(2024年) [15]提出的语义熵框架来估算模型判断的不确定性。该框架将多个采样输出聚类为语义等价组,并基于语义分布计算熵,而非基于词级(token-level)的变异。通过将语义层面的不确定性纳入置信度指标,我们对不稳定或模糊的概率分布进行惩罚,从而为任务理解和判断稳定性提供更稳健的评估依据。

针对每个评分问题,模型会输出各选项(如“是”或“否”)的概率分布,以此表示其在语义层面的置信度。语义熵用于量化判断的模糊性:接近均匀分布的概率(如0.51对0.49)会产生高熵值,表明判断不稳定;而偏态分布(如0.95对0.05)则熵值较低,反映判断更果断且一致。加权置信度指标将基于归一化熵的惩罚项融入原始置信度得分,降低高语义模糊性回答的影响。与单纯基于概率的评分方法相比,该方法能更有效地区分以下情况:模型在不同评分标准下生成表面相似但语义理解深度不同的输出,从而为任务执行过程中认知偏差与不确定性的评估提供更具可解释性和可靠性的度量。

2.2. 任务适应度

(1) 卡方独立性检验(Chi-square test of independence)

目的:检验评分分布是否因任务类型不同而存在显著差异

方法:

针对每个模型,构建一个4 (任务类型) × 10 (维度)的列联表:

O ij :第i类任务在第j个维度上的观测频数

N= ij O ij :总样本量

在独立性假设下,期望频数的计算公式为: E ij = ( sum_i n_{ ij } )( sum_j n_{ ij } ) N

卡方统计量的计算公式为: χ 2 = i=1 4 j 10 ( O ij E ij ) 2 E ij

自由度:df = (4 − 1) (10 − 1) = 27

根据卡方分布计算p值,以判断分布差异是否具有统计学意义。

(2) Cramer’s V

目的:用于衡量任务类型与评分分布之间关联强度(效应量),作为卡方检验的补充分析。

计算方法: V= χ 2 N( κ1 ) ,其中,K = min (rows, columns) = min(4, 10) = 4.

因此,在本研究中, V= χ 2 3N V[ 0,1 ]

V值越大,表明任务类型与评分分布之间的关联性越强。经验性解释标准:

V ≈ 0.10:弱关联

V ≈ 0.20:中等关联

V ≳ 0.30:强关联

(3) Jensen-Shannon (JS)距离

目的:用于衡量模型在不同任务间评分分布的差异程度,并通过热力图直观展示任务间的过渡关系。

计算方法:

针对每个模型,首先统计其在各任务下10个评分维度(A-J)的频数,记录为向量:

c ( t ) =( c 1 ( t ) , c 2 ( t ) ,, c 10 ( t ) )

将其归一化为概率分布: P ( t ) = c ( t ) j c j ( t )

对于任意两个任务ab,其詹森–香农散度定义为:

JSD( p ( a ) , p ( b ) )= 1 2 j=1 10 p j ( a ) log 2 p j ( a ) m j + 1 2 j=1 10 p j ( b ) log 2 p j ( b ) m j m j = 1 2 ( p j ( a ) + p j ( b ) )

JS距离通过取平方根得到:: JSdost= JSD [ 0,1 ]

JS距离值越大,表明两个任务间评分分布的差异越显著。将四个任务两两计算的JS距离结果绘制为4 × 4热力图。

3. 实验结果

3.1. 模型和人类评分标准评估

研究使用来自iCliniq.com数据集的150个患者咨询问题,使用四个模型(GPT-4o、DeepSeek-R1、LLaMA-3-70B、Gemini 1.5 Pro)和人类专家独立生成评分问题,以捕捉任务意图理解及关键评估维度[16] [17]。在评估阶段,根据模型和人类定义的双重标准对模型回答进行评估,结果显示在人类评估下,模型的加权置信度持续下降。平均而言,各模型的加权置信度下降了15至25个百分点,其中DeepSeek-R1与人类标准的契合度最高,而Gemini-1.5 Pro契合度最低(图2)。加权置信度指标结合了模型内部置信度与语义熵,揭示了任务意图理解方面存在的系统性差距,即模型过度依赖语言模式和结构化评分路径,未能捕捉人类标准中强调的临床推理深度和整体性判断。这些结果凸显了大语言模型与人类在任务意图理解方面存在的持续差异,以及在临床情境中提升认知契合度的必要性[18]

Figure 2. Evaluation comparison of weighted confidence in model responses under model-defined and human-defined scoring criteria. Weighted confidence scores under own-defined criteria (dark blue) and human-defined criteria (light blue) for model responses, scores were consistently lower under human criteria, indicating systematic gaps in task intent understanding

2. 模型与人类评分标准下模型回复加权置信度的评估对比。图中展示了模型回应在其自身定义标准(深蓝色)和人类定义标准(浅蓝色)下的加权置信度得分,结果显示在人类定义标准下得分始终更低,这表明模型在任务意图理解方面存在系统性差距

通过人工对比模型与人类生成的评分问题,发现两者在任务意图理解上存在差异。模型通常遵循语言模式,逐项剖析症状,生成碎片化的评分点。尽管这种方式强调了覆盖范围和逻辑性,但往往忽视了临床机制和整体性意图。相比之下,人类专家则优先考虑推理链条和综合病例评估,整合了病因、治疗方案、共病情况及权威资料。

表1 (孕前使用羟考酮)所示,模型生成的评分问题主要剖析表层症状或关键词,对临床因果关系或治疗复杂性关注有限。因此,其评估往往偏向孤立细节,未能捕捉所需的更广泛的临床推理。

Table 1. Example of differences between model- and human-generated scoring questions. Gemini-generated scoring questions emphasized surface-level risks but failed to capture key contextual details (husband as the user, timing during pregnancy preparation), unlike human expert questions

1. 模型与人类生成的评分问题差异示例。与人类专家提出的问题不同,Gemini模型生成的评分问题侧重于表层风险,却未能捕捉关键情境细节(如用户为患者丈夫、处于备孕阶段)

患者问题

Gemini评分标准

人类专家评分标准

“Hello, my husband is taking Oxycodone due to a broken leg/surgery. He has been taking this pain medication for one month. We are trying to conceive our second baby. Will this medication affect the fetus? Or the health of the baby? Or can it cause birth defects? Thank you.”

“回复是否承认了用户对羟考酮(Oxycodone)影响妊娠的担忧?”

“回复是否解释了孕期使用羟考酮可能带来的风险,包括新生儿戒断综合征(NAS)?”

“回复是否提及了孕期使用羟考酮可能导致出生缺陷的可能性?”

“回复是否建议咨询医生或专家以获取个性化建议?”

“回复是否建议与患者丈夫的医生讨论其他疼痛管理方案?”

“答案是否准确解释了男性在备孕期间服用奥施康定(OxyContin)对母亲受孕几率及胎儿健康的间接影响?”

“答案是否考虑了奥施康定的服用方式(用药途径)对受孕几率及胎儿健康的影响?”

“答案是否考虑了可能与奥施康定联合用于腿部损伤治疗的药物对胎儿健康的影响?”

“答案是否符合人类医学知识,且内容来源于权威且准确的参考资料?”

“答案是否针对患者关切进行回应,未偏离主题或引入无关细节?”

3.2. 人类和模型评分标准的具体差异分析

为评估模型生成评分问题的覆盖范围和结构,我们将这些问题归类到十个预先定义的类别(A-J)中,这些类别反映了诸如完整性、诊断相关性、治疗适宜性、可操作性以及解释深度等关键维度(详见附录表A1)。我们利用Claude模型将每个问题自动映射至这些类别。类别覆盖情况的结果见附录图A1

为探究模型在多任务环境中的表现,我们分析了评分偏好在诊断、咨询、治疗和预防任务之间的迁移情况。我们采用卡方检验(配合Cramer’s V系数) (表2)和JS散度热力图(图3)来量化分布变化,以此评估模型的适应性、跨任务灵活性以及对输入语义的敏感度[19] [20]

Table 2. Chi-square test and Cramer’s V results for scoring preference differences across diagnosis, consultation, treatment, and prevention tasks for four LLMs. Higher values indicate greater inter-task variation and task-specific adaptability

2. 四个大语言模型(LLMs)在诊断、咨询、治疗和预防任务间评分偏好差异的卡方检验及Cramer’s V系数结果。数值越高,表明任务间差异越大且任务特定适应性越强

大模型

χ2

p

Cramer’s V

GPT-4o

75.00

0.0000

0.2503

Deepseek-R1

35.75

0.1209

0.1726

LLaMa-3-70b

83.26

0.0000

0.2634

Gemini-1.5-pro

44.53

0.0182

0.1924

(a) GPT-4o (b) DeepSeek_R1

(c) Gemini_1.5_Pro (d) LLaMa_3_70B

Figure 3. Jensen-Shannon (JS) divergence heatmaps for four LLMs across diagnosis, consultation, treatment, and prevention tasks. Higher JS divergence indicates greater shifts in scoring distributions between tasks, reflecting stronger task sensitivity and adaptability

3. 四个大语言模型(LLMs)在诊断、咨询、治疗和预防任务中的詹森-香农(JS)散度热力图。JS散度越高,表明任务间评分分布的变化越大,反映出更强的任务敏感性和适应性

GPT-4o和LLaMA-3-70B在任务间表现出最为显著的差异(p < 0.001, V > 0.25),显示出较强的任务特定适应性。Gemini呈现出中等程度的差异(p = 0.018, V ≈ 0.19),而DeepSeek则未表现出显著差异(p = 0.121),反映出其评分模式较为稳定。JS散度的结果与上述发现一致:LLaMA的散度最高,特别是在诊断和预防任务之间;GPT-4o居中;DeepSeek和Gemini的散度较低,表明其评分存在模板化倾向。两种分析方法得出的结果一致,揭示了当前主流模型在动态调整评分策略能力上的明显差异。

3.3. 提示语修改策略

对任务间评分偏好变化的分析揭示了当前大语言模型在理解医疗咨询时存在的明显局限。模型常常将复杂的临床表述拆解为孤立、表面的要点,忽略了推理链条、上下文关联和整体评估。在需要细致医患沟通的任务中,这些模式尤为明显,凸显出模型在理解任务意图和对更深层次临床目标敏感性方面的不足[21]

我们推测,这种局限在一定程度上源于输入指令的语言清晰度和侧重点。为验证这一推测,我们进行了精细的语言调整操作,通过修改动词、形容词和副词来增强语义重点和结构清晰度[22]。例如,将口语化或模棱两可的患者描述改写为精确、具有医学专业性的表述(具体示例见附录表A2)。

图4所示,实验评估表明,这些调整显著提升了模型表现:在人类评分标准下,大多数模型的加权置信度均有所提高,其中Gemini的增幅最大。例如,Gemini的加权置信度从低于1%大幅跃升至超过60%,GPT-4o和DeepSeek-R1也取得了适度提升,而LLaMA-3-70B的表现则基本保持稳定。这些结果说明,在任务指令中使用严谨、结构化的语言能够显著提升大语言模型的理解能力,尤其是对于那些在不同任务中原始评分偏好较为一致的模型而言。

Figure 4. Comparison of weighted confidence scores for four LLMs before (dark blue) and after (light blue) linguistic refinement under human-defined evaluation criteria. Results show consistent improvements across most models, with the most pronounced gains observed in Gemini

4. 在人类定义的评估标准下,经语言优化前后(优化前为深蓝色,优化后为浅蓝色)四个大语言模型的加权置信度得分对比。结果显示,大多数模型的得分均有所提升,其中Gemini的提升最为显著

4. 结论和讨论

本研究系统对比了大语言模型与具有生物医学背景的人类标注者在任务理解和评分标准构建方面的差异,提出并验证了一个用于评估任务意图识别的框架。

先前的分析揭示了一种持续存在的分歧:模型在其自行生成的评分标准下取得了较高的加权置信度,但在人类定义的评分标准下却出现了显著下降。模型在不同评分标准下的表现揭示了其认知架构的固有局限性。

这种模式表明,模型更容易适应其自身构建的评价框架,但却难以理解人类评分中蕴含的任务意图、临床经验和情境判断[23]。在此,评分行为可被视为任务理解能力的一种体现,而评分指标的系统性差异凸显了模型与人类在任务意图理解路径上的根本分歧。

进一步分析表明,在生成评分标准时,模型倾向于对任务进行表层语义分解,关注可识别的语言特征,如信息是否存在或表达是否清晰。它们难以重构人类在实际临床判断中展现的推理结构和症状综合分析能力。此外,评分偏好结果显示,只有LLaMA和GPT在不同任务间表现出了一定程度的任务迁移能力,而其他模型则缺乏稳定的迁移策略。这往往表现为任务间评分偏好重叠、任务敏感性弱,以及评分问题模板化、缺乏灵活性。

为解决这一问题,我们引入了精细的语言干预。通过调整任务指令中的动词、形容词和副词,我们提高了指令的清晰度和语义重点,使模型评分与人类评分更加契合,其中Gemini的提升最为显著。这表明,大语言模型的任务感知并非固定不变,而是对输入形式敏感,通过受控的语言优化可以引导模型进行更接近人类的评价。

尽管本研究揭示了大语言模型在任务理解上的结构性偏差和行为特征,但仍存在一些局限性:其一,评分标准质量依赖人工设计,虽经有专业背景的人员验证,仍可能含主观因素;其二,现有指标无法全面捕捉认知过程;其三,研究对象以通用模型为主,未来需验证方法在医学微调模型上的性能。

基金项目

国家社会科学基金项目25CKX003。

附 录

Table A1. Predefined categories (A~J) used for classifying model-generated scoring questions. Each category represents a distinct evaluation dimension, including completeness, diagnostic relevance, treatment appropriateness, actionability, explanatory depth, and related aspects

A1. 用于对模型生成的评分问题进行分类的预定义类别(A~J)。每个类别代表一个独特的评估维度,包括完整性、诊断相关性、治疗适宜性、可操作性、解释深度及相关方面

Category dimension

Category ID

Classification

Description

Medical Function

A

Completeness

Whether all relevant causes, symptoms, and medical history are covered.

B

Diagnostic Relevance

Whether the model focuses on core pathological mechanisms or clinical manifestations.

C

Treatment Appropriateness

Whether reasonable and safe treatments are recommended.

D

Clinical Safety

Whether urgent conditions are identified and safety-related advice, such as seeking medical care, is provided.

E

Differentiation

Whether different possible causes are distinguished.

F

Actionability

Whether actionable and personalized recommendations are provided.

Expression Quality

G

Tone/Empathy/Clarity

Whether a professional and reassuring tone is used to avoid causing panic.

H

Caution/Uncertainty Mgmt

Whether warnings against self-diagnosis are given and boundaries of uncertainty are acknowledged.

Task Depth

I

Pathophysiology/Explanation

Whether mechanisms or pathological foundations are explained.

J

Personalization/Practicality

Whether practical advice is provided in light of the patient’s background.

(a) GPT-4o vs. Human (b) DeepSeek-R1 vs. Human

(c) LLaMa_3_70B vs. Human (d) Gemini-1.5-Pro vs.Human

Figure A1. Comparison of dimensional coverage between model- and human-generated scoring criteria across diagnosis, treatment, prevention, and consultation tasks. Models showed close alignment with human experts in the Medical Function dimension and performed comparably or better in Task Depth

A1. 在诊断、治疗、预防和咨询任务中,模型生成与人类专家生成的评分标准在各维度覆盖范围上的对比。模型在医疗功能维度上与人类专家的契合度较高,且在任务深度维度上的表现相当或更优

Table A2. Examples of linguistic refinements applied to patient symptom descriptions. Key modifications include changes to verbs, adjectives, and adverbs to enhance clarity, emphasize critical clinical features (e.g., onset, severity, and movement-dependent symptoms), and better support accurate medical interpretation

A2. 对患者症状描述进行语言润色的示例。主要修改包括调整动词、形容词和副词,以提升表述清晰度、突出关键临床特征(如发病时间、严重程度及与运动相关的症状),从而更有利于准确医学解读

Original

Revised

Key Modification Points and Explanations

“I woke up this morning feeling the whole room is spinning when i was sitting down. I went to the bathroom walking unsteadily, as i tried to focus i feel nauseous. I try to vomit but it wont come out. After taking panadol and sleep for few hours, i still feel the same. By the way, if i lay down or sit down, my head do not spin, only when i want to move around then i feel the whole world is spinning. And it is normal stomach discomfort at the same time? Earlier after i relieved myself, the spinning lessen so i am not sure whether its connected or coincidences. Thank you doc!”

“I woke up this morning with a sudden, severe spinning sensation while sitting down. Upon standing and walking, I felt unsteady and nauseous, though I couldn’t vomit. The spinning persists only when I move; it stops when I sit or lie down. I also have normal stomach discomfort. After taking panadol and sleeping for a few hours, the spinning remained. Earlier, after using the bathroom, the spinning lessened briefly. Is this normal, and could the spinning and stomach discomfort be connected?”

“the whole room is spinning” → “a sudden, severe spinning sensation”

Effect: This change transforms a colloquial description (“the whole room is spinning”) into a precise medical chief complaint. The added adjectives “sudden” and “severe” provide crucial information for clinicians to assess the nature (acute onset) and severity of the condition.

“do not spin” → “stops”

Effect: This is a verb modification. “do not spin” is a simple state description, whereas “stops” is a strong action verb. It creates a stark contrast with “persists” from the first part of the sentence. This pattern of “onset with movement, cessation with rest” is a classic feature for diagnosing Benign Paroxysmal Positional Vertigo (BPPV). The revision makes the description more medically specific and indicative.

NOTES

*通讯作者。

参考文献

[1] Dave, T., Athaluri, S.A. and Singh, S. (2023) ChatGPT in Medicine: An Overview of Its Applications, Advantages, Limitations, Future Prospects, and Ethical Considerations. Frontiers in Artificial Intelligence, 6, Article 1169595. [Google Scholar] [CrossRef] [PubMed]
[2] Baumgartner, C. (2023) The Potential Impact of ChatGPT in Clinical and Translational Medicine. Clinical and Translational Medicine, 13, e1206.
[3] Fan, C., Lu, Z. and Tian, J. (2025) Chinese-Vicuna: A Chinese Instruction-Following Llama-Based Model. arXiv: 2504.12737.
[4] Ayers, J.W., Poliak, A., Dredze, M., Leas, E.C., Zhu, Z., Kelley, J.B., et al. (2023) Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Internal Medicine, 183, 589-596. [Google Scholar] [CrossRef] [PubMed]
[5] Gilson, A., Safranek, C.W., Huang, T., Socrates, V., Chi, L., Taylor, R.A., et al. (2023) How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? the Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Medical Education, 9, e45312. [Google Scholar] [CrossRef] [PubMed]
[6] Wang, S., et al. (2025) A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains. arXiv: 2507.23486.
[7] Templin, T., Fort, S., Padmanabham, P., Seshadri, P., Rimal, R., Oliva, J., et al. (2025) Framework for Bias Evaluation in Large Language Models in Healthcare Settings. npj Digital Medicine, 8, Article No. 414. [Google Scholar] [CrossRef] [PubMed]
[8] Sun, Z., Yim, W., Uzuner, Ö., Xia, F. and Yetisgen, M. (2025) A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors, Misinformation, and Hallucination. Journal of Biomedical Informatics, 169, Article ID: 104866. [Google Scholar] [CrossRef] [PubMed]
[9] Alessa, A., Lakshminarasimhan, A., Somane, P., Skirzynski, J., McAuley, J. and Echterhoff, J.M. (2025) How Much Content Do LLMs Generate That Induces Cognitive Bias in Users? arXiv: 2507.03194.
[10] Zhang, Z., et al. (2025) IHEval: Evaluating Language Models on Following the Instruction Hierarchy. arXiv: 2502.08745.
[11] He, Q., Zeng, J., Huang, W., Chen, L., Xiao, J., He, Q., et al. (2024) Can Large Language Models Understand Real-World Complex Instructions? Proceedings of the AAAI Conference on Artificial Intelligence, 38, 18188-18196. [Google Scholar] [CrossRef
[12] Zhao, W.X., et al. (2023) A Survey of Large Language Models. arXiv: 2303.18223.
[13] Wen, B., et al. (2024) Benchmarking Complex Instruction-Following with Multiple Constraints Composition. arXiv: 2407.03978.
[14] Lyu, X., Wang, Y., Hajishirzi, H. and Dasigi, P. (2024) HREF: Human Response-Guided Evaluation of Instruction Following in Language Models. arXiv: 2412.15524.
[15] Farquhar, S., Kossen, J., Kuhn, L. and Gal, Y. (2024) Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 630, 625-630. [Google Scholar] [CrossRef] [PubMed]
[16] Li, Y., Li, Z., Zhang, K., Dan, R., Jiang, S. and Zhang, Y. (2023) ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus, 15, e40895. [Google Scholar] [CrossRef] [PubMed]
[17] Deutsch, D., Bedrax-Weiss, T. and Roth, D. (2021) Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. Transactions of the Association for Computational Linguistics, 9, 774-789. [Google Scholar] [CrossRef
[18] Cheng, K., Li, Z., Guo, Q., Sun, Z., Wu, H. and Li, C. (2023) Emergency Surgery in the Era of Artificial Intelligence: ChatGPT Could Be the Doctor’s Right-Hand Man. International Journal of Surgery, 109, 1816-1818. [Google Scholar] [CrossRef] [PubMed]
[19] Pearson, K. (1900) X. On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50, 157-175. [Google Scholar] [CrossRef
[20] Lin, J. (2002) Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory, 37, 145-151.
[21] Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X. and Zhou, D. (2023) Large Language Models Cannot Self-Correct Reasoning Yet. arXiv: 2310.01798.
[22] Xu, C., et al. (2023) WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv: 2304.12244.
[23] Heo, J., et al. (2024) Do LLMs “Know” Internally When They Follow Instructions? arXiv: 2410.14516.