大语言模型与人类在医患沟通中任务意图理解差异分析
Analysis of Differences in Task Intent Understanding between LLMs and Humans in Doctor-Patient Communication
DOI: 10.12677/airr.2025.146125, PDF,    国家社会科学基金支持
作者: 何明净, 陈 梅*, 刘小英:中央民族大学信息工程学院,北京;程怡凡, 郭婉蓉, 赵天宇:中央民族大学生命与环境科学学院,北京
关键词: 大模型意图理解医患沟通LLM Intent Understanding Doctor-Patient Communication
摘要: 大语言模型(LLMs)已广泛应用于医患沟通场景,能够生成符合常识和语言规范的回答。然而,它们是否具备与人类相当的任务意图理解能力,仍是一个关键问题。为探讨这一差异,本文设计了双盲实验,对比多个主流模型与人类在面对相同医疗问题时的任务理解方式。我们将模型生成的评分标准视为其对任务意图的显性表达,并通过语义熵与加权置信度等指标,评估其在评分任务中的执行表现。结果显示,模型执行自身生成的评分标准时的结果加权置信度显著高于执行人类评分标准。进一步分析表明,模型在构建评分标准时倾向于将临床表达中的复杂语义拆解为表层片段,聚焦于局部信息点,难以还原人类评分标准中所体现的临床推理链、语境敏感性和整体性判断。为验证这一理解偏差是否可通过输入调整加以缓解,我们设计了表达方式干预实验,发现通过引入动词、形容词、副词等语言结构约束对医疗指令进行微调,能够显著提升模型在执行人工评分标准时的表现。
Abstract: Large Language Models (LLMs) have been widely applied in doctor-patient communication scenarios, capable of generating responses that align with common sense and linguistic norms. However, whether they possess task intent understanding abilities comparable to those of humans remains a critical question. To explore this discrepancy, this paper designed a double-blind experiment to compare the task comprehension approaches of multiple mainstream models and humans when confronted with identical medical questions. We regarded the scoring criteria generated by the models as their explicit expressions of task intent and evaluated their performance in scoring tasks using metrics such as semantic entropy and weighted confidence. The results revealed that the models exhibited significantly higher weighted confidence when applying their own generated scoring criteria compared to when adhering to human-derived scoring standards. Further analysis indicated that when constructing scoring criteria, models tended to decompose complex semantics in clinical expressions into superficial fragments, focusing on localized information points and struggling to reconstruct the clinical reasoning chains, contextual sensitivity, and holistic judgment embodied in human scoring standards. To verify whether this comprehension bias could be mitigated through input adjustments, we designed an intervention experiment on expression styles and found that fine-tuning medical instructions by introducing linguistic structural constraints, such as verbs, adjectives, and adverbs, could significantly enhance the models’ performance when executing human-derived scoring standards.
文章引用:何明净, 陈梅, 刘小英, 程怡凡, 郭婉蓉, 赵天宇. 大语言模型与人类在医患沟通中任务意图理解差异分析[J]. 人工智能与机器人研究, 2025, 14(6): 1339-1350. https://doi.org/10.12677/airr.2025.146125

参考文献

[1] Dave, T., Athaluri, S.A. and Singh, S. (2023) ChatGPT in Medicine: An Overview of Its Applications, Advantages, Limitations, Future Prospects, and Ethical Considerations. Frontiers in Artificial Intelligence, 6, Article 1169595. [Google Scholar] [CrossRef] [PubMed]
[2] Baumgartner, C. (2023) The Potential Impact of ChatGPT in Clinical and Translational Medicine. Clinical and Translational Medicine, 13, e1206.
[3] Fan, C., Lu, Z. and Tian, J. (2025) Chinese-Vicuna: A Chinese Instruction-Following Llama-Based Model. arXiv: 2504.12737.
[4] Ayers, J.W., Poliak, A., Dredze, M., Leas, E.C., Zhu, Z., Kelley, J.B., et al. (2023) Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Internal Medicine, 183, 589-596. [Google Scholar] [CrossRef] [PubMed]
[5] Gilson, A., Safranek, C.W., Huang, T., Socrates, V., Chi, L., Taylor, R.A., et al. (2023) How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? the Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Medical Education, 9, e45312. [Google Scholar] [CrossRef] [PubMed]
[6] Wang, S., et al. (2025) A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains. arXiv: 2507.23486.
[7] Templin, T., Fort, S., Padmanabham, P., Seshadri, P., Rimal, R., Oliva, J., et al. (2025) Framework for Bias Evaluation in Large Language Models in Healthcare Settings. npj Digital Medicine, 8, Article No. 414. [Google Scholar] [CrossRef] [PubMed]
[8] Sun, Z., Yim, W., Uzuner, Ö., Xia, F. and Yetisgen, M. (2025) A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors, Misinformation, and Hallucination. Journal of Biomedical Informatics, 169, Article ID: 104866. [Google Scholar] [CrossRef] [PubMed]
[9] Alessa, A., Lakshminarasimhan, A., Somane, P., Skirzynski, J., McAuley, J. and Echterhoff, J.M. (2025) How Much Content Do LLMs Generate That Induces Cognitive Bias in Users? arXiv: 2507.03194.
[10] Zhang, Z., et al. (2025) IHEval: Evaluating Language Models on Following the Instruction Hierarchy. arXiv: 2502.08745.
[11] He, Q., Zeng, J., Huang, W., Chen, L., Xiao, J., He, Q., et al. (2024) Can Large Language Models Understand Real-World Complex Instructions? Proceedings of the AAAI Conference on Artificial Intelligence, 38, 18188-18196. [Google Scholar] [CrossRef
[12] Zhao, W.X., et al. (2023) A Survey of Large Language Models. arXiv: 2303.18223.
[13] Wen, B., et al. (2024) Benchmarking Complex Instruction-Following with Multiple Constraints Composition. arXiv: 2407.03978.
[14] Lyu, X., Wang, Y., Hajishirzi, H. and Dasigi, P. (2024) HREF: Human Response-Guided Evaluation of Instruction Following in Language Models. arXiv: 2412.15524.
[15] Farquhar, S., Kossen, J., Kuhn, L. and Gal, Y. (2024) Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 630, 625-630. [Google Scholar] [CrossRef] [PubMed]
[16] Li, Y., Li, Z., Zhang, K., Dan, R., Jiang, S. and Zhang, Y. (2023) ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus, 15, e40895. [Google Scholar] [CrossRef] [PubMed]
[17] Deutsch, D., Bedrax-Weiss, T. and Roth, D. (2021) Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. Transactions of the Association for Computational Linguistics, 9, 774-789. [Google Scholar] [CrossRef
[18] Cheng, K., Li, Z., Guo, Q., Sun, Z., Wu, H. and Li, C. (2023) Emergency Surgery in the Era of Artificial Intelligence: ChatGPT Could Be the Doctor’s Right-Hand Man. International Journal of Surgery, 109, 1816-1818. [Google Scholar] [CrossRef] [PubMed]
[19] Pearson, K. (1900) X. On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50, 157-175. [Google Scholar] [CrossRef
[20] Lin, J. (2002) Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory, 37, 145-151.
[21] Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X. and Zhou, D. (2023) Large Language Models Cannot Self-Correct Reasoning Yet. arXiv: 2310.01798.
[22] Xu, C., et al. (2023) WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv: 2304.12244.
[23] Heo, J., et al. (2024) Do LLMs “Know” Internally When They Follow Instructions? arXiv: 2410.14516.