|
[1]
|
Dave, T., Athaluri, S.A. and Singh, S. (2023) ChatGPT in Medicine: An Overview of Its Applications, Advantages, Limitations, Future Prospects, and Ethical Considerations. Frontiers in Artificial Intelligence, 6, Article 1169595. [Google Scholar] [CrossRef] [PubMed]
|
|
[2]
|
Baumgartner, C. (2023) The Potential Impact of ChatGPT in Clinical and Translational Medicine. Clinical and Translational Medicine, 13, e1206.
|
|
[3]
|
Fan, C., Lu, Z. and Tian, J. (2025) Chinese-Vicuna: A Chinese Instruction-Following Llama-Based Model. arXiv: 2504.12737.
|
|
[4]
|
Ayers, J.W., Poliak, A., Dredze, M., Leas, E.C., Zhu, Z., Kelley, J.B., et al. (2023) Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Internal Medicine, 183, 589-596. [Google Scholar] [CrossRef] [PubMed]
|
|
[5]
|
Gilson, A., Safranek, C.W., Huang, T., Socrates, V., Chi, L., Taylor, R.A., et al. (2023) How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? the Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Medical Education, 9, e45312. [Google Scholar] [CrossRef] [PubMed]
|
|
[6]
|
Wang, S., et al. (2025) A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains. arXiv: 2507.23486.
|
|
[7]
|
Templin, T., Fort, S., Padmanabham, P., Seshadri, P., Rimal, R., Oliva, J., et al. (2025) Framework for Bias Evaluation in Large Language Models in Healthcare Settings. npj Digital Medicine, 8, Article No. 414. [Google Scholar] [CrossRef] [PubMed]
|
|
[8]
|
Sun, Z., Yim, W., Uzuner, Ö., Xia, F. and Yetisgen, M. (2025) A Scoping Review of Natural Language Processing in Addressing Medically Inaccurate Information: Errors, Misinformation, and Hallucination. Journal of Biomedical Informatics, 169, Article ID: 104866. [Google Scholar] [CrossRef] [PubMed]
|
|
[9]
|
Alessa, A., Lakshminarasimhan, A., Somane, P., Skirzynski, J., McAuley, J. and Echterhoff, J.M. (2025) How Much Content Do LLMs Generate That Induces Cognitive Bias in Users? arXiv: 2507.03194.
|
|
[10]
|
Zhang, Z., et al. (2025) IHEval: Evaluating Language Models on Following the Instruction Hierarchy. arXiv: 2502.08745.
|
|
[11]
|
He, Q., Zeng, J., Huang, W., Chen, L., Xiao, J., He, Q., et al. (2024) Can Large Language Models Understand Real-World Complex Instructions? Proceedings of the AAAI Conference on Artificial Intelligence, 38, 18188-18196. [Google Scholar] [CrossRef]
|
|
[12]
|
Zhao, W.X., et al. (2023) A Survey of Large Language Models. arXiv: 2303.18223.
|
|
[13]
|
Wen, B., et al. (2024) Benchmarking Complex Instruction-Following with Multiple Constraints Composition. arXiv: 2407.03978.
|
|
[14]
|
Lyu, X., Wang, Y., Hajishirzi, H. and Dasigi, P. (2024) HREF: Human Response-Guided Evaluation of Instruction Following in Language Models. arXiv: 2412.15524.
|
|
[15]
|
Farquhar, S., Kossen, J., Kuhn, L. and Gal, Y. (2024) Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature, 630, 625-630. [Google Scholar] [CrossRef] [PubMed]
|
|
[16]
|
Li, Y., Li, Z., Zhang, K., Dan, R., Jiang, S. and Zhang, Y. (2023) ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge. Cureus, 15, e40895. [Google Scholar] [CrossRef] [PubMed]
|
|
[17]
|
Deutsch, D., Bedrax-Weiss, T. and Roth, D. (2021) Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary. Transactions of the Association for Computational Linguistics, 9, 774-789. [Google Scholar] [CrossRef]
|
|
[18]
|
Cheng, K., Li, Z., Guo, Q., Sun, Z., Wu, H. and Li, C. (2023) Emergency Surgery in the Era of Artificial Intelligence: ChatGPT Could Be the Doctor’s Right-Hand Man. International Journal of Surgery, 109, 1816-1818. [Google Scholar] [CrossRef] [PubMed]
|
|
[19]
|
Pearson, K. (1900) X. On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50, 157-175. [Google Scholar] [CrossRef]
|
|
[20]
|
Lin, J. (2002) Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory, 37, 145-151.
|
|
[21]
|
Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X. and Zhou, D. (2023) Large Language Models Cannot Self-Correct Reasoning Yet. arXiv: 2310.01798.
|
|
[22]
|
Xu, C., et al. (2023) WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv: 2304.12244.
|
|
[23]
|
Heo, J., et al. (2024) Do LLMs “Know” Internally When They Follow Instructions? arXiv: 2410.14516.
|