基于信息抽取的基因编辑技术领域大模型偏见评估研究
Bias Assessment of Large Language Models in Gene Editing Technology through Information Extraction
DOI: 10.12677/airr.2025.146131, PDF,    国家社会科学基金支持
作者: 陈 梅*, 高 扬, 石林钢, 何明净, 刘小英:中央民族大学民族语言智能分析与安全治理教育部重点实验室,北京;中央民族大学信息工程学院,北京
关键词: 基因编辑信息抽取大语言模型检索增强生成(RAG)偏见评估Gene Editing Information Extraction Large Language Models (LLMs) Retrieval-Augmented Generation (RAG) Bias Assessment
摘要: 基因编辑技术是一项兼具重大应用潜力与生物安全风险的前沿技术。当前,大语言模型在其相关科研、传播中的应用日益广泛。如果大模型对基因编辑技术存在认知偏见,将可能引发生物安全风险。为系统评估大模型对基因编辑技术的认知偏见,本研究构建了一种基于事件要素的信息抽取分析框架。该框架聚焦“人物、组织、技术、对象、效果、发表期刊”六类核心事件要素,采用“Basic-Rethink-Multi-Query”的级联评估流程,并在gpt-3.5-turbo与gpt-4-turbo两类模型及多种提示策略下进行实验验证。研究结果显示,基础信息抽取环节存在显著的结构性偏见,表现为对静态实体(如技术、组织、发表期刊)的识别效果较好,而对动态要素(如人物、对象、效果)的识别能力较弱。在引入反思机制与多轮问答策略后,各要素的识别均衡性与整体性能均得到显著提升,但“对象”等特定要素的识别仍存在滞后,提示模型在领域语义理解上存在盲区。本研究通过信息抽取方法有效识别并缓解了大模型在基因编辑技术领域的认知偏见,为发展可信赖的信息处理技术、支持生物安全治理提供了方法依据与实证参考。
Abstract: Gene editing technology represents a rapidly advancing field with substantial application potential and significant biosecurity risks. Large language models (LLMs) are increasingly used for scientific research and communication in this domain. If such models exhibit cognitive bias in gene editing technology, they may exacerbate biosecurity risks. To systematically assess such bias, we propose an event-centric information extraction framework. The framework targets six core event elements—Person, Organization, Technology, Object, Effect, and Publish—and employs a cascaded evaluation pipeline (“Basic-Rethink-Multi-Query”). We evaluate gpt-3.5-turbo and gpt-4-turbo under multiple prompting strategies. Results reveal significant structural bias in the basic extraction stage: the models perform better on relatively static entities (Technology, Organization, Publish) compared to dynamic elements (Person, Object, Effect). Incorporating Rethink mechanisms and Multi-Query strategy substantially improves both inter-category balance and overall extraction performance. However, extraction of certain elements (e.g., Object) remains comparatively weak, indicating gaps in domain-specific semantic understanding. This study applies information extraction methods to identify and mitigate LLM cognitive bias in the domain of gene editing technology, thereby providing a methodological foundation and empirical evidence for trustworthy information processing technologies and biosecurity governance.
文章引用:陈梅, 高扬, 石林钢, 何明净, 刘小英. 基于信息抽取的基因编辑技术领域大模型偏见评估研究[J]. 人工智能与机器人研究, 2025, 14(6): 1398-1409. https://doi.org/10.12677/airr.2025.146131

参考文献

[1] Dhaini, M., Poelman, W. and Erdogan, E. (2023) Detecting ChatGPT: A Survey of the State of Detecting ChatGPT-Generated Text. Proceedings of the 8th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing, Shoumen, 1-12.
[2] Gallegos, I.O., Rossi, R.A., Barrow, J., Tanjim, M.M., Kim, S., Dernoncourt, F., et al. (2024) Bias and Fairness in Large Language Models: A Survey. Computational Linguistics, 50, 1097-1179. [Google Scholar] [CrossRef
[3] 时宗彬, 朱丽雅, 乐小虬. 基于本地大语言模型和提示工程的材料信息抽取方法研究[J]. 数据分析与知识发现, 2024, 8(7): 23-31.
[4] 沈晨晨, 岳圣斌, 刘书隽, 等. 面向法律领域的大模型微调与应用[J]. 大数据, 2024, 10(5): 12-27.
[5] 赵勤博, 王又辰, 陈荣, 等. 面向开源情报的信息抽取大语言模型[J]. 计算机工程与设计, 2024, 45(12): 3772-3778.
[6] 孙亚伟. 基于多维度语义挖掘的情绪信息抽取技术研究[D]: [博士学位论文]. 北京: 北京邮电大学, 2024.
[7] Perot, V., Kang, K., Luisier, F., Su, G., Sun, X., Boppana, R.S., et al. (2024) LMDX: Language Model-Based Document Information Extraction and Localization. Findings of the Association for Computational Linguistics ACL 2024, Bangkok, 11-16 August 2024, 15140-15168. [Google Scholar] [CrossRef
[8] Jiao, Y., Li, S., Zhou, S., Ji, H. and Han, J. (2024) Text2DB: Integration-Aware Information Extraction with Large Language Model Agents. Findings of the Association for Computational Linguistics ACL 2024, Bangkok, 11-16 August 2024, 185-205. [Google Scholar] [CrossRef
[9] Kwak, A., Morrison, C., Bambauer, D. and Surdeanu, M. (2024) Classify First, and Then Extract: Prompt Chaining Technique for Information Extraction. Proceedings of the Natural Legal Language Processing Workshop 2024, Miami, 16 November 2024, 303-317. [Google Scholar] [CrossRef
[10] Zmigrod, R., Shetty, P., Sibue, M., Ma, Z., Nourbakhsh, A., Liu, X., et al. (2024) “What Is the Value of Templates?” Rethinking Document Information Extraction Datasets for LLMs. Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, 12-16 November 2024, 13162-13185. [Google Scholar] [CrossRef
[11] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al. (2017) Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, 6000-6010.
[12] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., et al. (2022) Training Language Models to Fol-low Instructions with Human Feedback. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, 27730-27744.
[13] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., et al. (2023) GPT-4 Technical Report. arXiv: 2303.08774.
[14] Wang, W., Zheng, V.W., Yu, H. and Miao, C. (2019) A Survey of Zero-Shot Learning. ACM Transactions on Intelligent Systems and Technology, 10, 1-37. [Google Scholar] [CrossRef
[15] Song, Y., Wang, T., Cai, P., Mondal, S.K. and Sahoo, J.P. (2023) A Comprehensive Survey of Few-Shot Learning: Evolution, Applications, Challenges, and Opportunities. ACM Computing Surveys, 55, 1-40. [Google Scholar] [CrossRef
[16] Sun, Z., Pergola, G., Wallace, B. and He, Y. (2024) Leveraging ChatGPT in Pharmacovigilance Event Extraction: An Empirical Study. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), St. Julian’s, 17-22 March 2024, 344-357. [Google Scholar] [CrossRef
[17] Ji, Z., Yu, T., Xu, Y., Lee, N., Ishii, E. and Fung, P. (2023) Towards Mitigating LLM Hallucination via Self Reflection. Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6-10 December 2023, 1827-1843. [Google Scholar] [CrossRef
[18] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., et al. (2020) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Article 793.
[19] Berkowitz, J., Srinivasan, A., Cortina, J. and Tatonetti1, N. (2024) TLab at #SMM4H 2024: Retrieval-Augmented Generation for ADE Extraction and Normalization. Proceedings of the 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks, Bangkok, 15 August 2024, 153-157. [Google Scholar] [CrossRef
[20] Yacouby, R. and Axman, D. (2020) Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, 20 November 2020, 79-91. [Google Scholar] [CrossRef
[21] Zhang, T.Y., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y. (2020) BERTScore: Evaluating Text Generation with BERT. [Google Scholar] [CrossRef