用于混合长文档的多粒度证据检索方法研究
Research on Multi-Granularity Evidence Retrieval Methods for Hybrid Long Documents
DOI: 10.12677/airr.2025.145109, PDF,   
作者: 宋亚静, 张美琪, 孙一凯:北京信息科技大学计算机学院,北京;肖诗斌:北京信息科技大学计算机学院,北京;拓尔思信息技术股份有限公司,北京;黄鸿发:拓尔思信息技术股份有限公司,北京
关键词: 证据检索大语言模型混合长文档Evidence Retrieval Large Language Model Hybrid Long Documents
摘要: 混合长文档(Hybrid Long Documents, HLDs)在问答任务中存在效率低下和证据分散的挑战,现有方法直接将整个文档作为大语言模型(Large Language Models, LLMs)的输入会引发噪声干扰和事实幻觉现象。为解决这些问题,本文提出以相关证据句子为核心的知识支持策略,并设计了三阶段处理框架进行证据句子检索,第一阶段采用分层检索策略,通过双阶段过滤机制根据用户声明从文档中精准定位相关证据段落;第二阶段针对包含表格的段落设计序列化处理方法;第三阶段通过段落去噪筛选证据句子,与用户声明结合起来辅助大语言模型生成答案。此外,为了推进领域研究,我们构建了首个中文可行性研究报告数据集,在数据集上进行的实验表明:我们模型超越了传统的基线模型,其准确率达到80.4%。
Abstract: Hybrid long documents face challenges of low efficiency and scattered evidence in question-answering tasks. Existing methods directly use the entire document as input to a large language model, which can cause noise interference and factual hallucination phenomena. To address these issues, this paper proposes a knowledge support strategy with relevant evidence sentences as the core, and designs a three-stage processing framework for evidence sentence retrieval. In the first stage, a hierarchical retrieval strategy is adopted to accurately locate relevant evidence paragraphs from documents based on user statements through a two stage filtering mechanism. In the second stage, a serialization processing method is designed for paragraphs containing tables. In the third stage, evidence sentences are filtered through paragraph denoising, and combined with user statements to assist the large language model in generating answers. In addition, to advance research in the field, we constructed the first Chinese feasibility study report dataset. Experiments on the dataset show that our model surpasses the traditional baseline model with an accuracy of 80.4%.
文章引用:宋亚静, 肖诗斌, 黄鸿发, 张美琪, 孙一凯. 用于混合长文档的多粒度证据检索方法研究[J]. 人工智能与机器人研究, 2025, 14(5): 1155-1166. https://doi.org/10.12677/airr.2025.145109

参考文献

[1] 马小丁. 充分认识可行性研究报告的重要性[J]. 中国投资(中英文), 2025(Z2): 92-93.
[2] Chowdhery, A., Narang, S., Devlin, J., et al. (2023) Palm: Scaling Language Modeling with Pathways. Journal of Machine Learning Research, 24, 1-113.
[3] Schaeffer, R., Miranda, B. and Koyejo, S. (2024) Are Emergent Abilities of Large Language Models a Mirage? Advances in Neural Information Processing Systems. arXiv:2304.15004.
[4] 刘泽垣, 王鹏江, 宋晓斌, 等. 大语言模型的幻觉问题研究综述[J]. 软件学报, 2025, 36(3): 1152-1185.
[5] Sun, J., Ju, C. and Tang, L. (2023) Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph. arXiv: 2307.07697.
[6] Abdallah, A. and Jatowt, A. (2023) Generator-Retriever-Generator Approach for Open-Domain Question Answering. arXiv:2307.11278.
[7] Sun, Z., Wang, X., Tay, Y., et al. (2022) Recitation-Augmented Language Models. arXiv:2210.01296.
[8] Martinez-Gil, J. (2023) A Survey on Legal Question-Answering Systems. Computer Science Review, 48, Article 100552. [Google Scholar] [CrossRef
[9] 何富威, 张仕斌, 卢嘉中, 等. 融合大语言模型和证据抽取的事实核查模型[J]. 武汉大学学报(理学版), 2025, 71(4): 485-494.
[10] 贺彦程, 徐冰, 朱聪慧. 基于跨证据文本实体关系构建的事实核查研究[J]. 中文信息学报, 2024, 38(3): 93-101+112.
[11] Huang, Q., Zhu, S., Feng, Y., et al. (2021) Three Sentences Are All You Need: Local Path Enhanced Document Relation Extraction. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, 1-6 August 2021, 998-1004. [Google Scholar] [CrossRef
[12] 安先跨, 肖蓉, 杨肖. 融合证据句子提取的文档级关系抽取[J]. 计算机科学, 2024, 51(Z1): 204-209.
[13] McDonald, T., Tsan, B., Saini, A., et al. (2022) Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering. arXiv:2210.01959.
[14] Zheng, X., Che, F., Wu, J., et al. (2024) KS-LLM: Knowledge Selection of Large Language Models with Evidence Document for Question Answering. arXiv:2404.15660.
[15] Yuan, F., Xu, Y., Lin, Z., et al. (2019) Multi-Perspective Denoising Reader for Multi-Paragraph Reading Comprehension. 26th International Conference, ICONIP 2019, Sydney, 12-15 December 2019, 222-234. [Google Scholar] [CrossRef
[16] Xu, W., Chen, K., Mou, L., et al. (2022) Document-Level Relation Extraction with Sentences Importance Estimation and Focusing. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, 10-15 July 2022, 2920-2929. [Google Scholar] [CrossRef
[17] Fang, X., Xu, W., Tan, F.A., et al. (2024) Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding—A Survey. arXiv:2402.17944.