基于大语言模型的非结构化流调数据治理与报告生成研究
Research on Governance of Unstructured Epidemiological Investigation Data and Report Generation Based on Large Language Models
摘要: 目的:针对流调资料来源分散、表达口语化、时间信息隐含导致的信息整理效率低、报告编制负担重和审计追溯不足等问题,提出面向非结构化文本的数据治理与报告生成一体化方案。方法:以大语言模型为核心,建立数据接入登记、分级脱敏、个案主键关联、版本管理和质量标注机制;构建语义抽取、结构化校验和模板化成稿流程,并结合时间解析、术语映射和规则引擎,形成异常识别与人工复核闭环;通过签审流程、角色权限和审计日志实现分级人机协同。结果:研究形成了以个案为主线的分层架构和可落地流程,实现了原始叙述、结构化要素与规范文书的连续衔接。原型应用表明,该方案能够支持多源文本接入、隐私治理、冲突待办生成和可追溯签审管理,在流程规范性、结果一致性和责任追踪方面表现稳定。结论:该研究为疾控机构在私有化部署和强审计约束条件下推进非结构化流调数据治理与报告自动化提供了可实施的工程路径。
Abstract: Purpose: To address decentralized sources of epidemiological investigation materials, colloquial wording, implicitly expressed temporal information, and the resulting low efficiency in information collation, heavy workload in report drafting, and weak audit traceability, this study proposes an integrated approach to data governance and report generation for unstructured text. Methods: A large language model serves as the core engine. The approach establishes mechanisms for data intake registration, tiered de-identification, case-level primary-key linkage, version control, and quality annotation; builds a pipeline of semantic extraction, structured validation, and templated document generation; and integrates temporal parsing, terminology mapping, and a rules engine to form a closed loop of anomaly detection and human review. Tiered human-machine collaboration is implemented through approval workflows, role-based access control, and audit logs. Results: The work yields a case-centered layered architecture and practical workflows that connect raw narratives, structured elements, and standardized documents in a continuous chain. Prototype use indicates that the solution supports multi-source text intake, privacy governance, generation of conflict-related action items, and traceable approval management, with stable performance in procedural rigor, consistency of outputs, and accountability tracking. Conclusion: The study offers a feasible engineering path for public health agencies to advance governance of unstructured epidemiological investigation data and automation of reporting under on-premises deployment and strong audit requirements.
文章引用:蒋松冬. 基于大语言模型的非结构化流调数据治理与报告生成研究[J]. 计算机科学与应用, 2026, 16(5): 58-65. https://doi.org/10.12677/csa.2026.165164

参考文献

[1] Gautam, A.S. and Raza, Z. (2024) Disease Outbreak Prediction Using Natural Language Processing: A Review. Knowledge and Information Systems, 66, 6561-6595.
[2] McClymont, H., Lambert, S.B., Barr, I., Vardoulakis, S., Bambrick, H. and Hu, W. (2024) Internet-Based Surveillance Systems and Infectious Diseases Prediction: An Updated Review of the Last 10 Years and Lessons from the COVID-19 Pandemic. Journal of Epidemiology and Global Health, 14, 645-657. [Google Scholar] [CrossRef] [PubMed]
[3] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need.
https://arxiv.org/abs/1706.03762
[4] Lee, P., Bubeck, S. and Petro, J. (2023) Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. New England Journal of Medicine, 388, 1233-1239. [Google Scholar] [CrossRef] [PubMed]
[5] Bedi, A., Nadkarni, P.M., Somai, M., et al. (2024) Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA Network Open, 7, e2440819.
[6] Busch, F., Hoffmann, L., Rueger, C., van Dijk, E.H., Kader, R., Ortiz-Prado, E., et al. (2025) Current Applications and Challenges in Large Language Models for Patient Care: A Systematic Review. Communications Medicine, 5, Article No. 26. [Google Scholar] [CrossRef] [PubMed]
[7] Rao, A., Pang, M., Kim, J., Kamineni, M., Lie, W., Prasad, A.K., et al. (2023) Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. Journal of Medical Internet Research, 25, e48659. [Google Scholar] [CrossRef] [PubMed]
[8] 世界卫生组织. 信息流行病[EB/OL].
https://www.who.int/health-topics/infodemic, 2026-03-28.
[9] He, Z., Wu, J., Peng, A., et al. (2023) MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering.
https://arxiv.org/abs/2309.16035
[10] Xing, Z., Ye, C., Han, T., et al. (2024) Retrieval-Augmented Generation for Generative Artificial Intelligence in Medicine.
https://arxiv.org/abs/2406.12449
[11] Zakka, C., Chaurasia, A., Shad, R., Dalal, A.R., Kim, J.L., Moor, M., et al. (2023) Almanac: Retrieval-Augmented Language Models for Clinical Medicine. arXiv:2303.01229.
[12] Consoli, S., Markov, P., Stilianakis, N.I., Bertolini, L., Gallardo, A.P. and Ceresa, M. (2024) Epidemic Information Extraction for Event-Based Surveillance Using Large Language Models. In: Yang, X.S., Sherratt, S., Dey, N. and, Joshi, A., Eds., Lecture Notes in Networks and Systems, Springer, 241-252. [Google Scholar] [CrossRef