大模型语言学能力测评数据集构建方法研究
Research on the Construction Method and Practice of Evaluation Dataset for Linguistic Ability of Large Language Models
摘要: 本研究探索大模型语言学能力测评数据集的构建方法,提出“出题–参考答案–模型生成回复–考点生成–考点标注–基于考点评分”的标准化流程。以张谊生《现代汉语》为理论依据,借助Kimi设计两套试卷共62题,覆盖语音、句法、语义、语用维度,并选取DeepSeek、豆包、Qwen作答,由Gemini生成考点与评分规则,经人工校对形成最终数据集。研究发现:大模型在自动出题、考点生成与初步阅卷中效率较高,但精准性与规范性仍需人工干预;部分模型能自动识别题目错误,展现知识批判潜力;基于考点的结构化评分比整体打分更具可解释性。本研究为后续大模型语言学能力评测及语言学教学提供了可推广的构建方法与参考基准。
Abstract: This study explores the construction method of an evaluation dataset for the linguistic capabilities of large language models (LLMs). A standardized workflow is proposed, consisting of “question drafting - reference answer preparation - model response generation - test point generation - test point annotation - scoring based on test points.” Using Zhang Yisheng’s Modern Chinese as the theoretical foundation, two test papers comprising a total of 62 questions were designed with the assistance of the LLM Kimi. The questions cover dimensions including phonetics, syntax, semantics, and pragmatics. Three LLMs—DeepSeek, Doubao, and Qwen—were employed to generate responses, while Gemini was utilized to automatically produce test points and scoring rubrics, which were subsequently refined through manual proofreading. The findings indicate that LLMs offer high efficiency in automatic question drafting, test point generation, and preliminary scoring, yet manual intervention remains essential for ensuring precision and standardization. Notably, certain models demonstrated the ability to autonomously identify errors in the question design, revealing their potential for knowledge critique. Moreover, structured scoring based on test points proved to be more interpretable than holistic scoring. This study provides a replicable construction method and a reference benchmark for future evaluations of LLMs’ linguistic capabilities and for linguistic pedagogy.
文章引用:李朝阳. 大模型语言学能力测评数据集构建方法研究[J]. 现代语言学, 2026, 14(5): 868-877. https://doi.org/10.12677/ml.2026.145469

参考文献

[1] Hendrycks, D., Burns, C., Basart, S., et al. (2021) Measuring Massive Multitask Language Under-Standing. 2021 International Conference on Learning Representations, Online, 3-7 May 2021, 1-27.
[2] Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S., et al. (2020) BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Transactions of the Association for Computational Linguistics, 8, 377-392. [Google Scholar] [CrossRef
[3] Xu, L., Hu, H., Zhang, X., Li, L., Cao, C., Li, Y., et al. (2020) CLUE: A Chinese Language Understanding Evaluation Benchmark. Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, December 2020, 4762-4772. [Google Scholar] [CrossRef
[4] Liu, C., Jin, R., Ren, Y. and Xiong, D. (2024) LHMKE: A Large-Scale Holistic Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models. Proceedings of the Language Resources and Evaluation Conference, Torino, May 2024, 10476-10487. [Google Scholar] [CrossRef
[5] 张谊生. 现代汉语[M]. 第2版. 上海: 复旦大学出版社, 2013.
[6] Chan, C.M., Chen, W., Su, Y., et al. (2023) ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6-10 December 2023, 13371-13391.
[7] 李东进. 基于知识点的专业文本可解释评阅研究[D]: [硕士学位论文]. 济南: 山东大学, 2020.
[8] Zhao, Q., Huang, Y., Lv, T., Cui, L., Sun, Q., Mao, S., et al. (2025) MMLU-CF: A Contamination-Free Multi-Task Language Understanding Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, July 2025, 13371-13391. [Google Scholar] [CrossRef