|
[1]
|
Hendrycks, D., Burns, C., Basart, S., et al. (2021) Measuring Massive Multitask Language Under-Standing. 2021 International Conference on Learning Representations, Online, 3-7 May 2021, 1-27.
|
|
[2]
|
Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S., et al. (2020) BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Transactions of the Association for Computational Linguistics, 8, 377-392. [Google Scholar] [CrossRef]
|
|
[3]
|
Xu, L., Hu, H., Zhang, X., Li, L., Cao, C., Li, Y., et al. (2020) CLUE: A Chinese Language Understanding Evaluation Benchmark. Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, December 2020, 4762-4772. [Google Scholar] [CrossRef]
|
|
[4]
|
Liu, C., Jin, R., Ren, Y. and Xiong, D. (2024) LHMKE: A Large-Scale Holistic Multi-Subject Knowledge Evaluation Benchmark for Chinese Large Language Models. Proceedings of the Language Resources and Evaluation Conference, Torino, May 2024, 10476-10487. [Google Scholar] [CrossRef]
|
|
[5]
|
张谊生. 现代汉语[M]. 第2版. 上海: 复旦大学出版社, 2013.
|
|
[6]
|
Chan, C.M., Chen, W., Su, Y., et al. (2023) ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6-10 December 2023, 13371-13391.
|
|
[7]
|
李东进. 基于知识点的专业文本可解释评阅研究[D]: [硕士学位论文]. 济南: 山东大学, 2020.
|
|
[8]
|
Zhao, Q., Huang, Y., Lv, T., Cui, L., Sun, Q., Mao, S., et al. (2025) MMLU-CF: A Contamination-Free Multi-Task Language Understanding Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, July 2025, 13371-13391. [Google Scholar] [CrossRef]
|