体裁与语向差异对人工与大型语言模型笔译评分行为的影响
The Impact of Genre and Translation Direction on Human and Large Language Model Scoring Behaviors in Translation Assessment
摘要: 本文基于多面Rasch模型(Many-Facet Rasch Model, MFRM),探讨体裁与语向差异对人工评分员与大型语言模型(LLMs)在笔译评分中的影响。研究以北京市某211高校商务英语专业30名本科生完成的八项平行翻译任务为语料,涵盖科技、商务、新闻与议论文四类文本的汉译英与英译汉方向。根据权威测评报告选取四个主流大语言模型(ChatGPT-4o、DeepSeek、通义Qwen-2.5与腾讯元宝),并与两位专家评分员共同参与评分。结果表明,人工评分整体偏宽,而大型语言模型普遍偏严;体裁差异显著影响评分严厉度,反映出评分者对文本功能与语言密度的敏感性。语向效应分析显示,人工评分在英译汉方向更为严格,腾讯元宝在汉译英方向偏严,而ChatGPT-4o与通义Qwen-2.5在双向评分中保持较高一致性。研究结果表明,大语言模型已具备初步的体裁识别与语向适配能力,但仍存在一定的严厉度偏移。本文为智能笔译测评系统的校准机制、体裁化教学反馈及跨语向评分公平性优化提供了实证参考。
Abstract: Drawing on the Many-Facet Rasch Model (MFRM), this study investigates how genre and translation direction shape the rating behavior of human raters and large language models (LLMs) in translation assessment. The dataset comprises eight parallel translation tasks completed by 30 Business English undergraduates at a 211 university in Beijing, covering scientific, business, news, and argumentative texts in both Chinese-English and English-Chinese directions. Four mainstream LLMs (ChatGPT-4o, DeepSeek, Tongyi Qwen-2.5, and Tencent Yuanbao) were selected based on authoritative evaluation reports and, together with two expert human raters, evaluated the translations using a unified scoring rubric. The findings show that human raters were generally more lenient, whereas all LLMs exhibited a consistent tendency toward stricter scoring. Genre exerted a significant influence on rating severity, indicating raters’ sensitivity to textual function and information density. With respect to translation direction, human raters were stricter in the English-Chinese tasks, while Tencent Yuanbao demonstrated higher severity in the Chinese-English direction. In contrast, ChatGPT-4o and Tongyi Qwen-2.5 maintained relatively high consistency across both directions. Overall, the results suggest that LLMs have begun to develop initial capacities for genre recognition and direction adaptation, although noticeable severity biases remain. These findings offer empirical support for the calibration of intelligent translation assessment systems and provide pedagogical implications for genre-based instructional feedback and the enhancement of cross-directional fairness in translation evaluation.
文章引用:刘玲燕, 唐青, 王彦南. 体裁与语向差异对人工与大型语言模型笔译评分行为的影响[J]. 现代语言学, 2025, 13(12): 195-205. https://doi.org/10.12677/ml.2025.13121253

参考文献

[1] 何莲珍. 大语言模型在语言测评中的应用[J]. 外语教学与研究, 2024, 56(6): 903-912+960.
[2] 刘建达. 人工智能时代的语言测评: 机遇与挑战[J]. 现代外语, 2024, 47(6): 859-869.
[3] Hao, J., von Davier, A.A., Yaneva, V., Lottridge, S., von Davier, M. and Harris, D.J. (2024) Transforming Assessment: The Impacts and Implications of Large Language Models and Generative AI. Educational Measurement: Issues and Practice, 43, 16-29. [Google Scholar] [CrossRef
[4] Mizumoto, A. and Eguchi, M. (2023) Exploring the Potential of Using an AI Language Model for Automated Essay Scoring. Research Methods in Applied Linguistics, 2, Article 100050. [Google Scholar] [CrossRef
[5] Kwako, A., Wan, Y., Zhao, J., Hansen, M., Chang, K. and Cai, L. (2023) Does BERT Exacerbate Gender or L1 Biases in Automated English Speaking Assessment? In: Kochmar, E., Burstein, J., Horbach, A., et al., Eds., Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Association for Computational Linguistics, 122-131. [Google Scholar] [CrossRef
[6] Seo, H., Hwang, T., Jung, J., Kang, H., Namgoong, H., Lee, Y., et al. (2025) Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Applied Sciences, 15, Article 671. [Google Scholar] [CrossRef
[7] 苏祺. 大语言模型在二语教学中的应用效能解析[J]. 外语界, 2024(3): 35-42.
[8] Kirwan, A. (2023) ChatGPT and University Teaching, Learning and Assessment: Some Initial Reflections on Teaching Academic Integrity in the Age of Large Language Models. Irish Educational Studies, 43, 1389-1406. [Google Scholar] [CrossRef
[9] Qin, H. and Lu, Y. (2024) The Application of Large Language Models in Foreign Language Education: An Exploration Based on Language Abilities. Foreign Language World, 6, 37-44.
[10] Wang, Y., Huang, J., Du, L., Guo, Y., Liu, Y. and Wang, R. (2025) Evaluating Large Language Models as Raters in Large-Scale Writing Assessments: A Psychometric Framework for Reliability and Validity. Computers and Education: Artificial Intelligence, 9, Article 100481. [Google Scholar] [CrossRef
[11] Kelly, D. (2005) A Handbook for Translator Trainers. St. Jerome Publishing.
[12] Colina, S. (2015) Fundamentals of Translation. Cambridge University Press. [Google Scholar] [CrossRef
[13] Hu, W. (2018) Revisiting Translation Quality Assurance: A Comparative Analysis of Evaluation Principles between Student Translators and the Professional Trans-Editor. World Journal of Education, 8, 176-184. [Google Scholar] [CrossRef
[14] Abanomey, A.A. and Almossa, S.Y. (2023) Translation Quality Assessment Practices of Faculty Members of Colleges of Languages and Translation in Arab Countries: An Exploratory Study. Humanities and Social Sciences Communications, 10, Article No. 835. [Google Scholar] [CrossRef
[15] McNamara, T.F. (1996) Measuring Second Language Performance. Longman.
[16] Myford, C.M. and Wolfe, E.W. (2004) Detecting and Measuring Rater Effects Using Many-Facet Rasch Measurement: Part II. Journal of Applied Measurement, 5, 189-227.
[17] Eckes, T. (2015) Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments. 2nd Edition, Peter Lang.
[18] Erguvan, I.D. and Aksu Dunya, B. (2020) Analyzing Rater Severity in a Freshman Composition Course Using Many Facet Rasch Measurement. Language Testing in Asia, 10, Article No. 1. [Google Scholar] [CrossRef
[19] Bouwer, R., Béguin, A., Sanders, T. and van den Bergh, H. (2015) Effect of Genre on the Generalizability of Writing Scores. Language Testing, 32, 83-100. [Google Scholar] [CrossRef
[20] Jeong, H. (2017) Narrative and Expository Genre Effects on Students, Raters, and Performance Criteria. Assessing Writing, 31, 113-125. [Google Scholar] [CrossRef
[21] Jia, J., Wei, Z., Cheng, H. and Wang, X. (2023) Translation Directionality and Translator Anxiety: Evidence from Eye Movements in L1-L2 Translation. Frontiers in Psychology, 14, Article ID: 1120140. [Google Scholar] [CrossRef] [PubMed]
[22] 王湘玲, 王律, 郑冰寒. 翻译方向对信息加工过程及质量的影响——基于眼动和屏幕记录等数据的多元互证[J]. 外语教学与研究, 2022, 54(1): 128-139.
[23] Han, C., Hu, J. and Deng, Y. (2023) Effects of Language Background and Directionality on Raters’ Assessments of Spoken-Language Interpreting. Revista Española de Lingüística Aplicada/Spanish Journal of Applied Linguistics, 36, 556-584. [Google Scholar] [CrossRef
[24] Qu, Y. and Wang, J. (2024) Performance and Biases of Large Language Models in Public Opinion Simulation. Humanities and Social Sciences Communications, 11, Article No. 1095. [Google Scholar] [CrossRef
[25] Chang, V.C. and Chen, I. (2023) Translation Directionality and the Inhibitory Control Model: A Machine Learning Approach to an Eye-Tracking Study. Frontiers in Psychology, 14, Article ID: 1196910. [Google Scholar] [CrossRef] [PubMed]
[26] Wiseman, C.S. (2012) Rater Effects: Ego Engagement in Rater Decision-Making. Assessing Writing, 17, 150-173. [Google Scholar] [CrossRef
[27] Eckes, T. (2012) Operational Rater Types in Writing Assessment: Moving toward a Theory of Rater Cognition. Language Testing, 29, 381-402.
[28] Kim, H.J. (2015) Rater Effects in L2 Writing Assessment: The Role of Rating Experience and L1 Background. Assessing Writing, 26, 1-15.
[29] Winke, P., Gass, S. and Myford, C. (2013) Raters’ L2 Background as a Potential Source of Bias in Rating Oral Performance. Language Testing, 30, 231-252. [Google Scholar] [CrossRef