大语言模型在BEC高级写作评分中的效能对比研究——基于DeepSeek与豆包的实证分析
A Comparative Study on the Efficacy of Large Language Models in BEC Higher Writing Scoring—An Empirical Analysis of DeepSeek and Doubao
摘要: 为检验国产大语言模型在BEC高级商务英语作文评分中的实用性,本研究以DeepSeek-V3.2与豆包为对象,选取了90篇BEC高级作文,设置无提示词、仅提供评分标准、同时提供评分标准和人工打分范文三种提示词场景开展对比实验。结果显示,DeepSeek-V3.2整体评分准确性、稳定性均优于豆包,从作文类型来看,两款模型均对商务报告评分最准确,商务信函评分能力较弱。仅提供评分标准会降低模型评分效果,搭配人工范文可明显提升评分质量。两款模型均可用于作文初评,但与专业人工评分仍有差距,暂不能完全替代人工。本研究为商务英语写作的人机协同评分提供了参考。
Abstract: To examine the practicability of domestic large language models (LLMs) in the scoring of BEC Higher business English writing, this study selects DeepSeek-V3.2 and Doubao as research objects, and employs a dataset of 90 BEC Higher writing scripts to carry out controlled comparative experiments under three prompt scenarios: no prompt, only scoring criteria provided, and both scoring criteria and human-scored sample essays provided. The findings reveal that DeepSeek-V3.2 surpasses Doubao in both overall scoring accuracy and stability. In terms of writing genres, both models deliver the highest scoring accuracy for business report, while their scoring performance for business letter is relatively weaker. Providing only scoring criteria undermines the models’ scoring efficacy, whereas the combination of scoring criteria and annotated sample essays can markedly improve scoring quality. Although both LLMs are applicable to the preliminary evaluation of writing scripts, a distinct gap remains between LLMs scoring and human scoring, indicating that they cannot fully replace human raters for the time being. This research offers implications for the implementation of human-machine collaborative scoring in business English writing assessment.
文章引用:张楠竹, 李华东. 大语言模型在BEC高级写作评分中的效能对比研究——基于DeepSeek与豆包的实证分析[J]. 现代语言学, 2026, 14(5): 687-694. https://doi.org/10.12677/ml.2026.145448

参考文献

[1] 韩童. 新高考背景下教育人工智能在读后续写中应用现状的调查研究[D]: [硕士学位论文]. 哈尔滨: 哈尔滨师范大学, 2023.
[2] Kim, Y. (2025) Automated Essay Scoring with GPT-4 for a Local Placement Test: Investigating Prompting Strategies, Intra-Rater Reliability, and Alignment with Human Scores. TESOL Quarterly, 59, S318-S329. [Google Scholar] [CrossRef
[3] Suhan, M. and Wolf, M.K. (2025) A Comparative Study of the Human, Automated Scoring Model, and GPT-4 Ratings of Young EFL Students’ Writing. Language Testing, 43, 66-78. [Google Scholar] [CrossRef
[4] Lan, G., Li, Y., Yang, J. and He, X. (2025) Investigating a Customized Generative AI Chatbot for Automated Essay Scoring in a Disciplinary Writing Task. Assessing Writing, 66, Article 100959. [Google Scholar] [CrossRef
[5] 李颖. iWrite自动评分与人工评分一致性研究[D]: [博士学位论文]. 北京: 北京外国语大学, 2021.
[6] 刘玉屏, 欧志刚, 武晓琴. 生成式人工智能赋能国际中文教学的效果测评——以教学设计、HSK模拟试题编写及作文评分为例[J]. 民族教育研究, 2025, 36(1): 156-166.