基于大语言模型的Python编程设计智能助教系统选型评测
System Selection and Performance Evaluation of LLM-Based Python Programming Teaching Assistants
DOI: 10.12677/csa.2025.156169, PDF,   
作者: 徐鼎洪, 景尔妮, 董许宏, 罗雅琴*:上海工程技术大学数理与统计学院,上海
关键词: 大语言模型Python教育评测体系Large Language Models Python Education Evaluation Framework
摘要: 本研究针对大语言模型在Python编程教育中的应用,构建了多维度评测体系,系统对比了通义千问、星火、文心一言等主流模型在教学场景中的表现。通过设计事实性问题、推理性问题、代码生成及多轮对话等测试任务,从回答准确性、完整性、语言流畅性、上下文理解能力及代码示例质量五个维度进行评估。实验结果表明,qwen-plus在综合评分中表现最优,其回答覆盖边界条件和多轮逻辑关联性,且代码示例符合PEP8规范;Ernie Bot 8k与sparkV3.5在准确性上优异但存在冗余注释问题,而GPT-4因代码冗余和异常处理片面性得分较低。研究揭示了模型在Python语言细节覆盖和上下文建模方面的共性缺陷,并提出通过知识库更新、强化学习优化及多模态评测体系改进的路径,为智能助教系统的选型与教学场景适配提供了实证依据。
Abstract: This study investigates the application of large language models (LLMs) in Python programming education by constructing a multi-dimensional evaluation framework to systematically compare the performance of mainstream models, such as Qwen-Plus, Ernie Bot 8k, and SparkV3.5, in educational scenarios. Through testing tasks including factual questions, reasoning problems, code generation, and multi-turn dialogue, models were assessed across five dimensions: accuracy, completeness, linguistic fluency, contextual understanding, and code example quality. Experimental results show that Qwen-Plus achieved the highest overall score, demonstrating superior coverage of edge cases and logical coherence in multi-turn interactions, with code examples adhering to PEP8 standards. Ernie Bot 8k and SparkV3.5 exhibited high accuracy but suffered from redundant annotations, while GPT-4 scored lower due to code redundancy and incomplete exception handling. The study identifies common limitations in models’ coverage of Python language details and contextual modeling, suggesting improvements through knowledge base updates, reinforcement learning optimization, and multi-modal evaluation frameworks. These findings provide empirical evidence for model selection and educational scenario adaptation in intelligent teaching assistant systems.
文章引用:徐鼎洪, 景尔妮, 董许宏, 罗雅琴. 基于大语言模型的Python编程设计智能助教系统选型评测[J]. 计算机科学与应用, 2025, 15(6): 190-197. https://doi.org/10.12677/csa.2025.156169

参考文献

[1] 张坤丽, 王影, 付文慧, 等. 大语言模型驱动下知识图谱的构建及应用综述[J/OL]. 郑州大学学报(理学版): 1-9. 2025-04-23. [Google Scholar] [CrossRef
[2] 方全, 张金龙, 王冰倩, 等. 基于组合上下文提示的大型语言模型领域知识问答研究[J/OL]. 计算机科学: 1-13.
http://kns.cnki.net/kcms/detail/50.1075.TP.20250417.1135.022.html, 2025-04-23.
[3] 黄冰. 大语言模型在古生物学中的应用初探——以基于RAG的知识问答系统为例[J/OL]. 古生物学报: 1-15. 2025-04-23.[CrossRef
[4] 段永康, 赵广宇, 耿骞, 等. 基于大语言模型的政策知识库构建与政策比较研究——以惠企政策为例[J/OL]. 数据分析与知识发现: 1-20.
http://kns.cnki.net/kcms/detail/10.1478.G2.20250418.1553.008.html, 2025-04-23.
[5] 邵欣怡, 朱经纬, 张亮. 基于大语言模型的业务流程长尾变化应变方法[J/OL]. 计算机科学: 1-12.
http://kns.cnki.net/kcms/detail/50.1075.tp.20250417.1126.018.html, 2025-04-23.
[6] 林丽萍. 国内大语言模型辅助意大利语教学的能力探析[J]. 公关世界, 2025(8): 117-119.
[7] 董艳民, 林佳佳, 张征, 等. 个性化学情感知的智慧助教算法设计与实践[J]. 计算机应用, 2025, 45(3): 765-772.
[8] 肖建力, 黄星宇, 姜飞. 智慧教育中的大语言模型综述[J/OL]. 智能系统学报: 1-17.
http://kns.cnki.net/kcms/detail/23.1538.tp.20250205.1354.002.html, 2025-05-06.
[9] 谢颖怡, 张逸诗, 曾艾玲. 基于人工智能大语言模型的微信聊天助教在高职英语教学中的应用探索[J]. 中国医学教育技术, 2025, 39(1): 48-53.
[10] 文玉锋, 林伟杰, 夏翠娟, 等. 面向古籍文献智能处理的大语言模型效能测评[J/OL]. 图书馆论坛: 1-10.
http://kns.cnki.net/kcms/detail/44.1306.g2.20250429.1504.002.html, 2025-05-06.
[11] 黎盈盈, 詹昌昊. 多模态大语言模型驱动的争论式智能对话学习系统设计与开发[J]. 数字技术与应用, 2025, 43(1): 25-27.