训推一体平台架构设计与关键技术研究

doi:10.12677/CSA.2023.139173

期刊菜单

训推一体平台架构设计与关键技术研究
Architecture Design and Key Technology Research of Training-Reasoning Integrated Platform

DOI: 10.12677/CSA.2023.139173, PDF,
作者: 梁秉豪, 张传刚：浪潮通信信息系统有限公司，山东济南
关键词: 预训练大模型；训推一体；任务调度；算力调度；自动表单；Large-Scale Pre-Trained Model； Training-Reasoning Integrated； Task Scheduling； Computing Power Scheduling； Automatic Forms

摘要: 近年来，以ChatGPT为代表的大规模预训练模型不断突破AI技术瓶颈，AI应用场景碎片化问题有望在短期内从根本上得到解决。未来，集中式AI应用研发将会取代传统的小作坊式生产，这一趋势对支撑AI模型训练、微调和部署等环节的人工智能平台提出了更高的要求。本文针对主流人工智能平台存在部分问题，设计了一套训练、推理一体化平台。该平台通过工作流引擎实现了机器学习流水线的高效调度，利用虚拟化和容器化技术解决了硬件资源分配和调度问题，此外基于自动化表单工具实现了算子的组件化和插件化管理。本文所设计的训推一体平台将有效降低AI应用的开发门槛，促进AI应用集中式和规模化生产，推动大规模预训练模型快速渗透到各个垂直行业应用场景。

Abstract: In recent years, the large-scale pretrained model represented by ChatGPT has continuously broken through the existing bottleneck of AI technology, and the problem of fragmentation of AI application is expected to be fundamentally solved in the short term. In the future, centralized AI application development will replace traditional individual workshop production, and this trend puts higher requirements on artificial intelligence platforms that support AI model training, fine-tuning and deployment. Aiming at the existing problems in the main stream artificial intelligence platform, this paper designs a training-reasoning integrated platform. This platform realizes the efficient scheduling of machine learning pipeline through workflow engine, solves the problem of hardware resource allocation and scheduling by using virtualization and containerization technology, and realizes the componentization and pluggability of AI operators based on automatic form tools. The training-reasoning integrated platform designed in this paper will effectively lower the development threshold of AI applications, facilitate the centralized and large-scale production of AI appli-cations, and accelerate the penetration of large-scale pretraining models into various vertical industry.

文章引用：梁秉豪, 张传刚. 训推一体平台架构设计与关键技术研究[J]. 计算机科学与应用, 2023, 13(9): 1748-1755. https://doi.org/10.12677/CSA.2023.139173

参考文献

[1]	马艳军, 于佃海, 吴甜, 王海峰. 飞桨: 源于产业实践的开源深度学习平台[J]. 数据与计算发展前沿, 2019, 1(1): 105.
[2]	Jia, X., Jiang, L., Wang, A., Xiao, W., Shi, Z., Zhang, J., Li, X., Chen, L., And, Y.L., Zheng, Z., Liu, X. and Lin, W. (2022) Whale: Efficient Giant Model Training over Heterogeneous GPUs. 2022 USENIX Annual Technical Conference, Carlsbad, California, July 11 2022, 673-688.
[3]	人工智能开发平台系统功能要求第1部分: 功能要求: AIIA/P 0006-2022 [S]. 中国人工智能产业发展联盟, 中国信息通信研究院, 2022.
[4]	束柬, 陈剑波. 深度学习平台体系架构及其关键技术[J]. 计算机应用研究, 2023, 40(11): 38.
[5]	Hummer, W., Muthusamy, V., Rausch, T., Dube, P. and Oum, P. (2019) ModelOps: Cloud-Based Lifecycle Management for Reliable and Trusted AI. Proceedings of the 2019 IEEE International Conference on Cloud Engineering (IC2E), Prague, 24-27 June 2019, 113-120. [Google Scholar] [CrossRef]
[6]	Hazelwood, K., Bird, S., Brooks, D., Chintala, S. and Diril, U. (2018) Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. Proceedings of the IEEE Inter-national Symposium on High Performance Computer Architecture, Vienna, 24-28 February 2018, 620-629. [Google Scholar] [CrossRef]
[7]	黄巨涛, 郑杰生, 高尚, 刘文彬, 林嘉鑫, 董召杰, 王尧. 基于云平台的人工智能开源开发平台框架研究[J]. 自动化与仪器仪表, 2020(7): 5.
[8]	Wu, C., Haihong, E. and Song, M. (2020) An Automatic Artificial Intelligence Training Platform Based on Kubernetes. Proceedings of the BDET 2020: 2020 2nd International Conference on Big Data Engineering and Technology, Singapore and China, 3-5 January 2020, 58-62. [Google Scholar] [CrossRef]
[9]	Abadi, M., Barham, P., Chen, J., Chen, Z. and Zhang, X. (2016) TensorFlow: A System for Large-Scale Machine Learning. USENIX Association, Carlsbad, Califor-nia.
[10]	Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J. and Devin, M. (2015) TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. https://arxiv.org/abs/1603.04467
[11]	Paszke, A., Gross, S., Massa, F., Lerer, A. and Chintala, S. (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library.
[12]	Swami, A. and Jain, R. (2013) Scikit-Learn: Ma-chine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
[13]	Meng, X., Bradley, J., Yavuz, B., Sparks, E., Talwalkar, A. (2015) MLlib: Machine Learning in Apache Spark. Journal of Machine Learning Research, 17, 1235-1241.

为你推荐

友情链接