基于机器学习的MPI仿真性能测试与调优平台
Machine Learning‑Based MPI Simulation Performance Testing and Tuning Platform
摘要: 工程数值模拟软件的MPI并行性能测试过程中,存在MPI参数调优难、人工测试低效、性能基线更新不及时等痛点,如何提升并行性能测试与调优效率成为并行计算领域研究热点之一。本文基于机器学习技术和XGBoost算法构建MPI参数与性能映射模型,运用惯性权重自适应调整、引入高斯变异策略改进IPSO自动调优算法,实现MPI并行参数自动调优,研发自动化回归测试技术,提出统计分析与异常剔除相结合的性能基线生成方法,最终设计并实现面向MPI并行软件研发的参数调优与自动化性能测试SimPerf平台。实验验证结果表明,该平台可显著降低仿真研发人员的MPI并行开发门槛、提升研发与测试效率,与传统人工方式对比,MPI并行仿真应用的性能测试与调优效率提升70%以上,测试覆盖率提升16%以上。
Abstract: During the MPI parallel performance testing of engineering numerical simulation software, several critical challenges exist, including difficult tuning of MPI parameters, low efficiency of manual testing, and untimely updates of performance baselines. Enhancing the efficiency of parallel performance testing and parameter tuning has therefore become a research hotspot in the field of parallel computing. In this paper, a mapping model between MPI parameters and performance is constructed based on machine learning techniques and the XGBoost algorithm. An improved IPSO automatic tuning algorithm is proposed by adaptively adjusting the inertia weight and introducing a Gaussian mutation strategy to realize automatic tuning of MPI parallel parameters. Automated regression testing technology is developed, and a performance baseline generation method combining statistical analysis and outlier elimination is put forward. Ultimately, the SimPerf platform for parameter tuning and automated performance testing is designed and implemented for the research and development of MPI parallel software. Experimental validation results demonstrate that the proposed platform can significantly lower the MPI parallel development threshold for simulation researchers and improve research‑development and testing efficiency. Compared with traditional manual methods, the efficiency of performance testing and tuning for MPI parallel simulation applications is improved by more than 70%, and test coverage is increased by over 16%.
文章引用:赵英燕, 曹群生. 基于机器学习的MPI仿真性能测试与调优平台[J]. 计算机科学与应用, 2026, 16(6): 252-267. https://doi.org/10.12677/csa.2026.166225

参考文献

[1] 叶宁, 付康, 胡少文, 等. 基于MPI的异构算力资源融合调度平台[J]. 计算机与现代化, 2025(12): 38-45.
[2] 郑文旭. 并行计算机系统中MPI运行时参数调优方法与关键技术研究[D]: [硕士学位论文]. 长沙: 国防科技大学, 2019.
[3] 刘轶, 高玉林, 张国振. 并行程序运行故障原因识别[J]. 国防科技大学学报, 2022, 44(5): 45-52.
[4] 严畅, 朱杰. 基于Vtune的网络协议性能测试技术[J]. 信息技术, 2010, 34(6): 72-74.
[5] Wylie, B.J.N., Giménez, J., Feld, C., Geimer, M., Llort, G., Mendez, S., et al. (2025) 15+ Years of Joint Parallel Application Performance Analysis/Tools Training with Scalasca/Score-P and Paraver/Extrae Toolsets. Future Generation Computer Systems, 162, Article 107472. [Google Scholar] [CrossRef
[6] Bader, M., Bode, A., Bungartz, H.J., et al. (2014) A Case Study: Holistic Performance Analysis on Heterogeneous Architectures Using the Vampir Toolchain. Advances in Parallel Computing, 25, 793-802.
[7] Alghamdi, A.S.A., Alghamdi, A.M., Eassa, F.E. and Khemakhem, M.A. (2020) ACC_TEST: Hybrid Testing Techniques for MPI-Based Programs. IEEE Access, 8, 91488-91500. [Google Scholar] [CrossRef
[8] Choi, J., Fink, Z., White, S., Bhat, N., Richards, D.F. and Kale, L.V. (2022) Accelerating Communication for Parallel Programming Models on GPU Systems. Parallel Computing, 113, Article 102969. [Google Scholar] [CrossRef
[9] Hunold, S., Carpen-Amarie, A., Lübbe, F.D. and Träff, J.L. (2016) PGMPI: Automatic Verification of Self-Consistent MPI Performance Guidelines. In: Lecture Notes in Computer Science, Springer, 433-446. [Google Scholar] [CrossRef
[10] 曹亚浩, 吕云飞, 陈源宝, 等. 基于改进粒子群算法的UUV集群回收任务规划[J]. 舰船科学技术, 2025, 47(23): 92-97.
[11] 陈雪娟, 许欢欢, 杨泽. 机器学习驱动的智能视觉分析在目标检测与结构动力特性识别中的应用[J]. 无线互联科技, 2026, 23(4): 16-20.
[12] 舒予, 金昊, 江昊. 多维度特征增强的用户转发行为预测方法[J]. 中国电子科学研究院学报, 2025, 20(5): 529-538.
[13] 宋芮芮, 王雷春, 何运平, 等. 基于混合自注意力和差异归一化的长时间序列预测[J]. 计算机应用, 2026, 46(5): 1499-1506.
[14] 周子程, 张仰森, 王璞. 资源受限场景下大模型算术优化与分析方法[J/OL]. 小型微型计算机系统, 2025: 1-12. 2025-11-14.[CrossRef
[15] 马超, 王建明, 高华, 等. 一种深度神经网络SAR图像目标识别可视化方法[J]. 空天预警研究学报, 2023, 37(4): 295-300.
[16] 章祉瑶, 聂斌, 郑水飞. 交互作用特征选择研究综述[J/OL]. 计算机工程与应用, 2026: 1-30.
https://link.cnki.net/urlid/11.2127.tp.20260314.1256.010, 2026-03-16.
[17] 徐洋, 罗润志, 许豪, 等. 基于数据驱动的自动泊车系统量化评价方法[J/OL]. 计算机工程与应用, 2026: 1-17.
https://link.cnki.net/urlid/11.2127.TP.20260320.1131.008, 2026-03-20.
[18] 郑嘉伟, 王粉花, 赵波, 等. 基于多层次特征交互的点击率预测模型[J]. 实验室研究与探索, 2022, 41(5): 21-25+49.
[19] 张挺, 王宗锴, 林震寰, 等. 基于自动终止准则改进的kd-tree粒子近邻搜索研究[J]. 工程科学与技术, 2024, 56(6): 217-229.
[20] 丁承君, 耿宇坤, 胡健鑫, 等. 基于自适应时域MPC的无人车轨迹跟踪控制[J]. 科学技术与工程, 2025, 25(23): 9883-9891.
[21] 杨正, 周睿, 李鹏. 改进的RSA算法及在疫情传播SVM模型中的应用[J]. 计算机工程与设计, 2025, 46(10): 3016-3023.
[22] 付鹏斌, 陈帅帅, 杨惠荣, 等. 结合依存关系与同义词词林的相似度计算[J]. 计算机技术与发展, 2020, 30(1): 13-18.
[23] 韩世平, 李林, 陈杰, 等. 基于微服务架构的多源异构数据实时处理平台设计[J]. 国外电子测量技术, 2025, 44(11): 251-256.
[24] 陈阵, 蒋建民, 郭继文. 基于角色的访问控制研究综述[J/OL]. 信息安全学报, 2025: 1-22.
https://link.cnki.net/urlid/10.1380.TN.20251211.1125.002, 2025-12-12.
[25] 沈瑜, 孙婧, 李娟. 基于Slurm的气象高性能计算资源调度管理及应用[J]. 计算机技术与发展. 2025, 35(11): 180-187.
[26] 韩培丽, 黄雯. 迁移学习在电力物联网数据异常检测中的应用[J]. 智能物联技术, 2025, 57(6): 120-124.
[27] 王宇, 唐小川. 基于贡献度的联邦学习模型聚合算法[J]. 信息技术. 2026(3): 1-6.
[28] 李思琪, 俞琨, 陈宇皓. 基于ARIMA和LSTM的高性能计算平台资源使用的预测研究[J]. 计算机科学, 2025, 52(9): 178-185.