异构智能嵌入式系统AI模型推理与部署优化——从模型轻量化到系统级加速的综述与展望
AI Model Inference and Deployment Optimization for Heterogeneous Intelligent Embedded Systems—A Survey and Perspective from Model Lightweighting to System-Level Acceleration
DOI: 10.12677/etis.2025.24023, PDF,    科研立项经费支持
作者: 庞为光:齐鲁工业大学(山东省科学院),山东省计算中心(国家超级计算济南中心),山东 济南
关键词: 异构嵌入式系统深度神经网络推理加速实时调度Heterogeneous Embedded Systems Deep Neural Networks Inference Acceleration Real-Time Scheduling
摘要: 随着人工智能技术与嵌入式硬件的快速发展,嵌入式人工智能系统(如移动机器人、自动驾驶汽车和星载无人机)在工业自动化、交通运输和航空航天等关键领域变得越来越重要。作为集成CPU、GPU、NPU等多种异构处理器单元的智能实时系统,其核心任务是通过计算密集型的深度神经网络(DNN)实现环境感知、决策控制等复杂功能,同时面临严格的时间约束与资源瓶颈。文章从网络模型在嵌入式系统加速推理优化的角度,将围绕DNN模型轻量化、推理加速优化与动态任务调度三个方面,详细分析嵌入式智能系统的国内外研究现状。
Abstract: With the rapid advancement of artificial intelligence (AI) technologies and embedded hardware, embedded AI systems—such as mobile robots, autonomous vehicles, and spaceborne unmanned aerial vehicles—are becoming increasingly significant in key domains including industrial automation, transportation, and aerospace. As intelligent real-time systems integrating heterogeneous processing units such as CPUs, GPUs, and NPUs, their core mission is to execute computationally intensive deep neural networks (DNNs) to achieve complex functions such as environmental perception and decision control, all under stringent temporal constraints and resource limitations. From the perspective of accelerating and optimizing neural network inference on embedded systems, this paper provides a comprehensive analysis of the current research progress at home and abroad, focusing on three major aspects: DNN model lightweighting, inference acceleration optimization, and dynamic task scheduling.
文章引用:庞为光. 异构智能嵌入式系统AI模型推理与部署优化——从模型轻量化到系统级加速的综述与展望[J]. 嵌入式技术与智能系统, 2025, 2(4): 255-260. https://doi.org/10.12677/etis.2025.24023

参考文献

[1] Wang, W., Chen, W., Luo, Y., Long, Y., Lin, Z., Zhang, L., et al. (2024) Model Compression and Efficient Inference for Large Language Models: A Survey. arXiv: 2402.09748.
[2] Liu, D., Kong, H., Luo, X., Liu, W. and Subramaniam, R. (2022) Bringing AI to Edge: From Deep Learning’s Perspective. Neurocomputing, 485, 297-320. [Google Scholar] [CrossRef
[3] Zhou, Z., Ning, X., Hong, K., et al. (2024) A Survey on Efficient Inference for Large Language Models.
https://arxiv.org/abs/2404.14294
[4] Dai, D., Deng, C., Zhao, C., Xu, R.X., Gao, H., Chen, D., et al. (2024) DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-Of-Experts Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, 11-16 August 2024, 1280-1297. [Google Scholar] [CrossRef
[5] NVIDIA. (2024). TensorRT-LLM [Computer Software]. GitHub.
https://github.com/NVIDIA/TensorRT-LLM
[6] Ascend. (2024). AscendSpeed [Computer Software]. GitHub.
https://github.com/Ascend/AscendSpeed
[7] Qiu, H., Mao, W., Patke, A., et al. (2024) Efficient Interactive LLM Serving with Proxy Model-Based Sequence Length Prediction. arXiv: 2404.08509.
[8] Nawrot, P., Łańcucki, A., Chochowski, M., et al. (2024) Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference. arXiv: 2403.09636.
[9] Yu, G.I., Jeong, J.S., Kim, G.W., et al. (2022) Orca: A Distributed Serving System for {Transformer-Based} Generative Models. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 521-538.
[10] Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., et al. (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles, Koblenz, 23-26 October 2023, 611-626. [Google Scholar] [CrossRef
[11] Xu, D., Zhang, H., Yang, L., et al. (2024) Empowering 1000 Tokens/Second On-Device LLM Prefilling with MLLM-NPU. arXiv: 2407.05858v1.
[12] Pang, W., Jiang, X., Liu, S., Qiao, L., Fu, K., Gao, L., et al. (2024) Control Flow Divergence Optimization by Exploiting Tensor Cores. Proceedings of the 61st ACM/IEEE Design Automation Conference, San Francisco, 23-27 June 2024, 1-6. [Google Scholar] [CrossRef
[13] Meng, F., Yao, Z. and Zhang, M. (2025) TransMLA: Multi-Head Latent Attention Is All You Need. arXiv: 2502.07864.
[14] 王子曦, 邵培南, 邓畅. 异构并行平台的Caffe推理速度提升方法[J]. 计算机系统应用, 2022, 31(2): 220-226.
[15] 尚绍法, 蒋林, 李远成, 等. 异构平台下卷积神经网络推理模型自适应划分和调度方法[J]. 计算机应用, 2023, 43(9): 2828-2835.
[16] Han, Y., Huang, G., Song, S., Yang, L., Wang, H. and Wang, Y. (2022) Dynamic Neural Networks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 7436-7456. [Google Scholar] [CrossRef] [PubMed]
[17] Bo, Z., Guo, C., Leng, C., Qiao, Y. and Wang, H. (2024) RTDeepEnsemble: Real-Time DNN Ensemble Method for Machine Perception Systems. 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, 18-20 November 2024, 191-198. [Google Scholar] [CrossRef
[18] Han, Y., Liu, Z., Yuan, Z., Pu, Y., Wang, C., Song, S., et al. (2024) Latency-Aware Unified Dynamic Networks for Efficient Image Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 7760-7774. [Google Scholar] [CrossRef] [PubMed]
[19] Heo, S., Jeong, S. and Kim, H. (2022) RTScale: Sensitivity-Aware Adaptive Image Scaling for Real-Time Object Detection. 34th Euro-Micro Conference on Real-Time Systems, Modena, 5-8 July 2022, 1-22.
[20] Heo, S., Cho, S., Kim, Y. and Kim, H. (2020) Real-Time Object Detection System with Multi-Path Neural Networks. 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Sydney, 21-24 April 2020, 174-187. [Google Scholar] [CrossRef
[21] Park, K., Oh, C. and Yi, Y. (2020) BPNet: Branch-Pruned Conditional Neural Network for Systematic Time-Accuracy Tradeoff. 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, 20-24 July 2020, 1-6. [Google Scholar] [CrossRef
[22] Wan, A., Hao, H., Patnaik, K., et al. (2023) UPSCALE: Unconstrained Channel Pruning. arXiv: 2307.08771.
[23] Zheng, Z., Ji, X., Fang, T., Zhou, F., Liu, C. and Peng, G. (2024) BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-Oriented Token Batching. arXiv: 2412.03594.
[24] Lee, S. and Nirjon, S. (2020) SubFlow: A Dynamic Induced-Subgraph Strategy toward Real-Time DNN Inference and Training. 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Sydney, 21-24 April 2020, 15-29. [Google Scholar] [CrossRef
[25] Oh, H., Kim, K., Kim, J., Kim, S., Lee, J., Chang, D., et al. (2024) ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, La Jolla, 27 April-1 May 2024, 369-384. [Google Scholar] [CrossRef
[26] Cui, W., Han, Z., Ouyang, L., et al. (2023) Optimizing Dynamic Neural Networks with Brainstorm. 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), Boston,10-12 July 2023, 797-815.
[27] Wang, H., Zhou, X., Yu, Z., Liu, S., Guo, B., Wu, Y., et al. (2020) Context-aware Adaptation of Deep Learning Models for IoT Devices. Scientia Sinica Informationis, 50, 1629-1644. [Google Scholar] [CrossRef
[28] Zhao, Z., Ling, N., Guan, N. and Xing, G. (2022) Aaron: Compile-Time Kernel Adaptation for Multi-DNN Inference Acceleration on Edge GPU. Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, Boston, 6-9 November 2022, 802-803. [Google Scholar] [CrossRef
[29] Pang, W., Luo, X., Chen, K., Ji, D., Qiao, L. and Yi, W. (2023) Efficient CUDA Stream Management for Multi-DNN Real-Time Inference on Embedded GPUs. Journal of Systems Architecture, 139, Article ID: 102888. [Google Scholar] [CrossRef