|
[1]
|
Wang, W., Chen, W., Luo, Y., Long, Y., Lin, Z., Zhang, L., et al. (2024) Model Compression and Efficient Inference for Large Language Models: A Survey. arXiv: 2402.09748.
|
|
[2]
|
Liu, D., Kong, H., Luo, X., Liu, W. and Subramaniam, R. (2022) Bringing AI to Edge: From Deep Learning’s Perspective. Neurocomputing, 485, 297-320. [Google Scholar] [CrossRef]
|
|
[3]
|
Zhou, Z., Ning, X., Hong, K., et al. (2024) A Survey on Efficient Inference for Large Language Models. https://arxiv.org/abs/2404.14294
|
|
[4]
|
Dai, D., Deng, C., Zhao, C., Xu, R.X., Gao, H., Chen, D., et al. (2024) DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-Of-Experts Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, 11-16 August 2024, 1280-1297. [Google Scholar] [CrossRef]
|
|
[5]
|
NVIDIA. (2024). TensorRT-LLM [Computer Software]. GitHub. https://github.com/NVIDIA/TensorRT-LLM
|
|
[6]
|
Ascend. (2024). AscendSpeed [Computer Software]. GitHub. https://github.com/Ascend/AscendSpeed
|
|
[7]
|
Qiu, H., Mao, W., Patke, A., et al. (2024) Efficient Interactive LLM Serving with Proxy Model-Based Sequence Length Prediction. arXiv: 2404.08509.
|
|
[8]
|
Nawrot, P., Łańcucki, A., Chochowski, M., et al. (2024) Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference. arXiv: 2403.09636.
|
|
[9]
|
Yu, G.I., Jeong, J.S., Kim, G.W., et al. (2022) Orca: A Distributed Serving System for {Transformer-Based} Generative Models. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 521-538.
|
|
[10]
|
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., et al. (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles, Koblenz, 23-26 October 2023, 611-626. [Google Scholar] [CrossRef]
|
|
[11]
|
Xu, D., Zhang, H., Yang, L., et al. (2024) Empowering 1000 Tokens/Second On-Device LLM Prefilling with MLLM-NPU. arXiv: 2407.05858v1.
|
|
[12]
|
Pang, W., Jiang, X., Liu, S., Qiao, L., Fu, K., Gao, L., et al. (2024) Control Flow Divergence Optimization by Exploiting Tensor Cores. Proceedings of the 61st ACM/IEEE Design Automation Conference, San Francisco, 23-27 June 2024, 1-6. [Google Scholar] [CrossRef]
|
|
[13]
|
Meng, F., Yao, Z. and Zhang, M. (2025) TransMLA: Multi-Head Latent Attention Is All You Need. arXiv: 2502.07864.
|
|
[14]
|
王子曦, 邵培南, 邓畅. 异构并行平台的Caffe推理速度提升方法[J]. 计算机系统应用, 2022, 31(2): 220-226.
|
|
[15]
|
尚绍法, 蒋林, 李远成, 等. 异构平台下卷积神经网络推理模型自适应划分和调度方法[J]. 计算机应用, 2023, 43(9): 2828-2835.
|
|
[16]
|
Han, Y., Huang, G., Song, S., Yang, L., Wang, H. and Wang, Y. (2022) Dynamic Neural Networks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 7436-7456. [Google Scholar] [CrossRef] [PubMed]
|
|
[17]
|
Bo, Z., Guo, C., Leng, C., Qiao, Y. and Wang, H. (2024) RTDeepEnsemble: Real-Time DNN Ensemble Method for Machine Perception Systems. 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, 18-20 November 2024, 191-198. [Google Scholar] [CrossRef]
|
|
[18]
|
Han, Y., Liu, Z., Yuan, Z., Pu, Y., Wang, C., Song, S., et al. (2024) Latency-Aware Unified Dynamic Networks for Efficient Image Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 7760-7774. [Google Scholar] [CrossRef] [PubMed]
|
|
[19]
|
Heo, S., Jeong, S. and Kim, H. (2022) RTScale: Sensitivity-Aware Adaptive Image Scaling for Real-Time Object Detection. 34th Euro-Micro Conference on Real-Time Systems, Modena, 5-8 July 2022, 1-22.
|
|
[20]
|
Heo, S., Cho, S., Kim, Y. and Kim, H. (2020) Real-Time Object Detection System with Multi-Path Neural Networks. 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Sydney, 21-24 April 2020, 174-187. [Google Scholar] [CrossRef]
|
|
[21]
|
Park, K., Oh, C. and Yi, Y. (2020) BPNet: Branch-Pruned Conditional Neural Network for Systematic Time-Accuracy Tradeoff. 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, 20-24 July 2020, 1-6. [Google Scholar] [CrossRef]
|
|
[22]
|
Wan, A., Hao, H., Patnaik, K., et al. (2023) UPSCALE: Unconstrained Channel Pruning. arXiv: 2307.08771.
|
|
[23]
|
Zheng, Z., Ji, X., Fang, T., Zhou, F., Liu, C. and Peng, G. (2024) BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-Oriented Token Batching. arXiv: 2412.03594.
|
|
[24]
|
Lee, S. and Nirjon, S. (2020) SubFlow: A Dynamic Induced-Subgraph Strategy toward Real-Time DNN Inference and Training. 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), Sydney, 21-24 April 2020, 15-29. [Google Scholar] [CrossRef]
|
|
[25]
|
Oh, H., Kim, K., Kim, J., Kim, S., Lee, J., Chang, D., et al. (2024) ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, La Jolla, 27 April-1 May 2024, 369-384. [Google Scholar] [CrossRef]
|
|
[26]
|
Cui, W., Han, Z., Ouyang, L., et al. (2023) Optimizing Dynamic Neural Networks with Brainstorm. 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), Boston,10-12 July 2023, 797-815.
|
|
[27]
|
Wang, H., Zhou, X., Yu, Z., Liu, S., Guo, B., Wu, Y., et al. (2020) Context-aware Adaptation of Deep Learning Models for IoT Devices. Scientia Sinica Informationis, 50, 1629-1644. [Google Scholar] [CrossRef]
|
|
[28]
|
Zhao, Z., Ling, N., Guan, N. and Xing, G. (2022) Aaron: Compile-Time Kernel Adaptation for Multi-DNN Inference Acceleration on Edge GPU. Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, Boston, 6-9 November 2022, 802-803. [Google Scholar] [CrossRef]
|
|
[29]
|
Pang, W., Luo, X., Chen, K., Ji, D., Qiao, L. and Yi, W. (2023) Efficient CUDA Stream Management for Multi-DNN Real-Time Inference on Embedded GPUs. Journal of Systems Architecture, 139, Article ID: 102888. [Google Scholar] [CrossRef]
|