基于扩散Transformer的复杂图像加速推理改进研究
Research on Improved Complex Image Acceleration Inference Based on Diffusion Transformer
DOI: 10.12677/csa.2025.1510255, PDF,   
作者: 兰慕辰, 孙杳如:同济大学计算机科学与技术学院,上海
关键词: 扩散模型空间复杂度生成式模型Diffusion Model Spatial Complexity Generative Model
摘要: 文生图模型的相关研究发展迅速,使用扩散模型的Transformer架构更是其中的主流。面对当今文生图模型在复杂图像处理和提高推理速度的瓶颈时,本文参考扩散Transformer提出了一种基于图像空间复杂度的自适应生成策略。本文设计了一种结合纹理和结构信息的图像空间复杂度计算方法,并将其作为生成模型的输入,动态分析生成图像的复杂结构。基于该复杂度指标,进一步提出了分块自适应生成策略,依据子图像空间复杂度的高低来调节推理模型深度的深浅,从而在保证图像质量的同时,显著提升生成效率。我们的模型在ImageNet 256 × 256和MSCOCO 2017数据集上进行了实验验证,其在FID、sFID和IS指标上均优于主流图像生成模型,同时推理速度明显加快。生成的结果显示,其不仅能复原各种潜在的细节信息,也能有效提高模型迭代速度,证明了该方法在复杂图像生成与加速推理上的有效性和可行性。
Abstract: Research on image-generating models has advanced rapidly, with the Transformer architecture using diffusion models becoming the mainstream. Addressing the bottlenecks faced by current image-generating models in processing complex images and improving inference speed, this paper proposes an adaptive generation strategy based on image spatial complexity, drawing inspiration from the diffusion Transformer. We devise a method to calculate image spatial complexity by incorporating texture and structural information, using this as input to a generative model to dynamically analyze the complex structure of the generated image. Based on this complexity metric, we further propose a block-wise adaptive generation strategy that adjusts the depth of the inference model based on the spatial complexity of the sub-images, significantly improving generation efficiency while maintaining image quality. Our model is experimentally validated on the ImageNet 256 × 256 and MSCOCO 2017 datasets, outperforming mainstream image generation models in terms of FID, sFID, and IS metrics, while significantly accelerating inference speed. The generated results demonstrate that it not only recovers a wide range of underlying details, but also effectively improves model iteration speed, demonstrating the effectiveness and feasibility of this approach for complex image generation and accelerated inference.
文章引用:兰慕辰, 孙杳如. 基于扩散Transformer的复杂图像加速推理改进研究[J]. 计算机科学与应用, 2025, 15(10): 126-136. https://doi.org/10.12677/csa.2025.1510255

参考文献

[1] Yin, M. and Li, J. (2023) A Systematic Review on Digital Human Models in Assembly Process Planning. The International Journal of Advanced Manufacturing Technology, 125, 1037-1059. [Google Scholar] [CrossRef
[2] Brack, M., Friedrich, F., Kornmeier, K., Tsaban, L., Schramowski, P., Kersting, K., et al. (2024) LEDITS++: Limitless Image Editing Using Text-to-Image Models. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 16-22 June 2024, 8861-8870. [Google Scholar] [CrossRef
[3] Liu, V., Vermeulen, J., Fitzmaurice, G. and Matejka, J. (2023) 3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows. Proceedings of the 2023 ACM Designing Interactive Systems Conference, Pittsburgh, 10-14 July 2023, 1955-1977. [Google Scholar] [CrossRef
[4] Reed, S.E., Akata, Z., Yan, X., Logeswaran, L., Schiele, B. and Lee, H. (2016) Generative Adversarial Text to Image Synthesis. International Conference on Machine Learning, New York, 19-24 June 2016, 1060-1069.
[5] Ye, S., Wang, H., Tan, M. and Liu, F. (2023) Recurrent Affine Transformation for Text-to-Image Synthesis. IEEE Transactions on Multimedia, 26, 462-473.
[6] Sauer, A., Karras, T., Laine, S., Geiger, A. and Aila, T. (2023) Stylegan-T: Unlocking the Power of Gans for Fast Large-scale Text-to-Image Synthesis. International Conference on Machine Learning, Honolulu, HI, 12-15 July 2023, 30105-30118.
[7] Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B. (2022) High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 10684-10695. [Google Scholar] [CrossRef
[8] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. and Chen, M. (2022) Hierarchical Text Conditional Image Generation with Clip Latents. arXiv: 2204.06125.
[9] Ho, J., Jain, A. and Abbeel, P. (2020) Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems, 33, 6840-6851.
[10] Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C. and Zhu, J. (2022) DPM-Solver: A Fast Ode Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. arXiv: 2206.00927.
[11] Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C. and Zhu, J. (2022) DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models. arXiv: 2211.01095.
[12] Salimans, T. and Ho, J. (2022) Progressive Distillation for Fast Sampling of Diffusion Models. arXiv: 2202.00512.
[13] Meng, C., Rombach, R., Gao, R., Kingma, D., Ermon, S., Ho, J., et al. (2023) On Distillation of Guided Diffusion Models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, 17-24 June 2023, 14297-14306. [Google Scholar] [CrossRef
[14] Song, Y., Dhariwal, P., Chen, M. and Sutskever, I. (2023) Consistency Models. arXiv: 2303.01469.
[15] Wang, Z., Xia, X., Chen, R., Yu, D., Wang, C., Gong, M., et al. (2025) LaVin-DiT: Large Vision Diffusion Transformer. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 10-17 June 2025, 20060-20070. [Google Scholar] [CrossRef
[16] Jia, W., Huang, M., Chen, N., Zhang, L. and Mao, Z. (2025) D2iT: Dynamic Diffusion Transformer for Accurate Image Generation. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 10-17 June 2025, 12860-12870. [Google Scholar] [CrossRef
[17] Fang, G., Li, K., Ma, X. and Wang, X. (2025) Tinyfusion: Diffusion Transformers Learned Shallow. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 10-17 June 2025, 18144-18154. [Google Scholar] [CrossRef
[18] Wang, Z. and Bovik, A.C. (2009) Mean Squared Error: Love It or Leave It? A New Look at Signal Fidelity Measures. IEEE Signal Processing Magazine, 26, 98-117. [Google Scholar] [CrossRef
[19] Girod, B. (1993) What’s Wrong with Mean-Squared Error. In: Watson, A.B., Ed., Digital Images and Human Vision, The MIT Press, 207-220.
[20] Wang, Z., Bovik, A.C., Sheikh, H.R. and Simoncelli, E.P. (2004) Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing, 13, 600-612. [Google Scholar] [CrossRef] [PubMed]
[21] Sheikh, H.R. and Bovik, A.C. (2006) Image Information and Visual Quality. IEEE Transactions on Image Processing, 15, 430-444. [Google Scholar] [CrossRef] [PubMed]
[22] Chandler, D.M. (2010) Most Apparent Distortion: Full-Reference Image Quality Assessment and the Role of Strategy. Journal of Electronic Imaging, 19, Article ID: 011006. [Google Scholar] [CrossRef
[23] Prashnani, E., Cai, H., Mostofi, Y. and Sen, P. (2018) PieAPP: Perceptual Image-Error Assessment through Pairwise Preference. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 1808-1817. [Google Scholar] [CrossRef
[24] Popat, K. and Picard, R.W. (1997) Cluster-Based Probability Model and Its Application to Image and Texture Processing. IEEE Transactions on Image Processing, 6, 268-284. [Google Scholar] [CrossRef] [PubMed]
[25] Balle, J., Stojanovic, A. and Ohm, J. (2011) Models for Static and Dynamic Texture Synthesis in Image and Video Compression. IEEE Journal of Selected Topics in Signal Processing, 5, 1353-1365. [Google Scholar] [CrossRef
[26] Peebles, W. and Xie, S. (2023) Scalable Diffusion Models with Transformers. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 4172-4182. [Google Scholar] [CrossRef
[27] Perez, E., Strub, F., De Vries, H., Dumoulin, V. and Courville, A. (2018) Film: Visual Reasoning with a General Conditioning Layer. Proceedings of the AAAI Conference on Artificial Intelligence, 32, 3942-3951. [Google Scholar] [CrossRef
[28] Brock, A., Donahue, J. and Simonyan, K. (2019) Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv: 1809.11096.
[29] Karras, T., Laine, S. and Aila, T. (2019) A Style-Based Generator Architecture for Generative Adversarial Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019, 4396-4405. [Google Scholar] [CrossRef
[30] Dhariwal, P. and Nichol, A. (2021) Diffusion Models Beat GANs on Image Synthesis. 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Sydney, 6-14 December 2021, 8780-8794.
[31] Zhang, R., Isola, P., Efros, A.A., Shechtman, E. and Wang, O. (2018) The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 586-595. [Google Scholar] [CrossRef
[32] Ding, K., et al. (2020) Image Quality Assessment: Unifying Structure and Texture Similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 2567-2581.
[33] Deng, J., Dong, W., Socher, R., Li, L., Kai Li, and Li Fei-Fei, (2009) ImageNet: A Large-Scale Hierarchical Image Database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, 20-25 June 2009, 248-255. [Google Scholar] [CrossRef
[34] Lin, T.Y., Maire, M., Belongie, S., et al. (2014) Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B. and Tuytelaars, T., Eds., Lecture Notes in Computer Science, Springer, 740-755.
[35] Loshchilov, I. and Hutter, F. (2017) Decoupled Weight Decay Regularization. arXiv: 1711.05101.