基于Stable-Diffusion的AI绘画方法
AI Painting Method Based on Stable-Diffusion
摘要: 本研究旨在开发一种基于AI技术的音频可视化方法,该方法能够将音频信号转化为符合歌曲主题的图片集以及视频。在此过程中,首先提取了音频中的平均频率、平均LUFS、平均相位等参数,并使用自然语言描述进行区间划分。随后利用GPT模型将音频数据描述转化为文字形式,为稳定扩散算法提供实例化的提示。在技术方案中,我们引入了基于Stable-Diffusion的AI绘画方法,通过对音频信息的处理与关键字生成,最终生成了高质量且逼真的音频可视化艺术作品。此外,我们还成功地将生成的艺术作品转化为相应的视频作品。整个流程具有较高的自由度和创造力,可为音乐和艺术创作领域带来新的可能性。
Abstract: The aim of this study is to develop an audio visualization method based on AI techniques that can transform the audio signal into a collection of pictures as well as videos that match the theme of the song. In this process, parameters such as average frequency, average LUFS, and average phase in the audio are first extracted and intervalized using natural language descriptions. Subsequently, the audio data descriptions are converted into textual form using the GPT model to provide instantiated cues for the stabilizing diffusion algorithm. In the technical solution, we introduced a Stable-Diffusion-based AI painting method to process the audio information with keyword generation, which ultimately generates high-quality and realistic audio visualization artworks. In addition, we successfully transformed the generated artworks into corresponding video works. The whole process has a high degree of freedom and creativity, which can bring new possibilities to the field of music and art creation.
文章引用:冉昕哲, 高琛, 黄小明, 梁嘉桐, 倪芊睿, 程思琪. 基于Stable-Diffusion的AI绘画方法[J]. 计算机科学与应用, 2024, 14(5): 147-155. https://doi.org/10.12677/csa.2024.145123

参考文献

[1] Kim, Y., Jang, J. and Shin, S. (2021) MUSIC2VIDEO: Automatic Generation of Music Video with Fusion of Audio and Text. [Google Scholar] [CrossRef
[2] Cho, K., Hariharan, B. and Steenbrugge, T. (2021) Incorporating Text and Image Information into RNNs for Improved Video Description. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 4214-4227.
[3] Chen, Y., Wang, H. and Zhang, X. (2021) Audio-Driven Video Synthesis Using Unsupervised Learning. IEEE Transactions on Multimedia, 23, 54-67.
[4] Kang, S. and Fu, Y. (2020) Audio-to-Video Generation with Deep Neural Networks: A Survey. ACM Transactions on Multimedia Computing, Communications, and Applications, 16, 1-22. [Google Scholar] [CrossRef
[5] Crowson, K., Biderman, S., Kornis, D., Stander, D., Hallahan, E., Castricato, L. and Raff, E. (2021) VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M. and Hassner, T., Eds., Computer VisionECCV 2022. ECCV 2022. Lecture Notes in Computer Science, Vol 13697, Springer, Cham, 88-105. [Google Scholar] [CrossRef