基于多模态的端到端语音识别

doi:10.12677/CSA.2021.115133

期刊菜单

基于多模态的端到端语音识别
End-to-End Speech Recognition Based on Multimode

DOI: 10.12677/CSA.2021.115133, PDF,
作者: 谭振宇, 吴怡之：东华大学，信息科学与技术学院，上海
关键词: 多模态；端到端；语音识别；双向长短期记忆网络；Multimode； End-to-End； Speech Recognition； BiLstmCtc

摘要: 为了去除复杂的音频切分和强制对齐过程，并在噪音环境下充分发挥说话人发音过程中发音器官的视觉作用，本文提出了一种融合唇部特征的端到端的多模态语音识别算法。本文首先对说话人视频进行处理得到对应图像集，使用基于回归树的人脸对齐算法对图像集中发音的主要视觉部分进行特征提取，并与说话人的声学特征进行对齐融合得到新的特征，然后使用支持变长输入的端到端双向长短期记忆网络模型(DeepBiLstmCtc)对特征进行处理，输出对应的音素序列。实验结果表明该算法能有效地识别出视听觉信息中的音素序列，在噪声情况下也有一定的识别率提升。

Abstract: In order to remove the complex audio segmentation and forced alignment process, and give full play to the visual effect of the speaker’s articulatory organs in the speaker’s pronunciation process in a noisy environment, this paper proposes an end-to-end multi-modal speech recognition that incorporates lip features algorithm. This paper first processes the speaker’s video to obtain the corresponding image set, uses the regression tree-based face alignment algorithm to extract the features of the main visual parts of the voice in the image set, and aligns and fuses it with the speaker’s acoustic features to obtain new features, and then uses the end-to-end bidirectional long and short-term memory network model (DeepBiLstmCtc) that supports variable-length input to process the features and output the corresponding phoneme sequence. The experimental results show that the algorithm can effectively identify the phoneme sequence in the audiovisual information, and it also has a certain improvement in the recognition rate in the case of noise.

文章引用：谭振宇, 吴怡之. 基于多模态的端到端语音识别[J]. 计算机科学与应用, 2021, 11(5): 1315-1324. https://doi.org/10.12677/CSA.2021.115133

参考文献

[1]	王海坤, 潘嘉, 刘聪.语音识别技术的研究进展与展望[J]. 电信科学, 2018, 34(2): 1-11.
[2]	赵荣刚, 贺庆民. 计算机人脸识别技术的应用[J]. 电子技术与软件工程, 2018(4): 137.
[3]	徐彦君, 杜利民, 侯自强. 面向未来的交互信息技术——听觉视觉双模态语音识别(AVSR) (上) [J]. 电子科技导报, 1999(1): 26-30+34.
[4]	Massaro, D.W. and Stork, D.G. (1998) Speech Recognition and Sensory Integration: A 240-Year-Old Theorem Helps Explain How People and Machines Can Integrate Auditory and Visual Information to Understand Speech. American Scientist, 86, 236-244. [Google Scholar] [CrossRef]
[5]	田春霖. 深度视音频双模态语音识别方法[D]: [硕士学位论文]. 西安: 中国科学院大学(中国科学院西安光学精密机械研究所), 2018.
[6]	李明浩. 基于深度神经网络的连续语音识别研究[D]: [硕士学位论文]. 长春: 吉林大学, 2018.
[7]	Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. [Google Scholar] [CrossRef] [PubMed]
[8]	Wongeun, O. (2020) Comparison of Environmental Sound Classi-fication Performance of Convolutional Neural Networks According to Audio Preprocessing Methods. The Journal of the Acoustical Society of Korea, 39, 143-149.
[9]	郭春霞, 裘雪红. 基于MFCC的说话人识别系统[J]. 电子科技, 2005(11): 55-58.
[10]	于维生. 最小残差绝对和回归模型参数的递推估计方法[J]. 中国管理科学, 1995(2): 49-55.
[11]	梁路宏, 艾海舟, 徐光祐, 张钹. 人脸检测研究综述[J]. 计算机学报, 2002, 25(5): 449-458.

为你推荐

友情链接