利用文本特征增强与注意力机制提高图像问答准确率
Improve Image Question and Answer Accuracy by Using Text Feature Enhancement and Attention Mechanism
摘要:
图像问答是深度学习在计算机视觉领域成功应用的主要方向之一,在人工智能、自然语言处理、图像识别等方面有着广泛应用。图像问答的准确率不仅与图像问答系统中特征融合模块的设计有关,而且与图像特征与问题特征语义层次匹配程度有关。本文首先将图像的文本特征和视觉特征融合后作为图像增强特征,之后对问题提取文本特征,再加入注意力机制,将图像增强特征与问题文本特征进行特征融合,对融合特征做出答案预测。实验结果表明,本文方法可以解决图像特征与文本特征层次不匹配的问题,提高图像问答系统的准确率。
Abstract:
Image question and answer is one of the main directions for the successful application of deep learning in the field of computer vision. It has been widely used in artificial intelligence, natural language processing, image recognition and so on. The accuracy of the image question and answer is not only related to the design of the feature fusion module in the image question answering system, but also related to the degree of matching between the image feature and the semantic level of the question feature. In this paper, the text features and visual features of the image are first combined as the enhanced features of the image. Then, the text features are extracted from the question, and then the attention mechanism is added. The enhanced features of the image and the text features of the question are merged, and make answer prediction for fusion features. The experimental results show that the proposed method can solve the problem of mismatch between image features and text features, and improve the accuracy of the image question answering system.
参考文献
|
[1]
|
张天. 用于图像问答的深层注意力网络结构研究[D]: [硕士学位论文]. 云南: 云南大学, 2017.
|
|
[2]
|
Malinowski, M. and Fritz, M. (2014) Multi-World Approach to Question Answering about Real-World Scenes Based on Uncertain Input. Proceedings of the Advances in Neural Information Processing Systems, Montreal, 8-13 December 2014, 1682-1690.
|
|
[3]
|
Gao, H., Mao, J., Zhou, J., et al. (2015) Are You Talking to a Machine? Dataset and Methods for Mul-tilingual Image Question. Proceedings of the Advances in Neural Information Processing Systems, Cornell University, Ithaca, New York, 2 November 2015, 2296-2304.
|
|
[4]
|
Malinowski, M., Rohrbach, M. and Fritz, M. (2015) Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images. Proceedings of the IEEE International Conference on Computer Vision, Santiago, 7-13 December 2015, 1-9. [Google Scholar] [CrossRef]
|
|
[5]
|
Wu, Q., Shen, C., Liu, L., et al. (2016) What Value Do Explicit High Level Concepts Have in Vision to Language Problems? Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 203-212. [Google Scholar] [CrossRef]
|
|
[6]
|
李庆. 基于深度神经网络和注意力机制的图像问答研究[D]: [硕士学位论文]. 合肥: 中国科学技术大学, 2018.
|
|
[7]
|
袁爱红. 图像内容的语义描述与理解[D]: [博士学位论文]. 陕西: 中国科学院大学, 2018.
|
|
[8]
|
刘瑾莱. 基于深层神经网络推理的图像问答技术研究和应用[D]: [硕士学位论文]. 北京: 北京邮电大学, 2019.
|
|
[9]
|
于东飞. 基于注意力机制与高层语义的视觉问答研究[D]: [博士学位论文]. 合肥: 中国科学技术大学, 2019.
|
|
[10]
|
林靖豪. 用于视频问答的多级注意力循环神经网络算法研究[D]: [硕士学位论文]. 杭州: 浙江大学, 2018.
|
|
[11]
|
https://visualqa.org/.
|