PDID:视觉离散化智能问答模型——基于图像像素离散化和图像语义离散化的VQA模型
PDID: Visual Discretization Intelligent Question Answering Model—VQA Model Based on Image Pixel Discretization and Image Semantic Discretization
摘要: 视觉问答是一项具有挑战性的多模态任务,它连接了计算机视觉和自然语言处理两个领域。在这项任务中,模型需要根据给定的图片和相关问题,有效地提取信息并给出正确答案。然而,由于图像和文本属于不同的模态,存在着严重的语义差异,因此如何有效地将不同模态的信息对齐并减少语义差异,是当前视觉问答领域的重点关注问题。本文针对当前视觉问答方法在多模态对齐阶段图像和文本信息颗粒度的巨大差异,提出了基于视觉离散化(PDID: Pixel Discretization and Instance Discretization)的智能问答模型并辅助以模态注意力机制完成跨模态信息和语义对齐。图像以像素为最小单位的特征数据与文本以单词为最小单位的特征数据,它们在数据的信息颗粒度上存在巨大的差异,即语言通过至多数万单词即可完成整个文本语义空间的构建,而图像则是通过亿级的RGB三原色数组构建而成。这说明了直接建模以像素为单位的图像是很难和文本做好对齐的。本文通过了多种图像离散化的方式,一方面通过离散化图像像素,以颜色离散化、强度离散化、纹理离散化、空间离散化四种形式将图像像素完成离散化,在数量级上逼近文本特征的最小基元数量;另一方面通过图像语义特征的软编码,离散化图像深层次的语义特征,将图像的语义特征与文本的单词语义对齐,在语义层面上逼近文本特征的单词语义信息量。除此以外,本文提出了一种新型的视觉关系融合模块,视觉关系融合模块用来捕获同种模态内离散化特征和连续特征的交互信息,为模型提供丰富的视觉特征。本文先使用自注意力方法提取模态内特征之间的相关性,即提取视觉全局关系,再使用通道空间分离注意力进行跨模态结合,为局部引导的全局特征提供更大的表示空间和更多的补充信息。为了验证本方法的有效性,在VQA-v2,COCO-QA,VQA-CP v2数据集上进行了广泛实验,充分验证了该方法在视觉问答任务中的基于离散机制的视觉问答研究有效性。同时也体现了该模型在其他跨模态任务(图像文本匹配、指示表达)中仍有很强的泛化能力。
Abstract: Visual question answering is a challenging multimodal task that bridges the fields of computer vi-sion and natural language processing. In this task, the model needs to effectively extract infor-mation and give the correct answer based on the given picture and related questions. However, since images and texts belong to different modalities, there are serious semantic differences. Therefore, how to effectively align information from different modalities and reduce semantic dif-ferences is a key concern in the current field of visual question answering. In view of the huge dif-ference in the granularity of image and text information in the multi-modal alignment stage of cur-rent visual question answering methods, this paper proposes an intelligent question answering model based on visual discretization (PDID: Pixel Discretization and Instance Discretization) and is assisted by a modal attention mechanism, cross-modal information and semantic alignment. There is a huge difference in the information granularity of the feature data of images with pixels as the smallest unit and the feature data of text with words as the smallest unit. That is, language can complete the construction of the entire text semantic space with up to tens of thousands of words, and the image is constructed from a billion-level RGB three primary color array. This shows that it is difficult to align the image with the text by directly modeling the image in pixels. This article adopts a variety of image discretization methods. On the one hand, it discretizes image pixels and discre-tizes image pixels in four forms: color discretization, intensity discretization, texture discretization, and space discretization, approaching text in an order of magnitude. The minimum number of primitives of the feature; on the other hand, through soft coding of image semantic features, the deep-level semantic features of the image are discretized, the semantic features of the image are aligned with the word semantics of the text, and the word semantic information of the text features is approximated at the semantic level quantity. In addition, this paper proposes a new type of visual relationship fusion module. The visual relationship fusion module is used to capture the interactive information of discrete features and continuous features within the same modality, providing rich visual features for the model. This paper first uses the self-attention method to extract the correla-tion between features within the modality, that is, extracts the visual global relationship, and then uses the channel space separation attention for cross-modal combination to provide a larger representation space and locally guided global features and more supplementary information. In order to verify the effectiveness of this method, extensive experiments were conducted on the VQA-v2, COCO-QA, and VQA-CP v2 data sets, which fully verified the effectiveness of this method in visual question answering research based on discrete mechanisms in visual question answering tasks. At the same time, it also reflects that the model still has strong generalization ability in other cross-modal tasks (image text matching, instruction expression).
文章引用:陈页名, 张思禹, 孙杳如. PDID:视觉离散化智能问答模型——基于图像像素离散化和图像语义离散化的VQA模型[J]. 计算机科学与应用, 2023, 13(12): 2432-2446. https://doi.org/10.12677/CSA.2023.1312243

参考文献

[1] Ardila, A., Bernal, B. and Rosselli, M. (2015) Language and Visual Perception Associations: Meta-Analytic Connectivity Modeling of Brodmann Area 37. Behavioural Neurology, 2015, Article ID: 565871. [Google Scholar] [CrossRef] [PubMed]
[2] Antol, S., Agrawal, A., Lu, J., Mitchell, M. and Parikh, D. (2015) VQA: Visual Question Answering. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 Decem-ber 2015, 2425-2433. [Google Scholar] [CrossRef
[3] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. [Google Scholar] [CrossRef] [PubMed]
[4] Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv: 1409.1556.
[5] Peng, Z., Yash, G., Douglas, S.S., Dhruv, B. and Devi, P. (2016) Balancing and Answering Binary Visual Questions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 5014-5022.
[6] Shih, K.J., Singh, S. and Hoiem, D. (2016) Where to Look: Focus Regions for Visual Question Answering. IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, 27-30 June 2016, 4613-4621. [Google Scholar] [CrossRef
[7] Alex, K., Ilya, S. and Geoffrey, E.H. (2017) Imagenet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 60, 84-90. [Google Scholar] [CrossRef
[8] Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A. and Fergus, B. (2015) Simple Baseline for Visual Question Answering. arXiv: 1512.02167.
[9] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., et al. (2015) Going Deeper with Convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 1-9. [Google Scholar] [CrossRef
[10] Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T. and Rohrbach, M. (2016) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, November 2016, 457-468. [Google Scholar] [CrossRef
[11] Charikar, M., Chen, K.C. and Colton, M.F. (2002) Finding Frequent Items in Data Streams. In: Widmayer, P., Eidenbenz, S., Triguero, F., Morales, R., Conejo, R. and Hennessy, M., Eds., Automata, Languages and Programming, Springer, Berlin, 693-703. [Google Scholar] [CrossRef
[12] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., et al. (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International Conference on Machine Learning, Lille, 6-11 July 2015, 2048-2057.
[13] Yang, Z., He, X., Gao, J., Deng, L. and Smola, A. (2015) Stacked Attention Networks for Image Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 21-29. [Google Scholar] [CrossRef
[14] Lu, P., Li, H., Zhang, W., Wang, J. and Wang, X. (2018) Co-Attending Freeform Regions and Detections with Multi-Modal Multiplicative Feature Embedding for Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 32, 7218-7225. [Google Scholar] [CrossRef
[15] Teney, D., Anderson, P., He, X. and Van Den Hengel, A. (2018) Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 4223-4232. [Google Scholar] [CrossRef
[16] Wu, C., Liu, J., Wang, X. and Li, R. (2019) Differential Networks for Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8997-9004. [Google Scholar] [CrossRef
[17] Zhang, L., et al. (2021) Rich Visual Knowledge-Based Aug-mentation Network for Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems, 32, 4362-4373. [Google Scholar] [CrossRef
[18] Xie, E., et al. (2020) PolarMask: Single Shot Instance Seg-mentation with Polar Representation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 12190-12199. [Google Scholar] [CrossRef
[19] Chen, H., Sun, K., Tian, Z., Shen, C., Huang, Y. and Yan, Y. (2020) BlendMask:Top-Down Meets Bottom-Up for Instance Segmentation. 2020 IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 8570-8578. [Google Scholar] [CrossRef
[20] Gao, P., et al. (2019) Dynamic Fusion with Intra- and In-ter-Modality Attention Flow for Visual Question Answering. 2019 IEEE/CVF Conference on Computer Vision and Pat-tern Recognition (CVPR), Long Beach, 15-20 June 2019, 6632-6641. [Google Scholar] [CrossRef
[21] Stefanini, M., Cornia, M., Baraldi, L. and Cucchiara, R. (2021) A Novel Attentionbased Aggregation Function to Combine Vision and Language. 2020 25th International Conference on Pattern Recognition (ICPR), Milan, 10-15 January 2021, 1212-1219. [Google Scholar] [CrossRef
[22] Wu, C., Liu, J., Wang, X. and Dong, X. (2018) Ob-ject-Difference Attention: Asimple Relational Attention for Visual Question Answering. Proceedings of the 26th ACM International Conference on Multimedia, Seoul, 22-26 October 2018, 519-527. [Google Scholar] [CrossRef
[23] Peng, L., Yang, Y., Wang, Z., Wu, X. and Huang, Z. (2019) CRA-Net: Composed Relation Attention Network for Visual Question Answering. Proceedings of the 27th ACM Inter-national Conference on Multimedia, Nice France, 21-25 October 2019, 1202-1210. [Google Scholar] [CrossRef
[24] Yang, Z., He, X., Gao, J., Deng, L. and Smola, A. (2016) Stacked Attention Networks for Image Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), Las Vegas, 27-30 June 2016, 21-29. [Google Scholar] [CrossRef
[25] Li, L., Gan, Z., Cheng, Y. and Liu, J. (2019) Relation-Aware Graph Attention Network for Visual Question Answering. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 10312-10321. [Google Scholar] [CrossRef
[26] Osman, A. and Samek, W. (2019) DRAU: Dual Recurrent Atten-tion Units for Visual Question Answering. Computer Vision and Image Understanding, 185, 24-30. [Google Scholar] [CrossRef
[27] Peng, L., et al. (2019) Word-to-Region Attention Network for Visual Question Answering. Multimedia Tools and Applications, 78, 3843-3858. [Google Scholar] [CrossRef
[28] Liu, Y., Zhang, X., Zhao, Z., Zhang, B., Cheng, L. and Li, Z. (2022) ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering. IEEE Transactions on Cybernetics, 52, 4520-4533. [Google Scholar] [CrossRef
[29] Zhong, H., Chen, J., Shen, C., Zhang, H., Huang, J. and Hua, X.S. (2021) Selfadaptive Neural Module Transformer for Visual Question Answering. IEEE Transactions on Multime-dia, 23, 1264-1273. [Google Scholar] [CrossRef
[30] Peng, L., Yang, Y., Wang, Z., Huang, Z. and Shen, H.T. (2022) MRA-Net: Improving VQA via Multi-Modal Relation Attention Network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 318-329. [Google Scholar] [CrossRef
[31] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv: 1810.04805. http://arxiv.org/abs/1810.04805
[32] Wang, X., Zhang, R., Kong, T., Li, L. and Shen, C. (2020) SOLOv2: Dynamic and Fast Instance Segmentation. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancou-ver, 6-12 December 2020, 17721-17732.
[33] Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X. and Huang, T.S. (2018) Gen-erative Image Inpainting with Contextual Attention. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition, Salt Lake City, 18-23 June 2018, 5505-5514. [Google Scholar] [CrossRef
[34] Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W. and Zhang, B.T. (2017) Hadamard Product for Low-Rank Bilinear Pooling. arXiv: 1610.04325. http://arxiv.org/abs/1610.04325
[35] Bai, Y., Fu, J., Zhao, T. and Mei, T. (2018) Deep Attention Neural Tensor Network for Visual Question Answering. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., ECCV 2018: Computer Vision—ECCV 2018, Springer, Cham, 21-37. [Google Scholar] [CrossRef
[36] Wu, C., Liu, J., Wang, X. and Dong, X. (2018) Chain of Rea-soning for Visual Question Answering. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, 3-8 December 2018, 275-285.