|
[1]
|
Ardila, A., Bernal, B. and Rosselli, M. (2015) Language and Visual Perception Associations: Meta-Analytic Connectivity Modeling of Brodmann Area 37. Behavioural Neurology, 2015, Article ID: 565871. [Google Scholar] [CrossRef] [PubMed]
|
|
[2]
|
Antol, S., Agrawal, A., Lu, J., Mitchell, M. and Parikh, D. (2015) VQA: Visual Question Answering. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, 7-13 Decem-ber 2015, 2425-2433. [Google Scholar] [CrossRef]
|
|
[3]
|
Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. [Google Scholar] [CrossRef] [PubMed]
|
|
[4]
|
Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv: 1409.1556.
|
|
[5]
|
Peng, Z., Yash, G., Douglas, S.S., Dhruv, B. and Devi, P. (2016) Balancing and Answering Binary Visual Questions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 5014-5022.
|
|
[6]
|
Shih, K.J., Singh, S. and Hoiem, D. (2016) Where to Look: Focus Regions for Visual Question Answering. IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, 27-30 June 2016, 4613-4621. [Google Scholar] [CrossRef]
|
|
[7]
|
Alex, K., Ilya, S. and Geoffrey, E.H. (2017) Imagenet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 60, 84-90. [Google Scholar] [CrossRef]
|
|
[8]
|
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A. and Fergus, B. (2015) Simple Baseline for Visual Question Answering. arXiv: 1512.02167.
|
|
[9]
|
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., et al. (2015) Going Deeper with Convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 1-9. [Google Scholar] [CrossRef]
|
|
[10]
|
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T. and Rohrbach, M. (2016) Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, November 2016, 457-468. [Google Scholar] [CrossRef]
|
|
[11]
|
Charikar, M., Chen, K.C. and Colton, M.F. (2002) Finding Frequent Items in Data Streams. In: Widmayer, P., Eidenbenz, S., Triguero, F., Morales, R., Conejo, R. and Hennessy, M., Eds., Automata, Languages and Programming, Springer, Berlin, 693-703. [Google Scholar] [CrossRef]
|
|
[12]
|
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., et al. (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International Conference on Machine Learning, Lille, 6-11 July 2015, 2048-2057.
|
|
[13]
|
Yang, Z., He, X., Gao, J., Deng, L. and Smola, A. (2015) Stacked Attention Networks for Image Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 21-29. [Google Scholar] [CrossRef]
|
|
[14]
|
Lu, P., Li, H., Zhang, W., Wang, J. and Wang, X. (2018) Co-Attending Freeform Regions and Detections with Multi-Modal Multiplicative Feature Embedding for Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 32, 7218-7225. [Google Scholar] [CrossRef]
|
|
[15]
|
Teney, D., Anderson, P., He, X. and Van Den Hengel, A. (2018) Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 4223-4232. [Google Scholar] [CrossRef]
|
|
[16]
|
Wu, C., Liu, J., Wang, X. and Li, R. (2019) Differential Networks for Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8997-9004. [Google Scholar] [CrossRef]
|
|
[17]
|
Zhang, L., et al. (2021) Rich Visual Knowledge-Based Aug-mentation Network for Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems, 32, 4362-4373. [Google Scholar] [CrossRef]
|
|
[18]
|
Xie, E., et al. (2020) PolarMask: Single Shot Instance Seg-mentation with Polar Representation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 12190-12199. [Google Scholar] [CrossRef]
|
|
[19]
|
Chen, H., Sun, K., Tian, Z., Shen, C., Huang, Y. and Yan, Y. (2020) BlendMask:Top-Down Meets Bottom-Up for Instance Segmentation. 2020 IEEE/CVF Conference on Com-puter Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 8570-8578. [Google Scholar] [CrossRef]
|
|
[20]
|
Gao, P., et al. (2019) Dynamic Fusion with Intra- and In-ter-Modality Attention Flow for Visual Question Answering. 2019 IEEE/CVF Conference on Computer Vision and Pat-tern Recognition (CVPR), Long Beach, 15-20 June 2019, 6632-6641. [Google Scholar] [CrossRef]
|
|
[21]
|
Stefanini, M., Cornia, M., Baraldi, L. and Cucchiara, R. (2021) A Novel Attentionbased Aggregation Function to Combine Vision and Language. 2020 25th International Conference on Pattern Recognition (ICPR), Milan, 10-15 January 2021, 1212-1219. [Google Scholar] [CrossRef]
|
|
[22]
|
Wu, C., Liu, J., Wang, X. and Dong, X. (2018) Ob-ject-Difference Attention: Asimple Relational Attention for Visual Question Answering. Proceedings of the 26th ACM International Conference on Multimedia, Seoul, 22-26 October 2018, 519-527. [Google Scholar] [CrossRef]
|
|
[23]
|
Peng, L., Yang, Y., Wang, Z., Wu, X. and Huang, Z. (2019) CRA-Net: Composed Relation Attention Network for Visual Question Answering. Proceedings of the 27th ACM Inter-national Conference on Multimedia, Nice France, 21-25 October 2019, 1202-1210. [Google Scholar] [CrossRef]
|
|
[24]
|
Yang, Z., He, X., Gao, J., Deng, L. and Smola, A. (2016) Stacked Attention Networks for Image Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), Las Vegas, 27-30 June 2016, 21-29. [Google Scholar] [CrossRef]
|
|
[25]
|
Li, L., Gan, Z., Cheng, Y. and Liu, J. (2019) Relation-Aware Graph Attention Network for Visual Question Answering. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 10312-10321. [Google Scholar] [CrossRef]
|
|
[26]
|
Osman, A. and Samek, W. (2019) DRAU: Dual Recurrent Atten-tion Units for Visual Question Answering. Computer Vision and Image Understanding, 185, 24-30. [Google Scholar] [CrossRef]
|
|
[27]
|
Peng, L., et al. (2019) Word-to-Region Attention Network for Visual Question Answering. Multimedia Tools and Applications, 78, 3843-3858. [Google Scholar] [CrossRef]
|
|
[28]
|
Liu, Y., Zhang, X., Zhao, Z., Zhang, B., Cheng, L. and Li, Z. (2022) ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering. IEEE Transactions on Cybernetics, 52, 4520-4533. [Google Scholar] [CrossRef]
|
|
[29]
|
Zhong, H., Chen, J., Shen, C., Zhang, H., Huang, J. and Hua, X.S. (2021) Selfadaptive Neural Module Transformer for Visual Question Answering. IEEE Transactions on Multime-dia, 23, 1264-1273. [Google Scholar] [CrossRef]
|
|
[30]
|
Peng, L., Yang, Y., Wang, Z., Huang, Z. and Shen, H.T. (2022) MRA-Net: Improving VQA via Multi-Modal Relation Attention Network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 318-329. [Google Scholar] [CrossRef]
|
|
[31]
|
Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv: 1810.04805. http://arxiv.org/abs/1810.04805
|
|
[32]
|
Wang, X., Zhang, R., Kong, T., Li, L. and Shen, C. (2020) SOLOv2: Dynamic and Fast Instance Segmentation. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancou-ver, 6-12 December 2020, 17721-17732.
|
|
[33]
|
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X. and Huang, T.S. (2018) Gen-erative Image Inpainting with Contextual Attention. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recog-nition, Salt Lake City, 18-23 June 2018, 5505-5514. [Google Scholar] [CrossRef]
|
|
[34]
|
Kim, J.H., On, K.W., Lim, W., Kim, J., Ha, J.W. and Zhang, B.T. (2017) Hadamard Product for Low-Rank Bilinear Pooling. arXiv: 1610.04325. http://arxiv.org/abs/1610.04325
|
|
[35]
|
Bai, Y., Fu, J., Zhao, T. and Mei, T. (2018) Deep Attention Neural Tensor Network for Visual Question Answering. In: Ferrari, V., Hebert, M., Sminchisescu, C. and Weiss, Y., Eds., ECCV 2018: Computer Vision—ECCV 2018, Springer, Cham, 21-37. [Google Scholar] [CrossRef]
|
|
[36]
|
Wu, C., Liu, J., Wang, X. and Dong, X. (2018) Chain of Rea-soning for Visual Question Answering. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, 3-8 December 2018, 275-285.
|