|
[1]
|
Bai, L., Islam, M. and Ren, H. (2023) Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery. In: Lecture Notes in Computer Science, Springer, 397-407. [Google Scholar] [CrossRef]
|
|
[2]
|
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J. and Chang, K.W. (2019) VisualBERT: A Simple and Performant Baseline for Vision and Language.
|
|
[3]
|
Seenivasan, L., Islam, M., Krishna, A.K. and Ren, H.L. (2022) Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer. In: Lecture Notes in Computer Science, Springer, 33-43. [Google Scholar] [CrossRef]
|
|
[4]
|
Yu, Z., Yu, J., Cui, Y.H., Tao, D.C. and Tian, Q. (2019) Deep Modular Co-Attention Networks for Visual Question Answering. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, 15-20 June 2019, 6281-6290.
|
|
[5]
|
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A. and Jegou, H. (2021) Training Data-Efficient Image Transformers Amp; Distillation through Attention. Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Location, 18-24 July 2021, 10347-10357.
|
|
[6]
|
Ben-Younes, H., Cadene, R., Cord, M. and Thome, N. (2017) MUTAN: Multimodal Tucker Fusion for Visual Question Answering. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 2631-2639. [Google Scholar] [CrossRef]
|
|
[7]
|
Yu, Z., Yu, J., Xiang, C.C., Fan, J.P. and Tao, D.C. (2018) Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. IEEE Transactions on Neural Networks and Learning Systems, 29, 5947-5959. [Google Scholar] [CrossRef] [PubMed]
|
|
[8]
|
Ben-Younes, H., Cadene, R., Thome, N. and Cord, M. (2019) BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 8102-8109. [Google Scholar] [CrossRef]
|
|
[9]
|
Bai, L., Islam, M., Seenivasan, L. and Ren, H.L. (2023) Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery. 2023 IEEE International Conference on Robotics and Automation (ICRA), London, 29 May-2 June 2023, 6859-6865. [Google Scholar] [CrossRef]
|