基于视频文本对齐的视频检索模型
Video Retrieval Model Based on Video Text Alignment
DOI: 10.12677/jisp.2025.143032, PDF,   
作者: 张 宇, 张天保:合肥工业大学计算机与信息学院,安徽 合肥
关键词: 多模态数据对齐深度学习视频理解视频检索Multimodal Data Alignment Deep Learning Video Comprehension Video Retrieval
摘要: 针对文本–视频检索遇到的全局对齐方法缺乏细粒度语义匹配以及跨模态语义鸿沟导致特征对齐困难的问题,提出一种高效全局–局部序列对齐方法(ETVA)。模型由文本编码器、视频编码器、文本–视频全局对齐模块和文本–视频细粒度对齐模块构成。其中文本编码器采用ALBERT模型,凭借其双向编码能力精准提取文本特征,能够提升跨模态特征的时序一致性与语义关联性。视频编码器利用多专家模块策略,从多模态、多特征角度全面捕捉视频信息。全局对齐模块通过聚合和变换特征,有效实现全局语义对齐;细粒度对齐模块基于共享聚类中心机制,深入挖掘文本和视频局部细节的语义关联。在实验中采用MSRVTT、ActivityNet Captions和LSMDC数据集,评价指标采用Recall@K和Median Rank,结果表明ETVA在不同数据集上均表现较好,在检索准确性相比其他方法有所提升。
Abstract: An efficient global local sequence alignment method (ETVA) is proposed to address the problem of global alignment methods lacking fine-grained semantic matching and cross modal semantic gaps leading to difficulty in feature alignment in text video retrieval. The model consists of a text encoder, a video encoder, a text video global alignment module, and a text video fine-grained alignment module. The text encoder adopts the ALBERT model, which accurately extracts text features with its bidirectional encoding ability, and can improve the temporal consistency and semantic correlation of cross modal features. The video encoder utilizes a multi expert module strategy to comprehensively capture video information from multiple modalities and feature perspectives. The global alignment module effectively achieves global semantic alignment by aggregating and transforming features; The fine-grained alignment module is based on a shared clustering center mechanism to deeply explore the semantic associations between local details in text and video. In the experiment, MSRVTT, ActiveNet Captions, and LSMDC datasets were used, and the evaluation indicators were Recall@K Compared with Median Rank, the results show that ETVA performs well on different datasets and has improved retrieval accuracy compared to other methods.
文章引用:张宇, 张天保. 基于视频文本对齐的视频检索模型[J]. 图像与信号处理, 2025, 14(3): 349-361. https://doi.org/10.12677/jisp.2025.143032

参考文献

[1] Yang, X., Dong, J., Cao, Y., Wang, X., Wang, M. and Chua, T. (2020) Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 25-30 July 2020, 1339-1348. [Google Scholar] [CrossRef
[2] Wang, Z., Zhong, Y., Miao, Y., et al. (2022) Contrastive Video-Language Learning with Fine-Grained Frame Sampling. arXiv: 2210.05039.
[3] Chen, S., Zhao, Y., Jin, Q. and Wu, Q. (2020) Fine-Grained Video-Text Retrieval with Hierarchical Graph Reasoning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 13-19 June 2020, 10635-10644. [Google Scholar] [CrossRef
[4] Bar-Shalom, G., Leifman, G. and Elad, M. (2024) Weakly-Supervised Representation Learning for Video Alignment and Analysis. 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, 3-8 January 2024, 6895-6904. [Google Scholar] [CrossRef
[5] Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., et al. (2022) Clip4clip: An Empirical Study of CLIP for End to End Video Clip Retrieval and Captioning. Neurocomputing, 508, 293-304. [Google Scholar] [CrossRef
[6] Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J. and Ji, R. (2022) X-CLIP: End-To-End Multi-Grained Contrastive Learning for Video-Text Retrieval. Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, 10-14 October 2022, 638-647. [Google Scholar] [CrossRef
[7] Gorti, S.K., Vouitsis, N., Ma, J., Golestan, K., Volkovs, M., Garg, A., et al. (2022) X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, 18-24 June 2022, 4996-5005. [Google Scholar] [CrossRef
[8] Zhang, H., Zeng, P., Gao, L., Song, J. and Shen, H.T. (2024) MPT: Multi-Grained Prompt Tuning for Text-Video Retrieval. Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, 28 October-1 November 2024, 1206-1214. [Google Scholar] [CrossRef
[9] Wang, Z., Sung, Y., Cheng, F., Bertasius, G. and Bansal, M. (2023) Unified Coarse-To-Fine Alignment for Video-Text Retrieval. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, 1-6 October 2023, 2804-2815. [Google Scholar] [CrossRef
[10] Bain, M., Nagrani, A., Varol, G. and Zisserman, A. (2021) Frozen in Time: A Joint Video and Image Encoder for End-To-End Retrieval. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 10-17 October 2021, 1708-1718. [Google Scholar] [CrossRef
[11] Wang, J., Wang, P., Sun, G., Liu, D., Dianat, S., Rao, R., et al. (2024) Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, 16-22 June 2024, 16551-16560. [Google Scholar] [CrossRef
[12] Li, H., Song, J., Gao, L., et al. (2023) Prototype-Based Aleatoric Uncertainty Quantification for Cross-Modal Retrieval. Advances in Neural Information Processing Systems, 36, 24564-24585.