基于排序损失的跨模态检索优化研究
Research on Optimization of Cross-Modal Retrieval Based on Ranking Loss
DOI: 10.12677/mos.2025.141012, PDF,   
作者: 蒋彩兰:上海理工大学管理学院,上海
关键词: 跨模态检索排序损失相似性度量Cross-Modal Retrieval Ranking Loss Similarity Measure
摘要: 跨模态检索通过一种模态(如文本或图像)来检索另一模态的数据,传统的跨模态检索方法主要依赖模态对齐与相似性度量,以实现多模态间的特征匹配。本文创新性地提出了一种基于排序的跨模态检索方法,通过引入排序损失来优化跨模态检索过程,使得与查询相关性高的项目在结果中排名靠前,从而实现跨模态检索。实验结果表明,引入排序损失可显著提升跨模态检索性能,尤其在文本与图像匹配中表现出色,为后续研究提供了新的方法视角和坚实的技术基础。
Abstract: Cross-modal retrieval aims to retrieve data in one modality (such as text or images) based on another modality. Traditional cross-modal retrieval methods primarily rely on modality alignment and similarity measures to achieve feature matching across multiple modalities. This paper presents an innovative sorting-based cross-modal retrieval method that optimizes the cross-modal retrieval process by introducing ranking loss, allowing items with higher relevance to the query to be prioritized in the results, thereby enhancing cross-modal retrieval effectiveness. Experimental results demonstrate that the introduction of ranking loss significantly enhances the performance of cross-modal retrieval, particularly excelling in text-image matching tasks. This work provides a new methodological perspective and a solid technical foundation for future research in the field.
文章引用:蒋彩兰. 基于排序损失的跨模态检索优化研究[J]. 建模与仿真, 2025, 14(1): 116-121. https://doi.org/10.12677/mos.2025.141012

参考文献

[1] Wang, K., Yin, Q., Wang, W., Wu, S. and Wang, L. (2016) A Comprehensive Survey on Cross-Modal Retrieval.
[2] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., et al. (2021) Learning Transferable Visual Models from Natural Language Supervision. 2021 International Conference on Machine Learning, Online, 13-16 December 2021, 8748-8763.
[3] Yan, F. and Mikolajczyk, K. (2015) Deep Correlation for Matching Images and Text. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 7-12 June 2015, 3441-3450. [Google Scholar] [CrossRef
[4] Zhang, Y. and Lu, H. (2018) Deep Cross-Modal Projection Learning for Image-Text Matching. In: Lecture Notes in Computer Science, Springer, 707-723. [Google Scholar] [CrossRef
[5] Zhang, C., Cheng, J. and Tian, Q. (2020) Multi-View Image Classification with Visual, Semantic and View Consistency. IEEE Transactions on Image Processing, 29, 617-627. [Google Scholar] [CrossRef] [PubMed]
[6] Wang, Z., Gao, Z., Yang, Y., Wang, G., Jiao, C. and Shen, H.T. (2024) Geometric Matching for Cross-Modal Retrieval. IEEE Transactions on Neural Networks and Learning Systems, 1-13. [Google Scholar] [CrossRef] [PubMed]
[7] Vaswani, A. (2017) Attention Is All You Need.
https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[8] Pobrotyn, P., Bartczak, T., Synowiec, M., Białobrzeski, R. and Bojar, J. (2020) Context-Aware Learning to Rank with Self-Attention.
[9] Young, P., Lai, A., Hodosh, M. and Hockenmaier, J. (2014) From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. Transactions of the Association for Computational Linguistics, 2, 67-78. [Google Scholar] [CrossRef
[10] Burges, C.J.C., Ragno, R. and Le, Q.V. (2007) Learning to Rank with Nonsmooth Cost Functions. In: Advances in Neural Information Processing Systems 19, The MIT Press, 193-200. [Google Scholar] [CrossRef
[11] Pobrotyn, P. and Białobrzeski, R. (2021) Neural NDCG: Direct Optimization of a Ranking Metric via Differentiable Relaxation of Sorting.
[12] Cao, Z., Qin, T., Liu, T., Tsai, M. and Li, H. (2007) Learning to Rank. Proceedings of the 24th International Conference on Machine Learning, New York, 20-24 June 2007, 129-136. [Google Scholar] [CrossRef
[13] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., et al. (2005) Learning to Rank Using Gradient Descent. Proceedings of the 22nd International Conference on Machine Learning, New York, 7-11 August 2005, 89-96. [Google Scholar] [CrossRef