TopN成对相似度迁移的三元组跨模态检索
Triplet Cross-Modal Retrieval Based on TopN Pairwise Similarity Transfer
DOI: 10.12677/CSA.2021.1110256, PDF,    国家自然科学基金支持
作者: 谭钜源, 何国辉, 袁文聪:五邑大学智能制造学部,广东 江门;江门市智能数据分析与应用工程技术研究中心,广东 江门
关键词: 跨模态检索子空间学习三元组损失局部保持投影成对相似度迁移Cross-Modal Retrieval Subspace Learning Triplet Loss Locality Preserving Projections Pairwise Similarity Transfer
摘要: 随着科技的快速发展,网络上的信息呈现出多模态共存的特点,如何存储和检索多模态信息成为当前的研究热点。其中,跨模态检索就是使用一种模态数据去检索语义相关的其它模态数据。目前大部分研究都聚焦于如何在公共子空间中使相关的样本尽可能靠近,不相关的样本尽可能分离,没有过多考虑相关样本的排序情况。因此提出一种TopN成对相似度迁移的三元组跨模态检索方法,其利用三元组损失和局部保持投影构建多模态共享的公共子空间,同时将原始空间中样本之间的高相似度关系迁移到公共子空间,以构建合理的排序约束。最后在两个经典跨模态数据集上证明了方法的有效性。
Abstract: With the rapid development of science and technology, information on the Internet shows the characteristics of multi-modal coexistence. How to store and retrieve multi-modal information has become a current research hotspot. Cross-modal retrieval is to use one type of modal data to retrieve semantically related data of other modalities. Most of the current research focuses on how to bring related samples as close as possible and how to separate unrelated samples as much as possible in the common subspace, but ignores the ranking of related samples. Therefore, a triplet cross-modal retrieval method based on TopN pairwise similarity transfer is proposed. It uses triplet loss and Locality Preserving Projections to construct a multi-modal shared common subspace. Meanwhile, it transfers the high similarity relation from origin subspace to common subspace to construct reasonable ordering constraints. Finally, the effectiveness of the method is proved on two classical cross-modal datasets.
文章引用:谭钜源, 何国辉, 袁文聪. TopN成对相似度迁移的三元组跨模态检索[J]. 计算机科学与应用, 2021, 11(10): 2529-2537. https://doi.org/10.12677/CSA.2021.1110256

参考文献

[1] 欧卫华, 刘彬, 周永辉, 等. 跨模态检索研究综述[J]. 贵州师范大学学报: 自然科学版, 2018, 36(2): 114-120.
[2] Wang, K., Yin, Q., Wang, W., et al. (2016) A Comprehensive Survey on Cross-Modal Retrieval. arXiv:1607.06215.
[3] Hardoon, D.R., Szedmak, S. and Shawe-Taylor, J. (2004) Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Computation, 16, 2639-2664. [Google Scholar] [CrossRef] [PubMed]
[4] Deng, C., Chen, Z., Liu, X., et al. (2018) Triplet-Based Deep Hashing Network for Cross-Modal Retrieval. IEEE Transactions on Image Processing, 27, 3893-3903. [Google Scholar] [CrossRef
[5] Schroff, F., Kalenichenko, D. and Philbin, J. (2015) Facenet: A Unified Embedding for Face Recognition and Clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 815-823. [Google Scholar] [CrossRef
[6] He, X. and Niyogi, P. (2004) Locality Preserving Projections. Proceedings of the 16th International Conference on Neural In-formation Processing Systems, Whistler, Columbia, 9-11 December 2003, 153-160.
[7] Zhang, W., Kang, P., Fang, X., et al. (2019) Joint Sparse Representation and Locality Preserving Projection for Feature Extraction. International Journal of Machine Learning and Cybernetics, 10, 1731-1745. [Google Scholar] [CrossRef
[8] 康培培, 林泽航, 杨振国, 等. 成对相似度迁移哈希用于无监督跨模态检索[J]. 计算机应用研究, 2021, 38(10): 3025-3029.
[9] Zhu, Z., Li, Y. and Liang Y. (2018) Learning and Generalization in Overparameterized Neural Networks, Going beyond Two Layers. arXiv preprint arXiv:181104918.
[10] Pereira, J.C., Coviello, E., Doyle, G., et al. (2013) On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 521-535. [Google Scholar] [CrossRef
[11] Mikolov, T., Chen, K., Corrado, G., et al. (2013) Efficient Estima-tion of Word Representations in Vector Space. arXiv e-prints, arXiv:1301.3781.
[12] Rashtchian, C., Young, P., Hodosh, M., et al. (2010) Collecting Image Annotations Using Amazon’s Mechanical Turk. Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, June 2010, 139-147.
[13] Peng, Y., Zhai, X., Zhao, Y., et al. (2015) Semi-Supervised Cross-Media Feature Learning with Unified Patch Graph Regularization. IEEE Transactions on Circuits and Systems for Video Technology, 26, 583-596. [Google Scholar] [CrossRef
[14] Blaschko, M.B. and Lampert, C.H. (2008) Correlational Spec-tral Clustering. Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 23-28 June 2008, 1-8. [Google Scholar] [CrossRef
[15] Andrew, G., Arora, R., Bilmes, J., et al. (2013) Deep Canonical Correlation Analysis. Proceedings of the International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 28, Atlanta, 16-21 June 2013, 1247-1255.
[16] Zhang, D. and Li, W.-J. (2014) Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization. Proceedings of the AAAI Conference on Artifi-cial Intelligence, Québec, 27-31 July 2014, 2177-2183.
[17] Zhai, X., Peng, Y. and Xiao, J. (2013) Learning Cross-Media Joint Representation with Sparse and Semisupervised Regularization. IEEE Transactions on Circuits and Systems for Video Technology, 24, 965-978. [Google Scholar] [CrossRef
[18] Wang, B., Yang, Y., Xu, X., et al. (2017) Adversarial Cross-Modal Retrieval. Proceedings of the Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, 23-27 October 2017, 154-162. [Google Scholar] [CrossRef
[19] Cheng, Q. and Gu, X. (2021) Bridging Multimedia Heterogeneity Gap via Graph Representation Learning for Cross-Modal Retrieval. Neural Networks, 134, 143-162. [Google Scholar] [CrossRef] [PubMed]