跨越表达鸿沟:面向电子商务的零样本组合图像检索(ZS-CIR)技术研究与应用
Bridging the Expression Gap: Research and Applications of Zero-Shot Composed Image Retrieval for E-Commerce
摘要: 为解决电子商务搜索中用户“视觉特征保留 + 语义属性修改”的复合查询需求与传统文本搜索表达局限形成的“表达鸿沟”,本文聚焦零样本组合图像检索(ZS-CIR)技术,通过构建统一数学框架,系统性梳理其技术路径、阐释自掩码投影与噪声注入等关键方法的理论依据,并在典型数据集上对文本反演、纯语言训练与合成数据驱动等主流范式开展全面评测与对比。实验结果显示,纯语言训练方法在极低训练成本下实现实用性能(Recall@10 38.5%,推理延迟18 ms),验证了语言空间模拟视觉修改的可行性;合成数据方法依托规模效应达成当前最优性能(Recall@10 46.8%)。本文从技术图谱、理论支撑与系统架构层面,为ZS-CIR在电商场景的研究与应用提供系统参考。
Abstract: To address the “expression gap” between users’ composite query needs of “visual feature retention + semantic attribute modification” and the expressive limitations of traditional text search in e-commerce, this paper focuses on Zero-Shot Composed Image Retrieval (ZS-CIR) technology. By constructing a unified mathematical framework, it systematically sorts out technical pathways, clarifies the theoretical basis of key methods such as Self-Masking Projection (SMP) and noise injection, and conducts comprehensive evaluations and comparisons of mainstream paradigms including textual inversion, language-only training, and synthetic data-driven approaches on benchmark datasets. Experimental results show that the language-only training method achieves practical performance with minimal training cost (Recall@10 38.5%, inference latency 18 ms), verifying the feasibility of simulating visual modifications in the linguistic space; the synthetic data-driven method attains state-of-the-art performance (Recall@10 46.8%) through scaling effects. From the perspectives of technology mapping, theoretical underpinnings, and system architecture, this paper provides a systematic reference for the research and application of ZS-CIR in e-commerce scenarios.
参考文献
|
[1]
|
尹奇跃, 马会娟, 刘成林. 基于深度学习的跨模态检索综述[J]. 中国图像图形学报, 2021, 26(6): 1368-1388.
|
|
[2]
|
张振兴, 王亚雄. 图文跨模态检索研究综述[J]. 北京交通大学学报, 2024, 48(2): 23-36.
|
|
[3]
|
Vo, N., Jiang, L., Sun, C., Murphy, K., Li, L., Fei-Fei, L., et al. (2019). Composing Text and Image for Image Retrieval—An Empirical Odyssey. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, 15-20 June 2019.[CrossRef]
|
|
[4]
|
Baldrati, A., Bertini, M., Uricchio, T. and Del Bimbo, A. (2023) Composed Image Retrieval Using Contrastive Learning and Task-Oriented Clip-Based Features. ACM Transactions on Multimedia Computing, Communications, and Applications, 20, 1-24. [Google Scholar] [CrossRef]
|
|
[5]
|
徐文婉, 周小平, 王佳. 跨模态检索技术研究综述[J]. 计算机工程与应用, 2022, 58(23): 12-23.
|
|
[6]
|
杨晓涵. 基于CLIP模型的以图搜图方法[J]. 计算机科学与应用, 2025, 15(1): 177-186. [Google Scholar] [CrossRef]
|
|
[7]
|
张心文. 基于矩阵分解和相似性保持的跨模态检索研究[J]. 计算机科学与应用, 2023, 13(6): 1264-1272. [Google Scholar] [CrossRef]
|
|
[8]
|
孔亚宁, 李春山, 初佃辉. 面向多源异构数据的跨模态存储与检索系统[J]. 南京大学学报(自然科学版), 2022, 58(3): 377-385.
|
|
[9]
|
姚昕彤. 基于多模态预训练模型的组合图像检索研究[D]: [硕士学位论文]. 北京: 北京交通大学, 2024.
|