基于身份与属性感知对齐的文本行人重识别
Text-Based Person Re-Identification via Identity and Attribute-Aware Alignment
摘要: 文本行人重识别(Text-based Person Re-Identification, TextReID)旨在根据自然语言描述在大规模行人图像库中检索对应个体,在智能安防等场景中具有重要应用价值。近年来,大规模的图文预训练模型(如:Contrastive Language-Image Pre-Training, CLIP)被广泛迁移到该任务中,然而CLIP默认采用的实例级对比学习目标与行人重识别数据的真实分布存在结构性错配,同时文本侧对行人属性语义建模不足,导致检索性能不稳定、泛化能力受限。针对上述问题,本文提出一种结构感知CLIP适配框架,从目标空间与语义空间两个层面对CLIP进行结构性修正。具体地,提出身份感知对齐损失(Identity-Aware Alignment, IAA),将实例级图文对齐提升为身份级分布对齐,避免同一身份样本被误判为负样本;同时提出属性感知掩码建模(Attribute-Aware Masked Modeling, AAMM),在文本编码阶段重点建模与行人属性相关的词汇语义,增强判别性文本表征。在多个公开数据集上的实验结果表明,我们的方法在Rank-1、mAP等指标上均显著优于现有方法,验证了所提框架的有效性与泛化能力。
Abstract: Text-based Person Re-Identification (TextReID) aims to retrieve corresponding pedestrian images from large-scale galleries based on natural language descriptions and plays an important role in intelligent surveillance. Recently, CLIP has been widely adopted for this task; however, its instance-level contrastive objective mismatches the true data distribution where multiple images and texts correspond to the same identity. Meanwhile, insufficient modeling of attribute-related semantics in textual descriptions further limits its discriminative capability. To address these issues, we propose a structure-aware CLIP adaptation framework that corrects the structural mismatches from both objective and semantic perspectives. Specifically, we introduce an Identity-Aware Alignment (IAA) loss to align identity-level distributions instead of individual instances, preventing same-identity samples from being mistakenly treated as negatives. Moreover, an Attribute-Aware Masked Modeling (AAMM) module is designed to emphasize attribute-related tokens during text encoding, thereby enhancing discriminative textual representations. Extensive experiments on multiple public benchmarks demonstrate that the proposed method significantly outperforms existing approaches in terms of Rank-1 and mAP, validating its effectiveness and generalization ability.
文章引用:詹光辉. 基于身份与属性感知对齐的文本行人重识别[J]. 计算机科学与应用, 2026, 16(2): 348-357. https://doi.org/10.12677/csa.2026.162064

参考文献

[1] Li, S., Xiao, T., Li, H., Zhou, B., Yue, D. and Wang, X. (2017) Person Search with Natural Language Description. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 21-26 July 2017, 1970-1979. [Google Scholar] [CrossRef
[2] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S. and Sutskever, I. (2021) Learning Transferable Visual Models from Natural Language Supervision. International Conference on Machine Learning, Vienna, 18-24 July 2021, 8748-8763.
[3] Li, S., Xiao, T., Li, H., Yang, W. and Wang, X. (2017) Identity-Aware Textual-Visual Matching with Latent Co-attention. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, 22-29 October 2017, 1890-1899. [Google Scholar] [CrossRef
[4] Chen, T., Xu, C. and Luo, J. (2018) Improving Text-Based Person Search by Spatial Matching and Adaptive Threshold. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, 12-15 March 2018, 1879-1887. [Google Scholar] [CrossRef
[5] Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition.
https://arxiv.org/pdf/1409.1556
[6] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780. [Google Scholar] [CrossRef] [PubMed]
[7] Han, X., He, S., Zhang, L. and Xiang, T. (2021) Text-Based Person Search with Limited Data. [Google Scholar] [CrossRef
[8] Yan, S., Dong, N., Zhang, L. and Tang, J. (2023) Clip-Driven Fine-Grained Text-Image Person Re-Identification. IEEE Transactions on Image Processing, 32, 6032-6046. [Google Scholar] [CrossRef] [PubMed]
[9] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., et al. (2017) Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 5998-6008.
[10] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., et al. (2021) Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision. International Conference on Machine Learning, Vienna, 18-24 July 2021, 4904-4916.
[11] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2019) Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171-4186.
[12] Zhang, Y. and Lu, H. (2018) Deep Cross-Modal Projection Learning for Image-Text Matching. In: Lecture Notes in Computer Science, Springer, 707-723. [Google Scholar] [CrossRef
[13] Sarafianos, N., Xu, X. and Kakadiaris, I. (2019) Adversarial Representation Learning for Text-to-Image Matching. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, 27 October-2 November 2019, 5814-5824. [Google Scholar] [CrossRef
[14] Wang, Z., Fang, Z., Wang, J. and Yang, Y. (2020) Vitaa: Visual-Textual Attributes Alignment in Person Search by Natural Language. In: Lecture Notes in Computer Science, Springer, 402-420. [Google Scholar] [CrossRef
[15] Gao, C., Cai, G., Jiang, X., Zheng, F., et al. (2021) Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search.
https://arxiv.org/pdf/2101.03036
[16] Zhu, A., Wang, Z., Li, Y., Wan, X., Jin, J., Wang, T., et al. (2021) DSSL: Deep Surroundings-Person Separation Learning for Text-Based Person Retrieval. Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, 20-24 October 2021, 209-217. [Google Scholar] [CrossRef
[17] Ding, Z., Ding, C., Shao, Z. and Tao, D. (2021) Semantically Self-Aligned Network for Text-to-Image Part-Aware Person Re-Identification.
https://arxiv.org/pdf/2107.12666
[18] Yan, S., Tang, H., Zhang, L. and Tang, J. (2024) Image-Specific Information Suppression and Implicit Local Alignment for Text-Based Person Search. IEEE Transactions on Neural Networks and Learning Systems, 35, 17973-17986. [Google Scholar] [CrossRef] [PubMed]
[19] Wang, Z., Zhu, A., Xue, J., Wan, X., Liu, C., Wang, T., et al. (2022) Look before You Leap: Improving Text-Based Person Retrieval by Learning a Consistent Cross-Modal Common Manifold. Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, 10-14 October 2022, 1984-1992. [Google Scholar] [CrossRef
[20] Li, S., Cao, M. and Zhang, M. (2022) Learning Semantic-Aligned Feature Representation for Text-Based Person Search. 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23-27 May 2022, 2724-2728. [Google Scholar] [CrossRef
[21] Chen, Y., Zhang, G., Lu, Y., Wang, Z. and Zheng, Y. (2022) TIPCB: A Simple but Effective Part-Based Convolutional Baseline for Text-Based Person Search. Neurocomputing, 494, 171-181. [Google Scholar] [CrossRef
[22] Shu, X., Wen, W., Wu, H., Chen, K., Song, Y., Qiao, R., et al. (2022) See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval. In: Lecture Notes in Computer Science, Springer, 624-641. [Google Scholar] [CrossRef
[23] Liu, Y., Li, Y., Liu, Z., Yang, W., Wang, Y. and Liao, Q. (2024) Clip-Based Synergistic Knowledge Transfer for Text-Based Person Retrieval. 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, 14-19 April 2024, 7935-7939. [Google Scholar] [CrossRef
[24] Huang, Y., Zhang, C., Li, Z., Wang, Z. and Wei, C. (2025) Prototypical Graph Alignment for Text-Based Person Search. 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, 6-11 April 2025, 1-5. [Google Scholar] [CrossRef