基于对比学习的中英混杂文本相似度方法
A Contrastive Learning Based Method for Chinese-English Code-Switching Text Similarity
摘要: 中英混杂(Code-Switching, CS)文本的语义相似度计算是自然语言处理中的一项重要挑战,其主要难点在于复杂的语言结构和缺乏标注数据。本文提出了一种针对中英混杂文本的对比学习框架CSCL,并设计了代码点迁移和语境感知回译两种数据增强策略,以生成高质量的正负样本对,帮助模型学习对语言切换不敏感且鲁棒的语义表示。在双塔孪生网络中应用该方法,使用Albert作为共享编码器。实验结果表明,CSCL方法在中英混杂文本相似度计算上表现优于多个基线模型,Spearman等级相关系数显著提升,相比对比方法提升了4个百分点,验证了该方法的有效性。
Abstract: The semantic similarity computation for Chinese-English code-switching (CS) texts is a significant challenge in natural language processing, mainly due to the complex language structures and the scarcity of annotated data. This paper proposes a contrastive learning framework for code-switching texts (Code-Switching Contrastive Learning, CSCL) and designs two data augmentation strategies: Code-Switching Point Shifting (CSPS) and Context-Aware Back-Translation (CABT), to generate high-quality positive and negative sample pairs that help the model learn robust semantic representations insensitive to language switching. The method is applied in a Siamese network structure with Albert as the shared encoder. Experimental results show that the CSCL method outperforms several baseline models in Chinese-English mixed-text similarity computation, compared with the comparison method, it has increased by 4 percentage points in Spearman’s rank correlation, demonstrating the effectiveness of the proposed approach.
文章引用:廖红虹, 赵文博, 廖海明, 郭昊淞, 刘剑波. 基于对比学习的中英混杂文本相似度方法[J]. 计算机科学与应用, 2025, 15(9): 93-104. https://doi.org/10.12677/csa.2025.159227

参考文献

[1] 谷波, 王瑞波, 李济洪, 等. 基于RNN的中文二分结构句法分析[J]. 中文信息学报, 2019, 33(1): 35-45.
[2] Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I. and Specia, L. (2017) SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, August 2017, 1-14. [Google Scholar] [CrossRef
[3] Devlin, J., Chang, M.W., Lee, K., et al. (2019) Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171-4186.
[4] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., et al. (2020) Unsupervised Cross-Lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020, 8440-8451. [Google Scholar] [CrossRef
[5] Reimers, N. and Gurevych, I. (2019) Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, November 2019, 3982-3992. [Google Scholar] [CrossRef
[6] Gao, T., Yao, X. and Chen, D. (2021) SimCSE: Simple Contrastive Learning of Sentence Embeddings. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, November 2021, 6894-6910. [Google Scholar] [CrossRef
[7] 李莹, 伍胜, 徐聪, 等. 语义文本相似度计算方法研究综述[J]. 软件导刊, 2024, 23(11): 1-11.
[8] 于营, 周显春, 贾树文, 等. 基于命名实体N-Gram图的文本相似性度量[J]. 现代计算机, 2022, 28(2): 73-77.
[9] 张克亮, 李芊芊. 基于本体的语义相似度计算研究[J]. 郑州大学学报(理学版), 2019, 51(2): 52-59.
[10] 徐传丽, 周世杰, 吴春江. 深度学习中文本相似度计算研究综述[J]. 计算机应用与软件, 2024, 41(11): 1-14.
[11] 杨德志, 柯显信, 余其超, 等. 基于RCNN的问题相似度计算方法[J]. 计算机工程与科学, 2021, 43(6): 1076-1080.
[12] 纪明宇, 王晨龙, 安翔, 等. 面向智能客服的句子相似度计算方法[J]. 计算机工程与应用, 2019, 55(13): 123-128.
[13] 苏锦钿, 洪晓斌, 余珊珊. 基于多模型集成的语义文本相似性判断[J]. 华南理工大学学报(自然科学版), 2022, 50(4): 1-9.
[14] 左玉生, 张礼. 基于深度神经网络的文本语义相似性度量[J]. 南京理工大学学报, 2022, 46(1): 83-88.
[15] 温雨, 王琦, 严武军. 基于相似度融合的中文文本相似性度量方法研究[J]. 计算机应用, 2023(10): 36-39.
[16] 董勃, 罗森林. 小数据集文本语义相似性分析模型的优化与应用[J]. 信息安全研究, 2023, 9(10): 980-985.
[17] Oord, A., Li, Y. and Vinyals, O. (2018) Representation Learning with Contrastive Predictive Coding.
https://arxiv.org/abs/1807.03748
[18] Rusak, E., Reizinger, P., Juhos, A., et al. (2024) InfoNCE: Identifying the Gap between Theory and Practice.
https://arxiv.org/abs/2407.00143
[19] 结巴中文分词[EB/OL].
https://github.com/fxsjy/jieba, 2025-07-13.
[20] 田久乐, 赵蔚. 基于同义词词林的词语相似度计算方法[J]. 吉林大学学报(信息科学版), 2010, 28(6): 602-608.
[21] Fellbaum, C. (2010) WordNet. In: Theory and Applications of Ontology: Computer Applications, Springer, 231-243. [Google Scholar] [CrossRef
[22] Lan, Z., Chen, M., Goodman, S., et al. (2019) ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations.
https://arxiv.org/abs/1909.11942
[23] Kalantidis, Y., Sariyildiz, M.B., Pion, N., et al. (2020) Hard Negative Mixing for Contrastive Learning. Advances in Neural Information Processing Systems, 33, 21798-21809.
[24] GitHub Repository (2019) Albert_zh.
https://github.com/brightmart/albert_zh
[25] Gage, P. (1994) A New Algorithm for Data Compression. The C Users Journal, 12, 23-38.
[26] Aguilar, G., Kar, S. and Solorio, T. (2020) LinCE: A Centralized Benchmark for Linguistic Code-Switching Evaluation. Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, May 2020, 1803-1813.