基于深度学习的翻译洗稿抄袭检测算法
Deep Learning-Based Translation Laundering Plagiarism Detection Algorithm
摘要: 为应对多媒体技术和互联网快速发展带来的多样化和新型化洗稿抄袭问题,本文提出了一种基于深度学习的翻译洗稿的抄袭检测算法,该算法通过融合多轮翻译后的特征来增强翻译文本的特征,从而得到高质量的文本表示,并利用对比学习架构拉近原文本在语义向量空间中与翻译文本的距离,同时保持其与负样本的距离。此外,本文通过改进的对比损失函数增强模型检测洗稿文本的能力。最后利用所构建的多元组翻译洗稿数据集来进行训练和验证,使之达到检测翻译洗稿抄袭的能力。实验结果表明,本文所提出的算法产生了质量更高的文本表示,从而在翻译洗稿抄袭检测任务上优于先前的方法,Spearman相关系数的结果也证明了所构建模型的优越性。
Abstract: In response to the increasingly diverse and novel plagiarism issues arising from the rapid development of multimedia technologies and the Internet, this paper introduces a deep learning-based plagiarism detection algorithm for translation laundering. This algorithm enhances the characteristics of translated texts by fusing features from multiple translation rounds, thereby achieving high-quality text representations. It utilizes a contrastive learning framework to narrow the distance between the original text and the spun text within the semantic vector space, while maintaining separation from negative samples. Additionally, the model’s ability to detect spun texts is bolstered by an improved contrastive loss function. The algorithm was trained and validated on a specially constructed multiset translation laundering dataset, to effectively detect plagiarism via translation laundering. Experimental results show that the proposed algorithm produces higher quality text representations and surpasses previous methods in detecting translation laundering plagiarism. The effectiveness of the constructed model is further affirmed by the Spearman correlation coefficient results.
文章引用:贺小玲, 周元鼎. 基于深度学习的翻译洗稿抄袭检测算法[J]. 建模与仿真, 2024, 13(4): 4279-4288. https://doi.org/10.12677/mos.2024.134388

参考文献

[1] 刘宏更. 基于小样本学习的文档查重系统的设计与实现[D]: [硕士学位论文]. 北京: 北京邮电大学, 2023.
[2] Jones, M. (2009) Back-Translation: The Latest form of Plagiarism. The 4th Asia Pacific Conference on Educational Integrity, Wollongong, 28-30 September 2009, 1-7.
[3] Anchal, P. and Urvashi, G. (2023) A Review on Diverse Algorithms Used in the Context of Plagiarism Detection. 2023 International Conference on Advancement in Computation & Computer Technologies (InCACCT), Gharuan, 5-6 May 2023, 1-6.
[4] Alzahrani, S.M., Salim, N. and Abraham, A. (2012) Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42, 133-149. . [Google Scholar] [CrossRef
[5] Chong, M. and Specia, L. (2011) Lexical Generalisation for Word-Level Matching in Plagiarism Detection. Conference: Recent Advances in Natural Language Processing, RANLP 2011, Hissar, 12-14 September 2011, 704-709.
[6] Alzahrani, S. and Salim, N. (2010) Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection Lab Report for PAN at CLEF 2010. CLEF 2010 LABs and Workshops, Notebook Papers, Padua, 22-23 September 2010, 1-8.
[7] El-Rashidy, M.A., Mohamed, R.G., El-Fishawy, N.A. and Shouman, M.A. (2023) An Effective Text Plagiarism Detection System Based on Feature Selection and SVM Techniques. Multimedia Tools and Applications, 83, 2609-2646. [Google Scholar] [CrossRef
[8] Poibeau, T. (2017) Machine Translation. MIT Press. [Google Scholar] [CrossRef
[9] Yoon, K. (2014) Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, 25-29 October 2014, 1746-1751.
[10] 厍向阳, 刘哲, 董立红. 基于多尺度注意力特征融合的场景文本检测[J]. 计算机工程与应用, 2024, 60(1): 198-206.
[11] Jeffrey, P., Richard, S. and Christopher, D.M. (2014) Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, 25-29 October 2014, 1532-1543.
[12] Jacob, D., Ming-Wei, C., Kenton, L. and Kristina, T. (2019) BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of NAACL-HLT 2019, Minneapolis, 2-7 June 2019, 4171-4186.
[13] Jun, G., Di, H. and Xu, T. (2018) Representation Degeneration Problem in Training Natural Language Generation Models. International Conference on Learning Representations, New Orleans, 6-9 May 2018.
[14] Nils, R. and Iryna, G. (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, 3-7 November 2019, 3982-3992.
[15] Li, B., Zhou, H. and He, J.X. (2020) On the Sentence Embeddings from Pre-Trained Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 16-18 November 2020, 9119-9130. [Google Scholar] [CrossRef
[16] Su, J.L., Cao, J.R., Liu, W.J. and Ouyang, Y.W. (2021) Whitening Sentence Representations for Better Semantics and Faster Retrieval. arXiv: 2103.15316.
[17] Yan, Y.M., Li, R.M. and Wang, S.R. (2021) ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), August 2021, 5065-5075. [Google Scholar] [CrossRef
[18] Wikipedia (2022) Spaces.Ac.cn.
https://spaces.ac.cn/archives/8860
[19] Li, X., Hu, X.L. and Yang, J. (2019) Spatial Group-Wise Enhance: Improving Semantic Feature Learning in Convolutional Networks. arXiv: 1905.09646.
[20] Nils, R., Philip, B. and Iryna, G. (2016) Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, 11-16 December 2016, 87-96.
[21] Hu, B.T., Chen, Q.C. and Zhu, F.Z. (2015) LCSTS: A Large Scale Chinese Short Text Summarization Dataset. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, 17-21 September 2015, 1967-1972. [Google Scholar] [CrossRef
[22] Gao, T.Y., Yao, X.C. and Chen, D.Q. (2021) Simcse: Simple Contrastive Learning of Sentence Embeddings. 2021 Conference on Empirical Methods in Natural Language Processing, 7-11 November 2021, 6894-6910. [Google Scholar] [CrossRef