基于LSH技术的试题相似度检测方法
The Application of LSH Technology in Similar Question Detection
DOI: 10.12677/CSA.2020.104077, PDF,    科研立项经费支持
作者: 陈 瑞*, 王 松, 梅 莹:楚雄师范学院经济与管理学院,云南 楚雄;杨云源:楚雄师范学院地理科学与旅游管理学院,云南 楚雄
关键词: 试题查重LSH算法Jaccard相似度K-shingleExamination Checking LSH Algorithm Jaccard Similarity K-Shingle
摘要: 试题内容重复率是评价试题库及试卷质量的重要指标之一,为了快速找出题库中的相似试题,本文主要研究了基于K-shingles的Jaccard相似度、MinHash和LSH技术应用于相似试题的检测方法。此方法首先将题干内容进行中文分词,进行适当处理后转换成K-shingle集,通过MinHash计算出签名,最后使用LSH技术快速地找出候选相似试题对并计算出相应的Jaccard相似度,若该相似度大于给定的阈值,则发现相似试题。该方法通过在题库系统中的使用,充分验证了该方法的可行性,达到了很好的效果。
Abstract: The repetition rate of test questions is one of the important indexes to evaluate the quality of test questions and test papers. In order to quickly find out similar questions in the test bank, this paper mainly studies the detection methods of similar questions based on K-shingles, Jaccard similarity, MinHash and LSH technology. First of all, the main content of the question is segmented into Chinese words, then converted into K-shingle set after proper processing, and the signature is calculated by MinHash. Finally, LSH technology is used to quickly find out the candidate pairs of similar questions and calculate the corresponding Jaccard similarity. If the similarity is greater than the given threshold, similar questions are found. Experiments prove to be practicable and effective.
文章引用:陈瑞, 王松, 梅莹, 杨云源. 基于LSH技术的试题相似度检测方法[J]. 计算机科学与应用, 2020, 10(4): 741-748. https://doi.org/10.12677/CSA.2020.104077

参考文献

[1] Muskan, K.M. (2017) Identifying Influential Segments from Word Co-Occurrence Networks Using AHP. Cognitive Systems Research, S138904171630198X.
[2] Pawar, A. and Mago, V. (2018) Calculating the Similarity between Words and Sentences Using a Lexical Database and Corpus Statistics.
[3] Abujar, S., Hasan, M. and Hossain, S.A. (2019) Sentence Similarity Estimation for Text Summarization Using Deep Learning. In: Kulkarni, A., Satapathy, S., Kang, T. and Kashan, A., Eds., Proceedings of the 2nd International Conference on Data Engineering and Communi-cation Technology, Advances in Intelligent Systems and Computing, Vol. 828, Springer, Singapore. [Google Scholar] [CrossRef
[4] Chen, Q., Hu, Q.M., Huang, X.J. and He, L. (2018) CAN: Enhancing Sentence Similarity Modeling with Collaborative and Adversarial Network. 815-824. [Google Scholar] [CrossRef
[5] Chi, Z. and Zhang, B. (2018) A Sentence Similarity Estimation Method Based on Improved Siamese Network. Journal of Intelligent Learning Systems and Applications, 10, 121-134. [Google Scholar] [CrossRef
[6] Yao, H., Liu, H. and Zhang, P. (2018) A Novel Sentence Similarity Model with Word Embedding Based on Convolutional Neural Network. Concurrency and Computation: Practice and Experience, 30, e4415. [Google Scholar] [CrossRef
[7] Quan, Z., Wang, Z., Le, Y., Yao, B., Li, K. and Yin, J. (2019) An Efficient Framework for Sentence Similarity Modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27, 853-865. [Google Scholar] [CrossRef
[8] Le, Y.Q., Wang, Z.-J., Quan, Z., He, J.W. and Yao, B. (2018) ACV-Tree: A New Method for Sentence Similarity Modeling. IJCAI, 4137-4143.
[9] 梁圣. 基于RNN的试题相似度计算模型研究与实现[J]. 数码设计, 2018, 7(1): 21-23.
[10] 田星, 郑瑾, 张祖平. 基于词向量的Jaccard相似度算法[J]. 计算机科学, 2018, 45(7): 192-195.
[11] Chen, Q., Hu, Q., Huang, J.X. and He, L. (2018) CA-RNN: Using Context-Aligned Recurrent Neural Networks for Modeling Sentence Similarity. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, 2-7 February 2018.
[12] Leskovec, J., Rajaraman, A. and Ullman, J.D. (2015) Mining of Massive Datasets. 2nd Edition. Posts & Telecom Press, Beijing, 56-70. [Google Scholar] [CrossRef
[13] Manaa, M.E. and Abdulameer, G. (2018) Web Documents Similarity Using K-Shingle Tokens and MinHash Technique. Journal of Engineering and Applied Sciences, 13, 1499-1505.