具有重叠集合约束的实体解析问题研究
Entity Resolution Algorithm Based on Locality Sensitive Hash and Fuzzy Join
DOI: 10.12677/HJDM.2023.132011, PDF,   
作者: 樊沁怿, 李 贵, 李征宇:沈阳建筑大学,信息与控制工程学院,辽宁 沈阳
关键词: 相似连接重叠约束集合Set Similarity Overlap Set
摘要: 本文研究了具有重叠集合约束的实体解析集合相似性连接问题。给定两个集合内元素为集合的集合以及一个常数c,找到数据集当中至少共享了c个共同元素的所有集合对。这一问题是许多领域诸如信息检索、数据挖掘和实体解析当中的基本操作。现有的方法都受到了O(n2)的限制,其中n是数据集总的大小。本文提出了一种算法复杂度为 的集合大小感知算法。集合大小感知算法根据集合大小将所有集合分为大集合与小集合并分别进行处理。本文通过现有的方法来处理大集合,对于小集合本文提出了集中启发式算法来提高算法性能。由于大集合与小集合的大小边界对于效率至关重要,本文还提出了一种有效的大小边界的选择方法来选择合适的大小边界。本文通过在真实数据集上的实验结果证明了该方法的有效性。
Abstract: In this paper, the entity resolution set similarity connection problem with overlapping set constraints is studied. Given two sets whose elements are sets of sets and a constant c, find all sets that share at least c common elements in the data set. This problem is a fundamental operation in many fields such as information retrieval, data mining, and entity resolution. Existing methods are limited by O(n2), where n is the total size of the dataset. This paper presents a set size awareness algorithm with complexity. The set size sensing algorithm divides all sets into large sets and small sets according to the set size and processes them separately. In this paper, the existing methods are used to deal with large sets. For small sets, a centralized heuristic algorithm is pro-posed to improve the algorithm performance. Since the size boundary of large sets and small sets is very important for efficiency, this paper also proposes an effective size boundary selection method to select the appropriate size boundary. Experimental results on real data sets demonstrate the ef-fectiveness of the proposed method.
文章引用:樊沁怿, 李贵, 李征宇. 具有重叠集合约束的实体解析问题研究[J]. 数据挖掘, 2023, 13(2): 107-116. https://doi.org/10.12677/HJDM.2023.132011

参考文献

[1] Satuluri, V. and Parthasarathy, S. (2012) Bayesian Locality Sensitive Hashing for Fast Similarity Search. Proceedings of the VLDB Endowment, 5, 430-441. [Google Scholar] [CrossRef
[2] Pennington, J., Socher, R. and Manning, C. (2014) Glove: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empir-ical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, 1532-1543. [Google Scholar] [CrossRef
[3] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013) Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, 2, 3111-3119.
[4] Lund, K. and Burgess, C. (1996) Producing High-Dimensional Semantic Spaces from Lexical Co-Occurrence. Behavior Research Methods, Instruments, & Comput-ers, 28, 203-208. [Google Scholar] [CrossRef
[5] Bullinaria, J.A. and Levy, J.P. (2007) Extracting Se-mantic Representations from Word Co-Occurrence Statistics: A Computational Study. Behavior Research Methods, 39, 510-526. [Google Scholar] [CrossRef
[6] Bayardo, R.J., Ma, Y. and Srikant, R. (2007) Scaling up All Pairs Similarity Search. In Proceedings of the 16th International Conference on World Wide Web, Association for Com-puting Machinery, New York, 131-140. [Google Scholar] [CrossRef
[7] Li, C., Lu, J. and Lu, Y. (2008) Efficient Merging and Filtering Al-gorithms for Approximate String Searches. 2008 IEEE 24th International Conference on Data Engineering, Cancun, 7-12 April 2008, 257-266. [Google Scholar] [CrossRef
[8] Wang, J., Li, G. and Feng, J. (2012) Can We Beat the Prefix Fil-tering?: An Adaptive Framework for Similarity Join and Search. In: Proceedings of the 2012 ACM SIGMOD Interna-tional Conference on Management of Data, Association for Computing Machinery, New York, 85-96. [Google Scholar] [CrossRef
[9] Malik, A.S., Boyko, O., Atkar, N. and Young, W.F. (2001) A Comparative Study of MR Imaging Profile of Titanium Pedicle Screws. Acta Radiologica, 42, 291-293. [Google Scholar] [CrossRef] [PubMed]