半结构化实体解析算法

doi:10.12677/HJDM.2020.101001

期刊菜单

半结构化实体解析算法
Semi-Structured Entity Resolution Algorithm

DOI: 10.12677/HJDM.2020.101001, PDF, 被引量国家自然科学基金支持
作者: 韦海浪^*, 李贵, 李征宇, 韩子扬, 曹科研：沈阳建筑大学信息与控制工程学院，辽宁沈阳
关键词: 实体解析；编辑相似度；集合相似度；二分图最大加权匹配；Entity Resolution； Edit Similarity； Set Similarity； Maximum Weighted Matching of Binary Graph

摘要: 实体解析是指识别一个或多个数据集中的相似或相同的记录。该文主要针对模式未知的半结构化数据，提出了一种基于字符串相似度的实体解析算法，将记录分成多个子字符串，采用编辑相似度计算子字符串之间关联度，在此基础上引入二分图最大加权匹配算法度量记录之间的关联度。由于该方法的计算时间复杂度比较高，对于Web大数据集实体解析来说，所需的时间成本较大，因此，该文还提出了一种基于集合相似度的实体解析算法，将记录看作所有属性值的集合，每个属性值作为集合中的元素，用一个标记数组来表示每个元素，根据这些标记数组为每个记录创建一个签名，找出与签名相匹配的其他相似记录。并且采用优化后的最大匹配算法来选出真正相似的记录。最后，该文采用实际数据集进行实验验证了上述方法比传统方法更有效。

Abstract: Entity resolution is the identification of similar or identical records in one or more datasets. In this paper, an entity resolution algorithm based on string similarity is proposed for semi-structured data with unknown patterns. The records are divided into several substrings, and the correlation between substrings is calculated by editing similarity. On this basis, the maximum weighted matching algorithm of binary graph is introduced to measure the correlation between records. Due to the computing time complexity of this method is higher, for Web entity resolution large data sets, the time cost is larger; therefore, this article also puts forward a kind of entity resolution algorithm based on set similarity, considering record as a collection of all the property values, each attribute value as the elements in the collection, using an array of tag to represent each element, according to these tags array for each record to create a signature, to find other similar records match the sig-nature. The optimized maximum matching algorithm is used to select the truly similar records. Fi-nally, this paper uses the actual data set to verify that the above method is more effective than the traditional method.

文章引用：韦海浪, 李贵, 李征宇, 韩子扬, 曹科研. 半结构化实体解析算法[J]. 数据挖掘, 2020, 10(1): 1-15. https://doi.org/10.12677/HJDM.2020.101001

参考文献

[1]	Galil, Z. (1986) Efficient Algorithms for Finding Maximum Matching in Graphs. ACM Computing Surveys, 18, 23-38. [Google Scholar] [CrossRef]
[2]	Elmagarmid, A.K. and Member, S. (2007) Duplicate Record Detection: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19, 1-16. [Google Scholar] [CrossRef]
[3]	高广尚. 面向实体解析的无监督聚类方法综述[J]. 计算机工程与应用, 2018(7): 11-19.
[4]	高广尚, 张智雄. 关于实体解析基本方法的研究和述评[J]. 数据分析与知识发现, 2019, 3(5): 27-40.
[5]	王宁, 李杰. 大数据环境下用于实体解析的两层相关性聚类方法[J]. 计算机研究与发展, 2014, 51(9): 2108-2116.
[6]	Hernandez, M. and Stolfo, S. (1995) The Merge Purge Problem for Large Databases. ACM, New York. [Google Scholar] [CrossRef]
[7]	Monge, A.E. and Elkan, C.E. (1997) An Efficient Do-main-Independent Algorithm for Detecting Approximately Duplicate Database Records. Proceedings of Workshop on Research Issues on Data Mining and Knowledge Discovery, Newport Beach, 14-17 August 1997, 23-29.
[8]	Gravano, L. and Ipeirotis, P.G. (2001) Using Q-Grams in a DBMS for Approximate String Processing. IEEE Data Engineering Bulletin, 24, 28-34.
[9]	Ristad, E.S. and Yianilos, P.N. (1998) Learning String-Edit Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 522-532. [Google Scholar] [CrossRef]
[10]	Deng, D., Fer-nandez, R.C., Abedjan, Z., Wang, S., Stonebraker, M., Elmagarmid, A., Ilyas, I.F., Madden, S., Ouzzani, M. and Tang, N. (2017) The Data Civilizer System. 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, 8-11 January 2017, 7 p.
[11]	Deng, D., Li, G. and Feng, J. (2014) A Pivotal Prefix Based Filtering Algorithm for String Similarity Search. ACM SIGMOD International Conference on Management of Data, Snowbird, 22-27 June 2014, 673-684. [Google Scholar] [CrossRef]
[12]	Fredman, M.L. and Tarjan, R.E. (1987) Fibonacci Heaps and Their Uses in Improved Network Optimization Algorithms. Journal of the ACM, 34, 596-615. [Google Scholar] [CrossRef]
[13]	Wang, J., Li, G. and Feng, J. (2011) Fast-Join: An Efficient Method for Fuzzy Token Matching Based String Similarity Join. 27th International Conference on Data Engineering, 11-16 April 2011, 458-469. [Google Scholar] [CrossRef]
[14]	Wang, J., Li, G. and Feng, J. (2014) Extending String Similarity Join to Tolerant Fuzzy Token Matching. ACM Transactions on Database Systems, 39, 7. [Google Scholar] [CrossRef]

为你推荐

友情链接