一种基于属性显著度的实体解析算法

doi:10.12677/HJDM.2021.112004

期刊菜单

一种基于属性显著度的实体解析算法
An Entity Resolution Algorithm Based on Attribute Salience

DOI: 10.12677/HJDM.2021.112004, PDF, 被引量
作者: 褚良旭, 李贵, 李征宇, 韩子扬, 曹科研：沈阳建筑大学，信息与控制工程学院，辽宁沈阳
关键词: 实体解析；属性显著度；二部图；随机游走；Entity Resolution； Attribute Salience； Bipartite Graph； Random Walk

摘要: 实体解析(ER)是数据集成和数据清洗的一个重要步骤。在领域数据清洗与集成中，实体中不同的属性通常能表现出不同的区分能力，计算并利用属性的区分能力能够提高记录相似度的精确度。目前实体解析的方法有采用基于字符串的记录相似度算法以及基于机器学习的算法等方法来计算记录相似度，缺少考虑不同属性的重要程度。因此本文利用SimRank和PageRank算法的思想并结合随机抽样得到的属性显著度提出了一种基于属性显著度的计算记录相似度算法。首先，构造一个加权的属性记录对二部图来表示属性与记录对之间的关系；其次，根据属性显著度结合图论相似度算法提出了基于属性显著度的计算记录相似度的迭代算法。最后，构造一个记录图来表示记录对之间的匹配概率(二部图中的权值 w（r_i,r_j）)，并使用改进的随机游走算法估计记录对匹配的概率。再将记录对的匹配概率反馈给加权的属性记录对二部图，并对基于属性显著度的计算记录相似度算法中的权值w（r_i,r_j）进行修正，直到收敛。利用房地产领域数据集进行了实验评估，结果表明，本文提出的基于属性显著度的实体解析算法与主流方法相比，具有较高的精确度。

Abstract: Entity resolution (ER) is an important step in data integration and data cleansing. In domain data cleaning and integration, different attributes in an entity usually exhibit different discriminating abilities. Calculating and utilizing the discriminating abilities of attributes can improve the accuracy of record similarity. Current entity resolution methods include record similarity algorithm based on string and algorithm based on machine learning to calculate record similarity, which lacks the im-portance of considering different attributes. Therefore, this paper uses the idea of SimRank and PageRank algorithm and combines the attribute salience obtained by random sampling to propose a similarity algorithm based on attribute salience. Firstly, a weighted attribute record pair bipartite graph is constructed to represent the relationship between attribute and record pair. Secondly, an iterative algorithm for calculating record similarity based on attribute significance is proposed ac-cording to attribute significance combined with graph similarity algorithm. Finally, a record graph is constructed to represent the matching probability between the record pairs (the weight in the bipartite graph), and the improved random walk algorithm is used to estimate the matching probability of the record pairs. Then, the matching probability of record pairs is fed back to the weighted bipartite graph of attribute record pairs, and the weight in the algorithm of calculating record similarity based on attribute salience is modified until convergence. Experi-mental evaluation using real estate data sets shows that the proposed entity resolution algorithm based on attribute salience is more accurate than the mainstream methods.

文章引用：褚良旭, 李贵, 李征宇, 韩子扬, 曹科研. 一种基于属性显著度的实体解析算法[J]. 数据挖掘, 2021, 11(2): 27-37. https://doi.org/10.12677/HJDM.2021.112004

参考文献

[1]	Christophides, V., Efthymiou, V. and Stefanidis, K. (2015) Entity Resolution in the Web of Data: Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool Publishers. [Google Scholar] [CrossRef]
[2]	韦海浪, 李贵, 李征宇, 韩子扬, 曹科研. 半结构化实体解析算法[J]. 数据挖掘, 2020, 10(1): 1-15. [Google Scholar] [CrossRef]
[3]	Kenig, B. and Gal, A. (2013) MFIBlocks: An Effective Block-ing Algorithm for Entity Resolution. Information Systems, 38, 908-926. [Google Scholar] [CrossRef]
[4]	Kolb, L., Thor, A. and Rahm, E. (2012) Dedoop: Efficient Deduplica-tion with Hadoop. Proceedings of the VLDB Endowment, 5, 1878-1881. [Google Scholar] [CrossRef]
[5]	高广尚, 张智雄. 关于实体解析基本方法的研究和述评[J]. 数据分析与知识发现, 2019, 3(5): 27-40.
[6]	Bilenko, M. and Mooney, R.J. (2003) Adaptive Duplicate Detection Using Learnable String Similarity Measures. Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington DC, August 2003, 39-48. [Google Scholar] [CrossRef]
[7]	Cohen, W.W. (2000) Data Integration Using Similarity Joins and a Word-Based Information Representation Language. Information Systems, 18, 288-321. [Google Scholar] [CrossRef]
[8]	Ristad, E. S. and Yianilos, P.N. (1998) Learning String-Edit Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 522-532. [Google Scholar] [CrossRef]
[9]	Bilenko, M. and Mooney, R.J. (2002) Learning to Combine Trained Dis-tance Metrics for Duplicate Detection in Databases. TechRep AI, 02-296.
[10]	Tejada, S., Knoblock, C.A. and Minton, S. (2001) Learning Object Identification Rules for Information Integration. Information Systems, 26, 607-633. [Google Scholar] [CrossRef]
[11]	Cohen, W.W. and Richman, J. (2002) Learning to Match and Cluster Large High-Dimensional Data Sets for Data Integration. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2002, 475-480. [Google Scholar] [CrossRef]
[12]	Ravikumar, P.D. and Cohen, W.W. (2004) A Hierarchical Graphical Model for Record Linkage. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, 454-461.
[13]	张晓辉, 蒋海华, 邸瑞华. 基于属性权重的链接数据共指关系构建[J]. 计算机科学, 2013, 40(2): 40-43.
[14]	强保花, 吴忠福. 基于属性信息熵的实体匹配方法研究[J]. 计算机工程, 2005, 31(21): 31-33.
[15]	Brin, S. and Page, L. (2002) The Anatomy of a Large-Scale Hypertextual web Search Engine. Computer Networks and ISDN Systems, 30, 107-117. [Google Scholar] [CrossRef]
[16]	Jeh, G. and Widom, J. (2002) Simrank: A Meas-ure of Structural-Context Similarity. KDD’02: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2002, 538-543. [Google Scholar] [CrossRef]
[17]	Zhang, D., Guo, L., He, X., et al. (2018) A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution. 2018 IEEE 34th International Conference on Data Engineering (ICDE), Paris, 16-19 April 2018, 713-724. [Google Scholar] [CrossRef]

为你推荐

友情链接