一种基于实例的数据转换方法

doi:10.12677/HJDM.2022.123024

期刊菜单

一种基于实例的数据转换方法
An Instance-Based Data Transformation Method

DOI: 10.12677/HJDM.2022.123024, PDF,
作者: 薄凤羽, 李贵, 李征宇, 韩子扬, 曹科研：沈阳建筑大学，信息与控制工程学院，辽宁沈阳
关键词: 数据转换；示例转换；信息熵；PBE；Data Transformation； Example Transformation； Information Entropy； PBE

摘要: Web中包含大量有用的信息，但由于它们是半结构化的，非专家用户在进行数据转换和集成时不能很好地利用。为此本文提出了一种基于实例的数据转换方法，用户只需要提供适当的输入–输出示例就可以得到所需的转换。首先，利用基于序列比对的模式距离度量方法依据用户提供的示例生成代表性示例；其次，提出了一种基于信息熵的代码分析方法，利用该方法与代表性示例结合来筛选与转换任务相关的候选函数；最后，通过函数排名将相关函数先进行列转换，再行合成与所有示例一致的数据转换程序。本文利用房地产领域数据集进行了实验评估，结果表明，该方法可以处理目前许多现有系统不支持的常见转换，并且能够实现实验系统中近80%的数据转换，其准确率远高于其他同类型系统。

Abstract: The Web contains a lot of useful information, but because it is semi-structured, non-expert users are not able to make good use of it in data transformation and integration. Therefore, this paper pro-poses an instance-based data transformation method. Users only need to provide appropriate in-put-output examples to get the required transformation. First, a pattern distance measurement method based on sequence alignment is used to generate representative examples from us-er-provided examples. Secondly, a code analysis method based on information entropy is proposed, which is combined with representative examples to screen candidate functions related to transformation tasks. Finally, the related functions are converted into rows and columns through function rankings, and then a data conversion program is synthesized that is consistent with all the examples. In this paper, we use real estate data set to carry out experimental evaluation, and the results show that this method can deal with many common conversions that are not supported by existing systems, and can achieve nearly 80% of the data conversions in the experimental system, and its accuracy is much higher than other systems of the same type.

文章引用：薄凤羽, 李贵, 李征宇, 韩子扬, 曹科研. 一种基于实例的数据转换方法[J]. 数据挖掘, 2022, 12(3): 235-245. https://doi.org/10.12677/HJDM.2022.123024

参考文献

[1]	Dasu, T. and Johnson, T. (2003) Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., New York. [Google Scholar] [CrossRef]
[2]	Lieberman, H., Ed. (2001) Your Wish Is My Command: Programming by Example. Morgan Kaufmann.
[3]	王飞龙. PBE技术在文本搜索中的应用[D]: [硕士学位论文]. 哈尔滨: 哈尔滨理工大学, 2007.
[4]	Harris, W.R. and Gulwani, S. (2011) Spreadsheet Table Transformations from Examples. ACM SIGPLAN Notices, 46, 317-328. [Google Scholar] [CrossRef]
[5]	Heer, J., Hellerstein, J.M. and Kandel, S. (2015) Predictive Interaction for Data Transformation. 7th Biennial Conference on Innovative Data Systems Research (CIDR’15), 4-7 January 2015, Asilomar, California.
[6]	Jin, Z., Anderson, M.R., Cafarella, M. and Jagadish, H.V. (2017) Foofah: Transforming Data by Example. SIGMOD’17: Proceedings of the 2017 ACM International Con-ference on Management of Data, Chicago, May 2017, 683-698. [Google Scholar] [CrossRef]
[7]	Singh, R. (2016) Blinkfill: Semi-Supervised Programming by Ex-ample for Syntactic String Transformations. Proceedings of the VLDB Endowment, 9, 816-827. [Google Scholar] [CrossRef]
[8]	Le, V. and Gulwani, S. (2014) FlasheXtract: A Framework for Data Extraction by Examples. PLDI’14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Lan-guage Design and Implementation, Edinburgh, June 2014, 542-553. [Google Scholar] [CrossRef]
[9]	Abedjan, Z., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P. and Stonebraker, M. (2016) DataXformer: A Robust Transformation Discovery System. 2016 IEEE 32nd International Conference on Data Engineering (ICDE), Helsinki, 16-20 May 2016, 1134-1145. [Google Scholar] [CrossRef]
[10]	Wang, Y. and He, Y. (2017) Synthesizing Mapping Relation-ships Using Table Corpus. SIGMOD’17: Proceedings of the 2017 ACM International Conference on Management of Data, Redmond, May 2017, 1117-1132. [Google Scholar] [CrossRef]
[11]	Gollery, M. (2005) Bioinformatics: Sequence and Genome Analy-sis. Briefings in Bioinformatics, 5, 393-396. [Google Scholar] [CrossRef]
[12]	夏源, 赵蕴龙, 范其林. 基于信息熵更新权重的数据流集成分类算法[J]. 计算机科学, 2022, 49(3): 92-98.
[13]	Smith, B.C. (1982) Procedural Reflection in Programming Languages. PhD Thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts.

为你推荐

友情链接