基于信息熵的离散化算法的研究与实现
The Study and Implementation of Discretization Algorithm Based on Information Entropy
摘要: 离散化算法将连续属性的取值范围划分为很多个小的区间,每个区间都对应着自己的离散化符号,合理的离散化能够更准确的表达信息。本课题研究并实现了一种基于信息熵的离散化算法,通过赋予断点信息熵来度量断点的重要性从而对集合S进行划分。首先计算连续的属性的候选断点属性集,其次从候选断点集合中选取一个使信息熵最小的断点加入到断点集合中,该断点把集合S分成了两个部分,之后对于每一个子集合确定断点直到对于集合S的划分足够表达不同信息,满足最小区分长度准则完成。本文最后用实验验证了此算法的正确性和有效性,并对多组数据进行了测试和比较。
Abstract:
The values range of continuous attributes is divided by discretization algorithm into several parts, each of which corresponds to its own discrete symbol. The reasonable discretization determines the accuracy of information expressing. This article studies and implements a discretization algorithm based on information entropy. The set S is divided by measuring the importance of the breakpoint by giving the breakpoint information entropy. First, a set of candidate breakpoint attributes of continuous attribute are calculated. Secondly, a breakpoint from the set of candidate breakpoints is selected to add the breakpoint with the smallest value of information entropy to the set of breakpoints, which breaks up the set S into two parts. The third determines the breakpoint for each set of instances until the partitioning for set S satisfies the minimum discrimination length criterion. In the last part of the article, the correctness and validity of the algorithm are verified by experiments, and test as well as comparison of different groups of data is given.
参考文献
|
[1]
|
侯利娟, 王国胤, 聂能, 等. 粗糙集理论中的离散化问题[J]. 计算机科学, 2000, 27(12): 89-94.
|
|
[2]
|
Shannon, C.E. (1948) A mathematical theory of communication. The Bell System Technical Journal, 27, 379-423. [Google Scholar] [CrossRef]
|
|
[3]
|
Fayyad, U.M. and Irani, K.B. (1993) Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93), Chambèry, 28 August-3 September 1993, 1022-1027.
|
|
[4]
|
谢宏, 程浩忠, 牛东晓. 基于信息熵的粗糙集连续属性离散化算法[J]. 计算机学报, 2005, 28(9): 1570-1574.
|
|
[5]
|
高建国, 崔业勤. 基于信息熵理论的连续属性离散化方法[J]. 微电子学与计算机, 2011, 28(7): 187-189.
|
|
[6]
|
刘业政, 焦宁, 姜元春. 连续属性离散化算法比较研究[J]. 计算机应用研究, 2007 , 24(9): 28-30+33.
|