基于联合粒度属性约简信息损失的研究Research on Information Loss of Attribute Reduction Based on Joint Granularity

DOI: 10.12677/CSA.2020.1011206, PDF, HTML, XML, 下载: 87  浏览: 171

Abstract: With the rapid development of Internet technology, the society has entered the era of big data. The data is not only of various types and structures, but also of dynamic change. How to quickly obtain valuable information from massive data is an urgent problem to be solved. Rough set is a data evaluation method to deal with data uncertainty. Attribute reduction is an important core application of rough set theory. This paper will focus on the amount of information loss after attribute reduction, so as to find an attribute reduction algorithm, which can keep the data classification accuracy higher and information loss less after reduction. In this paper, the concept of knowledge granularity and reduction algorithm, the introduction of joint granularity, and its application to the process of attribute reduction, further get the attribute reduction algorithm based on joint granularity. Then the algorithm is used to reduce the decision table system. It is concluded that the information loss of the algorithm is reduced to a low level while the classification accuracy remains unchanged. Finally, the accuracy and effectiveness of this method are verified by the simulation experiment of UCI data set.

1. 引言

2. 基础知识

$In{f}_{B}\left(x\right)=\left\{\left(a,a\left(x\right)\right):a\in B\right\}$

B-不分明关系(或称为不可区分关系)定义为:

$IND\left(B\right)=\left\{\left(x,y\right):In{f}_{B}\left(x\right)=In{f}_{B}\left(y\right)\right\}$

2.1. 属性约简

(1) $PO{S}_{B}\left(d\right)=PO{S}_{C}\left(d\right)$

(2) 对于任意的 $a\in B$，都有 $PO{S}_{B-\left\{a\right\}}\left(d\right)\ne PO{S}_{C}\left(d\right)$.

(1) $H\left(DS,\left\{d\right\}|B\right)=H\left(DS,\left\{d\right\}|A\right)$

(2) 对任意的 $S\subset B$，均都有 $H\left(DS,\left\{d\right\}|S\right)\ne H\left(DS,\left\{d\right\}|A\right)$

2.2. 知识粒度的基本概念

$GD\left(M\right)=\underset{i=1}{\overset{n}{\sum }}\frac{|{R}_{i}{|}^{2}}{|U{|}^{2}}$

$GD\left({R}_{1}|{R}_{2}\right)=GK\left({R}_{2}\right)-GK\left({R}_{1}\cup {R}_{2}\right)$

(1) $GD\left(D|A\right)=GK\left(D|R\right)$

(2) 对于任意的 $m\in R$$GK\left(D|R-\left\{m\right\}\right)\ne GK\left(D|R\right)$

$R\subseteq A$ 为该决策系统的一个知识属性约简。

$s\left(c,D\right)=\frac{|IND\left(D\cup \left\{c\right\}\right)|}{\sqrt{|IND\left(c\right)|}\cdot \sqrt{|IND\left(D\right)|}}$

2.3. 知识粒度的启发式属性约简算法 [8]

3. 联合粒度属性约简算法

Table 1. Heuristic attribute reduction algorithm based on knowledge granularity [8]

Table 2. Attribute Reduction Algorithm Based on Joint Granularity

$H\left(DS,A\right)=-\underset{i=1}{\overset{N}{\sum }}p\left({X}_{i}\right)lbp\left( X i \right)$

$\left\{d\right\}$ 与A的联合熵 [13] 即可定义为：

$H\left(DS,A,\left\{d\right\}\right)=-\underset{i=1}{\overset{N}{\sum }}\underset{i=1}{\overset{M}{\sum }}p\left({X}_{i},{Y}_{j}\right)lbp\left({X}_{i},{Y}_{j}\right)$

$\Delta \left(B\right)=H\left(DS,A\right)-H\left(DS,B\right)$

$B\subseteq A$ 属性约简的信息损失率可定义如下：

$s\left(B\right)=\frac{\Delta \left(B\right)}{H\left(DS,A\right)}×100%=1-\frac{H\left(DS,B\right)}{H\left(DS,A\right)}×100%$

Table 3. Decision System D S 1 = ( U , A , d )

$U/C=\left\{\left\{{x}_{1}\right\},\left\{{x}_{2}\right\},\left\{{x}_{3},{x}_{4}\right\},\left\{{x}_{5},{x}_{6}\right\},\left\{{x}_{7}\right\},\left\{{x}_{8}\right\}\right\}$

$U/\left\{d\right\}=\left\{\left\{{x}_{1},{x}_{3},{x}_{5},{x}_{7}\right\},\left\{{x}_{2},{x}_{4},{x}_{6},{x}_{8}\right\}\right\}$

(1) 根据知识粒度的属性约简算法可得约简后为： ${R}_{1}=RE{D}_{C}=\left\{{e}_{2},{e}_{3},{e}_{4}\right\}$.

$H\left(D{S}_{1},C\right)=-\sum p\left(X\right)lbp\left(X\right)=2.5$

$H\left(D{S}_{1},{R}_{1}\right)=-\sum p\left(X\right)lbp\left(X\right)=2.5$

${R}_{1}$ 的信息损失量为： $\Delta \left({R}_{1}\right)=0$

(2) 由正区域约简算法可得： ${R}_{2}=\left\{{e}_{2},{e}_{3},{e}_{4}\right\}$${R}_{3}=\left\{{e}_{1},{e}_{3},{e}_{4}\right\}$

$H\left(D{S}_{1},{R}_{3}\right)=-\sum p\left(X\right)lbp\left(X\right)=2$

${R}_{2}$ 的信息损失量为0。

${R}_{3}$ 的信息损失量为 $\Delta \left({R}_{3}\right)=0.5$

(3) 根据联合粒度属性约简算法可得： ${R}_{4}=\left\{{e}_{2},{e}_{3},{e}_{4}\right\}$

${R}_{4}$ 的信息损失量为0。

Table 4. Decision system D S 2 = ( U , A , d )

$U/A=\left\{\left\{{e}_{1}\right\},\left\{{e}_{2}\right\},\left\{{e}_{3}\right\},\left\{{e}_{4}\right\},\left\{{e}_{5}\right\},\left\{{e}_{6}\right\}\right\}$$U/\left\{d\right\}=\left\{\left\{{e}_{1},{e}_{4},{e}_{6}\right\},\left\{{e}_{2},{e}_{3},{e}_{5}\right\}\right\}$

$H\left(D{S}_{2},A\right)=-\sum p\left(X\right)lbp\left(X\right)=2.585$

(1) 基于正区域的相对约简： ${S}_{1}=\left\{a,b\right\}$

$H\left(D{S}_{2},{S}_{1}\right)=-\sum p\left(X\right)lbp\left(X\right)=1.918$

${S}_{1}$ 的信息损失量为： $\Delta \left({S}_{1}\right)=0.667$

(2) 基于知识粒度的属性约简： ${S}_{2}=RE{D}_{A}=\left\{a,b,e\right\}$

$H\left(D{S}_{2},{S}_{2}\right)=-\sum p\left(X\right)lbp\left(X\right)=2.252$

${S}_{2}$ 的信息损失量为： $\Delta \left({S}_{2}\right)=0.333$

(3)基于联合粒度的属性约简： ${S}_{3}=\left\{a,b,e\right\},{S}_{4}=\left\{a,b,c\right\}$

$H\left(D{S}_{2},{S}_{4}\right)=-\sum p\left(X\right)lbp\left(X\right)=2.252$

${S}_{3},{S}_{4}$ 的信息损失量为： $\Delta \left({S}_{3}\right)=\Delta \left({S}_{4}\right)=0.333$

4. 仿真实验分析

Table 5. Data set description

Figure 1. Comparison of dermatology

Figure 2. Comparison of conceptual method choices

Figure 3. Comparison of mushroom

Figure 4. Comparison of letter recognition

Figure 5. Analysis of dermatology

Figure 6. Analysis of conceptual method choice

Figure 7. Analysis of mushroom

Figure 8. Analysis of letter recognition

5. 结论

 [1] Pawlak, Z. (1982) Rough Sets. International Journal of Computer and Information Sciences, 11, 341-356. https://doi.org/10.1007/BF01001956 [2] Hobbs, J.R. (1985) Granularity. Proceedings of the Ninth International Joint Conference on Artificial Intelligence, Los Angeles, 432-435. [3] Lin, T.Y. (1997) Granular Computing. An-nouncement of the BASIC Special Interest Group on Granular Computing. [4] Zhang, C.C. (2020) Knowledge Granu-larity Based Incremental Attribute Reduction for Incomplete Decision Systems. International Journal of Machine Learn-ing and Cybernetics, 11, 1141-1157. https://doi.org/10.1007/s13042-020-01089-4 [5] 李旭, 等. 带权决策表的属性约简[J]. 计算机工程与应用, 2020, 56(12): 54-59. [6] 大数据背景下粗糙集属性约简研究进展[J]. 计算机工程与应用, 2019, 55(6): 31-38. [7] 基于知识粒化的信息系统增量式属性约简[J]. 模式识别与人工智能, 2019, 38(8): 31-38. [8] 一种基于知识粒度的启发式属性约简算法[J]. 计算机工程与应用, 2012, 48(36): 31-38. [9] 邓大勇, 薛欢欢, 苗夺谦, 卢克文. 属性约简准则与约简信息损失的研究[J]. 电子学报, 2017, 45(2): 401-407. [10] 王国胤. Rough集理论与知识获取[M]. 西安: 西安交通大学出版社, 2001. [11] 腾书华. 基于粗糙集理论的不确定性度量和属性约简方法研究[D]: [博士学位论文]. 长沙: 国防科学技术大学, 2010. [12] 桑妍丽, 钱宇华. 多粒度决策粗糙集中的粒度约简方法[J]. 计算机科学, 2017, 44(5): 199-205. [13] 桑妍丽, 钱宇华. 一种悲观多粒度粗糙集中的粒度约简算法[J]. 模式识别与人工智能, 2012, 25(3): 361-366. [14] 邓大勇, 黄厚宽. 多粒度粗糙集的双层绝对约简[J]. 模式识别与人工智能, 2016, 29(11): 969-975. [15] 苗夺谦, 李道国. 粗糙集理论、算法与应用[M]. 北京: 清华大学出版社, 2008: 4.