基于联合粒度属性约简信息损失的研究Research on Information Loss of Attribute Reduction Based on Joint Granularity

Abstract: With the rapid development of Internet technology, the society has entered the era of big data. The data is not only of various types and structures, but also of dynamic change. How to quickly obtain valuable information from massive data is an urgent problem to be solved. Rough set is a data evaluation method to deal with data uncertainty. Attribute reduction is an important core application of rough set theory. This paper will focus on the amount of information loss after attribute reduction, so as to find an attribute reduction algorithm, which can keep the data classification accuracy higher and information loss less after reduction. In this paper, the concept of knowledge granularity and reduction algorithm, the introduction of joint granularity, and its application to the process of attribute reduction, further get the attribute reduction algorithm based on joint granularity. Then the algorithm is used to reduce the decision table system. It is concluded that the information loss of the algorithm is reduced to a low level while the classification accuracy remains unchanged. Finally, the accuracy and effectiveness of this method are verified by the simulation experiment of UCI data set.

1. 引言

2. 基础知识

$In{f}_{B}\left(x\right)=\left\{\left(a,a\left(x\right)\right):a\in B\right\}$

B-不分明关系(或称为不可区分关系)定义为:

$IND\left(B\right)=\left\{\left(x,y\right):In{f}_{B}\left(x\right)=In{f}_{B}\left(y\right)\right\}$

2.1. 属性约简

(1) $PO{S}_{B}\left(d\right)=PO{S}_{C}\left(d\right)$

(2) 对于任意的 $a\in B$，都有 $PO{S}_{B-\left\{a\right\}}\left(d\right)\ne PO{S}_{C}\left(d\right)$.

(1) $H\left(DS,\left\{d\right\}|B\right)=H\left(DS,\left\{d\right\}|A\right)$

(2) 对任意的 $S\subset B$，均都有 $H\left(DS,\left\{d\right\}|S\right)\ne H\left(DS,\left\{d\right\}|A\right)$

2.2. 知识粒度的基本概念

$GD\left(M\right)=\underset{i=1}{\overset{n}{\sum }}\frac{|{R}_{i}{|}^{2}}{|U{|}^{2}}$

$GD\left({R}_{1}|{R}_{2}\right)=GK\left({R}_{2}\right)-GK\left({R}_{1}\cup {R}_{2}\right)$

(1) $GD\left(D|A\right)=GK\left(D|R\right)$

(2) 对于任意的 $m\in R$$GK\left(D|R-\left\{m\right\}\right)\ne GK\left(D|R\right)$

$R\subseteq A$ 为该决策系统的一个知识属性约简。

$s\left(c,D\right)=\frac{|IND\left(D\cup \left\{c\right\}\right)|}{\sqrt{|IND\left(c\right)|}\cdot \sqrt{|IND\left(D\right)|}}$

2.3. 知识粒度的启发式属性约简算法 [8]

3. 联合粒度属性约简算法

Table 1. Heuristic attribute reduction algorithm based on knowledge granularity [8]

Table 2. Attribute Reduction Algorithm Based on Joint Granularity

$H\left(DS,A\right)=-\underset{i=1}{\overset{N}{\sum }}p\left({X}_{i}\right)lbp\left( X i \right)$

$\left\{d\right\}$ 与A的联合熵 [13] 即可定义为：

$H\left(DS,A,\left\{d\right\}\right)=-\underset{i=1}{\overset{N}{\sum }}\underset{i=1}{\overset{M}{\sum }}p\left({X}_{i},{Y}_{j}\right)lbp\left({X}_{i},{Y}_{j}\right)$

$\Delta \left(B\right)=H\left(DS,A\right)-H\left(DS,B\right)$

$B\subseteq A$ 属性约简的信息损失率可定义如下：

$s\left(B\right)=\frac{\Delta \left(B\right)}{H\left(DS,A\right)}×100%=1-\frac{H\left(DS,B\right)}{H\left(DS,A\right)}×100%$

Table 3. Decision System D S 1 = ( U , A , d )

$U/C=\left\{\left\{{x}_{1}\right\},\left\{{x}_{2}\right\},\left\{{x}_{3},{x}_{4}\right\},\left\{{x}_{5},{x}_{6}\right\},\left\{{x}_{7}\right\},\left\{{x}_{8}\right\}\right\}$

$U/\left\{d\right\}=\left\{\left\{{x}_{1},{x}_{3},{x}_{5},{x}_{7}\right\},\left\{{x}_{2},{x}_{4},{x}_{6},{x}_{8}\right\}\right\}$

(1) 根据知识粒度的属性约简算法可得约简后为： ${R}_{1}=RE{D}_{C}=\left\{{e}_{2},{e}_{3},{e}_{4}\right\}$.

$H\left(D{S}_{1},C\right)=-\sum p\left(X\right)lbp\left(X\right)=2.5$

$H\left(D{S}_{1},{R}_{1}\right)=-\sum p\left(X\right)lbp\left(X\right)=2.5$

${R}_{1}$ 的信息损失量为： $\Delta \left({R}_{1}\right)=0$

(2) 由正区域约简算法可得： ${R}_{2}=\left\{{e}_{2},{e}_{3},{e}_{4}\right\}$${R}_{3}=\left\{{e}_{1},{e}_{3},{e}_{4}\right\}$

$H\left(D{S}_{1},{R}_{3}\right)=-\sum p\left(X\right)lbp\left(X\right)=2$

${R}_{2}$ 的信息损失量为0。

${R}_{3}$ 的信息损失量为 $\Delta \left({R}_{3}\right)=0.5$

(3) 根据联合粒度属性约简算法可得： ${R}_{4}=\left\{{e}_{2},{e}_{3},{e}_{4}\right\}$

${R}_{4}$ 的信息损失量为0。

Table 4. Decision system D S 2 = ( U , A , d )

$U/A=\left\{\left\{{e}_{1}\right\},\left\{{e}_{2}\right\},\left\{{e}_{3}\right\},\left\{{e}_{4}\right\},\left\{{e}_{5}\right\},\left\{{e}_{6}\right\}\right\}$$U/\left\{d\right\}=\left\{\left\{{e}_{1},{e}_{4},{e}_{6}\right\},\left\{{e}_{2},{e}_{3},{e}_{5}\right\}\right\}$

$H\left(D{S}_{2},A\right)=-\sum p\left(X\right)lbp\left(X\right)=2.585$

(1) 基于正区域的相对约简： ${S}_{1}=\left\{a,b\right\}$

$H\left(D{S}_{2},{S}_{1}\right)=-\sum p\left(X\right)lbp\left(X\right)=1.918$

${S}_{1}$ 的信息损失量为： $\Delta \left({S}_{1}\right)=0.667$

(2) 基于知识粒度的属性约简： ${S}_{2}=RE{D}_{A}=\left\{a,b,e\right\}$

$H\left(D{S}_{2},{S}_{2}\right)=-\sum p\left(X\right)lbp\left(X\right)=2.252$

${S}_{2}$ 的信息损失量为： $\Delta \left({S}_{2}\right)=0.333$

(3)基于联合粒度的属性约简： ${S}_{3}=\left\{a,b,e\right\},{S}_{4}=\left\{a,b,c\right\}$

$H\left(D{S}_{2},{S}_{4}\right)=-\sum p\left(X\right)lbp\left(X\right)=2.252$

${S}_{3},{S}_{4}$ 的信息损失量为： $\Delta \left({S}_{3}\right)=\Delta \left({S}_{4}\right)=0.333$

4. 仿真实验分析

Table 5. Data set description

Figure 1. Comparison of dermatology

Figure 2. Comparison of conceptual method choices

Figure 3. Comparison of mushroom

Figure 4. Comparison of letter recognition

Figure 5. Analysis of dermatology

Figure 6. Analysis of conceptual method choice

Figure 7. Analysis of mushroom

Figure 8. Analysis of letter recognition

5. 结论

