基于特征选择和加权的改进条件概率分布距离度量
Improved Conditional Probability Distribution Distance Measurement Based on Feature Selection and Weighting
DOI: 10.12677/aam.2025.144207, PDF,    国家自然科学基金支持
作者: 杨沛融, 胡桂开:东华理工大学理学院,江西 南昌
关键词: 特征选择条件概率分布名词性属性信息增益率Feature Selection Conditional Probability Distribution Nominal Attribute Information Gain Ratio
摘要: 为提高名词性属性实例间差异的识别精度,优化分类算法的准确率,在充分考虑属性间依赖关系下提出了一种基于特征选择和加权的改进条件概率分布距离度量方法。该方法首先利用对称不确定性构建了一个特征选择机制;其次,在此基础上计算属性与类的信息增益率,获得每个属性的权重,并计算加权距离;最后基于K-近邻算法对19个数据集进行仿真实验。结果表明:论文提出的距离度量有效提高了分类算法的性能。
Abstract: To enhance the recognition accuracy of differences between instances of nominal attributes and to optimize the accuracy of classification algorithms, an improved conditional probability distribution distance measurement based on feature selection and weighting has been proposed, taking into full consideration the dependencies among attributes. Firstly, a feature selection mechanism is constructed by using symmetric uncertainty. Secondly, on this basis, the information gain ratio of attributes and classes is calculated, and the weight of each attribute is obtained. Subsequently, the weighted distance is computed. Finally, simulation experiments are conducted on 19 datasets based on the K-Nearest Neighbors algorithm. The results indicate that the distance measurement proposed in this paper effectively improves the performance of classification algorithms.
文章引用:杨沛融, 胡桂开. 基于特征选择和加权的改进条件概率分布距离度量[J]. 应用数学进展, 2025, 14(4): 798-809. https://doi.org/10.12677/aam.2025.144207

参考文献

[1] Ayats, H.A., Cellier, P. and Ferré, S. (2024) Concepts of Neighbors and Their Application to Instance-Based Learning on Relational Data. International Journal of Approximate Reasoning, 164, Article ID: 109059. [Google Scholar] [CrossRef
[2] Aha, D.W., Kibler, D. and Albert, M.K. (1991) Instance-Based Learning Algorithms. Machine Learning, 6, 37-66. [Google Scholar] [CrossRef
[3] El Hindi, K. (2013) Specific-Class Distance Measures for Nominal Attributes. AI Communications, 26, 261-279. [Google Scholar] [CrossRef
[4] Short, R. and Fukunaga, K. (1981) The Optimal Distance Measure for Nearest Neighbor Classification. IEEE Transactions on Information Theory, 27, 622-627. [Google Scholar] [CrossRef
[5] Quang, L.S. and Bao, H.T. (2004) A Conditional Probability Distribution-Based Dissimilarity Measure for Categorial Data. In: Dai, H., Srikant, R. and Zhang, C., Eds., Advances in Knowledge Discovery and Data Mining, Springer, 580-589. [Google Scholar] [CrossRef
[6] Ienco, D., Pensa, R.G. and Meo, R. (2012) From Context to Distance: Learning Dissimilarity for Categorical Data Clustering. ACM Transactions on Knowledge Discovery from Data, 6, 1-25. [Google Scholar] [CrossRef
[7] Myles, J.P. and Hand, D.J. (1990) The Multi-Class Metric Problem in Nearest Neighbour Discrimination Rules. Pattern Recognition, 23, 1291-1297. [Google Scholar] [CrossRef
[8] Guyon, I. and Elisseeff, A. (2003) An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157-1182.
[9] 龚芳. 反转类指定距离度量的改进及应用研究[D]: [博士学位论文]. 武汉: 中国地质大学, 2021.
[10] 李超群. 名词性属性距离度量问题及其应用研究[D]: [博士学位论文]. 武汉: 中国地质大学, 2012.
[11] Qiu, C., Jiang, L. and Li, C. (2015) Not Always Simple Classification: Learning SuperParent for Class Probability Estimation. Expert Systems with Applications, 42, 5433-5440. [Google Scholar] [CrossRef
[12] Gong, F., Jiang, L., Zhang, H., Wang, D. and Guo, X. (2020) Gain Ratio Weighted Inverted Specific-Class Distance Measure for Nominal Attributes. International Journal of Machine Learning and Cybernetics, 11, 2237-2246. [Google Scholar] [CrossRef