基于属性加权的独依赖条件概率编码方法
One Dependence Conditional Probability Encoding Method Based on Attribute Weighting
DOI: 10.12677/ORF.2023.131009, PDF,    国家自然科学基金支持
作者: 梁祖鹏:贵州大学数学与统计学院,贵州 贵阳;李秋德*, 胡思贵:贵州医科大学生物与工程学院,贵州 贵阳
关键词: 混合数据分类条件概率编码独依赖值差度量属性加权Mixed Data Classification Conditional Probability Encoding One Dependence Value Difference Metric Attribute Weighting
摘要: 包含分类属性和数值属性的混合数据广泛存在于真实世界采集的数据或实验数据,在挖掘或分析这类数据前,通常需要将它们处理(转换/嵌入/表示/编码)为高质量的数值数据。条件概率编码方法(以属性条件独立假设为前提)在大多数情况下能取得不错的性能,但当它面对具有强属性关联的数据集时,性能并不理想。受独依赖值差度量的启发,将放宽属性条件独立的构想应用于条件概率编码方法。此外,还利用属性加权法来优化编码后的数据质量。融合上述这些方法,我们为混合数据的分类编码提出了一个属性加权的独依赖条件概率编码方法。实验结果表明,我们的编码方法可以显著性提高数据转换的质量,从而增强后续数据分析算法的性能。
Abstract: Mixed data containing categorical and numerical attributes are widely available in real-world or experimental data sets. Before mining or analyzing such data, it is typically necessary to process (transform/embed/represent) them into high-quality numerical data. Conditional probability transformation method (which is premised on the attribute conditional independence assumption) can provide acceptable performance in the majority of cases, but it is not satisfactory for data sets with strong attribute association. Inspired by the one dependence value difference metric method, the concept of relaxing the attributes conditional independence is applied to the conditional probability transformation method. In addition, an attribute weighting method is designed to optimize the quality of data encoding. Combining these methods, we propose an Attribute Weighted One Dependence Conditional Probability Encoding method for categorical encoding on mixed data. Extensive experimental results demonstrate that our method can significantly boost the quality of data encoding, hence enhancing the performance of subsequent data analysis algorithms.
文章引用:梁祖鹏, 李秋德, 胡思贵. 基于属性加权的独依赖条件概率编码方法[J]. 运筹与模糊学, 2023, 13(1): 74-87. https://doi.org/10.12677/ORF.2023.131009

参考文献

[1] Ramírez-Gallego, S., Krawczyk, B., García, S., Wozniak, M. and Herrera, F. (2017) A Survey on Data Preprocessing for Data Stream Mining: Current Status and Future Directions. Neurocomputing, 239, 39-57. [Google Scholar] [CrossRef
[2] García, S., Luengo, J. and Herrera, F. (2015) Data Pre-processing in Data Mining. Intelligent Systems Reference Library. [Google Scholar] [CrossRef
[3] Li, Q., Xiong, Q., Ji, S., Yu, Y., Wu, C. and Yi, H. (2021) A Method for Mixed Data Classification Base on RBF-ELM Network. Neurocomputing, 431, 7-22. [Google Scholar] [CrossRef
[4] 李秋德. 混合属性数据的处理及其分类算法研究[D]: [博士学位论文]. 重庆: 重庆大学, 2020.[CrossRef
[5] Zhang, K., Wang, Q., Chen, Z., Marsic, I., Kumar, V., Jiang, G. and Zhang, J. (2015) From Categorical to Numerical: Multiple Transitive Distance Learning and Em-bedding. Proceedings of the 2015 SIAM International Conference on Data Mining (SDM), Vancouver, 30 April-2 May 2015, 46-54. [Google Scholar] [CrossRef
[6] Kasif, S., Salzberg, S., Waltz, D. L., Rachlin, J. and Aha, D. W. (1998) A Probabilistic Framework for Memory-Based Reasoning. Artificial Intelligence, 104, 287-311. [Google Scholar] [CrossRef
[7] Stanfill, C. and Waltz, D. L. (1986) To-ward Memory-Based Reasoning. Communications of the ACM, 29, 1213-1228. [Google Scholar] [CrossRef
[8] Jiang, L., Zhang, H. and Cai, Z. (2009) A Novel Bayes Model: Hidden Naive Bayes. IEEE Transactions on Knowledge and Data Engineering, 21, 1361-1371. [Google Scholar] [CrossRef
[9] Li, C. and Li, H. (2011) One Dependence Value Difference Metric. Knowledge-Based Systems, 24, 589-594. [Google Scholar] [CrossRef
[10] Zhang, H., Jiang, L. and Yu, L. (2021) Attribute and In-stance Weighted Naive Bayes. Pattern Recognition, 111, Article ID: 107674. [Google Scholar] [CrossRef
[11] Jiang, L., Zhang, L., Li, C. and Wu, J. (2019) A Correla-tion-Based Feature Weighting Filter for Naive Bayes. IEEE Transactions on Knowledge and Data Engineering, 31, 201-213. [Google Scholar] [CrossRef
[12] Wang, L., Xie, Y., Pang, M. and Wei, J. (2022) Alleviating the Attribute Conditional Independence and I.I.D. Assumptions of Averaged One-Dependence Estimator by Double Weighting. Knowledge-Based Systems, 250, Article ID: 109078. [Google Scholar] [CrossRef
[13] Friedman, N., Geiger, D. and Goldszmidt, M. (1997) Bayesian Network Classifiers. Machine Learning, 29, 131-163. [Google Scholar] [CrossRef
[14] Lee, C.-H. (2018) An Information-Theoretic Filter Approach for Value Weighted Classification Learning in Naive Bayes. Data & Knowledge Engineering, 113, 116-128. [Google Scholar] [CrossRef
[15] Popescu, M.-C., Balas, V., Perescu-Popescu, L. and Masto-rakis, N. (2009) Multilayer Perceptron and Neural Networks. WSEAS Transactions on Circuits and Systems, 8, 579-588.
[16] Yang, C. C. (2010) Search Engines Information Retrieval in Practice. The Journal of the Association for Information Science and Technology, 61, 430. [Google Scholar] [CrossRef
[17] Nadeau, C. and Bengio, Y. (2003) Inference for the Generalization Error. Machine Learning, 52, 239-281. [Google Scholar] [CrossRef
[18] Demsar, J. (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7, 1-30.
[19] Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. ACM SIGKDD Ex-plorations Newsletter, 3, 27-32. [Google Scholar] [CrossRef
[20] Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V. and Gulin, A. (2018) CatBoost: Unbiased Boosting with Categorical Features. Pro-ceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, 3-8 December 2018, 6639-6649.
[21] Mougan, C., Masip, D., Nin, J. and Pujol, O. (2021) Quantile Encoder: Tackling High Car-dinality Categorical Features in Regression Problems. 18th International Conference, MDAI 2021, Umeå, 27-30 September 2021, 168-180. [Google Scholar] [CrossRef