博弈采样优化不平衡数据分类性能的探索——以糖尿病不平衡数据为例
Exploring the Classification Performance of Game Sampling Optimization for Imbalanced Data—Taking Imbalanced Data of Diabetes as an Example
DOI: 10.12677/sa.2025.145136, PDF,   
作者: 曾 洁:广东工业大学数学与统计学院,广东 广州
关键词: 糖尿病数据集博弈采样随机森林系统相关性分析Diabetes Data Set Game Sampling Random Forest System Correlation Analysis
摘要: 在当今大数据时代,不平衡数据广泛存在于众多领域,如医疗诊断、金融风险评估、工业故障检测等。其呈现出的样本类别分布极度不均衡现象,给数据分析与挖掘带来了巨大挑战。随着各行业对数据驱动决策的依赖程度不断加深,不平衡数据问题愈发凸显,严重影响了各类机器学习模型的性能,导致模型在少数类样本的识别与预测上表现欠佳,进而可能引发一系列严重后果,如在医疗领域漏诊罕见病、金融领域忽视潜在高风险客户等。过往针对不平衡数据的处理手段主要分为单一过采样和单一欠采样。单一过采样通过复制少数类样本增加其数量,但易引发过拟合问题;单一欠采样则删除多数类样本以平衡类别分布,却可能丢失重要信息。这些传统方法在复杂的数据环境下,难以有效提升模型对不平衡数据的分类性能。本研究运用博弈采样方法,致力于突破传统手段的局限。以糖尿病不平衡数据集为研究对象,在对原始数据精细预处理后,运用单一过采样、单一欠采样及动态博弈采样分别生成平衡数据集,并基于此构建随机森林模型。对比不同采样方法下模型输出,剖析各方法优劣。
Abstract: In today’s era of big data, imbalanced data widely exists in many fields, such as medical diagnosis, financial risk assessment, industrial fault detection, etc. The extremely imbalanced distribution of sample categories has brought great challenges to data analysis and mining. With the increasing dependence of various industries on data-driven decision-making, the problem of imbalanced data has become more and more prominent, which has seriously affected the performance of various machine learning models, resulting in poor performance of the model in identifying and predicting a few samples, which may lead to a series of serious consequences, such as missing rare diseases in the medical field and ignoring potential high-risk customers in the financial field. In the past, the processing methods for imbalanced data were mainly divided into single over-sampling and single under-sampling. Single oversampling increases the number of samples by copying a few types, but it is easy to cause over-fitting problems; Single undersampling deletes most samples to balance the distribution of categories, but it may lose important information. These traditional methods are difficult to effectively improve the classification performance of imbalanced data in complex data environment. This study uses game sampling method to break through the limitations of traditional means. Taking the imbalanced data set of diabetes as the research object, after fine pretreatment of the original data, the balanced data set is generated by single oversampling, single undersampling and dynamic game sampling respectively, and a random forest model is constructed based on this. The model output under different sampling methods is compared, and the advantages and disadvantages of each method are analyzed.
文章引用:曾洁. 博弈采样优化不平衡数据分类性能的探索——以糖尿病不平衡数据为例[J]. 统计学与应用, 2025, 14(5): 182-190. https://doi.org/10.12677/sa.2025.145136

参考文献

[1] 姚慧娟, 杨宇, 徐阿晶. 2022版《ADA/KDIGO共识报告: 慢性肾脏病患者的糖尿病管理》要点解读[J]. 中国全科医学, 2023, 26(12): 1415-1421.
[2] Kalra, S., Unnikrishnan, A.G., Baruah, M.P., Sahay, R. and Bantwal, G. (2021) Metabolic and Energy Imbalance in Dysglycemia-Based Chronic Disease. Diabetes, Metabolic Syndrome and Obesity: Targets and Therapy, 14, 165-184. [Google Scholar] [CrossRef] [PubMed]
[3] 刘佳璇, 李代伟, 任李娟, 等. 面向不平衡医疗数据的多阶段混合特征选择算法[J]. 计算机工程与应用, 2025, 61(2): 158-169.
[4] 张家伟, 郭林明, 杨晓梅. 针对不平衡数据的过采样和随机森林改进算法[J]. 计算机工程与应用, 2020, 56(11): 39-45.
[5] Huang, C., Li, Y., Loy, C.C. and Tang, X. (2016) Learning Deep Representation for Imbalanced Classification. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 5375-5384. [Google Scholar] [CrossRef
[6] Zheng, Z., Cai, Y. and Li, Y. (2015) Oversampling Method for Imbalanced Classification. Computing and Informatics, 34, 1017-1037.
[7] Feng, Y., Zhou, M. and Tong, X. (2021) Imbalanced Classification: A Paradigm‐Based Review. Statistical Analysis and Data Mining: The ASA Data Science Journal, 14, 383-406. [Google Scholar] [CrossRef
[8] Tahergorabi, Z. and Khazaei, M. (2012) Imbalance of Angiogenesis in Diabetic Complications: The Mechanisms. International Journal of Preventive Medicine, 3, 827-838. [Google Scholar] [CrossRef] [PubMed]
[9] ElSeddawy, A.I., Karim, F.K., Hussein, A.M. and Khafaga, D.S. (2022) Predictive Analysis of Diabetes-Risk with Class Imbalance. Computational Intelligence and Neuroscience, 2022, Article 3078025. [Google Scholar] [CrossRef] [PubMed]
[10] 徐玲玲, 迟冬祥. 面向不平衡数据集的机器学习分类策略[J]. 计算机工程与应用, 2020, 56(24): 12-27.
[11] 刘赛可, 何晓群, 夏利宇. 不平衡数据下模型评价指标的有效性探讨[J]. 统计与决策, 2022, 38(19): 5-9.