GA-RF:基于SHAP的不平衡数据中风识别的优化研究
GA-RF: Research on Unbalanced Stroke Data Recognition Based on SHAP
摘要: 在疾病初筛的场景中,数据失衡会导致分类器偏向多数类的预测偏差,对模型的性能产生影响。因此,选择合适的数据不平衡处理策略与分类器,对改进性能具有关键意义。本文分析不平衡的中风数据集,构建多种实验方案:引入11种数据不平衡处理方法,结合4种机器学习算法对中风患者进行识别(逻辑回归、SVM、CNN、随机森林)。在多组模型的对比中,得到RUS处理后的逻辑回归、SVM与随机森林优于其他方法,并引入PCA降维分析噪声数据。然后,利用PSO、GA、DE、BO对这3个模型进行优化,得到GA-RF的AUC为84.18%,Recall为91.06%,优势显著。最后,为突破解释性局限,采用SHAP对模型的特征重要性进行分析,得到年龄对中风识别的作用远超其余特征。
Abstract: In the scenario of primary disease screening, data imbalance will cause the classifier to bias the prediction of the majority class, which will have an impact on the performance of the model. Therefore, choosing appropriate data imbalance processing strategies and classifiers is of key significance to improving performance. This article analyzes unbalanced stroke data sets and constructs various experimental plans: eleven data imbalance processing methods are introduced, and four machine learning algorithms are combined to identify stroke patients (LR, SVM, CNN, RF). In the comparison of multiple groups of models, it is found that LR, SVM and RF after RUS processing are better than other methods, and PCA dimensionality reduction was introduced to analyze noise data. Then, the methods of PSO, GA, DE and BO are used to optimize these three models. The AUC of GA-RF is 84.18%, and the Recall is 91.06%, which has significant advantages. Finally, in order to break through the explanatory limitations, SHAP is used to analyze the feature importance of these models. It is found that the role of age in stroke recognition far exceeds that of other features.
文章引用:胡译丹, 高阳, 尹畅, 过子宽. GA-RF:基于SHAP的不平衡数据中风识别的优化研究[J]. 理论数学, 2026, 16(1): 17-28. https://doi.org/10.12677/pm.2026.161003

参考文献

[1] Pereira, J. and Saraiva, F. (2021) Convolutional Neural Network Applied to Detect Electricity Theft: A Comparative Study on Unbalanced Data Handling Techniques. International Journal of Electrical Power & Energy Systems, 131, Article ID: 107085. [Google Scholar] [CrossRef
[2] Khushi, M., Shaukat, K., Alam, T.M., Hameed, I.A., Uddin, S., Luo, S., et al. (2021) A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access, 9, 109960-109975. [Google Scholar] [CrossRef
[3] Huynh, T., Nibali, A. and He, Z. (2022) Semi-Supervised Learning for Medical Image Classification Using Imbalanced Training Data. Computer Methods and Programs in Biomedicine, 216, Article ID: 106628. [Google Scholar] [CrossRef] [PubMed]
[4] Xu, Z., Shen, D., Nie, T., Kou, Y., Yin, N. and Han, X. (2021) A Cluster-Based Oversampling Algorithm Combining SMOTE and K-Means for Imbalanced Medical Data. Information Sciences, 572, 574-589. [Google Scholar] [CrossRef
[5] Kumar, V., Lalotra, G.S., Sasikala, P., Rajput, D.S., Kaluri, R., Lakshmanna, K., et al. (2022) Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques. Healthcare, 10, Article 1293. [Google Scholar] [CrossRef] [PubMed]
[6] Banik, D. and Bhattacharjee, D. (2021) Mitigating Data Imbalance Issues in Medical Image Analysis. In: Rana, D.P. and Mehta, R.G., Eds., Data Preprocessing, Active Learning, and Cost Perceptive Approaches for Resolving Data Imbalance, IGI Global, 66-89. [Google Scholar] [CrossRef
[7] Yao, P., Shen, S., Xu, M., Liu, P., Zhang, F., Xing, J., et al. (2022) Single Model Deep Learning on Imbalanced Small Datasets for Skin Lesion Classification. IEEE Transactions on Medical Imaging, 41, 1242-1254. [Google Scholar] [CrossRef] [PubMed]
[8] Jevin, M.J., Jayant, H., Sanjay, R., et al. (2023) Heart Disease Identification Method Using Machine Learning Classification in E-Healthcare. Heart Disease, 10, 2322-2327.
[9] Abdar, M., Zomorodi-Moghadam, M., Das, R. and Ting, I. (2017) Performance Analysis of Classification Algorithms on Early Detection of Liver Disease. Expert Systems with Applications, 67, 239-251. [Google Scholar] [CrossRef
[10] Ding, H., Fawad, M., Xu, X. and Hu, B. (2022) A Framework for Identification and Classification of Liver Diseases Based on Machine Learning Algorithms. Frontiers in Oncology, 12, Article 1048348. [Google Scholar] [CrossRef] [PubMed]
[11] Sivaram Chowdary, M. and Puviarasi, R. (2022) Accuracy Improvement in Disease Identification of Mango Leaf Using CNN Algorithm Compared with Fuzzy Algorithm. ECS Transactions, 107, 11889-11903. [Google Scholar] [CrossRef
[12] Agarwal, R. and Godavarthi, D. (2023) Skin Disease Classification Using CNN Algorithms. EAI Endorsed Transactions on Pervasive Health and Technology, 9, 1-8. [Google Scholar] [CrossRef
[13] Al-Azani, S., Alkhnbashi, O.S., Ramadan, E. and Alfarraj, M. (2024) Gene Expression-Based Cancer Classification for Handling the Class Imbalance Problem and Curse of Dimensionality. International Journal of Molecular Sciences, 25, Article 2102. [Google Scholar] [CrossRef] [PubMed]
[14] Chen, C., Wu, X., Zuo, E., Chen, C., Lv, X. and Wu, L. (2023) R-GDORUS Technology: Effectively Solving the Raman Spectral Data Imbalance in Medical Diagnosis. Chemometrics and Intelligent Laboratory Systems, 235, Article ID: 104762. [Google Scholar] [CrossRef
[15] Wang, J., Yu, L. and Zhang, X. (2022) Explainable Detection of Adverse Drug Reaction with Imbalanced Data Distribution. PLOS Computational Biology, 18, e1010144. [Google Scholar] [CrossRef] [PubMed]