数据挖掘与机器学习在乳腺癌预后预测中的应用综述
A Review of Data Mining and Machine Learning Applications in Breast Cancer Prognostic Prediction
摘要: 乳腺癌作为全球女性最常见的恶性肿瘤之一,其高发病率和死亡率对公共卫生构成严峻挑战。随着医疗信息化的发展和大数据技术的进步,基于数据挖掘和机器学习的乳腺癌预后预测模型已成为精准医学研究的热点。本文系统综述了数据挖掘与机器学习技术在乳腺癌生存率预测领域的研究进展。首先阐述了乳腺癌预测模型构建中的关键数据预处理环节,包括数据清洗、缺失值填补和特征选择的主要方法及其优缺点。其次,系统梳理了应用于乳腺癌生存预测的各类机器学习模型,从传统的贝叶斯网络、逻辑回归、支持向量机到先进的深度学习技术,并分析了不同模型的应用场景与性能特点。进而,深入探讨了当前研究中面临的普遍挑战,如数据缺失、类别不平衡和模型可解释性等问题,并总结了SMOTE过采样、混合模型填补、生成对抗网络数据增强等主流应对策略。最后,展望了该领域的未来发展方向,包括多模态数据融合、全自动化模型构建以及基于因果推断的特征选择等前沿技术。本综述旨在为乳腺癌的精准医疗和临床决策支持提供系统性的理论参考。
Abstract: Breast cancer, as one of the most prevalent malignancies among women globally, poses a significant public health challenge due to its high incidence and mortality rates. With the advancement of medical informatization and big data technologies, prognostic prediction models for breast cancer based on data mining and machine learning have emerged as a focal point in precision medicine research. This paper systematically reviews the research progress of data mining and machine learning techniques in the field of breast cancer survival prediction. It begins by elucidating the critical data preprocessing steps in constructing breast cancer prediction models, including data cleaning, missing value imputation, and feature selection, along with their primary methods, advantages, and limitations. Subsequently, it systematically categorizes various machine learning models applied to breast cancer survival prediction, ranging from traditional models such as Bayesian networks, logistic regression, and support vector machines to advanced deep learning techniques, and analyzes their application scenarios and performance characteristics. Furthermore, it critically examines the prevalent challenges in current research, such as missing data, class imbalance, and model interpretability, and summarizes mainstream coping strategies including SMOTE oversampling, hybrid model imputation, and Generative Adversarial Network-based data augmentation. Finally, it discusses future directions in this field, encompassing cutting-edge technologies such as multi-modal data fusion, fully automated model construction, and causal inference-based feature selection. This review aims to provide a systematic theoretical reference for precision medicine and clinical decision support in breast cancer.
文章引用:陈心悦. 数据挖掘与机器学习在乳腺癌预后预测中的应用综述[J]. 统计学与应用, 2026, 15(4): 90-96. https://doi.org/10.12677/sa.2026.154074

参考文献

[1] Tjandra, J. and Collins, J.P. (2008) Breast Surgery. In: Clunie, G.J.A., Tjandra, J., Smith, J.A. and Kaye, A.H., Eds., Textbook of Surgery, 3rd Edition, Blackwell, 123-135.
[2] Sung, H., Ferlay, J., Siegel, R.L., Laversanne, M., Soerjomataram, I., Jemal, A., et al. (2021) Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA: A Cancer Journal for Clinicians, 71, 209-249. [Google Scholar] [CrossRef] [PubMed]
[3] Delen, D., Walker, G. and Kadam, A. (2005) Predicting Breast Cancer Survivability: A Comparison of Three Data Mining Methods. Artificial Intelligence in Medicine, 34, 113-127. [Google Scholar] [CrossRef] [PubMed]
[4] Hegselmann, S., Gruelich, L., Varghese, J. and Dugas, M. (2018) Reproducible Survival Prediction with SEER Cancer Data. Machine Learning for Healthcare Conference, Stanford, 17-18 August 2018, 49-66.
[5] Tseng, Y., Huang, C., Wen, C., Lai, P., Wu, M., Sun, Y., et al. (2019) Predicting Breast Cancer Metastasis by Using Serum Biomarkers and Clinicopathological Data with Machine Learning Technologies. International Journal of Medical Informatics, 128, 79-86. [Google Scholar] [CrossRef] [PubMed]
[6] (2021) Surveillance, Epidemiology, and End Results (SEER) Program, SEER*Stat Database: Incidence-SEER Research Data, 18 Registries, Nov. 2020 Sub (1975-2018). National Cancer Institute, DCCPS, Surveillance Research Program.
[7] Kaur, I., Doja, M.N. and Ahmad, T. (2020) Time-Range Based Sequential Mining for Survival Prediction in Prostate Cancer. Journal of Biomedical Informatics, 110, Article ID: 103550. [Google Scholar] [CrossRef] [PubMed]
[8] Doja, M.N., Kaur, I. and Ahmad, T. (2020) Age-Specific Survival in Prostate Cancer Using Machine Learning. Data Technologies and Applications, 54, 215-234. [Google Scholar] [CrossRef
[9] García-Laencina, P.J., Abreu, P.H., Abreu, M.H. and Afonoso, N. (2015) Missing Data Imputation on the 5-Year Survival Prediction of Breast Cancer Patients with Unknown Discrete Values. Computers in Biology and Medicine, 59, 125-133. [Google Scholar] [CrossRef] [PubMed]
[10] Vazifehdan, M., Moattar, M.H. and Jalali, M. (2019) A Hybrid Bayesian Network and Tensor Factorization Approach for Missing Value Imputation to Improve Breast Cancer Recurrence Prediction. Journal of King Saud UniversityComputer and Information Sciences, 31, 175-184. [Google Scholar] [CrossRef
[11] Lotfnezhad Afshar, H., Ahmadi, M., Roudbari, M. and Sadoughi, F. (2015) Prediction of Breast Cancer Survival through Knowledge Discovery in Databases. Global Journal of Health Science, 7, 392-398. [Google Scholar] [CrossRef] [PubMed]
[12] Lynch, C.M., Abdollahi, B., Fuqua, J.D., de Carlo, A.R., Bartholomai, J.A., Balgemann, R.N., et al. (2017) Prediction of Lung Cancer Patient Survival via Supervised Machine Learning Classification Techniques. International Journal of Medical Informatics, 108, 1-8. [Google Scholar] [CrossRef] [PubMed]
[13] Kate, R.J. and Nadig, R. (2017) Stage-Specific Predictive Models for Breast Cancer Survivability. International Journal of Medical Informatics, 97, 304-311. [Google Scholar] [CrossRef] [PubMed]
[14] Wang, K., Makond, B., Chen, K. and Wang, K. (2014) A Hybrid Classifier Combining SMOTE with PSO to Estimate 5-Year Survivability of Breast Cancer Patients. Applied Soft Computing, 20, 15-24. [Google Scholar] [CrossRef
[15] Wang, L. (2015) Mining Causal Relationships among Clinical Variables for Cancer Diagnosis Based on Bayesian Analysis. BioData Mining, 8, Article No. 13. [Google Scholar] [CrossRef] [PubMed]
[16] Wang, Y., Wang, D., Ye, X., Wang, Y., Yin, Y. and Jin, Y. (2019) A Tree Ensemble-Based Two-Stage Model for Advanced-Stage Colorectal Cancer Survival Prediction. Information Sciences, 474, 106-124. [Google Scholar] [CrossRef
[17] Shukla, N., Hagenbuchner, M., Win, K.T. and Yang, J. (2018) Breast Cancer Data Analysis for Survivability Studies and Prediction. Computer Methods and Programs in Biomedicine, 155, 199-208. [Google Scholar] [CrossRef] [PubMed]
[18] Park, K., Ali, A., Kim, D., An, Y., Kim, M. and Shin, H. (2013) Robust Predictive Model for Evaluating Breast Cancer Survivability. Engineering Applications of Artificial Intelligence, 26, 2194-2205. [Google Scholar] [CrossRef
[19] Bellaachia, A. and Guven, E. (2006) Predicting Breast Cancer Survivability Using Data Mining Techniques. Age, 58, 10-110.
[20] Dooling, D., Kim, A., McAneny, B. and Webster, J. (2016) Personalized Prognostic Models for Oncology: A Machine Learning Approach.
[21] Simsek, S., Kursuncu, U., Kibis, E., AnisAbdellatif, M. and Dag, A. (2020) A Hybrid Data Mining Approach for Identifying the Temporal Effects of Variables Associated with Breast Cancer Survival. Expert Systems with Applications, 139, Article ID: 112863. [Google Scholar] [CrossRef
[22] Miri Rostami, S. and Ahmadzadeh, M. (2018) Extracting Predictor Variables to Construct Breast Cancer Survivability Model with Class Imbalance Problem. Journal of AI and Data Mining, 6, 263-276.
[23] Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002) SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. [Google Scholar] [CrossRef
[24] Bunkhumpornpat, C., Sinapiromsaran, K. and Lursinsap, C. (2009) Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem. In: Theeramunkong, T., et al., Eds., Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, 475-482. [Google Scholar] [CrossRef
[25] He, H., Bai, Y., Garcia, E.A. and Li, S. (2008) ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1-8 June 2008, 1322-1328. [Google Scholar] [CrossRef
[26] Tabl, A.A., Alkhateeb, A., ElMaraghy, W., Rueda, L. and Ngom, A. (2019) A Machine Learning Approach for Identifying Gene Biomarkers Guiding the Treatment of Breast Cancer. Frontiers in Genetics, 10, Article No. 256. [Google Scholar] [CrossRef] [PubMed]
[27] Khan, M.U., Choi, J.P., Shin, H. and Kim, M. (2008) Predicting Breast Cancer Survivability Using Fuzzy Decision Trees for Personalized Healthcare. 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Vancouver, 20-25 August 2008, 5148-5151. [Google Scholar] [CrossRef] [PubMed]
[28] Jubair, S., Alkhateeb, A., Tabl, A.A., Rueda, L. and Ngom, A. (2020) A Novel Approach to Identify Subtype-Specific Network Biomarkers of Breast Cancer Survivability. Network Modeling Analysis in Health Informatics and Bioinformatics, 9, Article No. 43. [Google Scholar] [CrossRef
[29] Wang, H., Zheng, B., Yoon, S.W. and Ko, H.S. (2018) A Support Vector Machine-Based Ensemble Algorithm for Breast Cancer Diagnosis. European Journal of Operational Research, 267, 687-699. [Google Scholar] [CrossRef
[30] Nam, Y. and Shin, H. (2013) A Hybrid Cancer Prognosis System Based on Semi-Supervised Learning and Decision Trees. In: Lee, M., et al., Eds., International Conference on Neural Information Processing, Springer, 640-648. [Google Scholar] [CrossRef
[31] Huang, M. and Zhang, W. (2020) Support Vector Machine-Based Ensemble Learning for Breast Cancer Survival Prediction. Journal of Biomedical Informatics, 112, Article 103602.
[32] Chen, F. and Liu, L. (2022) A Hybrid Deep Learning Framework for Breast Cancer Prognosis Using Multi-Omics Data. Journal of Biomedical and Health Informatics (JBHI), 26, 2215-2225.