基于乳腺癌数据的插补方法比较研究
A Comparative Study of Interpolation Methods Based on Breast Cancer Data
DOI: 10.12677/fia.2025.141002, PDF,    科研立项经费支持
作者: 杨 丹, 左俊希:重庆理工大学理学院,重庆
关键词: 缺失数据多重插补KNN插补均值插补Missing Data Multiple Imputation KNN Interpolation Mean Interpolation
摘要: 缺失数据一直是数据分析工作中面临的难题之一,缺失数据的存在会导致模型性能的损耗,因此尽可能准确地预测填补缺失的方法变得尤为重要。本文将依托于“威斯康星乳腺癌诊断”数据集进行常见插补方法的比较研究,首先将原始数据按照完全随机缺失机制进行缺失处理,然后通过建立Logistic模型、支持向量机模型两种不同的模型,在不同缺失率(10%、30%)、不同协变量缺失个数(3个、6个)条件下,比较均值插补、KNN插补、多重插补3种不同插补方法的性能。同时,将准确率、F1值、AUC值作为衡量插补效果的量化指标。本文的实验结果表明,支持向量机模型对于乳腺癌数据集的拟合效果明显好于Logistic模型;同时对于所有的插补方法都会随着缺失率和缺失协变量的个数的增加,而性能发生降低。插补性能下降幅度却不相同,多重插补的性能明显更稳定,下降幅度最小,同时多重插补的插补效果综合来看也是最好的。对数据进行多重插补后拟合的Logistic模型和支持向量机模型在缺失率为30%、缺失协变量个数为6个的时候,对应准确率、F1值、AUC值分别为0.894、0.923、0.872和0.923、0.94、0.908。因此得出,基于生成多个数据集来模拟缺失数据不确定性的多重插值,在进行完全随机缺失处理后的“威斯康星乳腺癌诊断”数据集上相较于均值插补和KNN插补,其插补的稳健性和可信度明显更高。
Abstract: Missing data has always been one of the challenges faced in data analysis. The presence of missing data can lead to a loss of model performance, so it is particularly important to predict and fill in missing data as accurately as possible. This paper will rely on the data set of “Wisconsin Breast Cancer Diagnosis” to carry out a comparative study of common interpolation methods. First, the original data will be deleted according to the complete random deletion mechanism. Then, by establishing two different models, the Logistic model and the support vector machine model, under the conditions of different deletion rates (10%, 30%) and different number of covariate deletions (3, 6), the mean interpolation, KNN interpolation. The performance of three different interpolation methods for multiple interpolation. At the same time, accuracy, F1 value, and AUC value are used as quantitative indicators to measure the interpolation effect. The experimental results in this paper show that the fitting effect of support vector machine model for breast cancer dataset is significantly better than that of Logistic model; At the same time, for all interpolation methods, the performance will decrease with the increase of the missing rate and the number of missing covariates. The decrease in interpolation performance varies, with multiple interpolation showing significantly more stable performance with the smallest decrease. Overall, the interpolation effect of multiple interpolation is also the best. The logistic model and support vector machine model that fit the data after multiple imputation have corresponding accuracy, F1 value, and AUC value of 0.894, 0.923, 0.872, and 0.923, 0.94, and 0.908, respectively, when the missing rate is 30% and the number of missing covariates is 6. Therefore, based on generating multiple data sets to simulate the multiple interpolation of the uncertainty of missing data, the “Wisconsin Breast Cancer Diagnosis” data set after complete random deletion processing is significantly more robust and reliable than mean interpolation and KNN interpolation.
文章引用:杨丹, 左俊希. 基于乳腺癌数据的插补方法比较研究[J]. 国际会计前沿, 2025, 14(1): 10-19. https://doi.org/10.12677/fia.2025.141002

参考文献

[1] 宋亮, 万建洲. 缺失数据插补方法的比较研究[J]. 统计与决策, 2020, 36(18): 10-14.
[2] 郑智泉, 陈妍, 王孟孟, 田维琦. 不同缺失率下的数据填补算法稳定性研究[J]. 统计与决策, 2023, 39(8): 12-17.
[3] 彭海艳, 李意芝, 孟利军. 基于数据缺失率和缺失模式的多重插补误差研究[J]. 统计与决策, 2022, 38(1): 20-24.
[4] 汤健, 徐雯, 夏恒, 等. 面向城市固废焚烧过程的缺失数据填充及应用[J]. 北京工业大学学报, 2023, 49(4): 435-448.
[5] 费雪, 惠永昌, 吴帮玉, 等. 自监督连续缺失地震数据插值方法[C]. 2022年中国石油物探学术年会论文集(下册). 2022: 502-505.
[6] 饶珍敏. 删失指标随机缺失数据下两类回归模型的统计推断[D]: [硕士学位论文]. 杭州: 浙江工商大学, 2022.
[7] 赵若男, 苏同生, 宋瑞, 等. 中风队列研究中多重插补法拟合量表缺失数据效果评价[J]. 中国中医药信息杂志, 2022, 29(3): 110-116.
[8] 邓建新, 单路宝, 贺德强, 等. 缺失数据的处理方法及其发展趋势[J]. 统计与决策, 2019, 35(23): 28-34.
[9] 陈玉. 基于聚类和半参数Logistic的缺失数据插补研究[D]: [硕士学位论文]. 成都: 西南财经大学, 2022.
[10] 方芳. 完全随机缺失机制下XGBOOST模型缺失数据插补方法比较研究[D]: [硕士学位论文]. 昆明: 云南大学, 2021.
[11] 张彪, 韩伟, 庞海玉, 等. 完全随机缺失条件下分类随机变量数据缺失插补方法的比较研究[J]. 中国卫生统计, 2015, 32(5): 903-905, 907.