基于Stacking集成学习算法的疾病风险预测——以妊娠糖尿病为例
Disease Risk Prediction Based on Stacking Integrated Learning Algorithm—Using Data of Gestational Diabetes
摘要:
本文共采用了四种缺失值处理方案进行缺失值处理,并根据六种机器学习算法分析比较出了这四种缺失值处理方案的优劣程度。对于每一种机器学习算法,本文都给出了为防止算法模型过拟合所应采取的措施,并通过比较各算法预测结果的F1值,筛选出合适的算法模型作为Stacking集成学习算法的初级学习器,然后选取逻辑回归算法为该集成学习算法的次级学习器。最终,通过调节逻辑回归算法的参数得到精度高、泛化能力强的基于妊娠期糖尿病患病风险预测问题的Stacking集成学习算法模型。
Abstract:
In this paper, four missing value processing schemes are used for missing value processing, and the pros and cons of these four missing value processing schemes are compared and analyzed based on six machine learning algorithms. For each machine learning algorithm, this article gives the measures that should be taken to prevent the algorithm model from overfitting. By comparing the F1 values of the prediction results of each algorithm, the appropriate algorithm model is selected as the primary of the Stacking integrated learning algorithm. The learner then selects the logistic regression algorithm as the secondary learner of the ensemble learning algorithm. Finally, by adjusting the parameters of the logistic regression algorithm, a Stacking ensemble learning al-gorithm model based on the risk prediction problem of gestational diabetes is obtained with high accuracy and generalization ability.
参考文献
|
[1]
|
学习建模-个人信用风险评估模型实例[EB/OL].
https://www.zhihu.com/tardis/sogou/art/37355703, 2018-06-20.
|
|
[2]
|
张良均, 王璐, 谭立云, 等. Python数据分析与挖掘实战[M]. 北京: 机械工业出版社, 2016: 23.
|
|
[3]
|
动脉网蛋壳研究院. 大数据 + 医疗: 科学时代的思维与决策开本[M]. 北京: 机械工业出版社, 2019: 21.
|
|
[4]
|
酒卷隆治, 里洋平. 数据分析实战[M]. 北京: 民邮电出版社, 2017.
|
|
[5]
|
周志华. 机器学习[M]. 北京: 清华大学出版社, 2016: 73.
|
|
[6]
|
李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012: 35.
|
|
[7]
|
Harrington, P. 机器学习实战[M]. 北京: 人民邮电出版社, 2013: 15.
|
|
[8]
|
萨扬•穆霍帕迪亚. Python高级数据分析[M]. 北京: 机械工业出版社, 2019: 23.
|
|
[9]
|
stacking算法原理及代码[EB/OL].
https://www.cnblogs.com/dudumiaomiao/p/9692935.html, 2018-09-23.
|