基于Boosting算法的医疗费用预测——以鼻咽癌为例
Medical Expenses Prediction Based on Boosting Algorithms—Using Data of Nasopharyngeal Carcinoma (NPC)
DOI: 10.12677/CSA.2019.911237, PDF,    科研立项经费支持
作者: 曹 蕾, 何轶辉, 柳岳霖, 姜玉山*:东北大学秦皇岛分校数学与统计学院,河北 秦皇岛;东北大学秦皇岛分校数据分析与智能计算研究所,河北 秦皇岛
关键词: CART鼻咽癌AdaBoostGradient BoostingDBRT回归评价指标特征重要度部分依赖关系CART NPC AdaBoost Gradient Boosting DBRT Regression Valuation Index Feature Importance Partial Dependency
摘要: 本文的数据来源于广东省某肿瘤医院,共计2064个鼻咽癌病案,我们对其进行数据挖掘,并预测病人的医疗费用。本文通过以下四步对数据进行研究。首先,我们选取了病人的年龄、性别、TNM诊断分期以及住院天数等特征为预测变量。然后,基于回归决策树算法(CART)建立费用预测模型。其后,分别使用两种Boosting算法,AdaBoost和Gradient Boosting对已有模型进行改进。接着,通过直观比照和回归评价指标,分析三种算法建立的预测模型的效果并进行比较,得到效果最好的DBRT (Gradient Boosting Decision Tree)预测模型,其预测准确率约为85%。最后,通过特征重要度和部分依赖关系图,解释基于Boosting算法的模型的现实意义,为医疗保险资源的分配和单个病例预期费用提供了参考。
Abstract: The data of this paper come from 2064 cases of NPC in a Cancer Hospital of Guangdong Province. We mine the data and predict the medical cost per patient. This paper studies the data through the following four steps. First, we select the characteristics of patients’ age, gender, TNM diagnosis stage and length of stay as the prediction variables. Then, we build the cost prediction model based on the regression decision tree algorithm (CART). Then, two boosting algorithms, AdaBoost and gradient boosting, are used to improve the existing model. Then, through the visual comparison and regression evaluation index, the effect of the prediction model established by the three algorithms is analyzed and compared, and the best DBRT (gradient boosting decision tree) prediction model is obtained, with the prediction accuracy of about 85%. Finally, the significance of the model based on boosting algorithm is explained through the feature importance and partial dependency graph, which provides a reference for the allocation of medical insurance resources and the expected cost of a single case.
文章引用:曹蕾, 何轶辉, 柳岳霖, 姜玉山. 基于Boosting算法的医疗费用预测——以鼻咽癌为例[J]. 计算机科学与应用, 2019, 9(11): 2115-2128. https://doi.org/10.12677/CSA.2019.911237

参考文献

[1] Fetter, R.B., Shin, Y., Freeman, J.L., Averill, R.F. and Thompson, J.D. (1984) Case Mix Definition by Diagno-sis-Related Groups. Medical Care, 18, 1-53.
[2] 林倩, 王冬, 郭煜, 詹志颖, 吴志明. 基于CHAID算法的阑尾炎患者DRGs分组研究[J]. 卫生经济研究, 2017(8): 29-32.
[3] 杜剑亮, 刘骏峰, 陈倩. 不同决策树算法建立DRGs模型的差异[J]. 中国病案, 2014, 15(7): 38-41.
[4] Luo, A.-J., Chang, W.-F., Xin, Z.-R., Ling, H., Li, J.-J., Dai, P.-P., Deng, X.-T., Zhang, L. and Li, S.-G. (2018) Diagnosis Related Group Grouping Study of Senile Cataract Pa-tients Based on E-CHAID Algorithm. International Journal of Ophthalmology, 11, 308-313.
[5] 张凯. 数据挖掘技术在医疗费用数据中的应用研究[D]: [硕士学位论文]. 北京: 北京邮电大学, 2015.
[6] 王若佳, 魏思仪, 赵怡然, 王继民. 数据挖掘在健康医疗领域中的应用研究综述[J]. 图书情报知识, 2018(5): 114-123+9.
[7] 徐昆. 业健康保险与医疗大数据对接交互系统研究[J]. 金融理论与实践, 2018(7): 103-108.
[8] 曹蕾, 柳岳霖, 何轶辉, 姜玉山. 基于决策树的DRGs制度研究——以鼻咽癌为例[J]. 应用数学进展, 2019, 8(6): 1121-1132.
[9] Friedman, J.H., Hastie, T. and Tibshirani, R. (2000) Additive Logistic Regression: A Statistical View of Boosting. Annals of Statis-tics, 28, 337-407. [Google Scholar] [CrossRef
[10] Friedman, J.H. (2001) Greedy Function Approxi-mation: A Gradient Boosting Machine. Annals of Statistics, 29, 1189-1232. [Google Scholar] [CrossRef
[11] 李航. 统计学习方法[M]. 北京: 清华大学出版社, 2012.
[12] 吕晓玲, 宋捷. 大数据挖掘与统计机器学习[M]. 北京: 中国人民大学出版社, 2016.