多维度特征交互驱动的肺癌风险分级预测模型构建与临床应用研究
Construction and Clinical Application of Lung Cancer Risk Grading Prediction Model Driven by Multi-Dimensional Feature Interaction
摘要: 肺癌作为全球癌症死亡的首要诱因,其精准的风险分级与致病机制解析对临床诊疗及早期筛查效率具有重要意义。本研究基于Kaggle平台1000例肺癌患者数据集,涵盖环境暴露、既往病史等24项多维特征,尤其引入灰尘过敏等少被探索因素,系统挖掘特征间交互作用与肺癌风险等级的关联。改变了以往常集中于临床特征等单一维度忽略其他影响的情况。同时选取了五种代表模型,通过可视化分析、数据增强、超参数调优等筛选预测模型,并借助特征重要性排序与决策图增强模型可解释性,旨在筛选出兼顾准确性、鲁棒性和解释性的最优预测模型识别并其关键影响因素,为肺癌早期快速初筛检测提供支持。研究结果显示,除饮酒(0.72)、被动吸烟情况(0.7)等主流因素外,像灰尘过敏(0.71)等少被关注的因素及灰尘过敏与职业危害(0.79)等少研究的交互关系应加以重视。随机森林模型性能最优,准确率达98%,咳血、饮酒和肥胖是模型的三大关键预测因子。本研究构建的高精度预测模型为肺癌早期筛查提供可靠工具,新发现的特征与交互作用为肺癌病因学深入研究与个体化防控提供了新方向。
Abstract: As the leading cause of cancer death worldwide, accurate risk grading and pathogenic mechanism analysis of lung cancer are of great significance for clinical diagnosis and early screening efficiency. This study is based on a dataset of 1000 lung cancer patients on the Kaggle platform, covering 24 multidimensional features such as environmental exposure and past medical history. In particular, less explored factors such as dust allergies are introduced to systematically explore the interaction between features and their association with lung cancer risk levels. Changed the situation of focusing solely on clinical features and ignoring other influences. Five representative models were selected simultaneously, and prediction models were screened through visualization analysis, data augmentation, and hyperparameter tuning. The interpretability of the models was enhanced by feature importance ranking and decision graphs, aiming to identify the optimal prediction model that balances accuracy, robustness, and interpretability, and its key influencing factors, providing support for early rapid screening and detection of lung cancer. The research results show that in addition to mainstream factors such as alcohol consumption (0.72) and passive smoking (0.7), less studied factors such as dust allergy (0.71) and the interaction between dust allergy and occupational hazards (0.79) should be given attention. The random forest model has the best performance, with an accuracy of 98%. Coughing blood, alcohol consumption, and obesity are the three key predictive factors of the model. The high-precision prediction model constructed in this study provides a reliable tool for early screening of lung cancer, and the newly discovered features and interactions provide new directions for in-depth research on the etiology of lung cancer and personalized prevention and control.
文章引用:马翎容. 多维度特征交互驱动的肺癌风险分级预测模型构建与临床应用研究[J]. 统计学与应用, 2025, 14(11): 67-76. https://doi.org/10.12677/sa.2025.1411311

参考文献

[1] 刘宝珠, 李晓艺. 肺癌84例临床病理分析[J]. 基层医学论坛, 2020, 24(7): 977-978.
[2] 钟德光. 40岁以下肺癌临床特征分析[J]. 重庆医学, 2010, 39(13): 1777.
[3] 夏银川, 张冉, 柯晓庆, 等. 418例肺癌患者家族癌症发病史流行病学调查分析[J]. 公共卫生与预防医学, 2021, 32(1): 121-124.
[4] 张幸. 石棉相关癌症防控的紧迫性不容忽视[J]. 中华劳动卫生职业病杂志, 2021, 39(2): 81-84.
[5] Grzywa-Celińska, A., Krusiński, A. and Milanowski, J. (2020) ‘Smoging Kills’—Effects of Air Pollution on Human Respiratory System. Annals of Agricultural and Environmental Medicine, 27, 1-5. [Google Scholar] [CrossRef] [PubMed]
[6] 李秀芹, 李琳, 张慢丽. 基于集成学习的肺癌存活性预测分析[J]. 软件工程, 2022, 25(1): 41-46.
[7] 陈睿琳, 王静茹, 王硕, 唐思琦, 索晨. 大规模人群队列生活行为方式相关的肺癌风险预测模型的构建[J]. 四川大学学报(医学版), 2023, 54(5): 892-898.
[8] 孟丹, 张卫东, 李昌, 王杨, 甄磊. 基于支持向量机的中文极短文本分类模型[J]. 计算机应用研究, 2020, 37(2): 347-350.
[9] 蓝潞杭, 蒋炫东, 王茂峰, 等. 随机森林模型预测急性心肌梗死后急性肾损伤[J]. 中华急诊医学杂志, 2021, 30(4): 491-495.
[10] Barsasella, D., Gupta, S., Malwade, S., Aminin,, Susanti, Y., Tirmadi, B., et al. (2021) Predicting Length of Stay and Mortality among Hospitalized Patients with Type 2 Diabetes Mellitus and Hypertension. International Journal of Medical Informatics, 154, Article ID: 104569. [Google Scholar] [CrossRef] [PubMed]
[11] Marshall, E.A., Filho, F.S.L., Sin, D.D., Lam, S., Leung, J.M. and Lam, W.L. (2022) Distinct Bronchial Microbiome Precedes Clinical Diagnosis of Lung Cancer. Molecular Cancer, 21, Article No. 68. [Google Scholar] [CrossRef] [PubMed]