基于Stacking集成学习的电信用户流失预测研究
Research on Telecom Customer Churn Prediction Based on Stacking Ensemble Learning
摘要: 随着信息化建设的迅速推进,电信市场趋于饱和,如何应对用户流失成为通信运营商亟待解决的问题。本文基于电信用户数据,对用户流失趋势进行了深入预测分析。首先,针对数据缺失进行了填补,并对特征进行编码和衍生,使用SMOTE与Tomek Link技术处理了数据不均衡问题。接着,本文使用随机森林、XGBoost、SVM、逻辑回归、AdaBoost和GBDT六种单一模型分别进行用户流失预测。为了提高预测的准确性和稳健性,本文采用了Stacking多模型融合的方式,模型对比结果表明,第二层模型选用SVM达到了最高的准确率(0.8645),各项指标均优于单一模型。研究证明,Stacking集成模型在用户流失预测中具有较高的有效性,并通过分析识别了影响用户流失的关键因素,为电信运营商提供了减少客户流失的针对性建议,进而提升企业收益和利润。
Abstract: With the rapid advancement of information technology, the telecommunications market is becoming increasingly saturated, making customer churn a critical issue that telecom operators must address urgently. This paper conducts an in-depth predictive analysis of customer churn trends based on user data from Telecom. Initially, missing data was imputed, and feature encoding and derivation were performed. The SMOTE and Tomek Link techniques were employed to address the problem of data imbalance. Following this, six individual models—Random Forest, XGBoost, SVM, Logistic Regression, AdaBoost, and GBDT—were used to predict customer churn. To improve the accuracy and robustness of the predictions, this study applied the Stacking ensemble learning approach. The model comparison results indicate that the second-layer model using SVM achieved the highest accuracy (0.8645), with performance metrics surpassing those of the individual models. The study demonstrates the effectiveness of the Stacking ensemble model in predicting customer churn and identifies the key factors influencing churn through detailed analysis. These findings provide telecom operators with targeted recommendations to reduce customer churn and enhance corporate revenue and profitability.
文章引用:于荣荣, 冯媛, 玄金虎. 基于Stacking集成学习的电信用户流失预测研究[J]. 应用数学进展, 2024, 13(11): 4896-4907. https://doi.org/10.12677/aam.2024.1311471

参考文献

[1] Ganesh, J., Arnold, M.J. and Reynolds, K.E. (2000) Understanding the Customer Base of Service Providers: An Examination of the Differences between Switchers and Stayers. Journal of Marketing, 64, 65-87. [Google Scholar] [CrossRef
[2] 周支立, 刘斌. 基于客户信息的电信企业客户流失问题分析[J]. 情报杂志, 2003, 22(12): 97-99.
[3] 黄宝凤, 祁婷婷. 基于特征工程的个人信用风险评估组合模型[J]. 中国统计, 2021(6): 37-39.
[4] Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002) SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. [Google Scholar] [CrossRef
[5] Tomek, I. (1976) Two Modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 6, 769-772.
[6] Joachims, T. (1998) Making Large-Scale SVM Learning Practical. Technical Report.
[7] 方匡南, 吴见彬, 朱建平, 等. 随机森林方法研究综述[J]. 统计与信息论坛, 2011, 26(3): 32-38.
[8] Chen, T. and Guestrin, C. (2016) XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13-17 August 2016, 785-794. [Google Scholar] [CrossRef
[9] Freund, Y. and Schapire, R.E. (1997) A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55, 119-139. [Google Scholar] [CrossRef
[10] 王贝伦. 机器学习[M]. 南京: 东南大学出版社, 2021: 187-244.
[11] Wolpert, D.H. (1992) Stacked Generalization. Neural Networks, 5, 241-259. [Google Scholar] [CrossRef