机器学习赋能保险反欺诈识别:基于随机森林模型的实证研究
Machine Learning-Enabled Insurance Fraud Detection: An Empirical Study Based on a Random Forest Model
摘要: 保险欺诈识别是保险公司风险管理与数字化转型进程中无法回避的核心议题。本文以阿里天池公开的金融数据分析保险反欺诈预测数据集为样本,构建“数据预处理–探索性分析–特征工程–模型比较–业务转化”的一体化研究框架,系统比较随机森林与XGBoost模型在欺诈识别任务中的表现。实证结果表明,训练样本中欺诈比例为25.86%,高额索赔、事故严重程度、投保至出险时间间隔及保费相关变量均具有显著的识别价值。经参数调优与阈值校准后,随机森林模型的AUC约为0.723,召回率为0.691,能够在较高的欺诈捕获能力与可接受的误报水平之间取得务实平衡。进一步基于模型输出,本文提出涵盖核保审核、理赔调查与风险定价的三级风险分层机制,为业务落地提供可操作路径。研究表明,机器学习模型可有效充当保险反欺诈中的风险排序与资源配置工具,但人工复核、样本扩充与持续反馈机制仍是其不可或缺的实务补充。
Abstract: Insurance fraud detection sits at the heart of contemporary risk management and digital transformation in the insurance industry. Drawing on the publicly available insurance anti-fraud prediction dataset from Ali Tianchi, this study develops an integrated analytical framework spanning data preprocessing, exploratory analysis, feature engineering, model comparison, and business implementation. Random Forest and XGBoost classifiers are compared for fraud identification. The empirical results show that fraudulent observations account for 25.86% of the training sample, and that high claim amounts, accident severity, the interval between policy binding and incident occurrence, and premium-related variables carry substantial predictive power. After parameter tuning and threshold calibration, the Random Forest model attains an AUC of approximately 0.723 and a recall of 0.691, striking a workable balance between fraud capture and the false-positive burden. Building on the model output, the paper proposes a three-tier risk-control mechanism that supports underwriting review, claim investigation, and risk-based pricing. The findings indicate that machine learning is well-suited to serve as a risk-ranking and resource-allocation instrument in insurance anti-fraud practice, with human review and continuous feedback remaining indispensable complements.
参考文献
|
[1]
|
Bolton, R.J. and Hand, D.J. (2002) Statistical Fraud Detection: A Review. Statistical Science, 17, 235-255. [Google Scholar] [CrossRef]
|
|
[2]
|
Phua, C., Lee, V., Smith, K., et al. (2010) A Comprehensive Survey of Data Mining-Based Fraud Detection Research. arXiv:1009.6119. https://arxiv.org/abs/1009.6119
|
|
[3]
|
阿里天池. 金融数据分析保险反欺诈预测数据集[DB/OL]. https://tianchi.aliyun.com/competition/entrance/531994/information, 2025-11-20.
|
|
[4]
|
Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002) SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. [Google Scholar] [CrossRef]
|
|
[5]
|
Breiman, L. (2001) Random Forests. Machine Learning, 45, 5-32. [Google Scholar] [CrossRef]
|
|
[6]
|
Chen, T. and Guestrin, C. (2016). XGBoost. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, 13-17 August 2016, 785-794.[CrossRef]
|
|
[7]
|
Lundberg, S.M. and Lee, S.I. (2017) A Unified Approach to Interpreting Model Predictions. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, 4-9 December 2017, 4765-4774.
|