机器学习驱动的财务舞弊识别模型
Machine Learning-Driven Financial Fraud Detection Model
DOI: 10.12677/sa.2025.1411326, PDF,   
作者: 梁楚雯:北方工业大学理学院,北京;张慕瑶:北方工业大学经济管理学院,北京;谷司琪:北方工业大学电气与控制工程学院(无人机学院),北京
关键词: 财务舞弊随机森林模型XGBoost模型Financial Fraud Random Forest Model XGBoost Model
摘要: 财务舞弊是企业或个人操纵财务信息以谋取私利。本文结合高检测机器学习模型,建立分类能力较强的财务舞弊识别模型,以提高财务审计效率。建模分三部分:第一,数据预处理。本文收集了不同公司2017~2023年的财务报表及舞弊公司数据,进行数据清洗。依财务理论及SPSS的独立样本T检验,确定六大财务特征指标。最后对数据进行Z-Score标准化。第二,随机森林模型的建立与评估。本文运用Python构建随机森林模型。并运用混淆矩阵进行模型评估,舞弊样本的召回率为58%,模型AUC值为73%,可知该模型的区分性能较好,但舞弊召回率仍有待提升。第三,XGBoost模型的建立与评估。本文运用Python构建XGBoost模型。用分层抽样方法确定数据集、直方图算法加速建树,并通过带早停的交叉验证训练训练模型并监控性能。运用混淆矩阵进行评估,得到模型舞弊样本的召回率为67%,模型AUC值为79.06%,模型对两类样本的分类能力较好,且舞弊召回率较高。
Abstract: Financial fraud refers to the manipulation of financial information by enterprises or individuals to seek personal gains. This paper combines high-detection machine learning models to establish a financial fraud identification model with strong classification capabilities, aiming to enhance the efficiency of financial auditing. The modeling process is divided into three parts: First, data preprocessing. This paper collects financial statements of different companies from 2017 to 2023, along with data on fraudulent companies, and conducts data cleaning. Based on financial theories and the independent samples T-test in SPSS, six key financial characteristic indicators are identified. Finally, the data undergoes Z-Score standardization. Second, the establishment and evaluation of the Random Forest model. This paper employs Python to construct the Random Forest model. The model is evaluated using a confusion matrix, revealing a recall rate of 58% for fraudulent samples and an AUC value of 73%. It can be seen that the model demonstrates good discriminatory performance, yet there is still room for improvement in the recall rate of fraudulent cases. Third, the establishment and evaluation of the XGBoost model. This paper utilizes Python to build the XGBoost model. The dataset is determined using stratified sampling, and the histogram algorithm is employed to accelerate tree construction. The model is trained and its performance is monitored through cross-validation with early stopping. Upon evaluation with a confusion matrix, the model shows a recall rate of 67% for fraudulent samples and an AUC value of 79.06%. The model exhibits good classification capabilities for both types of samples, with a relatively high recall rate for fraudulent cases.
文章引用:梁楚雯, 张慕瑶, 谷司琪. 机器学习驱动的财务舞弊识别模型[J]. 统计学与应用, 2025, 14(11): 236-248. https://doi.org/10.12677/sa.2025.1411326

参考文献

[1] 朱敏. 上市公司财务报告舞弊的识别方法及模型研究[D]: [硕士学位论文]. 成都: 四川大学, 2005.
[2] 刘君, 王理平. 基于概率神经网络的财务舞弊识别模型[J]. 哈尔滨商业大学学报(社会科学版), 2006(3): 102-105.
[3] 张新民, 吴革. 财务报告舞弊的特征与识别模型研究[J]. 财贸经济, 2008(12): 36-40.
[4] 贺颖. 基于偏最小二乘法-支持向量机的上市公司财务舞弊识别模型研究[D]: [硕士学位论文]. 石河子: 石河子大学, 2010.
[5] 陈彬. 我国上市公司财务舞弊识别模型对比研究[D]: [硕士学位论文]. 西安: 西北大学, 2012.
[6] 成雪娇. 基于数据挖掘的中国上市公司财务舞弊识别研究[D]: [硕士学位论文]. 重庆: 重庆理工大学, 2018.
[7] 王柳匀. 基于分类算法的财务报表舞弊识别研究[D]: [硕士学位论文]. 北京: 中国财政科学研究院, 2021.
[8] 程新尧. 基于信息融合的财务舞弊识别研究[D]: [硕士学位论文]. 重庆: 重庆理工大学, 2022.
[9] 赵豪杰. 基于CNN-BiLSTM模型的上市公司财务舞弊识别研究[D]: [硕士学位论文]. 苏州: 苏州大学, 2023.
[10] 何子英. 基于集成学习与SHAP的财务舞弊识别研究[D]: [硕士学位论文]. 南昌: 江西财经大学, 2024.