基于Transformer-CatBoost融合模型的不平衡数据分类研究
Research on Imbalanced Data Classification Based on Transformer-CatBoost Fusion Model
摘要: 在二分类问题中,由于数据失衡处理不当、特征繁杂等问题,捕捉特征之间的复杂相关关系成为挑战。为改善这种现象,建立Transformer-CatBoost融合模型,引入Transformer挖掘用户数据的深层信息与CatBoost抗过拟合实现高效分类。首先,为Transformer编码器增加违约分类头,构建BaseTransformer。然后,将M个BaseTransformer集成作为模型的第一层学习器,得到第一层的预测结果与原特征一同输入第二层学习器CatBoost,实现基于Stacking的模型融合。采取多样化的模型评价指标,对比10种数据不平衡处理方法,选择了NCR方法参与实验,随后引入Optuna方法优化模型参数。最后,将模型与各种基准模型比较,借助消融实验证得模型的有效性与可行性,并利用Lending Club数据集证得模型的泛化能力。
Abstract: In binary classification problems, due to problems such as improper handling of data imbalance and complex features, capturing the correlation between features becomes a challenge. To improve this phenomenon, a Transformer-CatBoost fusion model is established, introducing Transformer to mine the deep information of user data and CatBoost to resist overfitting and achieve efficient classification. Firstly, a default classification head is introduced to the encoder of Transformer to build BaseTransformer. Then, M BaseTransformers are integrated as the first layer of the model. The prediction results of first layer together with the original features are input into the second layer CatBoost to build the stacking fusion model. Ten data imbalance processing methods are applied for comparison by using a variety of model evaluation indicators, and the NCR method is selected. Then the Optuna method is introduced to optimize the model parameters. Finally, the model is compared with various benchmark models. The effectiveness and feasibility of the model are verified by ablation experiments. And we prove the generalization ability of the model using the Lending Club dataset.
文章引用:胡译丹, 高阳, 过子宽. 基于Transformer-CatBoost融合模型的不平衡数据分类研究[J]. 理论数学, 2026, 16(6): 155-169. https://doi.org/10.12677/pm.2026.166165

参考文献

[1] Chiang, J.Y., Lio, Y., Hsu, C.Y., Ho, C. and Tsai, T. (2023) Binary Classification with Imbalanced Data. Entropy, 26, Article 15. [Google Scholar] [CrossRef] [PubMed]
[2] Zheng, M., Wang, F., Hu, X., Miao, Y., Cao, H. and Tang, M. (2022) A Method for Analyzing the Performance Impact of Imbalanced Binary Data on Machine Learning Models. Axioms, 11, Article 607. [Google Scholar] [CrossRef
[3] Charizanos, G., Demirhan, H. and İçen, D. (2024) Binary Classification with Fuzzy Logistic Regression under Class Imbalance and Complete Separation in Clinical Studies. BMC Medical Research Methodology, 24, Article No. 145. [Google Scholar] [CrossRef] [PubMed]
[4] Hazarika, B.B. and Gupta, D. (2022) Density Weighted Twin Support Vector Machines for Binary Class Imbalance Learning. Neural Processing Letters, 54, 1091-1130. [Google Scholar] [CrossRef
[5] Dawkrajai, J., Weerachapichasgul, W., Daosud, W. and Kittisupakorn, P. (2025) Enhancing Fault Diagnosis in Imbalanced Data Using Weighted GRU Architecture. Engineering Journal, 29, 35-44. [Google Scholar] [CrossRef
[6] Akinjole, A., Shobayo, O., Popoola, J., Okoyeigbo, O. and Ogunleye, B. (2024) Ensemble-Based Machine Learning Algorithm for Loan Default Risk Prediction. Mathematics, 12, Article 3423. [Google Scholar] [CrossRef
[7] Fei, H. and Huang, H. (2019) Research on Internet Credit Risk Prediction Based on Model Fusion. Statistics and Application, 8, 823-834.
[8] 陈玉沂, 刘高勇, 蔡焕仪. 个人信贷违约预测机器学习模型的解释性方法研究[J]. 现代计算机, 2024, 30(13): 68-72.
[9] Zohair, M., Chandra, R., Tiwari, S. and Agarwal, S. (2024) A Model Fusion Approach for Severity Prediction of Diabetes with Respect to Binary and Multiclass Classification. International Journal of Information Technology, 16, 1955-1965. [Google Scholar] [CrossRef
[10] 蔡青松, 吴金迪, 白宸宇. 基于可解释集成学习的信贷违约预测[J]. 计算机系统应用, 2021, 30(12): 194-201.
[11] 张瑶娜, 卓佩妍, 刘自金, 等. 基于Transformer编码器和残差网络的信贷违约预测模型[J]. 计算机应用, 2024, 44(S1): 324-329.
[12] Prokhorenkova, L., Gusev, G., Vorobev, A., et al. (2018) CatBoost: Unbiased Boosting with Categorical Features. Advances in Neural Information Processing Systems, 31, 1-8.
[13] Otoo, G., Appati, J.K., Yaokumah, W., Soli, M.A.T., Nwolley, S.J. and Ludu, J.Y. (2021) Evaluation of Data Imbalance Algorithms on the Prediction of Credit Card Fraud. International Journal of Intelligent Information Technologies, 17, 1-26. [Google Scholar] [CrossRef
[14] Khan, M.S., Peng, T., Khan, M.A., Khan, A., Ahmad, M., Aziz, K., et al. (2025) Explainable Automl Models for Predicting the Strength of High-Performance Concrete Using Optuna, SHAP and Ensemble Learning. Frontiers in Materials, 12, Article ID: 1542655. [Google Scholar] [CrossRef
[15] Hussein Sayed, E., Alabrah, A., Hussein Rahouma, K., Zohaib, M. and Badry, R.M. (2024) Machine Learning and Deep Learning for Loan Prediction in Banking: Exploring Ensemble Methods and Data Balancing. IEEE Access, 12, 193997-194019. [Google Scholar] [CrossRef
[16] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention Is All You Need. Advances in Neural Information Processing Systems, 30, 1-27.
[17] Ho, Y. and Wookey, S. (2019) The Real-World-Weight Cross-Entropy Loss Function: Modeling the Costs of Mislabeling. IEEE Access, 8, 4806-4813. [Google Scholar] [CrossRef
[18] Zhou, P., Xie, X., Lin, Z. and Yan, S. (2024) Towards Understanding Convergence and Generalization of AdamW. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 6486-6493. [Google Scholar] [CrossRef] [PubMed]
[19] 阿里云天池实验室. 违约贷款数据集[DB/OL].
https://tianchi.aliyun.com/dataset/140861, 2022-11-11.
[20] Pham, T.M., Pandis, N. and White, I.R. (2022) Missing Data, Part 2. Missing Data Mechanisms: Missing Completely at Random, Missing at Random, Missing Not at Random, and Why They Matter. American Journal of Orthodontics and Dentofacial Orthopedics, 162, 138-139. [Google Scholar] [CrossRef] [PubMed]
[21] Fitrianto, A., Wan Muhamad, W.Z.A., Kriswan, S. and Susetyo, B. (2022) Comparing Outlier Detection Methods Using Boxplot Generalized Extreme Studentized Deviate and Sequential Fences. Aceh International Journal of Science and Technology, 11, 38-45. [Google Scholar] [CrossRef
[22] 李逸君, 王思淼, 赵沐歌, 等. 具有缺失值及异常值的时间序列处理与再筛选机制[J]. 实验科学与技术, 2025, 23(6): 34-42.
[23] Dastane, O., Goi, C.L. and Rabbanee, F.K. (2023) The Development and Validation of a Scale to Measure Perceived Value of Mobile Commerce (MVAL-SCALE). Journal of Retailing and Consumer Services, 71, Article 103222. [Google Scholar] [CrossRef
[24] El-Hashash, E.F. and Shiekh, R.H.A. (2022) A Comparison of the Pearson, Spearman Rank and Kendall Tau Correlation Coefficients Using Quantitative Variables. Asian Journal of Probability and Statistics, 20, 36-48. [Google Scholar] [CrossRef
[25] Lai, L., Lin, Y., Liu, Y., Lai, J., Yang, W., Hou, H., et al. (2024) The Use of Machine Learning Models with Optuna in Disease Prediction. Electronics, 13, Article 4775. [Google Scholar] [CrossRef
[26] Shipman, A., Mead, D., Feng, Y., Escribano, J., Angeloudis, P. and Demiris, Y. (2022) Novel Trajectory Prediction Algorithm Using a Full Dataset: Comparison and Ablation Studies. 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), 8-12 October 2022, 2401-2406.
[27] 天池实验室. Lending Club贷款数据[DB/OL].
https://tianchi.aliyun.com/dataset/19517, 2019-04-15.