基于保险反欺诈任务的跨表特征工程方法
Cross Feature Engineering for Anti-Fraud Task in Insurance
DOI: 10.12677/AIRR.2024.132048, PDF,    国家自然科学基金支持
作者: 董今妮, 邓 潇, 那崇宁, 杨 耀, 陈 奎*:之江实验室,浙江 杭州
关键词: 保险反欺诈人工智能特征爆炸跨表特征工程Insurance Anti-Fraud AI Feature Exponential Growth Cross Feature Engineering
摘要: 特征工程是使用机器学习技术解决场景任务过程的核心环节,特征工程的质量决定了模型效果的上限。本文将聚焦汽车保险反欺诈任务,研究跨表特征工程技术,解决汽车保险反欺诈过程中的数据表格聚合和高效特征挖掘问题,用于支撑下游反欺诈建模任务。目前,单表的特征工程算法较为成熟,而跨表的特征工程算法相对较少。相比于单表特征工程,多表之间的特征衍生所涉及的特征数目更多,更容易出现特征爆炸。针对这一问题,我们提出了xDFS方法,在DFS (Deep Feature Synthesis)方法上进行优化,引入对单表的统计分析过程,避免了DFS在数据预处理阶段的特征拆分,利用xgboost模型计算特征衍生的最优组合,进而解决了跨表特征衍生过程中的特征爆炸问题。在实验过程中,我们将xDFS方法在两个公开数据集和一个车险数据集上进行测试,发现当衍生特征深度较深时,DFS出现特征爆炸问题,而xDFS均未产生特征爆炸问题。
Abstract: Feature engineering is the core part to introduce machine learning into application, which determines the best performance of a model. The current paper will be focused on the anti-fraud task in auto insurance, study the cross feature engineering so as to solve the aggregation among multi tables and efficiently mining deep features, as a result supporting anti-fraud modeling task. Currently, feature engineering in independent dataset is relatively mature, but less research in relational cross datasets. We develop the xDFS method based on DFS (deep feature synthesis), which introduces groupby method to get statistical features between different entities in same dataset without entity extraction and feature aggregation. Besides, xDFS uses xgboost to get feature combinations and avoid the exponential growth as increase of synthesis depth. Experiments on two public datasets and an auto insurance dataset show that feature exponential growth in DFS, while not in xDFS.
文章引用:董今妮, 邓潇, 那崇宁, 杨耀, 陈奎. 基于保险反欺诈任务的跨表特征工程方法[J]. 人工智能与机器人研究, 2024, 13(2): 467-477. https://doi.org/10.12677/AIRR.2024.132048

参考文献

[1] 国家金融监督管理总局官网[EB/OL].
https://www.cbirc.gov.cn/cn/view/pages/tongjishuju/tongjishuju.html, 2024-02-23.
[2] 喻炜, 冯根福, 张文珺. 机动车辆保险欺诈检测系统及团伙识别研究[J]. 保险研究, 2017(2): 63-73.
[3] 车险反欺诈联合课题组. 车险欺诈与反欺诈问题研究及监管建议[J]. 保险研究, 2021(6): 3-10.
[4] 卢冰洁, 李炜卓, 那崇宁, 等. 机器学习模型在车险欺诈检测的研究进展[J]. 计算机工程与应用, 2022, 58(5): 34-49.
[5] Yang, J., Chen, K., Ding, K., et al. (2023) Auto Insurance Fraud Detection with Multimodal Learning. Data Intelligence, 5, 388-412. [Google Scholar] [CrossRef
[6] Nian, K., Zhang, H., Tayal, A., et al. (2016) Auto Insurance Fraud Detection Using Unsupervised Spectral Ranking for Anomaly. The Journal of Finance and Data Science, 2, 58-75. [Google Scholar] [CrossRef
[7] Wang, Y. and Xu, W. (2018) Leveraging Deep Learning with LDA-Based Text Analytics to Detect Automobile Insurance Fraud. Decision Support Systems, 105, 87-95. [Google Scholar] [CrossRef
[8] Luo, Y., Wang, M., Zhou, H., et al. (2019) Autocross: Automatic Feature Crossing for Tabular Data in Real-World Applications. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1936-1945. [Google Scholar] [CrossRef
[9] Liu, B., Zhu, C., Li, G., et al. (2020) Autofis: Automatic Feature Interaction Selection in Factorization Models for Click-Through Rate Prediction. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2636-2645. [Google Scholar] [CrossRef
[10] Yu, R., Ye, Y., Liu, Q., et al. (2021) Xcrossnet: Feature Structure-Oriented Learning for Click-Through Rate Prediction. Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer International Publishing, Cham, 436-447. [Google Scholar] [CrossRef
[11] Katz, G., Shin, E.C.R. and Song, D. (2016) Explorekit: Automatic Feature Generation and Selection. 2016 IEEE 16th International Conference on Data Mining (ICDM), 979-984. [Google Scholar] [CrossRef
[12] Shi, Q., Zhang, Y.L., Li, L., et al. (2020) Safe: Scalable Automatic Feature Engineering Framework for Industrial Tasks. 2020 IEEE 36th International Conference on Data Engineering (ICDE), 1645-1656. [Google Scholar] [CrossRef
[13] Tsang, M., Cheng, D., Liu, H., et al. (2020) Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection. arXiv Preprint arXiv:2006.10966.
[14] Su, Y., Zhang, R., Erfani, S., et al. (2021) Detecting Beneficial Feature Interactions for Recommender Systems. Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), 35, No. 5. [Google Scholar] [CrossRef
[15] Deng, W., Pan, J., Zhou, T., et al. (2021) Deeplight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving. Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 922-930. [Google Scholar] [CrossRef
[16] Zhao, P., Xiao, K., Zhang, Y., et al. (2020) Amer: Automatic Behavior Modeling and Interaction Exploration in Recommender System. arXiv Preprint arXiv:2006.05933. [Google Scholar] [CrossRef
[17] Kanter, J.M. and Veeramachaneni, K. (2015) Deep Feature Synthesis: Towards Automating Data Science Endeavors. 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 1-10. [Google Scholar] [CrossRef
[18] Limsurut, T. and Chaisangmongkon, W. (2019) Event-Based Feature Synthesis: Autonomous Data Science Engine. Journal of Computers, 30, 55-67.
[19] McKinney, W. (2011) Pandas: A Foundational Python Library for Data Analysis and Statistics. Python for High Performance and Scientific Computing, 14, 1-9.