融合共现网络特征与知识增强语义梯度提升电子邮件分类
Gradient Boosted Email Classification through Integration of Co-Occurrence Network Features and Knowledge-Enhanced Semantics
DOI: 10.12677/mos.2025.143217, PDF,    科研立项经费支持
作者: 艾 均, 邹智洋*, 苏 湛*, 耿爱国, 马菀言:上海理工大学光电信息与计算机工程学院,上海
关键词: 电子邮件分类文本分类ERNIE复杂网络XGBoostEmail Classification Text Classification ERNIE Complex Network XGBoost
摘要: 本文针对现有电子邮件分类算法缺乏知识网络特征,并且训练复杂度较高的问题,应用复杂网络理论和知识增强语义模型,设计了一种基于电子邮件知识共现网络特征和知识增强语义的梯度提升算法,研究如何利用电子邮件知识网络和增强深度学习模型的知识表征来提升分类算法性能。首先,利用词汇共现度构建基于电子邮件知识的共现网络;其次,采用维瓦尔第算法将共现网络的节点映射到张量空间,生成对应知识节点空间嵌入;然后,计算共现网络模型的中心性特征并与维瓦尔第语义空间嵌入相结合,再融合知识增强语义模型生成的文本语义特征;最后,使用梯度增强算法实现电子邮件分类学习。在实验中,相较于现在的领先模型,本文算法在准确率、精确率和召回率等指标上均有明显提升,验证了其有效性,揭示了电子邮件知识网络特征能够有效增强现有模型的性能,提供了对其表征能力的有效补充。
Abstract: In this paper, for the problem that existing email classification algorithms lack knowledge network features and have high training complexity, a gradient boosting algorithm based on email knowledge co-occurrence network features and knowledge enhancement semantics is designed by applying the complex network theory and knowledge enhancement semantics model to study how to improve the performance of classification algorithms by using the email knowledge network and knowledge representation of the augmented deep learning model. Firstly, the lexical co-occurrence is used to construct a co-occurrence network based on email knowledge; secondly, the Vivaldi algorithm is used to map the nodes of the co-occurrence network to the tensor space to generate the corresponding knowledge node space embedding; then, the centrality feature of the co-occurrence network model is calculated and combined with the Vivaldi semantic space embedding, and then the text semantic features generated by the knowledge-enhanced semantic model are fused; finally, the gradient boosting algorithm is used to achieve email classification learning. In the experiments, compared with the current leading model, the algorithm in this paper has obvious improvement in the indexes of accuracy, precision and recall, which verifies its effectiveness and reveals that the email knowledge network features can effectively enhance the performance of the existing model and provide an effective complement to its representational capability.
文章引用:艾均, 邹智洋, 苏湛, 耿爱国, 马菀言. 融合共现网络特征与知识增强语义梯度提升电子邮件分类[J]. 建模与仿真, 2025, 14(3): 222-237. https://doi.org/10.12677/mos.2025.143217

参考文献

[1] Russell, E., Jackson, T.W., Fullman, M. and Chamakiotis, P. (2023) Getting on Top of Work‐Email: A Systematic Review of 25 Years of Research to Understand Effective Work‐Email Activity. Journal of Occupational and Organizational Psychology, 97, 74-103. [Google Scholar] [CrossRef
[2] Altulaihan, E., Alismail, A., Hafizur Rahman, M.M. and Ibrahim, A.A. (2023) Email Security Issues, Tools, and Techniques Used in Investigation. Sustainability, 15, Article 10612. [Google Scholar] [CrossRef
[3] Ageng, R., Faisal, R. and Ihsan, S. (2024) Random Forest Machine Learning for Spam Email Classification. Journal of Dinda: Data Science, Information Technology, and Data Analytics, 4, 8-13. [Google Scholar] [CrossRef
[4] Zavrak, S. and Yilmaz, S. (2023) Email Spam Detection Using Hierarchical Attention Hybrid Deep Learning Method. Expert Systems with Applications, 233, Article 120977. [Google Scholar] [CrossRef
[5] Roumeliotis, K.I., Tselikas, N.D. and Nasiopoulos, D.K. (2024) Next-Generation Spam Filtering: Comparative Fine-Tuning of LLMs, NLPs, and CNN Models for Email Spam Classification. Electronics, 13, Article 2034. [Google Scholar] [CrossRef
[6] Daud, S., Ullah, M., Rehman, A., Saba, T., Damaševičius, R. and Sattar, A. (2023) Topic Classification of Online News Articles Using Optimized Machine Learning Models. Computers, 12, Article 16. [Google Scholar] [CrossRef
[7] Hasib, K.M., Azam, S., Karim, A., Marouf, A.A., Shamrat, F.M.J.M., Montaha, S., et al. (2023) MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data. IEEE Access, 11, 93048-93063. [Google Scholar] [CrossRef
[8] Jianan, G., Kehao, R. and Binwei, G. (2024) Deep Learning-Based Text Knowledge Classification for Whole-Process Engineering Consulting Standards. Journal of Engineering Research, 12, 61-71. [Google Scholar] [CrossRef
[9] Shi, Y., Ma, H., Zhong, W., Tan, Q., Mai, G., Li, X., et al. (2023) ChatGraph: Interpretable Text Classification by Converting ChatGPT Knowledge to Graphs. 2023 IEEE International Conference on Data Mining Workshops (ICDMW), Shanghai, 1-4 December 2023, 515-520. [Google Scholar] [CrossRef
[10] Palanivinayagam, A., El-Bayeh, C.Z. and Damaševičius, R. (2023) Twenty Years of Machine-Learning-Based Text Classification: A Systematic Review. Algorithms, 16, Article 236. [Google Scholar] [CrossRef
[11] Qu, P., Zhang, B., Wu, J., et al. (2024) Comparison of Text Classification Algorithms based on Deep Learning. Journal of Computer Technology and Applied Mathematics, 1, 35-42.
[12] Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., et al. (2022) A Survey on Text Classification: From Traditional to Deep Learning. ACM Transactions on Intelligent Systems and Technology, 13, 1-41. [Google Scholar] [CrossRef
[13] Sun, J., Xu, C., Tang, L., Wang, S., Lin, C., Gong, Y., et al. (2023) Think-on-Graph: Deep and Responsible Reasoning of Large Language Model with Knowledge Graph. arXiv: 2307.07697. [Google Scholar] [CrossRef
[14] Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M. and Liu, Q. (2019) ERNIE: Enhanced Language Representation with Informative Entities. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, 28 July-2 August 2019, 1441-1451. [Google Scholar] [CrossRef
[15] Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., et al. (2020) ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 8968-8975. [Google Scholar] [CrossRef
[16] Page, L., Brin, S., Motwani, R. and Winograd, T. (1999) The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford InfoLab.
[17] Hong, L., Qian, Y., Gong, C., Zhang, Y. and Zhou, X. (2023) Improved Key Node Recognition Method of Social Network Based on Pagerank Algorithm. Computers, Materials & Continua, 74, 1887-1903. [Google Scholar] [CrossRef
[18] Yang, M., Wang, H., Wei, Z., Wang, S. and Wen, J. (2024) Efficient Algorithms for Personalized Pagerank Computation: A Survey. IEEE Transactions on Knowledge and Data Engineering, 36, 4582-4602. [Google Scholar] [CrossRef
[19] Dabek, F., Cox, R., Kaashoek, F. and Morris, R. (2004) Vivaldi: A Decentralized Network Coordinate System. ACM SIGCOMM Computer Communication Review, 34, 15-26. [Google Scholar] [CrossRef
[20] Papadakis, H., Panagiotakis, C. and Fragopoulou, P. (2017) Scor: A Synthetic Coordinate Based Recommender System. Expert Systems with Applications, 79, 8-19. [Google Scholar] [CrossRef
[21] Panagiotakis, C., Papadakis, H., Papagrigoriou, A. and Fragopoulou, P. (2021) Improving Recommender Systems via a Dual Training Error Based Correction Approach. Expert Systems with Applications, 183, Article 115386. [Google Scholar] [CrossRef
[22] Chen, T. and Guestrin, C. (2016) XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13-17 August 2016, 785-794. [Google Scholar] [CrossRef
[23] Liu, X., Wang, S., Lu, S., Yin, Z., Li, X., Yin, L., et al. (2023) Adapting Feature Selection Algorithms for the Classification of Chinese Texts. Systems, 11, Article 483. [Google Scholar] [CrossRef
[24] Samih, A., Ghadi, A. and Fennan, A. (2023) Enhanced Sentiment Analysis Based on Improved Word Embeddings and XGBoost. International Journal of Electrical and Computer Engineering (IJECE), 13, 1827-1836. [Google Scholar] [CrossRef
[25] Elsayed, S., Thyssens, D., Rashed, A., Jomaa, H.S. and Schmidt-Thieme, L. (2021) Do We Really Need Deep Learning Models for Time Series Forecasting? arXiv: 2101.02118. [Google Scholar] [CrossRef
[26] Shetty, J. and Adibi, J. (2004) The Enron Email Dataset Database Schema and Brief Statistical Report. Information Sciences Institute Technical Report, University of Southern California, 120-128.
[27] Bera, D., Ogbanufe, O. and Kim, D.J. (2023) Towards a Thematic Dimensional Framework of Online Fraud: An Exploration of Fraudulent Email Attack Tactics and Intentions. Decision Support Systems, 171, Article 113977. [Google Scholar] [CrossRef
[28] Voorhees, E.M. and Tice, D.M. (1999) The TREC-8 Question Answering Track Report. Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, 16 November 1999, 77-82.
[29] Woźniak, M., Wieczorek, M. and Siłka, J. (2023) BiLSTM Deep Neural Network Model for Imbalanced Medical Data of IoT Systems. Future Generation Computer Systems, 141, 489-499. [Google Scholar] [CrossRef
[30] Han, C., Wu, C., Guo, H., Hu, M. and Chen, H. (2023) HaNet: A Hierarchical Attention Network for Change Detection with Bitemporal Very-High-Resolution Remote Sensing Images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16, 3867-3878. [Google Scholar] [CrossRef
[31] Kim, Y., Kim, J., Kim, Y., Song, S. and Joo, H.J. (2023) Predicting Medical Specialty from Text Based on a Domain-Specific Pre-Trained Bert. International Journal of Medical Informatics, 170, Article 104956. [Google Scholar] [CrossRef] [PubMed]
[32] Cai, Q., Zheng, S. and Liu, J. (2024) Hierarchical Text Classification of Chinese Public Security Cases Based on ERNIE 3.0 Model. 2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, 19-21 April 2024, 746-751. [Google Scholar] [CrossRef