深度学习算法在新闻文本分类中的应用
Application of Deep Learning Agorithm in News Text Classification
摘要: 新闻文本分类是将新闻文本划分为不同类别的多分类任务,旨在识别文本的关键语义信息,帮助用户获取目标新闻。基于深度学习方法,本文构建了一种新的新闻文本分类模型textPDCNN,该模型主要包括LSTM层、多尺度空洞卷积层、池化层和全连接层。其中,LSTM层主要用于提取文本的语义信息,多尺度空洞卷积层用来融合不同膨胀率的空洞卷积所获取的不同尺度信息,池化层用于提取最有利于分类的特征,全连接层将提取到的特征映射到分类空间。实验表明:相比传统的HAN (Hierarchical Attention Networks)模型,textPDCNN文本分类模型的Macro_P,Macro_R和Macro_F指标分别提升了0.74%、0.61%和0.49%。并且,textPDCNN在公共数据集THUCNews上的性能超过目前已知文献中的最好结果。总之,该方法能很好地融合不同长度的短语特征,并在新闻文本分类任务上具有显著优势。
Abstract: News text classification is a multi-classification task that divides news text into different categories. It aims to identify the key semantic information of the text and provide users with convenient ac-cess to target news. Based on the theory of deep learning, this paper constructs a news text classifi-cation model called textPDCNN, which mainly includes LSTM layer, multi-scale hole convolution layer, pooling layer and fully connected layer. LSTM layer is mainly used to extract the semantic in-formation of the text, the multi-scale hole convolution layer mainly is used for integrating the in-formation of different scales obtained by the hole convolution with different expansion rates. The pooling layer is used to extract features that are most conducive to classification, and the fully con-nected layer is to map the extracted features to the classification space. Experiments show that compared with traditional HAN model, the Macro_P, Macro_R and Macro_F indicators of textPDCNN have increased by 0.74%, 0.61% and 0.49% respectively. The performance of textPDCNN on the public data set THUCNews exceeds the best results currently known in the literature. To sum up, the model we proposed can integrate phrase features of different lengths well, and has obvious ad-vantages in news text classification tasks.
文章引用:李伟, 王丙硕, 林明志. 深度学习算法在新闻文本分类中的应用[J]. 应用数学进展, 2022, 11(10): 7348-7361. https://doi.org/10.12677/AAM.2022.1110781

参考文献

[1] 凤丽洲. 文本分类关键技术及应用研究[D]: [博士学位论文]. 长春: 吉林大学, 2015.
[2] 张小花. 基于文本分类技术的垃圾邮件过滤研究[D]: [硕士学位论文]. 合肥: 安徽大学, 2017.
[3] 何炎祥. 用于微博情感分析的一种情感语义增强的深度学习模型[J]. 计算机学报, 2017, 40(4): 773-790.
[4] 陶文静. 基于卷积神经网络的新闻文本分类研究[D]: [硕士学位论文]. 北京: 北京交通大学, 2019.
[5] 靳小波. 文本分类综述[J]. 自动化博览, 2006, 23(z1): 24-29.
[6] 郭诗瑶. 融合上下文信息的文本分类算法研究及应用[D]: [硕士学位论文]. 北京: 中国邮电大学, 2019.
[7] 刘月. 基于嵌套LSTM的中文新闻文本分类研究[D]: [硕士学位论文]. 成都: 西南交通大学, 2019.
[8] 唐雪涛. 基于神经网络嵌入模型的中文文本分类方法研究[D]: [硕士学位论文]. 合肥: 合肥工业大学, 2020.
[9] 王煜. 基于决策树和K最近邻算法的文本分类研究[D]: [博士学位论文]. 天津: 天津大学管理学院, 2006.
[10] 赵琳. 基于关联规则的文本类投诉信息分类方法及分类器构建[D]: [硕士学位论文]. 长春: 东北师范大学, 2014.
[11] Joachims, T. (1998) Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Springer, Berlin. [Google Scholar] [CrossRef
[12] Lewis, D.D. (1998) Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. 10th European Conference on Machine Learning, Chemnitz, 21-23 April 1998, 4-15. [Google Scholar] [CrossRef
[13] Yang, Y.M. (1999) An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1, 69-90. [Google Scholar] [CrossRef
[14] 姜诚. 基于深度学习的中文长文本分类算法的研究与实现[D]: [硕士学位论文]. 北京: 中国科学院大学, 2020.
[15] Liu, P.F., Qiu, X.P. and Huang, X.G. (2016) Recurrent Neural Network for Text Classification with Multi-Task Learning. Proceedings of the 25th International Joint Conference on Artificial Intelligence, New York, 9-15 July 2016, 2873-2879.
[16] Kim, Y. (2014) Convolutional Neural Networks for Sentence Classification. Proceedings of the Empirical Methods in Natural Language Processing, Doha, 25-29 October 2014, 1746-1751. [Google Scholar] [CrossRef
[17] Yang, Z., Yang, D. and Dyer, C. (2016) Hierarchical Attention Networks for Document Classification. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, June 2016, 1480-1489. [Google Scholar] [CrossRef
[18] Sun, M.S., Li, J.G. and Guo, Z.P. (2016) THUCTC: An Efficient Chi-nese Text Classifier.
[19] Shen, L., Zhe, E.Z. and Hu, R.F. (2018) Analogical Reasoning on Chinese Morphological and Semantic Relations. Association for Computational Linguistics, Melbourne.