基于TextRank与BERT预训练模型的新闻评论观点句识别方法
Opinion Sentences Recognition in News Comments Based on TextRank and BERT Pre-Training Model
DOI: 10.12677/CSA.2022.126148, PDF,    国家自然科学基金支持
作者: 王红斌:昆明理工大学信息工程与自动化学院,云南 昆明;昆明理工大学云南省人工智能重点实验室,云南 昆明;李伊仝:昆明理工大学信息工程与自动化学院,云南 昆明;昆明理工大学云南省人工智能重点实验室,云南 昆明;昆明理工大学云南省计算机技术应用重点实验室,云南 昆明;李 辉*:昆明理工大学信息工程与自动化学院,云南 昆明
关键词: 观点挖掘观点句识别文本分类深度学习新闻评论Opinion Mining Opinion Sentences Recognition Text Classification Deep Learning News Comments
摘要: 由于用户的观点句与新闻内容高度相关,对新闻评论进行观点句识别时需要关注新闻文本这一额外信息。本文针对新闻文本通常很长,BERT并不能很好地处理长序列文本的问题。提出了将TextRank算法与BERT预训练模型相结合的方法,利用TextRank算法从新闻文本中提取出新闻摘要,在不缺失语义信息情况下将较长的新闻文本表示为较短的文本。再将新闻摘要信息与评论通过BERT模型得到语义融合表示向量,最后在全连接层将融合表示向量转换为评论是否为观点句的概率。本文与近年流行的深度学习文本分类模型进行了对比,在准确率上取得了79.80%的最佳效果,说明了模型的有效性。并在NLPCC&2012微博观点句识别数据集取得了准确率为80.38%的最佳效果,验证了模型具有一定的泛化能力。
Abstract: Because the user’s opinion sentence is highly related to the news content, the additional information of the news text needs to be paid attention to when identifying the opinion sentence of the news comment. In this article, news texts are usually very long, and BERT cannot handle the problem of long sequence texts well. A method of combining the TextRank algorithm with the BERT pre-training model is proposed. The TextRank algorithm is used to extract news summaries from the news text, and the longer news text is expressed as a shorter text without missing semantic information. Then the news summary information and comments are obtained through the BERT model to obtain the semantic fusion representation vector, and finally the fusion representation vector is converted into the probability of whether the comment is an opinion sentence in the fully connected layer. This paper compares with the popular deep learning text classification model in recent years, and achieves the best effect of 79.80% in accuracy rate, which shows the effectiveness of the model. And in the NLPCC&2012 Weibo opinion sentence recognition dataset, the best effect was achieved with an accuracy rate of 80.38%, which verified that the model has a certain generalization ability.
文章引用:王红斌, 李伊仝, 李辉. 基于TextRank与BERT预训练模型的新闻评论观点句识别方法[J]. 计算机科学与应用, 2022, 12(6): 1489-1498. https://doi.org/10.12677/CSA.2022.126148

参考文献

[1] Mihalcea, R. and Tarau, P. (2004) Textrank: Bringing Order into Text. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, July 2004, 404-411.
[2] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2018) Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding, arXiv:1810.04805.
[3] Kim, S.M. and Hovy, E. (2004) Determining the Sentiment of Opinions. COLING 2004: Pro-ceedings of the 20th International Conference on Computational Linguistics, Geneva, 23-27 August 2004, 1367-es. [Google Scholar] [CrossRef
[4] Shen, Y., Li, S., Ling, Z., Ren, X. and Cheng, X. (2009) Emotion Mining Research on Micro-Blog. 2009 1st IEEE Symposium on Web Society, Lanzhou, 23-24 August 2009, 71-75. [Google Scholar] [CrossRef
[5] Pak, A. and Paroubek, P. (2010) Twitter as a Corpus for Senti-ment Analysis and Opinion Mining. Proceedings of the International Conference on Language Resources and Evalua-tion, Valletta, 17-23 May 2010, 1320-1326.
[6] Turney, P.D. (2002) Thumbs up or Thumbs down? Semantic Orienta-tion Applied to Unsupervised Classification of Reviews. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, 7-12 July 2002, 417-424. [Google Scholar] [CrossRef
[7] 刘培玉, 荀静, 费绍栋, 朱振方. 基于隐马尔可夫模型的主观句识别[J]. 中文信息学报, 2016, 30(4): 206-212.
[8] Mihalcea, R., Banea, C. and Wiebe, J. (2007) Learning Multilingual Subjective Language via Cross-Lingual Projections. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, 25-27 June 2007, 976-983.
[9] Pang, B., Lee, L. and Vaithyanathan, S. (2002) Thumbs up? Sentiment Classi-fication Using Machine Learning Techniques. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, Vol. 10, PA, 6-7 July 2002, 79-86. [Google Scholar] [CrossRef
[10] 谢丽星, 周明, 孙茂松, 等. 基于层次结构的多策略中文微博情感分析和特征抽取[J]. 中文信息学报, 2012, 26(1): 73-83.
[11] Hu, M. and Liu, B. (2004) Mining and Summarizing Customer Reviews. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, 22-25 August 2004, 168-177. [Google Scholar] [CrossRef
[12] 刘荣, 郝晓燕, 李颖. 基于语义模式的半监督中文观点句识别研究[J]. 南京大学学报(自然科学), 2018, 54(5): 967-973.
[13] Shi, H., Chen, W. and Li, X. (2013) Opinion Sentence Extraction and Sentiment Analysis for Chinese Microblogs. 2nd CCF International Conference on Natural Language Processing and Chinese Computing, Chongqing, 15-19 November 2013, 417-423. [Google Scholar] [CrossRef
[14] Wang, G., Tian, X., Huang, D. and Zhang, J. (2016) Opinion Sentence Identification and Element Extraction in Chinese Micro Blogs. Journal of Data Acquisition and Processing, No. 1, 160-167.
[15] 刘丹. 情感文本的识别与分类算法的研究与实现[D]: [硕士学位论文]. 北京: 北京交通大学, 2019.
[16] 林思琦, 余正涛, 郭军军, 高盛祥. 融入多特征的汉越新闻观点句抽取方法[J]. 中文信息学报, 2019, 33(11): 101-106.
[17] 王晓涵, 余正涛, 相艳, 郭贤伟, 黄于欣. 基于特征扩展卷积神经网络的案件微博观点句识别[J]. 中文信息学报, 2020, 34(9): 62-69.
[18] Kim, Y. (2014) Convolutional Neural Networks for Sentence Classi-fication. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 25-29 Oc-tober 2014, 1746-1751. [Google Scholar] [CrossRef
[19] Liu, P., Qiu, X. and Huang, X. (2016) Recurrent Neural Network for Text Classification with Multi-Task Learning. arXiv:1605.05101.
[20] Johnson, R. and Zhang, T. (2017) Deep Pyramid Convolutional Neural Networks for Text Categorization. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, July 2017, 562-570. [Google Scholar] [CrossRef
[21] Lai, S., Xu, L., Liu, K. and Zhao, J. (2015) Recurrent Convolutional Neural Networks for Text Classification. 29th AAAI Conference on Artificial Intelligence, Austin, 25-30 January 2015, 2267-2273.
[22] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al. (2017) Attention Is All You Need. 31st Conference on Neural Information Processing Systems, Long Beach, 4-9 December 2017, 5998-6008.