基于混合信息增益算法的文本情感分析
Text Emotional Analysis Based on Hybrid Information Gain Algorithm
摘要: 针对传统信息增益特征选择方法存在的选择偏向性的现象以及未考虑特征元素在不同类别间词频的问题,提出了一种混合信息增益的文本情感分析算法。通过引入逆文档频率系数、类间特征词频系数和卡方统计量系数,对文本数据进行特征选择,使得整个文档中词频信息、每个类之间的词频信息以及重要情感色彩的低频词信息得到有效利用。实验结果表明,采用混合信息增益的文本情感分析方法可以有效地提高特征选择的质量,进而提高文本情感分析的准确率,大约2%~5%。
Abstract: Aiming at the problem of selection bias in the traditional information gain feature selection method and the problem of word frequency between different categories without considering the feature frequency of different elements, a text sentiment analysis algorithm with mixed information gain is proposed. By introducing the inverse document frequency coefficient, the inter-class feature word frequency coefficient and the chi-square statistic coefficient, the text data are feature-selected, so that the word frequency information in the entire document, the word frequency information between each class, and the low-frequency word information of important emotional colors are obtained and used efficiently. The experimental results show that the text sentiment analysis method with mixed information gain can effectively improve the quality of feature selection and improve the accuracy of text sentiment analysis, about 2% to 5%.
文章引用:李育强, 洪智勇, 陈靖辉. 基于混合信息增益算法的文本情感分析[J]. 计算机科学与应用, 2019, 9(12): 2314-2322. https://doi.org/10.12677/CSA.2019.912257

参考文献

[1] Cherry, C. and Mohammad, S. (2012) Binary Classifiers and Latent Sequence Models for Emotion Detection in Suicide Notes. Journal of Biomedical Informatics Insights, 5, 147-154. [Google Scholar] [CrossRef
[2] 梅莉莉, 黄河燕, 周新宇, 毛先领. 情感词典构建综述[J]. 中文信息学报, 2016, 30(5): 19-27.
[3] Zhai, S. and Zhang, Z.M. (2016) Semi-Supervised Autoencoder for Sentiment Analysis. In: Thirtieth AAAI Conference on Artificial Intelligence, AAAI Press, Palo Alto, CA, 1394-1400.
[4] Tang, J. and Zhou, S.G. (2016) A New Approach for Feature Selection from Microarray Data Based on Information. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13, 1004-1015. [Google Scholar] [CrossRef
[5] Bidi, N. and Elberichi, Z. (2016) Feature Selection for Text Classification Using Genetic Algorithms. 2016 8th International Conference on Modelling, Identification and Control, Algiers, Algeria, 15-17 November 2016, 806-807. [Google Scholar] [CrossRef
[6] Yang, Y.M. and Pederseno, J. (1997) A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, Morgan Kaufmann Publishers, San Francisco, CA, 412-420.
[7] 郭亚维, 刘晓霞. 文本分类中信息增益特征选择方法的研究[J]. 计算机工程与应用, 2012, 48(27): 119-122+127.
[8] 李海瑞. 基于信息增益和信息熵的特征词权重计算研究[D]: [硕士学位论文]. 重庆: 重庆大学, 2012.
[9] 蒲国林. 基于粗糙集与信息增益的情感特征选择方法[J]. 微电子学与计算机, 2016, 33(1): 96-99.
[10] 龚安, 费凡. 基于多特征融合的评论文本情感分析[J]. 计算机技术与发展, 2018, 28(8): 91-95.
[11] 曲炜. 信息论基础及应用[M]. 北京: 清华大学出版社, 2005.
[12] 李航. 统计学习方法第二版[M]. 北京: 清华大学出版社, 2019.
[13] 李平, 戴月明, 王艳. 基于混合卡方统计量与逻辑回归的文本情感分析[J]. 计算机工程, 2017, 43(12): 192-196+202.
[14] 徐明, 高翔, 许志刚, 刘磊. 基于改进卡方统计的微博特征提取方法[J]. 计算机工程与应用, 2014, 50(19): 113-117+142.
[15] 陈东亮, 白清源. 基于词频向量的关联文本分类[J]. 计算机研究与发展, 2009, 46(z2): 839-844.
[16] 马费成, 张勤. 国内外知识管理研究热点——基于词频的统计分析[J]. 情报学报, 2006, 25(2): 163-171.
[17] 邹娟, 周经野, 邓成, 等. 基于多重启发式规则的中文文本特征值提取方法[J]. 计算机工程与科学, 2006, 28(8): 78-79+104.