高效朴素贝叶斯Web新闻文本分类模型的简易实现
The Simply Implement of Effective Naive Bayes Web News Text Classification Model
DOI: 10.12677/SA.2014.31005, PDF, HTML,  被引量 下载: 2,897  浏览: 9,210 
作者: 吴致晖, 刘洪伟, 陈 丽:广东工业大学管理学院,广州
关键词: 文本分类特征选择朴素贝叶斯TF-IDF标准Text Classification; Feature Selection; Naive Bayes; TF-IDF Standard
摘要: 采用朴素贝叶斯算法作为文本分类算法时,因其每个特征出现概率相互独立且每个特征重要程度相等的假设,所以选择一种高效的特征选择方法显得尤为重要。本文运用jieba中文分词模块的TF-IDF标准[1]对训练新闻文本进行特征选择,实现一个基于朴素贝叶斯的文本分类模型。对待分类新闻文本也同样用该TF-IDF标准来提取文本关键词再进行分类测试,实验测试结果表明有相当高的分类效率
Abstract: When using Naive Bayes theory as a text classification algorithm, it is especially important to choose an effetive feature selection method, due to the hypothesis that occurrence probabilities of features are independent of each other which is equally important. In this paper, jieba Chinese segmentation module’s TF-IDF standard is used to select the features for the training news text and Naive Bayes text classification model is implemented with high performance. Before the test of classification model, it’s still necessary to use the TF-IDF standard to select thekeywords for testing news texts. The experiment result showed that this method is of high efficiency inclassification.
文章引用:吴致晖, 刘洪伟, 陈丽. 高效朴素贝叶斯Web新闻文本分类模型的简易实现[J]. 统计学与应用, 2014, 3(1): 30-35. http://dx.doi.org/10.12677/SA.2014.31005

参考文献

[1] Salton, G. and McGill, M.J. (1983.) Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York.
[2] Mamitsuka, H. (2006) Selecting Features in Microarray Classification Using ROC Curves. Pattern Recog-nition, 39, 2393-2404.
[3] Soucy, P., Mineau, G.W. (2005) Beyond TFIDF Weighting for Text Categorization in the Vector Space Model. Morgan Kaufmann, San Francisco,1130-1135.
[4] Blansche, A., Gancarski, P. and Korczak, J.J. (2006) A Modular Approach for Clustering with Local Attribute Weighting. Pattern Recognition Letters, 27, 1299-1306.
[5] Dunning, T.E. (1993) Accurate Methods for the Statistics of Surprise and Coincidence. Comutational Linguistics,19, 61-74.
[6] 周茜, 赵明生 (2004) 中文文本分类中的特征选择研究. 中文信息学报, 3, 17-23.
[7] 樊兴华, 孙茂松 (2006) 一种高性能的两类中文分词方法. 计算机学报, 1, 124-131.
[8] Harrington, P. (2013) 机器学习实战. 人民邮电出版社, 北京.