标题:
高效朴素贝叶斯Web新闻文本分类模型的简易实现The Simply Implement of Effective Naive Bayes Web News Text Classification Model
作者:
吴致晖, 刘洪伟, 陈丽
关键字:
文本分类, 特征选择, 朴素贝叶斯, TF-IDF标准Text Classification; Feature Selection; Naive Bayes; TF-IDF Standard
期刊名称:
《Statistics and Application》, Vol.3 No.1, 2014-03-28
摘要:
采用朴素贝叶斯算法作为文本分类算法时,因其每个特征出现概率相互独立且每个特征重要程度相等的假设,所以选择一种高效的特征选择方法显得尤为重要。本文运用jieba中文分词模块的TF-IDF标准[1]对训练新闻文本进行特征选择,实现一个基于朴素贝叶斯的文本分类模型。对待分类新闻文本也同样用该TF-IDF标准来提取文本关键词再进行分类测试,实验测试结果表明有相当高的分类效率。
When using Naive Bayes theory as a text classification algorithm, it is especially important to choose an effetive feature selection method, due to the hypothesis that occurrence probabilities of features are independent of each other which is equally important. In this paper, jieba Chinese segmentation module’s TF-IDF standard is used to select the features for the training news text and Naive Bayes text classification model is implemented with high performance. Before the test of classification model, it’s still necessary to use the TF-IDF standard to select thekeywords for testing news texts. The experiment result showed that this method is of high efficiency inclassification.