标题:
基于位置及词频信息的优化CHI文本特征选择方法An Improved CHI Text Feature Selection Method Based on the Location and Word Frequency Information
作者:
宋阿羚, 刘海峰, 刘守生
关键字:
特征选择, χ2统计, 相关性, 位置分布, 类偏斜Feature Selection, Chi-Square, Relevance, Location Distribution, Class Deflection
期刊名称:
《Computer Science and Application》, Vol.5 No.9, 2015-10-21
摘要:
特征选择是文本自动分类的核心技术。针对经典的CHI模型不足之处,本文首先从特征项与类别之间的正负相关性角度对特征项进行删减;然后针对类偏斜分类环境下的特征项权重进行调整;进而以特征项的词频数为依据,从特征项在文本中的具体位置、特征项的类内及类间分布等层面再对模型逐步改进,提出了一种优化的CHI特征选择方法。随后的文本分类试验验证了该方法的有效性。
Text feature selection is the core technology of text automatic categorization. Aiming at the short-comings of classical CHI model, we have screened the feature set which is based on the point of view of the positive and negative correlation between the feature and categories firstly. According to the type of deflection classification conditions, we adjust the feature weighting secondly. Thirdly, basing on characteristics of word frequency, we gradually improve the model based on the characteristics of a specific location in the text and the characteristics of distribution of information between classes. Finally, we propose an optimized CHI feature selection method. Text classification experiments demonstrate the effectiveness of the optimized CHI model.