基于位置及词频信息的优化CHI文本特征选择方法
An Improved CHI Text Feature Selection Method Based on the Location and Word Frequency Information
DOI: 10.12677/CSA.2015.59040, PDF, HTML, XML,  被引量 下载: 1,977  浏览: 6,570  国家自然科学基金支持
作者: 宋阿羚, 刘海峰, 刘守生:解放军理工大学理学院,江苏 南京
关键词: 特征选择χ2统计相关性位置分布类偏斜Feature Selection Chi-Square Relevance Location Distribution Class Deflection
摘要: 特征选择是文本自动分类的核心技术。针对经典的CHI模型不足之处,本文首先从特征项与类别之间的正负相关性角度对特征项进行删减;然后针对类偏斜分类环境下的特征项权重进行调整;进而以特征项的词频数为依据,从特征项在文本中的具体位置、特征项的类内及类间分布等层面再对模型逐步改进,提出了一种优化的CHI特征选择方法。随后的文本分类试验验证了该方法的有效性。
Abstract: Text feature selection is the core technology of text automatic categorization. Aiming at the short-comings of classical CHI model, we have screened the feature set which is based on the point of view of the positive and negative correlation between the feature and categories firstly. According to the type of deflection classification conditions, we adjust the feature weighting secondly. Thirdly, basing on characteristics of word frequency, we gradually improve the model based on the characteristics of a specific location in the text and the characteristics of distribution of information between classes. Finally, we propose an optimized CHI feature selection method. Text classification experiments demonstrate the effectiveness of the optimized CHI model.
文章引用:宋阿羚, 刘海峰, 刘守生. 基于位置及词频信息的优化CHI文本特征选择方法[J]. 计算机科学与应用, 2015, 5(9): 322-330. http://dx.doi.org/10.12677/CSA.2015.59040

参考文献

[1] Yang, Y.M. and Liu, X. (1999) A re-examination of text categorization on methods. Proceedings of the 22nd Annual International ACM SIGIR Conference Research and Development in Information Retrieval, New York, 15-19 August 1999, 42-49.
http://dx.doi.org/10.1145/312624.312647
[2] 王光, 邱云飞, 史庆伟 (2012) 集合CHI与IG的特征选择方法. 计算机应用研究, 7, 2454-2456.
[3] Meesad, P., Boonrawd, P. and Nuipian, V. (2012) A chi-square-test for word importance differentiation in text classification. Proceedings of 2011 International Conference on Information and Electronics Engineering, Singapore, 110- 114.
[4] 邱云飞, 王威, 刘大有, 等 (2012) 基于方差的CHI特征选择方法. 计算机应用研究, 4, 1304-1306.
[5] 熊忠阳, 张鹏招, 张玉芳 (2008) 基于 统计的文本分类特征选择方法的研究. 计算机应用, 2, 513-518.
[6] 林少波, 杨丹, 徐玲 (2012) 基于类别相关的新文本特征提取方法. 计算机应用研究, 5, 1680-1683.
[7] 郭颂, 马飞 (2013) 文本分类中信息增益特征选择算法的改进. 计算机应用与软件, 8, 139-142.
[8] 黄志艳 (2013) 一种基于信息增益的特征选择方法. 山东农业大学学报, 2, 252-256.
[9] 丁璇 (2002) 中文网页标引源主题表达能力的调查. 大学图书馆学报, 6, 70-72.
[10] 侯汉清, 张成志, 郑红 (2005) Web概念挖掘中标引源加权方案初探. 情报学报, 1, 87-92.
[11] 刘海峰, 姚泽清, 汪泽焱, 等 (2009) 一种基于位置的文本特征加权方法研究. 微电子学与计算机, 2, 188-192.
[12] http://www.nlpir.org/download/tc-corpus-answer.rar