改进的TF-IDF关键词提取方法
Improved TF-IDF Keyword Extraction Algorithm
摘要:
在TF-IDF算法基础上,提出新的基于词频统计的关键词提取方法。利用段落标注技术,对处于不同位置的词语给予不同的位置权重,对分词结果中词频较高的同词性词语进行词语相似度计算,合并相似度较高的词语,通过词语逆频率TF-IWF算法,按权值排序得到关键词。这种改进算法解决了传统中文关键词提取方法中对相似度高的词的不重视而导致关键词提取精度不高的问题。实验结果表明,改进的算法结果在准确率和召回率上较原有的TF-IDF算法上都得到较好的提升,使得提取的关键词集合能较好体现文本内容。
Abstract: According to the TF-IDF extract algorithm, this paper proposes a new extraction algorithm based on the words frequency statistics. Combining with sections mark technology, this algorithm assigns corresponding position weight to the words located in different position and calculates the words similarities with the same parts of speech which have a high counter in the result of the word segmentation, then merge the words with a higher similarity, finally we get the keyword sorted by the weight via the TF-IWF algorithm. This method optimized the traditional Chinese keyword extract algorithm, which take little notice of the higher similarity words, and lead to low-accuracy. The results show the new approach has better algorithm performance compared with the previous TF-IDF algorithm and the keywords set extracted can generally express the content of the article.