Mobile version of Hanspub

文章引用说明 更多>> (返回到该文章)

Luhn, H.P. (1957) A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1, 309-317.


  • 标题: 基于KNN的烟草企业档案文本自动分类算法研究An Approach for Algorithm of Tobacco Enterprise Archives Text Automatic Classification Based on KNN

    作者: 黄世反, 沈勇, 康洪炜, 王道红, 郑见琳, 郎波, 王冬, 贾丛丛

    关键字: TFIDF, KNN, 烟草档案, 文本自动分类, 保存期限TFIDF, KNN, Archives of Tobacco, Automatic Text Categorization, Storage Life

    期刊名称: 《Computer Science and Application》, Vol.4 No.9, 2014-09-24

    摘要: 通过对云南某卷烟厂历史档案文本数据的分析研究,结合实际情况,对档案文本主题词的获取和自动分类算法进行了详细的设计。且在主题词获取算法中引入了TFIDF算法,解决了档案文本缺少题名、文号及责任者项时,算法无法自动获取主题词的问题。在文本自动分类算法中引入了KNN最邻近算法,解决了无法根据题名、文号进行档案文本自动分类的问题。同时,还考虑了档案文本按保存期限进行分类的问题。实验结果证明,该算法明显提高了烟草企业档案文本的分类效率。By researching historical archives text data of a cigarette factory in Yunnan province, combing with actual situation, we have detailedly designed acquisition of file text subject headings and automatic classification algorithm. Furthermore, TFIDF algorithm is introduced to acquisition algorithm of subject headings, thus the problem that algorithm can’t automatically obtain subject headings when text file lack title, document number and statement items is solved. In this paper, KNN adjacent algorithm is introduced to the algorithm of automatic classification, and it solves the problem which can’t be solved according to the title and approval document for automatically classifying archives text. At the same time, we also consider the problem that classifies file text according to the storage life. The experimental results show that this algorithm obviously improves the classified efficiency of archives text of the tobacco enterprise.