基于KNN的烟草企业档案文本自动分类算法研究
An Approach for Algorithm of Tobacco Enterprise Archives Text Automatic Classification Based on KNN
DOI: 10.12677/CSA.2014.49029, PDF, HTML,  被引量 下载: 2,750  浏览: 10,741  科研立项经费支持
作者: 黄世反, 沈 勇, 康洪炜, 郑见琳, 郎 波, 王 冬, 贾丛丛:云南大学,软件学院,昆明;王道红:云南省农村信用社科技结算中心,昆明
关键词: TFIDFKNN烟草档案文本自动分类保存期限TFIDF KNN Archives of Tobacco Automatic Text Categorization Storage Life
摘要: 通过对云南某卷烟厂历史档案文本数据的分析研究,结合实际情况,对档案文本主题词的获取和自动分类算法进行了详细的设计。且在主题词获取算法中引入了TFIDF算法,解决了档案文本缺少题名、文号及责任者项时,算法无法自动获取主题词的问题。在文本自动分类算法中引入了KNN最邻近算法,解决了无法根据题名、文号进行档案文本自动分类的问题。同时,还考虑了档案文本按保存期限进行分类的问题。实验结果证明,该算法明显提高了烟草企业档案文本的分类效率。
Abstract: By researching historical archives text data of a cigarette factory in Yunnan province, combing with actual situation, we have detailedly designed acquisition of file text subject headings and automatic classification algorithm. Furthermore, TFIDF algorithm is introduced to acquisition algorithm of subject headings, thus the problem that algorithm can’t automatically obtain subject headings when text file lack title, document number and statement items is solved. In this paper, KNN adjacent algorithm is introduced to the algorithm of automatic classification, and it solves the problem which can’t be solved according to the title and approval document for automatically classifying archives text. At the same time, we also consider the problem that classifies file text according to the storage life. The experimental results show that this algorithm obviously improves the classified efficiency of archives text of the tobacco enterprise.
文章引用:黄世反, 沈勇, 康洪炜, 王道红, 郑见琳, 郎波, 王冬, 贾丛丛. 基于KNN的烟草企业档案文本自动分类算法研究[J]. 计算机科学与应用, 2014, 4(9): 204-216. http://dx.doi.org/10.12677/CSA.2014.49029

参考文献

[1] Luhn, H.P. (1957) A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development, 1, 309-317.
[2] Luhn, H.P. (1958) The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159- 165.
[3] Luhn, H.P. (1960) Key word in context index for technical literature (kwic index). American Documentation, 11, 288-295.
[4] Luhn, H.P. (1961) Selective dissemination of new scientific information with the aid of electronic processing equipment. American Documentation, 12, 131-138.
[5] 苏新宁, 徐进鸿 (1995) 档案自动分类算法研究. 情报学报, 3, 194-200.
[6] 齐菁 (2012) 基于档案来源原则建立档案信息自动分类编目体系的思考. 湖北档案, 2, 19-21.
[7] 陈嵩 (2013) 简述档案管理理论的新发展. 城建档案, 8, 55-56.
[8] Salton, G. and Yu, C.T. (1973) On the construction of effective vocabularies for information retrieval. ACM, 10, 48-60.
[9] Cover, T.M. (1968) Rates of convergence for nearest neighbor procedures. Proceedings of the Hawaii International Conference on Systems Sciences, 413-415.
[10] 刘辉 (2010) 基于KNN 算法的中文Web文本分类技术研究. 硕士论文, 辽宁工程技术大学, 阜新.