软件工程与应用  >> Vol. 5 No. 1 (February 2016)

基于谱哈希的大规模网页分类算法
Large Scale Web Page Classification Algorithm Based on Spectral Hashing

DOI: 10.12677/SEA.2016.51008, PDF, HTML, XML, 下载: 1,390  浏览: 4,257 

作者: 田郸郸:国防科学技术大学计算机学院,湖南 长沙

关键词: 网页分类大规模谱哈希KNNWeb Page Classification Large Scale Spectrum Hashing KNN

摘要: 如今,网络信息已经覆盖到我们生活的方方面面,但随着网络的发展,网络信息过载的问题也越来越凸显,我们在网络中难以准确定位我们所需要的信息。将网页分类可以有效的提高网页搜索效率,帮助我们准确的定位所需网页。当前的网页分类算法可以处理少量网页分类,但对大规模网页进行分类效率不够理想。最近人们提出了分布式的网页分类方法,但这种方法虽然能够提高网页分类效率,但并没有改进分类算法本身。所以本文提出一种基于哈希散列和KNN的方法,设计一个适用于大规模网页分类的分类算法。
Abstract: Nowadays, network information has been covered in all aspects of our lives, but with the devel-opment of the network, the problem of network information overload has become more and more prominent so that it is difficult for us to accurately locate the information we need in the network. The web classification can effectively improve the efficiency of web search and help us accurately locate the desired page. The current classification algorithm can handle a small amount of web pages classified, but the efficiency of large-scale web classification is not ideal. Recently, a distributed web classification is proposed. Although this method can improve the efficiency of web page classification, it does not improve classification algorithm itself. Therefore, this paper proposes a hashes and KNN method based on the design of a classification algorithm applied to large-scale web classification.

文章引用: 田郸郸. 基于谱哈希的大规模网页分类算法[J]. 软件工程与应用, 2016, 5(1): 65-74. http://dx.doi.org/10.12677/SEA.2016.51008

参考文献

[1] 贺海军, 王建芬, 周青, 等. 基于决策支持向量机的中文网页分类器[J]. 计算机工程, 2003, 29(2): 47-48.
[2] 李晋松. 基于朴素贝叶斯的网页自动分类技术研究[D]: [硕士学位论文]. 北京: 北京化工大学, 2008.
[3] 史国强. 基于RBF神经网络的网页分类技术研究[D]: [硕士学位论文]. 北京: 中国石油大学, 2011.
[4] Weiss, Y., Torralba, A. and Fer-gus, R. (2008) Spectral Hashing. Neural Information Processing Systems, 282, 1753- 1760.
[5] Charon, I., Cohen, G., et al. (2010) New Identifying Codes in the Binary Hamming Space. European Journal of Combinatorics, 31, 491-501.
http://dx.doi.org/10.1016/j.ejc.2009.03.032
[6] 张瑾. 基于改进TF-IDF算法的情报关键词提取方法[J]. 情报杂志, 2014(4): 153-155.
[7] Song, Y., Huang, J., Zhou, D., et al. (2007) IKNN: Informative K-Nearest Neighbor Pattern Classification. Knowledge Discovery in Databases: PKDD. Springer Berlin Heidelberg, 248-264.
[8] 许阳, 刘功申, 孟魁. 基于句中词语间关系的文本向量化算法[J]. 信息安全与通信保密, 2014(4): 84-88.
[9] 李峰, 李芳. 中文词语语义相似度计算——基于《知网》2000 [J]. 中文信息学报, 2007, 21(3): 99-105.