基于排名机制的领域Web网页发现
Domain Web Pages Discovery Based on Ranking Mechanism
DOI: 10.12677/HJDM.2022.124031, PDF,   
作者: 王安涛, 李征宇, 李 贵:沈阳建筑大学,信息与控制工程学院,辽宁 沈阳
关键词: 主题爬取网页排名领域Web网页发现Focused Crawling Page Rank Domain Web Pages Discovery
摘要: 对很多Web数据集成应用来说,领域Web发现能力至关重要。从目前来看,现有的主题爬取策略依然有效,并随之产生了不少依据这些策略的主题爬虫,然而配置主题爬虫困难且费时,因此提出基于排名机制的领域Web网页发现算法,该算法在现有的主题爬取策略之上,利用给定的样本网页集,使用基于排名的方法,系统地结合多种Web网页发现策略,迭代发现并提取领域Web新网页。实验表明,该方法具有较高的网页准确率,验证了方法的有效性。
Abstract: Domain Web discovery capabilities are critical to many Web data integration applications. From the current point of view, the existing focused crawling strategies are still effective, and many focused crawlers based on these strategies have been created. However, configuring focused crawlers is difficult and time-consuming. Therefore, a domain Web page discovery algorithm based on ranking mechanism is proposed. Based on the existing focused crawling strategies, the algorithm uses a given set of sample web pages, uses a ranking-based method, and systematically combines various web page discovery strategies to iteratively discover and extract new web pages in the domain. Experiments show that the method has high web page accuracy, which verifies the effectiveness of the method.
文章引用:王安涛, 李征宇, 李贵. 基于排名机制的领域Web网页发现[J]. 数据挖掘, 2022, 12(4): 320-333. https://doi.org/10.12677/HJDM.2022.124031

参考文献

[1] 汤羽, 林迪, 范爱华, 吴薇薇. 大数据分析与计算[M]. 北京: 清华大学出版社, 2018.
[2] Krishnamurthy, Y., Pham, K., Santos, A., and Freire, J. (2016) Interactive Exploration for Domain Discovery on the Web. ACM KDD Workshop on Interactive Data Exploration and Analytics (IDEA), 64-71. https://nyuscholars.nyu.edu/en/publications/interactive-exploration-for-domain-discovery-on-the-web
[3] Barbosa, L., Bangalore, S., and Sridhar, V.K.R. (2011) Crawling Back and Forth: Using Back and Out Links to Locate Bilingual Sites. Proceedings of 5th International Joint Conference on Natural Language Processing, Chiang Mai, 8-13 November 2011, 429-437. https://aclanthology.org/I11-1048
[4] Qiu, D.S., Barbosa, L., Dong, X.L., Shen, Y.Y., and Srivastava, D, (2015) Dexter: Large-Scale Discovery and Extraction of Product Specifications on the Web. Proc. Proceedings of the VLDB Endowment, 8, 2194-2205.
[5] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002) Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47, 235-256. [Google Scholar] [CrossRef
[6] Dean, J. and Henzinger, M.R. (1999) Finding Related Pages in the World Wide Web. Computer Networks 31, 11, 1467-1479.
[7] Murata, T. (2001) Finding Related Web Pages Based on Connectivity Information from a Search Engine. Poster Proceedings of 10th International Conference on World Wide Web (WWW), Hong Kong, 1-5 May 2001, 18-19. http://www10.org/cdrom/posters/frame.html
[8] Vieira, K., Barbosa, L., Silva, A.S., Freire, J., and Moura, E. (2016) Finding Seeds to Bootstrap Focused Crawlers. World Wide Web, 19, 449-474. [Google Scholar] [CrossRef
[9] Barbosa, L. and Freire, J. (2007) An Adaptive Crawler for Locat-ing Hidden-Web Entry Points. In Proceedings of the 16th International Conference on World Wide Web (WWW), New York, 8 May 2007, 441-450. [Google Scholar] [CrossRef
[10] Chakrabarti, S., Punera, K., and Subramanyam, M. (2002) Acceler-ated Focused Crawling through Online Relevance Feedback. In Proceedings of the 11th International Conference on World Wide Web (WWW), New York, 7 May 2002, 148-159. [Google Scholar] [CrossRef
[11] Chakrabarti, S., van den Berg, M., and Dom, B. (1999) Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Computer Networks, 31, 1623-1640. [Google Scholar] [CrossRef
[12] Ester, M., Kriegel, H.-P., and Schubert, M. (2004) Accurate and Efficient Crawling for Relevant Websites. In Proceedings of the Thirtieth International Conference on very Large Data Bases (VLDB), Toronto, 31 August-3 September 2004, 396-407. [Google Scholar] [CrossRef
[13] Meusel, R., Mika, P., and Blanco, R. (2014) Focused Crawling for Structured Data. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM), New York, 3 November 2014, 1039-1048. [Google Scholar] [CrossRef