加权判定遗传算法在数据采集中的研究
Research on Weighted Decision Genetic Algorithm in Data Acquisition
摘要: 随着互联网的快速发展,使得如何从海量的网络资源中快速准确地获取用户所需的信息成为一个关键问题。通用搜索引擎通过网页采集和索引为用户提供检索服务,但这种基于关键词匹配的检索方式,往往忽略用户真实查询意图的识别和匹配。垂直搜索引擎则通过缩小采集范围为特定领域和背景的用户提供专业化、定制化信息检索服务,是当前搜索领域研究的热点。主题爬虫是垂直搜索引擎的网页采集模块,在搜索路径上只保留与主题相关的网页,本文主要围绕主题爬虫的网页分析方法和搜索策略,探讨如何提高爬虫的指标性能。在以往的研究中,针对于链接结构评价和网页内容评价相结合的爬虫策略取得了较好的效果。但这种方法一般是将链接评价问题作为单目标问题处理,难以适应网页的多样性,同时全局搜索能力不强,容易陷入局部最优。经过对以上情况的分析,本文提出了一种加权判定遗传算法的主题爬虫策略,该策略在现有遗传算法爬行策略基础上新引入改进的TrustRank算法来提高反作弊能力和计算的网页的重要程度,采用多项网页内容信息来判断网页与主题的相关性,并通过选择遗传因子和设置适应度函数赋予这两项指标相应的权重来判定待下载网页的价值,在保证了利用遗传算法增强整体搜索性能的前提下,增强了爬取页面的重要性和主题相关性。相比于已有遗传算法,加权判定遗传算法的搜索策略能在一定程度上提高主题爬虫的查准和查全率,扩大爬虫的搜索范围,更符合用户的主题检索需求。
Abstract:
With the rapid development of the Internet, how to quickly and accurately obtain the information required by users from the massive network resources has become a key issue. General search engines provide users with retrieval services through web page collection and indexing, but this retrieval method based on keyword matching often ignores the identification and matching of users’ real query intentions. Vertical search engines provide specialized and customized information retrieval services for users in specific fields and backgrounds by narrowing the collection range, which is a hot research topic in the current search field. The topic crawler is a web page collection module of a vertical search engine. Only the topic-related web pages are kept on the search path. This paper mainly focuses on the webpage analysis method and search strategy of the topic crawler, and discusses how to improve the index performance of the crawler. In previous studies, the crawler strategy combining link structure evaluation and web content evaluation has achieved good results. However, this method generally treats the link evaluation problem as a single-objective problem, which is difficult to adapt to the diversity of web pages. At the same time, the global search ability is not strong, and it is easy to fall into local optimum. After analyzing the above situation, this paper proposes a topic crawling strategy based on weighted decision genetic algorithm. This strategy in-troduces an improved TrustRank algorithm based on the existing genetic algorithm crawling strat-egy to improve the anti-cheating ability and the importance of the calculated webpage, using a number of webpage content information to judge the relevance of web pages and themes, and by selecting genetic factors and setting fitness functions to give these two indicators the corresponding weights to judge the value of the webpage to be downloaded, which ensures the use of genetic algorithms to enhance the overall. On the premise of search performance, the importance and topic relevance of crawling pages are enhanced. Compared with the existing genetic algorithm, the search strategy of the weighted decision genetic algorithm can improve the precision and recall rate of the subject crawler to a certain extent, expand the search scope of the crawler, and better meet the user’s subject retrieval needs.
参考文献
|
[1]
|
左薇, 张熹, 董红娟, 于梦君. 主题网络爬虫研究综述[J]. 软件导刊, 2020, 19(2): 278-281.
|
|
[2]
|
安子建. 基于Scrapy框架的网络爬虫实现与数据抓取分析[D]: [硕士学位论文]. 长春: 吉林大学, 2017.
|
|
[3]
|
徐璐. 遗传算法模式识别的机理与意义[D]: [硕士学位论文]. 哈尔滨: 黑龙江大学, 2019.
|
|
[4]
|
Liu, J.F., Li, X., Zhang, Q.S. and Zhong, G. (2022) A Novel Focused Crawler Combining Web Space Evolution and Domain Ontology. Knowledge-Based Systems, 243, Article No. 108495. [Google Scholar] [CrossRef]
|
|
[5]
|
Cheok, S.M., Hoi, L.M., Tang, S.-K. and Tse, R. (2022) Crawling Parallel Data for Bilingual Corpus Using Hybrid Crawling Architecture. Proce-dia Computer Science, 198, 122-127. [Google Scholar] [CrossRef]
|
|
[6]
|
萧婧婕, 陈志云. 基于灰狼算法的主题爬虫[J]. 计算机科学, 2018, 45(S2): 146-148+166.
|
|
[7]
|
范会联, 李献礼, 曾广朴. 基于改进遗传算法的聚焦爬虫设计[J]. 计算机工程与科学, 2010, 32(5): 126-129.
|
|
[8]
|
刘成军. 基于查询扩展和多目标优化的主题爬虫系统的研究和实现[D]: [硕士学位论文]. 北京: 北京邮电大学, 2020.
|
|
[9]
|
白江伟. 改进的遗传算法在兰州自助终端巡检系统中的研究与运用[D]: [硕士学位论文]. 兰州: 兰州大学, 2019.
|
|
[10]
|
钱海军. 基于遗传算法的开放教育排课系统研究[D]: [硕士学位论文]. 广州: 广东技术师范学院, 2018.
|