基于模板和SVM协同工作的网页去噪方法

doi:10.12677/CSA.2020.101007

期刊菜单

基于模板和SVM协同工作的网页去噪方法
A Web Page Cleaning Method Based on Template and SVM

DOI: 10.12677/CSA.2020.101007, PDF, 科研立项经费支持
作者: 严金承, 王运锋^*：四川大学计算机学院，四川成都
关键词: 网页去噪；模板；SVM；Web Page Clean； Template； SVM

摘要: 本文提出一种基于模板和支持向量机(SVM)协同工作的网页去噪方法。该方法将网页噪声分为公共噪声和个性化噪声两类。首先从网页集合中建立模板库，利用模板去除网页公共噪声。对于剩下的个性化噪声，先计算块级标签特征，利用这些特征训练SVM模型，最后用训练好的SVM模型将块级标签分为噪声和正文两类，达到去噪目的。该方法能够有效去除主题型网页中的版权、导航、广告等噪声信息。与单纯使用SVM进行网页去噪相比，查准率和查全率上均有提升。

Abstract: This paper presents a method of web page denoising based on template and support vector machine (SVM). This method divides web page noise into common noise and personalized noise. Firstly, a template library from the web page collection is established, and the common noise of web page will be removed by using the template. And then, the features for block-level labels are calculated, with which the SVM model is trained. Finally, the trained SVM model is used to divide block-level labels into noise and main text, achieving the purpose of denoising. This method can effectively remove the copyright, navigation, advertising and other noise information in the web page. Compared with the pure use of SVM for web page denoising, both accuracy and recall rate of this method were improved.

文章引用：严金承, 王运锋. 基于模板和SVM协同工作的网页去噪方法[J]. 计算机科学与应用, 2020, 10(1): 51-59. https://doi.org/10.12677/CSA.2020.101007

参考文献

[1]	毛先领, 何靖, 闫宏飞. 网页去噪: 研究综述[J]. 计算机研究与发展, 2010, 47(12): 2025-2036.
[2]	Finn, A., Kushmeric, N. and Smyth, B. (2001) Fact or Fiction: Content Classification for Digital Libraries. Proceedings of the 2nd DELOS Network of Excellence Workshop on Personalization and Recommender Systems in Digital Libraries, Dublin, Ireland, 1-6.
[3]	Gibson, D., Punera, K. and Tomkins, A. (2005) The Volume and Evolution of Web Page Templates. In: Proceedings of the 14th International Conference on Word Wide Web, ACM, New York, 830-839. [Google Scholar] [CrossRef]
[4]	Cai, D., Yu, S., Wen, J.R. and Ma, W.-Y. (2003) Extracting Content Structure for Web Pages Based on Visual Representation. In: Zhou, X., Orlowska, M.E. and Zhang, Y., Eds., Web Technologies and Applications. APWeb 2003. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 406-417. [Google Scholar] [CrossRef]
[5]	Cai, D., Yu, S., Wen, J.R. and Ma, W.-Y. (2003) VIPS: A Vi-sion-Based Page Segmentation Algorithm. Microsoft Research.
[6]	Debnath, S., Mitra, P., Pal, N. and Giles, C.L. (2005) Automatic Identification of Informative Sections of Web Pages. IEEE Transactions on Knowledge and Data En-gineering, 17, 1233-1246. [Google Scholar] [CrossRef]
[7]	王健, 张金. 基于节点权重的网页去噪方法的研究[J]. 计算机技术与发展, 2017, 27(10): 83-86.
[8]	伊政, 徐武平, 徐爱萍. 一种基于结构分析的网页主题区域发现方法[J]. 计算机工程与应用, 2015, 51(6): 227-230+259.
[9]	郗家贞, 郭岩, 黎强, 等. 一种短正文网页的正文自动化抽取方法[J]. 中文信息学报, 2016, 30(1): 8-15.
[10]	周艳平, 李金鹏, 宋群豹. 一种基于SVM及文本密度特征的网页信息提取方法[J]. 计算机应用与软件, 2019, 36(10): 251-255+261.
[11]	李桐宇, 任锐, 蔡鸿明, 等. 基于文本对象模型的自动化网页内容提取方法[J]. 上海交通大学学报, 2018, 52(10): 1363-1369.
[12]	杨贤, 唐超兰, 李航. 基于文本块密度与标签路径等特征的正文提取[J]. 广东工业大学学报, 2018, 35(2): 51-56.
[13]	陈雪, 徐慧, 沈家峻. 基于网页结构的网页去噪算法设计[J]. 软件, 2013, 34(8): 95-97.
[14]	宋鳌, 支琤, 周军, 等. 基于LCS的特征树最大相似性匹配网页去噪算法[J]. 电视技术, 2011, 35(13): 44-48+63.
[15]	梁东, 杨永全, 魏志强. 基于支持向量机的网页正文内容提取方法[J]. 计算机与现代化, 2018(9): 21-26+31.
[16]	W. Bruce Croft, Donald Metzler, 等. 搜索引擎信息检索实践[M]. 北京: 机械工业出版社, 2010.
[17]	刘春卫, 罗健旭. 基于混合核函数的PSO-SVM分类算法[J]. 华东理工大学学报(自然科学版), 2014, 40(1): 96-101.
[18]	Raghavan, V. and Wang, G.S. (1989) A Critical Investigation of Recall and Precision as Measures of Retrieval System Performance. ACM Trans on Information Systems, 7, 205-229. [Google Scholar] [CrossRef]

为你推荐

友情链接