基于WEB网页文本信息抽取研究与实现
Research and Implementation of Text Information Extraction Based on WEB
DOI: 10.12677/HJDM.2015.54010, PDF, HTML, XML, 下载: 2,149  浏览: 5,807 
作者: 刘三星:肇庆市工业贸易学校,广东 肇庆
关键词: 互联网信息抽取HTMLXML文本信息抽取Internet Information Extraction HTML XML Text Information Extraction
摘要: 本文以传统的信息抽取理论和方法为基础,实现了一种基于XML特征的网页文本抽取方法。研究了一般网页的特征,实现了一种基于XML标签特征的网页提取方法,对网页进行HTML页面标准化,将其转成XML语言,并且根据XML语言的特点,对其内部语言进行转化,从GB转为UTF,并对其进行标准化,然后通过熟悉XML标签的各种特性,对网页文本根据标签进行抽取。
Abstract: In this paper, based on the theory and method of traditional information extraction, a method of Web Text Extraction Based on XML features is realized. The characteristics of general web pages are studied. A method of web page extraction based on XML tag feature is implemented. The HTML pages are standardized. The XML language is converted into XML language. According to the fea-tures of XML language, the internal language is transformed from GB to UTF, and then the standard is also extracted.
文章引用:刘三星. 基于WEB网页文本信息抽取研究与实现[J]. 数据挖掘, 2015, 5(4): 69-74. http://dx.doi.org/10.12677/HJDM.2015.54010