Publication detail
Information Extraction from HTML Documents Based on Logical Document Structure
BURGET, R.
Original Title
Information Extraction from HTML Documents Based on Logical Document Structure
Type
dissertation
Language
English
Original Abstract
The World Wide Web presents the largest Internet source of informationfrom a broad range of areas. The web documents are mostly written inthe Hypertext Markup Language (HTML) that doesn't contain any means forsemantic description of the content and thus the contained informationcannot be processed directly. Current approaches for the informationextraction from HTML are mostly based on wrappers that identify thedesired data in the document according to some previously specifiedproperties of the HTML code. The wrappers are limited to a narrow setof documents and they are very sensitive to any changes in the documentformatting.In this thesis, we propose a novel approach to information extractionthat is based on modeling the visual appearance of the document. Weshow that there exist some general rules for the visual presentationof the data in documents and we define formal models of the visualinformation contained in a document. Furthermore, we propose the way ofmodeling the logical structure of an HTML document based on the visualinformation. Finally, we propose methods for using the logicalstructure model for the information extraction task based on treematching algorithms. The advantage of this approach is certainindependence on the underlying HTML code and better resistance tochanges in the documents.
Keywords
Information Extraction, WWW, HTML, Logical Document Structure, Visual Information, Document Modeling
Authors
BURGET, R.
Released
6. 10. 2004
Publisher
Faculty of Information Technology BUT
Location
Brno
Pages count
85
URL
BibTex
@phdthesis{BUT66727,
author="Radek {Burget}",
title="Information Extraction from HTML Documents Based on Logical Document Structure",
publisher="Faculty of Information Technology BUT",
address="Brno",
pages="85",
year="2004",
url="https://www.fit.vut.cz/research/publication/7607/"
}