Publication detail

Visual HTML Document Modeling for Information Extraction

BURGET, R.

Original Title

Visual HTML Document Modeling for Information Extraction

Type

article in a collection out of WoS and Scopus

Language

English

Original Abstract

Current methods for the information extraction from HTML documents are mostly based on wrappers that read the HTML code and identify the data to be extracted by some properties of the surrounding HTML tags and the text. The bottleneck of this approach is too tight binding of the wrapper to the HTML code. The nature of HTML allows to achieve the desired document design by various ways that can be arbitrarily combined, which makes the wrappers limited to a narrow set of documents and a short time period. By contrast, there exist some generally accepted rules for the visual data presentation in the documents. Our approach is based on using the visual information for identifying the data in the documents. We define formal models of the visual information and we propose the method for information extraction based on the unordered tree matching algorithms.

Keywords

HTML, Information Extraction, Document Modeling, Logical Document Structure, Visual Information

Authors

BURGET, R.

Released

14. 9. 2005

Publisher

Faculty of Electrical Engineering and Computer Science, VSB-TU Ostrava

Location

Ostrava

ISBN

80-248-0864-1

Book

RAWS 2005

Pages from

17

Pages to

24

Pages count

8

BibTex

@inproceedings{BUT18057,
  author="Radek {Burget}",
  title="Visual HTML Document Modeling for Information Extraction",
  booktitle="RAWS 2005",
  year="2005",
  pages="17--24",
  publisher="Faculty of Electrical Engineering and Computer Science, VSB-TU Ostrava",
  address="Ostrava",
  isbn="80-248-0864-1"
}