Přístupnostní navigace
E-application
Search Search Close
Publication detail
BURGET, R.
Original Title
Layout Based Information Extraction from HTML Documents
Type
conference paper
Language
English
Original Abstract
We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.
Keywords
page segmentation, layout analysis, information extraction
Authors
RIV year
2007
Released
23. 9. 2007
Publisher
IEEE Computer Society
Location
Curitiba
ISBN
0-7695-2822-8
Book
9th International Conference on Document Analysis and Recognition ICDAR 2007
Pages from
624
Pages to
629
Pages count
6
BibTex
@inproceedings{BUT28821, author="Radek {Burget}", title="Layout Based Information Extraction from HTML Documents", booktitle="9th International Conference on Document Analysis and Recognition ICDAR 2007", year="2007", pages="624--629", publisher="IEEE Computer Society", address="Curitiba", isbn="0-7695-2822-8" }