Publication detail

Extracting Visually Presented Element Relationships from Web Documents

BURGET, R. SMRŽ, P.

Original Title

Extracting Visually Presented Element Relationships from Web Documents

Type

journal article in Scopus

Language

English

Original Abstract

Many documents in the World Wide Web present structured information that consists of multiple pieces of data with certain relationships among them. Although it is usually not difficult to identify the individual data values in the document text, their relationships are often not explicitly described in the document content. They are expressed by visual presentation of the document content that is expected to be interpreted by a human reader. In this paper, we propose a formal generic model of logical relationships in a document based on an interpretation of visual presentation patterns in the documents. The model describes the visually expressed relationships between individual parts of the contents independently of the document format and the particular way of presentation. Therefore, it can be used as an appropriate document model in many information retrieval or extraction applica- tions. We formally define the model, we introduce a method of extracting the relationships between the content parts based on the visual presentation analysis and we discuss the expected applications. We also present a new dataset consisting of programmes of conferences and other scientific events and we discuss its suitability for the task in hand. Finally, we use the dataset to evaluate results of the implemented system.

Keywords

logical document structure; page segmentation; document analysis; web documents

Authors

BURGET, R.; SMRŽ, P.

RIV year

2013

Released

31. 12. 2013

ISBN

1557-3958

Periodical

International Journal of Cognitive Informatics and Natural Intelligence

Year of study

2013

Number

2

State

United States of America

Pages from

13

Pages to

29

Pages count

17

URL

BibTex

@article{BUT105971,
  author="Radek {Burget} and Pavel {Smrž}",
  title="Extracting Visually Presented Element Relationships from Web Documents",
  journal="International Journal of Cognitive Informatics and Natural Intelligence",
  year="2013",
  volume="2013",
  number="2",
  pages="13--29",
  doi="10.4018/ijcini.2013040102",
  issn="1557-3958",
  url="https://www.fit.vut.cz/research/publication/10468/"
}

Documents