Detail aplikovaného výsledku

BUT Recognizer of GeoNames

OTRUSINA, L.; SMRŽ, P.; SZNAPKA, J.; ŠAFÁŘ, M.

Originální název

BUT Recognizer of GeoNames

Anglický název

BUT Recognizer of GeoNames

Druh

Software

Abstrakt

There are many aspects and objectives we had in mind when designing the new NER. First, it should avoid the performance bottlenecks common for the webbased APIs such as OpenCalais or AlchemyAPI. Second, it needs to achieve excellent precision and recall for geographical features, especially for places in Europe. Finally, the tool should perform disambiguation and normalization alongside the recognition process.
To meet the second objective, we utilized the Geonames.org data exported from the database, which contains over 10 millions geographical features. Efficiency is brought by the finite state automaton (FSA) technology that can deal with the huge lists of names and is very fast in searching the input texts. We employed an efficient algorithm for constructing the minimal FSA described in Daciuk et al. (1998). A freely available package provided by the first author of the paper allows building a minimal FSA from a list of pre-defined keywords. The resulting representation for all the relevant data from GeoNames (originally over 1.1 GB) takes only 71MB and the processing is extremely fast.

Abstrakt anglicky

There are many aspects and objectives we had in mind when designing the new NER. First, it should avoid the performance bottlenecks common for the webbased APIs such as OpenCalais or AlchemyAPI. Second, it needs to achieve excellent precision and recall for geographical features, especially for places in Europe. Finally, the tool should perform disambiguation and normalization alongside the recognition process.
To meet the second objective, we utilized the Geonames.org data exported from the database, which contains over 10 millions geographical features. Efficiency is brought by the finite state automaton (FSA) technology that can deal with the huge lists of names and is very fast in searching the input texts. We employed an efficient algorithm for constructing the minimal FSA described in Daciuk et al. (1998). A freely available package provided by the first author of the paper allows building a minimal FSA from a list of pre-defined keywords. The resulting representation for all the relevant data from GeoNames (originally over 1.1 GB) takes only 71MB and the processing is extremely fast.

Klíčová slova

name entitiy recognition, geonames.org, finite state automaton

Klíčová slova anglicky

name entitiy recognition, geonames.org, finite state automaton

Umístění

www.fit.vutbr.cz/~iotrusina/BURGeoN-0.1.tar.gz

Licenční poplatek

K využití výsledku jiným subjektem je vždy nutné nabytí licence

www