Detail produktu

BUT Recognizer of GeoNames

OTRUSINA, L. SMRŽ, P. SZNAPKA, J. ŠAFÁŘ, M.

Typ produktu

software

Abstrakt

There are many aspects and objectives we had in mind when designing the new NER. First, it should avoid the performance bottlenecks common for the webbased APIs such as OpenCalais or AlchemyAPI. Second, it needs to achieve excellent precision and recall for geographical features, especially for places in Europe. Finally, the tool should perform disambiguation and normalization alongside the recognition process. To meet the second objective, we utilized the Geonames.org data exported from the database, which contains over 10 millions geographical features. Efficiency is brought by the finite state automaton (FSA) technology that can deal with the huge lists of names and is very fast in searching the input texts. We employed an efficient algorithm for constructing the minimal FSA described in Daciuk et al. (1998). A freely available package provided by the first author of the paper allows building a minimal FSA from a list of pre-defined keywords. The resulting representation for all the relevant data from GeoNames (originally over 1.1 GB) takes only 71MB and the processing is extremely fast.

Klíčová slova

name entitiy recognition, geonames.org, finite state automaton

Datum vzniku

15. 12. 2011

Umístění

www.fit.vutbr.cz/~iotrusina/BURGeoN-0.1.tar.gz

Možnosti využití

K využití výsledku jiným subjektem je vždy nutné nabytí licence

Licenční poplatek

Poskytovatel licence na výsledek nepožaduje licenční poplatek

www