Publication detail

ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications has been verified and confirmed by the Action Editor

ZULUAGA-GOMEZ, J. VESELÝ, K. SZŐKE, I. BLATT, A. MOTLÍČEK, P. KOCOUR, M. RIGAULT, M. CHOUKRI, K. PRASAD, A. SARFJOO, S. NIGMATULINA, I. CEVENINI, C. KOLČÁREK, P. TART, A. ČERNOCKÝ, J. KLAKOW, D.

Original Title

ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications has been verified and confirmed by the Action Editor

Type

journal article - other

Language

English

Original Abstract

ersonal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC, large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). However, ATC is considered a low-resource domain. In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. In addition, we also open-source a GitHub repository that contains data preparation and training scripts useful to replicate our baselines related to ASR and NLU. The ATCO2 corpus covers 1) audio and radar data collection and pre-processing, 2) pseudo-transcriptions of speech audio, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets: (i) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold transcriptions for named-entity recognition (callsign, command, value) and speaker role detection. (ii) The ATCO2-test-set-1h corpus is a one-hour open-sourced subset from the 4h test set.\footnote{Free to download, available at: https://www.atco2.org/data. (iii) The ATCO2-PL-set corpus consists of 5'281 hours of pseudo-transcribed ATC speech enriched with contextual information (list of relevant n-gram sequences per utterance), speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. The whole ATCO2 corpus is publicly distributed through ELDA catalog (https://catalog.elra.info/en-us/repository/browse/ELRA-S0484/). We expect the corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.

Keywords

Automatic Speech Recognition, Spoken Language Understanding, Natural Language Processing, Air Traffic Control Communications

Authors

ZULUAGA-GOMEZ, J.; VESELÝ, K.; SZŐKE, I.; BLATT, A.; MOTLÍČEK, P.; KOCOUR, M.; RIGAULT, M.; CHOUKRI, K.; PRASAD, A.; SARFJOO, S.; NIGMATULINA, I.; CEVENINI, C.; KOLČÁREK, P.; TART, A.; ČERNOCKÝ, J.; KLAKOW, D.

ISBN

1533-7928

Periodical

Journal of Machine Learning Research

Year of study

2

Number

1

State

United States of America

Pages from

1

Pages to

45

Pages count

45

URL

BibTex

@article{BUT194022,
  author="ZULUAGA-GOMEZ, J. and VESELÝ, K. and SZŐKE, I. and BLATT, A. and MOTLÍČEK, P. and KOCOUR, M. and RIGAULT, M. and CHOUKRI, K. and PRASAD, A. and SARFJOO, S. and NIGMATULINA, I. and CEVENINI, C. and KOLČÁREK, P. and TART, A. and ČERNOCKÝ, J. and KLAKOW, D.",
  title="ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications has been verified and confirmed by the Action Editor",
  journal="Journal of Machine Learning Research",
  volume="2",
  number="1",
  pages="1--45",
  issn="1533-7928",
  url="https://openreview.net/forum?id=3CiWvVQVfw"
}

Documents