Detail publikace

Phoneme Recognition

SCHWARZ, P., MATĚJKA, P., ČERNOCKÝ, J.

Originální název

Phoneme Recognition

Typ

konferenční sborník (ne článek)

Jazyk

angličtina

Originální abstrakt

We investigate techniques for acoustic modeling in automatic recognition of context-independent phoneme strings. The recognizer was evaluated on TIMIT database. The baseline phoneme recognizer is based on TempoRAl Patterns (TRAP). It is an HMM - Neural Network (HMM/NN) hybrid. Critical bands energies are obtained in the conventional way. Speech signal is divided into 25 ms long frames with 10 ms shift. The Mel filter-bank is emulated by triangular weighting of FFT-derived short-term spectrum to obtain short-term critical-band logarithmic spectral densities. TRAP feature vector describes a segment of temporal evolution of critical band spectral densities within a single critical band. The central point is actual frame and there is equal number of frames in past and in future. The length can differ. This vector forms an input to a classifier. Outputs of the classifier are posterior probabilities of sub-word classes which we want to distinguish among. In our case, such classes are context-independent phonemes or their parts (states). Such classifier is applied in each critical band. The merger is another classifier and its function is to combine band classifier outputs into one. Both band classifiers and merger are neural nets. The described techniques yield phoneme probabilities for the center frame. These phoneme probabilities are then fed into a Viterbi decoder which produces phoneme strings. This recognizer is further simplified to shorten processing times, reduce computational requirements and optimized. This simplification optimization reduce PER absolutely about 1.8%. More precise modeling we achieved by splitting phonemes to 3 parts (states). This improved system of 0.9% absolutely. Separate modeling of left and right phoneme context gave us 0.38% in case of one state models. More fine modeling of these left and right contexts by three states lead to improvement 3.76%. Also bi-gram language models are incorporated into the system and evaluated. All modifications lead to a faster system with about 23.6% relative or 6.84% absolute improvement over the baseline in phoneme error rate. Work is in progress on porting this recognizer to meeting data domain. The recognizer will serve as one of front-ends for the acoustic event spotting (the task of Brno within AMI).

Klíčová slova

phoneme recognition, feature extraction, speech recognition

Autoři

SCHWARZ, P., MATĚJKA, P., ČERNOCKÝ, J.

Vydáno

25. 6. 2004

Strany od

Strany do

Strany počet

URL

http://www.fit.vutbr.cz/~matejkap/publi/2004/ami2004.pdf

BibTex

@proceedings{BUT64157,
  editor="Petr {Schwarz} and Pavel {Matějka} and Jan {Černocký}",
  title="Phoneme Recognition",
  year="2004",
  pages="1",
  url="http://www.fit.vutbr.cz/~matejkap/publi/2004/ami2004.pdf"
}

VUT

Fakulty

Vysokoškolské ústavy

Součásti

Phoneme Recognition