Publication detail

Phoneme Recognition from a Long Temporal Context

SCHWARZ, P., MATĚJKA, P., ČERNOCKÝ, J.

Original Title

Phoneme Recognition from a Long Temporal Context

Type

conference paper

Language

English

Original Abstract

We investigate techniques for acoustic modeling in automatic recognition of context-independent phoneme strings. The recognizer was evaluated on TIMIT database.
The baseline phoneme recognizer is based on TempoRAl Patterns (TRAP).
It is an HMM - Neural Network (HMM/NN) hybrid.
Critical bands energies are obtained in the conventional way. Speech signal is divided into 25 ms long frames with 10 ms shift. The Mel filter-bank is emulated by triangular weighting of FFT-derived short-term spectrum to obtain short-term critical-band logarithmic spectral densities.
TRAP feature vector describes a segment of temporal evolution of critical band spectral densities within a single critical band. The central point is actual frame and there is equal number of frames in past and in future.
The length can differ. This vector forms an input to a classifier.
Outputs of the classifier are posterior probabilities of sub-word classes which we want  to distinguish among. In our case, such classes are context-independent phonemes or  their parts (states). Such classifier is applied in each critical band. The merger is  another classifier and its function is to combine band classifier outputs into one.
 Both band classifiers and merger are neural nets.
 The described techniques yield phoneme probabilities for the center frame. These  phoneme probabilities are then fed into a Viterbi decoder which produces   phoneme strings.
This recognizer is further simplified to shorten processing times, reduce computational requirements and optimized. This simplification optimization reduce PER absolutely about 1.8%.
More precise modeling we achieved by splitting phonemes
to 3 parts (states). This improved system of 0.9% absolutely. Separate modeling of left and right phoneme context gave us 0.38% in case of one state models. More fine modeling of these left and right contexts by three states lead to improvement 3.76%. Also bi-gram language models are incorporated into the system and evaluated.
All modifications lead to a faster system with about 23.6% relative or 6.84% absolute improvement over the baseline in phoneme
error rate.
Work is in progress on porting this recognizer to meeting data domain. The recognizer will serve as
one of front-ends for the acoustic event spotting (the task of Brno within AMI).

Keywords

phoneme recognition, feature extraction, speech recognition

Authors

SCHWARZ, P., MATĚJKA, P., ČERNOCKÝ, J.

RIV year

2004

Released

15. 6. 2004

Publisher

Institute for Perceptual Artificial Intelligence

Location

Martigny

Pages from

1

Pages to

1

Pages count

1

URL

BibTex

@inproceedings{BUT17586,
  author="Petr {Schwarz} and Pavel {Matějka} and Jan {Černocký}",
  title="Phoneme Recognition from a Long Temporal Context",
  booktitle="poster at JOINT AMI/PASCAL/IM2/M4 Workshop on Multimodal Interaction and Related Machine Learning Algorithms",
  year="2004",
  pages="1--1",
  publisher="Institute for Perceptual Artificial Intelligence",
  address="Martigny",
  url="http://www.fit.vutbr.cz/~matejkap/publi/2004/ami2004.pdf"
}