Course detail
Speech Processing
FEKT-MZPRAcad. year: 2016/2017
The subject gives a comprehensive view of the present-day solution of speech processing occurring in verbal communication. First, speech production, its perception, human auditory system and process of hearing are introduced. Then segmental and suprasegmental parameters that are frequently used in speech analysis are discussed. Furthermore, all important areas of speech processing are mentioned: pattern and isolated word recognition, speech synthesis and coding and the TTS systems are described. The method of pitch analysis, prosody modelling, emotion analysis and speech watermarking are added. Attention is also paid to one-channel and multi-channel speech enhancement methods and noise suppression. In the end subjective and objective methods of assessing the quality and intelligibility of speech are introduced.
Language of instruction
Number of ECTS credits
Mode of study
Guarantor
Department
Learning outcomes of the course unit
Prerequisites
Co-requisites
Planned learning activities and teaching methods
Assesment methods and criteria linked to learning outcomes
Course curriculum
2. Areas of speech signal processing. Overview of segmental and supra-segmental attributes. Pre-processing of speech, segmentation, windowing, pre-emphasis. Narrowband and wideband spectrograms, short-term energy. Linear predictive analysis, direct and lattice implementation structures, reflection coefficients and their calculation, normal equations and their solution. Levinson-Durbin’s algorithm, order selection for LPC analysis. Perception LP coefficients and their calculation. PLP spectral coefficients. Formant estimation using LP coefficients. Cepstral analysis, complex and real cepstra, Mel’s spectral and cepstral coefficients, calculation example for MFCC.
3. Pitch signal and its frequency and period, jitter, shimmer. Overview of methods for the determination of pitch properties.
4. Pattern recognition, attribute extraction. Dynamic Time Warping (DTW). Degree of similarity, absolute difference. Euclid’s measure, Mahalanobis’s measure, Itakura’s measure, K-means algorithm. Applications: isolated word recognition, text-dependent speaker recognition. Speech therapy signals, analysis and detection of defects in speech therapy, learning system for defect removal. Analysis of biological signals for detection and treatment of various diseases which are diagnosed on the basis of human speech (Parkinson’s disease, etc.).
5. Bayesian classification, neural network, Gaussian Mixed Models (GMMs), Support Vector Machines (SVM), Hidden Markov’s Models (HMMs), Word and sentence prosody, micro-prosody. Prosody parameters: pitch variations, intensity and tempo. Fujisaki’s model, statistical and LPC modelling. Phonetic modelling according to rules (melodems).
6. Audio recordings of synthesiser illustrations, history of development. Making an inventory of speech units. Speech synthesis in the time domain and speech synthesis in the frequency domain. Vocal tract modelling (LPC and cepstral models, harmonic model). Approximation of exponential function exp(x). Text-To-Speech synthesis, text pre-processing, phonetic transcription, prosody settings.
7. Waveform coding. Source coding. The basic principle of LPC codec. Adaptive Multi-Rate Wideband (AMR-WB) system, Variable-Rate Multimode Wideband (VRM-WB) system. Speech transmission over internet.
8. Spectral subtraction method, RASTA method, mapping spectrogram method. Voice Activity Detector (VAD. Use of the wavelet transform and digital filter banks. Adaptive LMS filters. Digital filtering (dual-channel, multi-channel processing). Cocktail-party effect. Beam-forming. Blind source separation method (under-determined, determined, over-determined). Independent Component Analysis (ICA), Sparse Component Analysis (SCA).
9. Recognition of emotion from speech system. Emotion classification. System for emotion recognition from static images and videos.
10 . Evaluation of quality, intelligibility, naturalness, and acceptability of speech. Nominal, ordinal, interval, and ratio scales. Sentence, word and rhyme tests, logatoms, signal-to-noise ratio measurement. Database of speech recordings, their types and classification. PESQ and PSQM methods.
11. Data and database protection, general scheme of coder and decoder. Non-perceptibility, robustness, and coder workload. Masking in the time and the frequency domains.
12. Modulation spectrum, bi-spectrum, bi-cepstrum, methods of speech quality evaluation Attributes derived from Empirical Mode Decomposition (EMD) and Discrete Time Wavelet Transform (DTWT) methods, etc.
Work placements
Aims
Specification of controlled education, way of implementation and compensation for absences
Recommended optional programme components
Prerequisites and corequisites
Basic literature
O'SHAUGNESSY, D., LI DENG: Speech Processing-A Dynamic Optimization-Oriented Approach. Marcel Dekker, New York, 2003. ISBN 0-8247-4040-8
PSUTKA, J.: Komunikace s počítačem mluvenou řečí. ACADEMIA, Praha 1995. ISBN 80-200-0203-0
QUATIERI, T.F.: Discrete-Time Speech Signal Processing-Principles and Practice. Prentice Hall, NJ 2002. ISBN 0-13-242942-X
UHLÍŘ, J. SOVKA, P.: Digital Signal Processing (Číslicové zpracování signálů), ČVUT, Praha, 1995. (In Czech)
Recommended reading
Classification of course in study plans
- Programme EEKR-M Master's
branch M-TIT , 2 year of study, summer semester, elective specialised
- Programme EEKR-M Master's
branch M-TIT , 2 year of study, summer semester, elective specialised
- Programme AUDIO-P Master's
branch P-AUD , 2 year of study, summer semester, elective specialised
- Programme EEKR-CZV lifelong learning
branch EE-FLE , 1 year of study, summer semester, elective specialised
Type of course unit
Lecture
Teacher / Lecturer
Syllabus
Phonetic description of the Czech language.
Introduction into speech signal analysis, model of speech generation.
The marks used in analyzing speech signals.
Breaking down the homomorphous analysis (LPCC, LFCC and MFCC coefficients).
Automatic recognition of commands.
Automatic speaker recognition.
Temporal and fequency synthesis of speech.
Speech encoding techniques.
Speech signal and interference.
Single-channel filtering techniques.
Multi-channel filtering techniques.
Technical tools for the realization.
Laboratory exercise
Teacher / Lecturer
Syllabus
Calculation of autocorrelation and LPC coefficients.
Spectrogram-based analysis of speech signals.
Calculation of cepstral coefficients (LPCC, LFCC and MFCC coefficients).
Calculating the AMDF function, establishing the basic tone.
Selecting the marks for automatic command recognition.
Selecting the marks for automatic speaker recognition.
Establishing the utterance boundaries in noisy recordings.
Speech synthesis in the time domain.
Assignment of individual projects.
Solving and consulting individual projects.
Solving and consulting individual projects.
Handing in the projects and awarding the credit pass.