IV. Speech analysis and modeling

Vocal tract modeling

Defn. The formants of a speech sound are concentrations of energy at the resonant frequencies of the vocal tract.

Formant frequency (Hz) = center frequency of the resonance

Formant bandwidth (Hz). Given a formant with peak amplitude A, the formant bandwidth is the difference in frequency between the points on either side of the peak which have amplitude A/(sq. root of 2) (corresponds to 3 dB down from peak)

Formants are named (numbered) in increasing order of formant frequency

In the z-plane, one resonance corresponds to either one real pole:

H(z) = G/(z-z(i))
or one complex pole pair:
H(z) = G/(z-z(1))(z-z(2))

where z(1) and z(2) are complex conjugates.

For a pole at location (r, THETA), the formant frequency corresponds to THETA and the formant bandwidth corresponds to r.

The physical formant frequency is computed from THETA as

F = THETA / 2*pi*T
where T=1/(sampling frequency).

The relationship between the physical formant bandwidth and r is given by

B = -ln(r) / pi*T
and
r = |z(i)| = e^(-pi*T*B)

Example:

z(i) = 0.1 + j 0.95 --> |z(i)| = 0.955
THETA(i) = 1.466 radians

if f(s) = 8 kHz --> Fi = 1866 Hz; Bi = 117 Hz

In practice: formant frequencies are important, formant bandwidths less so.

Relation of formants to speech sounds:
F1:related to the "open-closed" dimension in articulatory phonetics
F2:related to the "front-back" dimension

This leads to the Vowel Quadrilateral (or Vowel Triangle)

Figure: Vowel quadrilateral.

Figure: F2 frequency vs. F1 frequency for English vowels, for a typical adult male speaker.

Figure: Mean formant frequencies and relative amplitudes for 33 male speakers, for English vowels in an /h-d/ context. Relative formant amplitudes are given in db with respect to the first formant of AO (bought). After [Peterson and Barney, reprinted in J. L. Flanagan, Speech Analysis Synthesis and Perception, Springer-Verlag, Berlin, 2nd edition, 1965.

Parameters for speech modeling:

  1. Parameters for vocal tract analysis and modeling

  2. Parameters for excitation analysis and modeling Voiced speech: Unvoiced speech: Contexts for analysis methods:
    1. Analysis-synthesis systems
      s(t) --- analyze --- synthesize --- ss(t)

      The analysis produces a representation in terms of a reduced number of parameters, for efficient storage or transmission (coding, compression)

      Model-based analysis-synthesis:
      If the linear filtering production model is assumed, model-based analysis attempts to estimate parameters for the vocal tract and for the excitation

      Generic name: "vocoder" -- voice coder
      "Voder" - System demonstrated by Bell Labs at the 1939 Worlds Fair

    2. Speech recognition systems

    3. Speech synthesis:

      Off line, create letter to sound or word to sound databases & rules
      Using these rules, synthesize speech from, e.g., typed text



    Go: