II. Representations of Speech


Physical Representations

Sound waves are longitudinal waves: displacement of molecules is parallel to the direction of motion of the wave

Figure: Longitudinal wave: Propagation of a wave along the particles of a medium.
[C. H. Coker, P. B. Denes, and E. N. Pinson, Speech Synthesis, Bell Telephone Laboratories, Inc., 1963.]

What characterizes the sound wave? Displacement of a particle from its rest position.

To measure

  1. variation in sound pressure

    or

  2. power (intensity) - varies as square of pressure

Scale:

Reference pressure

  1. Reference is smallest audible sound (2E10-4 dynes/cm^2)
    Using this reference:
    Ordinary conversation 65 dB
    Range for speech: 46 db (whisper) --> 86 dB (shout)
  2. Reference is max value of pressure across the given signal
    For the given signal, all values will be <= 0dB

Representations of Speech Signals

Rabiner & Juang, Chapter 2.3

1. Time domain representations

  1. Analog: voltage proportioned to air pressure, via a microphone
  2. Digital: sample the analog signal -- A/D conversion

Two components of A/D conversion:

Sampling: Key Concepts

References:
Ziemer and Tranter, Principles of Communication, pp. 68-72.
Oppenheim & Schafer, Discrete Time Signal Processing, pp. 80-87.

Let f(s) denote the sampling frequency and T denote the interval between successive samples, so f(s) = 1/T.
Let f(max) denote the frequency of the highest frequency component present in a given function.

Sampling Principle: Any continuous time function is exactly determined by equally spaced samples provided f(s) is greater than 2 times f(max).

The Nyquist rate is defined to be 2 f(max).
The Nyquist rate is a strict lower bound on the minimum acceptable sampling rate.

Consequence of undersampling: aliasing

In aliasing, a frequency component at f > f(s)/2 is observed as a frequency component at f' = | f - f(s) |. The spurious component due to f is added to the true component at f', corrupting the information about f'.

E.g.:
Assume f(s) = 6 kHz
Signal s contains a component at 4kHz
The component at 4kHz will be added to the true frequency component at 2kHz.

Implications for Speech

Where is the significant energy?

Practical note for collecting speech:

Precede the A/D converter with a low-pass filter that ensures that there are no frequency components above f(s)/s.

Relation between angular and physical frequencies:

physical frequency f: samples/second
angular frequency omega: radians/(sample interval)
f(s) = 1/T

Figure: Relationship between physical and angular frequencies.

Using the equivalence of the ratios:
omega/pi = f/f(s/2)
Then
f = (omega)/(2 pi T )

Quantization: Key Concepts

References
Parsons (9.1, 5.7);
Oppenheim & Schafer, Discrete Time SP, pp. 114-123.

The signal-to-noise ratio (SNR) is defined to be the ration of (signal power)/(quantization error power)

most common: uniform quantization, in which there is a fixed step size interval between adjacent quantization levels

Principal Result:

Let b = # bits in the quantized representation
Then SNR(in dB) = 6b - 7.2

An increase of 1 in the number of bits in the representation results in an increase in the SNR of 6 dB

Implications for Speech

Practical note for collecting speech:
If you use b bits to quantize but the high order h bits are never actually used, then the effective SNR is determined by (b-h), not by b: SNR(effective) = 6(b-h) - 7.2

Assuming a fixed b, then it may be desirable to amplify the speech before quantization in order to use the full available amplitude range provided by the b bits.


2. Frequency domain representations

Rabiner & Juang, Chapter 3.1 and 3.2

z-transform

Fourier transform of a discrete time signal

Discrete signal, continuous frequency
The Fourier transform is the z-transform evaluated at z = e^(j omega) (i.e., on the unit circle in the z-plane).

Problem: speech signal x is time varying, and is not known for all time

Solution: Define the short-time Fourier transform:

Xn(e^(j omega)) =
sum from m=-infinity to +infinity of {w(n-m)x(m)e^(j omega n))}

n = time index
Xn denotes the STFT at time index n
w(n-m) window function at time index n

One interpretation of the short-time Fourier transform: fix n, let omega vary:

Xn(e^(j omega)) = F[w(n-m)x(m)] = F[w(n-m)]*X(e(j omega))

Xn denotes the STFT at time index n
F[] denotes the traditional Fourier transform,
X denotes the Fourier transform of x,
W denotes the Fourier transform of w,
* denotes convolution.

Xn(e^(j omega)) = W(e^(-j omega))e^(-j omega n)*X(e^(j omega)) for n fixed

Theoretical issue:

existence of X(n)(e^(j omega)):
need x(m)w(n-m) absolutely summable over all n

OK if w(n-m) = 0 outside a finite interval around n
Think of the x(m)w(n-m) either as

Specific windows

Xn(e^(j omega)) = W(e^(-j omega))*X(e^(j omega))e^(-j omega n), n fixed

Goal: get an accurate representation of X(e^(j omega)) by looking at Xn(e^(j omega)).

Ideally W(e^(-j omega)) will look like an impulse with respect to X(e^(j omega))
For this to be the case, w(n-m) would have to be an infinite duration constant function.
But such a w(n-m) is not a finite duration window.

Common windows:

Figure: Rectangular and Hamming windows.

  1. Rectangular window
    w(m) =1 for m from 0 to N-1
    0 otherwise

  2. Hamming window
    w(m) =0.54 - 0.46 cos ((2 pi m)/(N-1)) for m from 0 to N-1
    0 otherwise

Frequency response of windows:

Goal: impulse-like W(e^(-j omega))

Figure: Fourier transform of (a) rectangular window; (b) Hamming window.

Two factors

So the rectangular window has better main lobe properties and the Hamming window has better attenuation properties. In practice, the Hamming window is more commonly used for speech analysis.

Window length

Consider a periodic speech signal. The average pitch period is typically between 4ms (female) and 8 ms (male).

If window length is:

<= 4 ms: often won't include a full pitch period, so pitch information won't show up in the spectrum.

>= 50 ms: the signal will be changing over the course of the window

common: 10 - 20 ms
T x N = 10 x 10^-3 or 20 x 10^-3
where T = sample interval; N = # non-zero points in window

at 10-20 ms: -- relatively stable vocal tract
-- at least 1 full pitch period

Different applications lead to different window lengths.

Examples:
Figure: Spectrum analysis of voiced speech using 50 msec Hamming and rectangular windows.
(a) and (c) show the time domain windowed waveforms; (b) and (d) show the corresponding spectra.
After [L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, 1978.]

Figure: Spectrum analysis of voiced speech using 5 msec Hamming and rectangular windows.
(a) and (c) show the time domain windowed waveforms; (b) and (d) show the corresponding spectra.
After [L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, 1978.]

Figure: Spectrum analysis of unvoiced speech using 50 msec Hamming and rectangular windows.
(a) and (c) show the time domain windowed waveforms; (b) and (d) show the corresponding spectra.
After [L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, 1978.]

Figure: Spectrum analysis of unvoiced speech using 5 msec Hamming and rectangular windows.
(a) and (c) show the time domain windowed waveforms; (b) and (d) show the corresponding spectra.
After [L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, 1978.]

Window Placement

May depend on application

Consider compression, where the Hamming windowed signal is analyzed to create a compressed signal, with the analysis parameters used to reconstruct the signal after compression.

With disjoint windows, the envelope shape of the Hamming window may be audible in the reconstructed signal.

Solution: overlapped windows

To reconstruct the speech from the overlapped analysis segments,
add the outputs from the overlapped frames.
Overlapping is less critical for speech recognition applications.

A question whose answer is related to the use of windows:
Are glottal pulses assumed to be an infinite pulse train, or a unit step?
I.e., is X(e^(j omega)) assumed to have constant amplitude or decaying amplitude, where X denotes the Fourier transform of the excitation signal x(n)?

The model is consistent with assuming that the glottal pulses are part of an infinite train, with the window giving a snapshot of a short time interval. The observed rolloff in the spectrum of speech signal s(n) is more readily attributed to the 6 dB/octave rolloff in the spectral envelope corresponding to the vocal tract function, H(e^(j omega)).

Frequency domain representations (continued):

The DFT evaluates the z-transform at discrete points on the unit circle in the z-plane.

The N-point DFT performs the evaluation at N points W(k) = (2 pi k)/N; 0 <= k < N

Radial frequencies from -pi --> pi correspond to physical frequencies from - f(s)/2 --> f(s)/2
or
Radial frequencies from 0 --> 2 pi correspond to physical frequencies from 0 --> f(s)

Conversion between radial and physical frequencies:

0 --> f(s)/2 corresponds to 0 --> pi
(k/N)pi <--> (k/N) (f(s)/2) = k/2NT
where f(s) = 1/T

e.g.

f(s) = 1/T = 12.8 kHz
N = 128 = 1/100 sec = 10 ms

frequency resolution = 12800 Hz/ 128 = 100 Hz between adjacent points in the DFT

e.g.

f(s) = 1/T = 12.8 kHz
N = 512 = 40 ms

frequency resolution = 25 Hz between adjacent points in the DFT

There's a trade-off between the frequency resolution and the time resolution: long time windows give good resolution of frequencies in the sense that the spacing between adjacent points in the DFT is small. However, during the course of the long window, the signal has probably changed.

Frequency domain representations (continued):

End of Frequency domain representations


Speech Representations

  1. Time domain
  2. Frequency domain
  3. Time + frequency domain

3. Time + frequency domain representations

Rabiner & Juang section 2.3

Spectrograph: displays short-time spectra over a duration of time
horizontal axis: time
vertical axis: frequency

(t,f) shows the log magnitude of f at time t, indicated by intensity

Implementation: bank of bandpass filters with different center frequencies
(or short-time DFTS)

Two forms of spectrographs:

End of notes on Speech Representations



Go: