Sound waves are longitudinal waves: displacement of molecules is parallel to the direction of motion of the wave
Figure: Longitudinal wave: Propagation of a wave along the
particles of a medium.
[C. H. Coker, P. B. Denes, and E. N. Pinson, Speech Synthesis,
Bell Telephone Laboratories, Inc., 1963.]
What characterizes the sound wave? Displacement of a particle from its rest position.
To measure
or
Scale:
L = (20 log10 R)db (decibels)
[If working w/pressure, use normalizer 20;
R = ratio of (measured pressure)/(reference pressure)
if working w/intensity (power), use 10]
Reference pressure
Rabiner & Juang, Chapter 2.3
Two components of A/D conversion:
Where is the significant energy?
Practical note for collecting speech:
Relation between angular and physical frequencies:
Figure: Relationship between physical and angular frequencies.
The signal-to-noise ratio (SNR) is defined to be the ration of (signal power)/(quantization error power)
most common: uniform quantization, in which there is a fixed step size interval between adjacent quantization levels
Principal Result:
An increase of 1 in the number of bits in the representation results in an increase in the SNR of 6 dB
Implications for Speech
Practical note for collecting speech:
If you use b bits to quantize but the high order h bits are never
actually used, then the effective SNR is determined by
(b-h), not by b:
SNR(effective) = 6(b-h) - 7.2
Assuming a fixed b, then it may be desirable to amplify the speech before quantization in order to use the full available amplitude range provided by the b bits.
z-transform
Fourier transform of a discrete time signal
Problem: speech signal x is time varying, and is not known for all time
Solution: Define the short-time Fourier transform:
One interpretation of the short-time Fourier transform: fix n, let omega vary:
Theoretical issue:
Specific windows
Xn(e^(j omega)) = W(e^(-j omega))*X(e^(j omega))e^(-j omega n), n fixed
Goal: get an accurate representation of X(e^(j omega)) by looking at Xn(e^(j omega)).
Ideally W(e^(-j omega)) will look like an impulse with respect to X(e^(j omega))
For this to be the case,
w(n-m) would have to be an infinite duration constant function.
But such a w(n-m) is not a finite duration window.
Common windows:
Figure: Rectangular and Hamming windows.
| w(m) = | 1 for m from 0 to N-1 | ||
| 0 otherwise |
| w(m) = | 0.54 - 0.46 cos ((2 pi m)/(N-1)) for m from 0 to N-1 | ||
| 0 otherwise |
Goal: impulse-like W(e^(-j omega))
Figure: Fourier transform of (a) rectangular window; (b) Hamming window.
Two factors
So the rectangular window has better main lobe properties and the Hamming window has better attenuation properties. In practice, the Hamming window is more commonly used for speech analysis.
Window length
Consider a periodic speech signal. The average pitch period is typically between 4ms (female) and 8 ms (male).
If window length is:
| <= 4 ms: | often won't include a full pitch period, so pitch information won't show up in the spectrum. |
|
| >= 50 ms: | the signal will be changing over the course of the window |
|
| common: | 10 - 20 ms | |
| T x N = 10 x 10^-3 or 20 x 10^-3 | ||
| where T = sample interval; N = # non-zero points in window |
| |
| at 10-20 ms: | -- relatively stable vocal tract | |
| -- at least 1 full pitch period |
Different applications lead to different window lengths.
Examples:
Figure: Spectrum analysis of voiced speech using
50 msec Hamming and rectangular windows.
(a) and (c) show the time domain windowed waveforms;
(b) and (d) show the corresponding spectra.
After [L. R. Rabiner and R. W. Schafer,
Digital Processing of Speech Signals,
Prentice-Hall, 1978.]
Figure: Spectrum analysis of voiced speech using
5 msec Hamming and rectangular windows.
(a) and (c) show the time domain windowed waveforms;
(b) and (d) show the corresponding spectra.
After [L. R. Rabiner and R. W. Schafer,
Digital Processing of Speech Signals,
Prentice-Hall, 1978.]
Figure: Spectrum analysis of unvoiced speech using
50 msec Hamming and rectangular windows.
(a) and (c) show the time domain windowed waveforms;
(b) and (d) show the corresponding spectra.
After [L. R. Rabiner and R. W. Schafer,
Digital Processing of Speech Signals,
Prentice-Hall, 1978.]
Figure: Spectrum analysis of unvoiced speech using
5 msec Hamming and rectangular windows.
(a) and (c) show the time domain windowed waveforms;
(b) and (d) show the corresponding spectra.
After [L. R. Rabiner and R. W. Schafer,
Digital Processing of Speech Signals,
Prentice-Hall, 1978.]
Window Placement
May depend on application
Consider compression, where the Hamming windowed signal is analyzed to create a compressed signal, with the analysis parameters used to reconstruct the signal after compression.
With disjoint windows, the envelope shape of the Hamming window may be audible in the reconstructed signal.
Solution: overlapped windows
A question whose answer is related to the use of windows:
Are glottal pulses assumed to be an infinite pulse train, or a unit step?
I.e.,
is X(e^(j omega)) assumed to have constant amplitude or decaying amplitude,
where X denotes the Fourier transform of the excitation signal x(n)?
The model is consistent with assuming that the glottal pulses are part of an infinite train, with the window giving a snapshot of a short time interval. The observed rolloff in the spectrum of speech signal s(n) is more readily attributed to the 6 dB/octave rolloff in the spectral envelope corresponding to the vocal tract function, H(e^(j omega)).
Frequency domain representations (continued):
The DFT evaluates the z-transform at discrete points on the unit circle in the z-plane.
The N-point DFT performs the evaluation at N points W(k) = (2 pi k)/N; 0 <= k < N
Radial frequencies from -pi --> pi correspond to physical frequencies
from - f(s)/2 --> f(s)/2
or
Radial frequencies from 0 --> 2 pi correspond to physical frequencies
from 0 --> f(s)
Conversion between radial and physical frequencies:
e.g.
frequency resolution = 12800 Hz/ 128 = 100 Hz between adjacent points in the DFT
e.g.
frequency resolution = 25 Hz between adjacent points in the DFT
There's a trade-off between the frequency resolution and the time resolution: long time windows give good resolution of frequencies in the sense that the spacing between adjacent points in the DFT is small. However, during the course of the long window, the signal has probably changed.
Frequency domain representations (continued):
End of Frequency domain representations
Speech Representations
End of notes on Speech Representations
Go: