Cepstrum Analysis

Rabiner & Juang, section 4.5.2 - 3

References

  1. A. M. Noll, "Cepstrum Pitch Determination," J. Acoustic Society of America, vol. 41, 1967, p. 293.
  2. Schafer & Rabiner, "System for Automatic Formant Analysis of Voiced Speech," JASA, vol. 47, 1970, p. 634.

Assume a linear filter model with excitation signal x(t) and vocal tract filter transfer function H(z), producing output s(t).

s(t) = x(t) * h(t)
S(omega) = X(omega)H(omega)
Goal: separate X(omega) and H(omega)

Taking the log yields:

log |S(omega)| = log [ |X(omega)| |H(omega)|] = log |X(omega)| + log |H(omega)|

log |S(omega)| has 2 components

a "low frequency ripple" due to the vocal tract

a "high frequency ripple" due to the excitation

Figure (top): log |S(omega)|
(The figure shows log |S(omega)|^2. This emphasizes the peak that will result later, but does not change the basic effect.)

If we think of the log |S(omega)| plot in terms of these "ripples" -- i.e., forget that it's a frequency domain plot and consider trying to characterize the frequency of the ripples, as if it were a time domain plot, then we would like to apply an operator that extracts frequency information from signals.

The Fourier transform is such an operator. Since we're in the frequency domain, we'll use the inverse Fourier transform instead, but, except for scaling, it should capture the "high" vs "low" frequency ripple information.

So taking the inverse Fourier transform:

IDFT[log |S(omega)|] = IDFT[log |H(omega)|] + IDFT[log |X(omega)|]

Figure (bottom): Cepstrum = IDFT [ log |S(omega)| ]
(Equivalent to DFT {DFT [ log |S(omega)|^2 ] } except for scaling.)

The units for the resulting plot are ms, but this isn't really the time domain. A new name is called for: "quefrency", which suggests that this isn't frequency, but it's related to frequency.

Consider voiced speech. From log |X(omega)| we get a concentrated peak at 1/F(0).

From log |H(omega)| we get a component in the quefrency domain that extends from 0ms up to about 3.7 ms. 3.7 ms corresponds to 1/270 Hz, with 270 Hz being the lowest expected formant frequency for an adult male talker. The low frequency ripple should therefore not have frequencies (in the ripple sense) below 270 Hz, or quefrencies above 3.7 ms.

The cepstrum is defined to be the IDFT[log |S(omega)|], with the cepstrum represented as c(n), with units of ms in the quefrency domain.

Figure: Examples of cepstrum analysis for voiced and unvoiced speech.
[After Schafer & Rabiner, "System for Automatic Formant Analysis of Voiced Speech," JASA, vol. 47, 1970, p. 634. ]

(For pitch extraction applications, the cepstrum is sometimes defined as IDFT[log |S(omega)|^2], which accentuates the peak due to the excitation.)

Using the cepstrum:

If we want excitation information:

Look for the 1/F(0) peak. If there is no peak in the expected range, the speech was unvoiced.

If we want vocal tract information:

c(n) = IDFT[log |H(omega)|] + IDFT[log|X(omega)].

If we can get rid of IDFT[log|X(omega)|],

then taking the Fourier transform will yield

DFT{IDFT[log |H(omega)|]} = log |H(omega)|

To get rid of IDFT[log|X(omega)|], let

l(n) = 1 for n < tau; 0 for n >= tau for some threshold tau.

For male speech, pick tau in the range 4 ms <= tau < 8 ms

l(n) is called a lowpass "lifter."

Then

log|H(omega)| = FFT[l(n)c(n)]

Can the cepstrum components due to X(omega) always be separated so easily?

malefemalechildren
avg. F(0)128 Hz256 Hz265 Hz
delta = 142 Hzdelta = 54 Hzdelta = 105 Hz
min F(1)270310370

The cepstrum peak due to the voiced excitation is more likely to overlap the vocal tract portion of the cepstrum in female speech.

Disadvantages of cepstrum analysis:

  1. Doesn't work well on female speech.
  2. Computationally expensive!

Practical Notes on Computing the IDFT and Cepstrum (postscript)

Practical Notes on Computing the IDFT and Cepstrum (pdf)


End of notes on Cepstrum Analysis.



Go: