Channel Vocoders and Filter Bank Analysis

Rabiner & Juang chapter 3.1

Let N = # bands to cover the frequency range 0 --> f(s), corresponding to 0 --> 2 pi

We're interested in the first Q frequency bands, for some Q <= N/2 (i.e., extending no higher than f(s)/2)

A filter bank can be used to filter the signal into the Q bands. The output of each band can be passed through a nonlinearity -- e.g., a full-wave rectifier followed by a low-pass filter -- to obtain the energy in each band.

The resulting output of each band is a "signal" whose values are measures of the energy in that band as a function of time.

Figure: Channel vocoder.
After [J. L. Flanagan, M .Schroeder, B. Atal, R. Crochiere, N. Jayant, and J. Tribolet, "Speech Coding," IEEE Trans. Communications, 1979. ]

Data compression can be achieved by

FFT-based implementation:

Let h(i) be the impulse response of bandpass filter i

Assume that for each band i, h is such that it can be written as:

h(n) = w(n)e^[j*omega(i)*n]

for some window w(n), where omega(i) is center frequency for band i.

Then the output of the i-th filter can be written as a convolution of s(n) with h(i).

Substitution of variables + rearranging gives an expression for each x(n) in terms of the short-time Fourier transform of s(n) at frequency omega(i).

Assume we're interested in evenly spaced omega(i)'s

omega(i) = (2 pi i) / N = 2 pi [f(i) / f(s)]

where N = # of filters needed to span 0 --> f(s) or 0 --> 2 pi

and Q <= N/2

Let m be the time index for the summation in the Fourier transform. Then for every value of m for which s(m)w(n-m) is non-zero, m can be written as

m = Nr + k
for some k and r, with 0 <= k < N-1 and - infinity < r < infinity

Let sn(m) denote s(m)w(n-m).
sn(m) = windowed signal at time index n

FFT implementation procedure

  1. Window sn(m) = s(m) w(n-m)
  2. Form cn(k) = sum over r [s(n) (Nr +k)], 0 <= k <= N-1
  3. Compute the N-point DFT of cn(k) --> Cn(k)
  4. Modulate Cn(k) by e^[j(2 pi/N)in]

Filter bank analyses can also be done with non-uniform filters, e.g., using logarithmic spacing or "critical band" filters based on perceptual models.

Why implement filter banks this way?

Suppose you want

f(s) = 10 kHz
N >= 64 and you want Q = 32

Alternative 1:
Why not compute S(n) (e^jw) with a window length L = 64?

If L = 64 --> 6.4 ms

Alternative 2:
Let L = 128, 12.8 ms
How do you get the 32 filter outputs you want?

Compute a 128-point FFT, and look at every other point?
or sum 2 adjacent points?

Computation: L log(2)L = ~ 128*7

Using the summing method:
With the summing method, you can choose the window length L independently of Q, as long as L >= 2Q.

Applications of filter bank analyses:

  1. Compression

    Baseline transmission rates for speech signals

    1. telephone quality speech
      bandwidth ~ 3000 Hz, SNR ~ 36 dB
      --> f(s) > 6 kHz, bits/sample ~ 7
      # bits per second of speech (bps) >= 42,000

    2. high quality speech
      bandwdith ~ 10,000 Hz, SNR ~ 60 dB
      --> f(s) >= 20 kHz, bits/sample ~ 11
      -->220,000 bps

    Typical channel vocoder

    24 channels
    6 bits/channel
    40 "frames"/sec (sampling rate on energy "signal" in filter outputs = 40 Hz)

    (24) (6) (40) = 5760 bps output of filters
    for excitation: V/UV + pitch --> 7 bits x 40 frames/sec = 280 bps

    Total 6040 bps vs. 42,000 bps for uncompressed telephone quality speech

    Can compress further:

    e.g., use fewer bits for higher frequency channels

    Channel vocoders can operate as low as 2400 bps

    Intelligibility ~ 85% in informal tests

  2. Speech recognition
    IBM "centisecond processor"
    f(s) = 20 kHz
    80 bands, 100 Hz bandwidth each
    frame rate of 100 frames per second, with each frame representing 10 ms
    (or 1 "centisecond")
    implemented using the FFT method on 20 ms Hamming windowed frames
End of notes on Channel Vocoders and Filter Bank Analysis.

Go: