Our implementation of the binaural spatializer is quite straightforward. The HRTFs were measured using a KEMAR (Knowles Electronics Mannequin for Acoustics Research), which is a high quality dummy-head microphone. The HRTFs were measured in 10 degree elevation increments from -40 to +90 degrees [13]. In the horizontal plane (0 degrees elevation), measurements were made every 5 degrees of azimuth. In total, 710 directions were measured. The sampling density was chosen to be roughly in accordance with the localization resolution of humans. The HRTFs were measured at a 44.1 kHz sampling rate.
The raw HRTF measurements contained not only the desired acoustical response of the dummy head, but also the response of the measurement system, including the speaker, microphones, and associated electronics. In addition, the measured HRTFs contained the response of the KEMAR ear canals. This is undesirable, because the final presentation of spatialized audio to a listener will involve the listener's own ear canals, and thus a double ear canal resonance will be heard. One way to eliminate all factors which do not vary as a function of direction is to equalize the HRTFs to a diffuse-field reference [15]. This is accomplished by first forming the diffuse-field average of all the HRTFs:
where is the measured HRTF for azimuth
and elevation
.
is therefore the power
spectrum which would result from a spatially diffuse soundfield of
white noise excitation. This formulation assumes uniform spatial
sampling around the head. The HRTFs are equalized using a minimum
phase filter whose magnitude is the inverse of
. Thus, the
diffuse-field average of the equalized HRTFs is flat. Figure
11 shows the diffuse-field average of the HRTFs. It is
dominated by the ear canal resonance at 2-3 kHz. The low-frequency
dropoff is a result of the poor low-frequency response of the
measurement speaker. The inverse equalizing filter was gain limited to
prevent excessive noise amplification at extreme frequencies. In
addition to the diffuse-field equalization, the HRTFs were sample rate
converted to 32 kHz. This was done in order to reduce the
computational requirements of the spatializer. The final HRTFs were
cropped to 128 points (4 msec) which was more than sufficient to
capture the entire head response including interaural delays.
The spatializer convolves a monophonic input signal with a pair of HRTFs to produce a stereophonic (binaural) output. The HRTFs that are closest to the desired azimuth and elevation are used. For efficiency, the convolution is accomplished using an overlap-save block convolver [20] based on the fast Fourier transform (FFT). Because the impulse response is 128 points long, the convolution is performed in 128-point blocks, using a 256-point real FFT. The forward transforms of all HRTFs are pre-computed. For each 128-point block of input samples (every 4 msec), the forward transform of the samples is calculated, and then two spectral multiplies and two inverse FFTs are calculated to form the two 128-point blocks of output samples. In addition to the convolution, a gain multiplication is performed to control apparent distance.
It is essential that the position of the source can be changed smoothly without introducing clicks into the output. This is easily accomplished as follows. Every 12 blocks (48 msec) the new source position is sampled and a new set of HRTFs is selected. The input block is convolved with both the previous HRTFs and the new HRTFs, and the two results are crossfaded using a linear crossfade. This assumes reasonable correlation between the two pairs of HRTFs. Subsequent blocks are processed using the new HRTFs until the next position is sampled. The sampling rate of position updates is about 20 Hz, which is quite adequate for slow moving sources.