SSSound - 3 Sound Localization

Synchronized Structured Sound:


Real-Time 3-Dimensional Audio Rendering


By Araz Inguilizian





Under Construction


3
Sound Localization

This thesis concentrates on the idea of Structured Sound. A system such as this requires a mode of delivering the sound in real-time so as to appear that it is coming from a certain location. If an image appears on the left side of the room two meters in front of the audience, so should the sound appear to be coming from the left side of the room, two meters in front of the audience. This procedure is called localization of sound.

Engineers for years have tried to design a system to synthesize directional sound. The research in this field is split between two methods of delivering such directional sound. The first uses binaural cues delivered at the ears to synthesize sound localization. While the second uses spatial separation of speakers to deliver localized sound. A binaural cue is a cue which relies on the fact that a listener hears sound from two different locations - namely the ears. Localization cues at low frequencies are given by interaural phase differences, where the phase difference of the signals heard at the two ears is an indication of the location of the sound source. At frequencies where the wavelength is shorter than the ear separation, phase cues cannot be used; interaural intensity difference cues are used, since the human head absorbs high frequencies. Thereby, using the knowledge of these cues and a model of the head, a system can be implemented to give an illusion of sounds being produced at a certain location. Head Related Transfer Functions (HRTF) [Wenzel 92] have been used extensively to deliver localized sound through head-phones. The HRTFs are dependent on ear separation, the shape of the head, and the shape of pinna. For each head description, the HRTF system produces accurate sound localization using headphones. However, because of the system's dependence on ear separation, no two people can hear the same audio stream and "feel" the objects at the same location. There would need to be a different audio stream for each listener depending on his/her head description.

One of the requirements of this thesis was to produce sound localization for a group of people simultaneously. Therefore, the author could not use binaural cues to localize sound sources. The second method of sound localization was used, that is the delivery of localization cues using the spatial distribution of speakers. Here the system does not rely on the fact that the audience has two ears, rather the system relies on intensity panning between adjacent speakers to deliver the cue. Thus an approach developed by Bill Gardner of the Media Laboratory's Perceptual Computing Group was adopted. Gardner developed a design for a virtual acoustic room [Gardner 92] using 6 speakers and a Motorola 56001 digital signal processor for each speaker, on a Macintosh platform.

This work is not identical to Gardner's. Because of design constraints, the author was limited to Digital's Alpha platform, and to the LoFi card [Levergood 9/93] as the primary means of delivering CD-quality sound. The LoFi card contains two 8 KHz telephone quality CODECs, a Motorola 56001 DSP chip with 32K 24-bit words of memory shared with the host processor. The 56001 serial port supports a 44.1 KHz stereo DAC. Thus using only one Alpha with one sound card limited the number of speakers to be used to two[1]. Those two speakers would have to be placed in front of the listener, which might limit the locations of sound localization. Gardner's model assumes that the listener is in the middle of the room and that sounds could be localized anywhere around him; he uses 6 speakers spaced equally around the listener for this purpose. However, from a visual standpoint, people view three dimensional movies through a window, namely the screen. The audience is always on one side of the window and can only see what is happening through that window. Therefore, it may not be too much of a restriction to limit sound localization to the front of the audience. The system would have two real speakers and one virtual speaker behind the listener that is not processed, but is needed to ensure proper calculations.

There are two major processes that are executed to render sound localization from the spatial location of speakers. The first is the simulation of the early reverberation response hereafter referred to as the Echo process. This is the modeling of the direct reflections from the walls of a room. The Echo process will produce an FIR filter for every speaker representing the delay of all the virtual sources in the room. The second is the simulation of the late reverberation, or diffuse reverberation, hereafter referred to as the Reverberation process. This models the steady state room noise from all the noises in a room and their Echoes. It is not directional such as the Echo is, but rather it creates the general feel of the acoustic quality of a room.

3.1 The Echo Process

The echo process is an attempt at simulating all the virtual sound sources resulting from the sounds bouncing off walls. The echo process is divided up into three procedures. The first is for every sound source in the room, to calculate the virtual sources beyond the room because of the reflections off the walls. The second is for every sound source and speaker, to calculate an FIR filter representing the delay of the virtual sources that are needed to be projected from that speaker. The third is the pruning stage for real-time purposes, where the FIR filter is pruned to reduce the number of taps, and thereby calculations, needed in the real-time rendering of the sound streams.

3.1.1 Virtual Sources

The first stage of the Echo process is to compute all the virtual sources in the room for every sound stream in the room. Since this system is confined to rectangular rooms (see section 4.1.2 for more detail), the procedure is simple (Figure 3-1). The program loops through the number of reflections, from zero to the user defined max_reflections for the room. For every number of reflection, the program loops through every possible reflection on the x-axis walls, and calculates the z-axis wall reflections necessary. Once the program has defined x-axis and z-axis locations, then it needs to calculate the attenuation coefficient for every location due to the reflections off the walls. The result depends on the number rather than the order of reflections on each wall. To calculate the number of reflections on each wall, the program would divide the axis location by two and the whole number result is that number of paired reflections on both walls on the axis, and the remainder is used as an indicator of any single reflection apart from the pair reflections. The sign of the remainder defines which wall the reflections occurred; for example, remainder of x = -1 would mean a bounce off wall 3 (the minus-x wall) and not wall 1(the plus-x wall). As a general example assume the program is calculating 11 reflections, and it happens to be on x-reflect-axis -5, and z-reflect-axis +6. The program computes the following:

Figure 3-1: Echo effects of sources in a rectangular room. The BOLD lines represent the actual room and source location, while the normal lines represented the echoed rooms/source locations. The axis is the number of reflections on the perpendicular walls. For example, x=3 means 2 reflections on wall 2 and 1 on wall 4. NOTE: The walls are marked in clockwise order starting from minus-z location, and the listener is facing the minus-z direction.

X-reflect-axis..............|..............Z-reflect-axis

The above summation is the attenuation coefficient of the virtual source, where ref_coef is the reflective coefficient of a wall. The attenuation coefficient is multiplied by the tap amplitude of that virtual source.

3.1.2 Speaker Sound

The next step in the process is to create an FIR filter for each speaker, representing the delays from all the sources in the room. From the list of the virtual sources, the program picks out the sources that could be projected from each speaker, and uses intensity panning between adjacent speakers to achieve the desired spatial localization of the virtual sources [Theile 77]. Moreover since the listener is not constrained to any particular orientation, it is unclear how to use phase information to aid in the localization of sound.

Figure 3-2: Intensity panning between adjacent speakers.

The diagram on the right (Figure 3-2) depicts one of the virtual sources in the system between two speakers. This virtual source will contribute a tap delay to both the speakers A and B, but not to any other speaker. The tap delays are proportional to the difference of the distances from the listener to the speaker and to the virtual source. The tap amplitudes are dependent on the same distances as well as the angle spans.

The formula for this system is as follows:

where :

d is the distance to the source in meters,

r is the distance to the speakers in meters,

c is the speed of sound in meter per second,

a is the amplitude of the virtual source relative to the direct sound,

S is the set of walls that sound encounters, and

[[Gamma]]j is the reflection coefficient of the jth wall.



There are a couple of comments that are worthy to be noted:

* The value of a was calculated when the program found each echoed source, and it was stored in the sound source description.

* This result assumes that the listener, speaker and virtual sources all lie in the same horizontal plane, and the speakers are all equidistant from the listener.

* The speaker locations are fixed with respect to the front of the listener. Therefore if the listener is facing a direction other than minus-z in the virtual space, then the speaker locations need to be rotated by that same amount and direction before any of the above calculations could be performed.

3.1.3. Pruning the Filters

A typical system setup of a rectangular room might have the maximum reflections of the room set to eight. This would give us 64 filter taps. While there is no direct system limit on the number of taps, the more taps the filter has the longer the program would take to compute the result of the filter over the sound samples. In a real-time environment, every possible care should be taken to force the system to compute a reasonable result as fast as possible. Therefore to enhance real-time performance, the procedure used to intelligently reduce the number of filter taps is as follows:

* Adjacent filter taps within 1 millisecond of each other are merged to form a new tap with the same energy. If the original taps are at times t0 and t1, with amplitudes a0 and a1, the merged tap is created at time t2 with amplitude a2 as follows:

* Filter taps are then sorted by amplitude. A system defined number of the highest amplitude taps are kept.

The pruning process tends to eliminate distant virtual sources as well as weak taps resulting from panning. This process should not affect the system quality if the maximum filter taps is set to at least 50 or so. The higher the max-number-taps, the better the system quality. However if real-time performance is hampered, lower max-number-taps would be advised.

3.2 The Reverberation Process

Rooms do not produce just direct reflection off of walls, they also have a general steady state noise level from all the sounds produced in the room. This noise is a general feeling of the acoustic quality of the room, and is referred to as Reverberation. Rendering this reverberant response is a task that has confounded engineers for a long time. It has been the general conception that if the impulse response of a room is known, then the user can compute the reverberation from many sound sources in that room. A system that Moorer determined to be an effective sounding impulse response of a diffuse reverberator is an exponentially decaying noise sequence [Moorer 79]. Rendering this reverberator requires performing large convolutions. At the time of Gardener's system development, the price/performance ratio of DSP chips was judged to be too high to warrant any real-time reverberator system at reasonable cost. Perhaps the ratio is now low enough to allow for real-time reverberator systems at reasonable cost. If a system incorporating these chips is implemented, input would have to be convolved with an actual impulse response of a room, or a simulated response using noise shaping. However, no such DSP chip exists for the Alpha platform. Thus the system implements efficient reverberators for real-time performance. This requires using infinite impulse response filters, such as a comb and allpass filters.

Two considerations were present when choosing which of the many combinations of filters to implement. The first consideration was the stability of the system at all frequencies. The second was that the system would increase the number of echoes generated in response to an impulse, since in a real room echoes, though they subside, increase in number. Thus, nested allpass filters are chosen as the basis for building the reverberator, since they satisfy both the parameters. For more detail on the mathematics and the creation of different reverberator refer to [Gardner 95] and [Gardner 92].

The design of the nested allpass filters used in the system is modeled in Figure 3-3, where X is the input, Y is the output, g is gain and G(z) is simply a delay. This allpass filter is the building block. The result of cascading these filters together is not a good sounding reverberator; it's response is metallic and sharp sounding. However, when some of the output of the cascaded allpass system is fed back to the input through a moderate delay, great sounding reverberators are achieved. The harsh and metallic feel of the systems without the feedback is eliminated partly because of the increased echoes due to the feedback loop. Moreover, adding a lowpass filter to the feedback loop would simulate the lowpass effect of air absorption. This newer system would be of a form as shown in Figure 3-4.

Figure 3-3: Allpass flow diagram with samples taken from the interior of the allpass delay line.

Figure 3-4: A generalized allpass reverberator with a low pass filter feedback loop, with multiple weighted output taps.

The system represents a set of cascaded allpass filters with a feedback loop containing a lowpass filter. The output is taken from a linear combination of the outputs of the individual allpass filters. Each of the individual allpass filters can themselves be a set of cascaded or nested allpass filters. The system as a whole is not allpass, because of the feedback loop and the lowpass filter. Stability would be achieved if the lowpass filter has magnitude less than 1 for all frequencies, and g (gain) < 1.

From this general structure, many systems can be designed. The key to creating good sounding reverberators is not mathematics, but rather it is the ear. The basic decision criterion for finding a good reverberator is whether or not it sounds good. Since the ear is good at detecting patterns, the job of a good reverberator is to elude this pattern recognition process. Therefore the reverberators used in SSSound have been empirically designed to sound good. They are taken from Gardner's Masters thesis[Gardner 92]. None of them are mathematical creations, rather they are the result of laborious hand tweaking, so as to produce good sounding reverberators.

In order to simplify the representation of nested allpass reverberators, a simplified schematic representation was used as shown in Figure 3-5. The top of the figure (3-5a) is the procedure used to perform the allpass filtering. Here the feed-forward multiply accumulate through -g occurs before the feedback calculation. While figure 3-5b shows a simple nested allpass system (for instructional purposes only). The input enters a delay line at the left side, where it is processed with a single allpass followed by a double nested allpass. The allpass delays are measured in milliseconds, while the gains are positioned in parenthesis. The system first experiences a 20 milliseconds delay, then a 30 millisecond allpass with a gain of 0.5. Then the system passes through another 5 milliseconds of delay, followed by a 50 millisecond allpass of gain 0.7 that contains a 20 millisecond allpass of gain 0.1.

Figure 3-5: A detailed description of the all pass procedure and an example of a reverberator system. a) (top) a schematic of the allpass procedure, where the forward multiply accumulate (of -g) happens before the feedback calculation through +g. b) (bottom) instructional allpass cascaded system.

In a general reverberator such as the one described in Figure 3-4, the only variable in the system is the gain of the feedback loop. Tweaking this gain would give us different reverberation responses. However, just this variable is not enough to simulate all the sizes of rooms a system can encounter. Thus it is highly unlikely that such a reverberator could be designed to simulate all the types and sizes of rooms. Gardner suggested three reverberators, one for each small, medium and large sized rooms. The acoustic size of the room can be established by the reverberation time of the room. The reverberation time of the room is proportional to the volume of the room and inversely proportional to the average absorption of all the surfaces of the room.

The following formula is a method of calculating the reverberation time (T) of a room:

where:

T is reverb time in seconds,

c is the speed of sound in meters per second,

V is the volume of the room in meters cubed,

a' is the metric absorption in meters squared,

S is the total surface area in meters squared,

is the average power absorption of the room,

Si , [[alpha]]i is the surface area and power absorption of wall i, and

[[Gamma]] is the pressure reflection of a material.



The above formula is used to calculate the reverberation time of the room so as to know which reverberator to use. The following table shows the reverberation time range for each room:

Reverberator Reverberation Time (sec)

small 0.38 -> 0.57

medium 0.58 -> 1.29

large 1.30 -> infinite

Figure 3-6 shows the three reverberators used in SSSound.

Figure 3-6: Diffuse reverberators used in SSSound for small, medium and large rooms. See figure 4-5 for detailed explanation of the schematic. These reverberators were designed by W. Gardner [Gardner 92].





Araz (araz@media.mit.edu)