This thesis concentrates on the idea of Structured Sound. A system such as this requires a mode of delivering the sound in real-time so as to appear that it is coming from a certain location. If an image appears on the left side of the room two meters in front of the audience, so should the sound appear to be coming from the left side of the room, two meters in front of the audience. This procedure is called localization of sound.
Engineers for years have tried to design a system to synthesize directional sound. The research in this field is split between two methods of delivering such directional sound. The first uses binaural cues delivered at the ears to synthesize sound localization. While the second uses spatial separation of speakers to deliver localized sound. A binaural cue is a cue which relies on the fact that a listener hears sound from two different locations - namely the ears. Localization cues at low frequencies are given by interaural phase differences, where the phase difference of the signals heard at the two ears is an indication of the location of the sound source. At frequencies where the wavelength is shorter than the ear separation, phase cues cannot be used; interaural intensity difference cues are used, since the human head absorbs high frequencies. Thereby, using the knowledge of these cues and a model of the head, a system can be implemented to give an illusion of sounds being produced at a certain location. Head Related Transfer Functions (HRTF) [Wenzel 92] have been used extensively to deliver localized sound through head-phones. The HRTFs are dependent on ear separation, the shape of the head, and the shape of pinna. For each head description, the HRTF system produces accurate sound localization using headphones. However, because of the system's dependence on ear separation, no two people can hear the same audio stream and "feel" the objects at the same location. There would need to be a different audio stream for each listener depending on his/her head description.
One of the requirements of this thesis was to produce sound localization for a group of people simultaneously. Therefore, the author could not use binaural cues to localize sound sources. The second method of sound localization was used, that is the delivery of localization cues using the spatial distribution of speakers. Here the system does not rely on the fact that the audience has two ears, rather the system relies on intensity panning between adjacent speakers to deliver the cue. Thus an approach developed by Bill Gardner of the Media Laboratory's Perceptual Computing Group was adopted. Gardner developed a design for a virtual acoustic room [Gardner 92] using 6 speakers and a Motorola 56001 digital signal processor for each speaker, on a Macintosh platform.
This work is not identical to Gardner's. Because of design constraints, the author was limited to Digital's Alpha platform, and to the LoFi card [Levergood 9/93] as the primary means of delivering CD-quality sound. The LoFi card contains two 8 KHz telephone quality CODECs, a Motorola 56001 DSP chip with 32K 24-bit words of memory shared with the host processor. The 56001 serial port supports a 44.1 KHz stereo DAC. Thus using only one Alpha with one sound card limited the number of speakers to be used to two. Those two speakers would have to be placed in front of the listener, which might limit the locations of sound localization. Gardner's model assumes that the listener is in the middle of the room and that sounds could be localized anywhere around him; he uses 6 speakers spaced equally around the listener for this purpose. However, from a visual standpoint, people view three dimensional movies through a window, namely the screen. The audience is always on one side of the window and can only see what is happening through that window. Therefore, it may not be too much of a restriction to limit sound localization to the front of the audience. The system would have two real speakers and one virtual speaker behind the listener that is not processed, but is needed to ensure proper calculations.
There are two major processes that are executed to render sound localization from the spatial location of speakers. The first is the simulation of the early reverberation response hereafter referred to as the Echo process. This is the modeling of the direct reflections from the walls of a room. The Echo process will produce an FIR filter for every speaker representing the delay of all the virtual sources in the room. The second is the simulation of the late reverberation, or diffuse reverberation, hereafter referred to as the Reverberation process. This models the steady state room noise from all the noises in a room and their Echoes. It is not directional such as the Echo is, but rather it creates the general feel of the acoustic quality of a room.
Figure 3-1: Echo effects of sources in a rectangular room. The BOLD lines represent the actual room and source location, while the normal lines represented the echoed rooms/source locations. The axis is the number of reflections on the perpendicular walls. For example, x=3 means 2 reflections on wall 2 and 1 on wall 4. NOTE: The walls are marked in clockwise order starting from minus-z location, and the listener is facing the minus-z direction.
The above summation is the attenuation coefficient of the virtual source, where ref_coef is the reflective coefficient of a wall. The attenuation coefficient is multiplied by the tap amplitude of that virtual source.
Figure 3-2: Intensity panning between adjacent speakers.
The formula for this system is as follows:
d is the distance to the source in meters,
r is the distance to the speakers in meters,
c is the speed of sound in meter per second,
a is the amplitude of the virtual source relative to the direct sound,
S is the set of walls that sound encounters, and
[[Gamma]]j is the reflection coefficient of the jth wall.
There are a couple of comments that are worthy to be noted:
* The value of a was calculated when the program found each echoed source, and it was stored in the sound source description.
* This result assumes that the listener, speaker and virtual sources all lie in the same horizontal plane, and the speakers are all equidistant from the listener.
* The speaker locations are fixed with respect to the front of the listener. Therefore if the listener is facing a direction other than minus-z in the virtual space, then the speaker locations need to be rotated by that same amount and direction before any of the above calculations could be performed.
* Adjacent filter taps within 1 millisecond of each other are merged to form a new tap with the same energy. If the original taps are at times t0 and t1, with amplitudes a0 and a1, the merged tap is created at time t2 with amplitude a2 as follows:
The pruning process tends to eliminate distant virtual sources as well as weak taps resulting from panning. This process should not affect the system quality if the maximum filter taps is set to at least 50 or so. The higher the max-number-taps, the better the system quality. However if real-time performance is hampered, lower max-number-taps would be advised.
Two considerations were present when choosing which of the many combinations of filters to implement. The first consideration was the stability of the system at all frequencies. The second was that the system would increase the number of echoes generated in response to an impulse, since in a real room echoes, though they subside, increase in number. Thus, nested allpass filters are chosen as the basis for building the reverberator, since they satisfy both the parameters. For more detail on the mathematics and the creation of different reverberator refer to [Gardner 95] and [Gardner 92].
The design of the nested allpass filters used in the system is modeled in Figure 3-3, where X is the input, Y is the output, g is gain and G(z) is simply a delay. This allpass filter is the building block. The result of cascading these filters together is not a good sounding reverberator; it's response is metallic and sharp sounding. However, when some of the output of the cascaded allpass system is fed back to the input through a moderate delay, great sounding reverberators are achieved. The harsh and metallic feel of the systems without the feedback is eliminated partly because of the increased echoes due to the feedback loop. Moreover, adding a lowpass filter to the feedback loop would simulate the lowpass effect of air absorption. This newer system would be of a form as shown in Figure 3-4.
Figure 3-3: Allpass flow diagram with samples taken from the interior of the allpass delay line.
Figure 3-4: A generalized allpass reverberator with a low pass filter feedback loop, with multiple weighted output taps.
From this general structure, many systems can be designed. The key to creating good sounding reverberators is not mathematics, but rather it is the ear. The basic decision criterion for finding a good reverberator is whether or not it sounds good. Since the ear is good at detecting patterns, the job of a good reverberator is to elude this pattern recognition process. Therefore the reverberators used in SSSound have been empirically designed to sound good. They are taken from Gardner's Masters thesis[Gardner 92]. None of them are mathematical creations, rather they are the result of laborious hand tweaking, so as to produce good sounding reverberators.
In order to simplify the representation of nested allpass reverberators, a simplified schematic representation was used as shown in Figure 3-5. The top of the figure (3-5a) is the procedure used to perform the allpass filtering. Here the feed-forward multiply accumulate through -g occurs before the feedback calculation. While figure 3-5b shows a simple nested allpass system (for instructional purposes only). The input enters a delay line at the left side, where it is processed with a single allpass followed by a double nested allpass. The allpass delays are measured in milliseconds, while the gains are positioned in parenthesis. The system first experiences a 20 milliseconds delay, then a 30 millisecond allpass with a gain of 0.5. Then the system passes through another 5 milliseconds of delay, followed by a 50 millisecond allpass of gain 0.7 that contains a 20 millisecond allpass of gain 0.1.
Figure 3-5: A detailed description of the all pass procedure and an example of a reverberator system. a) (top) a schematic of the allpass procedure, where the forward multiply accumulate (of -g) happens before the feedback calculation through +g. b) (bottom) instructional allpass cascaded system.
The following formula is a method of calculating the reverberation time (T) of a room:
T is reverb time in seconds,
c is the speed of sound in meters per second,
V is the volume of the room in meters cubed,
a' is the metric absorption in meters squared,
S is the total surface area in meters squared,
is the average power absorption of the room,
Si , [[alpha]]i is the surface area and power absorption of wall i, and
[[Gamma]] is the pressure reflection of a material.
The above formula is used to calculate the reverberation time of the room so as to know which reverberator to use. The following table shows the reverberation time range for each room:
Reverberator Reverberation Time (sec)
small 0.38 -> 0.57
medium 0.58 -> 1.29
large 1.30 -> infinite
Figure 3-6 shows the three reverberators used in SSSound.
Figure 3-6: Diffuse reverberators used in SSSound for small, medium and large rooms. See figure 4-5 for detailed explanation of the schematic. These reverberators were designed by W. Gardner [Gardner 92].