MIT Media Lab: Model-Based Audio

Model-based audio is an emerging paradigm for the representation and distribution of sound for networked computer-based applications. It is an alternative to sample and stream-based representations of audio which are the prevalent modes of sound dissemination at this time. The following page is a summary of my thoughts on Model-Based Audio for its utilization as a practical technique for audio applications on computer networks such as the Internet.

Why Model-based Audio?

Every footstep, door creak, or thunderclap we hear in a Hollywood film is placed there by a sound designer who carefully sorts through thousands of pre-recorded sound samples, or specially records sounds, to create what's known as the "Foley" track; named for Jack Foley who first developed the technique of recording sound effects 'live' in a specially equipped studio while a film was playing. But such labor-intensive methods are not practical for creating the sound effects now needed to enhance interactive environments such as virtual worlds and 3-D games or for distributing audio over large-scale computer networks such as the Internet.

Sounds created in the traditional way require massive storage and bandwidth resources. More importantly, each is essentially "frozen" like a snapshot of a single event, such as the sound of a car traveling at a particular speed. This static quality makes it impossible to reflect changes in an interactive environment, such as the car slowing down or skidding, without storing an exhaustive set of sounds for every possible outcome.

With an ear to the future of sound specification and distribution for networked media applications I have been researching methods to obtain controllable, high-quality synthesis models that are capable of producing professional-grade sound effects and high-quality music for interactive, networked applications and audio distribution over networks.

Perceptual Audio Models

Ph.D. Thesis research by Michael Casey

There are many possible approaches to the problem of model-based audio; one example is the Machine Listening Group's NetSound project which is an example of Algorithmic Sound Specification for music distribution. The following paragraphs outline the general principles behind Model-Based Audio with particular reference to my research on Perceptual Audio Models.

The emerging technologies for Model-Based Audio can be divided into four basic categories:

Algorithmic Sound Specification

Feature-based Sound Specification

Perceptual Control

Temporal Event-Stream Specification

Algorithmic Sound Specification

Sound synthesis languages, such as Csound or Music-V, have been used by the music synthesis research community for over 30 years. Such languages can be described as Algorithmic Sound Specification for audio. They work by creating audio samples from a set of signal processing building blocks, often call unit generators, and on the newest generations of hardware platforms they can perform quite well in real time. The Machine Listening Group's NetSound project is an example of Algorithmic Sound Specification used for music transportation over computer networks.

These techniques are useful for limited applications in music synthesis and sound-effects processing, but they do not readily generalize to the problem of creating arbitrary sounds for use in high-quality media applications, such as Virtual Worlds or 3D games.

Feature-Based Sound Specification

As an extension to the sound palettes offered by Algorithmic Sound Specification, we consider the use of Feature-Based Sound Specification. By carefully extracting the important features of a set of sounds, we can represent the perceptually salient information in an extremely compact and controllable manner. We can then combine these features in new ways to create new sounds that are hybrids of the original sounds that we analyzed. Perceptual Audio Models are a class of feature-based descriptions for audio based on advanced statistical pattern recognition techniques which enable intuitive control of audio content as well as reduction of the bandwidth that each sound needs for its specification.

Perceptual Control

It is important to emphasize the importance of control over sound in interactive media applications. Without control over the perceptually salient aspects of a sound we have nothing more than an audio snapshot; frozen in time and not useful beyond its original specific intention. By using feature-based representations of audio we enable the possibility of controlling the exact content of a sound by adjusting the relative presence of each perceptual feature in real-time. This opens up the potential for implementing intelligent sounds which change when attributes of their associated objects change.

For example, an application of Distributed Virtual Reality is the Virtual Baseball game, an example application for MERL 's Open Community Distributed Virtual Reality standards proposal. In that game, as each player hits the ball, the resulting sound should encode meaningful information about the hitting action, a very important cue to fielders in the game. Not only should the ball know about the sound of a wooden bat hitting it, but if we change the size and materials of the ball or the bat then the resulting sound should change also. This would be impossible to acheive without control over the perceptual features of the sound.

Temporal Event-stream Specification

The specification of sounds for Model-Based Audio is one component of the problem, the second major component is the specification of sound events through time. The popular MIDI standard (Musical Instrument Digital Interface) is an example of a temporal event-stream representation since only information pertaining to the onset and termination of events as well as temporal unfolding of synthesis parameters is encoded in the protocol. Perceptual Audio Models include methods for specifying the temporal structure of a sound-track independent of the sounds that are specified.

Practical Model-Based Audio

I consider that the techniques developed for my Ph.D. thesis comprise a practical direction for the development of Model-Based Audio with the needs of producers considered in the design of the algorithms and the representations used. As a test of the ideas, I am incorporating Perceptual Audio Models into the distributed virtual environments that I have produced for networked music entertainment.

Michael Casey