Model-based audio is an emerging paradigm for the representation and distribution of sound for networked computer-based applications. It is an alternative to sample and stream-based representations of audio which are the prevalent modes of sound dissemination at this time. The following page is a summary of my thoughts on Model-Based Audio for its utilization as a practical technique for audio applications on computer networks such as the Internet.
Sounds created in the traditional way require massive storage and bandwidth resources. More importantly, each is essentially "frozen" like a snapshot of a single event, such as the sound of a car traveling at a particular speed. This static quality makes it impossible to reflect changes in an interactive environment, such as the car slowing down or skidding, without storing an exhaustive set of sounds for every possible outcome.
With an ear to the future of sound specification and distribution for networked media applications I have been researching methods to obtain controllable, high-quality synthesis models that are capable of producing professional-grade sound effects and high-quality music for interactive, networked applications and audio distribution over networks.
The emerging technologies for Model-Based Audio can be divided into four basic categories:
These techniques are useful for limited applications in music
synthesis and sound-effects processing, but they do not
readily generalize to the problem of creating arbitrary sounds for
use in high-quality media applications, such as Virtual Worlds or
3D games.
Feature-Based Sound Specification
As an extension to the sound palettes offered by Algorithmic
Sound Specification, we consider the use of Feature-Based
Sound Specification. By carefully extracting the important
features of a set of sounds, we can represent the perceptually salient
information in an extremely compact and controllable manner. We can
then combine these features in new ways to create new sounds that are
hybrids of the original sounds that we
analyzed. Perceptual Audio Models are a class of feature-based
descriptions for audio based on advanced statistical pattern
recognition techniques which enable intuitive control of audio content as well
as reduction of the bandwidth that each sound needs for its specification.
Perceptual Control
It is important to emphasize the importance of control over sound in
interactive media applications. Without control over the perceptually
salient aspects of a sound we have nothing more than an audio
snapshot; frozen in time and not useful beyond its original specific
intention. By using feature-based representations of audio we enable
the possibility of controlling the exact content of a sound by
adjusting the relative presence of each perceptual feature in real-time.
This opens up the potential for implementing intelligent sounds which
change when attributes of their associated objects change.
For example, an application of Distributed Virtual Reality is the Virtual Baseball game, an example application for
MERL
's
Open Community
Distributed Virtual Reality standards proposal. In that game, as each
player hits the ball, the resulting sound should encode meaningful
information about the hitting action, a very important cue to fielders
in the game. Not only should the ball know about the sound of a
wooden bat hitting it, but if we change the size and materials of the
ball or the bat then the resulting sound should change also. This
would be impossible to acheive without control over the perceptual
features of the sound.
Temporal Event-stream Specification
The specification of sounds for Model-Based Audio is one component of the problem,
the second major component is the specification of sound events through time. The popular MIDI standard
(Musical Instrument Digital Interface) is an example of a temporal
event-stream representation since only information pertaining to the
onset and termination of events as well as temporal unfolding of
synthesis parameters is encoded in the protocol. Perceptual Audio Models include
methods for specifying the temporal structure of a sound-track independent of the sounds
that are specified.