NetSound ICMC96 Paper
Published in Proceedings of the International Computer Music Conference
1996, Hong Kong
Abstract
We describe a sound and music specification protocol called
NetSound that is oriented towards networked
low-bandwidth, native-signal-processing sound synthesis
applications. One such application is music distribution on the
internet. We describe the concept behind NetSound and
outline a prototype implementation that uses Csound, a
synthesis specification language designed and implemented at the
MIT Media Lab, as a client-side real-time synthesis engine.
NetSound
NetSound is a sound and music description system, currently prototyped in
Csound, in
which sound streams are described by decomposition into a
sound-specification description representing arbitrarily complex
signal processing algorithms, and event lists comprising scores or
MIDI files; as such, NetSound is an example of
Model-Based Audio.
This description is analogous to the Adobe Postscript language
for image and text information in which construction information for
fonts and images is separated from raw ASCII text.
As a network sound transmission protocol, NetSound has the advantage
of being able to transmit a wide selection of sounds using a
descriptive format that does not require a high-bandwidth channel.
Since description-based audio represents acoustic events as
parameterized units, a great deal of control over the resulting sound
is offered. In order to time-compress a sound stream, for example, a
scalar multiplier can be applied to all event duration values, or a
synthesis algorithm such as phase-vocoder resynthesis can be specified
and appropriate time-frequency modifications made from a simple
control function. The use of complex instrument descriptions and
appropriately parameterized score makes it possible to specify
descriptions of complete sound tracks or musical pieces using a very
small amount of data.
Other synthesis languages' instruments, such as the MUSIC-N languages,
and commercial synthesizer implementations can be translated into
Csound syntax. On the note level, NetSound has its own
event-specification format but is also capable of reading and playing
MIDI files.
NetSound as a sound specification protocol
Object-based representations for sound synthesis can be thought of
as a series of audio processing building blocks that are threaded into
a signal processing network for each class of sound. Each sound
instance produces a copy of the signal-processing template for that
class of sound. These data structures are constructed on the client
side by the Csound compiler. Once constructed and memory resident, the
signal processing networks can be executed in real time under the
control of a score or MIDI file event list. Csound features a complex
dynamic execution environment that adjusts memory requirements as
needed and maintains efficiency by optimized allocation and
reallocation of memory.
Specification and Distribution using NetSound
The process of designing a sound stream using NetSound
comprises the specification of the required sound synthesis
algorithms or selection from pre-existing synthesis units, such
as wavetable synthesis, FM synthesis, phase-vocoder or additive
synthesis. A standard sequencing program is used to construct the
temporal structure of the required sound stream as a MIDI file or the
readable Csound score format.
Sound streams are computed in real time and synthesized buffer by buffer
by a network client- i.e. an executable on the network users
computer. The resulting audio sample data is not stored or
transmitted, only the descriptions and the necessary sampled sounds or
synthesis data are stored and transmitted by the network server.
It is important to note that NetSound is not a compression protocol;
the process does not include a transcription from mixed audio to
NetSound format. We consider NetSound as a distribution tool that
reflects the manner in which music and sound tracks are constructed
for multimedia applications. That is, a small number of sounds or algorithms
are utilized for generating a large amount of audio data. NetSound renders the
data into sound without requiring large storage or throughput capacity.
NetSound and General MIDI
General MIDI comprises a fixed set of 128 apriori defined sound wavetables.
Extensions to general MIDI include a number of sound effects as well
extensions to the basic instrumental set. While general MIDI has been a useful
tool it is somewhat limited in its definition of the available sound palette.
The General Music extensions offer a protocol for including
user-definable wavetables for sounds, these are encoded using MPEG
compression and a limit of twenty seconds is suggested for the length
of these tables.
In contrast, NetSound exploits a suite of synthesis algorithms
comprising the most widely-used sound synthesis techniques from the
field of computer music as well as wavetable synthesis; including
user-definable wavetables. The synthesis template library includes a
version of FM synthesis, granular synthesis for sound textures, fof
synthesis for voice, karplus-strong/waveguide synthesis for
physical modeling, and additive synthesis or phase-vocoder for
detailed control over sound resynthesis.
As well as sound production algorithms, NetSound also includes a set
of sound effects algorithms such as reverberation, echo, delay,
phasing and flanging. As with the synthesis algorithms, these can be
combined to form composite signal processing units of arbitrary
complexity.
Network Advantages of NetSound
Most of the existing network audio protocols rely on lossy audio
compression techniques in order to reduce the bandwidth of an audio
data stream. There are also protocols that are able to
stream and uncompress buffered audio data in real time; for example,
at the time of writing RealAudio(tm) is able to deliver 1 channel of
compressed music over a 28.8kbit communications channel at a
resynthesis sampling rate of 11kHz.
The quality of these techniques varies as a function of the
compression ratio. Real-time compressed audio streams are good for
browsing audio material but do not offer a quality that is acceptable
for high-fidelity sound reproduction. High-quality compression
schemes such as MPEG do not reduce the data enough to make
transmission of large quantities of audio data feasible in a small
amount of time. All of the existing techniques exhibit a linear
relationship between the length of the original audio stream and the
size of the compressed file.
NetSound has the advantage of requiring far less server throughput
capacity and storage capacity than exisiting protocols. It is also far
more comprehensive in its sound palette than General MIDI. NetSound also
has the potential to represent sound streams with a data packet that is
sub-linearly or scalar related to the size of the resulting data stream.
Client-side computational efficiency versus Bandwidth
Since NetSound utilizes Csound as a real-time software synthesis
engine, issues of computational requirements must be addressed. The
decision to exploit client-side computing resources is born out of the
observation that current network activity is limited by client/server
throughput rather than available processor cycles. As long as that is
the case, a tradeoff between processor usage and bandwidth
requirements must be made.
In terms of processor usage, the most efficient method of audio
synthesis is sample playback. If the resulting sound stream comprises
a single sample stream with no rate conversion or amplitude scaling then the
minimum processor load is observed. However, an algorithmic synthesis
technique such as fof synthesis or granular synthesis requires far
more mathamatical operations per audio sample, but also requires much less
sound specification information for synthesis. Thus there is an
complex relationship between computational efficiency and bandwidth
requirement for specification.
Thus, the art of network sound design involves the careful
consideration of computational resources and bandwidth
availability. It is possible to exploit the merits of both when
specifying sound streams using NetSound, in situations where bandwidth
is plentiful, sample-based synthesis techniques are perhaps
preferable. However, when processor cycles are likely to be
available, other synthesis techniques, such as additive synthesis, or
phase-vocoder synthesis may be incorporated to reduce network
bandwidth requirements at the expense of increased central processor load.
Conclusion and Future Work
NetSound is currently well suited to synthesizing the types of sound
and music that are produced in a modern multimedia production
studio. It is the goal of NetSound to eliminate the pre-mastering
stage of multimedia sound production in favor of distributing
algorithmic synthesis descriptions, any necessary audio samples or
analysis signals, and structured event lists for the sound
stream. This information is currently implicitly represented in a
modern multi-media studio because as yet, there are no standards for
exporting information relating to the specification of signal-processing
networks. The future of software sound synthesis is somewhat dependent
on such protocols, NetSound is perhaps a first in this regard.
We are currently investigating the use of parametric models for
non-musical synthesis, such as foley-type sound effects, so that
commonly required classes of sounds can be specified by a small number
of parameters and sound class information. The modular nature of
Csound affords easy inclusion of new synthesis models into the
NetSound protocol.
As software synthesis starts to become embedded in multimedia
technologies, we believe that the principals outlined above
will become a governing factor in software-based sound design.
Michael Casey
Paris Smaragdis
Last modified: Fri Dec 20 14:23:02 EST