NetSound ICMC96 Paper

Published in Proceedings of the International Computer Music Conference 1996, Hong Kong

By Michael Casey and Paris Smaragdis

MIT Media Lab Machine Listening Group

Abstract

We describe a sound and music specification protocol called NetSound that is oriented towards networked low-bandwidth, native-signal-processing sound synthesis applications. One such application is music distribution on the internet. We describe the concept behind NetSound and outline a prototype implementation that uses Csound, a synthesis specification language designed and implemented at the MIT Media Lab, as a client-side real-time synthesis engine.

NetSound

NetSound is a sound and music description system, currently prototyped in Csound, in which sound streams are described by decomposition into a sound-specification description representing arbitrarily complex signal processing algorithms, and event lists comprising scores or MIDI files; as such, NetSound is an example of Model-Based Audio. This description is analogous to the Adobe Postscript language for image and text information in which construction information for fonts and images is separated from raw ASCII text. As a network sound transmission protocol, NetSound has the advantage of being able to transmit a wide selection of sounds using a descriptive format that does not require a high-bandwidth channel. Since description-based audio represents acoustic events as parameterized units, a great deal of control over the resulting sound is offered. In order to time-compress a sound stream, for example, a scalar multiplier can be applied to all event duration values, or a synthesis algorithm such as phase-vocoder resynthesis can be specified and appropriate time-frequency modifications made from a simple control function. The use of complex instrument descriptions and appropriately parameterized score makes it possible to specify descriptions of complete sound tracks or musical pieces using a very small amount of data. Other synthesis languages' instruments, such as the MUSIC-N languages, and commercial synthesizer implementations can be translated into Csound syntax. On the note level, NetSound has its own event-specification format but is also capable of reading and playing MIDI files.

NetSound as a sound specification protocol

Object-based representations for sound synthesis can be thought of as a series of audio processing building blocks that are threaded into a signal processing network for each class of sound. Each sound instance produces a copy of the signal-processing template for that class of sound. These data structures are constructed on the client side by the Csound compiler. Once constructed and memory resident, the signal processing networks can be executed in real time under the control of a score or MIDI file event list. Csound features a complex dynamic execution environment that adjusts memory requirements as needed and maintains efficiency by optimized allocation and reallocation of memory.

Specification and Distribution using NetSound

The process of designing a sound stream using NetSound comprises the specification of the required sound synthesis algorithms or selection from pre-existing synthesis units, such as wavetable synthesis, FM synthesis, phase-vocoder or additive synthesis. A standard sequencing program is used to construct the temporal structure of the required sound stream as a MIDI file or the readable Csound score format. Sound streams are computed in real time and synthesized buffer by buffer by a network client- i.e. an executable on the network users computer. The resulting audio sample data is not stored or transmitted, only the descriptions and the necessary sampled sounds or synthesis data are stored and transmitted by the network server. It is important to note that NetSound is not a compression protocol; the process does not include a transcription from mixed audio to NetSound format. We consider NetSound as a distribution tool that reflects the manner in which music and sound tracks are constructed for multimedia applications. That is, a small number of sounds or algorithms are utilized for generating a large amount of audio data. NetSound renders the data into sound without requiring large storage or throughput capacity.

NetSound and General MIDI

General MIDI comprises a fixed set of 128 apriori defined sound wavetables. Extensions to general MIDI include a number of sound effects as well extensions to the basic instrumental set. While general MIDI has been a useful tool it is somewhat limited in its definition of the available sound palette. The General Music extensions offer a protocol for including user-definable wavetables for sounds, these are encoded using MPEG compression and a limit of twenty seconds is suggested for the length of these tables. In contrast, NetSound exploits a suite of synthesis algorithms comprising the most widely-used sound synthesis techniques from the field of computer music as well as wavetable synthesis; including user-definable wavetables. The synthesis template library includes a version of FM synthesis, granular synthesis for sound textures, fof synthesis for voice, karplus-strong/waveguide synthesis for physical modeling, and additive synthesis or phase-vocoder for detailed control over sound resynthesis. As well as sound production algorithms, NetSound also includes a set of sound effects algorithms such as reverberation, echo, delay, phasing and flanging. As with the synthesis algorithms, these can be combined to form composite signal processing units of arbitrary complexity.

Network Advantages of NetSound

Most of the existing network audio protocols rely on lossy audio compression techniques in order to reduce the bandwidth of an audio data stream. There are also protocols that are able to stream and uncompress buffered audio data in real time; for example, at the time of writing RealAudio(tm) is able to deliver 1 channel of compressed music over a 28.8kbit communications channel at a resynthesis sampling rate of 11kHz. The quality of these techniques varies as a function of the compression ratio. Real-time compressed audio streams are good for browsing audio material but do not offer a quality that is acceptable for high-fidelity sound reproduction. High-quality compression schemes such as MPEG do not reduce the data enough to make transmission of large quantities of audio data feasible in a small amount of time. All of the existing techniques exhibit a linear relationship between the length of the original audio stream and the size of the compressed file. NetSound has the advantage of requiring far less server throughput capacity and storage capacity than exisiting protocols. It is also far more comprehensive in its sound palette than General MIDI. NetSound also has the potential to represent sound streams with a data packet that is sub-linearly or scalar related to the size of the resulting data stream.

Client-side computational efficiency versus Bandwidth

Since NetSound utilizes Csound as a real-time software synthesis engine, issues of computational requirements must be addressed. The decision to exploit client-side computing resources is born out of the observation that current network activity is limited by client/server throughput rather than available processor cycles. As long as that is the case, a tradeoff between processor usage and bandwidth requirements must be made. In terms of processor usage, the most efficient method of audio synthesis is sample playback. If the resulting sound stream comprises a single sample stream with no rate conversion or amplitude scaling then the minimum processor load is observed. However, an algorithmic synthesis technique such as fof synthesis or granular synthesis requires far more mathamatical operations per audio sample, but also requires much less sound specification information for synthesis. Thus there is an complex relationship between computational efficiency and bandwidth requirement for specification. Thus, the art of network sound design involves the careful consideration of computational resources and bandwidth availability. It is possible to exploit the merits of both when specifying sound streams using NetSound, in situations where bandwidth is plentiful, sample-based synthesis techniques are perhaps preferable. However, when processor cycles are likely to be available, other synthesis techniques, such as additive synthesis, or phase-vocoder synthesis may be incorporated to reduce network bandwidth requirements at the expense of increased central processor load.

Conclusion and Future Work

NetSound is currently well suited to synthesizing the types of sound and music that are produced in a modern multimedia production studio. It is the goal of NetSound to eliminate the pre-mastering stage of multimedia sound production in favor of distributing algorithmic synthesis descriptions, any necessary audio samples or analysis signals, and structured event lists for the sound stream. This information is currently implicitly represented in a modern multi-media studio because as yet, there are no standards for exporting information relating to the specification of signal-processing networks. The future of software sound synthesis is somewhat dependent on such protocols, NetSound is perhaps a first in this regard. We are currently investigating the use of parametric models for non-musical synthesis, such as foley-type sound effects, so that commonly required classes of sounds can be specified by a small number of parameters and sound class information. The modular nature of Csound affords easy inclusion of new synthesis models into the NetSound protocol. As software synthesis starts to become embedded in multimedia technologies, we believe that the principals outlined above will become a governing factor in software-based sound design.
Michael Casey
Paris Smaragdis
Last modified: Fri Dec 20 14:23:02 EST