This article appears in a modified form as ``Getting the Word'' in the October 1989 issue of Unix Review (Vol. 7, No. 10., Pages 54-62).
Chris Schmandt, M.I.T. Media Laboratory
Barry Arons, Olivetti Research Center
There has recently been a surge of interest in using voice (Throughout
this article we will use ``audio'' to refer to a medium, and ``voice'' and
``speech'' interchangeably to refer to the signal that is contained in that
medium. ) technologies in advanced workstations. The motivation for voice
is its role as the primary channel of human-to-human communication, which
ties in to current research in enhanced user interfaces, office computing,
and the use of computers to facilitate group problem solving. Taken broadly,
the use of speech as a command and data channel may require digital recording
and playback techniques, speech recognition, text-to-speech synthesis, and
telephone interface equipment. The big payoff will be to build systems,
using these technologies, to allow computers to become a part of the infrastructure
of daily human communication.
This article will focus on system integration issues and requirements of voice. The expected uses of voice and the difficulties one can expect to encounter in applying it to our everyday computing environment will be discussed. These together suggest some of the system requirements to support audio.
A large number of potential audio applications have been suggested: listening typewriters, voice annotation of text, interactive audio training systems, voice mail systems, computer conferencing, telephone access to data, speech substitutes for the mouse and keyboard, auditory icons, speed dialing tools, answering machines, and many others. It is the strong belief of the authors that although no single voice utility may be overwhelming in isolation, a synergistic collection of multi-media applications making appropriate use of voice will provide a very powerful communications environment. Such integration places further demands on the hardware and software architecture of future voice workstations.
It is important to consider the range of applications one might wish
to build using voice. For example, voice may be used to annotate text, e.g.,
as editorial comments on a manuscript or as part of an on-line tutorial.
Voice may be incorporated in a more general multi-media document, such as
a repair manual or video-based educational system. Voice, either synthesized
from text or pre-recorded, may allow remote telephone access to databases,
such as electronic mail, flight departures, or up-to-the-minute stock quotations.
Voice mail leads the pack in terms of currently successful applications; although typically implemented in a centralized architecture closely linked to a PBX, greater utility may be derived >from workstation access to messages, both for improved user interfaces and to allow the sharing of voice data between applications. The ubiquitous telephone will continue to play a key role in voice communications and hence any audio workstation. In particular, the telephone is likely to be the dominant source of voice as data (in the form of recorded messages), and is a potentially powerful channel for remote interaction.
For teleconferencing, voice is of course required. Although a computer-supported teleconference could simply consist of sharing windows and setting up a separate telephone call, it is advantageous if the voice link is computer-mediated. First, both the data and voice links can be initiated using a single conference management application. Second, voice can be used to initiate changes in who has control of the ``floor''. Third, the teleconference including voice can be logged. Multi-way audio conferencing can be realized with the telephone network and conventional audio bridging equipment, or by computer networks that transmit the voice digitally.
Speech recognition is a more difficult problem. Despite significant media hype, the general-purpose large-vocabulary ``listening typewriter'' is far from being available; keyboards will be with us for the foreseeable future. Recognition would be very useful over the telephone, but reduced audio bandwidth and noise problems make this one of the more difficult feats in the field. Specific workstation applications might amenable to voice input, especially if the mouse is being used for other functions, such as drawing lines in a paint program. Speech can be used simultaneously with mouse input for control or to drive menus. At the Media Lab an alternative approach is exploring voice to move between windows under the X Window System; speech is used as the channel for the ``meta-dialog'' of communication with the window manager, rather than just a keyboard substitute.
In general, audio is a difficult medium, as detailed in following sections, and that suggests its use as a secondary or adjunct channel of interaction. Voice is most useful when one's hands and/or eyes are busy, for example, while using a CAD system, sorting baggage, or driving a car. The use of voice in computer applications is of course more tolerable when no other access mechanism is possible, such as when used by physically impaired users or remotely over a telephone line.
There are several limitations in the utility of audio, some inherent
in the medium and some the unfortunate artifacts of current technology,
which must be considered in designing applications and user interfaces.
Speech is slow; at perhaps 150 words per minute it is much slower then our
ability to read from screens or paper, and even much worse than a 300 baud
modem. Speech is necessarily serial, being by nature a time-varying signal;
eyes can wander around menus but the ear cannot. Speech, at thousands of
bytes per second, is ``bulky'' to store, and it cannot be scanned or ``grep'ed''
like text. Listening, especially to synthetic speech, puts a cognitive load
on the user that can interfere with other types of mental activity.
In terms of currently available devices, there are limitations on the intelligibility of synthetic speech and the error rates of recognizers. For synthesis, text is broken down by the application of several layers of linguistic rules into sound units called phonemes, which are then realized as an acoustic waveform. Text-to-phoneme rules break down for many proper nouns, and phoneme realization problems make some sounds easily confused. Speech recognition is still in its infancy, and even small-vocabulary speaker dependent devices have difficulty in acoustically imperfect environments without noise-canceling microphones worn on the head. Recognition is generally easier on isolated words than on continuous speech, and when used in a speaker-dependent rather than a speaker-independent manner.
These limitations suggest some requirements of user interfaces to voice data. For example, the ``bulkiness'' of voice suggests graphical interfaces to facilitate random access and give some sense of ``place'' inside a voice document or voice file being edited. Such representations translate the temporal nature of speech into a spatial dimension, possibly providing cues like time (tick marks) and speech/silence intervals (color, size), in addition to a cursor that moves as the sound is played (see figure 1).
This illustration is not available in the HTML version of the paper.
Figure 1: A graphical representation of recorded telephone messages. The bars represent sounds, and the mouse can be used to play back selected portions.
Because of its slowness, audio playback or speech synthesis always need to be interruptible, and mechanisms should be provided to re-play, speed up, or skip ahead. Transitions between modes must be done with minimal overhead, such as when switching between play and record modes in a conversational system or stopping the playback of prompts as immediate feedback when a caller pushes a button on a telephone keypad. Speech recognition is especially difficult in the user interface, requiring careful choice of vocabulary, application, and crafting of functionality; recognition as a direct keyboard replacement almost never works.
The wide range of potential applications for voice, along with its rather
demanding requirements as a user interface, place requirements on the system
that will support it. Though some of these are ``device'' related, it is
equally important to develop a software architecture to allow sharing of
resources. This is especially true as we believe voice is more likely to
succeed as a data type that is shared across applications than when used
only within a single program.
In the early days of voice systems much work was devoted to data compression schemes, with a goal of cutting bit rates while minimizing the impact on intelligibility. As memory sizes, computer power, and network speeds have increased, the need for many of these techniques is waning, except in specialized environments such as toys or encrypted communications. Evolving applications tend to use 64 Kbits/second log PCM (pulse code modulation) or perhaps 32 or 16 Kbits/second encoding (such as ADPCM, Adaptive Delta PCM) to obtain telephone quality speech. This results in continuous data transfer rates well within the ability of most workstations and file systems.
However, it is important to note the need for buffering data on both play and record operations as UNIX is notorious for high overhead in interrupt response. As opposed to, say, the process of refreshing a text screen or repainting an image, voice output needs to be continuous, without any pauses. If this is achievable only when the workstation is otherwise idle, the applicability of voice will be severely limited. The higher fidelity and data rates associated with Compact Disk quality recordings are typically not required for interactive voice applications.
The need for low latency operations to provide a comfortable user interface is more demanding. It may be desired to play several sound files in succession, or to start recording immediately after playing a pre-recorded or synthesized prompt. Real-time fine grained silence detection during recording, as a termination condition, is required in a conversational system which alternates prompts and responses; rhythm is a key component of natural sounding dialog. These kinds of operations are difficult to perform in non-real-time operating systems, including most variants of UNIX.
A partial solution is to provide adequate event processing on whatever speech hardware is used, especially for digitization and playback. The relatively few pieces of required state information (such as a flag for ``terminate playback when a touch tone is detected'') can be implemented by a dedicated processor, which also can handle buffer management, flow control, and possibly DMA. In real-time audio applications, the primitives that UNIX processes use to communicate must be carefully chosen, with tradeoffs between complexity of the state logic and application programming ease.
Both recognition and synthesis are likely to require a specialized digital signal processor (DSP). This processor also may be able to handle some of the event handling configurations. The DSP approach may be most cost effective in the future, allowing rapid change between, say, synthesis and playback, by simply downloading a new algorithm. This same DSP may be also used for other communication tasks including modem or facsimile emulation. For example, the TI speech card supports recognition, recording, and text-to-speech synthesis, and the Natural Microsystems Watson card supports modem signaling in addition to recording.
In terms of the requirements for a software architecture, it is important to reiterate that we are interested in voice as a ubiquitous data type rather than a particular application using whatever specialized interface will suffice. Although as humans we use it all the time, voice is difficult enough to use that it is unlikely to ``take off'' in the marketplace until we can do many of the things we now with text, such as edit it, forward it, and integrate it with other files into coherent documents. Similarly, remote telephone access by voice is only as useful as the range of data made accessible. Finally, we need to be able to use voice with advanced window systems, both as an interface to the window system itself and as ``selected'' data that can be moved between processes running in different windows.
This suggests that a general-purpose audio environment must allow multiple processes to access audio hardware and files; these processes themselves may be distributed. Mechanisms must be provided to detect conflicting requests for scarce resources and to arbitrate between them. We must be able to build graphical interfaces closely-coupled to sound interfaces, capable of maintaining synchronization during record and playback activities. A means of representing multi-media objects in the context of user selections must be provided; this will be most useful if it allows maximum interoperability with current text-only applications.
Many of the requirements listed above suggest the use of a server to
handle the audio hardware. Just as current window systems (such as X or
NeWS) may employ a single server to draw onto the screen for multiple clients,
an audio server must be able to handle requests from multiple application
processes. Of course, the server must deal with conflicting requests from
multiple processes and the state associated with each.
The server can provide event queues to ensure low latency between time critical operations. It can provide for synchronization, which tends to be more difficult in the audio domain as voice operations extend over time. An audio request may be de-queued and processed only after the previous request has finished, which may be long after either request was submitted; thus queue handling is asynchronous with respect to the application.
A server also can provide a way to establish a device-independent interface to a variety of vendor hardware. The server does not have to reside on the same processor as the application, allowing for bus-specific hardware or the use of a real-time operating system for the server, while retaining the UNIX application development and run time environments for the client. This may allow for eventual server implementations using expensive centralized resources, such as specialized speech recognizers that can switched between many client workstations based on traffic requirements without the need for hardware replication.
In the next section we briefly describe a number of research approaches to audio servers. The following section will focus on details of current work on one particular server, the Olivetti VOX Audio Server, with a more ambitious architecture and broader range of targeted functionality. VOX attempts to build on the experience gained >from these past efforts.
Over the past 10 years, a number of architectures for integrating voice
technology into workstations have been developed. Some have run in UNIX
environments, several have been servers, and others have simply been libraries
of sound-related routines. Some have become products, while others have
never left the research labs. Many of these efforts have influenced the
design of the VOX Audio Server described in the adjoining article.
VOX's most direct antecedent was the voice server developed at the MIT Media Lab beginning in 1984. It consisted of an MS-DOS-based PC with local disk and plug-in speech cards, and it is still in use today for a variety of projects at the Media Lab. The voice server communicates with a UNIX host over a standard RS-232 serial line, providing digitization, recognition, synthesis, and telephony functions, a small set of audio-editing primitives, and specialized features for specific projects (such as analyzing and manipulating pitch of recordings). The serial communication link allows the server to be used from a variety of host workstations without regard to bus or other hardware constraints.
The MIT audio server has been most noted for its emphasis on the requirements of an audio user interface and the projects it was associated with (including Phone Slave and Conversational Desktop). Although it did not include any routing primitives, it did implement event queues and synchronization. MS-DOS provides a convenient real-time environment, and speech boards for the PC bus are plentiful and inexpensive. This voice server was designed for use by a single host process at a time, with no possibility of locking or resource-contention arbitration.
The Etherphone server developed at Xerox PARC uses a centralized server with distributed control and user interface. The Etherphone project was a contemporary of the Media Lab audio server effort, and Etherphone is also still in daily use. Audio is distributed digitally throughout the physical building site over Ethernet, with local voice nodes adjacent to workstations. Workstations communicate with centralized servers that support storage of digitized voice, telephony, speech synthesis, and audio editing in concert with the Etherphone voice nodes. Editing is implemented largely in a sophisticated filesystem based on the Cedar "ropes" construct; this filesystem includes garbage collection and "interest" registration, allowing pieces of sound to be shared by multiple users or applications. Etherphone is more sophisticated in terms of basic architecture, with somewhat less emphasis on demanding user-interface functions, than the Media Lab voice server.
When it was under way at Bell Communications Research from 1985 to 1988, the Modular Integrated Communications Environment (MICE) project provided a full range of audio and telephony functions from a UNIX-based centralized server. An application programming environment provided distributed access control to this server over Ethernet; voice was transported using conventional analog telephony. By providing software access to a powerful, computer-controllable Redcom PBX, programmers could easily build applications such as integrated voice and text mail, a paging service using speech synthesis, and graphical interfaces to telephone functions such as call forwarding and speed dialing.
The Meridian Mail system, a Northern Telecom product, gives PCs network-based interfaces to voice mail and, indirectly, to the PBX. The Meridian system supports applications such as the graphical display of voice mail and mixed voice/text mail systems (although with only one medium per message). Local PCs send commands over the network to the centralized voice storage server via a bridge, allowing for a distributed user interface and, indirectly, access to the PBX.
At the Information Sciences Institute of the University of Southern California, a small but sophisticated Phoneserver provides basic telephone interfaces between UNIX-based workstations over Ethernet to a Rolm PBX. The server itself is a PC with a special PBX interface card; workstations transmit requests via UDP datagrams, and a local user interface (Phonetool) runs under Sun Microsystems' SunView windowing environment. At the Media Lab, Phonetool has recently been ported to the X Window System using auto-dial modems as well as the Media Lab voice server.
Digital Sound Corp. offers a Voice Server product based on a real-time version of UNIX. UNIX processes control speech and telephone functions on associated digital signal processor (DSP) cards and telephone line interfaces, with a digital audio bus linking components. Application processes have access to a powerful set of voice primitives from the UNIX environment. Although equipped with Ethernet interfaces, the product is not a server in the sense the term is used in this article; the network is provided for access to external databases rather than client applications running elsewhere. Instead, the product more closely resembles a centralized resource with hardware support for multiple simultaneous applications, all running within the product. Algorithms for the DSP support a variety of speech and telephony functions.
One other product of note is Boston Technology's UNIX-based voice mail system. It uses a real-time UNIX on a PC bus, with speech cards from Dialogic. The processing power of these cards provides sufficient buffering and distributed functionality to offload the UNIX environment to such an extent that it can serve a large number of phone lines simultaneously. It supports only digitization and telephony functions, and is not sold as a development environment.
The VOX Audio Server is being developed at the Olivetti Research Center
to address the full range of requirements discussed above. VOX attempts
to provide a device-independent interface to audio functions, including
play, record, and telephony, with extensions in the wings to include speech
recognition and synthesis. In addition, it provides support for audio routing
and mixing, as these are an integral part of a realistic audio workstation
VOX uses a network-transparent client/server architecture heavily influenced by contemporary window systems (see figure 2). As a ``server'', VOX permits multiple clients to share audio resources. Hooks are provided to allow a privileged client, akin to a window manager, to arbitrate conflicting requests. Network transparency means that the client interface is the same regardless of whether the server is on the same processor as the client or elsewhere in a network.
This illustration is not available in the HTML version of the paper.
Figure 2: Underlying VOX layered architecture. Network transparency is achieved in the interface between the client and CLAUD layers. Note that LAUDs can be hierarchically combined into CLAUDs, and that multiple LAUDs can be implemented on a single physical device.
The low level building block of the audio server is the logical audio device (or LAUD, pronounced ``loud''). A LAUD is a device-independent abstraction of an audio resource. For example, playback and recognition are represented by separate LAUDs, even though these functions may be implemented on a single piece of hardware. Client applications may request multiple instances of the same LAUD. LAUDs have ``audio ports'' that can be interconnected, and each LAUD controls a device that performs the desired audio activity.
LAUDs can be combined into a composite LAUD (or CLAUD, pronounced ``cloud''). This is primarily done to maintain synchronization between LAUDs composited together to provide a higher-level service. For example, an answering machine requires synchronization between record, playback, and telephony events. To that end VOX multiplexes input events from a CLAUD's component LAUDs into a single time-stamped stream, and similarly de-multiplexes output requests within the server. For example, the client may submit play and record requests, and receive input tokens from the telephone.
To ensure real-time response, output requests can be prepared in advance of execution. Examples of this pre-processing include opening a sound file, prefetching an already recorded sound, or establishing the state of a speech recognizer. All these activities can be executed before the actual servicing of the request takes place, thereby reducing the latency between related requests--for example, to quickly switch from synthesis to recognition in a conversational application.
Additional synchronization and resource allocation mechanisms are provided to allow a client to gain exclusive access to a limited resource. For example, with only one telephone line, a second call cannot be placed without disconnecting the first, or blocking until it finishes.
VOX is being written to run under Berkeley UNIX and MACH environments, using AT bus-based workstations to take advantage of the variety of low cost speech boards and peripherals. The first reference implementation is on an Olivetti i386-based workstation running MACH. It supports the Natural MicroSystems VBX speech board (record, playback, and telephone functions), a Yamaha mixer, and an Akai audio crossbar switch (the latter two are controlled via a MIDI interface). A VideoTelecom full duplex echo canceler will be used to provide a high quality hands-free speakerphone using the microphones and speakers available on each desktop (see figure 3).
This illustration is not available in the HTML version of the paper.
Figure 3: Typical configuration of current audio devices supported by VOX on a single workstation. The tie lines in the upper left and lower right permit connections to other workstations.
The audio crossbar switch is used to handle the general case of routing and device interconnection within a workstation, and connects to tie lines to other workstations. All inputs and outputs of the audio peripherals are routed through the switch, providing a flexible environment for the rapid prototyping of audio applications. Applications of particular interest include telephone management, voice annotation, real-time teleconferencing, conversational answering machines, and, more generally, computer-based tools to support collaborative work.
The current thrust of the VOX Audio Server development is for building research prototypes to demonstrate the utility of the server and, more generally, of ``desktop audio''. However, the software architecture of the server can be carried over into a product environment. Indeed, ORC is making the software interfaces to the server non-proprietary so that it can be supported across multiple hardware and software platforms--promoting interoperability between clients and servers running in a homogeneous network of machines.
Audio is a demanding medium, but clearly one that is ripe for integration into our everyday computing environment. At present there is a high demand for interactive voice applications; unfortunately the medium is difficult to work with and the currently available technology has many weaknesses. An integrated environment using voice (both as data and control) over a number of applications suggests a server based approach. Although current technology can support such a server, its architecture is still a research topic.
Keith Lantz and Carl Binding were members of the VOX Audio Server design team along with the authors. Keith also made numerous helpful editorial comments and contributed to the section on VOX.
B. Arons, C. Binding, K. Lantz, and C. Schmandt. The VOX Audio Server.
In 2nd IEEE ComSoc International Multimedia Communications Workshop. IEEE
Communications Society, April 1989.
C. Binding, C. Schmandt, K. Lantz, and B. Arons,. Workstation Audio and Window-Based Graphics: Similarities and Differences. In Engineering for Human-Computer Interaction. IFIP Working Group 2.7, August 1989.
G. L. Martin. The Utility of Speech Input in User-Computer Interfaces. International Journal of Man/Machine Systems, 30:355-375, 1989.
C. Schmandt and B. Arons. A Conversational Telephone Messaging System. IEEE Transactions on Consumer Electronics, CE-30(3):xxi-xxiv, August 1984.
C. Schmandt and M. A. McKenna. An Audio and Telephone Server for Multi-Media Workstations. In 2nd IEEE Conference on Computer Workstations, pages 150-159. IEEE Computer Society, March 1988.
P. T. Zellweger, D. B. Terry, and D. C. Swinehart. An Overview of the Etherphone System and its Applications. In 2nd IEEE Conference on Computer Workstations, pages 160-168. IEEE Computer Society, March 1988.