An Annotated Bibliography of Interactive Speech User Interfaces by Barry Arons

For the most interesting things to view quicly, see the handful of videos by searching for '(video'. Keep in mind that Conversational Desktop and Phone Slave were done in the mid-1980's.


Skimming and Browsing of Speech, Time Compression

SpeechSkimmer: A System for Interactively Skimming Recorded Speech (PDF, 40 pages)
B.Arons. ACM Transactions on Computer Human Interaction. March 1997, Volume 4, Number 1, pages 3-38.
This is the final and best SpeechSkimmer paper. It is an expanded version of the UIST 93 paper and includes most of the material from my dissertation including the usability test. This paper also includes a description of the skimming interface built using an Apple Newton Message that was done after the dissertation was completed.
Interactively Skimming Recorded Speech. (PDF, 146 pages)
B. Arons. Ph.D. dissertation, MIT, Feb. 1994.
This document is superceded by the ToCHI paper above in terms of a complete and citeable reference for SpeechSkimmer. The dissertation is a prettier document, and contains more background material including chapters on Hyperspeech, adaptive speech detection, and time compression. See the ToCHI paper, unless you really need more background material.
Two non-visual user interfaces (SpeechSkimmer and Hyperspeech) for interactively skimming recorded speech are presented along with a variety of background material. This document includes revised and expanded versions of other papers as noted here.

SpeechSkimmer: Interactively Skimming Recorded Speech (PDF, 10 pages)
B. Arons. In Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), ACM SIGGRAPH and ACM SIGCHI, ACM Press, Nov. 1993, pp. 187-196.
This paper is superceded by the ToCHI paper above, but is included here for reference.
A non-visual user interface for interactively skimming speech recordings is described. SpeechSkimmer uses simple speech processing techniques to allow a user to hear recorded sounds quickly, and at several levels of detail. User interaction through a manual input device provides continuous real-time control of speed and detail level of the audio presentation.

A Hands-on Demonstration of SpeechSkimmer (PDF, 2 pages)
B. Arons. In Proceedings of the ACM Symposium on User Interface Software and Technology (UIST) Nov. 14-17 1995. pp. 71-72.

See a video demonstrating SpeechSkimmer. [video 1:30]

Pitch-Based Emphasis Detection for Segmenting Speech Recordings (PDF, 4 pages)
B. Arons. In Proceedings of International Conference on Spoken Language Processing (September 18-22, Yokohama, Japan), vol. 4, 1994, pp. 1931-1934.
A description of the pitch-based emphasis detection algorithm used for automatically summarizing speech recordings in SpeechSkimmer. (This paper expands on material in the dissertation)

Techniques, Perception, and Applications of Time-Compressed Speech (PDF, 9 pages)
B. Arons. In Proceedings of 1992 Conference, American Voice I/O Society, Sep. 1992, pp. 169-177.
A review of time compressed speech including the limits of perception, practical time-domain compression techniques, and an extensive bibliography. (Note: this paper, with minor revisions, appears as a chapter in the dissertation)

Efficient listening with Two Ears: Dichotic time compression and spatialization (PDF, 7 pages) B. Arons. In Proceedings of the International Conference on Auditory Display (Santa Fe, NM, Nov. 7-9), 1994, pp 171-177.
An in depth discussion of dichotic time compression, with an exploration of using dichotic time compression in a spatial audio system.

A Review of The Cocktail Party Effect (PDF, 16 pages)
B. Arons. Journal of the American Voice I/O Society 12 (Jul. 1992), 35-50.
A review of research in the area of multi-channel and spatial listening with an emphasis on techniques that could be used in speech-based systems.

Conference Scribe: Turning Conference Calls into Documents. (PDF, 9 pages)
P. Wellner, D. Weimer, and B. Arons. Proc. IEEE HICSS, Jan. 2001.
This paper describes a system for turning conference calls into archived documents that can be browsed, skimmed, displayed, hyperlinked and annotated on the World Wide Web.
Designing Auditory interactions for PDAs. (PDF, 4 pages)
D. Hindus, B. Arons, L. Stifelman, B. Gaver, E. Mynatt, M. Back. In Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), ACM SIGGRAPH and ACM SIGCHI, ACM Press, 1995. pp. 143-146.


Audio Notebook

The Audio Notebook: Paper and Pen Interaction with Structured Speech (PDF, 8 pages)
L. Stifelman, B. Arons, and C. Schmandt. Proceedings of the SIGCHI conference on Human factors in computing systems. 2001. Pages 182-189.
The Audio Notebook is a combination of a digital audio recorder and paper notebook, all in one device. Audio recordings are structured using two techniques: user structuring based on notetaking activity, and acoustic structuring based on a talker's changes in pitch, pausing, and energy.

See the video shown at the conference demonstrating the Audio Notebook. [video 2:00]


Voice Notes

VoiceNotes: A Speech Interface for a Hand-Held Voice Notetaker (PDF, 8 pages)
L.J. Stifelman, B. Arons, C. Schmandt, and E.A. Hulteen. In Proceedings of INTERCHI (Amsterdam, The Netherlands, Apr. 24-29), ACM, New York, 1993, pp. 179-186.
VoiceNotes is an application for a voice-controlled hand-held computer that allows the creation, management, and retrieval of user-authored "voice notes" (small segments of digitized speech containing thoughts, ideas, reminders, or things to do). VoiceNotes explores the problem of capturing and retrieving spontaneous ideas, the use of speech as data, and the use of speech input and output in the user interface for a hand-held computer without a visual display.


Hyperspeech

Hyperspeech (video 2:30).
B. Arons. ACM SIGGRAPH Video Review 88 (1993). InterCHI '93 Technical Video Program.
This is the best introduction to the system. A short video showing the Hyperspeech system in use.

A short description of the video. (PDF, 1 page)
B. Arons. CHI '93 Technical Video Program. p. 524 of CHI proceedings.

Hyperspeech: Navigating in Speech-Only Hypermedia (PDF, 14 pages)
B. Arons. In Proceedings of Hypertext (San Antonio, TX, Dec. 15-18), ACM, New York, 1991, pp. 133-146.
Hyperspeech is a speech-only (non-visual) hypermedia application that explores issues of speech user interfaces, navigation, and system architecture in a purely audio environment without a visual display. The system uses speech recognition input and synthetic speech feedback to aid in navigating through a database of digitally recorded speech segments. (Note: this paper, with minor revisions, appears as a chapter in the dissertation)

Authoring and Transcription Tools for Speech-Based Hypermedia (PDF, 5 pages)
B. Arons. In Proceedings of 1991 Conference, American Voice I/O Society, Sep. 1991, pp. 15-20.
An exploration of issues for automatically authoring a Hyperspeech database.


Conference Reports

Future of Speech and Audio in the Interface: A CHI 94 Workshop Report (PDF, 9 pages)
B. Arons and E. Mynatt. SIGCHI Bulletin 26, 4 (Oct. 1994), 44-48.
A report on the 1.5 day workshop on "The Future of Speech and Audio in the Interface" held at CHI 94.

Future of Speech and Audio in the Interface. (PDF, 1 page)
B. Arons and E. Mynatt. CHI Conference Companion. p. 465.
A description of the workshop.

Speech and audio in window systems: when will they happen? (PDF, 18 pages)
B. Arons, C. Schmandt, M. Hawley, L. Ludwig, P. Zellweger. ACM SIGGRAPH 89 Panel Proceedings. Pages 159-176.
A transcript of the panel.


Audio Servers

Tools for Building Asynchronous Servers to Support Speech and Audio Applications (PDF, 8 pages).
B. Arons. In Proceedings of the ACM Symposium on User Interface Software and Technology (UIST), ACM SIGGRAPH and ACM SIGCHI, ACM Press, Nov. 1992, pp. 71-78.
Describes tools for rapidly prototyping and debugging multimedia servers and applications. Includes details of a SparcStation-based audio server, speech recognition server, and several interactive applications.

Speech Recognition Architectures for Multimedia Environments. (PDF, 8 Pages)
E. Ly, C. Schmandt, and B. Arons. In Proceedings of 1993 Conference, American Voice I/O Society, Sept. 1993.
An object-oriented architecture and API (applications programming interface) for speech recognition servers.

Desktop Audio (a.k.a. Getting the Word) (HTML version)
C. Schmandt and B. Arons. Unix Review 7, 10 (Oct. 1989), 54-62.
An overview of "Desktop Audio" including the systems and interface requirements for the use of speech and audio in the personal workstation. Includes a summary of the VOX Audio Server, a system for managing and controlling the audio resources in a networked personal workstation.

The Design of Audio Servers and Toolkits for Supporting Speech in the User Interface (PDF, 15 pages)
B. Arons. Journal of the American Voice I/O Society 9 (Mar. 1991), 27-41.
An overview of audio servers, and design thoughts for toolkits built on top of an audio server, to provide a higher level programming interface.

A Voice and Audio Server for Multimedia Workstations.
(PDF, 4 pages)
B. Arons, C. Binding, K. Lantz, and C. Schmandt. In Proceedings of Speech Tech '89, May 1989, pp. 86-89.
A description of the VOX Audio Server designed at Olivetti Research.

The VOX Audio Server. (PDF, 6 pages)
B. Arons, C. Binding, K. Lantz, and C. Schmandt. Multimedia '89, 2nd IEEE Comsoc International Multimedia Communications Workshop
Apr. 20-23, 1989 Ottawa, Ontario

The VOX Audio Server. (PDF, 211 pages)
B. Arons, W. Yamamoto, J.D. Northcutt, C. Binding, K. Lantz, and C. Schmandt. Version 1.0. Olivetti Research Center, technical report, Aug. 1988.
Detailed internal design of the VOX Audio Server.


Conversational Desktop

Conversational Desktop (video 4:00).
C. Schmandt and B. Arons. ACM SIGGRAPH Video Review 27 (1987).
This is the best introduction to the system. A short video demonstrating many features of the Conversational Desktop.

Voice Interaction in an Integrated Office and Telecommunications Environment. (PDF, 7 pages)
C. Schmandt, B. Arons, and C. Simmons. In Proceedings of 1985 Conference, American Voice I/O Society, 1985.
The Conversational Desktop is a conversational office assistant that manages personal communications (phone calls, voice mail messages, scheduling, reminders, etc.). The system engages the user in a conversation to resolve ambiguous speech recognition input.

A Robust Parser and Dialog Generator for a Conversational Office System. (PDF, 11 pages)
C. Schmandt and B. Arons. In Proceedings of 1986 Conference, American Voice I/O Society, 1986, pp. 355-365.
Details the components of the system that handle and correct speech recognition errors through an interactive dialog.


Phone Slave

Phone Slave (video 5:30)
This is the best introduction to the system.

Phone Slave: A Conversational Telephone Messaging System. (PDF, 4 pages)
C. Schmandt and B. Arons. IEEE Transactions on Consumer Electronics CE-30, 3 (Aug. 1984), xxi-xxiv.
Phone Slave is a highly interactive conversational telephone answering machine with touch screen and speech recognition interfaces. This paper focuses on the speech interaction aspects of the system.

A Graphical Telecommunications Interface. (PDF, 4 pages)
C. Schmandt and B. Arons. Proceedings of the Society for Information Display 26, 1 (1985), 79-82.
Focuses on the graphical interaction aspects of the Phone Slave.

The Audio-Graphical Interface to a Personal Integrated Telecommunications System. (PDF, 88 pages)
B. Arons. Master's thesis, MIT, Jun., 1984.
Details the design and implementation of the Phone Slave system.


Miscellaneous

MIT's Sampler Disc of Disc Techniques. (PDF, 4 pages)
B. Arons. Educational and Industrial Television 16, 6 (June 1984), 36-40.
A detailed description of the design, production, and contents of the Discursions video disc from the Architecture Machine Group.


Return to Barry's Home Page