Michael Casey's Research Projects

MIT Media Lab, 1992-1997

Music Cognition Group, Machine Listening Group, Perceptual Computing Group

Perceptual Audio Models

Model-based audio is a sound representation that is essential for large-scale networked applications such as Virtual Worlds and mobile internet music distribution. My Ph.D. thesis explored the use of information theory and sound analysis techniques to create general sound synthesis techniques that offer controllable audio for use in networked interactive applications and network distribution of audio.

NetSound (Low-Bandwidth Music for the Net)

We have built a transport mechanism for sound that requires only a signal processing network specification and an event list to drive it. This results in an extremely low bandwidth representation for music.

NetSound Page[HTML] ICMC96 NetSound Paper[HTML]

Audio for Interactive Video Environments and Smart Rooms

Current interactive media and VR applications use head-mounted stereo headphones and microphones to perform audio capture, for speech recognition, and rendering 3D audio. This project explores the use of adaptive algorithms for beamforming and for steerable 3D audio. We have built a system that does not use body-mounted equipment for the Media Lab's IVE system.

AES99 Preprint #4052 (B-2)[PostScript 2312K] [HTML version]

Virtual Music

(for Distributed Virtual Reality Systems)

A system that supports many users interacting in the same environment creating sound and animated graphics using a gesture interface.

Audio Rendering and Sound Design for Large-Scale Virtual Environments

I have been working with Mitsubishi Electric Research Laboratories on a large-scale multi-user scalable virtual reality operating environment called SPLINE. A large-scale virtual environment has been implemented called Diamond Park which is the first SPLINE application. SPLINE and Diamond Park were shown at COMDEX in Las Vegas, November 13-17 1995.

Computational Analysis of Speech and Hand Gestures

Standup comdedians make great human gesture data. The gesture delivery is practiced and timed to maximize the impact on the audience. We have built a system that extracts audio and visual features from recorded video data and classifies these features using unsupervised clustering techniques.

The data is recorded segments of Jay Leno's "Tonight Show" and David Letterman's "Late Show". The goal is to create tools for browsing video databases for segments that would be of interest to viewers.

[Postscript (0.9Mb)] [Frame Maker (1.9Mb)] [HTML version]

Sound Analysis with Multiple Models

We are implementing a system that uses competition among parallel sound models in order to classify sounds and estimate various salient parameters for each model. Our research comprises exploring robust parameter estimation techniques for various sound models.

[Connection Science 6:2&3, pp. 355-371, 1995] [HTML version]

Michael Casey