Shawn Becker, Ph.D. MIT Media Laboratory 1997

Research Overview

My doctoral research is divided into two parts. The first part is scene acquisition, where with a minimal amount of initial assistance from the user/designer, the computer can automatically recover the geometry and texture of surfaces of a scene from still photos or individual frames of video. I see it as a sort of 'CAD of the future' because the interface is intuitively simple but by using photogrammetry techniques, it is quite powerful.

This acquisition work is of interest to people in architecture and design, to people doing digital special effects for cinematography, and to people doing virtual reality and telepresence. The 3-D result can be viewed interactively with any rendering package or platform. To view it comfortable at 30 frames per second (30 fps) with the best texture resolution you need an SGI Onyx workstation with a Reality Engine texture rendering coprocessor. A 486 with a commercial 3-D software rendering library can render a subsampled textured room at about 2 frames per second, but can do 30 fps easily if textures are averaged into flat shaded polygons.

The second part of my research aims to use this acquired scene model as the basis for low bandwidth 3-D video coding. I.e. you take a few images of a room, use my software to recover the 3-D room, and then film the room with live actors and moving camera. I hope to recover the camera positions and actor outlines in each frame so the movie, sports event or news program can be encoded as the single static 3-D scene model, camera parameters (7 numbers for each frame), and a separate masked 2-D video sequence describing the actors. The resulting video description should not only be significantly smaller than MPEG (1.5 Megabits per second), it's inherently 3-D and could be used for interactive 3-D television.

This concept fits nicely with the latest white papers that outline the future of VRML (i.e. http://vrml.wired.com/future/scale.html). In particular they consider the need to have a separation of information into static object (i.e. the static 3-D textured scene model) and dynamic stream (i.e. camera parameters and overlaying actor video). Due to the growing commercial interest in VRML a viable infrastructure will definitely build up, but acquisition remains a serious technical challenge.

I focussed on the problem of convenient 3-D acquisition. It is difficult because it's basically an AI (i.e. artificial intelligence) problem. Machines fail miserably at tracking ordinary features (like the tip of a light switch on the back wall) throughout a video sequence. They fail even worse when trying to separate pixels that belong to the room from pixels that belong to an actor. My hope is that by bootstrapping the problem at the scene creation stage, that there is enough information available for 'dumb' computer vision techniques to solve the room feature tracking and actor segmentation problems.

Naturally, if people really want 3-D TV or interactive 3-D TV, it is better to put greater constraints on the studio: that is to require special pre-calibrated camera setups and the like. This will be more effective than trying to emulate a human's ability to gather 3-D understanding from 2-D images. However, when you want to recover unknown (but structured) geometry and surface texture only from existing images (like say from a page in a magazine), my approach is the one to take. Besides creating really useful modeling tools, it's pushing the envelope for high level scene modeling from video. Fullblown automation is second to photorealism and correctness.

Research Overview
Research Results