Semiautomatic scene modeling from a
three views with partially known structure

My doctoral work thus far has been to extract view invariant information about a scene given one or more views of a static scene given preselected sets of parallel and coplanar edges. View invariant scene information includes 3-D geometry of objects in the scene, surface texture/reflectance of objects, scene lighting conditions, and a description of the cameras used to create those views.

Given a sequence of partially overlapping photos of a static scene we determine 3-D scene structure and camera parameters. Unlike pure-rotation approaches to scene mapping, i.e. Apple's QuickTime VR, which require known focal length and lens distortion and constrain camera motion to planar rotation about the camera's nodal point (or center of projection), my approach places no restriction on camera placement AND recovers a continuous 3-D rather than merely a 2-D description.

Since each image is individually calibrated from visible line structures, each photo can have arbitrary camera position, orientation, focal length and lens distortion. For example, here we process three views of a room of an exhibit shown in the Fall of '94 at List Visual Art Center located on the Weisner building at MIT.

Original view #1 and the same view overlayed with detected edges which have been manually grouped into parallel and coplanar sets.

Original view #2 and the same view overlayed with detected edges which have been manually grouped into parallel and coplanar sets.

Original view #3 and the same view overlayed with detected edges which have been manually grouped into parallel and coplanar sets.

This set of CAD-like geometric constraints among 2-D points and lines is analyzed to produce a textured 3-D model of the scene. Here is a synthetic view of the result and a low resolution 362K MPEG fly-through movie. The textured 3-D model is stored in OpenInventor VRML format. If you have an OIV viewer (like ivview, Holodeck or 3DView) here's a low-res textured version of the scene's .iv file and here's a bigger .iv file.

Structured Video

This technique has also been used to determine 3-D positions of actors from video taken in a chromakey studio. Camera pose relative to the studio floor is found using calibration cubes. Keeping cameras fixed, the cubes are removed and actors are filmed in front of chromakeyed background. An alpha-mask sillouette is found, and the head and feet are detected in each view. Image coordinates of the feet are projected onto the known floor to determine actor position in each frame.

The alpha-mask and computed 3-D positions of the actors are then used to composite the actor's 2-D cutouts into the textured museum scene. TVOT's parallel processing platform, Cheops, is used to render actor textures into views of the textured 3-D scene at a rate of about 12 frames per second, which is the basis of our research of interative 3-D digital video. Here's a 536K MPEG movie that shows the final result.


sbeck@media.mit.edu
Copyright © 1994 by MIT Media Lab. All rights reserved.