The primary goal of this project was to create a system which, when presented with examples of 3D models of human heads, could then take incomplete data from a head tracking system (such as that described in ) and produce a new 3D model which best approximates the user's head.
The 3D head models used as examples were obtained using CyberWare scanners (see Figure 1.1 for an example). These scanners work by revolving an arm containing a laser range finder 360 degrees around a subject's head, measuring the distance to the head, as well as the color, at many points (typically 512 measurements vertically and longitudinally). The output is a cylindrical coordinate depth map, as well as an associated texture map, of the user's head; these can be combined to reconstruct a 3D model. When viewed as 2D images instead of 3D models, these scans look like an ``unwrapped'' version of the head.
Figure 1.1: A computer model generated with a CyberWare scanner. The first two images are the range and texture data, respectively.
FLIRT  (Face Localization for Invariant Recognition and Tracking) uses a single video camera connected to a Silicon Graphics Indy workstation to find certain facial features such as the eyes and corners of the mouth. The three-dimensional alignment and (sparse) structure of the head is estimated using a Kalman filter as described in . Using this positional information, an ``average head'' model is aligned to the user's head and the video image of the user is projected onto the model. The texture corresponding to the face region is then unwrapped to create a sparse image with the same ``squashed'' appearance as in the example CyberWare scans. This data is used as input to the reconstruction system.
An eigenvector decomposition of the example heads was used as the coding mechanism for the reconstruction pipeline. This technique is efficient; encoding an incoming head requires just taking the dot product with the eigenvectors. It has also been shown to be effective for both recognition (, ) and reconstruction (). Modular eigenspaces were used to semi-independently code regions of the face such as the eyes, nose, and mouth at higher resolution. A linear estimator based on the example heads was used to fill in the data which the tracking system could not supply.
Figure 1.2: The run-time reconstruction sequence. FLIRT captures the video image and the data corresponding to the face region is unmapped into cylindrical coordinates, creating a sparse texture map. Two views of the reconstructed model from this data are shown.