BiDi Screen: A Thin, Depth-Sensing LCD for 3D Interaction using Light Fields

Matthew Hirsch1      Douglas Lanman2      Henry Holtzman1      Ramesh Raskar1
1MIT Media Lab         2Brown University

Supplementary Material

Paper teaser image

Figure 1 3D interaction with thin displays. We modify an LCD to allow co-located image capture and display. (Left) Mixed on-screen 2D multi-touch and off-screen 3D interactions. Virtual models are manipulated by the user's hand movement. Touching a model brings it forward from the menu, or puts it away. Once selected, free-space gestures control model rotation and scale. (Middle) Multi-view imagery recorded in real-time using a mask displayed by the LCD. (Right, Top) Image refocused at the depth of the hand on the right; the other hand, which is closer to the screen, is defocused. (Right, Bottom) Real-time depth map, with near and far objects shaded green and blue, respectively.

Abstract

We transform an LCD into a display that supports both 2D multi-touch and unencumbered 3D gestures. Our BiDirectional (BiDi) screen, capable of both image capture and display, is inspired by emerging LCDs that use embedded optical sensors to detect multiple points of contact. Our key contribution is to exploit the spatial light modulation capability of LCDs to allow lensless imaging without interfering with display functionality. We switch between a display mode showing traditional graphics and a capture mode in which the backlight is disabled and the LCD displays a pinhole array or an equivalent tiled-broadband code. A large-format image sensor is placed slightly behind the liquid crystal layer. Together, the image sensor and LCD form a mask-based light field camera, capturing an array of images equivalent to that produced by a camera array spanning the display surface. The recovered multi-view orthographic imagery is used to passively estimate the depth of scene points. Two motivating applications are described: a hybrid touch plus gesture interaction and a light-gun mode for interacting with external light-emitting widgets. We show a working prototype that simulates the image sensor with a camera and diffuser, allowing interaction up to 50 cm in front of a modified 20.1 inch LCD.

Contributions

We demonstrate that a BiDi screen can recognize on-screen as well as off-screen gestures. We also demonstrate its ability to detect light-emitting widgets, showing novel interactions between displayed images and external lighting.

Future Designs

The emphasis of this paper is on demonstrating novel techniques for optical sensing enabled when an LCD and diffuse light-sensing grid are placed proximate to each other. As such devices are currently being developed for commercial deployment, one goal is to influence the design of these displays by exploring design choices and illustrating additional benefits and applications that can be derived. We only touch upon the interaction techniques enabled, leaving additional assessment to future work.

Thin, Depth-Sensing LCDs

Earlier light-sensing displays focused on achieving touch interfaces. Our design advances the field by supporting both on-screen 2D multi-touch and off-screen, unencumbered 3D gestures. Our key contribution is that the LCD is put to double duty; it alternates between its traditional role in forming the displayed image and a new role in acting as an optical mask. We show that achieving depth- and lighting-aware interactions requires a small displacement between the sensing plane and the display plane. Furthermore, we maximize the display and capture frame rates using optimally light-efficient mask patterns.

Lensless Light Field Capture

We describe a thin, lensless light field camera composed of an optical sensor array and a spatial light modulator. We evaluate the performance of pinhole arrays and tiled-broadband masks for light field capture from primarily reflective, rather than transmissive, scenes. We describe key design issues, including: mask selection, spatio-angular resolution trade-offs, and the critical importance of angle-limiting materials.

Unencumbered 3D Interaction

We show novel interaction scenarios using a BiDi screen to recognize on- and off-screen gestures. We also demonstrate detection of light-emitting widgets, showing novel interactions between displayed images and external lighting.

Dynamic

Because the mask, whether composed of pinholes or a tiled-broadband code, is formed on an LCD, we can dynamically vary the size and density of such periodic patterns.

Benefits and Limitations

The BiDi screen has several benefits over related techniques for imaging the space in front of a display. Chief among them is the ability to capture multiple orthographic images, with a potentially thin device, without blocking the backlight or portions of the display. Besides enabling lighting direction and depth measurements, these multi-view images support the creation of a true mirror, where the subject gazes into her own eyes, or a videoconferencing application in which the participants have direct eye contact [Rosenthal 1947]. At present, however, the limited resolution of the prototype does not produce imagery competitive with consumer webcams.

The BiDi screen requires separating the light-modulating and light-sensing layers, complicating the display design. In our prototype an additional 2.5 cm was added to the display thickness to allow the placement of the diffuser. In the future a large-format sensor could be accommodated within this distance, however the current prototype uses a pair of cameras placed about 1 m behind the diffuser - significantly increasing the device dimensions. Also, as the LCD is switched between display and capture modes, the proposed design will reduce the native frame rate. Image flicker will result unless the display frame rate remains above the flicker fusion threshold [Izadi 2008]. Lastly, the BiDi screen requires external illumination, either from the room or a light-emitting widget, in order for its capture mode to function. Such external illumination reduces the displayed image contrast. This effect may be mitigated by applying an anti-reflective coating to the surface of the screen.

Designing a BiDi Screen

Comparison between ideal sensor implementation and camera/diffuser implementation

Figure 2 Image capture and display can be achieved by rearranging the optical components within an LCD. A liquid crystal spatial light modulator is used to display a mask (either a pinhole array or equivalent tiled-broadband code). (Left) The modulated light is captured on a sensor array for decoding. (Right) With no large-area sensor available, a camera images a diffuser to simulate the sensor array. In both cases, LEDs restore the backlight function.

As shown here, our BiDi screen is formed by repurposing typical LCD components such that image capture is achieved without hindering display functionality. We begin by excluding certain non-essential layers, including the CCFL/light guide/reflector components, the various brightness enhancing films, and the final diffuser between the LCD and the user. In a manner similar to [Lanman 2008], we then create a large-aperture, multi-view image capture device by using the spatial light modulator to display a pinhole array or tiled-broadband mask. Our key insight is that, for simultaneous image capture and display using an LCD, the remaining backlight diffuser must be moved away from the liquid crystal. In doing so, a coded image equivalent to an array of pinhole images is formed on the diffuser, which can be photographed by one or more cameras placed behind the diffuser. The backlight display functionality is restored by including an additional array of LEDs behind the diffuser.

We note that an angle-limiting material or other source of vignetting is critical to achieving image capture using the BiDi screen. In practice, the reflected light from objects in front of the screen will vary continuously over the full hemisphere of incidence angles. An angular-limiting film could be placed in front of the BiDi screen, however such a film would also limit the field of view of the display.

Various masks which may be used, as well as a depiction of a basic pinhole camera

Figure 3 Design of a pinhole camera. (Left) The PSF width b, sensor-pinhole separation di, object distance do, and the aperture a. The PSF width is magnified by M = di/do in the plane at do. (Right, Top) A single pinhole comprises an opaque set of 19×19i cells, with a central transparent cell. (Right, Bottom) We increase the light transmission by replacing the pinhole with a MURA pattern composed of a 50% duty cycle arrangement of opaque and transparent cells. As described by Lanman et al. [Lanman 2008] and earlier by Fenimore and Cannon [Fenimore 1978], this pattern yields an equivalent image as a pinhole.

Our design goals require sufficient image resolution to estimate the 3D position of points located in front of the screen, as well as the variation in position and angle of incident illumination. As described by [Veeraraghavan 2007], the trade-off between spatial and angular resolution is governed by the pinhole spacing (or the equivalent size of a broadband tile) and by the separation between the spatial light modulator and the image plane (i.e., the diffuser). As with any imaging system, the ultimate spatial and angular resolution will be limited by the optical point spread function (PSF). In this section we analyze the optimization of a BiDi screen for both on-screen and off-screen interaction modes under these constraints for the case of a pinhole array mask. We extend this analysis to the case of tiled-broadband masks.

Multi-View Orthographic Imagery: As shown in Figure 4, a uniform array of pinhole images can be decoded to produce a set of multi-view orthographic images. Consider the orthographic image formed by the set of optical rays perpendicular to the display surface. This image can be generated by concatenating the samples directly below each pinhole on the diffuser plane. Similar orthographic views, sampling along different angular directions from the surface normal of the display, can be obtained by sampling a translated array of points of the diffuser-plane image offset from the center pixel under each pinhole.

On-screen Interaction: For multi-touch applications, only the spatial resolution of the imaging device in the plane of the display is of interest. For a pinhole mask, this is simply the total number of displayed pinholes. Thus, to optimize on-screen interactions the pinhole spacing should be reduced as much as possible (in the limit displaying a fully transparent pattern) and the diffuser brought as close as possible to the spatial light modulator. This is precisely the configuration utilized by the existing optical touch sensing displays by Brown et al. [Brown 2007] and Abileah et al. [Abileah 2006].

Off-screen Interaction: To allow depth and lighting aware off-screen interactions, we observe that additional angular views are necessary. First, in order to passively estimate the depth of scene points, angular diversity is needed to provide a sufficient baseline for triangulation. Second, in order to facilitate interactions with an off-screen light-emitting widget the captured imagery must sample a wide range of incident lighting directions. As a result, we conclude that spatial and angular resolution must be traded to optimize the performance for a given application. Off-screen rather than on-screen interaction is the driving factor behind our decision to separate the diffuser from the spatial light modulator, allowing increased angular resolution at the cost of decreased spatial resolution with a pinhole array mask.

Spatio-Angular Resolution Trade-off: Consider the design of a single pinhole camera shown in Figure 3, optimized for imaging at wavelength λ, with circular aperture diameter a, and sensor-pinhole separation di. The total width b of the optical point spread function, for a point located a distance do from the pinhole, can be approximated as

b(di,do,a,λ) = 2.44 λ di / a + a (do + di) / do

Note that the first and second terms correspond to the approximate blur due to diffraction and the geometric projection of the pinhole aperture onto the sensor plane, respectively. If we now assume that each pinhole camera has a limited field of view, given by α, then the minimum pinhole spacing dp is

dp(di,di,a,λ,α) = 2d tan (a/2) + b(di,do,a,λ)

Note that a smaller spacing would cause neighboring pinhole images to overlap. As previously described, such limited fields of view could be due to vignetting or achieved by the inclusion of an angle-limiting film. Since, in our design, the number of orthographic views Nangular is determined by the resolution of each pinhole image, we conclude that the angular resolution of our system is limited to the width of an individual pinhole image (equal to the minimum pinhole spacing dp) divided by the PSF width b as follows.

Nangular(di,do,a,λ,α) = dp(di,do,a,α,λ) / b(di,do,a,λ)

Array of pinholes with view that eventually overlap

Figure 4 Multi-view orthographic imagery from pinhole arrays. A uniform array of pinhole images (each field of view shaded gray) is resampled to produce a set of orthographic images, each with a different viewing angle \theta with respect to the surface normal of the display. The set of optical rays perpendicular to the display surface (shown in blue) is sampled underneath the center of each pinhole. A second set of parallel rays (shown in red) is imaged at a uniform grid of points offset from the center pixels under each pinhole.

Now consider an array of pinhole cameras uniformly distributed across a screen of width s and separated by a distance dp (see Figure 4). Note that a limiting field of view is necessary to prevent overlapping of neighboring images. We use a depth from focus method to estimate the separation of objects from the display surface. As a result, the system components should be placed in order to maximize the effective spatial resolution in a plane located a distance do from the camera. The total number of independent spatial samples Nspatial in this plane is determined by the total number of pinholes and by the effective PSF for objects appearing in this plane, and given is by

Nspatial(di,do,a,λ,α;dp,b)=min(s/dp, di s / do b)

where the first argument is the total number of pinholes and the second argument is the screen width divided by the magnified PSF evaluated in the plane at do. Thus, the effective spatial resolution is given by Nspatial/s. Note that, since our system is orthographic, we assume the object plane at do is also of width s.

As shown in Figure 5, the effective spatial resolution in a plane at do varies as a function of the object distance from the pinhole array. For small values of do, the resolution monotonically increases as the object moves away from pinholes; within this range, the spatial resolution is approximately equal to the total number of pinholes divided by the screen width. For larger values of do, the resolution monotonically decreases; intuitively, when objects are located far from the display surface, neighboring pinholes produce nearly identical images. As described in the Appendix, the sensor-mask (or diffuser-mask) separation is selected to maximize the effective spatial resolution located within 50 cm of the display surface. Note that, in Figure 5, the resolution close to the pinhole array drops dramatically according to theory. However, in practice the resolution close to the display remains proportional to the number of pinholes. This is due to that fact that, in our prototype, the pinhole separation dp is help constant (as opposed to the variable spacing given in Equation 4). Practically, the vignetting introduced by the diffuser and camera's field of view prevents overlapping views even when an object is close to the screen - allowing for a fixed pinhole spacing.


Resolution theoretical limits and experimental errorbars

Figure 5 Effective spatial resolution as a function of distance do from the display. Orange error bars denote the experimentally-estimated spatial resolution. Note that, using either dynamically-shifted masks or a higher-quality image sensor, the spatial resolution could significantly increase near the display (approaching the higher limit imposed by the optical PSF).

Appendix: Optimizing the Mask Properties

Note: The original Appendix may be found in pdf format here.

In this appendix we describe how to optimize the sensor-mask (or diffuser-mask) separation, for pinhole arrays and tiled-broadband codes. As with other light field cameras, the total number of samples (given by the product of the spatial and angular resolutions) cannot exceed the number of camera pixels. In our system the LCD discretization further limits the mask resolution, restricting the total number of light field samples to be approximately equal to the total number of pixels in the display (i.e., 1680×1050 pixels). Thus, as described in our paper, we achieve a spatial resolution of 73×55 samples and an angular resolution of 19×19 samples with a pinhole or MURA tile spacing of dp = 4.92 mm and a mask separation of di = 2.5 cm. However, by adjusting the spacing and separation, the spatio-angular resolution trade-off can be adjusted.

Pinhole Array Mask Configuration

As shown in Figure 4, each pinhole must be separated by a distance dp = 2 di tan(α/2) if diffraction is negligible (otherwise Equation 2 must be used). Thus, the necessary pinhole array separation di is given by

di = dp / 2 tan (α / 2)

The field of view α, shown in Figure 3, is either determined by the vignetting of each sensor pixel (e.g., that due to the diffuser and camera's field of view) or by an angle-limiting film. Wider fields of view may be desirable for some applications. However, for a fixed field of view, one must choose the mask separation di to optimize the effective spatial resolution in front of the display. Thus, Equation 4 can be used to maximize Nspatial as a function of di. In our design we assume an average object distance of do = 25 cm. As an alternative to Figure 5, we can plot the effective spatial resolution as a function of the mask separation di (see Figure 6). Note that the selected distance di = 2.5 cm is close to the maximum, allowing slightly higher angular resolution (via Equation 3) without a significant reduction in spatial resolution.


Resolution limits when the pinhole mask is placed over a range of distances from the sensor

Figure 6 Effective spatial resolution as a function of diffuser-mask separation di for a pinhole array, given by Equation 4. System parameters correspond with the prototype described in the paper.

Tiled-Broadband Mask Configuration

The placement and design of tiled-broadband masks was described in [Lanman 2008]. However, their design was for a transmission-mode system with a uniform array of LEDs placed a fixed distance in front of the sensor. Our reflection-mode system requires the mask be placed at a different distance from the sensor to allow light field capture. In this section, the notation and derivation mirrors that paper. We describe 2D light fields and 1D sensor arrays, however the extension to 4D light fields and 2D sensor arrays arrives at a similar mask separation dm.


Shifted spectral copies due to mask modulation

Figure 7 Geometric derivation of tiled-broadband mask separation. (Left) The two-plane parameterization (u,s) of the optical ray shown in blue. The ray intersects the mask at ξ = u + (dm/dref) s. (Right) The received light field spectrum contains multiple spectral replicas shown in shades of red. The shield field spectrum must lie along the dashed line (i.e., fs/fu = (dm/dref) = fsR/(2 fu0)).

As shown in Figure 7, consider the two-plane parameterization [Chai 2000], where u denotes the horizontal coordinate (along the sensor or diffuser) and s denotes the horizontal position of intersection (in the local frame of u) of an incident ray with a plane that is a fixed distance dref = 1 cm away from, and parallel to, the first plane. A mask separated by dm from the sensor creates a shield field that acts as a volumetric occlusion function o(u,s) = m(u+(dm/dref) s). Thus, each ray parameterized by coordinates (u,s) is attenuated by the mask's attenuation function m(ξ) evaluated at ξ = u + (dm/dref}) s. Taking the 2D Fourier transform yields the mask's shield field spectrum O(fu,fs), given by

O(fu,fs)=M(fu) delta(fs - (dm/dref) fu)

where M(fξ) is the 1D Fourier transform of m(ξ). As described in [Lanman 2008], the effect of the mask can be modeled by convolving the incident light field spectrum Lincident(fu,fs) with the shield field spectrum O(fu,fs). This implies that the mask spectrum must be composed of a uniform series of impulses (i.e., a Dirac comb), with the tiled-MURA mask being one such valid pattern when the tile dimensions are equal to the pinhole spacing dp.

The mask separation dm must be adjusted so the received image can be decoded to recover the incident light field. Assume that Lincident}(fu,fs) is bandlimited to fu0 and fs0, as shown in Figure 7. Since Equation 7 implies that the mask spectrum lies along the line fs = (dm/dref) fu, then we conclude that

dm = dref fsr / 2 fu0 = dp / 2 tan (α / 2)

where fsR = 1/(2 dref tan(α/2)) and fu0 = 1/(2 dp). Note that Equations 6 and 8 imply that the pinhole array and tiled-broadband codes must be the same distance from the sensor.

Validation


Validation video

Spatial/Angular/Temporal Resolution

A chart containing a linear sinusoidal chirp, over the interval [0.5,1.5] cylces/cm, was used to quantify the spatial resolution (in a plane parallel to the display) as a function of distance do. In a first experiment, three charts were placed at various depths throughout the interaction volume. Each chart was assessed by plotting the intensity variation from the top to the bottom. The spatial cut-off frequency was measured by locating the position at which fringes lost contrast. As predicted by Equation 4, the spatial resolution was ~2 cylces/cm near the display; for do < 30 cm, the pattern lost contrast halfway through (where fringes were spaced at the Nyquist rate of 1 cycle/cm). In a second experiment, a chart was moved through a series of depths do using a linear translation stage (for details see video above). The experimentally-measured spatial resolution confirms the theoretically-predicted trend in Figure 8. In a third experiment, an LED was translated parallel to the display at a fixed separation of 33 cm. The image under a single pinhole (or equivalently a single MURA tile) was used to estimate the lighting angle, confirming a field of view of ~11 degrees. In a fourth experiment, an oscilloscope connected to the GPIO camera trigger recorded a capture rate of 6 Hz and a display refresh rate of 20 Hz.

Depth Resolution

The depth resolution was quantified by plotting the focus measure operator response as a function of object distance do. For each image pixel this response corresponds to the smoothed gradient magnitude evaluated over the set of images refocused at the corresponding depths. As shown in Figure 8, the response is compared at three different image points (each located on a different chart). Note that the peak response corresponds closely with the true depth. As described by Nayar and Nakagawa [Nayar 1994], an accurate depth map can be obtained by fitting a parametric model to the response curves. However, for computational efficiency, we assign a quantized depth corresponding to the per-pixel maximum response - leading to more outliers than with a parametric model.

Validation of predicted depth resolution with chirp targets

Figure 8 Experimental analysis of depth and spatial resolution. (Top, Left) A linear sinusoidal chirp, over the interval [0.5,1.5] cylces/cm with marks on the left margin indicating 0.1 cylces/cm increments in the instantaneous frequency. Three copies of the test pattern were placed parallel to the screen, at distances of do={18, 26, 34} cm (from right to left). (Top, Middle) All-in-focus image obtained by refocusing up to 55 cm from the display. As predicted by Equation 4, the spatial resolution is approximately 2 cylces/cm near the display, and falls off beyond 30 cm. Note that the colored arrows indicate the spatial cut-off frequencies predicted by Equation 4. (Top, Left) The recovered depth map, with near and far objects shaded green and blue, respectively. (Bottom) Focus measure operator response, for the inset regions in the depth map. Note that each peak corresponds to the depth of the corresponding test pattern (true depth shown with dashed lines).

Media

Click the images below to see the corresponding video clips, which show various interactions with our prototype.

Model Manipulation Demos




World Navitgaion Demo



Relighting Demo



Touch and Hover