Abstract: The appearance of objects consists of regions of local structure as well as dependencies between these regions. The local structure can be characterized by a vector of local features measured by local operators such as Gaussian derivatives or Gabor filters. This paper presents a technique in which the appearance of objects is represented by the joint statistics of local neighborhood operators. A probabilistic technique based on joint statistics is developed for the identification of multiple objects at arbitrary positions and orientations. Furthermore, by incorporating structural dependencies, a procedure for probabilistic localization of objects is obtained. The current recognition system runs at approximately 10Hz on a Silicon O2. Experimental results are provided and an application using a head mounted camera is described.
Abstract: DyPERS, 'Dynamic Personal Enhanced Reality System', uses augmented reality and computer vision to autonomously retrieve 'media memories' based on associations with real objects the user encounters. These are evoked as audio and video clips relevant for the user and overlayed on top of real objects the user encounters. The system utilizes an adaptive, audio-visual learning system on a tetherless wearable computer. The user's visual and auditory scene is stored in real-time by the system (upon request) and is then associated (by user input) with a snap shot of a visual object. The object acts as a key such that when the real-time vision system detects its presence in the scene again, DyPERS plays back the appropriate audio-visual sequence. The vision system is a probabilistic algorithm which is capable of discriminating between hundreds of everyday objects under varying viewing conditions (lighting, view changes, etc.). Once an audio-visual clip is stored, the vision system automatically recalls it and plays it back when it detects the object that the user wished to use to remind him of the sequence. DyPERS interface augments the user without encumbering him and effectively mimics a form of audio-visual memory. Performance is evaluated and usability results are shown.
Introduction: For most computer systems, even virtual reality systems, sensing techniques are a means of getting input directly from the user. However, wearable sensors and computers offer a unique opportunity to re-direct sensing technology towards recovering more general user context. Wearable computers have the potential to ``see'' as the user sees, ``hear'' as the user hears, and experience the life of the user in a ``first-person'' sense. This increase in contextual and user information may lead to more intelligent and fluid interfaces that use the physical world as part of the interface. Wearable computers are excellent platforms for contextually aware applications, but these applications are also necessary to use wearables to their fullest. Wearables are more than just highly portable computers, they perform useful work even while the wearer isn't directly interacting with the system. In such environments the user needs to concentrate on his environment, not on the computer interface, so the wearable needs to use information from the wearer's context to be the least distracting. For example, imagine an interface which is aware of the user's location: while being in the subway, the system might alert him with a spoken summary of an e-mail. However, during a conversation the wearable computer may present the name of a potential caller unobtrusively in the user's head-up display, or simply forward the call to voicemail.
Abstract: Small, body-mounted video cameras enable a different style of wearable computing interface. As processing power increases, a wearable computer can spend more time observing its user to provide serendipitous information, manage interruptions and tasks, and predict future needs without being directly commanded by the user. This paper introduces an assistant for playing the real-space game Patrol. This assistant tracks the wearer's location and current task through computer vision techniques and without off-body infrastructure. In addition, this paper continues augmented reality research, started in 1995, for binding virtual data to physical locations.
Abstract: An important function for wearable computers is the recognition of places and locations. This paper proposes an image sequence matching technique for the recognition of previously visited places. Similar in spirit as single word recognition in speech recognition, a dynamic programming algorithm is proposed for the calculation of dissimilarities of video sequences. Such video sequences represent not only the place itself but also the approaching trajectory. This algorithm allows to use a relatively simple and robust representation of single frames without compromising the discrimination between different places. Preliminary experimental results indicate the discriminational power and the robustness of the approach with respect to the angle of the approaching trajectory.
Abstract: Besides the obvious advantage of mobility, wearable computing offers intimacy with the user for augmented realities. A model of the user is as important as a model of the physical world for creating a seamless, unobtrusive interface while avoiding ``information overload.'' This paper summarizes some of the current projects at the MIT Media Laboratory that explore the space of user and physical environment modeling.
Abstract: The same scene viewed under two different illuminants induces two different colour images. If the two illuminants are the same colour but are placed at different positions then corresponding rgb pixels are related by simple scale factors. In contrast if the lighting geometry is held fixed but the colour of the light changes then it is the individual colour channels (e.g. all the red pixel values or all the green pixels) that are a scaling apart. It is well known that the image dependencies due to lighting geometry and illuminant colour can be respectively removed by normalizing the magnitude of the rgb pixel triplets (e.g. by calculating chromaticities) and by normalizing the lengths of each colour channel (by running the `grey-world' colour constancy algorithm). However, neither normalization suffices to account for changes in both the lighting geometry and illuminant colour.
Abstract: Image colour is often thought to be an intrinsic correlate of surface reflectance and so is a common feature for image indexing. In this paper we point out that image colour is actually a function of surface reflectance and imaging geometry and the colour of the viewing illuminant. Fortunately methods exist for normalizing away these dependencies. Pixel based and colour channel-based normalizations remove dependency on geometry and light colour respectively. Unfortunately, neither method removes both dependencies simulataneously and so a single normalization must be chosen. Common practice dictates that pixel-based normalization is the most useful. In contrast experiments that we carried out, on a variety of image databases cited in the computer vision literature, favour colour channel normalization.
Abstract: This article develops an analogy between object recognition and the transmission of information through a channel based on the statistical representation of the appearances of 3D objects. This analogy provides a means to quantitatively evaluate the contribution of individual receptive field vectors, and to predict the performance of the object recognition process. Transinformation also provides a quantitative measure of the discrimination provided by each viewpoint, thus permitting the determination of the most discriminant viewpoints. As an application, the article develops an active object recognition algorithm which is able to resolve ambiguities inherent in a single-view recognition algorithm.
Abstract: The appearance of an object is composed of local structure. This local structure can be described and characterized by a vector of local features measured by local operators such as Gaussian derivatives or Gabor filters. This article presents a technique where appearances of objects are represented by the joint statistics of local neighborhood operators. As such, this represents a new class of appearance based techniques for computer vision. Based on joint statistics, the article develops techniques for the identification of multiple objects at arbitrary positions and orientations in a cluttered scene. Experiments show that this technique can identify over 100 objects in the presence of major occlusions. Most remarkably, the technique has low complexity and therefore runs in real-time.
Abstract: During the last few years, there has been a growing interest in object recognition schemes directly based on images, each corresponding to a particular appearance of the object. Representations of objects, which only use information of images are called appearance based models. The interest in such representation schemes is due to their robustness, speed and success in recognizing objects.
The thesis proposes a framework for the statistical representation of appearances of 3D objects. The representation consists of a probability density function over a set of robust local shape descriptors which can be extracted reliable from images. The object representation is therefore learned automatically from sample images. Multidimensional receptive field histograms are introduced for the approximation of the probability density function. A main result of the thesis is that such a representation scheme based on local object descriptors provides a reliable means for object representation and recognition.
Different recognition algorithms are proposed and experimentally evaluated. The first recognition algorithm by histogram matching can be seen as the generalization of the color indexing scheme of Swain and Ballard. The second recognition algorithm calculates probabilities for the presence of objects only based on multidimensional receptive field histograms. The most remarkable property of the algorithm is that he does not rely neither on correspondence nor on figure ground segmentation. Experiments show the capability of the algorithm to recognize 100 objects in cluttered scenes. The third recognition algorithm incorporates several viewpoints in an active recognition framework in order to solve ambiguities inherent in single view recognition schemes.
The thesis also proposes visual classes as a general framework for appearance based object classification. Classification has been proven difficult for arbitrary objects due to instabilities of invariant representations. The proposed concepts for extraction, representation and recognition of visual classes provide a general framework for object classification.
From an abstract point of view, the thesis aims to push the limits of the appearance based paradigm without using neither figure ground segmentation nor correspondence. The active object recognition allows the consistent recognition of objects in 3D and therefore overcomes the limits of single view recognition. The appearance based classification framework based on the concept of visual classes will serve for future research.
Abstract: The article introduces the concept of visual classes as a general framework for object classification. Visual classes group together appearances which are similar with respect to a set of image measurements. As defined here, visual classes are implicit in many object representation schemes (geometric as well as appearance based models). We argue that the identification of visual classes provides a powerful tool for object classification. They provide a first step to classification depending on information provided for recognition, including context dependency and relations in space and time between objects.
The article introduces a statistical object representation which can be seen as a generalization of various object representations. Based on this statistical representation, the article introduces a possible extraction and representation of visual classes. First experimental results are given in order to validate the concept.
Abstract: This paper describes a new approach to indoor mobile robot position estimation, based on principal component analysis of laser range data. The eigenspace defined by the principal components of a number of range data sets describes the symetries in the data. Building structures offer a small number of main axes of symetry as caused by objects such as walls. As a consequence, the dimension of the eigenspace can be reduced to few axes which describe these symetries.
By transforming a new data set in the low--dimensional eigenspace, every potential position at which the data where taken as well as the probability of this position can be derived from the sourrounding training data sets, of which the positions are known.
The paper describes the principal component analysis of sets of range data and discusses its characteristics in indoor environments. It compares different methods to generate position hypothesis and discusses the question of noisy measurements and scene changes. Finally a probablistic model is proposed to integrate sequences of observations in order to reconstruct robot trajectories.
The advantage of the approach is the transformation of high--dimensional data sets in a low dimensional eigenspace. The reduction in complexity achieved by this transformation allows to localize the robot independant of other sources of position estimation (such as odometry) using adjacent measurements to resolve ambiguities. It is also possible to survey and correct an underlaying position estimation technique such as odometry.
Abstract: This article develops an analogy between object recognition and the transmissions of information through a channel. This analogy is based on the statistical representation of the appearance of 3D objects by several multidimensional receptive field histograms. The analogy between transmission of information and object recognition provides a means to quantitatively evaluate the contribution of individual receptive field functions, and to predict the performance of the object recognition process using receptive field histograms. Transinformation also provides a quantitative measure of the discrimination provided by each viewpoint, thus permitting the determination of the most discriminant viewpoints.
As an application, the article develops an active object recognition algorithm which is able to resolve ambiguities inherent in a single--view recognition algorithm. The algorithm incorporates 3D information of an objects appearance entirely based on 2D measurements in images of the object.
Abstract: In the ICPR-paper (see below) we have introduced the use of Multidimensional Receptive Field Histograms for Probabilistic Object Recognition. In this paper we reverse the object recognition problem by asking the question, "where should we look?", when we want to verify the presence of an object, to track an object or to actively explore a scene. This paper describes the statistical framework from which we obtain a network of salient points for an object. This network of salient points may be used for fixation control in the context of active object recognition.
Abstract: This paper extends our earlier work on object recognition using matching of multidimensional receptive field histograms. In our earlier papers we have shown that multi-dimensional receptive field histograms can be matched to provide the recognition of objects which is robust in the face of changes in viewing position and independent of image plane rotation and scale. In this paper we extend this method to compute the probability of the presence of an object in an image.
The paper begins with a review of the method and previously presented experimental results. We then extend the method for histogram matching to obtain a genuine probability of the presence of an object. We present experimental results showing 100\% recognition rates with the Columbia database (20 objects) as well with our own (more difficult) database composed of 31 objects. Results show that that receptive field histograms provide a technique for object recognition which is robust, has low computational cost and a computational complexity which is linear with the number of pixels.
Abstract: This paper presents a technique to determine the identity of objects in a scene using histograms of the responses of a vector of local linear neighborhood operators (receptive fields). This technique can be used to determine the most probable objects in a scene, independent of the object's position, image-plane orientation and scale. In this paper we describe the mathematical foundations of the technique and present the results of experiments which compare robustness and recognition rates for different local neighborhood operators and histogram similarity measurements.
The first part of the paper generalizes the Color Histogram matching technique developed by Swain and Ballard to the case of a multidimensional histogram of the responses from a vector of receptive fields. The second part of the paper shows the use of receptive field vector histograms for object recognition. Results of experiments are presented which show the robustness of the approach in the presence of changes of position, scale and image-plane rotation.
Abstract: At ECCV'96 we presented a technique to determine the identity of objects in a scene using multidimensional histograms of the responses of a vector of local linear neighborhood operators (receptive fields). This technique can be used to determine the most probable objects in a scene, independent of the object's position, image-plane orientation and scale.
The present paper describes experiments to evaluate the robustness of multidimensional receptive field histograms to view point changes, using the Columbia image database. In this experiment we examine the performance of different filter combinations, histogram matching functions and design parameter of the multidimensional histograms.
The first part of the paper summarizes the mathematical foundations of multidimensional Receptive Field Histograms. The second part of the paper shows the experimental evaluation of the robustness of the approach to view point changes (3D--rotation).
Abstract: This chapter presents a technique to determine the identity of objects in a scene using multidimensional histograms of the responses of a vector of local linear neighborhood operators (receptive fields). This technique can be used to determine the most probable objects in a scene, independent of the object's position, image-plane orientation and scale.
The first part of the chapter summarizes the mathematical foundations of multidimensional Receptive Field Histograms and gives a recognition example on a database of 103 objects. The second part of the chapter describes experiments to evaluate the robustness of multidimensional receptive field histograms to rotation, using the Columbia image database. In this experiment we examine the performance of different filter combinations, histogram matching functions and design parameter of the multidimensional histograms.
Abstract: see next paper (IWAFGR'95)
Abstract: In many practical situations, a desirable user interface to a computer system should have a model of where a person is looking at and what he/she is paying attention to. This is particularly important if a system is providing multi-modal communication cues, speech, gesture, lip-reading, etc., and the system must identify, whether the cues are aimed at it, or at someone else in the room. This paper describes a system that identifies user focus of attention by visually determining where a person is looking. While other attempts at gaze tracking usually assume a fixed or limited location of a person's face, the approach presented here allows for complete freedom of movement in a room. The Attentionfinder system, uses several connectionist modules, that track a person's face using a software controlled pan-tilt camera with zoom and identifies the focus of attention from the orientation and direction of the face.
For further information, bug reports etc. mail to: email@example.com