# Object Recognition using Multidimensional Receptive Field Histograms

PhD thesis of Bernt Schiele, July 1997.

# Abstract

During the last few years, there has been a growing interest in object recognition techniques directly based on images, each corresponding to a particular appearance of the object. Representations of objects, which use only information of images are called appearance based models. The interest in such representation schemes is due to their robustness, speed and success in recognizing objects.

The thesis proposes a framework for the statistical representation of the appearances of 3D objects. The representation consists of a probability density function over a set of robust local shape descriptors which can be extracted reliably from images. The object representation is therefore learned automatically from sample images. Multidimensional receptive field histograms are introduced for the approximation of the probability density function. A main result of the thesis is that such a representation scheme based on local object descriptors provides a reliable means for object representation and recognition.

Different recognition algorithms are proposed and experimentally evaluated. The first recognition algorithm by histogram matching can be seen as a generalization of the color indexing scheme of Swain and Ballard. The second recognition algorithm calculates probabilities for the presence of objects only based on multidimensional receptive field histograms. The most remarkable property of the algorithm is that it relies on neither correspondence nor figure ground segmentation. Experiments show that this algorithm is capable of recognizing 100 objects in cluttered scenes. The third recognition algorithm incorporates several viewpoints in an active recognition framework in order to solve ambiguities inherent in single view recognition schemes.

The thesis also proposes visual classes as a general framework for appearance based object classification. Classification has been proven difficult for arbitrary objects due to instabilities of invariant representations. The proposed concepts for extraction, representation and recognition of visual classes provide a general framework for object classification.

From an abstract point of view, the thesis aims to push the limits of the appearance based paradigm without using neither figure ground segmentation nor correspondence. The active object recognition method allows the consistent recognition of objects in 3D and therefore overcomes the limits of single view recognition. The appearance based classification framework based on the concept of visual classes will serve for future research.

• Preface
• 1 Introduction and Motivation
• 2 State of the art of object recognition
with focus on influential references for the thesis
• 3 Local Descriptor
in particular Gaussian derivatives, Gabor filters and invariant color descriptor
• 4 Statistical Object Representation
• 5 Measurements for histogram matching
• 6 Object recognition by histogram matching
• 7 Probabilistic object recognition
• 8 Active object recognition
• 9 Object classification
• 10 Conclusion and Perspectives

• # Preface

In the classical approach to image analysis approach image features such as edges or image regions such as texture regions are extracted from the image. High level feature groups may be obtained by grouping these basic image features. This approach hypothesizes the identity and the pose of the object in the scene by calculating feature correspondence between the feature groups and the features of the object model.

The principal difficulty with this classical approach is that the process of determining feature correspondence has a complexity which is exponential with the number of extracted image features. Furthermore the extraction and grouping processes which produce image features are unstable, producing broken and spurious features which compound the complexity of correspondence.

In order to make the problem tractable, the number of extracted features must be reduced. This implies the use of salient -- meaning discriminant -- features. Because of the exponential complexity, only a relatively small number of image features can be used so that each image feature must be highly discriminant. Due to the tradeoff between robustness of the feature extraction and the discriminant power of features, the process of feature extraction tends to be unstable. Furthermore, the saliency of image features depends on the object classes employed making the techniques suitable only for particular object classes such as geometric objects.

The above limitations of the classical approach to image analysis require a paradigm shift in computer vision: the object's identity and the object's pose are estimated directly from measurements which can be calculated reliably from the image. The process of estimating the object's identity and the object's pose has a complexity which can be linear with the number of image measurements. This implies that a large number of image measurements may be used and therefore that robust image measurements can be chosen. In this context, the model of an object is given by a representation of image measurements which can be learned automatically from sample images. These techniques are called appearance based methods since each of the represented images corresponds to a particular appearance of the object.

The advantage of appearance based methods is that they can use robust image measurements and that they can avoid feature correspondence. From an abstract point of view, these techniques calculate object correspondences between the image and the object models. This calculation of object hypotheses might be used as a pre--step of the classical image analysis approach: the hypothesized object can serve as a priori knowledge in order to reduce the complexity of the processes of correspondence and grouping.

Different appearance based object recognition techniques have been proposed: examples include the alignment scheme of Huttenlocher and Ullman, which relies on point correspondence of a small number of salient features, the eigenpicture approach, which assumes the detection or the segmentation of the object, and the aspect graph, which is so far only applied to geometric objects.

The color indexing approach of Swain and Ballard uses directly the color distribution of objects for recognition. Their approach has been shown to be remarkably robust to changes in the object's orientation, changes of the scale of the object, partial occlusion or changes of the viewing position. This approach is an attractive method for object recognition, because of its simplicity, speed and robustness. However, its reliance on object color and, to a lesser degree, light source intensity make it inappropriate for many recognition problems.

The focus of our work has been to develop a technique similar to color indexing using local descriptions of an object's shape provided by a vector of linear neighborhood operators. The first part of the thesis is therefore concerned with the definition of a statistical object representation framework based on local neighborhood operators. The principal aim is to develop fast and robust recognition techniques using the defined statistical object representation. The applicability of the techniques is shown experimentally on different databases each containing up to 100 objects. In order to overcome the limitations of the classical image analysis approach the thesis examines recognition without reliance on pre--segmentation and feature correspondence.

The speed and the robustness of appearance based object recognition approaches comes with a price: appearance based approaches use directly image measurements for recognition. Images and therefore appearances are recognized rather than objects. Due to this fact, any appearance based approach has to be evaluated with respect to the principal challenges of appearance based models. The main challenges are the recognition of objects in the presence of partial occlusion, the recognition of 3D objects and the classification of objects. The second part of the thesis extends the application of the defined statistical object representation framework to manage these three challenges.

# Chapter 1 Introduction and Motivation

During the last few years, there has been a growing interest in object recognition directly based on images, each corresponding to a particular appearance of the object. Representations of objects, which only use 2D--information of images are called appearance based models. The interest in such representation schemes is due to their robustness, speed and success in recognizing objects. The benefits of such representation schemes is most obvious in areas like face recognition, human--computer interfaces and content based image retrieval.

This thesis proposes an object representation scheme based on the statistics of robust local neighborhood operators. We want to show that such a representation may provide a robust and highly discriminant means for the recognition of arbitrary objects.

The initial motivation of this work has been the color histogram approach providing fast and robust recognition of colored objects. From an abstract point of view this approach models objects by their color statistics. Our aim has been the generalization of this approach for the modeling of objects by the statistics over their local characteristics. The generalized approach, which we call multidimensional receptive field histograms, is able to discriminate a large set of objects in a robust manner. However, the principal defeats are related to what we want to summarize as challenges for appearance based models. These challenges are listed below and examined throughout the thesis.

The principal interest of the thesis is therefore the development of a new appearance based representation scheme based on the statistics of vectors of receptive field functions. Different recognition algorithms are proposed. In particular, a probabilistic object recognition algorithm is defined which does not rely on correspondence between testimages and database objects. Experiments show that high recognition rates can be obtained with different recognition algorithms based on the proposed representation scheme. Extensions of the approach account for the principal challenges which have to be encountered by an appearance based model technique.

In the following we want to motivate the use of a statistical framework (section 1.1) and define object recognition as a case study in computer vision (section 1.2). The initial motivation of the work, the color histogram approach, is briefly discussed in section 1.3. As mentioned above, the principal challenges of appearance based models (section 1.4) will be treated throughout the thesis. Section 1.5 gives an overview of the content of each chapter.

### 1.1 Motivation for a statistical framework

The thesis proposes the use of a statistical framework for the recognition of objects. Statistics allows to incorporate information into the recognition process such as a priori information, context information and information provided by different sensors. A statistical framework provides a means for the incorporation of such additional information without altering the recognition algorithm itself. In general, adopting a statistical framework may offer the following advantages:

• ability to incorporate uncertainty
• soft decision making
• incorporation of a prior knowledge, which may be independent of the image content
• Statistics by its nature makes it possible to incorporate uncertainty. This can be done at different levels as for example at the level of sensor modeling, modeling data incompleteness and decision making. Therefore any doubt or uncertainty can and should be incorporated into such a framework.

Soft decision making refers to the probabilistic character of decisions in the context of statistics. The results of the recognition process may be formulated, for example, as probabilities for each object rather than a hard decision for each object if it is contained in the scene or not. If hard decisions are needed a threshold may be applied ultimately.

We believe that many decisions depend not only on the image content and therefore on the signal itself but also on the context or other a priori knowledge. A statistical framework is particularly suited for the incorporation of such knowledge. Additionally any other source of information may be incorporated as for example the information provided by other sensors.

Whereas statistical modeling and recognition may provide a powerful framework, the main problem of the successful application of statistical algorithms is the fundamental requirement of sufficient data for the estimation of the underlying probability density functions. In this respect the proposed statistical representation bypass the problem by the non--use of positional information. In particular, chapter 4 is devoted to the motivation of the statistical object representation by multidimensional receptive field histograms.

### 1.2 Object recognition as case study

For our study we have chosen the object recognition problem, since it can be seen as a general case study for computer vision. We can identify different degrees of freedom inherent in the object recognition problem. They are:

Figure 1: Different components of rotation and translation of a 3D object

Similarity transformation in the image plane:
three translational degrees of freedom ($t_x$, $t_y$ and $t_z$) and one rotational degree of freedom ($r_z$) can be identified (see figure 1).
3D transformation of the object:
two rotational degrees of freedom ($r_x$ and $r_y$) exist in addition to the similarity transformation (see figure 1).
Scene changes
include partial occlusion and background change.
Imaging conditions
change accordingly to illumination changes and different types of signal disturbance as signal noise, quantization error and blur. The principal characteristic of these changes is that they cannot be controlled and/or predicted.

Section 4.1 discusses in greater detail the statistical object representation as a function of these degrees of freedom. These degrees of freedom will be examined throughout the thesis. They form, for example, the basis of the structure of chapter 6.

We want to point out the difference between object identification and object classification, which are both part of the general problem of object recognition. On the one hand side object identification refers to recognition of objects which have been seen by the system before. Object classification on the other hand refers to the capability of the system to generalize outside the database of objects learned and seen before. The second problem therefore is more challenging even though often object identification may be sufficient. In the following we use the term object recognition and object identification synonymously. The term object classification is used for the generalization outside the database.

### 1.3 Motivation for multidimensional receptive field histograms

Swain and Ballard have developed a technique which identifies objects in an image by matching a color histogram from a region of the image with a color histogram from a sample of the object. Their technique has been shown to be remarkably robust to changes in the object's orientation, changes of the scale of the object, partial occlusion or changes of the viewing position. Even changes in the shape of an object do not necessarily degrade the performance of their method. However, the major drawback of their method is its sensitivity to the color and intensity of the light source and color of the object to be detected. Several authors have improved the performance of the original color histogram matching technique by introducing measures which are less sensitive to illumination changes (see section 2.1).

The color histogram approach is an attractive method for object recognition, because of its simplicity, speed and robustness. Additionally the approach does not rely on the correspondence between the object model and the test image. However, its reliance on object color and light source intensity make it inappropriate for many recognition problems. The initial motivation of our work has been to develop a similar technique using local descriptions of an object's shape provided by a vector of linear receptive fields. For the original color histogram algorithm, it can be seen that robustness to scale and rotation are provided by the use of color. Robustness to changes in viewing angle and to partial occlusion are due to the use of histogram matching. Thus it is natural to exploit the power of histogram matching to perform recognition based on histograms of local shape properties. The most general method to measure such properties is the use of a vector of linear local neighborhood operations, or receptive fields.

To this extend, the first part of the thesis (chapters 3 through 6) can be seen as the generalization of color histogram matching to the matching of multidimensional receptive field histograms.

### 1.4 Challenges for appearance based model

Throughout the thesis we will use the term appearance based model for techniques, which use 2D--image information for the representation and recognition of objects. The most prominent candidate is based on the principal component analysis of images. But also approaches based on 2D local descriptors and our approach based on the statistics of local neighborhood operators are appearance model based approaches. The main advantages of such techniques is the robustness and speed of the extraction of 2D--image information. By incorporating only 2D--information such approaches may be called viewer centered or image based. The representation of a 3D--object is therefore based on a collection of 2D--appearance models.

The opposite to appearance based approaches are 3D model based techniques. In such techniques each object has a single 3D model in an object coordinate system. The object models are therefore simpler and less memory intensive as in the case of appearance based models. However, the main disadvantage is the instability of the extraction of 3D--information from 2D--images.

Since the advantages and disadvantages of appearance based models and 3D models are complementary, neither can be stated to be generally superior to the other. Nevertheless, we adopt, as motivated in the previous section, an appearance based approach. This choice is mainly motivated by the robustness and speed which we expect to obtain by the application of such an approach. The challenges for the application of an appearance based approach may be summarized and listed as follows:

• recognition in the presence of viewpoint changes
• partial occlusion of objects
• recognition of 3D objects from 2D views
• object classification as the generalization outside the database
• memory requirements of object representation
• Throughout the thesis, each point is treated separately. The following sections and chapters can be identified:
• section 6.5 shows the robustness of the object representation by multidimensional receptive field histograms with respect to viewpoints changes
• chapter 7 proposes a probabilistic object recognition algorithm which can account for partial occlusion. Experimental results show that already a small object portion is sufficient for the recognition of 103 objects
• chapter 8 proposes an active object recognition algorithm based on the appearances of objects. The underlying idea is to incorporate several views of an object, resulting in 3D--consistent recognition of objects
• chapter 9 introduces the concept of visual classes as general framework for object classification and proposes a maximum--likelihood recognition of visual classes. Experiments are described for the case of image retrieval
• the memory requirement is analyzed in section 6.7. In the future, memory requirements may be reduced by the application of the concepts for object classification developed in chapter 9 or classical dimensionality reduction techniques
• As we can see from the referenced sections and chapters, the principal challenges of appearance based models are the motivation of the second part of the thesis, that is to say for chapters 7 through 9.

### 1.5 Overview of the thesis

In the following, the content of each chapter is summarized.

Chapter 2 summarizes references which have been source of inspiration for different aspects of the thesis. In particular, we discuss color histogram based approaches. Besides the original approach proposed by Swain and Ballard, several authors have been improving the performance of the approach in the presence of illumination intensity changes. We may point out the popularity of these approaches in the context of image retrieval. Since we generalize color histograms to histograms of vectors of local neighborhood operators, we describe several recognition techniques based on local characteristics. As stated above, we are interested in a general statistical framework so that we describe some statistical object recognition frameworks.

Chapter 3 is devoted to the discussion of local characteristics. Among the most popular characteristics are Gaussian derivatives since they are well understood, can be calculated robustly and they even have a physiological justification. Gabor filter are more time--consuming to calculate in general but widely used in the context of texture analysis. We also describe local descriptor based on color information, which offer the calculation of invariance with respect to changes of the intensity and color of the light source. This chapter also discusses normalization techniques which we apply in order to obtain robustness of local descriptors with respect to noise and illumination intensity changes. The robustness of the proposed normalization techniques with respect to additive Gaussian noise is examined in section 3.2.4. Robustness of the normalization with respect to illumination intensity changes is examined in experiments described in section 5.3.

Chapter 4 develops a general statistical object representation scheme. Each degree of freedom of the object recognition task, as introduced in section 1.2, is discussed and taken into account appropriately. As an approximation of the statistical representation of objects we motivate and propose a set of multidimensional receptive field histograms each corresponding to a particular object appearance. The developed statistical framework may be seen as a general object model for representation schemes based on local descriptor of objects. Based on this general framework we develop an analogy between object recognition and information theory. This allow us to apply information theoretical concepts for object recognition as for example the transinformation for the evaluation of the feature set used.

Chapter 5 introduces different histogram comparison measurements, since the intersection $\cap$ of histograms of the original color histogram approach has limitations in the general context of multidimensional receptive field histograms. The chapter defines and analyzes different histogram comparison measurements as $\chi^2$--statistics, quadratic distances and modified intersection measurements. The computational complexity and characteristics of each measurement are discussed. Additionally, the second part of the chapter analyzes the stability of the measurements with respect to additive Gaussian noise, blurring, image plane rotation and illumination intensity changes.

Chapter 6 applies the introduced comparison measurements for object recognition by histogram matching. After a recognition example for 261 objects, different degrees of freedom of the object recognition task are taken into account. In particular, the chapter describes how to account for image plane rotation and scale changes and provides experimental results in the presence of such changes. Section 6.5 is devoted to the important aspect of object recognition in the presence of viewpoint changes and analyzes the robustness of multidimensional receptive field histograms with respect to such changes. At the end of the chapter the memory requirement for multidimensional histograms is discussed.

Chapter 7 extends the application of multidimensional receptive field histograms to probabilistic recognition of objects. The probabilistic algorithm is capable of recognizing objects from a small portion of the image providing robustness to partial occlusion. Recognition results are given for a database of 103 objects in the presence of image plane rotation, scale changes and viewpoint changes. Based on these results we propose a dynamic hashtable'' approach using image patches as index of the hashtable. This latter algorithm is particularly suited for the recognition of objects in cluttered scenes.

The following two chapters propose two further extensions and applications of multidimensional receptive field histograms for active recognition of objects and for object classification. The experimental results provided in these chapter are not complete so that these chapters may be seen as perspective of the thesis.

Chapter 8 adopts a hypothesis--testing strategy for the active recognition of objects in a single 2D--image as well as in 3D. For the case of 2D--images we develop a general interest point detector which may be applied also in the context of other object recognition algorithms. The second active recognition algorithm uses the information theoretic concept of transinformation for the evaluation of the most salient viewpoints of an object. Moving the camera to these salient viewpoints make it possible to verify an object hypothesis derived from another viewpoint. Since at each viewpoint only 2D--information is used for recognition, this algorithm enables an appearance based model approach to recognize 3D--objects from several 2D--views in a consistent way. Experimental results underline this property of the algorithm.

Chapter 9 proposes the concept of visual classes as a general framework for object classification based on results of previous chapters. Visual classes are defined on 2D-- and/or 3D--similarities of objects which may be derived from the general statistical representation of objects. The chapter proposes a maximum--likelihood approach for the recognition of visual classes and applies the technique for the retrieval of visually similar images.

Chapter 10 concludes the principal results and lists perspectives of the thesis.