Research Overview (somewhat outdated - to be augmented soon)
A layman's view of my work follows in this page
The vision directing my research is easy to state:
Cooperative Situated Conversational Assistants that can communicate fluidly with humans using natural language.
Of course, any undertaking of comparable complexity requires a strong theoretical foundation: a design methodology, an architecture and most importantly a theory of how language connects to the senses and action. My main theoretical commitments are towards physically grounded semantic representations and processes in the form of Grounded Situation Models (GSM). GSM's are related to Johnson-Laird's mental model work, and the psychological situation model literature (Zwaan / Randvansky etc.)
So far, I have demonstrated the applicability of my ideas through the design and implementation of a GSM on the conversational robot Ripley (shown in the photo above). The current implementation is capable of resolving descriptions of object or temporal referents, answering questions about the present or the past, acting on objects, and updatting its model of the situation through visual, proprioceptive, or linguistic evidence. Thus, ripley can "imagine" parts of situations it has never seen before, talk about them without yet having sensory evidence, and verify/augment its knowledge through the senses. The proposed system also passes four out of the five parts of the "Token Test for children", a standard test used to assess early situated language skills in human children, and a minimal GSM design that can pass the whole test has been proposed.
In the recent past, many extensions to the system were developed: providing richer shape capabilities and spatial relations, enabling the system to embed the user's estimated GSM in order to carry out simple "mind reading" of the human partner etc. Most imprtantly, theoretical work towards the systematisation of GSM design given specific behavioral goals was carried out. The result of the all of the above research was the so-called 7-tiered "GSM proposal" which forms the central theme of my PhD Thesis.
Finally, in parallel to the GSM work described above, I have been involved in numerous other projects at MIT. Some time ago, I worked on facial category identification (male/female, ethnicity, artifacts etc.), and devised a method to localize the semantically salient information for a category label and to exploit the mutual information across categories that are not mutually exclusive. Later, I performed an early exploration of a system that treats speech actions and sensorymotor actions in a unified manner (Resolver), as well as a prototype automatically controlled camera that uses visual attentional salience for targetting (Spectator). Recently, I was also involved in the Human Speechome Project, a project that focuses on the acquisition and analysis of massive audio-video recordings of human activity in home situations, where I have devised a front-end preprocessor for activity detection in predifined regions of ineterests, and a visualisation for large-timescale multiple-location multimodal activity (Actigrams).
Here you can find descriptions of what my research is all about in layman's terms: (from the Media Lab TechNotes) - Somewhat outdated! Will update soon...
1. Grounded Situation Models - The problem:
How are people able to think about things that are not directly accessible to their senses at the moment?
What is required for a machine to able to talk about things that are out of sight, happened in the past, or view the world through somebody elses eyes (and mind)?
What is the machinery required for the comprehension of sentences like: "Give me the green beanbag that was on my left?"
"There is a red can behind you." (which is not visible to the robot)
"How many balls did you see when I moved my hand?"
How can one encode information about the world incorporating uncertainty, and variable categorical granularities, and how do these representations interact with active goal-directed sensing? (i.e. what you see and consequently what you know is what you need)
What is a clean and general reusable architecture that achieves the above?
My Approach: To address these and related questions, we are developing an architecture that consists of a tightly coupled pair of systems: a physical robot (Ripley) and a "internal virtual world" in Ripley's mind that reflects the current unfolding situation. Ripley is an interactive robot with vision, speech and grasping capabilities. The "world model" is constructed according to guidelines given in my GSM paper. This framework provides a foundation for our ongoing experiments in developing new models of natural language processing in which words are grounded in terms of sensory-motor representations. The use of a world model allows us to explore new models of word meanings that move beyond purely associative approaches of connecting words to sensors and motors. This work has applications in creating flexible and natural multimodal human-machine interfaces, and can also serve as a foundation for learning in our other projects, such as verb grounding. (A heartful thank you is given here to my colleague Kai-yuh Hsiao, for his most important contributions to: the early stages of the robot Ripley (before the mental model), Ripley's motor control system, as well as for the time he has spent during our attempts to get our modules to cooperate).
2. Resolver - The problem:
While trying to expand the dialogue system of Ripley the conversational robot, the following questions have arisen:
How does one resolve a referent given ambiguous information?
Which active sensing moves should be used in order to attend to possible referents in the situational context, which clarifying questions should be made, and in what order?
In general, how can one deal with double uncertainty:
When you only partially know what you're looking for, and only partially know what's available, which moves will get you to find a match?
My Approach: Ripley the robot is able to physically move about in order to collect more data about its world through action, vision, and touch. Ripley is also able to gain new information linguistically by asking its human partner questions. Each kind of action, motor and speech, has associated costs and expected payoffs. We are developing planning algorithms that treat these actions in a common framework, enabling Ripley to integrate both kinds of action into coherent behavior. This work has applications in dialogue generation, active learning, and knowledge discovery. Part of a longer-term exploration of the language as active perception metaphor.
The algorithm devised operates on a probabilistic representation of objects and their properties, which is updated given new information. A one-step ahead search is used in order to determine the next move, which is evaluated on the basis of possible outcomes given restrictions and their expected rewards. Rewards are calculated through a tunable cost-payoff fusion function.
3. Spectator - The problem:
Traditionally, videocameras require human operators for many application domains, such as round-table conferences, sports events etc.
So far, no products that are intelligent enough in order to carry out such a task are available. Ideally, we would like such a camera to be able to survey the environment, identify interesting objects and events in its field of view, track them, zoom in/out, and move on from one target to another.
My Approach: First, interesting regions, objects or events are detected.
Currently, this is carried out by a simple "salient point" detector, based on visual attention models. Attempts towards computational models of bottom-up human visual attention have already been made. Such a model (Itti-Koch) was implemented in the form of a module compatible with Ripley's distributed vision system (see relevant project). Given an input image, this module produces a map of salient points together with their saliency intensities. These are generated given low-level multiscale color, intensity, and edge-orientation information.
Then, these "interesting points" drive a camera behavior module, which moves the camera around. This module implements tracking, zooming, mapping, curiosity/boredom functions etc. (A warm thanks is given here to my partner Alexander Patrikalakis, an MIT undergrad working as a UROP, who has also contributed extensively to the implementation of this project)
4. Actigrams - The problem:
The Total Recall / Human Speechome Project of the Cognitive Machines group focuses on the acquisition and analysis of massive audio-video recordings of human activity in home situations. In this project, the Cognitive Machines group is creating a unique infrastructure for efficiently storing and managing millions of hours of audio and video, semi-automated meta-data creation, and statistical machine learning of cross-modal patterns. Given such massive quantities of audiovisual data, some of the problems that naturally arise and which I proposed and implemented a solution for, are the following:
a) How can one only record interesting stuff, and not spend endless terabytes of recording space for empty rooms or shadows of trees moving from the wind?
b) How can one localise activity in certain regions (for example, the kitchen table), and then be able to ask: "Show me what happened at the kitchen table yesterday", and get meaningful clips as answers?
c) How can one use these localised activity detection streams in order to detect larger-scale events? For example, we expect breakfast to contain a certain overall pattern as a sequence of movements: something like: kitchen door - fridge - cupboard - table - sink - table for longer time - sink - ...
d) How can one create a visualisation that summarises audiovisual activity across locations and rooms, which visualisation is also zoomable, and clickable in order to enable the playback of corresponding video segments?
My Approach: First, as most large objects in a house remain stationary, I marked regions of interest such as fridge, door, table etc. Then, I implemented and tuned a slight variant of a known activity detection algorithm, that can adapt to non-interesting periodic pseudo-activities as well as to image noise. Then, by masking activity detection through the predefined regions, I created a multi-stream timeseries across locations and rooms, combined it audio stream analysis, and found a color-coded way to dispaly zoomable and clickable pictures of the streams, which I called "Actigrams". Thus, I provided an adequate solution to all four of the problems stated above - saved lots of disk space and CPU time, enabled spatiotemporally-centered queries through simple linguistic descriptions, created time-series that others in my group use as prime material for automatic activity recognition and analysis, and made a visualisation that lies at the heart of the main user interface. And all this, at very small computational cost: often many times over real time or more.
(A warm thanks is given here to my partner Philip DeCamp, who has built a big part of the recording pipeline of the project and more, and also contributed extensively to the implementation of the already existing in the literature algorithm that formed the basis for the activity detection method I built upon).