The Media Laboratory *
Massachusetts Institute of Technology, 20 Ames Street, E15-410
Cambridge, MA 02139 kris@media.mit.edu
When people talk to each other they generally use a wealth of gesture, speech, gaze and facial expressions to communicate the intended content. Complex information is combined in a concise manner and representational styles are chosen dynamically in real-time as the conversation unfolds. Both reactive and reflective behaviors are used, and they span a wide range of time [1]. Clearly, such interaction requires powerful mechanisms of interpretation and response generation. My work focuses on making possible human-computer interaction based on human face-to-face communication. In order for the multimodal interface agent metaphor to work, the mechanism controlling an on-screen agent has to capture elements that are critical to the structure of multimodal dialogue, such as gestural meaning, body language, turn-taking, etc., and integrate these in a comprehensive way.
The current work deals not with an isolated, single process or problem within face-to-face interaction, but the larger picture of bridging between input and output to close the full loop of multimodal interaction between the human and machine. To address the problems encountered in "full-duplex" multimodal interaction I have designed an architecture for multimodal psychosocial dialogue skills, called ´Ymir, that bridges between input analysis and output generation and serves as a testbed for multimodal agents. This is a multi-layered system based mainly on a black-board model, but also borrows some insights from behavior-based AI. ´Ymir (pronounced "e-mir") gathers multimodal information (as captured by the appropriate techniques such as eye tracking, speech recognition, etc.) and restructures it in a way that can be used to generate appropriate dialogical responses in real-time during the interaction. These actions can have response characteristics similar to those found in face-to-face interaction, covering five orders of magnitude, from around 200 ms to minutes or hours [1, 2].
A prototype agent, Gandalf, is being developed in the ´Ymir system. This cartoon-like character is endowed with a real-time back channel feedback capability [3, 4], as well as deictic gaze, synthesized speech and manual gesture. A topic knowledge base will enable it to interpret limited spoken language and gesture and respond intelligently within a domain.
[1] O'Connaill, B., Whittaker, S. & Wilbur, S. (1993). Conversations Over Video Conferences: An Evaluation of the Spoken Aspects of Video-Mediated Communication. Human-Computer Interaction, vol. 8, 389-428.
[2] Thórisson, K. R. (1995). Computational Characteristics of Multimodal Dialogue. AAAI Fall Symposium on Embodied Language and Action. Massachusetts Institute of Technology, Cambridge, MA, November 10-12. [PDF]
[3] Thórisson, K. R. (1994). Face-to-Face Communication with Computer Agents. AAAI Spring Symposium on Believable Agents, Stanford University, California, March 19-20, 86-90. [PDF]
[4] Thórisson, K. R. (1993). Dialogue Control in Social Interface Agents. InterCHI Adjunct Proceedings '93, Amsterdam, April, 139-140. [PDF]
*Now at LEGO A/S, Klovermarken 120, 7190 Billund, Denmark. kris@digi.lego.com