Published in: First ACM International Conference on Autonomous Agents, Marina del Rey, California, February 5-8, 1997, 536-7.

Gandalf: An Embodied Humanoid Capable of Real-Time Multimodal Dialogue with People

Kristinn R. Thórisson *

Gesture & Narrative Language Group
M.I.T. Media Laboratory, 20 Ames St., Cambridge, MA 02139
kris@digi.lego.com - http://www.media.mit.edu/~kris
Content Areas: Interaction between people and agents, face-to-face communication, action selection and planning, real-time performance, synthetic agents.

Introduction

While many humanoid robots in fiction have an impressive collection of multimodal communications skills, research focused on creating truly multimodal humanoids is somewhat more recent [c.f. Thórisson 1995, Cassell et al. 1994, Brooks & Stein 1993]. Here the emphasis is on the full loop from perception to action: real-time multimodal, face-to-face communication. This research is motivated by the many potential benefits of multimodal communication; the reader is referred to Thórisson (1996, 1995) and Bolt (1987). This paper describes a humanoid called Gandalf (see Fig. 1)--a software robot that can interact in with people (see Figs. 2 & 3) using speech, gesture, gaze and facial expressions. It also includes a short description of Gandalf's architectural foundation.

Perception, Decision & Action

Human face-to-face dialogue is difficult to duplicate in computers partly because this is a loosely organized behavior with many exceptions to its "rules", and because the rules--and the associated behaviors--are highly time-specific [Thórisson 1995, Goodwin 1981], i.e. not only do they have to be produced fast, they have to be delivered at the right time. Behaviors can get initiated and executed partly, only to become obsolete before their execution is over.

Following the basic computational characteristics of multimodal dialogue outlined in Thórisson (1995), multimodal dialogue is treated as a layered system, where both (or all) participants are running multi-timescale perception-action loops that interact in meaningful ways. In this model, reactive and reflective actions work in concert to produce situated, action-based dialogue behavior.

Gandalf is created in a computational framework for psychosocial dialogue skills called Ymir (Thórisson 1996). Ymir (pronounced e-mer) is a hybrid, modular architecture for creating situated communicative agents. A character in Ymir is defined by the set of three types of processing modules: perceptual, decision and behavior. The modules are split into four layers, a Reactive (RL), a Process Control (PCL), a Content (CL) and an Action Scheduler (AS) layer. Each layer contains sufficient information to make decisions about actions at a specific time scale, or frequency band. Before decisions to act are turned into visible behavior, they are sent to the AS, which composes the exact morphology ("look") of an action. The AS prioritizes decisions in the following way: Decisions initiated by the RL (e.g. a decision to blink) are serviced immediately, those initiated by the PCL (e.g. to utter a sentence) take second priority and those initiated CL (e.g. to change the topic of the dialogue) take third priority. Intentions to act are ensured timeliness two ways: [1] by the priority scheduling and [2] by a time-management system that ensures that actions that didn't get executed in time will not be.

This arrangement results in a system with, among other things, the following unique characteristics:
  • A character's behavior follows common rules of turn-taking, without being rigid or step-lock.
  • Gesture and facial expression are an integrated part of the communication, with no artificial communication protocols.
  • Concurrent behaviors, such as glancing over to an object the speaker points at, happen naturally and where they are expected.
  • When speech overlaps or miscommunication occurs, it is dealt with in the same ways as in human face-to-face interaction, by stopping, restarting, hesitating, etc.
Figure 1. Gandalf.

Gandalf's Implementation & Abilities

The Gandalf prototype runs on 8 networked computers (5 UNIX, 2 PCs and 1 Macintosh) including speech recognition and graphics. It uses a microphone (the agent's "ear"), an eye tracker and a body suit (the agent's "eyes") to capture the user's multimodal actions. A 23-df face and hand allow Gandalf to display its behaviors. His voice is a DecTalk. Currently, Gandalf's knowledge of the solar system consists of the ability to travel to the planets, tilt the planets, zoom in and out, and tell users facts about each one. Gandalf's perceptual, decision and behavior modules were based on an extensive review of the psychological, linguistic and cognitive literature on human face-to-face dialogue (c.f. Goodwin 1981). They produce basic communicative behaviors like back channel feedback (e.g. showing subtle eyebrow motions on turn transitions), attentional cues (e.g. gazing at and turning head to the area of interest), using communicative facial expressions and fillers (e.g. "aaaah", "ehhh") where appropriate, correct timing of answers, etc. Gandalf contains 26 perceptual, 35 decision and 83 behavior modules. Because the modules can be added incrementally, in the spirit of Brooks' (1986) subsumption architecture, Gandalf's modules took only between 4-6 man-weeks to implement and test.


User: (turns to Gandalf) Hello Gandalf. Color
Gandalf: (immediately turns to user) Hello, I am your guide to t Color he galaxy.
U: (still looking at Gandalf) Take me to Ma Color rs.
G: (lifts his eyebrows momentarily, Color then says) Sure thing. (turns to the solar system. Mars appears, Gandalf turns back to the user).
U: (looking at Mars, points) Is that Mars? (looks back at Color Gandalf and brings down his hand)
G: (glances at Mars for about 450 ms, then looks at user and points with his hand at Mars) Yes, that is Mars.
U: Tell me about Mars, Gandalf.
G: (glances upward, then back at user) Okay. (looks at Mars, then back at user) Mars is a really cool planet.
Figure 2. Excerpt from a real-time dialogue between Gandalf and a user. Experimental analysis has shown the timing of Gandalf's multimodal acts to be very similar to a human's (Thórisson 1996). Pace of this dialogue is about the same as that between two people.


Summary & Future Work

Gandalf's dialogue performance has been rated surprisingly highly by users (Thórisson & Cassell 1996). The prototype provides proof-of-concept of new solutions to many difficult issues in multimodal dialogue, such as real-time back-channel feedback, flexible turn-taking (Goodwin 1981) and multimodal perception and action, in one integrated system. We are currently working on extending Ymir, Gandalf's foundation, implementing an extended version of Gandalf with broader topic knowledge, and providing it with new abilities to understand a Color nd generate content-related intonational patterns (Prevost 1996).

Figure 3. A user gets ready to interact Color with Gandalf.

Acknowledgments

Thanks to Justine Cassell, Pattie Maes, Steve Whittaker & Richard A. Bolt for guidance; Joshua Bers, David Berger, Christopher Wren & Hannes Vilhjálmsson for technical assistance; Jennifer Glos, Scott Prevost & Marina Umachi for useful suggestions. This work was sponsored by CSF-Tompson, RANNÍS, and the M.I.T. Media Laboratory.

References

Bolt, R. A. 1987. The Integrated Multi-Modal Interface. The Transactions of the Institute of Electronics, Information and Communication Engineers (Japan), November, J79-D(11), 2017-2025.

Brooks, R. (1986). A Robust Layered Control System for a Mobile Robot. IEEE Journal of Robotics and Automation, 2(1), March.

Brooks, R. & Stein, L. A. 1993. Building Brains for Bodies. M.I.T. Artifical Intelligence Laboratory memo No. 1439, August.

Cassell, J., Stone, M., Douville, B., Prevost, S., Achorn, B., Steedman, M., Badler, N. & Pelachaud, C. 1994. Modeling the Interaction between Speech and Gesture. Sixteenth Annual Conference of the Cognitive Science Society, Atlanta, Georgia, August 13-16, 153-158.

Goodwin, C. 1981. Conversational Organization: Interaction Between Speakers and Hearers. New York, NY: Academic Press.

Prevost, S. 1996. A Semantics of Contrast and Information Structure for Specifying Intonation in Spoken Language Generation. Ph.D. Thesis, Faculty of Computer and Information Science, University of Pennsylvania.

Thórisson, K. R. 1996. Communicative Humanoids: A Computational Model of Psychosocial Dialogue Skills. Ph.D. Thesis, Massachusetts Institute of Technology.

Thórisson, K. R. 1995. Computational Characteristics of Multimodal Dialogue. AAAI Fall Symposium Series on Embodied Language and Action, Nov. 10-12, 102-108.

Thórisson, K. R. & Cassell, J. 1996. Why Put an Agent in a Human Body: The Importance of Communicative Feedback in Human-Humanoid Dialogue. Lifelike Computer Characters, October 9-12, Snowbird, Utah.


*Now at LEGO A/S, Kløvermarken 120, 7190 Billund, Denmark