Ymir: A Mind Model

for Communicative Creatures and Humanoids

Kristinn R. Thórisson

Ymir is a broad, generative model of psychosocial dialogue skills that bridges between multimodal perception, decision and multimodal action in a coherent framework. It represents a distributed, modular approach that can be used to create autonomous characters capable of full-duplex (i.e. the interaction is open-loop -- the exchange of information is not step-lock). multimodal perception and action generation (Thórisson, 1998, 1996). Features from three A.I. approaches have been adopted in Ymir: Blackboard systems (Adler, 1992, Nii, 1989, Engelmore & Morgan, 1988, Selfridge, 1959), Schema Theory (Arbib, 1992) and behavior-based systems (Maes, 1990). However, Ymir goes beyond any one of these in the number of communication modalities and performance criteria it addresses. The goals behind the architecture, all of which have been successfully addressed in the model, can be summarized as:

Multimodal input and output has to be supported, with no artificial communication protocols or rules.
All data types and data combinations found in human face-to-face communication should be accommodated (analog, spatial, symbolic).
Incremental, real-time interpretation of perceptual input has to be directly supported.
Incremental output generation has to be directly supported.
Real-time interpretation has to happen in parallel with real-time response generation, providing seamlessness in the interaction.
Time should be an explicit part of the architecture’s structure and internal data.

Given the complexities of integrating numerous multimodal capabilities in a single system, two practical features that make the architecture more useful as a research tool are:

The architecture should allow incremental expansion of a character’s abilities, and
The architecture should allow the possibility for testing various computational methods within each of its elements.

A character created in this architecture should have the following abilities:

it should be capable of fluid, dynamic, face-to-face, task-oriented dialogue with human users, accomodating planning, task knowledge and any other skills necessary for this purpose, and
a character should be able to fluently combine its communication and task skills, like humans do in task-oriented dialogue.

Two critical abilities that affect the way the whole system is designed, and allow us to meet the requirements in the last two bullets:

The character should be able to cancel any of its own actions at a moment’s notice (approx. within 100 msec), based on a verbal or other perceptual cue. [2]
While some behaviors may be highly autonomous most of the time (for example where the next fixation falls), the agent should be able to follow verbal instructions to modify that behavior (e.g. “stop staring”), requiring control links from symbolic processes to lower-level behaviors.

The six main types of elements in Ymir are (see image, above):

A set of semi-independent process collections, G (green transparent plates and white cylinder).
A set of blackboards, F (yellow panes).
A set of perceptual modules, r (white spheres & blue prisms).
A set of decision modules, P (red cubes).
A set of behaviors, b, and behavior morphologies, bm (specific motor programs -- both contained inside white cylinder).
A set of knowledge bases, k ("spotty blobs", bottom layer).

There are four sets of process collections (G) in Ymir:

Reactive Layer (RL)
Process Control Layer (PCL)
Content Layer (CL)
Action Scheduler (AS)

Each of these contains these particular element types:

G (RL) = {r, P}
G (PCL) = {r, P}

G (CL) = {P, k}
G (AS) = {b, bm}

Multimodal information streams into the processing layers from the user (big arrows on left) and are processed at three different levels, using blackboards (yellow planes) for communicating intermediate and final results. An action scheduler (cylinder) composes particular motor morphologies and sends them to the agent's animation module (see ToonFace ). The current implementation of the Ymir/Gandalf system comprises about 13.000 lines of custom-written LISP code and a few hundred lines of C code (excluding third-party code such as speech recognition and synthesis, space-tracking drivers and realtime vision-based pupil tracking).

Details on the inner workings of Ymir are given in my thesis , chapters 7, 8 & 9; proof that it really works is given in Chapter 10.

Recent papers building on the Ymir foundation

Ng-Thow-Hing, V., K. R. Thórisson, R. K. Sarvadevabhatla, J. Wormer and Thor List (2009). Cognitive Map Architecture: Facilitation of Human-Robot Interaction in Humanoid Robot. IEEE Robotics & Automation Magazine, March, 16(1):55-66. [PDF]

Ng-Thow-Hing, V., T. List, K. Thórisson, J. Lim, J. Wormer (2007). Design and Evaluation of Communication Middleware in a Humanoid Robot Architecture. IROS 2007Workshop on Measures and Procedures for the Evaluation of Robot Architectures and Middleware, Oct. 29, 2007, San Diego, CA, 2007. [PDF]

Thórisson, K. R. (2007). Integrated A.I. Systems. Minds & Machines, 17:11-25, 2007. Invited paper at The Dartmouth Artificial Intelligence Conference: The Next 50 Years — Commemorating the 1956 Founding of AI as a Research Discipline, July 13-15, 2006, Dartmouth, New Hampshire, U.S.A. [PDF]

Thórisson, K.R., T. List, C. Pennock, J. DiPirro (2005). Whiteboards: Scheduling Blackboards for Semantic Routing of Messages & Streams. In K. R. Thórisson, H. Vilhjalmsson, S. Marsella (eds.), AAAI-05 Workshop on Modular Construction of Human-Like Intelligence, Pittsburgh, Pennsylvania, July 10, 8-15. Menlo Park, CA: American Association for Artificial Intelligence. [PDF]

List, T., J. Bins, R. B. Fisher, D. Tweed, K. R. Thórisson (2005). Two Approaches to a Plug-and-Play Vision Architecture - CAVIAR and Psyclone. In K. R. Thórisson, H. Vilhjalmsson, S. Marsella (eds.), AAAI-05 Workshop on Modular Construction of Human-Like Intelligence, Pittsburgh, Pennsylvania, July 10, 16-23. Menlo Park, CA: American Association for Artificial Intelligence.[PDF]

Thórisson, K. R., H. Benko, A. Arnold, D. Abramov, S. Maskey, A. Vaseekaran (2004). Constructionist Design Methodology for Interactive Intelligences. A.I. Magazine, 25(4): 77-90. Menlo Park, CA: American Association for Artificial Intelligence. [PDF]

Older papers on Ymir

Thórisson, K. R. (1999). A Mind Model for Multimodal Communicative Creatures and Humanoids. International Journal of Applied Artificial Intelligence, 13 (4-5), 449-486. This is the main Ymir paper. It provides an overview of the Ymir architecture and gives examples of implementation and performance of the first Ymir instantiation, Ymir Alpha. [PDF]

Thórisson, K. R. (2002). Machine Perception of Multimodal Natural Dialogue. P. McKevitt (Ed.), Language, vision & music . Amsterdam: John Benjamins. Focuses on the perceptual mechanims of Ymir, and their implementation in the communicative humanoid Gandalf, the first agent created in the Ymir architecture. [PDF]

Thórisson, K. R. (1998). Real-Time Decision Making in Face to Face Communication. Second ACM International Conference on Autonomous Agents , Minneapolis, Minnesota, May 11-13, 16-23. Decision-making mechanism of Ymir and explains the real-time nature of the architecture. [PDF]

Thórisson, K. R. (1997). Layered Modular Action Control for Communicative Humanoids. Computer Animation '97, Geneva, Switzerland, June 5-6, 134-143. Details the animation and motor control systems of Ymir, and the supporting structures developed for implementing the first animated graphical agent. [PDF]

Thórisson, K. R. (2002). Natural Turn-Taking Needs No Manual: A Computational Theory and Model, from Perception to Action. In B. Granström (Ed.), Multimodality in Language and Speech Systems. Heidelberg: Springer-Verlag. In-depth details on the full-duplex turn-taking system implemented for Gandalf in the Ymir architecture.[PDF]

References

To get other references made on this page, download the references section of my thesis (pdf).

Footnote

[1] Even in highly automated tasks performed by humans, such as touch typing, people can cancel motor sequences within 90 ms of a perceptual cue (Kosslyn & Koenig, 1992). This means that the fastest perception-action loop in our model should be no longer than 90 ms, quite a strict requirement for a complex system like multimodal dialogue.

For further information, see Thórisson's selected publications and thesis .

[ Back to Thórisson's home page ]