index back 3.1. Social science studies - 3.2. Communicating virtual bodies next
3.3. Electronic Communities - 3.4. Multi-User platforms

3. Previous Research


3.1. Social science studies of embodied communication

Multimodal conversation

A face-to-face conversation is an activity in which we participate in a relatively effortless manner, and where synchronization between participants seems to occur naturally. This is facilitated by the number of channels we have at our disposal to convey information to our partners. These channels include the words spoken, intonation of the speech, hand gestures, facial expression, body posture, orientation and eye gaze. For example, when giving feedback one can avoid overlapping a partner by giving it over a secondary channel, such as by facial expression, while receiving information over the speech channel (Argyle and Cook 1976). The channels can also work together, supplementing or complementing each other by emphasizing salient points (Chovil 1992, Prevost 1996), directing the listener’s attention (Goodwin 1986) or providing additional information or elaboration (McNeill 1992, Cassell forthcoming). When multiple channels are employed in a conversation, we refer to it as being multimodal.

We can think about the process as being similar to collaborative weaving. Each person contributes a bundle of different colored threads, the color representing a modality of communication, such as speech or gesture. Over the course of the conversation, the group of people weaves a continuous and seamless textile where each band consists of multiple strings from different participants and different modalities. When looking at the finished tapestry, an emerging pattern may be observed, suggesting an ordered affair. However, unlike the skilled textile worker, the people involved in the conversation will not be able to recall the specifics of laying out the strings, since most of it happened spontaneously.

Of course the pattern observed will be unique for each encounter, given a unique situation and cast of characters, but the relationship between the different colors and bands, is to some extent governed by general principles (Kendon 1990). Researchers from different disciplines, such as linguistics and sociology, have conducted the search for these principles of multimodal communication, each from a different point of view.

Even though methods differ and approaches to explanation vary, it is made clear that our body, be it through gesturing or our facial expression, displays structured signals that are an integral part of communication with other people. These are behaviors that should be exploited in the design of autonomous and semi-autonomous characters that are intended to be a part of or assist in a natural dialog.

The current work focuses on gaze and communicative facial expression mainly because these are fundamental in establishing and maintaining a live link between participants in a conversation. The displaying of gesture and body posture is also very important, but the required elaborate articulation of a human body is beyond the scope of this thesis and will be pursued later.

To illustrate what is meant by communicative behavior, the following section describes a scenario where two unacquainted people meet and have a conversation. The behaviors employed are referenced to background studies with relevant page numbers included. The two subsequent sections then elaborate on some of these behaviors and serve as a theoretical foundation for the automated behaviors in BodyChat.

An analyzed conversation

Paul is standing by himself, looking out for interesting people. Susan (unaquainted to Paul) walks by, mutual glances are exchanged, Paul nods smiling, Susan looks at Paul and smiles [distance salutation] (Kendon 1990, 173. Cary 1978, 269) Susan touches the hem of her shirt [grooming] as she dips her head, ceases to smile and approaches Paul (Kendon 1990, 186, 177). She looks back up at Paul when she is within 10’ [for initiating a close salutation], meeting his gaze, smiling again (Kendon 1990, 188; Argyle 1976, 113). Paul tilts his head to the side slightly and says "Paul", as he offers Susan his hand, which she shakes lightly while facing him and replying "Susan" [close salutation] (Kendon 1990, 188, 193). Then she steps a little to the side to face Paul at an angle (Kendon 1990, 193; Argyle 1976, 101). A conversation starts.

During the conversation both Paul and Susan display appropriate gaze behavior, such as looking away when starting a long utterance (Kendon 1990, 63; Argyle 1976, 115; Chovil 1992, 177; Torres 1997), marking various syntactic events in their speech with appropriate facial expressions, such as raising their eyebrows while reciting a question or nodding and raising eyebrows on an emphasized word (Argyle 1973; Chovil 1992, 177; Cassell 1994), giving feedback while listening in the form of nods, low "mhm"s and eyebrow action (Chovil 1992, 187; Schegloff 1968; Cassell 1994) and finally giving the floor or selecting the next speaker using gaze (Kendon 1990, 85; Chovil 1992, 177; Argyle 1973; Argyle 1976, 118).

Gaze and the initiation of a conversation

The eyes are a powerful channel for intimate connection between people. Not only does the "look" suggest a being with consciousness and intentions of its own, as Sartre (Sartre 1956) describes it, but it also works as a device for people to commonly establish their "openness" to one another’s communication (Kendon 1990, Argyle 1976, Goffman 1963).

Merely meeting a person’s gaze is an important first step but will not initiate a conversation. In fact what E. Goffman refers to as "civil inattention" is a fundamental social behavior of unacquainted people that happen to come into each other’s proximity without any intentions to converse:

One gives to another enough visual notice to demonstrate that one appreciates that the other is present (and that one admits openly to having seen him), while at the next moment withdrawing one’s attention from him so as to express that he does not constitute a target of special curiosity or design.

(Goffman 1963, 84)

If your initial glance and orientation towards the other person was not met by interest in a greeting, your behavior can pass as a part of the "civil inattention" ritual and thus you are saved the embarrassment of explicitly requesting a conversation from an unwilling person (Goffman 1963, Cary 1978, Kendon 1990).

The showing of mutual awareness asserts that the other person’s subsequent actions take your approach into account. A second glance or a sustained gaze and a smile, act as indicators of the other person’s intentions to greet you. A distance salutation is performed, an approach follows and finally a close salutation occurs once a comfortable conversational distance is established. A few studies have focused on the verbal aspect of opening a conversation (Schegloff 1968, Schegloff and Sacks 1973), while others have specifically looked at gaze (Kendon 1990, Cary 1975), and Adam Kendon (Kendon 1990) has done a thorough study of the role of the body in a salutation sequence.

The functions of the face during a conversation

Michael Argyle (Argyle and Cook 1976) argues that gaze serves 3 main functions during a face-to-face conversation:

Perhaps the most obvious function of gaze is Information Seeking, since the primary function of the eyes is to gather sensory input. In order to read visual signals from our environment, we have to direct our attention and thus our gaze towards the source. In a face-to-face conversation we rely on various kinds of gestural information given by our partner and therefore we have to glance at them, at least from time to time. Listeners spend more than half of the time looking at the speaker, supplementing the auditory information. Speakers on the other hand spend much less time looking at the listener, partially because they need to attend to planning and don’t want to load their senses while doing so (Argyle and Cook 1976). The speaker will at least look at the listener when feedback is expected, such as at the end of utterances, after speech repairs or a word search and during questions (Argyle and Cook 1976, Kendon 1990).

Facial movement, including the gaze, eyebrow action and mouth movement, accompanies the speech and is synchronized at the verbal level. These signals sent during the course of speaking have been classified into syntactic displays and semantic displays (Chovil 1992). The syntactic displays include the raising of eyebrows and a slight head nod on a stressed or an accented word, raised eyebrows during an offer or a suggestion and blinking on a pause. The semantic displays convey something about what is being said. They either emphasize a word by showing an appropriate expression or a reference to an emotion (lowering eyebrows and wrinkle nose when saying "not") or stand in place of a word by acting out what is being meant (showing surprise by dropping the jaw after saying "when I opened the door, I was like"). Facial movements such as nodding and brow raising are also used as listener feedback sometimes accompanying a low verbal chant like "mhm" or a "yeah".

Finally the face serves an important function in organizing how the conversation flows between participants. This is of course related to the speaker’s use of gaze to gather information on feedback, since it also signals the listener in question to elicit what is expected. It has been observed that the person whom the speaker last looked at before ending is more likely than other members of the group to speak next (Kendon 1990, Argyle 1976); thus, looking can serve "to coordinate group action by controlling the succession of speeches" (Weisbrod 1965).

 

3.2. Communicating virtual bodies

The real-time animation of 3D humanoid figures in a lifelike manner is a large research issue. The Improv system (Perlin and Goldberg 1996) demonstrates a visually appealing humanoid animation and provides tools for scripting complex behaviors, ideal for agents as well as avatars. However, coming up with the appropriate communicative behaviors and synchronizing them with an actual conversation between users has not been addressed yet in Improv. Real-time external control of animated autonomous actors has called for methods to direct animated behavior on a number of different levels (Blumberg and Galyean 1995).

Creating fully autonomous agents capable of natural multi-modal interaction deals with integrating speech, gesture and facial expression. By applying knowledge from discourse analysis and studies of social cognition, systems like The Animated Conversation (Cassell et al. 1994b) and Gandalf (Thorisson 1996) have been developed. The Animated Conversation renders a graphical representation of two autonomous agents having a conversation. The system’s dialog planner generates the conversation and its accompanying communicative signals, based on the agent’s initial goals and knowledge. Gandalf is an autonomous agent that can have a conversation with a user and employs a range of communicative behaviors that help to manage the conversational flow. Both these systems are good examples of discourse theory applied to computational environments, but neither is concerned with user embodiment and issues of avatar control.

Studies of human communicative behavior have seldom been considered in the design of believable avatars. Significant work includes Judith Donath’s Collaboration-at-a-Glance (Donath 1995), where on-screen participant’s gaze direction changes to display their attention, and Microsoft’s Comic Chat (Kurlander et al. 1996), where illustrative comic-style images are automatically generated from the interaction. In Collaboration-at-a-Glance the users lack a body and the system only implements a few functions of the head. In Comic Chat, the conversation is broken into discrete still frames, excluding possibilities for things like real-time backchannel feedback and subtle gaze.

 

3.3. Electronic communities

To understand the importance of addressing the issue of communicative behavior in avatars, it is enlightening to examine the literature on electronic communities. The phenomenon of electronic communities where people gather to socialize without bringing their own physical bodies, has fascinated researchers in sociology, anthropology, ethnography and psychology. In particular the text-based MUDs (Multi-User Domains) have been the subject of a variety of studies, due to their popularity and their strong sense of community construction (Curtis 1992). MUDs have been used to build research communities (Bruckman and Resnick 1995) and learning environments (Bruckman 1997) as well as game worlds and chat rooms. A certain conversational style has emerged in these systems, where a body is simulated in the text messages passed between users (Cherney 1995), emphasizing how the body is intimately involved in the discourse even in the absence of graphics. While some argue that purely text-based MUDs allow for a richer experience than graphical environments by engaging the user’s imagination, graphic MUD-like systems are gaining popularity partly because of their familiar video game like interface. Graphical electronic communities introduce the whole new field of avatar psychology, the study of how people present themselves graphically to others (Suler 1996). A recent dissertation at the Media Lab explores in depth various aspects of on-line societies and compares different conversational interfaces (Donath 1997).

 

3.4. Multi-user platforms

Implementing multi-user environments is a complex research topic first seriously tackled by the military in the large scale SIMNET system developed by DARPA in the mid 80’s. The goal was to create a virtual battlefield where multiple manned vehicle simulators could be present in the same environment. A scaled down version, dubbed NPSNET, was developed at the Naval Postgraduate School in Monterey, California, and has spawned many interesting research projects in the field of distributed simulation (Falby et al. 1993; Macedonia et al. 1994; O’Byrne 1995; Waldorp 1995). Other large multi-user environment projects, not necessarily affiliated with the military, include DIVE at SICS, Sweden (Carlsson and Hagsand 1993), SPLINE at MERL (Anderson et al. 1996), MASSIVE at CRG Nottingham University, UK (Greenhalgh and Benford 1995) and the GreenSpace project at the HITLab (Mandeville et al. 1995). These projects have mostly contributed to the development of a reliable infrastructure, but are now increasingly touching on issues concerned with human interaction within the systems. Because of the technical focus, none of them however, have applied discourse theory to the problem.

Commercially, many companies provide low-end client software to connect Internet users to graphical multi-user environments. The first Internet based graphical chat system that incorporated 3D graphics was WorldChat from Worlds Inc. The first system to allow voice communication and implement lip-synching is OnLive! Traveler from OnLive! Technologies and the first to include a selection of motion-captured animation for avatars was OZ Virtual from OZ Interactive. So far most solutions have been proprietary, but are starting to converge with the developing Virtual Reality Modeling Language (VRML), a standard language for describing interactive 3-D objects and worlds delivered across the Internet. Standardizing VRML extensions dealing with avatars and multi-user issues are currently being worked on.