2. Current systems and their shortcomings

2.1. Types of systems - 2.2. An exsisting system - 2.3. Shortcomings

2. Current systems and their shortcomings

The term avatar has been used when referring to many different ways of representing users graphically. As described in section 1.2, the range of applications is broad and the requirements for the user’s virtual presence differ. This work implements the type of avatars that inhabit what has been technically referred to as Distributed Virtual Environments (DVEs). The ideas presented here are still applicable to other kinds of systems and should be viewed with that in mind. The next section describes an existing graphical chat system that is a good example of a DVE. The particular system was chosen because it is popular and sports a few advanced features. The system is described here in detail primarily to give readers of this work some insight into the current state of the art.

2.2. An existing system: Active Worlds

The Active Worlds Browser (AWB) from Worlds Incorporated is a client program running under Windows that connects the user to an Active World server maintained by Worlds Inc. or one of its collaborators. The client renders a view into the Active World as seen by the avatar or optionally a floating camera (see Figure 2). Other users are seen as articulated 3D models that they have chosen from a menu of available bodies. The user can freely move through the 3D scene using either a mouse or the cursor keys. To communicate, the user types a sentence into an edit field, transmitting it into the world by hitting Carriage Return. A scrollable text window directly below the rendered view displays all transmitted sentences along with the name of the responsible user. The sentence also appears floating above the head of the user’s avatar. Only sentences from the closest 12 users are displayed.

Before using AWB one must choose a nickname and acquire a unique user ID number from Worlds Inc.’s "immigration" service. The nickname and ID are written into the AWB configuration file ensuring consistent identity between sessions. After executing the browser, the user can select from a list of avatar models to represent them in the world. The user is free to switch to another model at any time. The models are human figures of both sexes and various ethnicities. Each body has a set of distinctive idle motion sequences that are executed at random for an interesting visual effect. Some avatars seem to be checking their watches once in awhile, others rock their hips or look pensive.

Once moving through the world, the user is allowed to switch between a view of the surroundings through the eyes of the avatar and an overhead view following the avatar around. This allows the user to look at other users face-to-face or to observe themselves along with the other users. When the user wants to initiate a contact with another person, three steps can be taken. First the user can navigate up to another avatar, making sure to enter the other person’s field of view. Then the user can select from a limited set of animation sequences for the avatar to play, ‘Waving’ being the most appropriate for this situation. Lastly, the user starts a conversation by transmitting a sentence into the space, preferably addressing the person to contact. In fact, only the last step is necessary; the user’s greeting sentence will be ‘heard’ by the 12 closest avatars, regardless of their location or orientation. During the conversation, the user keeps typing messages for transmission, switching between animations from a set of ‘Happy’, ‘Angry’ and ‘Wave’ as appropriate. Between the selected animation sequences, the idle motions are randomly executed.

Upon entry into an Active World using the AWB, one notices how lively and in fact life-like the world seems to be. A crowd of people gathered on the city square is crawling as avatars move about and stretch their bodies. However, one soon realizes that the animation displayed is not reflecting the actual events and conversations taking place, as transcribed by the scrolling text window beneath the world view.

Although the avatars allow the user to visually create formations by controlling position and orientation in relation to other avatars, this does not affect the user’s ability to communicate as long as the desired audience is among the 12 closest persons. One reason for this redundancy is that the bodies in this system are not conveying any conversational signals. The automated motion sequences are not linked to the state of the conversation or the contents of the messages, but are initiated at random, making them irrelevant. The manually executed motion sequences allow a few explicit (and somewhat exaggerated) emotional displays, but since they are chosen by the user via buttons on a control panel, they tend not to be used while the user is engaged in a conversation, typing away on the keyboard.

2.3. Shortcomings

Paul walks up to Susan who stands there staring blankly out into space. "Hello Susan, how are you?" Susan looks at her watch as she replies "Paul! Great to see you! I’m fine, how have you been?" Paul returns the stare and without twitching a limb he exclaims "Real Life sucks, I don’t think I’m going back there :) ". Susan looks at her watch. Paul continues "I mean, out there you can’t just walk up to a random person and start a conversation". Susan looks at her watch. Karen says "Hi". While Paul rotates a full circle looking for Karen, Susan replies "I know what you mean". Karen says "So what do you guys think about this place?". Karen is over by the fountain, waving. Susan looks blankly at Paul as she says "I think it is great to actually see the people you are talking to!". Paul is stiff. Karen is waving. Susan looks at her watch.

Two modes of operation

In most current systems (such as the popular Active Worlds and The Palace) the user has to switch between controlling the avatar and chatting with other users. While the user is creating the message for her interlocutor, her avatar stands motionless or keeps repeating a selected animation sequence. This fails to reflect the relationship between the body and the communication that is taking place, potentially giving misleading or even conflicting visual cues to other users. Some systems, such as the voice based OnLive world, offer simple lip synching, which greatly enhances the experience, but actions such as gaze and gesture have not been incorporated.

Explicit selection of behavior

The creators of multi-user environments realize that avatars need to be animated in order to bring them to life, but their approach does not take into account the number of different communicative functions of the body during an encounter. They provide menus where users can select from a set of animation sequences or switch between different emotional representations. The biggest problem with this approach is that every change in the avatar’s state is explicitly controlled by the user, whereas many of the visual cues important to the conversation, are spontaneous and even involuntary, making it impossible for the user to explicitly select them from a menu. Furthermore, the users are often busy producing the content of their conversation, so that simultaneous behavior control becomes a burden.

Emotional displays

When people looked at the stiff early versions of avatars and considered ways to make them more life-like, generally they came to the conclusion that they were lacking emotions. Users should be allowed to express emotions in order to liven up the interaction. Naturally we associate the display of emotions to being human and the way we relate to our environment and other people. As repeatedly emphasized throughout a book on Disney animation, written by professional animators (Thomas and Johnson 1981), rich and appropriate emotional display is essential for the illusion of life.

However, lively emotional expression in interaction is in vain if mechanisms for establishing and maintaining mutual focus and attention are not in place (Thorisson and Cassell 1996). A sole person standing on a street corner, staring fixedly at a nearby wall and sporting a broad smile will be lucky if people other than suspicious officers dare to approach. We tend to take communicative behaviors such as gaze and head movements for granted, as their spontaneous nature and non-voluntary fluid execution makes them easy to overlook when recalling a previous encounter (Cassell, forthcoming). This is a serious oversight when creating avatars or humanoid agents since emotion displays do not account for the majority of displays that occur in a human to human interaction (Chovil 1992).

User tracking

Many believe that employing trackers to map certain key parts of the user’s body or face onto the graphical representation will solve the problem of having to explicitly control the avatar’s every move. As the user moves, the avatar imitates the motion. This approach, when used in a non-immersive setting, shares a classical problem with video conferencing: The user’s body resides in a space that is radically different from that of the avatar. This flaw becomes particularly apparent when multiple users try to interact, because the gaze pattern and orientation information gathered from a user looking at a monitor doesn’t map appropriately onto an avatar standing in a group of other avatars. Thus whereas tracking may be appropriate for Virtual Reality applications where head mounted displays are employed, it does not lend itself well to Desktop Virtual Environments.