Copyright © 1995 by Association for Computing Machinery, Inc. (ACM).
A new and crucial authorial role becomes defining the core methodology that governs the story presentation and viewer interaction. The foundations of our proposed model have been developed and implemented in conjunction with an evolving story about urban change in Boston. This story features a 7 Billion Dollar public works project to rebuild the Central Artery (I-95) and the project's impact on surrounding neighborhoods. The project will be on-going through 2004 which makes it a practical story for an Evolving Documentary investigation.
This work has been funded by the News in the Future Consortium at the Media Lab and by the Eastman Kodak Company.
New digital technologies can support evolving collections of media elements which are stored and accessed non-linearly. A growing body of research looks at the problem of how video should be indexed. Research areas include automatic parsing and matching, as well as human supported annotation activity. Once an author or logger has attached descriptors to a set of media elements, these materials can be retrieved according to simple or complex directed queries.
While the retrieval-by-query mechanism suits particular users, it will not satisfy users who do not know what is in the database or what they want to see. Consider the difference between the editor who is looking for certain content in an historical shot, and the consumer of news who wants an in-depth review of a public works project.
This latter circumstance suggests both a new model for content and the need for a narrative approach to browsing which incorporates some partnership between the viewer and the presentation engine. Assuming a rich content database of media materials, such a browser needs to suggest narratives by taking advantage of descriptive associations and methods for temporal progression. This paper presents ConText, a model browser.
In the next section, we discuss the "evolving documentary" model for extensible content and review previous research in the area of annotation and story modeling. Next we discuss the author's activity with explicit reference to an example data set we are in the process of developing. We then introduce a set of presentation methodologies which can take narrative advantage of a description space. Finally we present the look and feel of ConText as a functioning browser. Our conclusion summarizes the current state of the work, underscores idiosyncratic aspects of the method, and suggests future directions.
Figure 1: The Current Model
Certain ongoing news stories - including wars, elections, public works, science - fit this content model. The news story evolves as follows. A body of material is collected by a journalist for an editor who shapes the story. The journalist designs the story so that something is disclosed and if possible a cause is highlighted. The story rarely includes all relevant material; rather the "best" material for the chosen story is selected and sequenced. This selection reflects available editing time as well as available air time. Figure 1 shows an overview of this process whereby an editor filters the available content down to a rigid form for presentation to viewers.
In the case of developing or "evolving" stories, both the journalist and the viewer's knowledge of the situation changes as additional stories are filed. Eventually, the "bigger picture" begins to emerge. The first story now becomes a mere fragment of a larger whole. Different reporters will often develop very different stories around different themes on different days. This breadth provides us with context. Both editor and viewer draw on these older stories to shape their understanding of a new event.
Figure 2: The "Evolving Documentary"
In the world of digital media systems, the materials for an evolving story may be stored immediately, on collection, in a content repository. The materials grow as the story evolves. For this reason the storage and descriptive architecture must be extensible. We must be able to add new materials easily without noticeably impairing retrieval. In addition, we must be able to retrieve materials without the viewer knowing explicitly what is there. In the case of the "Evolving Documentary," the media materials and the description space in which they are mapped must be scalable, and the experience afforded the viewer must be interesting enough to encourage repeated exploration. Figure 2 shows an overview of this process.
In past work, we have developed systems for content annotation of video. Parallel development has looked at both class/keyword and knowledge-based systems.  In addition, we have considered the value of stream-based  and clip-based annotation methods. In most annotation schemes we have found the journalist's w's -- especially, who? what? where? when? -- to be critical annotations for editorial query; encoding the fifth w, why?, is still problematic.
We have also explored strategies for expressing and filling higher level narrative queries to a database of video materials. Such queries will inevitably add value to both the materials and the storage bank itself. In previous work, we have explored the idea of cascading multiple filters  as well as the idea of creating a temporal query which relies on class/keyword content descriptors. 
The limitation of these approaches is that they are explicit over a duration. The query cannot grow exponentially complex. If the story changes -- which could happen if for instance an interview became available which contradicted an earlier opinion -- the story model could not know to modify itself. Due in part to this limitation, neither of these strategies provides us with a dynamically steerable, progressive narrative. The program computes a narrative whole and the viewer must watch all or quit. This may be an appropriate strategy for some multivariant fiction stories, but it is insufficient in the case of an "evolving documentary" where the database contains a rich mixture of breadth and depth.
Our conclusion, based on these experiments, is that an "evolving documentary" needs a generalizable approach to description and presentation. Moreover, the audience needs to be able to dynamically steer the presentation so that it may remain relevant to their interests.
Boston's Central Artery stands as "a green monster," a relic of highway construction in the 1950's. Built to revitalize the downtown, this large iron, rusting structure visually separates the commercial Faneuil Hall area from the North End, Boston's most protected historic neighborhood.
Throughout the 1980's, a group of politicians devised a plan to replace this overhead highway with 37 lane miles of underground roadway. The estimated cost of this project is 7 Billion Dollars, most of which comes from Washington under the auspices of the Interstate Highway bill. This is currently the largest public works project in the United States. Several neighborhoods, including Boston's North End, will be affected in ways we do not yet know by these developments.
Currently MIT students are following a variety of stories in different media. These include portraits of particular places and people, interviews with residents and politicians, coverage of events and meetings. Stories are developed with more or less depth depending on interest, but the stories crisscross with a rich variety of themes. In addition to original material, the project uses materials from the Boston Globe and historic photographs. Some sequences are edited with a particular intent relative to narration. Shorter segments of material, edited by individual authors, are described according to an established set of descriptors and the general editorial guidance of the principle software designer Mike Murtaugh.
Figure 3: The Author's Activities
The second and final step is to describe the content using the established set of descriptors. The knowledge representation used in ConText is a simple bidirectional mapping between units of content and units of description. That is, the author associates a set of relevant descriptors for each piece of content added to the system. As the author connects each piece of content with sets of descriptors, they also define a mapping from descriptors to sets of content. Thus, for each descriptor, we know the set of content units that it describes. Note that relationships between content and descriptors are unqualified; the links are not weighted. Thus the core representation is equivalent to a simple keyword system. Descriptor weights, or prominences, are a function of playout during the ConText browsing session as described in the following secion on Presentation Methods.
Figure 4: Annotation of Content with Descriptors
Figure 4 shows two video clips in the left column and their associated descriptors in the right. The first is of North End resident Nancy Caruso describing how the Central Artery currently serves as a "green monster," dividing the North End from the rest of the city and protecting the community's residents from outsiders. In the second clip, urban planner Homer Russell expresses a similar idea about the Artery functioning as "a sort of Chinese Wall" or barrier. In addition, he notes how the North End community was justifiably frightened based on the experience of the West End, a community once like the North End that was leveled and replaced by highrise buildings in the 1960s.
Authors currently use a separate tool to define these relationships and establish graphical representations for descriptors (either text or a picture file). This information is then saved out in a format that the ConText browser reads to begin a browsing session.
As its name is meant to underscore, a central property of ConText is the idea of a story context. A story context is defined to be a set of descriptors where each descriptor is qualified by a numerical value. That value, its prominence, represents the level of that descriptor's importance within the given story context. For example, a story context including the character descriptor "Nancy Caruso" with a prominence value of 100, the location descriptor the "North End" with prominence 80, and the thematic descriptor "Protection" with prominence 20, represents a moment in the story where Nancy Caruso and the North End are quite prominent while the theme of protection is only slightly prominent.
When the system is required to select a unit of content to be displayed, the choice is made based on the current story context. Excluding those already displayed, each content unit is assigned a score equal to the sum of the prominences of its associated descriptors. The unit with the highest score wins and is selected, and is later marked as having been played. If there's a tie, the choice is made at random from among the high-scorers. Thus, given the story context above, a video clip associated with the "Nancy Caruso" descriptor and the "North End" descriptor would have a score of 180, while one tied to the character "Fred Salvucci" and the theme "Protection" would have a score of just 20 (since "Fred" has a prominence of 0). Given the choice between just these two clips the system would select the first.
The final crucial component to this system is that as the selected unit of content is presented to the user, the prominence value for those descriptors associated with the content is increased while the prominence value of all other descriptors is decreased (unless already zero). Thus the selection of a given piece of content influences the story context in a way that makes the future selection of similarly described content more likely. This property is called description feedback because of the way the sets of descriptors associated with a selection "feedback" on the selection process by influencing the story context. In this way, content acts as a bridge between story contexts. Figure 5 shows an overview of this process.
As further clips are selected and presented, the story context continues to develop with repeated elements rising in prominence while others recede. Thus, at any point, the story context reflects the path the viewer has taken during their particular browsing session. Later, in the section, "The Interface," an actual progression of story contexts from the Artery story is shown and described.
Figure 5: Description Feedback
Given the mechanism described above for continuity by description feedback, a progression of detail is surprisingly easy to implement when one makes the observation that: The number of descriptors associated with a unit of content is representative of the content's specificity. To favor relatively general units of content before more detailed ones, we simply need to divide the unit of content's "score" by the number of its associated descriptors. Thus we "penalize" content for having a large number of descriptors and cause less heavily described content to be presented first. The effect is relative in the sense that as the "general" content is played, it's removed from the pool of potential content. The previously penalized pieces then become relatively general to the content still available for playout that is even more heavily annotated. Thus, the movement from general to specific is truly a gradual progression.
As an example of this principle, consider a piece of content associated with only the character descriptor for "Nancy Caruso" and the location descriptor for the "North End." Such a piece of content might simply be audio narration stating that "Nancy Caruso is a resident of the North End." In contrast, a clip of Nancy Caruso talking about the North End in the 1950s would also be tied to the time descriptor representing the "1950s" and the theme "Memory." This clip is clearly more specific and has more relevance when presented after the former "establishing" material.
Figure 6: Progression of Detail
Figure 6 shows a progression of content based a given weighted set of descriptors representing a story context. Note how the progression builds from general to more specific content while featuring those units of content associated with the more prominent descriptors.
The above described relationship between the number of descriptors and the level of specificity is actually an equivalence. If one asks what exactly "specific" means in this context, the answer is it means specific with respect to the description space created by the author. Thus, if we had a video clip of Nancy Caruso comparing the North End to Little Italy in New York, the clip's annotations would still consist only of "Nancy Caruso" and the "North End" if our description space didn't have descriptors for "Little Italy", "New York", or the idea of a "comparison." Thus, the apparently more general narration establishing Nancy Caruso as a resident of the North End would be considered equally general as the "Little Italy" clip with respect to our current description space. The point is, this measure of specificity is only as meaningful as the description space is complete. In this case, if Little Italy or New York were pertinent to the story, they ought to be added as descriptors; otherwise, the function of that piece of content in the current story is questionable.
We've used the term description space repeatedly while in fact the term may seem rightly unjustified. Although we do have dimensions in terms of our categories of descriptors, directions along these axes or between any descriptors is not defined. In fact, implicit in the requirement of maintaining an extensible content base is that just as units of content must not be explicitly connected, neither may units of description. Thus, a relationship like "the North End is adjacent to the Central Artery" must not be explicitly represented in the database. However, such a fact is quite relevant especially when one wants the story to "pick up the pace" and possibly move away from a context including the Central Artery to adjacent locations.
Just as adjacencies of "continuity" are found between content based on their common connections to descriptors, meaningful adjacencies between descriptors may be found based on common connections to content. In the example given above, the existence of a unit of content annotated with descriptors for both the "Central Artery" and the "North End" would allow such a connection to be found. The content might be a video clip panning from the Artery to the North End, a visual expression of their geographic adjacency. Thus, the relationship between the two locations is available to the system from the content. If one wonders about relationships that exist between descriptors that aren't expressed by the content in the system, the fact is that they can't be captured. A unique property of the system is that only those relationships between descriptors articulated by available content could be used. In short, if the system can't demonstrate an idea, it can't know it.
By mirroring the structure and mechanism of description feedback described above, the system is capable of producing an analogous process of "content feedback" to explore possible movements within the space of descriptors. Given the set of descriptors with adjacencies to the those prominent in the current story context, the system can "increase the pace" of the story by making those adjacent descriptors more prominent than the current ones. In order to prevent immediate movement back towards the previous story context, a structure analogous to the "already shown?" tag used with units of content could be used with descriptors, capturing whether the descriptor had recently been invoked by the pacing mechanism. Figure 7 shows an overview of this process.
Figure 7: Finding Descriptor Adjacencies with Content Feedback
Finally, it is important to note that unlike continuity and the progression of detail which may always be active, this method for pacing requires the additional input as to when it ought to be invoked. One effective approach would be to simply give this control directly to the viewer, letting them decide if they wish to "stay in place" or "push the story forward." A second approach would be to start a ConText session with the pacing set to be relatively fast and gradually decrease it. This simple model corresponds nicely with the idea of allowing the viewer to explore the full breadth of the database first, then to dive in to depth as they find content of interest.
One notable property of the combination of the mechanism for controlling pacing and that of the "progression of detail", is that the control of pacing from "slow" to "fast" becomes the same as control from "depth" to "breadth" exploration of the content / description space. When the pace desired is slow, and the pacing mechanism is not invoked, the selection proceeds normally, with progression of detail tending to move from general to more specific -- exposing depth. When a faster pace is desired, however, adjacent descriptors are made more prominent than those of the current story context. The result is that by progression of detail, more general content relating to the newly invoked adjacent concepts becomes favored. Thus, movement is now decidedly upward and sideways -- exposing breadth.
By using a technique of "specializing by description", one can imagine applying the influence of the above techniques in more directed ways for specific story functions. Imagine if we add a "weight" value in addition to the prominence of each descriptor in a story context. By using the multiplication of the descriptor's prominence by its weight instead of its prominence alone, one could imagine making certain types of descriptors more influential. For example, we might make location descriptors have a higher weight than the others to make locations the "focus" of our story.
In a similar way, one can imagine applying the "pacing mechanisms" only to a certain subset of the descriptors. For instance, you could "increase the pace" of the character descriptors to cause movement across characters while other descriptors remained relatively stable. Using this technique in combination with the "location weighting" example given above would result in a story primarily about locations as told by many characters.
Units of content are displayed centered on top of the collage. The gradual influence a given content unit has over the story context is shown visually as the content appears on the screen. Thus, as a video clip plays out, the viewer sees its connected descriptors becoming brighter and more in focus while other descriptors fade away from view. An interesting property of this playout structure is that the longer the content is active, the more influence it has on the story context. Thus longer movie clips or pictures held on the screen by the user are more influential than shorter clips or materials that the user quickly dismisses (see "The Viewer's Activities" below).
Figure 8: Progression of ConTexts
Figure 8 shows a progression of three story contexts. In context (a), a picture representing the North End is the sole prominent descriptor. Recall the "Green Monster" and "Chinese Wall" clips and their associated descriptors shown in figure 4. Given this context and assuming neither clip has been seen by the viewer, both would have a non-zero score because of their shared connection to the "North End" descriptor. In accordance with the "progression of detail" methodology however, "Green Monster," with its five associated descriptors, has a higher score than "Chinese Wall" with its seven descriptors. Context (b) shows the effect of the "Green Monster" clip's playout, as the character "Nancy Caruso," the location "Central Artery," and the themes "Barrier" and "Protection" have become more prominent. Given this new context, the "Chinese Wall" clip has an even higher score than before and thus becomes the next clip selected for playout. Context (c) shows the resulting context after "Chinese Wall" plays out, the repeated themes of "Protection" and "Barrier," as well as the locations "Central Artery" and "North End" have become quite prominent. In addition, the descriptors for the theme "Fear," the location "West End," and the character "Homer Russell" have each become more prominent while the character "Nancy Caruso" has begun to recede.
The final component to the interface is the existence of subtle "tick marks" running in lines around each of the four edges of the interface display. Each mark represents a different descriptor and is arranged in one of four color-coded groups along each screen edge. The four groups correspond to the four categories of descriptors: Character, Theme, Location, and Time. These marks give the viewer immediate access to any of the descriptors in the system, regardless of the current story context. Figure 9 shows an actual screen shot from the current prototype. The central image is the current frame of the "Green Monster" clip as it plays out. In the background, the clip's associated descriptors gradually change to reflect their rising prominence in the current story context.
Figure 9: Screen Shot from the Browser
Returning to the scenario described in figure 8, after viewing the clip "Green Monster," the viewer might have chosen to move the mouse over the character "Nancy Caruso." This action would steer playout towards further content featuring that character (if any existed) and away from the "Chinese Wall" clip.
Currently, the viewer is given very coarse control over the playback. They may "pause" playout by moving the cursor over the material (or in the case of an audio clip, the center of the screen), and they may "dismiss" the material, stopping playout immediately, by clicking the mouse button.
In ConText, content is presented when the system detects idleness from the viewer. Thus, the story moves only when the viewer stops interacting. Content continues to be presented until the viewer stops it by clicking or moving the mouse over the interface to alter the story context. In this way, interaction in ConText follows the model of a one-sided conversation. As the story is told, the viewer is "passive" and attentive to the narrative. Only when the viewer wants to change the course of the presentation does he or she intervene, asking to delve into a particular aspect in more detail or perhaps urging to move on to something else.
Returning once more to the scenario depicted in figure 8 given the final state shown as context (c), the viewer may simply not interact, most likely resulting in more content related to the Artery and the North End and the themes barrier and protection. If, however, the viewer chooses to intercede by moving the mouse over the emerging location "West End," the story would move towards further content describing that area and any related elements.
The section on presentation methodologies raises the idea that by carefully weighing the influence of each methodology, the author could invoke specific types of story structures. In order for this to occur, the author must have some way to describe the operation of these structures, as well as the means for specifying when their use is appropriate. In order to adhere to the constraints of our extensible form, all of this must be done in a generalized way. In addition, our experience with the Artery story shows that individual authors need leadership when dealing with issues of content granularity and description. Limiting clips to a length of approximately 30 seconds, for example, was found to be helpful to both the description process and the resulting playout experience. Despite our attempts at "normalizing" the description space with four axes, we find that the task of maintaining the description space still grows as content is added. A more flexible and dynamic means of annotating content would be of great assistance. These problems will be the subject of future research.
2. Houbart, Gilberte. Viewpoints on Demand: Tailoring the Presentation of Opinion, MS thesis, MIT, 1994.
3. Davenport, Glorianna, Thomas G Aguierre Smith, Natalio Pincever. Cinematic Primitives for Multimedia, IEEE Computer Graphics and Applications, pp. 67-74, July 1991.
4. Davis, Marc Elliot. Media Streams: Representing Video for Retrieval and Repurposing, PhD Thesis, MIT, 1995
5. Davenport, Glorianna, Ryan Evans, Mark Halliday. Orchestrating Digital Micromovies. Leonardo, Vol. 26, No. 4, pp.282-288, 1993.
6. Davenport, Glorianna and Lee Morgenroth. Video Database Design: Convivial Storytelling Tools, Interactive Cinema Technical Report, MIT, May 1994.
7. Branigan, Edward. Narrative Comprehension and Film, Routledge, New York, 1992, especially "Chapter 1: Narrative Schema."
8. Colby, Grace, and Laura Scholl. Transparency and Blur as Selective Cue for Complex Information, in Proceedings of SPIE'92. 1992.