Prosodic Font
The Space Between the Spoken and the Written

Tara Rosenberger Shankar

Sunset.wav                                                                                       Wow.wav
These are composited images created with different frames from two separate Prosodic Font animated utterances. Click on these images to see the Prosodic Font animation. You can also hear the original sound files used to make these Prosodic Font animations.

When most words are written, they become, of course, a part of the visual world. Like most of the elements of the visual world, they become static things and lose, as such, the dynamism which is so characteristic of the auditory world in general, and of the spoken word in particular. They lose much of the personal element...They lose those emotional overtones and emphases...Thus, in general, words, by becoming visible, join a world of relative indifference to the viewer – a word from which the magic ‘power’ of the word has been abstracted.
Marshall McLuhan in The Gutenberg Galaxy (1962), quoting J.C. Carothers, writing in Psychiatry (November 1959).



I am demonstrating an expressive communication form I call Prosodic Font. Prosody - the intonational and rhythmic patterns implicit in the spoken voice - is an autobiographical description of a speaker's expression, emotional state, and emphasis. Designers can use the temporal information of the voice in the real-time dynamic design of typographic forms. As such, a Prosodic Font  bridges the distance between the spoken and the written.

A Prosodic Font gains form, shape and substance through a particular someone, somewhere, talking about something. The alphabetic forms, glyphs, are given shape and motion through a coded interpretation of the speech signal. Mappings between voice and glyph design might go something like this:Vocal timbre controls the glyph rendering choices. The speech dynamic controls the dynamic glyph size. The relative pitch range controls the anti-aliasing, curvature, weighting and squash-n-stretch of the glyph figures. The rate of speech causes the words to appear for longer or shorter durations. The explosiveness of a phonetic "t" causes the letter to shake violently...


Prosody is semantically specific. For example, it is very clear when someone says, "I like your work" in a disparaging manner and in a complimentary manner.

Prosody is emotionally specific. It is evident across telephone lines (to about 60% accuracy) when someone is angry and when they are sad. Our ability to read finer gradations of emotions - and even mixtures of emotions - from voice stimuli alone are well-developed skills by adulthood; likewise is our unconscious skillful manipulation of endless variations of spoken tunes and rhythms.

Prosody is also informationally specific. The syntactic "point" of a sentence is often disambiguated from the rest of the sentence through prosodic means. However, intonational contours and rhythms do not lend themselves to causal interpretations. There is no "angry" intonational contour nor rhythm. Rather, vocal characteristics act as a gestalt - within a contextualized communication situation - to communicate fine degrees of speaker state, meaning and intention.

Modelling typography after speech requires introducing a temporal design element. I have been priviliged to inherit the design legacy of the Media Lab, where many people have designed temporal typographic forms. Yin Yin Wong's and David Small's work in Temporal Typography, Suguru Ishizaki's work in Kinetic Typography, Professor John Maeda's course in Digital Typography, and Peter Cho's typographical work have all strongly influenced my design.


Optimally, a Prosodic Font would use a real-time speech and prosody recognition system. At this point in time, no off-the-shelf speech recognition system yet allows other developers access to the raw or interpreted signals in addition to the products of recognition, words. Furthermore, prosody recognizers (including F0, spectogram, and amplitude trackers and syllable break markers) are not integrated into speech recognition systems, probably because no one has perceived a need for this integration. Without having access to these real-time systems, let alone the existence of an integrated speech and prosody recognizer, I was not able to make a real-time system. Therefore, I constructed my own pre-recorded speech database.  These recordings are converted (by machine and by hand) to input text files containing textual descriptions of the speech features in which I am interested. The machine recognition is accomplished using Paul Taylor's TILT system (1998), and software from the University of Edinburgh. Using these text files as input to a collection of abstract glyphs, the Prosodic Font system constructs letters, and then syllables and words in space and time according to algorithms of shape and motion. This system began in Java1.2beta2 and finished in Java1.2 release.

Glyph Design

There are many examples of continuously parameterized fonts. Don Knuth's METAFONT project (left image) used in excess of seventy parameters to create the differences in typographical style shown. Adobe's Multiple Master font (right image) likewise can change font parameters continuously to create glyphs that look as if they belong to different font families entirely.

Prosodic Font uses an Object Oriented approach to creating fonts that can move easily. The primitives (above) can be placed within a typographer's grid given two lines of constraint. They are composited with other primitives to form every letter of the alphabet. In this way, each primitive can maintain its own shape integrity as the font transforms over time.

These two Prosodic Font 'g' examples are different only in weight. They are composited using a circle, a straight line with a curved tail 'facing left' connected to it.

Mapping Relationships

My goal was to create Squash-n-Stretch syllables from prosodic input parameters. The mapping relationships I have currently implemented are:
Prosody Font
Amplitude Scalar size (per syllable)
Average Amplitude Background graphic rectangle to show "normal" voice size
Abstraction of fundamental frequency curve (TILT system) Width, Height (inversely related), Weight (inversely related), 
vertical translation (not yet implemented)
Syllable duration Duration of syllable activity

Neither I (the font algorithm designer) nor the speaker (the prosodic font designer) need to label speech as exhibiting particular emotions. Rather, the speech signals are systematically "mapped" onto visual characteristics. It is the recipient's (the hearer/reader's) job to interpret the speaker's expression - just as it is in conversational settings. If prosody is a system, then we might understand nuances of expression and emotion in a systematically animated font. This method has more potential to work as a tool of communication than if we needed to label the entire range of human vocal expressiveness into specific discrete categories. Additionally, the job of creating algorithmic spatio-temporal mappings remains a creative act of design.

To read more about this idea and project, you can download my eighty page thesis in Adobe Acrobat format or a two page Computer Human Interaction paper also in Acrobat format:

Rosenberger, Tara. Prosodic Font: Between the Spoken and the Written. Massachusetts Institute of Technology: MAS Thesis, 1998.

Rosenberger, Tara and MacNeil, Ron. 1999. "Prosodic Font: Translating Speech into Graphics." Short Paper. 1999 Computer-Human Interaction Conference of the ACM, Pittsburgh, PA.