Brian Whitman - bwhitman@media.mit.edu

When we build an 'intelligent system,' we usually mean that we have repeatedly told the computer something and we hope that it makes the connection. For example, if we wanted to make a box that could detect whether music was classical or jazz, we would train a machine learning algorithm by presenting them with audio-derived data from each type: "this is jazz," "this is classical." We then claim that computer 'knows' the difference, when in reality it only knows how to draw a statistical line through mountains of data. Can we ever have a system actually know about something, especially something as complex as music?

Here we formulate an understanding problem in an 'external symbol grounding' framework. We treat innate knowledge as strong connections between two models-- language and music. We feed our system only a collection of CDs and the artist names that created them. By analysing the music and concurrently 'reading' about the artist on the Internet, a series of intelligent algorithms create a bimodal representation for music understanding that requires no human intervention. The system automatically creates a lexicon of descriptors based on statistical analyses of freeform web text, and only terms associated with music that it can reliably use to describe an audio feature are kept: for example 'young' is automatically thrown away while 'loud' remains.

The result-- a system that has autonomously learned to describe music by reading about it-- certainly provides some surprises. Peculiar terms like "Canadian" and "bitter" are apparently very useful to describe music, and most rap music is branded an instant 'awful.' And because even the lexicon is free from our own biases, some interesting terms ("80htm," for one) pop us as strong descriptors. But perhaps the computer now knows something we don't.


This work is a combination of techniques undertaken over the past year in the Music, Mind and Machine Group. We first developed a text description generator or 'cultural representation' for music from web searches [1], which outputs a vector of lexical terms (noun phrases, adjectives, etc.) given only an artist name. We embed this vector in kernel space for machine learning tasks.

The relations between the textual feature space and the audio feature space (developed in [2]) were learned with a new 'severe multi-class' training system (RLSC, [3,4]) that serves to trim improper output classes (when the training set is incorrect, as it may be since we cannot trust all of our 20,000 possible text features.) Because RLSC cannot operate in real-time, for this demo we have re-trained the remaining audio-to-textual relations with support vector machines.

[1] Whitman, Brian and Steve Lawrence (NEC Research Institute.) Inferring descriptions and similarity for music from Community Metadata. In Proceedings of the 2002 International Computer Music Conference, Gothenburg, Sweden.
[2] Whitman, Brian and Paris Smaragdis. Combining Musical and Cultural Features for Intelligent Style Detection. 2002 Proceedings of the International Symposium for Music Information Retrieval pp 47--52, Paris, France.
[3] Rifkin, Ryan, Gene Yeo, Brian Whitman and Tomaso Poggio. Regularized Least-Squares Classification. submitted.
[4] Whitman, Brian and Ryan Rifkin. Musical Query-by-Description as a Multiclass Learning Problem. In 2002 Proceedings of IEEE Multimedia Signal Processing Conference, St. Thomas, Virgin Islands, USA.