The classification of musical sound into different timbre groups has received much attention in perceptual literature in the past, but the principal components of timbre remain elusive. Existing theories of timbre are derived from perceptual experiments, such as the well known multi-dimensional scaling experiments of Grey [2], in which the cognitive relationships amongst a group of sounds are represented geometrically. The drawback of such theories is that the perceptual spaces are non metric and, as such, the interpretation of the dimensional axes is the source of some contention.
Our representation is cognitively based which is in contrast to the psychophysical interpretation that Grey gives of his timbre space. The approach we use grounds the salient attributes of timbre in the physical world. It is our premise that timbre groups fall into categories that are constrained by the underlying physics of the sound-generating systems and that it is the goal of the ear/brain system to discover such commonalities in the sounding world. Rather than seeing this approach as a challenge to existing theories of timbre we see it as providing a strong rationale to an otherwise psychophysically described phenomenon.
We present techniques for estimating parameters to physically-based models of sound-generating systems. Our methods negate the need for explicit inversion of a physical system. Instead, the models are acquired by minimizing errors between a feature-space representation of acoustic signals and the estimated output of a learned set of physical-model function approximators. We set out to show that it is possible to simultaneously perform an inverse mapping to different physical models. Thus, one's ability to choose among different model classes is seen as the basis for timbre classification.
The physical systems that we use are not explicit. They are acquired via function approximation techniques. Perceptually, this corresponds to the notion of envisioning the characteristic behaviour of a system. Given a model hypothesis and a set of estimated input parameters provided by an inverse model, envisioning is the process that evokes a predicted output for the physical-model estimator. This process allows the error between the feature-representation of an acoustic signal and the predicted output of the model-estimator to constrain the acquisition of the inverse model or parameter estimator. A desirable attribute of our system is that it only uses data that is readily available to humans for learning such tasks. There is no need to explicitly reference the physical equations for predicting sound outcomes from models and there is no need to explicitly measure articulator activity for estimating parameters.
As a component of timbre, model-based representations allow the highly underconstrained problem of sound classification to be reduced to the problem of selection between multiple models. It is our belief that relatively few sound models are needed to describe a particular auditory scene. The extent to which a set of models adequately describes a sound world is an area that needs addressing. Our approach is to constrain the sound world to the domain of orchestral musical instruments.
We do this for two reasons: the first is that there is much existing perceptual literature on these timbre classes which contrasts to the area of natural sounds which have had little or no perceptual research until very recently. The second reason for choosing musical instruments as our sound domain is that there exist many efficient, physically-based techniques for synthesizing musical instrument sounds. These synthesis techniques are used to train the set of physical model function approximators, or forward models, and the set of parameter estimators, or inverse models, that we use in our experiments.
By demonstrating that a particular sound domain can conceivably be reduced to a manageable number of models we hope to show that the perception of timbre is dependent on world context. That is, the set of models used when listening to music is different than the set used when we walk through a city, even though the low-level perceptual atoms may have features in common. Models afford a major reduction in the search space of sound classes and estimated input parameters.
In relating our work to Grey's work on timbre, it is our goal to show how our representation, in conjunction with the established psychophysical attributes of timbre, can account for the perceptual groupings that are seen in his timbre space.