Automatic Drum Samples Classification
A final project for Pattern Recogintion MAS 622J/1.126J
Eyal Shahar, MIT Media Lab
All quotes are taken from the the sketch "More
Cowbell" as performed in "Saturday Night Live"
Background
"I put my pants on just like the rest of you -- one leg at
a time. Except, once my pants are on, I make gold records."
Musicians today, both professional and hobbyist, who rely
heavily on their computers to make music, usually find themselves
with hard drives full of music samples of all sorts. The majority of
these samples are of individual drum samples, often called “hits” or
“one shots”. Arranging these samples in folders is usually done
manually by listening to every sample and moving it into a desired
folder. While in the process of making music, retrieval of these
samples is done, once more, by tedious auditions of each and every
sample. This project is a first step towards making the life of the
computer-based musician a little bit easier by automatically
classifying these samples and allowing better methods of retrieval.
Objective
"Before we're done here.. y'all be wearing gold-plated
diapers."
The goal of this project is to automatically classify drum
samples, compare classification techniques and optimal features sets.
Training And Testing Sets
"I gotta have more cowbell, baby!"
The training set consists of 1000 samples, divided to 6 classes:
bassdrums, snares, hi-hats, cymbals, tom-toms and claps.
The testing set consists of 1200 samples.
The following
table describes the distribution of the sets.
Features
"... The last time I checked, we don't have a whole lot of
songs that feature the cowbell."
Most of the feature extraction was done using the
MIRtoolbox for Matlab, by the University
of Jyväskylä. These are:
- Brightness – Percentage of energy above 1500Hz
- Rolloff – Frequency below which 85% of the energy is found
- Roughness – based on the frequency ratio of each pair of
sinusoids
- Irregularity – degree of variation of successive spectrum
peaks
- MFCC – Mel frequency cepstrum coefficients
In addition, two more features were extracted using custom
algorithms:
- Pitch – The samples are sliced into equal length frames. In
each frame the peak of the spectrum is found. These are averaged and
the frequency correlating to that FFT bin is returned.
- Decay – The amplitude envelope of the sample is calculated
with MIRtoolbox. The peak amplitude is found, and the decay time is
calculated as the time that the amplitude drops from the peak to
50%.
The Matlab GUI
"... and, Gene - Really explore the studio space this
time."
To manage the learning and testing processes, a Matlab GUI was
created. It provides quick and intuitive access to feature
extraction, loading and saving of the training and testing data sets,
saving and loading of classification models, selection of active
model, calling the testing and learning routines and graphic
visualization of the feature space.
Classification methods
"Let's just do the thing."
Support vector machine
For this method Matlab’s SVM tools were used. The main drawback
of this implementation is that with the absence of the optimization
package, as in my computer, the algorithm uses a linear kernel.
Six SVM were trained, one for each class, using a
one-versus-all approach.
For validation a leave-one-out method was used.
K Nearest Neighbors
custom KNN algorithm was written for this method and was
trained to find the optimal K for 1<k<15 for odd values of k.
Validation was done using a leave-one-out approach.
Neural Network
Matlab’s neural networks tools were used for this algorithm,
testing both one and two hidden layers, with each layers tested for 5
to 10 units. Validation is a part of the toolbox’s features and
therefore no additional validation was done, while the MSE as
calculated during the learning process was used as a measure of
performance to determine the best net and features set configuration.
Feature Selection
"Well, it's just that I find Gene's cowbell playing
distracting."
In all the classification methods learning process, forward
feature selection was implemented: At first, the algorithm was with
one feature as input. The feature that performed best remained in the
features set the algorithm was tested again with each of the
remaining features as a second feature. This process repeated it self
until the performance did not show improvement of over 0.5%.
Results And Performance
"...And I'd be doing myself a disservice and every member
of this band, if I didn’t perform the hell out of this!"
The K-nearest neighbors gave the best results with k = 9. The
selected features were Brightness, Irregularity, Decay, MFCC 1, MFCC
2, MFCC 3 and MFCC 5.
The neural-network learning algorithm produced a 2 hidden layer
network, with 9 and 7 units respectively.
The SVM learning algorithm found these features to be optimal:
The following table and graph show the accuracy of detection of
the testing set:
Conclusions
"Guess what? I got a fever! And the only prescription.. is
more cowbell!"
Random Insights
- It is interesting to see how features are selected for each
of the SVM instances, and to note that theses features do indeed say
a lot about the behavior of samples from each class. Hi-hats, for
examples, are very short and high pitched, and indeed "Decay" and
"Pitch" are among the selected features. Cymbal sounds, on the other
hand, are usually very long, noisy and have plenty of high
frequencies, so we see "Decay", "Roughness" and "Brightness"
selected.
- It is clear that classes that were less common in the
training set, namely Toms and Claps, were detected less accurately
during testing. The extreme case is of the Toms, of which there were
only 12 in the training set, and had negligible detection rates in
both k-NN and neural-network algorithms. Furthermore, the
performance of the neural network for each class seems to depend
heavily on the amount of samples from that class present in the
training set. It is therefore important to have enough samples of
each class, preferably with all classes equally represented.
- The apparent reason for the SVM algorithm to be an exception
is that it has the advantage of producing a different decision
machine for each class, and thus has the opportunity to isolate the
class unique characteristics, even when a small training set is
introduced.
Possible Improvements
The following steps can be considered in order to improve
recognition results :
- Extend and manually validate training set - It is
very likely that significantly better results can be achieved by
enlarging the training set. Also, some samples have been noticed to
have dubious character. These, perhaps, should not be a part of the
training set.
- Try temporal features approaches, namely HMM -
Certain features, such as pitch and MFCC, can give better results
when calculated for small time frames. Moreover, their behavior over
time can give more information about the sound's class.
- Implement multiple algorithms and majority selection
- It is clear that some algorithms are better at detecting certain
classes than others. In fact, for every class there is an algorithm
the performs very well. Applying all algorithms and taking a joint
decision might exhibit better performance.
Future Work
As stated earlier, this work can be a framework of a system
with stronger capabilities, such as:
- Melodic instruments samples - Should these algorithms
mature into being reliable at a high degree, it would be interesting
to see whether they can be applied to melodic instruments.
- Naive retrieval methods - having a database of sounds
and their features provides the user with new ways to retrieve
sounds. The user would be able to select a sound, and "stroll
around" that sound's environment, knowing that these sounds are
close to the one he selected.
- Subjective features extraction - Having the user
define new features and figuring out how these features are
reflected in the existing features can give fascinating results.
Suppose a user teaches the machine what sounds are "duller" than
others, "sharper", more "soothing", or "yellow". Then musicians can
retrieve sounds from the based on their subjective definitions.
Final project
presentation (.pdf) Project
proposal presentation (.pdf)