Classification of musical compositions by artists using the Hidden Markov Models


Hidden Markov Models have long been used in speech recognition. Thanks to the cepstral coefficients (MFCC), it became possible to discard signal components that are insignificant for recognition, significantly reducing the dimensionality of features. There are many simple examples on the Internet of using HMM with MFCC to recognize simple words.

After getting acquainted with these possibilities, there was a desire to try out this recognition algorithm in music. So the idea of ​​the task of classifying musical compositions by performers was born. About attempts, some kind of magic and results will be discussed in this post.


The desire to get acquainted in practice with hidden Markov models arose long ago, and last year I managed to link their practical use with the course project in the master's program.

During the pre-project googling, an interesting article was found describing the use of HMM to classify folk music of Ireland, Germany and France. Using a large archive of songs (thousands of songs), the authors of the article try to identify the existence of a statistical difference between the compositions of different peoples.

While exploring libraries with HMM, I came across code from a Python ML Cookbook , where, using the example of recognizing several simple words, the hmmlearn library was used, which it was decided to try.

Formulation of the problem

Available songs of several music artists. The task is to train the classifier based on HMM, the correct recognition of the authors of the songs entering it.

Songs are in .wav format. The number of songs for different groups is different. The quality, duration of the compositions also vary.


To understand the operation of the algorithm (which parameters are involved in which training), it is necessary to at least have a superficial knowledge of the theory of chalk-cepstral coefficients and hidden Markov models. For more information, see the articles on MFCC and HMM .

MFCC is a representation of a signal, roughly speaking, in the form of a special spectrum from which components that are insignificant for human hearing are removed using various filtering and transformations. The spectrum is short-term in nature, that is, initially the signal is divided into intersecting segments of 20-40 ms. It is assumed that in such segments the signal frequencies do not change too much. And already on these segments magic coefficients are considered.

There is a signal.
25 ms intervals are taken from it.
And for each of them, cepstral coefficients are calculated.
The advantage of this representation is that for speech recognition it is enough to take about 16 coefficients per frame instead of hundreds or thousands, in the case of the usual Fourier transform. It has been found experimentally that to isolate these coefficients in songs it is better to take 30-40 components.

For a general understanding of the work of hidden Markov models, you can also see the description on the wiki .

Their meaning is that there is an unknown set of hidden states $ inline $ x_1, x_2, x_3 $ inline $ whose manifestation in some sequence determined by probabilities $ inline $ a_1, a_2, a_3 $ inline $ with some probabilities $ inline $ b_1, b_2, b_3 $ inline $ leads to a set of observable results $ inline $ y_1, y_2, y_3 $ inline $ .


In our case, the observed results are mfcc for each frame.
The Baum-Welsh algorithm (a special case of the more well-known EM algorithm) is used to find unknown HMM parameters. It is he who is engaged in teaching the model.


Let's get down to the code finally. Full version is available here .

The librosa library was chosen to calculate MFCC . You can also use the python_speech_features library , which, unlike librosa, implements only the functions necessary for calculating the cepstral coefficients.

We will accept songs in the ".wav" format. Below is the function for calculating MFCC, which takes the name of the ".wav" file as input.

    def getFeaturesFromWAV(self, filename):
        audio, sampling_freq = librosa.load(
            filename, sr=None, res_type=self._res_type)

        features = librosa.feature.mfcc(
            audio, sampling_freq, n_mfcc=self._nmfcc, n_fft=self._nfft, hop_length=self._hop_length)

        if self._scale:
            features = sklearn.preprocessing.scale(features)

        return features.T

The first line is the usual download of the ".wav" file. The stereo file is converted to a single-channel format. librosa allows for different resampling, I settled on res_type='scipy' .

I considered it necessary to indicate three main parameters for the calculation of signs: n_mfcc - the number of chalk-cepstral coefficients, n_fft - the number of points for a fast Fourier transform, hop_length - the number of samples for frames (for example, 512 samples for 22 kg and will produce approximately 23ms).

Scaling is an optional step, but with it I managed to make the classifier more stable.

Let's move on to the classifier. hmmlearn turned out to be an unstable library in which something breaks with every update. Nevertheless, its compatibility with scikit cannot but rejoice. At the moment (0.2.1), Hidden Markov Models with Gaussian emissions is the most working model.

Separately, I want to note the following model parameters.

self._hmm = hmm.GaussianHMM(n_components=hmmParams.n_components,
                                    covariance_type=hmmParams.cov_type, n_iter=hmmParams.n_iter, tol=hmmParams.tol)

Parameter n_components - determines the number of hidden states. Relatively good models can be built using 6-8 hidden states. They learn pretty quickly: 10 songs take about 7 minutes on my Core i5-7300HQ 2.50GHz. But to get more interesting models, I preferred to use about 20 hidden states. I tried more, but on my tests the results did not change much, and the training time increased to several days with the same number of songs.

The remaining parameters are responsible for the convergence of the EM algorithm, limiting the number of iterations, accuracy, and determining the type of covariance state parameters.

hmmlearn is used for teaching without a teacher. Therefore, the learning process is structured as follows. Each class has its own model. Next, the test signal is run through each model, where the logarithmic probability of score each model is calculated from it score . The class that corresponds to the model with the highest probability is the owner of this test signal.

Training in the code of one model looks like this:

            featureMatrix = np.array([])

            for filename in [x for x in os.listdir(subfolder) if x.endswith('.wav')]:
                filepath = os.path.join(subfolder, filename)

                features = self.getFeaturesFromWAV(filepath)
                featureMatrix = np.append(featureMatrix, features, axis=0) if len(
                    featureMatrix) != 0 else features

            hmm_trainer = HMMTrainer(hmmParams=self._hmmParams)


The code runs through the folder and subfolder finds all the ".wav" files, and for each of them it considers MFCC, which subsequently simply adds to the matrix of signs. In the characteristics matrix, the row corresponds to the frame, the column corresponds to the coefficient number from the MFCC.

After filling in the matrix, a hidden Markov model is created for this class, and the attributes are transferred to the EM algorithm for training.

The classification looks like this.

        features = self.getFeaturesFromWAV(filepath)

        #label is the name of class corresponding to model
        scores = {}
        for hmm_model, label in self._models:
            score = hmm_model.get_score(features)
            scores[label] = score

        similarity = sorted(scores.items(), key=lambda t: t[1], reverse=True)

We wander around all the models and count logarithmic probabilities. We get a probability-sorted set of classes. The first element will show who is the most likely performer of this song.

Results and Improvements

The songs of seven artists were selected in the training set: Anathema, Hollywood Undead, Metallica, Motorhead, Nirvana, Pink Floyd, The XX. The number of songs for each of them, as well as the songs themselves, were chosen from considerations of exactly which tests you want to conduct.

For example, the style of the band Anathema changed a lot during their career, from heavy doom metal to calm progressive rock. It was decided to send songs from the first album to a test sample, and more to training - softer songs.

List of Songs Participating in the Training
Untouchable Part 1
Lost Control
One Last Goodbye
A Fine Day To Exit

Hollywood Undead:
Been To Hell
We Are
Coming Back Down

Enter Sandman
Nothing Else Matters
Sad But True
Of Wolf And Man
The Unforgiven
The God That Failed
Wherever I May Room
My Friend Of Misery
Don't Tread On Me
The Struggle Within
Through The Never

Victory Or Die
The Devil.mp3
Thunder & Lightning
Fire Storm Hotel
Evil Eye
Shoot Out All Of Your Lights

About A Girl
Something In The Way
Come As You Are
Endless Nameless
Heart Shaped Box

Pink Floyd:
Another Brick In The Wall pt 1
Comfortably Numb
The Dogs Of War
Empty Spaces
Wish You Were Here
On The Turning Away

The XX:
Basic Space

Tests gave a relatively good result (out of 16 tests, 4 errors). Problems appeared when trying to recognize the artist by the cut out part of the song.

It suddenly turned out that when the composition itself is classified correctly, part of it can produce a diametrically opposite result. Moreover, if this piece of composition contains the beginning of the song, then the model produces the correct result. But if it still starts with another part of the composition, then the model is completely and completely sure that this song does not belong to the desired artist.

Part of the tests
Master Of Puppets to Metallica (True)

Master Of Puppets (Cut 00:00 — 00:35) to Metallica (True)

Master Of Puppets (Cut 00:20 — 00:55) to Anathema (False, Metallica)

The Unforgiven (Cut 01:10 — 01:35) to Anathema (False, Metallica)

Heart Shaped Box to Nirvana (True)

Heart Shaped Box (Cut 01:00 — 01:40) to Hollywood Undead (False, Nirvana)

The solution was sought for a long time. Attempts were made to study in 50 or more hidden states (almost three days of training), the number of MFCC increased to hundreds. But none of this solved the problem.

The problem was solved by a very severe, but at some level of subconscious mind clear idea. It was to randomly shuffle the lines in the feature matrix before training. The result paid off by slightly increasing the training time, but by producing a more stable algorithm.

            featureMatrix = np.array([])

            for filename in [x for x in os.listdir(subfolder) if x.endswith('.wav')]:
                filepath = os.path.join(subfolder, filename)

                features = self.getFeaturesFromWAV(filepath)
                featureMatrix = np.append(featureMatrix, features, axis=0) if len(
                    featureMatrix) != 0 else features

            hmm_trainer = HMMTrainer(hmmParams=self._hmmParams)

            np.random.shuffle(featureMatrix) #shuffle it

Below are the results of a model test with parameters: 20 hidden states, 40 MFCC, with component scaling and shuffle.

Test results
The Man Who Sold The World to Anathema (False, Nirvana)

We Are Motörhead to Motorhead (True)

Master Of Puppets to Metallica (True)

Empty to Anathema (True)

Keep Talking to Pink Floyd (True)

Tell Me Who To Kill to Motorhead (True)

Smells Like Teen Spirit to Nirvana (True)

Orion (Instrumental) to Metallica (True)

The Silent Enigma to Anathema (True)

Nirvana — School to Nirvana (True)

A Natural Disaster to Anathema (True)

Islands to The XX (True)

High Hopes to Pink Floyd (True)

Have A Cigar to Pink Floyd (True)

Lovelorn Rhapsody to Pink Floyd (False, Anathema)

Holier Than Thou to Metallica (True)

Result: 2 errors out of 16 songs. In general, not bad, although the mistakes are frightening (Pink Floyd is clearly not so heavy).

Tests with clippings from songs confidently pass.

Clippings from songs
Master Of Puppets to Metallica (True)

Master Of Puppets (Cut 00:00 — 00:35) to Metallica (True)

Master Of Puppets (Cut 00:20 — 00:55) to Metallica (True)

The Unforgiven (Cut 01:10 — 01:35) to Metallica (True)

Heart Shaped Box to Nirvana (True)

Heart Shaped Box (Cut 01:00 — 01:40) to Nirvana (True)


The constructed classifier based on hidden Markov models shows satisfactory results, correctly identifying performers for most compositions.

All code is available here . Anyone interested can try to train models on their own compositions. According to the results, you can also try to identify the common in the music of different groups.

For a quick test on trained compositions, you can look at the site spinning on Heroku (it accepts small ".wav" files as an input). The list of compositions on which the model was trained from the site is presented above in the paragraph above under the spoiler.