Near the end of each year, Spotify shows us our personal statistics over the past year. For the year 2020, I used my statistics to come to the conclusion that I was listening to music on Spotify more or less one third of the time I was awake. So needless to say that I’m a big music fan, and combining this passion of mine with data science seemed like a very exciting idea.
Starting my research, I quickly realized that this wasn’t an easy task, as classifying music in different genres simply isn’t a trivial problem. There isn’t a simple rule you can follow, and even music experts can discuss over which genre fits the best for many songs.
Firstly, I’ll give a short summary of what I found during my case study. Then, I’ll go on talking a little bit about the data and the overall approach used to tackle this problem. I’ll conclude the article summarizing the results and mentioning some possible further improvements and conclusions.
One of the first steps in building a neural network is always considering your data set and what kind of feature extracting you want to do from that dataset. In this case, we have to choose which kind of features we want to consider from the used audio samples. Two concepts that always returned in the documentation I found were MFCCs and Mel Spectograms, which are first cousins of each other really. As Mel Spectograms in general showed slightly better results (see this paper), I decided to go with Mel Spectograms for this problem.
Another choice we have to make when building the model is the type of machine learning model we want to use. Over the last couple of years, several machine learning techniques like k-nearest neighbors, Support Vector Machines and Convolutional Neural Networks (CNNs) have been compared for this task, and the use of CNNs have proven better than other ML techniques in almost all cases (see this paper, this one or this one).
Next up: the dataset. A widely used, very available and comprehensive dataset used for this kind of problem is the GZTAN dataset, which can be downloaded here. This dataset contains 100 samples for each one of the 10 included genres, containing 30 seconds of audio each. The genres included are: blues, classical, country, disco, hiphop, jazz, metal, pop, reggae and rock.
Training and testing the CNN
After having investigated the state of the art in music genre classification, it was time to start implementing and testing the model. This starts with investigating in detail the data at hand.
When working with audio data in python, one of the first libraries that comes to mind is librosa. It provided me with the means to calculate and visualize the Mel Spectogram of an audio sample with just a couple of lines of code, like so:
y,sr = librosa.load('./data/audio_sample.wav')
mel_spec = librosa.feature.melspectrogram(y=y,sr=sr,n_mels=64)
mel_spec_db = librosa.power_to_db(mel_spec,ref=np.max)
librosa.display.specshow(mel_spec_db, sr=sr, x_axis='time', y_axis='mel')
This provided me with interesting images like this one:
Working with a CNN, these Mel Spectograms can then be considered to be 1-channel, 2D images to be classified, instead of normal RGB 3-channel images.
As 44100Hz is a common sampling frequency, and the theorem of Nyquist-Shannon tells us that we need to store at least 22050 samples per second when sampling at this frequency, we have to consider the total amount of data being processed. Instead of using the complete 30 seconds of each audio sample, I chose to use a configurable amount of subsamples with a configurable amount of length when loading the data.
I started experimenting with different types of CNNs using my personal laptop, but as one test took easily a couple of hours, I soon realized that I wasn’t going to do a lot of fine-tuning this way. So I moved to Google Colab, were you can use (for free!) single 12GB NVIDIA Tesla K80 GPU for up to 12 hours continuously. An execution which took more than two hours on my personal laptop suddenly only needed two minutes in Google Colab, clearing the way for a lot of experimenting and tweaking of the CNN.
And so, my journey of experimenting with my neural network began. Using Tensorflow and Keras, I started with a very basic architecture and slowly increased it’s complexity until overfitting began to occur, after which counter measures were taken. An overview of all configurable parameters I used when training my data is presented here, together with the parameters from the final iteration:
- train_test_ratio: Ratio between data used for training and data used for testing.
Final value: 0.80
- number_ of_epochs: This parameters defines how many times the learning algorithm will process the dataset when training the model.
Final value: 150
- learning_rate: Learning rate used by the CNN.
Final value: 0.00005
- batch_size: This parameter defines how many samples the learning algorithm has to work through before updating the model weights.
Final value: 32
- sample_length: Number of seconds used for audio samples provided to the learning algorithm.
Final value: 4
- num_data_series_per_sample: Number of subsamples provided to the learning algorithm per original audio sample from the GZTAN dataset.
Final value: 6
- n_mels: Number of mels used when computing the Mel Spectograms.
Final value: 64
In the end, I ended up with the following architecture for my neural network:
- Conv2D (64) + MaxPool(2,2) + BatchNormalization
- Conv2D (32) + MaxPool(2,2) + BatchNormalization
- Conv2D (32) + MaxPool(2,2) + BatchNormalization
- Dense (64) + Dropout(0.50)
- Dense (32) + Dropout(0.50)
- Dense (16) + Dropout(0.50)
These were the final training and testing results:
As you can see, a final accuracy of around 68% was achieved (with a baseline of 10%). Knowing that human accuracy of genre classification tasks lies around 70%, I was pretty satisfied with this result.
Conclusions and possible improvements
When using the GZTAN dataset, the model achieved a validation accuracy of 68%. I also calculated and visualized the confusion matrix, showing the following results:
I found these results very interesting, as they show which genres are easily mistaken or interchanged by the CNN. It shows that the pairs blues+jazz, reggae+hiphop and rock+country are most easily interchanged by the CNN. As these are genres were the difference is not always very clear, I thought these results to be a good representation of reality.
Afterwards, I also implemented a web interface (hosted here) for using this music classifier. There, you can provide an audio sample as a file or you can record some audio to be classified. I’ve noticed that, when recording audio, the accuracy is a lot worse than when audio files are provided. This is mainly because the model was trained using 100% high-quality audio files, and it was not able to learn when background noise or other sound effects were present. As a possible improvement for the future, the possibility of data augmentation (adding random noise for example) might be exploited, so that the reached validation accuracy also applies for recorded sound samples.
You can find all code related to this project here. Thanks a lot for reading!