Skip to content

Methodology

Isabelle Eysseric edited this page Sep 11, 2024 · 1 revision

We first started by separating each of the audio files into 5-second segments. We use a sequence length of 5 seconds since the frequency of the songs is not constant and we wanted to make sure to get at least one song per sequence. Then, we extracted the mel-spectrogram of the first 3 5-second excerpts of each audio file. We limit ourselves to the first 3 sequences because of the already present class imbalance. Indeed, by taking all the 5-second sequences contained in each of the files, the number of images for each of the classes would be even more strongly imbalanced. We use 20% of the audio files as a test set and the remaining 80% as a training set. We then used the ImageFolder class to load our data when training the network.


We use the Kaggle notebook since it was easier to get the competition data at the beginning. The files containing the audio excerpts have a total size of 40 GB, which makes it more difficult to use other alternatives. The Kaggle notebook allows us to use a GPU, for 36 hours per week. In addition, the available memory limit was 15 GB. These two constraints slightly limited our experiments. For example, we could not use a batch size of 512 or more with most of the tested architectures since the allocated memory was higher than the limit. In addition, since the training time of the networks varied from 5 to 8 hours, depending on the architecture used, we were limited in the number of possible experiments.


Clone this wiki locally