-
Notifications
You must be signed in to change notification settings - Fork 0
Approach
We initially chose this project because we wanted to use a recurrent network (RNN) rather than a convolutional network (CNN). We thought, a priori, that an RNN would be suitable for the problem since the recordings are basically sequential data. However, after reading different articles on audio classification, we noticed that the most used technique is to transform the recording into an image and then classify the resulting images using a convolutional network.
A frequently used technique to transform an audio file into an image is to use a Mel-Spectrogram. To capture audio digitally, it is generally necessary to take 44,100 samples per second, that is to say 44.1 kHz, which corresponds to a waveform for the signal[6]. Using the melspectrogram method from the Librosa library, we transform these sound waves into images in order to subsequently give them to our neural network.
Since we are looking to classify images following the transformation of audio files, we use a convolutional layer network architecture. We favor the ResNet architecture (18, 34 and 50 layers). Since we have 397 classes, we also modify the size of the linear output layer of the network. We also experimented slightly with the VGG 11 architecture, but the results obtained were not very promising.