Approach

Description

We initially chose this project because we wanted to use a recurrent network (RNN) rather than a convolutional network (CNN). We thought, a priori, that an RNN would be suitable for the problem since the recordings are basically sequential data. However, after reading different articles on audio classification, we noticed that the most used technique is to transform the recording into an image and then classify the resulting images using a convolutional network.

Transformation of recordings

A frequently used technique to transform an audio file into an image is to use a Mel-Spectrogram. To capture audio digitally, it is generally necessary to take 44,100 samples per second, that is to say 44.1 kHz, which corresponds to a waveform for the signal[6]. Using the melspectrogram method from the Librosa library, we transform these sound waves into images in order to subsequently give them to our neural network.

Architecture

Since we are looking to classify images following the transformation of audio files, we use a convolutional layer network architecture. We favor the ResNet architecture (18, 34 and 50 layers). Since we have 397 classes, we also modify the size of the linear output layer of the network. We also experimented slightly with the VGG 11 architecture, but the results obtained were not very promising.

Wiki

Home
Accueil

Dossiers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Approach

Description

Transformation of recordings

Architecture

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally