Introduction
Speech is one of the most innate ways that humans express themselves. We rely on it so much that we understand its significance while using other communication channels, such as emails and text messages, where we frequently utilize emoticons to convey our feelings. The detection and analysis of emotions are crucial in today's digital age of remote communication because they play a crucial role in communication. Because emotions are arbitrary, detecting them is a difficult task. How to measure or classify them is a subject of debate.
A Speech Emotion Learning (SER) system is a group of approaches that classify and analyze speech signals in order to identify emotions that are present in them. An application for such a system includes interactive speech-based assistants and caller-agent conversation analysis, among many others.
In this article, we analyze the acoustic characteristics of the audio data of recordings in an effort to identify underlying emotions in a speech that has been recorded.
Table of contents:
- How does Speech Emotion Recognition Work?
- Commonly used algorithms for speech emotion recognition
- Top 5 Speech emotion recognition datasets for practice
- SER model pipeline
- Conclusion
- How does Speech Emotion Recognition Work?
In speech emotion recognition, to uncover buried layers of information that can enhance and extract tonal and acoustic elements from speech, data scientists employ a variety of audio processing techniques.
It is more difficult to convert audio impulses into numeric or vector format than it is to convert images. When we give up the "audio" format, how much crucial information is maintained will depend on the transformation technique. It would be difficult for the models to understand the mood and identify the sample if a certain data transformation was unable to capture the softness and tranquility.
Mel Spectrograms, which depict audio signals based on their frequency components and can be plotted as an audio wave and supplied to train a CNN as an image classifier, is one way to convert audio data into numbers.
It is more difficult to convert speech to text with direct speech recognition since it entails mapping uttered words and sentences to their equivalents in text. Initially, LSTM and more recently Transformer models have greatly aided research in this area because practically all video streaming services offer subtitles or audio transcripts.
As we learned about the workings of SER, now let’s look at some of the algorithms used in SER.
- Commonly used algorithms for speech emotion recognition
To extract features out of audio, here are some of the algorithm/model architectures that are used widely:
- RNNs/LSTMs: These models operate on a timestep sequence, allowing them to retain previous information from the same sample as they process the upcoming timestamp. A neural network receives numerical information and outputs a logit vector.
- Attention-Based Models: For any activity involving the mapping of two data types, these models are now the ones that are most frequently utilized. An encoder-decoder strategy can be used by an attention-based model to learn the mapping of new sequences using previously anticipated ones.
- Listen-attend-spell(LAS): This was one of the first approaches to combine the above two methods by creating an encoder that learns features using bidirectional LSTMs.
3) Top 5 Speech emotion recognition datasets for practice
- Audio MNIST:
This dataset contains 30,000 audio clips of spoken digits (0–9) from 60 different speakers.
- Flickr 8k Audio Caption Corpus:
the dataset contains 40,000 annotated speech files stored in WAVE audio format.
- LJ speech:
A dataset containing 13,100 audio clips narrating passages from seven classic books.
- MS SNSD:
It includes files of clean speech as well as other environmental noises that can be combined with clean speech to create a more substantial, enhanced speech dataset.
- Speech accent archive:
This dataset contains a diverse number of accents of the English language, with speakers coming from 177 countries to record 2140 speech samples in total.
4) Speech emotion recognition(SER) model pipeline
The raw signal serves as the input for the processing depicted in the image below. The initial step was extracting the 2D features from the datasets and converting them into 1-D form using the row means. For 4 of our datasets, a measure of noise was applied to the raw audio (except CREMA-D as the others were studio recording and thus cleaner). The features were then taken out of those noisy files and added to our dataset. Following feature extraction, we used a variety of ML algorithms, including SVM, XGB, CNN-1D(Shallow), CNN-1D, and CNN-2D on our 1D data frame and 2D-tensor, respectively. As some of the models were overfitting the data, and taking into consideration a large number of features (181 in 1D) we tried dimensionality reduction to check the overfitting and training of the models again.
Source:https://www.analyticsinsight.net/speech-emotion-recognition-ser-through-machine-learning/
CNN model architectures:
CNN 1-D (Shallow):
This model consisted of 1 Convolution layer of 64 channels and the same padding followed by a dense layer and the output layer.
CNN 2-D (Deep):
In this model architecture In order to make the model less complex, the final two blocks of the three convolution layers from the VGG-16 were eliminated.
Model architecture is as follows:
- 2 convolution layers of 64 channels, 3×3 kernel size, and same padding followed by a max-pooling layer of size 2×2 and stride 2×2.
- 2 convolution layers of 128 channels, 3×3 kernel size, and same padding followed by a max-pooling layer of size 2×2 and stride 2×2.
- 3 convolution layers of 256 channels, 3×3 kernel size, and same padding followed by a max-pooling layer of size 2×2 and stride 2×2.
- Each convolution layer had the ‘relu’ activation function.
- After flattening, two dense layers of 512 units each were added and dropout layers of 0.1 and 0.2 were added after each dense layer.
- Finally, the output layer was added with a ‘softmax’ activation function
Model results comparison:
Based on accuracy measures that compare predicted values to actual values, the outcome is determined. The true positive (TP), true negative (TN), false positive (FP), and false negative (FN) variables are combined to form a confusion matrix (FN). We have derived accuracy from confusion measures in the manner shown below:
The model was trained on training data and tested on test data with different numbers of epochs starting from 50 to 100, 150, and 200. The accuracies were compared among all models viz. SVM, XGBoost, and Convolution Neural Network (shallow and deep) for 1D features and 2D features.
5) Conclusion
In this article, we learned how speech emotion learning(SER) works and algorithms to build models for emotion classification. We also saw a popular dataset that can be utilized to work on SER along with machine learning pipeline using different model
References:
[1]https://www.analyticsinsight.net/speech-emotion-recognition-ser-through-machine-learning/