MFCC features + SVM for speech emotion classification
git clone https://github.com/Jason-Oleana/Speech-Emotion-Classification.git
cd Speech-Emotion-Classification
pip install -r requirements.txt
https://drive.google.com/drive/folders/1S3j7CkyGWDpjS6OMSGOL0ka_osKff_Vg?usp=sharing
Reference:
- Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391
my_path = "your path to Ravdess-RawData"
3 emotions = happy, sad, neutral
5 emotions = happy, angry, neutral, sad, and fearful
8 emotions = happy, angry, neutral, sad, fearful, disgust, calm, surprised
The emotion sets were selected based on the two-dimensional space of arousal and valence.
‘Happy’ was selected as it falls in the highest range of both arousal and positive valence.
In contrast, ‘sad’ was selected as it falls in the lowest range of arousal and the highest range of negative valence.
‘Neutral’ was selected as this emotion is balanced between happy and sad in a two-dimensional space
of arousal and valence. For the second set, five emotions were selected for
evaluation: happy, angry, neutral, sad, and fearful.
Mel frequency cepstral coefficients are the most widely used speech feature for SER.
Mel Frequency Cepstral Coefficients are derived from the cepstrum, which is the inverse spectral transform of the logarithm of the spectrum.
They concisely describe the overall shape of a spectral envelope
A noticeable result was in the emotions that achieved the lowest performance: happy, disgust, sadness, and anger.
These emotions were the most difficult to distinguish for both the Log-Gabor conditions and the MFCC conditions.
A reason for the lower performance could be that disgust, anger, and sadness fall under the unpleasant range of valence.
According to research on the vocal communication of emotion conducted by Scherer (2003), emotions with a similar valence are often confused with one another, and can therefore be harder for a machine learning algorithm to classify.
A limitation of this research is that the dataset consisted of acted emotions recorded by
professional actors. According to Gupta & Rajput (2007), the stress on various emotions is
not significant when customers express long sentences. In another study on the relationship
between lexical and prosodic emotional cues in English (Mairano, Zovato & Quinci, 2019),
the researchers indicated that speech voiced by professional actors tends to be overacted.
For this reason, acted emotions do not correspond to the natural flows found in spontaneous
speech (Scherer, 2013). Furthermore, a multimodel algorithm could be proposed in order
to handle acoustic features and converting speech to text for sentiment analysis from real
conversation. A combination of acoustic features and sentiment analysis from text could increase the
accuracy of speech emotion classification algorithms.