a semi-supervised framework for SER tasks by utilizing clustering pseudo labeling technique
This is an implementation of semi-supervised DeepEmoCluster framework for the attribute-based speech emotion recognition (SER) tasks. Part of the codes are contributed from the DeepCluster repository. The experiments and trained models were based on the MSP-Podcast v1.6 corpus in the paper.
NEW: Improved DeepEmoCluster framework is now available at this repo
Using the feat_extract.py to extract 128-mel spectrogram features for every speech segment in the corpus (remember to change I/O paths in the .py file). Then, use the norm_para.py to save normalization parameters for our framework’s pre-processing step. The parameters will be saved in the generated ‘NormTerm’ folder.
After extracted the 128-mel spec features (e.g., Mel_Spec/feat_mat/*.mat) for MSP-Podcast corpus, we use the ‘labels_concensus.csv’ provided by the corpus as the default input label setting for the supervised emotional regressor network.
python main.py -ep 50 -batch 64 -emo Act -nc 10
python online_testing.py -ep 50 -batch 64 -emo Act -nc 10
We provide some trained models based on version 1.6 of the MSP-Podcast in the ‘trained_models’ folder. The CCC performances of models based on the test set are shown in the following table. Note that the results are slightly different from the paper since we performed statistical test in the paper (i.e., we averaged multiple trails results together).
40K unlabeled set | Act(10-clusters) | Dom(30-clusters) | Val(30-clusters) |
---|---|---|---|
DeepEmoClusters | 0.6732 | 0.5547 | 0.1902 |
Users can get these results by running the online_testing.py with corresponding args.
Since the framework is an end-to-end model, we also provide the complete prediction process that alows users to directly make emotional predictions (i.e., arousal, domiance and valence) for your own dataset or any audio files (audio spec: WAV file, 16k sampling rate and mono channel) based on the trained models from DeepEmoClusters. Users just need to change the input folder path in prediction_process.py to run the predictions and the output results will be saved as a ‘pred_result.csv’ file under the same directory.
If you use this code, please cite the following paper:
Wei-Cheng Lin, Kusha Sridhar and Carlos Busso, “DeepEmoCluster: A Semi-Supervised Framework for Latent Cluster Representation of Speech Emotions”
@InProceedings{Lin_2021,
author={W.-C. Lin and K. Sridhar and C. Busso},
title={{DeepEmoCluster}: A Semi-Supervised Framework for Latent Cluster Representation of Speech Emotions},
booktitle={IEEE international conference on acoustics, speech and signal processing (ICASSP 2021)},
volume={},
year={2021},
month={June},
pages={7263-7267},
address = {Toronto, ON, Canada},
doi={10.1109/ICASSP39728.2021.9414035},
}