Deep Learning model for lexical stress detection in spoken English
lexical-stress-detection is a deep learning model that identifies if a vowel phoneme
in an isolated word is stressed or un-stressed.
The image below summarizes the project
To retrain the model follow the steps below:
The first step of feature extraction is forced phoneme alignment of audio files. Refer to the alignment
readme.
Phoneme alignment needs the files in .wav format. If you’ve .flacc files, use
this script to convert them to wav files.
This process extracts spectral and non spectral features from the of each phoneme, stores them as numpy arrays and
writes to disk as .npy files.
Since stress on a particular vowel phoneme is related to other vowel phonemes within that word, features of each
phoneme is sandwiched between the features of preceding and succeeding phoneme.
For each phoneme two files are generated:
*_mfcc.npy
: Spectral features - 13 MFCCs for 10 frames, their derivatives and double derivatives. This is13 x 30
. Preceding an succeeding phoneme features are added as channels, thus the3 x 13 x 30
.mfcc_extraction.py
*_other.npy
: Non Spectral Features - 6 non spectral features for the phoneme represented as a vector of lengthnon_mfcc_extraction.py
Sample generation is be done by running the sample_generation.py
script which takes three inputs as command line arguments:
Sample generation script is parallelized, CPU with 16 or more cores is recommended for running it.
After sample generation, primary stress phonemes were twice as much than unstressed. Secondary stress were a very
small percentage of the total. For this project we completely ignored secondary stress and randomly sampled primary
stress features approximately equal to unstressed.
We removed features of 80 stop words. Since the npy file names
have the word in them, a sample script can be written for this action.
Use train_test_split.py
to split data into train and val sets.
The script needs three input parameters as command line arguments.
The model
is a combination of CNN and DNN. Spectral features are fed into the CNN and the
non spectral into DNN. The output form these networks are concatenated and fed into another DNN and finally, the
softmax loss layer is used.
The training.py
script takes five command line arguments:
Hyper parameters like batch size can be changed in this script.
It also generates a file data_check_test.csv
which has some info about predictions on the val set. This is useful
for debugging which samples are incorrectly classified. The five columns in the file are:
Sample csv file:
path,label,pred,prob_0,prob_1
test/0/libri_5808-54425-0000_is_ih0_mfcc.npy,0,0,0.9996665716171265,0.00033342366805300117
test/1/libri_5808-54425-0000_years_ih1_mfcc.npy,1,1,6.26739677045407e-07,0.9999994039535522
test/1/libri_5808-54425-0000_five_ay1_mfcc.npy,1,1,2.3276076888123498e-07,0.9999997615814209
test/1/libri_5808-54425-0000_but_ah1_mfcc.npy,1,1,4.2122044874304265e-07,0.9999995231628418