Low LSTM Accuracy in Speech Recognition

2 次查看(过去 30 天)
Hello everyone, I am applying LSTM to speech emotion recognition. I have performed feature extraction using MFCC, resulting in a matrix of dimensions 60,575 × 39. I subsequently transformed this matrix into a cell array named "AllCellTrain" with dimensions 280 × 1, containing signals of varying sizes, as illustrated in the image below. I then utilized "AllCellTrain" as input for the trainNetwork function, along with the labels YCA, network layers, and training options. However, I encountered a significant issue with accuracy, achieving only around 20%. I'm unsure where I may have made a mistake. Could someone please offer some assistance?
num_hidden_units = 1024;
layers = [
sequenceInputLayer(num_features)
lstmLayer(num_hidden_units, 'OutputMode', 'last')
fullyConnectedLayer(num_classes)
softmaxLayer
classificationLayer];
% Specify the training options
max_epochs = 36;
mini_batch_size = 28;
initial_learning_rate = 0.001;
options = trainingOptions('adam', ...
'MaxEpochs', max_epochs, ...
'MiniBatchSize', mini_batch_size, ...
'InitialLearnRate', initial_learning_rate, ...
'SequenceLength','shortest', ...
'Shuffle','every-epoch',...
'ExecutionEnvironment','gpu', ...
'Verbose', false, ...
'Plots','training-progress');
net = trainNetwork(AllCellTrain, YCA, layers, options);
predicted_labels = classify(net, AllCellTest,'ExecutionEnvironment','gpu');
acc = mean(predicted_labels == YCT)
  4 个评论
Hamza
Hamza 2023-11-6
编辑:Hamza 2023-11-6
Hi @Christopher McCausland , thanks for your answer, I ma trying to classify 7 emotion classes, for your information I have used the same data on 1D CNN and got 90% accuracy, didnt know the issue on LSTM, also when I shufflued the colunms "the features" I got diffrent result, which souldnt be the case. you find the attached curve! thanks in advance
Christopher McCausland
Hi @Hamza,
To me this looks like classic overfitting, your model appears to train well and learn features, however these features are overfitted to the training data, and are not representative of genralised data.
A few things to consider;
  1. Do you have multiple speakers? If so, how do you pick which speakers are in the test/train set.
  2. You have 280 input sequences, and seven classes, if the data is perfectly ballanced you have 40 observations per class, is this enough?
  3. Can you include a validation split to prevent overfitting?
  4. These are just a few ways to prevent overfitting/ ensure your data is appropreate for training, there are many other which I would suggest you take a look at.
In terms of the CNN preformance, were the test/train set the same and how many epochs did you train the CNN for?

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Speech Recognition 的更多信息

产品


版本

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by