Speech Command detection in audio file

Question

0 个投票

Hi,

I need to remake "Speech Command Recognition Using Deep Learning" example so I can read audio from the wav file and get time intervals in which the command appears, but I don't know how to change real-time analysis from microphone into static file analysis in this example. Thank you for your help.

 %% Detect Commands Using Streaming Audio from Microphone
% Test your newly trained command detection network on streaming audio from
% your microphone. If you have not trained a network, then type
% |load('commandNet.mat')| at the command line to load a pretrained network
% and the parameters required to classify live, streaming audio. Try
% saying one of the commands, for example, _yes_, _no_, or _stop_.
% Then, try saying one of the unknown words such as _Marvin_, _Sheila_, _bed_,
% _house_, _cat_, _bird_, or any number from zero to nine.
%%
% Specify the audio sampling rate and classification rate in Hz and create
% an audio device reader that can read audio from your microphone.
fs = 16e3;
classificationRate = 20;
audioIn = audioDeviceReader('SampleRate',fs, ...
    'SamplesPerFrame',floor(fs/classificationRate));
%%
% Specify parameters for the streaming spectrogram computations and
% initialize a buffer for the audio. Extract the classification labels of
% the network. Initialize buffers of half a second for the labels and
% classification probabilities of the streaming audio. Use these buffers to
% compare the classification results over a longer period of time and by
% that build 'agreement' over when a command is detected.
frameLength = frameDuration*fs;
hopLength = hopDuration*fs;
waveBuffer = zeros([fs,1]);
labels = trainedNet.Layers(end).Classes;
YBuffer(1:classificationRate/2) = categorical("background");
probBuffer = zeros([numel(labels),classificationRate/2]);
framesNumber = audio/frameLength;
%%
% Create a figure and detect commands as long as the created figure exists.
% To stop the live detection, simply close the figure. 
h = figure('Units','normalized','Position',[0.2 0.1 0.6 0.8]);
while ishandle(h)
    
    % Extract audio samples from the audio device and add the samples to
    % the buffer.
    x = audioIn();
    waveBuffer(1:end-numel(x)) = waveBuffer(numel(x)+1:end);
    waveBuffer(end-numel(x)+1:end) = x;
    
    % Compute the spectrogram of the latest audio samples.
    spec = auditorySpectrogram(waveBuffer,fs, ...
        'WindowLength',frameLength, ...
        'OverlapLength',frameLength-hopLength, ...
        'NumBands',numBands, ...
        'Range',[50,7000], ...
        'WindowType','Hann', ...
        'WarpType','Bark', ...
        'SumExponent',2);
    spec = log10(spec + epsil);
    
    % Classify the current spectrogram, save the label to the label buffer,
    % and save the predicted probabilities to the probability buffer.
    [YPredicted,probs] = classify(trainedNet,spec,'ExecutionEnvironment','cpu');
    YBuffer(1:end-1)= YBuffer(2:end);
    YBuffer(end) = YPredicted;
    probBuffer(:,1:end-1) = probBuffer(:,2:end);
    probBuffer(:,end) = probs';
    
    % Plot the current waveform and spectrogram.
    subplot(2,1,1);
    plot(waveBuffer)
    axis tight
    ylim([-0.2,0.2])
    
    subplot(2,1,2)
    pcolor(spec)
    caxis([specMin+2 specMax])
    shading flat
    
    % Now do the actual command detection by performing a very simple
    % thresholding operation. Declare a detection and display it in the
    % figure title if all of the following hold:
    % 1) The most common label is not |background|.
    % 2) At least |countThreshold| of the latest frame labels agree.
    % 3) The maximum predicted probability of the predicted label is at
    % least |probThreshold|. Otherwise, do not declare a detection.
    [YMode,count] = mode(YBuffer);
    countThreshold = ceil(classificationRate*0.2);
    maxProb = max(probBuffer(labels == YMode,:));
    probThreshold = 0.5;
    subplot(2,1,1);
    if YMode == "background" || count<countThreshold || maxProb < probThreshold
        title(" ")
    else
        title(string(YMode),'FontSize',20)
    end
    
    drawnow
end

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

jibrahim 2019-4-22

在 MATLAB Online 中打开

0 个投票

Hi Bartek,

Here is a sketch of how you can achieve this:

1) Read the contents of the audio file

[audioIn,fs] = audioread('filename.wav');

2) Split the signal into 1-second chunks, with overlap between consecutive chunks. The higher the overlap, the higher your resolution (i.e. how close you will be able to detect where the keyword occured)

y = buffer(audioIn, fs, round(3*fs/4));

3) Convert each chunk (each column of y) to a mel spectrogram, and classify it using the network. Based on the network results, you will get an estimate of where the word occured in your wav file (each chunk advances by 1 sec - overlap, so in the this example, it's ~250 ms.

HTH

Jihad

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Speech Command detection in audio file

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

更多回答（0 个）

类别

产品

版本

标签

Community Treasure Hunt

Speech Command detection in audio file

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

采纳的回答

0 个评论 显示 -2更早的评论 隐藏 -2更早的评论

更多回答（0 个）

类别

产品

版本

标签

另请参阅

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论