Main Content

detectspeechnn

Detect boundaries of speech in audio signal using AI

Since R2023a

    Description

    roi = detectspeechnn(audioIn,fs) returns indices corresponding to the beginning and end of speech within the audio signal.

    example

    roi = detectspeechnn(audioIn,fs,Name=Value) specifies options using one or more name-value arguments. For example, detectspeechnn(audioIn,fs,MergeThreshold=0.5) merges speech regions that are separated by 0.5 seconds or less.

    example

    [roi,probs] = detectspeechnn(___) also returns the probability of voice activity per sample in the input audio signal.

    example

    detectspeechnn(___) with no output arguments plots the input signal and the detected speech regions.

    This function requires both Audio Toolbox™ and Deep Learning Toolbox™.

    example

    Examples

    collapse all

    Read in an audio signal containing speech and music and listen to the sound.

    [audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
    sound(audioIn,fs)

    Call detectspeechnn on the signal to obtain the regions of interest (ROIs), in samples, containing speech.

    roi = detectspeechnn(audioIn,fs)
    roi = 2×2
    
               1       63120
           83600      150000
    
    

    Convert the ROIs from samples to seconds.

    roiSeconds = (roi-1)/fs
    roiSeconds = 2×2
    
             0    3.9449
        5.2249    9.3749
    
    

    Plot the audio waveform with the speech regions.

    detectspeechnn(audioIn,fs)

    Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

    Read in an audio signal containing a speaker repeating the phrase "volume up".

    [audioIn,fs] = audioread("MaleVolumeUp-16-mono-6secs.ogg");

    Compare detected speech regions by calling detectspeechnn with and without the application of an energy-based voice activity detector (VAD) in postprocessing.

    tiledlayout(2,1)
    nexttile()
    detectspeechnn(audioIn,fs)
    nexttile()
    detectspeechnn(audioIn,fs,ApplyEnergyVAD=true)

    Figure contains 2 axes objects. Axes object 1 with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 4 objects of type line, constantline, patch. Axes object 2 with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 16 objects of type line, constantline, patch.

    Read in an audio signal.

    [audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");

    Call detectspeechnn with no output arguments to display a plot of the detected speech regions.

    detectspeechnn(audioIn,fs);

    Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 8 objects of type line, constantline, patch.

    Modify the parameters used in the postprocessing algorithm and see how they affect the detected speech regions. For more information about the VAD postprocessing algorithm, see Postprocessing.

    mergeThreshold = 1.3 ; % seconds
    
    lengthThreshold = 0.25; % seconds
    
    activationThreshold = 0.5; % probability
    
    deactivationThreshold = 0.25 ; % probability
    
    applyEnergyVAD = false ;
    
    detectspeechnn(audioIn,fs,MergeThreshold=mergeThreshold, ...
        LengthThreshold=lengthThreshold, ...
        ActivationThreshold=activationThreshold, ...
        DeactivationThreshold=deactivationThreshold)

    Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s), ylabel Amplitude contains 5 objects of type line, constantline, patch.

    Read in an audio signal containing speech and music.

    [audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");

    Call detectspeechnn with an additional output variable to get the probabilities of speech in each sample of the signal.

    [roi,probs] = detectspeechnn(audioIn,fs);

    Plot the audio signal along with the voice activity probability.

    t = (0:length(audioIn)-1)/fs;
    plot(t,audioIn,t,probs,"r")
    legend("Audio signal","Probability of speech",Location="best")
    xlabel("Time (s)")
    title("Voice Activity Probability")

    Figure contains an axes object. The axes object with title Voice Activity Probability, xlabel Time (s) contains 2 objects of type line. These objects represent Audio signal, Probability of speech.

    Use detectspeechnn to detect the presence of speech in a streaming audio signal.

    Create a dsp.AudioFileReader object to stream an audio file for processing. Set the SamplesPerFrame property to read 100 ms nonoverlapping chunks from the signal.

    afr = dsp.AudioFileReader("MaleVolumeUp-16-mono-6secs.ogg");
    analysisDuration = 0.1; % seconds
    afr.SamplesPerFrame = floor(analysisDuration*afr.SampleRate);

    The neural network architecture of detectspechnn does not retain state between calls, and it performs best when analyzing larger chunks of audio signals. When you use detectspeechnn in a streaming scenario, specific application requirements of accuracy, computational efficiency, and latency dictate the analysis duration and whether to overlap analysis chunks.

    Create a timescope object to plot the audio signal and the detected speech regions. Create an audioDeviceWriter to play the audio as you stream it.

    scope = timescope(NumInputPorts=2, ...
        SampleRate=afr.SampleRate, ...
        TimeSpanSource="property",TimeSpan=5, ...
        YLimits=[-1.2,1.2], ...
        ShowLegend=true,ChannelNames=["Audio","Detected Speech"]);
    adw = audioDeviceWriter(afr.SampleRate);

    In a streaming loop:

    1. Read in a 100 ms chunk from the audio file.

    2. Use detectspeechnn to detect any regions of speech in the frame. Use sigroi2binmask to convert the region indices to a binary mask.

    3. Plot the audio signal and the detected speech.

    4. Play the audio with the device writer.

    while ~isDone(afr)
        audioIn = afr();
        segments = detectspeechnn(audioIn,afr.SampleRate,LengthThreshold=0.01);
        mask = sigroi2binmask(segments,afr.SamplesPerFrame);
        scope(audioIn,mask)
        adw(audioIn);
    end

    Input Arguments

    collapse all

    Audio input signal, specified as a column vector (single channel).

    Data Types: single | double

    Sample rate in Hz, specified as a positive scalar.

    Data Types: single | double

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

    Example: detectspeechnn(audioIn,fs,ApplyEnergyVAD=true)

    Merge threshold in seconds, specified as a nonnegative scalar. The function merges speech regions that are separated by a duration less than or equal to the specified threshold. Set the threshold to Inf to not merge any detected regions.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Length threshold in seconds, specified as a nonnegative scalar. The function does not return speech regions that have a duration less than or equal to the specified threshold.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Probability threshold to start a speech segment, specified as a scalar in the range [0, 1].

    Data Types: single | double

    Probability threshold to end a speech segment, specified as a scalar in the range [0, 1].

    Data Types: single | double

    Apply energy-based voice activity detector (VAD) to the speech regions detected by the neural network, specified as true or false.

    Data Types: logical

    Output Arguments

    collapse all

    Speech regions, returned as an N-by-2 matrix of indices into the input signal, where N is the number of individual speech regions detected. The first column contains the index of the start of a speech region, and the second column contains the index of the end of a region.

    Probability of speech per sample of the input audio signal, returned as a column vector with the same size as the input signal.

    Algorithms

    collapse all

    Preprocessing

    The detectspeechnn function preprocesses the audio data using the following steps.

    1. Resample the audio to 16kHz.

    2. Compute a centered short-time Fourier transform (STFT) using a 25 ms periodic Hamming window and 10 ms hop length. Pad the signal so that the first window is centered at 0 s.

    3. Convert the STFT to a power spectrogram.

    4. Apply a mel filter bank with 40 bands to obtain a mel spectrogram.

    5. Convert the mel spectrogram to a log scale.

    6. Standardize each of the mel bands to have zero mean and standard deviation of 1.

    Neural Network Inference

    The preprocessed data is passed to a pretrained VAD neural network. The network outputs represent the probability of speech in each frame of audio in the input spectrogram.

    The neural network is a ported version of the vad-crdnn-libriparty pretrained model provided by SpeechBrain[1], which combines convolutional, recurrent, and fully connected layers.

    Postprocessing

    The detectspeechnn function postprocesses the VAD network output using the following steps.

    1. Apply activation and deactivation thresholds to posterior probabilities to determine candidate speech regions.

    2. Optionally, apply energy-based VAD to refine the detected speech regions.

    3. Merge speech regions that are close to each other according to the merge threshold.

    4. Remove speech regions that are shorter than or equal to the length threshold.

    References

    [1] Ravanelli, Mirco, et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv, 8 June 2021. arXiv.org, http://arxiv.org/abs/2106.04624

    Extended Capabilities

    Version History

    Introduced in R2023a

    expand all