Main Content

detectSpeech

Detect boundaries of speech in audio signal

Since R2020a

Description

idx = detectSpeech(audioIn,fs) returns indices of audioIn that correspond to the boundaries of speech signals.

example

idx = detectSpeech(audioIn,fs,Name,Value) specifies options using one or more Name,Value pair arguments.

Example: detectSpeech(audioIn,fs,'Window',hann(512,'periodic'),'OverlapLength',256) detects speech using a 512-point periodic Hann window with 256-point overlap.

example

[idx,thresholds] = detectSpeech(___) also returns the thresholds used to compute the boundaries of speech.

example

detectSpeech(___) with no output arguments displays a plot of the detected speech regions in the input signal.

example

Examples

collapse all

Read in an audio signal. Clip the audio signal to 20 seconds.

[audioIn,fs] = audioread('Rainbow-16-8-mono-114secs.wav');
audioIn = audioIn(1:20*fs);

Call detectSpeech. Specify no output arguments to display a plot of the detected speech regions.

detectSpeech(audioIn,fs);

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s) contains 37 objects of type line, constantline, patch.

The detectSpeech function uses a thresholding algorithm based on energy and spectral spread per analysis frame. You can modify the Window, OverlapLength, and MergeDistance to fine-tune the algorithm for your specific needs.

windowDuration = 0.074; % seconds
numWindowSamples = round(windowDuration*fs);
win = hamming(numWindowSamples,'periodic');

percentOverlap = 35;
overlap = round(numWindowSamples*percentOverlap/100);

mergeDuration = 0.44;
mergeDist = round(mergeDuration*fs);

detectSpeech(audioIn,fs,"Window",win,"OverlapLength",overlap,"MergeDistance",mergeDist)

Figure contains an axes object. The axes object with title Detected Speech, xlabel Time (s) contains 19 objects of type line, constantline, patch.

Read in an audio file containing speech. Split the audio signal into a first half and a second half.

[audioIn,fs] = audioread('Counting-16-44p1-mono-15secs.wav');
firstHalf = audioIn(1:floor(numel(audioIn)/2));
secondHalf = audioIn(numel(firstHalf):end);

Call detectSpeech on the first half of the audio signal. Specify two output arguments to return the indices corresponding to regions of detected speech and the thresholds used for the decision.

[speechIndices,thresholds] = detectSpeech(firstHalf,fs);

Call detectSpeech on the second half with no output arguments to plot the regions of detected speech. Specify the thresholds determined from the previous call to detectSpeech.

detectSpeech(secondHalf,fs,'Thresholds',thresholds)

Working with Large Data Sets

Reusing speech detection thresholds provides significant computational efficiency when you work with large data sets, or when you deploy a deep learning or machine learning pipeline for real-time inference. Download and extract the data set [1].

url = 'https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz';

downloadFolder = tempdir;
datasetFolder = fullfile(downloadFolder,'google_speech');

if ~exist(datasetFolder,'dir')
    disp('Downloading data set (1.9 GB) ...')
    untar(url,datasetFolder)
end

Create an audio datastore to point to the recordings. Use the folder names as labels.

ads = audioDatastore(datasetFolder,'IncludeSubfolders',true,'LabelSource','foldernames');

Reduce the data set by 95% for the purposes of this example.

ads = splitEachLabel(ads,0.05,'Exclude','_background_noise');

Create two datastores: one for training and one for testing.

[adsTrain,adsTest] = splitEachLabel(ads,0.8);

Compute the average thresholds over the training data set.

thresholds = zeros(numel(adsTrain.Files),2);
for ii = 1:numel(adsTrain.Files)
    [audioIn,adsInfo] = read(adsTrain);
    [~,thresholds(ii,:)] = detectSpeech(audioIn,adsInfo.SampleRate);
end
thresholdAverage = mean(thresholds,1);

Use the precomputed thresholds to detect speech regions on files from the test data set. Plot the detected region for three files.

[audioIn,adsInfo] = read(adsTest);
detectSpeech(audioIn,adsInfo.SampleRate,'Thresholds',thresholdAverage);

[audioIn,adsInfo] = read(adsTest);
detectSpeech(audioIn,adsInfo.SampleRate,'Thresholds',thresholdAverage);

[audioIn,adsInfo] = read(adsTest);
detectSpeech(audioIn,adsInfo.SampleRate,'Thresholds',thresholdAverage);

References

[1] Warden, Pete. "Speech Commands: A Public Dataset for Single Word Speech Recognition." Distributed by TensorFlow. Creative Commons Attribution 4.0 License.

Read in an audio file and listen to it. Plot the spectrogram.

[audioIn,fs] = audioread('Counting-16-44p1-mono-15secs.wav');

sound(audioIn,fs)

spectrogram(audioIn,hann(1024,'periodic'),512,1024,fs,'yaxis')

Figure contains an axes object. The axes object with xlabel Time (s), ylabel Frequency (kHz) contains an object of type image.

For machine learning applications, you often want to extract features from an audio signal. Call the spectralEntropy function on the audio signal, then plot the histogram to display the distribution of spectral entropy.

entropy = spectralEntropy(audioIn,fs);

numBins = 40;
histogram(entropy,numBins,'Normalization','probability')
title('Spectral Entropy of Audio Signal')

Figure contains an axes object. The axes object with title Spectral Entropy of Audio Signal contains an object of type histogram.

Depending on your application, you might want to extract spectral entropy from only the regions of speech. The resulting statistics are more characteristic of the speaker and less characteristic of the channel. Call detectSpeech on the audio signal and then create a new signal that contains only the regions of detected speech.

speechIndices = detectSpeech(audioIn,fs);
speechSignal = [];
for ii = 1:size(speechIndices,1)
    speechSignal = [speechSignal;audioIn(speechIndices(ii,1):speechIndices(ii,2))];
end

Listen to the speech signal and plot the spectrogram.

sound(speechSignal,fs)

spectrogram(speechSignal,hann(1024,'periodic'),512,1024,fs,'yaxis')

Figure contains an axes object. The axes object with xlabel Time (s), ylabel Frequency (kHz) contains an object of type image.

Call the spectralEntropy function on the speech signal and then plot the histogram to display the distribution of spectral entropy.

entropy = spectralEntropy(speechSignal,fs);

histogram(entropy,numBins,'Normalization','probability')
title('Spectral Entropy of Speech Signal')

Figure contains an axes object. The axes object with title Spectral Entropy of Speech Signal contains an object of type histogram.

Input Arguments

collapse all

Audio input, specified as a column vector.

Data Types: single | double

Sample rate in Hz, specified as a scalar.

Data Types: single | double

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: detectSpeech(audioIn,fs,'MergeDistance',100)

Window applied in the time domain, specified as the comma-separated pair consisting of 'Window' and a real vector. The number of elements in the vector must be in the range [2, size(audioIn,1)]. The number of elements in the vector must also be greater than OverlapLength.

Data Types: single | double

Number of samples overlapping between adjacent windows, specified as the comma-separated pair consisting of 'OverlapLength' and an integer in the range [0, size(Window,1)).

Data Types: single | double

Number of samples over which to merge positive speech detection decisions, specified as the comma-separated pair consisting of 'MergeDistance' and a nonnegative scalar.

Note

The resolution for speech detection is given by the hop length, where the hop length is equal to numel(Window) − OverlapLength.

Data Types: single | double

Thresholds for decision, specified as the comma-separated pair consisting of 'Thresholds' and a two-element vector.

  • If you do not specify Thresholds, the detectSpeech function derives thresholds by using histograms of the features calculated over the current input frame.

  • If you specify Thresholds, the detectSpeech function skips the derivation of new decision thresholds. Reusing speech decision thresholds provides significant computational efficiency when you work with large data sets, or when you deploy a deep learning or machine learning pipeline for real-time inference.

Data Types: single | double

Output Arguments

collapse all

Start and end indices of speech regions, returned as an N-by-2 matrix. N corresponds to the number of individual speech regions detected. The first column corresponds to start indices of speech regions and the second column corresponds to end indices of speech regions.

Data Types: single | double

Thresholds used for decision, returned as a two-element vector. The thresholds are in the order [Energy Threshold, Spectral Spread Threshold].

Data Types: single | double

Algorithms

The detectSpeech algorithm is based on [1], although modified so that the statistics to threshold are short-term energy and spectral spread, instead of short-term energy and spectral centroid. The diagram and steps provide a high-level overview of the algorithm. For details, see [1].

Sequence of stages in algorithm.

  1. The audio signal is converted to a time-frequency representation using the specified Window and OverlapLength.

  2. The short-term energy and spectral spread is calculated for each frame. The spectral spread is calculated according to spectralSpread.

  3. Histograms are created for both the short-term energy and spectral spread distributions.

  4. For each histogram, a threshold is determined according to T=W×M1+M2W+1, where M1 and M2 are the first and second local maxima, respectively. W is set to 5.

  5. Both the spectral spread and the short-term energy are smoothed across time by passing through successive five-element moving median filters.

  6. Masks are created by comparing the short-term energy and spectral spread with their respective thresholds. To declare a frame as containing speech, a feature must be above its threshold.

  7. The masks are combined. For a frame to be declared as speech, both the short-term energy and the spectral spread must be above their respective thresholds.

  8. Regions declared as speech are merged if the distance between them is less than MergeDistance.

References

[1] Giannakopoulos, Theodoros. "A Method for Silence Removal and Segmentation of Speech Signals, Implemented in MATLAB", (University of Athens, Athens, 2009).

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Version History

Introduced in R2020a