Use Datastores to Manage Audio Data Sets
Deep learning and machine learning models are popular for processing audio signals for various tasks. Training these models requires working with large data sets containing both audio data and labeling information. For example, when training a model to identify spoken commands, the data can be a collection of audio files and the labels in this case are the ground truth commands for each file. Datastores are useful for working with large collections of data, and the audioDatastore
object allows you to manage collections of audio files.
This example shows you how to use datastores to manage three different audio data sets. The first data set uses the names of the folders containing the audio files as labels, the second data set uses the file names as labels, and the third data set contains labels in a metadata file. You can then use these datastores to train machine learning or deep learning models on the audio data.
Data With Folder Name Labels
The Google Speech Commands data set [1] contains files with spoken command words stored in folders whose names are the word labels. Download and extract the data set.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","google_speech.zip"); dataFolder = tempdir; unzip(downloadFolder,dataFolder) dataset = fullfile(dataFolder,"google_speech");
Create an audioDatastore
that points to the training data.
ads = audioDatastore(fullfile(dataset,"train"),IncludeSubfolders=true);
Extract the labels for each file from the folder names using the folders2labels
function. Use countlabels
to view the distribution of labels.
labels = folders2labels(ads); countlabels(labels)
ans=30×3 table
Label Count Percent
______ _____ _______
bed 1340 2.6229
bird 1411 2.7619
cat 1399 2.7384
dog 1396 2.7325
down 1842 3.6055
eight 1852 3.6251
five 1844 3.6095
four 1839 3.5997
go 1861 3.6427
happy 1373 2.6875
house 1427 2.7932
left 1839 3.5997
marvin 1424 2.7873
nine 1875 3.6701
no 1853 3.6271
off 1839 3.5997
⋮
Use combine
to create a CombinedDatastore
object from the audio data and the labels. Each call to read
on the datastore returns one of the audio signals and its label.
lds = arrayDatastore(labels); cds = combine(ads,lds);
You can create a separate datastore for validation data by repeating the same steps after creating an audioDatastore
that instead points to the validation
subfolder of the data set. Alternatively, you can use splitlabels
to separate an existing datastore into training and validation sets. Specify UnderlyingDatastoreIndex
to indicate which of the underlying datastores in the combined datastore contains the labels.
idxs = splitlabels(cds,0.8,"randomized",UnderlyingDatastoreIndex=2);
trainDs = subset(cds,idxs{1});
valDs = subset(cds,idxs{2});
Call read
on the train datastore. The function returns both the audio signal and the label in a cell array.
read(trainDs)
ans=1×2 cell array
{14861×1 double} {[bed]}
Data With File Name Labels
The Free Spoken Digit Dataset (FSDD) [2] contains recordings of spoken digits in files whose names contain the digit labels as well as speaker labels. Download the data set and create an audioDatastore
that points to the data.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip"); dataFolder = tempdir; unzip(downloadFolder,dataFolder) dataset = fullfile(dataFolder,"FSDD","recordings"); ads = audioDatastore(dataset);
Select a random file from the data set and display its name. The file name is formatted as digitLabel_
speakerName_
index.
[~,name] = fileparts(ads.Files{randi(length(ads.Files))})
name = '1_jackson_45'
Use filenames2labels
to extract the digit labels from the file names. Combine the labels with the audio into a CombinedDatastore
and see the label distribution of the data set.
labels = filenames2labels(ads,ExtractBefore="_");
lds = arrayDatastore(labels);
cds = combine(ads,lds);
countlabels(cds,UnderlyingDatastoreIndex=2)
ans=10×3 table
Label Count Percent
_____ _____ _______
0 200 10
1 200 10
2 200 10
3 200 10
4 200 10
5 200 10
6 200 10
7 200 10
8 200 10
9 200 10
Data With Metadata File
The Mozilla Common Voice data set [3] contains recordings of subjects speaking short sentences. The data set has a metadata file with various labels including sentence transcriptions and speaker IDs. Download the data set and create an audioDatastore
that points to the training data.
downloadFolder = matlab.internal.examples.downloadSupportFile("audio","commonvoice.zip"); dataFolder = tempdir; unzip(downloadFolder,dataFolder) dataset = fullfile(dataFolder,"commonvoice","train"); ads = audioDatastore(fullfile(dataset,"clips"));
Read the metadata file into a table.
metadata = readtable(fullfile(dataset,"train.tsv"),FileType="text");
Assert that the order of the files in the datastore matches the table. This ensures you can easily associate the metadata information with the datastore.
[~,adsFilenames,~] = fileparts(ads.Files); assert(length(adsFilenames)==length(metadata.path)) assert(all(strcmp(adsFilenames,metadata.path)))
Create a CombinedDatastore
using the transcribed sentences as labels.
sentences = arrayDatastore(string(metadata.sentence)); transcriptDs = combine(ads,sentences);
Create another CombinedDatastore
with speaker IDs as labels. Rename the speaker ID labels to natural numbers for simplicity.
speakerLabels = categorical(metadata.client_id); speakerIDs = string(1:length(categories(speakerLabels))); speakerLabels = renamecats(speakerLabels,speakerIDs); labelsDs = arrayDatastore(speakerLabels); speakerDs = combine(ads,labelsDs); countlabels(speakerDs,UnderlyingDatastoreIndex=2)
ans=595×3 table
Label Count Percent
_____ _____ _______
1 1 0.05
10 1 0.05
100 3 0.15
101 4 0.2
102 36 1.8
103 4 0.2
104 1 0.05
105 2 0.1
106 4 0.2
107 1 0.05
108 1 0.05
109 1 0.05
11 4 0.2
110 1 0.05
111 1 0.05
112 10 0.5
⋮
Next Steps
You can now use the data sets to train deep learning or machine learning models, and you can use read
and readall
to access the data and labels. You can also perform feature extraction on the data and use transform
to create a new datastore that extracts features from the audio data.
References
[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available from https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license, available here: https://creativecommons.org/licenses/by/4.0/legalcode.
[2] Zohar Jackson, César Souza, Jason Flaks, Yuxin Pan, Hereman Nicolas, and Adhish Thite. “Jakobovski/free-spoken-digit-dataset: V1.0.8”. Zenodo, August 9, 2018. https://doi.org/10.5281/zenodo.1342401.
[3] Mozilla Common Voice. https://commonvoice.mozilla.org/en.