Manage Data Sets for Machine Learning and Deep Learning Workflows
Use MATLAB® and Signal Processing Toolbox™ functionality to create a successful artificial intelligence (AI) workflow from labeling to training to deployment.
Common AI Tasks
Common AI tasks are signal classification, sequence-to-sequence classification, and regression. An AI model predicts:
For signal classification — A discrete class label for each input signal
For sequence-to-sequence classification — A label for each time step of the sequence data
For regression — A continuous numeric value
Data Organization
For many machine learning and deep learning applications, data sets are large and consist of both signal and label variables. Based on how your data set is organized, you can use datastores and functions in MATLAB and Signal Processing Toolbox to manage your data.
There are various methods to collect and store data that influence how you can access it in a workflow. In the data preparation stage, you might come across one or more of these common questions:
How do I organize my data?
How do I access data for training?
How do I create labels?
How do I combine signal and label data?
This table provides different data organization scenarios and shows you how to create datastores that correspond to these scenarios, so that you can access and prepare your data for your workflow.
Data Organization | Task | Related Datastore | Example |
---|---|---|---|
Signal and label variables stored separately in memory |
| Consider a data set consisting of signals stored in matrix
ads1 = arrayDatastore(sig); ads2 = arrayDatastore(lbls); Use
the cds = combine(ads1,ads2); Determine
the count of each label in the data set. Specify the underlying datastore index
to count the labels in
cnt = countlabels(cds,UnderlyingDatastoreIndex=2) cnt = 4×3 table Label Count Percent _____ _____ _______ a 20 25 b 20 25 c 20 25 d 20 25 Use
the idxs = splitlabels(cds,[0.7 0.2],"randomized");
trainDs = subset(cds,idxs{1});
valDs = subset(cds,idxs{2});
testDs = subset(cds,idxs{3}); Count the number of labels in the training subset datastore. trainCnt = countlabels(trainDs,UnderlyingDatastoreIndex=2) trainCnt = 4×3 table Label Count Percent _____ _____ _______ a 14 25 b 14 25 c 14 25 d 14 25 | |
Signal and label variables stored in separate MAT-files |
| Consider a data set consisting of two sets of MAT-files. The first set
contains signal data and the second set contains corresponding labels. All files
are saved in the same folder and have either " sds = signalDatastore(datasetFolder); Use
the sigds = subset(sds,contains(sds.Files,"signal")); lblds = subset(sds,contains(sds.Files,"label")); Read
the label data into memory. Convert the labels to a categorical array with
categories labeldata = readall(lblds); lblcat = categorical(labeldata,{'a' 'b' 'c'}); Create
an ads = arrayDatastore(lblcat); allds = combine(sigds,ads); Preview the first signal and the corresponding label in the datastore. preview(allds) ans = 1×2 cell array {1000×1 double} {[a]} Note A datastore parses files alphabetically. To ensure that signal variables and label variables stored in separate files are paired correctly, use a matching identifier for corresponding filenames. | |
Signal and label variables stored in a single MAT-file |
| Consider a data set consisting of MAT-files that contain both signal
( sds = signalDatastore(datasetFolder,IncludeSubFolders=true, ... SignalVariableNames=["sig" "lbl"]); Read the first pair of signal and label data. read(sds) ans = 2×1 cell array {225000×1 double} {225000×1 categorical} Divide the data at random into training and testing sets. Use 80% of the data to train the network and 20% of the data to test the network. [trainIdx,~,testIdx] = dividerand(numel(sds.Files),0.8,0.2); trainds = subset(sds,trainIdx); testds = subset(sds,testIdx); | |
Signals stored in MAT-files and labels stored in memory |
| Consider a data set consisting of signals stored in MAT-files in
location sds = signalDatastore(folder); ads = arrayDatastore(lbls); Use the cds = combine(sds,ads) cds = CombinedDatastore with properties: UnderlyingDatastores: {[1×1 signalDatastore] [1×1 matlab.io.datastore.ArrayDatastore]} SupportedOutputFormats: ["txt" "csv" "xlsx" "xls" "parquet" "parq" … ] | |
Signals stored in MAT-files saved in folders containing label names |
| Consider a data set consisting of signals stored in MAT-files. The
files are saved in folders, and each folder name corresponds to a label. Create
a sds = signalDatastore(location); Use
the lbls = folders2labels(location,FileExtensions=".mat");
ads = arrayDatastore(lbls); Combine
the signal datastore and the array datastore using the cds = combine(sds,ads); | |
Signals stored in MAT-files and region-of-interest (ROI) limits stored in separate MAT-files |
| Consider a data set consisting of MAT-files that contain signal data and other MAT-files that contain label data. The label data is stored as region-of-interest tables that define a label value for different signal regions. Create two separate datastores to consume the data. sds1 = signalDatastore(FileLocation1,SampleRate=fs); sds2 = signalDatastore(FileLocation2, ... SignalVariableNames=["LabelVals";"LabelROIs"]); Convert the ROI limits and labels to a categorical sequence that you can use to train a model. i = 1; while hasdata(sds1) signal = read(sds1); label = read(sds2); % Convert label values to categorical vector labelCats = categorical(label{2,1}.Value,{'a' 'b' 'c' 'd'}); % Convert label values and ROI limits to table for signalMask input roiTable = table(label{2,1}.ROILimits,labelCats); m = signalMask(roiTable); % Obtain categorical sequence mask mask = catmask(m,length(signal)); lbls{i} = mask; i = i+1; end % Store categorical sequence mask in array datastore ads = arrayDatastore(lbls,IterationDimension=2); Combine
sds4 = combine(sds1,ads); | |
Labeled signal set containing signal and label data |
| Consider a labeled signal set lblnames = getLabelNames(lss) ans = 3×1 string "WhaleType" "MoanRegions" "TrillRegions" Use
the [sds,ads] = createDatastores(lss,lblnames) sds = signalDatastore with properties: MemberNames:{ 'Whale1'; 'Whale2' } Members: {2×1 cell} ReadSize: 1 SampleRate: 4000 ads = ArrayDatastore with properties: ReadSize: 1 IterationDimension: 1 OutputType: "cell" | |
Input and output signals stored in the same MAT-file |
| Consider a data set consisting of MAT-files stored in
sds = signalDatastore(folder,SignalVariableNames=["xIn" "xOut"]); You
can input Consider a different data set
consisting of MAT-files stored in inDs = signalDatastore(location,SignalVariableNames=["a" "b" "c"]); outDs = signalDatastore(location,SignalVariableNames=["d" "e"]); |
When your data is ready, you can use the trainnet
(Deep Learning Toolbox) function
to train a neural network. Common functions that you can use for network training, like
trainnet
or minibatchqueue
(Deep Learning Toolbox),
accept datastores as an input for training data and
responses.
net = trainnet(ds,...)
Note
When data is stored in memory, you can input a cell array directly to the
trainnet
function. If you need to transform in-memory data before
training, use a TransformedDatastore
.
Data Preprocessing
Some workflows require you to preprocess the data before feeding it to a network. For example, you can resample, resize, or filter signals before or during training. You can precompute features or use datastore transformations to prepare the data for training.
Example: Compute Fourier synchrosqueezed transform (FSST)
Calculate the FSST of each signal in datastore ds
.
fsstDs = transform(ds,@fsst);
The transformed data fits in memory. Use the readall
function to read all of the data from the TransformedDatastore
into memory
so that the FSST computations are performed only once during the training step.
transformedData = readall(fsstDs);
Example: Extract time-frequency features from signal data
Obtain the short-time Fourier transform (STFT) of each signal in datastore
ds
. Call the transform
function to compute the stft
and then
use the writeall
function to write the output to the disk.
tds = transform(ds,@stft); writeall(tds,outputLocation);
Create a new datastore that points to the out-of-memory features.
ds = signalDatastore(outputLocation);
Example: Extract spectral skewness and time-frequency ridges from signal data
Create a datastore that points to a location that contains signal data files. The sample rate is 1000 Hz.
Fs = 1000; sds = signalDatastore(datasetFolder,IncludeSubfolders=true);
Create a signalTimeFrequencyFeatureExtractor
object defining a sample rate. Enable the
spectral skewness and time-frequency ridges as features to extract.
tfFE = signalTimeFrequencyFeatureExtractor(SampleRate=Fs, ...
SpectralSkewness=true,TFRidges=true);
Call the extract
function to extract the specified features.
numDataFiles = length(sds.Files); M = cell(numDataFiles,1); for i=1:numDataFiles data = read(sds); [M{i},infoFeatures] = extract(tfFE,data); end Features = cell2mat(M);
Example: Filter and downsample signal data and downsample label data with custom preprocessing function
Create a datastore that points to a location containing both signal data files and label data files.
sds = signalDatastore(location,SignalVariableNames=["data" "labels"]);
Define a custom preprocessing function that bandpass-filters and downsamples the signal data and the label data.
function [dataOut] = downsampleData(dataIn) sig = dataIn{1}; lbls = dataIn{2}; filtsig = bandpass(sig,[10 400],3000); downsig = downsample(filtsig,3); downlbls = downsample(lbls,3); dataOut = [downsig,downlbls]; end
Call transform
on
sds
to apply the custom preprocessing function to each file.
tds = transform(sds,@downsampleData);
For more information about preprocessing in deep learning workflows, see Preprocess Data for Domain-Specific Deep Learning Applications (Deep Learning Toolbox).
Workflow Scenarios
A general workflow for any machine learning or deep learning task involves these steps:
Data preparation
Network training
Model deployment
This table shows examples and functions you can use to go from preparing data to training a network for signal classification tasks.
Example | Data | Related Functions | Highlights |
---|---|---|---|
Spoken Digit Recognition with Custom Log Spectrogram Layer and Deep Learning |
| Predict labels for audio recordings using deep convolutional neural network (DCNN) and custom log spectrogram layer
| |
Hand Gesture Classification Using Radar Signals and Deep Learning |
| Preprocess signals using custom functions and train multiple-input single-output convolutional neural network (CNN)
| |
Train Spoken Digit Recognition Network Using Out-of-Memory Features |
| Predict labels for audio recordings using a network trained on mel-frequency spectrograms
|
This table shows examples and functions you can use to go from preparing data to training a network for sequence-to-sequence classification tasks.
Example | Data | Related Functions | Highlights |
---|---|---|---|
Waveform Segmentation Using Deep Learning |
|
| Segment regions of interest in signals
|
Classify Arm Motions Using EMG Signals and Deep Learning |
|
| Classify signal ROIs
|
This table shows examples and functions you can use to go from preparing data to training a network for regression tasks.
Example | Data | Related Functions | Highlights |
---|---|---|---|
Denoise EEG Signals Using Differentiable Signal Processing Layers |
| Denoise signals using regression model
|
Tip
Use the read
, readall
, and
writeall
functions to read data in a datastore or write data from
a datastore to files.
read
— Use this function to read data iteratively from a datastore that contains file data or in-memory data.readall
— Use this function to read all the data in a datastore at once when the data set fits in memory. If the data set is too large to fit in memory, you can transform the data at each training epoch or use thewriteall
function to store the transformed data that you can then read using asignalDatastore
.writeall
— Use this function to write preprocessed data that does not fit in memory to files. You can then create a new datastore that points to the location of the output files.
Available Data Sets
There are several data sets readily available for use in an AI workflow:
QT Database — 210 ECG signals with region labels. Available for download at
https://www.mathworks.com/supportfiles/SPT/data/QTDatabaseECGData.zip
.EEGdenoiseNet — 4514 clean EEG segments and 3400 ocular artifact segments. Available for download at
https://ssd.mathworks.com/supportfiles/SPT/data/EEGEOGDenoisingData.zip
.UWB-gestures — 96 multichannel UWB impulse radar signals. Available for download at
https://ssd.mathworks.com/supportfiles/SPT/data/uwb-gestures.zip
.Myoelectric Data — 720 multichannel EMG signals with region labels. Available for download at
https://ssd.mathworks.com/supportfiles/SPT/data/MyoelectricData.zip
.Mendeley Data — 327 accelerometer signals with class labels. Available for download at
https://ssd.mathworks.com/supportfiles/wavelet/crackDetection/transverse_crack.zip
.
For additional data sets, see Time Series and Signal Data Sets (Deep Learning Toolbox).
Related Topics
- Datastores for Deep Learning (Deep Learning Toolbox)
- Signal Processing Applications (Deep Learning Toolbox)
- Sequence Classification Using Deep Learning (Deep Learning Toolbox)
- Sequence-to-Sequence Classification Using Deep Learning (Deep Learning Toolbox)
- Sequence-to-One Regression Using Deep Learning (Deep Learning Toolbox)