GPU Acceleration of Scalograms for Deep Learning

This example uses:

This example shows how you can accelerate scalogram computation using GPUs. The computed scalograms are used as the input features to deep convolution neural networks (CNN) for ECG and spoken digit classification.

Using a GPU requires Parallel Computing Toolbox™. To see which GPUs are supported, see GPU Computing Requirements (Parallel Computing Toolbox). The audio section of this example requires Audio Toolbox™ to use the audio datastore and transformed datastore.

Scalogram Computation Using GPU

The most efficient way to compute scalograms on the GPU is to use cwtfilterbank. The steps to compute scalograms on the GPU are:

Construct cwtfilterbank with the desired property settings.
Move the signal to the GPU using gpuArray.
Use the filter bank WT method to compute the continuous wavelet transform (CWT).

The first time you use the WT method, cwtfilterbank caches the wavelet filters on the GPU. As a result, substantial time savings in computation are realized when you obtain scalograms of multiple signals using the same filter bank and same datatype. The following demonstrates the recommended workflow. As an example, use a sample of guitar music containing 661,500 samples.

[y,fs] = audioread('guitartune.wav');
plot(y)
grid on

Because most NVIDIA® GPUs are significantly more efficient with single rather than double-precision data, cast the signal to single precision.

y = single(y);

Construct the filter bank to match the signal length and the sampling frequency. For deep learning, the sampling frequency is normally not necessary and therefore can be excluded.

fb = cwtfilterbank('SignalLength',length(y),'SamplingFrequency',fs);

Finally move the signal to the GPU using gpuArray and compute the CWT of the data. Plot the resulting scalogram.

[cfs,f] = fb.wt(gpuArray(y));
t = 0:1/fs:(length(y)*1/fs)-1/fs;
imagesc(t,f,abs(cfs))
axis xy
ylabel('Hz')
xlabel('Seconds')

Use gather to bring the CWT coefficients and any other outputs back to the CPU.

cfs = gather(cfs);
f = gather(f);

To demonstrate the efficiency gained in using the GPU, time the CWT computation on the GPU and CPU for the same signal. The GPU compute times reported here are obtained using an NVIDIA Titan V with a compute capability of 7.0.

ygpu = gpuArray(y);
fgpu = @()fb.wt(ygpu);
Tgpu = gputimeit(fgpu)

Tgpu = 0.3503

Repeat the same measurement on the CPU and examine the ratio of GPU to CPU time to see the reduction in computation time.

fcpu = @()fb.wt(y);
Tcpu = timeit(fcpu)

Tcpu = 4.3036

Tcpu/Tgpu

ans = 12.2852

Scalograms in Deep Learning

A common application of the CWT in deep learning is to use the scalogram of a signal as the input "image" to a deep CNN. This necessarily mandates the computation of multiple scalograms, one for each signal in the training, validation, and test sets. While GPUs are often used to speed up training and inference in the deep network, it is also beneficial to use GPUs to accelerate any feature extraction, or data preprocessing needed to make the deep network more robust.

To illustrate this, the following section applies wavelet scalograms to human electrocardiogram (ECG) classification. Scalograms are used with the same data treated in Classify Time Series Using Wavelet Analysis and Deep Learning. In that example, transfer learning with GoogLeNet and SqueezeNet was used to classify ECG waveforms into one of three categories. The description of the data and how to obtain it is repeated here for convenience.

ECG Data Description and Download

The ECG data is obtained from three groups of people: persons with cardiac arrhythmia (ARR), persons with congestive heart failure (CHF), and persons with normal sinus rhythms (NSR). In total there are 162 ECG recordings from three PhysioNet databases: MIT-BIH Arrhythmia Database [2][3], MIT-BIH Normal Sinus Rhythm Database [3], and The BIDMC Congestive Heart Failure Database [1][3]. More specifically, 96 recordings from persons with arrhythmia, 30 recordings from persons with congestive heart failure, and 36 recordings from persons with normal sinus rhythms. The goal is to train a model to distinguish between ARR, CHF, and NSR.

You can obtain this data from the MathWorks GitHub repository. To download the data from the website, click Code and select Download ZIP. Save the file physionet_ECG_data-main.zip in a folder where you have write permission. The instructions for this example assume you have downloaded the file to your temporary directory, tempdir, in MATLAB®. Modify the subsequent instructions for unzipping and loading the data if you choose to download the data in folder different from tempdir.

After downloading the data from GitHub, unzip the file in your temporary directory.

unzip(fullfile(tempdir,'physionet_ECG_data-main.zip'),tempdir)

Unzipping creates the folder physionet-ECG_data-main in your temporary directory. This folder contains the text file README.md and ECGData.zip. The ECGData.zip file contains

ECGData.mat
Modified_physionet_data.txt
License.txt

ECGData.mat holds the data used in this example. The text file, Modified_physionet_data.txt, is required by PhysioNet's copying policy and provides the source attributions for the data as well as a description of the preprocessing steps applied to each ECG recording.

Unzip ECGData.zip in physionet-ECG_data-main. Load the data file into your MATLAB workspace.

unzip(fullfile(tempdir,'physionet_ECG_data-main','ECGData.zip'),...
    fullfile(tempdir,'physionet_ECG_data-main'))
load(fullfile(tempdir,'physionet_ECG_data-main','ECGData.mat'))

ECGData is a structure array with two fields: Data and Labels. The Data field is a 162-by-65536 matrix where each row is an ECG recording sampled at 128 hertz. Labels is a 162-by-1 cell array of diagnostic labels, one for each row of Data. The three diagnostic categories are: 'ARR', 'CHF', and 'NSR'. Use the helper function, helperRandomSplit, to split the data into training and validation sets with 80% of the data allocated for training and 20% for validation. Convert the ECG diagnostic labels into categoricals.

[trainData, validationData, trainLabels, validationLabels] = helperRandomSplit(80,ECGData);
trainLabels = categorical(trainLabels);
validationLabels = categorical(validationLabels);

There are 130 records in the trainData set and 32 records in validationData. By design, the training data contains 80.25% (130/162) of the data. Recall that the ARR class represents 59.26% of the data (96/162), the CHF class represents 18.52% (30/162), and the NSR class represents 22.22% (36/162). Examine the percentage of each class in the training and test sets. The percentages in each are consistent with the overall class percentages in the data set.

Ctrain = countcats(trainLabels)./numel(trainLabels).*100

Ctrain = 3×1

   59.2308
   18.4615
   22.3077

Cvalid = countcats(validationLabels)./numel(validationLabels).*100

Cvalid = 3×1

   59.3750
   18.7500
   21.8750

Scalograms With Deep CNN — ECG Data

Scalogram Computation on the GPU

Compute the scalograms for both the training and validation sets. Set useGPU to true to use the GPU and false to compute the scalograms on the CPU. To mitigate the effect of large input matrices on the CNN and create more training and validation examples, helperECGScalograms splits each ECG waveform into four nonoverlapping segments of 16384 samples each and computes scalograms for all four segments. Replicate the labels to match the expanded dataset. In this case, obtain an estimate of the expended computation time.

frameLength = 16384;
useGPU = true;
tic;
Xtrain = helperECGScalograms(trainData,frameLength,useGPU);

Computing scalograms...
Processed 50 files out of 130
Processed 100 files out of 130
...done

T = toc;
sprintf('Elapsed time is %1.2f seconds',T)

ans = 
'Elapsed time is 4.59 seconds'

trainLabels = repelem(trainLabels,4);

With the Titan V GPU, 502 scalograms have been computed in approximately 4.2 seconds. Setting useGPU to false and repeating the above computation demonstrates the speed up obtained by using the GPU. In this case, using the CPU required 33.3 seconds to compute the scalograms. The GPU computation was more than 7 times faster.

Repeat the same process for the validation data.

useGPU = true;
Xvalid = helperECGScalograms(validationData,frameLength,useGPU);

Computing scalograms...
...done

validationLabels = repelem(validationLabels,4);

Next set up a deep CNN to process both the training and validation sets. The simple network used here is not optimized. This CNN is only used to illustrate the end-to-end workflow for cases where the scalograms fit in memory.

sz = size(Xtrain);
specSize = sz(1:2);
imageSize = [specSize 1];
dropoutProb = 0.3;

layers = [
    imageInputLayer(imageSize)
  
    convolution2dLayer(3,12,'Padding','same')
    batchNormalizationLayer
    reluLayer
    
    maxPooling2dLayer(2,'Stride',2)
      
    convolution2dLayer(3,20,'Padding','same')
    batchNormalizationLayer
    reluLayer
    
    maxPooling2dLayer(2,'Stride',2)   
    
    convolution2dLayer(3,32,'Padding','same')
    batchNormalizationLayer
    reluLayer
    dropoutLayer(dropoutProb)
    fullyConnectedLayer(3)
    softmaxLayer
    ];

Use the following training options.

options = trainingOptions('sgdm', ...
    'InitialLearnRate',1e-4, ...
    'LearnRateDropPeriod',18, ...
    'MiniBatchSize',15, ...
    'MaxEpochs',25, ...
    'L2Regularization',1e-1, ...
    'Plots','training-progress', ...
    'Metrics','accuracy', ...
    'Verbose',false, ...
    'Shuffle','every-epoch', ...
    'ExecutionEnvironment','auto', ...
    'ValidationData',{Xvalid,validationLabels});

Train the network and measure the validation error.

trainnet(Xtrain,trainLabels,layers,"crossentropy",options);

Even though the simple CNN used here is not optimized, the validation accuracy is consistently in the high 80 to low 90 percent range. This is comparable to the validation accuracy achieved with the more powerful and optimized SqueezeNet shown in Classify Time Series Using Wavelet Analysis and Deep Learning example. Further, this is a much more efficient use of the scalogram, because in that example the scalograms had to be rescaled as RGB images compatible with SqueezeNet, saved to disk in an appropriate image format, and then fed to the deep network using imageDatastore.

Spoken Digit Recognition — GPU Computing using Transform Datastore

This section shows how to accelerate scalogram computation using GPU in a transformed datastore workflow.

Data

Download the Free Spoken Digit Dataset (FSDD). FSDD consists of 2000 recordings in English of the digits 0 through 9 obtained from four speakers. Two of the speakers are native speakers of American English, one speaker is a nonnative speaker of English with a Belgian French accent, and one speaker is a nonnative speaker of English with a German accent. The data is sampled at 8000 Hz.

downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"FSDD");

Use audioDatastore to manage data access.

ads = audioDatastore(dataset,IncludeSubfolders=true);

The helper function helpergenLabels creates a categorical array of labels from the FSDD files. List the classes and the number of examples in each class.

ads.Labels = helpergenLabels(ads);
summary(ads.Labels)

Transformed Datastore

First split the FSDD into training and test sets. Allocate 80% of the data to the training set and retain 20% for the test set.

rng default;
ads = shuffle(ads);
[adsTrain,adsTest] = splitEachLabel(ads,0.8);
countEachLabel(adsTrain)

ans=10×2 table
    Label    Count
    _____    _____

      0       160 
      1       160 
      2       160 
      3       160 
      4       160 
      5       160 
      6       160 
      7       160 
      8       160 
      9       160

Next create the CWT filter bank and the transformed datastores for both the training and test data using the helper function, helperDigitScalogram. The transformed datastore converts each recording into a signal of length 8192, computes the scalogram on the GPU, and gathers the data back onto the CPU.

reset(gpuDevice(1))
fb = cwtfilterbank('SignalLength',8192);
adsSCTrain = transform(adsTrain,@(audio,info)helperDigitScalogram(audio,info,fb),'IncludeInfo',true);
adsSCTest = transform(adsTest,@(audio,info)helperDigitScalogram(audio,info,fb),'IncludeInfo',true);

Deep CNN

Construct a deep CNN to train with the transformed datastore, adscTrain. As in the first example, the network is not optimized. The point is to show the workflow using scalograms computed on the GPU for out-of-memory data.

numClasses = 10;

dropoutProb = 0.2;
numF = 12;
layers = [
    imageInputLayer([101 8192 1])

    convolution2dLayer(5,numF,'Padding','same')
    batchNormalizationLayer
    reluLayer

    maxPooling2dLayer(3,'Stride',2,'Padding','same')

    convolution2dLayer(3,2*numF,'Padding','same')
    batchNormalizationLayer
    reluLayer

    maxPooling2dLayer(3,'Stride',2,'Padding','same')

    convolution2dLayer(3,4*numF,'Padding','same')
    batchNormalizationLayer
    reluLayer

    maxPooling2dLayer(3,'Stride',2,'Padding','same')

    convolution2dLayer(3,4*numF,'Padding','same')
    batchNormalizationLayer
    reluLayer
    convolution2dLayer(3,4*numF,'Padding','same')
    batchNormalizationLayer
    reluLayer

    maxPooling2dLayer(2)

    dropoutLayer(dropoutProb)
    fullyConnectedLayer(numClasses)
    softmaxLayer
   ];

Set the training options for the network.

miniBatchSize = 25;
options = trainingOptions('adam', ...
    'InitialLearnRate',1e-4, ...
    'MaxEpochs',30, ...
    'MiniBatchSize',miniBatchSize, ...
    'Shuffle','every-epoch', ...
    'Plots', 'training-progress',...
    'Metrics','accuracy', ...
    'Verbose',false,...
    'ExecutionEnvironment','gpu');

Train the network.

trainedNet = trainnet(adsSCTrain,layers,"crossentropy",options);

In this instance, training was completed in around 30 minutes. If you comment out the call to gpuArray in helperDigitScalogram and use the CPU to obtain the scalograms, training time increases significantly. In this case, an increase from 30 minutes to 55 minutes was observed.

Use the trained network to predict the digit labels for the test set.

score = minibatchpredict(trainedNet,adsSCTest,'ExecutionEnvironment','cpu');
classNames = categories(ads.Labels);
Ypredicted = scores2label(score,classNames);
cnnAccuracy = sum(Ypredicted == adsTest.Labels)/numel(Ypredicted)*100

cnnAccuracy = 93.5000

The inference time using the GPU was approximately 22 seconds. Using the CPU, inference time doubled to 45 seconds.

The performance of trained network on the test data is around 90%. This is comparable to the performance in Spoken Digit Recognition with Wavelet Scattering and Deep Learning.

Summary

This example has showcased how to use the GPU to accelerate scalogram computation. The example presented the optimal workflow for efficiently computing scalograms both for in-memory data and for out-of-memory data read from disk using transformed datastores.

References

Baim, D. S., W. S. Colucci, E. S. Monrad, H. S. Smith, R. F. Wright, A. Lanoue, D. F. Gauthier, B. J. Ransil, W. Grossman, and E. Braunwald. "Survival of patients with severe congestive heart failure treated with oral milrinone." Journal of the American College of Cardiology. Vol. 7, Number 3, 1986, pp. 661–670.
Goldberger A. L., L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E. Stanley. "PhysioBank, PhysioToolkit,and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals." Circulation. Vol. 101, Number 23: e215–e220. [Circulation Electronic Pages; http://circ.ahajournals.org/content/101/23/e215.full]; 2000 (June 13). doi: 10.1161/01.CIR.101.23.e215.
Moody, G. B., and R. G. Mark. "The impact of the MIT-BIH Arrhythmia Database." IEEE Engineering in Medicine and Biology Magazine. Vol. 20. Number 3, May-June 2001, pp. 45–50. (PMID: 11446209)

Helper Functions

function Labels = helpergenLabels(ads)
% This function is only for use in Wavelet Toolbox examples. It may be
% changed or removed in a future release.
tmp = cell(numel(ads.Files),1);
expression = "[0-9]+_";
for nf = 1:numel(ads.Files)
    idx = regexp(ads.Files{nf},expression);
    tmp{nf} = ads.Files{nf}(idx);
end
Labels = categorical(tmp);

end

function X = helperECGScalograms(data,window,useGPU)

disp("Computing scalograms...");
Nsig = size(data,1);
Nsamp = size(data,2);
Nsegment = Nsamp/window;

fb = cwtfilterbank('SignalLength',window,'Voices',10);
Ns = length(fb.Scales);
X = zeros([Ns,window,1,Nsig*Nsegment],'single');
start = 0;
if useGPU
    data = gpuArray(single(data'));
else
    data = single(data');
end
for ii = 1:Nsig
    ts = data(:,ii);
    ts = reshape(ts,window,Nsegment);
    ts = (ts-mean(ts))./max(abs(ts));
    
    for kk = 1:size(ts,2)
        cfs = fb.wt(ts(:,kk));
        X(:,:,1,kk+start) = gather(abs(cfs));
        
    end
    start = start+Nsegment;
    
    if mod(ii,50) == 0
        disp("Processed " + ii + " files out of " + Nsig)
    end
    
end

disp("...done");
data = gather(data);

end

function [x,info] = helperReadSPData(x,info)
% This function is only for use Wavelet Toolbox examples. It may change or
% be removed in a future release.

N = numel(x);
if N > 8192
    x = x(1:8192);
elseif N < 8192
    pad = 8192-N;
    prepad = floor(pad/2);
    postpad = ceil(pad/2);
    x = [zeros(prepad,1) ; x ; zeros(postpad,1)];
end
x = x./max(abs(x));

end

function [dataout,info] = helperDigitScalogram(audioin,info,fb)
audioin = single(audioin);
audioin = gpuArray(audioin);
audioin = helperReadSPData(audioin);
cfs = gather(abs(fb.wt(audioin)));
audioin = gather(audioin);
dataout = {cfs,info.Label};
end