# 基于小波散射和深度学习的口述数字识别

### 数据

```downloadFolder = matlab.internal.examples.downloadSupportFile("audio","FSDD.zip"); dataFolder = tempdir; unzip(downloadFolder,dataFolder) dataset = fullfile(dataFolder,"FSDD");```

`ads = audioDatastore(dataset,IncludeSubfolders=true);`

```ads.Labels = helpergenLabels(ads); summary(ads.Labels)```
``` 0 200 1 200 2 200 3 200 4 200 5 200 6 200 7 200 8 200 9 200 ```

FSDD 数据集由 10 个平衡类组成，每个平衡类有 200 个录音。FSDD 的录音时间长度不等。FSDD 不是特别大，因此通读 FSDD 文件并构造信号长度直方图。

```LenSig = zeros(numel(ads.Files),1); nr = 1; while hasdata(ads) digit = read(ads); LenSig(nr) = numel(digit); nr = nr+1; end reset(ads) histogram(LenSig) grid on xlabel("Signal Length (Samples)") ylabel("Frequency")```

### 小波时间散射

```sf = waveletScattering(SignalLength=8192,InvarianceScale=0.22, ... SamplingFrequency=8000,OversamplingFactor=2);```

```rng("default") ads = shuffle(ads); [adsTrain,adsTest] = splitEachLabel(ads,0.8); countEachLabel(adsTrain)```
```ans=10×2 table Label Count _____ _____ 0 160 1 160 2 160 3 160 4 160 5 160 6 160 7 160 8 160 9 160 ```
`countEachLabel(adsTest)`
```ans=10×2 table Label Count _____ _____ 0 40 1 40 2 40 3 40 4 40 5 40 6 40 7 40 8 40 9 40 ```

```Xtrain = []; scatds_Train = transform(adsTrain,@(x)helperReadSPData(x)); while hasdata(scatds_Train) smat = read(scatds_Train); Xtrain = cat(2,Xtrain,smat); end```

```Xtest = []; scatds_Test = transform(adsTest,@(x)helperReadSPData(x)); while hasdata(scatds_Test) smat = read(scatds_Test); Xtest = cat(2,Xtest,smat); end```

```Strain = sf.featureMatrix(Xtrain); Stest = sf.featureMatrix(Xtest);```

```TrainFeatures = Strain(2:end,:,:); TrainFeatures = squeeze(mean(TrainFeatures,2))'; TestFeatures = Stest(2:end,:,:); TestFeatures = squeeze(mean(TestFeatures,2))';```

### SVM 分类器

```template = templateSVM(... KernelFunction="polynomial", ... PolynomialOrder=2, ... KernelScale="auto", ... BoxConstraint=1, ... Standardize=true); classificationSVM = fitcecoc( ... TrainFeatures, ... adsTrain.Labels, ... Learners=template, ... Coding="onevsone", ... ClassNames=categorical({'0'; '1'; '2'; '3'; '4'; '5'; '6'; '7'; '8'; '9'}));```

```partitionedModel = crossval(classificationSVM,KFold=5); [validationPredictions, validationScores] = kfoldPredict(partitionedModel); validationAccuracy = (1 - kfoldLoss(partitionedModel,LossFun="ClassifError"))*100```
```validationAccuracy = 96.9375 ```

```predLabels = predict(classificationSVM,TestFeatures); testAccuracy = sum(predLabels==adsTest.Labels)/numel(predLabels)*100 %#ok<NASGU>```
```testAccuracy = 98 ```

```figure(Units="normalized",Position=[0.2 0.2 0.5 0.5]) ccscat = confusionchart(adsTest.Labels,predLabels); ccscat.Title = "Wavelet Scattering Classification"; ccscat.ColumnSummary = "column-normalized"; ccscat.RowSummary = "row-normalized";```

### 长短期记忆 (LSTM) 网络

LSTM 网络是一种循环神经网络 (RNN)。RNN 是专门用于处理序列或时序数据（如语音数据）的神经网络。由于小波散射系数是序列，因此它们可以用作 LSTM 的输入。通过使用散射特征而不是原始数据，您可以减少网络需要学习的变异性。

```TrainFeatures = Strain(2:end,:,:); TrainFeatures = squeeze(num2cell(TrainFeatures,[1 2])); TestFeatures = Stest(2:end,:,:); TestFeatures = squeeze(num2cell(TestFeatures, [1 2]));```

```[inputSize, ~] = size(TrainFeatures{1}); YTrain = adsTrain.Labels; numHiddenUnits = 512; numClasses = numel(unique(YTrain)); layers = [ ... sequenceInputLayer(inputSize) lstmLayer(numHiddenUnits,OutputMode="last") fullyConnectedLayer(numClasses) softmaxLayer classificationLayer];```

```maxEpochs = 300; miniBatchSize = 50; options = trainingOptions("adam", ... InitialLearnRate=0.0001,... MaxEpochs=maxEpochs, ... MiniBatchSize=miniBatchSize, ... SequenceLength="shortest", ... Shuffle="every-epoch",... Verbose=false, ... Plots="training-progress");```

`net = trainNetwork(TrainFeatures,YTrain,layers,options);`

```predLabels = classify(net,TestFeatures); testAccuracy = sum(predLabels==adsTest.Labels)/numel(predLabels)*100 %#ok<NASGU>```
```testAccuracy = 96.2500 ```

### 贝叶斯优化

```YTrain = adsTrain.Labels; YTest = adsTest.Labels; if ~exist("results/",'dir') mkdir results end```

```optVars = [ optimizableVariable(InitialLearnRate=[1e-5, 1e-1],Transform="log") optimizableVariable(NumHiddenUnits=[10, 1000],Type="integer") ];```

```ObjFcn = helperBayesOptLSTM(TrainFeatures,YTrain,TestFeatures,YTest); optimizeCondition = false; if optimizeCondition BayesObject = bayesopt(ObjFcn,optVars, ... MaxObjectiveEvaluations=15, ... IsObjectiveDeterministic=false, ... UseParallel=true); %#ok<UNRCH> else url = "http://ssd.mathworks.com/supportfiles/audio/SpokenDigitRecognition.zip"; downloadNetFolder = tempdir; netFolder = fullfile(downloadNetFolder,"SpokenDigitRecognition"); if ~exist(netFolder,"dir") disp("Downloading pretrained network (1 file - 12 MB) ...") unzip(url,downloadNetFolder) end load(fullfile(netFolder,"0.02.mat")) end```
```Downloading pretrained network (1 file - 12 MB) ... ```

```numHiddenUnits = 768; numClasses = numel(unique(YTrain)); layers = [ ... sequenceInputLayer(inputSize) lstmLayer(numHiddenUnits,OutputMode="last") fullyConnectedLayer(numClasses) softmaxLayer classificationLayer]; maxEpochs = 300; miniBatchSize = 50; options = trainingOptions("adam", ... InitialLearnRate=2.198827960269379e-04, ... MaxEpochs=maxEpochs, ... MiniBatchSize=miniBatchSize, ... SequenceLength="shortest", ... Shuffle="every-epoch", ... Verbose=false, ... Plots="training-progress"); net = trainNetwork(TrainFeatures,YTrain,layers,options);```

```predLabels = classify(net,TestFeatures); testAccuracy = sum(predLabels==adsTest.Labels)/numel(predLabels)*100```
```testAccuracy = 96.7500 ```

### 使用 Mel 频谱图的深度卷积网络

```segmentDuration = 8192*(1/8000); frameDuration = 0.22; hopDuration = 0.01; numBands = 40;```

```reset(adsTrain) reset(adsTest)```

```epsil = 1e-6; XTrain = helperspeechSpectrograms(adsTrain,segmentDuration,frameDuration,hopDuration,numBands);```
```Computing speech spectrograms... Processed 500 files out of 1600 Processed 1000 files out of 1600 Processed 1500 files out of 1600 ...done ```
```XTrain = log10(XTrain + epsil); XTest = helperspeechSpectrograms(adsTest,segmentDuration,frameDuration,hopDuration,numBands);```
```Computing speech spectrograms... ...done ```
```XTest = log10(XTest + epsil); YTrain = adsTrain.Labels; YTest = adsTest.Labels;```

### 定义 DCNN 架构

```sz = size(XTrain); specSize = sz(1:2); imageSize = [specSize 1]; numClasses = numel(categories(YTrain)); dropoutProb = 0.2; numF = 12; layers = [ imageInputLayer(imageSize) convolution2dLayer(5,numF,Padding="same") batchNormalizationLayer reluLayer maxPooling2dLayer(3,Stride=2,Padding="same") convolution2dLayer(3,2*numF,Padding="same") batchNormalizationLayer reluLayer maxPooling2dLayer(3,Stride=2,Padding="same") convolution2dLayer(3,4*numF,Padding="same") batchNormalizationLayer reluLayer maxPooling2dLayer(3,Stride=2,Padding="same") convolution2dLayer(3,4*numF,Padding="same") batchNormalizationLayer reluLayer convolution2dLayer(3,4*numF,Padding="same") batchNormalizationLayer reluLayer maxPooling2dLayer(2) dropoutLayer(dropoutProb) fullyConnectedLayer(numClasses) softmaxLayer classificationLayer(Classes=categories(YTrain)); ];```

```miniBatchSize = 50; options = trainingOptions("adam", ... InitialLearnRate=1e-4, ... MaxEpochs=30, ... MiniBatchSize=miniBatchSize, ... Shuffle="every-epoch", ... Plots="training-progress", ... Verbose=false, ... ExecutionEnvironment="cpu");```

`trainedNet = trainNetwork(XTrain,YTrain,layers,options);`

```[Ypredicted,probs] = classify(trainedNet,XTest,ExecutionEnvironment="CPU"); cnnAccuracy = sum(Ypredicted==YTest)/numel(YTest)*100```
```cnnAccuracy = 98.2500 ```

```figure(Units="normalized",Position=[0.2 0.2 0.5 0.5]); ccDCNN = confusionchart(YTest,Ypredicted); ccDCNN.Title = "Confusion Chart for DCNN"; ccDCNN.ColumnSummary = "column-normalized"; ccDCNN.RowSummary = "row-normalized";```

### 附录：辅助函数

```function Labels = helpergenLabels(ads) % This function is only for use in Wavelet Toolbox examples. It may be % changed or removed in a future release. tmp = cell(numel(ads.Files),1); expression = "[0-9]+_"; for nf = 1:numel(ads.Files) idx = regexp(ads.Files{nf},expression); tmp{nf} = ads.Files{nf}(idx); end Labels = categorical(tmp); end```
```function x = helperReadSPData(x) % This function is only for use Wavelet Toolbox examples. It may change or % be removed in a future release. N = numel(x); if N > 8192 x = x(1:8192); elseif N < 8192 pad = 8192-N; prepad = floor(pad/2); postpad = ceil(pad/2); x = [zeros(prepad,1) ; x ; zeros(postpad,1)]; end x = x./max(abs(x)); end```
```function x = helperBayesOptLSTM(X_train, Y_train, X_val, Y_val) % This function is only for use in the % "Spoken Digit Recognition with Wavelet Scattering and Deep Learning" % example. It may change or be removed in a future release. x = @valErrorFun; function [valError,cons, fileName] = valErrorFun(optVars) %% LSTM Architecture [inputSize,~] = size(X_train{1}); numClasses = numel(unique(Y_train)); layers = [ ... sequenceInputLayer(inputSize) bilstmLayer(optVars.NumHiddenUnits,OutputMode="last") % Using number of hidden layers value from optimizing variable fullyConnectedLayer(numClasses) softmaxLayer classificationLayer]; % Plots not displayed during training options = trainingOptions("adam", ... InitialLearnRate=optVars.InitialLearnRate, ... % Using initial learning rate value from optimizing variable MaxEpochs=300, ... MiniBatchSize=30, ... SequenceLength="shortest", ... Shuffle="never", ... Verbose=false); %% Train the network net = trainNetwork(X_train, Y_train, layers, options); %% Training accuracy X_val_P = net.classify(X_val); accuracy_training = sum(X_val_P == Y_val)./numel(Y_val); valError = 1 - accuracy_training; %% save results of network and options in a MAT file in the results folder along with the error value fileName = fullfile("results", num2str(valError) + ".mat"); save(fileName,"net","valError","options") cons = []; end % end for inner function end % end for outer function```
```function X = helperspeechSpectrograms(ads,segmentDuration,frameDuration,hopDuration,numBands) % This function is only for use in the % "Spoken Digit Recognition with Wavelet Scattering and Deep Learning" % example. It may change or be removed in a future release. % % helperspeechSpectrograms(ads,segmentDuration,frameDuration,hopDuration,numBands) % computes speech spectrograms for the files in the datastore ads. % segmentDuration is the total duration of the speech clips (in seconds), % frameDuration the duration of each spectrogram frame, hopDuration the % time shift between each spectrogram frame, and numBands the number of % frequency bands. disp("Computing speech spectrograms..."); numHops = ceil((segmentDuration - frameDuration)/hopDuration); numFiles = length(ads.Files); X = zeros([numBands,numHops,1,numFiles],"single"); for i = 1:numFiles [x,info] = read(ads); x = normalizeAndResize(x); fs = info.SampleRate; frameLength = round(frameDuration*fs); hopLength = round(hopDuration*fs); spec = melSpectrogram(x,fs, ... Window=hamming(frameLength,"periodic"), ... OverlapLength=frameLength-hopLength, ... FFTLength=2048, ... NumBands=numBands, ... FrequencyRange=[50,4000]); % If the spectrogram is less wide than numHops, then put spectrogram in % the middle of X. w = size(spec,2); left = floor((numHops-w)/2)+1; ind = left:left+w-1; X(:,ind,1,i) = spec; if mod(i,500) == 0 disp("Processed " + i + " files out of " + numFiles) end end disp("...done"); end %-------------------------------------------------------------------------- function x = normalizeAndResize(x) % This function is only for use in the % "Spoken Digit Recognition with Wavelet Scattering and Deep Learning" % example. It may change or be removed in a future release. N = numel(x); if N > 8192 x = x(1:8192); elseif N < 8192 pad = 8192-N; prepad = floor(pad/2); postpad = ceil(pad/2); x = [zeros(prepad,1) ; x ; zeros(postpad,1)]; end x = x./max(abs(x)); end```