Speech Command Recognition Code Generation on Raspberry Pi
This example shows how to deploy feature extraction and a convolutional neural network (CNN) for speech command recognition to Raspberry Pi™. To generate the feature extraction and network code, you use MATLAB Coder™, MATLAB® Support Package for Raspberry Pi Hardware, and the ARM® Compute Library. In this example, the generated code is an executable on your Raspberry Pi, which is called by a MATLAB script that displays the predicted speech command along with the signal and auditory spectrogram. Interaction between the MATLAB script and the executable on your Raspberry Pi is handled using the user datagram protocol (UDP). For details about audio preprocessing and network training, see Train Deep Learning Network for Speech Command Recognition.
Prerequisites
MATLAB Coder Interface for Deep Learning Libraries
ARM processor that supports the NEON extension
ARM Compute Library version 20.02.1 (on the target ARM hardware)
Environment variables for the compilers and libraries
For supported versions of libraries and for information about setting up environment variables, see Prerequisites for Deep Learning with MATLAB Coder (MATLAB Coder).
Streaming Demonstration in MATLAB
Use the same parameters for the feature extraction pipeline and classification as developed in Train Deep Learning Network for Speech Command Recognition.
Define the same sample rate the network was trained on (16 kHz). Define the classification rate and the number of audio samples input per frame. The feature input to the network is a Bark spectrogram that corresponds to 1 second of audio data. The Bark spectrogram is calculated for 25 ms windows with 10 ms hops. Calculate the number of individual spectrums in each spectrogram.
fs = 16000; classificationRate = 20; samplesPerCapture = fs/classificationRate; segmentDuration = 1; segmentSamples = round(segmentDuration*fs); frameDuration = 0.025; frameSamples = round(frameDuration*fs); hopDuration = 0.010; hopSamples = round(hopDuration*fs); numSpectrumPerSpectrogram = floor((segmentSamples-frameSamples)/hopSamples) + 1;
Create an audioFeatureExtractor
object to extract 50-band Bark spectrograms without window normalization. Calculate the number of elements in each spectrogram.
afe = audioFeatureExtractor( ... 'SampleRate',fs, ... 'FFTLength',512, ... 'Window',hann(frameSamples,'periodic'), ... 'OverlapLength',frameSamples - hopSamples, ... 'barkSpectrum',true); numBands = 50; setExtractorParameters(afe,'barkSpectrum','NumBands',numBands,'WindowNormalization',false); numElementsPerSpectrogram = numSpectrumPerSpectrogram*numBands;
Load the pretrained CNN and labels.
load('commandNet.mat') labels = trainedNet.Layers(end).Classes; NumLabels = numel(labels); BackGroundIdx = find(labels == 'background');
Define buffers and decision thresholds to post process network predictions.
probBuffer = single(zeros([NumLabels,classificationRate/2])); YBuffer = single(NumLabels * ones(1, classificationRate/2)); countThreshold = ceil(classificationRate*0.2); probThreshold = single(0.7);
Create an audioDeviceReader
object to read audio from your device. Create a dsp.AsyncBuffer
object to buffer the audio into chunks.
adr = audioDeviceReader('SampleRate',fs,'SamplesPerFrame',samplesPerCapture,'OutputDataType','single'); audioBuffer = dsp.AsyncBuffer(fs);
Create a dsp.MatrixViewer
object and a timescope
object to display the results.
matrixViewer = dsp.MatrixViewer("ColorBarLabel","Power per band (dB/Band)",... "XLabel","Frames",... "YLabel","Bark Bands", ... "Position",[400 100 600 250], ... "ColorLimits",[-4 2.6445], ... "AxisOrigin","Lower left corner", ... "Name","Speech Command Recognition using Deep Learning"); timeScope = timescope("SampleRate",fs, ... "YLimits",[-1 1], ... "Position",[400 380 600 250], ... "Name","Speech Command Recognition Using Deep Learning", ... "TimeSpanSource","Property", ... "TimeSpan",1, ... "BufferLength",fs, ... "YLabel","Amplitude", ... "ShowGrid",true);
Show the time scope and matrix viewer. Detect commands as long as both the time scope and matrix viewer are open or until the time limit is reached. To stop the live detection before the time limit is reached, close the time scope window or matrix viewer window.
show(timeScope) show(matrixViewer) timeLimit = 10; tic while isVisible(timeScope) && isVisible(matrixViewer) && toc < timeLimit % Capture audio x = adr(); write(audioBuffer,x); y = read(audioBuffer,fs,fs-samplesPerCapture); % Compute auditory features features = extract(afe,y); auditoryFeatures = log10(features + 1e-6); % Perform prediction probs = predict(trainedNet, auditoryFeatures); [~, YPredicted] = max(probs); % Perform statistical post processing YBuffer = [YBuffer(2:end),YPredicted]; probBuffer = [probBuffer(:,2:end),probs(:)]; [YModeIdx, count] = mode(YBuffer); maxProb = max(probBuffer(YModeIdx,:)); if YModeIdx == single(BackGroundIdx) || single(count) < countThreshold || maxProb < probThreshold speechCommandIdx = BackGroundIdx; else speechCommandIdx = YModeIdx; end % Update plots matrixViewer(auditoryFeatures'); timeScope(x); if (speechCommandIdx == BackGroundIdx) timeScope.Title = ' '; else timeScope.Title = char(labels(speechCommandIdx)); end drawnow limitrate end
Hide the scopes.
hide(matrixViewer) hide(timeScope)
Prepare MATLAB Code for Deployment
To create a function to perform feature extraction compatible with code generation, call generateMATLABFunction
on the audioFeatureExtractor
object. The generateMATLABFunction
object function creates a standalone function that performs equivalent feature extraction and is compatible with code generation.
generateMATLABFunction(afe,'extractSpeechFeatures')
The HelperSpeechCommandRecognitionRasPi
supporting function encapsulates the feature extraction and network prediction process demonstrated previously. So that the feature extraction is compatible with code generation, feature extraction is handled by the generated extractSpeechFeatures
function. So that the network is compatible with code generation, the supporting function uses the coder.loadDeepLearningNetwork
(MATLAB Coder) function to load the network. The supporting function uses a dsp.UDPReceiver
system object to send the auditory spectrogram and the index corresponding to the predicted speech command from Raspberry Pi to MATLAB. The supporting function uses the dsp.UDPReceiver
system object to receive the audio captured by your microphone in MATLAB.
Generate Executable on Raspberry Pi
Replace the hostIPAddress
with your machine's address. Your Raspberry Pi sends auditory spectrograms and the predicted speech command to this IP address.
hostIPAddress = coder.Constant('172.18.230.30');
Create a code generation configuration object to generate an executable program. Specify the target language as C++.
cfg = coder.config('exe'); cfg.TargetLang = 'C++';
Create a configuration object for deep learning code generation with the ARM compute library that is on your Raspberry Pi. Specify the architecture of the Raspberry Pi and attach the deep learning configuration object to the code generation configuration object.
dlcfg = coder.DeepLearningConfig('arm-compute'); dlcfg.ArmArchitecture = 'armv7'; dlcfg.ArmComputeVersion = '20.02.1'; cfg.DeepLearningConfig = dlcfg;
Use the Raspberry Pi Support Package function, raspi
, to create a connection to your Raspberry Pi. In the following code, replace:
raspiname
with the name of your Raspberry Pipi with your user name
password
with your password
r = raspi('raspiname','pi','password');
Create a coder.hardware
(MATLAB Coder) object for Raspberry Pi and attach it to the code generation configuration object.
hw = coder.hardware('Raspberry Pi');
cfg.Hardware = hw;
Specify the build folder on the Raspberry Pi.
buildDir = '~/remoteBuildDir';
cfg.Hardware.BuildDir = buildDir;
Use an auto generated C++ main file for the generation of a standalone executable.
cfg.GenerateExampleMain = 'GenerateCodeAndCompile';
Call codegen
(MATLAB Coder) to generate C++ code and the executable on your Raspberry Pi. By default, the Raspberry Pi application name is the same as the MATLAB function.
codegen -config cfg HelperSpeechCommandRecognitionRasPi -args {hostIPAddress} -report -v
Deploying code. This may take a few minutes. ### Compiling function(s) HelperSpeechCommandRecognitionRasPi ... ------------------------------------------------------------------------ Location of the generated elf : /home/pi/remoteBuildDir/MATLAB_ws/R2022a/W/Ex/ExampleManager/sporwal.Bdoc22a.j1844576/deeplearning_shared-ex00376115 ### Using toolchain: GNU GCC Embedded Linux ### 'W:\Ex\ExampleManager\sporwal.Bdoc22a.j1844576\deeplearning_shared-ex00376115\codegen\exe\HelperSpeechCommandRecognitionRasPi\HelperSpeechCommandRecognitionRasPi_rtw.mk' is up to date ### Building 'HelperSpeechCommandRecognitionRasPi': make -j$(($(nproc)+1)) -Otarget -f HelperSpeechCommandRecognitionRasPi_rtw.mk all ------------------------------------------------------------------------ ### Generating compilation report ... Warning: Function 'HelperSpeechCommandRecognitionRasPi' does not terminate due to an infinite loop. Warning in ==> HelperSpeechCommandRecognitionRasPi Line: 86 Column: 1 Code generation successful (with warnings): View report
Initialize Application on Raspberry Pi
Create a command to open the HelperSpeechCommandRasPi
application on Raspberry Pi
. Use system
to send the command to your Raspberry Pi.
applicationName = 'HelperSpeechCommandRecognitionRasPi'; applicationDirPaths = raspi.utils.getRemoteBuildDirectory('applicationName',applicationName); targetDirPath = applicationDirPaths{1}.directory; exeName = strcat(applicationName,'.elf'); command = ['cd ' targetDirPath '; ./' exeName ' &> 1 &']; system(r,command);
Create a dsp.UDPReceiver
system object to send audio captured in MATLAB to your Raspberry Pi. Update the targetIPAddress
for your Raspberry Pi. Raspberry Pi receives the captured audio from the same port using the dsp.UDPReceiver
system object.
targetIPAddress = '172.18.231.92'; UDPSend = dsp.UDPSender('RemoteIPPort',26000,'RemoteIPAddress',targetIPAddress);
Create a dsp.UDPReceiver
system object to receive auditory features and the predicted speech command index from your Raspberry Pi. Each UDP packet received from the Raspberry Pi consists of auditory features in column-major order followed by the predicted speech command index. The maximum message length for the dsp.UDPReceiver
object is 65507 bytes. Calculate the buffer size to accommodate the maximum number of UDP packets.
sizeOfFloatInBytes = 4; maxUDPMessageLength = floor(65507/sizeOfFloatInBytes); samplesPerPacket = 1 + numElementsPerSpectrogram; numPackets = floor(maxUDPMessageLength/samplesPerPacket); bufferSize = numPackets*samplesPerPacket*sizeOfFloatInBytes; UDPReceive = dsp.UDPReceiver("LocalIPPort",21000, ... "MessageDataType","single", ... "MaximumMessageLength",samplesPerPacket, ... "ReceiveBufferSize",bufferSize);
Reduce initialization overhead by sending a frame of zeros to the executable running on your Raspberry Pi.
UDPSend(zeros(samplesPerCapture,1,"single"));
Perform Speech Command Recognition Using Deployed Code
Detect commands as long as both the time scope and matrix viewer are open or until the time limit is reached. To stop the live detection before the time limit is reached, close the time scope or matrix viewer window.
show(timeScope) show(matrixViewer) timeLimit = 20; tic while isVisible(timeScope) && isVisible(matrixViewer) && toc < timeLimit % Capture audio and send that to RasPi x = adr(); UDPSend(x); % Receive data packet from RasPi udpRec = UDPReceive(); if ~isempty(udpRec) % Extract predicted index, the last sample of received UDP packet speechCommandIdx = udpRec(end); % Extract auditory spectrogram spec = reshape(udpRec(1:numElementsPerSpectrogram), [numBands, numSpectrumPerSpectrogram]); % Display time domain signal and auditory spectrogram timeScope(x) matrixViewer(spec) if speechCommandIdx == BackGroundIdx timeScope.Title = ' '; else timeScope.Title = char(labels(speechCommandIdx)); end drawnow limitrate end end hide(matrixViewer) hide(timeScope)
To stop the executable on your Raspberry Pi, use stopExecutable
. Release the UDP objects.
stopExecutable(codertarget.raspi.raspberrypi,exeName) release(UDPSend) release(UDPReceive)
Profile Using PIL Workflow
You can measure the execution time taken on the Raspberry Pi using a processor-in-the-loop (PIL) workflow of Embedded Coder®. The ProfileSpeechCommandRecognitionRaspi
supporting function is the equivalent of the HelperSpeechCommandRecognitionRaspi function, except that the former returns the speech command index and auditory spectrogram while the latter sends the same parameters using UDP. The time taken by the UDP calls is less than 1 ms, which is relatively small compared to the overall execution time.
Create a PIL configuration object.
cfg = coder.config('lib','ecoder',true); cfg.VerificationMode = 'PIL';
Set the ARM compute library and architecture.
dlcfg = coder.DeepLearningConfig('arm-compute'); cfg.DeepLearningConfig = dlcfg ; cfg.DeepLearningConfig.ArmArchitecture = 'armv7'; cfg.DeepLearningConfig.ArmComputeVersion = '19.05';
Set up the connection with your target hardware.
if (~exist('r','var')) r = raspi('raspiname','pi','password'); end hw = coder.hardware('Raspberry Pi'); cfg.Hardware = hw;
Set the build directory and target language.
buildDir = '~/remoteBuildDir'; cfg.Hardware.BuildDir = buildDir; cfg.TargetLang = 'C++';
Enable profiling and then generate the PIL code. A MEX file named ProfileSpeechCommandRecognition_pil
is generated in your current folder.
cfg.CodeExecutionProfiling = true; codegen -config cfg ProfileSpeechCommandRecognitionRaspi -args {rand(samplesPerCapture, 1, 'single')} -report -v
Deploying code. This may take a few minutes. ### Compiling function(s) ProfileSpeechCommandRecognitionRaspi ... ### Connectivity configuration for function 'ProfileSpeechCommandRecognitionRaspi': 'Raspberry Pi' ### Using toolchain: GNU GCC Embedded Linux ### Creating 'W:\Ex\ExampleManager\sporwal.Bdoc22a.j1844576\deeplearning_shared-ex00376115\codegen\lib\ProfileSpeechCommandRecognitionRaspi\coderassumptions\lib\ProfileSpeechCommandRecognitionRaspi_ca.mk' ... ### Building 'ProfileSpeechCommandRecognitionRaspi_ca': make -j$(($(nproc)+1)) -Otarget -f ProfileSpeechCommandRecognitionRaspi_ca.mk all ### Using toolchain: GNU GCC Embedded Linux ### Creating 'W:\Ex\ExampleManager\sporwal.Bdoc22a.j1844576\deeplearning_shared-ex00376115\codegen\lib\ProfileSpeechCommandRecognitionRaspi\pil\ProfileSpeechCommandRecognitionRaspi_rtw.mk' ... ### Building 'ProfileSpeechCommandRecognitionRaspi': make -j$(($(nproc)+1)) -Otarget -f ProfileSpeechCommandRecognitionRaspi_rtw.mk all Location of the generated elf : /home/pi/remoteBuildDir/MATLAB_ws/R2022a/W/Ex/ExampleManager/sporwal.Bdoc22a.j1844576/deeplearning_shared-ex00376115/codegen/lib/ProfileSpeechCommandRecognitionRaspi/pil ------------------------------------------------------------------------ ### Using toolchain: GNU GCC Embedded Linux ### 'W:\Ex\ExampleManager\sporwal.Bdoc22a.j1844576\deeplearning_shared-ex00376115\codegen\lib\ProfileSpeechCommandRecognitionRaspi\ProfileSpeechCommandRecognitionRaspi_rtw.mk' is up to date ### Building 'ProfileSpeechCommandRecognitionRaspi': make -j$(($(nproc)+1)) -Otarget -f ProfileSpeechCommandRecognitionRaspi_rtw.mk all ------------------------------------------------------------------------ ### Generating compilation report ... Code generation successful: View report
Evaluate Raspberry Pi Execution Time
Call the generated PIL function multiple times to get the average execution time.
testDur = 50e-3; numCalls = 100; for k = 1:numCalls x = pinknoise(fs*testDur,'single'); [speechCommandIdx, auditoryFeatures] = ProfileSpeechCommandRecognitionRaspi_pil(x); end
### Starting application: 'codegen\lib\ProfileSpeechCommandRecognitionRaspi\pil\ProfileSpeechCommandRecognitionRaspi.elf' To terminate execution: clear ProfileSpeechCommandRecognitionRaspi_pil ### Launching application ProfileSpeechCommandRecognitionRaspi.elf... Execution profiling data is available for viewing. Open Simulation Data Inspector. Execution profiling report available after termination.
Terminate the PIL execution.
clear ProfileSpeechCommandRecognitionRaspi_pil
### Host application produced the following standard output (stdout) and standard error (stderr) messages: Execution profiling report: report(getCoderExecutionProfile('ProfileSpeechCommandRecognitionRaspi'))
Generate an execution profile report to evaluate execution time.
executionProfile = getCoderExecutionProfile('ProfileSpeechCommandRecognitionRaspi'); report(executionProfile, ... 'Units','Seconds', ... 'ScaleFactor','1e-03', ... 'NumericFormat','%0.4f')
ans = 'W:\Ex\ExampleManager\sporwal.Bdoc22a.j1844576\deeplearning_shared-ex00376115\codegen\lib\ProfileSpeechCommandRecognitionRaspi\html\orphaned\ExecutionProfiling_d82c7024f87064b9.html'
The maximum execution time taken by the ProfileSpeechCommandRecognitionRaspi
function is nearly twice the average execution time. You can notice that the execution time is maximum for the first call of the PIL function and it is due to the initialization happening in the first call. The average execution time is approximately 20 ms, which is below the 50 ms budget (audio capture time). The performance is measured on Raspberry Pi 4 Model B Rev 1.1.