Signal Processing Algorithm Acceleration in MATLAB
Note
The benchmarks in this example have been measured on a machine with four physical cores.
This example shows how to accelerate a signal processing algorithm in
MATLAB® using the codegen
(MATLAB Coder) and dspunfold
functions. You can generate a MATLAB executable (MEX function) from an entire MATLAB function or specific parts of the MATLAB function. When you run the MEX function
instead of the original MATLAB code,
simulation speed can increase significantly. To generate the MEX equivalent, the
algorithm must support code generation.
To use codegen
(MATLAB Coder), you must have MATLAB
Coder™ installed. To use dspunfold
, you must have MATLAB
Coder and DSP System Toolbox™ installed.
To use dspunfold
on Windows and Linux, you must
use a compiler that supports the Open Multi-Processing (OpenMP) application interface.
See Supported
Compilers.
FIR Filter Algorithm
Consider a simple FIR filter algorithm to accelerate. Copy the
firfilter
function code into the
firfilter.m
file.
function [y,z1] = firfilter(b,x) % Inputs: % b - 1xNTaps row vector of coefficients % x - A frame of noisy input % States: % z, z1 - NTapsx1 column vector of states % Output: % y - A frame of filtered output persistent z; if (isempty(z)) z = zeros(length(b),1); end Lx = size(x,1); y = zeros(size(x),'like',x); z1 = z; for m = 1:Lx % Load next input sample z1(1,:) = x(m,:); % Compute output y(m,:) = b*z1; % Update states z1(2:end,:) = z1(1:end-1,:); z = z1; end
The firfilter
function accepts a vector of filter
coefficients, b, a noisy input
signal, x, as inputs. Generate the filter coefficients
using the fir1
function.
NTaps = 250; Fp = 4e3/(44.1e3/2); b = fir1(NTaps-1,Fp);
Filter a stream of a noisy sine wave signal by using
the firfilter
function. The sine wave has a frame
size of 4000 samples and a sample rate of 192 kHz. Generate the sine wave using the
dsp.SineWave
System object™. The noise is a white Gaussian with a mean of 0 and a variance of
0.02. Name this function firfilter_sim
. The
firfilter_sim
function calls
the firfilter
function on the noisy input.
function totVal = firfilter_sim(b) % Create the signal source Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200); totVal = zeros(4000,500); R = 0.02; clear firfilter; % Iteration loop. Each iteration filters a frame of the noisy signal. for i = 1 : 500 trueVal = Sig(); % Original sine wave noisyVal = trueVal + sqrt(R)*randn; % Noisy sine wave filteredVal = firfilter(b,noisyVal); % Filtered sine wave totVal(:,i) = filteredVal; % Store the entire sine wave end
Run firfilter_sim
and measure the speed of execution.
The execution speed varies depending on your machine.
tic;totVal = firfilter_sim(b);t1 = toc;
fprintf('Original Algorithm Simulation Time: %4.1f seconds\n',t1);
Original Algorithm Simulation Time: 7.8 seconds
Accelerate the FIR Filter Using codegen
Call codegen
on firfilter
, and generate
its MEX equivalent, firfilter_mex
. Generate and pass the filter
coefficients and the sine wave signal as inputs to the
firfilter
function.
Ntaps = 250; Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200); % Create the Signal Source R = 0.02; trueVal = Sig(); % Original sine wave noisyVal = trueVal + sqrt(R)*randn; % Noisy sine wave Fp = 4e3/(44.1e3/2); b = fir1(Ntaps-1,Fp); % Filter coefficients codegen firfilter -args {b,noisyVal}
In the firfilter_sim
function, replace
firfilter(b,noisyVal)
function call with
firfilter_mex(b,noisyVal)
. Name this function
firfilter_codegen
.
function totVal = firfilter_codegen(b) % Create the signal source Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200); totVal = zeros(4000,500); R = 0.02; clear firfilter_mex; % Iteration loop. Each iteration filters a frame of the noisy signal. for i = 1 : 500 trueVal = Sig(); % Original sine wave noisyVal = trueVal + sqrt(R)*randn; % Noisy sine wave filteredVal = firfilter_mex(b,noisyVal); % Filtered sine wave totVal(:,i) = filteredVal; % Store the entire sine wave end
Run firfilter_codegen
and measure the speed of execution. The
execution speed varies depending on your machine.
tic;totValcodegen = firfilter_codegen(b);t2 = toc; fprintf('Algorithm Simulation Time with codegen: %5f seconds\n',t2); fprintf('Speedup factor with codegen: %5f\n',(t1/t2));
Algorithm Simulation Time with codegen: 0.923683 seconds Speedup factor with codegen: 8.5531
The speedup gain is approximately 8.5
.
Accelerate the FIR Filter Using dspunfold
The dspunfold
function generates a multithreaded MEX file
which can improve the speedup gain even further.
dspunfold
also generates a single-threaded MEX file and a
self-diagnostic analyzer function. The multithreaded MEX file leverages the
multicore CPU architecture of the host computer. The single-threaded MEX file is
similar to the MEX file that the codegen
function generates.
The analyzer function measures the speedup gain of the multithreaded MEX file over
the single-threaded MEX file.
Call dspunfold
on firfilter
and generate
its multithreaded MEX equivalent, firfilter_mt
. Detect the
state length in samples by using the -f
option, which can improve
the speedup gain further. -s auto
triggers the automatic state
length detection. For more information on using the -f
and
-s
options, see dspunfold
.
dspunfold firfilter -args {b,noisyVal} -s auto -f [false,true]
State length: [autodetect] samples, Repetition: 1, Output latency: 8 frames, Threads: 4 Analyzing: firfilter.m Creating single-threaded MEX file: firfilter_st.mexw64 Searching for minimal state length (this might take a while) Checking stateless ... Insufficient Checking 4000 samples ... Sufficient Checking 2000 samples ... Sufficient Checking 1000 samples ... Sufficient Checking 500 samples ... Sufficient Checking 250 samples ... Sufficient Checking 125 samples ... Insufficient Checking 187 samples ... Insufficient Checking 218 samples ... Insufficient Checking 234 samples ... Insufficient Checking 242 samples ... Insufficient Checking 246 samples ... Insufficient Checking 248 samples ... Insufficient Checking 249 samples ... Sufficient Minimal state length is 249 samples Creating multi-threaded MEX file: firfilter_mt.mexw64 Creating analyzer file: firfilter_analyzer.p
The automatic state length detection tool detects an exact state length of
259
samples.
Call the analyzer function and measure the speedup gain of the multithreaded MEX file with respect to the single-threaded MEX file. Provide at least two different frames for each input argument of the analyzer. The frames are appended along the first dimension. The analyzer alternates between these frames while verifying that the outputs match. Failure to provide multiple frames for each input can decrease the effectiveness of the analyzer and can lead to false positive verification results.
firfilter_analyzer([b;0.5*b;0.6*b],[noisyVal;0.5*noisyVal;0.6*noisyVal]);
Analyzing multi-threaded MEX file firfilter_mt.mexw64. For best results, please refrain from interacting with the computer and stop other processes until the analyzer is done. Latency = 8 frames Speedup = 3.2x
firfilter_mt
has a speedup gain factor of
3.2
when compared to the single-threaded MEX file,
firfilter_st
. To increase the speedup further, increase the
repetition factor using the -r
option. The tradeoff is that the
output latency increases. Use a repetition factor of 3
. Specify
the exact state length to reduce the overhead and increase the speedup
further.
dspunfold firfilter -args {b,noisyVal} -s 249 -f [false,true] -r 3
State length: 249 samples, Repetition: 3, Output latency: 24 frames, Threads: 4 Analyzing: firfilter.m Creating single-threaded MEX file: firfilter_st.mexw64 Creating multi-threaded MEX file: firfilter_mt.mexw64 Creating analyzer file: firfilter_analyzer.p
Call the analyzer function.
firfilter_analyzer([b;0.5*b;0.6*b],[noisyVal;0.5*noisyVal;0.6*noisyVal]);
Analyzing multi-threaded MEX file firfilter_mt.mexw64. For best results, please refrain from interacting with the computer and stop other processes until the analyzer is done. Latency = 24 frames Speedup = 3.8x
The speedup gain factor is 3.8
, or approximately 32 times the
speed of execution of the original simulation.
For this particular algorithm, you can see that dspunfold
is
generating a highly optimized code, without having to write any C or C++ code. The
speedup gain scales with the number of cores on your host machine.
The FIR filter function in this example is only an illustrative algorithm that is
easy to understand. You can apply this workflow on any of your custom algorithms. If
you want to use an FIR filter, it is recommended that you use the dsp.FIRFilter
System object in DSP System Toolbox. This object runs much faster than the benchmark numbers presented in
this example, without the need for code generation.
Kalman Filter Algorithm
Consider a Kalman filter algorithm, which estimates the sine wave signal from a
noisy input. This example shows the performance of Kalman filter with
codegen
and dspunfold
.
The noisy sine wave input has a frame size of 4000 samples and a sample rate of 192 kHz. The noise is a white Gaussian with mean of 0 and a variance of 0.02.
The function filterNoisySignal
calls the
kalmanfilter
function on the noisy input.
type filterNoisySignal
function totVal = filterNoisySignal % Create the signal source Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200); totVal = zeros(4000,500); R = 0.02; clear kalmanfilter; % Iteration loop to estimate the sine wave signal for i = 1 : 500 trueVal = Sig(); % Actual values noisyVal = trueVal + sqrt(R)*randn; % Noisy measurements estVal = kalmanfilter(noisyVal); % Sine wave estimated by Kalman filter totVal(:,i) = estVal; % Store the entire sine wave end
type kalmanfilter
function [estVal,estState] = kalmanfilter(noisyVal) % This function tracks a noisy sinusoid signal using a Kalman filter % % State Transition Matrix A = 1; stateSpaceDim = size(A,1); % Measurement Matrix H = 1; measurementSpaceDim = size(H,1); numTsteps = size(noisyVal,1)/measurementSpaceDim; % Containers to store prediction and estimates for all time steps zEstContainer = noisyVal; xEstContainer = zeros(size(noisyVal)); Q = 0.0001; % Process noise covariance R = 0.02; % Measurement noise covariance persistent xhat P xPrior PPrior; % Local copies of discrete states if isempty(xhat) xhat = 5; % Initial state estimate end if isempty(P) P = 1; % Error covariance estimate end if isempty(xPrior) xPrior = 0; end if isempty(PPrior) PPrior = 0; end % Loop over all time steps for n=1:numTsteps % Gather chunks for current time step zRowIndexChunk = (n-1)*measurementSpaceDim + (1:measurementSpaceDim); stateEstsRowIndexChunk = (n-1)*stateSpaceDim + (1:stateSpaceDim); % Prediction step xPrior = A * xhat; PPrior = A * P * A' + Q; % Correction step. Compute Kalman gain. PpriorH = PPrior * H'; HPpriorHR = H * PpriorH + R; KalmanGain = (HPpriorHR \ PpriorH')'; KH = KalmanGain * H; % States and error covariance are updated in the % correction step xhat = xPrior + KalmanGain * noisyVal(zRowIndexChunk,:) - ... KH * xPrior; P = PPrior - KH * PPrior; % Append estimates xEstContainer(stateEstsRowIndexChunk, :) = xhat; zEstContainer(zRowIndexChunk,:) = H*xhat; end % Populate the outputs estVal = zEstContainer; estState = xEstContainer; end
Run filterNoisySignal.m
and measure the speed of
execution.
tic;totVal = filterNoisySignal;t1 = toc;
fprintf('Original Algorithm Simulation Time: %4.1f seconds\n',t1);
Original Algorithm Simulation Time: 21.7 seconds
Accelerate the Kalman Filter Using codegen
Call the codegen
function on
kalmanfilter
, and generate its MEX equivalent,
kalmanfilter_mex
.
The kalmanfilter
function requires the noisy sine wave as the
input.
Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200); % Create the Signal Source R = 0.02; % Measurement noise covariance trueVal = step(Sig); % Actual values noisyVal = trueVal + sqrt(R)*randn; % Noisy measurements codegen -args {noisyVal} kalmanfilter.m
Replace kalmanfilter(noisyVal)
in
filterNoisySignal
function with
kalmanfilter_mex(noisyVal)
. Name this function as
filterNoisySignal_codegen
function totVal = filterNoisySignal_codegen % Create the signal source Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200); totVal = zeros(4000,500); R = 0.02; clear kalmanfilter_mex; % Iteration loop to estimate the sine wave signal for i = 1 : 500 trueVal = Sig(); % Actual values noisyVal = trueVal + sqrt(R)*randn; % Noisy measurements estVal = kalmanfilter_mex(noisyVal); % Sine wave estimated by Kalman filter totVal(:,i) = estVal; % Store the entire sine wave end
Run filterNoisySignal_codegen
and measure the speed of
execution.
tic; totValcodegen = filterNoisySignal_codegen; t2 = toc; fprintf('Algorithm Simulation Time with codegen: %5f seconds\n',t2); fprintf('Speedup with codegen is %0.1f',t1/t2);
Algorithm Simulation Time with codegen: 0.095480 seconds Speedup with codegen is 227.0
The Kalman filter algorithm implements several matrix multiplications.
codegen
uses the Basic Linear Algebra Subroutines (BLAS)
libraries to perform these multiplications. These libraries generate a highly
optimized code, hence giving a speedup gain of 227.
Accelerate the Kalman Filter Using dspunfold
Generate a multithreaded MEX file using dspunfold
and compare
its performance with codegen
.
Sig = dsp.SineWave('SamplesPerFrame',4000,'SampleRate',19200); % Create the signal source R = 0.02; % Measurement noise covariance trueVal = step(Sig); % Actual values noisyVal = trueVal + sqrt(R)*randn; % Noisy measurements dspunfold kalmanfilter -args {noisyVal} -s auto
State length: [autodetect] frames, Repetition: 1, Output latency: 8 frames, Threads: 4 Analyzing: kalmanfilter.m Creating single-threaded MEX file: kalmanfilter_st.mexw64 Searching for minimal state length (this might take a while) Checking stateless ... Insufficient Checking 1 frames ... Sufficient Minimal state length is 1 frames Creating multi-threaded MEX file: kalmanfilter_mt.mexw64 Creating analyzer file: kalmanfilter_analyzer.p
Call the analyzer function.
kalmanfilter_analyzer([noisyVal;0.01*noisyVal;0.05*noisyVal;0.1*noisyVal]);
Analyzing multi-threaded MEX file kalmanfilter_mt.mexw64. For best results, please refrain from interacting with the computer and stop other processes until the analyzer is done. Latency = 8 frames Speedup = 0.7x
kalmanfilter_mt
has a speedup factor of
0.7
, which is a performance loss of 30%
when compared to the single-threaded MEX file, kalmanfilter_st
.
Increase the repetition factor to 3
to see if the performance
increases. Also, detect the state length in samples.
dspunfold kalmanfilter -args {noisyVal} -s auto -f true -r 3
State length: [autodetect] samples, Repetition: 3, Output latency: 24 frames, Threads: 4 Analyzing: kalmanfilter.m Creating single-threaded MEX file: kalmanfilter_st.mexw64 Searching for minimal state length (this might take a while) Checking stateless ... Insufficient Checking 4000 samples ... Sufficient Checking 2000 samples ... Sufficient Checking 1000 samples ... Sufficient Checking 500 samples ... Sufficient Checking 250 samples ... Insufficient Checking 375 samples ... Sufficient Checking 312 samples ... Sufficient Checking 281 samples ... Sufficient Checking 265 samples ... Sufficient Checking 257 samples ... Insufficient Checking 261 samples ... Sufficient Checking 259 samples ... Sufficient Checking 258 samples ... Insufficient Minimal state length is 259 samples Creating multi-threaded MEX file: kalmanfilter_mt.mexw64 Creating analyzer file: kalmanfilter_analyzer.p
Call the analyzer function.
kalmanfilter_analyzer([noisyVal;0.01*noisyVal;0.05*noisyVal;0.1*noisyVal]);
Analyzing multi-threaded MEX file kalmanfilter_mt.mexw64. For best results, please refrain from interacting with the computer and stop other processes until the analyzer is done. Latency = 24 frames Speedup = 1.4x
dspunfold
gives a speedup gain of 40%
when
compared to the highly optimized single-threaded MEX file. Specify the exact state
length and increase the repetition factor to 4
.
dspunfold kalmanfilter -args {noisyVal} -s 259 -f true -r 4
State length: 259 samples, Repetition: 4, Output latency: 32 frames, Threads: 4 Analyzing: kalmanfilter.m Creating single-threaded MEX file: kalmanfilter_st.mexw64 Creating multi-threaded MEX file: kalmanfilter_mt.mexw64 Creating analyzer file: kalmanfilter_analyzer.p
Invoke the analyzer function to see the speedup gain.
kalmanfilter_analyzer([noisyVal;0.01*noisyVal;0.05*noisyVal;0.1*noisyVal]);
Analyzing multi-threaded MEX file kalmanfilter_mt.mexw64. For best results, please refrain from interacting with the computer and stop other processes until the analyzer is done. Latency = 32 frames Speedup = 1.5x
The speedup gain factor is 50%
when compared to the
single-threaded MEX file.
The performance gain factors codegen
and
dspunfold
give depend on your algorithm.
codegen
provides sufficient acceleration for some
MATLAB constructs. dspunfold
can provide additional
performance gains using the cores available on your machine to distribute your
algorithm via unfolding. As shown in this example, the amount of speedup that
dspunfold
provides depends on the particular algorithm to
accelerate. Use dspunfold
in addition to
codegen
if your algorithm is well-suited for distributing
via unfolding, and if the resulting latency cost is in line with the constraints of
your application.
Some MATLAB constructs are highly optimized with MATLAB interpreted execution. The fft
function, for example, runs
much faster in interpreted simulation than with code generation.