Multithreaded MEX File Generation
This example shows how to use the dspunfold
function to generate a multithreaded MEX file from a MATLAB® function using unfolding technology. The MATLAB function can contain an algorithm which is stateless (has no states) or stateful (has states).
NOTE: The following example assumes that the current host computer has at least two physical CPU cores. The presented screenshots, speedup, and latency values are collected using a host computer with six physical CPU cores.
Required MathWorks™ products:
DSP System Toolbox™
MATLAB Coder™
Using dspunfold with a MATLAB Function Containing a Stateless Algorithm
Consider the MATLAB function dspunfoldDCTExample
. This function computes the DCT of an input signal and returns the value and index of the maximum energy point
type dspunfoldDCTExample.m
function [peakValue,peakIndex] = dspunfoldDCTExample(x) % Stateless MATLAB function computing the dct of a signal (e.g. audio), and % returns the value and index of the highest energy point % Copyright 2015 The MathWorks, Inc. X = dct(x); [peakValue,peakIndex] = max(abs(X)); end
To accelerate the algorithm, a common approach is to generate a MEX file using the codegen
function. This example shows how to do so when using an input of 4096 doubles. The generated MEX file, dspunfoldDCTExample_mex
, is singlethreaded.
codegen dspunfoldDCTExample -args {(1:4096)'}
Code generation successful.
To generate a multithreaded MEX file, use the dspunfold
function. The argument -s 0
indicates that the algorithm in dspunfoldDCTExample
is stateless.
dspunfold dspunfoldDCTExample -args {(1:4096)'} -s 0
State length: 0 frames, Repetition: 1, Output latency: 12 frames, Threads: 6 Analyzing: dspunfoldDCTExample.m Creating single-threaded MEX file: dspunfoldDCTExample_st.mexa64 Creating multi-threaded MEX file: dspunfoldDCTExample_mt.mexa64 Creating analyzer file: dspunfoldDCTExample_analyzer.p
This command generates these files:
Multithreaded MEX file
dspunfoldDCTExample_mt
Single-threaded MEX file
dspunfoldDCTExample_st
, which is identical to the MEX file obtained using thecodegen
functionSelf-diagnostic analyzer function
dspunfoldDCTExample_analyzer
Additional three MATLAB files are also generated, containing the help for each of the above files.
To measure the speedup of the multithreaded MEX file relative to the single-threaded MEX file, see the example function dspunfoldBenchmarkDCTExample
.
type dspunfoldBenchmarkDCTExample
function dspunfoldBenchmarkDCTExample % Function used to measure the speedup of the multi-threaded MEX file % dspunfoldDCTExample_mt obtained using dspunfold vs the single-threaded MEX % file dspunfoldDCTExample_st. % Copyright 2015 The MathWorks, Inc. clear dspunfoldDCTExample_mt; % for benchmark precision purpose numFrames = 1e5; inputFrame = (1:4096)'; % exclude first run from timing measurements dspunfoldDCTExample_st(inputFrame); tic; % measure execution time for the single-threaded MEX for frame = 1:numFrames dspunfoldDCTExample_st(inputFrame); end timeSingleThreaded = toc; % exclude first run from timing measurements dspunfoldDCTExample_mt(inputFrame); tic; % measure execution time for the multi-threaded MEX for frame = 1:numFrames dspunfoldDCTExample_mt(inputFrame); end timeMultiThreaded = toc; fprintf('Speedup = %.1fx\n',timeSingleThreaded/timeMultiThreaded);
dspunfoldBenchmarkDCTExample
measures the execution time taken by dspunfoldDCTExample_st
and dspunfoldDCTExample_mt
to process numFrames
frames. Finally, it prints the speedup, which is the ratio between the multithreaded MEX file execution time and single-threaded MEX file execution time.
Run the example.
dspunfoldBenchmarkDCTExample;
Speedup = 2.9x
To improve the speedup even more, increase the repetition value. To modify the repetition value, use the -r
flag. For more information on the repetition value, see the dspunfold
function reference page. For an example on how to specify the repetition value, see the section 'Using dspunfold with a MATLAB Function Containing a Stateful Algorithm'.
dspunfold
generates a multithreaded MEX file, which buffers multiple signal frames and then processes these frames simultaneously, using multiple cores. This process introduces some deterministic output latency. Executing help dspunfoldDCTExample_mt
displays more information about the multithreaded MEX file, including the value of the output latency. For this example, the output of the multithreaded MEX file has a latency of 16 frames relative to its input, which is not the case for the single-threaded MEX file.
Run dspunfoldShowLatencyDCTExample
example. The generated plot displays the outputs of the single-threaded and multithreaded MEX files. Notice that the output of the multithreaded MEX is delayed by 16 frames, relative to that of the single-threaded MEX.
dspunfoldShowLatencyDCTExample;
Using dspunfold with a MATLAB Function Containing a Stateful Algorithm
The MATLAB function dspunfoldFIRExample
executes two FIR filters.
type dspunfoldFIRExample.m
function y = dspunfoldFIRExample(u,c1,c2) % Stateful MATLAB function executing two FIR filters % Copyright 2015 The MathWorks, Inc. persistent FIRSTFIR SECONDFIR if isempty(FIRSTFIR) FIRSTFIR = dsp.FIRFilter('NumeratorSource','Input port'); SECONDFIR = dsp.FIRFilter('NumeratorSource','Input port'); end t = step(FIRSTFIR,u,c1); y = step(SECONDFIR,t,c2);
To build the multithreaded MEX file, you must provide the state length corresponding to the two FIR filters. Specify 1s to indicate that the state length does not exceed 1 frame.
firCoeffs1 = fir1(192,0.8); firCoeffs2 = fir1(256,0.2,'High'); dspunfold dspunfoldFIRExample -args {(1:4096)',firCoeffs1,firCoeffs2} -s 1
State length: 1 frames, Repetition: 1, Output latency: 12 frames, Threads: 6 Analyzing: dspunfoldFIRExample.m Creating single-threaded MEX file: dspunfoldFIRExample_st.mexa64 Creating multi-threaded MEX file: dspunfoldFIRExample_mt.mexa64 Creating analyzer file: dspunfoldFIRExample_analyzer.p
Executing this code generates:
Multithreaded MEX file
dspunfoldFIRExample_mt
Single-threaded MEX file
dspunfoldFIRExample_st
Self-diagnostic analyzer function
dspunfoldFIRExample_analyzer
The corresponding MATLAB help files for these three files
The output latency of the multithreaded MEX file is 12 frames. To measure the speedup, execute dspunfoldBenchmarkFIRExample
.
dspunfoldBenchmarkFIRExample;
Speedup = 1.4x
To improve the speedup of the multithreaded MEX file even more, specify the exact state length in samples. To do so, you must specify which input arguments to dspunfoldFIRExample
are frames. In this example, the first input is a frame because the elements of this input are sequenced in time. Therefore it can be further divided into subframes. The last two inputs are not frames because the FIR filters coefficients cannot be subdivided without changing the nature of the algorithm. The value of the dspunfoldFIRExample
MATLAB function state length is the sum of the state length of the two FIR filters (192 + 256 = 448). Using the -f
argument, mark the first input argument as true (frame), and the last two input arguments as false (nonframes)
dspunfold dspunfoldFIRExample -args {(1:4096)',firCoeffs1,firCoeffs2} -s 448 -f [true,false,false]
State length: 448 samples, Repetition: 1, Output latency: 12 frames, Threads: 6 Analyzing: dspunfoldFIRExample.m Creating single-threaded MEX file: dspunfoldFIRExample_st.mexa64 Creating multi-threaded MEX file: dspunfoldFIRExample_mt.mexa64 Creating analyzer file: dspunfoldFIRExample_analyzer.p
Again, measure the speedup for the resulting multithreaded MEX using the dspunfoldBenchmarkFIRExample
function. Notice that the speedup increased because the exact state length was specified in samples, and dspunfold was able to subdivide the frame inputs.
dspunfoldBenchmarkFIRExample;
Speedup = 2.0x
Oftentimes, the speedup can be increased even more by increasing the repetition (-r) provided when invoking dspunfold
. The default repetition value is 1. When you increase this value, the multithreaded MEX buffers more frames internally before the processing starts. Increasing the repetition factor increases the efficiency of the multi-threading, but at the cost of a higher output latency.
dspunfold dspunfoldFIRExample -args {(1:4096)',firCoeffs1,firCoeffs2} ... -s 448 -f [true,false,false] -r 5
State length: 448 samples, Repetition: 5, Output latency: 60 frames, Threads: 6 Analyzing: dspunfoldFIRExample.m Creating single-threaded MEX file: dspunfoldFIRExample_st.mexa64 Creating multi-threaded MEX file: dspunfoldFIRExample_mt.mexa64 Creating analyzer file: dspunfoldFIRExample_analyzer.p
Again, measure the speedup for the resulting multithreaded MEX, using the dspunfoldBenchmarkFIRExample
function. Speedup increases again, but the output latency is now 60 frames. The general output latency formula is . In these examples, the number of Threads
is equal to the number of physical CPU cores.
dspunfoldBenchmarkFIRExample;
Speedup = 2.2x
Detecting State Length Automatically
To request that dspunfold
autodetect the state length, specify -s auto
. This option generates an efficient multithreaded MEX file, but with a significant increase in the generation time, due to the extra analysis that it requires.
dspunfold dspunfoldFIRExample -args {(1:4096)',firCoeffs1,firCoeffs2} ... -s auto -f [true,false,false] -r 5
State length: [autodetect] samples, Repetition: 5, Output latency: 60 frames, Threads: 6 Analyzing: dspunfoldFIRExample.m Creating single-threaded MEX file: dspunfoldFIRExample_st.mexa64 Searching for minimal state length (this might take a while) Checking stateless ... Insufficient Checking 4096 samples ... Sufficient Checking 2048 samples ... Sufficient Checking 1024 samples ... Sufficient Checking 512 samples ... Sufficient Checking 256 samples ... Insufficient Checking 384 samples ... Insufficient Checking 448 samples ... Sufficient Checking 416 samples ... Insufficient Checking 432 samples ... Insufficient Checking 440 samples ... Insufficient Checking 444 samples ... Insufficient Checking 446 samples ... Insufficient Checking 447 samples ... Insufficient Minimal state length is 448 samples Creating multi-threaded MEX file: dspunfoldFIRExample_mt.mexa64 Creating analyzer file: dspunfoldFIRExample_analyzer.p
dspunfold
checks different state lengths, using as inputs the values provided with the -args
option. The function aims to find the minimum state length for which the outputs of the multithreaded MEX and single-threaded MEX are the same. Notice that it found 448, as the minimal state length value, which matches the expected value, manually computed before.
Verify Generated Multithreaded MEX Using the Generated Analyzer
When creating a multithreaded MEX file using dspunfold, the single-threaded MEX file is also created along with an analyzer function. For the stateful example in the previous section, the name of the analyzer is dspunfoldFIRExample_analyzer
.
The goal of the analyzer is to provide a quick way to measure the speedup of the multithreaded MEX relative to the single-threaded MEX, and also to check if the outputs of the multithreaded MEX and single-threaded MEX match. Outputs usually do not match when an incorrect state length value is specified.
Execute the analyzer for the multithreaded MEX file, dspunfoldFIRExample_mt
, generated previously using the -s auto
option.
firCoeffs1_1 = fir1(192,0.8); firCoeffs1_2 = fir1(192,0.7); firCoeffs1_3 = fir1(192,0.6); firCoeffs2_1 = fir1(256,0.2,'High'); firCoeffs2_2 = fir1(256,0.1,'High'); firCoeffs2_3 = fir1(256,0.3,'High'); dspunfoldFIRExample_analyzer((1:4096*3)',[firCoeffs1_1;firCoeffs1_2;firCoeffs1_3],... [firCoeffs2_1;firCoeffs2_2;firCoeffs2_3]);
Analyzing multi-threaded MEX file dspunfoldFIRExample_mt.mexa64. For best results, please refrain from interacting with the computer and stop other processes until the analyzer is done. Latency = 60 frames Speedup = 2.4x
Each input to the analyzer corresponds to the inputs of the dspunfoldFIRExample_mt
MEX file. Notice that the length (first dimension) of each input is greater than the expected length. For example, dspunfoldFIRExample_mt
expects a frame of 4096 doubles for its first input, while samples were provided to dspunfoldFIRExample_analyzer
. The analyzer interprets this input as 3 frames of 4096 samples. The analyzer alternates between these 3 input frames circularly while checking if the outputs of the multithreaded and single-threaded MEX files match.
The table shows the inputs used by the analyzer at each step of the numerical check. The total number of steps invoked by the analyzer is 180 or , where is 60 in this case.
| input1 | input2 | input3
------+----------------+--------------+--------------
Step1 | (1:4096)' | firCoeffs1_1 | firCoeffs2_1
Step2 | (4097:8192)' | firCoeffs1_2 | firCoeffs2_2
Step3 | (8193:12288)' | firCoeffs1_3 | firCoeffs2_3
Step4 | (1:4096)' | firCoeffs1_1 | firCoeffs2_1
... | ... | ... | ...
NOTE: For the analyzer to correctly check for the numerical match between the multithreaded MEX and single-threaded MEX, provide at least two frames with different values for each input. For inputs that represent parameters, such as filter coefficients, the frames can have the same values for each input. In this example, you could have specified a single set of coefficients for the second and third inputs.