Kernel Analysis
For GPU code generation, the primary mechanism for creating CUDA® kernels is by using for
-loops. The way you write loops in
your MATLAB® code has a significant impact on the number of kernels created as well as the
performance of the generated code. When you generate GPU code, check the diagnostic report to
see if your loop segment has Loop not parallelized
notices. Calls to
MATLAB functions in your code may also have for
-loops that contain
these notices. To get maximum performance, you want to ensure
that compute intensive loop segments in your code are mapped to kernels and executed in
parallel. The following recommendations help you in achieving this goal and generating
efficient CUDA kernels.
Mapping Nested Loops to Kernels
Condition
Consider a function that has nested for
-loops.
function y = foo(x) ... for i1 = 1:N1 for i2 = 1:N2 for i3 = 1:N3 for i4 = 1:N4 ... end end end end
Assume that one of the intermediate loop i3
is not parallelizable.
When performs loop analysis to create kernels, GPU Coder™ it considers only the outermost parallel loops i1,i2
and
creates a kernel with the outer loop dimensions N1,N2
. The loops
i3,i4
are within the kernel body and are executed sequentially.
However if the innermost i4
is large (iteration), then better
performance may be achieved by creating kernels for the innermost loop.
Action
There are three ways in which you can parallelize the innermost loop:
Rewrite the code so that the innermost code segment is not within a nested loop.
If the iteration size of the outer loop is small, then attach the loop to a
coder.unroll
function. This function unrolls thefor
-loop by making a copy of the loop body for each loop iteration. For more information, seecoder.unroll
.function y = foo(x) ... for i1 = coder.unroll(1:N1) ... end
Make the outer loop dimension as dynamic bound. This way parallel loop analysis fails on the outer loop, whereas it succeeds on the inner loops.
function y = foo(x,N1) ... for i1 = 1:N1 ... end
For-Loops with Break
Condition
Loops with break are not supported.
while (i < N) ... ... if (cond2) ... ... break; end end
Action
Remove breaks by creating a guard variable and conditional.
cond = true; while (i< N) if(cond) ... ... if(cond2) cond = false; end end end
Dependence Analysis Parallel Loop Check Fails
Condition
Kernel extraction use parallel loop dependence analysis. There are cases where loop
dependence analysis cannot detect a parallel for loop. The
coder.gpu.kernel
allows GPU Coder to override dependence analysis and force kernel creation. The caveat is for
user to be sure that the loop is “for-all”
loop without inter-iteration dependencies.
Action
Use coder.gpu.kernel
pragma explicitly on each of your for-loops.
Logical Indexing of Arrays
Condition
GPU Coder may not create kernels when logical indexing is used for accessing array elements.
i = (mag ~= 0); vx(i) = vx(i)./mag(i); vy(i) = vy(i)./mag(i);
Action
Rewrite the code by using a loop body and guarding with an appropriate conditional.
for i = 1:numel(mag) if (mag(i) ~= 0) vx(i) = vx(i)./mag(i); vy(i) = vy(i)./mag(i); end end
Unsupported Functions
Condition
Use of unsupported functions, coder pragmas, toolbox functions etc. inside a loop prevents them from becoming a kernel.
Action
Try rewriting unsupported functions using pure MATLAB.
Loop Interchange
Condition
If smaller loops in a loop nest are the outer most loops, then a kernel could be created with just a subset of the loops in the nesting. If algorithm allows it, always put the largest loops in the outermost nesting.
Action
Rewrite loop nesting with larger loops as outer loops.