Using a "CUDAKernel" type object within a parfor loop
5 次查看(过去 30 天)
显示 更早的评论
Hello,
Thanks in advance for your help.
I'm trying to make use of a CUDAKernel object from within a parfor loop, but when I do so I am met with the message:
Warning: Cannot load an object of class 'CUDAKernel':
No matching constructor signature found.
> In parallel.internal.pool.optionallyDeserialize (line 7)
In parallel.internal.parfor.cppRemoteParallelFunction (line 25)
As a workaround for this issue, I have tried passing in a string which contains the name of the kernel as follows:
cudaKernel = parallel.gpu.CUDAKernel(cudaKernelName+".ptx", cudaKernelName+".cu");
However, directly constructing the kernel in each iteration of the for loop greatly reduces performance.
I am wondering, is there any efficient way that I can pass an existing CUDAKernel object into a parfor loop?
Joseph
0 个评论
采纳的回答
Joss Knight
2023-6-22
A CUDAKernel object cannot be serialized, as you've found, so you will need to construct it separately on each worker. However, you can do this efficiently using parallel.pool.Constant:
kernelConst = parallel.pool.Constant(@()parallel.gpu.CUDAKernel(cudaKernelName+".ptx", cudaKernelName+".cu"));
Then inside the parfor you access the object using kernelConst.Value:
parfor idx = 1:numIterations
% ...
feval(kernelConst.Value, args, etc);
end
This efficiently constructs the object on each worker just once.
Note that if you don't have a multi-GPU setup, the execution of your kernel object will be serialized anyway, which means the kernels on each worker will not be running at the same time.
3 个评论
更多回答(1 个)
Aditya Singh
2023-6-22
编辑:Aditya Singh
2023-6-22
Hello Joseph,
As per my understanding, you are facing an issue in using CUDAKernel in parfor loop, so you tried a workaround. But in this approach, you are making a kernel instance each time, which you want to avoid.
The workaround for not having to create a kernel every time is to utilize a pre-compiled kernel. This approach assumes that the kernel code undergoes minimal modifications with each iteration.
Suppose that you have a CUDA kernel called `vectorAddKernel` that adds two vectors of size `n` and writes the result to a third vector of the same size. The kernel code is presented below:
% Define the kernel source code with placeholders for input parameters
kernelSource = ['#include <cuda.h>\n' ...
'template <int n>\n' ...
'extern "C" __global__ void vectorAddKernel(float* A, float* B, float* C)\n'
'{\n' ...
' int i = threadIdx.x + blockDim.x * blockIdx.x;\n' ...
' if (i < n) {\n' ...
' C[i] = A[i] + B[i];\n' ...
' }\n' ...
'}'];
% Define the kernel parameters
n = 1024;
numIterations = 10;
% Compile the templated kernel code to PTX code using nvcc and then load
% the file
kernelPTX = fileread('vectorAddKernel.ptx');
compiledKernel = parallel.gpu.CUDAKernel(kernelPTX, 'vectorAddKernel');
% Inside the parfor loop, call the compiled kernel with the necessary arguments
parfor idx = 1:numIterations
% Call kernel with arguments
blockSize = [256,1,1]; % block size
gridSize = [ceil(n/256),1,1]; % grid size
output{idx} = zeros(n, 1, 'single', 'gpuArray');
feval(compiledKernel, blockSize, gridSize, {gpuArray(single(rand(n, 1))), gpuArray(single(rand(n, 1))), output{idx}});
end
For reference, kindly see:
- Kernel executable on GPU - MATLAB - MathWorks India
- Decide When to Use parfor - MATLAB & Simulink - MathWorks India
- Nested parfor and for-Loops and Other parfor Requirements - MATLAB & Simulink - MathWorks India
Hope this helps!
4 个评论
Joss Knight
2023-6-22
You can directly compile to PTX from MATLAB using mexcuda with the -ptx option. You should probably add that (and check that the code runs).
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 GPU Computing 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!