Failed to generate large CUDA kernel in GPU coder with FFT function inside

9 次查看(过去 30 天)
I am trying to get my code paralle in GPU.
I have converted the code with the "main.m" script as attached. But the mex code on GPU is much slower than the m code on CPU. I understand that the GPU is not suitable for such small data size. But it takes much much longer time on the GPU if bigger data size is used.
Then I check the profilling timeline. I find that many cuda kernel is created and the overall GPU utilization is low. After some debugging, I find that if the fft command is used, the GPU coder failed to generate large CUDA kernel.
I think that the perfermance can be improved significantly if the fft can be incoporate inside one CUDA kernel like the situation without fft. FFT is needed. I have try to search on Google, but nothing relative can be found. Can you provide any information about this or any solution? The output of gpuDevice is also provided in the attachment.
Here is the profilling timeline without fft.
Here is the profilling timeline with fft.

回答(1 个)

Justin Hontz
Justin Hontz 2024-9-18
Hi He,
In your M-code for RandCopy, the for loop cannot be executed as a GPU kernel (even with the coder.gpu.kernel pragma) because of the fft / ifft calls inside of the loop. This is because fft is implemented using its own specialized GPU kernel, and GPU Coder does not supported nested kernels execution. Consequently, the for loop runs sequentially, which explains why you see thousands of small kernel instaces within the performance analyzer timeline graph.
To improve the performance of your code, you will want to perform your computation using only a single fft / ifft call that operates on the entire input array instead of individual slices. Something like this should work:
Tmp = fft(Data,[],2);
Tmp = Tmp + (1 + 1i);
Tmp = Tmp * (1564 + 798i);
Data = ifft(Tmp,[],2);
After making the change on my end, the performance analyzer report shows a significant performance improvement, with the timeline graph looking similar to the original one without fft.
  4 个评论
He Da
He Da 2024-9-19
I fully understand the benefit of calculation on the entire array, which is the way I am working for years. However, it is not suitable inherently. I haved tried to disable cuFFT in the coder config, which results thousands of memory copy between the host and device. Maybe it requires other optimzation.
It said:NVIDIA cuFFT introduces cuFFTDx APIs, device side API extensions for performing FFT calculations inside your CUDA kernel. Fusing numerical operations can decrease the latency and improve the performance of your application.
It seems like that the cuFFT can be called from device code. Hopefully you can show me how to use cuFFTDx in RandCopy.m. Perhaps that may be overly demanding.
Justin Hontz
Justin Hontz 2024-9-19
GPU Coder currently does not support generating direct calls to the cuFFTDx API. With that said, however, you may still be able to indirectly call into the API in the generated code if you are willing to write your own CUDA wrapper function that directly uses the API. This can possibly be achieved by invoking the wrapper function inside the for loop of your M-code via coder.ceval. The call would look something like this:
coder.ceval('-gpudevicefcn', 'myFFTWrapper', coder.ref(data), ...);
The -gpudevicefcn flag indicates that the wrapper function is meant to be executed by a GPU thread rather than by the CPU.
Note that I have not tried using this approach on my end, so I cannot guarantee that such an approach would work correctly without issue.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Kernel Creation from MATLAB Code 的更多信息

产品


版本

R2024b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by