Allocating pinned memory in matlab mex with CUDA

1 次查看(过去 30 天)
I have an application where I call my own CUDA fucntions from a mex. However, the memory transferred can be very big (both input and output) and that means that pinned memory can help me speed up the process quite a lot.
I have seen several posts in teh internet and in hre mentioning that you can not use pinned memory (cudaMallocHost) with MTALAB variables, however all these are from 2017 or older. Now that we are in 2019, and the parallel computing toolbox, CUDA and MATLAB have changed a lot, is this still true? Can pinned memory not be used still? For applications where memory is critical this is a big drawback.
  6 个评论
Matt J
Matt J 2019-8-13
编辑:Matt J 2019-8-13
In any case, for most algorithms and uses of TIGRE, specially when the data is big, the transfer times are just a small fraction of the computational time
I'm not sure which algorithms you had in mind here, but performance will definitely suffer for ordered subset algorithms if you have to do a transfer after every forward/back projection of a subset. The total data set may be large, but the size of a subset can be small in comparison, and the more subsets you have, the more transfers you will have to do. If I were to undertake the task of creating dedicated gpuArray versions of the forward/back projection modules only, are you saying it would be highly challenging task?
Ander Biguri
Ander Biguri 2019-8-13
Hi Matt,
You are absolutely right. In fact, a small test that I did not long ago showed that particularly for SART (which updates images projection by ptojection), an acceleration of x10 is expected if the memory trasnfer is removed and all the data is kept in the GPU.
For industrial/scientific sizes of images, SART would still be very slow and not recoomended. For medical images, this improvement may be very welcomed.
Now, about modifying TIGRE: it may be a challenging task.
Recenltly I updated TIGRE (https://arxiv.org/pdf/1905.03748.pdf) to work with multi-GPUs where the trasnfer to CPU may be required, as TIGRE now will break up the problem in chuncks if it does not fit the GPU, thus allowing for recosntruction bigger than before. Modifying this version is quite a huge workload as it would require quite big changes in the CUDA side, as there is a lot of memory management involved.
However, modifying the older single-GPU version will likely be considerably easier. Some changes in the CUDA code will be required (as its who passes memory in and out of the GPU), but there are just few lines to do the job. If you were to modify it to have dedicated gpuArrays and succeed, we could find a way to add it to the TIGRE code, and I could add some logic for when to use each of the versions (depending on problem size, number of GPUs, etc). If you are up for the task, please feel free to email me and we can discuss it further.
Ander

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Get Started with GPU Coder 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by