Why does my cuda code run slower on linux than on windows?

12 次查看(过去 30 天)
I am doing 3D non-local means for image stack denoising.
The codes in principles work like this: 1) copy data from host memory to device memory; 2) do the necessary calculation completely in GPU; 3) copy the results back from device to host memory. My codes do not use any built-in functions from MatLab, but are only interfaced to mexFunction.
My codes (exactly the same codes) work fine under both windows and linux system. The final results are quite similar. I mean "smilar", but not identical, just because I think the differences in every pixels are probally from the trunking of floating numbers, for example, the pixel values are about 1 ~ 5000, and the difference is between 1e-6 ~ 1e-5.
However, the time used to calculate same dataset is different on Windows 10 and Ubuntu Linux LTS 20.04. On linux the cuda/nVidia GPU calculation time is double the time on Windows. On the contrary, the same C codes executed with CPU on Ubuntu Linux is a little bit faster than on Windows, roughly by 10 ~ 30% faster. I understand that C codes with CPU on Linux may be faster, because I think Windows does something probally with a little more abstraction.
Dose anyone have similar experiences like that? Is it possible to solve the problem of cuda speed difference?
The GPU is not in "TCC" mode. Under both Windows and Linux, the GPU is still used for displaying GUI, because this GPU is the only graphic device.
Thanks!
Qinghai
  4 个评论
Hamza Butt
Hamza Butt 2021-12-17
Hi Qinghai,
Is it safe to assume that you are dual booting the machine and the hardware is the same? Can you let me know what you are using to benchmark execution times? The best way to time operations on the GPU is to use gputimeit, for example:
A = gpuArray.ones(1000);
gputimeit( @() A*A)
Otherwise, timeit can be used to benchmark CPU operations.
Also, what is the absolute difference in timings for the GPU benchmarks between the two? Is it large enough to not be considered noise?
Because this is mex code, it might be slightly tricky to figure out what could be happening, but we can try a few things nonetheless. I assume the mex creation script is the same for both OS (which would remove any compilation argument differences like optimisation flags) and you are using mexcuda. It may be that some background process is using the GPU on Linux. You can check GPU utilization from outside MATLAB by running the command:
nvidia-smi
You can then observe the output to see what processes are using the GPU. Are all of the entries in the table expected?
You could also try some profiling with Nsight Compute. Perhaps the profiler shows that a CUDA library call takes longer on Linux than Windows. I can imagine that drivers can cause such discrepancies in even seemingly simple operations like memory allocations.
Nsight Compute comes with CUDA but a quicker (albeit primitive) way of checking things could also be to first compare a benchmark of empty mex files between the OSes. Then, gradually add CUDA code to the mex file and benchmark until you start seeing measurable differences. This will help you isolate the problematic function (if there is one).
Debugging performance issues can be tricky but I hope this helps.
Qinghai Tian
Qinghai Tian 2021-12-17
I am sorry that I can not test this anymore. I have lost the linux instance now.
My computer was running with dual booting. I was using tic/toc command to estimate the time for the mex funciton.
The compilation of the cuda codes was performed with the same command (mexcuda -O -v -largeArrayDims) in Matlab.
The sbsolute different between the cuda code is exactly two times. My cuda calculation for an image stack of 120x512x1000 pixels (my data is normally much larger than this sample image stak.) normally took about 20 seconds on windows, but about 40s on linux. I was doubting that the calculation was done with floating precision on windows, but on linux with double precision, but I do use the same code with float precision. I would say this difference is not "noise".
It is also not related to the CPU, because the CPU codes are very shot, and total running time is only about 1 or 2 seconds.
Nsight Compute may be a better idea to test.
Anyway, thanks a lot for your help. Let's me finish my calculation before I do further test.

请先登录,再进行评论。

回答(0 个)

类别

Help CenterFile Exchange 中查找有关 GPU Computing 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by