GPU time slower than CPU time in Mandrelbolt set example?

Hi, I'm following the Mandrelbolt set example featured on Mathworks's blog: http://blogs.mathworks.com/loren/2011/07/18/a-mandelbrot-set-on-the-gpu/ I'm using Windows 10, 16GB of RAM, and my GPU information:
>> gpuDevice
ans =
CUDADevice with properties:
Name: 'Quadro M1000M'
Index: 1
ComputeCapability: '5.0'
SupportsDouble: 1
DriverVersion: 8
ToolkitVersion: 7.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 2.1475e+09
AvailableMemory: 1.5948e+09
MultiprocessorCount: 4
ClockRateKHz: 1071500
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
Here are the results:
The thing is, the time it took with GPU is much longer than simply using CPU (arrayfun is fine). Why is it? Please help me, thank you very much :)

回答(2 个)

Your Quadro GPU is not intended for intensive double precision computation (I can't find published figures, but it's going to be something like 50 gigaflops as opposed to 5 teraflops for a proper compute GPU). Try converting the example to single precision. It will probably be about 30 times faster.

7 个评论

The material I am finding suggests that Quadro should be 1:3 of fp32.
However thinking about past discussions I wonder if this is one of the cases where you need to enable TCC to get good performance?
I'm sorry, I don't understand what TCC is? (googled but did not find anything). And how do I enable it ? :(. I'm using Windows 10
"NVIDIA high-end GPUs (Tesla, Quadro, etc) can be configured to run in either Tesla Compute Cluster (TCC) mode or Windows Display Driver Model (WDDM) mode. The difference between the two is that in TCC mode, the cards dedicate themselves completely to compute and are not meant to have a local display. In WDDM mode, they act as both a compute card as well as a GPU for displaying local graphics."
and it gives instructions on switching modes.
Tried but did not work:
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe -L
GPU 0: Quadro M1000M (UUID: GPU-10af5042-4cf4-0ad4-a314-abc9b616b1a8)
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi -g 0 -dm 1
Unable to set driver model for GPU 0000:01:00.0: Not Supported
Treating as warning and moving on.
All done
By the way, if my application required calculating k-nearest neighbors, would it be safe to use single-precision floating points. Thanks
Single precision is usually fine for k-nearest: distances that are within that range of each other are typically intended to be equal distances. You might end up with a different order between neighbours that are the same distance away.
You can't put mobile GPU chips into TCC mode as far as I'm aware. The basic issue is that you're trying to do high performance computing on a laptop.
Okay, further research says that the M1000M is Maxwell architecture GM107 series, and that the double precision performance is 1/32 of the single precision performance.

请先登录,再进行评论。

This is not uncommon. There is communication overhead with the GPU. It is most effective if you have extensive GPU computation with little data transfer (which does not necessarily mean small matrices being computed with.) In cases where you do a little bit of computing on large matrices being transferred then although the computations might be very fast you have to wait for the results to data transfer in both directions. If you are going to do further computation on data then leave a copy of it on the GPU even if you want a CPU copy, so that you do not need to transfer it up to the GPU again .

1 个评论

But there was no data transfer from the CPU to the GPU, because it was created directly on the GPU :( Can you explain this phenomenon? :(

请先登录,再进行评论。

类别

帮助中心File Exchange 中查找有关 GPU Computing 的更多信息

标签

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by