Matrix Multiplication on GPU quite slow?

Question

Sven 2017-12-7

2
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/371737-matrix-multiplication-on-gpu-quite-slow

评论： Sven 2018-1-2

MinEx.m

Hi, I just started out using GPU in Matlab and hoped for considerable performance gains in matrix multiplication. I did some performance test and read quite a bit on it in different spots. But my results from testing appear quite frustrating and I found no good explanations online for those mixed results.

First some hardware info: i5-4590 quadcore 3.30GHz, 64 bit(Win 7, Matlab 2016a); GeForce GT 640, 384 CUDA cores, ~1 GHz.

When running the tests, I got some gains when multiplying 2 1024x1024 matrices. But when looping on 200x200 or 500x500 matrices multiplication is down for GPU by about the difference in clock speed. While looping over some similiar matrix addition shows up as succesful as I hoped.

I also get different results for timing with tictoc or (gpu)timeit.

So here are my timing results, which mostly explain themselves. Attached there is also the MinExample producing this output.

-------------------------------------
Single Matrix Operation on 1024x1024
-------------------------------------
Standard CPU:
tictoc
Elapsed time is 0.030685 seconds.
timeit
Elapsed time is  0.035352 seconds
Lets check GPU:
tictoc
Elapsed time is 0.000323 seconds.
Elapsed time is 0.000173 seconds.
timeit
Elapsed time is  0.061935 seconds
Elapsed time is  0.061718 seconds
-------------------------------------
Now starting some loops:
-------------------------------------
-------------------------------------
Matrix Addition n=10000:
-------------------------------------
-------------------------------------
Matrix is 600x600
-------------------------------------
Standard CPU:
Elapsed time is 1.675066 seconds.
Lets check GPU:
Elapsed time is 0.123021 seconds.
-------------------------------------
Matrix is 1000x1000
-------------------------------------
Standard CPU:
Elapsed time is 20.782437 seconds.
Lets check GPU:
Elapsed time is 0.119888 seconds.
-------------------------------------
Matrix Multiplication n=1000:
-------------------------------------
-------------------------------------
Matrix is 200x200
-------------------------------------
Standard CPU:
Elapsed time is 0.190912 seconds.
Lets check GPU:
Elapsed time is 0.751289 seconds.
-------------------------------------
Matrix is 500x500
-------------------------------------
Standard CPU:
Elapsed time is 2.620033 seconds.
Lets check GPU:
Elapsed time is 7.402474 seconds.

I summarize here for better understanding. One time operations with CPU(1024x1024): around 0.031s While for GPU tictoc counts only like 0.0003s but timeit gets like 0.06s. First confusion here, does timing function matter so much? Does GPU really speedup?

Next doing 1k multiplications on 500x500 takes: CPU: 2.62s GPU: 7.40s Loosing around clock speed difference.

For the 100k addition of 1000x1000 GPU speeds up dramatically from 20.78s -> 0.12s

So is there a consistent way to speed up with GPU in matrix multiplications? Can exact implementation matter a lot? What slows down the multiplication loop?

Thanks in advance Best Sven

2 个评论
显示无隐藏无

Matt J 2017-12-7

编辑：Matt J 2017-12-7

For timing GPU operations, you should use only gputimeit.

Sven 2017-12-11

Yeah that is why I compared tictoc and gputimeit in the single operation at the top. My interpretation there is that CPU can contiue while gpu still is processing. Hence tictoc reflect small values vs. gputimeit large ones. (0.000323s vs. 0.061935s) My guess is that this is also why the looped matrix multiplication are slower on gpu even with tictoc.

Just wanted to make transparent what different setups show. But bottom line still appears to be the same, matrix multiplication do not seem to be using gpu parallelization productively.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Edric Ellis 2017-12-8

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/371737-matrix-multiplication-on-gpu-quite-slow#answer_295316

I would recommend using GPUBench to get an understanding of your GPU's performance.

Note that your GPU (GT 640) is primarily a display card; high performance is usually achieved by dedicated "compute" cards such as the Tesla or Quadro family. Typically display cards have much worse performance in double precision compared to single precision.

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Sven 2017-12-11

编辑：Sven 2017-12-11

Still GT 640 has 384 cores and addition shows clear profits of parallelizations.

And there are simply monetary constraints here, but I still hoped 384 cores would show some result. So I am not puzzled why GPU might be slower in general due to lower clock frequency and weakness at double precision. Right now rechecked the multiplication loop with single precision and then CPU and GPU are about equal. But my main point is that I don't see any payoff from parallelization in multiplication, while addition shows gain factors of ~10-200 and even not really a difference between 600x600 or 1kx1k, so seems to be mainly overhead costs.

请先登录，再进行评论。

Answer 2

Sven 2017-12-11

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/371737-matrix-multiplication-on-gpu-quite-slow#answer_295772

编辑：Sven 2017-12-12

A little addendum, as I found another example on mathworks which I have overlooked so far, in my ongoing search, though it mostly comes to the same question I phrased above.

On Mathworks Measuring GPU Performance they show with their Tesla K40c (far beyond my GPU) performance on matrix multiplication. I also rerun their code on mine:

Their computing oriented modern Tesla 40c( All Measures in GFLOPS, basically some inverse of runtime):

My quite old display oriented GT 640:

Their Tesla improves matrix multiplication by some factor up to around maybe 20.

My GT 640 decreases performance by a more constant factor around 2-4.

So as in my examples above, my GPU computation is just making it worse, supposedly due to clock speed and double precision. And it shows no significant signs of parallelization.

Their Tesla (?3000 cuda/stream cores) has great bandwith(288 GB/s) and clock speed (6 GHz). But still a max gain of around 20. This does not look like parallelization using around 3000 cores.

So quite some difference to mine, but it does not look as if there is serious parallelization going on. Despite matrix multiplication appearing to me as a inherently highly parallel task.

So my question stays still the same. Why are matrix multiplication not gaining strongly on GPU despite there supposedly strong parallel structure?

2 个评论
显示无隐藏无

Edric Ellis 2018-1-2

According to the wikipedia description of NVIDIA Tesla cards, the K40 has the has a peak FLOP rate of ~1500 GFlops in double precision.

Therefore, the first graph you post shows the gpuArray performance approaching ~80% of peak performance. So, it's not clear why you consider this not to represent "serious parallelisation".

Also, please note that a CUDA core is really not comparable with an x86 CPU core - they are much less powerful, and they are not capable of fully independent operation.

According to wikipedia's list of other NVIDIA cards, the GT 640 has a peak processing power in double precision of ~30 GFlops. So, in your case, MATLAB is clearly taking full advantage of your GPU device.

Finally, it's worth bearing in mind that the CPU implementation of MTIMES is highly optimized and multi-threaded using the cores of your CPU.

Sven 2018-1-2

My idea of 'serious parallelisation' went along the lines: ~400 Cuda Cores in my GT 640, less powerful than CPU cores (assume some loss factor 4-8), due to reasons you named. Would still leave a gain factor around 50-100.

The plots above (GT 640) show for matrix multplication (A*B) a factor around 0.2 - 0.3, slower on GPU. Trying the same example I get high factors when benchmarking basic elementwise operations (GPU achieves serious gains). While some a little less basic operations or variations, seem to decrease performance severely.

Here a short list of operations and factors:

A*B -- 0.2 - 0.3
A+B,A-B,A.*B,A./B -- >100
A.*B - (B - A) -- 1 - 2
A + log(B) -- 2 - 6

I hoped to carry over those high factors from basic operations to other parallelized operations and do not understand what creates this bottleneck. I can imagine the CUDA-cores are less powerful than CPU-cores, especially on operations like the log, but I am confused by the severe fallbacks in performance. And sure, got no idea at all what difference optimization of CPU implemented MTIMES yields, was a bit part of the question.

请先登录，再进行评论。