Matrix multiplication optimization using GPU parallel computation

Dear all,
I have two questions.
(1) How do I monitor GPU core usage when I am running a simulation? Is there any visual tool to dynamically check GPU core usage?
(2) Mathematically the new and old approaches are same, but why is the new approach is 5-10 times faster?
%%% Code for new approach %%%
M = gpuArray(M) ;
for nt=1:STEPs
if (there is a periodic boundary condition)
M = A1 * M + A2 * f * M
else
% diffusion
M = A1 * M ;
end
end

6 个评论

Just curious: What timings do you get for:
M = (A1 + A2 * f) * M;
Are A1, A2 and f gpuArrays also?
Hi Jan,
It needs about 600 seconds to run the old approach, 120 seconds to run the new approach on the consumer GPU (5 times), 10 times on the professional GPU card.
Do you know how can we check the GPU core and memory usage in realtime (graphically if possible)?
Thanks!
@Nick: I do not understand, what the "old" and the "new" approach is. I asked for the speed of:
M = (A1 + A2 * f) * M;
which might avoid a matrix multiplication. Are A1, A2 and f gpuArrays also?
Tried A1, A2 and f gpuArrays. It doesn't help on the calculation speed.
Okay. As far as I understand, you do not want to tell me the speed difference between
M = A1 * M + A2 * f * M;
and
M = (A1 + A2 * f) * M
and you do not want to show the complete code for the "old" implementation. Then I cannot estimate, if storing the data in "B(t_n)" is a cause of the problem.
Hi Jan,
The following table summarizes the computation time comparison over different approach and GPU enabled/disabled.
New one-step app 1 doesn't have any improvement.

请先登录,再进行评论。

 采纳的回答

Matt J
Matt J 2022-8-18
编辑:Matt J 2022-8-18
Because in your second formulation, there is no need to build a table of non-zero entries for the sparse matrix B. The table-building step requires sorting operations, which your second version avoids.
Also, if B has many columns, it will consume a lot of memory in proportion to the number of columns (independent of the sparsity). That is avoided as well by the second implementation.

10 个评论

Hi Matt,
Thanks for you insights!
(1) I am surprised that the Matlab compiler didn't optimize this step for the old approach (just substitution as the new approach). Is it due to the MATLAB line-by-line script execution mode? Will MATLAB optimize from old to new if I compile the program into a standaline EXE.
(2) Do you know how we can check the GPU core and memory usage in the realtime (graphically if possible)?
(3) With (2), can we timer and monitor GUP calculation in the realtime?
Thanks!
I am surprised that the Matlab compiler didn't optimize this step for the old approach (just substitution as the new approach). Is it due to the MATLAB line-by-line script execution mode?
If you write code instructing Matlab to create a matrix B, Matlab must assume you actually want to use B even if B is never used later in the code. If, for example, you use were to insert a breakpoint in the code, B needs to be available so that you can examine it.
Will MATLAB optimize from old to new if I compile the program into a standaline EXE.
If you convert the code to C/C++ with the Matlab Coder, it might.
Hi Matt,
Do you have any reference about how MATLAB builds a matrix (e.g. sorting operation you mentioned)? Is it done in CPU?
Or how does MATLAB handle the matrix in general?
I observed about 70% CPU usage after gpuArray is called. What operations does CPU work on when GPU is running?
Thanks!
If you build the matrix on the CPU, then transfer it to the GPU, it would explain why you see CPU activity.
Hi Matt, do you have some reference reading related to MATLAB matrix handling?
Thanks!!
There is this rather general doc,
but I want to emphasize that nothing you are seeing is likely related to CPU/GPU transfers or GPU versus CPU differences. It is simply more expensive to create a sparse matrix than to do matrix/vector multiplication with that matrix, even in the plain vanilla case where all processing is done on the CPU (see below). In your case, by avoiding the creation of an additional sparse matrix B, your second version avoids very obvious overhead.
A1=sprand(1e5,1e5,0.001);
A2=sprand(1e5,1e5,0.001);
b=rand(1e5,1);
tic;
B=A1+A2;
toc
Elapsed time is 0.185736 seconds.
tic
B*b;
toc
Elapsed time is 0.039143 seconds.
tic
A1*b; A2*b;
toc
Elapsed time is 0.037451 seconds.
Hi Matt,
I am trying to better understand what you said here: "there is no need to build a table of non-zero entries for the sparse matrix B".
Do you know how MATLAB manages sparse array elements? For example, in a 1000x1000 sparse matrix with only 100 non-zero elements, will MATLAB save the non-zero elements in a table? If so, will any operation on those non-zero elements cause the sorting operations you mentioned above?
Do you have some MATLAB reference about the sparse array handling?
Thanks in advance!
Do you know how MATLAB manages sparse array elements?
Here is some detail on how sparse matrices are stored,
If so, will any operation on those non-zero elements cause the sorting operations you mentioned above?
If a new sparsity pattern is generated, then it will. Here's maybe another example to show how this can make sparse operations slower than full operations:
N=5000;
A=sprand(N,N,1/5);
B=sprand(N,N,1/5);
tic;
A+B;
toc; %sparse matrix addition
Elapsed time is 0.085529 seconds.
A=full(A); B=full(B);
tic
A+B;
toc %full matrix addition
Elapsed time is 0.049478 seconds.

请先登录,再进行评论。

更多回答(1 个)

The Windows Task Manager lets you track GPU utilization and memory graphically, and the utility nvidia-smi lets you do it in a terminal window.
Neither the CUDA driver nor the runtime provide access to which core is running what, although you might be able to hand-code something using NVML.

3 个评论

Hi Joss,
Thanks for your tip.
It does consume a lot of CPU power (~70%) after executing gpuArray command, and it drops to 12% in the end of GPU simulation.
I observed the dedicated GPU memory usage increase from 0.2 GB to 2.6 GB, but all the GPU performance parameters (e.g. 3D, Copy, Video Encode and Decode) are at almost 0% usage with very tiny ripples.
I am curious what GPU parameter is the key index for matrix multiplication calculation. Would you please advise?
Ah, I forgot that you cannot see utilization information for GeForce cards, sorry. Those charts are for graphics and so not relevant for compute (except the memory one).
You'll have to use nvidia-smi.

请先登录,再进行评论。

类别

帮助中心File Exchange 中查找有关 GPU Computing 的更多信息

产品

版本

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by