Preconditioning for iterative solvers on GPU - Performance issues
显示 更早的评论
Dear all,
I'm experimenting some preconditioners for iterative solvers on GPU in a linear system [A]{x}={B}. The problem is defined by this simple command line:
sol=pcg(A_gpu,B_gpu,tol,maxit,P)
where A and B are gpuArrays and P is the preconditioner.
Some simple tests point out that the solution is faster than any iterative CPU solver, whenever P=[ ], with speedups up to 12x;
However, what I still can't figure out, is the reason why the performance drops whenever any type of preconditioner is selected. For an instance, using Incomplete Cholesky factorization:
L=ichol(A)
sol=pcg(A_gpu,B_gpu,tol,maxit,L*L')
Blows out the performance when compared to no preconditioner at all on the GPU. The solution is even slower than the CPU version, where this same preconditioner improves the CPU performance by 1.5x. That's really strange.
I've also tried passing A_gpu as preconditioner, but the solution takes forever:
sol=pcg(A_gpu,B_gpu,tol,maxit,A_gpu)
This issue is also related to other iterative solvers, such as: BICG and SYMMLQ
Am I doing something wrong? It appears that any preconditioner on the GPU is acting as a drawback, even when it is efficient for the CPU version.
Please share your thoughts and experiences. Thanks!
7 个评论
Walter Roberson
2019-11-14
Remember, that GPU processing does not correspond exactly to CPU processing. Users make top-level calls, and MATLAB can use any GPU implementation it deems suitable, not necessarily the same one that would be used on CPU. In particular, MATLAB can make use of third-party pre-tuned GPU libraries that might not have been designed with pre-conditioners in mind.
Joss Knight
2019-11-15
Does it take more iterations or is each iteration slower?
On the GPU the preconditioning method is via no-fill ILU which can be slow. What you lose in start-up overhead you are supposed to gain in convergence speed, i.e. it reduces the number of iterations. But it is problem-dependent. It would help if you could provide an example A and B for me to try.
Paulo Ribeiro
2019-11-15
编辑:Paulo Ribeiro
2019-11-16
Joss Knight
2019-11-21
The explanation as to why certain preconditioners work or do not work is beyond my expertise. I know that preconditioning on the GPU uses a different algorithm and so would expect different behaviour than the CPU.
Your NVIDIA RTX 2080 SUPER does not have good double precision performance.
At 316 GFLOPS it is 32x slower than its single precision performance and, likely, slower than your CPU for the kinds of hybrid computations that are happening in the iterative solvers - perhaps in particular for computing the ILU. I would recommend using your CPU when you are using a preconditioner.
Joss Knight
2019-11-21
For what it's worth, these are the results I got for your data on a Titan V, which has around 7 TFLOPS in double precision. I saw similar issues for passing the reconstructed cholesky or ILU factors - I can't explain that but perhaps the sparsity pattern is just a really poor match for the GPU factorization algorithm. We do intend to provide a future enhancement that will allow two triangular preconditioners to be passed to the solver so that the decomposition can be done independently.
>> Ag = gpuArray(A);
>> Bg = gpuArray(B);
>> P = diag(diag(A));
>> tic; pcg(A,B,1e-5,6000); toc
pcg converged at iteration 5346 to a solution with relative residual 1e-05.
Elapsed time is 25.906744 seconds.
>> tic; pcg(Ag,Bg,1e-5,6000); toc
pcg converged at iteration 5345 to a solution with relative residual 1e-05.
Elapsed time is 1.399854 seconds.
>> tic; pcg(A,B,1e-5,6000,P); toc
pcg converged at iteration 5501 to a solution with relative residual 1e-05.
Elapsed time is 34.181677 seconds.
>> tic; pcg(Ag,Bg,1e-5,6000,P); toc
pcg converged at iteration 5502 to a solution with relative residual 9.8e-06.
Elapsed time is 2.404074 seconds
In other words, preconditioner or no, the GPU is giving a great performance improvement.
Paulo Ribeiro
2019-11-21
编辑:Paulo Ribeiro
2019-11-22
Joss Knight
2019-11-25
I investigated further and found that applying the preconditioner - not just decomposing it - does appear to be taking an unusually long time. This does warrant further investigation, since these two triangular solves should be fast, and your system matrix is band-diagonal. It does have quite a large bandwidth of 543 however, so that could be the issue.
Iterative solvers are always faster than direct solves for large sparse matrices (assuming they have reasonable convergence properties). Direct solves are hugely memory intensive because there is a lot of fill-in during factorization.
回答(0 个)
类别
在 帮助中心 和 File Exchange 中查找有关 Parallel and Cloud 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!