A fast way to perform sparse matrix-free product with a vector

Question

Paulo Ribeiro 2018-9-4

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/417540-a-fast-way-to-perform-sparse-matrix-free-product-with-a-vector

编辑： Paulo Ribeiro 2018-9-8

Dear all,

I was wondering if it is possible to increase the performance of a matrix-free product of a sparse matrix defined by 3 vectors (rows, columns and values) with another vector {B}. That is:

A(l,m)=v and performing {C}=[A]*{B}

So far I have the following strategy:

function C = mfree_times(l,m,v,B)
     aux = B(m);
     prod = v.*aux;
     C = accumarray(l,prod);
end

My intentions are to increase this performance as much as possible. The bottleneck is majorly given by accumarray function. I am already using l, m and v as gpuArrays. Please share your thoughts. Thanks!

10 个评论
显示 8更早的评论隐藏 8更早的评论

James Tursa 2018-9-6

@Paulo: So, based on your replies, your variables appear to have nothing to do with CSR format sparse matrices ... you simply have 3-tuples of row,column,value to use. My answer below is based on that assumption.

Paulo Ribeiro 2018-9-6

James, Thanks. That’s right. The question is now updated.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

James Tursa 2018-9-6

1
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/417540-a-fast-way-to-perform-sparse-matrix-free-product-with-a-vector#answer_335760

编辑：James Tursa 2018-9-6

在 MATLAB Online 中打开

Assuming the vectors are 3-tuples of row, column, and value, and that all of the variables involved are full double vectors, here is a naive mex routine to do the multiplication and produce a full vector result. If you are working with R2018a and compiling with the -R2018a option then replace the mxGetPr calls with mxGetDoubles calls.

/*   C = mfree_timesx(r,c,v,B) does the equivalent of:
 *
 *   C = sparse(r,c,v,numel(B),numel(B)) * B
 *
 * where
 *
 *   r,c,v are 3-tuples of values from an MxM matrix:
 *     r = row index (1-based)
 *     c = column index (1-based)
 *     v = values
 *   B = column vector of size Mx1
 *   C = column vector of size Mx1
 *
 * It is assumed that the indexing contained in the r and c vectors are
 * all within an MxM square matrix (i.e., are integers between 1 and M).
 * (The code does *not* check for this. A violation will bomb this routine)
 *
 * Duplicate indexing within the r and c vectors is allowed, and is handled
 * by simply adding the results together (same as what would happen if you
 * did C = sparse(r,c,v,numel(B),numel(B)) * B).
 *
 * All inputs must be full real double vectors.
 *
 * The output is a full real double column vector.
 *
 * Programmer: James Tursa
 *
 */
/* Includes ----------------------------------------------------------- */
#include "mex.h"
/* Gateway ------------------------------------------------------------ */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    mwSize i, n, m;
    double *Ar, *Ac, *Av, *B, *C;
      if( nrhs != 4 ) {
          mexErrMsgTxt("Need exactly four double input vectors");
      }
      if( nlhs > 1 ) {
          mexErrMsgTxt("Too many outputs");
      }
      if( !(mxIsDouble(prhs[0]) && mxIsDouble(prhs[1]) && mxIsDouble(prhs[2]) && mxIsDouble(prhs[3])) ||
            mxIsComplex(prhs[0]) || mxIsComplex(prhs[1]) || mxIsComplex(prhs[2]) || mxIsComplex(prhs[3]) ||
            mxIsSparse(prhs[0]) || mxIsSparse(prhs[1]) || mxIsSparse(prhs[2]) || mxIsSparse(prhs[3]) ) {
          mexErrMsgTxt("All inputs must be full real double vectors");
      }
      n = mxGetNumberOfElements(prhs[0]);
      if( mxGetNumberOfElements(prhs[1]) != n || mxGetNumberOfElements(prhs[2]) != n ) {
          mexErrMsgTxt("The first three inputs must be the same size");
      }
      m = mxGetNumberOfElements(prhs[3]);
      if( mxGetM(prhs[3]) != m ) {
          mexErrMsgTxt("The last input must be a column vector");
      }
      Ar = mxGetPr(prhs[0]);
      Ac = mxGetPr(prhs[1]);
      Av = mxGetPr(prhs[2]);
      B = mxGetPr(prhs[3]);
      plhs[0] = mxCreateDoubleMatrix(m,1,mxREAL);
      C = mxGetPr(plhs[0]);
      for( i=0; i<n; i++ ) {
          C[(mwSize)*Ar++ - 1] += *Av++ * B[(mwSize)*Ac++ - 1];
      }
  }

Some sample test code:

N = 5000;
r = ceil(N*rand(N*N,1));
c = ceil(N*rand(N*N,1));
v = rand(N*N,1);
B = rand(N,1);
disp(' ')
disp('mex routine')
tic
x = mfree_timesx(r,c,v,B);
toc
disp('sparse')
tic
s = sparse(r,c,v,numel(B),numel(B))*B;
toc
disp('accumarray')
tic
aux = B(c);
prod = v.*aux;
C = accumarray(r,prod);
toc
disp(' ')
disp('mex vs sparse')
max(abs(x-s))
disp('mex vs accumarray')
max(abs(x-C))
disp('sparse vs accumarray')
max(abs(s-C))
disp(' ')

and results:

mex routine
Elapsed time is 0.150372 seconds.
sparse
Elapsed time is 5.806035 seconds.
accumarray
Elapsed time is 0.666030 seconds.
mex vs sparse
ans =
  1.2960e-011
mex vs accumarray
ans =
     0
sparse vs accumarray
ans =
  1.2960e-011

4 个评论
显示 2更早的评论隐藏 2更早的评论

Paulo Ribeiro 2018-9-7

Dear James,

A GPU version for the 3 tuples and B will be more than welcome. Do you think it is manageable? Many thanks for your attention and work on this thread. Glad to find an expert :)

Best regards,

Paulo

Paulo Ribeiro 2018-9-8

编辑：Paulo Ribeiro 2018-9-8

Here are some values after @James contributions:

Problem: finite element, 2D elasticity, 116.000 nodes. Setup: i7 7700HQ, NVIDIA GTX 1050, Linux (gcc compiler); Solver: matrix-free conjugate gradient, as described on this thread

Solver time for CPU with MEX compilation (mfree_timesx) = 42s;

Solver time for CPU with mfree_times (no MEX) = 317s;

Solver time for GPU with mfree_times (no MEX) = 82s.

Really impressive results with MEX on CPU

请先登录，再进行评论。