Without resorting to FEX/MEX, can the overhead of repeated calls to sparse() be avoided with operations on I, J, V directly?

3 次查看（过去 30 天）

显示更早的评论

Joel Lynch 2022-8-6

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1775090-without-resorting-to-fex-mex-can-the-overhead-of-repeated-calls-to-sparse-be-avoided-with-operati

评论： Bruno Luong 2022-8-7

采纳的回答： Bruno Luong

I have a time-marching problem bottlenecked by matrix multiplication

, where the sparse matrix

is the sum of many time-independent sparse matrices with time-dependent scalar coefficients:

I'm aware of FEX solutions that would enable a 3D

to be defined once prior to time-marching, though I would prefer to rule out native MATLAB first. My current thinking is to supplement the I, J, V vectors with an index vector K, and do one of the following:

Generate a cell array of sparse matrices along index K, with multiplied with bsxfun, similar to this. (Probably the slowest option)
Sparse matrix construction with B=sparse(I, J, V.*beta(K), N, N)*X
Avoid sparse matrix entirely with B=accumarray(I, V.*beta(K).*X(J), N, N)

I haven't tested option 1, but option 3 with accumarray runs about 50x slower than option 2 (with or without issparse option). This makes sense, as the multiplication with X in accumarray is done before accumulation, hence on a much larger array.

However, option 2 still appears to be wasting performance. For example, sparse(I, J, V, N, N)*X runs about 50x slower than basic A*X. My initial interpretation of this is that much of the time in sparse is spent on unavoidable summation of common terms in I,J,V. I cant profile sparse internally, but tested this by accumulating V prior to sparse(I, J, V, N, N)*X, which comes out to ~7x slower than A*X.

My conclusion is that there is still some significant & (potentially) avoidable overhead from calling sparse many times unrelated to summation of duplicate indices, and the larger piece of summation that is unavoidable due to the summation of time-dependent

terms co-located in

To boil this down, I have two questions I cannot answer:

Considering the pattern of summation of duplicate indices is always the same (static I/J), can the summation overhead in sparse be improved?
What is the source of the remaining overhead in sparse and can it be avoided?

20 个评论
显示 18更早的评论隐藏 18更早的评论

James Tursa 2022-8-7

编辑：James Tursa 2022-8-7

This is the math as I understand it, although this is not how I would physically implement it. Were I to write code for this, e.g. I would not physically replicate the

into one long vector and multiply all the elements by β as I show, but do the equivalent math virtually and only do the necessary non-zero multiplies. If speed is the issue and I knew there were no inf or NaN present, could skip those checks.

= the individual NxN matrices (static),

Abig = [

] % concatenate the

horizontally into NxNM matrix (static)

x = Nx1 column vector (time dependent)

beta = 1xM row vector (time dependent)

Xbig = x .* beta % a NxM matrix with each column

using virtual array expansion

B = Abig * Xbig(:) % the result of one big matrix*vector multiply

The above is just the notional math. The actual coding would combine the last two steps virtually, and could even be multi-threaded. The coding would also depend on how the

are actually stored (as separate matrices, as one big matrix, as sparse, or as I,J,V triplets).

Does this match the calculations you want, or am I missing something?

Joel Lynch 2022-8-7

编辑：Joel Lynch 2022-8-7

Yes, I think we've arrived at effectively the same solution. Your method hides the X multiplication in with beta, so ABig*XBig is on all elements A_k, but at the same cost as Y*beta in Bruno's solution. Bruno's solution does cut down on the overall size of Y by ignoring elements sparse for all A_k (columns in your ABig matrix), but you could perform the same optimization with your scheme.

My best guess is that any differences between these solutions would be splitting hairs compared to MEX implementation, so it's probably the end of the road for native MATLAB.

Bruno Luong 2022-8-7

@James Tursa yes you get it right. However

In the range of density/size and N_k (m in your notation) your method is still 50% slower than the accumarray solution we come up with,
The runtime scale with N_k and is not bounded by the assumption of limited "bandwidth" (density of A0) stated by @Joel Lynch. It is probably an important requirement for his application.

请先登录，再进行评论。

请先登录，再回答此问题。

采纳的回答

Bruno Luong 2022-8-7

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/1775090-without-resorting-to-fex-mex-can-the-overhead-of-repeated-calls-to-sparse-be-avoided-with-operati#answer_1022320

编辑：Bruno Luong 2022-8-7

在 MATLAB Online 中打开

EDIT: Here is the final code. Method used in B2 is the fatest;

M = 10000;
N = 10000;
N_k = 100;
A_k = cell(1,N_k);
for k=1:N_k
    A_k{k} = sprand(M,N,0.0005);
end
N_k = length(A_k);
[M,N] = size(A_k{1});
beta = rand(N_k,1);
X = rand(N,1);
I = cell(1,N_k);
J = cell(1,N_k);
V = cell(1,N_k);
K = cell(1,N_k);
for k=1:N_k
    [I{k},J{k},V{k}] = find(A_k{k});
    K{k} = k + zeros(size(I{k}));
end
I = cat(1,I{:});
J = cat(1,J{:});
K = cat(1,K{:});
V = cat(1,V{:});
[IJ,~,ic] = unique([I,J],'rows');
I2 = IJ(:,1);
J2 = IJ(:,2);
Y = sparse(ic, K, V, size(IJ,1), N_k);
tic
B1=sparse(I, J, V.*beta(K), M, N)*X ;
toc
Elapsed time is 0.252217 seconds.
tic
B3 = sparse(I2,J2,Y*beta,M,N)*X;
toc
Elapsed time is 0.195761 seconds.
tic
B2 = accumarray(I2,(Y*beta).*X(J2),[M,1]);
toc
Elapsed time is 0.096573 seconds.

5 个评论
显示 3更早的评论隐藏 3更早的评论

Joel Lynch 2022-8-7

编辑：Joel Lynch 2022-8-7

在 MATLAB Online 中打开

Thanks for this!

The problem I see is that your Y matrix has a row for every corresponding I/J coordinate in the MxN matrix, which is excessive when A_0 is still sparse. Also, using sprand with large N_k will probably fill up A_0 beyond what is reasonable (I got density of 63% with your script).

Nevertheless, this is what I came up with, which was about 2x faster than your first test (for B1). The basic idea here is to ensure Y's rows are limited to I/J points with at least 1 non-zero A_k value (i.e., all unique Z indicies). Then use the ia/ic pointers from unique to yo-yo the Y matrix into a smaller mtimes operation with X, and back out into the correct size for B.

Z = sub2ind([M,N],I,J);
[Zred, ia, ic] = unique(Z);
I2 = I(ia);
J2 = J(ia);
Y = sparse(ic, K, V, numel(Zred), N_k);
tic
B2 = accumarray(I2,(Y*beta).*X(J2),[N,1]);
toc

I don't have a quick way to test it yet, but I suspect the 2x improvement is still rather conservative. This is because aligning A_k sparsity in a more realistic way (for my problem anyway) where A_0 remains reasonably sparse would mean with the same number of non-zero elements in A_k, but more terms summed in Y*beta before multiplication by X.

Bruno Luong 2022-8-7

编辑：Bruno Luong 2022-8-7

It actualy entirely your method and credit. I simply put the code so as readers can test and understand better the problem and solutions.

Joel Lynch 2022-8-7

在 MATLAB Online 中打开

Yeah, at this point I'm just documenting for future viewers. Feel free to post this up top, I cleaned some things up and added comments. Also added integer type casting to reduced memory.

M = 10000; % A_0/A_k rows
N = 10000; % A_0/A_k columns
N_k = 100; % Number of A_k matricies
% Random beta/X
beta = rand(N_k,1); 
X = rand(N,1); 
% Pick smallest integer type for indices
int_type = 'uint64';
if max(N,M) < intmax("uint16")
    int_type = 'uint16';
elseif max(N,M) < intmax("uint32")
    int_type = 'uint32';
end
% Generate A_k matricies in single stream
I = []; J=[]; K=[]; V=[];
bandwith = round(M*0.25); % Controls final density of A_0
A_k_ref_density = 1e-4; % density of each A_k
for k=1:N_k
    A_k = sprand(M, N, A_k_ref_density);
    [Inew, Jnew] = find(A_k);
    ind = abs(Inew-Jnew)<bandwith; % Keep i/j's within band
    I = [I; cast(Inew(ind), int_type)];
    J = [J; cast(Jnew(ind), int_type)];
    V = [V; A_k(ind)];
    K = [K; repmat(cast(k, int_type), sum(ind), 1)];
    % spy(sparse(I,J,K, M,N)) % Spy A_k
end
%spy(sparse(I,J,K, M,N)) % Spy Final A_0 for non-zero Beta's
% Smaller vector of any non-zero i/j's in all A_k's
% numel(ic) = size([I, J], 1)
%   ic points to the location in the reduced/unique array of [I, J] pairs
%   (i.e. ic holds the row index where each nonzero I/J in A_0 resides)
[IJ, ~, ic] = unique([I, J], 'rows', 'stable');
Iu = IJ(:,1); % unique I in [I, J]
Ju = IJ(:,2); % unique J in [I, J]
Nrow_u = size(IJ, 1);
% Time-Independent Y matrix, such that Y*beta is a column vector of
% non-zero elements in A_0, with I/J defined by Iu, Ju
Y = sparse(ic, double(K), V, Nrow_u, N_k);
% Native approach, construct A_0 within time loop, then A*x
t1 = timeit(@() sparse(I, J, V.*beta(K), M, N)*X)
% Construct sparse, but with the reduced Y
t2 = timeit(@() sparse(Iu, Ju, Y*beta, M, N)*X)
% Optimal Approach, accumarray which avoids sparse()
t3 = timeit(@() accumarray(Iu, (Y*beta).*X(Ju), [M,1]))

请先登录，再进行评论。

类别

MATLAB Mathematics Linear Algebra

在 Help Center 和 File Exchange 中查找有关 Linear Algebra 的更多信息

产品

MATLAB

版本

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by

Without resorting to FEX/MEX, can the overhead of repeated calls to sparse() be avoided with operations on I, J, V directly?

20 个评论
显示 18更早的评论隐藏 18更早的评论

采纳的回答

5 个评论
显示 3更早的评论隐藏 3更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

Without resorting to FEX/MEX, can the overhead of repeated calls to sparse() be avoided with operations on I, J, V directly?

20 个评论 显示 18更早的评论隐藏 18更早的评论

采纳的回答

5 个评论 显示 3更早的评论隐藏 3更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

20 个评论
显示 18更早的评论隐藏 18更早的评论

5 个评论
显示 3更早的评论隐藏 3更早的评论