Vec-trick implementation (multiple times)

Dear all,
the question is related to Tensorproduct. Since the question was not answered as intended, i want to revisit the question.
Introduction:
Suppose you have a matrix vector multiplication, where a matrix C with size (np x mq) is constructed by a Kronecker product of matrices A with size (n x m) and B with size (p x q). The vector is denoted v with size (mp x 1) or its vectorized version X with size (m x p).
In two dimensions this operation can be performed with O(npq+qnm) operations instead of O(mqnp) operations, see Wikipedia.
Expensive variant (in case of flops):
Cheap variant (in case of flops):
Main question:
I want to perform many of these operations at ones, e.g. 2500000. Example: n=m=p=q=7 with A=size(7x7), B=size(7x7), v=size(49x2500000).
In Tensorproduct i have implemented a MeX-C version of the cheap variant which is quite slower than a Matlab version of the expensive variant provided by Bruno Luong.
Is it possible to implement the cheap version in Matlab without looping?

5 个评论

"Is it possible to implement the cheap version in Matlab without looping"
The method
B = matrix_xi.';
A = matrix_eta;
X = reshape(vector, size(A,2), size(B,1), []);
CX = pagemtimes(pagemtimes(A, X), B); % Reshape CX if needed
given in this thread is cheap version in MATLAB without looping! Granted it is no faster than the expensive method, but it's what you ask for, no ?
Your completely right. I over looked the "pagetimes" version.
I am still suprised about the fact that there is no cheap version which is able to outperform the expensive vectorized Matlab version (actually BLAS version).
I mean, the cheap variant is of order O(2*7*7*7)=O(686), whereas the expensive variant is of order O(7*7*7*7)=O(2401), in case of flops.
Thank you!
By the way i tried to optimize the C-code in Tensorproduct, which gave me 20% speed-up, however still not usefull.
Because smaller flops doesn't mean necessary faster. Memory access, cache, thread management are as well important, and which is fatest method probably depends on n=m=p=q.
Yeah that's definitly the case here.
The main problem is that, if you want to perform the Vec-trick multiple times in a vectorized fashion you have to reorder the datastructure. After applying AX you cannot perform a Matrix-Matrix multiplication directly with B.
Stupid Memory access O.o!

请先登录,再进行评论。

 采纳的回答

Here is a pure intrinsic Matlab version without loops, however with two transpose operations and quite slow.
n=7;m=7;p=7;q=7;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,500000,5);
n = 5;
C = kron(B,A);
tic
for i=1:n
v1 = reshape(C*reshape(v,49,[]),size(v));
end
toc % Elapsed time is 0.456353 seconds
tic
for i=1:n
v2 = reshape(reshape(B*reshape((A*reshape(v,7,[])).',7,[]),7*2500000,[]).',7,[]);
end
toc % Elapsed time is 3.879752 seconds
max(abs(v1(:)-v2(:)))
% 1.4211e-14

22 个评论

According to my test, your code is slightly slower than pagemtimes method that is provided in other thread.
for i=1:n
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
end
That makes sense. I think pagetime will not do much different. However, only recently (R2020b) introduced.
@Bruno: Do you know a faster way tranposing a matrix in Matlab? :)
Actually MATLAB is smarther than you though, when you do
C = AA * BB.';
Matlab does not call transpose, it call Blas to do matrix multiplication without explicit transposition (no memory copying or moving).
So if you wonder if your code is not efficient because of .', the answer is NO.
I don't know what you mean.
  1. The ().' is far the most expensive operation no matter what is being done in the background.
  2. Reshape is for free.
  3. The small 7er matrix-matrix multiplication is cheaper than the 49er big one.
  4. By the way ()' or ().' are nearly same expensive.
n=7;m=7;p=7;q=7;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,500000,5);
n = 5;
tic
for i=1:n
vv = reshape(v,7,[]); %#ok<*NASGU>
end
toc % Elapsed time is 0.000186 seconds
tic
for i=1:n
vvv = A*vv;
end
toc % Elapsed time is 0.350487 seconds
tic
for i=1:n
vvvv = (vvv).';
end
toc % Elapsed time is 1.682334 seconds
tic
for i=1:n
vvvvv = reshape(vvvv,7,[]);
end
toc % Elapsed time is 0.000181 seconds
tic
for i=1:n
vvvvvvv = B*vvvvv;
end
toc % Elapsed time is 0.347840 seconds
tic
for i=1:n
vvvvvvvv = reshape(vvvvvvv,7*2500000,[]);
end
toc % Elapsed time is 0.000174 seconds
tic
for i=1:n
vvvvvvvvv = (vvvvvvvv).';
end
toc % Elapsed time is 1.470868 seconds
tic
for i=1:n
vvvvvvvvvv = reshape(vvvvvvvvv,7,[]);
end
toc % Elapsed time is 0.000148 seconds
No
CC = BB.'
alone take time. However with
CC = AA*BB.';
there is no transposition happens in the background.
As showed here (timings obtained by running the code on TMW server, R2021a)
AA = rand(49);
BB = rand(49,500000*5);
BBt = BB.';
tic
CC = AA*BB;
toc
Elapsed time is 0.631366 seconds.
tic
CC = AA*BBt.'; % MATLAB does not perform transposition then multiplication
% it does both by a Blas implementation
toc
Elapsed time is 0.631350 seconds.
tic
BBtt = BBt.';
CC = AA*BBtt;
toc
Elapsed time is 1.105448 seconds.
I understand your remarks. However, my operations are in following order:
reshape(B*reshape(A.')).'
I think the reshape between them is the problem. This makes it expensive.
Oh I see, I missread your code and the transposition is before the reshape.
In the case, yes the tranposition must be carried out explicitly by Matlab. Sorry but I don't know how one can accelerate the transposition.
Suprisingly, even two small 7er matrix-matrix multiplications are slower than one 49er matrix-matrix multiplication.
n=7;m=7;p=7;q=7;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,500000,5);
n = 5;
C = kron(B,A);
tic
for i=1:n
v1 = reshape(C*reshape(v,49,[]),size(v));
end
toc % Elapsed time is 0.456353 seconds
tic
for i=1:n
v2 = reshape(A*reshape(v,7,[]),size(v));
v3 = reshape(B*reshape(v,7,[]),size(v));
end
toc % Elapsed time is 0.683820 seconds
According to https://github.com/MichielStock/Kronecker.jl the vec-trick becomes useful for larger sizes than n=m=p=q=100.
According to my test, it is "cheap" method is faster from size 27.
stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,100000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
t3(i) = toc;
end
close all
semilogy(stab, [t1; t3]');
legend('Expensive method', 'Cheap method using pagemtimes');
xlabel('s');
ylabel('time [sec]');
grid on;
Unfortunately, my problem is restricted to sizes of max n=m=p=q=16. Rather smaller.
With the intrinsic method it is nearly the same.
stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,1000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
v2 = reshape(reshape(B*reshape((A*reshape(v,s,[])).',s,[]),s*1000,[]).',s,[]);
t3(i) = toc;
end
close all
semilogy(stab, [t1; t3]');
legend('Expensive method', 'Cheap method using intrinsic operations');
xlabel('s');
ylabel('time [sec]');
grid on;
Your curves do not clearly cross and not coherent with your finding for size = 7.
Could you make a test between pagetimes and the intrinsic one? Currently, I have no pagetimes available.
Would be great!
Yes perhaps the vector size (=1000) was too small. Note that today i use different computer.
There is an mtimesx in File exchange that you can use for older matlab that does similar task as pagemtimes.
AFAIK pagemtimes is slighly faster.
Results with 3 methods
stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,100000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
v2 = reshape(reshape(B*reshape((A*reshape(v,s,[])).',s,[]),s*1000,[]).',s,[]);
t2(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
t3(i) = toc;
end
close all
semilogy(stab, [t1; t2; t3]');
legend('Expensive method', 'Cheap method using transposition', 'Cheap method using pagemtimes');
xlabel('s');
ylabel('time [sec]');
grid on;
Thanks for the effort. There is a small typo in line "v2 = ... s*1000...". But i don't think it would influence the results much.
Add benchmark with mtimesx
Conclusion
  • For version before R2020b, use expensive method for s < 44, use mtimesx otherwise;
  • For version R2020b or later, use expensive method for s < 27, use pagemtimes otherwise.
stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
t4 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,100000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
v2 = reshape(reshape(B*reshape((A*reshape(v,s,[])).',s,[]),[],s).',s,[]);
t2(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
t3(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v4 = mtimesx(mtimesx(A, X), 'N', B, 'T');
t4(i) = toc;
end
close all
semilogy(stab, [t1; t2; t3; t4]');
legend('Expensive method', ...
'Cheap method using transposition', ...
'Cheap method using pagemtimes', ...
'Cheap method using mtimesx');
xlabel('s');
ylabel('time [sec]');
grid on;
Hi there! May I ask if you are aware of implementation of functions similar to "pagemtimes" but able to work with at least one sparse input? Alternatively do you see any easy workaround? More precisely I need someting like
pagemtimes(A, V)
where A is a nxnxn sparse real tensor and V is a real dense nxn matrix...
@Stefano Cipolla "sparse real tensor"
I'm not aware this native MATLAB class.
But you can put the A as diagonal block of n^2 x n^2 sparse matrix
SA = [A(:,:,1) 0 0 ... 0
0 A(:,:,2) 0 ... 0
...
9=0 0 ... A(:,:,n)]
Do the same expansion for V (with the same block) then solve it

请先登录,再进行评论。

更多回答(0 个)

类别

帮助中心File Exchange 中查找有关 Linear Algebra 的更多信息

产品

版本

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by