Vec-trick implementation (multiple times)
7 次查看(过去 30 天)
显示 更早的评论
Dear all,
the question is related to Tensorproduct. Since the question was not answered as intended, i want to revisit the question.
Introduction:
Suppose you have a matrix vector multiplication, where a matrix C with size (np x mq) is constructed by a Kronecker product of matrices A with size (n x m) and B with size (p x q). The vector is denoted v with size (mp x 1) or its vectorized version X with size (m x p).
In two dimensions this operation can be performed with O(npq+qnm) operations instead of O(mqnp) operations, see Wikipedia.
Expensive variant (in case of flops):

Cheap variant (in case of flops):

Main question:
I want to perform many of these operations at ones, e.g. 2500000. Example: n=m=p=q=7 with A=size(7x7), B=size(7x7), v=size(49x2500000).
In Tensorproduct i have implemented a MeX-C version of the cheap variant which is quite slower than a Matlab version of the expensive variant provided by Bruno Luong.
Is it possible to implement the cheap version in Matlab without looping?
5 个评论
Bruno Luong
2021-8-23
"Is it possible to implement the cheap version in Matlab without looping"
The method
B = matrix_xi.';
A = matrix_eta;
X = reshape(vector, size(A,2), size(B,1), []);
CX = pagemtimes(pagemtimes(A, X), B); % Reshape CX if needed
given in this thread is cheap version in MATLAB without looping! Granted it is no faster than the expensive method, but it's what you ask for, no ?
ConvexHull
2021-8-23
编辑:ConvexHull
2021-8-23
Your completely right. I over looked the "pagetimes" version.
I am still suprised about the fact that there is no cheap version which is able to outperform the expensive vectorized Matlab version (actually BLAS version).
I mean, the cheap variant is of order O(2*7*7*7)=O(686), whereas the expensive variant is of order O(7*7*7*7)=O(2401), in case of flops.
Thank you!
ConvexHull
2021-8-23
By the way i tried to optimize the C-code in Tensorproduct, which gave me 20% speed-up, however still not usefull.
Bruno Luong
2021-8-23
Because smaller flops doesn't mean necessary faster. Memory access, cache, thread management are as well important, and which is fatest method probably depends on n=m=p=q.
ConvexHull
2021-8-23
编辑:ConvexHull
2021-8-23
Yeah that's definitly the case here.
The main problem is that, if you want to perform the Vec-trick multiple times in a vectorized fashion you have to reorder the datastructure. After applying AX you cannot perform a Matrix-Matrix multiplication directly with B.
Stupid Memory access O.o!
采纳的回答
ConvexHull
2021-8-24
编辑:ConvexHull
2021-8-25
Here is a pure intrinsic Matlab version without loops, however with two transpose operations and quite slow.
n=7;m=7;p=7;q=7;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,500000,5);
n = 5;
C = kron(B,A);
tic
for i=1:n
v1 = reshape(C*reshape(v,49,[]),size(v));
end
toc % Elapsed time is 0.456353 seconds
tic
for i=1:n
v2 = reshape(reshape(B*reshape((A*reshape(v,7,[])).',7,[]),7*2500000,[]).',7,[]);
end
toc % Elapsed time is 3.879752 seconds
max(abs(v1(:)-v2(:)))
% 1.4211e-14
22 个评论
Bruno Luong
2021-8-24
According to my test, your code is slightly slower than pagemtimes method that is provided in other thread.
for i=1:n
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
end
ConvexHull
2021-8-24
That makes sense. I think pagetime will not do much different. However, only recently (R2020b) introduced.
ConvexHull
2021-8-24
编辑:ConvexHull
2021-8-24
@Bruno: Do you know a faster way tranposing a matrix in Matlab? :)
Bruno Luong
2021-8-24
Actually MATLAB is smarther than you though, when you do
C = AA * BB.';
Matlab does not call transpose, it call Blas to do matrix multiplication without explicit transposition (no memory copying or moving).
So if you wonder if your code is not efficient because of .', the answer is NO.
ConvexHull
2021-8-24
编辑:ConvexHull
2021-8-24
I don't know what you mean.
- The ().' is far the most expensive operation no matter what is being done in the background.
- Reshape is for free.
- The small 7er matrix-matrix multiplication is cheaper than the 49er big one.
- By the way ()' or ().' are nearly same expensive.
n=7;m=7;p=7;q=7;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,500000,5);
n = 5;
tic
for i=1:n
vv = reshape(v,7,[]); %#ok<*NASGU>
end
toc % Elapsed time is 0.000186 seconds
tic
for i=1:n
vvv = A*vv;
end
toc % Elapsed time is 0.350487 seconds
tic
for i=1:n
vvvv = (vvv).';
end
toc % Elapsed time is 1.682334 seconds
tic
for i=1:n
vvvvv = reshape(vvvv,7,[]);
end
toc % Elapsed time is 0.000181 seconds
tic
for i=1:n
vvvvvvv = B*vvvvv;
end
toc % Elapsed time is 0.347840 seconds
tic
for i=1:n
vvvvvvvv = reshape(vvvvvvv,7*2500000,[]);
end
toc % Elapsed time is 0.000174 seconds
tic
for i=1:n
vvvvvvvvv = (vvvvvvvv).';
end
toc % Elapsed time is 1.470868 seconds
tic
for i=1:n
vvvvvvvvvv = reshape(vvvvvvvvv,7,[]);
end
toc % Elapsed time is 0.000148 seconds
Bruno Luong
2021-8-24
编辑:Bruno Luong
2021-8-24
No
CC = BB.'
alone take time. However with
CC = AA*BB.';
there is no transposition happens in the background.
As showed here (timings obtained by running the code on TMW server, R2021a)
AA = rand(49);
BB = rand(49,500000*5);
BBt = BB.';
tic
CC = AA*BB;
toc
Elapsed time is 0.631366 seconds.
tic
CC = AA*BBt.'; % MATLAB does not perform transposition then multiplication
% it does both by a Blas implementation
toc
Elapsed time is 0.631350 seconds.
tic
BBtt = BBt.';
CC = AA*BBtt;
toc
Elapsed time is 1.105448 seconds.
ConvexHull
2021-8-24
编辑:ConvexHull
2021-8-24
I understand your remarks. However, my operations are in following order:
reshape(B*reshape(A.')).'
I think the reshape between them is the problem. This makes it expensive.
Bruno Luong
2021-8-24
Oh I see, I missread your code and the transposition is before the reshape.
In the case, yes the tranposition must be carried out explicitly by Matlab. Sorry but I don't know how one can accelerate the transposition.
ConvexHull
2021-8-25
编辑:ConvexHull
2021-8-25
Suprisingly, even two small 7er matrix-matrix multiplications are slower than one 49er matrix-matrix multiplication.
n=7;m=7;p=7;q=7;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,500000,5);
n = 5;
C = kron(B,A);
tic
for i=1:n
v1 = reshape(C*reshape(v,49,[]),size(v));
end
toc % Elapsed time is 0.456353 seconds
tic
for i=1:n
v2 = reshape(A*reshape(v,7,[]),size(v));
v3 = reshape(B*reshape(v,7,[]),size(v));
end
toc % Elapsed time is 0.683820 seconds
ConvexHull
2021-8-25
编辑:ConvexHull
2021-8-25
According to https://github.com/MichielStock/Kronecker.jl the vec-trick becomes useful for larger sizes than n=m=p=q=100.
Bruno Luong
2021-8-25
According to my test, it is "cheap" method is faster from size 27.
stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,100000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
t3(i) = toc;
end
close all
semilogy(stab, [t1; t3]');
legend('Expensive method', 'Cheap method using pagemtimes');
xlabel('s');
ylabel('time [sec]');
grid on;

ConvexHull
2021-8-25
编辑:ConvexHull
2021-8-25
Unfortunately, my problem is restricted to sizes of max n=m=p=q=16. Rather smaller.
ConvexHull
2021-8-25
编辑:ConvexHull
2021-8-25
With the intrinsic method it is nearly the same.
stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,1000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
v2 = reshape(reshape(B*reshape((A*reshape(v,s,[])).',s,[]),s*1000,[]).',s,[]);
t3(i) = toc;
end
close all
semilogy(stab, [t1; t3]');
legend('Expensive method', 'Cheap method using intrinsic operations');
xlabel('s');
ylabel('time [sec]');
grid on;

Bruno Luong
2021-8-25
编辑:Bruno Luong
2021-8-25
Your curves do not clearly cross and not coherent with your finding for size = 7.
ConvexHull
2021-8-25
编辑:ConvexHull
2021-8-25
Could you make a test between pagetimes and the intrinsic one? Currently, I have no pagetimes available.
Would be great!
ConvexHull
2021-8-25
编辑:ConvexHull
2021-8-25
Yes perhaps the vector size (=1000) was too small. Note that today i use different computer.
Bruno Luong
2021-8-25
编辑:Bruno Luong
2021-8-25
There is an mtimesx in File exchange that you can use for older matlab that does similar task as pagemtimes.
AFAIK pagemtimes is slighly faster.
Bruno Luong
2021-8-25
Results with 3 methods

stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,100000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
v2 = reshape(reshape(B*reshape((A*reshape(v,s,[])).',s,[]),s*1000,[]).',s,[]);
t2(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
t3(i) = toc;
end
close all
semilogy(stab, [t1; t2; t3]');
legend('Expensive method', 'Cheap method using transposition', 'Cheap method using pagemtimes');
xlabel('s');
ylabel('time [sec]');
grid on;
ConvexHull
2021-8-25
Thanks for the effort. There is a small typo in line "v2 = ... s*1000...". But i don't think it would influence the results much.
Bruno Luong
2021-8-26
Add benchmark with mtimesx

Conclusion
- For version before R2020b, use expensive method for s < 44, use mtimesx otherwise;
- For version R2020b or later, use expensive method for s < 27, use pagemtimes otherwise.
stab = 5:5:100;
t1 = zeros(size(stab));
t2 = zeros(size(stab));
t3 = zeros(size(stab));
t4 = zeros(size(stab));
for i = 1:length(stab)
fprintf('%d/%d\n', i, length(stab));
s = stab(i);
n=s;
m=s;
p=s;
q=s;
A = rand(n,m);
B = rand(p,q);
v = rand(m*p,100000);
tic
C = kron(B,A);
v1 = reshape(C*reshape(v,s*s,[]),size(v));
t1(i) = toc;
tic
v2 = reshape(reshape(B*reshape((A*reshape(v,s,[])).',s,[]),[],s).',s,[]);
t2(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v3 = pagemtimes(pagemtimes(A, X), 'none', B, 'transpose');
t3(i) = toc;
tic
X = reshape(v, size(A,2), size(B,1), []);
v4 = mtimesx(mtimesx(A, X), 'N', B, 'T');
t4(i) = toc;
end
close all
semilogy(stab, [t1; t2; t3; t4]');
legend('Expensive method', ...
'Cheap method using transposition', ...
'Cheap method using pagemtimes', ...
'Cheap method using mtimesx');
xlabel('s');
ylabel('time [sec]');
grid on;
Stefano Cipolla
2023-9-14
编辑:Stefano Cipolla
2023-9-14
Hi there! May I ask if you are aware of implementation of functions similar to "pagemtimes" but able to work with at least one sparse input? Alternatively do you see any easy workaround? More precisely I need someting like
pagemtimes(A, V)
where A is a nxnxn sparse real tensor and V is a real dense nxn matrix...
Bruno Luong
2023-9-14
编辑:Bruno Luong
2023-9-14
@Stefano Cipolla "sparse real tensor"
I'm not aware this native MATLAB class.
But you can put the A as diagonal block of n^2 x n^2 sparse matrix
SA = [A(:,:,1) 0 0 ... 0
0 A(:,:,2) 0 ... 0
...
9=0 0 ... A(:,:,n)]
Do the same expansion for V (with the same block) then solve it
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Data Type Identification 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!发生错误
由于页面发生更改,无法完成操作。请重新加载页面以查看其更新后的状态。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
亚太
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)