Major speed reduction (~50x) when doing multiple matrix multiply operations in a row. Is this a (major) bug? Any ideas for a fix?
4 次查看(过去 30 天)
显示 更早的评论
The code I am running more-or-less implements the following
% PREALLOCATE. For the current tests I'm running: N=1200, M=2049, and L=256
dataCur = <some [N x N x M] array>
dataOut = zeros(L,N,N);
% START PRIMARY / OUTER LOOP
for kkOuter = 1:L
% COMPUTE 'dataOut' SLICE FOR CURRENT OUTER LOOP ITERATION
% NOTE: this is the same as doing "dataOut(kkOuter,:,:) = reshape(sum(dataCur,3),[1,N,N])"
dataOut(kkOuter,:,:) = reshape((reshape(dataCur,[],M)*ones(M,1)).',[1,N,N]);
% UPDATE 'dataCur' BY LOOPING THROUGH IT ONE SLICE AT A TIME USING AN INNER LOOP
for kkInner = 1:M
W = someFunction(kkInner); % this is a [N x N] array
dataCur(:,:,kkInner) = transpose(W) * dataCur(:,:,kkInner) * W;
end
end
The evolution of the code's efficiency is shown here (note: the x axis is actually inner loop iterations / 10, since I collected time data every 10th iteration of the inner loop).
Clearly something is wrong.
Note that there is a "shift" where it goes from a "slow mode" of ~0.5 iterations per second to a "fast mode" of ~20 iterations per second. This can be seen in the upper image. The lower image shows overall average rate since the beginning of the current output loop iteration, so you dont see the sudden shift there as much.
.
WHAT IVE FIGURED OUT
The problem seems to be related to thread scheduling somehow or another. In that link the first 2 images show it running in "fast mode" (on linux mint), and the last 3 show it running in "slow mode".
Unfortunately, thats about all Ive got. Ive ruled a few things out (see below), but have no idea why this consistently keeps happening.
.
WHAT IVE RULED OUT
1) The OS. Unless Windows 10 and Linux Mint 19 both have this exact same issue.
2) MKL. Unless MKL 2017.0, 2018.0, 2018.3 and 2019.0 all have the exact same problem. 2018.3 and 2019.0 were used by launching MATLAB with the following command (this shows 2019.0 oon linux, using windows and using 2018.3 are fairly analogous)
/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/bin/mklvars.sh intel64 ilp64 && export "BLAS_VERSION=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_rt.so" && export "LAPACK_VERSION=/opt/intel/compilers_and_libraries_2019.0.117/linux/mkl/lib/intel64_lin/libmkl_rt.so" && export "MKL_THREADING_LAYER=INTEL" && export "MKL_INTERFACE_LAYER=ILP64" && "/usr/local/MATLAB/$(ls /usr/local/MATLAB | sort -g | tail -n 1)/bin/matlab"
So, unless the processor is straight ignoring the requested thread scheduling for a while (which idk if that is even possible), MATLAB seems the most likely culprit.
3) Memory fragmentation. I thought this might be some weird physical vs virtual memory issue where MATLAB's virtual addresses were sequential but the corresponding physical addresses got more and more fragmented, but ive tried a) implementing the inner loop in a separate function, and b) forcing the data into new memory addresses at every outer loop iteration using the following at the start of each outer loop iteration:
clear dataOld; dataOld=zeros(size(dataCur),'like',dataCur)+ones(1,1,class(dataCur))-ones(1,1,class(dataCur)); dataOld=dataCur; clear dataCur; dataCur=zeros(size(dataOld),'like',dataOld);
If that doesnt force the data to be deep copied to new addresses then IDK what would.
.
OTHER INFO
CPU is a i9-7940x. This has AVX-512, which MATLAB doesnt seem to understand. For example, running `version -blas` tells me the CNR branch is unknown (this usually tells you if it uses AVX/AVX2/SSE/etc.) MKL is definitely using AVX512, but I figure this might possibly have something to do with it? (mainly because nothing else I try seems to have any effect)
.
Any ideas for a fix that dont involve "waiting for an undetermined number of MATLAB versions unless it works right out-of-the-bpx" would be much appreciated.
Thanks in advance.
2 个评论
回答(1 个)
James Tursa
2018-10-5
编辑:James Tursa
2018-10-5
You should not store your slices as follows, since this forces each slice to be scattered in memory and does not make efficient use of the cache when accessing the slices. It also forces a deep data copy every time you access the slices:
dataOut(kkOuter,:,:)
Rather, you should be storing your slices this way, since this forces each slice to be contiguous in memory and makes efficient use of the cache when accessing the slices:
dataOut(:,:,kkOuter)
Complex data comments:
The BLAS complex matrix multiply routines require interleaved data. Since R2018a and later also use this format, there will be no data copy required for this particular reason. However, there may be a data copy required when accessing slices of a matrix (rather than the entire matrix). There are indications in other Answers threads that MATLAB may implement methods that do not require data copies in some circumstances when the slices are in the first two dimensions as recommended above, but the rules for this are not published to my knowledge. You could of course write mex code that calls the BLAS routines directly to do the matrix multiplies and avoid any data copying in this case, but I don't think any of the FEX submissions that do this are updated for R2018a and later yet (including my own submission).
9 个评论
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!