Effectiveness of implicit and explicit parallelization of fft2

12 次查看(过去 30 天)
Hi,
I have many images on which I want to do some FFTs. I want to accelerate this image processing with multithreading. I am wondering why explicit parallelization with parfor as shown in the following example is not largely more efficient than the implicit one from the matlab function fft2 :
clear
%% parameters & variable declaration
N_slice = 40 ;
N_img=80;
a = rand(1024,1024,N_slice);
for ii=1:N_img
img{ii}=rand(1024,1024);
out{ii}=0;
end
%% multithreading inside the loop
maxNumCompThreads(4);
% 39 s with 4 threads
% 88 s with 1 thread
% parallelization efficiency of 88/(39*4)=56%
tic
for ii=1:N_img
out{ii}=sum(ifft2(a.*fft2(img{ii})),'all');
end
toc
%% parfor version
% 33 s with parpool(4)
% parallelization efficiency of 88/(32*4)=68%
maxNumCompThreads(1); %avoid the relative inefficiency of fft2 parallelisation
pool = parpool('Processes',4,'SpmdEnabled',false);
b=parallel.pool.Constant(a);
tic
parfor ii=1:N_img
% out{ii}=sum(ifft2(b.Value.*fft2(img{ii})),'all');
out{ii} = fun(b.Value,img{ii});
end
toc
function out=fun(x,y)
% Just in case, I read somewhere that parpool might be faster
% using functions rather than script - change nothing here.
out = sum(ifft2(x.*fft2(y)),'all');
end
FFT2 is supporting multithreading by default and on my pc with 4 cores, I get a 56% multithreading efficiency (defined a T1/(T2*N) with T1 the time to do a task with one thread, T2 the time with N thread).
I am expecting to get a larger multithreading efficiency by creating a pool of 4 workers, and making them work on my images in paralle. Because there is so little data transfer and so much computation time I am expecting to get a multithreading efficiency close to one. But I only get around 68% as you can see in my exemple. Why is that ?
Matthieu

采纳的回答

Damian Pietrus
Damian Pietrus 2024-3-19
Hello Matt,
As mentioned already, there could be a few reasons for this. When you start up a pool of 4 workers in the "Processes" profile, you are starting 4 separate MATLAB workers, each with their own process and memory space. There does end up being some overhead with parfor, as MATLAB not only needs to do some data transfer to each worker, but also needs to break the problem up in the first place. FFT is already well optimized and can avoid some of this overhead.
One thing I'd recommend as a comparison point is to use the "Threads" cluster profile instead of "Processes". Here, parallel workers run as threads within the main MATLAB process and share memory, so I'm curious to see if/how that impacts your overall results.
  2 个评论
Matt
Matt 2024-3-21
Hi,
Switching from process to threads I went from 68% to 88% efficiency ! And that is with a cpu that works at a slighty lower frequency, so it is actually probably higher.
Very good advice. So why is this much better ?
Matt
Damian Pietrus
Damian Pietrus 2024-3-21
Based on the timing of your calculations, I think the difference in effeciency comes down to the overhead required in sharing data between workers. Threaded workers can directly access the same memory space, while process workers require a bit more overhead in getting the right data copied to the right process.

请先登录,再进行评论。

更多回答(1 个)

Pratyush
Pratyush 2024-3-18
Hi Matt,
The observed discrepancy in multithreading efficiency when using MATLAB's "parfor" for FFT operations on images, achieving around 68% efficiency instead of the expected near-perfect efficiency, can be attributed to several factors:
  • Setting up parallel pools, transferring data, and scheduling tasks introduce overhead that reduces overall efficiency.
  • MATLAB's "fft2" is already optimized for multithreading, making explicit parallelization less beneficial.
  • Due to Amdahl's Law, as more workers are added, the speedup gained from parallelization tends to plateau because of the non-parallelizable portions of the task and increasing overhead.
  • Competition for CPU time and memory bandwidth among workers and the main MATLAB process can hinder performance.
  • While "parfor" is efficient for many tasks, its overhead can impact tasks that are already optimized, like FFT operations.
The gist is, achieving near-linear scaling in parallel computing is challenging, especially for already optimized operations. The 68% efficiency observed is relatively good, considering these factors. Optimizing parallel efficiency might involve minimizing data transfer, experimenting with the number of workers, and ensuring optimal hardware and MATLAB configurations.
  1 个评论
Matt
Matt 2024-3-18
Hi,
Indeed there are many factors that can lower parallelization efficiency, but here I fail to see which one is predominant :
  • the creation of the pool is not timed
  • data transfer should not be a bottleneck since the data is a light 1024x1024 array per worker, compared to an execution time of around 1s on one thread per task.
  • I asked for 4 workers on a 4 core cpu, and I have set the number of thread per worker to one to avoid resource allocation competition.
  • 100% of the code is parallelized.
I don't see where I lose 32% of efficiency here?

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Parallel Computing Fundamentals 的更多信息

产品


版本

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by