Why does the parallel mode slow down the speed of each core/task

21 次查看(过去 30 天)
I did not know the fact until see my log data.
I am running some time consuming optimization. Within each trial run(simulation), it's a large computation that takes about 140s if only one simulation is running(non-parallel mode). However, if the independent tasks are running simultaneously on different cores (I have a i7 processor) is applied. I discovered that the more cores I used, the slower each task runs. Below I pasted the recorded time consumption data of EACH TASK in different situations. The consumption of all the steps are in seconds.
Init IO Runoff WaterBudget StateUpdate Routing
case 1: 1-worker/task(non-parallel)
2.1372 25.5530 53.4303 7.6752 8.0341 65.8012
case 2: 2-worker/task parallel
2.2464 18.3457 50.7939 5.6472 9.0481 67.5952
case 3: 4-worker/task parallel
3.7284 25.9586 67.5640 8.6893 13.7437 88.2966
case 4: 7-worker/task parallel
6.4896 47.8143 123.9428 20.7949 25.6622 150.5722
Can anyone give me an explanation on what happened? I expected the parallel computation gave me the times of improvement, which equals to the number of cores. The results let me down obviously.
By the way, I performed those computation in my personal computer of an intel i-7 3770 CPU(with 8 logical cores) and sufficient RAM. 1-job/multi-task mode is employed in my codes. The tasks are created by createTask function rather than batch. I have already excluded the time of submitting/waiting a job and collecting the results from each worker.
Matrix(Vectorized) computation is extensively used throughout the process.(I use no loops except the computation is order dependent). Can this be a reason? Thank you!
  1 个评论
Xinyi Shen
Xinyi Shen 2014-6-26
function MultiTaskTest(nCore)
cluster = parcluster();
job = createJob(cluster);
for i=1:nCore
job.createTask(@LargeTask, 1,{1000000,500},'CaptureDiary',true);
end
job.submit();
DispProg(job);
job.wait();
results = job.fetchOutputs();
end
function res=LargeTask(n,m)
t1=cputime;
for i=1:200
srcArr=zeros(n,2);
% create a value array
srcArr(:,1)=rand(n,1);
% create an indexarray
srcArr(:,2)= randi(m,n,1);
% accumulate the values by class indices.
res=accumarray(srcArr(:,2),srcArr(:,1));
end
t2=cputime;
disp([num2str(t2-t1),'s'])
end
function DispProg(job)
tasks=job.Tasks;
nTasks=length(tasks);
indices=zeros(nTasks,1);
while ~strcmp(job.State, 'finished')
bEmpty=true;
for i=1:nTasks
if ~isempty(tasks(i).Diary)
bEmpty=false;
end
end
if bEmpty
continue;
end
for i=1:nTasks
text=tasks(i).Diary;
if ~(indices(i)==length(text))
disp(text(indices(i)+1:end));
indices(i)=length(text);
end
end
end
end
I pasted the sample code above so that anyone can test on his/her own computer. The only parameter of the entrance function is the number of cores you decide to use.
My testing screenshot is, >> MultiTaskTest(1) 10.9513s >> MultiTaskTest(2) 12.3085s 12.3709s >> MultiTaskTest(3) 13.3225s
13.4941s
13.4161s
>> MultiTaskTest(4) 15.1945s
15.3349s
15.3037s
15.3193s
>> MultiTaskTest(5) 16.5985s
16.8169s
16.9261s
16.7545s
16.8169s
>> MultiTaskTest(6) 18.6733s
18.6265s
18.7981s
18.8137s
18.7825s
18.8137s
>> MultiTaskTest(7) 19.0477s
18.7513s
18.9541s
18.9697s
19.1257s
18.2209s
16.7077s
The time consumed by the same single task obviously increase by the number of cores I used.

请先登录,再进行评论。

采纳的回答

Matt J
Matt J 2014-7-5
编辑:Matt J 2014-7-5
Does that mean that the vectorization is always contradict to the multi-core parallelization?
There appears to be a bit of a clash between vectorization and caching in MATLAB. This affects both serial and parallel for loops as the following plot shows. The code that generated these plots is further below. It basically repeats the same total computation with different relative amounts of looping and vectorization. The plot shows computation time vs. the chunk size of the vectorized data.
The plot establishes that optimal performance isn't necessarily achieved by doing the most vectorization possible. Some intermediate vector size and number of loop iterations turns out to be best for both for and parfor. Notice also that parfor achieves the best speed-up factor over for when the least amount of vectorization is done (data size = 1000). I imagine this is because parfor can then cache the data on the CPU and the workers fight less for access to RAM. However, the overall speed performance in this case is the worst because diminishing the amount of vectorization too much eventually causes the speed to suffer.
The trends in the plot are very likely platform-dependent and computation-dependent. I did this with a pool of 2 workers on an i7-2640M, 2.8GHz quadcore CPU, MATLAB version R2013b.
The bottom line, though, is that it may take some experimentation to find the optimal amount of vectorization for your problem.
N0=8; M0=1e7;
for j=0:4;
realloc=10^j;
N=realloc*N0;
M=M0/realloc;
tic;
parfor i=1:N
A=rand(1,M);
polyval(1:5,A);
end
compTime_PARFOR(j+1)=toc;
tic;
for i=1:N
A=rand(1,M);
polyval(1:5,A);
end
compTime_FOR(j+1)=toc;
end
semilogx( M0./(10.^(0:4)) , compTime_FOR,'-*',...
M0./(10.^(0:4)) , compTime_PARFOR,'--sr' );
legend('For','Parfor')
ylabel 'Time (sec)'
xlabel 'Data Size'
  2 个评论
Matt J
Matt J 2014-7-5
编辑:Matt J 2014-7-5
I used a parpool of 2. The pool appears to make use of all 4 cores of my cores, though, judging from the Task Manager.

请先登录,再进行评论。

更多回答(1 个)

Edric Ellis
Edric Ellis 2014-6-30
Parallel workers run in 'single computational thread' mode, and this can mean that using multiple workers is no faster than using a single multi-threaded desktop MATLAB. This is especially the case when your code is already well vectorised - as vectorised calls in MATLAB are often intrinsically multi-threaded. You can check this by launching your desktop MATLAB with the '-singleCompThread' argument and checking the performance of that.
  1 个评论
Xinyi Shen
Xinyi Shen 2014-6-30
If as you said, should the performance go worse if I use the '-singleCompThread'? Unfortunately, the performance is the same. And I checked all the matlab built-in functions I called. None of them implemented the multi-thread intrinsically.
I guess it more possible because the fully vectorization style swallows the L2 cache shared by all cores because each of my thread takes up 200 MB RAM while my L2 cache is only 8MB.
Does that mean that the vectorization is always contradict to the multi-core parallelization? Is there any easy way to solve it?
Thanks

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Parallel Computing Fundamentals 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by