There are several reasons why parfor may not be able to do a perfect job of speeding up your calculation. These basically boil down to two broad categories:
- Overheads associated with running in parallel (dividing up the work, sending stuff to and from the workers, imperfect scheduling of the work - i.e. some workers left idle towards the end of the loop).
- Intrinsic hardware limitations - not every single-threaded program can be perfectly accelerated on a given machine. One common cause here is access to memory. For instance you might run out of cache memory, or main memory bandwidth. This can be particularly challenging to diagnose. One way is to use hardware performance counters. A simpler way is to run the core piece of your computation with increasing contention on your target system, and see whether the performance degrades (it often does).
You can investigate the overheads by using ticBytes/tocBytes. However, that's not always the whole picture - things depend on how "far" away the workers are. It's not clear if you're running on a single system or not.
Digging into the second point can be done by timing execution on the workers as you increase the number of concurrent computations. One way is to use spmd, something a bit like this:
parpool(36);
for ii = 1:36
    spmd (ii) % Limit the number of workers for the SPMD block
        t = tic();
        % Run an inner loop to ensure the timings are not dominated by spmd
        % overheads. Aim to get the block to take 5-10 seconds with a
        % single worker
        for jj = 1:N
            getCoh(. . .);
        end
        t = toc(t);
    end
    t = t{1}; % retrieve time from worker
    % Display (or capture) the time as contention increases.
    [ii,t]
end


