Is vectorization preserve scaling of runtime?

1 次查看(过去 30 天)
Suppose I have a calculation, in which some for cycles is required inside each other. For example: for n = 1:N; for j = 1:J; for m = 1:M*J; In this case runtime scales with N, and M linearly, but runtime grows quadratically with J. If we can vectorize the calculation somehow, vectorization can speed up. But is scaling remain the same?

采纳的回答

Jan
Jan 2017-5-11
编辑:Jan 2017-5-11
Even with vectorization the runtime will depend on the number of calculations. Sometimes vectorized code requires large temporary arrays, which do not match into the processor cache or the RAM. Then the slower RAM or even slower hard disk is used, which can degrade the speed massively. But as long as the same kind of memory can be used, the order of the execution time is the same for loops and vectorized code. Roughly.
But sometimes, vectorized code is multi-threaded, e.g. in the sum command. Then the number of free cores matters in addition above a certain size of the input. While sum(2e5) needs about twice as long as sum(1e5), this does not hold true for sum(10e5) and sum(5e5).
  3 个评论
Walter Roberson
Walter Roberson 2017-5-19
For sufficiently large arrays, prod() would be handed over to one of the fast mathematical libraries such as MKL or BLAS. Those libraries are highly optimized, and can take advantage of multiple cores, and can take advantage of advanced hardware instructions such as various SIMD (Single Instruction Multiple Data) that might allow higher hardware performance rates than might be obvious. Your exact primary and secondary cache sizes can turn out to matter, your exact CPU can matter, the exact array size can matter...
Jan
Jan 2017-5-19
编辑:Jan 2017-5-19
The code I've posted before did not show anything relevant.
Neither BLAS nor MKL has a function for the calculation of the product of the elements.
SUM and PROD use multiple cores, when the input is larger than 88999:
x = rand(1, 88999);
tic; for i = 1:50000; v = sum(x); clear('v'); end, toc
tic; for i = 1:50000; v = prod(x); clear('v'); end, toc
x = rand(1, 89000);
tic; for i = 1:50000; v = sum(x); clear('v'); end, toc
tic; for i = 1:50000; v = prod(x); clear('v'); end, toc
% Matlab 2016b, Core2Duo
Elapsed time is 6.926055 seconds.
Elapsed time is 10.872182 seconds.
Elapsed time is 4.241430 seconds.
Elapsed time is 6.319897 seconds.
See also the Performance tab of the task manager: For the smaller problem only 1 core is used, but 2 (or all?) for the larger one. The code is not 50% faster (on my 2-core machine), but only 40%, because starting a thread is rather expensive. That SUM and PROD have the same limit does not look "highly optimized", because the point of break even would be reached for the more expensive product ealier. In opposite to this, the optimized BLAS libraries like ATLAS or MKL consider the cache sizes and costs of starting threads.
In older Matlab versions (2009a) the multi-threading was implemented cheaply, because the output of the threads was collected in the order the thread get ready. Then rounding errors could cause different results for sum depending on random factors.
By the way: Look at the timings under R2009a on the same machine:
Elapsed time is 6.628261 seconds.
Elapsed time is 10.488722 seconds.
Elapsed time is 9.719854 seconds.
Elapsed time is 11.743312 seconds.
As in R2016b the 88999 elements cause 50% processor load, and the 89000 100%, but the code was not faster, but substantially slower. Scary.

请先登录,再进行评论。

更多回答(0 个)

类别

Help CenterFile Exchange 中查找有关 Loops and Conditional Statements 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by