Speed of the abs function

22 次查看(过去 30 天)
Hello everybody!
I had a question about how matlab computed the absolute value function.
Indeed, i was working on a school project and had to optimize a code that returned the smallest euclidian distance from the origin of a 3-D cloud of points. Simple enough, my first thought was to use the basic :
d = x.^2 + y.^2 + z.^2;
dmin = sqrt(min(d));
But then, thinking that squaring was a expensive operation, especially on a data set that can be bilions of points big, i tried to implement a "pre-check". My idea was the following: the 1-norm will be way cheaper to compute and so i will be able to compute a first "estimate" of the smallest distance very quickly. I will then use this to only compute the 2-norm of the points that are "close enough" to the origin on a way smaller data set.
so i did this:
dEstimate = abs(x) + abs(y) + abs(z);
dMinEstimate = min(dEstimate);
threshold = dMinEstimate*sqrt(3); %the closest point in the 2-norm can be at most sqrt(3) times further than the closest in the 1-norm.
potentialMin = find(dEstimate < threshold);
d = x(potentialMin).^2 + y(potentialMin).^2 + y(potentialMin).^2
dmin = sqrt(min(d));
But to my surprise, when using the profile viewer to compare the codes, this is what i got:
As you can see, the results are averaged over 100 calls in order to reduce processing noise. Keep in mind that the data is one milion points here.
How can the square operation time be close or even faster than the absolute value?
I would assume that the absolute value is computed by setting to 0 the sign bit in the IEEE 754 double convention, which is a very simple operation while the squaring is a much more complex process.
Is matlab this optimised? How can this be?
Thanks in advance for your answer !
Arthur

采纳的回答

James Tursa
James Tursa 2023-12-5
编辑:James Tursa 2023-12-6
Although I applaud your thinking about this problem, this level of optimization is hard to achieve with a language such as MATLAB because there is a lot going on in the background that you are not considering beyond just the abs( ) vs squaring comparison. E.g.,
d = x.^2 + y.^2 + z.^2; % passes the x, y, z arrays through the CPU. Allocates d same size as x, y, and z
dmin = sqrt(min(d)); % passes d through the CPU. Then sqrt( ) of a scalar
So, four potentially large arrays get passed through the CPU along with some ops and a potentially large new allocation.
Compared to this:
dEstimate = abs(x) + abs(y) + abs(z); % passes the x, y, z arrays through the CPU. Passes abs( ) of them through CPU also. Allocates dEstimate same size as x, y, and z
dMinEstimate = min(dEstimate); % min( ) passes dEstimate through the CPU
threshold = dMinEstimate*sqrt(3); % Scalar ops
potentialMin = find(dEstimate < threshold); % passes dEstimate through the CPU again, allocates potentialMin
d = x(potentialMin).^2 + y(potentialMin).^2 + y(potentialMin).^2; % New allocations for the subscripting, passing through CPU, etc.
dmin = sqrt(min(d)); % passes d through the CPU, then scalar op
So, lots more going on in the background here with how much data is being fed through the CPU, new memory allocations, etc. I don't know how one would necessarily expect this to be either slower or faster than the original given all that is going on. And depending on how large the arrays are I'm not sure what cache hit rate you might expect either. So results may or may not scale well.
If you were coding in C/C++ then maybe you could look at something like this and try to optimize cache hits and minimize large temporary array allocations etc. For this particular problem, in C/C++ all you need is a simple loop with no large temporary array allocations needed at all. But for a language like MATLAB this type of optimization is not worth it IMO.
  1 个评论
Walter Roberson
Walter Roberson 2023-12-5
If you manage to find tables of the throughput and latency of different instructions for various generations of Intel architectures, then you will see that at various times:
  • FABS has been the same speed as FMUL
  • FABS has been the same speed as FMUL but had less latency
  • FABS has been the same speed and latency as FMUL but can only be started every 2 cycles where FMUL could be started every cycle
  • FABS has been the same speed and latency as FMUL but can only be started every 4 cycles where FMUL could be started every 2 cycles
  • FABS has been 1 cycle, latency 1, startable every cycle, but FMUL was 1 cycle, latency 5, 2 startable every cycle
  • FABS has been 1 cycle, latency 1, but FMUL was 2 cycles
... and other combinations.
So, FABS is anywhere from "always faster than FMUL" to "same speed as FMUL but you can't start as many FABS in a row" to "faster than FMUL but you can do more FMUL at the same time"... depending on exact architecture.

请先登录,再进行评论。

更多回答(2 个)

Chunru
Chunru 2023-12-5
" squaring was a expensive operation"
This is ture for VERY OLD cpu or some special purpose CPUs. In modern general purpose CPU, such as Intel one, the difference among square/multiplication/abs/addition is no longer significant (subjective here). For example, a CPU can perform 4 ADD/cycle or 4 MUL/cycle (floating point). The abs operation may not be much simpler than square.
If you consider the FP addition, abs, multiplier (of similar speed), with the overhead such memory access, loop control, the two approaches you used above may have ver similar performance. (It is also noted additional overhead for testing and comparison).
So in modern general purpose CPU, it may not worth to do the "optimization" mannually. (Some compilers may do optimisation better).

Walter Roberson
Walter Roberson 2023-12-5
We have to watch out for measurement error due to Execution Engine processing
format long g
N = 10000;
x = randn(N, 1); y = randn(N,1); z = randn(N,1);
M = 10;
t1a = zeros(1,M);
t1b = zeros(1,M);
t2a = zeros(1,M);
t2b = zeros(1,M);
for K = 1 : M; start = tic; d = x.^2 + y.^2 + z.^2; t1a(K) = toc(start); end
for K = 1 : M; start = tic; d = abs(x) + abs(y) + abs(z); t2a(K) = toc(start); end
for K = 1 : M; start = tic; d = x.^2 + y.^2 + z.^2; t1b(K) = toc(start); end
for K = 1 : M; start = tic; d = abs(x) + abs(y) + abs(z); t2b(K) = toc(start); end
t1a
t1a = 1×10
1.0e+00 * 0.000268 0.000168 1.8e-05 1.5e-05 1.8e-05 1.5e-05 6e-06 5e-06 5e-06 5e-06
t1b
t1b = 1×10
1.0e+00 * 0.00019 1.7e-05 1.3e-05 1.3e-05 1.8e-05 1.4e-05 5e-06 5e-06 5e-06 5e-06
t2a
t2a = 1×10
1.0e+00 * 0.000346 4.2e-05 0.000113 2.6e-05 9e-06 7e-06 7e-06 6e-06 6e-06 6e-06
t2b
t2b = 1×10
1.0e+00 * 0.000192 3.1e-05 2.5e-05 4.3e-05 8e-06 7e-06 6e-06 6e-06 6e-06 6e-06
plot([t1a; t1b; t2a; t2b].')
legend({'squares#1', 'squares#2', 'abs#1', 'abs#2'})
plot([t1b; t2b].')
legend({'squares#2', 'abs#2'})
For reasons currently unknown, the abs() version consistently is slower at execution 3 or 4 compared to 2, but by the end of the run the timings for the two versions are virtually identical.

类别

Help CenterFile Exchange 中查找有关 Simulink 的更多信息

产品


版本

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by