Speed of masked matrix operations in 'single' vs 'double'

Question

DGM 2018-7-10

2
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/409702-speed-of-masked-matrix-operations-in-single-vs-double

回答： Image Analyst 2018-7-11

I've been going through a lot of my tools and trying to make things faster and reduce memory use. I know that using double-precision FP as my default working datatype is part of the problem regarding memory, but I had expected using single-precision may be faster as well.

Simple tests seem to indicate that it would be (these times are all averages of many tests):

% 105.6ms for double
% 50.4ms for single
R=imfilter(bg,fs);
% 7.6ms for double
% 4.4ms for single
R=flipdim(bg,2);
% 45ms double
% 22ms single
R=bg.*fg;
% 4.5ms double
% 3.2ms single
R=fg.^2 + 2*bg.*fg.*(1-bg);

but operations involving masking via multiplication were significantly slower in single:

% 6.0ms double
% 25.7ms single!
hi=I>0.5; 
R=(1-2*(1-I).*(1-M)).*hi + (2*M.*I).*~hi;

Explicitly casting the logical mask as numeric and handling it without the NOT operator does speed things up a bit, but either case with numeric masks is still slower than using double with logical masks.

% 7.7ms double
% 9.8ms single
hi=single(I>0.5); 
R=(1-2*(1-I).*(1-M)).*hi + (2*M.*I).*(1-hi);

You might ask why I'm masking via multiplication in the first place. Why not just use logical indexing? I used to do everything that way, but apparently overcalculation is faster than a bunch of logical indexing:

% 62.1ms double
% 53.7ms single
hi=I>0.5; lo=~hi;
R=zeros(size(I),'single'); % preallocate with appropriate wclass
R(lo)=2*I(lo).*M(lo);
R(hi)=1-2*(1-M(hi)).*(1-I(hi));

Am I misguided to expect reliable speed gains from using single-precision for a wide range of operations across different machines (this code will be used by others)? Comments like this make me think so.

That, and if I were to pursue this flexibility for the conservation of memory alone, is there a better approach to masked operations than what I've described?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Matt J 2018-7-10

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/409702-speed-of-masked-matrix-operations-in-single-vs-double#answer_328295

编辑：Matt J 2018-7-10

Am I misguided to expect reliable speed gains from using single-precision for a wide range of operations across different machines

Well, no, you're not misguided, assuming you're using a recent version of Matlab, and your post demonstrates that indeed you do achieve gains over a wide range of operations. Just not all operations.

I don't have a good explanation for the behavior, but I'm guessing that there are difficulties in writing multi-threaded code in a type-generic way.

5 个评论
显示 3更早的评论隐藏 3更早的评论

DGM 2018-7-11

That might explain some of the cpu usage patterns observed when profiling the different operations. I was only logging average times, but when I was running the overall tests in 'double', I noticed that it was more frequently occupying more cpu cores. I haven't checked to see which cases exhibited that behavior though (there were ~80 different cases being tested sequentially).

This might be an idea to shelve for a couple years. It seems I'm almost always running an older version than the other students, and I'd hate to optimize something for myself that makes things worse for everyone else.

Walter Roberson 2018-7-11

When I put together the information from http://www.agner.org/optimize/instruction_tables.pdf and https://www.felixcloutier.com/x86/index.html, I get the impression that for most of the processors in the x86 and x64 architecture, the only difference in rates for signed multiplication (FMUL or IMUL) for single precision and double precision, would be entirely due to differences in whether 32 bits or 64 bits were being transferred from memory, and for integer there would be an additional latency of conversion to floating point.

Addition looks like it can get pretty complicated, with numerous different modes related to various forms of packing, and related to fused instructions. It looks like there are different instructions for adding scalar single precision and for scalar double precision, but the latency tables give the same rates for single and double precision.

The material indicates that if nan or inf is part of the data then the computation can take up to 100 cycles longer.

请先登录，再进行评论。