GCC compiled MEX file taking more time than the one compiled by Microsoft Visual Studio.

1 次查看(过去 30 天)
Hello, I have the following loop:
spmd
dgtilde = zeros(length(denom),d.nexp2);
for mm = 1:d.nexp2
dgtilde(:,mm) = sum(g{d.exp2(mm,1)}.*g{d.exp2(mm,2)}.*weight,2) ...
- gtilde(:,d.exp2(mm,1)).*gtilde(:,d.exp2(mm,2));
end
end
I converted the inner loop to C code as follows:
#include <math.h>
#include <matrix.h>
#include <mex.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
const mwSize *dims;
const mxArray *cell;
const mxArray *cellArray1, *cellArray2;
double *pr1, *pr2;
double *weight, *gtilde;
double *exp2;
double *sum_gammaXmom;
int mom, cellSize, nnz, mm1, mm2, sgIndex;
bool issparse1, issparse2;
mwIndex i, j, k, count, jcell,*ir, *jc;
mwSize ncol, nrow;
cell = prhs[0];
mom = (int)mxGetScalar(prhs[1]);
weight = mxGetPr(prhs[2]);
exp2 = mxGetPr(prhs[3]);
dims = mxGetDimensions(prhs[3]);
gtilde = mxGetPr(prhs[4]);
if(mom>dims[0]) mexErrMsgTxt("d.mom variable exceeds g cell array size.");
jcell = 0;
cellArray1 = mxGetCell(prhs[0], jcell);
cellSize = mxGetNumberOfElements(prhs[0]);
nrow = mxGetM(cellArray1);
ncol = mxGetN(cellArray1);
plhs[0] = mxCreateDoubleMatrix(nrow, mom, mxREAL);
sum_gammaXmom = mxGetPr(plhs[0]);
count = 0;
for(j=0;j<(mom*nrow);j++) sum_gammaXmom[j] = 0;
for (jcell=0; jcell<mom; jcell++) {
mm1 = (int)exp2[jcell]-1;
mm2 = (int)exp2[jcell+mom]-1;
cellArray1 = mxGetCell(prhs[0], mm1);
cellArray2 = mxGetCell(prhs[0], mm2);
pr1 = mxGetPr(cellArray1);
pr2 = mxGetPr(cellArray2);
for(i=0;i<nrow;i++) {
sgIndex = i+jcell*nrow;
for(j=0;j<ncol;j++){
sum_gammaXmom[sgIndex] += pr1[i+j*nrow]*pr2[i+j*nrow]*weight[i+j*nrow];
}
sum_gammaXmom[sgIndex] = sum_gammaXmom[sgIndex]-gtilde[i+mm1*nrow]*gtilde[i+mm2*nrow];
}
}
}
When I compiled the MEX file through Microsoft Visual Studio compiler on Windows machine, it reduces the execution time to half. On the other hand, when I compiled the file to MEX using GCC compiler, the execution time didn't get better at all. I have two questions:
  1. Why is there this difference between the performance of two compilers?
  2. Is there a way to improve C code to perform better?
  3. Should I expect an improvement in the speed if I use a 3D matrix 'g' as an input, instead of a cell array of double matrices 'g'.
  • g variable is a composite with each lab's data containing a cell array of double matrices.
  • weight variable is a composite with each lab's data containing a double matrix.
  • sum_gammaXmom variable is computing dgtilde.
Addendum:
Actually, I have a client who is working on a linux/unix based system with gcc. When I first delivered him C files, he compiled and told me that its only 2x faster than native MATLAB, where I was getting 3x improvement with Microsoft Visual Studio. So I installed GCC on my computer and tested my C functions, and got the same 3x improvement that I was getting with MVS compilers. I asked him to compile with O1, O2, O3 options, but no luck there. I am attaching the mex_C_glnxa64.xml file he is using in his computer and gcc MEXOPTS.bat file that I am using on my local machine. Can you guys tell me if we are using any different parameters that is causing this difference in performance on two machines.
thanks.
  3 个评论
dpb
dpb 2015-7-4
Surprising; gcc is generally considered quite good. Do you have a recent release; what are you running it under/is it a native installation or under an emulation layer or something by any chance?

请先登录,再进行评论。

回答(2 个)

Ivo Houtzager
Ivo Houtzager 2015-7-4
There is difference in the default floating-point optimization between the compilers.
The floating-point calculations from the GCC compiler follows the strict IEEE compliance by default. The optional -ffast-math flag enables optimizations that can break the strict IEEE compliance. You can try if this option improves the speed at the possible cost of accuracy.
The floating point calculations from the VS compiler does not preserve strict IEEE compliance by default. The default option /fp:precise enables some non-strict optimizations. If you need strict floating point calculations from the VS compiler use the /fp:strict option. For the fastest floating-point calculations that VS compiler can offer use the /fp:fast option.
The VS compiler also enables the use of SSE2 instructions (option /arch:SSE2) by default on x86 platforms. The GCC does not enable the use of SSE2 instructions by default. To enable instructions supported by most common proccesors use the option -mtune=generic.
  4 个评论
Ivo Houtzager
Ivo Houtzager 2015-7-8
The following line shows the optimization options from mexopts.bat.
set OPTIMFLAGS=-O3 -funroll-loops -DNDEBUG
The following line shows the optimization options from mex_C_glxna64.xml.
COPTIMFLAGS="-O -DNDEBUG"
Thus the compiler on the windows platform optimizes more than the linux platform (O3 vs O level). Further, loop unrolling is enabled for the windows compiler. You can set the compile options from the mexopts.bat to the mex_C_glxna64.xml to improve the optimization. You can even try to improve optimization further by adding -ffast-math and/or -mtune=generic options to the line as discussed above.
Ubaid Ullah
Ubaid Ullah 2015-7-8
Well my client tried O1 to O3, but he didn't see any improvement. I will ask him to use -ffast-math and -mtune=generic options.

请先登录,再进行评论。


Jan
Jan 2015-7-4
Why is there this difference between the performance of two compilers?
Compilers translate the C code to machine instructions. There are different possible translations, which lead to the same results but with different runtime. E.g. a compiler can create MMX, SSE, SSE2 or SSE3 instructions. Some will run on modern processors only, others support older processors also. Therefore it is expected that different compilers create programs with different speed.
Try memset instead of a loop to set sum_gammaXmom to zero. Or even better: Omit this zero'ing, because mxCreateDoubleMatrix fills the array with zeros already.
sum_gammaXmom[sgIndex] += pr1[i+j*nrow]*pr2[i+j*nrow]*weight[i+j*nrow];
You could try if storing i+j*nrow in a variable avoid the repeated calculation of the same value. But I hope that smart compilers recognize this. A general problem remains the memory access: It is much cheaper to read and write to and from neighboring elementes in the memory. Is it possible to run the loop over i in the inside, such that [i+j*nrow] accesses contiguos memory elements?
  5 个评论
Jan
Jan 2015-7-5
Accessing 25 cells costs less than a millisecond. But I do not understand what "with each cell having an 25-element array of double matrices" means.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 MATLAB Compiler 的更多信息

产品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by