Editing elements of vector inside mex function slow

Consider the following mex function written in C, which returns a column vector:
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
int x = *mxGetPr(prhs[0]);
int length = *mxGetPr(prhs[1]);
plhs[0] = mxCreateDoubleMatrix(x, 1, mxREAL);
double *output_vector = mxGetPr(plhs[0]);
int i;
for(i = 0; i < length; ++i) {
output_vector[i % x] += 1;
}
}
If I compile this code in Matlab and then run the function with inputs (5, 1000000000), it takes 3.228 s. Consider now the following altered code:
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
int x = *mxGetPr(prhs[0]);
int length = *mxGetPr(prhs[1]);
plhs[0] = mxCreateDoubleMatrix(x, 1, mxREAL);
//double *output_vector = mxGetPr(plhs[0]);
double *output_vector = malloc(x*sizeof(double));
int i;
for(i = 0; i < length; ++i) {
output_vector[i] = 0.0;
}
for(i = 0; i < length; ++i) {
output_vector[i % x] += 1;
}
free(output_vector);
}
If I compile and run the function with the same inputs as before, it takes only 0.627 s.
It seems like editing elements of an mxArray is much slower than editing elements of a double array. It seems like there should be no issues with MATLABs column-major order vs the row-major order in C since I am only using a vector here.
Any ideas why I am seeing this time difference?
Here is some further information:
  • OS: 64-bit Windows 10.
  • Compiler: MinGW64 Compiler (C), with the additional compile flags -std=c99 and -pedantic.
  • MATLAB version: R2016b
Update: For the simple example above, updating the mxArray takes about 5 times as long. In another code which I am using for an actual application, updating an mxArray instead of a double array takes 30 times as long.
Update 2: Please see my new timings in my comment below after incorporating the helpful suggestions by Walter and James. After fixing an error in the second code above, writing to an mxArray is now 10x slower than a double array for this simple example.

4 个评论

You should consider using memset() or calloc() instead of looping setting zero.
You should also take a look at the implementation of https://www.mathworks.com/matlabcentral/fileexchange/31362-uninit-create-an-uninitialized-variable--like-zeros-but-faster- in the sense that it would give you a better comparison for the speed of MATLAB allocation compared to malloc(), in that the time of any zeroing of the memory would be factored out.
How does this not crash MATLAB?
double *output_vector = malloc(x*sizeof(double));
int i;
for(i = 0; i < length; ++i) {
output_vector[i] = 0.0;
}
If x is 5 and length is 1000000000 the above code is writing off the end of the allocated memory block big time and should crash. Are you sure this is the actual code you are running and comparing?
Also, how are you calculating the timing? Average of multiple runs after mex routine is loaded? tic & toc?
@ Walter Roberson: Thank you for those suggestions. I have incorporated calloc() in my code, and it will be even more useful in my other code where I initialize a much larger array to zero. Also, thanks for the link to UNINIT.
@ James Tursa: That's a good question! That's definitely a mistake; thanks for pointing it out. I'm calculating the timing using MATLABs profile (i.e., I run "profile on", then run the mex function, then view the results using "profile viewer"). Although I don't take an average over multiple runs, the variance in the run time over multiple runs is small compared to the much bigger difference between the two different codes.
After correcting the mistake pointed out by James, and also trying out calloc() as suggested by Walter, I get the following timings:
  • Write to mxArray (first code in my original post): 2.949 s
  • Write to malloc:ed double array (corrected second code in my original post): 0.295 s
  • Write to calloc:ed double array (same as second code in original post, but with calloc() and without initialization for loop): 0.296 s
This time, I'm on a Linux machine with MATLAB R2016b, and compiling with gcc version 6.3.1-3.
Next step: try with an uninit MATLAB array followed by writing in zeros. This will give you information about the amount of time it takes to go through the MATLAB memory manager.

请先登录,再进行评论。

 采纳的回答

I can think of no reason why writing to memory from a mxArray (off of the heap) should take a significantly different amount of time than writing to memory from malloc or calloc (also off of the heap). I ran the following two sets of code on R2017a Win64 and see no significant differences:
/* double_write_test1.c */
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
int x = *mxGetPr(prhs[0]);
int length = *mxGetPr(prhs[1]);
double *output_vector;
int i;
plhs[0] = mxCreateDoubleMatrix(x, 1, mxREAL);
output_vector = mxGetPr(plhs[0]);
for(i = 0; i < length; ++i) {
output_vector[i % x] += 1;
}
}
and
/* double_write_test2.c */
#include <stdlib.h> /* malloc , free */
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
int x = *mxGetPr(prhs[0]);
int length = *mxGetPr(prhs[1]);
double *output_vector;
int i;
// plhs[0] = mxCreateDoubleMatrix(x, 1, mxREAL);
//double *output_vector = mxGetPr(plhs[0]);
output_vector = malloc(x*sizeof(double));
for(i = 0; i < length; ++i) {
output_vector[i % x] += 1;
}
free(output_vector);
}
The timing results:
>> tic;double_write_test1(5, 1000000000);toc
Elapsed time is 3.263384 seconds.
>> tic;double_write_test1(5, 1000000000);toc
Elapsed time is 3.004800 seconds.
>> tic;double_write_test1(5, 1000000000);toc
Elapsed time is 3.098912 seconds.
>>
>>
>> tic;double_write_test2(5, 1000000000);toc
Elapsed time is 3.071897 seconds.
>> tic;double_write_test2(5, 1000000000);toc
Elapsed time is 3.091942 seconds.
>> tic;double_write_test2(5, 1000000000);toc
Elapsed time is 3.056829 seconds.
So, timing is pretty much the same. This is all as expected on my machine. I don't know what might be happening on your machine.
I would point out that MATLAB seems to keep a store of 0'ed memory to the side for use in some circumstances. E.g., if you call mxCalloc the pointer returned may be to a memory block that has already been previously set to all 0's prior to your mxCalloc call. So you can't necessarily conclude that any timings associated with the call also included the time it took to set all of the memory to 0's since that might have been done prior to the call.
Side Note: I don't know what all "mex.h" includes, but I wouldn't necessarily trust it to include the files that have the native C function prototypes. In particular, since you are using malloc etc you should probably explicitly include a header file like stdlib.h to get the proper prototypes for the functions you are using.

8 个评论

When I run these two code examples, I get the following timings:
>> tic; double_write_test1(5, 1000000000); toc
Elapsed time is 2.934494 seconds.
>> tic; double_write_test1(5, 1000000000); toc
Elapsed time is 2.949074 seconds.
>> tic; double_write_test1(5, 1000000000); toc
Elapsed time is 2.939972 seconds.
>> tic; double_write_test2(5, 1000000000); toc
Elapsed time is 0.297458 seconds.
>> tic; double_write_test2(5, 1000000000); toc
Elapsed time is 0.298535 seconds.
>> tic; double_write_test2(5, 1000000000); toc
Elapsed time is 0.301099 seconds.
I wonder why we get such different results for double_write_test2.
What's more, if I change the second code so that the contents of output_vector are copied over to an mxArray at the end, i.e.,
/* double_write_test3.c */
#include <stdlib.h> /* malloc , free */
#include "mex.h"
#include <stdio.h>
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
int x = *mxGetPr(prhs[0]);
int length = *mxGetPr(prhs[1]);
double *output_vector;
int i;
plhs[0] = mxCreateDoubleMatrix(x, 1, mxREAL);
double *output_vector_2 = mxGetPr(plhs[0]);
output_vector = calloc(x,sizeof(double));
for(i = 0; i < length; ++i) {
output_vector[i % x] += 1;
}
for(i = 0; i < x; ++i){
output_vector_2[i] = output_vector[i];
}
free(output_vector);
}
then I get the following results:
>> tic; double_write_test3(5, 1000000000); toc
Elapsed time is 2.942094 seconds.
>> tic; double_write_test3(5, 1000000000); toc
Elapsed time is 2.945187 seconds.
>> tic; double_write_test3(5, 1000000000); toc
Elapsed time is 2.948032 seconds.
It seems odd that this would add so much to the run time, especially since only 5 elements are copied.
I get similar results when compiling and running these codes on two of the computers in my department's computer lab (Linux, gcc compiler, MATLAB R2017a).
What happens if you replace malloc by mxMalloc in test2?
If I replace malloc in test2 by mxMalloc or mxCalloc I get run times of around 3.0 seconds.
Same slow result if you go back to using malloc but uncomment the line that creates plhs[0]?
Also, what happens if you get rid of the loop entirely and just have the plhs[0] creation?
If I go back to malloc and uncomment that line, it is fast again, taking about 0.3 seconds.
However, I noticed that in the test2 code, if I add e.g.
printf("Test");
the code starts taking as long as the test1 code. Could it be that test2 runs faster on the computers I tested due to the compiler doing some sort of optimization? After all, in test2 the memory allocated to output_vector is immediately freed after updating the entries of that array.
Does that seem plausible?
Which compiler do you use?
I just tried compiling the test2 code on my Windows machine using the Microsoft Windows SDK 7.1 C compiler instead of the MinGW64 C compiler. When using the MinGW64 compiler, the function runs in about ~0.3 s as before, and when using the Microsoft compiler, the function takes slightly more than 3 s to run.
So it seems like the MingGW64 / gcc compilers do some sort of optimization that makes the code seem faster than it actually is for any practical purposes, i.e., when anything else is done after the time consuming computation in the main for loop. This would also explain why the two test codes took about the same amount of time on your computer (I'm guessing you're using the Microsoft compiler?).
Thank you for your help with this issue.
Yes, Microsoft SDK. This is still strange to me, however, to have that much timing difference.
I believe the dominant cost in the loop is the operation `i%x`. Since x isn't known at compile time, GCC-4.9 w/ -O2 wasn't able to do tricks to avoid the integer division. When the compiler optimizes out `output_vector[i%x] += 1`, the runtime drops significantly, because we no longer have to do a billion integer divisions.

请先登录,再进行评论。

更多回答(0 个)

类别

帮助中心File Exchange 中查找有关 Matrix Indexing 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by