Issue with Mexfile in parfor loops

6 次查看(过去 30 天)
To speed up some heavy calculations I wrote a C file with I compiled with Matlabs mex compiler. It appears to run smoothly giving correct results when using only single threads/no parfor loops and I have run it > 100 times without any error.
However, when I run several calculations in parallel, one or two of my workers usually die, which lets the parfoor loop restart. After a while though all workers are able to finish. These calculations are done using SLURM, so on another machine in our network. Anyone got an idea? Perhaps my MexFile does something illegal I am not aware of.
My main script has this structure:
parfor i=1:numWorkers
doWork();
end
and doWork() is basically like
function doWork()
doSomestuff();
[a,b,c,d,e,f] = initialize();
myMexFunc(a,b,c,d,e,f);
doMoreStuff();
end
and my Mex file is the following:
#include "mex.h"
#include "stdio.h"
void calcModulation(double* A, unsigned int* B, double* C, unsigned int* D, unsigned int L, double* E, unsigned int num_col, double* F)
{
// First Task
for(unsigned int n=0;n < L; ++n)
{
for(unsigned int m=0; m < 132; ++m)
{
A[D[n]+ 22*(B[n]+m)] = A[D[n] + 22*(B[n]+m)] + C[m+132*n];
}
}
// Second Task
for(unsigned int n=0;n < num_col; ++n)
{
for(unsigned int m=0; m < 22; ++m)
{
E[n] = E[n] + F[m + 22*(n)] * A[m + 22*(n)];
}
}
}
/* The gateway function */
void mexFunction( int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
// Names changed as part of the original code is secret
unsigned int num_col = mxGetN(prhs[0]);
unsigned int L = mxGetN(prhs[2]);
double* myMatrix_A = mxGetData(prhs[0]); // N x L
unsigned int *myVector_C, *myVector_D;
myVector_C = (unsigned int*) mxGetData(prhs[1]); // N x 1
double* myMatrix_B = mxGetData(prhs[2]); // N x L
myVector_D = (unsigned int*) mxGetData(prhs[3]); // N x 1
double* myVector_E = mxGetData(prhs[4]); //1 x L
double* myMatrix_D = mxGetData(prhs[5]); //N X L
calcModulation(myMatrix_A, myVector_C, myMatrix_B, myVector_D, L, myVector_E, num_col, myMatrix_D);
}
Is there something wrong about the way I set the pointers in the mex file?
The dimensions of the Matlab variables are stated next to the "mxGetData" calls. All are double except for those casted to unsigned int*.
  2 个评论
James Tursa
James Tursa 2020-12-8
Are the unsigned int* variables actually uint32 class at the MATLAB m-file level?
There is no way for us to determine if your indexing is correct because you don't show us the inputs, and these input values are actually used as indexing into other variables.
Also, you are modifying variables inplace, which is against the rules. I.e., the A and E in calcModulation come from prhs variables which according to the official rules are const.
And you never check that the prhs inputs are actually the class and sizes you expect before you use them.
We don't really have much else to examine based on what you have posted thus far, but I would start with the above comments.
RH
RH 2020-12-8
Thank you!
[Quote]
Are the unsigned int* variables actually uint32 class at the MATLAB m-file level?
[/Quote]
I casted the doubles to uint32. To be double safe I changed my mexfile such that i use uint32_T as data type of the unsigned ints.
[Quote]
Also, you are modifying variables inplace, which is against the rules. I.e., the A and E in calcModulation come from prhs variables which according to the official rules are const.
[/Quote]
I see, I thought what I get is a pointer to the actual data that I may modify. This would allow me to avoid copying and creating large amounts of data, i.e. is it not possible to pass by address without it being a pointer to constant data?
Probably this would explain the behavior.
[Quote]
And you never check that the prhs inputs are actually the class and sizes you expect before you use them.
[/Quote]
That is correct but in my code, I can be sure the data is always in the correct format, i.e. the class and size should always fit.
[Quote]
There is no way for us to determine if your indexing is correct because you don't show us the inputs
[/Quote]
Yes, sorry about that but I cannot be sure what part I am allowed to share and what I am not allowed to share.

请先登录,再进行评论。

采纳的回答

RH
RH 2020-12-9
编辑:RH 2020-12-9
Alright, thank you for your thorough responses James. I found the problem. As I inititally suspected but then discarded I had insufficient amounts of RAM. The data size was significantly larger than I inititally calculated and therefore the workers did not get enough RAM to allocate the memory request in my code.
The solution of course is simple: More RAM or smaller data sizes. We decided to split our data in several parts and process them individually.
I found this out by setting the number of workers to one but keeping the parfor loop in there. Then I got the error message in detail from this worker where I got only a simple "Worker has died blabla" message without anything concrete previously.
edit: To avoid confusion:
The code in my opening post was apparently problematic because I changed the content of the input variables which should not be done. What I did was to change my mex file so that dynamic memory was allocated inside of it. Then I ran into the issue when a worker tried to allocate memory but it was not granted by the server it was running on and threw an exception and died.
This post responds to this issue.

更多回答(1 个)

James Tursa
James Tursa 2020-12-8
编辑:James Tursa 2020-12-8
Regarding the inplace modification in MATLAB, here is the actual situation:
MATLAB uses a system behind the scenes that is often known as "copy-on-write". That is, multiple variables can share the same data memory. A deep copy is only made when changes are made. The actual behaviour varies a bit depending on MATLAB version, but goes something like this in a recent version:
A = 1:10; % variable A is created, but it is sharing the same data area as a background varible you know nothing about
B = A; % variable B is sharing the same data area as A and the background variable.
% at this point in the code, there are actually three variables sharing the same data area
mymexfunction(A) % suppose this mex function changes the values of A inplace
% at this point in the code, variable B and the background variable have been changed inplace, a nasty side effect
C = 1:10; % variable C maybe gets created as a shared copy of the background variable with the changed values!!!
You are screwed at this point. MATLAB saw the 1:10 pattern when creating C so it might use the background variable for this, but you had inadvertently changed the values of that background variable inplace with your mex routine. If you subsequently did the A = 1:10 line again you would definitely be screwed since the variable is the same.
What to do? You can sometimes get away with modifying variables inplace in a mex routine, but only if you really, really know what you are doing and take extra precautions to make sure the variable isn't shared with any other variable prior to calling your mex routine. Since MATLAB gives you no official tools to determine this, it can be a bit of a crap shoot to know if your code is going to work as you want or expect. See this link for a nasty example:
One method that seems to work for making sure a variable is unshared is the following:
A = something potentially shared with other variables
A(1) = A(1); % MATLAB sees the assignment so it will unshare A first.
mymexfunction(A); % modifying A inplace will *probably* work OK now.
Even so, I am not sure what to expect if you are using parfor loops and each thread is trying to write into the same workspace variable inplace.
  2 个评论
RH
RH 2020-12-8
Thanks for the elaborate reply. I changed my code such that I now do not change the input of the mex functions. However, this does not affect the outcome, some workers still crash for some reason.
Is there some reasonable way to debug the workers? This is a rather difficult problem as it appears to be kind of random if and what worker crashes. Like 4 out of 20 crash.
James Tursa
James Tursa 2020-12-8
You can use the crude debugger (i.e., lots of print statements to make sure your indexing is not running off the end of the valid memory areas), or e.g. in Visual Studio you can compile your mex routine in debug mode and then attach the MATLAB process to your Visual Studio session and try to do the debugging there. But I don't have any experience doing this with parfor.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 MATLAB Compiler 的更多信息

标签

产品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by