mex CUDA code to calculate Coulomb field

1 次查看(过去 30 天)
Hello, I've been trying to CUDA-ify a mex code I had written for calculating the electric field produced by M point charges whose coordinates are stored in an Mx3 array of [x,y,z]'s. The field is calculated at a set of points specified by a similar Nx3 array.
I had used nested for loops, of which the outer ran over the coordinates of the charges, and the inner - those of the points of interest.
Thus for each charge, with coordinates [xc,yc,zc] the field would be calculated at each location, such that the output of the inner loop would produce Ex, Ey, Ez - so that size(Ex) = size(Ey) = size(Ez) = Nx3.
Each field component generated by the next charge would then be added up in the outer loop, so that the total field created at each point would come out of the outer loop at the end. So something like:
if true
% code
for(i = 0; i<n_charges ;i++) {
Ex=Ey=Ez=0;
for (l = 0; l < length_of_coord_list; l++) {
R[0] = vecr[l] - qxyz[i];
R[1] = vecr[l+length_of_coord_list] - qxyz[i+n_charges];
R[2] = vecr[l+length_of_coord_list*2] - qxyz[i+n_charges*2];
R2 = R[0]*R[0]+R[1]*R[1]+R[2]*R[2]; /* calculate distance */
R3 = pow(sqrt(R21),3); /*calculate R^3*/
Ex = Ex+Q[0]*(R[0]/R3);
Ey = Ey+Q[0]*(R[1]/R3);
Ez = Ez+Q[0]*(R[2]/R1);
}
vro[ i] = Ex;
vro[ l+length_of_coord_list] = Ey;
vro[l+length_of_coord_list*2] = Ez;
}
end
where "vecr" is the list of coordinates where the field is measured, "qxyz"- the coordinates of the charges, R[0], R[1] and R[2] - the x, y and z components of the vector connecting a given point to the given charge. Q[0] is the charge (a constant); "vro" is the output of the mex function.
Now it seems to me that this code would greatly benefit from running on a GPU instead of a CPU. The GPU in question is an "NVidia 650m" on a Macbook Pro Retina.
I have managed to compile and successfully run an example code mexGPUExample.cu taken from http://www.mathworks.com/help/distcomp/create-and-run-mex-files-containing-cuda-code.html#btrgjg_ and thought I would build on that. However being completely new to CUDA programming I don't have a good feel for how to proceed, especially with loops. So here are the main questions I have:
1) After some searching I came across a couple of posts where the authors advise against directly programming nested loops in CUDA, and suggest to instead run the outer loop on the CPU and the inner (the one with more iterations, and hence that which will benefit more from parallelization) on the GPU. It seems that this method would also result in a more transparent code, which is a definite plus for a CUDA beginner.
2) I am trying to set it up to run the loop over the charge coordinates on the CPU and the loop over l : for (l = 0; l < length_of_coord_list; l++) on the GPU. However I'm not sure if a loop is necessary at all - from the example and some literature it appears that the iterations of the loop are "taken care of" by the blocks, threads and such (see http://mexgpuexample.cu/).
3) In the example code there seems to be a one-to-one correspondence between B[i] and A[i], whereas I would need to something less straightforward using the three components of the arrays representing charge positions and coordinates of the field measurement, which is simpler in an explicit loop. So I'm not sure how to handle the two matrices, of dimensions Mx3 and Nx3 respectively. For instance the indexing in
vro[ i] = Ex;
vro[ l+length_of_coord_list] = Ey;
vro[l+length_of_coord_list*2] = Ez;
is such because the Nx3 is converted to 3*Nx1 for the C code, but such convention may not play well with the CUDA kernel? Perhaps I could feed the x, y and z components of the matrices to the kernel as separate 1-D arrays and go from there? 4) I end up having to invoke gpuDevice(1) to reset the device quite often. Otherwise GPU LAUNCH ERRORS pop up. Is this normal (at least for macs with 650m's)?
Anyway, I apologize for the long-winded question. I realize that the answer must be quite simple but I can't seem to find it. Thanks in advance.
Mike.
P.S. I have looked into bsxfun, arrayfun etc (who mostly choke on big matrices in my experience), but I would like to squeeze every bit of speed out of the machine, so mex+CUDA seems to be the way to go.

回答(1 个)

James Lebak
James Lebak 2013-4-9
编辑:James Lebak 2013-4-23
Mike,
I'm glad to hear that you're using the GPU MEX API. Below are my answers to your questions, which you can accept or reject as you see fit.
1-2) If I understand your problem, you can think of it in two steps: calculate the contributions of each individual charge at each point of interest, and then accumulate the field strength at each point of interest. If you have the memory available, I would be tempted to separate these steps. Do the first one on the GPU as your first CUDA project, and either do the accumulation on the CPU or use MATLAB's sum function on the GPU for the second step. After you're confident that you understand the code to compute the contributions, then you can try integrating the parallel accumulation into your MEX file. There are good references on how to do this type of operation in CUDA but it's not trivial to get the best performance. If you can make MATLAB's sum function, which is overloaded for the GPU, do the work for you, it might make your life easier.
My advice for parallelizing the first step would be to use your MEX file to create an M by N by 3 matrix A, in which the elements (i,j,:) represent the length-3 vector describing the field contribution of charge i at coordinate j. You could parallelize over a total of MN threads and have each thread calculate the three values A(i,j,:). Then I think B=squeeze(sum(A, 1)) would do the second step, producing an N by 3 matrix where the elements in row j are the Ex, Ey, Ez values calculated by your original code.
3) For the purposes of getting coalesced memory accesses, it is likely to be better to have the individual components of the matrices in different 1-D arrays as you suggest.
4) Most importantly, I want to confirm that when developing MEX files in R2013a, you can get a launch failure after a recompile or a clear mex, unless you reset the device first. This is a bug in MATLAB R2013a, and the bug report is here: http://www.mathworks.com/support/bugreports/details.html?rp=880130
Finally, when I ran GPUBench on the Retina MacBook Pro back in August, I observed much better performance from the GPU for single precision than for double. I know you square values for the distance calculation, so you may not be able to use single precision in this example, but if you can you will probably achieve better performance.
  2 个评论
Misha Dabaghyan
Misha Dabaghyan 2013-4-10
Hi James,
Thank you for the detailed answer. I like your approach. But I don't think I have enough memory to pull it off - the matrices involved in the calculations are rather large, and when I tried to prototype the code with bsxfun or arrayfun or some such - the computer hangs, unless I use small matrices as inputs.
I did attempt to write a code that for now just calculates the field values for only one charge located at a single spacial coordinate qcoor = [x0,y0,z0]. So the mex code just takes one 3x1 coordinate for the charge and an Nx3 list of coordinates for the points of interest. The idea is then to perform this calculation in an external loop over the charges (on the CPU in a mex or directly in matlab) and sum up the contributions.
I based my code on the example shown here http://www.slideshare.net/MarioGazziro/cuda-12084616#btnNext Here is my code.
if true % code
#include "cuda.h" #include "mex.h" #include "/usr/include/math.h"
/*Kernel to compute elements of the array on the GPU*/ _global_ void guilherme_kernel(float qcoor, float vecr, float* B, int N) { int i = blockIdx.x*blockDim.x+threadIdx.x; float x, y, z, R3;
if( i<N )
{
/* M[i+j*N] = g1[N+i-j]*(K[i]+k)*(K[j]+k); */
x = vecr[i] - qcoor[1];
y = vecr[i+N] - qcoor[2];
z = vecr[i+2*N] - qcoor[3];
R3 = pow(sqrt(x*x + y*y + z*z),3);
B[i] = x/R3;
B[i+N] = y/R3;
B[i+2*N] = z/R3;
}
}
/*Gateway function*/ void mexFunction(int nlhs, mxArray * plhs[], int nrhs, const mxArray *prhs[]) { int j, m_0, m_1, m_o, n_0, n_1, n_o;
double* data1, *data2, *data3;
float* data1f, *data2f, *data3f;
float* data1f_gpu, *data2f_gpu, *data3f_gpu;
mxClassID category;
if(nrhs!=(nlhs+1))
mexErrMsgTxt("The number of input and output arguments must be the same.");
/* Find the dimensions of the data */
m_0 = mxGetM(prhs[0]);
n_0 = mxGetN(prhs[0]);
/* Create an input and output data array on the GPU */
cudaMalloc((void**)&data1f_gpu,sizeof(float)*m_0*n_0);
/*Retrieve the input data*/
data1 = mxGetPr(prhs[0]);
/* Check if the input array is single or double precision */
category = mxGetClassID(prhs[0]);
if(category==mxSINGLE_CLASS)
{
/*The input array is single precision, it can be sent directly to the card*/
cudaMemcpy(data1f_gpu,data1,sizeof(float)*m_0*n_0,
cudaMemcpyHostToDevice);
}
/*Find the dimensions of the data*/
m_1 = mxGetM(prhs[1]);
n_1 = mxGetN(prhs[1]);
/*Create an input and output data array on the GPU*/
cudaMalloc((void**)&data2f_gpu,sizeof(float)*m_1*n_1);
/*Retrieve the input data*/
data2 = mxGetPr(prhs[1]);
/*Check if the input array is single or double precision*/
category = mxGetClassID(prhs[1]);
if(category==mxSINGLE_CLASS)
{
/*The input array is single precision, it can be sent directly to the card*/
cudaMemcpy(data2f_gpu,data2,sizeof(float)*m_1*n_1,
cudaMemcpyHostToDevice);
}
/*Find the dimensions of the data*/
/*
* m_o = n_0;
* n_o = n_1;
*/
m_o = m_1;
n_o = n_1;
/*Create an input and output data array on the GPU*/
/* pretty sure this is the output*/
cudaMalloc((void**) & data3f_gpu, sizeof(float)*m_o*n_o);
/* Compute execution configuration using 128 threads per block */
dim3 dimBlock(128);
dim3 dimGrid((m_o*n_o)/dimBlock.x);
if((n_o*m_o)%128!=0) dimGrid.x+=1;
/*Call function on GPU*/
guilherme_kernel<<<dimGrid,dimBlock>>>(data1f_gpu, data2f_gpu, data3f_gpu, n_o*m_o/3);
data3f = (float*)mxMalloc(sizeof(float)*m_o*n_o);
/*Copy result back to host*/
cudaMemcpy(data3f, data3f_gpu, sizeof(float)*n_o*m_o, cudaMemcpyDeviceToHost);
/*Create anmxArray for the output data */
plhs[0] = mxCreateDoubleMatrix(m_o, n_o, mxREAL);
/*Create a pointer to the output data*/
data3 = mxGetPr(plhs[0]);
}
end
The output array is supposed to be the same size as the list of the coordinates of the points of interest. I am not sure if the code is correct, but I think I mostly understand it enough to slightly modify it while following the logic of the original example.
The problem is that if I compile it using just "mex coulomb.cu" - it prints a few warnings, no errors and runs fine, but returns all zeroes. This happens both for the actual code in the example and for my modified program. If I try to compile it with
*nvmex -f nvmexopts.bat brazil_coulomb.cu -IC:cudainclude -LC:cudalib -lcufft -lcudart #include "cuda.h" #include "mex.h" #include "/usr/include/math.h" *
I get "Error using nvmex Too many input arguments." ------------------------------
if I try simple to use nvmex('coulomb.cu') I get >> nvmex('coulomb.cu') /usr/local/cuda/bin/nvcc --compile coulomb.cu -o coulomb.o --compiler-options -fPIC -I/Applications/MATLAB_R2013a.app/extern/include coulomb.cu(29): warning: variable "j" was declared but never referenced
coulomb.cu(31): warning: variable "data3" was set but never used
coulomb.cu(33): warning: variable "data1f" was declared but never referenced
coulomb.cu(33): warning: variable "data2f" was declared but never referenced
coulomb.cu(29): warning: variable "j" was declared but never referenced
coulomb.cu(31): warning: variable "data3" was set but never used
coulomb.cu(33): warning: variable "data1f" was declared but never referenced
coulomb.cu(33): warning: variable "data2f" was declared but never referenced
mex ('coulomb.o', '-L/usr/local/cuda/lib', '-lcudart')
Warning: No source files in argument list. Assuming C source
code for linking purposes. To override this
assumption use '-fortran' or '-cxx'.
*
ld: warning: ignoring file coulomb.o, file was built for i386 which is not the architecture being linked (x86_64): coulomb.o
Undefined symbols for architecture x86_64:
"_mexFunction", referenced from:
-exported_symbol[s_list] command line option
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
mex: link of ' "coulomb.mexmaci64"' failed.
Error using mex (line 206) Unable to complete successfully. Error in nvmex (line 40) eval(mexCommandLine); *
Here is what my nvmex looks like (taken from online examples as well)
function nvmex(cuFileName)
%NVMEX Compiles and links a CUDA file for MATLAB usage
% NVMEX(FILENAME) will create a MEX-File (also with the name FILENAME) by
% invoking the CUDA compiler, nvcc, and then linking with the MEX
% function in MATLAB.
% Copyright 2009 The MathWorks, Inc.
% !!! Modify the paths below to fit your own installation !!!
if ispc % Windows
CUDA_LIB_Location = 'C:\CUDA\lib';
Host_Compiler_Location = '-ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin"';
PIC_Option = '';
else % Mac and Linux (assuming gcc is on the path)
% CUDA_LIB_Location = '/usr/local/cuda/lib64';
CUDA_LIB_Location = '/usr/local/cuda/lib';
Host_Compiler_Location = '';
PIC_Option = ' --compiler-options -fPIC ';
end
% !!! End of things to modify !!!
[~, filename] = fileparts(cuFileName);
nvccCommandLine = [ ...
'/usr/local/cuda/bin/nvcc --compile ' cuFileName ' ' Host_Compiler_Location ' ' ...
' -o ' filename '.o ' ...
PIC_Option ...
' -I' matlabroot '/extern/include ' ...
];
mexCommandLine = ['mex (''' filename '.o'', ''-L' CUDA_LIB_Location ''', ''-lcudart'')'];
disp(nvccCommandLine);
status = system(nvccCommandLine);
if status < 0
error 'Error invoking nvcc';
end
disp(mexCommandLine);
eval(mexCommandLine);
end
The next thing I think I'll try is to just compile the .cu code in the command line outside matlab (with all the mex-related parts removed) just to see if the kernel does what it is supposed to do.
I would appreciate any comments or suggestions. It seems like a fairly simple thing to do, but somehow I can't get a handle on it. Thanks for your help.
James Lebak
James Lebak 2013-4-11
Looking at the last four lines of your Mex code, are you returning the pointer to the correct matrix? I would move the cudaMemcpy to the end and copy into data3 rather than data3f.
You should not need to use nvmex.

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 GPU CUDA and MEX Programming 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by