Mex & Cuda

2 次查看(过去 30 天)
jason beckell
jason beckell 2012-1-24
编辑: Cogli 2016-3-10
Hello to everybody,
I' a student newbie to Matlab and Cuda. I have to write down a simple Mex file which takes a vector as an input and then calls a routine from a shared library in CUDA(void twotimes(float* x, float *y, int n)) which simply multiplies by two each element of such a vector.
This is the code of the mex file:
#include "mex.h"
#include "matrix.h"
/*Headr file of the shared library */
#include "doppiolgcc.h"
void myExitFcn()
{
mexPrintf("MEX-file is being unloaded");
}
void mexFunction(int nlhs, mxArray *plhs[], int nrhs,
const mxArray *prhs[])
{
double *x, *y;
int i;
int mrows, ncols;
/* The input must be a noncomplex floating-point vector*/
mrows = mxGetM(prhs[0]);
ncols = mxGetN(prhs[0]);
if (!mxIsDouble(prhs[0]) || mxIsComplex(prhs[0]) ||
!(ncols == 1)) {
mexErrMsgTxt("Input must be a noncomplex floating-point vector.");
}
/* Assign pointers to each input and output. */
x = mxGetPr(prhs[0]);
plhs[0] = mxCreateDoubleMatrix(mrows, ncols, mxREAL);
y = mxGetPr(plhs[0]);
/*Call the external routine */
timestwo(x, y, mrows);
if(mexAtExit(myExitFcn))
{
mexPrintf("Error unloading function!");
}
}
The code of the header file is the following
extern "C" void timestwo(float *x, float *y, int LEN);
And this is the simple implementation in CUDA of such a procedure
const int N = 256;
__global__ void vecAdd(float* A, float* B)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
B[i] = A[i]*2.0;
}
extern "C" void timestwo(float *x, float *y, int len)
{
/* pointers to device memory */
float *x_d, *y_d;
/* Allocate arrays x_d, y_d on device*/
cudaMalloc ((void **) &x_d, sizeof(float)*len);
cudaMalloc ((void **) &y_d, sizeof(float)*len);
/* Copy data from host memory to device memory */
cudaMemcpy(x_d, x, sizeof(float)*len, cudaMemcpyHostToDevice);
/* Launch the computation*/
vecAdd<<< N/len, len>>>(x_d, y_d);
/* Copy data from deveice memory to host memory */
cudaMemcpy(y, y_d, sizeof(float)*len, cudaMemcpyDeviceToHost);
/* Free the memory */
cudaFree(x_d);
cudaFree(y_d);
}
After doing so, I compile successfully and then I launch my application. I initalize my variable input like this:
for i=1:256 a(i)=i; a=a' end
This is the the final output:
b = doppiom(a, i)
b =
256
512
768
1024
1280
1536
1792
2048
2304
2560
2816
3072
3328
3584
3840
4096
4352
4608
4864
5120
5376
5632
5888
6144
6400
6656
6912
7168
7424
7680
7936
8192
8448
8704
8960
9216
9472
9728
9984
10240
10496
10752
11008
11264
11520
11776
12032
12288
12544
12800
13056
13312
13568
13824
14080
14336
14592
14848
15104
15360
15616
15872
16128
16384
16640
16896
17152
17408
17664
17920
18176
18432
18688
18944
19200
19456
19712
19968
20224
20480
20736
20992
21248
21504
21760
22016
22272
22528
22784
23040
23296
23552
23808
24064
24320
24576
24832
25088
25344
25600
25856
26112
26368
26624
26880
27136
27392
27648
27904
28160
28416
28672
28928
29184
29440
29696
29952
30208
30464
30720
30976
31232
31488
31744
32000
32256
32512
32768
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
So, I obtain b = b[i] : b[i] = a[256]*2^i for few i, instead of b[i] = a[i]*2 for all i. How come doesn't it work?
Thank you all very much!
Jason.

回答(2 个)

Friedrich
Friedrich 2012-1-24
Hi,
I am not a CUDA expert but as far as I can tell the reason for this behavior is the way you call vecAdd:
vecAdd<<< N/len, len>>>(x_d, y_d);
You start 256/len blocks where each block has len amount of threads. I would rather try something like his
vecAdd<<< 1, len>>>(x_d, y_d);
and in the vecadd do:
int i = threadIdx.x;
B[i] = A[i]*2.0;
Since there is a limit (1024) in the amount of threads per block this won’t work correctly.
So you like to get blocks of size 256 and most likely 256 threads I would try this:
vecAdd<<< len/N, N>>>(x_d, y_d);
In that way you get len/N blocks where each block runs N threads. And in the vecAdd do what you already do:
int i = threadIdx.x + blockDim.x * blockIdx.x;
B[i] = A[i]*2.0;

jason beckell
jason beckell 2012-1-25
Thank you very much Friedrich for your suggestion! It's been very kind of you! In any case, the main problem was that the Cuda file expected float variables as inputs, whereas Matlab passed to it only double variables. Thank you very much again and to you all!
  1 个评论
Cogli
Cogli 2016-3-10
编辑:Cogli 2016-3-10
I have encountered the same situation. I used double type in both mex main .cpp and customized .cu file, and my returned result (i.e. plhs) is always 0.
What did you mean by "the main problem was that the Cuda file expected float variables as inputs, whereas Matlab passed to it only double variables"?

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 GPU CUDA and MEX Programming 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by