Dave, I am wondering why you didn't use OpenCL library like I did in my matrix mult example at the workshop? If you do there is no requirement for a special compiler (nvcc) and extra makefiles, and the code is already optimized. <br>
<br>
Yes, the limiting factor is the bussing of data to/from the GPU and for addition it outweighs the cost of the addition operations. Its the same for OpenCL, that's why I did the matrix mult example..<br>
<br>
Also I don't see how you are carrying the derivative information around, that was my issue thus far, CUDA and OpenCL don't support C++ classes yet! Please let me know what you think of this as this parallelization has been of ongoing interest to me.<br>
<br>
Thanks,<br>
Chris<br>
<br>
----- Original Message -----<br>
From: dave fournier <davef@otter-rsch.com><br>
Date: Saturday, September 3, 2011 4:05 pm<br>
Subject: Re: [ADMB Users] Does CUDA suck? answer NO!<br>
To: users@admb-project.org<br>
<br>
> First there is an error in the code. It should read<br>
> <br>
> <br>
> return z;<br>
> <br>
> and not<br>
> <br>
> return x+y;<br>
> <br>
> However I thought that maybe the problem is that addition <br>
> is too trivial compared to the<br>
> overhead of moving things to the GPU and back. I changed the <br>
> function to pow(x,y)<br>
> and lo! the el cheapo GPU is faster (about 6 times faster).<br>
> So how hard is a vector pow. All that was necessary was to <br>
> take the included VecAdd<br>
> function and modify it to<br>
> <br>
> <br>
> __global__ void VecPow(const double* A, const double* B, double* <br>
> C, int N)<br>
> {<br>
> int i = blockDim.x * blockIdx.x + threadIdx.x;<br>
> double x=0.0;<br>
> if (i < N)<br>
> {<br>
> C[i] = pow(A[i],B[i]);<br>
> }<br>
> }<br>
> <br>
> Code is attached. Note I use mypow just to avoid clash with <br>
> existing admb libs.<br>
> <br>
>