Dave, I am wondering why you didn't use OpenCL library like I did in my matrix mult example at the workshop?  If you do there is no requirement for a special compiler (nvcc) and extra makefiles, and the code is already optimized.  <br>

<br>

Yes, the limiting factor is the bussing of data to/from the GPU and for addition it outweighs the cost of the addition operations. Its the same for OpenCL, that's why I did the matrix mult example..<br>

<br>

Also I don't see how you are carrying the derivative information around, that was my issue thus far, CUDA and OpenCL don't support C++ classes yet!  Please let me know what you think of this as this parallelization has been of ongoing interest to me.<br>

<br>

Thanks,<br>

Chris<br>

<br>

----- Original Message -----<br>

From: dave fournier <davef@otter-rsch.com><br>

Date: Saturday, September 3, 2011 4:05 pm<br>

Subject: Re: [ADMB Users] Does CUDA suck?  answer NO!<br>

To: users@admb-project.org<br>

<br>

> First there is an error in the code. It should read<br>

> <br>

>            <br>

> return z;<br>

> <br>

>  and not<br>

> <br>

>           return x+y;<br>

> <br>

> However I thought that maybe the problem is that  addition <br>

> is too trivial compared to the<br>

> overhead of moving things to the GPU and back. I changed the <br>

> function to pow(x,y)<br>

> and lo!  the el cheapo GPU is faster (about 6 times faster).<br>

> So how hard is a vector pow.  All that was necessary was to <br>

> take the included VecAdd<br>

> function and modify it to<br>

> <br>

> <br>

> __global__ void VecPow(const double* A, const double* B, double* <br>

> C, int N)<br>

> {<br>

>     int i = blockDim.x * blockIdx.x + threadIdx.x;<br>

>     double x=0.0;<br>

>     if (i < N)<br>

>     {<br>

>         C[i] = pow(A[i],B[i]);<br>

>     }<br>

> }<br>

> <br>

> Code is attached. Note I use mypow just to avoid clash with <br>

> existing admb libs.<br>

> <br>

>