[Developers] Big improvement in the function minimizer with GPU.

Tue May 8 10:44:39 PDT 2012

Hi Dave,

Speeding up ADMB by a factor of 32 is a staggering improvement indeed! 
It's great to see the example written in user-end TPL (sections etc.) and 
not some lowest-level hardware-specific system calls in C. Obviously, 
those calls are being made somewhere, but the level of abstraction that 
you demonstrate looks very promising.

Like you have suggested, it is time to get as many ADMB developers as 
possible to try this out, but how do I start? Specifically:

(1) Does this approach assume a GPU from a certain vendor?

(2) Do I need to install some GPU developer library? Hopefully no license 
conflict with BSD.

(3) Can we expect this to work on all operating systems?

(4) Can we do this speed comparison for other ADMB examples, like 
simple.tpl, catage.tpl, or even random effects? If not, what would need to 
be done in order to run those models on GPU?

I ask these questions in an enthusiastic, not pessimistic, tone. This is 
revolutionary stuff. After all, the shiny 48-core Linux server downstairs 
probably has no GPU, so maybe the next IT purchase should be a GPU 
cluster?

It would be great if ADMB will provide a -gpu option in the near future. I 
imagine the user would pass that option at an early stage (adcomp and 
adlink) and not the end stage (as in mymodel -gpu)?

Arni

On Tue, 8 May 2012, dave fournier wrote:

> To get a proof of concept for any programming technique it is nice to 
> get a big result fairly easily.  almost all ADMB users rely on the 
> function minimizer fmin in the file newfmin.cpp.  So to improve the 
> performance of this function in a more or less transparent would 
> immediately help a lot of users.
>
>
> I hacked the newfmin.cpp file to add the BFGS quasi Newton update with 
> the (sort of) hess inverse kept on the GPU and main calcs done on the 
> GPU.
>
> I tested this with a modified Rosenbrock function with 6144 parameters. 
> The new setup is both much faster and more stable than the old one on 
> newfmin. It appears that newfmin uses a different quasi-Newton update 
> which is not as efficient for a large number of parameters.
>
> This is the tpl file for the example.
>
> DATA_SECTION
>  int n
> !! n=4096+2048;
> PARAMETER_SECTION
>  init_vector x(1,n);
>  objective_function_value f
> PROCEDURE_SECTION
>  for (int i=1;i<=n/2;i++)
>  {
>     f+=100.*square(square(x(2*i-1))-x(2*i))+square(x(2*i-1)-1.0);
>  }
>
> The new GPU version took 36 seconds and 477 function evals to converge
> - final statistics:
> 6144 variables; iteration 277; function evaluation 477
> Function value   3.2531e-21; maximum gradient component mag   9.7979e-11
> Exit code = 1;  converg criter   1.0000e-10
>
> real    0m35.414s
> user   0m4.417s <--- most time waiting for the GPU calcs
> sys     0m0.616s
>
> Old version took 288 seconds to do 477 function evaluations but is not 
> nearly as good at this point.
>
> 6144 variables; iteration 300; function evaluation 485; phase 1
> Function value   6.6252316e+00; maximum gradient component mag  -8.4966e+00
>
> Old version converged in about 19 min 36 seconds so the new version with 
> BFGS update on the GPU is about 32 times faster than the old version and 
> probably more stable.
>
> Here is the old version final output
> - final statistics:
> 6144 variables; iteration 1212; function evaluation 2119
> Function value   1.7758e-21; maximum gradient component mag   9.7086e-11
> Exit code = 1;  converg criter   1.0000e-10
>
> real    19m36.357s
> user    19m35.848s
> sys    0m0.093s
>
> Yawn.
>