[Developers] Big improvement in the function minimizer with GPU.

Tue May 8 11:43:09 PDT 2012

On 12-05-08 10:59 AM, anders at nielsensweb.org wrote:6144 variables; 
iteration 11; function evaluation 22; phase 1
Function value   3.5523132e+02; maximum gradient component mag   1.0031e+01
HF norm2(K1-K)= 9.26204e-29

I have attached the example.  As is usual with this stuff I have forgotten
a lot of the painful details which I learned the hard way.  If yu unzip 
all this
into a directory and have the right GPU stuff installed as well as ADMB
you should be ready to go.

We will need to put together a help thing once we see how portable this is.

Of course this example is intended to make the GPU look good.  The idea is
a lot of parameters used in a function that itself does not involve a 
lot of
intensive calculations.  Finding out the the BFGS method is more stable than
the old newfmin one was an unexpected bonus. The BFGS update takes
more time than the one in newfmin so this adds to the advantage of theGPU.

For any questions one should identify the GPU and OS they are using to 
help us
get an idea how portable this is.

The AMD SDK supplies a routine named clinfo which for me produces

Number of platforms:                             1
   Platform Profile:                              FULL_PROFILE
   Platform Version:                              OpenCL 1.2 AMD-APP (923.1)
   Platform Name:                                 AMD Accelerated 
Parallel Processing
   Platform Vendor:                               Advanced Micro 
Devices, Inc.
   Platform Extensions:                           cl_khr_icd 
cl_amd_event_callback cl_amd_offline_devices

   Platform Name:                                 AMD Accelerated 
Parallel Processing
Number of devices:                               2
   Device Type:                                   CL_DEVICE_TYPE_GPU
   Device ID:                                     4098
   Board name:                                    AMD Radeon HD 6900 Series
   Device Topology:                               PCI[ B#2, D#0, F#0 ]
   Max compute units:                             22
   Max work items dimensions:                     3
     Max work items[0]:                           256
     Max work items[1]:                           256
     Max work items[2]:                           256
   Max work group size:                           256
   Preferred vector width char:                   16
   Preferred vector width short:                  8
   Preferred vector width int:                    4
   Preferred vector width long:                   2
   Preferred vector width float:                  4
   Preferred vector width double:                 2
   Native vector width char:                      16
   Native vector width short:                     8
   Native vector width int:                       4
   Native vector width long:                      2
   Native vector width float:                     4
   Native vector width double:                    2
   Max clock frequency:                           0Mhz
   Address bits:                                  32
   Max memory allocation:                         509870080
   Image support:                                 Yes
   Max number of images read arguments:           128
   Max number of images write arguments:          8
   Max image 2D width:                            8192
   Max image 2D height:                           8192
   Max image 3D width:                            2048
   Max image 3D height:                           2048
   Max image 3D depth:                            2048
   Max samplers within kernel:  

1,3           Top
  Max size of kernel argument:                   1024
   Alignment (bits) of base address:              2048
   Minimum alignment (bytes) for any datatype:    128
   Single precision floating point capability
     Denorms:                                     No
     Quiet NaNs:                                  Yes
     Round to nearest even:                       Yes
     Round to zero:                               Yes
     Round to +ve and infinity:                   Yes
     IEEE754-2008 fused multiply-add:             Yes
   Cache type:                                    None
   Cache line size:                               0
   Cache size:                                    0
   Global memory size:                            2039480320
   Constant buffer size:                          65536
   Max number of constant args:                   8
   Local memory type:                             Scratchpad
   Local memory size:                             32768
   Kernel Preferred work group size multiple:     64
   Error correction support:                      0
   Unified memory for Host and Device:            0
   Profiling timer resolution:                    1
   Device endianess:                              Little
   Available:                                     Yes
   Compiler available:                            Yes
   Execution capabilities:
     Execute OpenCL kernels:                      Yes
   Execute native function:                     No
   Queue properties:
     Out-of-Order:                                No
     Profiling :                                  Yes
   Platform ID:                                   0x7feb9b2660a0
   Name:                                          Cayman
   Vendor:                                        Advanced Micro 
Devices, Inc.
   Device OpenCL C version:                       OpenCL C 1.2
   Driver version:                                CAL 1.4.1720
   Profile:                                       FULL_PROFILE
   Version:                                       OpenCL 1.2 AMD-APP (923.1)
   Extensions:                                    cl_khr_fp64 
cl_amd_fp64 cl_khr_global_int32_base_atomics 
cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics 
cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes 
cl_khr_byte_addressable_store cl_khr_gl_sharing 
cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 
cl_amd_printf cl_amd_media_ops cl_amd_popcnt

an important one is

   Max memory allocation:                         509870080

This was learned with a lot of pain.   The card has 2GB memory but a 
contiguous
memory buffer on it can only be this big. Whether you can two contiguous 
memory
buffers this big I don;t know yet but will soon.

To get this much I had to set an environment string.

GPU_MAX_HEAP_SIZE=95

It took me two weeks to figure this out.

As always

     enjoy   ...

> Wow, this sounds awesome!
>
> Any idea how that would work for problems with an order of
> magnitude (or two) fewer parameters.
>
> How can I try it out?
>
> thanks,
>
> Anders.
>
> On 08.05.2012 10:59, dave fournier wrote:
>> To get a proof of concept for any programming technique it is nice to
>> get a big result fairly easily.  almost all ADMB users rely on the
>> function minimizer fmin in the file newfmin.cpp.  So to improve the
>> performance of this function in a more or less transparent
>> would immediately help a lot of users.
>>
>>
>> I hacked the newfmin.cpp file to add the BFGS quasi Newton update
>> with the (sort of) hess inverse kept on the GPU and main calcs done
>> on the GPU.
>>
>> I tested this with a modified Rosenbrock function with 6144 parameters.
>> The new setup is both much faster and more stable than the old one
>> on newfmin. It appears that newfmin uses a different quasi-Newton
>> update which
>> is not as efficient for a large number of parameters.
>>
>> This is the tpl file for the example.
>>
>> DATA_SECTION
>>   int n
>>  !! n=4096+2048;
>> PARAMETER_SECTION
>>   init_vector x(1,n);
>>   objective_function_value f
>> PROCEDURE_SECTION
>>   for (int i=1;i<=n/2;i++)
>>   {
>>      f+=100.*square(square(x(2*i-1))-x(2*i))+square(x(2*i-1)-1.0);
>>   }
>>
>> The new GPU version took 36 seconds and 477 function evals to converge
>>  - final statistics:
>> 6144 variables; iteration 277; function evaluation 477
>> Function value   3.2531e-21; maximum gradient component mag   9.7979e-11
>> Exit code = 1;  converg criter   1.0000e-10
>>
>> real    0m35.414s
>> user   0m4.417s <--- most time waiting for the GPU calcs
>> sys     0m0.616s
>>
>> Old version took 288 seconds to do 477 function evaluations
>> but is not nearly as good at this point.
>>
>> 6144 variables; iteration 300; function evaluation 485; phase 1
>> Function value   6.6252316e+00; maximum gradient component mag  
>> -8.4966e+00
>>
>> Old version converged in about 19 min 36 seconds
>> so the new version with BFGS update on the GPU
>> is about 32 times faster than the old version
>> and probably more stable.
>>
>> Here is the old version final output
>>  - final statistics:
>> 6144 variables; iteration 1212; function evaluation 2119
>> Function value   1.7758e-21; maximum gradient component mag   9.7086e-11
>> Exit code = 1;  converg criter   1.0000e-10
>>
>> real    19m36.357s
>> user    19m35.848s
>> sys    0m0.093s
>>
>> Yawn.
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Developers mailing list
>> Developers at admb-project.org
>> http://lists.admb-project.org/mailman/listinfo/developers
> _______________________________________________
> Developers mailing list
> Developers at admb-project.org
> http://lists.admb-project.org/mailman/listinfo/developers
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: opencl_example.zip
Type: application/zip
Size: 11129 bytes
Desc: not available
URL: <http://lists.admb-project.org/pipermail/developers/attachments/20120508/055e610b/attachment.zip>