[Developers] Big improvement in the function minimizer with GPU.
dave fournier
davef at otter-rsch.com
Tue May 8 11:43:09 PDT 2012
On 12-05-08 10:59 AM, anders at nielsensweb.org wrote:6144 variables;
iteration 11; function evaluation 22; phase 1
Function value 3.5523132e+02; maximum gradient component mag 1.0031e+01
HF norm2(K1-K)= 9.26204e-29
I have attached the example. As is usual with this stuff I have forgotten
a lot of the painful details which I learned the hard way. If yu unzip
all this
into a directory and have the right GPU stuff installed as well as ADMB
you should be ready to go.
We will need to put together a help thing once we see how portable this is.
Of course this example is intended to make the GPU look good. The idea is
a lot of parameters used in a function that itself does not involve a
lot of
intensive calculations. Finding out the the BFGS method is more stable than
the old newfmin one was an unexpected bonus. The BFGS update takes
more time than the one in newfmin so this adds to the advantage of theGPU.
For any questions one should identify the GPU and OS they are using to
help us
get an idea how portable this is.
The AMD SDK supplies a routine named clinfo which for me produces
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 1.2 AMD-APP (923.1)
Platform Name: AMD Accelerated
Parallel Processing
Platform Vendor: Advanced Micro
Devices, Inc.
Platform Extensions: cl_khr_icd
cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated
Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Device ID: 4098
Board name: AMD Radeon HD 6900 Series
Device Topology: PCI[ B#2, D#0, F#0 ]
Max compute units: 22
Max work items dimensions: 3
Max work items[0]: 256
Max work items[1]: 256
Max work items[2]: 256
Max work group size: 256
Preferred vector width char: 16
Preferred vector width short: 8
Preferred vector width int: 4
Preferred vector width long: 2
Preferred vector width float: 4
Preferred vector width double: 2
Native vector width char: 16
Native vector width short: 8
Native vector width int: 4
Native vector width long: 2
Native vector width float: 4
Native vector width double: 2
Max clock frequency: 0Mhz
Address bits: 32
Max memory allocation: 509870080
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel:
1,3 Top
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: None
Cache line size: 0
Cache size: 0
Global memory size: 2039480320
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue properties:
Out-of-Order: No
Profiling : Yes
Platform ID: 0x7feb9b2660a0
Name: Cayman
Vendor: Advanced Micro
Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: CAL 1.4.1720
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (923.1)
Extensions: cl_khr_fp64
cl_amd_fp64 cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes
cl_khr_byte_addressable_store cl_khr_gl_sharing
cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3
cl_amd_printf cl_amd_media_ops cl_amd_popcnt
an important one is
Max memory allocation: 509870080
This was learned with a lot of pain. The card has 2GB memory but a
contiguous
memory buffer on it can only be this big. Whether you can two contiguous
memory
buffers this big I don;t know yet but will soon.
To get this much I had to set an environment string.
GPU_MAX_HEAP_SIZE=95
It took me two weeks to figure this out.
As always
enjoy ...
> Wow, this sounds awesome!
>
> Any idea how that would work for problems with an order of
> magnitude (or two) fewer parameters.
>
> How can I try it out?
>
> thanks,
>
> Anders.
>
> On 08.05.2012 10:59, dave fournier wrote:
>> To get a proof of concept for any programming technique it is nice to
>> get a big result fairly easily. almost all ADMB users rely on the
>> function minimizer fmin in the file newfmin.cpp. So to improve the
>> performance of this function in a more or less transparent
>> would immediately help a lot of users.
>>
>>
>> I hacked the newfmin.cpp file to add the BFGS quasi Newton update
>> with the (sort of) hess inverse kept on the GPU and main calcs done
>> on the GPU.
>>
>> I tested this with a modified Rosenbrock function with 6144 parameters.
>> The new setup is both much faster and more stable than the old one
>> on newfmin. It appears that newfmin uses a different quasi-Newton
>> update which
>> is not as efficient for a large number of parameters.
>>
>> This is the tpl file for the example.
>>
>> DATA_SECTION
>> int n
>> !! n=4096+2048;
>> PARAMETER_SECTION
>> init_vector x(1,n);
>> objective_function_value f
>> PROCEDURE_SECTION
>> for (int i=1;i<=n/2;i++)
>> {
>> f+=100.*square(square(x(2*i-1))-x(2*i))+square(x(2*i-1)-1.0);
>> }
>>
>> The new GPU version took 36 seconds and 477 function evals to converge
>> - final statistics:
>> 6144 variables; iteration 277; function evaluation 477
>> Function value 3.2531e-21; maximum gradient component mag 9.7979e-11
>> Exit code = 1; converg criter 1.0000e-10
>>
>> real 0m35.414s
>> user 0m4.417s <--- most time waiting for the GPU calcs
>> sys 0m0.616s
>>
>> Old version took 288 seconds to do 477 function evaluations
>> but is not nearly as good at this point.
>>
>> 6144 variables; iteration 300; function evaluation 485; phase 1
>> Function value 6.6252316e+00; maximum gradient component mag
>> -8.4966e+00
>>
>> Old version converged in about 19 min 36 seconds
>> so the new version with BFGS update on the GPU
>> is about 32 times faster than the old version
>> and probably more stable.
>>
>> Here is the old version final output
>> - final statistics:
>> 6144 variables; iteration 1212; function evaluation 2119
>> Function value 1.7758e-21; maximum gradient component mag 9.7086e-11
>> Exit code = 1; converg criter 1.0000e-10
>>
>> real 19m36.357s
>> user 19m35.848s
>> sys 0m0.093s
>>
>> Yawn.
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Developers mailing list
>> Developers at admb-project.org
>> http://lists.admb-project.org/mailman/listinfo/developers
> _______________________________________________
> Developers mailing list
> Developers at admb-project.org
> http://lists.admb-project.org/mailman/listinfo/developers
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: opencl_example.zip
Type: application/zip
Size: 11129 bytes
Desc: not available
URL: <http://lists.admb-project.org/pipermail/developers/attachments/20120508/055e610b/attachment.zip>
More information about the Developers
mailing list