[Developers] opencl newfmin example

Wed May 16 11:47:49 PDT 2012

Unfortunately I don't have time today to play with this. For the record, my
graphics card is NVIDIA Quadro FX 3800.

After a number of amendments to the newfmin.cpp file based on Dave's
suggestions, it occurred to me that we have a "branches" directory in the
SVN repository to keep track of such changes.

There was an old gpu folder in there, which I don't know anything about. So
rather than replace the file in the src/linad99, I just put the file in the
main directory: /branches/gpu/newfmin.cpp.
Here's a link to the modified file in case anyone else wants to try it:
http://admb-project.org/redmine/projects/issues/repository/entry/branches/gpu/newfmin.cpp

-Ian

On Wed, May 16, 2012 at 6:27 AM, dave fournier <davef at otter-rsch.com> wrote:

>  On 12-05-15 02:59 PM, Ian Taylor wrote:
>
> I'm 99% sure this is not running on the GPU.  You need to get an error
> free run
> and this has one error when it tries to compile the source for the GPU.
> The error message is not correct as it got duplicated in the code. But
> there is an
> error.  One could find out what the returned error code is and look it up
> in the cl.h header file.
>
>
>
> Hi all,
> Thanks to help from Dave, I finally got his example working (perhaps) on a
> Windows computer using Microsoft Visual C++ on a computer with a Nvidia
> GPU. I got an error about "Error trying to load Kernel source GPU" (pasted
> at bottom of email along with other warnings that I don't understand), but
> using something called "GPU-Z", I was able to see that the GPU Load went
> from 1% to 99%. Nevertheless, using the GPU only cut the run time in half,
> and the majority of that was achieved with the BFGS algorithm without the
> GPU (USE_GPU_FLAG=0). So I'm thinking the GPU is not being utilized
> correctly, or my GPU is not as well suited to this problem as Dave's, or
> the VC compiler is not as well suited at GCC.
>
>  Speed comparison:
> new newfmin with GPU: 2 minutes, 19 seconds for 442 function evaluations.
> new newfmin w/o  GPU: 2 minutes, 37 seconds for 682 function evaluations.
> old newfmin time (no GPU): 5 minutes, 21 seconds for 2119 function
> evaluations.
>
>  I had struggles at various points along the way, including installing
> the correct OpenCL stuff for my GPU, building ADMB with or without the new
> newfmin file, and linking the bigmin model to the OpenCL libraries.
> Everything I know about C++, I learned from working with ADMB, so this was
> a valuable addition to my education.
> -Ian
>
>  ### Here are the warnings and errors ###
>
>  >bigmin -mno 10000 -crit 1.e-10 -nox -nohess
> Error trying to open data input file bigmin.dat
> command queue created successfully
> Number of devices found 1
> Error trying to load Kernel source  GPU
> All buffers created successfully
> Program creation code = 0
> Program build code = 0
> Create Kernel2 error code = 0
> Create Kernel error code = 0
> Create Kernel3 error code = 0
> Create Kernel4 error code = 0
> Create Kernel1 error code = 0
>
> Initial statistics: 6144 variables; iteration 0; function evaluation 0;
> phase 1
> ...
>
>
>
>
> On Tue, May 15, 2012 at 10:51 AM, John Sibert <sibert at hawaii.edu> wrote:
>
>> I tried to get it working, but did not succeed. In the process, I might
>> have learned a few things, so I have included a lot of stuff in this email.
>>
>> It would be really helpful if others on this list would also give it a
>> try and share the results with the rest of us.
>>
>> The main problem I encountered ignorance of what (if anything) needed to
>> be installed on my computer. Neither the OpenCL nor the AMD websites offer
>> much guidance.
>>
>> In the end I concluded that my hardware (a Dell D series laptop with
>> Nvidia graphics processor purchased in 2009 and running Ubuntu 10.04) is
>> unsuitable, probably not supporting double precision arithmetic.
>>
>> Without installing any new software the machine comes with the executable
>> "clinfo" that provides a lot of information about the hardware. Sections to
>> note are "Platform Extensions:  cl_khr_byte_addressable_store cl_khr_icd
>> cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query
>> cl_nv_pragma_unroll"
>> and "Extensions: cl_khr_fp64 cl_amd_fp64 ..." (without the word
>> "Platform"). If the graphics card supports double precision calculations it
>> should report "cl_khr_fp64 cl_amd_fp64", but note the ambiguity of two
>> different "Extensions".
>>
>>
>> Emboldened, I managed to build the bigmin example without much drama and
>> $ ./bigmin -mno 10000 -crit 1.e-10 -nox -nohess
>> produced the following
>>
>>> Error creating command queue ret = -34
>>> Number of devices found 0
>>> No GPU found
>>>
>>
>> So I desabled the Nvidia graphics driver and downloaded
>>  AMD-APP-SDK-v2.6-lnx64.tgz from
>> http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx
>> and installed it. After messing around with linker paths, the bigmin
>> compiled and linked, but produced the same run-time error .
>>
>> At his point I concluded that my graphics card does not support floating
>> point calculations.
>>
>> A bit of work with google turned up some more information.
>>
>> http://developer.nvidia.com/cuda-gpus
>> lists Nvidia graphics processors and their "compute capability". The
>> entry for mine is Quadro NVS 135M  compute capability 1.1
>>
>> http://www.herikstad.net/2009/05/cuda-and-double-precision-floating.html
>> offers some interpretation of compute capacity:
>>
>>> To enable the use of doubles inside CUDA kernels you first need to
>>> make sure you have a CUDA Compute 1.3-capable card. These are the newer
>>> versions of the nVidia CUDA cards such as the GTX 260, GTX 280, Quadro
>>> FX 5800, and Tesla S1070 and C1060.  Thereby you have to add a command
>>> line options to the nvcc compiler: --gpu-architecture sm_13.
>>>
>> The ever-helpful wikipedia entry for CUDA
>> http://en.wikipedia.org/wiki/CUDA agrees
>>
>>> CUDA (with compute capability 1.x) uses a recursion-free,
>>> function-pointer-free subset of the C language, plus some simple
>>> extensions. However, a single process must run spread across multiple
>>> disjoint memory spaces, unlike other C language runtime environments.
>>>
>>> CUDA (with compute capability 2.x) allows a subset of C++ class
>>> functionality, for example member functions may not be virtual (this
>>> restriction will be removed in some future release). [See CUDA C
>>> Programming Guide 3.1 - Appendix D.6]
>>>
>>> Double precision (CUDA compute capability 1.3 and above) deviate
>>> from the IEEE 754 standard: round-to-nearest-even is the only supported
>>> rounding mode for reciprocal, division, and square root. In single
>>> precision, denormals and signalling NaNs are not supported; only two
>>> IEEE rounding modes are supported (chop and round-to-nearest even), and
>>> those are specified on a per-instruction basis rather than in a control
>>> word; and the precision of division/square root is slightly lower than
>>> single precision.
>>>
>>>
>> So you need a graphics processor with compute capability 1.3 and above.
>>
>> I would urge everyone to try to get this example running and share your
>> experiences. The opencl looks like a promising way to parallelize some
>> applications. The overview document
>>
>> http://www.khronos.org/assets/uploads/developers/library/overview/opencl-overview.pdf
>> implies that it might be possible to tune an application to use either
>> GPU or multiple cores on a cluster. Unfortunately the learning curve is
>> steep (ask Dave) and the documentation is thin.
>>
>> Happy hacking,
>> John
>>
>>
>>
>>
>> John Sibert
>> Emeritus Researcher, SOEST
>> University of Hawaii at Manoa
>>
>> Visit the ADMB project http://admb-project.org/
>>
>>
>>
>> On 05/12/2012 05:31 AM, dave fournier wrote:
>>
>>> Has anyone else actually got this example to work?
>>>
>>> Some advice. Older GPU's (whatever that is) probably
>>> do not support double precision.
>>>
>>> WRT using the BFGS update on the CPU. It does not seem
>>> to perform as well as doing iton the GPU. I think this is
>>> due to roundoff error.  The CPU is carrying out additions in a different
>>> way. It may be that with say 4K or more parameters and this
>>> (artificial) example roundoff error becomes important.
>>>
>>> I stored the matrix by rows. It is now appears that it should be stored
>>> by columns for the fastest matrix * vector multiplication.
>>>
>>>
>>>
>>> _______________________________________________
>>> Developers mailing list
>>> Developers at admb-project.org
>>> http://lists.admb-project.org/mailman/listinfo/developers
>>>
>>>  _______________________________________________
>> Developers mailing list
>> Developers at admb-project.org
>> http://lists.admb-project.org/mailman/listinfo/developers
>>
>
>
>
> _______________________________________________
> Developers mailing listDevelopers at admb-project.orghttp://lists.admb-project.org/mailman/listinfo/developers
>
>
>
> _______________________________________________
> Developers mailing list
> Developers at admb-project.org
> http://lists.admb-project.org/mailman/listinfo/developers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.admb-project.org/pipermail/developers/attachments/20120516/88976510/attachment.html>