[Developers] opencl newfmin example

Wed May 16 12:26:42 PDT 2012

http://developer.nvidia.com/cuda-gpus claims Compute Capacity 1.3 for 
tje Quadro FX 3800 card, but only 1.1 for the Quadro FX 3800M mobile 
version.

When you get a chance, it would be nice to know what clinfo tells you 
about the Platform Extensions

John Sibert
Emeritus Researcher, SOEST
University of Hawaii at Manoa

Visit the ADMB project http://admb-project.org/

On 05/16/2012 08:47 AM, Ian Taylor wrote:
> Unfortunately I don't have time today to play with this. For the 
> record, my graphics card is NVIDIA Quadro FX 3800.
>
> After a number of amendments to the newfmin.cpp file based on Dave's 
> suggestions, it occurred to me that we have a "branches" directory in 
> the SVN repository to keep track of such changes.
>
> There was an old gpu folder in there, which I don't know anything 
> about. So rather than replace the file in the src/linad99, I just put 
> the file in the main directory: /branches/gpu/newfmin.cpp.
> Here's a link to the modified file in case anyone else wants to try 
> it: 
> http://admb-project.org/redmine/projects/issues/repository/entry/branches/gpu/newfmin.cpp
>
> -Ian
>
> On Wed, May 16, 2012 at 6:27 AM, dave fournier <davef at otter-rsch.com 
> <mailto:davef at otter-rsch.com>> wrote:
>
>     On 12-05-15 02:59 PM, Ian Taylor wrote:
>
>     I'm 99% sure this is not running on the GPU.  You need to get an
>     error free run
>     and this has one error when it tries to compile the source for the
>     GPU.
>     The error message is not correct as it got duplicated in the code.
>     But there is an
>     error.  One could find out what the returned error code is and
>     look it up
>     in the cl.h header file.
>
>
>
>>     Hi all,
>>     Thanks to help from Dave, I finally got his example working
>>     (perhaps) on a Windows computer using Microsoft Visual C++ on a
>>     computer with a Nvidia GPU. I got an error about "Error trying to
>>     load Kernel source GPU" (pasted at bottom of email along with
>>     other warnings that I don't understand), but using something
>>     called "GPU-Z", I was able to see that the GPU Load went from 1%
>>     to 99%. Nevertheless, using the GPU only cut the run time in
>>     half, and the majority of that was achieved with the BFGS
>>     algorithm without the GPU (USE_GPU_FLAG=0). So I'm thinking the
>>     GPU is not being utilized correctly, or my GPU is not as well
>>     suited to this problem as Dave's, or the VC compiler is not as
>>     well suited at GCC.
>>
>>     Speed comparison:
>>     new newfmin with GPU: 2 minutes, 19 seconds for 442 function
>>     evaluations.
>>     new newfmin w/o  GPU: 2 minutes, 37 seconds for 682 function
>>     evaluations.
>>     old newfmin time (no GPU): 5 minutes, 21 seconds for 2119
>>     function evaluations.
>>
>>     I had struggles at various points along the way, including
>>     installing the correct OpenCL stuff for my GPU, building ADMB
>>     with or without the new newfmin file, and linking the bigmin
>>     model to the OpenCL libraries. Everything I know about C++, I
>>     learned from working with ADMB, so this was a valuable addition
>>     to my education.
>>     -Ian
>>
>>     ### Here are the warnings and errors ###
>>
>>     >bigmin -mno 10000 -crit 1.e-10 -nox -nohess
>>     Error trying to open data input file bigmin.dat
>>     command queue created successfully
>>     Number of devices found 1
>>     Error trying to load Kernel source  GPU
>>     All buffers created successfully
>>     Program creation code = 0
>>     Program build code = 0
>>     Create Kernel2 error code = 0
>>     Create Kernel error code = 0
>>     Create Kernel3 error code = 0
>>     Create Kernel4 error code = 0
>>     Create Kernel1 error code = 0
>>
>>     Initial statistics: 6144 variables; iteration 0; function
>>     evaluation 0; phase 1
>>     ...
>>
>>
>>
>>
>>     On Tue, May 15, 2012 at 10:51 AM, John Sibert <sibert at hawaii.edu
>>     <mailto:sibert at hawaii.edu>> wrote:
>>
>>         I tried to get it working, but did not succeed. In the
>>         process, I might have learned a few things, so I have
>>         included a lot of stuff in this email.
>>
>>         It would be really helpful if others on this list would also
>>         give it a try and share the results with the rest of us.
>>
>>         The main problem I encountered ignorance of what (if
>>         anything) needed to be installed on my computer. Neither the
>>         OpenCL nor the AMD websites offer much guidance.
>>
>>         In the end I concluded that my hardware (a Dell D series
>>         laptop with Nvidia graphics processor purchased in 2009 and
>>         running Ubuntu 10.04) is unsuitable, probably not supporting
>>         double precision arithmetic.
>>
>>         Without installing any new software the machine comes with
>>         the executable "clinfo" that provides a lot of information
>>         about the hardware. Sections to note are "Platform
>>         Extensions:  cl_khr_byte_addressable_store cl_khr_icd
>>         cl_khr_gl_sharing cl_nv_compiler_options
>>         cl_nv_device_attribute_query cl_nv_pragma_unroll"
>>         and "Extensions: cl_khr_fp64 cl_amd_fp64 ..." (without the
>>         word "Platform"). If the graphics card supports double
>>         precision calculations it should report "cl_khr_fp64
>>         cl_amd_fp64", but note the ambiguity of two different
>>         "Extensions".
>>
>>
>>         Emboldened, I managed to build the bigmin example without
>>         much drama and
>>         $ ./bigmin -mno 10000 -crit 1.e-10 -nox -nohess
>>         produced the following
>>
>>             Error creating command queue ret = -34
>>             Number of devices found 0
>>             No GPU found
>>
>>
>>         So I desabled the Nvidia graphics driver and downloaded
>>          AMD-APP-SDK-v2.6-lnx64.tgz from
>>         http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx
>>         and installed it. After messing around with linker paths, the
>>         bigmin compiled and linked, but produced the same run-time
>>         error .
>>
>>         At his point I concluded that my graphics card does not
>>         support floating point calculations.
>>
>>         A bit of work with google turned up some more information.
>>
>>         http://developer.nvidia.com/cuda-gpus
>>         lists Nvidia graphics processors and their "compute
>>         capability". The entry for mine is Quadro NVS 135M  compute
>>         capability 1.1
>>
>>         http://www.herikstad.net/2009/05/cuda-and-double-precision-floating.html
>>         offers some interpretation of compute capacity:
>>
>>             To enable the use of doubles inside CUDA kernels you
>>             first need to
>>             make sure you have a CUDA Compute 1.3-capable card. These
>>             are the newer
>>             versions of the nVidia CUDA cards such as the GTX 260,
>>             GTX 280, Quadro
>>             FX 5800, and Tesla S1070 and C1060.  Thereby you have to
>>             add a command
>>             line options to the nvcc compiler: --gpu-architecture sm_13.
>>
>>         The ever-helpful wikipedia entry for CUDA
>>         http://en.wikipedia.org/wiki/CUDA agrees
>>
>>             CUDA (with compute capability 1.x) uses a recursion-free,
>>             function-pointer-free subset of the C language, plus some
>>             simple
>>             extensions. However, a single process must run spread
>>             across multiple
>>             disjoint memory spaces, unlike other C language runtime
>>             environments.
>>
>>             CUDA (with compute capability 2.x) allows a subset of C++
>>             class
>>             functionality, for example member functions may not be
>>             virtual (this
>>             restriction will be removed in some future release). [See
>>             CUDA C
>>             Programming Guide 3.1 - Appendix D.6]
>>
>>             Double precision (CUDA compute capability 1.3 and above)
>>             deviate
>>             from the IEEE 754 standard: round-to-nearest-even is the
>>             only supported
>>             rounding mode for reciprocal, division, and square root.
>>             In single
>>             precision, denormals and signalling NaNs are not
>>             supported; only two
>>             IEEE rounding modes are supported (chop and
>>             round-to-nearest even), and
>>             those are specified on a per-instruction basis rather
>>             than in a control
>>             word; and the precision of division/square root is
>>             slightly lower than
>>             single precision.
>>
>>
>>         So you need a graphics processor with compute capability 1.3
>>         and above.
>>
>>         I would urge everyone to try to get this example running and
>>         share your experiences. The opencl looks like a promising way
>>         to parallelize some applications. The overview document
>>         http://www.khronos.org/assets/uploads/developers/library/overview/opencl-overview.pdf
>>         implies that it might be possible to tune an application to
>>         use either GPU or multiple cores on a cluster. Unfortunately
>>         the learning curve is steep (ask Dave) and the documentation
>>         is thin.
>>
>>         Happy hacking,
>>         John
>>
>>
>>
>>
>>         John Sibert
>>         Emeritus Researcher, SOEST
>>         University of Hawaii at Manoa
>>
>>         Visit the ADMB project http://admb-project.org/
>>
>>
>>
>>         On 05/12/2012 05:31 AM, dave fournier wrote:
>>
>>             Has anyone else actually got this example to work?
>>
>>             Some advice. Older GPU's (whatever that is) probably
>>             do not support double precision.
>>
>>             WRT using the BFGS update on the CPU. It does not seem
>>             to perform as well as doing iton the GPU. I think this is
>>             due to roundoff error.  The CPU is carrying out additions
>>             in a different
>>             way. It may be that with say 4K or more parameters and this
>>             (artificial) example roundoff error becomes important.
>>
>>             I stored the matrix by rows. It is now appears that it
>>             should be stored
>>             by columns for the fastest matrix * vector multiplication.
>>
>>
>>
>>             _______________________________________________
>>             Developers mailing list
>>             Developers at admb-project.org
>>             <mailto:Developers at admb-project.org>
>>             http://lists.admb-project.org/mailman/listinfo/developers
>>
>>         _______________________________________________
>>         Developers mailing list
>>         Developers at admb-project.org <mailto:Developers at admb-project.org>
>>         http://lists.admb-project.org/mailman/listinfo/developers
>>
>>
>>
>>
>>     _______________________________________________
>>     Developers mailing list
>>     Developers at admb-project.org  <mailto:Developers at admb-project.org>
>>     http://lists.admb-project.org/mailman/listinfo/developers
>
>
>     _______________________________________________
>     Developers mailing list
>     Developers at admb-project.org <mailto:Developers at admb-project.org>
>     http://lists.admb-project.org/mailman/listinfo/developers
>
>
>
>
> _______________________________________________
> Developers mailing list
> Developers at admb-project.org
> http://lists.admb-project.org/mailman/listinfo/developers