[Developers] opencl newfmin example

Tue May 15 10:51:50 PDT 2012

I tried to get it working, but did not succeed. In the process, I might 
have learned a few things, so I have included a lot of stuff in this email.

It would be really helpful if others on this list would also give it a 
try and share the results with the rest of us.

The main problem I encountered ignorance of what (if anything) needed to 
be installed on my computer. Neither the OpenCL nor the AMD websites 
offer much guidance.

In the end I concluded that my hardware (a Dell D series laptop with 
Nvidia graphics processor purchased in 2009 and running Ubuntu 10.04) is 
unsuitable, probably not supporting double precision arithmetic.

Without installing any new software the machine comes with the 
executable "clinfo" that provides a lot of information about the 
hardware. Sections to note are "Platform Extensions:  
cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing 
cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll"
and "Extensions: cl_khr_fp64 cl_amd_fp64 ..." (without the word 
"Platform"). If the graphics card supports double precision calculations 
it should report "cl_khr_fp64 cl_amd_fp64", but note the ambiguity of 
two different "Extensions".

Emboldened, I managed to build the bigmin example without much drama and
$ ./bigmin -mno 10000 -crit 1.e-10 -nox -nohess
produced the following
> Error creating command queue ret = -34
> Number of devices found 0
> No GPU found

So I desabled the Nvidia graphics driver and downloaded
  AMD-APP-SDK-v2.6-lnx64.tgz from
http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx
and installed it. After messing around with linker paths, the bigmin 
compiled and linked, but produced the same run-time error .

At his point I concluded that my graphics card does not support floating 
point calculations.

A bit of work with google turned up some more information.

http://developer.nvidia.com/cuda-gpus
lists Nvidia graphics processors and their "compute capability". The 
entry for mine is Quadro NVS 135M  compute capability 1.1

http://www.herikstad.net/2009/05/cuda-and-double-precision-floating.html
offers some interpretation of compute capacity:
> To enable the use of doubles inside CUDA kernels you first need to
> make sure you have a CUDA Compute 1.3-capable card. These are the newer
> versions of the nVidia CUDA cards such as the GTX 260, GTX 280, Quadro
> FX 5800, and Tesla S1070 and C1060.  Thereby you have to add a command
> line options to the nvcc compiler: --gpu-architecture sm_13.
The ever-helpful wikipedia entry for CUDA   
http://en.wikipedia.org/wiki/CUDA agrees
> CUDA (with compute capability 1.x) uses a recursion-free,
> function-pointer-free subset of the C language, plus some simple
> extensions. However, a single process must run spread across multiple
> disjoint memory spaces, unlike other C language runtime environments.
>
> CUDA (with compute capability 2.x) allows a subset of C++ class
> functionality, for example member functions may not be virtual (this
> restriction will be removed in some future release). [See CUDA C
> Programming Guide 3.1 - Appendix D.6]
>
> Double precision (CUDA compute capability 1.3 and above) deviate
> from the IEEE 754 standard: round-to-nearest-even is the only supported
> rounding mode for reciprocal, division, and square root. In single
> precision, denormals and signalling NaNs are not supported; only two
> IEEE rounding modes are supported (chop and round-to-nearest even), and
> those are specified on a per-instruction basis rather than in a control
> word; and the precision of division/square root is slightly lower than
> single precision.
>

So you need a graphics processor with compute capability 1.3 and above.

I would urge everyone to try to get this example running and share your 
experiences. The opencl looks like a promising way to parallelize some 
applications. The overview document
http://www.khronos.org/assets/uploads/developers/library/overview/opencl-overview.pdf
implies that it might be possible to tune an application to use either 
GPU or multiple cores on a cluster. Unfortunately the learning curve is 
steep (ask Dave) and the documentation is thin.

Happy hacking,
John

John Sibert
Emeritus Researcher, SOEST
University of Hawaii at Manoa

Visit the ADMB project http://admb-project.org/

On 05/12/2012 05:31 AM, dave fournier wrote:
> Has anyone else actually got this example to work?
>
> Some advice. Older GPU's (whatever that is) probably
> do not support double precision.
>
> WRT using the BFGS update on the CPU. It does not seem
> to perform as well as doing iton the GPU. I think this is
> due to roundoff error.  The CPU is carrying out additions in a different
> way. It may be that with say 4K or more parameters and this
> (artificial) example roundoff error becomes important.
>
> I stored the matrix by rows. It is now appears that it should be stored
> by columns for the fastest matrix * vector multiplication.
>
>
>
> _______________________________________________
> Developers mailing list
> Developers at admb-project.org
> http://lists.admb-project.org/mailman/listinfo/developers
>