[Developers] opencl newfmin example
John Sibert
sibert at hawaii.edu
Wed May 16 12:26:42 PDT 2012
http://developer.nvidia.com/cuda-gpus claims Compute Capacity 1.3 for
tje Quadro FX 3800 card, but only 1.1 for the Quadro FX 3800M mobile
version.
When you get a chance, it would be nice to know what clinfo tells you
about the Platform Extensions
John Sibert
Emeritus Researcher, SOEST
University of Hawaii at Manoa
Visit the ADMB project http://admb-project.org/
On 05/16/2012 08:47 AM, Ian Taylor wrote:
> Unfortunately I don't have time today to play with this. For the
> record, my graphics card is NVIDIA Quadro FX 3800.
>
> After a number of amendments to the newfmin.cpp file based on Dave's
> suggestions, it occurred to me that we have a "branches" directory in
> the SVN repository to keep track of such changes.
>
> There was an old gpu folder in there, which I don't know anything
> about. So rather than replace the file in the src/linad99, I just put
> the file in the main directory: /branches/gpu/newfmin.cpp.
> Here's a link to the modified file in case anyone else wants to try
> it:
> http://admb-project.org/redmine/projects/issues/repository/entry/branches/gpu/newfmin.cpp
>
> -Ian
>
> On Wed, May 16, 2012 at 6:27 AM, dave fournier <davef at otter-rsch.com
> <mailto:davef at otter-rsch.com>> wrote:
>
> On 12-05-15 02:59 PM, Ian Taylor wrote:
>
> I'm 99% sure this is not running on the GPU. You need to get an
> error free run
> and this has one error when it tries to compile the source for the
> GPU.
> The error message is not correct as it got duplicated in the code.
> But there is an
> error. One could find out what the returned error code is and
> look it up
> in the cl.h header file.
>
>
>
>> Hi all,
>> Thanks to help from Dave, I finally got his example working
>> (perhaps) on a Windows computer using Microsoft Visual C++ on a
>> computer with a Nvidia GPU. I got an error about "Error trying to
>> load Kernel source GPU" (pasted at bottom of email along with
>> other warnings that I don't understand), but using something
>> called "GPU-Z", I was able to see that the GPU Load went from 1%
>> to 99%. Nevertheless, using the GPU only cut the run time in
>> half, and the majority of that was achieved with the BFGS
>> algorithm without the GPU (USE_GPU_FLAG=0). So I'm thinking the
>> GPU is not being utilized correctly, or my GPU is not as well
>> suited to this problem as Dave's, or the VC compiler is not as
>> well suited at GCC.
>>
>> Speed comparison:
>> new newfmin with GPU: 2 minutes, 19 seconds for 442 function
>> evaluations.
>> new newfmin w/o GPU: 2 minutes, 37 seconds for 682 function
>> evaluations.
>> old newfmin time (no GPU): 5 minutes, 21 seconds for 2119
>> function evaluations.
>>
>> I had struggles at various points along the way, including
>> installing the correct OpenCL stuff for my GPU, building ADMB
>> with or without the new newfmin file, and linking the bigmin
>> model to the OpenCL libraries. Everything I know about C++, I
>> learned from working with ADMB, so this was a valuable addition
>> to my education.
>> -Ian
>>
>> ### Here are the warnings and errors ###
>>
>> >bigmin -mno 10000 -crit 1.e-10 -nox -nohess
>> Error trying to open data input file bigmin.dat
>> command queue created successfully
>> Number of devices found 1
>> Error trying to load Kernel source GPU
>> All buffers created successfully
>> Program creation code = 0
>> Program build code = 0
>> Create Kernel2 error code = 0
>> Create Kernel error code = 0
>> Create Kernel3 error code = 0
>> Create Kernel4 error code = 0
>> Create Kernel1 error code = 0
>>
>> Initial statistics: 6144 variables; iteration 0; function
>> evaluation 0; phase 1
>> ...
>>
>>
>>
>>
>> On Tue, May 15, 2012 at 10:51 AM, John Sibert <sibert at hawaii.edu
>> <mailto:sibert at hawaii.edu>> wrote:
>>
>> I tried to get it working, but did not succeed. In the
>> process, I might have learned a few things, so I have
>> included a lot of stuff in this email.
>>
>> It would be really helpful if others on this list would also
>> give it a try and share the results with the rest of us.
>>
>> The main problem I encountered ignorance of what (if
>> anything) needed to be installed on my computer. Neither the
>> OpenCL nor the AMD websites offer much guidance.
>>
>> In the end I concluded that my hardware (a Dell D series
>> laptop with Nvidia graphics processor purchased in 2009 and
>> running Ubuntu 10.04) is unsuitable, probably not supporting
>> double precision arithmetic.
>>
>> Without installing any new software the machine comes with
>> the executable "clinfo" that provides a lot of information
>> about the hardware. Sections to note are "Platform
>> Extensions: cl_khr_byte_addressable_store cl_khr_icd
>> cl_khr_gl_sharing cl_nv_compiler_options
>> cl_nv_device_attribute_query cl_nv_pragma_unroll"
>> and "Extensions: cl_khr_fp64 cl_amd_fp64 ..." (without the
>> word "Platform"). If the graphics card supports double
>> precision calculations it should report "cl_khr_fp64
>> cl_amd_fp64", but note the ambiguity of two different
>> "Extensions".
>>
>>
>> Emboldened, I managed to build the bigmin example without
>> much drama and
>> $ ./bigmin -mno 10000 -crit 1.e-10 -nox -nohess
>> produced the following
>>
>> Error creating command queue ret = -34
>> Number of devices found 0
>> No GPU found
>>
>>
>> So I desabled the Nvidia graphics driver and downloaded
>> AMD-APP-SDK-v2.6-lnx64.tgz from
>> http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx
>> and installed it. After messing around with linker paths, the
>> bigmin compiled and linked, but produced the same run-time
>> error .
>>
>> At his point I concluded that my graphics card does not
>> support floating point calculations.
>>
>> A bit of work with google turned up some more information.
>>
>> http://developer.nvidia.com/cuda-gpus
>> lists Nvidia graphics processors and their "compute
>> capability". The entry for mine is Quadro NVS 135M compute
>> capability 1.1
>>
>> http://www.herikstad.net/2009/05/cuda-and-double-precision-floating.html
>> offers some interpretation of compute capacity:
>>
>> To enable the use of doubles inside CUDA kernels you
>> first need to
>> make sure you have a CUDA Compute 1.3-capable card. These
>> are the newer
>> versions of the nVidia CUDA cards such as the GTX 260,
>> GTX 280, Quadro
>> FX 5800, and Tesla S1070 and C1060. Thereby you have to
>> add a command
>> line options to the nvcc compiler: --gpu-architecture sm_13.
>>
>> The ever-helpful wikipedia entry for CUDA
>> http://en.wikipedia.org/wiki/CUDA agrees
>>
>> CUDA (with compute capability 1.x) uses a recursion-free,
>> function-pointer-free subset of the C language, plus some
>> simple
>> extensions. However, a single process must run spread
>> across multiple
>> disjoint memory spaces, unlike other C language runtime
>> environments.
>>
>> CUDA (with compute capability 2.x) allows a subset of C++
>> class
>> functionality, for example member functions may not be
>> virtual (this
>> restriction will be removed in some future release). [See
>> CUDA C
>> Programming Guide 3.1 - Appendix D.6]
>>
>> Double precision (CUDA compute capability 1.3 and above)
>> deviate
>> from the IEEE 754 standard: round-to-nearest-even is the
>> only supported
>> rounding mode for reciprocal, division, and square root.
>> In single
>> precision, denormals and signalling NaNs are not
>> supported; only two
>> IEEE rounding modes are supported (chop and
>> round-to-nearest even), and
>> those are specified on a per-instruction basis rather
>> than in a control
>> word; and the precision of division/square root is
>> slightly lower than
>> single precision.
>>
>>
>> So you need a graphics processor with compute capability 1.3
>> and above.
>>
>> I would urge everyone to try to get this example running and
>> share your experiences. The opencl looks like a promising way
>> to parallelize some applications. The overview document
>> http://www.khronos.org/assets/uploads/developers/library/overview/opencl-overview.pdf
>> implies that it might be possible to tune an application to
>> use either GPU or multiple cores on a cluster. Unfortunately
>> the learning curve is steep (ask Dave) and the documentation
>> is thin.
>>
>> Happy hacking,
>> John
>>
>>
>>
>>
>> John Sibert
>> Emeritus Researcher, SOEST
>> University of Hawaii at Manoa
>>
>> Visit the ADMB project http://admb-project.org/
>>
>>
>>
>> On 05/12/2012 05:31 AM, dave fournier wrote:
>>
>> Has anyone else actually got this example to work?
>>
>> Some advice. Older GPU's (whatever that is) probably
>> do not support double precision.
>>
>> WRT using the BFGS update on the CPU. It does not seem
>> to perform as well as doing iton the GPU. I think this is
>> due to roundoff error. The CPU is carrying out additions
>> in a different
>> way. It may be that with say 4K or more parameters and this
>> (artificial) example roundoff error becomes important.
>>
>> I stored the matrix by rows. It is now appears that it
>> should be stored
>> by columns for the fastest matrix * vector multiplication.
>>
>>
>>
>> _______________________________________________
>> Developers mailing list
>> Developers at admb-project.org
>> <mailto:Developers at admb-project.org>
>> http://lists.admb-project.org/mailman/listinfo/developers
>>
>> _______________________________________________
>> Developers mailing list
>> Developers at admb-project.org <mailto:Developers at admb-project.org>
>> http://lists.admb-project.org/mailman/listinfo/developers
>>
>>
>>
>>
>> _______________________________________________
>> Developers mailing list
>> Developers at admb-project.org <mailto:Developers at admb-project.org>
>> http://lists.admb-project.org/mailman/listinfo/developers
>
>
> _______________________________________________
> Developers mailing list
> Developers at admb-project.org <mailto:Developers at admb-project.org>
> http://lists.admb-project.org/mailman/listinfo/developers
>
>
>
>
> _______________________________________________
> Developers mailing list
> Developers at admb-project.org
> http://lists.admb-project.org/mailman/listinfo/developers
More information about the Developers
mailing list