[Developers] opencl newfmin example

Thu May 17 16:46:45 PDT 2012

Success.

I got a fancier GPU and switched to linux by renting some time on a GPU
cluster from Amazon, which has Nvidia Tesla M2050 GPUs, with a Compute
Capacity of 2.0.

Here are run times for the bigmin model:
new newfmin with GPU: 0 minutes 22 seconds for 442 function evaluations.
new newfmin w/o  GPU: 3 minutes, 37 seconds for 682 function evaluations.
old newfmin time (no GPU): 5 minutes, 44 seconds for 2119 function
evaluations.

So that's 15 times faster for the new code.

I still haven't figured out what wasn't working on the other setup, which
had the same number of function evaluations in each case, but somehow
wasn't taking advantage of the GPU. I think it was the model that John
scored at 1.3, not the mobile version with 1.1.

Unfortunately, I didn't see any gain in the GPU calcs with other models.
Perhaps this is due to ignorance of what's required to take advantage of
the GPU. The "catage" model with 200 years and 40 ages took 11 seconds with
the GPU calcs and 7 seconds with the old newfmin. Also, I sometimes got
this error when switching phases, which I'm guessing is due to the number
of parameters changing from even to odd or vice-versa: "Index bounds do not
match in dvector operator - (const dvector&, const dvector&)".

-Ian

On Wed, May 16, 2012 at 12:26 PM, John Sibert <sibert at hawaii.edu> wrote:

> http://developer.nvidia.com/**cuda-gpus<http://developer.nvidia.com/cuda-gpus>claims Compute Capacity 1.3 for tje Quadro FX 3800 card, but only 1.1 for
> the Quadro FX 3800M mobile version.
>
> When you get a chance, it would be nice to know what clinfo tells you
> about the Platform Extensions
>
>
> John Sibert
> Emeritus Researcher, SOEST
> University of Hawaii at Manoa
>
> Visit the ADMB project http://admb-project.org/
>
>
> On 05/16/2012 08:47 AM, Ian Taylor wrote:
>
>> Unfortunately I don't have time today to play with this. For the record,
>> my graphics card is NVIDIA Quadro FX 3800.
>>
>> After a number of amendments to the newfmin.cpp file based on Dave's
>> suggestions, it occurred to me that we have a "branches" directory in the
>> SVN repository to keep track of such changes.
>>
>> There was an old gpu folder in there, which I don't know anything about.
>> So rather than replace the file in the src/linad99, I just put the file in
>> the main directory: /branches/gpu/newfmin.cpp.
>> Here's a link to the modified file in case anyone else wants to try it:
>> http://admb-project.org/**redmine/projects/issues/**
>> repository/entry/branches/gpu/**newfmin.cpp<http://admb-project.org/redmine/projects/issues/repository/entry/branches/gpu/newfmin.cpp>
>>
>> -Ian
>>
>> On Wed, May 16, 2012 at 6:27 AM, dave fournier <davef at otter-rsch.com<mailto:
>> davef at otter-rsch.com>> wrote:
>>
>>    On 12-05-15 02:59 PM, Ian Taylor wrote:
>>
>>    I'm 99% sure this is not running on the GPU.  You need to get an
>>    error free run
>>    and this has one error when it tries to compile the source for the
>>    GPU.
>>    The error message is not correct as it got duplicated in the code.
>>    But there is an
>>    error.  One could find out what the returned error code is and
>>    look it up
>>    in the cl.h header file.
>>
>>
>>
>>     Hi all,
>>>    Thanks to help from Dave, I finally got his example working
>>>    (perhaps) on a Windows computer using Microsoft Visual C++ on a
>>>    computer with a Nvidia GPU. I got an error about "Error trying to
>>>    load Kernel source GPU" (pasted at bottom of email along with
>>>    other warnings that I don't understand), but using something
>>>    called "GPU-Z", I was able to see that the GPU Load went from 1%
>>>    to 99%. Nevertheless, using the GPU only cut the run time in
>>>    half, and the majority of that was achieved with the BFGS
>>>    algorithm without the GPU (USE_GPU_FLAG=0). So I'm thinking the
>>>    GPU is not being utilized correctly, or my GPU is not as well
>>>    suited to this problem as Dave's, or the VC compiler is not as
>>>    well suited at GCC.
>>>
>>>    Speed comparison:
>>>    new newfmin with GPU: 2 minutes, 19 seconds for 442 function
>>>    evaluations.
>>>    new newfmin w/o  GPU: 2 minutes, 37 seconds for 682 function
>>>    evaluations.
>>>    old newfmin time (no GPU): 5 minutes, 21 seconds for 2119
>>>    function evaluations.
>>>
>>>    I had struggles at various points along the way, including
>>>    installing the correct OpenCL stuff for my GPU, building ADMB
>>>    with or without the new newfmin file, and linking the bigmin
>>>    model to the OpenCL libraries. Everything I know about C++, I
>>>    learned from working with ADMB, so this was a valuable addition
>>>    to my education.
>>>    -Ian
>>>
>>>    ### Here are the warnings and errors ###
>>>
>>>    >bigmin -mno 10000 -crit 1.e-10 -nox -nohess
>>>    Error trying to open data input file bigmin.dat
>>>    command queue created successfully
>>>    Number of devices found 1
>>>    Error trying to load Kernel source  GPU
>>>    All buffers created successfully
>>>    Program creation code = 0
>>>    Program build code = 0
>>>    Create Kernel2 error code = 0
>>>    Create Kernel error code = 0
>>>    Create Kernel3 error code = 0
>>>    Create Kernel4 error code = 0
>>>    Create Kernel1 error code = 0
>>>
>>>    Initial statistics: 6144 variables; iteration 0; function
>>>    evaluation 0; phase 1
>>>    ...
>>>
>>>
>>>
>>>
>>>    On Tue, May 15, 2012 at 10:51 AM, John Sibert <sibert at hawaii.edu
>>>    <mailto:sibert at hawaii.edu>> wrote:
>>>
>>>        I tried to get it working, but did not succeed. In the
>>>        process, I might have learned a few things, so I have
>>>        included a lot of stuff in this email.
>>>
>>>        It would be really helpful if others on this list would also
>>>        give it a try and share the results with the rest of us.
>>>
>>>        The main problem I encountered ignorance of what (if
>>>        anything) needed to be installed on my computer. Neither the
>>>        OpenCL nor the AMD websites offer much guidance.
>>>
>>>        In the end I concluded that my hardware (a Dell D series
>>>        laptop with Nvidia graphics processor purchased in 2009 and
>>>        running Ubuntu 10.04) is unsuitable, probably not supporting
>>>        double precision arithmetic.
>>>
>>>        Without installing any new software the machine comes with
>>>        the executable "clinfo" that provides a lot of information
>>>        about the hardware. Sections to note are "Platform
>>>        Extensions:  cl_khr_byte_addressable_store cl_khr_icd
>>>        cl_khr_gl_sharing cl_nv_compiler_options
>>>        cl_nv_device_attribute_query cl_nv_pragma_unroll"
>>>        and "Extensions: cl_khr_fp64 cl_amd_fp64 ..." (without the
>>>        word "Platform"). If the graphics card supports double
>>>        precision calculations it should report "cl_khr_fp64
>>>        cl_amd_fp64", but note the ambiguity of two different
>>>        "Extensions".
>>>
>>>
>>>        Emboldened, I managed to build the bigmin example without
>>>        much drama and
>>>        $ ./bigmin -mno 10000 -crit 1.e-10 -nox -nohess
>>>        produced the following
>>>
>>>            Error creating command queue ret = -34
>>>            Number of devices found 0
>>>            No GPU found
>>>
>>>
>>>        So I desabled the Nvidia graphics driver and downloaded
>>>         AMD-APP-SDK-v2.6-lnx64.tgz from
>>>        http://developer.amd.com/sdks/**AMDAPPSDK/downloads/Pages/**
>>> default.aspx<http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx>
>>>        and installed it. After messing around with linker paths, the
>>>        bigmin compiled and linked, but produced the same run-time
>>>        error .
>>>
>>>        At his point I concluded that my graphics card does not
>>>        support floating point calculations.
>>>
>>>        A bit of work with google turned up some more information.
>>>
>>>        http://developer.nvidia.com/**cuda-gpus<http://developer.nvidia.com/cuda-gpus>
>>>        lists Nvidia graphics processors and their "compute
>>>        capability". The entry for mine is Quadro NVS 135M  compute
>>>        capability 1.1
>>>
>>>        http://www.herikstad.net/2009/**05/cuda-and-double-precision-**
>>> floating.html<http://www.herikstad.net/2009/05/cuda-and-double-precision-floating.html>
>>>        offers some interpretation of compute capacity:
>>>
>>>            To enable the use of doubles inside CUDA kernels you
>>>            first need to
>>>            make sure you have a CUDA Compute 1.3-capable card. These
>>>            are the newer
>>>            versions of the nVidia CUDA cards such as the GTX 260,
>>>            GTX 280, Quadro
>>>            FX 5800, and Tesla S1070 and C1060.  Thereby you have to
>>>            add a command
>>>            line options to the nvcc compiler: --gpu-architecture sm_13.
>>>
>>>        The ever-helpful wikipedia entry for CUDA
>>>        http://en.wikipedia.org/wiki/**CUDA<http://en.wikipedia.org/wiki/CUDA>agrees
>>>
>>>            CUDA (with compute capability 1.x) uses a recursion-free,
>>>            function-pointer-free subset of the C language, plus some
>>>            simple
>>>            extensions. However, a single process must run spread
>>>            across multiple
>>>            disjoint memory spaces, unlike other C language runtime
>>>            environments.
>>>
>>>            CUDA (with compute capability 2.x) allows a subset of C++
>>>            class
>>>            functionality, for example member functions may not be
>>>            virtual (this
>>>            restriction will be removed in some future release). [See
>>>            CUDA C
>>>            Programming Guide 3.1 - Appendix D.6]
>>>
>>>            Double precision (CUDA compute capability 1.3 and above)
>>>            deviate
>>>            from the IEEE 754 standard: round-to-nearest-even is the
>>>            only supported
>>>            rounding mode for reciprocal, division, and square root.
>>>            In single
>>>            precision, denormals and signalling NaNs are not
>>>            supported; only two
>>>            IEEE rounding modes are supported (chop and
>>>            round-to-nearest even), and
>>>            those are specified on a per-instruction basis rather
>>>            than in a control
>>>            word; and the precision of division/square root is
>>>            slightly lower than
>>>            single precision.
>>>
>>>
>>>        So you need a graphics processor with compute capability 1.3
>>>        and above.
>>>
>>>        I would urge everyone to try to get this example running and
>>>        share your experiences. The opencl looks like a promising way
>>>        to parallelize some applications. The overview document
>>>        http://www.khronos.org/assets/**uploads/developers/library/**
>>> overview/opencl-overview.pdf<http://www.khronos.org/assets/uploads/developers/library/overview/opencl-overview.pdf>
>>>        implies that it might be possible to tune an application to
>>>        use either GPU or multiple cores on a cluster. Unfortunately
>>>        the learning curve is steep (ask Dave) and the documentation
>>>        is thin.
>>>
>>>        Happy hacking,
>>>        John
>>>
>>>
>>>
>>>
>>>        John Sibert
>>>        Emeritus Researcher, SOEST
>>>        University of Hawaii at Manoa
>>>
>>>        Visit the ADMB project http://admb-project.org/
>>>
>>>
>>>
>>>        On 05/12/2012 05:31 AM, dave fournier wrote:
>>>
>>>            Has anyone else actually got this example to work?
>>>
>>>            Some advice. Older GPU's (whatever that is) probably
>>>            do not support double precision.
>>>
>>>            WRT using the BFGS update on the CPU. It does not seem
>>>            to perform as well as doing iton the GPU. I think this is
>>>            due to roundoff error.  The CPU is carrying out additions
>>>            in a different
>>>            way. It may be that with say 4K or more parameters and this
>>>            (artificial) example roundoff error becomes important.
>>>
>>>            I stored the matrix by rows. It is now appears that it
>>>            should be stored
>>>            by columns for the fastest matrix * vector multiplication.
>>>
>>>
>>>
>>>            ______________________________**_________________
>>>            Developers mailing list
>>>            Developers at admb-project.org
>>>            <mailto:Developers at admb-**project.org<Developers at admb-project.org>
>>> >
>>>
>>>            http://lists.admb-project.org/**mailman/listinfo/developers<http://lists.admb-project.org/mailman/listinfo/developers>
>>>
>>>        ______________________________**_________________
>>>        Developers mailing list
>>>        Developers at admb-project.org <mailto:Developers at admb-**project.org<Developers at admb-project.org>
>>> >
>>>
>>>        http://lists.admb-project.org/**mailman/listinfo/developers<http://lists.admb-project.org/mailman/listinfo/developers>
>>>
>>>
>>>
>>>
>>>    ______________________________**_________________
>>>    Developers mailing list
>>>    Developers at admb-project.org  <mailto:Developers at admb-**project.org<Developers at admb-project.org>
>>> >
>>>    http://lists.admb-project.org/**mailman/listinfo/developers<http://lists.admb-project.org/mailman/listinfo/developers>
>>>
>>
>>
>>    ______________________________**_________________
>>    Developers mailing list
>>    Developers at admb-project.org <mailto:Developers at admb-**project.org<Developers at admb-project.org>
>> >
>>
>>    http://lists.admb-project.org/**mailman/listinfo/developers<http://lists.admb-project.org/mailman/listinfo/developers>
>>
>>
>>
>>
>> ______________________________**_________________
>> Developers mailing list
>> Developers at admb-project.org
>> http://lists.admb-project.org/**mailman/listinfo/developers<http://lists.admb-project.org/mailman/listinfo/developers>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.admb-project.org/pipermail/developers/attachments/20120517/1d46be62/attachment-0001.html>