<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    On 12-05-17 04:46 PM, Ian Taylor wrote:<br>

    <br>

    Thanks for that Ian. I really appreciate your efforts.<br>

    <br>

    How much did the Amazon stuff cost?<br>

    <br>

    Otherwise  what have we learned here so far?<br>

    <br>

    1.) People should be told you need to know how to compile.link C/C++

    programs<br>

    to play with this.<br>

    <br>

    2.) You really need to determine that your GPU supports double

    precision/whatever<br>

    before wasting time. <br>

    <br>

    3.) For an artificial problem which is set up to exploit the

    possible superiority  of<br>

    GPU's you can do a lot better with this stuff.<br>

    <br>

     I'm not surprised that catage does not do better. It is only a few

    variables<br>

    like 38 if I remember (although that is considered HUGE on the R

    list)<br>

    <br>

    For the Multifan-CL model we will probably be approaching 10,000

    parameters<br>

    so with multi-species models and that is the kind of problem I am

    thinking of<br>

    for this stuff.<br>

    <br>

    However all this led me to revisit the limited memory newton version

    of newfmin<br>

    that I built.  It performs much better than the BFGS version on this

    admittedly<br>

    artificial useless function designed by academics.   It appear that

    this<br>

    verion of the Rosenbrock function is exquisitely sensitive to

    roundoff error<br>

    in the quasi-newton update.<br>

    <br>

    I modifed the BFGS update to use the dd_real type.  This is stuff I

    added to Autodif<br>

    to support things in Multifan.  I don't think they are  in the free

    version at present<br>

    although I see no reason why they can not be ... but I digress ...

    The point is that<br>

    the BFGS and LBFGS agree completely when the BFGS is done with

    dd_real<br>

    precision (128 bits).   Now for 10,000 variables it seems clear that

    one might<br>

    as well use the LBFGS update. <br>

    <br>

    Conclusion?  The BFGS update exercise is a waste of time maybe.<br>

    <br>

    But it was interesting.<br>

    <br>

    Next question.  Can doing the LBFGS update on the GPU for 10K plus

    parameters<br>

    be worthwhile.<br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <br>

    <blockquote

cite="mid:CAOUH9r4EzjHS6mSmRXPCgDy7x2CP9mmOmxadHomLzO3yZJjhWg@mail.gmail.com"

      type="cite">

      <div>Success.</div>

      <div><br>

      </div>

      <div>I got a fancier GPU and switched to linux by renting some

        time on a GPU cluster from Amazon, which has Nvidia Tesla M2050

        GPUs, with a Compute Capacity of 2.0.</div>

      <div><br>

      </div>

      <div>Here are run times for the bigmin model:</div>

      <div>

        <div style="font-size:13px;font-family:arial,sans-serif">new

          newfmin with GPU: 0 minutes 22 seconds for 442 function

          evaluations.</div>

        <div style="font-size:13px;font-family:arial,sans-serif">

          new newfmin w/o  GPU: 3 minutes, 37 seconds for 682 function

          evaluations.</div>

        <div style="font-size:13px;font-family:arial,sans-serif">old

          newfmin time (no GPU): 5 minutes, 44 seconds for 2119 function

          evaluations.</div>

        <div style="font-size:13px;font-family:arial,sans-serif"><br>

        </div>

        <div style="font-size:13px;font-family:arial,sans-serif">So

          that's 15 times faster for the new code.</div>

        <div style="font-size:13px;font-family:arial,sans-serif">

          <br>

        </div>

        <div style="font-size:13px;font-family:arial,sans-serif">I still

          haven't figured out what wasn't working on the other setup,

          which had the same number of function evaluations in each

          case, but somehow wasn't taking advantage of the GPU. I think

          it was the model that John scored at 1.3, not the mobile

          version with 1.1.</div>

      </div>

      <div><br>

      </div>

      <div>Unfortunately, I didn't see any gain in the GPU calcs with

        other models. Perhaps this is due to ignorance of what's

        required to take advantage of the GPU. The "catage" model with

        200 years and 40 ages took 11 seconds with the GPU calcs and 7

        seconds with the old newfmin. Also, I sometimes got this error

        when switching phases, which I'm guessing is due to the number

        of parameters changing from even to odd or vice-versa: "Index

        bounds do not match in dvector operator - (const dvector&,

        const dvector&)".</div>

      <div><br>

      </div>

      <div>-Ian</div>

      <div><br>

        <div class="gmail_quote">On Wed, May 16, 2012 at 12:26 PM, John

          Sibert <span dir="ltr"><<a moz-do-not-send="true"

              href="mailto:sibert@hawaii.edu" target="_blank">sibert@hawaii.edu</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex"><a

              moz-do-not-send="true"

              href="http://developer.nvidia.com/cuda-gpus"

              target="_blank">http://developer.nvidia.com/cuda-gpus</a>

            claims Compute Capacity 1.3 for tje Quadro FX 3800 card, but

            only 1.1 for the Quadro FX 3800M mobile version.<br>

            <br>

            When you get a chance, it would be nice to know what clinfo

            tells you about the Platform Extensions

            <div><br>

              <br>

              John Sibert<br>

              Emeritus Researcher, SOEST<br>

              University of Hawaii at Manoa<br>

              <br>

              Visit the ADMB project <a moz-do-not-send="true"

                href="http://admb-project.org/" target="_blank">http://admb-project.org/</a><br>

              <br>

              <br>

            </div>

            <div>

              On 05/16/2012 08:47 AM, Ian Taylor wrote:<br>

            </div>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div>

                Unfortunately I don't have time today to play with this.

                For the record, my graphics card is NVIDIA Quadro FX

                3800.<br>

                <br>

                After a number of amendments to the newfmin.cpp file

                based on Dave's suggestions, it occurred to me that we

                have a "branches" directory in the SVN repository to

                keep track of such changes.<br>

                <br>

                There was an old gpu folder in there, which I don't know

                anything about. So rather than replace the file in the

                src/linad99, I just put the file in the main directory:

                /branches/gpu/newfmin.cpp.<br>

                Here's a link to the modified file in case anyone else

                wants to try it: <a moz-do-not-send="true"

href="http://admb-project.org/redmine/projects/issues/repository/entry/branches/gpu/newfmin.cpp"

                  target="_blank">http://admb-project.org/redmine/projects/issues/repository/entry/branches/gpu/newfmin.cpp</a><br>

                <br>

                -Ian<br>

                <br>

              </div>

              <div>

                <div>

                  On Wed, May 16, 2012 at 6:27 AM, dave fournier <<a

                    moz-do-not-send="true"

                    href="mailto:davef@otter-rsch.com" target="_blank">davef@otter-rsch.com</a>

                  <mailto:<a moz-do-not-send="true"

                    href="mailto:davef@otter-rsch.com" target="_blank">davef@otter-rsch.com</a>>>

                  wrote:<br>

                  <br>

                     On 12-05-15 02:59 PM, Ian Taylor wrote:<br>

                  <br>

                     I'm 99% sure this is not running on the GPU.  You

                  need to get an<br>

                     error free run<br>

                     and this has one error when it tries to compile the

                  source for the<br>

                     GPU.<br>

                     The error message is not correct as it got

                  duplicated in the code.<br>

                     But there is an<br>

                     error.  One could find out what the returned error

                  code is and<br>

                     look it up<br>

                     in the cl.h header file.<br>

                  <br>

                  <br>

                  <br>

                </div>

              </div>

              <blockquote class="gmail_quote" style="margin:0 0 0

                .8ex;border-left:1px #ccc solid;padding-left:1ex">

                <div>

                  <div>    Hi all,<br>

                       Thanks to help from Dave, I finally got his

                    example working<br>

                       (perhaps) on a Windows computer using Microsoft

                    Visual C++ on a<br>

                       computer with a Nvidia GPU. I got an error about

                    "Error trying to<br>

                       load Kernel source GPU" (pasted at bottom of

                    email along with<br>

                       other warnings that I don't understand), but

                    using something<br>

                       called "GPU-Z", I was able to see that the GPU

                    Load went from 1%<br>

                       to 99%. Nevertheless, using the GPU only cut the

                    run time in<br>

                       half, and the majority of that was achieved with

                    the BFGS<br>

                       algorithm without the GPU (USE_GPU_FLAG=0). So

                    I'm thinking the<br>

                       GPU is not being utilized correctly, or my GPU is

                    not as well<br>

                       suited to this problem as Dave's, or the VC

                    compiler is not as<br>

                       well suited at GCC.<br>

                    <br>

                       Speed comparison:<br>

                       new newfmin with GPU: 2 minutes, 19 seconds for

                    442 function<br>

                       evaluations.<br>

                       new newfmin w/o  GPU: 2 minutes, 37 seconds for

                    682 function<br>

                       evaluations.<br>

                       old newfmin time (no GPU): 5 minutes, 21 seconds

                    for 2119<br>

                       function evaluations.<br>

                    <br>

                       I had struggles at various points along the way,

                    including<br>

                       installing the correct OpenCL stuff for my GPU,

                    building ADMB<br>

                       with or without the new newfmin file, and linking

                    the bigmin<br>

                       model to the OpenCL libraries. Everything I know

                    about C++, I<br>

                       learned from working with ADMB, so this was a

                    valuable addition<br>

                       to my education.<br>

                       -Ian<br>

                    <br>

                       ### Here are the warnings and errors ###<br>

                    <br>

                       >bigmin -mno 10000 -crit 1.e-10 -nox -nohess<br>

                       Error trying to open data input file bigmin.dat<br>

                       command queue created successfully<br>

                       Number of devices found 1<br>

                       Error trying to load Kernel source  GPU<br>

                       All buffers created successfully<br>

                       Program creation code = 0<br>

                       Program build code = 0<br>

                       Create Kernel2 error code = 0<br>

                       Create Kernel error code = 0<br>

                       Create Kernel3 error code = 0<br>

                       Create Kernel4 error code = 0<br>

                       Create Kernel1 error code = 0<br>

                    <br>

                       Initial statistics: 6144 variables; iteration 0;

                    function<br>

                       evaluation 0; phase 1<br>

                       ...<br>

                    <br>

                    <br>

                    <br>

                    <br>

                       On Tue, May 15, 2012 at 10:51 AM, John Sibert

                    <<a moz-do-not-send="true"

                      href="mailto:sibert@hawaii.edu" target="_blank">sibert@hawaii.edu</a><br>

                  </div>

                </div>

                <div>

                  <div>    <mailto:<a moz-do-not-send="true"

                      href="mailto:sibert@hawaii.edu" target="_blank">sibert@hawaii.edu</a>>>

                    wrote:<br>

                    <br>

                           I tried to get it working, but did not

                    succeed. In the<br>

                           process, I might have learned a few things,

                    so I have<br>

                           included a lot of stuff in this email.<br>

                    <br>

                           It would be really helpful if others on this

                    list would also<br>

                           give it a try and share the results with the

                    rest of us.<br>

                    <br>

                           The main problem I encountered ignorance of

                    what (if<br>

                           anything) needed to be installed on my

                    computer. Neither the<br>

                           OpenCL nor the AMD websites offer much

                    guidance.<br>

                    <br>

                           In the end I concluded that my hardware (a

                    Dell D series<br>

                           laptop with Nvidia graphics processor

                    purchased in 2009 and<br>

                           running Ubuntu 10.04) is unsuitable, probably

                    not supporting<br>

                           double precision arithmetic.<br>

                    <br>

                           Without installing any new software the

                    machine comes with<br>

                           the executable "clinfo" that provides a lot

                    of information<br>

                           about the hardware. Sections to note are

                    "Platform<br>

                           Extensions:  cl_khr_byte_addressable_store

                    cl_khr_icd<br>

                           cl_khr_gl_sharing cl_nv_compiler_options<br>

                           cl_nv_device_attribute_query

                    cl_nv_pragma_unroll"<br>

                           and "Extensions: cl_khr_fp64 cl_amd_fp64 ..."

                    (without the<br>

                           word "Platform"). If the graphics card

                    supports double<br>

                           precision calculations it should report

                    "cl_khr_fp64<br>

                           cl_amd_fp64", but note the ambiguity of two

                    different<br>

                           "Extensions".<br>

                    <br>

                    <br>

                           Emboldened, I managed to build the bigmin

                    example without<br>

                           much drama and<br>

                           $ ./bigmin -mno 10000 -crit 1.e-10 -nox

                    -nohess<br>

                           produced the following<br>

                    <br>

                               Error creating command queue ret = -34<br>

                               Number of devices found 0<br>

                               No GPU found<br>

                    <br>

                    <br>

                           So I desabled the Nvidia graphics driver and

                    downloaded<br>

                            AMD-APP-SDK-v2.6-lnx64.tgz from<br>

                           <a moz-do-not-send="true"

href="http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx"

                      target="_blank">http://developer.amd.com/sdks/AMDAPPSDK/downloads/Pages/default.aspx</a><br>

                           and installed it. After messing around with

                    linker paths, the<br>

                           bigmin compiled and linked, but produced the

                    same run-time<br>

                           error .<br>

                    <br>

                           At his point I concluded that my graphics

                    card does not<br>

                           support floating point calculations.<br>

                    <br>

                           A bit of work with google turned up some more

                    information.<br>

                    <br>

                           <a moz-do-not-send="true"

                      href="http://developer.nvidia.com/cuda-gpus"

                      target="_blank">http://developer.nvidia.com/cuda-gpus</a><br>

                           lists Nvidia graphics processors and their

                    "compute<br>

                           capability". The entry for mine is Quadro NVS

                    135M  compute<br>

                           capability 1.1<br>

                    <br>

                           <a moz-do-not-send="true"

href="http://www.herikstad.net/2009/05/cuda-and-double-precision-floating.html"

                      target="_blank">http://www.herikstad.net/2009/05/cuda-and-double-precision-floating.html</a><br>

                           offers some interpretation of compute

                    capacity:<br>

                    <br>

                               To enable the use of doubles inside CUDA

                    kernels you<br>

                               first need to<br>

                               make sure you have a CUDA Compute

                    1.3-capable card. These<br>

                               are the newer<br>

                               versions of the nVidia CUDA cards such as

                    the GTX 260,<br>

                               GTX 280, Quadro<br>

                               FX 5800, and Tesla S1070 and C1060.

                     Thereby you have to<br>

                               add a command<br>

                               line options to the nvcc compiler:

                    --gpu-architecture sm_13.<br>

                    <br>

                           The ever-helpful wikipedia entry for CUDA<br>

                           <a moz-do-not-send="true"

                      href="http://en.wikipedia.org/wiki/CUDA"

                      target="_blank">http://en.wikipedia.org/wiki/CUDA</a>

                    agrees<br>

                    <br>

                               CUDA (with compute capability 1.x) uses a

                    recursion-free,<br>

                               function-pointer-free subset of the C

                    language, plus some<br>

                               simple<br>

                               extensions. However, a single process

                    must run spread<br>

                               across multiple<br>

                               disjoint memory spaces, unlike other C

                    language runtime<br>

                               environments.<br>

                    <br>

                               CUDA (with compute capability 2.x) allows

                    a subset of C++<br>

                               class<br>

                               functionality, for example member

                    functions may not be<br>

                               virtual (this<br>

                               restriction will be removed in some

                    future release). [See<br>

                               CUDA C<br>

                               Programming Guide 3.1 - Appendix D.6]<br>

                    <br>

                               Double precision (CUDA compute capability

                    1.3 and above)<br>

                               deviate<br>

                               from the IEEE 754 standard:

                    round-to-nearest-even is the<br>

                               only supported<br>

                               rounding mode for reciprocal, division,

                    and square root.<br>

                               In single<br>

                               precision, denormals and signalling NaNs

                    are not<br>

                               supported; only two<br>

                               IEEE rounding modes are supported (chop

                    and<br>

                               round-to-nearest even), and<br>

                               those are specified on a per-instruction

                    basis rather<br>

                               than in a control<br>

                               word; and the precision of

                    division/square root is<br>

                               slightly lower than<br>

                               single precision.<br>

                    <br>

                    <br>

                           So you need a graphics processor with compute

                    capability 1.3<br>

                           and above.<br>

                    <br>

                           I would urge everyone to try to get this

                    example running and<br>

                           share your experiences. The opencl looks like

                    a promising way<br>

                           to parallelize some applications. The

                    overview document<br>

                           <a moz-do-not-send="true"

href="http://www.khronos.org/assets/uploads/developers/library/overview/opencl-overview.pdf"

                      target="_blank">http://www.khronos.org/assets/uploads/developers/library/overview/opencl-overview.pdf</a><br>

                           implies that it might be possible to tune an

                    application to<br>

                           use either GPU or multiple cores on a

                    cluster. Unfortunately<br>

                           the learning curve is steep (ask Dave) and

                    the documentation<br>

                           is thin.<br>

                    <br>

                           Happy hacking,<br>

                           John<br>

                    <br>

                    <br>

                    <br>

                    <br>

                           John Sibert<br>

                           Emeritus Researcher, SOEST<br>

                           University of Hawaii at Manoa<br>

                    <br>

                           Visit the ADMB project <a

                      moz-do-not-send="true"

                      href="http://admb-project.org/" target="_blank">http://admb-project.org/</a><br>

                    <br>

                    <br>

                    <br>

                           On 05/12/2012 05:31 AM, dave fournier wrote:<br>

                    <br>

                               Has anyone else actually got this example

                    to work?<br>

                    <br>

                               Some advice. Older GPU's (whatever that

                    is) probably<br>

                               do not support double precision.<br>

                    <br>

                               WRT using the BFGS update on the CPU. It

                    does not seem<br>

                               to perform as well as doing iton the GPU.

                    I think this is<br>

                               due to roundoff error.  The CPU is

                    carrying out additions<br>

                               in a different<br>

                               way. It may be that with say 4K or more

                    parameters and this<br>

                               (artificial) example roundoff error

                    becomes important.<br>

                    <br>

                               I stored the matrix by rows. It is now

                    appears that it<br>

                               should be stored<br>

                               by columns for the fastest matrix *

                    vector multiplication.<br>

                    <br>

                    <br>

                    <br>

                               _______________________________________________<br>

                               Developers mailing list<br>

                               <a moz-do-not-send="true"

                      href="mailto:Developers@admb-project.org"

                      target="_blank">Developers@admb-project.org</a><br>

                  </div>

                </div>

                           <mailto:<a moz-do-not-send="true"

                  href="mailto:Developers@admb-project.org"

                  target="_blank">Developers@admb-project.org</a>>

                <div><br>

                             <a moz-do-not-send="true"

                    href="http://lists.admb-project.org/mailman/listinfo/developers"

                    target="_blank">http://lists.admb-project.org/mailman/listinfo/developers</a><br>

                  <br>

                         _______________________________________________<br>

                         Developers mailing list<br>

                </div>

                       <a moz-do-not-send="true"

                  href="mailto:Developers@admb-project.org"

                  target="_blank">Developers@admb-project.org</a>

                <mailto:<a moz-do-not-send="true"

                  href="mailto:Developers@admb-project.org"

                  target="_blank">Developers@admb-project.org</a>>

                <div>

                  <br>

                         <a moz-do-not-send="true"

                    href="http://lists.admb-project.org/mailman/listinfo/developers"

                    target="_blank">http://lists.admb-project.org/mailman/listinfo/developers</a><br>

                  <br>

                  <br>

                  <br>

                  <br>

                     _______________________________________________<br>

                     Developers mailing list<br>

                </div>

                   <a moz-do-not-send="true"

                  href="mailto:Developers@admb-project.org"

                  target="_blank">Developers@admb-project.org</a>

                 <mailto:<a moz-do-not-send="true"

                  href="mailto:Developers@admb-project.org"

                  target="_blank">Developers@admb-project.org</a>><br>

                   <a moz-do-not-send="true"

                  href="http://lists.admb-project.org/mailman/listinfo/developers"

                  target="_blank">http://lists.admb-project.org/mailman/listinfo/developers</a><br>

              </blockquote>

              <br>

              <br>

                 _______________________________________________<br>

                 Developers mailing list<br>

                 <a moz-do-not-send="true"

                href="mailto:Developers@admb-project.org"

                target="_blank">Developers@admb-project.org</a>

              <mailto:<a moz-do-not-send="true"

                href="mailto:Developers@admb-project.org"

                target="_blank">Developers@admb-project.org</a>>

              <div>

                <br>

                   <a moz-do-not-send="true"

                  href="http://lists.admb-project.org/mailman/listinfo/developers"

                  target="_blank">http://lists.admb-project.org/mailman/listinfo/developers</a><br>

                <br>

                <br>

                <br>

                <br>

                _______________________________________________<br>

                Developers mailing list<br>

                <a moz-do-not-send="true"

                  href="mailto:Developers@admb-project.org"

                  target="_blank">Developers@admb-project.org</a><br>

                <a moz-do-not-send="true"

                  href="http://lists.admb-project.org/mailman/listinfo/developers"

                  target="_blank">http://lists.admb-project.org/mailman/listinfo/developers</a><br>

              </div>

            </blockquote>

          </blockquote>

        </div>

        <br>

      </div>

    </blockquote>

    <br>

  </body>

</html>