From otter at otter-rsch.com Mon Sep 15 12:03:48 2014 From: otter at otter-rsch.com (otter) Date: Mon, 15 Sep 2014 12:03:48 -0700 Subject: [Developers] faster ludecomp in C++ for creating adjoint code Message-ID: <54173814.3030902@otter-rsch.com> After much pain I have produced a C++ version of the LU decomposition which is suitable for producing adjoint code for ADMB and perhaps cppad. (Don't know what adjoint code for cppad is as yet!) This code is about 25 times faster than the current ADMB code for a 2,000 x 2,000 matrix and about 4 times slower than the Openblas code which contains optimized assembler and Fortrash. Any suggestion for improvements would be welcome. Per usual I will hold my breath. -------------- next part -------------- A non-text attachment was scrubbed... Name: unblocked.cpp Type: text/x-c++src Size: 2692 bytes Desc: not available URL: From johnoel at hawaii.edu Mon Sep 15 12:14:51 2014 From: johnoel at hawaii.edu (Johnoel Ancheta) Date: Mon, 15 Sep 2014 09:14:51 -1000 Subject: [Developers] faster ludecomp in C++ for creating adjoint code In-Reply-To: <54173814.3030902@otter-rsch.com> References: <54173814.3030902@otter-rsch.com> Message-ID: Thanks Dave! I'll provide feedback after testing... On Mon, Sep 15, 2014 at 9:03 AM, otter wrote: > After much pain I have produced a C++ version of the LU decomposition which > is suitable for producing adjoint code for ADMB and perhaps cppad. (Don't > know what adjoint > code for cppad is as yet!) This code is about 25 times faster than the > current > ADMB code for a 2,000 x 2,000 matrix and about 4 times slower than the > Openblas > code which contains optimized assembler and Fortrash. Any suggestion for > improvements > would be welcome. Per usual I will hold my breath. > > > > _______________________________________________ > Developers mailing list > Developers at admb-project.org > http://lists.admb-project.org/mailman/listinfo/developers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davef at otter-rsch.com Mon Sep 15 12:19:50 2014 From: davef at otter-rsch.com (dave fournier) Date: Mon, 15 Sep 2014 12:19:50 -0700 Subject: [Developers] faster ludecomp in C++ for creating adjoint code In-Reply-To: References: <54173814.3030902@otter-rsch.com> Message-ID: <54173BD6.4060307@otter-rsch.com> On 09/15/2014 12:14 PM, Johnoel Ancheta wrote: If you wan t I have code to call the Openblas version as well as a blocked version which stores the relevant matrices in blocks as recommended by Dongarra et al (but it does not seem to be worth the effort). > Thanks Dave! I'll provide feedback after testing... > > > > On Mon, Sep 15, 2014 at 9:03 AM, otter > wrote: > > After much pain I have produced a C++ version of the LU > decomposition which > is suitable for producing adjoint code for ADMB and perhaps cppad. > (Don't know what adjoint > code for cppad is as yet!) This code is about 25 times faster > than the current > ADMB code for a 2,000 x 2,000 matrix and about 4 times slower than > the Openblas > code which contains optimized assembler and Fortrash. Any > suggestion for improvements > would be welcome. Per usual I will hold my breath. > > > > _______________________________________________ > Developers mailing list > Developers at admb-project.org > http://lists.admb-project.org/mailman/listinfo/developers > > > > > _______________________________________________ > Developers mailing list > Developers at admb-project.org > http://lists.admb-project.org/mailman/listinfo/developers -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnoel at hawaii.edu Mon Sep 15 12:23:02 2014 From: johnoel at hawaii.edu (Johnoel Ancheta) Date: Mon, 15 Sep 2014 09:23:02 -1000 Subject: [Developers] faster ludecomp in C++ for creating adjoint code In-Reply-To: <54173BD6.4060307@otter-rsch.com> References: <54173814.3030902@otter-rsch.com> <54173BD6.4060307@otter-rsch.com> Message-ID: Sure, it would be nice to do the comparison. Thanks in advance, Johnoel On Mon, Sep 15, 2014 at 9:19 AM, dave fournier wrote: > On 09/15/2014 12:14 PM, Johnoel Ancheta wrote: > > If you wan t I have code to call the Openblas version as well as a blocked > version which stores the relevant matrices in > blocks as recommended by Dongarra et al (but it does not seem to be worth > the effort). > > > Thanks Dave! I'll provide feedback after testing... > > > > On Mon, Sep 15, 2014 at 9:03 AM, otter wrote: > >> After much pain I have produced a C++ version of the LU decomposition >> which >> is suitable for producing adjoint code for ADMB and perhaps cppad. (Don't >> know what adjoint >> code for cppad is as yet!) This code is about 25 times faster than the >> current >> ADMB code for a 2,000 x 2,000 matrix and about 4 times slower than the >> Openblas >> code which contains optimized assembler and Fortrash. Any suggestion for >> improvements >> would be welcome. Per usual I will hold my breath. >> >> >> >> _______________________________________________ >> Developers mailing list >> Developers at admb-project.org >> http://lists.admb-project.org/mailman/listinfo/developers >> >> > > > _______________________________________________ > Developers mailing listDevelopers at admb-project.orghttp://lists.admb-project.org/mailman/listinfo/developers > > > > _______________________________________________ > Developers mailing list > Developers at admb-project.org > http://lists.admb-project.org/mailman/listinfo/developers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davef at otter-rsch.com Mon Sep 15 12:38:22 2014 From: davef at otter-rsch.com (dave fournier) Date: Mon, 15 Sep 2014 12:38:22 -0700 Subject: [Developers] faster ludecomp in C++ for creating adjoint code In-Reply-To: References: <54173814.3030902@otter-rsch.com> <54173BD6.4060307@otter-rsch.com> Message-ID: <5417402E.9000201@otter-rsch.com> On 09/15/2014 12:23 PM, Johnoel Ancheta wrote: To compile it I used something like CXX=g++ ADMB_HOME=~/admodel echo "!!! Note using the version found in directory " echo ${ADMB_HOME} ${CXX} -march=native -ggdb -Ofast -funroll-loops -DOPT_LIB -ffast-math -pthread -DUSE_PTHREADS -W -fpermissive -DUSE_LAPLACE -Dlinux \ avx_vecdot.o \ -L/home/dave/opt/OpenBLAS/lib \ -I/home/dave/opt/OpenBLAS/include \ -D__GNUDOS__ -o$1 $1.cpp \ -I. -I${ADMB_HOME}/include \ -I/home/dave/include \ -L${ADMB_HOME}/lib \ -L/home/dave/lib \ -ladmbo \ -lopenblas \ -lgfortran Near the top of main you set the matrix size n and the block size m. At present n must be a multiple of m. main() { ad_set_new_handler(); ad_exit=&ad_boundf; int n=2000; // we will work with an nxn symmetric matrix const int m=100; // block size for blocked LU code To restrict openblas to one thread for comparison you can use the environment string export OPENBLAS_NUM_THREADS=1 > Sure, it would be nice to do the comparison. > > Thanks in advance, > > Johnoel > > On Mon, Sep 15, 2014 at 9:19 AM, dave fournier > wrote: > > On 09/15/2014 12:14 PM, Johnoel Ancheta wrote: > > If you wan t I have code to call the Openblas version as well as a > blocked version which stores the relevant matrices in > blocks as recommended by Dongarra et al (but it does not seem to > be worth the effort). > > >> Thanks Dave! I'll provide feedback after testing... >> >> >> >> On Mon, Sep 15, 2014 at 9:03 AM, otter > > wrote: >> >> After much pain I have produced a C++ version of the LU >> decomposition which >> is suitable for producing adjoint code for ADMB and perhaps >> cppad. (Don't know what adjoint >> code for cppad is as yet!) This code is about 25 times >> faster than the current >> ADMB code for a 2,000 x 2,000 matrix and about 4 times slower >> than the Openblas >> code which contains optimized assembler and Fortrash. Any >> suggestion for improvements >> would be welcome. Per usual I will hold my breath. >> >> >> >> _______________________________________________ >> Developers mailing list >> Developers at admb-project.org >> http://lists.admb-project.org/mailman/listinfo/developers >> >> >> >> >> _______________________________________________ >> Developers mailing list >> Developers at admb-project.org >> http://lists.admb-project.org/mailman/listinfo/developers > > > _______________________________________________ > Developers mailing list > Developers at admb-project.org > http://lists.admb-project.org/mailman/listinfo/developers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: yblocklu10.cpp Type: text/x-c++src Size: 18081 bytes Desc: not available URL: From sibert at hawaii.edu Mon Sep 15 14:07:39 2014 From: sibert at hawaii.edu (John Sibert) Date: Mon, 15 Sep 2014 11:07:39 -1000 Subject: [Developers] faster ludecomp in C++ for creating adjoint code In-Reply-To: <54173814.3030902@otter-rsch.com> References: <54173814.3030902@otter-rsch.com> Message-ID: Jolly good, old chap. (Headed to Heathrow.) Cheers. Sent from my phone. On Sep 15, 2014 3:04 PM, "otter" wrote: > After much pain I have produced a C++ version of the LU decomposition which > is suitable for producing adjoint code for ADMB and perhaps cppad. (Don't > know what adjoint > code for cppad is as yet!) This code is about 25 times faster than the > current > ADMB code for a 2,000 x 2,000 matrix and about 4 times slower than the > Openblas > code which contains optimized assembler and Fortrash. Any suggestion for > improvements > would be welcome. Per usual I will hold my breath. > > > > _______________________________________________ > Developers mailing list > Developers at admb-project.org > http://lists.admb-project.org/mailman/listinfo/developers > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davef at otter-rsch.com Tue Sep 16 11:35:42 2014 From: davef at otter-rsch.com (dave fournier) Date: Tue, 16 Sep 2014 11:35:42 -0700 Subject: [Developers] c++ amp with admb Message-ID: <541882FE.1090508@otter-rsch.com> A while back Matthrew pointed out to me that development on C++AMP for linux is an ongoing project which should be watched. For those whio don't know AMP is a microsoft development which is supposed to greatly simplify parallel processing on GPU's/CPUs. Having done the LU decopmposition code I wondered it it could benefit from C++ AMP. that is a bit difficult for a newbie sio I decided to start with a simple matrix multiplication example. I have attached the code. the zip file includes a script for compiling the example. class dvector { // .................... public: double * data (void) const { return v+index_min; } }; class dmatrix { // ............................................ public: double * data (void) const { return m[index_min].data(); } }; One technicality is that the AMP library expects the matrix or vector class to have a member function data which returns a pointer to the numbers in the vector. I added this to the dvector and dmatrix classes. Another technicality is that the AMP library expects the matrix to be stored in contiguous memory. I have accomplished this for the dmatrices in the example by using the make_dmatrix function. Otherwise all seems to work except that the AMP code for a 2,000 x 2,000 matrix runs less than half as fast as the ADMB matrix multiply. I have no GPU driver on my laptop so AMP is using the CPU. This works as top reports CPU usage of 792% which indicates that 8 threads are running. Suggestions for improvements would be welcome. -------------- next part -------------- A non-text attachment was scrubbed... Name: amp_mpy.zip Type: application/zip Size: 1984 bytes Desc: not available URL: