Outstanding jan-japp!
I had to make the following code changes to get it to run on the Tezro. I guess you are using the (older?) CBLAS routines as opposed to the SCSL C interface.
man INTRO_BLAS3 wrote:
NOTE: SCSL supports two different C interfaces to the BLAS:
* The C interface described in this man page and in individual BLAS man
pages follows the same conventions used for the C interface to the
SCSL signal processing library.
* SCSL also supports the C interface to the legacy BLAS set forth by
the BLAS Technical Forum. This interface supports row-major storage
of multidimensional arrays; see INTRO_CBLAS(3S) for details.
Here's the diff output:
Code:
Select all
~:diff test_double.c test_double.orig.c
5,6c5,6
< /* #include <cblas.h> */
< #include <scsl_blas.h>
---
> #include <cblas.h>
>
42c42
< dgemm("N", "N", n, n, n, alpha, &a[0], n, &b[0], n, beta, &c[0], n);
---
> dgemm(NoTranspose, NoTranspose, n, n, n, alpha, &a[0], n, &b[0], n, beta, &c[0], n);
and the compile line changes (for me) to:
cc -fullwarn -O3 -Ofast=IP35 -IPA -o test_double test_double.c -lscs -lftn
The
result
is
test_double
Using default value, n = 128
Elapsed time to multiply two matrices of order 128: 0.003320
Now, to be fair I re-ran the origin test_quad in "double" mode (replace long double -> double ) and got:
~:cc -fullwarn -O3 -Ofast=IP35 -IPA -o test_quad-double test_quad-double.c
~:test_quad-double Using default value, n = 128
size of double: 8
Elapsed time to multiply two matrices of order 128: 0.002489
which was
faster
(!) than the blas routine on my machine. Naive matrix mult not so bad for general matrix types?
Another test in "64 bit mode"
cc -64 -fullwarn -O3 -Ofast=IP35 -IPA -o test_double test_double.c -lscs -lftn
test_double
Using default value, n = 128
Elapsed time to multiply two matrices of order 128: 0.003326
no change from -n32. But switch in the multiprocessor version of SCSL lib and
cc -n32 -fullwarn -O3 -Ofast=IP35 -IPA -o test_double test_double.c -lscs_mp -lftn
test_double
Using default value, n = 128
Elapsed time to multiply two matrices of order 128: 0.045474
which is slower on the small matrix, so let's try the 1024x1024.
cc -n32 -fullwarn -O3 -Ofast=IP35 -IPA -o test_double test_double.c -lscs -lftn
Using default value, n = 1024
Elapsed time to multiply two matrices of order 1024: 1.320950
cc -n32 -fullwarn -O3 -Ofast=IP35 -IPA -o test_double test_double.c -lscs_mp -lftn
test_double
Using default value, n = 1024
Elapsed time to multiply two matrices of order 1024: 0.941582
as expected on larger problems the multiprocessing starts to pay off.
The message that keeps getting shoved into my face again and again with coding for performance (math or graphics) is "Know your hardware, know your problem." and find the
fast-path
for your situation. As an addendum I'd add that this implies high-performance cross-platform
code
seems to be a pipe dream.