SGI: Development

quad-precision benchmark

Hi,
This is kind of an outgowth of the debate on R10k/Itanium architecture over in the thread about the 1ghz r16k's. Anyway, I figured i'd make a little benchmark. The code is at the bottom of this post. Anyway, I was wondering if anyone could compile it with MipsPRO and optimize for R10000. I'm getting bummer numbers...although the pentium's technically not doing the same thing. Ok, so two requests:
1) Can someone figure out why the p4 won't do 128bit with long double? Do I have to do it by hand? I thought this was what sse2 was for?
2) Can someone compile test_quad.c with mipspro optimized for r10k?
THANKS!

here's what I got:
nekochan's gcc 3.4.0 vs. gentoo's 3.4.5
O2k on 1 R10k 250

Code: Select all

-bash-2.05b$ gcc -mtune=r8k -o test test_quad.c
-bash-2.05b$ ./test
Using default value, n = 128
size of long double: 16

Elapsed time to multiply two matrices of order 128: 30.121496

And Dell on 1 Xeon 3.06

Code: Select all

[email protected] ~ $ gcc -o test -mtune=pentium4 test_quad.c
[email protected] ~ $ ./test
Using default value, n = 128
size of long double: 12

Elapsed time to multiply two matrices of order 128: 0.066819


and finally, the code:
test_quad.c

times are in seconds...
Ninety-nine percent of who you are is invisible and untouchable.
-R. Buckminster Fuller
epitaxial bandgap wrote: and finally, the code:
test_quad.c

times are in seconds...


No go on the link bub
-ks

:Onyx: :Onyx: :Crimson: :O2000: :Onyx2: :Fuel: :Octane: :Octane2: :PI: :Indigo: :Indigo: :O2: :O2: :Indigo2: :Indigo2: :Indigo2IMP: :Indy: :320: :540: :O3x0: :1600SW: :1600SW: :hpserv:

See them all >here<
I knew I was supposed to do something before getting dinner. Fixed, thanks!
Ninety-nine percent of who you are is invisible and untouchable.

-R. Buckminster Fuller
Tut-tut! Bad gcc/C++ habits!
cc-1098 cc: ERROR File = test_quad.c, Line = 7
An array cannot have elements of the indicated type.

void matrix_mult(long double A[][], long double B[][], long double C[][], int n);
^

cc-1098 cc: ERROR File = test_quad.c, Line = 7
An array cannot have elements of the indicated type.

void matrix_mult(long double A[][], long double B[][], long double C[][], int n);
^

cc-1098 cc: ERROR File = test_quad.c, Line = 7
An array cannot have elements of the indicated type.

void matrix_mult(long double A[][], long double B[][], long double C[][], int n);
^

cc-1098 cc: ERROR File = test_quad.c, Line = 8
An array cannot have elements of the indicated type.

void matrix_print(long double M[][], FILE * nout, int n);
^

cc-1098 cc: ERROR File = test_quad.c, Line = 9
An array cannot have elements of the indicated type.

void matrix_rand(long double M[][], int n);
^

cc-1241 cc: ERROR File = test_quad.c, Line = 27
A declaration cannot appear after an executable statement in a block.

long double A[MAX_DIM][MAX_DIM];
^

cc-1241 cc: ERROR File = test_quad.c, Line = 28
A declaration cannot appear after an executable statement in a block.

long double B[MAX_DIM][MAX_DIM];
^

cc-1241 cc: ERROR File = test_quad.c, Line = 29
A declaration cannot appear after an executable statement in a block.

long double C[MAX_DIM][MAX_DIM];
^

cc-1551 cc: WARNING File = test_quad.c, Line = 34
The variable "t1" is used before its value is set.

t1 += (double)tp.tv_sec+(1.e-6)*tp.tv_usec;
^

cc-1551 cc: WARNING File = test_quad.c, Line = 37
The variable "t2" is used before its value is set.

t2 += (double)tp.tv_sec+(1.e-6)*tp.tv_usec;
^

cc-1552 cc: WARNING File = test_quad.c, Line = 14
The variable "rtn" is set but never used.

int rtn, i;
^

cc-1174 cc: WARNING File = test_quad.c, Line = 14
The variable "i" was declared but never referenced.

int rtn, i;



Running modified code (below) on 1 1Ghz R16K with MIPSpro C 7.4.4m, I get

~:cc -64 -IPA -Ofast=IP30 test_quad.c -o test_quad
cc-1178 cc: WARNING File = test_quad.c, Line = 34
Argument is incompatible with the corresponding format string conversion.

printf("size of long double: %d\n", sizeof(long double));
^

~:test_quad
Using default value, n = 128
size of long double: 16

Elapsed time to multiply two matrices of order 128: 0.242758


More precision, but slower than the Dell-box. Would be interesting to se how the IRIX scientific lib (SCSL) would handle things.

The poor performance of your previous test on the O2K would be gcc?


And the more C/IRIX friendly code:

Code: Select all

#include <math.h>
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>

#define MAX_DIM 256
void matrix_mult(long double A[][MAX_DIM], long double B[][MAX_DIM], long double C[][MAX_DIM], int n);
void matrix_print(long double M[][MAX_DIM], FILE * nout, int n);
void matrix_rand(long double M[][MAX_DIM], int n);

int main(int argc, char ** argv){
int n;
double t1=0.0, t2=0.0, elapsed;
timespec_t tp[2];
FILE *nout;
long double A[MAX_DIM][MAX_DIM];
long double B[MAX_DIM][MAX_DIM];
long double C[MAX_DIM][MAX_DIM];

memset( C, 0, MAX_DIM*MAX_DIM*sizeof(long double) );
nout = fopen("/dev/null", "r+");
if ( !nout ) {
fprintf(stderr,"open failed\n");
exit(-1);
}

n = 128; /* default order of matrix */
if(argc < 2)
printf("Using default value, n = 128\n");
else{
n = atoi(argv[1]);
printf("Using n = %d\n", n);
}

printf("size of long double: %d\n", sizeof(long double));

matrix_rand(A, n);
matrix_rand(B, n);
clock_gettime(CLOCK_SGI_CYCLE, &tp[0]);
matrix_mult(A, B, C, n);
clock_gettime(CLOCK_SGI_CYCLE, &tp[1]);

t1 += (double)tp[0].tv_sec+(1.e-9)*tp[0].tv_nsec;
t2 += (double)tp[1].tv_sec+(1.e-9)*tp[1].tv_nsec;
matrix_print(C, nout, n);

elapsed = t2 -t1;
printf("\nElapsed time to multiply two matrices of order %d: %f\n", n, elapsed);

return(0);
}
/*****************************************************************************/
void matrix_mult(long double A[][MAX_DIM], long double B[][MAX_DIM], long double C[][MAX_DIM], int n){
int i, j, k;
for(i = 0; i < n; i++)
for(j = 0; j < n; j++)
for(k = 0; k < n; k++)
C[i][j] +=  A[i][k]*B[k][j];
}
/*****************************************************************************/
void matrix_print(long double M[][MAX_DIM], FILE * nout, int n){
int i, j;
for(i = 0; i < n; i++){
for(j = 0; j < n; j++){
fprintf(nout, "%5.5Lf  ", M[i][j]);
}
}
}
/*****************************************************************************/
void matrix_rand(long double M[][MAX_DIM], int n){
int i, j;
for(i = 0; i < n; i++)
for(j = 0; j < n; j++)
M[i][j] = (double)rand()/12.3;
}
squeen wrote: Running modified code (below) on 1 1Ghz R16K, I get

-Ofast=IP30



Shouldn't that be -Ofast=IP35 in your test case? Maybe it doesn't make much of a difference.
Twitter: @neko_no_ko
IRIX Release 4.0.5 IP12 Version 06151813 System V
Copyright 1987-1992 Silicon Graphics, Inc.
All Rights Reserved.
Yup. I originally ran it on the Octane (0.437848 sec). The Tezro run time was the same, with IP35.

Thanks neko, good catch.
OK, The same machine (R16K) that ran 0.24 sec with MIPSpro ran 7.37 secs with gcc (-O3), roughly 30 times slower .
squeen wrote: OK, The same machine (R16K) that ran 0.24 sec with MIPSpro ran 7.37 secs with gcc (-O3), roughly 30 times slower .


Is it any wonder why we use MIPSpro for Nekoware? ;)
Twitter: @neko_no_ko
IRIX Release 4.0.5 IP12 Version 06151813 System V
Copyright 1987-1992 Silicon Graphics, Inc.
All Rights Reserved.
squeen wrote: Running modified code (below) on 1 1Ghz R16K with MIPSpro C 7.4.4m, I get

Elapsed time to multiply two matrices of order 128: 0.242758



And a humble R10k Indigo2 running IRIX 6.2, MIPSpro 7.2.1.3m:
Elapsed time to multiply two matrices of order 128: 2.199671


Code: Select all

void matrix_mult(long double A[][MAX_DIM], long double B[][MAX_DIM], long double C[][MAX_DIM], int n){
int i, j, k;
for(i = 0; i < n; i++)
for(j = 0; j < n; j++)
for(k = 0; k < n; k++)
C[i][j] +=  A[i][k]*B[k][j];
}


That is the most naive matrix mult routine I've seen in awhile :wink:
Expect 9 ~ 11x speedup if you replace this with optimized routines, such as those found in SCSL or ATLAS. Or, to put it differently, I would be faster with my lowly Indigo2 than the 1GHz Tezro. Algorithm optimization rules -- you can't beat that with compiler switches!
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)
jan-jaap wrote: Algorithm optimization rules -- you can't beat that with compiler switches!


Amen brother!

BTW what clock speed on the Indigo2?
Hmm...looks like blas3 (man INTRO_BLAS3) doesn't support quad precision, just single (SGEMM) and double (DGEMM). Still, a double precision test might be fun.
squeen wrote:
jan-jaap wrote: Algorithm optimization rules -- you can't beat that with compiler switches!


Amen brother!

BTW what clock speed on the Indigo2?


R10k @ 195MHz.

squeen wrote: Hmm...looks like blas3 (man INTRO_BLAS3) doesn't support quad precision, just single (SGEMM) and double (DGEMM). Still, a double precision test might be fun.


Yes, and given that BLAS, LAPACK and co. are the defacto standard libraries for this sort of business should tell you something about the real world value of quad precision floating point. In my experience with numerical solvers, everything beyond the 13th decimal was noise, and you can do that easily with double precision.
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)
Belive it or not, it does come up at times in engineering. We are looking at advanced spacecraft formations thousands of kilometers apart with nanometer precision (1e6+1e9 = 1e15) which is right at the tail end of double precision.

Code: Select all

#include <math.h>
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#include <cblas.h>

void matrix_mult(double* a, double* b, double* c, int n);
void matrix_rand(double* d, int nn);

int main(int argc, char ** argv)
{
int n;
double t1=0.0, t2=0.0, elapsed;
timespec_t tp[2];
double *a, *b, *c;

n = 128; /* default order of matrix */
a = (double*)calloc(n*n, sizeof(double));
b = (double*)calloc(n*n, sizeof(double));
c = (double*)calloc(n*n, sizeof(double));
printf("Using default value, n = %d\n", n);

matrix_rand(a, n*n);
matrix_rand(b, n*n);

clock_gettime(CLOCK_SGI_CYCLE, &tp[0]);
matrix_mult(a, b, c, n);
clock_gettime(CLOCK_SGI_CYCLE, &tp[1]);

t1 += (double)tp[0].tv_sec+(1.e-9)*tp[0].tv_nsec;
t2 += (double)tp[1].tv_sec+(1.e-9)*tp[1].tv_nsec;
/*    matrix_print(C, nout, n); */
elapsed = t2 -t1;
printf("\nElapsed time to multiply two matrices of order %d: %f\n", n, elapsed);

return(0);
}
/*****************************************************************************/
void matrix_mult(double* a, double* b, double* c, int n)
{
double alpha=1.0, beta=0.0;
dgemm(NoTranspose, NoTranspose, n, n, n, alpha, &a[0], n, &b[0], n, beta, &c[0], n);
}
/*****************************************************************************/
void matrix_rand(double* d, int nn)
{
int i;
for(i = 0; i < nn; i++)
d[i] = (double)rand()/12.3;
}


Then:

Code: Select all

cc -fullwarn -O3 -Ofast=IP28 -IPA -o test_double test_double.c -lblas -lftn
ld32: WARNING 85: definition of main in test_double.o preempts that definition in /usr/lib32/mips4/libftn.so.
ld32: WARNING 85: definition of main in test_double.ipaa0064e/test_double.o preempts that definition in /usr/lib32/mips4/libftn.so.

And voila:

Code: Select all

./test_double
Using default value, n = 128

Elapsed time to multiply two matrices of order 128: 0.014334

Eat dust, you ugly Dell :shock: :shock:

And finally, I'd like to see you do a problem of slightly less trivial size :twisted:

Code: Select all

Elapsed time to multiply two matrices of order 1024: 8.305315
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)
Outstanding jan-japp!

I had to make the following code changes to get it to run on the Tezro. I guess you are using the (older?) CBLAS routines as opposed to the SCSL C interface.
man INTRO_BLAS3 wrote: NOTE: SCSL supports two different C interfaces to the BLAS:

* The C interface described in this man page and in individual BLAS man
pages follows the same conventions used for the C interface to the
SCSL signal processing library.

* SCSL also supports the C interface to the legacy BLAS set forth by
the BLAS Technical Forum. This interface supports row-major storage
of multidimensional arrays; see INTRO_CBLAS(3S) for details.



Here's the diff output:

Code: Select all

~:diff test_double.c test_double.orig.c
5,6c5,6
< /* #include <cblas.h> */
< #include <scsl_blas.h>
---
> #include <cblas.h>
>
42c42
<     dgemm("N", "N", n, n, n, alpha, &a[0], n, &b[0], n, beta, &c[0], n);
---
>     dgemm(NoTranspose, NoTranspose, n, n, n, alpha, &a[0], n, &b[0], n, beta, &c[0], n);


and the compile line changes (for me) to:
cc -fullwarn -O3 -Ofast=IP35 -IPA -o test_double test_double.c -lscs -lftn


The result is
test_double
Using default value, n = 128

Elapsed time to multiply two matrices of order 128: 0.003320


Now, to be fair I re-ran the origin test_quad in "double" mode (replace long double -> double ) and got:

~:cc -fullwarn -O3 -Ofast=IP35 -IPA -o test_quad-double test_quad-double.c
~:test_quad-double Using default value, n = 128
size of double: 8

Elapsed time to multiply two matrices of order 128: 0.002489


which was faster (!) than the blas routine on my machine. Naive matrix mult not so bad for general matrix types?

Another test in "64 bit mode"

cc -64 -fullwarn -O3 -Ofast=IP35 -IPA -o test_double test_double.c -lscs -lftn

test_double
Using default value, n = 128

Elapsed time to multiply two matrices of order 128: 0.003326


no change from -n32. But switch in the multiprocessor version of SCSL lib and

cc -n32 -fullwarn -O3 -Ofast=IP35 -IPA -o test_double test_double.c -lscs_mp -lftn
test_double
Using default value, n = 128

Elapsed time to multiply two matrices of order 128: 0.045474


which is slower on the small matrix, so let's try the 1024x1024.

cc -n32 -fullwarn -O3 -Ofast=IP35 -IPA -o test_double test_double.c -lscs -lftn
Using default value, n = 1024

Elapsed time to multiply two matrices of order 1024: 1.320950

cc -n32 -fullwarn -O3 -Ofast=IP35 -IPA -o test_double test_double.c -lscs_mp -lftn

test_double
Using default value, n = 1024

Elapsed time to multiply two matrices of order 1024: 0.941582


as expected on larger problems the multiprocessing starts to pay off.

The message that keeps getting shoved into my face again and again with coding for performance (math or graphics) is "Know your hardware, know your problem." and find the fast-path for your situation. As an addendum I'd add that this implies high-performance cross-platform code seems to be a pipe dream.