Nekonomicon - gcc 4.7 optimization options

diegel
Who joined Nov. 17, 2009, 2:08 a.m.
and authored 153 notes

Wrote the following at Jan. 25, 2013, 8:31 a.m...

I hope some more of us tested the gcc 4.7 already. Do you have any hints for best optimization for firfox3?

_________________
:Tezro:

ShadeOfBlue
Moderator
Who joined Nov. 25, 2003, 12:09 p.m.
and authored 726 notes

Wrote the following at Jan. 25, 2013, 9:22 a.m...

Try:

Code:

  -march=r12000 -O3 -fomit-frame-pointer -fno-math-errno -fno-rounding-math -fno-signaling-nans -funsafe-math-optimizations -fgcse-sm -fgcse-las -fipa-pta -ftree-loop-linear -ftree-loop-im -fivopts -fno-keep-static-consts
 

These options make some numerical code run 40% faster than MIPSpro's "-Ofast=ip35 -TARG:processor=r12000 -OPT:IEEE_arithmetic=3 -OPT:alias=typed". YMMV.

Also note that you shouldn't use "-Ofast" on gcc, since it enables "-ffast-math", which enables "-ffinite-math-only", which will break your code if it uses Inf and NaN (the Javascript interpreter probably uses them).

You can also try adding:

Code:

-fgraphite-identity -floop-block

to the above options. Then you can try adjusting "--param l1-cache-size", "--param l1-cache-line-size", and "--param l2-cache-size" for better performance. You can also try other Graphite-based optimizations, but they're still a bit experimental and can break the compiler

If the code doesn't use C++ exceptions, add:

Code:

-fno-exceptions -freorder-blocks-and-partition

(if it does, the compiler will warn you).

If the code doesn't use C++ RTTI, add:

Code:

-fno-rtti

(as above, if the code uses this, the compiler will warn you).

Also, link everything statically (this also improves program startup time).

diegel
Who joined Nov. 17, 2009, 2:08 a.m.
and authored 153 notes

Wrote the following at Jan. 25, 2013, 9:53 a.m...

Thanks for the detailed answer. We have a lot of multiprocessor machines here. Do you think

Code:

-ftree-parallelize-loops=4

makes any sense?

_________________
:Tezro:

ShadeOfBlue
Moderator
Who joined Nov. 25, 2003, 12:09 p.m.
and authored 726 notes

Wrote the following at Jan. 25, 2013, 10:53 a.m...

diegel wrote:

Thanks for the detailed answer. We have a lot of multiprocessor machines here. Do you think

Code:

-ftree-parallelize-loops=4

makes any sense?

On Firefox, no, it would probably make things slower

Autoparallelization works great on certain number-crunching algorithms, but rendering webpages and executing javascript aren't something that could benefit from this.

Another thing you could try are profile-guided optimizations (in addition to the optimizations from my previous post). From experience, this combination should give the biggest speed boost, but requires a bit more complicated build environment.
Firefox should already support building with PGO, unless version 3 is too old to have that.

diegel
Who joined Nov. 17, 2009, 2:08 a.m.
and authored 153 notes

Wrote the following at Jan. 25, 2013, 12:18 p.m...

ShadeOfBlue wrote:

Try:

Code:

   -march=r12000 -O3 -fomit-frame-pointer -fno-math-errno -fno-rounding-math -fno-signaling-nans -funsafe-math-optimizations -fgcse-sm -fgcse-las -fipa-pta -ftree-loop-linear -ftree-loop-im -fivopts -fno-keep-static-consts
  

First try with these options works fine and is the fastest build so far. Thank you!

_________________
:Tezro:

diegel
Who joined Nov. 17, 2009, 2:08 a.m.
and authored 153 notes

Wrote the following at Jan. 25, 2013, 1:23 p.m...

Just one more question: assuming we have a R12000 processor. -param l1-cache-size=32 but what is the correct cache-line-size for this architecture?

_________________
:Tezro:

ShadeOfBlue
Moderator
Who joined Nov. 25, 2003, 12:09 p.m.
and authored 726 notes

Wrote the following at Jan. 25, 2013, 2:58 p.m...

diegel wrote:

Just one more question: assuming we have a R12000 processor. -param l1-cache-size=32 but what is the correct cache-line-size for this architecture?

32 bytes for the L1 cache and 128 bytes for the L2 cache, it's the same for R10k and R5k as well, IIRC

Adrenaline
Who joined Feb. 10, 2005, 12:37 p.m.
and authored 420 notes

Wrote the following at Jan. 25, 2013, 4:08 p.m...

ShadeOfBlue wrote:

Try:

Code:

   -march=r12000 -O3 -fomit-frame-pointer -fno-math-errno -fno-rounding-math -fno-signaling-nans -funsafe-math-optimizations -fgcse-sm -fgcse-las -fipa-pta -ftree-loop-linear -ftree-loop-im -fivopts -fno-keep-static-consts
  

These options make some numerical code run 40% faster than MIPSpro's "-Ofast=ip35 -TARG:processor=r12000 -OPT:IEEE_arithmetic=3 -OPT:alias=typed". YMMV.

Also note that you shouldn't use "-Ofast" on gcc, since it enables "-ffast-math", which enables "-ffinite-math-only", which will break your code if it uses Inf and NaN (the Javascript interpreter probably uses them).

You can also try adding:

Code:

-fgraphite-identity -floop-block

to the above options. Then you can try adjusting "--param l1-cache-size", "--param l1-cache-line-size", and "--param l2-cache-size" for better performance. You can also try other Graphite-based optimizations, but they're still a bit experimental and can break the compiler

If the code doesn't use C++ exceptions, add:

Code:

-fno-exceptions -freorder-blocks-and-partition

(if it does, the compiler will warn you).

If the code doesn't use C++ RTTI, add:

Code:

-fno-rtti

(as above, if the code uses this, the compiler will warn you).

Also, link everything statically (this also improves program startup time).

That's one of the best posts I've ever seen on a message board, thank you!

_________________
:Indigo:

33mhz R3k/48mb/XS24 :Indy:

150mhz R4400/256mb/XL24 :Fuel:

600mhz R14kA/2gb/V10

8x1.4ghz Itanium 2/8GB :O3x08R:

32x600mhz R14kA/24GB :Tezro:

4x700mhz R16k/8GB/V12/DCD/SAS/FC/DM5 :O3x0:

2x700mhz R16k/4GB

ShadeOfBlue
Moderator
Who joined Nov. 25, 2003, 12:09 p.m.
and authored 726 notes

Wrote the following at Jan. 26, 2013, 5:48 a.m...

You're welcome!

This list is the result of many years of tweaking gcc options to convince it to produce the best code it can

It gives great results on other architectures as well.

diegel, I forgot to thank you for the Firefox tardist in that other thread, so thanks!

miod
Who joined Oct. 9, 2009, 1:44 a.m.
and authored 183 notes

Wrote the following at Jan. 26, 2013, 7:46 a.m...

ShadeOfBlue wrote:

diegel wrote:

Just one more question: assuming we have a R12000 processor. -param l1-cache-size=32 but what is the correct cache-line-size for this architecture?

32 bytes for the L1 cache and 128 bytes for the L2 cache, it's the same for R10k and R5k as well, IIRC

Things are a bit more complicated than this.
L1 line size differs between instruction cache (64 bytes) and data cache (32 bytes) on the R10k family. Also, the L2 line size varies between systems. R10k-based O2 systems use only 64 bytes, while all other systems (Indigo2, Origin, etc) use 128 bytes.

_________________
:Indigo:

R4000

R4400

R4400

R8000

R10000

R4000PC

R4000SC

R5000SC

R5000

RM7000

2xR10000

R12000

-

2x2xR10000 :Fuel:

R16000

among more than 150 machines : Apollo, Be, Data General, Digital, HP, IBM, MIPS before SGI , Motorola, NeXT, SGI, Solbourne, Sun...

ShadeOfBlue
Moderator
Who joined Nov. 25, 2003, 12:09 p.m.
and authored 726 notes

Wrote the following at Jan. 26, 2013, 8:56 a.m...

miod wrote:

Things are a bit more complicated than this.
L1 line size differs between instruction cache (64 bytes) and data cache (32 bytes) on the R10k family. Also, the L2 line size varies between systems. R10k-based O2 systems use only 64 bytes, while all other systems (Indigo2, Origin, etc) use 128 bytes.

Thanks, I didn't know about the L2 line size on the O2. GCC only seems to be interested in the data cache, so an L1 line size of 32 bytes should be good enough for all systems

foetz
Who joined April 14, 2003, 3:34 a.m.
and authored 4683 notes

Wrote the following at Feb. 5, 2013, 1:36 p.m...

ShadeOfBlue wrote:

These options make some numerical code run 40% faster than MIPSpro's "-Ofast=ip35 -TARG:processor=r12000 -OPT:IEEE_arithmetic=3 -OPT:alias=typed". YMMV.

40%

it seems gcc improved a lot lately. guess i might have to put my prejudices behind and take it for a ride

and indeed, great tips there ShadeOfBlue!

_________________
r-a-c.de

ShadeOfBlue
Moderator
Who joined Nov. 25, 2003, 12:09 p.m.
and authored 726 notes

Wrote the following at Feb. 5, 2013, 2:45 p.m...

foetz wrote:

40%

it seems gcc improved a lot lately. guess i might have to put my prejudices behind and take it for a ride

and indeed, great tips there ShadeOfBlue!

Thanks

I couldn't believe it at first too

I think the biggest difference was made by the R10k/R12k instruction scheduler that somebody wrote for GCC somewhere around the 4.6 release (if that person is reading this: thank you for writing it!). The generic optimizations have also advanced significantly, since GCC now has to compete with LLVM/clang

However, IRIX is marked as obsolete and support for it will be removed in GCC 4.8 :!:

If anybody has a quad Origin 300 or something similar, please make it available remotely to the developers of GCC, so that they can continue supporting IRIX -- they've obsoleted it because they don't have any SGI machines which could build gcc in a decent time frame.
There was a thread about this a while ago .

In a few years, C++11 code will be much more common. MIPSpro cannot compile it and will never be able to, since it's not developed anymore. So it is not only about generating faster code, but also about having a compiler, which can actually compile new software.
Therefore, it is essential that we help the GCC developers maintain the IRIX port.

SGI: Development

gcc 4.7 optimization options