SGI: Development

gcc 4.7 optimization options

I hope some more of us tested the gcc 4.7 already. Do you have any hints for best optimization for firfox3?

_________________
:Tezro: :Fuel: :Octane2: :Octane: :Onyx2: :O2+: :O2: :Indy: :Indigo: :Cube:
Try:
Code:
-march=r12000 -O3 -fomit-frame-pointer -fno-math-errno -fno-rounding-math -fno-signaling-nans -funsafe-math-optimizations -fgcse-sm -fgcse-las -fipa-pta -ftree-loop-linear -ftree-loop-im -fivopts -fno-keep-static-consts

These options make some numerical code run 40% faster than MIPSpro's "-Ofast=ip35 -TARG:processor=r12000 -OPT:IEEE_arithmetic=3 -OPT:alias=typed". YMMV.

Also note that you shouldn't use "-Ofast" on gcc, since it enables "-ffast-math", which enables "-ffinite-math-only", which will break your code if it uses Inf and NaN (the Javascript interpreter probably uses them).

You can also try adding:
Code:
-fgraphite-identity -floop-block

to the above options. Then you can try adjusting "--param l1-cache-size", "--param l1-cache-line-size", and "--param l2-cache-size" for better performance. You can also try other Graphite-based optimizations, but they're still a bit experimental and can break the compiler :)

If the code doesn't use C++ exceptions, add:
Code:
-fno-exceptions -freorder-blocks-and-partition

(if it does, the compiler will warn you).

If the code doesn't use C++ RTTI, add:
Code:
-fno-rtti

(as above, if the code uses this, the compiler will warn you).

Also, link everything statically (this also improves program startup time).
Thanks for the detailed answer. We have a lot of multiprocessor machines here. Do you think
Code:
-ftree-parallelize-loops=4
makes any sense?

_________________
:Tezro: :Fuel: :Octane2: :Octane: :Onyx2: :O2+: :O2: :Indy: :Indigo: :Cube:
diegel wrote:
Thanks for the detailed answer. We have a lot of multiprocessor machines here. Do you think
Code:
-ftree-parallelize-loops=4
makes any sense?

On Firefox, no, it would probably make things slower :)

Autoparallelization works great on certain number-crunching algorithms, but rendering webpages and executing javascript aren't something that could benefit from this.

Another thing you could try are profile-guided optimizations (in addition to the optimizations from my previous post). From experience, this combination should give the biggest speed boost, but requires a bit more complicated build environment.
Firefox should already support building with PGO, unless version 3 is too old to have that.
ShadeOfBlue wrote:
Try:
Code:
-march=r12000 -O3 -fomit-frame-pointer -fno-math-errno -fno-rounding-math -fno-signaling-nans -funsafe-math-optimizations -fgcse-sm -fgcse-las -fipa-pta -ftree-loop-linear -ftree-loop-im -fivopts -fno-keep-static-consts

First try with these options works fine and is the fastest build so far. Thank you!

_________________
:Tezro: :Fuel: :Octane2: :Octane: :Onyx2: :O2+: :O2: :Indy: :Indigo: :Cube:
Just one more question: assuming we have a R12000 processor. -param l1-cache-size=32 but what is the correct cache-line-size for this architecture?

_________________
:Tezro: :Fuel: :Octane2: :Octane: :Onyx2: :O2+: :O2: :Indy: :Indigo: :Cube:
diegel wrote:
Just one more question: assuming we have a R12000 processor. -param l1-cache-size=32 but what is the correct cache-line-size for this architecture?

32 bytes for the L1 cache and 128 bytes for the L2 cache, it's the same for R10k and R5k as well, IIRC :)
ShadeOfBlue wrote:
Try:
Code:
-march=r12000 -O3 -fomit-frame-pointer -fno-math-errno -fno-rounding-math -fno-signaling-nans -funsafe-math-optimizations -fgcse-sm -fgcse-las -fipa-pta -ftree-loop-linear -ftree-loop-im -fivopts -fno-keep-static-consts

These options make some numerical code run 40% faster than MIPSpro's "-Ofast=ip35 -TARG:processor=r12000 -OPT:IEEE_arithmetic=3 -OPT:alias=typed". YMMV.

Also note that you shouldn't use "-Ofast" on gcc, since it enables "-ffast-math", which enables "-ffinite-math-only", which will break your code if it uses Inf and NaN (the Javascript interpreter probably uses them).

You can also try adding:
Code:
-fgraphite-identity -floop-block

to the above options. Then you can try adjusting "--param l1-cache-size", "--param l1-cache-line-size", and "--param l2-cache-size" for better performance. You can also try other Graphite-based optimizations, but they're still a bit experimental and can break the compiler :)

If the code doesn't use C++ exceptions, add:
Code:
-fno-exceptions -freorder-blocks-and-partition

(if it does, the compiler will warn you).

If the code doesn't use C++ RTTI, add:
Code:
-fno-rtti

(as above, if the code uses this, the compiler will warn you).

Also, link everything statically (this also improves program startup time).


That's one of the best posts I've ever seen on a message board, thank you!

_________________
:Indigo: 33mhz R3k/48mb/XS24 :Indy: 150mhz R4400/256mb/XL24 :Fuel: 600mhz R14kA/2gb/V10 Image 8x1.4ghz Itanium 2/8GB :O3x08R: 32x600mhz R14kA/24GB :Tezro: 4x700mhz R16k/8GB/V12/DCD/SAS/FC/DM5 :O3x0: 2x700mhz R16k/4GB
You're welcome!

This list is the result of many years of tweaking gcc options to convince it to produce the best code it can :)
It gives great results on other architectures as well.

diegel, I forgot to thank you for the Firefox tardist in that other thread, so thanks! :)
ShadeOfBlue wrote:
diegel wrote:
Just one more question: assuming we have a R12000 processor. -param l1-cache-size=32 but what is the correct cache-line-size for this architecture?

32 bytes for the L1 cache and 128 bytes for the L2 cache, it's the same for R10k and R5k as well, IIRC :)

Things are a bit more complicated than this.
L1 line size differs between instruction cache (64 bytes) and data cache (32 bytes) on the R10k family. Also, the L2 line size varies between systems. R10k-based O2 systems use only 64 bytes, while all other systems (Indigo2, Origin, etc) use 128 bytes.

_________________
:Indigo: R4000 :Indigo2: R4400 :Indigo2IMP: R4400 :Indigo2: R8000 :Indigo2IMP: R10000 :Indy: R4000PC :Indy: R4000SC :Indy: R5000SC :O2: R5000 :O2: RM7000 :Octane: 2xR10000 :Octane: R12000 :O200: - :O200: 2x2xR10000 :Fuel: R16000 :A350:
among more than 150 machines : Apollo, Be, Data General, Digital, HP, IBM, MIPS before SGI , Motorola, NeXT, SGI, Solbourne, Sun...
miod wrote:
Things are a bit more complicated than this.
L1 line size differs between instruction cache (64 bytes) and data cache (32 bytes) on the R10k family. Also, the L2 line size varies between systems. R10k-based O2 systems use only 64 bytes, while all other systems (Indigo2, Origin, etc) use 128 bytes.

Thanks, I didn't know about the L2 line size on the O2. GCC only seems to be interested in the data cache, so an L1 line size of 32 bytes should be good enough for all systems :)
ShadeOfBlue wrote:
These options make some numerical code run 40% faster than MIPSpro's "-Ofast=ip35 -TARG:processor=r12000 -OPT:IEEE_arithmetic=3 -OPT:alias=typed". YMMV.


40% :o
it seems gcc improved a lot lately. guess i might have to put my prejudices behind and take it for a ride :D

and indeed, great tips there ShadeOfBlue!

_________________
r-a-c.de
foetz wrote:
40% :o
it seems gcc improved a lot lately. guess i might have to put my prejudices behind and take it for a ride :D

and indeed, great tips there ShadeOfBlue!

Thanks :)

I couldn't believe it at first too :)

I think the biggest difference was made by the R10k/R12k instruction scheduler that somebody wrote for GCC somewhere around the 4.6 release (if that person is reading this: thank you for writing it!). The generic optimizations have also advanced significantly, since GCC now has to compete with LLVM/clang :D

However, IRIX is marked as obsolete and support for it will be removed in GCC 4.8 :!:
If anybody has a quad Origin 300 or something similar, please make it available remotely to the developers of GCC, so that they can continue supporting IRIX -- they've obsoleted it because they don't have any SGI machines which could build gcc in a decent time frame.
There was a thread about this a while ago .

In a few years, C++11 code will be much more common. MIPSpro cannot compile it and will never be able to, since it's not developed anymore. So it is not only about generating faster code, but also about having a compiler, which can actually compile new software.
Therefore, it is essential that we help the GCC developers maintain the IRIX port.