Nekonomicon - Altix NUMA systems apparently moving to x86

schleusel
From NRW, Germany
Who joined Oct. 20, 2003, 6:49 a.m.

Wrote the following at July 11, 2008, 2:13 p.m...

This news bit (german) talks about planned upgrades of the new northern german "HLRN-II" Altix installation - which currently consists solely of Altix ICE x86 clusters.

The interesting part is the second passage. It talks about two additional planned "Ultra Violet" SMP systems each consisting of 2176 CPUs (136 blades with two 8-core Nehalem-Xeons each) and 8.7 TB of memory in only five large nodes - using NUMALink 5

Wonder if this will mark the end of their IA64 lineup (wouldn't be all that surprising i suppose :-)

) or if they'll offer both variants..

deBug
Donor
From Sweden
Who joined Feb. 27, 2006, 1:44 p.m.

Wrote the following at July 12, 2008, 5:54 a.m...

schleusel wrote: This news bit (german) talks about planned upgrades of the new northern german "HLRN-II" Altix installation - which currently consists solely of Altix ICE x86 clusters.

The interesting part is the second passage. It talks about two additional planned "Ultra Violet" SMP systems each consisting of 2176 CPUs (136 blades with two 8-core Nehalem-Xeons each) and 8.7 TB of memory in only five large nodes - using NUMALink 5

Wonder if this will mark the end of their IA64 lineup (wouldn't be all that surprising i suppose ) or if they'll offer both variants..

As the Nehalem is a NUMA design (Intels NUMA implemenatation goes under the name "Quick path interconnect") I suspect SGI and other HPC manufacturer will regard the x86 as a more worthy platform for HPC.
So I think you are right, they will probably use x86 more in future designs.

//Harry

Mein Führer, I can walk!

kramlq
From IRL
Who joined Sept. 20, 2005, 5:10 p.m.

Wrote the following at July 12, 2008, 7:08 a.m...

A couple of months ago someone on here pointed out to me that some Altix systems were now shipping with x86 inside. Immediately I wondered if it was the start of a move from Itanic. They are very different chips, and I don't see how a company in SGI's state can justify the expense of engineering systems using both. When skywriter wanted to know if we had any questions for SGI I was going to suggest asking about x86 vs Itanium in the long term, but didn't, as I figured SGI would never give a straight answer.

SAQ
From Renton, WA
Who joined July 19, 2006, 8:37 a.m.

Wrote the following at July 12, 2008, 7:46 a.m...

From a technical standpoint, aren't there still potential performance benefits from the VLIW architecture (provided that Intel has a way to deal with the scalability issues - i.e. not wasting units if a future version is wider and attempts to run current software), or has current OoO technology eliminated the edge?

"Brakes??? What Brakes???"

:Indigo:

(single-CM)

kramlq
From IRL
Who joined Sept. 20, 2005, 5:10 p.m.

Wrote the following at July 12, 2008, 9:52 a.m...

SAQ wrote: From a technical standpoint, aren't there still potential performance benefits from the VLIW architecture (provided that Intel has a way to deal with the scalability issues - i.e. not wasting units if a future version is wider and attempts to run current software), or has current OoO technology eliminated the edge?

VLIW just moves a lot of complexity into software (compilers and apps), so for those willing to put in the effort in optimising code, there possibly are, but X86 speed advances so fast you have to ask is it worth the effort. Interestingly, it seems a trend is emerging for simpler CPU design as well - UltraSPARC and Cell (PPC) both reduce the amount of OOO technology in favour of adding more parallelism (Chip Multithreading in SPARC, Coprocessors with fast local memory in the case of Cell). The X86 approach is probably the safest bet though :-)

TeeTylerToe
Who joined Sept. 13, 2004, 11:56 p.m.

Wrote the following at July 12, 2008, 7:04 p.m...

you've got a 1.66GHz Itanium 2 dual core that does 4 FP ops per cycle, so that's 8 FP per cycle at about 1.5GHz, so 12Gflop per socket.

you've got core 2 xeons with 4 cores at 3GHz+ with I think 2 flop/cycle, or 24Gflop per socket.

twice the performance, the chips are cheaper, and the architecture's orders of magnitude cheaper, although Intel's moving both to quickpath so the Itanium architecture premium would drop out to some extent, so you'd just have the excessive cost of the Itanium chip, with it's anemic performance.
*edit* 24gflops, not 14

schleusel
From NRW, Germany
Who joined Oct. 20, 2003, 6:49 a.m.

Wrote the following at July 13, 2008, 7:04 a.m...

TeeTylerToe wrote: you've got core 2 xeons with 4 cores at 3GHz+ with I think 2 flop/cycle, or 14Gflop per socket.

Actually even 4 double precision FLOPs/cycle if you include the SIMD units i think (2 x87 + 2 SSE2/3), i.e. 48 GFLOPs/s theoretical peak for the 3GHz Quad Core.

mapesdhs
From Edinburgh, Scotland
Who joined Nov. 10, 2003, 4:17 p.m.

Wrote the following at July 15, 2008, 10:35 a.m...

schleusel wrote: Actually even 4 double precision FLOPs/cycle if you include the SIMD units i think (2 x87 + 2 SSE2/3), i.e. 48 GFLOPs/s theoretical peak for the 3GHz Quad Core.

And yet for some codes IA64 can be 100% faster than a 3GHz XEON. As always, it depends what you're doing.
Comparing based on peak fp ops is far too simplistic, way too many other issues involved, from cache to mem
bw, etc.

SPEC results show IA64 to be quite strong for fluid dynamics and QCD, easily beating XEONs with 2X higher clock speeds.
For other tests the XEON is better, still others about the same. Here's a table of results (not updated for since Jan/08, must
plough through the data again soon):

Code: Select all


   SPEC2006 CPU Superiority Table, by Ian Mapleson <[email protected]>
   

   

   Last Change: 16/Jul/2008 (reference adjustments, new CPU data not added since Jan/08)
   

   

   

   Core2Extreme Overclocking Reference:
   

   

   http://www.nordichardware.com/Reviews/?skrivelse=487&page=5
   

   

   NOTE: The 4.7GHz Core2Extreme numbers in these tables are
   

   _extrapolated_ from the published Dell 390 results, ie. linear
   

   extrapolation and then reduced by 5%. They are a guide only!!
   

   

   Source Refs:
   

   

   http://www.spec.org/cpu2006/results/cfp2006.html
   

   http://www.spec.org/cpu2006/results/cint2006.html
   

   

   

   FLOATING POINT SPEC2006 RESULTS (ch = chip, cr/crs = core/cores, Para = Autoparallelize option used?):
   

   

   HP rx2660      HP ProLiant     IBM P570     Dell 390         Overclocked     Fujitsu
   

   Integrity      DL360-G5 XEON   Power6       Core2Extreme     Core2Extreme    CELSIUS V840
   

   (SPECfp2006)    1.66GHz IA64   5460, 3.16GHz   4.7GHz       X6800, 2.93GHz   X6800, 4.7GHz   Opteron 3GHz
   

   1 ch, 2 crs    2 ch, 8 crs     1 ch, 1 cr   1 ch, 2 crs      1 ch, 2 crs     2 ch, 4 crs
   

   Para: No       Para: Yes       Para: No     Para: No         Para: No        Para: Yes
   

   

   410.bwaves         44.7            32.9          67.3           20.9             31.6            60.8
   

   416.gamess         11.5            23.7          14.8           17.3             26.4            17.1
   

   433.milc           21.3            10.8          20.0           11.4             17.4            13.2
   

   434.zeusmp         19.3            16.8          26.4           16.0             24.4            12.4
   

   435.gromacs        17.3            20.4          12.6           17.0             25.9            15.4
   

   436.cactusADM      36.5           105.0          24.8           20.2             30.8            39.1
   

   437.leslie3d       27.9            19.5          31.2           13.3             20.3            9.45
   

   444.namd           26.8            16.8          14.1           15.1             23.0            13.9
   

   447.dealII         21.6            30.6          23.8           18.0             27.4            25.0
   

   450.soplex         11.6            15.7          21.4           14.9             22.7            11.8
   

   453.povray         11.5            30.7          15.4           26.1             39.8            19.4
   

   454.calculix       16.7            24.8          18.7           13.9             21.2            8.62
   

   459.GemsFDTD       19.7            19.2          21.3           12.0             18.3            23.3
   

   465.tonto          13.1            25.3          17.5           12.6             19.2            17.5
   

   470.lbm            37.6            24.3          53.7           14.0             21.3            15.3
   

   481.wrf            16.6            22.3          12.3           17.0             25.9            19.5
   

   482.sphinx3        24.4            28.4          36.3           22.9             34.9            15.3
   

   

   

   SPECfp2006 Peak
   

   

   HP ProLiant DL360-G5 XEON X5460 3.16GHz:       23.9
   

   IBM p570, Power6 4.7GHz:                       22.5
   

   HP Integrity rx2660, Itanium2 1.66GHz:         20.4
   

   Fujitsu CELSIUS V840 Opteron 2222 3GHz:        17.4
   

   Dell 390, Core2Extreme X6800, 2.93GHz:         16.2
   

   

   

   Itanium2: http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071015-02285.html
   

   XEON:     http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071029-02381.html
   

   Power6:   http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071030-02420.html
   

   Core2Ext: http://www.spec.org/cpu2006/results/res2007q1/cpu2006-20061226-00189.html
   

   Opteron:  http://www.spec.org/cpu2006/results/res2007q3/cpu2006-20070802-01587.html
   

   

   

   

   INTEGER SPEC2006 RESULTS (ch = chip, cr/crs = core/cores, Para = Autoparallelize option used?):
   

   

   HP Integrity     Intel 3.16GHz   IBM P570     Dell 390         Overclocked     ASUS Intel    HP Proliant
   

   rx2660 1.66GHz   HP ProLiant     Power6       Core2Extreme     Core2Extreme    Core2Quad     DL185 G5
   

   (SPECint2006)    Itanium2 IA64    XEON-X5460      4.7GHz       X6800, 2.93GHz   X6800, 4.7GHz   QX9650 3GHz   Opteron 3GHz
   

   1 ch, 2 crs      2 ch, 8 crs     1 ch, 1 cr   1 ch, 2 crs      1 ch, 2 crs     1 ch, 4 crs   2 ch, 4 crs
   

   Para: No         Para: Yes       Para: No     Para: No         Para: No        Para: Yes     Para: No
   

   

   400.perlbench       11.7             25.4           13.7           23.5             35.8           24.8           16.6
   

   401.bzip2           11.4             18.4           16.1           15.7             23.9           17.2           11.6
   

   403.gcc             12.9             20.1           19.8           13.3             20.1           23.3           12.6
   

   429.mcf             19.4             24.3           36.8           23.8             36.3           29.9           18.1
   

   445.gobmk           15.0             22.8           17.9           21.5             32.8           22.4           18.0
   

   456.hmmer           28.6             28.8           17.1           12.3             18.8           17.2           21.2
   

   458.sjeng           12.4             20.7           14.9           19.6             29.9           20.5           14.9
   

   462.libquantum      61.4            227.0           96.6           17.9             27.3          105.0           29.5
   

   464.h264ref         22.6             35.2           30.6           31.6             48.2           33.4           23.1
   

   471.omnetpp          9.1             17.0           18.5           15.8             24.1           21.3           10.9
   

   473.astar           15.6             17.2           13.8           14.2             21.6           17.4           10.6
   

   483.xalancbmk       16.6             26.9           18.3           20.7             31.6           24.6           15.3
   

   

   

   SPECint2006 Peak
   

   

   HP ProLiant DL360 G5, XEON-X5460 3.16GHz:      27.6
   

   ASUS P5E3 Deluxe, Core2Quad QX9650 3GHz:       25.5
   

   IBM p570, Power6 4.7GHz:                       21.6
   

   Dell 390, Core2Extreme X6800, 2.93GHz:         18.5
   

   HP Integrity rx2660, 1.66GHz Itanium2:         17.0
   

   HP Proliant DL185 G5 Opteron 3GHz:             16.1
   

   

   

   Itanium2:  http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071015-02284.html
   

   XEON:      http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071029-02385.html
   

   Power6:    http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070518-01096.html
   

   Core2Ext:  http://www.spec.org/cpu2006/results/res2007q1/cpu2006-20061226-00186.html
   

   Core2Quad: http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071112-02554.html
   

   Opteron:   http://www.spec.org/cpu2006/results/res2008q1/cpu2006-20071220-02914.html
   

   

   *******************************************************************************************************************
   

   

   To aid SGI users, here is a SPEC2000 comparison between R14K/600MHz,
   

   the older Itanium2 1.6GHz IA64 (before the newer L2 sizes), aswell as
   

   some other CPUs from the past, listed in order of aggregate fp
   

   performance (extrapolated R16K/1GHz included as a comparative guess):
   

   

   

   System                CPU             MHz    Cache Sizes          SPECint2000     SPECfp2000
   

   

   SGI Altix 4700/D      Itanium2 9040   1600   16/16 + 1MB/256K + 24MB     ?            3098 [b]
   

   SGI Altix 4700/D      Itanium2 9040   1600   16/16 + 1MB/256K + 18MB     ?            3001 [b]
   

   SGI Altix XE 210      P4 XEON 5160    3000   32/32 + 4MB              2937 [b]        2752 [b]
   

   IBM p570              Power5+         1900   64/32 + 2MB + 36MB       1453            2733
   

   SGI Altix 3700 Bx2    Itanium2        1600   16/16 + 256K + 9MB       1441 [b]        2647 [b]
   

   HP rx4640             Itanium2        1600   16/16 + 256K + 6MB       1590            2612
   

   SGI Altix 3700 Bx2    Itanium2        1600   16/16 + 256K + 6MB       1410 [b]        2600 [b]
   

   SGI Altix 3700 Bx2    Itanium2        1500   16/16 + 256K + 4MB       1242 [b]        2361 [b]
   

   SGI Altix 3000        Itanium2        1500   16/16 + 256K + 6MB       1077            2148
   

   SGI Altix 350         Itanium2        1400   16/16 + 256K + 3MB       1078            1942
   

   SGI Altix 3000        Itanium2        1300   16/16 + 256K + 3MB        875            1854
   

   AMD/ASUS              Opteron 150     2400   64/64 + 1MB              1663            1849
   

   Intel D925            P4-X            3466   12/8 + 512 + 2MB         1772            1724
   

   HP Alpha GS1280       21364           1300   64/64 + 2MB               994            1684
   

   SGI Altix 350         Itanium2        1400   16/16 + 256K + 1.5MB      986            1684
   

   Intel D925            P4              3600   12/8 + 1MB               1575            1630
   

   SGI Altix 350         Itanium2        1000   16/16 + 256K + 1.5MB      743            1374
   

   Fujitsu               SPARC64-V       1350   128 + 128K/2MB            905            1340
   

   Apple                 PPC970 (G5)     2000   64/32 + 512K              800             840
   

   Origin 3000 [*]       R16000          1000   32/32 + 16MB              792             838
   

   HP rx4610             Itanium          800   16/16 + 96K + 4MB         379             701
   

   HP c3750              PA-8700          875   768K/1.5MB                678             674
   

   HP                    Pentium-M       1000   32/32 + 1MB               687             552
   

   SGI Origin 3200       R14000A          600   32/32 + 8MB               500             529
   

   SGI Origin 300        R14000A          600   32/32 + 4MB               483             495
   

   SGI Origin 3200       R14000           500   32/32 + 8MB               427             463
   

   SGI Origin 3200       R12000           400   32/32 + 8MB               353             407
   

   SGI 2200              R14000           500   32/32 + 8MB               412             386
   

   SGI Origin 300        R14000           500   32/32 + 2MB               379             378
   

   SGI 2200              R12000           400   32/32 + 8MB               347             343
   

   SGI 2100              R12000           350   32/32 + 8MB               289             293
   

   SGI Origin 200        R12000           360   32/32 + 4MB               298             290
   

   SGI 2200              R12000           300   32/32 + 8MB               264             283
   

   

   

   [*] Extrapolation based on R14K/600, done as: (1000/600) * 0.95 * result. Real speed might be
   

   faster due to larger L2 and higher L2 speed, or might be less due to simple clock increase
   

   not benefiting every code in the same way. Guesstimate only!!
   

   

   REFS: http://www.cl.cam.ac.uk/teaching/2006/CompArch/mainnotes/comparch-2.pdf
   

   http://www.spec.org/cpu2000/results/cint2000.html
   

   http://www.spec.org/cpu2000/results/cfp2000.html

In the end though, cost is likely to be the deciding factor, just as it was for SGI, which would be ironic given it was
the IA64 programme which pretty much killed off SGI's future CPUs in the 1st place (staff moving to Intel, etc.)

Ian.

[email protected]
+44 (0)131 476 0796

TeeTylerToe
Who joined Sept. 13, 2004, 11:56 p.m.

Wrote the following at July 16, 2008, 10:10 a.m...

ummm, didn't the 3.16GHz beat the itanium2 in both? not as well as I predicted. also, how many cores, and sockets.

D-EJ915
From Virginia Beach, USA
Who joined July 30, 2007, 10:07 p.m.

Wrote the following at July 16, 2008, 2:28 p.m...

TeeTylerToe wrote: ummm, didn't the 3.16GHz beat the itanium2 in both? not as well as I predicted. also, how many cores, and sockets.

usually those tests are just a single core if the cpu has multiple cores

mapesdhs
From Edinburgh, Scotland
Who joined Nov. 10, 2003, 4:17 p.m.

Wrote the following at July 16, 2008, 3:11 p.m...

D-EJ915 writes:
> usually those tests are just a single core if the cpu has multiple cores

That's generally true for the integer tests, but most of the fp tests are done in parallel, ie. cores/CPUs
enabled, Autoparallel compiler options turned on.

TeeTylerToe writes:
> ummm, didn't the 3.16GHz beat the itanium2 in both? ...

Both what? Please don't say you're judging based on the final averages!

The key point is that one
should compare those tests closest to one's application, and in that regard there are many codes
where the XEON is much slower than IA64. And the answer to your next question might surprise you.
Comparing based on final SPEC averages is dumb and tells one nothing (eg. the final SPEC averages
for O2 hide an order of magnitude difference between lowest to highest). Look at the codes where
IA64 does well (lots of Fluid Dynamics stuff), when you compare the difference in the number of
cores and clock speeds involved, the IA64's results are very impressive.

> ... not as well as I predicted. also, how many cores, and sockets.

The XEON was using 2 chips, 4 cores/chip, 8 cores total, whereas the IA64 was only using 1 chip with 2
cores. The IA64 does much better than people think, ie. a single dual-core 1.66GHz Itanium2 can be 2X
faster than two quad-core XEONs. This doesn't apply to all codes (certainly not), but for those
where the two XEONs are faster one has to remember it takes two XEONs to achieve such a difference.
I've ammended the table above, and added URLs for the page results. Quite a few new results out now,
like the Dell T7400, so I'll wade through the spec pages and update the tables soon. It's not linked to
from my site index, but the file has always been available here . I use it as a personal reference when
researching performance issues and if I can update it 3 or 4 times a year.

Note that the Power6 result is interesting - it's only using 1 core on the chip. Strange that IBM didn't
submit results using both, though maybe they have now, I've not checked yet. Oh, I'm only referring to
the fp results here; for int, the XEON does much better, at least for the tests in the SPEC suite anyway
(large scale apps using lots of CPUs might be have differently).

Anyway, all this anti-IA64 sentiment is silly. Some of what went into IA64 came from ex-SGI people and
others familiar with SGI's ideas for H1/H2. I was very against IA64 early on, partly anti-Intel bias, partly
the loss of SGI's next-gen CPUs, etc., but after talking to John Mashey about it (the STREAM guy)
I was satisfied the result was going to be a good design, which it was/is, alas the late release caused
other problems (never used in O2K). If Itanium fails, it will be on cost grounds and like factors, not
performance. Speed-wise, it's a very efficient design, ie. work done per clock cycle per core. It would
have been nice to have H2, etc., but it was never going to happen for cost reasons (faster than IA64
as originally planned, but not fast enough to justify the higher cost).

It was a long time ago now, so I'm sure John wouldn't mind if I quoted his 1998 email to me:

Code: Select all


   IA64 is *very* different in almost every conceivable way from an IA32 in architecture, emphasis, and
   

   performance; it is very much what we might have designed had we been able to do a new ISA that
   

   no longer had to be upwards compatible with anything we had (MIPS has run out of some opcode
   

   slots, and will always have 32 each of integer and FP registers). It is publicaly known that IA64 has
   

   128 each of integer and FP registers, and if you care about FP you will like that. There are numerous
   

   other features where people have learned, and I found many features that probably first appeared in
   

   MIPS chips, but in some cases cleaner. I studied the manuals looking for showstoppers, and was
   

   heartily relieved not to find any such, and I did find features that I'd been specing for H2.
   

   

   Anyway, I can't say anything that isn't public, but I would say:
   

   

   a) This is a good architecture and a good chip, and if you liked MIPS chips, you will like IA64, even if
   

   you despise IA32s.
   

   

   b) In various strange ways, this architecture and implementation almost seem designed as better for
   

   SGI than for anybody else,  even HP.
   

   

   c) As it happens, the threads of ideas that led to the R8000 and SGI compiler technologies came
   

   from people who'd worked on related technologies at other companies, and worked with people
   

   with similar ideas, who went to Intel & HP, and strangely enough, some of what's seen in the chips
   

   is *very* familiar.
   

   

   So, anyway, your concerns are well-taken; I had some of the identical ones; my boss (Forest Baskett,
   

   our CTO, and one of those who worked on the Stanford MIPS) & I have both looked carefully at
   

   this thing, and both feel that this will be a good chip for our customers ... it is *not* an X86, even if is
   

   able to run that code. By now, people understand how to do 64-bit instruction sets, so that's not
   

   that hard any more; the real issue is in a myriad of other details, which is why I spent hours studying
   

   the manuals, and I sure felt a lot better after I'd read them.

Ian.

R-ten-K
From Nor Cal
Who joined Nov. 15, 2004, 10:36 p.m.

Wrote the following at July 16, 2008, 6:05 p.m...

SPEC is not multithreaded, throwing more cores does not change the SPEC results.

"Was it a dream where you see yourself standing in sort of sun-god robes on a
pyramid with thousand naked women screaming and throwing little pickles at you?"

TeeTylerToe
Who joined Sept. 13, 2004, 11:56 p.m.

Wrote the following at July 17, 2008, 3:58 a.m...

some interesting things going on:
T2 "Rock" Sun T5240 '06 FP Rate 2 socket - 111/119
http://www.spec.org/cpu2006/results/res ... 04061.html
T2 "Rock" Sun T5240 '06 INT Rate 2 socket - 142/157
http://www.spec.org/cpu2006/results/res ... 04063.html
Xeon X7350 Poweredge R900 FP Rate 4 socket - 108/119
http://www.spec.org/cpu2006/results/res ... 02355.html
Opteron 8360 SE PowerEdge R905 FP Rate 4 socket - 152/166
http://www.spec.org/cpu2006/results/res ... 04224.html
Xeon X7350 NovaScale R480E1 Int rate 4 socket - 177/213
http://www.spec.org/cpu2006/results/res ... 03206.html

http://www.spec.org/cpu2006/results/res ... 04225.html
167/193 Int Rate 2.5GHz 4 socket Opteron

http://www.spec.org/cpu2006/results/res ... 00934.html
80.8/82.2 FP rate itanium NovaScale 3045 4 socket 1.6GHz

can someone explain why the Sun Fire 280R 750MHz USIII does about as good as a 300MHz Ultra 2 on '06 SpecFP?

mapesdhs
From Edinburgh, Scotland
Who joined Nov. 10, 2003, 4:17 p.m.

Wrote the following at July 17, 2008, 6:06 a.m...

R-ten-K wrote: SPEC is not multithreaded, throwing more cores does not change the SPEC results.

That's not true, the run rules state autoparallel optimisations are allowed, and most of the fp tests do use them.
Whether they use them efficiently is another matter, that's down to how well the compilers have been written. See
sections 4.2.2 and 4.2.3 of the run rules. The Itanium2 did not use the autoparallel option, but the XEON tests
did, though maybe this tells us more about the quality of the compilers than the hardware, but then that's the whole
focus of how Itanium2 is used anyway (I'd have liked to see the fp suite on the Itanium2 with the autoparallel turned
on, see what it did for some of the tests compared to the XEON).

In many cases when the option is not used, other cores/chips are still enabled - how these might help the final
result is hard to say, hopefully not much (background system services, etc. could use the other cores).

As for the other systems, the Power6 and Core2Extreme did not use the option, while the Opteron did for just the
fp test.

Obvioulsy an autoparallel option will be nothing like as efficient at using multiple cores as hand-designed threaded
code, but nevertheless that's how the tests are run on many of the systems.

Btw, the above would explain why the XEON gets such high numbers for the cactusADM and libquantum
tests (likewise the Core2Quad for libquantum), whereas the Core2Extreme is much slower, or would you
really expect one thread on a 3.16GHz XEON to be ten times faster than a single thread on a 2.93GHz
Core2Extreme? If the use of the autoparallel option is not the cause, then what is? I know XEONs have
differnent internal settings and optimisations, but a ten-fold speed difference for a single thread? To me,
it looks much more likely the compiler on the XEON is doing exactly what it's been asked to do, namely using
all 8 cores as best it can with automatic optimisation.

If I'm wrong and this is all cobblers, I'll be more than happy to ammend my posts of course.

Ian.

R-ten-K
From Nor Cal
Who joined Nov. 15, 2004, 10:36 p.m.

Wrote the following at July 17, 2008, 9:20 a.m...

OK, let me expand on my answer.

SPECcpu (SPECint + SPECFP) are/were designed to test single CPU, not system performance. That sounds like a quaint distinction, but it really is not.

In fact you can actually compile SPEC with all the system calls and static dataset generation, and basically run that spec executable without the OS. For each benchmark in SPEC, after initialization, there is no disk access and usually it is ran in single user mode as to limit the interference of interrupts etc. This is, SPEC is designed to really point out at the performance of your core and its associated memory subsystem. Not IO, not gfx, nt OS, and not the compiler.

Whenever some one publishes a result with flags like autoparallel et al, a lot of people normally tend to discard those results. Why? Because when high levels of optimization are reported, there is a belief that then there is a lot of interference by the compiler. In that case, it is not the CPU that you are measuring, but the combination of CPU+compiler. Note, that in the cases in which you may be interested in knowing what is the effect of the compiler optimizations in performance, then it makes perfect sense to take into account aggressive levels of optimization.

Now, Intel makes the claim that the Itanium is really a CPU+compiler combination. And that is true, unlike modern out-of-order machines, the IA64 is far more dependent on the compiler, since the quality of static scheduling and compiler hints do make a huge difference in its performance (heck it is designed that way as a matter of fact).

The problem the architectural community has with very aggressive levels of compiler optimization by people like Intel is that for the most part, there is limited levels of instruction parallelism in the benchmarks (by design actually) and that enough ILP to warrant more than 1 core is usually scheduled by hand. There are tons of research on automatic parallelization of SPEC that basically concluded that it is not worth the effort (at least up to SPEC06, which may change in the next SPEC interation).

However, Intel et al, actually employ teams whose only job is to manually tweak code for SPEC. The reason being that the publicity that can be harvested from a good SPEC result, is well worth the investment. That fact makes some people weary of taking results with very aggressive compiler optimizations without a grain of salt. That is because the rest of us mere mortals don't make a living from running SPEC but actual code, most of the actual code we will run in our CPUs is not going to be optimized by hand...

And thus, one has to be careful how to read SPEC results. For the most part as I said before, the understanding in the architectural community is that SPEC CPU is to isolate single core/memory subsystem performance. And even though some optimizations and deviations are available, then those results tend to not give you the performance of what you are trying to isolate. To isolate the effect of the multiple cores/arch support for multiprocessing, there are things like SPLASH-like suites, or even SPECRate (which is basically a bunch of SPEC CPU processes launch in parallel), same for IO, or networking, or even GFX subsystem. Every benchmark suite is designed to try to provide a good behavior estimation of the subsystem you are trying to analyze.

Heck there is even some groups that speed up some of the kernels in SPECFP with the GPU, but then in that case it is easy to understand why that is not useful. When you are interested in knowing the FP performance of your CPU, it doesn't help if part of that benchmark is accelerated by the GPU...

I did not have enough coffee yet, but does this make sense?

"Was it a dream where you see yourself standing in sort of sun-god robes on a
pyramid with thousand naked women screaming and throwing little pickles at you?"

SGI: Discussion

Altix NUMA systems apparently moving to x86