SGI: Discussion

Altix NUMA systems apparently moving to x86

This news bit (german) talks about planned upgrades of the new northern german "HLRN-II" Altix installation - which currently consists solely of Altix ICE x86 clusters.

The interesting part is the second passage. It talks about two additional planned "Ultra Violet" SMP systems each consisting of 2176 CPUs (136 blades with two 8-core Nehalem-Xeons each) and 8.7 TB of memory in only five large nodes - using NUMALink 5 :)

Wonder if this will mark the end of their IA64 lineup (wouldn't be all that surprising i suppose :-) ) or if they'll offer both variants..
schleusel wrote: This news bit (german) talks about planned upgrades of the new northern german "HLRN-II" Altix installation - which currently consists solely of Altix ICE x86 clusters.

The interesting part is the second passage. It talks about two additional planned "Ultra Violet" SMP systems each consisting of 2176 CPUs (136 blades with two 8-core Nehalem-Xeons each) and 8.7 TB of memory in only five large nodes - using NUMALink 5 :)

Wonder if this will mark the end of their IA64 lineup (wouldn't be all that surprising i suppose :-) ) or if they'll offer both variants..

As the Nehalem is a NUMA design (Intels NUMA implemenatation goes under the name "Quick path interconnect") I suspect SGI and other HPC manufacturer will regard the x86 as a more worthy platform for HPC.
So I think you are right, they will probably use x86 more in future designs.

//Harry
Mein Führer, I can walk!
A couple of months ago someone on here pointed out to me that some Altix systems were now shipping with x86 inside. Immediately I wondered if it was the start of a move from Itanic. They are very different chips, and I don't see how a company in SGI's state can justify the expense of engineering systems using both. When skywriter wanted to know if we had any questions for SGI I was going to suggest asking about x86 vs Itanium in the long term, but didn't, as I figured SGI would never give a straight answer.
From a technical standpoint, aren't there still potential performance benefits from the VLIW architecture (provided that Intel has a way to deal with the scalability issues - i.e. not wasting units if a future version is wider and attempts to run current software), or has current OoO technology eliminated the edge?
"Brakes??? What Brakes???"

:Indigo: :Octane: :Indigo2: :Indigo2IMP: :Indy: :PI: :O3x0: :ChallengeL: :O2000R: (single-CM)
SAQ wrote: From a technical standpoint, aren't there still potential performance benefits from the VLIW architecture (provided that Intel has a way to deal with the scalability issues - i.e. not wasting units if a future version is wider and attempts to run current software), or has current OoO technology eliminated the edge?

VLIW just moves a lot of complexity into software (compilers and apps), so for those willing to put in the effort in optimising code, there possibly are, but X86 speed advances so fast you have to ask is it worth the effort. Interestingly, it seems a trend is emerging for simpler CPU design as well - UltraSPARC and Cell (PPC) both reduce the amount of OOO technology in favour of adding more parallelism (Chip Multithreading in SPARC, Coprocessors with fast local memory in the case of Cell). The X86 approach is probably the safest bet though :-)
you've got a 1.66GHz Itanium 2 dual core that does 4 FP ops per cycle, so that's 8 FP per cycle at about 1.5GHz, so 12Gflop per socket.

you've got core 2 xeons with 4 cores at 3GHz+ with I think 2 flop/cycle, or 24Gflop per socket.

twice the performance, the chips are cheaper, and the architecture's orders of magnitude cheaper, although Intel's moving both to quickpath so the Itanium architecture premium would drop out to some extent, so you'd just have the excessive cost of the Itanium chip, with it's anemic performance.
*edit* 24gflops, not 14
TeeTylerToe wrote: you've got core 2 xeons with 4 cores at 3GHz+ with I think 2 flop/cycle, or 14Gflop per socket.


Actually even 4 double precision FLOPs/cycle if you include the SIMD units i think (2 x87 + 2 SSE2/3), i.e. 48 GFLOPs/s theoretical peak for the 3GHz Quad Core.
schleusel wrote: Actually even 4 double precision FLOPs/cycle if you include the SIMD units i think (2 x87 + 2 SSE2/3), i.e. 48 GFLOPs/s theoretical peak for the 3GHz Quad Core.


And yet for some codes IA64 can be 100% faster than a 3GHz XEON. As always, it depends what you're doing.
Comparing based on peak fp ops is far too simplistic, way too many other issues involved, from cache to mem
bw, etc.

SPEC results show IA64 to be quite strong for fluid dynamics and QCD, easily beating XEONs with 2X higher clock speeds.
For other tests the XEON is better, still others about the same. Here's a table of results (not updated for since Jan/08, must
plough through the data again soon):

Code: Select all

SPEC2006 CPU Superiority Table, by Ian Mapleson <[email protected]>

Last Change: 16/Jul/2008 (reference adjustments, new CPU data not added since Jan/08)


Core2Extreme Overclocking Reference:

http://www.nordichardware.com/Reviews/?skrivelse=487&page=5

NOTE: The 4.7GHz Core2Extreme numbers in these tables are
_extrapolated_ from the published Dell 390 results, ie. linear
extrapolation and then reduced by 5%. They are a guide only!!

Source Refs:

http://www.spec.org/cpu2006/results/cfp2006.html
http://www.spec.org/cpu2006/results/cint2006.html


FLOATING POINT SPEC2006 RESULTS (ch = chip, cr/crs = core/cores, Para = Autoparallelize option used?):

HP rx2660      HP ProLiant     IBM P570     Dell 390         Overclocked     Fujitsu
Integrity      DL360-G5 XEON   Power6       Core2Extreme     Core2Extreme    CELSIUS V840
(SPECfp2006)    1.66GHz IA64   5460, 3.16GHz   4.7GHz       X6800, 2.93GHz   X6800, 4.7GHz   Opteron 3GHz
1 ch, 2 crs    2 ch, 8 crs     1 ch, 1 cr   1 ch, 2 crs      1 ch, 2 crs     2 ch, 4 crs
Para: No       Para: Yes       Para: No     Para: No         Para: No        Para: Yes

410.bwaves         44.7            32.9          67.3           20.9             31.6            60.8
416.gamess         11.5            23.7          14.8           17.3             26.4            17.1
433.milc           21.3            10.8          20.0           11.4             17.4            13.2
434.zeusmp         19.3            16.8          26.4           16.0             24.4            12.4
435.gromacs        17.3            20.4          12.6           17.0             25.9            15.4
436.cactusADM      36.5           105.0          24.8           20.2             30.8            39.1
437.leslie3d       27.9            19.5          31.2           13.3             20.3            9.45
444.namd           26.8            16.8          14.1           15.1             23.0            13.9
447.dealII         21.6            30.6          23.8           18.0             27.4            25.0
450.soplex         11.6            15.7          21.4           14.9             22.7            11.8
453.povray         11.5            30.7          15.4           26.1             39.8            19.4
454.calculix       16.7            24.8          18.7           13.9             21.2            8.62
459.GemsFDTD       19.7            19.2          21.3           12.0             18.3            23.3
465.tonto          13.1            25.3          17.5           12.6             19.2            17.5
470.lbm            37.6            24.3          53.7           14.0             21.3            15.3
481.wrf            16.6            22.3          12.3           17.0             25.9            19.5
482.sphinx3        24.4            28.4          36.3           22.9             34.9            15.3


SPECfp2006 Peak

HP ProLiant DL360-G5 XEON X5460 3.16GHz:       23.9
IBM p570, Power6 4.7GHz:                       22.5
HP Integrity rx2660, Itanium2 1.66GHz:         20.4
Fujitsu CELSIUS V840 Opteron 2222 3GHz:        17.4
Dell 390, Core2Extreme X6800, 2.93GHz:         16.2


Itanium2: http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071015-02285.html
XEON:     http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071029-02381.html
Power6:   http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071030-02420.html
Core2Ext: http://www.spec.org/cpu2006/results/res2007q1/cpu2006-20061226-00189.html
Opteron:  http://www.spec.org/cpu2006/results/res2007q3/cpu2006-20070802-01587.html



INTEGER SPEC2006 RESULTS (ch = chip, cr/crs = core/cores, Para = Autoparallelize option used?):

HP Integrity     Intel 3.16GHz   IBM P570     Dell 390         Overclocked     ASUS Intel    HP Proliant
rx2660 1.66GHz   HP ProLiant     Power6       Core2Extreme     Core2Extreme    Core2Quad     DL185 G5
(SPECint2006)    Itanium2 IA64    XEON-X5460      4.7GHz       X6800, 2.93GHz   X6800, 4.7GHz   QX9650 3GHz   Opteron 3GHz
1 ch, 2 crs      2 ch, 8 crs     1 ch, 1 cr   1 ch, 2 crs      1 ch, 2 crs     1 ch, 4 crs   2 ch, 4 crs
Para: No         Para: Yes       Para: No     Para: No         Para: No        Para: Yes     Para: No

400.perlbench       11.7             25.4           13.7           23.5             35.8           24.8           16.6
401.bzip2           11.4             18.4           16.1           15.7             23.9           17.2           11.6
403.gcc             12.9             20.1           19.8           13.3             20.1           23.3           12.6
429.mcf             19.4             24.3           36.8           23.8             36.3           29.9           18.1
445.gobmk           15.0             22.8           17.9           21.5             32.8           22.4           18.0
456.hmmer           28.6             28.8           17.1           12.3             18.8           17.2           21.2
458.sjeng           12.4             20.7           14.9           19.6             29.9           20.5           14.9
462.libquantum      61.4            227.0           96.6           17.9             27.3          105.0           29.5
464.h264ref         22.6             35.2           30.6           31.6             48.2           33.4           23.1
471.omnetpp          9.1             17.0           18.5           15.8             24.1           21.3           10.9
473.astar           15.6             17.2           13.8           14.2             21.6           17.4           10.6
483.xalancbmk       16.6             26.9           18.3           20.7             31.6           24.6           15.3


SPECint2006 Peak

HP ProLiant DL360 G5, XEON-X5460 3.16GHz:      27.6
ASUS P5E3 Deluxe, Core2Quad QX9650 3GHz:       25.5
IBM p570, Power6 4.7GHz:                       21.6
Dell 390, Core2Extreme X6800, 2.93GHz:         18.5
HP Integrity rx2660, 1.66GHz Itanium2:         17.0
HP Proliant DL185 G5 Opteron 3GHz:             16.1


Itanium2:  http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071015-02284.html
XEON:      http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071029-02385.html
Power6:    http://www.spec.org/cpu2006/results/res2007q2/cpu2006-20070518-01096.html
Core2Ext:  http://www.spec.org/cpu2006/results/res2007q1/cpu2006-20061226-00186.html
Core2Quad: http://www.spec.org/cpu2006/results/res2007q4/cpu2006-20071112-02554.html
Opteron:   http://www.spec.org/cpu2006/results/res2008q1/cpu2006-20071220-02914.html

*******************************************************************************************************************

To aid SGI users, here is a SPEC2000 comparison between R14K/600MHz,
the older Itanium2 1.6GHz IA64 (before the newer L2 sizes), aswell as
some other CPUs from the past, listed in order of aggregate fp
performance (extrapolated R16K/1GHz included as a comparative guess):


System                CPU             MHz    Cache Sizes          SPECint2000     SPECfp2000

SGI Altix 4700/D      Itanium2 9040   1600   16/16 + 1MB/256K + 24MB     ?            3098 [b]
SGI Altix 4700/D      Itanium2 9040   1600   16/16 + 1MB/256K + 18MB     ?            3001 [b]
SGI Altix XE 210      P4 XEON 5160    3000   32/32 + 4MB              2937 [b]        2752 [b]
IBM p570              Power5+         1900   64/32 + 2MB + 36MB       1453            2733
SGI Altix 3700 Bx2    Itanium2        1600   16/16 + 256K + 9MB       1441 [b]        2647 [b]
HP rx4640             Itanium2        1600   16/16 + 256K + 6MB       1590            2612
SGI Altix 3700 Bx2    Itanium2        1600   16/16 + 256K + 6MB       1410 [b]        2600 [b]
SGI Altix 3700 Bx2    Itanium2        1500   16/16 + 256K + 4MB       1242 [b]        2361 [b]
SGI Altix 3000        Itanium2        1500   16/16 + 256K + 6MB       1077            2148
SGI Altix 350         Itanium2        1400   16/16 + 256K + 3MB       1078            1942
SGI Altix 3000        Itanium2        1300   16/16 + 256K + 3MB        875            1854
AMD/ASUS              Opteron 150     2400   64/64 + 1MB              1663            1849
Intel D925            P4-X            3466   12/8 + 512 + 2MB         1772            1724
HP Alpha GS1280       21364           1300   64/64 + 2MB               994            1684
SGI Altix 350         Itanium2        1400   16/16 + 256K + 1.5MB      986            1684
Intel D925            P4              3600   12/8 + 1MB               1575            1630
SGI Altix 350         Itanium2        1000   16/16 + 256K + 1.5MB      743            1374
Fujitsu               SPARC64-V       1350   128 + 128K/2MB            905            1340
Apple                 PPC970 (G5)     2000   64/32 + 512K              800             840
Origin 3000 [*]       R16000          1000   32/32 + 16MB              792             838
HP rx4610             Itanium          800   16/16 + 96K + 4MB         379             701
HP c3750              PA-8700          875   768K/1.5MB                678             674
HP                    Pentium-M       1000   32/32 + 1MB               687             552
SGI Origin 3200       R14000A          600   32/32 + 8MB               500             529
SGI Origin 300        R14000A          600   32/32 + 4MB               483             495
SGI Origin 3200       R14000           500   32/32 + 8MB               427             463
SGI Origin 3200       R12000           400   32/32 + 8MB               353             407
SGI 2200              R14000           500   32/32 + 8MB               412             386
SGI Origin 300        R14000           500   32/32 + 2MB               379             378
SGI 2200              R12000           400   32/32 + 8MB               347             343
SGI 2100              R12000           350   32/32 + 8MB               289             293
SGI Origin 200        R12000           360   32/32 + 4MB               298             290
SGI 2200              R12000           300   32/32 + 8MB               264             283


[*] Extrapolation based on R14K/600, done as: (1000/600) * 0.95 * result. Real speed might be
faster due to larger L2 and higher L2 speed, or might be less due to simple clock increase
not benefiting every code in the same way. Guesstimate only!!

REFS: http://www.cl.cam.ac.uk/teaching/2006/CompArch/mainnotes/comparch-2.pdf
http://www.spec.org/cpu2000/results/cint2000.html
http://www.spec.org/cpu2000/results/cfp2000.html



In the end though, cost is likely to be the deciding factor, just as it was for SGI, which would be ironic given it was
the IA64 programme which pretty much killed off SGI's future CPUs in the 1st place (staff moving to Intel, etc.)

Ian.
[email protected]
+44 (0)131 476 0796
ummm, didn't the 3.16GHz beat the itanium2 in both? not as well as I predicted. also, how many cores, and sockets.
TeeTylerToe wrote: ummm, didn't the 3.16GHz beat the itanium2 in both? not as well as I predicted. also, how many cores, and sockets.

usually those tests are just a single core if the cpu has multiple cores
:Indy: :rx2600: :Indigo2: :Indigo2: :Indy: :Indy:
D-EJ915 writes:
> usually those tests are just a single core if the cpu has multiple cores

That's generally true for the integer tests, but most of the fp tests are done in parallel, ie. cores/CPUs
enabled, Autoparallel compiler options turned on.


TeeTylerToe writes:
> ummm, didn't the 3.16GHz beat the itanium2 in both? ...

Both what? Please don't say you're judging based on the final averages! :D The key point is that one
should compare those tests closest to one's application, and in that regard there are many codes
where the XEON is much slower than IA64. And the answer to your next question might surprise you.
Comparing based on final SPEC averages is dumb and tells one nothing (eg. the final SPEC averages
for O2 hide an order of magnitude difference between lowest to highest). Look at the codes where
IA64 does well (lots of Fluid Dynamics stuff), when you compare the difference in the number of
cores and clock speeds involved, the IA64's results are very impressive.


> ... not as well as I predicted. also, how many cores, and sockets.

The XEON was using 2 chips, 4 cores/chip, 8 cores total, whereas the IA64 was only using 1 chip with 2
cores. The IA64 does much better than people think, ie. a single dual-core 1.66GHz Itanium2 can be 2X
faster than two quad-core XEONs. This doesn't apply to all codes (certainly not), but for those
where the two XEONs are faster one has to remember it takes two XEONs to achieve such a difference.
I've ammended the table above, and added URLs for the page results. Quite a few new results out now,
like the Dell T7400, so I'll wade through the spec pages and update the tables soon. It's not linked to
from my site index, but the file has always been available here . I use it as a personal reference when
researching performance issues and if I can update it 3 or 4 times a year.

Note that the Power6 result is interesting - it's only using 1 core on the chip. Strange that IBM didn't
submit results using both, though maybe they have now, I've not checked yet. Oh, I'm only referring to
the fp results here; for int, the XEON does much better, at least for the tests in the SPEC suite anyway
(large scale apps using lots of CPUs might be have differently).

Anyway, all this anti-IA64 sentiment is silly. Some of what went into IA64 came from ex-SGI people and
others familiar with SGI's ideas for H1/H2. I was very against IA64 early on, partly anti-Intel bias, partly
the loss of SGI's next-gen CPUs, etc., but after talking to John Mashey about it (the STREAM guy)
I was satisfied the result was going to be a good design, which it was/is, alas the late release caused
other problems (never used in O2K). If Itanium fails, it will be on cost grounds and like factors, not
performance. Speed-wise, it's a very efficient design, ie. work done per clock cycle per core. It would
have been nice to have H2, etc., but it was never going to happen for cost reasons (faster than IA64
as originally planned, but not fast enough to justify the higher cost).

It was a long time ago now, so I'm sure John wouldn't mind if I quoted his 1998 email to me:

Code: Select all

IA64 is *very* different in almost every conceivable way from an IA32 in architecture, emphasis, and
performance; it is very much what we might have designed had we been able to do a new ISA that
no longer had to be upwards compatible with anything we had (MIPS has run out of some opcode
slots, and will always have 32 each of integer and FP registers). It is publicaly known that IA64 has
128 each of integer and FP registers, and if you care about FP you will like that. There are numerous
other features where people have learned, and I found many features that probably first appeared in
MIPS chips, but in some cases cleaner. I studied the manuals looking for showstoppers, and was
heartily relieved not to find any such, and I did find features that I'd been specing for H2.

Anyway, I can't say anything that isn't public, but I would say:

a) This is a good architecture and a good chip, and if you liked MIPS chips, you will like IA64, even if
you despise IA32s.

b) In various strange ways, this architecture and implementation almost seem designed as better for
SGI than for anybody else,  even HP.

c) As it happens, the threads of ideas that led to the R8000 and SGI compiler technologies came
from people who'd worked on related technologies at other companies, and worked with people
with similar ideas, who went to Intel & HP, and strangely enough, some of what's seen in the chips
is *very* familiar.

So, anyway, your concerns are well-taken; I had some of the identical ones; my boss (Forest Baskett,
our CTO, and one of those who worked on the Stanford MIPS) & I have both looked carefully at
this thing, and both feel that this will be a good chip for our customers ... it is *not* an X86, even if is
able to run that code. By now, people understand how to do 64-bit instruction sets, so that's not
that hard any more; the real issue is in a myriad of other details, which is why I spent hours studying
the manuals, and I sure felt a lot better after I'd read them.


Ian.
SPEC is not multithreaded, throwing more cores does not change the SPEC results.
"Was it a dream where you see yourself standing in sort of sun-god robes on a
pyramid with thousand naked women screaming and throwing little pickles at you?"
some interesting things going on:
T2 "Rock" Sun T5240 '06 FP Rate 2 socket - 111/119
http://www.spec.org/cpu2006/results/res ... 04061.html
T2 "Rock" Sun T5240 '06 INT Rate 2 socket - 142/157
http://www.spec.org/cpu2006/results/res ... 04063.html
Xeon X7350 Poweredge R900 FP Rate 4 socket - 108/119
http://www.spec.org/cpu2006/results/res ... 02355.html
Opteron 8360 SE PowerEdge R905 FP Rate 4 socket - 152/166
http://www.spec.org/cpu2006/results/res ... 04224.html
Xeon X7350 NovaScale R480E1 Int rate 4 socket - 177/213
http://www.spec.org/cpu2006/results/res ... 03206.html

http://www.spec.org/cpu2006/results/res ... 04225.html
167/193 Int Rate 2.5GHz 4 socket Opteron

http://www.spec.org/cpu2006/results/res ... 00934.html
80.8/82.2 FP rate itanium NovaScale 3045 4 socket 1.6GHz

can someone explain why the Sun Fire 280R 750MHz USIII does about as good as a 300MHz Ultra 2 on '06 SpecFP?
R-ten-K wrote: SPEC is not multithreaded, throwing more cores does not change the SPEC results.


That's not true, the run rules state autoparallel optimisations are allowed, and most of the fp tests do use them.
Whether they use them efficiently is another matter, that's down to how well the compilers have been written. See
sections 4.2.2 and 4.2.3 of the run rules. The Itanium2 did not use the autoparallel option, but the XEON tests
did, though maybe this tells us more about the quality of the compilers than the hardware, but then that's the whole
focus of how Itanium2 is used anyway (I'd have liked to see the fp suite on the Itanium2 with the autoparallel turned
on, see what it did for some of the tests compared to the XEON).

In many cases when the option is not used, other cores/chips are still enabled - how these might help the final
result is hard to say, hopefully not much (background system services, etc. could use the other cores).

As for the other systems, the Power6 and Core2Extreme did not use the option, while the Opteron did for just the
fp test.

Obvioulsy an autoparallel option will be nothing like as efficient at using multiple cores as hand-designed threaded
code, but nevertheless that's how the tests are run on many of the systems.

Btw, the above would explain why the XEON gets such high numbers for the cactusADM and libquantum
tests (likewise the Core2Quad for libquantum), whereas the Core2Extreme is much slower, or would you
really expect one thread on a 3.16GHz XEON to be ten times faster than a single thread on a 2.93GHz
Core2Extreme? If the use of the autoparallel option is not the cause, then what is? I know XEONs have
differnent internal settings and optimisations, but a ten-fold speed difference for a single thread? To me,
it looks much more likely the compiler on the XEON is doing exactly what it's been asked to do, namely using
all 8 cores as best it can with automatic optimisation.

If I'm wrong and this is all cobblers, I'll be more than happy to ammend my posts of course.

Ian.
OK, let me expand on my answer.

SPECcpu (SPECint + SPECFP) are/were designed to test single CPU, not system performance. That sounds like a quaint distinction, but it really is not.

In fact you can actually compile SPEC with all the system calls and static dataset generation, and basically run that spec executable without the OS. For each benchmark in SPEC, after initialization, there is no disk access and usually it is ran in single user mode as to limit the interference of interrupts etc. This is, SPEC is designed to really point out at the performance of your core and its associated memory subsystem. Not IO, not gfx, nt OS, and not the compiler.

Whenever some one publishes a result with flags like autoparallel et al, a lot of people normally tend to discard those results. Why? Because when high levels of optimization are reported, there is a belief that then there is a lot of interference by the compiler. In that case, it is not the CPU that you are measuring, but the combination of CPU+compiler. Note, that in the cases in which you may be interested in knowing what is the effect of the compiler optimizations in performance, then it makes perfect sense to take into account aggressive levels of optimization.

Now, Intel makes the claim that the Itanium is really a CPU+compiler combination. And that is true, unlike modern out-of-order machines, the IA64 is far more dependent on the compiler, since the quality of static scheduling and compiler hints do make a huge difference in its performance (heck it is designed that way as a matter of fact).

The problem the architectural community has with very aggressive levels of compiler optimization by people like Intel is that for the most part, there is limited levels of instruction parallelism in the benchmarks (by design actually) and that enough ILP to warrant more than 1 core is usually scheduled by hand. There are tons of research on automatic parallelization of SPEC that basically concluded that it is not worth the effort (at least up to SPEC06, which may change in the next SPEC interation).

However, Intel et al, actually employ teams whose only job is to manually tweak code for SPEC. The reason being that the publicity that can be harvested from a good SPEC result, is well worth the investment. That fact makes some people weary of taking results with very aggressive compiler optimizations without a grain of salt. That is because the rest of us mere mortals don't make a living from running SPEC but actual code, most of the actual code we will run in our CPUs is not going to be optimized by hand...

And thus, one has to be careful how to read SPEC results. For the most part as I said before, the understanding in the architectural community is that SPEC CPU is to isolate single core/memory subsystem performance. And even though some optimizations and deviations are available, then those results tend to not give you the performance of what you are trying to isolate. To isolate the effect of the multiple cores/arch support for multiprocessing, there are things like SPLASH-like suites, or even SPECRate (which is basically a bunch of SPEC CPU processes launch in parallel), same for IO, or networking, or even GFX subsystem. Every benchmark suite is designed to try to provide a good behavior estimation of the subsystem you are trying to analyze.

Heck there is even some groups that speed up some of the kernels in SPECFP with the GPU, but then in that case it is easy to understand why that is not useful. When you are interested in knowing the FP performance of your CPU, it doesn't help if part of that benchmark is accelerated by the GPU...

I did not have enough coffee yet, but does this make sense?
"Was it a dream where you see yourself standing in sort of sun-god robes on a
pyramid with thousand naked women screaming and throwing little pickles at you?"