SGI: Hardware

Origin scheduling question

I set up my Origin 200, and upon noticing that one of the CPUs had a 2MB cache and the other a 1MB cache, I put the 2MB cache module node as the master, because of course you want CPU0 to be the most powerful, after all it's the most used, and any Sun howto will tell you to put the most powerful CPU as the primary one.

After getting IRIX installed and firing up gr_osview to see how things are going, I noticed that IRIX seems to default to scheduling more things on CPU1 :x . Is this the usual behavior to keep CPU0 open for system operations, or is this just a temporary thing?
"Brakes??? What Brakes???"

:Indigo: :Octane: :Indigo2: :Indigo2IMP: :Indy: :PI: :O3x0: :ChallengeL: :O2000R: (single-CM)
Well, just hit a string of activities biased towards CPU0 load. Looks like it was just luck of the draw.
"Brakes??? What Brakes???"

:Indigo: :Octane: :Indigo2: :Indigo2IMP: :Indy: :PI: :O3x0: :ChallengeL: :O2000R: (single-CM)
The bootmaster is THE bootmaster (Monarch in HP terms), but after the OS is up and running the bootmaster/Monarch is just another cpu node and the operating system may schedule whatever it chooses on it. In fact, every time I have to reboot/halt my Origin 2k, the reboot is initiated from a different cpu, ID.EST, not a single node/cpu is a master anymore from the OS perspective.
LAMMEN GORTHAUR
If something runs on a CPU, most smart schedulers will apply a weight value in the scheduling algorithm to bias it toward that CPU again in the future, as the code/data is likely to still be cache resident (or "warm"), and entries in the TLB are more likely to be useful. Moving it to a "cold" CPU means potential cache misses, cache coherence, and TLB misses.

This is in general. IRIX internals are a mystery, so I am not definite this is what is happening on your Origin.
It would also matter where your IO cards are right ??

Each of the two nodeboards, each have a connection to each xbow (and there's two) take ownership of each I/O card. Normally one CPU/heart would take ownership and all communications and ISR would go through that CPU/heart.

If you look in /var/sysgen/system/irix.sm around line 573 (for IRIX6.5.28) and you'll find directive to control nobeboard ownership and interrupt routing

This way if you had allot of heavy I/O on XBOW #1 and it had the choice of CPU's with different cache sizes(which would make a larger largish difference for ISR that are always getting trigger. Having 2MB of cache could make a large difference in some cases) or a more powerful CPU you could make sure the BEST cpu is servicing the hardware on the XBOW and that CPU will handle the extra work, interruptions

Line 573: contains the NOINTR directive which excludes CPU from servicing interrupts
Line 583: contains DEVICE_ADMIN which assigns CPU ownership for devices on that XBOW (it would make sense, even if you assign a CPU that's on a Xbow not connected to the I/O port you still have to go through the owning nodeboards heart on the way from I/O->Xbow->owning heart->router->nodeboard xbow->nodeboard heart->CPU compared to Xbow->heart->CPU )

/var/sysgen/system/numa.sm always contains some NUMA directives, which are handy (espc on a busy system , or a system low on mem that has an application that isn't NUMA aware). Migration is turned off by default - turning it on could make a large difference for processes that are memory hungry and have CPU's serially access large amounts of memory. In this case though oview would show a large amount of traffic through the Xbow/Routers. Migration is one of the cooler features of SGI's NUMA hardware, they widely state it's one of the things that make NUMA a high performance, yet it's disabled by default - I guess they have reasons, but ..... I suppose it doesn't matter that now memory is so cheap - but you can also crank down the kernel replication so it's not using as much memory on each nodeboard
Like I said, it seems to have evened out later.

I think I recall reading that (in addition to the cache bit) SGI NUMA archs try to keep things local to a node to minimize remote memory access, and since I have only one processor per node, that would necessarily mean keeping it to a single processor.

I just wasn't sure - MP SPARCstations recommend the faster processor be the first one, wasn't sure if SGI had gotten away from this (seems like they have).

On the plus side, it's noticeably faster than the old I2 ;)
"Brakes??? What Brakes???"

:Indigo: :Octane: :Indigo2: :Indigo2IMP: :Indy: :PI: :O3x0: :ChallengeL: :O2000R: (single-CM)
SAQ wrote: Like I said, it seems to have evened out later.

I think I recall reading that (in addition to the cache bit) SGI NUMA archs try to keep things local to a node to minimize remote memory access, and since I have only one processor per node, that would necessarily mean keeping it to a single processor.

I just wasn't sure - MP SPARCstations recommend the faster processor be the first one, wasn't sure if SGI had gotten away from this (seems like they have).

On the plus side, it's noticeably faster than the old I2 ;)

Just like any other NUMA computer... The bootmaster is nothing but a normal node after the kernel is up. One can certainly "influence" the scheduling of the bottom halves of the interrupt handlers, but to what extend? There are no real interrupts in the XIO environment, there are only packets routed through the fabric.
LAMMEN GORTHAUR
chervarium wrote:
SAQ wrote: Like I said, it seems to have evened out later.

I think I recall reading that (in addition to the cache bit) SGI NUMA archs try to keep things local to a node to minimize remote memory access, and since I have only one processor per node, that would necessarily mean keeping it to a single processor.

I just wasn't sure - MP SPARCstations recommend the faster processor be the first one, wasn't sure if SGI had gotten away from this (seems like they have).

On the plus side, it's noticeably faster than the old I2 ;)

Just like any other NUMA computer... The bootmaster is nothing but a normal node after the kernel is up. One can certainly "influence" the scheduling of the bottom halves of the interrupt handlers, but to what extend? There are no real interrupts in the XIO environment, there are only packets routed through the fabric.


Well in the classic sense like the processor isn't doing a PUSHA/IRET (on a I386 arch) and everything is still bridged by the fabric (I thought the CPU's each still had one main interrupt on the Origins ???????), but none the less the CPU has to stop executing what's it's doing, change contexts and start hashing through a different part of memory entirely. Actually doing a PC style interrupt with a OS that doesn't do every good protection is likely far less overhead then in IRIX. But either way it must have a very measurable impact on the L1/L2 cache stats. I also thought IRIX (within the xbow) let you manage both the top and bottom part of the ISR.... Hmmmm I'm going to start trashing though code, I've never looked at the way IRIX handles interrupt's where (if at all) it's still for low latency good RT performance

I mean : (forgive my incorrect / incomplete asm example 'it's been awhile')

ISR_IS_HERE:
inb 0x3f7.al
mov bx, offset_of_filo_ring
mov [bx], al
inc [bx]
out 0x20, 0x20h
iret

That's a pretty short amount of work an plain old PeeCee's (yes I know, old non-protected mode code, all the assembly I did 15 years ago was on embedded computers and was mostly just doing basic I/O for disks, and some user interface. I used A86/D86 which didn't stress correct syntax either so ....)

Does anyone have much low level knowledge of the LINC chip that's on the more complex. I keep thinking about it and having 132Mhz R4650 just to bridge the two PCI buses (PPCI, CPCI) and it seems like it would just add more latency then FPGA or some custom silicon (well, other custom silicon)

Is the LINC there to translate the very large addresses used in the large SN0/1 systems to something that's fits nicely into the PCI spec ???

Or was the LINC more for performance, and more to co process the DMA (the LINC seem todo allot of scatter gather)

Is the LINC code standard across all LINC bearing cards, or is the LINC code tweaked to all process the specific card it's on. I suppose it seems to make allot more sense of each LINC uses custom code and is more to commodity silicon for I/O to keep the costs low, but still co process them to keep the impact on the CPU's lower. It just seems that it would still be cheaper faster to do even the most complex function (DMA scatter/gather) in a (fp)gate array