Like I said, it seems to have evened out later.
I think I recall reading that (in addition to the cache bit) SGI NUMA archs try to keep things local to a node to minimize remote memory access, and since I have only one processor per node, that would necessarily mean keeping it to a single processor.
I just wasn't sure - MP SPARCstations recommend the faster processor be the first one, wasn't sure if SGI had gotten away from this (seems like they have).
On the plus side, it's noticeably faster than the old I2
Just like any other NUMA computer... The bootmaster is nothing but a normal node after the kernel is up. One can certainly "influence" the scheduling of the bottom halves of the interrupt handlers, but to what extend? There are no real interrupts in the XIO environment, there are only packets routed through the fabric.
Well in the classic sense like the processor isn't doing a PUSHA/IRET (on a I386 arch) and everything is still bridged by the fabric (I thought the CPU's each still had one main interrupt on the Origins ???????), but none the less the CPU has to stop executing what's it's doing, change contexts and start hashing through a different part of memory entirely. Actually doing a PC style interrupt with a OS that doesn't do every good protection is likely far less overhead then in IRIX. But either way it must have a very measurable impact on the L1/L2 cache stats. I also thought IRIX (within the xbow) let you manage both the top and bottom part of the ISR.... Hmmmm I'm going to start trashing though code, I've never looked at the way IRIX handles interrupt's where (if at all) it's still for low latency good RT performance
I mean : (forgive my incorrect / incomplete asm example 'it's been awhile')
mov bx, offset_of_filo_ring
mov [bx], al
out 0x20, 0x20h
That's a pretty short amount of work an plain old PeeCee's (yes I know, old non-protected mode code, all the assembly I did 15 years ago was on embedded computers and was mostly just doing basic I/O for disks, and some user interface. I used A86/D86 which didn't stress correct syntax either so ....)
Does anyone have much low level knowledge of the LINC chip that's on the more complex. I keep thinking about it and having 132Mhz R4650 just to bridge the two PCI buses (PPCI, CPCI) and it seems like it would just add more latency then FPGA or some custom silicon (well, other custom silicon)
Is the LINC there to translate the very large addresses used in the large SN0/1 systems to something that's fits nicely into the PCI spec ???
Or was the LINC more for performance, and more to co process the DMA (the LINC seem todo allot of scatter gather)
Is the LINC code standard across all LINC bearing cards, or is the LINC code tweaked to all process the specific card it's on. I suppose it seems to make allot more sense of each LINC uses custom code and is more to commodity silicon for I/O to keep the costs low, but still co process them to keep the impact on the CPU's lower. It just seems that it would still be cheaper faster to do even the most complex function (DMA scatter/gather) in a (fp)gate array