SGI: Development

Low-level details of SGI multiprocessing?

Are there any known details on programming the ASICs responsible for managing the CPUs in multiprocessor SGIs? I mean, details from an OS-development point of view (ie: ASIC commands needed to create a new process on a certain CPU, or for scheduling each CPU, etc...).

I don't mind if the details are for Challenge, for the Octane, or for NUMA or later MP boxes.

The purpose is that I feel curious about the protocol of commands used by SGI for controlling MP systems. I'm beginning a pet project which needs (simple) MP and maybe I could get some inspiration from older SGI designs...
good question, I do not have an answer, but I can say OpenBSD and Linux has SMP for IP30
Some prowling the streets, looking for sweets from their Candyman , I'm Looking for a new fun with IP30/Octane2
IP30 purposes : linux (kernel development), Irix Scientific Apps { Ansys, Catia, Pro/E, FiberSIM, AutoDYNþ, ... }
Other Projects : { Cerberus , Woody Box , 68K-board, SWI_DBG }, discontinued Console hacks { GB, GBA, PSX1 }
Wanted Equipments : { U1732C LCR meter by Keysight } ~ ~ I am still Learning English, be patient with me ~ ~
about me , there are just a few things to know: I am exuberant , and I love the urban dictionary : is it a problem ?!?
The O2000 and Onyx2 (and by extension the Octane and Tezro) are based on an academic computer design from Stanford called DASH. There are many papers about it on citeseer. This design uses directory memories to produce coherency from a loosely-coupled mesh of processing nodes (each of which has several CPUs in a traditional SMP arrangement).
:PI: :O2: :Indigo2IMP: :Indigo2IMP:
robespierre wrote: The O2000 and Onyx2 (and by extension the Octane and Tezro) are based on an academic computer design from Stanford called DASH. There are many papers about it on citeseer. This design uses directory memories to produce coherency from a loosely-coupled mesh of processing nodes (each of which has several CPUs in a traditional SMP arrangement).


Stanford DASH
Stanford DASH was a cache coherent multiprocessor developed in the late 1980s by a group led by Anoop Gupta, John L. Hennessy, Mark Horowitz, and Monica S. Lam at Stanford University. It was based on adding a pair of directory boards designed at Stanford to up to 16 SGI IRIS 4D Power Series machines and then cabling the systems in a mesh topology using a Stanford-modified version of the Torus Routing Chip. The boards designed at Stanford implemented a directory-based cache coherence protocol allowing Stanford DASH to support distributed shared memory for up to 64 processors. Stanford DASH was also notable for both supporting and helping to formalize weak memory consistency models, including release consistency. Because Stanford DASH was the first operational machine to include scalable cache coherence, it influenced subsequent computer science research as well as the commercially available SGI Origin 2000 . Stanford DASH is included in the 25th anniversary retrospective of selected papers from the International Symposium on Computer Architecture and several computer science books, has been simulated by the University of Edinburgh, and is used as a case study in contemporary computer science classes.



any good book & paper to learn about that ?

I opened this topic, but … :D
Some prowling the streets, looking for sweets from their Candyman , I'm Looking for a new fun with IP30/Octane2
IP30 purposes : linux (kernel development), Irix Scientific Apps { Ansys, Catia, Pro/E, FiberSIM, AutoDYNþ, ... }
Other Projects : { Cerberus , Woody Box , 68K-board, SWI_DBG }, discontinued Console hacks { GB, GBA, PSX1 }
Wanted Equipments : { U1732C LCR meter by Keysight } ~ ~ I am still Learning English, be patient with me ~ ~
about me , there are just a few things to know: I am exuberant , and I love the urban dictionary : is it a problem ?!?
This is the DASH book: Scalable Shared-Memory Multiprocessing - 30% (100 pages) of it is "Experience with DASH". I've got it some time ago for less than $10.
:PI: :Indigo: :Indigo: :Indigo: :Indy: :Indy: :Indigo2: :Indigo2IMP: :Octane: :Fuel: :540:
GL1zdA wrote: This is the DASH book: Scalable Shared-Memory Multiprocessing - 30% (100 pages) of it is "Experience with DASH". I've got it some time ago for less than $10.


bought in 20 milli seconds since I realized that baham_books sells (UK book seller) it for €9,31 euro including shipping :D
Some prowling the streets, looking for sweets from their Candyman , I'm Looking for a new fun with IP30/Octane2
IP30 purposes : linux (kernel development), Irix Scientific Apps { Ansys, Catia, Pro/E, FiberSIM, AutoDYNþ, ... }
Other Projects : { Cerberus , Woody Box , 68K-board, SWI_DBG }, discontinued Console hacks { GB, GBA, PSX1 }
Wanted Equipments : { U1732C LCR meter by Keysight } ~ ~ I am still Learning English, be patient with me ~ ~
about me , there are just a few things to know: I am exuberant , and I love the urban dictionary : is it a problem ?!?
I think that the Octane and Desktop Tezro only have a single node, either with 2 CPUs (for Octane) or 4 CPUs (for Tezro). You would think that the Rackmount Tezro is really an O350 so it would support multiple nodes, but the second brick has graphics and I/O only, no CPUs.
:PI: :O2: :Indigo2IMP: :Indigo2IMP:
robespierre wrote: I think that the Octane and Desktop Tezro only have a single node, either with 2 CPUs (for Octane) or 4 CPUs (for Tezro). You would think that the Rackmount Tezro is really an O350 so it would support multiple nodes, but the second brick has graphics and I/O only, no CPUs.

You're right with the Octane and Desktop Tezro, but the Rackmount Tezro was also available as 2 nodes 2 CPUs each using the Workstation Expansion Module (since you couldn't use a 4x 1 GHz CPU nodeboard with the V12). I've tried to compile the available options here , but I don't think anyone has experimented with extreme (unsupported) Tezro configs like the ones in my questions. But there were successful attempts to change the various personalities (Origin 350/Onyx 350/Tezro) of the Chimera.
:PI: :Indigo: :Indigo: :Indigo: :Indy: :Indy: :Indigo2: :Indigo2IMP: :Octane: :Fuel: :540:
ivelegacy wrote: good question, I do not have an answer, but I can say OpenBSD and Linux has SMP for IP30

Thanks, I forgot this. I'll take a look at the source. Thanks a lot too for the pointers towards DASH documentation to everybody who contributed.
Most of the Linux SMP code came out of SGI, if you're of a mind to look at any of that. Probably you're not... :roll:
Project:
Temporarily lost at sea...
Plan:
World domination! Or something...

:Tezro: :Octane2:
Image
Image

and got my book :D :D :D :D :D
Some prowling the streets, looking for sweets from their Candyman , I'm Looking for a new fun with IP30/Octane2
IP30 purposes : linux (kernel development), Irix Scientific Apps { Ansys, Catia, Pro/E, FiberSIM, AutoDYNþ, ... }
Other Projects : { Cerberus , Woody Box , 68K-board, SWI_DBG }, discontinued Console hacks { GB, GBA, PSX1 }
Wanted Equipments : { U1732C LCR meter by Keysight } ~ ~ I am still Learning English, be patient with me ~ ~
about me , there are just a few things to know: I am exuberant , and I love the urban dictionary : is it a problem ?!?
cesss wrote: Are there any known details on programming the ASICs responsible for managing the CPUs in multiprocessor SGIs? I mean, details from an OS-development point of view (ie: ASIC commands needed to create a new process on a certain CPU, or for scheduling each CPU, etc...).

I don't mind if the details are for Challenge, for the Octane, or for NUMA or later MP boxes.

The purpose is that I feel curious about the protocol of commands used by SGI for controlling MP systems. I'm beginning a pet project which needs (simple) MP and maybe I could get some inspiration from older SGI designs...


If you are interested in experimenting a bit with multiprocessors, inter-processor interrupts etc. then Stanford's SimOS/MIPS may also be of interest. Its related to the FLASH architecture ( http://mprc.pku.edu.cn/mentors/training ... kuskin.pdf ) which many Stanford DASH people (and future VMware people) were involved in. Its a partial implementation though and probably about 50% of its commands and registers are no-ops. Documentation isn't great, but you do however have the source code to see how the 'hardware' works, making it possible to understand enough to do CPU startup/shutdown, interrupt setup, interprocessor interrupts, timers etc, and it doesn't take a lot of code - maybe 10% of what an x86/Intel MPS multiprocessor from that same era would have taken. If learning basic SMP operations is your goal, its probably a good starting point. For example, it would be a good exercise to try to port an existing OS like Linux to it.

Linux SMP isn't that hard to understand (though admittedly I haven't kept up with it for many years now, and it tends to grow in complexity each year). Even though they were obsolete, I found it easier to look at simpler early SMP architectures first (e.g. SPARC32, Alpha) - PROM and hardware handled many of the really low level details and so you can concentrate on understanding how stuff like how tracking and managing CPUs, IPIs, cache operations work. I certainly think diving straight into something like a high end NUMA architecture is going to be a more difficult approach.

Also, I'd recommend you get a really good grounding in cache/TLB coherency, atomic locking, memory ordering, and interrupts on multiprocessor systems first. Otherwise you will probably have to redesign all your code when it doesn't work as you expect on real hardware :-)

Code: Select all

Architecture

Onyx systems can be classified as MIMD SMP systems.

MIMD Multiple Instruction Multiple Data. This means multiple instructions can be passed at the same time to multiple groups of CPU and FPU.

SMP Symmetric MultiProcessing is a common form for MIMD systems. All CPUs are treated alike and can perform any task at any time (this is unlike to Master/Slave concepts) as they share a common view of the memory.

The following ASCII Art Diagram shows the basic concept of (Power)Challenge systems. All subsystems are implemented on special boards which are plugged into a backplane that builds the systems bus. "..." indicates that multiple boards of a given type may be installed.

The VME Bus and/or Graphics interface is provided by additional boards that attache to the IO subsystem.


Code: Select all

CPU        CPU
|          |
|    ...   |
Cache      Cache
|          |
|          |
=============================================== System Bus (1.2 GB/s)
|          |          |          |
|    ...   |          |   ...    |
|          |          |          |
Memory     Memory      I/O        I/O


a very interesting Architecture: The Onyx :D
Some prowling the streets, looking for sweets from their Candyman , I'm Looking for a new fun with IP30/Octane2
IP30 purposes : linux (kernel development), Irix Scientific Apps { Ansys, Catia, Pro/E, FiberSIM, AutoDYNþ, ... }
Other Projects : { Cerberus , Woody Box , 68K-board, SWI_DBG }, discontinued Console hacks { GB, GBA, PSX1 }
Wanted Equipments : { U1732C LCR meter by Keysight } ~ ~ I am still Learning English, be patient with me ~ ~
about me , there are just a few things to know: I am exuberant , and I love the urban dictionary : is it a problem ?!?
The Onyx is actually a very classic SMP design - a fast bus integrating all components. Onyx2 is where it becomes interesting, when they try to find the right balance between tightly coupled SMP and loosely coupled cluster.
:PI: :Indigo: :Indigo: :Indigo: :Indy: :Indy: :Indigo2: :Indigo2IMP: :Octane: :Fuel: :540:
Linus Torvald's on 13 November 2012:

Code: Select all

SGI in particular worked a lot on scaling past a few hundred CPUs. Their initial patches could just not be merged. There was no way we could take the work they did and use it on a regular PC because they added all this infrastructure to work on thousands of CPUs. That was way too expensive to do when you had only a couple.

I was afraid for the longest time that we would have the high-performance kernel for the big machines, and the source code would be separate from the normal kernel. People worked a lot on just making sure that we had a clean code base where you can say at compile time that, hey, I want the kernel that works for 4,000 CPUs, and it generates the code for that, and at the same time, if you say no, I want the kernel that works on 2 CPUs, the same source code compiles.

It was something that in retrospect is really important because it actually made the source code much better. All the effort that SGI and others spent on unifying the source code, actually a lot of it was clean-up – this doesn't work for a hundred CPUs, so we need to clean it up so that it works. And it actually made the kernel more maintainable. Now on the desktop, 8 and 16 CPUs are almost common; it used to be that we had trouble scaling to 8, now it's like child's play.


Link to the original interview
Project:
Temporarily lost at sea...
Plan:
World domination! Or something...

:Tezro: :Octane2: