SGI: Hardware

Onyx2 Hang..

I powered up the 22 CPU Onyx2 today and it gets as far as ....


Code: Select all

IP27 PROM SGI Version 6.156  built 11:27:56 AM Nov 18, 2003
Testing/Initializing memory ...............             DONE
Copying PROM code to memory ...............             DONE
Discovering local IO ......................             DONE
Discovering NUMAlink connectivity .........             DONE
Found 17 objects (11 hubs, 6 routers) in 268724 usec
Waiting for peers to complete discovery....             DONE
Recognized 390 MHz midplane
Global master is /hw/module/1/slot/n1
Testing/Initializing all memory ...........             DONE
Initializing headless node at nasid 10
........Discovering local IO ......................             DONE
Checking partitioning information .........             DONE




All the MSCs say 3F3F3F except the master which says 503F3F

Ideas anyone...? GOBI..?

R.
死の神はりんごだけ食べる

開いた括弧は必ず閉じる -- あるプログラマー

:Tezro: :Tezro: :Onyx2R: :Onyx2RE: :Onyx2: :O3x04R: :O3x0: :O200: :Octane: :Octane2: :O2: :O2: :Indigo2IMP: :PI: :PI: :1600SW: :1600SW: :Indy: :Indy: :Indy: :Indy: :Indy:
:hpserv: J5600, 2 x Mac, 3 x SUN, Alpha DS20E, Alpha 800 5/550, 3 x RS/6000, Amiga 4000 VideoToaster, Amiga4000 -030, 733MHz Sam440 AmigaOS 4.1 update 1.

Sold: :Indy: :Indy: :Indy: :Indigo: Tandem Himalaya S-Series Nonstop S72000 ServerNet.

Twitter @PymbleSoftware
Current Apps (iOS) -> https://itunes.apple.com/au/artist/pymb ... d553990081
(Android) https://play.google.com/store/apps/deve ... +Ltd&hl=en
(Onyx2) Cortex ---> http://www.facebook.com/pages/Cortex-th ... 11?sk=info
(0300s) Minnie ---> http://www.facebook.com/pages/Minnie-th ... 02?sk=info
Github ---> https://github.com/pymblesoftware
Looks like one of the nodes went 'headless', a.k.a. compression connector problem or HUB chip on IP31 fried (I've seen both happen).

If this node happens to also operate the console you get nothing.

I'd swap that nodeboard with status 0x50 with any other ones doing memory test (0x3f) and retry, you'll probably get to the PROM and at least more verbose error messages. IIRC, the leftmost nodeboard (n4) has the console, but you'd better double check the LEDs on the back.

If not, set switches 4 & 5 on the MSC to go straight to POD on that module. Oh, and remove as much memory as possible from the headless node: memory initialization of a headless node takes *forever* (an hour or more)
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)
With 22 processors involved, that's a fairly complex piece of hardware to troubleshoot. I think the GOBI would suggest that minimizing the number of components tested at one time would simplify the process (something akin to the process suggested in this discussion ).

To avoid later issues with the PROM having components re-installed in different locations, put a masking tape tag indicating the position of each board and CrayLink cable before disconnecting or removing them.

  • Temporarily disconnect the CrayLink connections and Graphics module and test only the compute module with the MSC that doesn't get to 3F3F3F.
  • Remove all XIO expansion boards (X-Town, FC, MSCSI, etc.)
  • After that module is powered up freestanding, examine the nodeboard LEDs for error codes , then test the system with only the problem nodeboard (in N1).
    • if you have issues with the normal console try the diagnostic port on the MSC, and
    • if you can't get diagnostic info with only the problem nodeboard installed, move it to N2 and put a working node in N1.
  • As j-j has suggested follow that error code(s) to try to isolate the problem.
  • If the problem can't be eliminated while only the problem nodeboard is installed, try a different nodeboard (in the same slot).

Once you get a nodeboard in the problem slot to display a post-initialization heartbeat , add the other components back, one-at-a-time.

and please post a follow-up on your findings - details on the problem from someone with your experience and knowledge of the hardware is almost certain to be of future help some one (who has yet to acquire either).
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************
recondas wrote: With 22 processors involved, that's a fairly complex piece of hardware to troubleshoot. I think the GOBI would suggest that minimizing the number of components tested at one time would simplify the process (something akin to the process suggested in this discussion ).

To avoid later issues with the PROM having components re-installed in different locations, put a masking tape tag indicating the position of each board and CrayLink cable before disconnecting or removing them.

  • Temporarily disconnect the CrayLink connections and Graphics module and test only the compute module with the MSC that doesn't get to 3F3F3F.
  • Remove all XIO expansion boards (X-Town, FC, MSCSI, etc.)
  • After that module is powered up freestanding, examine the nodeboard LEDs for error codes , then test the system with only the problem nodeboard (in N1).
    • if you have issues with the normal console try the diagnostic port on the MSC, and
    • if you can't get diagnostic info with only the problem nodeboard installed, move it to N2 and put a working node in N1.
  • As j-j has suggested follow that error code(s) to try to isolate the problem.
  • If the problem can't be eliminated while only the problem nodeboard is installed, try a different nodeboard (in the same slot).

Once you get a nodeboard in the problem slot to display a post-initialization heartbeat , add the other components back, one-at-a-time.

and please post a follow-up on your findings - details on the problem from someone with your experience and knowledge of the hardware is almost certain to be of future help some one (who has yet to acquire either).


I think we all know who the real GOBI is... ;)
I pulled everything down and found one of the 500MHz nodes seems to have failed. I got depressed and left it...

I took your text out of that forum post, reworked it slightly... made a new wiki topic called Onyx2 Diagnostics and then included that topic into the main topic on Onyx2
If you go to edit the diagnostics section of the Onyx2 page in the wiki you will see nothing but {{:Onyx2 Diagnostics}} which means it is in C/C++ programmers terms it is #include'd from that topic go edit it there..

Seems like every time I fire the Onyx2 when it hasn't been started in a while [1], I lose 2 CPUs, maybe I should sacrifice an Indy before hand.


jan-jaap wrote: If not, set switches 4 & 5 on the MSC to go straight to POD on that module. Oh, and remove as much memory as possible from the headless node: memory initialization of a headless node takes *forever* (an hour or more)


Any chance you could please document this in the wiki.. ?


Edit: its up and running on 18CPUs... :(
I'll post a request for 400MHz and 500MHz IP27 nodes in the wanted forum.

R.

1. I don't run it over the southern hemisphere summer. If you knew how hot it gets here you'd understand.
死の神はりんごだけ食べる

開いた括弧は必ず閉じる -- あるプログラマー

:Tezro: :Tezro: :Onyx2R: :Onyx2RE: :Onyx2: :O3x04R: :O3x0: :O200: :Octane: :Octane2: :O2: :O2: :Indigo2IMP: :PI: :PI: :1600SW: :1600SW: :Indy: :Indy: :Indy: :Indy: :Indy:
:hpserv: J5600, 2 x Mac, 3 x SUN, Alpha DS20E, Alpha 800 5/550, 3 x RS/6000, Amiga 4000 VideoToaster, Amiga4000 -030, 733MHz Sam440 AmigaOS 4.1 update 1.

Sold: :Indy: :Indy: :Indy: :Indigo: Tandem Himalaya S-Series Nonstop S72000 ServerNet.

Twitter @PymbleSoftware
Current Apps (iOS) -> https://itunes.apple.com/au/artist/pymb ... d553990081
(Android) https://play.google.com/store/apps/deve ... +Ltd&hl=en
(Onyx2) Cortex ---> http://www.facebook.com/pages/Cortex-th ... 11?sk=info
(0300s) Minnie ---> http://www.facebook.com/pages/Minnie-th ... 02?sk=info
Github ---> https://github.com/pymblesoftware
I replaced a CPU swapped a board that was getting cache errors and got still headless issues...

Code: Select all

IP27 PROM SGI Version 6.156  built 11:27:56 AM Nov 18, 2003
Testing/Initializing memory ...............             DONE
Copying PROM code to memory ...............             DONE
Discovering local IO ......................             DONE
Discovering NUMAlink connectivity .........             DONE
Found 16 objects (10 hubs, 6 routers) in 258072 usec
Waiting for peers to complete discovery....             DONE
Recognized 390 MHz midplane
Global master is /hw/module/1/slot/n1
Testing/Initializing all memory ...........             DONE
Initializing headless node at nasid 10
........*** Nasid 10: CPU A was previously Present & Enabled but is now Present & Disabled
*** Nasid 10: CPU B was previously Present & Enabled but is now Present & Disabled
*** Nasid 10: Memory bank 0 was previously had 128 MB but now has 512 MB
Discovering local IO ......................             DONE
Checking partitioning information .........             DONE
Loading BASEIO prom .......................             DONE

BASEIO PROM Monitor SGI Version 6.156  built 11:26:28 AM Nov 18, 2003 (BE64)
18 CPUs on 10 nodes found.
****************************************************************
*    PANIC: Boards in same module show different moduleids.    *
*      PANIC: Failed to automatically assign moduleid(s)       *
*    Please assign globally unique module id(s) at the MSC.   *
****************************************************************

Switching into Power-On Diagnostics mode...


1A 000: *** Software entry into POD mode from IO6 POD mode on node 0
1A 000: POD IOC3 Dex>
1A 000: POD IOC3 Cac> enable n:10
Permanently enabling NASID 10 CPU A
Permanently enabling NASID 10 CPU B
Could not unsetenv DisableMemMask: variable not set
mem banks  enabled. Reset to make it functional
Warning: reset required to take effect
1A 000: POD IOC3 Cac> reset
Resetting the system...


IP27 PROM SGI Version 6.156  built 11:27:56 AM Nov 18, 2003
Testing/Initializing memory ...............             DONE
Copying PROM code to memory ...............             DONE
Discovering local IO ......................             DONE
Discovering NUMAlink connectivity .........             hub_link_vector_diag: Detected 79 CB errors on local side of link
hub_link_vector_diag: Starting cb_err value was 0
DIAG hub_link_vecto HUB on Nodeboard in SLOT n1 failed test run by CPU 0.
hub_link_vector_diag failed: Rtr saw too many link errors.
RSLT hub_link_vecto FAIL                diag_rc = 206 Rtr saw too many link erro
*** hub_link_vector_diag: /hw/module/0/slot/n0: SHOWED ERRORS
DONE
Found 16 objects (10 hubs, 6 routers) in 672869 usec
Waiting for peers to complete discovery....             DONE
Recognized 390 MHz midplane
Global master is /hw/module/1/slot/n1
Testing/Initializing all memory ........
Initializing headless node at nasid 10
........Discovering local IO ......................             DONE
Checking partitioning information .........             DONE
Loading BASEIO prom .......................             DONE

BASEIO PROM Monitor SGI Version 6.156  built 11:26:28 AM Nov 18, 2003 (BE64)
18 CPUs on 10 nodes found.
****************************************************************
*    PANIC: Boards in same module show different moduleids.    *
*      PANIC: Failed to automatically assign moduleid(s)       *
*    Please assign globally unique module id(s) at the MSC.   *
****************************************************************

SwitchingInitializing headless node at nasid 10
........Discovering local IO ......................             DONE
Checking partitioning information .........             DONE
Loading BASEIO prom .......................             DONE

BASEIO PROM Monitor SGI Version 6.156  built 11:26:28 AM Nov 18, 2003 (BE64)
18 CPUs on 10 nodes found.
****************************************************************
*    PANIC: Boards in same module show different moduleids.    *
*      PANIC: Failed to automatically assign moduleid(s)       *
*    Please assign globally unique module id(s) at the MSC.   *
****************************************************************

SwitchingInitializing headless node at nasid 10
........Discovering local IO ......................             DONE
Checking partitioning information .........             DONE
Loading BASEIO prom .......................             DONE

BASEIO PROM Monitor SGI Version 6.156  built 11:26:28 AM Nov 18, 2003 (BE64)
18 CPUs on 10 nodes found.
****************************************************************
*    PANIC: Boards in same module show different moduleids.    *
*      PANIC: Failed to automatically assign moduleid(s)       *
*    Please assign globally unique module id(s) at the MSC.   *
****************************************************************

Switching into Power-On Diagnostics mode...


1A 000: *** Software entry into POD mode from IO6 POD mode on node 0
1A 000: POD IOC3 Dex>  into Power-On Diagnostics mode...



I dumped the command from the POD help into: this topic
R.
死の神はりんごだけ食べる

開いた括弧は必ず閉じる -- あるプログラマー

:Tezro: :Tezro: :Onyx2R: :Onyx2RE: :Onyx2: :O3x04R: :O3x0: :O200: :Octane: :Octane2: :O2: :O2: :Indigo2IMP: :PI: :PI: :1600SW: :1600SW: :Indy: :Indy: :Indy: :Indy: :Indy:
:hpserv: J5600, 2 x Mac, 3 x SUN, Alpha DS20E, Alpha 800 5/550, 3 x RS/6000, Amiga 4000 VideoToaster, Amiga4000 -030, 733MHz Sam440 AmigaOS 4.1 update 1.

Sold: :Indy: :Indy: :Indy: :Indigo: Tandem Himalaya S-Series Nonstop S72000 ServerNet.

Twitter @PymbleSoftware
Current Apps (iOS) -> https://itunes.apple.com/au/artist/pymb ... d553990081
(Android) https://play.google.com/store/apps/deve ... +Ltd&hl=en
(Onyx2) Cortex ---> http://www.facebook.com/pages/Cortex-th ... 11?sk=info
(0300s) Minnie ---> http://www.facebook.com/pages/Minnie-th ... 02?sk=info
Github ---> https://github.com/pymblesoftware