SGI: Hardware

Changed RMs - Now Onyx2 Deskside Won't Boot

I have replaced my two RM9 with two RM10...

and now my Onyx2 will not boot :(

System controller says "P 0 M 1C"

and the LEDs on the node boards giving the following code:

N1...................N2
CPU A..CPU B......CPU A..CPU B
:idea: ..... :idea: ......... :idea: ..... :idea:
:oops: ..... :oops: ......... :arrow: ..... :oops:
:idea: ..... :idea: ......... :idea: ..... :idea:
:idea: ..... :idea: ......... :arrow: ..... :idea:
:idea: ..... :idea: ......... :idea: ..... :idea:
:oops: ..... :oops: ......... :arrow: ..... :oops:
:idea: ..... :idea: ......... :idea: ..... :idea:
:oops: ..... :oops: ......... :arrow: ..... :oops:

:idea: = LED is ON
:arrow: = LED is OFF
:oops: = LED is blinking

What is going wrong with my Onyx2??
:1600SW: :Onyx: :Onyx2: :Indigo: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Indigo2IMP: :Indigo2IMP: :Indigo2IMP: :O2: :O2: :O2: :Octane: :Octane: :Octane2: :Octane2: :Fuel: :Fuel:
Thomas - To make an evaluation of the Nodeboard LED codes meaningful, you should record the LEDs as either off or on. If they blink record them as on - if they blink in a repeatable pattern, you could also record both the 'blink' and 'non-blink' states for each CPU <see below for explaination>.

A repeating pattern of blinking LEDs is actually a good thing. Static LEDs after CPU initialization is almost always a bad sign, a repeated pattern of blinking LEDs after CPU initialization is expected. Once the CPUs have initialized <but prior to IRIX booting>, the expected behavior is that the Nodeboard LEDs will alternate <blink> "00" and "55". I interpreted your smiley face rendition of the LEDs to read:
    N1A - 00
    N1B - 00
    N2A - 55
    N2B - 00
That and the display of POM1C on the MSC suggest to me that there's nothing wrong with the compute-side hardware in your Onyx2.

I'd suggest connecting a serial terminal to check the output of the power on diagnostics. If you can get to the PROM, you might check an hinv to make sure all looks well <running 'update' probably wouldn't be a bad idea either>. If all looks well at the PROM level, try booting the system with the serial terminal connected.

If it boots, I'd suggest testing the graphics hardware with irsaudit. http://www.nekochan.net/wiki/ ... h_irsaudit

The proximity of the your newly developed problem and the addition of your newly acquired RM10s would make failure to initialize graphics a more likely suspect. Your Onyx2 might be halting at the PROM because there aren't any graphics, or might even be booting but not generating a display.
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************


As recondas mentioned: Whip out a serial terminal or fire up HyperTerm and go through the usual PROM clearing and updating procedures. Saved me many hours of irritation. Also if you wait I can provide information on how to disable extensive memory
testing reducing the POST-to-boot to a few minutes

MAYA, nut-
:Octane2: :Octane2: Octane 2 R14k 600 V12 4GB, Octane2 R14K 600 V10 1GB ,
:Onyx2: :Onyx2: Onyx2 IR3 4GB Quad R14K 500 DIVO, Onyx2 IR Quad R12K 400 2GB,
:Indigo2: SGI Indigo 2 R8K75 TEAL Extreme 256MB,
:Indigo2IMP: SGI Indigo 2 R10K 195 Solid Impact 256MB, MAX Impact Pending
,
Apple G5 Quad, NV Quadro 4500 + 7800GT, 12GB RAM
Sun Blade 1000 Dual 900 XVR 1000 4GB
Sun Blade 2000 Dual 1200 XVR 1200 8GB
I have fired up HyperTerm...

Code: Select all

IP27 PROM SGI Version 6.94  built 04:00:43 PM Dec  5, 2001
Testing/Initializing memory ...............             DONE
Copying PROM code to memory ...............             DONE
Discovering local IO ......................             DONE
Discovering NUMAlink connectivity .........             DONE
Found 2 objects (2 hubs, 0 routers) in 10464 usec
Waiting for peers to complete discovery....             DONE
Global master is /hw/module/1/slot/n2
Testing/Initializing all memory ...........             DONE
Checking partitioning information .........             DONE
Loading BASEIO prom .......................             DONE

BASEIO PROM Monitor SGI Version 6.156  built 11:26:28 AM Nov 18, 2003 (BE64)
4 CPUs on 2 nodes found.
Installing PROM Device drivers ............
Base I/O Ethernet set to /dev/ethernet/ef0
Installing Graphics Console...
graphics install: searching for pipe 0

Walking SCSI Adapter 0 (/hw/module/1/slot/io1), (pci id 0)
1+ 2- 3- 4- 5- 6+ 7- 8- 9- 10- 11- 12- 13- 14- 15- = 2 device(s)


Walking SCSI Adapter 1 (/hw/module/1/slot/io1), (pci id 1)
1- 2- 3- 4- 5- 6- 7- 8- 9- 10- 11- 12- 13- 14- 15- = 0 device(s)

Initializing PROM Device drivers ..........             DONE
Cannot connect to keyboard -- check the cable.
Cannot open /dev/input/ioc3pckm0 for input
Cannot connect to keyboard -- check the cable.
Cannot open /dev/input/ioc3pckm0 for input
Checking hardware inventory ...............              /hw/module/1/slot/n1:ME
M BANK 0 missing or disabled
Found new or re-enabled component MEM BANK 2 3 4 5 6 7
***Warning: Found a new IP27 board in module 1, slot n2, serial MHV535
Please use the 'update' command from the PROM Monitor to update the inventory
***Warning: Found a new PCI_XIO board in module 1, slot io2, serial GLY410
Please use the 'update' command from the PROM Monitor to update the inventory
***Warning: Found a new MSCSI board in module 1, slot io3, serial HPP125
Please use the 'update' command from the PROM Monitor to update the inventory
DONE

**** System Configuration and Diagnostics Summary ****
CONFIG:
No. of NODEs enabled    = 2
No. of NODEs disabled   = 0
No. of CPUs enabled     = 4
No. of CPUs disabled    = 0
Mem enabled             = 3456 MB
Mem disabled            = 768 MB
No. of RTRs enabled     = 0
No. of RTRs disabled    = 0

DIAG RESULTS:
/hw/module/1/slot/n2/node/mem: MEMBANK(S) 2  disabled
Reason:
Bank 2: Some DIMMs failed mem test.
/hw/module/1/slot/n1/node/mem: MEMBANK(S) 0  disabled
Reason:
Bank 0: Some DIMMs failed mem test.
**** End System Configuration and Diagnostics Summary ****


System Maintenance Menu

1) Start System
2) Install System Software
3) Run Diagnostics
4) Recover System
5) Enter Command Monitor

Option?


"hinv" from within HyperTerm...

Code: Select all

>> hinv
System  SGI-IP27
4 400 MHz IP27 Processors
Main memory size: 3456 Mbytes, (768 Mbytes disabled)
Graphics Controller
Integral SCSI controller
Integral SCSI controller
Integral SCSI controller
Integral SCSI controller
Integral SCSI controller 0
Integral SCSI controller 1
Integral Fast Ethernet
IOC3 serial port, (adapter 0)
Integral Fast Ethernet
IOC3 serial port, (adapter 0)
Disk drive: unit 1 on SCSI Controller 0, (dksc(0,1,0))
CDROM: unit 6 on SCSI Controller 0, (cdrom(0,6,7))


Where are the RM10??
:1600SW: :Onyx: :Onyx2: :Indigo: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Indigo2IMP: :Indigo2IMP: :Indigo2IMP: :O2: :O2: :O2: :Octane: :Octane: :Octane2: :Octane2: :Fuel: :Fuel:

Code: Select all

System Maintenance Menu

1) Start System
2) Install System Software
3) Run Diagnostics
4) Recover System
5) Enter Command Monitor

Option? 1


Starting up the system...

IRIX Release 6.5 IP27 Version 10060437 System V - 64 Bit
Copyright 1987-2004 Silicon Graphics, Inc.
All Rights Reserved.

Setting rbaud to 19200
The system is coming up.

Moving /core to /usr/tmp/core.postinst_detected
An old core file has been found at /core. To reduce
the chance of running out of disk space on the root
filesystem during postinst processing, the file /core
is being moved to /usr/tmp/core.postinst_detected.
WARNING: ef0: link fail - check ethernet cable
ONYX2_IR2E: Unknown host
routed: Send bcast sendto(ef0, 192.168.2.255.520): Network is down
ESP has already been started
inst:
inst: Software installation has installed new configuration files and/or saved
inst: the previous version in some cases.  You may need to update or merge
inst: old configuration files with the newer versions.  See the "Updating
inst: Configuration Files" section in the versions(1M) manual page for details.
inst: The shell command "versions changed" will list the affected files.
inst:
inst: These directories were unable to be moved properly during the
inst: installation process.  Check for any user-modified files, then
inst: delete the directories.
inst:    /usr/include/Vk.O
IR0: ARM: Welcome to ARMLand - 0/0x0d00
IR0: ARM: running...(sherwood-root 0410060304)
IR0: ARM: ******************************************************
IR0: ARM: * InfiniteReality/Reality Software, IRIX 6.5 release *
IR0: ARM: ******************************************************



ONYX2_IR2E console login: IR0: GE0: Welcome to GELand!
IR0: GE1: Welcome to GELand!
IR0: GE2: Welcome to GELand!
IR0: GE3: Welcome to GELand!
:1600SW: :Onyx: :Onyx2: :Indigo: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Indigo2IMP: :Indigo2IMP: :Indigo2IMP: :O2: :O2: :O2: :Octane: :Octane: :Octane2: :Octane2: :Fuel: :Fuel:
Thomas W. wrote: Where are the RM10??


Do this:

Code: Select all

/usr/gfx/gfxinfo -v
:Onyx2: :Fuel: :Indigo2: :Indigo2IMP: :O3x0:
Thomas W. wrote: Where are the RM10??

Thomas - SGI provided a program, "irsaudit", that will do in-depth testing of InfiniteReality graphics boards. I'd strongly suggest you run the program and report back the results section. I wrote a wiki that might prove helpful: http://www.nekochan.net/wiki/ ... h_irsaudit There's an example of an irsaudit run and a screen capture of the results section in this post: viewtopic.php?f=3&t=16922&p=132650&#p132650

You might also want to stop in the PROM, select option 5, and run "enableall" and "update" at the PROM command line.

If the the disabled memory banks don't come back up, you'll probably need to drop into POD mode and clear/reset the power-on diagnostic logs. viewtopic.php?f=3&t=16720907&p=7299835&#p7299835
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************
recondas wrote: You might also want to stop in the PROM, select option 5, and run "enableall" and "update" at the PROM command line.


done that... now 2 CPUs are gone... :(

Code: Select all

IP27 PROM SGI Version 6.94  built 04:00:43 PM Dec  5, 2001
Testing/Initializing memory ...............             DONE
Copying PROM code to memory ...............             DONE
Discovering local IO ......................             DONE
Discovering NUMAlink connectivity .........             DONE
Found 2 objects (2 hubs, 0 routers) in 10467 usec
Waiting for peers to complete discovery....             DONE
Global master is /hw/module/1/slot/n2
Testing/Initializing all memory ...........             DONE
*** Barrier sync warning: local=5090, NIC 0x59a385=4963 (promrev mismatch?)
Checking partitioning information .........             DONE
Loading BASEIO prom .......................             DONE

BASEIO PROM Monitor SGI Version 6.156  built 11:26:28 AM Nov 18, 2003 (BE64)
2 CPUs on 1 nodes found.
Installing PROM Device drivers ............
Base I/O Ethernet set to /dev/ethernet/ef0
Installing Graphics Console...
graphics install: searching for pipe 0

Walking SCSI Adapter 0 (/hw/module/1/slot/io1), (pci id 0)
1+ 2- 3- 4- 5- 6+ 7- 8- 9- 10- 11- 12- 13- 14- 15- = 2 device(s)


Walking SCSI Adapter 1 (/hw/module/1/slot/io1), (pci id 1)
1- 2- 3- 4- 5- 6- 7- 8- 9- 10- 11- 12- 13- 14- 15- = 0 device(s)

Initializing PROM Device drivers ..........             DONE
Cannot connect to keyboard -- check the cable.
Cannot open /dev/input/ioc3pckm0 for input
Cannot connect to keyboard -- check the cable.
Cannot open /dev/input/ioc3pckm0 for input
Checking hardware inventory ...............
Warning: Inventory table ID value is 0. Check Midplane NIC
Writing 4 records.... DONE
Updated new configuration. Wrote 4 records.

**** System Configuration and Diagnostics Summary ****
CONFIG:
No. of NODEs enabled    = 1
No. of NODEs disabled   = 0
No. of CPUs enabled     = 2
No. of CPUs disabled    = 0
Mem enabled             = 2112 MB
Mem disabled            = 0 MB
No. of RTRs enabled     = 0
No. of RTRs disabled    = 0

DIAG RESULTS:
ALL DIAGS PASSED.
**** End System Configuration and Diagnostics Summary ****


System Maintenance Menu

1) Start System
2) Install System Software
3) Run Diagnostics
4) Recover System
5) Enter Command Monitor

Option?
:1600SW: :Onyx: :Onyx2: :Indigo: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Indigo2IMP: :Indigo2IMP: :Indigo2IMP: :O2: :O2: :O2: :Octane: :Octane: :Octane2: :Octane2: :Fuel: :Fuel:
Thomas W. wrote: done that... now 2 CPUs are gone... :(

Have you tried entering POD mode to clear and reinitialize the system diagnostic logs? <if you need it, there's a how-to linked in my previous post>

Thomas W. wrote:

Code: Select all

Checking hardware inventory ...............
Warning: Inventory table ID value is 0. Check Midplane NIC
That doesn't sound good. If you pull any boards you might want to visualize the midplane to make sure the NIC hasn't been knocked loose. Never had the deskside version of an Onyx2, so I'm not sure of it's location, but it should be similar in appearance to a coin cell battery.

If the NIC hasn't been physically disturbed <or if you just want to play it safe>, you might pull the RM10s and re-try things with your original configuration - just in case one or both of the RM10s have an electrical malfunction or some other issue that's causing your sudden rash of problems.
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************
I just took a picture of an Onyx2 Deskside midplane. The NIC is marked. So if you look for it check under nodeboard N1 near the bottom.
Current inventory: :O2: :1600SW: :Octane2: :Tezro: :Onyx2:
Former members: :PI: :Indigo: :Indy: :Indigo2: :Indigo2IMP: :1600SW: :Octane: :Octane: :Octane: :Octane: :Fuel: :PrismDT: :Onyx2: :Onyx2: :O200: :O2000: :O3200: :O3x04R:
I can't believe I suggested gfxinfo when the poor guy can't get his Onyx2 to boot. That's what you get for drinking and posting! :oops:

I'm interested to see how this turns out. I dropped RM10s into my Onyx2 without any trouble. If the NIC seems fine, I'd do what recondas mentioned and roll back to the RM9s. And definitely clear the logs via POD - that'll probably get your other node back online.
:Onyx2: :Fuel: :Indigo2: :Indigo2IMP: :O3x0:
bigD wrote: ... . If the NIC seems fine ...

Tangential to the problem here but that NIC ... looks like the NIC from an Octane. Makes me wonder if it's possible to transfer NIC's between an Octane and an Onyx2 ? Or vicee versee ?
Thomas: Did you make any progress? If the NIC is the problem and a replacement NIC would be helpful, let me know because the one on the photo would be available.
Current inventory: :O2: :1600SW: :Octane2: :Tezro: :Onyx2:
Former members: :PI: :Indigo: :Indy: :Indigo2: :Indigo2IMP: :1600SW: :Octane: :Octane: :Octane: :Octane: :Fuel: :PrismDT: :Onyx2: :Onyx2: :O200: :O2000: :O3200: :O3x04R:
at the moment i have not the time to tinker around with the Onyx2

maybe next week... i hope
:1600SW: :Onyx: :Onyx2: :Indigo: :Indigo: :Indigo: :Indy: :Indy: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Indigo2IMP: :Indigo2IMP: :Indigo2IMP: :O2: :O2: :O2: :Octane: :Octane: :Octane2: :Octane2: :Fuel: :Fuel:
So Thomas W. - is your Onyx2 still floating belly up <or did you figure out what was wrong>?
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************