The collected works of Robyn

Hi all,

After replacing an impeller module in an Altix 350 C-brick, the brick won't power up because of serial number mismatch.
This is not the global system serial number mismatch problem as discussed in other threads.

(interesting stuff in those, by the way:
http://www.nekochan.net/wiki/Use_a_rbri ... Origin_300
viewtopic.php?f=3&t=16726363&view=unread )

In my case, the brick does know the global SSN. The problem is that the NVRAM reports a checksum error, and the board
serial number (BSN) in NVRAM does not match the BSN in EEPROM. The EEPROM is correct. Excerpts from "L1> log" :

08/29/13 18:40:16 nvram checksum error - log pointers invalid, using backup copy
08/29/13 18:40:16 L1 booting 1.62.0
08/29/13 19:01:38 Brick Serial Number mismatch
08/29/13 19:01:38 NVRAM BSN: JGL4$4 EEPROM BSN: NGN444

I've been reading through the Altix 350 Service Guide, and the Altix L1 and L2 Controller Software User's Guide,
but haven't found a way to fix NVRAM problems.

The description of the L1 "nvram reset" command is merely that it resets to factory default settings.
I don't know if that will help, by clearing the checksum error and letting the NVRAM learn new settings
upon L1 reboot, or if it will be worse, for example by removing the global SSN.

Anybody know how to fix up the NVRAM's BSN?


Robyn
Thanks SMJ.

Here's another approach I think I will try first.
I logged into SGI Supportfolio and searched the knowledgebase
for "BSN mismatch". Found an article going back to Origin 3000
with this outline:
Symptom: BSN mismatch. cannot be fixed by changing serial number
Cause: NVRAM has been corrupted
Solution: upgrade L1 firmware

In my situation NVRAM is corrupted as well, so I'll try reflashing
the L1 on this brick with the same version it already has and see if that clears things up.

What do you think of that?

BTW, by way of introduction, since I'm new here, I'll mention that I have been looking after
SGI systems since about 1989, starting with IRIS 340 server and Personal IRIS 4D,
and on up through Indigo, Octane, O2, Indy, Origin 2000, Origin 3000, Altix 350 and 3700,
and newer XE clusters. No UV yet. For storage, TP9100, IS220, IS5000.
I have loads of old manuals etc. from the IRIX days.


Robyn
I reflashed the L1 firmware with the same version using 'flashsc -f' from the L3 station,
then rebooted the L1. Alas, it did not clear the NVRAM checksum error or fix the
mismatch of BSNs between EEPROM and NVRAM.


Robyn
Thank you "recondas", but I am happy to report success by other means.

There are various other intermediate pieces to my sob story about ideas for partial recovery
(e.g. this brick is not the primary brick but does happen to have IO in it, and the home directory
disk is there, but I couldn't simply move that disk to the primary brick and boot system without
this brick because the primary brick has IO9 SCSI and this brick has IO10 SATA, so I'd have
to move all the IO pieces to another brick instead...) But I'll leave those out for now.

I decided to try the secret undocumented L1 command from that thread mentioned earlier.
To my astonishment, it did not merely turn off the requirement for global system serial number
consistency, it cleared the NVRAM checksum error and reset the scrambled NVRAM BSN
to match the correct EEPROM BSN. In other words, "fixed".

Thanks to whomever discovered that command and published it.


Robyn