SGI: Hardware

Altix 350 NVRAM and EEPROM BSN mismatch

Hi all,

After replacing an impeller module in an Altix 350 C-brick, the brick won't power up because of serial number mismatch.
This is not the global system serial number mismatch problem as discussed in other threads.

(interesting stuff in those, by the way:
http://www.nekochan.net/wiki/Use_a_rbri ... Origin_300
viewtopic.php?f=3&t=16726363&view=unread )

In my case, the brick does know the global SSN. The problem is that the NVRAM reports a checksum error, and the board
serial number (BSN) in NVRAM does not match the BSN in EEPROM. The EEPROM is correct. Excerpts from "L1> log" :

08/29/13 18:40:16 nvram checksum error - log pointers invalid, using backup copy
08/29/13 18:40:16 L1 booting 1.62.0
08/29/13 19:01:38 Brick Serial Number mismatch
08/29/13 19:01:38 NVRAM BSN: JGL4$4 EEPROM BSN: NGN444

I've been reading through the Altix 350 Service Guide, and the Altix L1 and L2 Controller Software User's Guide,
but haven't found a way to fix NVRAM problems.

The description of the L1 "nvram reset" command is merely that it resets to factory default settings.
I don't know if that will help, by clearing the checksum error and letting the NVRAM learn new settings
upon L1 reboot, or if it will be worse, for example by removing the global SSN.

Anybody know how to fix up the NVRAM's BSN?


Robyn
Welcome Robyn! I'm afraid I haven't seen this kind of problem, though threads around replacing NVRAM batteries and forcing serial number changes are/were a good place to start.

I wonder if this is something that can be tweaked from the POD - there's a list of POD commands in the wiki along with L1 commands . Nothing obvious leaps out in a quick scan, but...
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
Thanks SMJ.

Here's another approach I think I will try first.
I logged into SGI Supportfolio and searched the knowledgebase
for "BSN mismatch". Found an article going back to Origin 3000
with this outline:
Symptom: BSN mismatch. cannot be fixed by changing serial number
Cause: NVRAM has been corrupted
Solution: upgrade L1 firmware

In my situation NVRAM is corrupted as well, so I'll try reflashing
the L1 on this brick with the same version it already has and see if that clears things up.

What do you think of that?

BTW, by way of introduction, since I'm new here, I'll mention that I have been looking after
SGI systems since about 1989, starting with IRIS 340 server and Personal IRIS 4D,
and on up through Indigo, Octane, O2, Indy, Origin 2000, Origin 3000, Altix 350 and 3700,
and newer XE clusters. No UV yet. For storage, TP9100, IS220, IS5000.
I have loads of old manuals etc. from the IRIX days.


Robyn
Robyn wrote: BTW, by way of introduction, since I'm new here, I'll mention that I have been looking after
SGI systems since about 1989, starting with IRIS 340 server and Personal IRIS 4D,
and on up through Indigo, Octane, O2, Indy, Origin 2000, Origin 3000, Altix 350 and 3700,
and newer XE clusters. No UV yet. For storage, TP9100, IS220, IS5000.
I have loads of old manuals etc. from the IRIX days.


That is quite the resume! :shock:

Thanks for signing up and don't be a stranger now you know where we are... :mrgreen:
Project:
Temporarily lost at sea...
Plan:
World domination! Or something...

:Tezro: :Octane2:
Robyn wrote: What do you think of that?

I think I'm surprised you were able to find something so useful on Supportfolio... :lol: Just kidding, and glad to hear it.

I certainly hope it works, and if so please let us know so we can write it up. Sounds like something that might apply to any of the systems featuring an L1 controller...
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
I reflashed the L1 firmware with the same version using 'flashsc -f' from the L3 station,
then rebooted the L1. Alas, it did not clear the NVRAM checksum error or fix the
mismatch of BSNs between EEPROM and NVRAM.


Robyn
Robyn wrote: After replacing an impeller module in an Altix 350 C-brick, the brick won't power up because of serial number mismatch.
The problem is that the NVRAM reports a checksum error, and the board
serial number (BSN) in NVRAM does not match the BSN in EEPROM. The EEPROM is correct. Excerpts from "L1> log" :

Code: Select all

08/29/13 18:40:16 nvram checksum error - log pointers invalid, using backup copy


Anybody know how to fix up the NVRAM's BSN?

I'll be the first to admit that I have zero experience with the Altix 350, so take the following with a grain of salt.

The nvram checksum error has cropped up before in the L1 logs of the somewhat similar Origin 350 and Origin 300.

What resolved both of those cases was replacing the PROM battery. If you'd like to take a look, here's a link to the thread - with a photo of the battery location and info on where you could obtain a replacement: viewtopic.php?f=3&t=16727448#p7357727
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************
Thank you "recondas", but I am happy to report success by other means.

There are various other intermediate pieces to my sob story about ideas for partial recovery
(e.g. this brick is not the primary brick but does happen to have IO in it, and the home directory
disk is there, but I couldn't simply move that disk to the primary brick and boot system without
this brick because the primary brick has IO9 SCSI and this brick has IO10 SATA, so I'd have
to move all the IO pieces to another brick instead...) But I'll leave those out for now.

I decided to try the secret undocumented L1 command from that thread mentioned earlier.
To my astonishment, it did not merely turn off the requirement for global system serial number
consistency, it cleared the NVRAM checksum error and reset the scrambled NVRAM BSN
to match the correct EEPROM BSN. In other words, "fixed".

Thanks to whomever discovered that command and published it.


Robyn
Hmmm... The thot plickens!
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
For the convenience of anyone searching in the future, the undocumented command is "let the carnage begin" - relevant Nekowiki page is here: http://www.nekochan.net/wiki/Use_a_rbrick_in_place_of_a_NUMAlink_module_on_Origin_300
Project:
Temporarily lost at sea...
Plan:
World domination! Or something...

:Tezro: :Octane2: