Getting Started, Documentation, Tips & Tricks

How to fix an Altix 350 with mixed PROM versions

Let's say you find, buy, or inherit two or more Altix 350 modules. You connect everything up, but it just won't seem to make it to the EFI shell - in fact it dumps you into the Power On Diagnostics (POD) instead. It might look something like this:

Code:
SGI SN1 L1 Controller
Firmware Image B: Rev. 1.26.9, Built 02/12/2004 13:59:52

001c01-L1>* pwr up
001c01-L1>
entering console mode  001c01 CPU0, <CTRL_T> to escape to L1

INFO: console subchannel changed:  001c08 CPU2
001c08#0c: SGI SAL Version 3.25 rel040225 IP41 built 12:01:43 PM Feb 25, 2004
INFO: console subchannel changed:  001c08 CPU0
001c08#0a: SGI SAL Version 3.25 rel040225 IP41 built 12:01:43 PM Feb 25, 2004
001c01#0c: SGI SAL Version 4.43 rel051202 IP41 built 02:14:24 PM Dec  2, 2005
001c01#0a: SGI SAL Version 4.43 rel051202 IP41 built 02:14:24 PM Dec  2, 2005
Probing memory DIMMs ...............Found I/O brick attached to module/001c01/sl
ab/0/node
Probing memory DIMMs ........................           DONE
Initializing memory controller ............             DONE
Testing memory ..............................           DONE
Initializing memory controller ............             DONE
Testing memory ............................             DONE
Initializing memory ....................                DONE
.Switching to RAM and testing CPU ..........            DONE
Discovering NUMAlink connectivity .........             DONE
Found 2 objects (2 chipsets, 0 routers, 0 iochipsets) in 3909 usec
Waiting for peers to complete discovery....             ..              DONE
Initializing memory .......................             DONE
Switching to RAM and testing CPU ..........             DONE
Discovering NUMAlink connectivity .........             DONE
Found 2 objects (2 chipsets, 0 routers, 0 iochipsets) in 15722 usec
Waiting for peers to complete discovery....             DONE
DONE

tree barrier at module/001c01/slab/0/node timed out
POD entered via MCA, using Cac mode
INFO: console subchannel changed:  001c08 CPU2
POD entered via MCA, using Cac mode
INFO: console subchannel changed:  001c08 CPU0
0 002: POD SysCt Cac> INFO: console subchannel changed:  001c08 CPU2
2 002: POD SysCt Cac> *** module/001c08/slab/0/node has taken an exception! Cont
inuing...
Discovering local I/O on nasid 0 .........              DONE
Checking partitioning information .........             DONE
Erecting partition fences .................             DONE
POD entered via MCA, using Cac mode
0 000: POD SysCt Cac> POD entered via MCA, using Cac mode
2 000: POD SysCt Cac>

As frequently happens you may have received these modules in pieces, and everything is suspect. Did you get the right RAM? Are the NUMAlink cables good? Was there some other step you were supposed to take? You may have some or all of those issues, but we're going to tackle the fact that the two modules have different versions of the PROM.

If you look at the snippet above you'll see " 001c01#0c: SGI SAL Version 4.43 " which indicates that the brick 001c01 (probably your base module) has PROM version 4.43, whereas the line " 001c08#0a: SGI SAL Version 3.25 " indicates that module 001c08 has PROM version 3.25. Not only is PROM 3.25 too old to work with version 4.43, it's too old to work with version 2.6 of the Linux kernel...

Fortunately there's a way to fix this even if you can't get to the EFI shell, where the documentation from SGI tells you that you can use the "flash" command to reprogram a module's PROM from a binary file. And many thanks to forum member rosmaniac for sharing the method shown below in this thread .

Though nobody has reported seeing documentation on the POD, there is a "help" command, and the help command tells you about another command: the "flash" command. It's not quite the same as the EFI Shell command of the same name - instead of flashing a PROM image stored in a file, it burns the PROM image from the master node into the node you select.

How do you tell the "flash" command which node to act on? You need something called the node's NASID, and there's another command the POD provides - "pcfg" - that will tell you what those are. So first, let's see how the "pcfg" command does that.

Code:
2 000: POD SysCt Cac> version
SGI SAL Version 4.43 rel051202 IP41 built 02:14:24 PM Dec  2, 2005
2 000: POD SysCt Cac> pcfg
NUMAlink Topology: (node 0):
Entry 0: SHub 001c01#0 Chiprev=3 Route=0x0
Module=001c01 Slab=0 Partition=0 Space=RESET
Nasid=0 Flags=0x100000 Syssize=0 Prom=4.43
Port 1 connection: Entry 1 SHub 001c08#0 port 2
Port 1 status: UP NF
Port 2 connection: Entry 1 SHub 001c08#0 port 1
Port 2 status: UP NF
Entry 1: SHub 001c08#0 Chiprev=3 Route=0x1
Module=001c08 Slab=0 Partition=0 Space=RESET
Nasid=2 Flags=0x1110000 Syssize=0 Prom=3.25
Port 1 connection: Entry 0 SHub 001c01#0 port 2
Port 1 status: UP NF
Port 2 connection: Entry 0 SHub 001c01#0 port 1
Port 2 status: UP NF
2 000: POD SysCt Cac>

First thing I do is run the "version" command just to make sure that the POD is using or part of the correct PROM version.

You can see that when I run the "pcfg" command the output contains two Entry blocks, one for each module. Check the entries for the module name you want (" Module=001c08 "), confirm that this module has the version of the PROM you want to replace (" Prom=3.25 "), and then note the value of the NASID for this entry (" Nasid=2 ").

Now you're ready to flash the newer PROM to the out-of-date module, using the NASID, Make sure that the command you've entered is the one you want, the "flash" command will not ask you to confirm !!

Code:
2 000: POD SysCt Cac> flash 2
Flashing node 2
...erasing sectors
................................................................................
................Done.
...copying prom
source address     : 0x80000087ffa00000
destination address: 0x8000008fffa00000
size (bytes)       : 0x0000000000600000
...programming
................................................................................
................Flash of node 2 complete.
Waiting for all flash operations to complete...DONE.
2 000: POD SysCt Cac>

That's it - the newer PROM image from module 001c01 has been written to module 001c08. But it won't take effect until you reset everything:

Code:
2 000: POD SysCt Cac> reset
001c08#0c: SGI SAL Version 4.43 rel051202 IP41 built 02:14:24 PM Dec  2, 2005
001c01#0c: SGI SAL Version 4.43 rel051202 IP41 built 02:14:24 PM Dec  2, 2005
INFO: console subchannel changed:  001c08 CPU0
001c08#0a: SGI SAL Version 4.43 rel051202 IP41 built 02:14:24 PM Dec  2, 2005
001c01#0a: SGI SAL Version 4.43 rel051202 IP41 built 02:14:24 PM Dec  2, 2005
Probing memory DIMMs ..............Found I/O brick attached to module/001c01/slab/0/node

Note that both modules are now reporting PROM version 4.43!

Code:
Probing memory DIMMs .............................             DONE
Initializing memory controller ............                             DONE
Initializing memory controller ............             DONE
DONE
Testing memory .........................Testing memory .........................
...             DONE
.Initializing memory ....................               DONE
Switching to RAM and testing CPU ..........             DONE
..Discovering NUMAlink connectivity .........           DONE
Found 2 objects (2 chipsets, 0 routers, 0 iochipsets) in 3912 usec
Waiting for peers to complete discovery....                             DONE
Initializing memory .....................               DONE
Switching to RAM and testing CPU ..........             DONE
Discovering NUMAlink connectivity .........             DONE
Found 2 objects (2 chipsets, 0 routers, 0 iochipsets) in 3910 usec
Waiting for peers to complete discovery....             DONE
DONE
Discovering local I/O on nasid 0 .........              DONE
Checking partitioning information .........             Checking partitioning
information .........             DONE
DONE
Erecting partition fences .................             Syncing EFI var. store (
module/001c01/slab/0/node->module/001c08/slab/0/node) ...DONE
4 CPUs on 2 nodes found.
...........-
.....   DONE

Decompressing SAL runtime .................             DONE
Loading SAL runtime .......................             DONE
Decompressing EFI .........................             DONE
Loading EFI ...............................             DONE

Altix IO Topology Information
*****************************


Serial Number:R200****

PCI SEGMENT PCIBUS NUMBER     BRICK  RACK:SLOT  BUS       CONNECTION TOPOLOGY
----------- -------------     ---------------------       -------------------
0x0000        0x01         OPbrick  001:01    01      001c01:slot0:slab0:widget15:bus0
0x0000        0x02         OPbrick  001:01    02      001c01:slot0:slab0:widget15:bus1



EFI version 1.10 [14.62] Build flags: EFI64 Running on Intel(R) Itanium Processor EFI_DEBUG
EFI IA-64 SDV/FDK (No BIOS ) [Dec  2 2005 14:10:20] - INTEL

Copyright (c) 2000-2005 Broadcom Corporation
Broadcom NetXtreme Gigabit Ethernet EFI driver v8.1.1

Seg: 0 Bus: 1 Dev: 1 Func: 0 - SGI IOC4 ATA detected: Firmware Rev 79

Seg: 0 Bus: 1 Dev: 3 Func: 0 - Qlogic 12160 SCSI Controller detected: Firmware Rev 6
(Pun 1,Lun 0): FUJITSU MAP3735NC        5605

Broadcom NetXtreme Gigabit Ethernet (BCM5701) is detected (PCI)



EFI Boot Manager ver 1.10 [14.62]

Partition  0:                           Enabled Disabled
CBricks          2         Nodes          2        0
RBricks          0         CPUs           4        0
IOBricks         2         Mem(GB)        6        0


Loading device drivers

Please select a boot option
SUSE Linux Enterprise Server 11 SP1
EFI Shell [Built-in]
Boot option maintenance menu

Use ^ and v to change option(s). Use Enter to select an option


EFI Shell [Built-in]

Loading.: EFI Shell [Built-in]
EFI Shell version 1.10 [14.62]
Device mapping table
fs0  : Acpi(PNP0A03,1)/Pci(3|0)/Scsi(Pun1,Lun0)/HD(Part1,Sig3B6D7BAE-C470-476E
-BC5D-F3F974DB367E)
blk0 : Acpi(PNP0A03,1)/Pci(3|0)/Scsi(Pun1,Lun0)
blk1 : Acpi(PNP0A03,1)/Pci(3|0)/Scsi(Pun1,Lun0)/HD(Part1,Sig3B6D7BAE-C470-476E
-BC5D-F3F974DB367E)
blk2 : Acpi(PNP0A03,1)/Pci(3|0)/Scsi(Pun1,Lun0)/HD(Part2,Sig5FEB84EE-6CD6-4687
-8004-D119966337C7)
blk3 : Acpi(PNP0A03,1)/Pci(3|0)/Scsi(Pun1,Lun0)/HD(Part3,SigA5D5C908-46E3-44B9
-984F-B7C3C24FF740)
Shell>


And now not only are the modules running the same version of the PROM, the system can start the EFI Shell and you can take the next steps in troubleshooting this system. :D

_________________
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
Very nicely done - Thanks!

_________________
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************
Thanks for the valuable information. It may turn out to be useful in case I finally put my hands onto two more Numalink cables.
I'd like to connect all the modules together (at least once), but honestly I don't know if I will run the whole hungry beast (in terms of electrical power) very often. Actually I never checked the SAL versions of every module.

_________________
:Onyx2: 4xR12k 400, 8Gb, IR2E, 2x18+3x73GB HD (oxygen)
:A3504L: :A3504L: 16xItanium2 1.6, L2 9MB (neon)
:Fuel: R16k 800 V12, 2Gb, M-Audio, 36+147GB HD, 3Dconnexion SpaceMouse Classic (nitrogen)
:Octane2: Dual R14k 600 V6, 2Go, HD (173Go, 34Go) (carbon)
:Octane: R10k 400 MXE, 1280Mo (lithium) / 2xR10k 300 SSE,... (fluorine)
:O2: R10k 195, 512Mo (hydrogen) / R5k 180, 512Mo (sodium) / R5k 180->200 motherboard and PM only
:Indigo2IMP: R10k 195, HighImpact, 160Mo (helium) / R4400 125, Extreme, 160Mo (boron)
:O200: :O200: twin O200, 4xR12k 270, 2Go, HD (4x18Go) (beryllium)
:Indigo: R4k 100, 80Mo, LG1, 9GB HD, Python 25601 tape (magnesium])
:4D70G: 4D70GT... my very first one (now property of musée bolo and the foundation mémoires informatiques )
See the hinv/gfxinfo posts here .
I just dug up what rosmaniac had shared, or else I'd be unable to use more than the base module. Amusingly I bookmarked his thread when I saw it two years ago, but had forgotten all about it - I had to find his tip by searching for anything about Altix' and PROMs...

And for completeness' sake: After I did this to bring one downrev module even with the base module at v4.43, I updated both to PROM 5.04 using the documented "flash -a shub1snprom.bin" procedure in the EFI Shell. Then when I hooked up other CM modules to the base module I was successful using the same procedure described above to bring those 3.25 modules directly to 5.04.

_________________
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
Glad it helped you.... Sorry it was buried...... :-(

EDIT:

That reminded me that I hadn't powered up the beast in a while.... it still works, a boots Scientific Linux CERN 5.4 just fine. This is my build box for rebuilding CentOS 5 from source; I'm up to the GA release CentOS 5.9 at the moment, and am running the smaller 4 CPU Altix 3700 on my internal CentOS 5.9 rebuild:
Code:
[root@roan ~]# uname -srvm
Linux 2.6.18-348.el5 #1 SMP Sun Jan 20 00:26:37 EST 2013 ia64
[root@roan ~]# uptime
12:33:35 up 165 days, 23:58,  1 user,  load average: 0.01, 0.01, 0.00
[root@roan ~]# cat /etc/redhat-release
CentOS release 5.9 (Final)
[root@roan ~]#


Need to build the updates..... :-)

I started with SLC 5.4 IA64, the last one put out by CERN, and built stepwise through CentOS 5.5, 5.6, 5.7, 5.8, and 5.9, and if there's enough demand I can probably make the unsigned package trees available read-only; I will need to check with the CentOS devs first to make sure I don't need to rebrand before distributing. You can read what I did in the archives of the CentOS-devel mailing list, just Google for 'CentOS IA64' and you'll find it..... it takes a long time, even on a big box, to do the full rebuild.

It does need PROM 5.04.
Hold that thought - I'd like to move this into a dedicated thread on Altix OS choices.

(Edit: Link )

_________________
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube: