Okay, so boy following by Reconda's word I stripped down to the base system (MSC, one nodebaord and an IO6G, because my IO6 appears to be PROM passworded and otherwise useless right now)
Using one of the good nodes with good ram I went into POD, nuked everything and then restarted to let everything get rediscovered. On restart I see:
Code:
WARNING: Board with freq 100 mhz at module 1 slot 1 is attached to a back-plane of 360 mhz frequency
WARNING: Hub at 100 mhz and router at 90 mhz is an unsupported configuration
...otherwise the system runs normally.
Then powered down, removed the node and transferred the known good memory from banks 0 and 1 to one of the questionable nodes. Put it all back together and powered up.
Code:
1B 000: ***No useable ram installed. Need working and enabled memory in bank 0 or 1
1B 000: ***Add working and enabled memory present in bank 0 or 1 and reset the syst
MMSC also displays
Code:
5C0 M 1
Oh yeah, plus the system has locked up and had to be powered down from the breaker.
Reinstall the ram in the good node and confirm that it is good. It is and I get to the PROM.
Okay. So can confirm two good banks worth of ram and one healthy nodeboard. Onwards.
Perform the same POD nuke and reset as before, again the same two warnings. Can now confirm two good nodeboards and four healthy banks worth of ram.
Now repeat putting healthy ram into another questionable nodeboard.
Code:
1B 000: ***No useable ram installed. Need working and enabled memory in bank 0 or 1
1B 000: ***Add working and enabled memory present in bank 0 or 1 and reset the syst
...and again, it locks solid.
Repeat putting known pairs of good ram into remaining two spare nodeboards. Exact same result.
Okay. So now we put one healthy node + ram combo into slot 1 and stuff a questionable node with good ram in the second slot.
on top of our previous memory errors we also now have
Code:
***Warning: Found a new IP27 board in module 1, slot n2, serial KAH055
Please use the 'update' command from the PROM Monitor to update the inventory
and...
Code:
IP27 Node Board, Module 1, Slot n2
ASIC HUB Rev 5, 100 MHz, (nasid 1)
Processor A: **Disabled, Reason = Unuseable bank 0.
Processor A: 250 MHz R10000 Rev 0.0
Secondary Cache 4MB 250MHz Tap 0x9 , (cpu 2)
R10010FPC Rev 0.0
Processor B: **Disabled, Reason = Unuseable bank 0.
Processor B: 250 MHz R10000 Rev 0.0
Secondary Cache 4MB 250MHz Tap 0x9 , (cpu 3)
R10010FPC Rev 0.0
Memory on board, 0 MBytes (Standard), (256 Mbytes - Bank(s) 0 1 disabled)
Bank 0, 128 MBytes Disabled, Reason: Some DIMMs failed mem test.
Bank 1, 128 MBytes Disabled, Reason: Some DIMMs failed mem test.
Running UPDATE and a reset completely disables the node.
Code:
***Warning: Board in module 1, slot n2 is missing or disabled It previously contained a node-board, barcode KAH055 laser 3f3404
Same applies with the other three.
The one thing I have not yet done is put any of the questionable sticks in one of the good nodes in the event I am dealing with a bad stick that killed every node I put it into so far. I have switched around my two routers in case they were acting funny.
On the other hand if you installed both healthy nodes with good ram...
Code:
*** Barrier sync warning: local=4418, NIC 0x3408b0=4400 (promrev mismatch?)
*** Barrier sync warning: local=4446, NIC 0x3408b0=4428 (promrev mismatch?)
*** Barrier sync warning: local=4908, NIC 0x3408b0=4890 (promrev mismatch?)
*** Barrier sync warning: local=4981, NIC 0x3408b0=4963 (promrev mismatch?)
....it appears they need to be flashed. I don't know immediately how to do that without booting into irix.
Edited: Also, if you have in both healthy nodes and then add in one of the questionable nodes with questionable ram you get:
Code:
2A 001:Testing/Initializing all memory ........... DONE
2A 001: waiting for node with nic 3472e2 at module 1 slot 1 at global barrier...
....
2A 001: ........
2A 001: *** Barrier sync warning: local=5090, NIC 0x3472e2=5108 (promrev mismatch?)
2A 001:Checking partitioning information ......... DONE
2A 001: > Global barrier failed at line 5337 node 0
2B 001: Local slave entering slave loop
2A 001: Local master entering slave loop
1A 000:
1A 000: *** General Exception on node 0
1A 000: *** EPC: 0xc00000001fc3a2e8 (0xc00000001fc3a2e8)
1A 000: *** Press ENTER to continue.