The collected works of hkurokawa

Hi all,

My TEZRO with IP59-PIMM had not started up Power On Diagnostic because of subjected reason.

Some days ago, my TEZRO came to be stopped and reported "TLB refilll exception" error many times.
When I checked status of components from PROM console(issued enableall or so),
system had suddenly stopped that reason, and after re-plug power cable, POD stopped staring up.
(The reason of TLB refill error may trouble of DIMMs.
Because when I inspected the DIMMs, I found some ceramic capcitor had removed from component side of DIMMs.
Now these DIMMs group had removed.)

When I issued "pwr u" from L1 console, fans, Hdds are spinned up, but POD is not started up.
And issued "leds" after that, system reported following messages.

001c01-L1>leds
CPU A: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU B: < CPU not present >
CPU C: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU D: < CPU not present >

Ofcause, all CPUs enabled in L1 level.

001c01-L1>cpu
CPU Present Enabled
--- ------- -------
0A 1 1
0B 0 0
0C 1 1
0D 0 0

I had also reflashed L1 firmware from software L2 emulator, but could not re-enable.

Here is my logs and related infomations.

001c01-L1>pwr u
ERROR: no response from 001c01

(wait 10 sec.)
001c01-L1>leds
CPU A: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU B: < CPU not present >
CPU C: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU D: < CPU not present >

001c01-L1>pwr d
001c01-L1>log
04/12/13 08:58:09 ChiWS IP59
04/12/13 08:58:11 USB0: waiting on open
04/15/13 09:00:47 L1 booting 1.40.6
04/15/13 09:00:47 ChiWS IP59
04/15/13 09:00:49 USB0: waiting on open
04/15/13 09:02:24 USB0: opened
04/15/13 09:02:24 USB0: registered as remote
04/15/13 09:02:24 USB0: registered for events
04/15/13 09:07:08 power up (COMMAND)
04/15/13 09:07:13 Node 0 XTalk clock 88
04/15/13 09:07:14 reset again MIPS
04/15/13 09:07:19 Node 0 XTalk clock 88
04/15/13 09:07:46 power down (COMMAND)
04/15/13 09:12:32 USB-R: USB:connection lost
04/15/13 09:12:32 UNREG: 300054c0 0 7
04/15/13 09:12:32 USB0: unregistered
04/15/13 09:12:33 USB0-R: IRouter:read failed - read error
04/15/13 09:12:33 USB0: waiting on open
04/15/13 22:11:19 USB0: opened
04/15/13 22:11:19 USB0: registered as remote
04/15/13 22:11:19 USB0: registered for events
04/15/13 22:27:13 USB-R: USB:connection lost
04/15/13 22:27:13 UNREG: 300054c0 0 7
04/15/13 22:27:13 USB0: unregistered
04/15/13 22:27:14 USB0-R: IRouter:read failed - read error
04/15/13 22:27:14 USB0: waiting on open
04/15/13 22:45:29 USB0: opened
04/15/13 22:45:29 USB0: registered as remote
04/15/13 22:45:29 USB0: registered for events
04/15/13 23:33:30 2MB flash part!!
04/15/13 23:43:19 L1 booting 1.48.1
04/15/13 23:43:19 ChiWS IP59
04/15/13 23:43:21 USB0: waiting on open
04/15/13 23:43:21 USB0: opened
04/15/13 23:43:21 USB0: registered as remote
04/15/13 23:43:21 USB0: registered for events
04/15/13 23:50:41 power up (COMMAND)
04/15/13 23:50:45 Node 0 XTalk clock 88
04/15/13 23:50:47 reset again MIPS
04/15/13 23:50:51 Node 0 XTalk clock 88
04/15/13 23:53:00 power down (COMMAND)
04/15/13 23:53:06 L1 booting 1.48.1
04/15/13 23:53:06 ChiWS IP59
04/15/13 23:53:08 USB0: waiting on open
04/15/13 23:53:08 USB0: opened
04/15/13 23:53:08 USB0: registered as remote
04/15/13 23:53:08 USB0: registered for events
04/16/13 00:01:42 power up (COMMAND)
04/16/13 00:01:46 Node 0 XTalk clock 88
04/16/13 00:01:48 reset again MIPS
04/16/13 00:01:52 Node 0 XTalk clock 88
04/16/13 00:02:12 power down (COMMAND)
001c01-L1>serial all

Data Location Value
------------------------------ ------------ --------
Local System Serial Number NVRAM P1003842
Reference System Serial Number NVRAM P1003842
Local Brick Serial Number EEPROM NWC610
Reference Brick Serial Number NVRAM NWC610


EEPROM Product Name Serial Part Number Rev T/W
---------- -------------- ------------- -------------------- --- ------
INTERFACE WS_INT_53 NWC610 030_1881_007 B 00
IO9 IO9 NWY514 030_1771_006 A 00
ODYSSEY ODY128B1_2 NWG705 030_1884_005 B 00
SNOWBALL no hardware detected
NODE IP59_2CPU NSD938 030_2059_002 C 00
IO DGHTR CHWS_IO_DAUG NVG234 030_1875_003 A 00

EEPROM JEDEC-SPD Info Part Number Rev Speed SGI
---------- ------------------------ ------------------ ---- ------ --------
DIMM 0 no hardware detected
DIMM 2 no hardware detected
DIMM 4 7F94FFFFFFFFFFFF937DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 6 7F94FFFFFFFFFFFFD37DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 1 no hardware detected
DIMM 3 no hardware detected
DIMM 5 7F94FFFFFFFFFFFF2B99700D SM57264DSGI100C2 00FF 8.0 N/A
DIMM 7 7F94FFFFFFFFFFFFB37DE80D SM57264DSGI100C3 00FF 8.0 N/A

001c01-L1>cpu
CPU Present Enabled
--- ------- -------
0A 1 1
0B 0 0
0C 1 1
0D 0 0

001c01-L1>flash status
Flash image A currently booted

Image Status Revision Built
----- ------------- ---------- -----
A user default 1.48.1 01/22/2007 11:34:34 (<- reflashed image)
B valid 1.40.6 01/06/2006 13:16:50

The reason of this error is some strange data had written to PROM or NVRAM because of brokened DIMM, I think.
Does anyone have any solution to re-enable the CPUs?
And sorry for my poor English, I'm not native speaker.
Thank you for your reply.

Unfortunately, I have not tried "enableall" from PROM console.
Because PROM could not started up due to CPU disabled problem.
I'm also tried entering PROM console (console mode <Ctrl + D>) from L1,
system reported "no response from 001c01" and no PROM's POD message displayed.

And my TEZRO is a deskside one.
Thank you for your kind advice.

I tried moving DIMMs and powered up, but PROM does not start up.

001c01-L1>serial all

Data Location Value
------------------------------ ------------ --------
Local System Serial Number NVRAM P1003842
Reference System Serial Number NVRAM P1003842
Local Brick Serial Number EEPROM NWC610
Reference Brick Serial Number NVRAM NWC610


EEPROM Product Name Serial Part Number Rev T/W
---------- -------------- ------------- -------------------- --- ------
INTERFACE WS_INT_53 NWC610 030_1881_007 B 00
IO9 IO9 NWY514 030_1771_006 A 00
ODYSSEY ODY128B1_2 NWG705 030_1884_005 B 00
SNOWBALL no hardware detected
NODE IP59_2CPU NSD938 030_2059_002 C 00
IO DGHTR CHWS_IO_DAUG NVG234 030_1875_003 A 00

EEPROM JEDEC-SPD Info Part Number Rev Speed SGI
---------- ------------------------ ------------------ ---- ------ --------
DIMM 0 7F94FFFFFFFFFFFF937DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 2 7F94FFFFFFFFFFFF2B99700D SM57264DSGI100C2 00FF 8.0 N/A
DIMM 4 no hardware detected
DIMM 6 no hardware detected
DIMM 1 7F94FFFFFFFFFFFFD37DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 3 7F94FFFFFFFFFFFFB37DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 5 no hardware detected
DIMM 7 no hardware detected

001c01-L1>pwr u
001c01-L1>leds (issued soon after "pwr u")
CPU A: 0x0c: PLED_INVSCACHE: Invalidate secondary cache.
0x2a: PLED_MAKESTACK
0x0d: PLED_INMAIN: Successfully jumped to main().
0x00: PLED_RESET: Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU B: < CPU not present >
CPU C: 0x0c: PLED_INVSCACHE: Invalidate secondary cache.
0x2a: PLED_MAKESTACK
0x0d: PLED_INMAIN: Successfully jumped to main().
0x38: PLED_BARRIEROK
CPU D: < CPU not present >

001c01-L1>leds (issued wait about 10 sec.)
CPU A: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU B: < CPU not present >
CPU C: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU D: < CPU not present >

I had also tried all DIMMs removed, but same resulted, PROM could not start runnning.
(If PROM starts normally, PROM's POD should report "no DIMMs error" in such a case, I think.)
Thank you for your valuable advice.
My TEZRO's CPU now seems to be re-enabled.

I had read the article you suggested,
but I did not consider that procedure is available for my environment.
(Because TEZRO is O3000 based architecture, in contrast contents of article is O2000)

I tried some commands that you suggested.

Here is my procedure.

001c01-L1>debug 0x10d (<-Setting DEBUG mode)
debug switches set to 0x010d
001c01-L1>pwr u
(wait about 15sec)
001c01-L1>leds
CPU A: 0xbc: POD Mode (0x80/0xBC=okay, solid 0xBC=possibly hung polling UART) (<- Seems bypassed CPU env check!!)
0x80: POD Mode (0x80/0xBC=okay, solid 0x80=possibly hung polling UART)
CPU B: < CPU not present >
CPU C: 0x45: PLED_LAUNCHLOOP: Slave loop (0x00/0x45=okay, solid 0x45=possibly hung)
0x00: PLED_RESET: Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU D: < CPU not present >

001c01-L1> (<-Ctrl + D)
entering console mode 001c01 CPU0, <CTRL_T> to escape to L1

A 000 001c01: POD SysCt Cac> go cac
A 000 001c01: Must be in Dex mode before switching to Cac or Unc.
A 000 001c01: POD SysCt Cac> clearalllogs
A 000 001c01: *** This must be run only after NUMAlink discovery is complete.
A 000 001c01: *** This will clear all previous log variables such as:
A 000 001c01: *** moduleids, nodeids, etc. for all nodes.
A 000 001c01: Clear all logs? [n] y
A 000 001c01: Checking 1 entries for promlogs
A 000 001c01: .DONE
A 000 001c01: All PROM logs cleared!
A 000 001c01: POD SysCt Cac> initalllogs
A 000 001c01: *** This must be run only after NUMAlink discovery is complete.
A 000 001c01: *** This will clear all previous log variables such as:
A 000 001c01: *** moduleids, nodeids, etc. for all nodes.
A 000 001c01: Clear all logs environment variables, and aliases ? [n] y
A 000 001c01: Checking 1 entries for promlogs
A 000 001c01: .DONE
A 000 001c01: All PROM logs cleared!
A 000 001c01: POD SysCt Cac> flush
A 000 001c01: POD SysCt Cac>
escaping to L1 system controller
001c01-L1>debug 0 (<- Clear DEBUG flag)
debug switches set to 0x0000

returning to console mode 001c01 CPU0, <CTRL_T> to escape to L1

A 000 001c01: POD SysCt Cac> reset
A 000 001c01: Resetting the system...
Starting PROM Boot process (<- Yes!, now revived!!)


IP35 PROM SGI Version 6.210 built 02:33:51 PM Aug 26, 2004
Testing/Initializing memory ............... DONE
Copying PROM code to memory ............... DONE
Discovering local IO ...................... DONE
Discovering NUMAlink connectivity .........
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 5898 usec
Waiting for peers to complete discovery.... DONE
No other nodes present; becoming global master
Global master is /hw/rack/001/bay/01
Intializing any CPUless nodes.............. DONE
Checking partitioning information ......... DONE
No other nodes present; becoming partition master
Local slave entering slave loop
Loading BASEIO prom ....................... DONE

BASEIO PROM Monitor SGI Version 6.210 built 02:30:38 PM Aug 26, 2004 (BE64)
2 CPUs on 1 nodes found.

NVRAM checksum is incorrect: reinitializing.
Automatic update of PROM environment disabled
Graphics diagnostics

Odyssey board #0 found on nasid 0
Running Odyssey xtalk sanity diag...
Board version 1 - Buzz revision 3B
On board sdram size: 128 Mb
Cas latency: CAS 3
4 banks by sdram module
Running Odyssey Buzz registers diag...
Device passed diagnostics

Installing PROM Device drivers ............
On-board (IO9) tigon3 1000BaseT interface
Base I/O Ethernet set to /dev/ethernet/tg0
Installing Graphics Console...
graphics install: searching for pipe 0
Probing IOC4 ATA adapter 2
IOC4 RevId = 83
Detected Vendor id/Product MATSHITA DVD-ROM SR-8589

Walking SCSI Adapter 0, (pci id 3)
1- 2- 3- 4- 5- 6- 7- 8- 9- 10- 11- 12- 13- 14- 15- = 0 device(s)


Walking SCSI Adapter 1, (pci id 3)
1- 2- 3- 4- 5- 6- 7- 8- 9- 10- 11- 12- 13timeout on adapter 1 target d
tm0=0xffffabaee079d65b, tm1=0xffffabaee079d634, timeout=0xb
- 14Set Device queue Parameters failed
- 15- = 0 device(s)

Initializing PROM Device drivers ..........
Initializing Base I/O Ethernet Interface...Failed. MII Status Register = 0x7949
Done.
---------------Interface Configuration Summary----------------
ASIC|Revision|MAC Address : 5701|B5|08:00:69:11:e7:6f
Link Negotiation|Advertisement : On|<H10 F10 H100 F100 F1000>
Link|Speed|Duplex|Rx/Tx FlowCtrl: Down|10|Half|Off/Off
--------------------------------------------------------------
DONE
Cannot connect to keyboard -- check the cable.
Cannot open /dev/input/ioc4pckm0 for input
Cannot connect to keyboard -- check the cable.
Cannot open /dev/input/ioc4pckm0 for input
Checking hardware inventory ...............
WARNING: hardware inventory is invalid. Reinitializing...
Writing 5 records..... DONE
Updated new configuration. Wrote 5 records.

**** System Configuration and Diagnostics Summary ****
CONFIG:
No. of NODEs enabled = 1
No. of NODEs disabled = 0
No. of CPUs enabled = 2
No. of CPUs disabled = 0
Mem enabled = 2048 MB
Mem disabled = 0 MB
No. of RTRs enabled = 0
No. of RTRs disabled = 0

DIAG RESULTS:
ALL DIAGS PASSED.
**** End System Configuration and Diagnostics Summary ****


System Maintenance Menu

1) Start System
2) Install System Software
3) Run Diagnostics
4) Recover System
5) Enter Command Monitor

Option?


Thank you again for your kind help and taking the time to give me for my problem.