SGI: Hardware

CPU is disabled by env variable on TEZRO

Hi all,

My TEZRO with IP59-PIMM had not started up Power On Diagnostic because of subjected reason.

Some days ago, my TEZRO came to be stopped and reported "TLB refilll exception" error many times.
When I checked status of components from PROM console(issued enableall or so),
system had suddenly stopped that reason, and after re-plug power cable, POD stopped staring up.
(The reason of TLB refill error may trouble of DIMMs.
Because when I inspected the DIMMs, I found some ceramic capcitor had removed from component side of DIMMs.
Now these DIMMs group had removed.)

When I issued "pwr u" from L1 console, fans, Hdds are spinned up, but POD is not started up.
And issued "leds" after that, system reported following messages.

001c01-L1>leds
CPU A: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU B: < CPU not present >
CPU C: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU D: < CPU not present >

Ofcause, all CPUs enabled in L1 level.

001c01-L1>cpu
CPU Present Enabled
--- ------- -------
0A 1 1
0B 0 0
0C 1 1
0D 0 0

I had also reflashed L1 firmware from software L2 emulator, but could not re-enable.

Here is my logs and related infomations.

001c01-L1>pwr u
ERROR: no response from 001c01

(wait 10 sec.)
001c01-L1>leds
CPU A: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU B: < CPU not present >
CPU C: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU D: < CPU not present >

001c01-L1>pwr d
001c01-L1>log
04/12/13 08:58:09 ChiWS IP59
04/12/13 08:58:11 USB0: waiting on open
04/15/13 09:00:47 L1 booting 1.40.6
04/15/13 09:00:47 ChiWS IP59
04/15/13 09:00:49 USB0: waiting on open
04/15/13 09:02:24 USB0: opened
04/15/13 09:02:24 USB0: registered as remote
04/15/13 09:02:24 USB0: registered for events
04/15/13 09:07:08 power up (COMMAND)
04/15/13 09:07:13 Node 0 XTalk clock 88
04/15/13 09:07:14 reset again MIPS
04/15/13 09:07:19 Node 0 XTalk clock 88
04/15/13 09:07:46 power down (COMMAND)
04/15/13 09:12:32 USB-R: USB:connection lost
04/15/13 09:12:32 UNREG: 300054c0 0 7
04/15/13 09:12:32 USB0: unregistered
04/15/13 09:12:33 USB0-R: IRouter:read failed - read error
04/15/13 09:12:33 USB0: waiting on open
04/15/13 22:11:19 USB0: opened
04/15/13 22:11:19 USB0: registered as remote
04/15/13 22:11:19 USB0: registered for events
04/15/13 22:27:13 USB-R: USB:connection lost
04/15/13 22:27:13 UNREG: 300054c0 0 7
04/15/13 22:27:13 USB0: unregistered
04/15/13 22:27:14 USB0-R: IRouter:read failed - read error
04/15/13 22:27:14 USB0: waiting on open
04/15/13 22:45:29 USB0: opened
04/15/13 22:45:29 USB0: registered as remote
04/15/13 22:45:29 USB0: registered for events
04/15/13 23:33:30 2MB flash part!!
04/15/13 23:43:19 L1 booting 1.48.1
04/15/13 23:43:19 ChiWS IP59
04/15/13 23:43:21 USB0: waiting on open
04/15/13 23:43:21 USB0: opened
04/15/13 23:43:21 USB0: registered as remote
04/15/13 23:43:21 USB0: registered for events
04/15/13 23:50:41 power up (COMMAND)
04/15/13 23:50:45 Node 0 XTalk clock 88
04/15/13 23:50:47 reset again MIPS
04/15/13 23:50:51 Node 0 XTalk clock 88
04/15/13 23:53:00 power down (COMMAND)
04/15/13 23:53:06 L1 booting 1.48.1
04/15/13 23:53:06 ChiWS IP59
04/15/13 23:53:08 USB0: waiting on open
04/15/13 23:53:08 USB0: opened
04/15/13 23:53:08 USB0: registered as remote
04/15/13 23:53:08 USB0: registered for events
04/16/13 00:01:42 power up (COMMAND)
04/16/13 00:01:46 Node 0 XTalk clock 88
04/16/13 00:01:48 reset again MIPS
04/16/13 00:01:52 Node 0 XTalk clock 88
04/16/13 00:02:12 power down (COMMAND)
001c01-L1>serial all

Data Location Value
------------------------------ ------------ --------
Local System Serial Number NVRAM P1003842
Reference System Serial Number NVRAM P1003842
Local Brick Serial Number EEPROM NWC610
Reference Brick Serial Number NVRAM NWC610


EEPROM Product Name Serial Part Number Rev T/W
---------- -------------- ------------- -------------------- --- ------
INTERFACE WS_INT_53 NWC610 030_1881_007 B 00
IO9 IO9 NWY514 030_1771_006 A 00
ODYSSEY ODY128B1_2 NWG705 030_1884_005 B 00
SNOWBALL no hardware detected
NODE IP59_2CPU NSD938 030_2059_002 C 00
IO DGHTR CHWS_IO_DAUG NVG234 030_1875_003 A 00

EEPROM JEDEC-SPD Info Part Number Rev Speed SGI
---------- ------------------------ ------------------ ---- ------ --------
DIMM 0 no hardware detected
DIMM 2 no hardware detected
DIMM 4 7F94FFFFFFFFFFFF937DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 6 7F94FFFFFFFFFFFFD37DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 1 no hardware detected
DIMM 3 no hardware detected
DIMM 5 7F94FFFFFFFFFFFF2B99700D SM57264DSGI100C2 00FF 8.0 N/A
DIMM 7 7F94FFFFFFFFFFFFB37DE80D SM57264DSGI100C3 00FF 8.0 N/A

001c01-L1>cpu
CPU Present Enabled
--- ------- -------
0A 1 1
0B 0 0
0C 1 1
0D 0 0

001c01-L1>flash status
Flash image A currently booted

Image Status Revision Built
----- ------------- ---------- -----
A user default 1.48.1 01/22/2007 11:34:34 (<- reflashed image)
B valid 1.40.6 01/06/2006 13:16:50

The reason of this error is some strange data had written to PROM or NVRAM because of brokened DIMM, I think.
Does anyone have any solution to re-enable the CPUs?
And sorry for my poor English, I'm not native speaker.
hkurokawa wrote:
Does anyone have any solution to re-enable the CPUs?

You probably did this already but just in case : did you do an < enableall > from the prom ?

Out of curiosity, is this a deskside Tezro or a rackmount ?

_________________
lemon tree very pretty and the flower very sweet ...
Thank you for your reply.

Unfortunately, I have not tried "enableall" from PROM console.
Because PROM could not started up due to CPU disabled problem.
I'm also tried entering PROM console (console mode <Ctrl + D>) from L1,
system reported "no response from 001c01" and no PROM's POD message displayed.

And my TEZRO is a deskside one.
hkurokawa wrote:
Unfortunately, I have not tried "enableall" from PROM console.
Because PROM could not started up due to CPU disabled problem.

Okay, this won't help you much but maybe give you some hope ...

I did something similar once with some bad memory. POST disabled the memory slots due to the bad memory so I switched them into the other memory slots, so the computer could disable all the memory. ( Fuel )

After that it didn't want to boot :oops:

I did manage to fix it via some hardware-switching and using the L1 so you should be able to do the same.

Sorry I can't be more exact : Oldtimer's is hell.

Quote:
And my TEZRO is a deskside one.

Cool. I was just wondering if it was the rackmount that was on yahoo auctions. I'm still kicking myself for not buying that :(

_________________
lemon tree very pretty and the flower very sweet ...
Hi!

hkurokawa wrote:
EEPROM JEDEC-SPD Info Part Number Rev Speed SGI
---------- ------------------------ ------------------ ---- ------ --------
DIMM 0 no hardware detected
DIMM 2 no hardware detected
DIMM 4 7F94FFFFFFFFFFFF937DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 6 7F94FFFFFFFFFFFFD37DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 1 no hardware detected
DIMM 3 no hardware detected
DIMM 5 7F94FFFFFFFFFFFF2B99700D SM57264DSGI100C2 00FF 8.0 N/A
DIMM 7 7F94FFFFFFFFFFFFB37DE80D SM57264DSGI100C3 00FF 8.0 N/A


Try moving the DIMMs from slots 4,6,5,7 into slots 0,2,1,3.

Then if you can get into the PROM, type "enableall", followed by "update", and finally "reset".
Thank you for your kind advice.

I tried moving DIMMs and powered up, but PROM does not start up.

001c01-L1>serial all

Data Location Value
------------------------------ ------------ --------
Local System Serial Number NVRAM P1003842
Reference System Serial Number NVRAM P1003842
Local Brick Serial Number EEPROM NWC610
Reference Brick Serial Number NVRAM NWC610


EEPROM Product Name Serial Part Number Rev T/W
---------- -------------- ------------- -------------------- --- ------
INTERFACE WS_INT_53 NWC610 030_1881_007 B 00
IO9 IO9 NWY514 030_1771_006 A 00
ODYSSEY ODY128B1_2 NWG705 030_1884_005 B 00
SNOWBALL no hardware detected
NODE IP59_2CPU NSD938 030_2059_002 C 00
IO DGHTR CHWS_IO_DAUG NVG234 030_1875_003 A 00

EEPROM JEDEC-SPD Info Part Number Rev Speed SGI
---------- ------------------------ ------------------ ---- ------ --------
DIMM 0 7F94FFFFFFFFFFFF937DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 2 7F94FFFFFFFFFFFF2B99700D SM57264DSGI100C2 00FF 8.0 N/A
DIMM 4 no hardware detected
DIMM 6 no hardware detected
DIMM 1 7F94FFFFFFFFFFFFD37DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 3 7F94FFFFFFFFFFFFB37DE80D SM57264DSGI100C3 00FF 8.0 N/A
DIMM 5 no hardware detected
DIMM 7 no hardware detected

001c01-L1>pwr u
001c01-L1>leds (issued soon after "pwr u")
CPU A: 0x0c: PLED_INVSCACHE: Invalidate secondary cache.
0x2a: PLED_MAKESTACK
0x0d: PLED_INMAIN: Successfully jumped to main().
0x00: PLED_RESET: Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU B: < CPU not present >
CPU C: 0x0c: PLED_INVSCACHE: Invalidate secondary cache.
0x2a: PLED_MAKESTACK
0x0d: PLED_INMAIN: Successfully jumped to main().
0x38: PLED_BARRIEROK
CPU D: < CPU not present >

001c01-L1>leds (issued wait about 10 sec.)
CPU A: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU B: < CPU not present >
CPU C: 0x9a: FLED_DISABLED: CPU is disabled by env variable.
CPU D: < CPU not present >

I had also tried all DIMMs removed, but same resulted, PROM could not start runnning.
(If PROM starts normally, PROM's POD should report "no DIMMs error" in such a case, I think.)
Did you read this thread: viewtopic.php?f=3&t=7894 ?

Do you get any kind of prompt when you power up the system ('pwr u') after it print the message about disabled CPUs?

You may have to set the virtual dip switches to boot the system into POD or CAC mode instead of booting the regular PROM. Then, from there, the 'clearallogs / initalllogs / flush / reset' sequence mentioned in the link above (and other places on this board).

Sorry, I don't have a link handy to the virtual debug switch documentation. If nobody can help you with those I can dig a little deeper.

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)
jan-jaap wrote:
You may have to set the virtual dip switches to boot the system into POD or CAC mode instead of booting the regular PROM.

That's a good idea.

Type "debug 0x10d" in the L1 console to do that. This will disable power-on diagnostics, enable verbose output, enable booting directly into POD mode, and most importantly: ignore disabled CPUs and memory.

Once in POD mode, type "go cac" and the commands jan-jaap suggested. Before you enter "reset" in POD mode, type "dbg 0" to disable the debug switches (you can also escape to the L1 prompt and type "debug 0" for the same effect).

After the system restarts, you might still need to do "enableall", "update", and "reset" within the PROM.
Thank you for your valuable advice.
My TEZRO's CPU now seems to be re-enabled.

I had read the article you suggested,
but I did not consider that procedure is available for my environment.
(Because TEZRO is O3000 based architecture, in contrast contents of article is O2000)

I tried some commands that you suggested.

Here is my procedure.

001c01-L1>debug 0x10d (<-Setting DEBUG mode)
debug switches set to 0x010d
001c01-L1>pwr u
(wait about 15sec)
001c01-L1>leds
CPU A: 0xbc: POD Mode (0x80/0xBC=okay, solid 0xBC=possibly hung polling UART) (<- Seems bypassed CPU env check!!)
0x80: POD Mode (0x80/0xBC=okay, solid 0x80=possibly hung polling UART)
CPU B: < CPU not present >
CPU C: 0x45: PLED_LAUNCHLOOP: Slave loop (0x00/0x45=okay, solid 0x45=possibly hung)
0x00: PLED_RESET: Slave loop (0x00/0x45=okay, solid 0x00=possibly hung)
CPU D: < CPU not present >

001c01-L1> (<-Ctrl + D)
entering console mode 001c01 CPU0, <CTRL_T> to escape to L1

A 000 001c01: POD SysCt Cac> go cac
A 000 001c01: Must be in Dex mode before switching to Cac or Unc.
A 000 001c01: POD SysCt Cac> clearalllogs
A 000 001c01: *** This must be run only after NUMAlink discovery is complete.
A 000 001c01: *** This will clear all previous log variables such as:
A 000 001c01: *** moduleids, nodeids, etc. for all nodes.
A 000 001c01: Clear all logs? [n] y
A 000 001c01: Checking 1 entries for promlogs
A 000 001c01: .DONE
A 000 001c01: All PROM logs cleared!
A 000 001c01: POD SysCt Cac> initalllogs
A 000 001c01: *** This must be run only after NUMAlink discovery is complete.
A 000 001c01: *** This will clear all previous log variables such as:
A 000 001c01: *** moduleids, nodeids, etc. for all nodes.
A 000 001c01: Clear all logs environment variables, and aliases ? [n] y
A 000 001c01: Checking 1 entries for promlogs
A 000 001c01: .DONE
A 000 001c01: All PROM logs cleared!
A 000 001c01: POD SysCt Cac> flush
A 000 001c01: POD SysCt Cac>
escaping to L1 system controller
001c01-L1>debug 0 (<- Clear DEBUG flag)
debug switches set to 0x0000

returning to console mode 001c01 CPU0, <CTRL_T> to escape to L1

A 000 001c01: POD SysCt Cac> reset
A 000 001c01: Resetting the system...
Starting PROM Boot process (<- Yes!, now revived!!)


IP35 PROM SGI Version 6.210 built 02:33:51 PM Aug 26, 2004
Testing/Initializing memory ............... DONE
Copying PROM code to memory ............... DONE
Discovering local IO ...................... DONE
Discovering NUMAlink connectivity .........
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 5898 usec
Waiting for peers to complete discovery.... DONE
No other nodes present; becoming global master
Global master is /hw/rack/001/bay/01
Intializing any CPUless nodes.............. DONE
Checking partitioning information ......... DONE
No other nodes present; becoming partition master
Local slave entering slave loop
Loading BASEIO prom ....................... DONE

BASEIO PROM Monitor SGI Version 6.210 built 02:30:38 PM Aug 26, 2004 (BE64)
2 CPUs on 1 nodes found.

NVRAM checksum is incorrect: reinitializing.
Automatic update of PROM environment disabled
Graphics diagnostics

Odyssey board #0 found on nasid 0
Running Odyssey xtalk sanity diag...
Board version 1 - Buzz revision 3B
On board sdram size: 128 Mb
Cas latency: CAS 3
4 banks by sdram module
Running Odyssey Buzz registers diag...
Device passed diagnostics

Installing PROM Device drivers ............
On-board (IO9) tigon3 1000BaseT interface
Base I/O Ethernet set to /dev/ethernet/tg0
Installing Graphics Console...
graphics install: searching for pipe 0
Probing IOC4 ATA adapter 2
IOC4 RevId = 83
Detected Vendor id/Product MATSHITA DVD-ROM SR-8589

Walking SCSI Adapter 0, (pci id 3)
1- 2- 3- 4- 5- 6- 7- 8- 9- 10- 11- 12- 13- 14- 15- = 0 device(s)


Walking SCSI Adapter 1, (pci id 3)
1- 2- 3- 4- 5- 6- 7- 8- 9- 10- 11- 12- 13timeout on adapter 1 target d
tm0=0xffffabaee079d65b, tm1=0xffffabaee079d634, timeout=0xb
- 14Set Device queue Parameters failed
- 15- = 0 device(s)

Initializing PROM Device drivers ..........
Initializing Base I/O Ethernet Interface...Failed. MII Status Register = 0x7949
Done.
---------------Interface Configuration Summary----------------
ASIC|Revision|MAC Address : 5701|B5|08:00:69:11:e7:6f
Link Negotiation|Advertisement : On|<H10 F10 H100 F100 F1000>
Link|Speed|Duplex|Rx/Tx FlowCtrl: Down|10|Half|Off/Off
--------------------------------------------------------------
DONE
Cannot connect to keyboard -- check the cable.
Cannot open /dev/input/ioc4pckm0 for input
Cannot connect to keyboard -- check the cable.
Cannot open /dev/input/ioc4pckm0 for input
Checking hardware inventory ...............
WARNING: hardware inventory is invalid. Reinitializing...
Writing 5 records..... DONE
Updated new configuration. Wrote 5 records.

**** System Configuration and Diagnostics Summary ****
CONFIG:
No. of NODEs enabled = 1
No. of NODEs disabled = 0
No. of CPUs enabled = 2
No. of CPUs disabled = 0
Mem enabled = 2048 MB
Mem disabled = 0 MB
No. of RTRs enabled = 0
No. of RTRs disabled = 0

DIAG RESULTS:
ALL DIAGS PASSED.
**** End System Configuration and Diagnostics Summary ****


System Maintenance Menu

1) Start System
2) Install System Software
3) Run Diagnostics
4) Recover System
5) Enter Command Monitor

Option?


Thank you again for your kind help and taking the time to give me for my problem.
Excellent, have fun!

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)