SGI: Hardware

o300 Boot Problems

Hi,

My o300 server (with two bricks) keeps on shutting down. I am not an expert on SGI hardware/software, can anyone give a hint where could I get some help?

For the start I get this message when the machine in powering up:

Code: Select all

001c01-L1>* power up

returning to console mode  001c01 console, <CTRL_T> to escape to L1
INFO: console subchannel changed:  001c01 CPU0
*** WARNING : Please upgrade L1version to at least 1.12.5
INFO: console subchannel changed:  001c03 CPU0
*** WARNING : Please upgrade L1version to at least 1.12.5
Starting PROM Boot process
INFO: console subchannel changed:  001c01 CPU0
Starting PROM Boot process


Anyone has a clue on that warning about the L1 version? Shall I look into changing that or not?

When I look at the power, I get larger voltage on Speedo2 CPU on one brick (the one which powers off 2-3 times a week).

Code: Select all

001c01-L1>* power
001c01:
Supply          State Voltage    Margin  Value
--------------  ----- ---------  ------- -----
12V     on       N/A      N/A
12V IO     NC   12.000V      N/A
12V DIG     NC   12.125V      N/A
5V     NC    4.966V      N/A
3.3V     NC    3.320V      N/A
5V aux     NC    4.992V      N/A
3.3V aux     NC    3.406V      N/A
2.5V     on    2.483V   normal     3
Speedo2 CPU     on    1.706V   normal    19
1.5V     on    1.495V   normal     5
001c03:
Supply          State Voltage    Margin  Value
--------------  ----- ---------  ------- -----
12V     on       N/A      N/A
12V IO     NC   12.250V      N/A
12V DIG     NC   12.125V      N/A
5V     NC    4.992V      N/A
3.3V     NC    3.337V      N/A
5V aux     NC    4.966V      N/A
3.3V aux     NC    3.406V      N/A
2.5V     on    2.483V   normal     3
Speedo2 CPU     on    1.480V   normal    19
1.5V     on    1.480V   normal     5

returning to console mode  001c01 console, <CTRL_T> to escape to L1


Sometime that voltage is much higher and I have to leave the bricks off for few hours before I try again to power it up.

Any hints will be appreciated.

Thank you,

Ionel (reads as "EE-O-NEL")
In the mean time the time it stays up reduced to few minutes (sometime not enough to finish the boot process). Can someone point me in the right direction, please. Do I simply have to buy a new transformer or what is the issue with this "Speedo2 CPU"? The machine starts to send warnings when this voltage goes close to 2 (above 1.8-1.9) and it cuts the power to that brick (not the rest of the bricks).

Thank you for any hint.

Regards,
Ionel
OK, best to start with basics.
1. Did it ever work for you or have you just bought it?
2. Best to start moodule at a time so power the system down and remove the numa-link cable and then start the boot process. Capture the full console o/p and we'll take it from their. Also show me the full env from the L1 console before and after power on.
No transformers to replace in O300 - just SMPS and VRMs (OK they have transfomers in them but..)

Also - I'd suggest that a moderator creates a new thread -O300 boot problems as it reflects the topic more accurately.
:ChallengeL: :O2000: :Onyx2: :Onyx: :O2000R: :O2000R: :O2000E: :O2000E: :Onyx2R: :O3000: :0300: :0300: :0300: :Indy: :Indigo2: :Indigo2: :Indigo2IMP: :Octane: :Octane: :Octane2: :Octane2: :Fuel: :Fuel:
You could also check to see if its the 1.5V VRM that is the problem by moving it from module 1 to module 2. It's the smaller VRM located in the white socket on the board P/N 060-0127-00x. If you swap from the different modules and the problem moves to the other module that would be a good starting point.

Regards the L1 controller message, it's because you have updated IRIX on the system from the original shipped with it, but not a big deal, there should be some extra features in later versions, but what they are I can't tell. There is a command to flash the l1 from within IRIX, but this is done at your own risk and normal rules apply such as not powering off :

Code: Select all

cd /usr/cpu/firmware/sysco

/usr/sbin/flashsc --sc ./l1.bin all


That would update all L1's in the system. You can just use r.s where r = rack and s = slot for individual L1's. ie. :

Code: Select all

cd /usr/cpu/firmware/sysco

/usr/sbin/flashsc --sc ./l1.bin 1.1
&

Code: Select all

/usr/sbin/flashsc --sc ./l1.bin 1.3



I have had 1 problem using this method where going from very early L1 (1.10.2)version to very new version (1.54.0) I had a problem with a "UNKNOWN BRICK TYPE" message on the L1 and the brick could not be reset to brick type C. Unfortunately that problem is still outstanding...... :(

I think also, if it's causing you a real pain by it shutting down frequently you could probably turn the environmental settings off on the L1 (env off I think). But this is not a good long term solution as you obviously have a problem with the system that needs fixing.

Good luck with your troubleshooting anyhow.
In order of use at the moment..... :Fuel: :O3000:

Currently looking to buy good :Fuel: and :O2: :O2+: machine.
maxsleg wrote: OK, best to start with basics.
1. Did it ever work for you or have you just bought it?
2. Best to start moodule at a time so power the system down and remove the numa-link cable and then start the boot process. Capture the full console o/p and we'll take it from their. Also show me the full env from the L1 console before and after power on.
No transformers to replace in O300 - just SMPS and VRMs (OK they have transfomers in them but..)

Also - I'd suggest that a moderator creates a new thread -O300 boot problems as it reflects the topic more accurately.


Thank you for your message. The machine did work before without problems (3+ years). The second brick (the one without problems) was replaced by an SGI enginer (I had two CPU's there and I have now four). The problems started about 1+ years after this upgrate.

The machine is located far away but I have access to the console via a serial cable. I can try to switch one just the brick with problems.

OK. env BEFORE the power on:

Code: Select all

001c01-L1>env
Environmental monitoring is enabled and running.

Description    State       Warning Limits     Fault Limits       Current
-------------- ----------  -----------------  -----------------  -------
12V IO   Wait Pwr  10%  10.80/ 13.20  20%   9.60/ 14.40    0.00
12V DIG   Wait Pwr  10%  10.80/ 13.20  20%   9.60/ 14.40    0.00
5V   Wait Pwr  10%   4.50/  5.50  20%   4.00/  6.00    0.00
3.3V   Wait Pwr  10%   2.97/  3.63  20%   2.64/  3.96    0.05
5V aux   Wait Pwr  10%   4.50/  5.50  20%   4.00/  6.00    5.02
3.3V aux   Wait Pwr  10%   2.97/  3.63  20%   2.64/  3.96    3.42
2.5V   Wait Pwr  10%   2.25/  2.75  20%   2.00/  3.00    0.00
Speedo2 CPU   Wait Pwr  10%   1.35/  1.65  20%   1.20/  1.80    0.00
1.5V   Wait Pwr  10%   1.35/  1.65  20%   1.20/  1.80    0.00

Description    State       Warning RPM  Current RPM
-------------- ----------  -----------  -----------
FAN 0     Left   Wait Pwr         2160            0
FAN 1   Center   Wait Pwr         2160            0
FAN 2    Right   Wait Pwr         2160            0
FAN 3       PS   Wait Pwr         2160            0

Advisory  Critical  Fault     Current
Description    State       Temp      Temp      Temp      Temp
-------------- ----------  --------  --------  --------  ---------
NODE    TEMP 0   Wait Pwr  30C/ 86F  35C/ 95F  40C/104F  24c/ 75F
NODE    TEMP 1   Wait Pwr  30C/ 86F  35C/ 95F  40C/104F  24c/ 75F
NODE    TEMP 2   Wait Pwr  30C/ 86F  35C/ 95F  40C/104F  24c/ 75F

WARNING: power appears off, console unavailable


and now env AFTER I boot just the node with problems (the other node is still connected via the NUMA-cable but it is powered off):

Code: Select all

001c01-L1>env
Environmental monitoring is enabled and running.

Description    State       Warning Limits     Fault Limits       Current
-------------- ----------  -----------------  -----------------  -------
12V IO    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   11.88
12V DIG    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.19
5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    4.97
3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.34
5V aux    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    4.99
3.3V aux    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.41
2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.48
Speedo2 CPU      Fault  10%   1.35/  1.65  20%   1.20/  1.80    1.83
1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.49

Description    State       Warning RPM  Current RPM
-------------- ----------  -----------  -----------
FAN 0     Left    Enabled         2160         3659
FAN 1   Center    Enabled         2160         3619
FAN 2    Right    Enabled         2160         3619
FAN 3       PS    Enabled         2160         3088

Advisory  Critical  Fault     Current
Description    State       Temp      Temp      Temp      Temp
-------------- ----------  --------  --------  --------  ---------
NODE    TEMP 0    Enabled  30C/ 86F  35C/ 95F  40C/104F  25c/ 77F
NODE    TEMP 1    Enabled  30C/ 86F  35C/ 95F  40C/104F  24c/ 75F
NODE    TEMP 2    Enabled  30C/ 86F  35C/ 95F  40C/104F  24c/ 75F


I can't keep it on anymore. I get those messages every 5 seconds:

Code: Select all

001c01 ATTN: brick auto power down in 30 seconds

001c01 ATTN: power down aborted, environmental monitor reset

001c01 ATTN: Speedo2 CPU high fault limit reached @  1.833V.

001c01 ATTN: brick auto power down in 30 seconds

001c01 ATTN: brick auto power down in 25 seconds

001c01 ATTN: brick auto power down in 20 seconds

001c01 ATTN: brick auto power down in 15 seconds

001c01 ATTN: brick auto power down in 10 seconds

001c01 ATTN: brick auto power down in 5 seconds

001c01 ATTN: brick is powering down now!


so the machine doesn't come back on anymore. Some time I got lucky, the Speedo2 CPU is lower than 1.8 (about 1.77) and the machine can boot, but it will go down in short time...

Thank you for any hint.

Best regards,
Ionel
tjsgifan wrote: You could also check to see if its the 1.5V VRM that is the problem by moving it from module 1 to module 2. It's the smaller VRM located in the white socket on the board P/N 060-0127-00x. If you swap from the different modules and the problem moves to the other module that would be a good starting point.


I will open it and I will have a look. Thank you for suggesting it.

tjsgifan wrote: Regards the L1 controller message, it's because you have updated IRIX on the system from the original shipped with it, but not a big deal, there should be some extra features in later versions, but what they are I can't tell.


I did not. What happend is I transformed my 6CPU o300 system into a 8 CPU o300 system. Some engineer form SGI came and did the swap of the 2CPU motherboard with a 4CPU motherboard. I can't tell what exactly he did.

tjsgifan wrote: There is a command to flash the l1 from within IRIX, but this is done at your own risk and normal rules apply such as not powering off :

Code: Select all

cd /usr/cpu/firmware/sysco

/usr/sbin/flashsc --sc ./l1.bin all


That would update all L1's in the system. You can just use r.s where r = rack and s = slot for individual L1's. ie. :

Code: Select all

cd /usr/cpu/firmware/sysco

/usr/sbin/flashsc --sc ./l1.bin 1.1
&

Code: Select all

/usr/sbin/flashsc --sc ./l1.bin 1.3


I have to do this when the machine is up and running or when it is powered off?

tjsgifan wrote: I think also, if it's causing you a real pain by it shutting down frequently you could probably turn the environmental settings off on the L1 (env off I think). But this is not a good long term solution as you obviously have a problem with the system that needs fixing.


There is a problem with that voltage, it is good that the machine switches off, otherwise something else might get broken...

tjsgifan wrote: Good luck with your troubleshooting anyhow.


Thank you for your suggestions.