SGI: Hardware

Tezro V12 running too warm or is it normal?

Good day,

I finally had a chance to spend a bit more time with my two Tezro machines and I noticed that the Odyssey V12 video boards seem to run a bit warm on both of them. Once the machines have been running for a period of time, I frequently get the following type of messages on the console and in the logs:

Code:
Dec 23 20:41:41 4A:tezro unix: |$(0x15f)WARNING: 001c01 ATTN: ODY zone advisory limit reached 50 C/ 122 F  Fan: 87
Dec 23 20:42:41 4A:tezro unix: |$(0x162)WARNING: 001c01 ATTN: Cooling system stabilized
Dec 23 21:12:03 4A:tezro unix: |$(0x15f)WARNING: 001c01 ATTN: ODY zone advisory limit reached 50 C/ 122 F  Fan: 87
Dec 23 21:13:03 4A:tezro unix: |$(0x162)WARNING: 001c01 ATTN: Cooling system stabilized


These occur with virtually no load on the video hardware, so on the surface it would seem to me that the V12 boards are running somewhat warmer than intended. Interestingly both machines are exhibiting exactly the same behaviour, so I doubt that it is a hardware issue. For reference, both V12 boards have the DCD option and both machines are equipped with the DM3 cards. Removing the DM3 card appears to mostly correct this, but I still have seen an occasional message:

Code:
Dec 26 16:05:17 4A:tezro unix: |$(0x15f)WARNING: 001c01 ATTN: ODY zone advisory limit reached 51 C/ 123 F  Fan: 80
Dec 26 16:08:17 4A:tezro unix: |$(0x162)WARNING: 001c01 ATTN: Cooling system stabilized
Dec 26 23:11:55 4A:tezro unix: |$(0x15f)WARNING: 001c01 ATTN: ODY zone advisory limit reached 50 C/ 122 F  Fan: 80
Dec 26 23:12:15 4A:tezro unix: |$(0x162)WARNING: 001c01 ATTN: Cooling system stabilized


With the DM3 board removed, it seems that the temperatures are just marginally below the advisory level:

Code:
tezro 1# l1cmd env
Environmental monitoring is enabled and running.

Description    State       Warning Limits     Fault Limits       Current
-------------- ----------  -----------------  -----------------  -------
1.8V    Enabled  10%   1.62/  1.98  20%   1.44/  2.16    1.875
12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.000
12V #2    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.125
3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.474
2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.613
12V IO    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.063
5V AUX    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.070
3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.268
5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.070
XIO 12V BIAS    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.063
XIO 5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.070
XIO 2.5V    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.574
XIO 3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.285
IP53 3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.302
IP53 5V AUX    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.044
IP53 12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   11.875
IP53 VCPU    Enabled  10%   1.13/  1.38  20%   1.00/  1.50    1.283
IP53 SRAM    Enabled  10%   2.25/  2.75  20%   2.00/  3.00    2.574
IP53 1.5V    Enabled  10%   1.35/  1.65  20%   1.20/  1.80    1.551

Description     State       Warning RPM  Current RPM
--------------- ----------  -----------  -----------
FAN  0   NODE 1    Enabled         1800         2109
FAN  1   NODE 2    Enabled         1800         2136
FAN  2   NODE 3    Enabled         1800         2149
FAN  3    PCI 1    Enabled         1350         1430
FAN  4    PCI 2    Enabled         1350         1493
FAN  5       HD    Enabled         1620         4218
FAN  6    ODY 1    Enabled         1300         2220
FAN  7    ODY 2    Enabled         1300         2083

Advisory   Critical   Fault      Current
Description       State       Temp       Temp       Temp       Temp
----------------- ----------  ---------  ---------  ---------  ---------
0 INTERFACE 0       Enabled    [Autofan Control]    76C/168F   39C/102F
1 INTERFACE 1       Enabled    [Autofan Control]    76C/168F   35C/ 95F
2 INTERFACE 2       Enabled    [Autofan Control]    76C/168F   34C/ 93F
3 INTERFACE 3       Enabled    [Autofan Control]    76C/168F   41C/105F
4 ODYSSEY           Enabled    [Autofan Control]    76C/168F   50C/122F
5 NODE              Enabled    [Autofan Control]    76C/168F   55C/131F
6 BEDROCK           Enabled    [Autofan Control]    85C/185F   55C/131F

Zone Temp     Target    Current   Zone Fan   Curr/Min
Zone Name  State     Sensors       Average   Average   Index      Fan %
---------  --------  ------------  --------  --------  ---------  ---------
Node        Enabled           5,6  62C/143F  55C/131F          0   46%/ 46%
PCI         Enabled       0,1,2,3  45C/113F  37C/ 98F        3,4   57%/ 57%
ODY         Enabled             4  50C/122F  50C/122F          6   78%/ 64%
HD          Enabled             5  40C/104F  55C/131F          5   80%/ 38%


I looked through the forums here and came across a number of posts with sample environmental monitor output on Tezros. They all seem lower than what I see on both of my machines. Can anyone offer any thoughts on this? Should I worry?

Thank you.
Is this a rack mount tezro? I don't see the "BOOST" zone or Fans 8-10 listed.
canavan wrote:
Is this a rack mount tezro? I don't see the "BOOST" zone or Fans 8-10 listed.


Hi. Both units are "desktops", not rack mount...
I think the 'BOOST' zone is the three extra fans only found on the 1GHz node boards

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)
t-rexky wrote:
Code:
ODY         Enabled             4  50C/122F  50C/122F          6   78%/ 64%
HD          Enabled             5  40C/104F  55C/131F          5   80%/ 38%


I looked through the forums here and came across a number of posts with sample environmental monitor output on Tezros. They all seem lower than what I see on both of my machines.

I don't think this is normal. That's the temperature of my Tezro running 24/7 at 20C room temperature:
Code:
Zone Temp     Target    Current   Zone Fan   Curr/Min
Zone Name  State     Sensors       Average   Average   Index      Fan %
---------  --------  ------------  --------  --------  ---------  ---------
Node        Enabled           5,6  62C/143F  47C/116F          0   46%/ 46%
PCI         Enabled       0,1,2,3  45C/113F  32C/ 89F        3,4   57%/ 57%
ODY         Enabled             4  50C/122F  41C/105F          6   64%/ 64%
HD          Enabled             5  40C/104F  44C/111F          5   49%/ 38%
Probably there is dust in the machines or any parts blocking the airflow.

_________________
:Tezro: :Fuel: :Octane2: :Octane: :Onyx2: :O2+: :O2: :Indy: :Indigo: :Cube:
The suggestion to make sure the fans in your Tezro are unobstructed is a good one - continued operation at those temperatures might well cause damage. The DM3 is positioned fairly close to the perforated metal shroud that encloses the V12, so it's likely the graphics in your Tezros run cooler without the DM3 because of improved airflow.

I think SGI was aware that graphics boards in certain tower Tezros configurations could run hot. Later revisions of the V12 used in the Tezro included what appears to be an (unused) 3-pin fan header - the one in the photo is an 030-1884-002 Revision D:
Attachment:
Tezro_V12_Fan_Header.jpg
Tezro_V12_Fan_Header.jpg [ 357.89 KiB | Viewed 267 times ]
To my knowledge that V12 fan header was never used - along with the development of 256 and 512MB versions of the V12 it may have fell victim to SGI's decision to scale back/drop MIPS-based systems.

You might also compare the L1 revision in your Tezros with some of those posted in the hinv forums (that have cooler graphics temperatures). I don't know for certain that the temperature threshold/fan speed levels used by the L1's "Autofan Control" changed between Tezro L1 revisions, but I did notice differences in the Autofan Control between different L1 revisions in the very similar O350.

The Tezro I have arrived with a dead V12. There have been other mentions in the forum of failed Tezro graphics boards. Concerned that those failures might be heat related I placed a low-noise fan on the perforated metal cover directly above the heat sink on the V12. The additional fan lowered the average operating temperature of the V12 in my Tezro from 50C to 40C (though the DM3s in your Tezros will make adding a fan directly above the V12 heat sink slightly more challenging).

_________________
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************
diegel wrote:
I don't think this is normal. That's the temperature of my Tezro running 24/7 at 20C room temperature:
Code:
Zone Temp     Target    Current   Zone Fan   Curr/Min
Zone Name  State     Sensors       Average   Average   Index      Fan %
---------  --------  ------------  --------  --------  ---------  ---------
Node        Enabled           5,6  62C/143F  47C/116F          0   46%/ 46%
PCI         Enabled       0,1,2,3  45C/113F  32C/ 89F        3,4   57%/ 57%
ODY         Enabled             4  50C/122F  41C/105F          6   64%/ 64%
HD          Enabled             5  40C/104F  44C/111F          5   49%/ 38%
Probably there is dust in the machines or any parts blocking the airflow.


Thank you for the information. Both of these machines were very clean when I picked them up, with just a trace of black deposits on some of the parts. Nonetheless the machine I am currently using was very thoroughly cleaned and inspected before I started using it on a more regular basis. I don't believe that there is anything blocking the cooling flow either, so I am completely baffled. I wonder if perhaps the temperature sensors have somehow drifted their calibration...
recondas wrote:
The suggestion to make sure the fans in your Tezro are unobstructed is a good one - continued operation at those temperatures might well cause damage. The DM3 is positioned fairly close to the perforated metal shroud that encloses the V12, so it's likely the graphics in your Tezros run cooler without the DM3 because of improved airflow.

I think SGI was aware that graphics boards in certain tower Tezros configurations could run hot. Later revisions of the V12 used in the Tezro included what appears to be an (unused) 3-pin fan header - the one in the photo is an 030-1884-002 Revision D: To my knowledge that V12 fan header was never used - along with the development of 256 and 512MB versions of the V12 it may have fell victim to SGI's decision to scale back/drop MIPS-based systems.


I will have a look under the shield to see if the V12 card in my Tezro has the header. I will also look through the manuals to see how to remove the V12. I am wondering if perhaps there is some heat transfer issue between the chips and the heatsink? It looked to me like the shield has been previously removed so perhaps something went awry in the process...

recondas wrote:
You might also compare the L1 revision in your Tezros with some those posted in the hinv forums (that have cooler graphics temperatures). I don't know for certain that the temperature threshold/fan speed levels used by the L1's "Autofan Control" changed between Tezro L1 revisions, but I did notice differences in the Autofan Control between different L1 revisions in the very similar O350.


I recently upgraded the L1 to the firmware version that is supplied with 6.5.30 and I have not seen any concrete difference. Unfortunately I cannot upgrade it any further since I do not have a maintenance contract with SGI. I exchanged some emails with their support, but it looks like enthusiasts are out of luck as far as patches are concerned :( .

recondas wrote:
The Tezro I have arrived with a dead V12. There have been other mentions in the forum of failed Tezro graphics boards. Concerned that those failures might be heat related I placed a low-noise fan on the perforated metal cover directly above the heat sink on the V12. The additional fan lowered the average operating temperature of the V12 in my Tezro from 50C to 40C (though the DM3s in your Tezros will make adding a fan directly above the V12 heat sink slightly more challenging).


I will have another good look at the internals to see if there is something I can do to enhance the baseline cooling. There are a number of things in the baseline cooling design that I do not really understand, being an engineer. For example, if the V12 is this sensitive to overheating, the positioning of the cooling fans, the primary heatsink and the perforations in the cover shield do not seem ideal...

Thank you very much for your comprehensive response!
t-rexky wrote:
Thank you very much for your comprehensive response!
You're welcome.
t-rexky wrote:
I will have a look under the shield to see if the V12 card in my Tezro has the header.
If you decide you might like to use the header, to be safe I'd suggest metering the header to confirm that it was intended for use with a fan.
t-rexky wrote:
I will have another good look at the internals to see if there is something I can do to enhance the baseline cooling.
Another possibility might be replacing the Odyssey cooling fans with some that have the same physical dimensions but with a higher RPM/CFM rating. The L1 firmware regulates the system fans using a percentage of the fan speed rather than at a specific RPM, so a fan with a higher speed/flow rating should provide at least some additional cooling:
Code:
Zone Temp     Target    Current   Zone Fan   Curr/Min
Zone Name  State     Sensors       Average   Average   Index      Fan %
---------  --------  ------------  --------  --------  ---------  ---------
Node        Enabled           5,6  62C/143F  55C/131F          0   46%/ 46%
PCI         Enabled       0,1,2,3  45C/113F  37C/ 98F        3,4   57%/ 57%
ODY         Enabled             4  50C/122F  50C/122F          6   78%/ 64%
HD          Enabled             5  40C/104F  55C/131F          5   80%/ 38%
If you decide to try graphics fans with a higher RPM/CFM rating, I'd suggest picking one with a minimum speed higher than the 1300 RPM Warning Level (called for by your L1 firmware), or you'll likely generate fan warning errors that cause the system to auto-power down:
Code:
Description     State       Warning RPM  Current RPM
--------------- ----------  -----------  -----------
FAN  0   NODE 1    Enabled         1800         2109
FAN  1   NODE 2    Enabled         1800         2136
FAN  2   NODE 3    Enabled         1800         2149
FAN  3    PCI 1    Enabled         1350         1430
FAN  4    PCI 2    Enabled         1350         1493
FAN  5       HD    Enabled         1620         4218
FAN  6    ODY 1    Enabled         1300         2220
FAN  7    ODY 2    Enabled         1300         2083
Interestingly enough, the Warning RPM for the graphics fans in my Tezro is given as 1350 RPM - perhaps another indication there might be some differences in the environmental parameters called for in different firmware revisions.
Code:
FAN  6    ODY 1    Enabled         1350         1555
FAN  7    ODY 2    Enabled         1350         1461

_________________
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************