SGI: Hardware

Warning:Origin 350 PSU may fail during sustained powered-on!

A couple of days ago, my O350 had switched off overnight. It turned out that one of the PSUs had failed :(

I took the PSU out of the system (hot swap PSUs rule), opened it up, and with the help of a colleague determined that the a PFC diode had failed and taken a MOSFET switcher and the 8A primary fuse with it.

We ordered replacement parts:

Parts arrived, surgery was performed and I'm happy to report that the patient lives again :)

However, the story doesn't end there. I had of course looked for a replacement unit as well. The PSU is a Delta DPS-500EB E. It didn't take long for me to figure out this PSU has been used by a wide range of manufacturers, including Intel, SUN, Acer, Fujitsu-Siemens and several others. The Intel part number is A76009-00x, and this is where it gets interesting. There's a Technical Advisary TA-0674-5 (dated February 20, 2004) out for this PSU:
TA-0674-5 wrote: Description
Intel® Server Chassis SR2300 500-watt redundant power (RP) supply modules with Intel part number A76009-006 and
prior revisions have the potential to fail during sustained, powered-on operation due to a failure of the primary PFC diode
(D802) in the power supply module. If a diode failure occurs, systems operating in a non-redundant power supply
configuration (only one power supply module installed in the power supply cage) will experience an immediate system
power down. [...]

Root Cause
Inherent imperfections in the silicon carbide base material (substrate) used to fabricate the diode cause abnormal electric
fields within the diode package during normal operating conditions. These fields result in high temperatures in the
imperfection areas which cause degradation and eventual failure of the diode. The structural design of the current
supplier’s diode does not have designed-in protection from these abnormal electric fields.

Corrective Action / Resolution
Intel has identified an alternate supplier source for the primary PFC diode in the power supply module. The alternate
diode design is substantially less susceptible to substrate imperfections, because it has designed-in protection against
substrate imperfections, and is therefore more robust than the current diode design. Intel has determined that power
supply modules built with the alternate diode meet Intel’s DPM rate requirement for server system power supplies. The
alternate diode is an equivalent drop-in replacement. An Engineering Change Order (ECO) has been completed to
incorporate the alternate diode. This change is described in Product Change Notification (PCN) number 103919-00.
Power supply modules built with the alternate diode will be marked with Intel part number A76009-007 (or later revisions).
Power supply modules with Intel part number A76009-006 and prior revisions may be reworked with the alternate diode
by Intel’s factories to part number A76009-007. Reworked power supplies will be marked with a green sticker and
relabeled with part number A76009-007 (or later revisions). Power supply modules with the alternate diode will begin
shipment from the power supply supplier on February 19, 2004. All affected product codes built after February 19, 2004
will contain power supply modules built with the alternate diode.

SUN issued a bulletin wrt. the Sun Fire V65x Servers which use this PSU , I'm not aware of any action by SGI though ( oh, how things have changed... ). I can only assume that Origin 350 systems produced before Q2 2004 (and maybe later) have the same issue. Mine certainly did: the PFC diode that blew was D802 described in Intel TA-0674-5. And it had apparently been hot for a while, because the heat had all but erased the part# printed on it.

I also won an equivalent (?) Fujitsu-Siemens DPS-500EB PSU for a whopping 1€ (+shipping) on eBay :lol: I'm curious to find out if (1) this PSU will work at all, or that the O350 will reject it. It can read part- and serial numbers so who knows, and (2) whether this PSU contains the same Infinion part, or the alternate diode that TA-0674-5 mentions.

Most of the hobbyist crew probably don't run their O350's 24/7, but even if you don't, *I* wouldn't be happy knowing there's a component in there that degrades and blows like that. Unfortunately we ordered the parts before I dug up this Technical Advisary, so I probably have the same part susceptible to degradation unless Infinion improved it.

I will almost certainly replace D802 in the other module of my O350 as well. Until I have this sorted out I will minimize the power-on hours of the O350. If you own an O350, you may want to do the same.
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)
Thanks for posting that info jan-jaap. With the potential for that defect to sideline O350's your offer to test a non-SGI version of the Delta DPS-500EB E PSU also might be very be very helpful to an O350 owner who might not have the skills or facilities to replace the defective parts.

jan=jaap wrote: I'm curious to find out if (1) this PSU will work at all, or that the O350 will reject it. It can read part- and serial numbers so who knows...
Like you discovered with the Fuel, it looks like some early versions of L1 firmware don't report PSU info - I have an O350 that doesn't list the PSU in an L1 serial all (1.22.4 ), while the others do (1.34.8 or newer).

I also noticed that an L1 serial all (those that have PSU info) include a revision code - a quick scan of the nekochan hinv forum turned up serial all posts with revision "S1" and revision S4. All of the SGI versions of the DPS-500EB E I looked at so far, including those used in the later production Altix 350, use SGI part number 060-0178-003. No revision info is printed on the SGI part number label on any of the O350 PSUs I have here, but all of them do include the Sx revision designation on the OEM Delta label. Hopefully some of the SGI-supplied O350 PSUs with later S(number) revisions will have been manufactured with a non-defective PFC diode.

Not may of the Delta DPS_500EB examples on eBay include enough photographic detail to read the revision info on the Delta label, but at least one (a DPS-500EB G ) was label by Delta as a revision S2.

It looks like the PSUs in your O350 are reported as revision S1. I have S1 and S4 revisions, if you can tell me where to look I'm willing to open the S4 revision to check if it has the defective PFC diode. If it's possible, a photo would probably save me some time peering through a magnifying glass.
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************
recondas wrote: It looks like the PSUs in your O350 are reported as revision S1. I have S1 and S4 revisions, if you can tell me where to look I'm willing to open the S4 revision to check if it has the defective PFC diode. If it's possible, a photo would probably save me some time peering through a magnifying glass.

I took a couple of pictures when the PSU was open, and the defective parts removed:




The arrow points to where D802 should be. It is normally hiding under a clamp and a thermal pad, and there's another piece of heat sink on top as well. So some disassembly will be required.

The MOSFET that died as well goes in the empty spot to the right of D802.

The datasheet of the schottky diode refers to is as 'v2.0', '2nd generation' etc. so I have good hopes this one is a little better. We'll see in another 7 years or so ;)
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)
An update: there are many cheap DPS-500EB power supplies out there on eBay. I bought two for less than 10€ a piece.

They are Fujitsu-Siemens labeled "DPS-500EB C Rev1" (from a FSC Primergy RX300 server), and it appears the Origin 350 accepts them without complaining:

Since the PFC diode in my original DPS-500EB E blew out, I will probably replace them in these DPS-500EB C as a precaution.
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)
That's some very useful info. Thanks for posting the follow up!
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************