SGI: Hardware

Origin3200 IBrick voltage problems - Page 1

After describing the following problem piggybacking somebody elses thread and getting little feedback I decided to open this new thread because I hope some of you might have some insight into this.

As the title says I have an Origin3200 with 1 IBrick, 1 DBrick, 2 CBricks and a power bay which shows the following symptoms:

The machine starts up nicely, all CPUs, RAMs and disks are discovered nicely and I can log on at the console or via network. After some running time the IBrick detects the -12V line to be way out of bounds. It says -24 V or even -36 Volts. Very disturbing. Obviously the system shuts down to protect itself.

From what I read in the other thread the faulty part can be the 12V VRM in the IBrick. That is where my questions start:

1. Do you insiders know that the -12v are supplied by the 12V VRM so that I can be sure this is the VRM to replace? It says 12V on the PCB next to the VRM. It says -12V in another location on the PCB and there is just a big chip with a heatsink nearby (so no easily replaceable part).

2. As I have a Fuel which is reporting faulty voltage and shutting down and my reading told me it is a faulty readout and not a faulty PSU do you think something like this could be the case with the IBrick too?

Sorry for posing these questions again but I just hope that some of you have some experience in this field. This is not meant to say that the reply I got from pierocks was not helpful! Thanks for that. I just hope I am not already at the end of my journey with this nice machine.

Maybe I should post a photo so you can share my affection for this system?

_________________
:Indigo: :Indy: :O2: :1600SW: :1600SW: :Octane: :Octane: :Octane: :Octane: :Octane2: :Fuel: :Onyx2: :Onyx2: :Onyx2: :O2000: :O3200:
OK, let me tell you what I found so far:

When the system has been switched off for a while everything looks perfect. I can power up and get this:

Code:
001c04-L1>cti env
001i14:
Environmental monitoring is enabled and running.

Description    State       Warning Limits     Fault Limits       Current
-------------- ----------  -----------------  -----------------  -------
12V BIAS    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.250
2.5V    Enabled  10%   2.30/  2.82  20%   2.05/  3.07    2.509
3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.302
12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.000
5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.018
-12V    Enabled  10% -10.80/-13.20  20%  -9.60/-14.40  -12.195
3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.285
5V AUX    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    4.966

Description     State       Warning RPM  Current RPM
--------------- ----------  -----------  -----------
FAN  0     LEFT    Enabled         2160         2481
FAN  1   CENTER    Enabled         2160         2556
FAN  2    RIGHT    Enabled         2160         2500

Advisory   Critical   Fault      Current
Description       State       Temp       Temp       Temp       Temp
----------------- ----------  ---------  ---------  ---------  ---------
0 POWER 0           Enabled   30C/ 86F   35C/ 95F   40C/104F   22C/ 71F
1 POWER 1           Enabled   30C/ 86F   35C/ 95F   40C/104F   22C/ 71F


And this:
Code:
001c04-L1>cti pwr vrm
001i14:
VRM Type     Location  Present  Okay
--------------  --------  -------  -------
12V         1   passed   passed
5V         2   passed   passed
3.3V         3   passed   passed
3.3V         4   passed   passed
2.5V         7   passed   passed
-12V  on board      N/A   passed
48V       N/A      N/A   passed


As I had feared the -12V are not generated by the 12V VRM but somewhere on the main board.

I managed to get a hinv from the system because usually it takes some time till the error occurs. The hinv can be found here: http://forums.nekochan.net/viewtopic.php?f=14&t=16723604

I will try to fire her up again and see what information I can get from the L1 after it shut her down again. Will edit the post then.

_________________
:Indigo: :Indy: :O2: :1600SW: :1600SW: :Octane: :Octane: :Octane: :Octane: :Octane2: :Fuel: :Onyx2: :Onyx2: :Onyx2: :O2000: :O3200:
So, this is what it look like:

Code:
001c04-L1>cti env
001i14:
Environmental monitoring is enabled and running.

Description    State       Warning Limits     Fault Limits       Current
-------------- ----------  -----------------  -----------------  -------
12V BIAS    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.250
2.5V    Enabled  10%   2.30/  2.82  20%   2.05/  3.07    2.509
3.3V    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.302
12V    Enabled  10%  10.80/ 13.20  20%   9.60/ 14.40   12.000
5V    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    5.018
-12V      Fault  10% -10.80/-13.20  20%  -9.60/-14.40  -31.687
3.3V AUX    Enabled  10%   2.97/  3.63  20%   2.64/  3.96    3.285
5V AUX    Enabled  10%   4.50/  5.50  20%   4.00/  6.00    4.966


Check out the -12V current.

Btw, does really nobody know if the L1 of a brick tends to malfunction the way it does in the Fuel? I mean show erraneous data / readings.

_________________
:Indigo: :Indy: :O2: :1600SW: :1600SW: :Octane: :Octane: :Octane: :Octane: :Octane2: :Fuel: :Onyx2: :Onyx2: :Onyx2: :O2000: :O3200:
rusti wrote:
Btw, does really nobody know if the L1 of a brick tends to malfunction the way it does in the Fuel? I mean show erraneous data / readings.

My guess would be "no, it does not." All these machines have been out for a long time. The Fuel problem is well known, if that problem were common in the big iron then it would also be well known. You most likely have a real problem ...
Damn. I feared so. Nevertheless as I figured it can not get any worse I switched off environmental monitoring on that brick. It reported -32V all the time but I could not see or smell :-( any ill effect...

Still I have a very bad feeling about this.

_________________
:Indigo: :Indy: :O2: :1600SW: :1600SW: :Octane: :Octane: :Octane: :Octane: :Octane2: :Fuel: :Onyx2: :Onyx2: :Onyx2: :O2000: :O3200:
-12V is most likely only used on the PCI-X slots, for backward compatibility with conventional PCI. Probably none of the PCI-X cards supported in the I-Brick actually use the -12V rail.

I wouldn't run it like this for too long, though. Once you can smell something burning, it's already too late :(
rusti wrote:
Do you insiders know that the -12v are supplied by the 12V VRM so that I can be sure this is the VRM to replace? It says 12V on the PCB next to the VRM. It says -12V in another location on the PCB and there is just a big chip with a heatsink nearby (so no easily replaceable part). As I had feared the -12V are not generated by the 12V VRM but somewhere on the main board.
There's an amazing diversity of knowledge available on nekochan - you might consider posting a few detailed photos of the main PCB, and the general area of the -12 marking on the PCB. Perhaps some one can ID the suspect component and offer repair or replacement advice.

ShadeofBlue wrote:
-12V is most likely only used on the PCI-X slots, for backward compatibility with conventional PCI.
Unlikely enough to hardly be worth the effort, but have you tried removing the fibre channel board and any other PCI cards that might be in the I-Brick? Might also check the sliding retaining bracket in the top of the PCI carrier hasn't worked loose allowing a PCI board to be inserted at a slight angle? The attachment screws on the carrier use a smaller-than-usual Torx bit, but in a pinch a the right metric hex key will do the job.

....and testing with a different power bay supply module and cable would fit right into my list of long shots.

_________________
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************
@ShadeofBlue: I am not sure whether I fully understand the PCI PCI-X part (lack of background knowledge). So, what you are saying is that PCI used -12V and PCI-X does not any more? The only PCI Board installed is the FC Controller. Lets assume I want to install a Gigabit LAN Card later will I get in trouble because it uses -12V and will be destroyed?

I see your point with not letting it run like this for long. But what are my options. If environmental monitoring is on it is only a question of time until the system shuts down. So its basically a useless brick.

@recondas: The only PCI Board is the FC-Controller for the boot disk and array. I doubt that if it were improperly seated it would cause an error that took 20 minutes from starting the system till it occurs. Nevertheless since any chance of fixing it is better than no chance I will have a close look and reinsert it. I will also open it and take some photos like you suggested. Will post them once I found the time to do that (hopefully later today).

_________________
:Indigo: :Indy: :O2: :1600SW: :1600SW: :Octane: :Octane: :Octane: :Octane: :Octane2: :Fuel: :Onyx2: :Onyx2: :Onyx2: :O2000: :O3200:
rusti wrote:
@ShadeofBlue: I am not sure whether I fully understand the PCI PCI-X part (lack of background knowledge). So, what you are saying is that PCI used -12V and PCI-X does not any more?

Sorry, I should have been more specific. Standard PCI slots supply various voltages on different pins.
There's a pin that supplies -12V, along with a few pins for 12V, 5V and 3.3V (see this page for a full pinout).
PCI-X was designed to be backward compatible with the PCI specification, so they kept the -12V pin, even though most modern cards have no use for it. In the past, the -12V rail was used for some communication cards.

Quote:
The only PCI Board installed is the FC Controller. Lets assume I want to install a Gigabit LAN Card later will I get in trouble because it uses -12V and will be destroyed?

If it uses -12V, then yes. But it probably doesn't use -12V :)
The documentation for the card should list how many amps it draws on the various rails; if that list doesn't include the -12V rail, then the card doesn't use it.

Quote:
I see your point with not letting it run like this for long. But what are my options. If environmental monitoring is on it is only a question of time until the system shuts down. So its basically a useless brick.

If anything on board used -12V, it would have fried by now, so as far as that goes, it's probably ok :)
However, it's still possible that the chip responsible for generating -12V might fail completely and cause a short circuit.
It's best to play it safe, I-bricks are still quite expensive and replacing the broken chip could be the cheapest solution, but not if the chip takes something else with it.
No, you shouldn't apologize. It's me who lacks basic knowledge here.

Anyway. I took out the FC Card, tested, reseated, tested again took a different power supply and so on to no avail.
I took some pictures which I still have to check and post later. How many megapixel are a good compromise between sharpness and thread loading times.

After reassembling machine no brought a memory exeption, some how set the global master to the other c-brick and then complains that it has no IO. So for the time being it doesn't boot at all any more. Definitely need to do some reading...

_________________
:Indigo: :Indy: :O2: :1600SW: :1600SW: :Octane: :Octane: :Octane: :Octane: :Octane2: :Fuel: :Onyx2: :Onyx2: :Onyx2: :O2000: :O3200:
Ready to attach some pictures:

_________________
:Indigo: :Indy: :O2: :1600SW: :1600SW: :Octane: :Octane: :Octane: :Octane: :Octane2: :Fuel: :Onyx2: :Onyx2: :Onyx2: :O2000: :O3200:
rusti wrote:
I took some pictures which I still have to check and post later. How many megapixel are a good compromise between sharpness and thread loading times.
The nekochan interface will automatically make load-time friendly thumbnails of large photos. Resize the originals by eye so that there's a balance of reasonable file size/enough detail to allow visualizing components on the board.

EDIT: I see you figured that out quicker than I type. :D

I ran the part number that's visible on your board <030-1571-002>. If all else fails, Chris at MCE <j5reward> posted one in an "available spares" list in the nekochan for sale forum a while back .
Code:
030-1571-002 POWER BOARD, I-BRICK/X-BRICK DNS

_________________
***********************************************************************
Welcome to ARMLand - 0/0x0d00
running...(sherwood-root 0607201829)
* InfiniteReality/Reality Software, IRIX 6.5 Release *
***********************************************************************
Just for fun, if you have an old pci card you could solder a lead to the -12v trace. Then measure the voltage on that pin to see what it really is. That will tell you if it's the monitoring or an actual fault.
The only other thing -12V would be used for would be RS-232 levels. You could probably live OK with it disabled - try it before buying a new (probably expensive) board.

Another possibility would be to rig up a circuit to inject -12V where needed

_________________
Damn the torpedoes, full speed ahead!

:Indigo: :Octane: :Indigo2: :Indigo2IMP: :Indy: :PI: :O200: :ChallengeL:
hamei wrote:
Just for fun, if you have an old pci card you could solder a lead to the -12v trace. Then measure the voltage on that pin to see what it really is. That will tell you if it's the monitoring or an actual fault.

who knows, you just might fry the card which then deep-fries the rest of the system. Better to solder a temporary lead to the underside of the board with the PCI-X socket on, or somewhere else where the -12V comes out. And that's only if you really want to know- but it would rule out the possibility of faulty monitoring.

_________________
:Onyx: (Aldebaran) :Octane: (Chaos) :O2: (Machop)
:hp xw9300: (Aggrocrag) :hp dv8000: (Attack)
I did what hamei suggested. Normally only using my soldering iron on my slotcars it was quite a thrill for me. And I had an interesting learning curve:

Board Nr. 1 I chose was obviously 5V only so it did not have the notch near the rear of the chassis. Btw it was a soundblaster live.

Board Nr. 2 was a Realtek 10/100 network card. I did have the notches but did not have a pin going to -12V

So I ended up using an original sgi 10/100 network card I pulled out of my O2 some time ago. Basically I did, what you can see in the first picture (that is still the soundblaster and the right wire is soldered to pin 4 instead of 3 for ground).
Attachment:
IMGP8161.JPG
IMGP8161.JPG [ 1.37 MiB | Viewed 152 times ]


I took a photo of my voltmeter next to the output of L1 command cti pwr.
Attachment:
IMGP8164.JPG
IMGP8164.JPG [ 1.28 MiB | Viewed 152 times ]


So I would say it is a false alarm. If it were a productive system turning of env monitoring would not be an option but for a hobbyist system I think I will live with that. I do with my Fuel so what the heck. If I get bored I could at least check if I find ICs that are like the Fuel ones that somebody in this forum dared to replace.

_________________
:Indigo: :Indy: :O2: :1600SW: :1600SW: :Octane: :Octane: :Octane: :Octane: :Octane2: :Fuel: :Onyx2: :Onyx2: :Onyx2: :O2000: :O3200:
sybrfreq wrote:
who knows, you just might fry the card which then deep-fries the rest of the system.

Not so likely - usually when things blow up they die open, but still ... you're right. I was thinking more along the lines of using an old junk board just for the leads and cutting the traces. Actually, the O2's pci standoff would have been a good choice. Nothing connects to that.

All's well that ends well, though. I ran the Fuel for four years with no environment monitoring. Didn't like that but it worked out fine. Can you turn off L1 env for just one brick or does it apply to the entire computer ?
L1 is working Brick wise. I am connected to a C-Brick with my "Terminal" which is an O2 with cu. I should take a picture of that too. The O2 with the 1600SW look cool on top of the short rack.

Anyway from the console in L1 mode you can issue commands to all bricks (which is 2 c-bricks and the i-brick in my case) or to individual bricks. So
Code:
cti env off
turns off environmental monitoring in the i-brick only.

I will do that picture and an "* env" after breakfast. You can see then that env is off on the i-brick put still on on the others.

_________________
:Indigo: :Indy: :O2: :1600SW: :1600SW: :Octane: :Octane: :Octane: :Octane: :Octane2: :Fuel: :Onyx2: :Onyx2: :Onyx2: :O2000: :O3200:
Good to see it's working :)

rusti wrote:
After reassembling machine no brought a memory exeption, some how set the global master to the other c-brick and then complains that it has no IO. So for the time being it doesn't boot at all any more. Definitely need to do some reading...

Connecting the I-brick's NUMAlink cable to the other C-brick should get rid of the "no I/O" error and allow it to boot (you may need to delete /etc/ioconfig.conf and reboot).
Then you can try reseating the CPUs and RAM in the C-brick that's reporting the exception.
I am not sure whether I understood that.
Let me ask you about the NumaLink cabling first. Currently XIO10 of the I-Brick is connected to XIO(II) on the first (lower) C-Brick and the two C-Bricks are connected Link(NI) to Link(NI). That is what I understood from reading the manual.
I was always tempted to connect XIO11 of the I-Brick to XIO(II) of the second (upper) C-Brick for redundancy and / or better bandwidth but since I found no indication whether this is a legal configuration I did not do it.

If I read you correctly what you mean is connecting the upper C-brick to the I-Brick instead of the lower one, right?

Second thing: I was always searching for a place where you can find the location of the console (which brick) and this master / slave thing. I know I read something but give all the readig I did on Onyx2 and Origin2000 and Origin3200 all at the same time and given my age (above 40) :-) I probably had a brain stall in the process. So /etc/ioconfig.conf is one such spot? But obviously that can only be read once the C-Brick found the I-Brick and the boot drive. So that is operating system level. I guess there need to be some NVRAM settings, too.
This is all mind-boggeling. Will try to clean up my server room today (shitty weather, so the perfect time to do that), make the photo I promised and then invest some more time in booting and exploring the system.

_________________
:Indigo: :Indy: :O2: :1600SW: :1600SW: :Octane: :Octane: :Octane: :Octane: :Octane2: :Fuel: :Onyx2: :Onyx2: :Onyx2: :O2000: :O3200: