SGI: Hardware

What the heck is this and how do I get it to work? - Page 3

Now that I have time to work on it, it won't boot. Stuck at "Initing saio" ... after some RTFM and STFA it looks like a memory error. Going back to what I saw over the serial console from the last few times I've tried to get it to boot:

*** Disabling CPU 10/01 Error: Failed scache1 addr test.

Which slot is that? They aren't labelled. I assume the 01 is for Card Cage 01 (the front) and processor 10 would be either 10 or 11 (if we start counting at 0 or 1.)

The "Initing saio..." readout on the management display comes out as:

B+__+++++++++++++D

I think I know which board to pull, as it's "blinken lights" don't match the patterns of any other boards. I went ahead and pulled it, and now no more "D" for disabled CPUs. Still hanging at "Initing saio..."

B+__++++++++++++

Any clues for further decoding?

_________________
Currently Own:
Image Iris Indigo R3K Image ?? Challenge XL RE ?? Image O2 R5k CRM
Image Indigo2 ZX R4400
Sold:
Image R5K Indy, ZX Graphics. MANY, MANY moons ago.
I think I have that part figured out. After pulling the board with the bad CPU and moving another board around (trying to debug) I now have

"Initing saio..." B++++++++++++

... but still hanging.

I broke out my serial console gear (USB->serial converter, some null modem cables, etc.)

... and then it boots fine. Are you kidding me? UGH! Intermittent faults are the worst!

And then it reboots again after not finding a boot device and gives me this:

Code:
Configuring memory...
Using standard interleave algorithm.
Running built-in memory test... 01 02 04
*** Self-test FAILED on slot 02, leaf 0, bank 0 (A)


Which slot does this physically translate to?

The only thing I see as "slot 2" (at least in card cage 1) is a 030-0318-004

Counting backwards (right to left) I pulled the card out of "slot 02" and tried again.

Code:
Using standard interleave algorithm.
Running built-in memory test... 01 02
*** Self-test FAILED on slot 02, leaf 0, bank 0 (A)


So that board didn't have the bad RAM.

Also, after pulling that board I saw the following error, which I hadn't seen before:

Code:
Piggyback reads enabled.
Initializing software and devices.
Error: unrecognized SCSI bus controller atslot 0 padap 4.
s1: bus 0 ignored; 93b not supported
RealityEngine2 prom Rev. 0x3000e


Then, after complaining there is no boot device I pressed enter to continue and got this...

Code:
Exception: <vector=Normal>
Status register: 0x4000082<FR,IPL=8,KX,MODE=KERNEL>
Cause register: 0x801c<CE=0,IP8,EXC=DBE>
Exception PC: 0x818261b8, Exception RA: 0x81817218
Instruction Bus error

*** Error/TimeOut Interrupt(s) Pending: 0x40 ==
A Chip MyResp drsc tout

VID #0's ARCS PDA:  &pda 0x81880278, &regs 0x81886a78, magic 0xadacab
vid 0, pid 20, init_sp 0x0, fault_sp 0x8196c210, stack_mode 1
mode_sv 0, EPC_sv 0x818261b8, AT_sv 0x0, badvaddr_sv 0xffffffff
ErrEPC_sv 0xffffffff, CacheErr_sv 0xff3fffff, cause_sv 0x801c, v0_sv 0x0
SP_sv 0x8196b0c0, SR_sv 0x4000082, exc_sv 0x1, return_addr_sv 0x81817218
notfirst 0x1, firstEPC 0x818261b8, nofault 0x0

PANIC: Unexpected exception

[Press reset or ENTER to restart.]


OY!

So then I move over to the other card cage and pull the only thing I could think of as "slot 2" which is a combined CPU/memory board.

Code:
Running built-in memory test... 01 02 04
*** Self-test FAILED on slot 02, leaf 0, bank 0 (A)


Very frustrating.

_________________
Currently Own:
Image Iris Indigo R3K Image ?? Challenge XL RE ?? Image O2 R5k CRM
Image Indigo2 ZX R4400
Sold:
Image R5K Indy, ZX Graphics. MANY, MANY moons ago.
Also, I don't know how many of you are getting annoyed by the updates or not, but I figure I'd keep everyone posted. jan-jaap, everyone, you've been fantastic resources and inspiration.

I swapped around some cards; I pulled the CPU board which had a bad CPU (last night.) I pulled one of the IO4 cards originally and wound up swapping that one with the existing one (this one is -05, the other was -03 hand-written), moving the graphics adapter over to the newer IO4 and removing some of the extra mezz cards (one had a SCSI card, the other had a remote VCAM card which wasn't used.) I also pulled out the extra Sirius board, since I don't have a break-out box for it. I also pulled an audio card which was hard to get to and buried pretty far back. It's an interesting beast, as it has a VME adapter to the audio card itself which plugs in via two plugs to the adapter. Even with that it sits back very far in the cage and you can't reach it (and nothing was hooked to it.) I'll pull part numbers later.

After all this, things started working more reliably. Even though I have some bad ram that the PROM maps out with the POST the system will reboot on its own a few times (like when it can't find a boot disk) and get back to the PROM menu and work; it wasn't doing that consistently before.

I figured why not try and install an OS if things are more stable? I have a CD set of IRIX 6.5.3 and a ton of drives.

Booting from the CD to run fx and partition the disk (I tried 3 different disks) works fine.

Booting from the CD to copy the miniroot over to the disk works.

Booting from the miniroot fails (see attached photo.)

Not sure what all the errors mean, and if they're valid given that the situation seems "compounded" (SCSI error, then IO, then VCAM, etc.)

I tried playing with terminator jumpers, parity jumpers, etc, but none of that made a difference.

Where do I go from here?

Next major step will be using the 'pod' tool from the PROM to isolate the bad 64M of RAM and yank it, but I didn't get that far.

_________________
Currently Own:
Image Iris Indigo R3K Image ?? Challenge XL RE ?? Image O2 R5k CRM
Image Indigo2 ZX R4400
Sold:
Image R5K Indy, ZX Graphics. MANY, MANY moons ago.
Remember the Hitchhiker's Guide: "Don't Panic"

Miniroot installs contain drivers for all the supported devices. If your machine doesn't have them they'll error out and go away quietly.

The error you need to pay attention to is the SCSI error (WD95*) The RealityEngine termination error may also be relevant, but that can be handled later.

Since it's just a board parity error start checking your SCSI cabling, sleds and termination. Firmly reseat everything from the SCSI bus adapter card on. Pull all drives/sleds that aren't vital and see what happens.

_________________
Damn the torpedoes, full speed ahead!

There are those who say I'm a bit of a curmudgeon. To them I reply: "GET OFF MY LAWN!"

:Indigo: :Octane: :Indigo2: :Indigo2IMP: :Indy: :PI: :O3x0: :ChallengeL: :O2000R: (single-CM)
And as for the updates, keep 'em coming. First, how else will you get help and support? Second, anybody who'd get annoyed isn't being forced to read them, so don't sweat it. If they choose to read them and complain, we'll drown them out. Third, the debugging process you document here might help somebody else out years down the road.

So in short: Chillax man, it's all good. :P

_________________
Then? :IRIS3130: ... Now? :O3x02L: :A3504L: - :A3502L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
About the SCSI parity errors: could be a disk, could be wiring or incorrect termination.

You were talking about a disk on the 50pin connector of the sled, but is it a wide disk with some adapter? I don't think that's possible, because both controller and disk are 'wide', will try to negotiate wide, except half the lines is missing because there's a 'narrow' cable in between.

Termination is on the disk backplane. I don't think jumper settings are documented anywhere, but I kept a couple of shots of my (deskside) Challenge when I converted the 2nd channel to SE. They are here: http://www.vdheijden-messerli.net/sgist ... challenge/ (high res available on request)


You also have a message about illegal RM4 termination on the screen. Onyx RM4's don't need terminators, but Crimson/PowerSeries did. I think you'll find that at least one RM4 has a bunch of resistor packs installed close to the backplane connector. You could do without them, and possibly make someone happy who's trying to upgrade a Crimson.

Oh, and is that rack always that 'naked'? I can see the RM4 from the outside, that's got to affect the airflow. If you like the naked look, mount a sheet of perspex but try to cover that hole. RM4's are fragile enough as it is ...

PS: I like the industrial looks of that rack. That's the one from sunhelp 'rescue', right?

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)
The system came to me with 4 disk sleds; all 50 pin.

My disks are 68 pin and SCA; but they have jumpers to "force SE mode" on the disk that I've set. But that could be the issue.

What gets me is that labeling the disks, running fx, copying the miniroot is A-OK. It just barfs on booting the miniroot.

Yes, the haul I have is from Pat which was posted on sunhelp.

_________________
Currently Own:
Image Iris Indigo R3K Image ?? Challenge XL RE ?? Image O2 R5k CRM
Image Indigo2 ZX R4400
Sold:
Image R5K Indy, ZX Graphics. MANY, MANY moons ago.
katzmandu wrote:
The system came to me with 4 disk sleds; all 50 pin.

My disks are 68 pin and SCA; but they have jumpers to "force SE mode" on the disk that I've set. But that could be the issue..


Sometimes, but you've set "force SE" on a SE bus (hopefully), so that isn't the issue.

What Jan-Jaap is bringing up has to do with negotiation. SCSI can operate in 8-bit or 16-bit wide mode. Normally, devices will autonegotiate (but this doesn't always work, I have some disks that need to be forced narrow for narrow busses), but if you have a wide initiator and wide target with a narrow bit of cabling (such as a drive sled) in between that negotiation doesn't always work right, and you wind up with devices trying to send 16-bit data chunks and having the upper 8-bits stripped. If you have a wide device and a wide HBA force narrow mode or (better) get a wide sled/cable so you can get wide speed.

_________________
Damn the torpedoes, full speed ahead!

There are those who say I'm a bit of a curmudgeon. To them I reply: "GET OFF MY LAWN!"

:Indigo: :Octane: :Indigo2: :Indigo2IMP: :Indy: :PI: :O3x0: :ChallengeL: :O2000R: (single-CM)
I took the 50 pin drive out of an Indigo2 and threw it into the Onyx. It worked like a champ. I didn't install over it, because I want to keep 5.3 on it for the Indigo2.

I also pulled the resistors from the RM4 board. No more errors from that.

Bottom line, I need a 68-pin tray for the Onyx.

I took some photos of the other boards; if useful, they can be added to the wiki, etc.

_________________
Currently Own:
Image Iris Indigo R3K Image ?? Challenge XL RE ?? Image O2 R5k CRM
Image Indigo2 ZX R4400
Sold:
Image R5K Indy, ZX Graphics. MANY, MANY moons ago.