SGI: Hardware

Indigo2 ECC panic... FIXED

Help!

I'm a bit of an SGI newbie. I currently own an Indigo 2 and it hasn't wanted to boot for some time.. I finally got around to sourcing the correct serial cable from Amazon and a USB to serial adaptor and got it all hooked up tonight. The system powers on, gives me a little menu and when I ask it to boot the system after about 5 seconds of disk access it dumps the following:

Code:
ecc_panic initiated! (for cpu 0)
ECC PANIC: Uncorrectable HARDWARE ECC error, in kernel mode
CPU SysAD bus: |?!Syndrome at addr 0x10000008 normal!
CacheErr 0xe400000f<ER,EC,ED,EE,SIDX=0x882c51f8,PIDX=0x882c51f8>
Status 0xff05<IM8,IM7,IM6,IM5,IM4,IM3,IM2,IM1,IPL=0,MODE=KERNEL,ERL,IE>
ErrorEPC 0x8801a88c, Exception Frame 0xa834bde0, ECC Frame 0xa834bf88
PhysAddr 0x10000008, VirtAddr 0x7008
cpu_err_stat: 0x3a0, cpu_err_addr: 0x10000000
gio_err_stat: 0x0, gio_err_addr: 0x1fbd9023
hpc3_buserr_stat: 0xbfbb0010
ECC cbits 0x0 data_lo 0x0 data_hi 0x0
sbe dblwrds 0x0 mbe dblwrds 0x0
s_taglo 0x10003cd<paddr=0x882c5358,INVAL,vind=0x882c5358,ecc=0x882c5358>


PANIC: Uncorrectable cache ecc/parity error
Kernel/Interrupt Stack Overflow @0x881ba988 sp:0x88317e28 k1:0x23001f
ra:0x881bd2b0 stkflag:1

DOUBLE PANIC: stack underflow/overflow


The system is fully populated with RAM and my gut is telling me this is a memory error... am I right? If so is it time to start swapping out SIMM's one at a time until I find the culprit(s)?

Thanks in advance!

_________________
Image Image Image
it looks like a broken PIMM. you should try blowing out/cleaning the connectors between the processor module and the mainboard, but likely it's toast. Indigo2 processor modules are still relatively common on the used market.

_________________
:PI: :O2: :Indigo2IMP: :Indigo2IMP:
If you have an sgi machine and the graphics settings (refresh rate) for the irix desktop are higher than a new monitor supports, can you change the settings from the maintenance command prompt ?
The best way to come out of this situation is, go to prom monitor by pressing esc key when the
Sorry guys think I might of found the answer -

system shows the initial screen. Type single at >>> prompt.
Once you enter into single user mode, goto /usr/gfx/ucode/CRM/vof or /usr/gfx/ucode/CRIME/vof.
Copy the file 1024x768_60 to 1280x1024_96.

OR

go to /etc/rc2.d

vi S99setmonx

/usr/gfx/setmon -n 1024x768_60.

write and close.

Reboot.
Have seen this type of error caused by the CPU but also sometimes this type of error can be caused by faulty RAM too... have seen this error occur on an Indigo, which was giving cache errors, and it turned out to be the RAM... so it may be worth checking that out also...
definitely clean and reseat everything before trashing it

_________________
r-a-c.de
yup single user mode and change the settings. or put the disk in another machine and change it from there.

_________________
r-a-c.de
Thanks for the feedback guys - much appreciated. I'm thinking a can of compressed air to blow dust out of the nooks and crannies, and perhaps alcohol wipes to clean connectors and so forth - or is that dangerous and likely to induce a static charge? Not really owned equipment old enough to need this sort of cleaning before so unsure how to proceed! :) (and don't want to make the situation worse!)

_________________
Image Image Image
I've never seen a bad contact between a CPU and mainboard.

If the CPU module goes bad, it's almost always the cache memory, and IRIX should complain about uncorrectable cache errors.

I have to clean the memory of my Indigo2's every once in a couple of years. Especially the R10K Indigo2 with 1024MB RAM because the 'gigaram' chips don't have gold contact fingers.

Validate your cleaning with the diagnostics tests (press ESC to interrupt the boot process when prompted and select 'run diagnostics).

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Onyx2: (2x) :O3x02L:
In the museum : almost every MIPS/IRIX system.
Wanted : GM1 board for Professional Series GT graphics (030-0076-003, 030-0076-004)
SGI_Ben wrote:
Once you enter into single user mode, goto /usr/gfx/ucode/CRM/vof or /usr/gfx/ucode/CRIME/vof.
Copy the file 1024x768_60 to 1280x1024_96.

I may be missing something, but surely not this! Looks to me (while sleep-deprived) like you're wiping out the old definition for 1280x1024_96 - that way lies madness! Use x setmon instead. You don't want to be that guy posting in a few years, "I have a new monitor, but every setting I try displays 1024x768_60..." :)

Kudos on finding your own answer, though, and following up in your thread!



Edit : Yeah yeah, "setmon" not "xsetmon". Sheesh, I said I was sleep deprived... ;)

_________________
Then? :IRIS3130: ... Now? :O3x02L: :1600SW: +MLA :Fuel: :Octane2: :Octane: :Indigo2IMP: ... Other: DEC :BA213: :BA123: Sun , DG AViiON , NeXT :Cube:
smj wrote:
SGI_Ben wrote:
Once you enter into single user mode, goto /usr/gfx/ucode/CRM/vof or /usr/gfx/ucode/CRIME/vof.
Copy the file 1024x768_60 to 1280x1024_96.

I may be missing something, but surely not this! Looks to me (while sleep-deprived) like you're wiping out the old definition for 1280x1024_96 - that way lies madness! Use xsetmon instead.


Damn good eyes ! I just whooshed over the top without reading, it's such a faq. SMJ is quadruple correct, this method is a big fat NO WAY !!!

btw - have to use < setmon >, xsetmon won't run without the graphics up.

You can also do it from the prom, < setenv gfx low > but I am not sure that all SGI kompewters will recognize that.

_________________
lemon tree very pretty and the flower very sweet ...
Thanks for the help guys and pointing out the no no !
jan-jaap wrote:
I've never seen a bad contact between a CPU and mainboard.

If the CPU module goes bad, it's almost always the cache memory, and IRIX should complain about uncorrectable cache errors.

Almost, but not always... viewtopic.php?f=3&t=16724371&p=7334288

_________________
:Onyx: (Maradona) :Octane: (DavidVilla) A1186 (Xavi) d800 (Pique) d820 (Neymar)
A1370 (Messi) dp43tf (Puyol) A1387 (Abidal) A1408 (Guardiola)
Hm - can't remember what CPU it has in it. I don't think it's particularly high-end though.

_________________
Image Image Image
Try stripping it to one quad of memory and seeing if the problem disappears. If it's still around then try a different single quad.

The error does specify cache ECC error, so it's likely a processor module (PM) fault. You can try touching up the solder connections if you're good.

_________________
Damn the torpedoes, full speed ahead!

There are those who say I'm a bit of a curmudgeon. To them I reply: "GET OFF MY LAWN!"

:Indigo: :Octane: :Indigo2: :Indigo2IMP: :Indy: :PI: :O3x0: :ChallengeL: :O2000R: (single-CM)
Just to bump this to a conclusion, can confirm it's one bank of memory that's duff. Spent a bit of time last night pulling the system apart, reseating stuff and was still running into problems. Then started to rotate taking one bank of memory out and lo and behold with one set of 4x32MB SIMM's removed the system boots and runs perfectly time after time. With these modules installed the system hangs on booting, sometimes won't even power on properly... obviously there's a very unhappy module in there.

At some point I'll rotate all four through one of the banks of good modules and see if I can isolate the faulty one.

Thanks for the tips! Was great to get it running again.

_________________
Image Image Image