SGI: Hardware

Unwell Indigo2s

This is one for the IRIX gurus out there.
For some reason not one, but two of my Indigo2s have developed problems with a couple of days.

One machine has a failed SolidImpact board, and the EISA 3Com 3C597 died at the same time. A swift purchase from Ian Mapleson will put this right, but why two boards failed at the same time? Please feel free to speculate. All fans running as they should.

Another is panicing when in production. When put on the bench, it just runs and runs. Typical! I'm using icrash to analyse the crash reports, and I'm seeing something consistent in the crash reports: what is thread lclintr2.4 doing? It is always in this thread when the panic occurs. Any ideas?

This system is using default kernel parameters (unusual for us, normally we tweak the syssegsz and maxdmasz to enable big DMA transfers over SCSI)

*EDIT* : Did a bit more digging with icrash, looks like a GIO problem. Perhaps the second SCSI channel IC (which is in use during production but not on the bench) is going bad?

===========
CPU SUMMARY
===========

CPU 0 was in kernel mode running an xthread named 'lclintr2.4'

=====================
NMI SUMMARY FOR CPU 0
=====================

REGISTERS FOR CPU 0:

ERREPC: 0x0
SP: 0x0
RA: 0x0

LTICKS FOR CPU 0: 2

STACK TRACE FOR CPU 0:


Could not find a valid stack trace for CPU 0

INSTRUCTIONS NEAR PC IN NMI FOR CPU 0:

No valid ERROR_EPC found in NMI registers.

Crash 2:

===========
CPU SUMMARY
===========

CPU 0 was in kernel mode running an xthread named 'lclintr2.4'

=====================
NMI SUMMARY FOR CPU 0
=====================

REGISTERS FOR CPU 0:

ERREPC: 0x0
SP: 0x0
RA: 0x0

LTICKS FOR CPU 0: 1

STACK TRACE FOR CPU 0:


Could not find a valid stack trace for CPU 0

INSTRUCTIONS NEAR PC IN NMI FOR CPU 0:

No valid ERROR_EPC found in NMI registers.

Crash 3:

===========
CPU SUMMARY
===========

CPU 0 was in kernel mode running an xthread named 'lclintr2.4'

=====================
NMI SUMMARY FOR CPU 0
=====================

REGISTERS FOR CPU 0:

ERREPC: 0x0
SP: 0x0
RA: 0x0

LTICKS FOR CPU 0: 1

STACK TRACE FOR CPU 0:


Could not find a valid stack trace for CPU 0

INSTRUCTIONS NEAR PC IN NMI FOR CPU 0:

No valid ERROR_EPC found in NMI registers.


=======================
SLEEPING PROCESS STATES
=======================
for the first machine, perhaps the SI board failing could have shorted into the backplane/riser board for the GIO/ISA.
May want to inspect all boards and traces closely to avoid roasting the replacement boards.