Gerhard.Lenerz wrote:
A while ago I tried starting up the machine with a bare minimum setup. Actually aside from the backplane and the CPU board there was nothing. Again the machine keeps repeating the same error over and over again.
Looking at the messages now I realize something that I did miss previously. Something seems to have changed since I recorded the first error which I previously posted. Can't say what caused that without further experiments.
I removed the "code" tags to make the bold work.
Old :
EXCEPTION: <vector=NORMAL>
Exception pc: 0xbfc108a0
Cause register: 0x30001008<CE=3,IP5,EXC=RMISS>
Status register: 0x80000<CM,IPL=8>
Bad Vaddress: 0xc0000000
Error Addr register: 0x17b40
Local I/O interrupt register: 0xff <>
Parity error register: 0x0
Registers (in hex):
arg: c98cf600 ffffffff 15180 0
tmp: 0 0 0 0 0 0 0 0
sve: a0017b93 bfc268c8 bfc268ca 1 54 0 1 800
t8 ff00 t9 502e8000 at 1 v0 c0000000 v1 f65da k1 bfc04234
gp 0 fp bfc04bd0 sp a0017b64 ra bfc10744
exit(-1) called
New:
EXCEPTION: <vector=NORMAL>
Exception pc: 0xbfc108a0
Cause register: 0x3000
5008
<CE=3,IP7,IP5,EXC=RMISS>
Status register: 0x80000<CM,IPL=8>
Bad Vaddress: 0xc0000000
Error Addr register: 0x17b40
Local I/O interrupt register: 0xff <>
Parity error register: 0x0
Registers (in hex):
arg: c98cf600 ffffffff 15180 0
tmp: 0 0 0 0 0 0 0 0
sve:
a0017af7
bfc268c8 bfc268ca 1 54 0 1 0
t8 ff00 t9 502e8000 at 1 v0 c0000000 v1 f65da k1 bfc04234
gp 0 fp bfc04bd0 sp
a0017ac8
ra bfc10744
exit(-1) called
You have to read the CAUSE register as a bitmask, and it seems that in the new situation bit# 0x00004000 (named IP7, this has nothing to do with a PowerSeries CPU board btw.) has been set where it wasn't before.
The PC where it crashes is the same, in other words it crashes at exactly the same code location. Locations 0xBFCxxxxx are ROM addresses (PROM code). In other words, it crashes during self test, but you knew that already
The exception handler has no way to recover from this error, so it reboots the system ("exit(-1) called"). Which results in an endless loop of course.
The fact that CE=3 is set is a little confusing, it claims that a coprocessor (#3) caused the crash but AFAIK there is no CP3 in this system (there's always a CP0, and CP1 is the math coprocessor).
The CAUSE register describes the state of the CPU when an exception happens. Bits 8:15 (IP bits) describe
I
nterrupts
P
ending when the exception happened. To make it more confusing, IP is subtly different from the rest of the CAUSE register fields; it doesn’t indicate what happened when the exception took place, but rather shows what is happening now. In your case INT5 and INT7 are pending. What went wrong is this: EXC=RMISS: the CPU tried to read from memory and failed. On memory exceptions, the BadVaddr Register contains the address whose reference led to the exception: 0xc0000000. This (virtual) address is the base address of the KSEG2 address space. This area is only accessible in kernel mode and it's translated through the MMU.
Why
it would fail is a little harder to say. It could be any component in the memory subsystem:
* The main memory itself.
* The primary cache memory
* The secondary cache memory
* The MMU which translates between physical and virtual addresses.
* Some other part of main board logic (buffers, drivers, ...)
You've eliminated main memory I believe, does it crash before or after the cache memory diagnostics? Have you tried to boot it with no memory at all installed, and did that make any difference?
You could try to set the debug switches to halt in PON mode. This is an extremely crude monitor program, but it runs entirely inside the CPU + cache memory. If you get that far, you can have some confidence in the CPU and can run some cache diagnostics from inside PON I believe. Expand the "circle of trust" starting from something that works, rather than making a guess from a situation that doesn't work.
If you can eliminate CPU, L1 and L2 cache from the list, and it's not main RAM, then it must be a main board problem, probably a driver chip for the main RAM. There's a bunch of 74AS1004A chips (drivers) and 74AS623 chips (bus transceivers) around the SIMMs. Unfortunately they are not socketed.
If only those jumpers in the same area would be documented. I bet they can be used to configure/enable/disable banks of memory. Could be very helpful in your case.
Gerhard.Lenerz wrote:
As usual it complains about not being able to communicate with the graphics option,
Let me guess,
gfx: can't reset GM textport
In my system the CPU board works, and I can sometimes enable the graphics (sometimes it fails). It is good enough to get a graphics PROM console, but any attempt to boot into IRIX with the graphics installed in the system will cause bus errors, crashes, ... The GM board probably contains a memory buffer which is mapped into the address space of the CPU for communication between the two, and I have a feeling mine has some dodgy RAM chips...
Well, at least with the GM/GE boards pulled mine boots into IRIX. Slower than anything I've ever seen.