Nekonomicon - Another IRIS 4D in trouble...

Gerhard.Lenerz
Enthusiast
Who joined Jan. 15, 2004, 2:20 a.m.
and authored 280 notes

Wrote the following at Oct. 31, 2009, 8:39 a.m...

Now that the winter is close I also decided to take a look at some machines that I didn't power up for a long time. Of course trouble was expected, so it isn't a big surprise that the 4D/50G doesn't come up well. As usual it complains about not being able to communicate with the graphics option, but then there is something new further on in POST:

Code:

  EXCEPTION: <vector=NORMAL>
  
  Exception pc: 0xbfc108a0
  
  Cause register: 0x30001008<CE=3,IP5,EXC=RMISS>
  
  Status register: 0x80000<CM,IPL=8>
  
  Bad Vaddress: 0xc0000000
  
  Error Addr register: 0x17b40
  
  Local I/O interrupt register: 0xff <>
  
  Parity error register: 0x0
  
  Registers (in hex):
  
  arg: c98cf600 ffffffff 15180 0
  
  tmp: 0 0 0 0 0 0 0 0
  
  sve: a0017b93 bfc268c8 bfc268ca 1 54 0 1 800
  
  t8 ff00 t9 502e8000 at 1 v0 c0000000 v1 f65da k1 bfc04234
  
  gp 0 fp bfc04bd0 sp a0017b64 ra bfc10744
  
  exit(-1) called

Anyone got an idea what is wrong? I believe the machine is still doing POST as the LED shows "D" while the error is showed again and again and again.

During Christmas season or whenever I have a couple of days off I'll try reseating bits and pieces. After all, the machine hasn't been fully up and running since July 2004. I hope that will fix things.

On a brighter note, good old 'boromir' (4D/420VGX) is doing just fine. Despite the bent chassis and the wrong skins.

Gerhard

_________________
www.sgistuff.net - SGI info since 2001

:4D70G:

jan-jaap
Old Salt
Who joined June 17, 2004, 11:35 a.m.
and authored 2689 notes

Wrote the following at Nov. 2, 2009, 6:39 a.m...

Hi Gerhard,

The system is taking an exception, and the 'cause' register tells you the reason you're in the exception handler:

Code:

Cause register: 0x30001008<CE=3,IP5,EXC=RMISS>

RMISS means a 'read TLB miss', usually this means there's an error reading the memory address.
In other words: time to reseat / remove / swap around the memory sticks

Gerhard.Lenerz wrote:

Code:

exit(-1) called

[...]
Anyone got an idea what is wrong? I believe the machine is still doing POST as the LED shows "D" while the error is showed again and again and again.

If the PROM calls 'exit()' it resets the system. Of course, it will hit the same error and everything will happen all over.

Gerhard.Lenerz wrote:

On a brighter note, good old 'boromir' (4D/420VGX) is doing just fine. Despite the bent chassis and the wrong skins.

Those skins used to be on my 4D/210GTX, no? My own 4D/440VGX has also started to act up. It looks like a glitch in the power resets the system after a couple of minutes. Looks like I'll have to get familiar with the PowerOne PSU's again this winter

_________________
Now this is a deep dark secret, so everybody keep it quiet

It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi
Currently in commercial service:

(2x)

In the museum: almost every MIPS/IRIX system.

dexter1
Moderator
Who joined Feb. 20, 2003, 7:57 a.m.
and authored 2057 notes

Wrote the following at Nov. 2, 2009, 7 a.m...

jan-jaap wrote:

My own 4D/440VGX has also started to act up. It looks like a glitch in the power resets the system after a couple of minutes. Looks like I'll have to get familiar with the PowerOne PSU's again this winter

Could it be a loose contact somewhere? Maybe the main switch? My Crimson suffered from a Main switch that didn't made fully contact and issued a reset every now and then. Replacing it with a new one solved the problem immediately

_________________
:Crimson:

European nekoware mirror, updated twice a day: http://www.mechanics.citg.tudelft.nl/~everdij/nekoware
ftp://mech001.citg.tudelft.nl rsync mech001.citg.tudelft.nl::nekoware

jan-jaap
Old Salt
Who joined June 17, 2004, 11:35 a.m.
and authored 2689 notes

Wrote the following at Nov. 3, 2009, 8:22 a.m...

dexter1 wrote:

Could it be a loose contact somewhere? Maybe the main switch? My Crimson suffered from a Main switch that didn't made fully contact and issued a reset every now and then. Replacing it with a new one solved the problem immediately

I can try, I think I have a spare breaker and a front panel with on/off switch for a 4D deskside. It's certainly easier to try than removing the PSU.

_________________
Now this is a deep dark secret, so everybody keep it quiet

It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi
Currently in commercial service:

(2x)

In the museum: almost every MIPS/IRIX system.

Gerhard.Lenerz
Enthusiast
Who joined Jan. 15, 2004, 2:20 a.m.
and authored 280 notes

Wrote the following at Nov. 3, 2009, 2:21 p.m...

jan-jaap wrote:

RMISS means a 'read TLB miss', usually this means there's an error reading the memory address.
In other words: time to reseat / remove / swap around the memory sticks

Given the system hat been sitting idle for some years it may well be that it is just a simm that doesn't have proper connection to the mainboard. After all the core POST seems to work fine, although there is some complaint about corrupted PROM variables (not surprising).

jan-jaap wrote:

Those skins used to be on my 4D/210GTX, no?

Exactly. Nice red and brown GTX skins.

Gerhard

_________________
www.sgistuff.net - SGI info since 2001

:4D70G:

jan-jaap
Old Salt
Who joined June 17, 2004, 11:35 a.m.
and authored 2689 notes

Wrote the following at Nov. 5, 2009, 3:15 p.m...

jan-jaap wrote:

dexter1 wrote:

Could it be a loose contact somewhere? Maybe the main switch? My Crimson suffered from a Main switch that didn't made fully contact and issued a reset every now and then. Replacing it with a new one solved the problem immediately

I can try, I think I have a spare breaker and a front panel with on/off switch for a 4D deskside. It's certainly easier to try than removing the PSU.

I'll be damned. I changed both the breaker on the back and the on/off switch panel on the front, and at least for the time being it's working beautifully

_________________
Now this is a deep dark secret, so everybody keep it quiet

It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi
Currently in commercial service:

(2x)

In the museum: almost every MIPS/IRIX system.

dexter1
Moderator
Who joined Feb. 20, 2003, 7:57 a.m.
and authored 2057 notes

Wrote the following at Nov. 6, 2009, 3:57 a.m...

jan-jaap wrote:

I'll be damned. I changed both the breaker on the back and the on/off switch panel on the front, and at least for the time being it's working beautifully

Awesome, Jan-Jaap! This is exactly why i really love these old systems: Using soldering iron, steel file and pliers you can repair these machines very quickly.

_________________
:Crimson:

European nekoware mirror, updated twice a day: http://www.mechanics.citg.tudelft.nl/~everdij/nekoware
ftp://mech001.citg.tudelft.nl rsync mech001.citg.tudelft.nl::nekoware

jan-jaap
Old Salt
Who joined June 17, 2004, 11:35 a.m.
and authored 2689 notes

Wrote the following at Nov. 6, 2009, 5 a.m...

dexter1 wrote:

Awesome, Jan-Jaap! This is exactly why i really love these old systems: Using soldering iron, steel file and pliers you can repair these machines very quickly.

True, but in this case I just used parts from another PowerSeries I scrapped at some point.

I suspect the problem was caused by the switch on the old front panel and will probably replace that, if only to keep my supply of spares intact. And no, that doesn't even require a soldering iron :mrgreen:

As an extra bonus, it then successfully passed all the VGX diagnostics, so it can celebrate it's 18th birthday in good health.

Next I'm going to make a little adapter for my CrystalEyes 3D glasses. Performer/Inventor grab all 4 CPUs so the 4D/440 does quite well. I love messing with these old systems

_________________
Now this is a deep dark secret, so everybody keep it quiet

It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi
Currently in commercial service:

(2x)

In the museum: almost every MIPS/IRIX system.

SAQ
Old Salt
Who joined July 19, 2006, 8:37 a.m.
and authored 3874 notes

Wrote the following at Nov. 6, 2009, 12:25 p.m...

dexter1 wrote:

jan-jaap wrote:

I'll be damned. I changed both the breaker on the back and the on/off switch panel on the front, and at least for the time being it's working beautifully

Awesome, Jan-Jaap! This is exactly why i really love these old systems: Using soldering iron, steel file and pliers you can repair these machines very quickly.

Nah - that's VAX-11/780 and some PDP-11/PDP-8 models. SGI has always used some custom VLSIs and PALs.

_________________
Damn the torpedoes, full speed ahead!

:Indigo:

jan-jaap
Old Salt
Who joined June 17, 2004, 11:35 a.m.
and authored 2689 notes

Wrote the following at Nov. 17, 2009, 5:52 a.m...

jan-jaap wrote:

I'll be damned. I changed both the breaker on the back and the on/off switch panel on the front, and at least for the time being it's working beautifully

Bah, the problem returned. :evil:

Must be in the PSU after all...

_________________
Now this is a deep dark secret, so everybody keep it quiet

It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi
Currently in commercial service:

(2x)

In the museum: almost every MIPS/IRIX system.

Gerhard.Lenerz
Enthusiast
Who joined Jan. 15, 2004, 2:20 a.m.
and authored 280 notes

Wrote the following at April 4, 2010, 8:51 a.m...

Gerhard.Lenerz wrote:

Given the system hat been sitting idle for some years it may well be that it is just a simm that doesn't have proper connection to the mainboard. After all the core POST seems to work fine, although there is some complaint about corrupted PROM variables (not surprising).

I tried reseating memory today and powered it up with two different sets of memory (minimum configuration each time) and I still get the samer error. Also I tried it without GFX installed as this used to give me headaches from time to time.

I'm not sure but I think there is one Ethernet adapter I could pull on the off chance that it is a VMEbus related problem.

_________________
www.sgistuff.net - SGI info since 2001

:4D70G:

Gerhard.Lenerz
Enthusiast
Who joined Jan. 15, 2004, 2:20 a.m.
and authored 280 notes

Wrote the following at June 21, 2010, 2:22 p.m...

A while ago I tried starting up the machine with a bare minimum setup. Actually aside from the backplane and the CPU board there was nothing. Again the machine keeps repeating the same error over and over again.

Looking at the messages now I realize something that I did miss previously. Something seems to have changed since I recorded the first error which I previously posted. Can't say what caused that without further experiments.

Code:

  EXCEPTION: <vector=NORMAL>
  
  Exception pc: 0xbfc108a0
  
  Cause register: 0x3000[b]5008[/b]<CE=3,IP7,IP5,EXC=RMISS>
  
  Status register: 0x80000<CM,IPL=8>
  
  Bad Vaddress: 0xc0000000
  
  Error Addr register: 0x17b40
  
  Local I/O interrupt register: 0xff <>
  
  Parity error register: 0x0
  
  Registers (in hex):
  
  arg: c98cf600 ffffffff 15180 0
  
  tmp: 0 0 0 0 0 0 0 0
  
  sve: [b]a0017af7[/b] bfc268c8 bfc268ca 1 54 0 1 0
  
  t8 ff00 t9 502e8000 at 1 v0 c0000000 v1 f65da k1 bfc04234
  
  gp 0 fp bfc04bd0 sp [b]a0017ac8[/b] ra bfc10744
  
  exit(-1) called

I didn't expect that but this may a little sign that not all is lost. During both times I saved longer parts of the session so I can say that the error didn't change while it was repeated. Hard reset doesn't change anything either.

Gerhard

_________________
www.sgistuff.net - SGI info since 2001

:4D70G:

jan-jaap
Old Salt
Who joined June 17, 2004, 11:35 a.m.
and authored 2689 notes

Wrote the following at June 22, 2010, 5:38 a.m...

Gerhard.Lenerz wrote:

A while ago I tried starting up the machine with a bare minimum setup. Actually aside from the backplane and the CPU board there was nothing. Again the machine keeps repeating the same error over and over again.

Looking at the messages now I realize something that I did miss previously. Something seems to have changed since I recorded the first error which I previously posted. Can't say what caused that without further experiments.

I removed the "code" tags to make the bold work.

Old :

EXCEPTION: <vector=NORMAL>
Exception pc: 0xbfc108a0
Cause register: 0x30001008<CE=3,IP5,EXC=RMISS>
Status register: 0x80000<CM,IPL=8>
Bad Vaddress: 0xc0000000
Error Addr register: 0x17b40
Local I/O interrupt register: 0xff <>
Parity error register: 0x0
Registers (in hex):
arg: c98cf600 ffffffff 15180 0
tmp: 0 0 0 0 0 0 0 0
sve: a0017b93 bfc268c8 bfc268ca 1 54 0 1 800
t8 ff00 t9 502e8000 at 1 v0 c0000000 v1 f65da k1 bfc04234
gp 0 fp bfc04bd0 sp a0017b64 ra bfc10744
exit(-1) called

New:

EXCEPTION: <vector=NORMAL>
Exception pc: 0xbfc108a0
Cause register: 0x3000 5008 <CE=3,IP7,IP5,EXC=RMISS>
Status register: 0x80000<CM,IPL=8>
Bad Vaddress: 0xc0000000
Error Addr register: 0x17b40
Local I/O interrupt register: 0xff <>
Parity error register: 0x0
Registers (in hex):
arg: c98cf600 ffffffff 15180 0
tmp: 0 0 0 0 0 0 0 0
sve: a0017af7 bfc268c8 bfc268ca 1 54 0 1 0
t8 ff00 t9 502e8000 at 1 v0 c0000000 v1 f65da k1 bfc04234
gp 0 fp bfc04bd0 sp a0017ac8 ra bfc10744
exit(-1) called

You have to read the CAUSE register as a bitmask, and it seems that in the new situation bit# 0x00004000 (named IP7, this has nothing to do with a PowerSeries CPU board btw.) has been set where it wasn't before.

The PC where it crashes is the same, in other words it crashes at exactly the same code location. Locations 0xBFCxxxxx are ROM addresses (PROM code). In other words, it crashes during self test, but you knew that already

The exception handler has no way to recover from this error, so it reboots the system ("exit(-1) called"). Which results in an endless loop of course.

The fact that CE=3 is set is a little confusing, it claims that a coprocessor (#3) caused the crash but AFAIK there is no CP3 in this system (there's always a CP0, and CP1 is the math coprocessor).

The CAUSE register describes the state of the CPU when an exception happens. Bits 8:15 (IP bits) describe I nterrupts P ending when the exception happened. To make it more confusing, IP is subtly different from the rest of the CAUSE register fields; it doesn’t indicate what happened when the exception took place, but rather shows what is happening now. In your case INT5 and INT7 are pending. What went wrong is this: EXC=RMISS: the CPU tried to read from memory and failed. On memory exceptions, the BadVaddr Register contains the address whose reference led to the exception: 0xc0000000. This (virtual) address is the base address of the KSEG2 address space. This area is only accessible in kernel mode and it's translated through the MMU.

Why it would fail is a little harder to say. It could be any component in the memory subsystem:
* The main memory itself.
* The primary cache memory
* The secondary cache memory
* The MMU which translates between physical and virtual addresses.
* Some other part of main board logic (buffers, drivers, ...)

You've eliminated main memory I believe, does it crash before or after the cache memory diagnostics? Have you tried to boot it with no memory at all installed, and did that make any difference?

You could try to set the debug switches to halt in PON mode. This is an extremely crude monitor program, but it runs entirely inside the CPU + cache memory. If you get that far, you can have some confidence in the CPU and can run some cache diagnostics from inside PON I believe. Expand the "circle of trust" starting from something that works, rather than making a guess from a situation that doesn't work.

If you can eliminate CPU, L1 and L2 cache from the list, and it's not main RAM, then it must be a main board problem, probably a driver chip for the main RAM. There's a bunch of 74AS1004A chips (drivers) and 74AS623 chips (bus transceivers) around the SIMMs. Unfortunately they are not socketed.

If only those jumpers in the same area would be documented. I bet they can be used to configure/enable/disable banks of memory. Could be very helpful in your case.

Gerhard.Lenerz wrote:

As usual it complains about not being able to communicate with the graphics option,

Let me guess, gfx: can't reset GM textport

In my system the CPU board works, and I can sometimes enable the graphics (sometimes it fails). It is good enough to get a graphics PROM console, but any attempt to boot into IRIX with the graphics installed in the system will cause bus errors, crashes, ... The GM board probably contains a memory buffer which is mapped into the address space of the CPU for communication between the two, and I have a feeling mine has some dodgy RAM chips...

Well, at least with the GM/GE boards pulled mine boots into IRIX. Slower than anything I've ever seen. :lol:

_________________
Now this is a deep dark secret, so everybody keep it quiet

It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi
Currently in commercial service:

(2x)

In the museum: almost every MIPS/IRIX system.

hamei
Old Salt
Who joined Feb. 24, 2004, 5:10 p.m.
and authored 6623 notes

Wrote the following at June 22, 2010, 10:14 a.m...

jan-jaap wrote:

Slower than anything I've ever seen. :lol:

I had a K&T that required loading ten or twelve 8" reels of paper tape to boot. That took a while

SGI: Hardware

Another IRIS 4D in trouble...