SGI: Hardware

Fuel: is my VPro V10 broken?

So, I finally got myself a Fuel. The hard drive was dead, so I replaced it. Luckily I also bought a SGI Software Library from the same fellow I bought the Fuel from, so I could install something on the new hard drive. Anyway, there is a slight problem: on boot, there is no output on my screen (LCD monitor, 1280x1024, connected via a DVI-to-VGA adapter). Sometimes when I boot there are vertical stripes on the screen. Not good. I connected null modem cable to the first (lower) serial port, and get this output:
Code:
IP35 PROM SGI Version 6.170  built 03:59:07 PM Aug  6, 2003
Running in DDR mode
Testing/Initializing memory ...............      DONE
Copying PROM code to memory ...............      DONE
Discovering local IO ......................      DONE
Discovering NUMAlink connectivity .........
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30359 usec
Waiting for peers to complete discovery....      DONE
No other nodes present; becoming global master
Global master is /hw/rack/001/bay/01
Intializing any CPUless nodes..............      DONE
Checking partitioning information .........      DONE
No other nodes present; becoming partition master
Loading BASEIO prom .......................      DONE

BASEIO PROM Monitor SGI Version 6.170  built 03:56:02 PM Aug  6, 2003 (BE64)
1 CPUs on 1 nodes found.
Automatic update of PROM environment disabled

PS/2 Keyboard & Mouse diagnostics
Found mouse on port 0
Found keyboard on port 1
PS/2 Keyboard & Mouse diagnostics passed

Graphics diagnostics

Odyssey board #0 found on nasid 0
Running Odyssey xtalk sanity diag...
Board version 1 - Buzz revision 3B
On board sdram size: 32 Mb
Cas latency: CAS 3
4 banks by sdram module
Running Odyssey Buzz registers diag...
Device passed diagnostics

Installing PROM Device drivers ............
Base I/O Ethernet set to /dev/ethernet/ef0
Installing Graphics Console...
graphics install: searching for pipe 0

Walking SCSI Adapter 0, (pci id 1)
1+ Device Vendor Product: IBM DPSS-318350N
2- 3- 4- 5- 6- 7- 8- 9- 10- 11- 12- 13- 14- 15- = 1 device(s)


Walking SCSI Adapter 1, (pci id 1)
1- 2- 3- 4- 5- 6+ Device Vendor Product: TOSHIBA DVD-ROM SD-M1401
7- 8- 9- 10- 11- 12- 13- 14- 15- = 1 device(s)

Initializing PROM Device drivers ..........             DONE
odsy: frame buffer/fullscreen accum config causes oversubscription of sdram.
odsy: frame buffer/fullscreen accum config causes oversubscription of sdram.
odsy dfifo timeout
odsy dfifo timeout
odsy dfifo timeout
odsy dfifo timeout

(lots more of the "odsy dfifo timout" lines snipped), followed by:
Code:
A 000: *** TLB Refill Exception on node 0
A 000: *** EPC: 0xc00000001fc44de4 (0xc00000001fc44de4)
A 000: *** Press ENTER to continue.
Ouch. After pressing enter, i can do a reset, and sometimes the machine comes up like this:
Code:
A 000: POD IOC3 Unc>  reset
Resetting the system...


IP35 PROM SGI Version 6.170  built 03:59:07 PM Aug  6, 2003
Running in DDR mode
Testing/Initializing memory ...............      DONE
Copying PROM code to memory ...............      DONE
Discovering local IO ......................      DONE
Discovering NUMAlink connectivity .........
Local hub NUMAlink is down.
*** Local network link down
DONE
Found 1 objects (1 hubs, 0 routers) in 30360 usec
Waiting for peers to complete discovery....      DONE
No other nodes present; becoming global master
Global master is /hw/rack/001/bay/01
Intializing any CPUless nodes..............      DONE
Checking partitioning information .........      DONE
No other nodes present; becoming partition master
Loading BASEIO prom .......................      DONE

BASEIO PROM Monitor SGI Version 6.170  built 03:56:02 PM Aug  6, 2003 (BE64)
1 CPUs on 1 nodes found.
Automatic update of PROM environment disabled

PS/2 Keyboard & Mouse diagnostics
Found mouse on port 0
Found keyboard on port 1
PS/2 Keyboard & Mouse diagnostics passed

Graphics diagnostics

Odyssey board #0 found on nasid 0
Running Odyssey xtalk sanity diag...
Board version 1 - Buzz revision 3B
On board sdram size: 32 Mb
Cas latency: CAS 3
4 banks by sdram module
Running Odyssey Buzz registers diag...
Device passed diagnostics

Installing PROM Device drivers ............
Base I/O Ethernet set to /dev/ethernet/ef0
Installing Graphics Console...
graphics install: searching for pipe 0

Walking SCSI Adapter 0, (pci id 1)
1+ Device Vendor Product: IBM DPSS-318350N
2- 3- 4- 5- 6- 7- 8- 9- 10- 11- 12- 13- 14- 15- = 1 device(s)


Walking SCSI Adapter 1, (pci id 1)
1- 2- 3- 4- 5- 6+ Device Vendor Product: TOSHIBA DVD-ROM SD-M1401
7- 8- 9- 10- 11- 12- 13- 14- 15- = 1 device(s)

Initializing PROM Device drivers ..........             DONE
odsy: frame buffer/fullscreen accum config causes oversubscription of sdram.
odsy: frame buffer/fullscreen accum config causes oversubscription of sdram.
ALERT: odsy_tpinit: giving up ... no tport for now.



IRIS console login:

I have taken out the VPro graphics card; the connectors look fine (no corrosion or other stuff) and tried re-seating it a few times. Still the problem remains.
Is my graphics card broken?
Or is there anything else i can do?

Here is a hinv -v of the system:
Code:
>> hinv -v
IP35 Node Board, Module 001c01
ASIC BEDROCK Rev 2, 200 MHz, (nasid 0)
Processor A: 500 MHz R14000 Rev 2.3
Secondary Cache 2MB 250MHz Tap 0xa , (cpu 0)
R14010FPC Rev 2.3
Memory on board, 1024 MBytes (Standard)
Bank 0, 512 MBytes (Standard)  <-- (Software Bank 0)
Bank 1, 512 MBytes (Standard)
IBRICK Bridge, Module 001c01
ASIC BRIDGE Rev 2, (widget 14)
IBRICK Bridge, Module 001c01
ASIC BRIDGE Rev 2, (widget 15)
adapter PCI (SCSI interface) Rev 6
(pci id 1)
peripheral DISK, BUS 0, ID 1, IBM DPSS-318350N
peripheral CDROM, BUS 1, ID 6, TOSHIBA DVD-ROM SD-M1401
adapter IOC3 Rev 1
(pci id 4)
controller multi function SuperIO
controller Ethernet Rev 1
adapter USB (OHCI interface)
(pci id 5)
ASIC XBOW Rev 2, on CBrick, Module 001c01
ODYSSEY Graphics Board, Module 001c01
>>
Any help appreciated.

_________________
Torfinn
Welcome to Nekochan!
Quote:
Code:
Graphics diagnostics

Odyssey board #0 found on nasid 0
Running Odyssey xtalk sanity diag...
Board version 1 - Buzz revision 3B
On board sdram size: 32 Mb
Cas latency: CAS 3
4 banks by sdram module
Running Odyssey Buzz registers diag...
Device passed diagnostics


This looks encouraging!

Quote:
Code:
odsy: frame buffer/fullscreen accum config causes oversubscription of sdram.


I'd almost say the graphics board is set to a config which requires more frame buffer than the 32MB installed on a V10. Maybe the system used to have a V12 and the config is still stored somewhere other than the graphics board?

You say you can (sometimes) boot it into IRIX. Next time, try this from IRIX:
Code:
/usr/gfx/setmon -n 1280x1024_60

and reboot.

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Octane2: :Onyx2: (2x) :0300:
In the museum: almost every MIPS/IRIX system.
Thanks for the friendly welcome.
Yes, sometimes I can boot into IRIX. But the it says this:
Code:
WARNING: odsy board 0: UNCORRECTABLE_ECC_ERROR received
ALERT: odsy board 0: Graphics error
odsy flags: 0x4<UNCORRECTABLE_ECC_ERROR>
odsy status0: 0x10024080<CFIFO_ENABLED,CFIFO_LW,XRFIFO_LW,RASTER_SYNC_SRC=unset,CFIFO_SYNC_SRC=unset,DMA_SYNC_SRC=unset>
and when I try to run setmon then I get this:
Code:
# /usr/gfx/setmon -n 1280x1024_60
Cannot open display
: Error 0

Forcing it doesn't help either (not that I thought it would):
Code:
# DISPLAY=:0.0 /usr/gfx/setmon -n 1280x1024_60
Cannot open display
: Connection refused
I even tried running it on a remote display (from my FreeBSD workstation):
Code:
$ /usr/gfx/setmon -n 1280x1024_60
setmon: must run on local hardware.
Try setting the DISPLAY to :0.0. or :0.1
(of course setting DISPLAY doesn't work here either)

_________________
Torfinn
Is there a way to set / check the screen mode / video mode from the PROM?

_________________
Torfinn
tingo wrote:
Code:
WARNING: odsy board 0: UNCORRECTABLE_ECC_ERROR received
ALERT: odsy board 0: Graphics error
odsy flags: 0x4<UNCORRECTABLE_ECC_ERROR>

This looks like hardware (memory) error to me. I've never heard of a graphics board with ECC memory though. Maybe the ECC error is in main memory. You could remove the main memory DIMMs, clean the contacts and reseat them, see if that makes any differences.

Otherwise, you could start the hardware diagnostics from the PROM (options '3' rather than '1' which boots the OS). I'm not sure the diagnostics are part of the default IRIX install like for most other SGI workstations, I think they were a separate download for the Fuel.

If you've got both memory banks filled in the Fuel, you could remove half the RAM in an attempt to isolate the failure. I none of that works, I think the V10 must be broken. Fortunately they are fairly cheap (compared to a V12). I'd also contact the person who sold you the system in this condition.

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Octane2: :Onyx2: (2x) :0300:
In the museum: almost every MIPS/IRIX system.
jan-jaap wrote:
This looks like hardware (memory) error to me. I've never heard of a graphics board with ECC memory though. Maybe the ECC error is in main memory. You could remove the main memory DIMMs, clean the contacts and reseat them, see if that makes any differences.

I forgot to say, I have already taken out the DIMMs, and re-seateed them. The contacts looked fine, no corrosion or anything.
Quote:
Otherwise, you could start the hardware diagnostics from the PROM (options '3' rather than '1' which boots the OS). I'm not sure the diagnostics are part of the default IRIX install like for most other SGI workstations, I think they were a separate download for the Fuel.

It seems so:
Code:
System Maintenance Menu

1) Start System
2) Install System Software
3) Run Diagnostics
4) Recover System
5) Enter Command Monitor

Option? 3


Starting diagnostic program...

Press <Esc> to return to the menu.



Autoboot failed.
dksc(0,1,0):/stand/smdk/smdk: no such file or directory.

Hit Enter to continue.

Do anyone know where I can download the diagnostics from? I don't think they are in the IRIX 6.5.22 overlays, and none of the CDs I got when I bought the Fuel have them.
Quote:
If you've got both memory banks filled in the Fuel, you could remove half the RAM in an attempt to isolate the failure.

Only one memory bank filled.
Quote:
I none of that works, I think the V10 must be broken. Fortunately they are fairly cheap (compared to a V12). I'd also contact the person who sold you the system in this condition.

Oh well, one thing at a time.

_________________
Torfinn
Ok the Diagnostics are on SupportFolio:
Customer Diagnostics 2.2 - sMDK Diagnostics for IRIX 6.5.22 (from 30-Oct-2003), however, with my free SupportFolio account, I am not "entitled" to download them. Dang!

_________________
Torfinn
tingo wrote:
jan-jaap wrote:
This looks like hardware (memory) error to me. I've never heard of a graphics board with ECC memory though. Maybe the ECC error is in main memory. You could remove the main memory DIMMs, clean the contacts and reseat them, see if that makes any differences.

I forgot to say, I have already taken out the DIMMs, and re-seateed them. The contacts looked fine, no corrosion or anything.
Quote:
Otherwise, you could start the hardware diagnostics from the PROM (options '3' rather than '1' which boots the OS). I'm not sure the diagnostics are part of the default IRIX install like for most other SGI workstations, I think they were a separate download for the Fuel.

It seems so:
Code:
System Maintenance Menu

1) Start System
2) Install System Software
3) Run Diagnostics
4) Recover System
5) Enter Command Monitor

Option? 3


Starting diagnostic program...

Press <Esc> to return to the menu.



Autoboot failed.
dksc(0,1,0):/stand/smdk/smdk: no such file or directory.

Hit Enter to continue.

Do anyone know where I can download the diagnostics from? I don't think they are in the IRIX 6.5.22 overlays, and none of the CDs I got when I bought the Fuel have them.


They should work if you put the "installation tools" disk (that supports Fuel) in the CD drive and select the diagnostics option - that's the way it works for the IDE diagnostics, anyway.

_________________
Damn the torpedoes, full speed ahead!

:Indigo: :Octane: :Indigo2: :Indigo2IMP: :Indy: :PI: :O200: :ChallengeL:
SAQ wrote:
They should work if you put the "installation tools" disk (that supports Fuel) in the CD drive and select the diagnostics option - that's the way it works for the IDE diagnostics, anyway.

SAQ - I believe the Fuel diagnostics are entirely separate. That's a pretty crappy deal and I'd love to be proven wrong but alas ....
hamei - you are correct, the diags are separate, the are not on the tools disk. I have now managed to get the diags, and testing is in progress. We'll see what happens.

_________________
Torfinn
Ok, online diags completed.
Code:
root@IRIS# cd /usr/diags/bin
root@IRIS# l
./            bridgeloc*    diagsetup*    olenet*       olrtr*        oltape*       olvst*        pandora*
../           cached*       olcmt*        olpci*        olsio*        olusb*        onlinediag*   runalldiags*
root@IRIS# ./runalldiags

Running online diagnostics at Normal level
Time: Thu Jul  8 22:40:35 CEST 2010
System Information: IRIX64 IRIS 6.5 6.5.22f 10070055 IP35
Plan on running: olcmt olpci olenet olsio olusb olrtr pandora

olcmt - Cache/Memory Test    (Check /var/adm/SYSLOG for error message)
PASS(olcmt)
olpci - PCI Config Space Dump/Decode
PASS(olpci)
PASS(olpci)
olenet - BaseIO Ethernet Diagnostic
PASS(olenet)
olsio - Serial Port Diagnostic
PASS(olsio.1)
olusb - USB Diagnostic
PASS(olusb)
olrtr - Router Diagnostic
PASS(olrtr)
pandora - System Stress Test
PASS(pandora)

Finished running at Thu Jul  8 23:17:42 CEST 2010
Ran: 8  Failed: 0

root@IRIS#
Now proceeding with standalone diags.

_________________
Torfinn
It appears that there might be a problem with the standalone diagnostics. Here is the output:
Code:
Starting diagnostic program...

Press <Esc> to return to the menu.

SMDK SGI Version 6.171 TEST built 04:38:14 AM Sep 24, 2003
smdk loading io discovery code...
smdk loading launcher code...
smdk>
sMDK Diagnostic Launcher: Version 2.0
Built 00:26:41 Sep 24 2003
term none
Setting up diagnostics.....
Starting diagnostics.....



Testing  PIMM............   PASSED
Testing  CACHE...............   PASSED
Testing  DIMM..........................................................................................   PASSED
Testing  IO....
And it looks like the machine hangs there. Is this part (testing IO) a part that takes a very long time to run?

_________________
Torfinn
tingo wrote:
It appears that there might be a problem with the standalone diagnostics. Here is the output:
Code:
........................................................................................   PASSED
Testing  IO....
And it looks like the machine hangs there. Is this part (testing IO) a part that takes a very long time to run?

Could very well be it's doing a surface scan of the harddisk. Depending on the disk, just that could take 30min - 1hr.

_________________
Now this is a deep dark secret, so everybody keep it quiet :)
It turns out that when reset, the WD33C93 defaults to a SCSI ID of 0, and it was simpler to leave it that way... -- Dave Olson, in comp.sys.sgi

Currently in commercial service: Image :Octane2: :Onyx2: (2x) :0300:
In the museum: almost every MIPS/IRIX system.
jan-jaap wrote:
tingo wrote:
It appears that there might be a problem with the standalone diagnostics. Here is the output:
Code:
........................................................................................   PASSED
Testing  IO....
And it looks like the machine hangs there. Is this part (testing IO) a part that takes a very long time to run?

Could very well be it's doing a surface scan of the harddisk. Depending on the disk, just that could take 30min - 1hr.

Well, today I tried again, and let it run for several hours (many hours in fact). Still it stops on the same line. Is there a description of these tests somewhere? So one can find out what they really test, and how they do it?

_________________
Torfinn
So I tried an extensive online diagnostic test:
Code:
root@IRIS# ./runalldiags -extensive

Running online diagnostics at Extensive level
Time: Sat Jul 10 01:49:33 CEST 2010
System Information: IRIX64 IRIS 6.5 6.5.22f 10070055 IP35
Plan on running: olcmt olpci olenet olsio olusb olrtr pandora olvst

olcmt - Cache/Memory Test    (Check /var/adm/SYSLOG for error message)
PASS(olcmt)
olpci - PCI Config Space Dump/Decode
PASS(olpci)
PASS(olpci)
olenet - BaseIO Ethernet Diagnostic
PASS(olenet)
olsio - Serial Port Diagnostic
PASS(olsio.1)
olusb - USB Diagnostic
PASS(olusb)
olrtr - Router Diagnostic
PASS(olrtr)
pandora - System Stress Test
FAIL(pandora): see /tmp/diagFailure.0.pandora for details
olvst - Generic Network test using Sockets

Finished running at Sat Jul 10 03:43:12 CEST 2010
Ran: 8  Failed: 1

root@IRIS#
and here is the output of the pandora file:
Code:
root@IRIS# more /tmp/diagFailure.0.pandora
NOTE                This diagnostic can NOT be run concurrent with any user jobs
CMDL                /usr/diags/bin/pandora -runtime 60
TEST pandora        System level stress test.              Test(1/1), Loop(1/0)
REV                 Pandora version 7.3 built on Jun 29 2006 at 23:57:23
INFO                Start time Sat Jul 10 02:38:27 2010
INFO                Running on IRIX64 6.5.22f 10070055 IP35 (IRIS)
INFO                Pandora run time: 60 minutes
INFO
INFO                Initializing...  Initialization time varies based
INFO                upon how much memory is to be tested and how many
INFO                test processes are set to run in the sysinfo file.
INFO
INFO                Testing will be performed using 1 CPU(s)
INFO                857MB of memory will be tested, if MEM Tests enabled.
INFO                Testing will consist of:
INFO                1   IO Tests
INFO                4   FPU Tests
INFO                4   MEM Tests
INFO                1   GFX Tests
INFO                0   NTWK Tests
INFO
INFO                Completed initialization, starting tests...
HRTB                Testing....................................................
HRTB                ...........................................................
HRTB                ...........................................................
HRTB                ...........................................................
HRTB                ........................

HRTB                Continuing
DIAG       000001   CPID#2498: GFX_PIXEL process may be hung on cpu 0.
HRTB                Continuing. DONE!
I wish it just would tell me in plain text what was wrong.

_________________
Torfinn