Nekonomicon - Octane PCI cage wierdness.

regan_russell
From Sydney, Australia
Who joined July 18, 2006, 9:32 p.m.

Wrote the following at Nov. 8, 2006, 2:30 a.m...

Hi all,

I just pruchased a PCI cage. I put it in the machine in accordance with the tech pubs guide.

When I power on the machine it boots up normally, but memory consumption goes through the roof, I have /usr/nekoware on a separate drive and it can not resolve libs for nekoware apps, everything is noticably s-l-o-w-e-r, almost glacial, general malaise. Shut it down pull the cage out and the machine is back to normal

Do I a have a dodgey PCI, a stuffed xbow or power problems or something else...?

[EDIT] I forgot to add the I can see the PCI cage in the hinv:
(It is the XTALK PCI board I diff'ed hinv b4 and after.. )

Location: /hw/node
PM20 Board: barcode HRJ328 part 030-1356-001 rev J
Location: /hw/node/xtalk/15
IP30 Board: barcode KJN377 part 030-1467-001 rev D
Location: /hw/node/xtalk/15/pci/2
PWR.SPPLY.ER Board: barcode AAE9460440 part 060-0035-002 rev A
FP1 Board: barcode 33094C part 030-0891-003 rev E
Location: /hw/node/xtalk/13
XTALKPCI Board: barcode LKY520 part 030-0952-005 rev E
Location: /hw/node/xtalk/12
MOT20 Board: barcode JKD288 part 030-1240-003 rev G
2 300 MHZ IP30 Processors
Heart ASIC: Revision F
CPU: MIPS R12000 Processor Chip Revision: 2.3
FPU: MIPS R12010 Floating Point Chip Revision: 0.0
Main memory size: 896 Mbytes
Xbow ASIC: Revision 1.3
Instruction cache size: 32 Kbytes
Data cache size: 32 Kbytes
Secondary unified instruction/data cache size: 2 Mbytes
Integral SCSI controller 0: Version QL1040B (rev. 2), single ended
Disk drive: unit 1 on SCSI controller 0 (unit 1)
Disk drive: unit 2 on SCSI controller 0 (unit 2)
Disk drive: unit 3 on SCSI controller 0 (unit 3)
Integral SCSI controller 1: Version QL1040B (rev. 2), single ended
IOC3/IOC4 serial port: tty1
IOC3/IOC4 serial port: tty2
IOC3 parallel port: plp1
Graphics board: EMXI
Integral Fast Ethernet: ef0, version 1, pci 2
Iris Audio Processor: version RAD revision 12.0, number 1
PCI Adapter ID (vendor 0x10a9, device 0x0003) PCI slot 2
PCI Adapter ID (vendor 0x1077, device 0x1020) PCI slot 0
PCI Adapter ID (vendor 0x1077, device 0x1020) PCI slot 1
PCI Adapter ID (vendor 0x10a9, device 0x0005) PCI slot 3

Regan

J5600, 2 x SUN, 2 x Mac, 3 x Alpha, 2 x RS/6000

regan_russell
From Sydney, Australia
Who joined July 18, 2006, 9:32 p.m.

Wrote the following at Nov. 14, 2006, 2:33 a.m...

Additional infomation, the syslog from the time that I put the PCI cage in:

Code: Select all


   Nov  2 01:43:41 6B:mickey Xsession: regan: login
   

   Nov  2 01:44:24 1A:mickey unix: ALERT: firefox-bin [1192] - out of logical swap
   

   space during brk/sbrk - see swap(1M) [filter /usr/sbin/klogpp failed: Resource temporarily unavailable]
   

   Nov  2 01:44:24 1A:mickey unix: ALERT: firefox-bin [1192] - out of logical swap
   

   space during brk/sbrk - see swap(1M)
   

   Nov  2 01:44:24 1A:mickey unix: ALERT: syslogd [79] - out of logical swap space
   

   during fork while allocating uarea - see swap(1M)
   

   Nov  2 01:44:24 1A:mickey unix: ALERT: syslogd [79] - out of logical swap space
   

   during fork while allocating uarea - see swap(1M)
   

   Nov  2 01:44:24 1A:mickey unix: ALERT: firefox-bin [1192] - out of logical swap
   

   space during mmap - see swap(1M)
   

   Nov  2 01:44:24 1A:mickey unix: ALERT: firefox-bin [1192] - out of logical swap
   

   space during mmap - see swap(1M)
   

   Nov  2 01:44:24 1A:mickey unix: ALERT: firefox-bin [1192] - out of logical swap
   

   space during brk/sbrk - see swap(1M)
   

   Nov  2 01:44:24 1A:mickey unix: ALERT: firefox-bin [1192] - out of logical swap
   

   space during brk/sbrk - see swap(1M)
   

   Nov  2 01:45:34 1A:mickey unix: ALERT: configmon [1163] - out of logical swap space during brk/sbrk - see swap(1M)
   

   Nov  2 01:45:34 1A:mickey unix: ALERT: configmon [1163] - out of logical swap space during brk/sbrk - see swap(1M)
   

   Nov  2 01:46:27 1A:mickey unix: ALERT: configmon[1163] was killed to end a memory deadlock condition.
   

   Nov  2 01:46:27 1A:mickey unix: |$(0x70d)ALERT: Process [configmon] pid 1163 killed due to insufficient memory/swap.
   

   Nov  2 01:46:53 6B:mickey runpriv[1500]: Running privilege dtshutdown for user regan.
   

   Nov  2 01:46:55 6D:mickey tfxd[1079]: terminating

Anyone have any ideas...?

Regan

J5600, 2 x SUN, 2 x Mac, 3 x Alpha, 2 x RS/6000

recondas
Moderator
From NC - USA
Who joined June 6, 2004, 5:55 p.m.

Wrote the following at Nov. 14, 2006, 5:56 a.m...

By any chance do you have esp turned on? <look in chkconfig>

If you do try turning it off.

regan_russell
From Sydney, Australia
Who joined July 18, 2006, 9:32 p.m.

Wrote the following at Nov. 14, 2006, 1:38 p.m...

Hi recondas,

Turning it off has fixed my problem (THANKYOU)

Yes I do (did), but sorry you lost me there... turning off event monitoring and logging...
how and why has that fixed things..? isnt no logging/monitoring dangereous..?

I'm running 6.5.27m btw.

WTF? I just turned esp back on in chkconfig rebooted and everything is mystereousily work fine again... even with logging on.

Thanks,

Regan

J5600, 2 x SUN, 2 x Mac, 3 x Alpha, 2 x RS/6000

recondas
Moderator
From NC - USA
Who joined June 6, 2004, 5:55 p.m.

Wrote the following at Nov. 14, 2006, 2:34 p.m...

regan_russell wrote: Hi recondas,

Turning it off has fixed my problem (THANKYOU)

Yes I do (did), but sorry you lost me there... turning off event monitoring and logging...
how and why has that fixed things..? isnt no logging/monitoring dangereous..?

I'm running 6.5.27m btw.

WTF? I just turned esp back on in chkconfig rebooted and everything is mystereousily work fine again... even with logging on.

Thanks,

Regan

I've experienced the same issues you reported and also noted the configmon errors - troubleshooting configmon eventually lead to turning off ESP. Stopping ESP and leaving it off has been the only lasting solution for me <eventually the problem would come back if I re-enabled ESP>. BTW - to my knowledge turning off ESP doesn't stop configmon - just ESP.

I've seen a few posts mentioning SGI has released a patch - but I don't have access and so haven't tried or tested it patched.

regan_russell
From Sydney, Australia
Who joined July 18, 2006, 9:32 p.m.

Wrote the following at Nov. 15, 2006, 1:16 p.m...

Cool thanks, I like to know how and why and tend to pull things apart just to know.. I really feel the need to get to the root cause of things..
I'll go look for the patch.

Thanks,

Regan

J5600, 2 x SUN, 2 x Mac, 3 x Alpha, 2 x RS/6000

dc_v01
From Boston, MA
Who joined July 29, 2005, 3:38 p.m.

Wrote the following at Nov. 15, 2006, 9:13 p.m...

regan_russell wrote: isnt no logging/monitoring dangereous..?

Do you click on the "Send Error Report to Microsoft" button, too?

regan_russell
From Sydney, Australia
Who joined July 18, 2006, 9:32 p.m.

Wrote the following at Nov. 15, 2006, 9:25 p.m...

dc_v01 wrote: Do you click on the "Send Error Report to Microsoft" button, too?

Yes, especially when the code that caused it was mine. ;-)

J5600, 2 x SUN, 2 x Mac, 3 x Alpha, 2 x RS/6000

recondas
Moderator
From NC - USA
Who joined June 6, 2004, 5:55 p.m.

Wrote the following at Nov. 15, 2006, 9:59 p.m...

regan_russell wrote: I'll go look for the patch.

Try patch 7111.

Code: Select all


   IRIX Patch 7111: ESP eventmond core fix.    28-Sep-2006

regan_russell
From Sydney, Australia
Who joined July 18, 2006, 9:32 p.m.

Wrote the following at Nov. 15, 2006, 10:05 p.m...

Thanks !!

dc_v01
From Boston, MA
Who joined July 29, 2005, 3:38 p.m.

Wrote the following at Nov. 16, 2006, 7:35 a.m...

regan_russell wrote:

dc_v01 wrote: Do you click on the "Send Error Report to Microsoft" button, too?

Yes, especially when the code that caused it was mine.

Oh, wow, that's impressive! (Ever help you out?)

Anyway, I always turn esp off 'cause it seems to speed things up so dramatically. I'm not going to get any support from sgi, and there's pretty much zero security risk on my boxes so it's not "dangerous" not to have added logging/monitoring from that perspective.

regan_russell
From Sydney, Australia
Who joined July 18, 2006, 9:32 p.m.

Wrote the following at Nov. 16, 2006, 1:37 p.m...

Helps me out if it keeps a MSFT tech support engineer busy for about a nano second, which will slow their march on niche markets by a small fraction of that.. Or it upsets their statistics of reported failures.

I am not so much worried about support from sgi or anyone, or even security, what worries me is something failing siliently, unknown to me, and I think everythings fine until one day it all turns around and bites me in the ... What is it they say.. known knowns, known unknowns and unknown unknowns or some gibberish like that..?

I hate suprises, especially nasty ones, I guess it is my nature to be a bit retentive... Just like a cat is a cat is a cat.. Its the nature of things.

Anyway, I always turn esp off 'cause it seems to speed things up so dramatically.

I will take that under advisement when considering performance issues.

Regan

J5600, 2 x SUN, 2 x Mac, 3 x Alpha, 2 x RS/6000

mapesdhs
From Edinburgh, Scotland
Who joined Nov. 10, 2003, 4:17 p.m.

Wrote the following at Nov. 21, 2006, 6:33 p.m...

regan_russell writes:
> WTF? I just turned esp back on in chkconfig rebooted and everything is mystereousily
> work fine again... even with logging on.

You don't need ESP. Turn it off.

ESP is for automating support calls for those with service contracts. Ordinary error
logging to SYSLOG operates as normal without it.

Plus, if it's off, a system will boot up a full minute faster because without being
configured ESP times out for 60 seconds.

Other things can usually be turned off too, including sendmail, pmcd, tfxd,
sesdaemon/fcagent (if you're not using FC), xlv (if you're not using disk volumes),
webface, webface_apache, cluster, and so on.

I also edit MSGTIME in /etc/init.d/network to be 0 (speeds bootup by 10 seconds
if the host name is still IRIS), and change the sleep time in /etc/init.d/filesystems
to 0 (if a file system doesn't mount, display an error but don't then wait 5 secs).

On older versions of the OS, check if there is a '&' symbol at the end of the
newaliases line in the mail startup script:

Code: Select all


   # grep newal /etc/init.d/mail
   

   newaliases >/dev/null &

If it's there, good. If not, then add the symbol. Without it, unconfigured
sendmail will time out for 60 secs on bootup. The symbol was omitted by mistake
at some point, but is back in there now, at least it's there in 6.5.22 and later.

Ian.

PS. Did you receive my last email? Feel free to email or PM.

SGI: Hardware

Octane PCI cage wierdness.