Nekonomicon - 2505-9050, or, eight digits of doom

ClassicHasClass
From Sunny So Cal
Who joined July 25, 2012, 7:12 p.m.

Wrote the following at May 8, 2014, 9:57 p.m...

So the POWER6 (which the Apple Network Server 500 is subbing in for) did indeed blow its system backplane. Unfortunately it appears to have taken the RAID 5 array with it. There is data still in the auxiliary battery-backed cache, but the cache directory is apparently hosed and SRN 2505-9050 popped up in the logs, indicating it does not recognize the cache data as belonging to the array. The associated MAP 3131 to resolve it strongly suggests I'm more hosed than a fire truck at a gay pride parade, with ugly words like "data loss" and "delete then recreate array."

Diagnostics shows the disks are there, and recognized as belonging to the array, but the array itself is listed as Failed.

My IBM tech friends aren't sure what to do with it either, but I thought I'd ask here in case anyone is a POWER Systems god. The system was relatively quiescent the day it went bad, so there shouldn't be a LOT in that write cache, mostly log files (the 57B7/8 card pair has a modest 175MB of write cache). This is AIX, so these are all JFS2 file systems.

The way I figure it, /, /usr and /opt haven't seen much action since Obama's first term. They should be clean of writes; there shouldn't be anything in the write cache for those partitions. /tmp is expendable. /home is backed up daily, so I can restore it. The logs in /var are almost certainly toast, but the database hasn't been written to in several weeks, mail is backed up several times a day, and the web server and gopher server are backed up daily and weekly respectively. I don't care about the journal volume or the paging volume, since I already assume those are fried.

So, given this, my thought is to just reclaim the write cache and wipe it, and fsck and hope for the best. The drives still organize themselves into an array, just a failed one. The journaling should keep the file system in a sane state, even if I've lost some writes.

Or do you think I'll be rebuilding the server from scratch and backups?

smit happens.

:Fuel:

bigred , 900MHz R16K, 4GB RAM, V12 DCD, 6.5.30
:Indy:

indy , 150MHz R4400SC, 256MB RAM, XL24, 6.5.10
:Indigo2IMP:

purplehaze , R10000, Solid IMPACT
probably posted from

bruce , Quad 2.5GHz PowerPC 970MP, 16GB RAM, Mac OS X 10.4.11
plus IBM POWER6 p520 * Apple Network Server 500 * HP C8000 * BeBox * Solbourne S3000 * Commodore 128 * many more...

japes
From Lynnwood, WA
Who joined Nov. 8, 2007, 4:35 p.m.

Wrote the following at May 8, 2014, 10:45 p.m...

Seems like you're covered either way. It doesn't take long to install AIX so why not try to get the array back, but have your CDs ready if you're in for a reinstall. Or maybe this was a call for bets and we should all get those in? In which case, what's the spread?

At $WORK our power systems are all using the volume manager only, no raid. Which through our consultant for a loop. Then again, we don't have a raid controller not believing all the parts belong together.

Most of the power systems are even without raid controllers, but one of the 520s has a raid controller - I found this out because the battery died - how does one go about managing such a thing? It's not like a LSI card in a PC server where I press a key combo at boot time - or I don't know what to press at least.

<--challenge S

ClassicHasClass
From Sunny So Cal
Who joined July 25, 2012, 7:12 p.m.

Wrote the following at May 9, 2014, 7:44 a.m...

I have a mksysb image of the OS, patched up the way I like it, though it probably got some minor modifications that I don't have in that .iso. So I'd rather not rebuild from scratch if I can avoid it, though it does look like my backup strategy is at least adequate ...

What I'm trying to find out is if anything other than write cache is in the RAID, and how often it gets written out. I'd hate for it to queue up a full 175MB of data before it tries to emit any, and I'd really hate for it to be caching things like the superblock.

On the p520 and I think p720, the planar can be either "flat" or RAID. There is an enablement card set that you can install as an option (that's the 57B7/8 in mine) that enables the planar backplane RAID and provides the cache.

SMS (and for that matter, Open Firmware) does not know how to manage RAID controllers; a crashed array will in fact be completely invisible to SMS and won't even show up when you list devices. You have to boot from the AIX diagnostics CD to view the status of the RAID array and attempt repair. This is not the same as an AIX install set. Incredibly, you can download an ISO of that from here, at least until IBM puts that under support-contract-only too: https://www-304.ibm.com/webapp/set2/sas ... /home.html

smit happens.

:Fuel:

bigred , 900MHz R16K, 4GB RAM, V12 DCD, 6.5.30
:Indy:

indy , 150MHz R4400SC, 256MB RAM, XL24, 6.5.10
:Indigo2IMP:

purplehaze , R10000, Solid IMPACT
probably posted from

bruce , Quad 2.5GHz PowerPC 970MP, 16GB RAM, Mac OS X 10.4.11
plus IBM POWER6 p520 * Apple Network Server 500 * HP C8000 * BeBox * Solbourne S3000 * Commodore 128 * many more...

jpstewart
From Southwestern Ontario, Canada
Who joined Sept. 21, 2010, 3:31 p.m.

Wrote the following at May 9, 2014, 8:25 a.m...

What's the status of the RAID component devices look like? Is the whole array showing as failed because too many component devices are thought to be failed? If so, does the management interface let you manually/forcibly re-enable the component devices?

I have zero experience on POWER systems or their RAID controllers. But I used to have an IBM ServeRAID 6M in a PC server. Just about any unclean shutdown of the system (but particularly those that happened as a result of bad RAM) would cause the 6M to think that two devices in its RAID-5 array had died. Booting from the ServeRAID CD and using its interface to change the drives' status from "Faulty" back to "On-line" was all that was needed. The data on the disk(s) was all there so the controller didn't re-sync/re-build the array. So I had no data loss at all. In my case, it was just a confused controller but a healthy array.

Again, I have no idea how your POWER system compares to my ServeRAID 6M, but your situation (hardware failure leading to a crash leading to a hosed array) sounds just like what I experienced. It might be worth a try if the tools allow it, and if you can accept the fact that it might not work at all.

Sun SPARCstation 20, Blade 2500
HP C8000

ClassicHasClass
From Sunny So Cal
Who joined July 25, 2012, 7:12 p.m.

Wrote the following at May 9, 2014, 11:08 a.m...

The component drives show up as RWProtected in AIX diagnostics, not Failed -- only the array itself appears as Failed. This is the designation the SAS RAID uses for a drive that is still part of an array, but the array has failed "safe," so the controller will lock it down until repairs are effected. See, for example, http://pic.dhe.ibm.com/infocenter/power ... ration.htm

I'm going to test the individual drives, too, but I very much doubt that's the problem. Near as I am aware, you can't set the individual status of the drives manually, at least not in AIX or System i; the controller has to be "happy" first before it will release the protection lock.

smit happens.

:Fuel:

bigred , 900MHz R16K, 4GB RAM, V12 DCD, 6.5.30
:Indy:

indy , 150MHz R4400SC, 256MB RAM, XL24, 6.5.10
:Indigo2IMP:

purplehaze , R10000, Solid IMPACT
probably posted from

bruce , Quad 2.5GHz PowerPC 970MP, 16GB RAM, Mac OS X 10.4.11
plus IBM POWER6 p520 * Apple Network Server 500 * HP C8000 * BeBox * Solbourne S3000 * Commodore 128 * many more...

ClassicHasClass
From Sunny So Cal
Who joined July 25, 2012, 7:12 p.m.

Wrote the following at May 9, 2014, 7:31 p.m...

Well ...

IT WORKED!

It doesn't look like I lost much of anything. MySQL is unhappy that it didn't shut down cleanly, but other than that the system looks functional. Hooray for JFS!

smit happens.

:Fuel:

bigred , 900MHz R16K, 4GB RAM, V12 DCD, 6.5.30
:Indy:

indy , 150MHz R4400SC, 256MB RAM, XL24, 6.5.10
:Indigo2IMP:

purplehaze , R10000, Solid IMPACT
probably posted from

bruce , Quad 2.5GHz PowerPC 970MP, 16GB RAM, Mac OS X 10.4.11
plus IBM POWER6 p520 * Apple Network Server 500 * HP C8000 * BeBox * Solbourne S3000 * Commodore 128 * many more...

smj
From Berkeley, CA, USA, NA, Earth, Sol
Who joined Nov. 12, 2007, 7:54 p.m.

Wrote the following at May 9, 2014, 7:46 p.m...

Glad to hear this worked out relatively painlessly!

Then?

... Now?

-

+MLA

... Other: DEC :BA213:

Sun , DG AViiON , NeXT :Cube:

ClassicHasClass
From Sunny So Cal
Who joined July 25, 2012, 7:12 p.m.

Wrote the following at May 9, 2014, 9:07 p.m...

At least it's good to know that wiping/reclaiming the controller cache is a recoverable operation, if it has to be done. That's not at all clear from IBM's technical documentation. JFS' resiliency as a file system is probably a big part of that.

Oh well, guess I remain the resident AIX dweeb.

smit happens.

:Fuel:

bigred , 900MHz R16K, 4GB RAM, V12 DCD, 6.5.30
:Indy:

indy , 150MHz R4400SC, 256MB RAM, XL24, 6.5.10
:Indigo2IMP:

purplehaze , R10000, Solid IMPACT
probably posted from

bruce , Quad 2.5GHz PowerPC 970MP, 16GB RAM, Mac OS X 10.4.11
plus IBM POWER6 p520 * Apple Network Server 500 * HP C8000 * BeBox * Solbourne S3000 * Commodore 128 * many more...

jpstewart
From Southwestern Ontario, Canada
Who joined Sept. 21, 2010, 3:31 p.m.

Wrote the following at May 10, 2014, 7:49 a.m...

Glad to hear it all worked out for you.

And sorry the info I posted wasn't useful. I had hoped that there might be some similarity between IBM's PC ServeRAID cards and the ones for their POWER systems. What I hadn't realized (and really should have :oops:

) was that your system was a generation or two newer (using SAS) than the old ServeRAID 6M (U320 SCSI) that I'd used. I hope I didn't waste too much of your time with my irrelevant ramblings.

Sun SPARCstation 20, Blade 2500
HP C8000

ClassicHasClass
From Sunny So Cal
Who joined July 25, 2012, 7:12 p.m.

Wrote the following at May 10, 2014, 10:22 a.m...

Well, there probably was at the time, and I imagine their current non-POWER enterprise RAID offerings are much the same. But I appreciate the kind thought

smit happens.

:Fuel:

bigred , 900MHz R16K, 4GB RAM, V12 DCD, 6.5.30
:Indy:

indy , 150MHz R4400SC, 256MB RAM, XL24, 6.5.10
:Indigo2IMP:

purplehaze , R10000, Solid IMPACT
probably posted from

bruce , Quad 2.5GHz PowerPC 970MP, 16GB RAM, Mac OS X 10.4.11
plus IBM POWER6 p520 * Apple Network Server 500 * HP C8000 * BeBox * Solbourne S3000 * Commodore 128 * many more...