Wednesday, 17 June 2015

When RAID-5 becomes RAID-Fuck All

This one starts out as any normal server issue. Engineer attends site to do a routine checkup. He finds that one of the hard drives is predicting a failure. No big deal, alert the account manager to order a replacement.

Engineer goes to site again the next day, again still routine. I think he was covering for their usual IT guy there. Turns up to site, and the server is down. Engineer checks the drives, instead of one predicting a failure, there's now two drives that have totally failed.

Server has a RAID-5 so can't survive two drives failing (who needs a hot spare anyway!) and the array has been disabled.

Tried to talk the engineer into forcing the drives online, but he couldn't figure it out. Okay, so let's take a step back and review the situation.

Is the server important? Mostly no, but for some reason it was running a virtual machine that contains some ancient sales access database. Unfortunately this ancient access database was very important! Typical, something extremely old that should have been decommissioned years ago but the client doesn't want to let go of it!

Okay, do we have a backup of the server? Nope. Well, looks like we REALLY need to fix it now!

He brought the server back to the offices and gave it to me. So I jumped into the RAID utility and starting looking. First of all it's a Dell with the usual PERC RAID card, not a great start....

Worst still, there are 8 drives in the RAID-5! No wonder two had failed at the same time. Having 8 drives in a RAID-5 signficantly increases the chances of a double failure as the drives age. Why not use one of the drives as a hot spare? Obviously the engineer who built the server was greedy for hard drive space. 

So, I managed to force the drives online and get the server booted. Anyone who's done this themselves will know, it's like winning the lottery. You've gone from everything is gone, to everything is back!

Okay, so open up the Dell OpenManage homepage. Shows that one drive is rebuilding and the other is predicting a failure still. Then the boss appears, producing some questionable replacement hard drives. So I thought to myself, why wait for this drive to rebuild? It's only going to put stress on the rest of the drives. Better to just swap it out now and let it start rebuilding with a good drive.

There's a reason paitence is a virtue.


I checked the numbers to confirm which faulty drive to remove and yanked it out (yes they were hot swap). I can't remember if I misread the numbers or they didn't match with the dell software. Either way, I fucked up. I removed the wrong drive, and the server rebooted.

So back into the RAID utility, force the drives online again. Now to shut down the server, and replace the CORRECT drive. This time I just leave it in the RAID utility to rebuild.

Once the rebuild is complete, I replaced the other drive and let it rebuild again.

So we finally have a healthy RAID. I boot up the server, let's check out the damage. Uh oh, windows wants to run chkdsk. Not a good sign. I cancel it straight away, and let it continue booting to windows.

Okay, let's try and boot up the virtual machine. It's the only valuable thing on the server.
Boom, VMDK file is corrupt. That's not good. Shadow copy? nope nothing can save this one. I think I tried some vmware tools but the VM was toast.

Then I had a stroke of luck. I found the access database which was stored on the PDC! Someone had the foresight to move the database to a safe location at some point. Unfortunately, the database was so old, it won't work on the shiny new versions of Office. The virtual machine had office 2000 on there and was running terminal services so people could hop on and use it as they please.

Then I found a REALLY old backup of the virtual machine. We're talking like 5 years old, but I figured it hasn't changed much. So I built a new virtual machine on another server, and set about rebuilding the 2000 server.

After spending an hour or two getting a windows 2000 ISO, I installed Windows 2000 server edition. When that was done, I fired up NT Backup and begun the restore. This was actually the first time I used NT Backup to restore a server, in the past I've always used symantec products.

Once the restore was complete, I looked over the machine to see what was what. Well, there was a local copy of the database that was older than Peter Sallis. So I needed to get a network drive mapped, and change the terminal services config so it would open the database from the server.

To top it off, I needed to reinstall all the printers as the ones on there were all defunct. One of them really did not want to install, windows 2000 doesn't have the best driver support these days.

This was a very painful job, but I deserved it. It taught me a valuable lesson. When you're dealing with a broken RAID, note the serial number on the drive you want to swap. Then shut down the server and THEN change it. Especially if you don't have a backup!

The server also had GFI Mailarchiver, so I then had to reinstall the software and re-index all the databases. The databases were stored on a NAS box, so fortunately they wern't lost.

The previous IT guy who looked after the site made a lucky escape. The whole system was hanging by a thread for years and he managed to avoid a disaster.

A few days after rebuilding the server, both of the replacement drives came back as predicting failures. Seems those hard drives the boss gave me were bad, and they needed to be replaced again. I left the company shortly after that, I don't miss them one bit! There was so much dodgy hardware in that place it was unbelivable.

Moral of the story? Do backups! Backups backups backups. Also, if a client has an unstable network. Make sure everything you do for them is billed by the hour!

Luck was on my side that day, and I got some valuable experience. I hope people reading this might avoid a similar mishap. Advice here may seem obvious, but we all make mistakes. No matter how good you are.

No comments:

Post a Comment