Sunday, 7 June 2015

Dell RAID-5 Hard Drive Mishap

We had one customer where we inherited a budget Dell PowerEdge server. It was running SBS 2008, and underspeced. It was running very slow ever since we took them onboard (suspect an install gone wrong).

About a year or two before, one of the hard drives failed. No big deal, RAID-5 with 3 disks and a hot spare. Dell replaced the drive under warranty, no sweat. Everything is hunky dory, but that's not the end of it. Now here's the first problem. RAID-5 is a nice cheap way of getting maximum storage capacity with redunancy, however, people on low budgets use it with an all eggs in one basket attitude thinking that they're always covered.

Both the OS and the Data was sitting on the RAID-5. What's wrong with that? Nothing, when it works. However, SBS runs exchange, which loves to thrash the hard drives with it's constant indexing and database maintenance etc. Not only that, but this company had decided to give all of it's users roaming profiles and redirected documents, no wonder it was slow!

So now we have, exchange, regular data, roaming profiles and redirected documents running on one set of disks. But hey, it's working isn't it? For the moment yes. However, that was about to change.

About a year or two later, they started running out of disk space on the data partition. There were only 4 bays in the server for hard drives (cheap budget model). The engineer working on the case had a great idea, lets bring the hot spare into the array and we'll have another 250gig of space and no cost required. Seems simple enough, we still have protection from a disk failure so no problem, everybody's happy.

Then just over a week later. Early morning phone call, the same customer calls to say their server is down. Okay, don't panic, ask them to reboot via the power button, that usually does the trick. Nope, server is down. Server is unable to find an operating system and it's trying to boot from a CD. Okay, so remove any CDs and USB drives from the server, nope no change. There's a message when booting up which says the RAID is degraded, uh oh.

No remote fix for this one, it's a cheap server so no nice DRAC card to help us look remotely. Engineer attends the site, the very same one who was expanding the RAID only a week ago! He happened to be in the area, and was nice enough to pop in. He confirms the RAID is degraded, and he takes it upon himself to deem it's not recoverable, and deletes the RAID! Wow, bold move. Hope that backup last night worked OK. So he decides that he can't start the recovery onsite, I think there was no symantec recovery disk available. He brings the server back to our office, and leaves it for one of the senior guys to try and bring a server back from the dead.

I was lucky enough to be nominated to try and fix the server. So the RAID has been recreated, okay lets pop in the recovery CD and restore from last nights backup. Restore goes through, says it's successful, great. Okay try to boot windows, nope blue screen of death. Okay, lets run startup repair, nope still BSOD. Repair MBR and run SFC scan, all those helpful tools to revive a non booting windows. Nope still blue screens, I think it was saying file was missing or corrupt. Okay, let's restoring just the OS from the previous backup, nope same thing. So tried 4 different backups and all the same result, something is not right here.

I looked at the backups on the USB drive more closely, it said the C drive backup was only 20gig in size. That seems a bit small for SBS 2008. Opened up the recovery point using the symantec software on my own PC, says there's only 10gig used and 40gig free. No way does SBS 2008 have only 10gig used on a C drive, even without a page file a fresh install would still use more than that. When I opened the recovery point I was actually unable to browse any of the data! Only actually being able to see the size of the backup, but couldn't see any data. This was the same for the other recovery points. I soon realised that these backups are simply corrupt and were pretty much useless, fortunately, however, the backup of the D drive was fully intact!

So with the operating system completely gone, and the customer had already been without a server for 2 days. The weekend was coming and we needed to make a decision. I took the server home with me and installed a fresh copy of SBS 2008 on the nicely formatted C partition. Gave it the same name and domain, recreated all the user accounts manually. Created a recovery storage group in exchange and started recovering data. You get the idea, manually resetting all permissions. No ntds.dit or system state backup to save the day on this one. Saving grace was that all the user data and shared data was recoverable.

After spending a whole weekend rebuilding the server. I took it back to site with my colleague (this is Sunday evening now, around 6pm) and get to work rejoining all the computers to the domain. I used the dirty profile wizard software to keep the existing profiles and join them to the new domain. Then I needed to create a new outlook profile on each machine as the mailboxes were technically fresh ones. There were between 10-15 machines and between the two of us we managed to get everything up and running within around 3 and a half hours. which was not bad. There were still a few issues (AV out of sync, mailbox delegates missing and problems with outlook indexing), but for the most part they were up and running and able to work again.

We were actually quite lucky on this one. Although the customer was down for almost a week. It could have been much worse. In the end that had little to no data loss, which was lucky as the backups had not been working correctly for a while. So why did the RAID fail? I suspect that 2 of the hard drives were already on the brink of failure, but not enough to actually fail the disk. Then when the RAID was expanded, it was simply too much and the RAID failed due to disk errors.  Although I still think if the RAID hadn't been wiped out by the engineer that went to site, we may have been able to force the disks online and may of had a chance of replacing them.

A month or two after this event, one of the drives actually did fail and had to be replaced. Then a month later. another one failed. This in my mind, proves that faulty disks slowly killed the RAID and caused symantec to produce corrupted backups of the operating system.

So could this have been avoided? Maybe, but I think it was a combination of bad luck and an underspeced, overworked server wearing out it's hard drives.

If they had gone with a seperate RAID for the OS and the Data, then I think this never would have occured. Or better yet, seperate RAIDS for OS, Data & Exchange, but this was a small company with a low budget. If you do take on a customer that has all their data on a single RAID-5, make them fully aware that they could lose everything and they need to have a robust backup system in place to truely protect their data.

This is definitely one of the worst server restores I've had to do. However, it was a learning curve and I'm glad I saw it through to the end. Looking back there's definitely things I would have done a little differently but I think going from no server everything lost, to server with everything recovered is not a bad result in itself.

No comments:

Post a Comment