IT Horror Stories: 2015

Wednesday, 17 June 2015

When RAID-5 becomes RAID-Fuck All

This one starts out as any normal server issue. Engineer attends site to do a routine checkup. He finds that one of the hard drives is predicting a failure. No big deal, alert the account manager to order a replacement.

Engineer goes to site again the next day, again still routine. I think he was covering for their usual IT guy there. Turns up to site, and the server is down. Engineer checks the drives, instead of one predicting a failure, there's now two drives that have totally failed.

Server has a RAID-5 so can't survive two drives failing (who needs a hot spare anyway!) and the array has been disabled.

Tried to talk the engineer into forcing the drives online, but he couldn't figure it out. Okay, so let's take a step back and review the situation.

Is the server important? Mostly no, but for some reason it was running a virtual machine that contains some ancient sales access database. Unfortunately this ancient access database was very important! Typical, something extremely old that should have been decommissioned years ago but the client doesn't want to let go of it!

Okay, do we have a backup of the server? Nope. Well, looks like we REALLY need to fix it now!

He brought the server back to the offices and gave it to me. So I jumped into the RAID utility and starting looking. First of all it's a Dell with the usual PERC RAID card, not a great start....

Worst still, there are 8 drives in the RAID-5! No wonder two had failed at the same time. Having 8 drives in a RAID-5 signficantly increases the chances of a double failure as the drives age. Why not use one of the drives as a hot spare? Obviously the engineer who built the server was greedy for hard drive space.

So, I managed to force the drives online and get the server booted. Anyone who's done this themselves will know, it's like winning the lottery. You've gone from everything is gone, to everything is back!

Okay, so open up the Dell OpenManage homepage. Shows that one drive is rebuilding and the other is predicting a failure still. Then the boss appears, producing some questionable replacement hard drives. So I thought to myself, why wait for this drive to rebuild? It's only going to put stress on the rest of the drives. Better to just swap it out now and let it start rebuilding with a good drive.

There's a reason paitence is a virtue.

I checked the numbers to confirm which faulty drive to remove and yanked it out (yes they were hot swap). I can't remember if I misread the numbers or they didn't match with the dell software. Either way, I fucked up. I removed the wrong drive, and the server rebooted.

So back into the RAID utility, force the drives online again. Now to shut down the server, and replace the CORRECT drive. This time I just leave it in the RAID utility to rebuild.

Once the rebuild is complete, I replaced the other drive and let it rebuild again.

So we finally have a healthy RAID. I boot up the server, let's check out the damage. Uh oh, windows wants to run chkdsk. Not a good sign. I cancel it straight away, and let it continue booting to windows.

Okay, let's try and boot up the virtual machine. It's the only valuable thing on the server.
Boom, VMDK file is corrupt. That's not good. Shadow copy? nope nothing can save this one. I think I tried some vmware tools but the VM was toast.

Then I had a stroke of luck. I found the access database which was stored on the PDC! Someone had the foresight to move the database to a safe location at some point. Unfortunately, the database was so old, it won't work on the shiny new versions of Office. The virtual machine had office 2000 on there and was running terminal services so people could hop on and use it as they please.

Then I found a REALLY old backup of the virtual machine. We're talking like 5 years old, but I figured it hasn't changed much. So I built a new virtual machine on another server, and set about rebuilding the 2000 server.

After spending an hour or two getting a windows 2000 ISO, I installed Windows 2000 server edition. When that was done, I fired up NT Backup and begun the restore. This was actually the first time I used NT Backup to restore a server, in the past I've always used symantec products.

Once the restore was complete, I looked over the machine to see what was what. Well, there was a local copy of the database that was older than Peter Sallis. So I needed to get a network drive mapped, and change the terminal services config so it would open the database from the server.

To top it off, I needed to reinstall all the printers as the ones on there were all defunct. One of them really did not want to install, windows 2000 doesn't have the best driver support these days.

This was a very painful job, but I deserved it. It taught me a valuable lesson. When you're dealing with a broken RAID, note the serial number on the drive you want to swap. Then shut down the server and THEN change it. Especially if you don't have a backup!

The server also had GFI Mailarchiver, so I then had to reinstall the software and re-index all the databases. The databases were stored on a NAS box, so fortunately they wern't lost.

The previous IT guy who looked after the site made a lucky escape. The whole system was hanging by a thread for years and he managed to avoid a disaster.

A few days after rebuilding the server, both of the replacement drives came back as predicting failures. Seems those hard drives the boss gave me were bad, and they needed to be replaced again. I left the company shortly after that, I don't miss them one bit! There was so much dodgy hardware in that place it was unbelivable.

Moral of the story? Do backups! Backups backups backups. Also, if a client has an unstable network. Make sure everything you do for them is billed by the hour!

Luck was on my side that day, and I got some valuable experience. I hope people reading this might avoid a similar mishap. Advice here may seem obvious, but we all make mistakes. No matter how good you are.

Friday, 12 June 2015

Exchange 2003, where did my database go?

It's a classic story, an old SBS 2003 server is running low on space. It's 2013 so they need to buy a new server anyway. Then the usual, no IT budget means no server. Just stick an extra hard drive in and jobs a good'n.

Well, at least it's better than buying a NAS and using it as a second server :)

So, hard drives are in. RAID-5 has been extended and we have an extra 300GB of space. We create a new partition for exchange, now it's just matter of moving the database and we're good to go. Piece of cake.

Exchange 2003 makes it fairly painless to move the database, you just edit the location, point it where you want the database to sit, and windows does the rest. It takes a while so just leave it running and go make a coffee.

Unfortunately, the engineer who was performing the move overlooked the fact that the terminal server session times out after 30 minutes. Easy thing to miss, but it would have dire consequences.

I'm still not quite sure how it happened. The engineer told me he kicked off the move, left it running for a while, came back to it and the session had been disconnected. He logged back in, move had failed. Then I think he tried it again and it failed a second time.

After that, nothing. Exchange store wouldn't mount, engineer asked me to have a look the next morning. Checked the event logs, found the following error:

Information Store (15376) First Storage Group: An attempt to write to the file "E:\Program Files\Exchsrvr\rmdbdata\pub1.edb" at offset 0 (0x0000000000000000) for 4096 (0x00001000) bytes failed after 0 seconds with system error 21 (0x00000015): "The device is not ready. ". The write operation will fail with error -1022 (0xfffffc02). If this error persists then the file may be damaged and may need to be restored from a previous backup.

That's not good. Looking at the server, there were now two databases! The original one and another copy on the new partition. The database in the new partition didn't work, but worst still, the original refused to mount as well!

Okay well, the original database should still be intact right? Surely it will have copied the database to the new location and then removed the original once the copy has completed? I honestly don't know, but it went wrong somehow.

I ran eseutil, to try and repair the database. I ran a soft repair, which does a sweep of the database and replays the transaction logs to bring it into a clean state. This failed, can't remember the exact error, database is corrupt. I think I even called microsoft and opened an emergency case. They confirmed my fears, and basically said "yeah it's fucked, and don't bother doing a hard repair, it won't be worth the trouble". So the only course of action is restore from a backup.

Okay, don't panic. That's why we have backups. Company had Symantec System Recovery, nice full image of the server we can restore from. Time for Symantec to shine.

So open up the software. Last backup... 8 days ago. Waste of time, can't afford to lose a weeks worth of emails. So I created a blank database on the new partition, and started backing up everyone's outlook (fortunately they were all cached). Then went through the fun of importing all the emails back into the server. I used Stellar Exchange Recovery and managed to recover some of the emails that wern't cached (couple of people on leave).

Fortunately it was a small company, only around a dozen users. So by the end of the day we were pretty much there. Fortunately the customer was understanding and appreciated that we managed to recover almost all their email.

Moral of the story? Run a backup before you move the exchange database! Also, initiate the move from the console of the server, not from terminal services.

Many more exchange stories to come :D

Tuesday, 9 June 2015

Double the iOmega, double the fun!

Okay, two stories for here for you guys. The first one I didn't get involved in, I only saw what happened afterwards. The second story is a situation I got lumbered with, which was the result of a greedy salesman (common factor unfortunately with these stories). Let's begin.

Story 1, Only backups?

We had a client where we sold them a new shiny server solution. I don't remember the spec of the server but they had SBS 2011 and were using Symantec System Recovery for backups. Not too shabby so far. Unfortunately, however, they were sold an iOmega ix2 home NAS for their backups! That's right, the salesperson obviously tried to find something cheap and sold the customer a NAS designed for home use!

You may be thinking, so what? I've had an iOmega for years and it hasn't let me down. That's fine for you maybe, but this is for a backup device! It's writing gigs and gigs of data every night and a full backup every week. These boxes are designed for one off storage that rarely changes, not to have the data re-written every month!

Anyway, back on topic. The backups started failing on the NAS. They were getting stuck and wern't finishing. Engineer goes to site for the monthly checkup. They're advised that the backups are failing and to have a look at the NAS. Engineer looks at the NAS, decides to do a factory reset and start again. Fair enough, that's something to try at least, and they have offsite copies on USB drives so we're all covered right?

Few days later, everything seems to be going well. Backups are working again, looks like the reset did the trick. Then we get a call from the customer. One of the users, (one of the important ones) reports that one of their network drives is disconnected. Engineer investigates, looks at the network drive and determines it's pointing to the NAS box.

Engineer calls the guy who was at looking at the NAS to find out what happened. The conversion went something like this:

Eng1 - Hey, what's happening with the NAS for customer xyz?
Eng2 - The backups were failing, I had to perform a factory reset.
Eng1 - Did you check there was any data on there before you reset it?
Eng2 - It's only for backups, there's no user data on there.
Eng1 - But did YOU check before you reset it that there was no data on there?
Eng2 - No, there should only be backups on there.

The penny dropped at that point. It was clear there had been some data on that NAS, and it had been wiped. However, no one yet knew the scale of this loss.

The director took control of the case. He emailed the user explaining what had happened, and tried to shift some blame onto the contractor who was visiting the site a few months prior, as he suspected they had setup the network drive for the user.

The user was unimpressed to say the least. All they wanted was the data back. The director contacted the manufacturer and asked for advice on how to perform data recovery on the NAS. Hoping that there may be an slim chance of retrieving some of the data.

The data recovery was run, but it transpired that the user data consisted of huge video files, several gigs in size. This was a massive problem in itself. Big files are much more vulnerable to be damaged after a format than smaller ones. Also the backups had been running again which would have overwritten a lot of the data as well. 100% data loss in this case, all the files that were recovered came back corrupt.

Moral of the story? Never format something without checking the data first. You never know what someone has left on there. Even if they shouldn't have put it there, if the data belongs to someone important then you could be held accountable!

Story 2, Picture time!

This story has a happier ending. One of our clients was a private school. They had an extremely old server, with limited storage capacity. The same salesman had a lightbulb moment, server is full? Let's sell them a NAS and they can put some data on there. Much cheaper than buying server hard drives!

So that was that. They put all of their school photos on this NAS, yep another iOmega ix2. All was going well until one day, nobody could access the NAS. Phone call came in, can't access the X drive. Okay, checked the server, the X drive is stored on the NAS. I told them to reboot the NAS. Nothing happened. Can't ping it, can't browse to it, can't access over HTTP. Not good so far!

So we collect the NAS and bring it back to our offices. Plug it in and run the iOmega utility. Can see the NAS but that's about it, can't browse to it or see the data etc. Called iOmega support, that was hopeless, can't remember what their repsonse was but it wasn't helpful! Customer wants their data back badly by this point.

So we take out one of the hard drives, and try to power it up. If one of them has failed then the device should work fine with the other. Nope, same result, both drives are unusable. Okay let's plug one of the drives into a PC. Okay, partitions are showing as RAW. They must be formatted in Linux and Windows doesn't understand the file system.

Wait, we have a linux machine that someone built for testing. Let's hook up the hard drive to that. Boot up the machine, boom, the drive won't mount. It's part of a mirror and it's looking for it's other half. Okay, let's plug in the other drive. Boom, machine won't boot. The other drive is completely shot and the PC can't even see it. Fortunately after some searching, we found a command that allows you to force linux to mount the drive, even though the second one is missing.

I think the command was "mdadm --assemble /dev/md1 /dev/sdm1 –force"

So I looked in the file browser, wo ho! All the data is there, copied it to a USB drive and took it back to the school. This was a lucky one, and just shows how domestic NAS's in a business enviroment are false economy. We effectively lost a whole days work (anyone in IT will know that a day rate for an engineer is not cheap) and of course, we sold the customer the NAS, and it was covered in the support contract. So we couldn't exactly charge them for the hours we put in recoving the data.

Moral of the story here? If you're going to use a NAS for storage, always have a backup. Also, if it's in a production enviroment, BUY A BUSINESS PRODUCT. QNap, Synology, Netgear all have business grade NAS boxes that can go for years without so much as a drive failure!

Sunday, 7 June 2015

Dell RAID-5 Hard Drive Mishap

We had one customer where we inherited a budget Dell PowerEdge server. It was running SBS 2008, and underspeced. It was running very slow ever since we took them onboard (suspect an install gone wrong).

About a year or two before, one of the hard drives failed. No big deal, RAID-5 with 3 disks and a hot spare. Dell replaced the drive under warranty, no sweat. Everything is hunky dory, but that's not the end of it. Now here's the first problem. RAID-5 is a nice cheap way of getting maximum storage capacity with redunancy, however, people on low budgets use it with an all eggs in one basket attitude thinking that they're always covered.

Both the OS and the Data was sitting on the RAID-5. What's wrong with that? Nothing, when it works. However, SBS runs exchange, which loves to thrash the hard drives with it's constant indexing and database maintenance etc. Not only that, but this company had decided to give all of it's users roaming profiles and redirected documents, no wonder it was slow!

So now we have, exchange, regular data, roaming profiles and redirected documents running on one set of disks. But hey, it's working isn't it? For the moment yes. However, that was about to change.

About a year or two later, they started running out of disk space on the data partition. There were only 4 bays in the server for hard drives (cheap budget model). The engineer working on the case had a great idea, lets bring the hot spare into the array and we'll have another 250gig of space and no cost required. Seems simple enough, we still have protection from a disk failure so no problem, everybody's happy.

Then just over a week later. Early morning phone call, the same customer calls to say their server is down. Okay, don't panic, ask them to reboot via the power button, that usually does the trick. Nope, server is down. Server is unable to find an operating system and it's trying to boot from a CD. Okay, so remove any CDs and USB drives from the server, nope no change. There's a message when booting up which says the RAID is degraded, uh oh.

No remote fix for this one, it's a cheap server so no nice DRAC card to help us look remotely. Engineer attends the site, the very same one who was expanding the RAID only a week ago! He happened to be in the area, and was nice enough to pop in. He confirms the RAID is degraded, and he takes it upon himself to deem it's not recoverable, and deletes the RAID! Wow, bold move. Hope that backup last night worked OK. So he decides that he can't start the recovery onsite, I think there was no symantec recovery disk available. He brings the server back to our office, and leaves it for one of the senior guys to try and bring a server back from the dead.

I was lucky enough to be nominated to try and fix the server. So the RAID has been recreated, okay lets pop in the recovery CD and restore from last nights backup. Restore goes through, says it's successful, great. Okay try to boot windows, nope blue screen of death. Okay, lets run startup repair, nope still BSOD. Repair MBR and run SFC scan, all those helpful tools to revive a non booting windows. Nope still blue screens, I think it was saying file was missing or corrupt. Okay, let's restoring just the OS from the previous backup, nope same thing. So tried 4 different backups and all the same result, something is not right here.

I looked at the backups on the USB drive more closely, it said the C drive backup was only 20gig in size. That seems a bit small for SBS 2008. Opened up the recovery point using the symantec software on my own PC, says there's only 10gig used and 40gig free. No way does SBS 2008 have only 10gig used on a C drive, even without a page file a fresh install would still use more than that. When I opened the recovery point I was actually unable to browse any of the data! Only actually being able to see the size of the backup, but couldn't see any data. This was the same for the other recovery points. I soon realised that these backups are simply corrupt and were pretty much useless, fortunately, however, the backup of the D drive was fully intact!

So with the operating system completely gone, and the customer had already been without a server for 2 days. The weekend was coming and we needed to make a decision. I took the server home with me and installed a fresh copy of SBS 2008 on the nicely formatted C partition. Gave it the same name and domain, recreated all the user accounts manually. Created a recovery storage group in exchange and started recovering data. You get the idea, manually resetting all permissions. No ntds.dit or system state backup to save the day on this one. Saving grace was that all the user data and shared data was recoverable.

After spending a whole weekend rebuilding the server. I took it back to site with my colleague (this is Sunday evening now, around 6pm) and get to work rejoining all the computers to the domain. I used the dirty profile wizard software to keep the existing profiles and join them to the new domain. Then I needed to create a new outlook profile on each machine as the mailboxes were technically fresh ones. There were between 10-15 machines and between the two of us we managed to get everything up and running within around 3 and a half hours. which was not bad. There were still a few issues (AV out of sync, mailbox delegates missing and problems with outlook indexing), but for the most part they were up and running and able to work again.

We were actually quite lucky on this one. Although the customer was down for almost a week. It could have been much worse. In the end that had little to no data loss, which was lucky as the backups had not been working correctly for a while. So why did the RAID fail? I suspect that 2 of the hard drives were already on the brink of failure, but not enough to actually fail the disk. Then when the RAID was expanded, it was simply too much and the RAID failed due to disk errors. Although I still think if the RAID hadn't been wiped out by the engineer that went to site, we may have been able to force the disks online and may of had a chance of replacing them.

A month or two after this event, one of the drives actually did fail and had to be replaced. Then a month later. another one failed. This in my mind, proves that faulty disks slowly killed the RAID and caused symantec to produce corrupted backups of the operating system.

So could this have been avoided? Maybe, but I think it was a combination of bad luck and an underspeced, overworked server wearing out it's hard drives.

If they had gone with a seperate RAID for the OS and the Data, then I think this never would have occured. Or better yet, seperate RAIDS for OS, Data & Exchange, but this was a small company with a low budget. If you do take on a customer that has all their data on a single RAID-5, make them fully aware that they could lose everything and they need to have a robust backup system in place to truely protect their data.

This is definitely one of the worst server restores I've had to do. However, it was a learning curve and I'm glad I saw it through to the end. Looking back there's definitely things I would have done a little differently but I think going from no server everything lost, to server with everything recovered is not a bad result in itself.

Welcome

We've all been there, either as a user who's trying to finish before their deadline. Or the IT guy who was enjoying a quiet Tuesday afternoon, then oops, server goes down. IT guy's blood runs cold, as he knows it's only a matter of minutes before an angry mob will be outside his door demanding that he fixes it yesterday.

I've only been in IT for 7 years but I'm going to share a few of my horror stories that I have either whitnessed or been right in the epicentre. Nobody wants a disaster to happen but unfortunately it's the only way people can learn from their mistakes.

Perhaps you might avoid the same situations by reading some of the stories here.

I encourage everyone out there with similar experiences to post here so we can all share our tales of woe and wisdom.

Mr IT