IT Horror Stories

Sunday, 7 August 2016

2003 Mysterious D: Drive renegade

I have seen this happen a few times. On a 2003 server you wake up on morning and have the dreaded "This drive is not formatted, would you like to format drive D:".

You normally only see this when you have a faulty USB stick, or a new hard drive that actually does need formatting. However, when you see it on a live server, on their main data partition. Any IT Engineer's heart will sink at this point.

This issue usually occurs when you have a server that is over 5 years old and it's time for a replacement. Usually indicating that the drives are slowly failing and getting hard write errors, and or bad sectors.

I'll start with an example which actually happened after this incident. I once had one of my colleagues call me up in the late evening, in quite a panic. The backup on the server had been failing for about a week, it was getting stuck at a certain point and not finishing.

He tried rebooting the server to see if that would clear the issue. However, when the server came back on, something was different. The D drive was no longer accessible. It was present in 'My Computer', however, when you tried to open it, it came back with 'Drive D not formatted, etc etc, do you want to format?'.

This was a big problem, as the backup was a week old and this was their main application server. I managed to find a nice bit of software called 'Partition Table Doctor'. Which scans disks for missing/corrupt partition information. If you search for that software now, you will find it's been bought out and you need to pay for it. However, the version I found was a nice old freeware copy

It found the partition within seconds and was able to recover it. After rebooting, it was still there and we were even able to run the backup again. A tense situation nicely diffused!

However, they had lost some data. Data that they were accessing recently. I think the software maker was able to help them out in the end. However, we had done our part. We had made the best of the situation.

Now, back to the original story. This one was not a quick fix, it was a much more time consuming recovery.

Interestingly, the situations were very similar. This client also had not had a backup for about a week. The engineer that picked up the case, presumed the drive as unrecoverable. So they loaded up the classic 'Get Data Back for NTFS'. Which was the best data recovery software at the time.

They immediately started running recovery scans on the now RAW partition and then began copying recovered data to the USB drive. Which is slow on USB2!

So already the client has lost a day at this point. As the D drive contained the exchange data as well as their company files and accounting data. So they were basically just left with an operating system!

After data recovery had finished. They then had to rebuild the shared folders and setup the permissions again. However, after copying the exchange database back. It refused the mount, moaning about lost log files. So once again, the same engineer deemed it a lost cause and claimed the database would have to be created from scratch and email would need to be recovered from OST files.

So at this point, I think it was the third day. They sent me to site, to finish things off. I had the thrilling task of going round everyone's machines, opening up their cached outlook profile and backing it up to a PST file. Then creating a new profile, linking up to the new exchange database and importing the data.

I think all of their mail had been collecting in a catch all POP mailbox. So the POP3 connector was going to have a busy night bringing down 2 days of email for all the users!

Their accounting database also wasn't working. I think the .QBW database file was OK, but the TLG (transaction log) was missing as it had not been successfully recovered, so and it refused to open.

So, without any prior knowledge. I copied the .TLG file from the 7 day old backup, into the same folder with the current .QBW database. Which then allowed them to open the database successfully. I know now you aren't supposed to do that, however, it was still significantly better than losing a weeks worth of transactions etc. The customer was delighted, so I would chalk it up to luck on that one.

Then there was a lot of mopping up to do. First the the file permissions, users could not save or edit files as some areas still had the default permissions assigned.

Then there was restoring files from the backup, where the initial file recovery had either missed things or recovered corrupted files.

All in all, a very time consuming process, and I feel with a bit more investigation and prior knowledge, this would have been a much quicker recovery. Even if recovering the partition wasn't possible, then they still may have been able to soft repair the exchange database. Also recovering the permissions should have been possible from the backup, using ICACLS. However, the circumstances and the fact that it's 2003 may not have allowed this. So these are merely observations rather than criticisms as such.

Morale of the story? Again it comes down to backups and making sure they are monitored. Also a quick and brutal decision to format the drive can lead to a longer and more painful recovery process. However, the steps taken may have been time consuming, they did recover most of the data. So in the end, not a bad result.

Saturday, 9 April 2016

No RAID + No Good Backup = Disaster

We sold a customer an entry level server (ML 150 G5 I think it was) to run SBS 2008 and Freight software with only 4GB of ram! There was no proper hardware RAID so they ended up with 2 SATA drives in RAID-1.

It was problematic from the start. First off, SBS 2008 runs Exchange and SQL out of the box. Both memory hungry and hard drive thrashers. As soon as they tried to introduce the freight software, which is also SQL based, the whole thing just ground to a halt.

We then had to donate a server to them to run the freight software as we had told them (or salesman rather) that the server would be powerful enough for their needs.

The customer still complained that their systems were very slow, in particular the internet. Then the salesman decided to sell them new computers to fix the problem. Of course, it made no difference as it was down to the server being so overworked and actually being too slow to respond to DNS queries!

I was a first level engineer at the time. So in my wisdom, I added some extra DNS servers into DHCP so that the clients could run their DNS queries against the internet as well as the server, this sped up the internet nicely. Only problem was they would randomly lose connection to the server as sometimes it would resolve the internal name of the server against the wrong DNS server.

So that was scrapped and it was back to forwarders. I did manage to speed things up though, by limited the SQL memory so the windows internal database basically had no power, and also limited the memory exchange could steal.

We did eventually put more RAM in the server (took a couple of visits as they forgot to bring the RAM the first time!). The server was better after that, but it was still never fit for purpose.

After 2-3 years, we decided to sell them some shiny new Symantec System Recovery software and a NAS to replace their failing tape drive backup solution. The engineer (not me!)installed the NAS and got it setup. Then began installing symantec, but it needs a reboot to finish off the installation.

I think the reboot did eventually happen a week later, but the software was never setup. Then the emails stopped working. I discovered that the exchange SMTP queue file was corrupt, so I recreated it. Then it was fine for the rest of the day.

Then at about 5pm. The server goes offline. Customer reboots and it starts to do a diskcheck, never a good sign. Rebooted it again to bypass the diskcheck and it refused to come back on. It went into a reboot loop. Engineer goes to site and collects the server as nothing more could be done.

So we get the server, no decent backup and not booting. Plugged the hard drives into our test machine to see if they were still usable. Both of them could be read, however, one of the drives only had data from 2 years ago! So it seems the RAID had stopped working ages ago and the server had been running on one hard drive for years, now it's finally starting to fail.

We were able to get the server booted, but only in Directory Services Restore mode. So Active Directory was corrupt. By this time I was a third level engineer, but I had never had to deal with this kind of problem before. There server had been there for 5 days now. So I rolled up my sleeves, got some Maccy D's breakfast and got to work. I found an article which gave a guide on how to repair the ntds.dit file. To my surprise this actually worked and the server booted. However, exchange was bolloxed, database would not mount so we still weren't in great shape.

We managed to source another hard drive to get the RAID started, so at least we would have a decent hard drive in there. Then with some amazing luck, we found that there was a tape backup of exchange, as they were still changing the tapes and there was a backup a few days before the crash.

So I restored the exchange backup from the tape, and the server was finally working again. I took it back to site, plugged it in and we were back online. Fortunately they had their email cached in outlook, so that filled in the gaps where the backup was a few days behind. Also they had hosted email security which held the emails for 7 days in the cloud and they didn't lose those either (server had been gone for 6 days).

So we just about got away with that, but it was pretty poor on our part. If the backup had been setup within a reasonable time then it would have been much more helpful and wouldn't have taken so long to get the server back. Just shows a decent RAID can make all the difference and a good backup is always essential.

Wednesday, 17 June 2015

When RAID-5 becomes RAID-Fuck All

This one starts out as any normal server issue. Engineer attends site to do a routine checkup. He finds that one of the hard drives is predicting a failure. No big deal, alert the account manager to order a replacement.

Engineer goes to site again the next day, again still routine. I think he was covering for their usual IT guy there. Turns up to site, and the server is down. Engineer checks the drives, instead of one predicting a failure, there's now two drives that have totally failed.

Server has a RAID-5 so can't survive two drives failing (who needs a hot spare anyway!) and the array has been disabled.

Tried to talk the engineer into forcing the drives online, but he couldn't figure it out. Okay, so let's take a step back and review the situation.

Is the server important? Mostly no, but for some reason it was running a virtual machine that contains some ancient sales access database. Unfortunately this ancient access database was very important! Typical, something extremely old that should have been decommissioned years ago but the client doesn't want to let go of it!

Okay, do we have a backup of the server? Nope. Well, looks like we REALLY need to fix it now!

He brought the server back to the offices and gave it to me. So I jumped into the RAID utility and starting looking. First of all it's a Dell with the usual PERC RAID card, not a great start....

Worst still, there are 8 drives in the RAID-5! No wonder two had failed at the same time. Having 8 drives in a RAID-5 signficantly increases the chances of a double failure as the drives age. Why not use one of the drives as a hot spare? Obviously the engineer who built the server was greedy for hard drive space.

So, I managed to force the drives online and get the server booted. Anyone who's done this themselves will know, it's like winning the lottery. You've gone from everything is gone, to everything is back!

Okay, so open up the Dell OpenManage homepage. Shows that one drive is rebuilding and the other is predicting a failure still. Then the boss appears, producing some questionable replacement hard drives. So I thought to myself, why wait for this drive to rebuild? It's only going to put stress on the rest of the drives. Better to just swap it out now and let it start rebuilding with a good drive.

There's a reason paitence is a virtue.

I checked the numbers to confirm which faulty drive to remove and yanked it out (yes they were hot swap). I can't remember if I misread the numbers or they didn't match with the dell software. Either way, I fucked up. I removed the wrong drive, and the server rebooted.

So back into the RAID utility, force the drives online again. Now to shut down the server, and replace the CORRECT drive. This time I just leave it in the RAID utility to rebuild.

Once the rebuild is complete, I replaced the other drive and let it rebuild again.

So we finally have a healthy RAID. I boot up the server, let's check out the damage. Uh oh, windows wants to run chkdsk. Not a good sign. I cancel it straight away, and let it continue booting to windows.

Okay, let's try and boot up the virtual machine. It's the only valuable thing on the server.
Boom, VMDK file is corrupt. That's not good. Shadow copy? nope nothing can save this one. I think I tried some vmware tools but the VM was toast.

Then I had a stroke of luck. I found the access database which was stored on the PDC! Someone had the foresight to move the database to a safe location at some point. Unfortunately, the database was so old, it won't work on the shiny new versions of Office. The virtual machine had office 2000 on there and was running terminal services so people could hop on and use it as they please.

Then I found a REALLY old backup of the virtual machine. We're talking like 5 years old, but I figured it hasn't changed much. So I built a new virtual machine on another server, and set about rebuilding the 2000 server.

After spending an hour or two getting a windows 2000 ISO, I installed Windows 2000 server edition. When that was done, I fired up NT Backup and begun the restore. This was actually the first time I used NT Backup to restore a server, in the past I've always used symantec products.

Once the restore was complete, I looked over the machine to see what was what. Well, there was a local copy of the database that was older than Peter Sallis. So I needed to get a network drive mapped, and change the terminal services config so it would open the database from the server.

To top it off, I needed to reinstall all the printers as the ones on there were all defunct. One of them really did not want to install, windows 2000 doesn't have the best driver support these days.

This was a very painful job, but I deserved it. It taught me a valuable lesson. When you're dealing with a broken RAID, note the serial number on the drive you want to swap. Then shut down the server and THEN change it. Especially if you don't have a backup!

The server also had GFI Mailarchiver, so I then had to reinstall the software and re-index all the databases. The databases were stored on a NAS box, so fortunately they wern't lost.

The previous IT guy who looked after the site made a lucky escape. The whole system was hanging by a thread for years and he managed to avoid a disaster.

A few days after rebuilding the server, both of the replacement drives came back as predicting failures. Seems those hard drives the boss gave me were bad, and they needed to be replaced again. I left the company shortly after that, I don't miss them one bit! There was so much dodgy hardware in that place it was unbelivable.

Moral of the story? Do backups! Backups backups backups. Also, if a client has an unstable network. Make sure everything you do for them is billed by the hour!

Luck was on my side that day, and I got some valuable experience. I hope people reading this might avoid a similar mishap. Advice here may seem obvious, but we all make mistakes. No matter how good you are.

Friday, 12 June 2015

Exchange 2003, where did my database go?

It's a classic story, an old SBS 2003 server is running low on space. It's 2013 so they need to buy a new server anyway. Then the usual, no IT budget means no server. Just stick an extra hard drive in and jobs a good'n.

Well, at least it's better than buying a NAS and using it as a second server :)

So, hard drives are in. RAID-5 has been extended and we have an extra 300GB of space. We create a new partition for exchange, now it's just matter of moving the database and we're good to go. Piece of cake.

Exchange 2003 makes it fairly painless to move the database, you just edit the location, point it where you want the database to sit, and windows does the rest. It takes a while so just leave it running and go make a coffee.

Unfortunately, the engineer who was performing the move overlooked the fact that the terminal server session times out after 30 minutes. Easy thing to miss, but it would have dire consequences.

I'm still not quite sure how it happened. The engineer told me he kicked off the move, left it running for a while, came back to it and the session had been disconnected. He logged back in, move had failed. Then I think he tried it again and it failed a second time.

After that, nothing. Exchange store wouldn't mount, engineer asked me to have a look the next morning. Checked the event logs, found the following error:

Information Store (15376) First Storage Group: An attempt to write to the file "E:\Program Files\Exchsrvr\rmdbdata\pub1.edb" at offset 0 (0x0000000000000000) for 4096 (0x00001000) bytes failed after 0 seconds with system error 21 (0x00000015): "The device is not ready. ". The write operation will fail with error -1022 (0xfffffc02). If this error persists then the file may be damaged and may need to be restored from a previous backup.

That's not good. Looking at the server, there were now two databases! The original one and another copy on the new partition. The database in the new partition didn't work, but worst still, the original refused to mount as well!

Okay well, the original database should still be intact right? Surely it will have copied the database to the new location and then removed the original once the copy has completed? I honestly don't know, but it went wrong somehow.

I ran eseutil, to try and repair the database. I ran a soft repair, which does a sweep of the database and replays the transaction logs to bring it into a clean state. This failed, can't remember the exact error, database is corrupt. I think I even called microsoft and opened an emergency case. They confirmed my fears, and basically said "yeah it's fucked, and don't bother doing a hard repair, it won't be worth the trouble". So the only course of action is restore from a backup.

Okay, don't panic. That's why we have backups. Company had Symantec System Recovery, nice full image of the server we can restore from. Time for Symantec to shine.

So open up the software. Last backup... 8 days ago. Waste of time, can't afford to lose a weeks worth of emails. So I created a blank database on the new partition, and started backing up everyone's outlook (fortunately they were all cached). Then went through the fun of importing all the emails back into the server. I used Stellar Exchange Recovery and managed to recover some of the emails that wern't cached (couple of people on leave).

Fortunately it was a small company, only around a dozen users. So by the end of the day we were pretty much there. Fortunately the customer was understanding and appreciated that we managed to recover almost all their email.

Moral of the story? Run a backup before you move the exchange database! Also, initiate the move from the console of the server, not from terminal services.

Many more exchange stories to come :D

Tuesday, 9 June 2015

Double the iOmega, double the fun!

Okay, two stories for here for you guys. The first one I didn't get involved in, I only saw what happened afterwards. The second story is a situation I got lumbered with, which was the result of a greedy salesman (common factor unfortunately with these stories). Let's begin.

Story 1, Only backups?

We had a client where we sold them a new shiny server solution. I don't remember the spec of the server but they had SBS 2011 and were using Symantec System Recovery for backups. Not too shabby so far. Unfortunately, however, they were sold an iOmega ix2 home NAS for their backups! That's right, the salesperson obviously tried to find something cheap and sold the customer a NAS designed for home use!

You may be thinking, so what? I've had an iOmega for years and it hasn't let me down. That's fine for you maybe, but this is for a backup device! It's writing gigs and gigs of data every night and a full backup every week. These boxes are designed for one off storage that rarely changes, not to have the data re-written every month!

Anyway, back on topic. The backups started failing on the NAS. They were getting stuck and wern't finishing. Engineer goes to site for the monthly checkup. They're advised that the backups are failing and to have a look at the NAS. Engineer looks at the NAS, decides to do a factory reset and start again. Fair enough, that's something to try at least, and they have offsite copies on USB drives so we're all covered right?

Few days later, everything seems to be going well. Backups are working again, looks like the reset did the trick. Then we get a call from the customer. One of the users, (one of the important ones) reports that one of their network drives is disconnected. Engineer investigates, looks at the network drive and determines it's pointing to the NAS box.

Engineer calls the guy who was at looking at the NAS to find out what happened. The conversion went something like this:

Eng1 - Hey, what's happening with the NAS for customer xyz?
Eng2 - The backups were failing, I had to perform a factory reset.
Eng1 - Did you check there was any data on there before you reset it?
Eng2 - It's only for backups, there's no user data on there.
Eng1 - But did YOU check before you reset it that there was no data on there?
Eng2 - No, there should only be backups on there.

The penny dropped at that point. It was clear there had been some data on that NAS, and it had been wiped. However, no one yet knew the scale of this loss.

The director took control of the case. He emailed the user explaining what had happened, and tried to shift some blame onto the contractor who was visiting the site a few months prior, as he suspected they had setup the network drive for the user.

The user was unimpressed to say the least. All they wanted was the data back. The director contacted the manufacturer and asked for advice on how to perform data recovery on the NAS. Hoping that there may be an slim chance of retrieving some of the data.

The data recovery was run, but it transpired that the user data consisted of huge video files, several gigs in size. This was a massive problem in itself. Big files are much more vulnerable to be damaged after a format than smaller ones. Also the backups had been running again which would have overwritten a lot of the data as well. 100% data loss in this case, all the files that were recovered came back corrupt.

Moral of the story? Never format something without checking the data first. You never know what someone has left on there. Even if they shouldn't have put it there, if the data belongs to someone important then you could be held accountable!

Story 2, Picture time!

This story has a happier ending. One of our clients was a private school. They had an extremely old server, with limited storage capacity. The same salesman had a lightbulb moment, server is full? Let's sell them a NAS and they can put some data on there. Much cheaper than buying server hard drives!

So that was that. They put all of their school photos on this NAS, yep another iOmega ix2. All was going well until one day, nobody could access the NAS. Phone call came in, can't access the X drive. Okay, checked the server, the X drive is stored on the NAS. I told them to reboot the NAS. Nothing happened. Can't ping it, can't browse to it, can't access over HTTP. Not good so far!

So we collect the NAS and bring it back to our offices. Plug it in and run the iOmega utility. Can see the NAS but that's about it, can't browse to it or see the data etc. Called iOmega support, that was hopeless, can't remember what their repsonse was but it wasn't helpful! Customer wants their data back badly by this point.

So we take out one of the hard drives, and try to power it up. If one of them has failed then the device should work fine with the other. Nope, same result, both drives are unusable. Okay let's plug one of the drives into a PC. Okay, partitions are showing as RAW. They must be formatted in Linux and Windows doesn't understand the file system.

Wait, we have a linux machine that someone built for testing. Let's hook up the hard drive to that. Boot up the machine, boom, the drive won't mount. It's part of a mirror and it's looking for it's other half. Okay, let's plug in the other drive. Boom, machine won't boot. The other drive is completely shot and the PC can't even see it. Fortunately after some searching, we found a command that allows you to force linux to mount the drive, even though the second one is missing.

I think the command was "mdadm --assemble /dev/md1 /dev/sdm1 –force"

So I looked in the file browser, wo ho! All the data is there, copied it to a USB drive and took it back to the school. This was a lucky one, and just shows how domestic NAS's in a business enviroment are false economy. We effectively lost a whole days work (anyone in IT will know that a day rate for an engineer is not cheap) and of course, we sold the customer the NAS, and it was covered in the support contract. So we couldn't exactly charge them for the hours we put in recoving the data.

Moral of the story here? If you're going to use a NAS for storage, always have a backup. Also, if it's in a production enviroment, BUY A BUSINESS PRODUCT. QNap, Synology, Netgear all have business grade NAS boxes that can go for years without so much as a drive failure!

Sunday, 7 June 2015

Dell RAID-5 Hard Drive Mishap

We had one customer where we inherited a budget Dell PowerEdge server. It was running SBS 2008, and underspeced. It was running very slow ever since we took them onboard (suspect an install gone wrong).

About a year or two before, one of the hard drives failed. No big deal, RAID-5 with 3 disks and a hot spare. Dell replaced the drive under warranty, no sweat. Everything is hunky dory, but that's not the end of it. Now here's the first problem. RAID-5 is a nice cheap way of getting maximum storage capacity with redunancy, however, people on low budgets use it with an all eggs in one basket attitude thinking that they're always covered.

Both the OS and the Data was sitting on the RAID-5. What's wrong with that? Nothing, when it works. However, SBS runs exchange, which loves to thrash the hard drives with it's constant indexing and database maintenance etc. Not only that, but this company had decided to give all of it's users roaming profiles and redirected documents, no wonder it was slow!

So now we have, exchange, regular data, roaming profiles and redirected documents running on one set of disks. But hey, it's working isn't it? For the moment yes. However, that was about to change.

About a year or two later, they started running out of disk space on the data partition. There were only 4 bays in the server for hard drives (cheap budget model). The engineer working on the case had a great idea, lets bring the hot spare into the array and we'll have another 250gig of space and no cost required. Seems simple enough, we still have protection from a disk failure so no problem, everybody's happy.

Then just over a week later. Early morning phone call, the same customer calls to say their server is down. Okay, don't panic, ask them to reboot via the power button, that usually does the trick. Nope, server is down. Server is unable to find an operating system and it's trying to boot from a CD. Okay, so remove any CDs and USB drives from the server, nope no change. There's a message when booting up which says the RAID is degraded, uh oh.

No remote fix for this one, it's a cheap server so no nice DRAC card to help us look remotely. Engineer attends the site, the very same one who was expanding the RAID only a week ago! He happened to be in the area, and was nice enough to pop in. He confirms the RAID is degraded, and he takes it upon himself to deem it's not recoverable, and deletes the RAID! Wow, bold move. Hope that backup last night worked OK. So he decides that he can't start the recovery onsite, I think there was no symantec recovery disk available. He brings the server back to our office, and leaves it for one of the senior guys to try and bring a server back from the dead.

I was lucky enough to be nominated to try and fix the server. So the RAID has been recreated, okay lets pop in the recovery CD and restore from last nights backup. Restore goes through, says it's successful, great. Okay try to boot windows, nope blue screen of death. Okay, lets run startup repair, nope still BSOD. Repair MBR and run SFC scan, all those helpful tools to revive a non booting windows. Nope still blue screens, I think it was saying file was missing or corrupt. Okay, let's restoring just the OS from the previous backup, nope same thing. So tried 4 different backups and all the same result, something is not right here.

I looked at the backups on the USB drive more closely, it said the C drive backup was only 20gig in size. That seems a bit small for SBS 2008. Opened up the recovery point using the symantec software on my own PC, says there's only 10gig used and 40gig free. No way does SBS 2008 have only 10gig used on a C drive, even without a page file a fresh install would still use more than that. When I opened the recovery point I was actually unable to browse any of the data! Only actually being able to see the size of the backup, but couldn't see any data. This was the same for the other recovery points. I soon realised that these backups are simply corrupt and were pretty much useless, fortunately, however, the backup of the D drive was fully intact!

So with the operating system completely gone, and the customer had already been without a server for 2 days. The weekend was coming and we needed to make a decision. I took the server home with me and installed a fresh copy of SBS 2008 on the nicely formatted C partition. Gave it the same name and domain, recreated all the user accounts manually. Created a recovery storage group in exchange and started recovering data. You get the idea, manually resetting all permissions. No ntds.dit or system state backup to save the day on this one. Saving grace was that all the user data and shared data was recoverable.

After spending a whole weekend rebuilding the server. I took it back to site with my colleague (this is Sunday evening now, around 6pm) and get to work rejoining all the computers to the domain. I used the dirty profile wizard software to keep the existing profiles and join them to the new domain. Then I needed to create a new outlook profile on each machine as the mailboxes were technically fresh ones. There were between 10-15 machines and between the two of us we managed to get everything up and running within around 3 and a half hours. which was not bad. There were still a few issues (AV out of sync, mailbox delegates missing and problems with outlook indexing), but for the most part they were up and running and able to work again.

We were actually quite lucky on this one. Although the customer was down for almost a week. It could have been much worse. In the end that had little to no data loss, which was lucky as the backups had not been working correctly for a while. So why did the RAID fail? I suspect that 2 of the hard drives were already on the brink of failure, but not enough to actually fail the disk. Then when the RAID was expanded, it was simply too much and the RAID failed due to disk errors. Although I still think if the RAID hadn't been wiped out by the engineer that went to site, we may have been able to force the disks online and may of had a chance of replacing them.

A month or two after this event, one of the drives actually did fail and had to be replaced. Then a month later. another one failed. This in my mind, proves that faulty disks slowly killed the RAID and caused symantec to produce corrupted backups of the operating system.

So could this have been avoided? Maybe, but I think it was a combination of bad luck and an underspeced, overworked server wearing out it's hard drives.

If they had gone with a seperate RAID for the OS and the Data, then I think this never would have occured. Or better yet, seperate RAIDS for OS, Data & Exchange, but this was a small company with a low budget. If you do take on a customer that has all their data on a single RAID-5, make them fully aware that they could lose everything and they need to have a robust backup system in place to truely protect their data.

This is definitely one of the worst server restores I've had to do. However, it was a learning curve and I'm glad I saw it through to the end. Looking back there's definitely things I would have done a little differently but I think going from no server everything lost, to server with everything recovered is not a bad result in itself.

Welcome

We've all been there, either as a user who's trying to finish before their deadline. Or the IT guy who was enjoying a quiet Tuesday afternoon, then oops, server goes down. IT guy's blood runs cold, as he knows it's only a matter of minutes before an angry mob will be outside his door demanding that he fixes it yesterday.

I've only been in IT for 7 years but I'm going to share a few of my horror stories that I have either whitnessed or been right in the epicentre. Nobody wants a disaster to happen but unfortunately it's the only way people can learn from their mistakes.

Perhaps you might avoid the same situations by reading some of the stories here.

I encourage everyone out there with similar experiences to post here so we can all share our tales of woe and wisdom.

Mr IT