Collection of my own IT Horror Stories. Feel free to post your own stories in the comments.
Sunday, 7 August 2016
2003 Mysterious D: Drive renegade
I have seen this happen a few times. On a 2003 server you wake up on morning and have the dreaded "This drive is not formatted, would you like to format drive D:".
You normally only see this when you have a faulty USB stick, or a new hard drive that actually does need formatting. However, when you see it on a live server, on their main data partition. Any IT Engineer's heart will sink at this point.
This issue usually occurs when you have a server that is over 5 years old and it's time for a replacement. Usually indicating that the drives are slowly failing and getting hard write errors, and or bad sectors.
I'll start with an example which actually happened after this incident. I once had one of my colleagues call me up in the late evening, in quite a panic. The backup on the server had been failing for about a week, it was getting stuck at a certain point and not finishing.
He tried rebooting the server to see if that would clear the issue. However, when the server came back on, something was different. The D drive was no longer accessible. It was present in 'My Computer', however, when you tried to open it, it came back with 'Drive D not formatted, etc etc, do you want to format?'.
This was a big problem, as the backup was a week old and this was their main application server. I managed to find a nice bit of software called 'Partition Table Doctor'. Which scans disks for missing/corrupt partition information. If you search for that software now, you will find it's been bought out and you need to pay for it. However, the version I found was a nice old freeware copy
It found the partition within seconds and was able to recover it. After rebooting, it was still there and we were even able to run the backup again. A tense situation nicely diffused!
However, they had lost some data. Data that they were accessing recently. I think the software maker was able to help them out in the end. However, we had done our part. We had made the best of the situation.
Now, back to the original story. This one was not a quick fix, it was a much more time consuming recovery.
Interestingly, the situations were very similar. This client also had not had a backup for about a week. The engineer that picked up the case, presumed the drive as unrecoverable. So they loaded up the classic 'Get Data Back for NTFS'. Which was the best data recovery software at the time.
They immediately started running recovery scans on the now RAW partition and then began copying recovered data to the USB drive. Which is slow on USB2!
So already the client has lost a day at this point. As the D drive contained the exchange data as well as their company files and accounting data. So they were basically just left with an operating system!
After data recovery had finished. They then had to rebuild the shared folders and setup the permissions again. However, after copying the exchange database back. It refused the mount, moaning about lost log files. So once again, the same engineer deemed it a lost cause and claimed the database would have to be created from scratch and email would need to be recovered from OST files.
So at this point, I think it was the third day. They sent me to site, to finish things off. I had the thrilling task of going round everyone's machines, opening up their cached outlook profile and backing it up to a PST file. Then creating a new profile, linking up to the new exchange database and importing the data.
I think all of their mail had been collecting in a catch all POP mailbox. So the POP3 connector was going to have a busy night bringing down 2 days of email for all the users!
Their accounting database also wasn't working. I think the .QBW database file was OK, but the TLG (transaction log) was missing as it had not been successfully recovered, so and it refused to open.
So, without any prior knowledge. I copied the .TLG file from the 7 day old backup, into the same folder with the current .QBW database. Which then allowed them to open the database successfully. I know now you aren't supposed to do that, however, it was still significantly better than losing a weeks worth of transactions etc. The customer was delighted, so I would chalk it up to luck on that one.
Then there was a lot of mopping up to do. First the the file permissions, users could not save or edit files as some areas still had the default permissions assigned.
Then there was restoring files from the backup, where the initial file recovery had either missed things or recovered corrupted files.
All in all, a very time consuming process, and I feel with a bit more investigation and prior knowledge, this would have been a much quicker recovery. Even if recovering the partition wasn't possible, then they still may have been able to soft repair the exchange database. Also recovering the permissions should have been possible from the backup, using ICACLS. However, the circumstances and the fact that it's 2003 may not have allowed this. So these are merely observations rather than criticisms as such.
Morale of the story? Again it comes down to backups and making sure they are monitored. Also a quick and brutal decision to format the drive can lead to a longer and more painful recovery process. However, the steps taken may have been time consuming, they did recover most of the data. So in the end, not a bad result.
Saturday, 9 April 2016
No RAID + No Good Backup = Disaster
We sold a customer an entry level server (ML 150 G5 I think it was) to run SBS 2008 and Freight software with only 4GB of ram! There was no proper hardware RAID so they ended up with 2 SATA drives in RAID-1.
It was problematic from the start. First off, SBS 2008 runs Exchange and SQL out of the box. Both memory hungry and hard drive thrashers. As soon as they tried to introduce the freight software, which is also SQL based, the whole thing just ground to a halt.
We then had to donate a server to them to run the freight software as we had told them (or salesman rather) that the server would be powerful enough for their needs.
The customer still complained that their systems were very slow, in particular the internet. Then the salesman decided to sell them new computers to fix the problem. Of course, it made no difference as it was down to the server being so overworked and actually being too slow to respond to DNS queries!
I was a first level engineer at the time. So in my wisdom, I added some extra DNS servers into DHCP so that the clients could run their DNS queries against the internet as well as the server, this sped up the internet nicely. Only problem was they would randomly lose connection to the server as sometimes it would resolve the internal name of the server against the wrong DNS server.
So that was scrapped and it was back to forwarders. I did manage to speed things up though, by limited the SQL memory so the windows internal database basically had no power, and also limited the memory exchange could steal.
We did eventually put more RAM in the server (took a couple of visits as they forgot to bring the RAM the first time!). The server was better after that, but it was still never fit for purpose.
After 2-3 years, we decided to sell them some shiny new Symantec System Recovery software and a NAS to replace their failing tape drive backup solution. The engineer (not me!)installed the NAS and got it setup. Then began installing symantec, but it needs a reboot to finish off the installation.
I think the reboot did eventually happen a week later, but the software was never setup. Then the emails stopped working. I discovered that the exchange SMTP queue file was corrupt, so I recreated it. Then it was fine for the rest of the day.
Then at about 5pm. The server goes offline. Customer reboots and it starts to do a diskcheck, never a good sign. Rebooted it again to bypass the diskcheck and it refused to come back on. It went into a reboot loop. Engineer goes to site and collects the server as nothing more could be done.
So we get the server, no decent backup and not booting. Plugged the hard drives into our test machine to see if they were still usable. Both of them could be read, however, one of the drives only had data from 2 years ago! So it seems the RAID had stopped working ages ago and the server had been running on one hard drive for years, now it's finally starting to fail.
We were able to get the server booted, but only in Directory Services Restore mode. So Active Directory was corrupt. By this time I was a third level engineer, but I had never had to deal with this kind of problem before. There server had been there for 5 days now. So I rolled up my sleeves, got some Maccy D's breakfast and got to work. I found an article which gave a guide on how to repair the ntds.dit file. To my surprise this actually worked and the server booted. However, exchange was bolloxed, database would not mount so we still weren't in great shape.
We managed to source another hard drive to get the RAID started, so at least we would have a decent hard drive in there. Then with some amazing luck, we found that there was a tape backup of exchange, as they were still changing the tapes and there was a backup a few days before the crash.
So I restored the exchange backup from the tape, and the server was finally working again. I took it back to site, plugged it in and we were back online. Fortunately they had their email cached in outlook, so that filled in the gaps where the backup was a few days behind. Also they had hosted email security which held the emails for 7 days in the cloud and they didn't lose those either (server had been gone for 6 days).
So we just about got away with that, but it was pretty poor on our part. If the backup had been setup within a reasonable time then it would have been much more helpful and wouldn't have taken so long to get the server back. Just shows a decent RAID can make all the difference and a good backup is always essential.
It was problematic from the start. First off, SBS 2008 runs Exchange and SQL out of the box. Both memory hungry and hard drive thrashers. As soon as they tried to introduce the freight software, which is also SQL based, the whole thing just ground to a halt.
We then had to donate a server to them to run the freight software as we had told them (or salesman rather) that the server would be powerful enough for their needs.
The customer still complained that their systems were very slow, in particular the internet. Then the salesman decided to sell them new computers to fix the problem. Of course, it made no difference as it was down to the server being so overworked and actually being too slow to respond to DNS queries!
I was a first level engineer at the time. So in my wisdom, I added some extra DNS servers into DHCP so that the clients could run their DNS queries against the internet as well as the server, this sped up the internet nicely. Only problem was they would randomly lose connection to the server as sometimes it would resolve the internal name of the server against the wrong DNS server.
So that was scrapped and it was back to forwarders. I did manage to speed things up though, by limited the SQL memory so the windows internal database basically had no power, and also limited the memory exchange could steal.
We did eventually put more RAM in the server (took a couple of visits as they forgot to bring the RAM the first time!). The server was better after that, but it was still never fit for purpose.
After 2-3 years, we decided to sell them some shiny new Symantec System Recovery software and a NAS to replace their failing tape drive backup solution. The engineer (not me!)installed the NAS and got it setup. Then began installing symantec, but it needs a reboot to finish off the installation.
I think the reboot did eventually happen a week later, but the software was never setup. Then the emails stopped working. I discovered that the exchange SMTP queue file was corrupt, so I recreated it. Then it was fine for the rest of the day.
Then at about 5pm. The server goes offline. Customer reboots and it starts to do a diskcheck, never a good sign. Rebooted it again to bypass the diskcheck and it refused to come back on. It went into a reboot loop. Engineer goes to site and collects the server as nothing more could be done.
So we get the server, no decent backup and not booting. Plugged the hard drives into our test machine to see if they were still usable. Both of them could be read, however, one of the drives only had data from 2 years ago! So it seems the RAID had stopped working ages ago and the server had been running on one hard drive for years, now it's finally starting to fail.
We were able to get the server booted, but only in Directory Services Restore mode. So Active Directory was corrupt. By this time I was a third level engineer, but I had never had to deal with this kind of problem before. There server had been there for 5 days now. So I rolled up my sleeves, got some Maccy D's breakfast and got to work. I found an article which gave a guide on how to repair the ntds.dit file. To my surprise this actually worked and the server booted. However, exchange was bolloxed, database would not mount so we still weren't in great shape.
We managed to source another hard drive to get the RAID started, so at least we would have a decent hard drive in there. Then with some amazing luck, we found that there was a tape backup of exchange, as they were still changing the tapes and there was a backup a few days before the crash.
So I restored the exchange backup from the tape, and the server was finally working again. I took it back to site, plugged it in and we were back online. Fortunately they had their email cached in outlook, so that filled in the gaps where the backup was a few days behind. Also they had hosted email security which held the emails for 7 days in the cloud and they didn't lose those either (server had been gone for 6 days).
So we just about got away with that, but it was pretty poor on our part. If the backup had been setup within a reasonable time then it would have been much more helpful and wouldn't have taken so long to get the server back. Just shows a decent RAID can make all the difference and a good backup is always essential.
Subscribe to:
Posts (Atom)