Recently, nerds across the land have been all atwitter over what is fashionably referred to these days as an “epic fail.” Google the subject if you want more details, but a relatively short form goes something like this:
The cell phone company T-Mobile put out a nice little smart phone called the Sidekick. The Sidekick is possessed of no local non-volatile storage for productivity data. (Translation: if you turn it off your contacts, appointments, emails, etc… go away.) Not a problem though; the phone connects to a bunch of servers by way of T-Mobile’s cellular data network, and these servers apply some think-balm to the Sidekick’s amnesia.
T-Mobile is not really in the business of running the kind of data center necessary to provide their smart phone customers with the “cloud” services required to make this all work. They understandably handed off the responsibility of managing the services in question to a company called “Danger.” (For those of you familiar with horror movies, this is the point at which the creepy music starts.)
Time passed and the Sidekick gained a loyal following of devoted customers. Somewhere along the line Danger was eaten by Microsoft, the company’s new owner. Then, just a little while ago as of the time of this writing, a decision was made to upgrade the network at Danger.
The upgrade, as you probably have guessed, is the epic fail to which I previously alluded. Not only did the patient not survive the surgery, but the parties performing the upgrade were unable to restore the subscribers’ data for several days.
Although Danger / Microsoft teams are “working around the clock” trying to recover the data somehow, T-Mobile has warned Sidekick users that some of them they may be out of luck. It may prove to be too late for these unlucky folks, but there are lessons to be learned from mishaps such as this.
First of all, always have an easy way to keep an offsite copy of your data over which you have control. This copying method is only of any value if you use employ it on a regular basis.
In my case, I subscribe to a service called “Mobile Me”. From my perspective, the relevance of Mobile Me is that it allows me to sync appointments and contacts from my IPod Touch with Outlook on my PC and ICal on my wife’s Mac. It works wonderfully… most of the time.
Twice now we’ve had some issues where one or more of the devices in the mix would not sync anymore. We called tech support both times and were able to speak to an actual human being fluent in the English language. They were helpful and tried to debug the issue, but in both cases it required going to a backup.
This might be the time to own up to a couple things. First, I’m very apprehensive about upgrading my IPod. Every single time I’ve done an upgrade on the device from a Windows machine it has required a full wipe & restore. Not fun. This is relevant given that the problems in question might have been avoided were I to have been running the most recent software available for the IPod. The second thing bit of relevant trivia is that I make periodic backups of my own local data files; including those files used by Outlook.
The bottom line is that a simple restore of my Outlook data files fixed the whole problem, other than the fact that I lost a day’s worth of events. I don’t know how well the backups at Mobile Me might have been in my case, because I didn’t need to use them.
Control over your backups is a big deal, and keeping a copy of your data out of sync with the daily grind of operations is about the most important kind of control that you can have. The synchronization with Mobile Me is analogous to the kind of useful redundancy given on a network by the use of a storage technology called “RAID”.
“RAID” is an acronym that stands for “Redundant Array of Inexpensive Disks.” The idea is that you setup two or more disk drives that function as if they were one drive. If you are feeling really lucky and are the kind of person who would drive your new sports car in the snow at full speed, you can configure the drives to act as one big one. This is more than a little bit risky since there is no fault tolerance whatsoever.
Most people choose to use RAID to increase their level of safety, rather than using it to merely increase the amount of storage that they have on a good day. The single most common configuration is to setup a disk array (a group of disks) to do “RAID 1” storage. A much more common name for “RAID 1” is “disk mirroring.”
When you configure an array of disks for mirroring, you pair the disks together. The operating system on your computer then sees the pairs of disks as one functional unit. Data is written to both disks at the same time. This gives you a safety net should one of the drives fail.
A typical drive failure scenario where you are protected by “RAID 1” goes something like this: most users on the network don’t care. Everyone is puzzled by the fact their resident network geek looks a little stressed and is telling the boss that it is time to order a new drive for the server. Understandably, this is far preferable to the alternative scenario in which the server reverts to a drooling dysfunctional heap of lost data.
Unfortunately, there are three inherent limitations to any “fault tolerance” technologies such as RAID.
- Both drives in a pair could still be healthy, but useless if the server itself dies.
- If the room or building in which the server sits is trashed, or becomes inaccessible, the server is frequently out of the picture as well regardless of whether it is still in working order.
- Faithful and accurate copies of your data made in real time are only helpful if the current state of your data is actually something you want. What happens if you just deleted a file? It was just deleted on both of your mirrored disks. Whoops.
The means by which these problems are addressed is conceptually simple, though sometimes annoyingly complex in practice: you back your stuff up.
Backup copies are different from the copies created during the normal operation of a RAID array. For these backup copies to be truly useful, a few simple rules must be followed:
- The backup copy must be written onto media that is out of sync with the normal ebb and flow of file creation, deletion and modification on the network from which the original copy came. If the backup can be easily modified by users on your network during the course of normal operations, it is not safe to consider it a trustworthy copy.
- The media onto which the backups are written must be removable such that it can be taken offsite.
- Multiple copies should be made. A good rule of thumb is to have at least three copies- one offsite, one onsite and connected, and one onsite and disconnected, but available for quick connection should the need arise.
- The backup process should be automated.
- The media onto which the backup is written should be trivially accessible using commodity components that could purchased at just about any electronics store.
- The format in which the backup data is stored should be easily decipherable. Avoid proprietary archival storage schemes whenever possible.
- The automated backup process should keep logs, and those logs should be checked periodically by an actual human being capable of reading and understanding them.
- Test restores should periodically be performed, ensuring that the backup is actually a usable copy.
- A full, trustworthy backup should be made of any critical systems prior to any major changes being made to the systems in question. The backup media should then be disconnected from the system prior to the changes being made.
- Copies should occasionally be pulled out of circulation for long term storage as a “moment in time.”
- You should have a bare metal disaster recovery strategy in place. Translated from Geek to English, this means that you have a plan for recovering from a total catastrophic failure in which you must start from scratch. “Bare Metal” is a common nerd term referring to a new computer taken out of the box.
If your organization is one to which special data retention rules apply then your backup strategy must reflect this. Do you need to make unalterable backups of your email every month in a format easily accessible to auditors? Might you be served with a public request for information? Make sure to know such things in advance of crafting your backup strategy.
Whatever you do, assume that you will eventually become a statistic. A good rule of thumb to follow is that virtually 100% of all computer users will someday lose their primary copy of critical data due to hardware malfunction, operating system problems, network errors, or simple human mistakes. If this rule applies to giants like T-Mobile and Microsoft, it definitely applies to you.



The summer after my freshman year of college I took a full time job working as what could best be described as a techno-grunt. I did a tiny bit of programming, but mostly I carried heavy stuff from point A to point B, helped sling cable through ceilings, crawled through tunnels, repaired countless computers and terminals and burned myself with a soldering iron more times than I could count. I also helped with backups.