Topics:
Cloud reliability, revisited
![]()

Remember the recent case of how a massive cloud failure reportedly wiped out all customer data on T-Mobile Sidekicks? Detractors of cloud computing are quick to jump at the opportunity to highlight this as a perfect example of how unwise it is to trust someone else to manage your data.
Well, fast forward a week, and Microsoft has announced that it has recovered most of the lost data. As can be expected, the official statement is carefully scripted and polished to tell as little as possible while conveying the assurances that most data is in fact, not lost.
PR backpedaling aside, a closer examination of how Microsoft phrased its statement yields more clues as to what happened. According to Microsoft, it re-built the thrashed system "component by component" in order to preserve and extricate the data. My guess is that the massive PR backlash prompted Microsoft to practically reconstruct the entire infrastructure, which says a lot by itself.
You see, what strikes me here is the extraordinary cost such a move must entail. Remember, we are talking about a Telco-grade infrastructure serving all Sidekick users at T-Mobile here, and not just a dinky server room. Clearly, the infrastructure could range well into the hundreds or even thousands of servers--consider the associated cost in terms of engineering and IT expertise to put it together.
What I felt is even more interesting though, is the fact that data is recoverable at all. Some sites suggested that the entire debacle could just be one huge misunderstanding; that those at T-Mobile might simply not be aware that there is a way to recover the lost data. However, if you are an administrator, you will know that data from practically any hard disk drive can be recovered--as long as you are willing to pay the exorbitant cost of data recovery, and can wait for it.
Microsoft obviously has a vested interest to maintain its reputation, what with the imminent launch of its Windows Azure Cloud operating system, among other cloud-hosted applications. My take is that it decided to plunge in with the restoration work which T-Mobile might have balked at.
Of course, all the above is merely hypothetical; the true sequence of events might never be known to the rest of us. Even after this incident, I don't think cloud computing is inherently faulty. There is a takeaway lesson here though, and perhaps we might want to more closely examine our contracts with cloud providers on just how hard they will try to recover our data in the event of a catastrophic failure. - Paul




Comments