Getting to the root of RIM's worldwide outage

Email LinkedIn
Tools

Despite the formation of a 'SWAT Team' to investigate the three-day BlackBerry outage last month that eventually spread across five continents, not much has been heard since about any findings. At the time of the outage in October, I wrote about how the worst ever BlackBerry outage made the imperfections of the cloud model clear--and how its sheer complexity makes rapid rectification of problems difficult.

In a new article published on Ars Technica, Sean Gallagher explored the topic of why RIM (NASDAQ: RIMM) still hasn't found the cause of its worldwide outage. Speaking with experts familiar with the cloud, Gallagher took a closer look at some recent outages including the one experienced by Amazon EC2 in April this year. 

One explanation for the delay is the complex relationships between various components of the company's architecture. While the outage appeared to be triggered by the failure of a core hardware switch, operations teams were likely inundated with a tremendous amount of data as systems failed and backup systems overloaded. This data would "require more effort to understand, and the relationships in it aren't clear," observed Mark Jaffe, the CEO of Prelert, a company that has developed a data center monitoring tool.

It is clear that the sheer scale of cloud data centers means that even a small failure has the potential to escalate and cause a greater impact. As I wrote about the cloud previously, leverage it if it makes sense to, but don't try to do it without preparing appropriate backup and failover. As better tools and methodologies start trickling in, I think the advice still stands.

For more on this story:
- check out this article at Bloomberg
- check out this article at Ars Technica

Related Articles:
RIM needs to show more goodwill after outage fiasco

SoundOff: What we've learned about the cloud in 2011

Worst ever BlackBerry outage highlights imperfection of the cloud model

Questions raised over latest Amazon EC2 outage