Getting to the root of RIM's worldwide outage
Despite the formation of a 'SWAT Team' to investigate the three-day BlackBerry outage last month that eventually spread across five continents, not much has been heard since about any findings. At the time of the outage in October, I wrote about how the worst ever BlackBerry outage made the imperfections of the cloud model clear--and how its sheer complexity makes rapid rectification of problems difficult.
In a new article published on Ars Technica, Sean Gallagher explored the topic of why RIM (NASDAQ: RIMM) still hasn't found the cause of its worldwide outage. Speaking with experts familiar with the cloud, Gallagher took a closer look at some recent outages including the one experienced by Amazon EC2 in April this year.
One explanation for the delay is the complex relationships between various components of the company's architecture. While the outage appeared to be triggered by the failure of a core hardware switch, operations teams were likely inundated with a tremendous amount of data as systems failed and backup systems overloaded. This data would "require more effort to understand, and the relationships in it aren't clear," observed Mark Jaffe, the CEO of Prelert, a company that has developed a data center monitoring tool.
It is clear that the sheer scale of cloud data centers means that even a small failure has the potential to escalate and cause a greater impact. As I wrote about the cloud previously, leverage it if it makes sense to, but don't try to do it without preparing appropriate backup and failover. As better tools and methodologies start trickling in, I think the advice still stands.
For more on this story:
- check out this article at Bloomberg
- check out this article at Ars Technica
Related Articles:
RIM needs to show more goodwill after outage fiasco
SoundOff: What we've learned about the cloud in 2011
Worst ever BlackBerry outage highlights imperfection of the cloud model
Questions raised over latest Amazon EC2 outage




Comments