LONDON — According to Facebook, the worldwide outage that took Facebook and its other platforms offline for hours was caused by an accident during normal maintenance.
The issue arose while engineers worked on Facebook’s worldwide backbone network, which includes the computers, routers, and software in its data centers across the world, as well as the fiber-optic connections that link them.
“During one of these routine maintenance jobs, a command was issued with the intention of assessing the availability of global backbone capacity,” Janardhan said Tuesday. “Unintentionally, all connections in our backbone network were taken down, effectively disconnecting Facebook data centers globally.”
Facebook’s systems are supposed to capture such errors, but a fault in the audit tool prevented the command from being properly stopped in this case, according to Janardhan.
That update also caused a secondary issue, making it unable to contact Facebook’s servers, even though they were still working.
Engineers rushed to the scene to address the problem, but the added levels of protection took time, according to Janardhan. The data centers are “tough to get into, and once inside, the hardware and routers are designed to be impossible to alter even if you have physical access to them,” according to the company.
Services were progressively restored when connectivity was restored to minimize traffic spikes that may cause future failures.
Although the failure of a maintenance update to take down Facebook’s backbone network was an “unexpected anomaly,” Angelique Medina of Cisco Systems’ ThousandEyes, a firm that monitors internet outages, believes the company could have avoided a scenario in which its servers were completely taken offline, making it impossible to access the tools needed to fix it.
“The key concern is why there could be a single point of failure for so many internal tools and systems,” says the author “Medina explained. “Facebook would still be down due to the network failure, but if they had inside access, they could have fixed the problem sooner.”