Facebook said routine maintenance work caused the massive global hours-long outage of the social media giant’s namesake network as well as its Instagram and WhatsApp platforms.
“This outage was triggered by the system that manages our global backbone network capacity,” said Santosh Janardhan, VP, Engineering and Infrastructure at Facebook, in a blog post Tuesday.
“The backbone is the network Facebook has built to connect all our computing facilities together, which consists of tens of thousands of miles of fibre-optic cables crossing the globe and linking all our data centres.”
Janardhan said a “command was issued with the intention to assess the availability of global backbone capacity,” which “unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centres globally.”
The backbone’s removal from operation then led to Facebook’s DNS servers becoming unreachable. The reason the outage lasted for hours was that the DNS loss broke internal tools Facebook would have used to investigate and resolve outages.
“Our primary and out-of-band network access was down, so we sent engineers onsite to the data centres to have them debug the issue and restart the systems,” according to the Facebook post. “But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.”
Once Facebook restored backbone network connectivity across data centre regions, the company then brought its services back online. It plans to simulate global backbone outage events and create a quicker recovery plan in case the event happens again, according to the post.
“We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making.”
The company said it continues to investigate the outages “so we can continue to make our infrastructure more resilient,” according to the post. The outage lasted for more than five hours and included employees having trouble making calls from work-issued cellphones, receiving external emails and unable to use an internal communications platform called Workplace, according to The New York Times.
Hundreds of thousands of users trying to access Facebook, WhatsApp, Instagram and Facebook Messenger reported outages Monday, according to Downdetector.
Facebook’s post on the cause of the outages included an apology to users.
“To all the people and businesses around the world who depend on us, we are sorry for the inconvenience caused by” Monday’s outages, the company posted in a separate blog post on Monday.
“We’ve been working as hard as we can to restore access, and our systems are now back up and running. The underlying cause of this outage also impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem.”
Facebook’s stock dropped to US$323.13 Monday afternoon after opening at US$335.52. Tuesday morning, it was trading at about US$334 a share.