An accidental outage at Cloudflare caused major disruptions across large swaths of the Internet on Tuesday morning, reportedly hitting popular sites such as Discord, Shopify, Grindr, Fitbit, and Peloton.
The security and performance services vendor said the problem was the result of “our error” and was fixed within about an hour and 15 minutes.
In a blog post, Cloudflare said the outage in the early hours of Tuesday affected traffic in 19 of its data centres.
“Unfortunately, these 19 locations handle a significant proportion of our global traffic,” the company said. “A change to the network configuration in those locations caused an outage which started at 06:27 UTC. At 06:58 UTC the first data centre was brought back online and by 07:42 UTC all data centres were online and working correctly.”
Though it didn’t provide specific information about the extent of the disruptions, Cloudflare said: “Depending on your location in the world you may have been unable to access websites and services that rely on Cloudflare. In other locations, Cloudflare continued to operate normally.”
The company concluded in the introduction of its blog post: “We are very sorry for this outage. This was our error and not the result of an attack or malicious activity.”
A company official could not be reached for further comment.
According Downdetector, complaints of outages started streaming in early Tuesday morning, with major sites such as Discord, Shopify, Grindr, Fitbit, and Peloton ultimately experiencing disruptions.
The isn’t the first time Cloudflare has experienced a self-inflicted outage that disrupted services.
In 2019, the company acknowledged a widespread service outage was caused by a bug in the company‘s firewall software and not a cyberattack. “This was a mistake we caused ourselves,” Prince told CRN US three years ago. “It wasn‘t an issue caused by someone else.”
Now Cloudflare is acknowledging another company mistake that has led to yet another widespread outage.
According to its blog post, the error occurred while the company was trying to improve its system.
“Over the last 18 months, Cloudflare has been working to convert all of our busiest locations to a more flexible and resilient architecture,” the company wrote in the post. “In this time, we’ve converted 19 of our data centres to this architecture, internally called Multi-Colo PoP (MCP).”
The company said the “new architecture has provided us with significant reliability improvements, as well as allowing us to run maintenance in these locations without disrupting customer traffic. “
But since these locations “carry a significant proportion of the Cloudflare traffic, any problem here can have a very wide impact, and unfortunately, that’s what happened today.”
The problem occurred while company officials were “deploying a change to our prefix advertisement policies,” the company said.
In its blog post, Cloudflare wrote: “Although Cloudflare has invested significantly in our MCP design to improve service availability, we clearly fell short of our customer expectations with this very painful incident. We are deeply sorry for the disruption to our customers and to all the users who were unable to access Internet properties during the outage.”
The company said it’s already working on changes and “will continue our diligence to ensure this cannot happen again.”