Amazon Web Services blamed an “automated scaling activity” that triggered a “latent issue” in a “request back-off behaviors” code path used by networking clients as the source of its major outage Tuesday that downed services from ConnectWise, Netflix, Disney+, Ticketmaster, Flickr and others.
The company also outlined some fixes to prevent the specific issue from happening again and to improve communication during major outages, according to a message published Friday. The event affected AWS’ Northern Virginia (US-East-1) region.
An inability for AWS’ internal operations teams to see real-time monitoring data prevented them from finding the source of network congestion and prolonged the outage, according to AWS.
An AWS spokeswoman declined to comment further on the AWS post and the outage.
The internal network disruption slowed deployment systems, further delaying recovery, according to AWS. And the internal operation team took an “extremely deliberate” approach to changes to avoid disrupting the AWS services still operating normally on the main network and customer applications.
“We want to apologize for the impact this event caused for our customers,” according to the AWS message. “While we are proud of our track record of availability, we know how critical our services are to our customers, their applications and end-users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.”
The outage even prompted digs from AWS rivals such as Oracle during the company’s latest quarterly earnings call.
The “automated scaling activity”-triggered “unexpected behaviour” for a large number of clients in AWS’ internal network happened at 7:30 a.m. Pacific Tuesday, according to AWS. The result was a surge of connection activity that overwhelmed networking devices between the main AWS network and the AWS internal network, which is used to host monitoring, internal DNS authorization services, parts of the EC2 control plane and other foundational services.
Customers using the main AWS network saw “minimal impact” from the outage, according to AWS.
All network devices fully recovered by 2:22 p.m. Pacific, but some resulting issues continued into the evening. The Amazon Secure Token Service (STS) used for authentication by Redshift and other AWS services didn’t fully recover until 4:28 p.m. Pacific, according to Amazon.
AWS Fargate continued to see elevated error rates and insufficient capacity errors until 5 p.m., with some customers seeing errors for certain task sizes for another “several hours” after recovery, according to AWS.
AWS EventBridge saw elevated event delivery latency until 6:40 p.m. Pacific due to backlog processing.
API Gateway errors and latencies stayed elevated until 4:37 p.m. Pacific, with errors and throttling continuing for “several hours” while Gateways stabilized, according to AWS.
To avoid the specific outage event happening again, AWS “immediately disabled the scaling activities that triggered this event and will not resume them until we have deployed all remediations,” according to the company. The company does not have to resume that unnamed activity “in the near-term.”
“We are developing a fix for this issue and expect to deploy this change over the next two weeks,” according to AWS. “We have also deployed additional network configuration that protects potentially impacted networking devices even in the face of a similar congestion event. These remediations give us confidence that we will not see a recurrence of this issue.”