AWS’ performance affected thousands of third-party online services and dozens of AWS services for hours last week during a massive outage stemming from a capacity increase on Amazon’s Kinesis server fleet.
On Nov. 25, the public cloud titan added capacity to its front-end fleet of Kinesis servers without checking if the operating system’s configuration allowed for it, which ultimately led to a significant outage that took approximately 17 hours before Kinesis was fully restored. AWS services like API Gateway, Amplify, AppStream2, Athena, Cloudtrail, Cloudwatch, Cognito, DynamoDB, EventBridge, IoT Services, Lambda, LEX, Managed BlockChain, S3, Sagemaker and Workspaces were impacted.
Ethan Simmons, a managing partner for Pinnacle Technology Partners Inc., an AWS managed service provider with an impressive life sciences customer base, said none of his customers were impacted by the outage. A big reason for that is Pinnacle’s popular PeakPlus suite of secure managed and monitored AWS services and its adherence to AWS’ well-architected review standards.
Simmons, a 28-year IT veteran, said the outage highlights the need for well-architected AWS environments.
“It doesn’t matter whether it is on-premise or in the cloud, outages are going to happen, you always have to design for it,” said Simmons. “If you blindly think everything is going to function okay you are making a big mistake. You need to plan for it and have a partner that can help you architect the solution correctly. IT is complex and, if anything, it is getting more complex. Companies need a partner that can help them architect their environment.”
Key to a well-architected AWS environment is taking advantage of all of the robust AWS redundant services, said Simmons. “When we bring on net new customers, we use AWS’ well-architected framework to make sure they have from day one, the right high availability and redundancy in place as part of the design.”
Amazon Kinesis is used by developers to capture data and video streams in order to process them through AWS’ machine learning platforms. After adding capacity to the Kinesis servers in the early morning hours on Nov. 25, the front-end fleet began to exceed the maximum number of threads allowed by its operating system configuration, according to a recent AWS blog post. This caused AWS’ US-EAST-1 region to go offline.
“The new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration,” said AWS. “As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.”
For the thousands of servers to communicate with one another, AWS Kinesis fleet needs “threads” between each other. When servers are added to the fleet, it can take hours for these “threads” to be created and recognized by existing servers. With the number of threads exceeding the OS configuration, the servers were not able to route requests to Kinesis back-end clusters.
AWS fixed the issue by rebooting all of Kinesis. It took several hours because “we can only add servers at a rate of a few hundred per hour,” said AWS.
AWS is already making several changes to make sure a similar outage doesn’t occur again including using larger CPU and memory servers, and reducing the total number of servers and threads required by each server to communicate across the fleet. “This will provide significant headroom in thread count used as the total threads each server must maintain is directly proportional to the number of servers in the fleet. Having fewer servers means that each server maintains fewer threads,” AWS said.
Additionally, AWS is adding “fine-grained” alarming for thread consumption in the service as well as moving several large services, such as CloudWatch, to a separate front-end fleet. The company is also working on a larger project to isolate failures in one service so it doesn’t affect other services.
“We want to apologize for the impact this event caused for our customers,” said AWS. “We know how critical this service, and the other AWS services that were impacted, are to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.”
Steve Burke contributed to this article.