Only last week I shared an article about the recent increase in cloud outages, now AWS have hit headlines with one of their own. The problem only lasted a few hours but there were 12,000 outage reports at its height. The company announced that the issue was caused by their serverless computing service, AWS Lambda.
Those affected include the Boston Globe, Southwest Airlines and the New York Metropolitan Transport Authority. The outage also caused problems for AWS’s website along with Amazon Music and Alexa.
Serverless Computing
The outage appears to have been rooted in the company’s Lambda service. AWS lambda is a serverless computing service, meaning users can run computer programs without having to rely on a server.
Perhaps surprisingly, there are servers within serverless computing. The difference is that the management of the servers is the responsibility of the provider not the user. It simply takes away the burden from users. AWS Lambda is actually the pioneer in this space, having launched in 2014.
There are, of course, some great cost-saving and efficiency benefits with serverless computing, but it comes with its challenges. In a non-serverless model users have control of everything, this isn’t the case here. Effectively, you’re handing over control of your infrastructure to a third party. This comes with security risks and the possibility of outages as it allows minimal margin of error for those in control.
A spokesperson from the company said…
“We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers and indirectly through the use by other AWS services.”
Competitor Outages
Other recent outages from AWS competitors, Microsoft Azure and Google Cloud have been much more significant. Azure’s January outage struck down millions of users. Shut downs this big have a huge economic impact.
These problems aren’t always the fault of cloud providers mismanagement, sometimes it’s as simple as the weather knocking out a data centre. But providers are often caught out for not properly protecting their users against attacks that can shut down the service.
AWS Outage Record
In 2017, Amazon S3, their data-hosting service, took down huge parts of the internet in the Eastern US for about four hours. It impacted hundreds of major businesses such as Coursera, Expedia, JSTOR, Kickstarter, Lonely Planet, Mailchimp and Yahoo! Mail. This is the risk of relying on just a handful of services to hold up the majority of the economy’s online life. If AWS, or Google Cloud, or Microsoft Azure go down, so do hundreds if not thousands of businesses.
What’s wild about this particular outage is that, simply put, it happened because of a typo. An Amazon operator input a command to remove a small number of servers but it was entered incorrectly meaning many more servers were taken out.
AWS suffered another outage in 2021, this time very inconveniently right before Christmas. In the midst of the pandemic, despite the arrival of Omicron, Christmas 2021 offered some version of a return to normality. So it was maybe the worst possible time for Netflix, Disney+ and Amazon’s e-commerce arm to be swept up in an outage. That meant no Christmas telly and no presents, from those companies at least.
Conclusion
The main cause of this recent AWS outage, by their own admission, was their serverless computing service AWS Lambda. Outages, as explained in this article, are caused by all sorts of reasons. But by giving all the responsibility to cloud providers, are we putting ourselves at risk?