serveurs

Amazon Web Services (AWS) rarely goes down unexpectedly, but you can expect a detailed explainer when a major outage does happen.

12/15 update:AWS misfires once more, just days after a massive failure

The latest of AWS's major outages occurred at 7:30AM PST on Tuesday, December 7, lasted five hours and affected customers using certain application interfaces in the US-EAST-1 Region. In a public cloud of AWS's scale, a five-hour outage is a major incident.

Special Report

Managing the Multicloud

It's easier than ever for enterprises to take a multicloud approach, as AWS, Azure, and Google Cloud Platform all share customers. Here's a look at the issues, vendors and tools involved in the management of multiple clouds.
Read now
According toAWS's explanation of what went wrong , the source of the outage was a glitch in its internal network that hosts "foundational services" such as application/service monitoring, the AWS internal Domain Name Service (DNS), authorization, and parts of the Elastic Cloud 2 (EC2) network control plane. DNS was important in this case as it's the system used to translate human-readable domain names to numeric internet (IP) addresses.
SEE:Having a single cloud provider is so last decade
AWS's internal network underpins parts of the main AWS network that most customers connect with in order to deliver their content services. Normally, when the main network scales up to meet a surge in resource demand, the internal network should scale up proportionally via networking devices that handle network address translation (NAT) between the two networks.

However, on Tuesday last week, the cross-network scaling didn't go smoothly, with AWS NAT devices on the internal network becoming "overwhelmed", blocking translation messages between the networks with severe knock-on effects for several customer-facing services that, technically, were not directly impacted.
"At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network," AWS says in its postmortem.

"This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks."

The delays spurred latency and errors for foundational services talking between the networks, triggering even more failing connection attempts that ultimately led to "persistent congestion and performance issues" on the internal network devices.
With the connection between the two networks blocked up, the AWS internal operating team quickly lost visibility into its real-time monitoring services and were forced to rely on past-event logs to figure out the cause of the congestion. After identifying a spike in internal DNS errors, the teams diverted internal DNS traffic away from blocked paths. This work was completed two hours after the initial outage at 9:28AM PST.
This alleviated impact on customer-facing services but didn't fully fix affected AWS services or unblock NAT device congestion. Moreover, the AWS internal ops team still lacked real-time monitoring data, subsequently slowing recovery and restoration.

Besides lacking real-time visibility, AWS internal deployment systems were hampered, again slowing remediation. The third major cause of its non-optimal response was concern that a fix for internal-to-main network communications would disrupt other customer-facing AWS services that weren't affected.

"Because many AWS services on the main AWS network and AWS customer applications were still operating normally, we wanted to be extremely deliberate while making changes to avoid impacting functioning workloads," AWS said.
So what AWS customers services were impacted?
First, the main AWS network was not affected, so AWS customer workloads were "not directly impacted", AWS says. Rather, customers were affected by AWS services that rely on its internal network.

However, the knock-on effects from the internal network glitch were far and wide for customer-facing AWS services, affecting everything from compute, container and content distribution services to databases, desktop virtualization and network optimization tools.

Cloud

?What is digital transformation? Everything you need to know

The best cloud providers compared: AWS, Azure, Google Cloud, and more

The top 6 cheap web hosting services: Find an affordable option

What is cloud computing? Here's everything you need to know

AWS control planes are used to create and manage AWS resources. These control planes were affected as they are hosted on the internal network. So, while EC2 instances were not affected, the EC2 APIs customers use to launch new EC2 instances were. Higher latency and error rates were the first impacts customers saw at 7:30AM PST.
SEE:Cloud security in 2021: A business guide to essential tools and best practices
With this capability gone, customers had trouble with Amazon RDS (relational database services) and the Amazon EMR big data platform, while customers with Amazon Workspaces's managed desktop virtualization service couldn't create new resources.

Similarly, AWS's Elastic Cloud Balancers (ELB) were not directly affected but, since ELB APIs were, customers couldn't add new instances to existing ELBs as quickly as usual.
Route 53 (CDN) APIs were also impaired for five hours, preventing customers changing DNS entries. There were also login failures to the AWS Console, latency affecting Amazon Secure Token Services for third-party identity services, delays to CloudWatch, and impaired access to Amazon S3 buckets, DynamoDB tables via VPC Endpoints, and problems invoking serverless Lambda functions.
The December 7 incident shared at least one trait with a major outage that occurred this time last year: it stopped AWS from communicating swiftly with customers about the incident via the AWS Service Health Dashboard.
"The impairment to our monitoring systems delayed our understanding of this event, and the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region," AWS explained.
Additionally, the AWS support contact center relies on the AWS internal network, so staff couldn't create new cases at normal speed during the five-hour disruption.
AWS says it will release a new version of its Service Health Dashboard early 2022, which will run across multiple regions to "ensure we do not have delays in communicating with customers."

Cloud outages do happen. Google Cloud has had its fare share and Microsoft in October had to explain its eight-hour outage. While rare, the outages are a reminder that public cloud might be more reliable than conventional data centers, but things do go wrong, sometimes catastrophically, and can impact a wide number of critical services.
"Finally, we want to apologize for the impact this event caused for our customers," said AWS. "While we are proud of our track record of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further."

Enterprise Software
ChatGPT's next big challenge: Helping Microsoft to challenge Google searchWhen will Microsoft end support for your version of Windows or Office?Tech in 2023: 6 new priorities for your shortlistThe 14 best web hosting services: Which is right for your website?

ChatGPT's next big challenge: Helping Microsoft to challenge Google search

When will Microsoft end support for your version of Windows or Office?

Tech in 2023: 6 new priorities for your shortlist

The 14 best web hosting services: Which is right for your website?

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

serveurs

Nouvelles chaudes

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

Common display Commands for Huawei Devices

Stacking Card Stacking vs. Service Port Stacking: Application Scenarios for the Two Switch Stacking Methods

Huawei S5731-H24T4XC: High-Performance Intelligent Gigabit Switch

Huawei S5731-S48P4X: High-Performance PoE Switch with Flexible Power and Uplink Options

Huawei S5731 Series: Advanced Networking Solutions for Enterprises

Difference between campus switch and data center switch

Huawei S6730-H28Y4C Campus CloudEngine Switch Datasheet

S6730-H48Y6C: Unleashing Power and Flexibility for Modern Networking

CloudEngine S6730-H Series Switches Datasheet

Huawei CloudEngine Switch S6730-S24X6Q Datasheet

CloudEngine S6700 Series Switches Naming Conventions & Description

Huawei CloudEngine S6730-H24X6C Datasheet

Huawei S6730 Series Switches Datasheet

Huawei CloudEngine Switch S6730-H48X6C Datasheet

Introduction to the Huawei CloudEngine S6730-S Series Switches

Huawei S6730-H48X6CZ-V2: The Ultimate High-Speed Network Switch

Overview of the S6730-H28X6CZ-V2 Switch

Huawei CloudEngine S6730-H24X4Y4C: A High-Performance Enterprise Switch for Modern Networks

​Introduction to Huawei CloudEngine S6730-H Series Switches

Comprehensive Guide to the CloudEngine S6730-H24X6C-V2: Features, Specifications, and Applications

AWS: Here's what went wrong in our big cloud-computing outage

Special Report

Managing the Multicloud

So what AWS customers services were impacted?

Cloud

Enterprise Software

Tags chauds: affaires Logiciel d’entreprise

Ordering Guide

Ressources ressources

À propos de nous

Introduction to Huawei CloudEngine S6730-H Series Switches