Qu’est-il arrivé à Cloudflare la semaine dernière?

serveurs

SOPA Images/Contributor/Getty Images

On November 2, 2023, Cloudflare's customer-facing interfaces, including their website and APIs, along with logging and analytics, ceased functioning properly. That was bad.

Over 7.5 million websites use Cloudflare, and 3,280 of the world's 10,000 most popular websites depend on its content delivery network (CDN) services. The good news is that the CDN didn't go down. The bad news is that Cloudflare Dashboard and its related application programming interfaces (API) were down for almost two days.

Also:The best VPN services (and how to choose the right one for you)

That kind of thing just doesn't happen -- or it shouldn't, anyway -- to major internet service companies. So, the multi-million-dollar question is: 'What happened?' The answer, according to Cloudflare CEO, Matthew Prince, was a power-related incident at a trio of the company's primary data centers in Oregon, which are managed by Flexential, that cascaded into one problem after another. Thirty-six hours later, Cloudflare was finally back to normal.

Prince didn't pussyfoot around the problem:

To start, this never should have happened. We believed that we had high availability systems in place that should have stopped an outage like this, even when one of our core data center providers failed catastrophically. And, while many systems did remain online as designed, some critical systems had non-obvious dependencies that made them unavailable. I am sorry and embarrassed for this incident and the pain that it caused our customers and our team.

He's right -- this incident never should have happened. Cloudflare's control plane and analytics systems run on servers in three data centers around Hillsboro, Oregon. But, they're all independent of one another; each has multiple utility power feeds, and multiple redundant and independent internet connections.

The trio of data centers is not so close together that a natural disaster would cause them all to crash at once. Simultaneously, they're still close enough that they could all run active-redundant data clusters. So, by design, if any of the facilities go offline, the remaining ones should pick up the load and keep operating.

Sounds great, doesn't it? However, that's not what happened.

What happened first was that a power failure at Flexential's facility caused unexpected service disruption. Portland General Electric (PGE) was forced to shut down one of its independent power feeds into the building. The data center has multiple feeds with some level of independence that can power the facility. However, Flexential powered up their generators to supplement the feed that was down.

That approach, by the way, for those of you who don't know data centers' best practices, is a no-no. You don't use off-premise power and generators at the same time. Adding insult to injury, Flexential didn't tell Cloudflare that they'd sort of, kind of, transitioned to generator power.

Also:10 ways to speed up your internet connection today

Then, there was a ground fault on a PGE transformer that was going into the data center. And, when I say ground fault, I don't mean a short, like the one that has you going down into the basement to fix a fuse. I mean a 12,470-volt bad boy that took down the connection and all the generators in less time than it took you to read this sentence.

In theory, a bank of UPS batteries should have kept the servers going for 10 minutes, which in turn should have been enough time to crank the generators back on. Instead, the UPSs started dying in about four minutes, and the generators never made it back on in time anyway.

Whoops.

There might have been no one who was able to save the situation, but when the onsite, overnight staff "consisted of security and an unaccompanied technician who had only been on the job for a week," the situation was hopeless.

Also: The best VPN services for iPhone and iPad (yes, you need to use one)

In the meantime, Cloudflare discovered the hard way that some critical systems and newer services were not yet integrated into its high-availability setup. Furthermore, Cloudflare's decision to keep logging systems out of the high-availability cluster, because the analytics delays would be acceptable, turned out to be wrong. As Cloudflare's staff couldn't get a good look at the logs to see what was going wrong, the outage would linger on.

It turned out that, while the three data centers were "mostly" redundant, they weren't completely. The other two data centers running in the area did take over responsibility for the high-availability cluster and keep critical services online.

So far, so good. However, a subset of services that were supposed to be on the high-availability cluster had dependencies on services that were running exclusively on the dead data center.

Specifically, two critical services that process logs and power Cloudflare's analytics -- Kafka and ClickHouse -- were only available in the offline data center. So, when services in the high-availability cluster called for Kafka and Clickhouse, they failed.

Cloudflare admits it was "far too lax about requiring new products and their associated databases to integrate with the high-availability cluster." Moreover, far too many of its services depend on the availability of its core facilities.

Lots of companies do things this way, but Prince admitted, this "does not play to Cloudflare's strength. We are good at distributed systems. Throughout this incident, our global network continued to perform as expected. but far too many fail if the core is unavailable. We need to use the distributed systems products that we make available to all our customers for all our services, so they continue to function mostly as normal even if our core facilities are disrupted."

Also: Cybersecurity 101: Everything on how to protect your privacy and stay safe online

Hours later, everything was finally back up and running -- and it wasn't easy. For example, almost all the power breakers were fried, and Flexentail had to go and buy more to replace them all.

Expecting that there had been multiple power surges, Cloudflare also decided the "only safe process to recover was to follow a complete bootstrap of the entire facility." That approach meant rebuilding and rebooting all the servers, which took hours.

The incident, which lasted until November 4, was eventually resolved. Looking forward, Prince concluded: "We have the right systems and procedures in place to be able to withstand even the cascading string of failures we saw at our data center provider, but we need to be more rigorous about enforcing that they are followed and tested for unknown dependencies. This will have my full attention and the attention of a large portion of our team through the balance of the year. And the pain from the last couple of days will make us better."

Featured

MacBook Pro (M3 Max) review: A desktop-class laptop for an AI-powered ageDo you have the latest AirTag firmware update? How to checkBest early Black Friday deals: Amazon, Walmart, Best Buy, and moreYou can now turn your Amazon Echo into a Wi-Fi extender. Here's how

MacBook Pro (M3 Max) review: A desktop-class laptop for an AI-powered age
Do you have the latest AirTag firmware update? How to check
Best early Black Friday deals: Amazon, Walmart, Best Buy, and more
You can now turn your Amazon Echo into a Wi-Fi extender. Here's how

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

serveurs

Nouvelles chaudes

S5735-L48LP4XE-A-V2: Scalable, Secure, and PoE-Ready for Demanding Enterprise Deployments

S5735-L48LP4S-A-V2 Powers Smarter Campus Networks with Advanced PoE and Cloud Management

S5735-L24T4X-A1 Empowers Installers with Scalable, Reliable, and Efficient Network Access

Best Ethernet Switches for Business (2025): Selection Guide and Top Picks

Huawei S5735-L24T4S-A1: A Compact, Stackable Access Switch Built for the Future

Huawei S5735-L24T4S-A: High-Performance Stacking Meets Zero-Noise Deployment

S5735-L24P4XE-A-V2: Huawei’s Smart Choice for High-Density Campus Deployments

S5735-L24P4X-A1: Huawei’s High-Performance Access Switch Redefining Campus Networking

Huawei S5735-L24P4S-A1 Review: Reliable Gigabit Access with Enterprise-Grade Features

What Is an Orthogonal Architecture?

Huawei s5735-l24p4s-a-v2 Delivers Scalable, Secure, and Smart PoE Access for Modern IT Infrastructures

Huawei S5735-L48T4XE-A-V2 Switch Delivers Enterprise-Grade Performance in a Compact Design

Huawei S5735-L48P4XE-A-V2 Review: Versatile Campus Switch with iStack and Full L3 Support

Differences Between Huawei CE Series and S Series Switches

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

What on earth happened to Cloudflare last week?

Featured

Tags chauds: Maison & bureau réseautage

Ordering Guide

Ressources ressources

À propos de nous

Cisco Price, Dell Price, Huawei Price, ZTE HPE Fortinet Switch Router Server At Low Price

serveurs

Nouvelles chaudes

S5735-L48LP4XE-A-V2: Scalable, Secure, and PoE-Ready for Demanding Enterprise Deployments

S5735-L48LP4S-A-V2 Powers Smarter Campus Networks with Advanced PoE and Cloud Management

S5735-L24T4X-A1 Empowers Installers with Scalable, Reliable, and Efficient Network Access

Best Ethernet Switches for Business (2025): Selection Guide and Top Picks

Huawei S5735-L24T4S-A1: A Compact, Stackable Access Switch Built for the Future

Huawei S5735-L24T4S-A: High-Performance Stacking Meets Zero-Noise Deployment

S5735-L24P4XE-A-V2: Huawei’s Smart Choice for High-Density Campus Deployments

S5735-L24P4X-A1: Huawei’s High-Performance Access Switch Redefining Campus Networking

Huawei S5735-L24P4S-A1 Review: Reliable Gigabit Access with Enterprise-Grade Features

What Is an Orthogonal Architecture?

Huawei s5735-l24p4s-a-v2 Delivers Scalable, Secure, and Smart PoE Access for Modern IT Infrastructures

Huawei S5735-L48T4XE-A-V2 Switch Delivers Enterprise-Grade Performance in a Compact Design

Huawei S5735-L48P4XE-A-V2 Review: Versatile Campus Switch with iStack and Full L3 Support

Differences Between Huawei CE Series and S Series Switches

Huawei CloudEngine S5735 Switches Set the Benchmark for High-Performance, Energy-Efficient Switching

Huawei CloudEngine S5731‑S48P4X Datasheet

Huawei CloudEngine S5731‑S24P4X Datasheet

Huawei S5731-S Empowers Next-Generation Campus Networks with Advanced Capabilities

Huawei S5731-H24P4XC Switch Review: Power-Packed Performance and Smart PoE

Huawei S5731-H Series Switches Redefine Campus Networking with Intelligent High-Performance Architecture

Top Features of the Huawei S5731-S24T4X: The Ultimate Gigabit Access Switch for Modern Networks

General Power Module Fault Location Procedure (CE8800 & 7800 & 6800 & 5800)

How Do I Split a Stack? How to clear the stacking configuration?

Huawei CloudEngine S5731 Datasheet

Huawei CloudEngine S5731-S24P4X: Powerful Enterprise-Grade Switch Explained

Huawei S5731-S48T4X Review: Powerful Enterprise Switch for High-Speed Networking

Why are network cables limited to 100 meters?

Huawei S5731-S32ST4X: Powerful, Enterprise-Ready Gigabit Switch with Advanced Capabilities

Huawei S5731-H48T4XC Review: High-Performance Switching for Modern IT Infrastructures

Huawei S5731-H48P4XC: Comprehensive Overview

What on earth happened to Cloudflare last week?

Featured

Tags chauds: Maison & bureau réseautage

Ordering Guide

Ressources ressources

À propos de nous

Huawei CloudEngine S5731‑S48P4X Datasheet