Tech and Telecom

Cloudflare Explains What Caused Its Worst Outage in Years

Cloudflare has shared new details on what triggered the major outage that took many websites offline on Tuesday. In a blog post published late Tuesday, Cloudflare co-founder and CEO Matthew Prince said the problem started inside the company’s Bot Management system. This system is supposed to control which automated crawlers are allowed to access websites through Cloudflare’s network.

Cloudflare says that around 20% of the internet runs through its network. It is built to help websites stay online during traffic spikes and DDoS attacks. But this time, the system failed, and many sites went down for several hours. Platforms like X, ChatGPT, and even Downdetector were hit, causing an outage similar to the recent failures linked to Microsoft Azure and Amazon Web Services.

What Caused the Failure

Prince explained that the issue came from a change inside a database. It was not related to DNS, not caused by Cloudflare’s new AI tools, and not the result of a cyberattack.

Ad Powered By Advergic
Loading ad . . .
Ad - Continue scrolling to read

The problem began with the machine learning model used by Cloudflare’s Bot Management system. This model gives each request a “bot score” to decide if the traffic is real or automated. It uses a configuration file that updates often. But after a change in how Cloudflare’s ClickHouse database handled queries, this file started filling up with many duplicate rows.

The file quickly grew too large and crossed its memory limits. When that happened, the core proxy system that handles customer traffic began to fail for anyone using Cloudflare’s bot features. Companies that blocked certain bots using Cloudflare’s rules saw real traffic treated as fake, which cut users off. Customers who did not rely on these bot scores stayed online.

What Cloudflare Will Do Next

Prince said Cloudflare is working on several fixes to prevent this from happening again. These include making the system stricter when handling its own configuration files, adding more global kill switches, stopping error reports from using too many system resources, and reviewing how core systems respond when something goes wrong.

Cloudflare’s outage showed how much the internet depends on a few large companies. As more services rely on the same infrastructure, experts warn that outages like this will continue to affect many websites at the same time.

Share
Published by
Afaq Wajdan Malik