ads

A network configuration error caused a massive Cloudflare outage.

The problem started when Cloudflare suddenly went offline. At the time, the outage was relatively minor, but it quickly became a massive issue. Cloudflare is a service that provides very fast and reliable connection between your web server and the internet. In fact, Cloudflare is considered one of the best performing online services.



A move that should have boosted network resilience, according to Cloudflare, produced a large outage that affected more than a dozen of its data centers and hundreds of important online platforms and services today.

After analyzing the event, Cloudflare stated, "Today, June 21, 2022, Cloudflare had an outage that disrupted traffic in 19 of our data centers."

"Regrettably, these 19 locations are responsible for a considerable amount of our global traffic. This disruption was triggered by a change that was implemented as part of a long-term strategy to improve resilience in our busiest sites."

The full list of compromised websites and services, according to user reports, includes Amazon, Twitch, Amazon Web Services, Steam, Coinbase, Telegram, Discord, DoorDash, Gitlab, and more.

Cloudflare's busiest locations were affected by the outage.

Cloudfla was impacted by the outage. After complaints of disruptions to Cloudflare's network from customers and users around the world, the business began investigating the situation at roughly 06:34 AM UTC. the busiest spots in the city

"Customers trying to access Cloudflare sites in the afflicted areas will receive 500 errors. All data plane services in our network are affected by the incident "According to Cloudflare.

While the incident report on Cloudflare's system status page has no data about what caused the outage, the firm provided further information about the June 21 outage on its official blog.

The Cloudflare team stated, "This interruption was caused by a change that was part of a long-running endeavor to boost resilience in our busiest areas."

"An outage began at 06:27 UTC due to a change in network configuration in specific sites. The first data center was brought back up at 06:58 UTC, and by 07:42 UTC, all data centers were up and running.

"You may have been unable to access websites and services that rely on Cloudflare depending on your location. Cloudflare continues to function normally in other areas."

Despite the fact that the affected locations account for approximately 4% of Cloudflare's total network, their outage affected roughly 50% of all HTTP requests served by Cloudflare globally.

The update that caused today's outage was part of a bigger endeavor to upgrade data centers in Cloudlfare's busiest locations to more resilient and flexible architecture, dubbed Multi-Colo PoP internally (MCP).

Amsterdam, Atlanta, Ashburn, Chicago, Frankfurt, London, Los Angeles, Madrid, Manchester, Miami, Milan, Mumbai, Newark, Osaka, So Paulo, San Jose, Singapore, Sydney, and Tokyo are among the data centers hit by today's event.

Timeline of the outage: 

3:56 UTC: The modification is deployed to our first location. Because we are still using our earlier architecture, none of our locations will be affected by the change.

06:17: The modification has been implemented in our busiest locations, but not in the MCP architecture locations.

06:27: The modification has been deployed to our spines, and the rollout has reached MCP-enabled areas. This is when the problem began, as these 19 locations were quickly taken offline.

06:32: A Cloudflare internal issue has been declared.

06:51: The first update to a router is made to determine the root problem.

06:58: The root cause has been identified and understood. Work on reversing the problematic alteration begins.

07:42: The last of the reverts is finished. This was delayed because network engineers trod over each other's adjustments, causing them to be reverted.

Tags

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.