Holy Smokes Linode/Akamai

SplitIce · July 2025

This is why you don't put your failover PoP with Linode/Akamai if your primary is too. Even if they are in different Datacenters on other sides of the world.

Impact to LKE services has been confirmed to have also extended to our data centers in Dallas, Fremont, Sydney, Tokyo 2, Toronto and Washington due to the interaction with our data center in Newark. We continue to work bringing our services back online, and we will provide an update as soon as progress is made.

https://status.linode.com/incidents/6yw88b0ft94g (5 hours and counting)

Fortunately we run backup services for critical elements (i.e networking, mitigation analyzers etc) with 2 other providers (3 regions). Didnt expect that from such a big player. Feeling very glad we went with "Deploy less but work with more vendors" as opposed to "Deploy an exact copy in a different region with the same vendor" right about now which were both options when the work was being scoped.

JoshR · July 2025

In a way no difference than outages from Gcloud, AWS, AZURE. The big clouds have them just as well. AND it seems like when the big clouds have them it effects more than one place.

Arirang · July 2025

As you may know, according to the status page, there are a lot of issues affecting a lot of regions.
I'm considering moving out from them.

SplitIce · July 2025

@JoshR I am aware its not unique to Akamai. But they do have an irritatingly sparing style of status page updates and a way of ignoring a single ticket (I raised the issue prior to the incident, its still open 19 hours later with no response)

Glad I decided do 3 hour sleeps between checks last night, it was no quick issue.

@Ariang I'm definately considering it too (it won't be an overnight thing).

I happen to know theres a pretty big unpatched bug in their LKE implementation too. Acknowledged by them. Patch developed. Got months of monthly updates letting me know that its not being put into the next release, but maybe the next. And its the kind of issue that anyone running a moderately sized LKE cluster may find (especially if you run many smaller nodes with a decent number of pods).

The clusters I manage are not small accounts either. One account could easily pay an Australian salary.

We are about half up (half the LKE nodes) currently. The other half appear up but without internal networking. Not sure if its a split brain situation (authentication systems are on one side).

ehhthing · July 2025

EWR has been down for > 12 hours, I wonder what's going on. Based on their status updates the initial power outage was fixed, but it seems like their cooling systems are still dead?

I guess it'll take much longer than 12 hours to fix cooling...

MikeA · July 2025

Quite surprising for a company of that size and revenue ($4 billion 2024.) Seems like some lower end commercial datacenters are setup better in terms of redundancy and backup hardware/facilities. But maybe it's something insane. Would be cool to read about if they were to write a blog post afterwards like CloudFlare does in downtime events.

SplitIce · July 2025

The fact that they blame cooling and power is interesting considering the servers I do have access to have not been rebooted.

It will be 24 hours offline shortly I expect.

caasify · July 2025

@SplitIce said:
This is why you don't put your failover PoP with Linode/Akamai if your primary is too. Even if they are in different Datacenters on other sides of the world.

Impact to LKE services has been confirmed to have also extended to our data centers in Dallas, Fremont, Sydney, Tokyo 2, Toronto and Washington due to the interaction with our data center in Newark. We continue to work bringing our services back online, and we will provide an update as soon as progress is made.

https://status.linode.com/incidents/6yw88b0ft94g (5 hours and counting)

Fortunately we run backup services for critical elements (i.e networking, mitigation analyzers etc) with 2 other providers (3 regions). Didnt expect that from such a big player. Feeling very glad we went with "Deploy less but work with more vendors" as opposed to "Deploy an exact copy in a different region with the same vendor" right about now which were both options when the work was being scoped.

To avoid issues like the recent Linode outage affecting multiple regions, you can use Caasify, a centralized platform that lets you deploy VPS instances across 81+ data centers from providers like Linode, DigitalOcean, Vultr, Hetzner, and more, all through a single account. This way, you can easily build a multi-vendor, multi-region infrastructure without the hassle of managing separate accounts on each platform, helping you reduce the risk of vendor-wide outages affecting your services.

Rubben · July 2025

@caasify said:

@SplitIce said:
This is why you don't put your failover PoP with Linode/Akamai if your primary is too. Even if they are in different Datacenters on other sides of the world.

Impact to LKE services has been confirmed to have also extended to our data centers in Dallas, Fremont, Sydney, Tokyo 2, Toronto and Washington due to the interaction with our data center in Newark. We continue to work bringing our services back online, and we will provide an update as soon as progress is made.

https://status.linode.com/incidents/6yw88b0ft94g (5 hours and counting)

Fortunately we run backup services for critical elements (i.e networking, mitigation analyzers etc) with 2 other providers (3 regions). Didnt expect that from such a big player. Feeling very glad we went with "Deploy less but work with more vendors" as opposed to "Deploy an exact copy in a different region with the same vendor" right about now which were both options when the work was being scoped.

To avoid issues like the recent Linode outage affecting multiple regions, you can use Caasify, a centralized platform that lets you deploy VPS instances across 81+ data centers from providers like Linode, DigitalOcean, Vultr, Hetzner, and more, all through a single account. This way, you can easily build a multi-vendor, multi-region infrastructure without the hassle of managing separate accounts on each platform, helping you reduce the risk of vendor-wide outages affecting your services.

I’m sure OP wants to use some random noname reseller 😆 what a shameless ad plug

ehhthing · July 2025

@MikeA said:
Quite surprising for a company of that size and revenue ($4 billion 2024.) Seems like some lower end commercial datacenters are setup better in terms of redundancy and backup hardware/facilities. But maybe it's something insane. Would be cool to read about if they were to write a blog post afterwards like CloudFlare does in downtime events.

Linode was only recently acquired by Akamai, and Newark is one of their oldest DCs, so it dates back to way before Akamai became involved in the picture.

Socheat · July 2025

@SplitIce said:
@JoshR I am aware its not unique to Akamai. But they do have an irritatingly sparing style of status page updates and a way of ignoring a single ticket (I raised the issue prior to the incident, its still open 19 hours later with no response)

That doesn't sit well with the big player like Linode/Akamai. Left the ticket unanswered for 19 hours? Wow. At least some LET providers are better when it comes to support.

MikeA · July 2025

@ehhthing said:

@MikeA said:
Quite surprising for a company of that size and revenue ($4 billion 2024.) Seems like some lower end commercial datacenters are setup better in terms of redundancy and backup hardware/facilities. But maybe it's something insane. Would be cool to read about if they were to write a blog post afterwards like CloudFlare does in downtime events.

Linode was only recently acquired by Akamai, and Newark is one of their oldest DCs, so it dates back to way before Akamai became involved in the picture.

I'm aware, but they've owned them for a few years now.

SplitIce · July 2025

@caasify said:
To avoid issues like the recent Linode outage affecting multiple regions, you can use Caasify, a centralized platform that lets you deploy VPS instances across 81+ data centers from providers like Linode, DigitalOcean, Vultr, Hetzner, and more, all through a single account. This way, you can easily build a multi-vendor, multi-region infrastructure without the hassle of managing separate accounts on each platform, helping you reduce the risk of vendor-wide outages affecting your services.

Most peoples issues with redundnancy is not managing a couple accounts, its:

configuring software and designing architecture to support golbally deployment (e.g database server)
cost, a global / multi vendor deployment just costs more

SplitIce · July 2025

Wow,

Mitigation efforts are continuing with our subject matter experts actively working to restore the remaining services. We will continue to provide updates as progress is made.

Last 3 updates are all the same, thats 3 times in a row for the past 7 hours.

Maelstrom36 · July 2025

@SplitIce said:
Wow,

Mitigation efforts are continuing with our subject matter experts actively working to restore the remaining services. We will continue to provide updates as progress is made.

Last 3 updates are all the same, thats 3 times in a row for the past 7 hours.

Just the usual corporate, PR-friendly kind of update

SplitIce · July 2025

Also its not mentioned on the status page but someone else I know who manages a cluster on Linode got a data loss email.

So at-least some people have lost data. It might be a good idea for people to keep your backups at hand and take the time to manually verify. I'm definately pulling my (remote) backups (even if its paranoia).

sibaper · July 2025

It's impacting multiple services. It seems they make an update, then something goes wrong.

SplitIce · July 2025

Good news! 4/12 of our Kubnernetes nodes are functional.

Maybe we should spin up replacement nodes? Nope new nodes are non functional from boot.

But don't worry the issue is considered resolved by Akamai. Celebrate.

SplitIce · July 2025

Good news 12/12 Kubernetes nodes are up it seems.

No ticket update, lets not touch anything.

jsg · July 2025

But, but! ... who could have known that propagating configs throughout a global network would also propagate mistakes and errors?!

The culprits obviously are the evil fibers and routers who just don't care about what's in the packets they transport! Shame on them!

Akamai is completely totally innocent (as multi-billion corporations usually are) as their PR uhm, I mean statements (as well as absence thereof) clearly demonstrate.

(in a 3pt Arial footnote, light grey on white: "A few customers might have experienced some minor issues, which our CEO will personally investigate and fix. Out of abundance of caution we do reject any and all responsibility for what may or may not have happened. We appreciate your understanding")

Neoon · July 2025

Howdy, Stranger!

Categories

In this Discussion

Holy Smokes Linode/Akamai

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Holy Smokes Linode/Akamai

Comments