AWS and GCP down? Major outage?

nohavps · June 2025

@yoursunny said:
Provider with routed /48 and no-cost BGP session for $3.50/month or less is less likely to have an outage.

Only you, with your IPv6, should be recognized by ARIN, RIPE as a member superior to the others!!!

COLBYLICIOUS · June 2025

https://blog.cloudflare.com/cloudflare-service-outage-june-12-2025/

jsg · June 2025

"The cloud is safe", they said, propagandized, and preached and even countries increasingly put their admin shit shows in the cloud.

The idiots got what they deserve. Simple as that.

raindog308 · June 2025

I can't believe this was caused by a null pointer exception.

Google. In 2025. Taking down their cloud with a null pointer exception. Sigh.

Roast them, @jsg

https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW

"On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.

On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds. This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment.

Within 2 minutes, our Site Reliability Engineering team was triaging the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place. The red-button was ready to roll out ~25 minutes from the start of the incident. Within 40 minutes of the incident, the red-button rollout was completed, and we started seeing recovery across regions, starting with the smaller ones first."

WyvernCo · June 2025

Fingers crossed for a Kevin Fang video about this

jsg · June 2025

@raindog308 said:
I can't believe this was caused by a null pointer exception.

Google. In 2025. Taking down their cloud with a null pointer exception. Sigh.

Roast them, @jsg

https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1SsW

OK, but I won't roast them for the null pointer.

What I see as the real problem is the fact that they (self-admitted) basically run a "do as you please, adhere to rules or don't, just as you like" software development freak-show.

Google does have at least some very capable and experienced engineers and Google does have a set of maybe not complete but reasonable rules and they do know how to do software development (the whole cycle, incl. testing) - but they obviously not only tolerate utter ignorance but even gross disregard of those rules. In other word: it's mainly a management problem.

Now to the null pointer and why I'm somewhat lenient on that.

Null pointers (sadly) still just are a fact in the field. Google's code often needs to be high-performance which boils down to certain languages, plus I guess they need lots and lots of code which boils down to not being able to use the very few safe languages and techniques.
One can create (almost) 100% safe software, and it's actually done, e.g. with railway management systems, air and aircraft control, etc. - but that's very, very expensive and also quite slow (development cycle). I happen to do a lot of work in that field and painfully know what I'm talking about.

Now, writing software for say, a nuclear reactor control system is a large project - but compared to the mega shit tons of code Google needs it looks wimpy. Read: it's reasonably doable. Very expensive, very complex development chain, lots of formal stuff, beginning with the specification and requirements and certainly not ending with static analysis, and so on. But it's doable.

One major reason for "it's doable" is that such a project needs relatively "few" developers and those usually are used to work in a very strict environment. Google however needs a large armada of developers and the vast majority of those those would run away if Google went hardcore on safety. Plus, of course, the whole software development would be much, much more expensive, and even that is theoretical because they wouldn't even find the amount of such (adequate) engineers in the first place.

So, they did what they could (as in also "economically reasonable"). Hell, they even created a quite capable (albeit not my taste) programming language suiting their needs and with halfway reasonable safety. It even invites developers to always return an error state.

But here's a "dirty" secret: I know e.g. Ada, and I even like it (a lot), but there are situations when one must squeeze out even the last bit of performance ... and then most developers turn to the compromise between Assembler and a modern programming language: to C. Yes, it's dangerous, it's kind of dirty, uncool, etc. but it's the language with which you get those difficult spots done, plus, at least nowadays, you have quite direct access to e.g. SIMD and the like, which can be the difference between pushing 3 Gb/s and 50+ Gb/s through the network. The price one pays for that if safety is paramount, is high though, and you must use six legged creatures like e.g. Frama, weird (often Ocaml based) based tools or even go really hardcore with full proof systems like Isabelle ... or simply not give a shit (like at Google it seems).

That's why I'm lenient on the null pointer per se, and rather hit hard on management, because, OK, Google doesn't run nuclear reactors (yet) but they are deeply interwoven with billions and billions of $$ going through their systems (usually for customers) and millions of people depending on them and they absolutely need to run a reasonably tight ship - but they don't, as this case very clearly and painfully shows.

If I were high up at Google I'd let the guilty developer get away with a stern warning and stricter oversight, but one or more in management would get fired.

SteveMC · June 2025

This is what happens when you rely on AI for your source code...

PineappleM · June 2025

This will all be moot because AI will eventually take the jobs of these expensive engineers right? (Give it 5 years or so.) Google is in the AI race after all, wouldn’t put it past them to displace their own headcount with it.

COLBYLICIOUS · June 2025

I think I'll make the move to use Bunny (aff / non-aff) for DNS & CDN just because they are from EU and GDPR things shit and also not dependent of Google/Cloudflare or any big tech companies.

Howdy, Stranger!

Categories

In this Discussion

AWS and GCP down? Major outage?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

AWS and GCP down? Major outage?

Comments