Small outtage @ AMS-IX

FoxelVox · July 2020

Hey Guys,

I'm posting/inquiring here if any of you also noticed issues in network traffic to/from the Netherlands, AMS-IX had some big issues on one of their core routers;

"Dear member/customers,
Three 36x100G linecards at the PE router, located at Equinix AM7, rebooted unexpectedly causing many customer ports and Backbone connection to flap for several minutes. At this moment the linecards recovered their operational status, but we are still unaware of the root cause. We will investigate the issue and monitor the situation.
Our apologies for the inconvenience
Kind Regards
AMS-IX NOC"

Did you experience issues?

jsg · July 2020

So, one of their 9850 edge routers had a hiccup, or more precisely, 3 100G line cards in that "router" had. That's extreme - but then their routers, which are mainly aggregation switches it seems, are from 'extreme networks'.

Normally I would ask how there was no redundancy in place but hey let us not be too hard on them. After all those cards at least rebooted into what seems to be a stable status and, most importantly, their NSA/GHCQ snorkels, err, I mean their glimmerglass MEMS switches, continued to work properly. And No, considering their age that's not to be expected anyway. But it's really good because glimmerglass seems to have gone belly up years ago (or were swallowed up?).

SplitIce · July 2020

I think I saw that. It was pretty quick according to my monitoring server.

FoxelVox · July 2020

@SplitIce said:
I think I saw that. It was pretty quick according to my monitoring server.

Total downtime for some connections (like vodafone NL/Ziggo(UPC)/Tweak.nl and some Dutch hosting providers) was around 8 to 15 minutes, and there was some time in between. After that it's solved according to my smokeping and MTR change server

FoxelVox · July 2020

@jsg said:
So, one of their 9850 edge routers had a hiccup, or more precisely, 3 100G line cards in that "router" had. That's extreme - but then their routers, which are mainly aggregation switches it seems, are from 'extreme networks'.

Normally I would ask how there was no redundancy in place but hey let us not be too hard on them. After all those cards at least rebooted into what seems to be a stable status and, most importantly, their NSA/GHCQ snorkels, err, I mean their glimmerglass MEMS switches, continued to work properly. And No, considering their age that's not to be expected anyway. But it's really good because glimmerglass seems to have gone belly up years ago (or were swallowed up?).

Yep that's precisely what we also heard. The line cards also went off the second time they replaced it, and from what we've heard they re-routed traffic and replaced some cables to another core router so they could further investigate this one. Im really impressed how quickly they managed to solve this.

SplitIce · July 2020

My monitoring server saw it between TransIP (NL) and DigitalOcean AMS2 said 4 minutes. Either end may have re-routed however.

jsg · July 2020

@FoxelVox said:
Yep that's precisely what we also heard. The line cards also went off the second time they replaced it, and from what we've heard they re-routed traffic and replaced some cables to another core router so they could further investigate this one. Im really impressed how quickly they managed to solve this.

Pardon me, but No, that's not very smart or professional. One changes a line card if one (1) line card goes berserk - but not when 3 cards fail. Reason: the chances that it's just a line card failure is virtually zero; in fact, when 3 line cards fail at (roughly or precisely) the same time chances are high that there is either a data plane management failure (quite unlikely but possible) or that there is an outside condition causing a grave problem in the line card, where "outside problem" may be some un- or at least not malevolently intented problem (e.g. due to misconfiguration) or an intended problem (e.g. by intentionally (ab)using a weakness to trigger a failure).

Lets hope that the cause was just an unintentional f_ckup because if it wasn't AMS-IX and a lot depending on them may be in for a very nasty surprise, maybe not tomorrow but at the most unfortunate moment.

SplitIce · July 2020

@jsg
Sometimes shit happens. Management or data plane failure is not that unlikely honestly.

Of all the number of supposedly large and professional transit & IX providers that have had major breaks due to router issues in recent years this one doesnt rank for me.

Who here remembers Telia's New York issues?

jsg · July 2020

@SplitIce said:
Sometimes shit happens. Management or data plane failure is not that unlikely honestly.

Of all the number of supposedly large and professional transit & IX providers that have had major breaks due to router issues in recent years this one doesnt rank for me.

Who here remembers Telia's New York issues?

Yes, sometimes sh_t happens - but 3 100G line cards isn't "sh_t happens", that's a problem and potentially a serious one.
Also note that AMS_IX isn't just any IX but one of the top-10 worldwide and this is about well known (to the AMS-IX engineers) equipment.

I respect your view and maybe you are right, but I think there's more to it than just "sh_t happens".

Howdy, Stranger!

Categories

In this Discussion

Small outtage @ AMS-IX

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Small outtage @ AMS-IX

Comments