Azure US East outage due to fiber cut

joepie91 · August 2018

So, Azure's US East location has been having issues for the past 9 hours or so, reporting 'reduced capacity' due to a fiber cut. I've seen reports of people's services being down there entirely.

You'd think that with the $87/TB they charge for network traffic, they'd at least provide some real redundancy... but apparently a single line being cut is enough to fuck shit up.

Neoon · August 2018

If you sell something, that enough people still, buy because it got CLOUD or has Azure in it, why brother putting up a additional fibre uplink?

Let them pay $87/TB if they are stupid enough.

The good thing on this is, I am sure some of them will switch and learn. Hopefully.

jsg · August 2018

@joepie91

Yes, ridiculous.

But my guess is that they DO have multiple lines - but a single feed in. Cheap but bad bad practice. A high class DC has dual feed ins with a sufficiently large distance between (typ. opposite sides of building) AND different physical routes.

With power feeds one can be a bit more sloppy if one has good enough backup infrastructure. But with fibers cut is cut and the only proper way is actual different physical routes redundancy.

With 87$/TB there is no excuse for cutting corners.

MikeA · August 2018

One thing Microsoft Azure and OVH have in common. One is 90% cheaper.

Clouvider · August 2018

Two diverse fibres going out via two opposite sides of the building that never overlap is the minimum standard I consider ‘professional’

Yura · August 2018

Single homed duh. They need that sweet CC blend.

OmgpleaseRead · August 2018

i am certain those companies that need real redundancy would have implemented a high availability architecture for their databases and virtual machines to be able to scale up and route around the affected region. If course that wrecks the ability of people to offer snide comments.

Clouvider · August 2018

@OmgpleaseRead said:
i am certain those companies that need real redundancy would have implemented a high availability architecture for their databases and virtual machines to be able to scale up and route around the affected region. If course that wrecks the ability of people to offer snide comments.

It doesn’t excuse the extremely poor network design though.

joepie91 · August 2018

@OmgpleaseRead said:
i am certain those companies that need real redundancy would have implemented a high availability architecture for their databases and virtual machines to be able to scale up and route around the affected region. If course that wrecks the ability of people to offer snide comments.

Quite possible. Doesn't change the fact that Azure are charging $87/TB for network infrastructure that's less redundant than some of the 'one-man shows' on here.

OmgpleaseRead · August 2018

Perhaps their clients will ask for more details after the crisis is mitigated via private conversations. Like some companies here have requested in the past.

JohnMiller92 · August 2018

@MikeA said:
One thing Microsoft Azure and OVH have in common. One is 90% cheaper.

LOL love this

TriJetScud · August 2018

I'd hazard a chance that most of the people who are using Azure are probably using it off their $150 per month of free Azure Credits that people get via University or from their MSDN subscription.

Or startups who participate in their BizSpark program.

Then again, most of the big spenders I know who use Azure don't rely on a single location for their workload. Also, to be fair, most of the services that people are using isn't necessarily compute workloads.

jsg · August 2018

@OmgpleaseRead said:
i am certain those companies that need real redundancy would have implemented a high availability architecture for their databases and virtual machines to be able to scale up and route around the affected region. If course that wrecks the ability of people to offer snide comments.

Or maybe those companies chose an expensive service ("Azure") because - so they thought - it has some real redundancy.

Chuck · August 2018

87$/TB? Is this for real?

TriJetScud · August 2018

@jsg said:

@OmgpleaseRead said:
i am certain those companies that need real redundancy would have implemented a high availability architecture for their databases and virtual machines to be able to scale up and route around the affected region. If course that wrecks the ability of people to offer snide comments.

Yep, and they literally give you the option to either make it GeoRedundant or Datacenter level Redundant when you create your Compute resources storage.

mrtz · August 2018

"This issue was attributed to a fiber cut caused by construction approximately 5 km from Microsoft data centers. This resulted in multiple line breaks impacting separate splicing enclosures that reduced capacity between 2 Azure regional data centers."

"multiple line breaks" and "reduced capacity between 2 azure regional data centers". So as far as I can tell, it's not "we lost all connectivity due to a single line break".

Infinity · August 2018

@mrtz said:
"This issue was attributed to a fiber cut caused by construction approximately 5 km from Microsoft data centers. This resulted in multiple line breaks impacting separate splicing enclosures that reduced capacity between 2 Azure regional data centers."

"multiple line breaks" and "reduced capacity between 2 azure regional data centers". So as far as I can tell, it's not "we lost all connectivity due to a single line break".

Well, I think they've still missed the idea of a diverse feed then, it should be diverse from the point of their core routers even including the datacentre cross connects. Unless two seperate digger teams happened to dig up each of the diverse routes at the same time - I'm not buying it.

Not saying it isn't poss me for two diverse feeds to go offline, coincidences can happen, but their explanation seems off.

OmgpleaseRead · August 2018

For all of the providers saying "they should have known" or they should have done better blah blah blah. Can you provide documents detailing your fiber route maps internal and external to your building - not just to the edge of the lot- but for miles and miles away from the DC- so we can see you've truly documented that your network is so robustly designed there are no convergent paths (yep that means 2 sea cables- not just one if you are in the EU and servicing the US. My point its easy to point a finger based on one paragraph of info- but is your setup truly better and documented?

OmgpleaseRead · August 2018

regarding bandwidth - Inbound is free (I've seen providers here that count it) and pricing does drop for high volume users. Of course well structured web pages and data pulls will minimize costs and sloppy code on high usage site would get severely penalized in bandwidth charges on azure (or aws or google), but would be less impactful on providers with more "free" bandwidth.

ehhthing · August 2018

Remember that time where Azure in the Netherlands was completely shut down because high humidity?

I'm really starting to question what the money is going to here.

Yura · August 2018

@ehhthing said:
I'm really starting to question what the money is going to here.

Hookers, cocaine, lawyers.

Aidan · August 2018

@Clouvider said:
Two diverse fibres going out via two opposite sides of the building that never overlap is the minimum standard I consider ‘professional’

At ~$26000 per gigabit/mo, it's simply not feasible.

Clouvider · August 2018

@OmgpleaseRead said:
For all of the providers saying "they should have known" or they should have done better blah blah blah. Can you provide documents detailing your fiber route maps internal and external to your building - not just to the edge of the lot- but for miles and miles away from the DC- so we can see you've truly documented that your network is so robustly designed there are no convergent paths (yep that means 2 sea cables- not just one if you are in the EU and servicing the US. My point its easy to point a finger based on one paragraph of info- but is your setup truly better and documented?

Yes. We require detailed fibre map for our proposed route to be a part of the contract when we order a fibre. We then make sure that cross connects are run diversely to each of the diverse pair. It's quite a detailed planning process when each new location is opened.

OmgpleaseRead · August 2018

Awesome! I guess for these things the devil is in the details and it can get quite detailed to go through.

Infinity · August 2018

@OmgpleaseRead said:
For all of the providers saying "they should have known" or they should have done better blah blah blah. Can you provide documents detailing your fiber route maps internal and external to your building - not just to the edge of the lot- but for miles and miles away from the DC- so we can see you've truly documented that your network is so robustly designed there are no convergent paths (yep that means 2 sea cables- not just one if you are in the EU and servicing the US. My point its easy to point a finger based on one paragraph of info- but is your setup truly better and documented?

Yes, all of our dark fibre from our various providers comes with detailed maps of the splicing points and amplification sites and routes etc. The routes are also carefully planned prior with the provider. Granted we don't have the same maps for transit providers, but that's why transit is taken at several sites and sites are interconnected diversely.

Also worth noting that on all of our metro dark fibre, at least within the UK we have to pay tax, so at that point it's absurd if your don't request a map of your route.

mehargags · August 2018

Azure ( and infact the whole Microsoft) operates via resellers and CSPs. Once you get billing 1500 - 2000 bucks a month with them, they will make you CSPs, reseller, authorized channel partners etc, etc, whatever they call it. Then if there is a customer case /project, it gets transferred to you and you get business through Microsoft.

I have many dev teams and companies that work on this model. They know Azure is highly overpriced infra, yet the stick to them for business that eventually pays off itself and earns them good money (or not I don't know but thats a model they follow).

Azure support is mediocre to bad and it is highly likely at the first instance you will hit non-competent third party support staff, only when you yell on them, things take notice. This also depends on regions... US has usually better support, UK... I never felt very good and Asia is worst.

joepie91 · August 2018

@mrtz said:
"This issue was attributed to a fiber cut caused by construction approximately 5 km from Microsoft data centers. This resulted in multiple line breaks impacting separate splicing enclosures that reduced capacity between 2 Azure regional data centers."

"multiple line breaks" and "reduced capacity between 2 azure regional data centers". So as far as I can tell, it's not "we lost all connectivity due to a single line break".

Multiple breaks on the same line, judging from the cause (because construction work would not simultaneously break two geographically redundant lines).

And yes, I am aware they call it 'reduced capacity'. I've spoken to people who reported straight-up outages of their services. Apparently 'reduced capacity' means "some people's services are up, some people's services are not".

@Chuck said:
87$/TB? Is this for real?

Yes.

@OmgpleaseRead said:
regarding bandwidth - Inbound is free (I've seen providers here that count it) and pricing does drop for high volume users. Of course well structured web pages and data pulls will minimize costs and sloppy code on high usage site would get severely penalized in bandwidth charges on azure (or aws or google), but would be less impactful on providers with more "free" bandwidth.

Even if you take the cheapest bulk tier and halve the price to compensate for the 'free inbound', it's still $25/TB; an order of magnitude more expensive than what providers here charge. I expect serious redundancy for that kind of cost difference.

gestiondbi · August 2018

@Infinity said:

@OmgpleaseRead said:
For all of the providers saying "they should have known" or they should have done better blah blah blah. Can you provide documents detailing your fiber route maps internal and external to your building - not just to the edge of the lot- but for miles and miles away from the DC- so we can see you've truly documented that your network is so robustly designed there are no convergent paths (yep that means 2 sea cables- not just one if you are in the EU and servicing the US. My point its easy to point a finger based on one paragraph of info- but is your setup truly better and documented?

Yes, all of our dark fibre from our various providers comes with detailed maps of the splicing points and amplification sites and routes etc. The routes are also carefully planned prior with the provider. Granted we don't have the same maps for transit providers, but that's why transit is taken at several sites and sites are interconnected diversely.

Also worth noting that on all of our metro dark fibre, at least within the UK we have to pay tax, so at that point it's absurd if your don't request a map of your route.

@Clouvider said:

@OmgpleaseRead said:
For all of the providers saying "they should have known" or they should have done better blah blah blah. Can you provide documents detailing your fiber route maps internal and external to your building - not just to the edge of the lot- but for miles and miles away from the DC- so we can see you've truly documented that your network is so robustly designed there are no convergent paths (yep that means 2 sea cables- not just one if you are in the EU and servicing the US. My point its easy to point a finger based on one paragraph of info- but is your setup truly better and documented?

Yes. We require detailed fibre map for our proposed route to be a part of the contract when we order a fibre. We then make sure that cross connects are run diversely to each of the diverse pair. It's quite a detailed planning process when each new location is opened.

I can't tell for EU, but for US and CAN, lots of providers refuse to provide network plans to customer, even to DC like OVH. They say it's for security measure, which is not totally false when in a single splice you can have your competitors but also gov and others critical). I can tell this since I'm myself a network builder which has done fiber installations to some high importance enterprises and none of then have the plan of the network, outside of their own floor and first network pole/manhole (connection point to network). Lot's of them have "ring" setup which result in not a ring for many kilometers due to network limitation, fiber/splice re-use, reverse setup and even, simply because the engeenier didn't see on their plan a common point of failure (ex. Re-sale/lease fiber, outdoor element, human interaction, etc.).

Simply hope they will learn, and then contact their providers and check why the ring didn't kick in and change what need to be changed.

Regards, David

Infinity · August 2018

@davidgestiondbi said:
I can't tell for EU, but for US and CAN, lots of providers refuse to provide network plans to customer, even to DC like OVH. They say it's for security measure, which is not totally false when in a single splice you can have your competitors but also gov and others critical). I can tell this since I'm myself a network builder which has done fiber installations to some high importance enterprises and none of then have the plan of the network, outside of their own floor and first network pole/manhole (connection point to network). Lot's of them have "ring" setup which result in not a ring for many kilometers due to network limitation, fiber/splice re-use, reverse setup and even, simply because the engeenier didn't see on their plan a common point of failure (ex. Re-sale/lease fiber, outdoor element, human interaction, etc.).

Simply hope they will learn, and then contact their providers and check why the ring didn't kick in and change what need to be changed.

Regards, David

I've only ever dealt with dark fibre (and waves) in the EU, but our providers euNetworks and Zayo are both very detailed in the plans they provide before the circuit is provisioned, they are also consistent in updating if the route has changed (which does happen a fair amount on long haul waves).

Last mile circuits from national telecoms are a different story, we have a good amount of BT Openreach last mile fibre circuits, and only on rare occasions have we managed to get plans out of them, however they certainly do plan for diversity and keep it that way. We have several fibre breaks on our last mile circuits to clients, but not once have we had both legs of a diverse circuit go down in over 4 years, and they always clearly state where the "pinch-points" are e.g. building without diverse entries from different manholes.

jsg · August 2018

Pretty much all backbones (~ not last mile) are done as rings (well, as quite stretched ellipses) and often have bi-directional fiber pairs. So it's usually not even particularly painful or troublesome to have proper dual feeds through different building feed ins.

As for maps my experience was mixed. Some (incl. btw very large US carriers) do provide at least "stepped maps" where you get quite precise maps for some miles from your DC and less precise maps for anything beyond that. And some just refuse it completely, usually mentioning security concerns.

And, frankly, it's not even an absolute must have, because in the end it's the contract that's relevant. And in those you virtually always CAN get something like "the 2 fibers, except for [pops, dcs,...] are at least xyz miles apart from each other" with more specifics re. pops, dcs, landings, etc. (which themselves often are not exactly geo specified). Typically those specs/info are in the contract annex. Plus, of course, you have your SLA. Putting those next to each other you have a pretty good basis to judge both the quality and risks of your feed(s).

My personal take is that Microsoft purchased dark fiber or waves and simply didn't care too much about the details. Typical very large player attitude. And then it just so happened that a critical portion of their fibers ended up in the same duct. The "partial failure" part is probably due to the fact that the excavator ripped into the duct and some fibers snapped and some didn't (but quite probably streched and/or bent). The really ugly part is that todays fibers aren't "binary"; it's not like "works 100% or not all" but a bunch of factors/grades (like, to name an important one, attenuation per wavelength).

Howdy, Stranger!

Categories

In this Discussion

Azure US East outage due to fiber cut

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Azure US East outage due to fiber cut

Comments