New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Azure US East outage due to fiber cut
So, Azure's US East location has been having issues for the past 9 hours or so, reporting 'reduced capacity' due to a fiber cut. I've seen reports of people's services being down there entirely.
You'd think that with the $87/TB they charge for network traffic, they'd at least provide some real redundancy... but apparently a single line being cut is enough to fuck shit up.
Thanked by 1rds100
Comments
If you sell something, that enough people still, buy because it got CLOUD or has Azure in it, why brother putting up a additional fibre uplink?
Let them pay $87/TB if they are stupid enough.
The good thing on this is, I am sure some of them will switch and learn. Hopefully.
@joepie91
Yes, ridiculous.
But my guess is that they DO have multiple lines - but a single feed in. Cheap but bad bad practice. A high class DC has dual feed ins with a sufficiently large distance between (typ. opposite sides of building) AND different physical routes.
With power feeds one can be a bit more sloppy if one has good enough backup infrastructure. But with fibers cut is cut and the only proper way is actual different physical routes redundancy.
With 87$/TB there is no excuse for cutting corners.
One thing Microsoft Azure and OVH have in common. One is 90% cheaper.
Two diverse fibres going out via two opposite sides of the building that never overlap is the minimum standard I consider ‘professional’
Single homed duh. They need that sweet CC blend.
i am certain those companies that need real redundancy would have implemented a high availability architecture for their databases and virtual machines to be able to scale up and route around the affected region. If course that wrecks the ability of people to offer snide comments.
It doesn’t excuse the extremely poor network design though.
Quite possible. Doesn't change the fact that Azure are charging $87/TB for network infrastructure that's less redundant than some of the 'one-man shows' on here.
Perhaps their clients will ask for more details after the crisis is mitigated via private conversations. Like some companies here have requested in the past.
LOL love this
I'd hazard a chance that most of the people who are using Azure are probably using it off their $150 per month of free Azure Credits that people get via University or from their MSDN subscription.
Or startups who participate in their BizSpark program.
Then again, most of the big spenders I know who use Azure don't rely on a single location for their workload. Also, to be fair, most of the services that people are using isn't necessarily compute workloads.
Or maybe those companies chose an expensive service ("Azure") because - so they thought - it has some real redundancy.
87$/TB? Is this for real?
Yep, and they literally give you the option to either make it GeoRedundant or Datacenter level Redundant when you create your Compute resources storage.
"This issue was attributed to a fiber cut caused by construction approximately 5 km from Microsoft data centers. This resulted in multiple line breaks impacting separate splicing enclosures that reduced capacity between 2 Azure regional data centers."
"multiple line breaks" and "reduced capacity between 2 azure regional data centers". So as far as I can tell, it's not "we lost all connectivity due to a single line break".
Well, I think they've still missed the idea of a diverse feed then, it should be diverse from the point of their core routers even including the datacentre cross connects. Unless two seperate digger teams happened to dig up each of the diverse routes at the same time - I'm not buying it.
Not saying it isn't poss me for two diverse feeds to go offline, coincidences can happen, but their explanation seems off.
For all of the providers saying "they should have known" or they should have done better blah blah blah. Can you provide documents detailing your fiber route maps internal and external to your building - not just to the edge of the lot- but for miles and miles away from the DC- so we can see you've truly documented that your network is so robustly designed there are no convergent paths (yep that means 2 sea cables- not just one if you are in the EU and servicing the US. My point its easy to point a finger based on one paragraph of info- but is your setup truly better and documented?
regarding bandwidth - Inbound is free (I've seen providers here that count it) and pricing does drop for high volume users. Of course well structured web pages and data pulls will minimize costs and sloppy code on high usage site would get severely penalized in bandwidth charges on azure (or aws or google), but would be less impactful on providers with more "free" bandwidth.
Remember that time where Azure in the Netherlands was completely shut down because high humidity?
I'm really starting to question what the money is going to here.
Hookers, cocaine, lawyers.
At ~$26000 per gigabit/mo, it's simply not feasible.
Yes. We require detailed fibre map for our proposed route to be a part of the contract when we order a fibre. We then make sure that cross connects are run diversely to each of the diverse pair. It's quite a detailed planning process when each new location is opened.
Awesome! I guess for these things the devil is in the details and it can get quite detailed to go through.
Yes, all of our dark fibre from our various providers comes with detailed maps of the splicing points and amplification sites and routes etc. The routes are also carefully planned prior with the provider. Granted we don't have the same maps for transit providers, but that's why transit is taken at several sites and sites are interconnected diversely.
Also worth noting that on all of our metro dark fibre, at least within the UK we have to pay tax, so at that point it's absurd if your don't request a map of your route.
Azure ( and infact the whole Microsoft) operates via resellers and CSPs. Once you get billing 1500 - 2000 bucks a month with them, they will make you CSPs, reseller, authorized channel partners etc, etc, whatever they call it. Then if there is a customer case /project, it gets transferred to you and you get business through Microsoft.
I have many dev teams and companies that work on this model. They know Azure is highly overpriced infra, yet the stick to them for business that eventually pays off itself and earns them good money (or not I don't know but thats a model they follow).
Azure support is mediocre to bad and it is highly likely at the first instance you will hit non-competent third party support staff, only when you yell on them, things take notice. This also depends on regions... US has usually better support, UK... I never felt very good and Asia is worst.
Multiple breaks on the same line, judging from the cause (because construction work would not simultaneously break two geographically redundant lines).
And yes, I am aware they call it 'reduced capacity'. I've spoken to people who reported straight-up outages of their services. Apparently 'reduced capacity' means "some people's services are up, some people's services are not".
Yes.
Even if you take the cheapest bulk tier and halve the price to compensate for the 'free inbound', it's still $25/TB; an order of magnitude more expensive than what providers here charge. I expect serious redundancy for that kind of cost difference.
I can't tell for EU, but for US and CAN, lots of providers refuse to provide network plans to customer, even to DC like OVH. They say it's for security measure, which is not totally false when in a single splice you can have your competitors but also gov and others critical). I can tell this since I'm myself a network builder which has done fiber installations to some high importance enterprises and none of then have the plan of the network, outside of their own floor and first network pole/manhole (connection point to network). Lot's of them have "ring" setup which result in not a ring for many kilometers due to network limitation, fiber/splice re-use, reverse setup and even, simply because the engeenier didn't see on their plan a common point of failure (ex. Re-sale/lease fiber, outdoor element, human interaction, etc.).
Simply hope they will learn, and then contact their providers and check why the ring didn't kick in and change what need to be changed.
Regards, David
I've only ever dealt with dark fibre (and waves) in the EU, but our providers euNetworks and Zayo are both very detailed in the plans they provide before the circuit is provisioned, they are also consistent in updating if the route has changed (which does happen a fair amount on long haul waves).
Last mile circuits from national telecoms are a different story, we have a good amount of BT Openreach last mile fibre circuits, and only on rare occasions have we managed to get plans out of them, however they certainly do plan for diversity and keep it that way. We have several fibre breaks on our last mile circuits to clients, but not once have we had both legs of a diverse circuit go down in over 4 years, and they always clearly state where the "pinch-points" are e.g. building without diverse entries from different manholes.
Pretty much all backbones (~ not last mile) are done as rings (well, as quite stretched ellipses) and often have bi-directional fiber pairs. So it's usually not even particularly painful or troublesome to have proper dual feeds through different building feed ins.
As for maps my experience was mixed. Some (incl. btw very large US carriers) do provide at least "stepped maps" where you get quite precise maps for some miles from your DC and less precise maps for anything beyond that. And some just refuse it completely, usually mentioning security concerns.
And, frankly, it's not even an absolute must have, because in the end it's the contract that's relevant. And in those you virtually always CAN get something like "the 2 fibers, except for [pops, dcs,...] are at least xyz miles apart from each other" with more specifics re. pops, dcs, landings, etc. (which themselves often are not exactly geo specified). Typically those specs/info are in the contract annex. Plus, of course, you have your SLA. Putting those next to each other you have a pretty good basis to judge both the quality and risks of your feed(s).
My personal take is that Microsoft purchased dark fiber or waves and simply didn't care too much about the details. Typical very large player attitude. And then it just so happened that a critical portion of their fibers ended up in the same duct. The "partial failure" part is probably due to the fact that the excavator ripped into the duct and some fibers snapped and some didn't (but quite probably streched and/or bent). The really ugly part is that todays fibers aren't "binary"; it's not like "works 100% or not all" but a bunch of factors/grades (like, to name an important one, attenuation per wavelength).