IWstack outage

squibs · July 2014

First time I've seen a service outage in the serveral months I've been with them. One of my servers went down, the other was unaffected.

today at 10:01 AM CEST the iwStack orchestrator received a disconnection event from several hosts which triggered a massive High Availability recover procedure for more than 600 instances.
At 10:50 while most instances were back running, a couple hundreds were stuck in starting state waiting for the network setup to complete.
At 11:10 in the attempt to speed up the process we forced a network restart (including VR rebuild), but this turned to be a wrong solution causing more delay.
Finally at 13:00 all the queued instances were started.

If your instances are still in stopped state, just start them. Please open a ticket if some instance don't start.

At present we have disabled the HA flag for all the instances while we're investigating on the incident.

We are sorry for any inconvenience this issue may have caused.

instatech · July 2014

I dont understand it" iwStack orchestrator received a disconnection event from several hosts which triggered a massive High Availability" but whatever the reason is they should make sure such incidents wont happen again and this put bad image to cloud computing which claims 100% uptime.

ErawanArifNugroho · July 2014

I just received the email too. But my uptime is still at 309 days, so it seems like it's not affecting all instances.

I hope no more problem like this in the future

geekalot · July 2014

@instatech: Dude, NOTHING ON THIS PLANET has 100% uptime.

1) Ensure you have recent backups

2) Redundancy, redundancy, redundancy (multiple instances, multiple providers, multiple geographies)

3) Have "Hot spares" always ready to go (up to date servers that are NOT normally exposed to the internet that you can failover to) in multiple geographies/regions

4) Have a good failover mechanism (whether DNS, or load balancer device)

5) Hold providers with persistent recurring faults accountable (and dump them at YOUR convenience). Prometeus is NOT a provider with recurring/persistent issues. This is the first iwStack outage I am aware of (at least since I started using it).

6) Did I mention redundancy?

7) Sit back, relax, enjoy life

This is called having a Business Continuity Plan

Cheers

Infinity · July 2014

@instatech said:
I dont understand it" iwStack orchestrator received a disconnection event from several hosts which triggered a massive High Availability" but whatever the reason is they should make sure such incidents wont happen again and this put bad image to cloud computing which claims 100% uptime.

The issue was with a network card which caused network problems on some hosts, that then caused the orchastrator to go into HA mode and restart all of those instances on other nodes. Following that some instances were stuck in starting, it didn't affect all instances. As mentioned in the RFO it is being looked into, and of course to avoid such issues in the future.

Also, iwStack does not claim 100%. This is the first issue of this scale since iwStack's inception.

Maounique · July 2014

There is no 100% uptime.

Here there is a bit more extensive RFO:
http://board.prometeus.net/viewtopic.php?f=15&t=1409&p=1965#p1965

In this case, the main problem was the HA, without HA the downtime would have been a few minutes it took us to isolate the malfunctioning nic and solve the problem. However, those few minutes of downtime convinced the orchestrator all the VMs on the affected nodes are down and proceeded to restart them on other nodes. That meant the queue was full for hours and since the virtual routers are vms too on random nodes, at times the VMs started before the VR or were up on nodes which were online all the time and, while the vm was up, the network was down, so it was a huge mess.
You can defend against a node failure, even a few, but when the orchestrator thinks tons of nodes died at once it cannot be really fixed fast.

We are thinking to add some code to check if more than one node appears offline and if it does, to wait for human intervention because that is highly unlikely to happen due to node failure. Cloudstack was conceived by people used with XenServer clusters and added KVM to it. It would have been better to put it on Xen at that time, in hindsight, but what is done is done, we do plan to make a Xen cluster soon, though, to test it and give people a choice, maybe phase out KVM in time if it proves successful.

Amitz · July 2014

My instances were not affected as it seems. All 100% up like throughout the last months. Thumbs up for that!

geekalot · July 2014

@Maounique said:
to test it and give people a choice, maybe phase out KVM in time if it proves successful.

@Maounique, @Prometeus: PLEASE do not phase out KVM

"and give people a choice" --> THIS is the better solution, IMHO

Maounique · July 2014

geekalot said: PLEASE do not phase out KVM

Do not understand phase out as closing down the KVM zones, far from it, just put as default Xen zones and only re-assign nodes if the KVM usage lowers and there is a need for more nodes in the Xen zone.
We only phased out a few products so far, I can only remember the windows separate offer (with proxmox, outside cloud) and we will discontinue the KVM storage plans as well as the atomic cloudmin ones made redundant by the xenpower L plans. Add to this the old shared hosting with shared IP and no resource isolation where people suffer from bad neighbours, because the new plans with dedicated IP, dedicated IOPS and dedicated cpu cycles are far better.

geekalot · July 2014

@Maounique said:
We only phased out a few products so far, I can only remember the windows separate offer (with proxmox, outside cloud) and we will discontinue the KVM storage plans as well as the atomic cloudmin ones made redundant by the xenpower L plans.

ah, OK

instatech · July 2014

OK thanks to all for posting detailed information now i understand it.@geekalot i like your Business Continuity Plan it is useful i will follow it.

texteditor · July 2014

Maounique said: maybe phase out KVM in time if it proves successful.

I can definitely understand the rationale behind it (support costs alone for managing multiple KVM 'variations') but I would be sad to see classic KVM go.

geekalot · July 2014

@instatech said:
OK thanks to all for posting detailed information now i understand it.geekalot i like your Business Continuity Plan it is useful i will follow it.

@instatech, it is basically combinations & permutations:

What is the likelihood of 1 server going down?
vs What is the likelihood of 2 servers going down from two different providers?
vs What is the likelihood of 3 servers going down from three different providers?
and so on .....
So even 99% uptime (per server) against 4 independent servers can lead to INCREDIBLE uptime
(The added bonus is if you have a mechanism to do load balancing in addition to failover, you can also use less resources per server as you split your user traffic between them!)

I won't bore you with the actual math, but suffice it to say it is HIGHLY unlikely. (1% of 1% of 1% of 1% to have a complete failure against 4 x 99% uptime independent servers; even less for 99.9% uptime)

Just try your best to reduce any SINGLE point of failure.

This is how you can string together cheaper 2nd (or even 3rd) tier providers and have better performance than the expensive "1st" tier providers -- all day, every day (IMHO).

Cheers

sean · July 2014

texteditor said: I can definitely understand the rationale behind it (support costs alone for managing multiple KVM 'variations') but I would be sad to see classic KVM go.

Good riddance to classic KVM. I've been using a newer one for a while and I have to say it's much better!

Maounique · July 2014

No, as long as ta least 3 nodes are populated 2/3 to offer 2+1 redundancy, it will not happen, even then, sacrificing some space on 3 nodes will not create us as many problems as it would for our really long time customers to move. It will be a bitch to maintain, though, but hopefully not something at the level of individual nodes for regular VPS which usually need more maintenance than iwstack nodes and it is easy here to do it without downtime, just put the node in maintenance, wait for the VMs to be moved without downtime or noticeable issues and proceed.

MarkTurner · July 2014

@Maounique - our experience of Cloudstack is that XenServer is a lot more robust than KVM. We have a 200 node CS install, the only thing that really bites with Xenserver is that you need to keep your cluster size small (8-16 server). Also you need to be careful that you have the exact same processor model/stepping/revision across your cluster otherwise migrations start being refused.

Maounique · July 2014

Those problems are similar with KVM, so, that is not an issue, we hit some walls when we were designing iwstack, it took way too long and KVM happened to work much faster at that time.

Coudio · July 2014

@geekalot said:
instatech: Dude, NOTHING ON THIS PLANET has 100% uptime.

Your heart has a 100 percent uptime .

J1021 · July 2014

Coudio said: Your heart has a 100 percent uptime .

Doesn't actually. Sometimes a heart can stop and be started again.

geekalot · July 2014

@Coudio said:
Your heart has a 100 percent uptime .

@Coudio, Well, at least you HOPE so ... LOL

Better pray to your "provider" if it does not :-)

Maounique · July 2014

Nope. There are many people with hearts stopped at least once and while it can happen to have 100% uptime for the majority of people, on average it is not 100%.
The only 100% sure thing is death, so far.

geekalot · July 2014

@Maounique said:
Nope. There are many people with hearts stopped at least once and while it can happen to have 100% uptime for the majority of people, on average it is not 100%.
The only 100% sure thing is death, so far.

And taxes

J1021 · July 2014

Maounique said: The only 100% sure thing is death, so far.

I am 100% sure I am wearing blue trousers at the moment.

Maounique · July 2014

Blue is subjective, I often say something is blue, my partner says it is green, you learn the colors when you are a kid and parents explain them, however, the spectrum is continuous and there are not just 7 colors, there are practically infinite.
Taxes are not 100% sure, many people live in tribal areas where there is no government to collect, not even money.

Raymii · July 2014

@MarkTurner said:
Maounique - our experience of Cloudstack is that XenServer is a lot more robust than KVM. We have a 200 node CS install, the only thing that really bites with Xenserver is that you need to keep your cluster size small (8-16 server). Also you need to be careful that you have the exact same processor model/stepping/revision across your cluster otherwise migrations start being refused.

+500 esxi hosts here with a mixture of hardware. The fuck all the issues with failover and migration. We're now standardizing on e5's for the next 10 clusters, but still, didnt expect so much issues...

netomx · July 2014

@Maounique said:
Taxes are not 100% sure, many people live in tribal areas where there is no government to collect, not even money.

Excellent point

MarkTurner · July 2014

@Raymii - make sure the machines are IDENTICAL, same mobo, same CPU version, same RAM type, etc. Drove me crazy about 18 months ago when we built this thing.

Ruchirablog · July 2014

our main VM was down until I restarted it from the panel

squibs · July 2014

I have to say my IWstack uptime and performance have been better than rackspace, which is intended for mission critical stuff, and IWstack pricing is an order of magnitude better.

geekalot · July 2014

@Maounique said:
Taxes are not 100% sure, many people live in tribal areas where there is no government to collect, not even money.

Mind sharing the list? They will definitely be on the top list of places to retire ... assuming they have internet connectivity :-)

But seriously though, even "tribal areas" have their own "taxation" system ... e.g., having to ante up 100 cows to marry the chief's daughter etc

Maounique · July 2014

@Ruchirablog said:
our main VM was down until I restarted it from the panel

Sorry for that, at this time HA is turned off until we manage to make sure something like this will not happen again.

Howdy, Stranger!

Categories

In this Discussion

IWstack outage

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

IWstack outage

Comments