Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


IWstack outage
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

IWstack outage

squibssquibs Member
edited July 2014 in General

First time I've seen a service outage in the serveral months I've been with them. One of my servers went down, the other was unaffected.

today at 10:01 AM CEST the iwStack orchestrator received a disconnection event from several hosts which triggered a massive High Availability recover procedure for more than 600 instances.
At 10:50 while most instances were back running, a couple hundreds were stuck in starting state waiting for the network setup to complete.
At 11:10 in the attempt to speed up the process we forced a network restart (including VR rebuild), but this turned to be a wrong solution causing more delay.
Finally at 13:00 all the queued instances were started.

If your instances are still in stopped state, just start them. Please open a ticket if some instance don't start.

At present we have disabled the HA flag for all the instances while we're investigating on the incident.

We are sorry for any inconvenience this issue may have caused.

«1

Comments

  • I dont understand it" iwStack orchestrator received a disconnection event from several hosts which triggered a massive High Availability" but whatever the reason is they should make sure such incidents wont happen again and this put bad image to cloud computing which claims 100% uptime.

  • I just received the email too. But my uptime is still at 309 days, so it seems like it's not affecting all instances.

    I hope no more problem like this in the future :)

    Thanked by 1Maounique
  • geekalotgeekalot Member
    edited July 2014

    @instatech: Dude, NOTHING ON THIS PLANET has 100% uptime.

    1) Ensure you have recent backups

    2) Redundancy, redundancy, redundancy (multiple instances, multiple providers, multiple geographies)

    3) Have "Hot spares" always ready to go (up to date servers that are NOT normally exposed to the internet that you can failover to) in multiple geographies/regions

    4) Have a good failover mechanism (whether DNS, or load balancer device)

    5) Hold providers with persistent recurring faults accountable (and dump them at YOUR convenience). Prometeus is NOT a provider with recurring/persistent issues. This is the first iwStack outage I am aware of (at least since I started using it).

    6) Did I mention redundancy?

    7) Sit back, relax, enjoy life



    This is called having a Business Continuity Plan



    Cheers

  • InfinityInfinity Member, Host Rep
    edited July 2014

    @instatech said:
    I dont understand it" iwStack orchestrator received a disconnection event from several hosts which triggered a massive High Availability" but whatever the reason is they should make sure such incidents wont happen again and this put bad image to cloud computing which claims 100% uptime.

    The issue was with a network card which caused network problems on some hosts, that then caused the orchastrator to go into HA mode and restart all of those instances on other nodes. Following that some instances were stuck in starting, it didn't affect all instances. As mentioned in the RFO it is being looked into, and of course to avoid such issues in the future.

    Also, iwStack does not claim 100%. This is the first issue of this scale since iwStack's inception.

  • MaouniqueMaounique Host Rep, Veteran
    edited July 2014

    There is no 100% uptime.

    Here there is a bit more extensive RFO:
    http://board.prometeus.net/viewtopic.php?f=15&t=1409&p=1965#p1965

    In this case, the main problem was the HA, without HA the downtime would have been a few minutes it took us to isolate the malfunctioning nic and solve the problem. However, those few minutes of downtime convinced the orchestrator all the VMs on the affected nodes are down and proceeded to restart them on other nodes. That meant the queue was full for hours and since the virtual routers are vms too on random nodes, at times the VMs started before the VR or were up on nodes which were online all the time and, while the vm was up, the network was down, so it was a huge mess.
    You can defend against a node failure, even a few, but when the orchestrator thinks tons of nodes died at once it cannot be really fixed fast.

    We are thinking to add some code to check if more than one node appears offline and if it does, to wait for human intervention because that is highly unlikely to happen due to node failure. Cloudstack was conceived by people used with XenServer clusters and added KVM to it. It would have been better to put it on Xen at that time, in hindsight, but what is done is done, we do plan to make a Xen cluster soon, though, to test it and give people a choice, maybe phase out KVM in time if it proves successful.

    Thanked by 2Dylan ihatetonyy
  • AmitzAmitz Member
    edited July 2014

    My instances were not affected as it seems. All 100% up like throughout the last months. Thumbs up for that!

    Thanked by 1Maounique
  • @Maounique said:
    to test it and give people a choice, maybe phase out KVM in time if it proves successful.

    @Maounique, @Prometeus: PLEASE do not phase out KVM

    "and give people a choice" --> THIS is the better solution, IMHO

    Thanked by 1tux
  • MaouniqueMaounique Host Rep, Veteran
    edited July 2014

    geekalot said: PLEASE do not phase out KVM

    Do not understand phase out as closing down the KVM zones, far from it, just put as default Xen zones and only re-assign nodes if the KVM usage lowers and there is a need for more nodes in the Xen zone.
    We only phased out a few products so far, I can only remember the windows separate offer (with proxmox, outside cloud) and we will discontinue the KVM storage plans as well as the atomic cloudmin ones made redundant by the xenpower L plans. Add to this the old shared hosting with shared IP and no resource isolation where people suffer from bad neighbours, because the new plans with dedicated IP, dedicated IOPS and dedicated cpu cycles are far better.

  • @Maounique said:
    We only phased out a few products so far, I can only remember the windows separate offer (with proxmox, outside cloud) and we will discontinue the KVM storage plans as well as the atomic cloudmin ones made redundant by the xenpower L plans.

    ah, OK

  • OK thanks to all for posting detailed information now i understand it.@geekalot i like your Business Continuity Plan it is useful i will follow it.

    Thanked by 1netomx
  • Maounique said: maybe phase out KVM in time if it proves successful.

    I can definitely understand the rationale behind it (support costs alone for managing multiple KVM 'variations') but I would be sad to see classic KVM go.

  • geekalotgeekalot Member
    edited July 2014

    @instatech said:
    OK thanks to all for posting detailed information now i understand it.geekalot i like your Business Continuity Plan it is useful i will follow it.

    @instatech, it is basically combinations & permutations:


    • What is the likelihood of 1 server going down?
    • vs What is the likelihood of 2 servers going down from two different providers?
    • vs What is the likelihood of 3 servers going down from three different providers?
    • and so on .....
    • So even 99% uptime (per server) against 4 independent servers can lead to INCREDIBLE uptime
    • (The added bonus is if you have a mechanism to do load balancing in addition to failover, you can also use less resources per server as you split your user traffic between them!)

    I won't bore you with the actual math, but suffice it to say it is HIGHLY unlikely. (1% of 1% of 1% of 1% to have a complete failure against 4 x 99% uptime independent servers; even less for 99.9% uptime)

    Just try your best to reduce any SINGLE point of failure.



    This is how you can string together cheaper 2nd (or even 3rd) tier providers and have better performance than the expensive "1st" tier providers -- all day, every day (IMHO).

    Cheers

    Thanked by 2Maounique instatech
  • seansean Member

    texteditor said: I can definitely understand the rationale behind it (support costs alone for managing multiple KVM 'variations') but I would be sad to see classic KVM go.

    Good riddance to classic KVM. I've been using a newer one for a while and I have to say it's much better!

  • MaouniqueMaounique Host Rep, Veteran

    No, as long as ta least 3 nodes are populated 2/3 to offer 2+1 redundancy, it will not happen, even then, sacrificing some space on 3 nodes will not create us as many problems as it would for our really long time customers to move. It will be a bitch to maintain, though, but hopefully not something at the level of individual nodes for regular VPS which usually need more maintenance than iwstack nodes and it is easy here to do it without downtime, just put the node in maintenance, wait for the VMs to be moved without downtime or noticeable issues and proceed.

  • @Maounique - our experience of Cloudstack is that XenServer is a lot more robust than KVM. We have a 200 node CS install, the only thing that really bites with Xenserver is that you need to keep your cluster size small (8-16 server). Also you need to be careful that you have the exact same processor model/stepping/revision across your cluster otherwise migrations start being refused.

    Thanked by 1Maounique
  • MaouniqueMaounique Host Rep, Veteran

    Those problems are similar with KVM, so, that is not an issue, we hit some walls when we were designing iwstack, it took way too long and KVM happened to work much faster at that time.

  • CoudioCoudio Member

    @geekalot said:
    instatech: Dude, NOTHING ON THIS PLANET has 100% uptime.

    Your heart has a 100 percent uptime :).

    Thanked by 1instatech
  • J1021J1021 Member

    Coudio said: Your heart has a 100 percent uptime :).

    Doesn't actually. Sometimes a heart can stop and be started again.

  • geekalotgeekalot Member
    edited July 2014

    @Coudio said:
    Your heart has a 100 percent uptime :).

    @Coudio, Well, at least you HOPE so ... LOL



    Better pray to your "provider" if it does not :-)

    Thanked by 1netomx
  • MaouniqueMaounique Host Rep, Veteran
    edited July 2014

    Nope. There are many people with hearts stopped at least once and while it can happen to have 100% uptime for the majority of people, on average it is not 100%.
    The only 100% sure thing is death, so far.

  • @Maounique said:
    Nope. There are many people with hearts stopped at least once and while it can happen to have 100% uptime for the majority of people, on average it is not 100%.
    The only 100% sure thing is death, so far.

    And taxes

    Thanked by 2netomx rds100
  • J1021J1021 Member

    Maounique said: The only 100% sure thing is death, so far.

    I am 100% sure I am wearing blue trousers at the moment.

  • MaouniqueMaounique Host Rep, Veteran

    Blue is subjective, I often say something is blue, my partner says it is green, you learn the colors when you are a kid and parents explain them, however, the spectrum is continuous and there are not just 7 colors, there are practically infinite.
    Taxes are not 100% sure, many people live in tribal areas where there is no government to collect, not even money.

  • RaymiiRaymii Member

    @MarkTurner said:
    Maounique - our experience of Cloudstack is that XenServer is a lot more robust than KVM. We have a 200 node CS install, the only thing that really bites with Xenserver is that you need to keep your cluster size small (8-16 server). Also you need to be careful that you have the exact same processor model/stepping/revision across your cluster otherwise migrations start being refused.

    +500 esxi hosts here with a mixture of hardware. The fuck all the issues with failover and migration. We're now standardizing on e5's for the next 10 clusters, but still, didnt expect so much issues...

  • netomxnetomx Moderator, Veteran

    @Maounique said:
    Taxes are not 100% sure, many people live in tribal areas where there is no government to collect, not even money.

    Excellent point

  • @Raymii - make sure the machines are IDENTICAL, same mobo, same CPU version, same RAM type, etc. Drove me crazy about 18 months ago when we built this thing.

    Thanked by 1netomx
  • our main VM was down until I restarted it from the panel

  • squibssquibs Member

    I have to say my IWstack uptime and performance have been better than rackspace, which is intended for mission critical stuff, and IWstack pricing is an order of magnitude better.

  • @Maounique said:
    Taxes are not 100% sure, many people live in tribal areas where there is no government to collect, not even money.

    Mind sharing the list? They will definitely be on the top list of places to retire ... assuming they have internet connectivity :-)

    But seriously though, even "tribal areas" have their own "taxation" system ... e.g., having to ante up 100 cows to marry the chief's daughter etc

  • MaouniqueMaounique Host Rep, Veteran

    @Ruchirablog said:
    our main VM was down until I restarted it from the panel

    Sorry for that, at this time HA is turned off until we manage to make sure something like this will not happen again.

    Thanked by 1Ruchirablog
Sign In or Register to comment.