IWstack outage

netman · July 2014

@Maounique said:
Sorry for that, at this time HA is turned off until we manage to make sure something like this will not happen again.

Mine were down too, until I received the mail about the incident and restarted it.

Unfortunately the mail wasn't send until 4 hours after the situation had been resolved, so that meant another 4 hours of downtime for my instance.

I'm a new customer, and this is a new server, so I hadn't set up any monitoring yet. Guess I need to get going on that...

kaflo · July 2014

@netman said:
the mail wasn't send until 4 hours after the situation had been resolved, so that meant another 4 hours of downtime for my instance

8 hours of downtime?

so much for high availability

you may want to come down from the clouds now and back to the "ordinary" VPSes.

Maounique · July 2014

netman said: Unfortunately the mail wasn't send until 4 hours after the situation had been resolved,

Not really, it has been sent when the HA was turned off, not 4 hours after that.

netman · July 2014

@Maounique said:
Not really, it has been sent when the HA was turned off, not 4 hours after that.

Okay. Double checking the mail headers I can see now that I was confused by the many timezones, daylight savings etc. involved (includes a forwarding from the one you sent to), and that all hops stays within same minute:second (more or less).

So you are probably right, and I'm wrong. Sorry about that.

Crab · July 2014

@Maounique Are you still facing problems, my instance has been horribly slow today?

AuroraZ · July 2014

Bah it was just @Maounique trying to do a runner but he forgot how again.

Crab · July 2014

Could have been something on my side too. The loads went to 5+ without any reason, ssh connections were getting dropped etc. Rebooted the instance and everything has been smooth so far.

0xdragon · July 2014

It's probably hundreds of VMs being booted up and set up after the downtime.

Maounique · July 2014

0xdragon said: It's probably hundreds of VMs being booted up and set up after the downtime.

Not really, they either booted in the time before the announcement or were left down after when we turned off HA. It must have been a local issue, at this time there are no known problems in the whole infrastructure apart from the disabled HA. Actually, load is lower than usual which means either there are still some VMs down (cant be more than 20-30) or the restarts cleared internal issues with some instances.

Maounique · July 2014

Reason for stopping: Migrating to the same host.

2014-07-05 02:04:32,880 DEBUG [cloud.capacity.CapacityManagerImpl] (AgentConnectTaskPool-199:null) VM state transitted from :Running to Running with event: AgentReportRunningvm's original host id: 45 new host id: 45 host id before state transition: 45

This is a bug we are investigating.

mikho · July 2014

@MarkTurner said:
Raymii - make sure the machines are IDENTICAL, same mobo, same CPU version, same RAM type, etc. Drove me crazy about 18 months ago when we built this thing.

That is one of the reasons to not build large as hell clusters.
Very expensive to upgrade.
We keep our clusters sized at 10-20 hosts.

Howdy, Stranger!

Categories

In this Discussion

IWstack outage

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

IWstack outage

Comments