OVH Strasbourg Power Failure

Bahram0110 · November 2017

@AnthonySmith said:

Bahram0110 said: Hello My OVH server is down more than 30 hours, I understand their current turbulent state, But they should understand their customers too. 30 hours down time is unbelievable for Datacenters like OVH, They don't respond to tickets too

Hello, you should try rebooting your server from the customer panel, it has worked for many people.

Unfortunately Reboot does not resolve my issue.
OVH support is very slow

Maounique · November 2017

Bahram0110 said: OVH support is very slow

This time you could not expect anything else, nobody can have enough capacity for such issues in the support area, will just get hundreds of people twiddling their thumbs, 99.5% of the time.

Setsura · November 2017

@szarka said:

@Setsura said:
I know that feel, especially as an early BHS user, the power issues, and then the "fiber cuts". Good times at BHS.

Triggered.

I forgot which one, but during one of the "fiber cuts" they had all of BHS routing through like a single 1G backup line or something, so everyone got to collectively stare at their terminals, unable to actually do anything just because of how slow it was. BHS was fun.

WebDude · November 2017

I don't think their DR department is really up to the task? Why?
"power outage" is "The worst scenario that can happen to us".

No, that isn't. Let's change that scenario to situation:

1) Over voltage, all hardware fried

2) Flood, all hardware flushed off, or massive fire

3) Direct air cooling and nice amount of vulcanic ash or something, like corrosive chemical leak

4) Fertilizer ship / train going past the DC explodes

5) Solar flare / EMP - Widely fried electronics. - Some data centers are hardened against this

6) Nation state (or any other competent party) gets pissed of at them, and wipes their systems totally hijacking all control of the systems, after monitoring operations for months and they really know what they're doing

7) Any of these scenarios caused by internal sabotage

In these scenarios the whole site is more or less physically wiped out. Yep, recovering that is big more demanding than the 'trivial' job of restoring power.

These are the reason why I'm always having full off-site remote backups, just in case. Because you'll never know.

WSS · November 2017

@WebDude Holy shit- do you write copy for CNN, or did you just produce end-of-the-world movies for basic cable during the 90s?

Vova1234 · November 2017

Yesterday I restored servers what suffered in the accident. Only with Centos 7 there were problems, all the servers of Debian 7, 8, 9 are 100% up and did not even notice the jambs. Here it is the power of the Debian. The four Centos servers went to rescue. I had to transfer the data to backup lftp, reinstall it, configure the software and then back it up to 1 gigabit speed. Spent about 2 hours to work.

All other orders in SBG stood up normally VPS and other. PING UP.

The accident created losses at 21000 RUB ~ 305 EUR.

WebDude · November 2017

@WSS said: Holy shit- do you write copy for CNN, or did you just produce end-of-the-world movies for basic cable during the 90s?

Yeah, you made me laugh. But all of those things have happened in past, and will happen in future too.

Edit: Have you checked the SBG sites location? I think it's not optimal. Flood can be very real risk. Don't have data, but it looks like it.

WSS · November 2017

I don't remember the last time someone intentionally blew up a datacenter, myself. Thankfully.

@Vova1234 said:
Yesterday I restored all my servers. Only with Centos 7 there were problems, all the servers of Debian 7, 8, 9 are 100% up and did not even notice the jambs. Here it is the power of the Debian. The four Centos servers went to rescue. I had to transfer the data to backup lftp, reinstall it, configure the software and then back it up to 1 gigabit speed. Spent about 2 hours to work.

If it was NetBSD, you could still be fscking them (until you ran out of swap)!

Falzo · November 2017

@Vova1234 said:
Yesterday I restored servers what suffered in the accident. Only with Centos 7 there were problems, all the servers of Debian 7, 8, 9 are 100% up and did not even notice the jambs. Here it is the power of the Debian. The four Centos servers went to rescue. I had to transfer the data to backup lftp, reinstall it, configure the software and then back it up to 1 gigabit speed. Spent about 2 hours to work.

All other orders in SBG stood up normally VPS and other. PING UP.

The accident created losses at 21000 RUB ~ 305 EUR.

At least some one talking reasonable numbers and not that usual thousands of dollars per minute jibberish.

WebDude · November 2017

@Falzo said:
At least some one talking reasonable numbers and not that usual thousands of dollars per minute jibberish.

Well, costs can be indirect. It's hard to even estimate costs. If it's downtime alone, it might not be that expensive. But if the situation would have been worse. And systems would have needed to be restored from off-site backups. Yes, it would have been several thousands of euros directly. And even more indirectly, when users require compensation, and data is lost and there's extra data synchronization, and restoring data lost due to restoring potentially day old backup, and so on.

In that situation we're probably talking over 10k€+ easily. It would have meant basically redirecting all resources on system restoration (probably on other service provider) and ... Lots of work, before everything is working. Probably one week before most important stuff is working and before everything is working, would have taken around one month.

Yes, of course everyone has considered these things when making their DRP. Good thing. Stuff can be restored. Bad thing. It would take a lot of time and cost a lot, and probably cause indirect costs in loss of customers and tons of bad will, and so on.

There's also some stuff which is considered not worth of backing up daily to off-site location, because that data isn't "critical". But it's still something which would be still essentially recreated in case of total loss of DC & storage.

From some of the posts, it seems that the users / clients hasn't made proper DR preparations. - Providers like UpCloud clearly state that the clients are required to have off-site backups of all critical data.

Also if uptime is so important, in this kind of situation. As soon as the issue is detected, there restore to another provider / location should be launched immediately. Preferably in fully automated fashion. Or even there should be already alternate replicated sites, where your systems can fail-over automatically. - These are the discussions which pop-up always when there's issue with Amazon. If the service is important, you shouldn't trust only one Availability Zone or Region nor you should trust even one Cloud-Provider. - These are the topics I'm always bringing up, when someone says that they'll need system with high availability.

WebDude · November 2017

@WSS said:
I don't remember the last time someone intentionally blew up a datacenter, myself. Thankfully.

Read more about WTC and Data Centers. There were serious issues with banks. Also because someone smart (?), thought that having the second hot replica of data / backup systems in the alternate tower was a good idea. - Well that failed. And they had to do the slow and painful off-site restoration process.

Good example why the secondary system shouldn't be too close to the primary system. Of course that could cause latency issues, if it's far enough. - That's why we got Google Spanner and similar kind of technologies, for data which is actually important.

Btw. I've got $25 free credit invites to UpCloud, if anyone wants to take a look. I'm using it for everything, which is too important to be handled by OVH. - So I kind of agree with others, that the OVH is the budged solution.

Edit: typofix date -> data

Maounique · November 2017

WebDude said: Providers like UpCloud clearly state that the clients are required to have off-site backups of all critical data.

This is a very sensible thing to say. Not just to cover your ass, but also because there are so many other things that can go wrong.
We say we keep backups on some services and people constantly ask us to restore them because they deleted the wrong folder or because they got hacked.
We keep backups of the whole storage, that is meant for bare metal recovery, not individual containers, it covers hardware failure, mostly, there is no way we can know what data the customer needs to back-up and critically, when the snapshot must be taken. Also, we take those once a week or once a day and keep only one set. In the first case the back-up can be too old, in the second it can be already overwritten when you notice the malware/hack, etc. Heck, can be overwritten in the first case too.

NOBODY should presume the provider will cover their ass, nor that they will do it sensibly taking care of all their specifics, if your data is important, you take backups, if your data is critical, you sync it AND back it up in different locations, if also uptime is critical you make some sort of redundant setup (which will also kinda cover the back-up issue, except the cold storage of incremental back-ups so you can go back if needed for various other cases).

citrix · November 2017

Out of interest has anyone still got services out.

one of my VPS in SBG appears to still be down i have tried rebooting it but it hangs on 25% then errors out.

Clouvider · November 2017

@Clouvider said:

bitswitch said: Just minutes again you confidently confirmed it would takes day to be back up, only for the network go come slowly back up seconds after your statement.

What are you talking about? They are recovering from the other screw up with the almost (entire?) network being affected. I don't see anyone claiming their servers in the DCs without power comping back up. This is a major job. Especially with many servers which will have broken arrays and drives after such incident.

To have ALL Clients back up will be an uphill battle. Anyone who experienced a full DC blackout in their career will know what I'm talking about.

D

Since you called me out here, Just to follow up @bitswitch, some services continue to be down on the 3rd day. I wasn't a keyboard warrior attacking OVH in my post, I was simply realistic.

BasementCluster · November 2017

@Clouvider said:
I guess it comes down to how much their Customers are prepared to pay. Clearly, as in this example, getting down to such prices requires major cutting corners, which backfires, again as in this example.

AWS is ~10x more expensive all things (traffic) considered and had an identical outage which took out an availability zone. Price is a consideration but it doesn't work quite so straightforward at scale.

Clouvider · November 2017

@BasementCluster said:

@Clouvider said:
I guess it comes down to how much their Customers are prepared to pay. Clearly, as in this example, getting down to such prices requires major cutting corners, which backfires, again as in this example.

AWS is ~10x more expensive all things (traffic) considered and had an identical outage which took out an availability zone. Price is a consideration but it doesn't work quite so straightforward at scale.

But I was talking opposite situation here? You can have large mark up and screw it up, but when you need to cut corners to not sell at a huge loss you have significantly higher chances of screwing up.

Like Oles of OVH admitted here

Now as I said, this is a result of cutting corners.

BasementCluster · November 2017

@Clouvider said:
Now as I said, this is a result of cutting corners.

OVH claims it was a problem with power failover. Generators worked but the transfer switch didn't. It's obviously difficult to test and also hit one of the world's largest and most expensive providers.

How is it a result of cutting corners?

Neoon · November 2017

@BasementCluster said:

@Clouvider said:
Now as I said, this is a result of cutting corners.

OVH claims it was a problem with power failover. Generators worked but the transfer switch didn't. It's obviously difficult to test and also hit one of the world's largest and most expensive providers.

How is it a result of cutting corners?

Sounds like propaganda to buy his services instead of OVH.

Did you ever thought about size? SBG 4 could be twice the size as SBG 1.

Clouvider · November 2017

@BasementCluster said:

@Clouvider said:
Now as I said, this is a result of cutting corners.

OVH claims it was a problem with power failover. Generators worked but the transfer switch didn't. It's obviously difficult to test and also hit one of the world's largest and most expensive providers.

How is it a result of cutting corners?

You're disagreeing with me in principle ? Did you even bother to read what I linked you to ?

Let me quote the relevant part:

As per Oles

So why this failure? Why didn’t SBG withstand a simple power failure? Why couldn’t all the intelligence that we developed at OVH, prevent this catastrophe?

The quick answer: SBG's power grid inherited all the design flaws that were the result of the small ambitions initially expected for that location.

Now here is the long answer: 

Back in 2011, we planned the deployment of new datacenters in Europe. In order to test the appetite for each market, with new cities and new countries, we invented a new datacenter deployment technology. With the help of this internally developed technology, we were hoping to get the flexibility that comes with deploying a datacenter without the time constraints associated with building permits. Originally, we wanted the opportunity to validate our hypotheses before making substantial investments in a particular location. 

This is how, at the beginning of 2012, we launched SBG1 datacenter made of shipping containers. We deployed 8 shipping containers and SBG1 was operational in less than 2 months.

Clouvider · November 2017

Neoon said: propaganda

Not even worth time to discuss then ;-).

Clouvider · November 2017

@BasementCluster said:

@Clouvider said:
Now as I said, this is a result of cutting corners.

OVH claims it was a problem with power failover. Generators worked but the transfer switch didn't. It's obviously difficult to test and also hit one of the world's largest and most expensive providers.

How is it a result of cutting corners?

You also write something dramatically false. As per Oles, they did not work. Please DO READ before you start contesting someone's posts.

This morning, the motorized failover system did not work as expected. The command to start of the backup generators was not given by the PLC.

BasementCluster · November 2017

I get it. It's part legacy, part experiment. But the issue wasn't with any of that. It's not like the containers rusted through.

The outage was caused by a fault in power failover. The generators did work, they didn't start. This is a problem which affects all providers because it's fundamentally difficult to test for.

It's all in the official post mortem. You don't have to wade through Oles' frenglish feed.

Zerpy · November 2017

By the way, Interxion España experienced the same type of outage a few years back where the generators didn't turn on during a power outage.

Faults happen - sometimes it kills a whole lot of hardware, other times not.

Deal with it, design DC/Provider failover if stuff is essential enough :-D

WSS · November 2017

The command to start of the backup generators was not given by the PLC. It is an NSM (Normal-emergency motorised), provided by the supplier of the 20KV high voltage cells. We are in contact with the manufacture/suplier to understand the origin of this issue.

I'm sure that call was calm and handled easily and quickly by both parties at the time. To hear what sort of threats and such were bandied about..

Neoon · November 2017

@Clouvider said:

@BasementCluster said:

@Clouvider said:
Now as I said, this is a result of cutting corners.

OVH claims it was a problem with power failover. Generators worked but the transfer switch didn't. It's obviously difficult to test and also hit one of the world's largest and most expensive providers.

How is it a result of cutting corners?

You're disagreeing with me in principle ? Did you even bother to read what I linked you to ?

Let me quote the relevant part:

Relevant for you? thats why you also cutted the line below?

Every where OVH is, you keep bashing it.

He said, SBG was planned as small DC but was scaled up fast due demand, just because there is a design flaw, and they need to close old affected parts, because of what happend.

You make the decision and tell us, its because of the demand?

It does not sound like to me because of demand, he is closing old parts yes, but rather due to the issues of SBG.

He is going to invest another 5 million into SBG, do you think he would do that if there is no demand?

Clouvider · November 2017

Relevant to support my sentence.

Future investment is irrelevant to what I was saying. Furthermore, you might have noticed that they learned quite a bit through the years and are increasing the prices, and my hunch is they will continue to steadily do so now.

OVH had the balls to admit the failure, and I respect them for that.

I understand you are in love with them, but this love makes you blind when even they admitted it.

Clouvider · November 2017

BasementCluster said: The outage was caused by a fault in power failover. The generators did work, they didn't start. This is a problem which affects all providers because it's fundamentally difficult to test for.

Yeah, and clearly keeping only 8 minutes of juice didn't allow anyone to react quick enough to prevent it. This particular outage was simple and extremely easy to prevent.

Have you ever been present during a site full failover test? You'd notice that it takes a good couple moments for the generators to start and normalise in normal conditions when everything goes well. I don't work with any datacentre that has such a short margin on batteries as it was in this case.

WebDude · November 2017

Btw guys, if you like stories here's related Hacker News discussion.

citrix · November 2017

my VPS is finally live since the outage in SBG

jetchirag · November 2017

@citrix said:
my VPS is live since the outage in SBG

My VPS was also up!

Edit: Just that it wasn't hosted at OVH

Howdy, Stranger!

Categories

In this Discussion

OVH Strasbourg Power Failure

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

OVH Strasbourg Power Failure

Comments