Prometeus ?

TinyTunnel_Tom · February 2015

Of course not everything will be online at once.

If the power is still down and they are in generators last thing you wanna do is fire up everything at once. It will most likely fail. You do it rack by rack or depending on the generator room by room

Maounique · February 2015

This was a powerfailure in all the campus. We supposed it is a network failure because could not connect through any means, usually you think this is due to network issues. There was a severe malfunction of the power grid and this means everything went down abruptly. We are trying to recover things in the order of whatever needs less time to have the most services up in the least possible time.
IWStack was particularly badly hit as the orchestrator might think nodes are dead and try to start the VMs on other ones while the nodes come up. We are mitigating this problem by keeping the main node down.

HyperSpeed · February 2015

I get why people are complaining and not happy and want answers but surely you'd rather him try and fix it than sit on this thread responding to it all?
I understand updates but if you read up he said there was a network issue and that he's working on it, then soon after said Xen was back up so I'd imagine KVM can't be far behind.

@tinytunnel_tom It's not recommended that you reboot all nodes at the same time in any sense, especially not all blades at the same time if you have downtime as you potentially could pull 30+ KW/h worth of power within a few seconds which the power supply won't likey haha

Falzo · February 2015

wishing best of luck to prometeus/iwstack to get things sorted out!

can't understand why people are complaining anyways... but probably even if the Flood would erase half of european continent, there would be some owner of 20$py-vps complaining about when his ultrahyperimportantmeaningless website will be up again because he thought of something like guaranteed 1000% perfect uptime to not miss any baidu-bot visiting his site...

HyperSpeed · February 2015

@Maounique said:
IWStack was particularly badly hit as the orchestrator might think nodes are dead and try to start the VMs on other ones while the nodes come up. We are mitigating this problem by keeping the main node down.

Typical power, I'd presume you had been running on UPS for a while although most networking gear only has one psu (especially older stuff) and hence the network went down first?
Ouch that's nasty one, even worse when there's nothing wrong with the node but it wants to push its VM's all over the place! Hope you get it back to normal soon bud.

William · February 2015

Must have been a really bad issue, the campus has fairly large scale UPS and a fairly new genset facility...

TinyTunnel_Tom · February 2015

@HyperSpeed yea haha.

In my primary school we had 2 IT rooms everything we tried to boot off generated it failed. Thank fully I was helping out with a after school club so went round and literally every computer in the whole building was switched off at the wall as they were shutdown incorrectly. Took a hour to turn each machine on then off. And then the servers took a shit. One wouldn't load without the other and vise versa but they tripped the power out -.-

Maounique · February 2015

William said: Must have been a really bad issue, the campus has fairly large scale UPS and a fairly new genset facility...

Hence we did not really had our own, nor contingency plans for such a situation, in 10 years or so, this is unheard of.
we are yet to get the latest news, but a full RFO will be put up when we have the data and the services are up. It is still possible it was a localized issue with just a branch or something. We are trying first to restart services, check the disaster causes later.

HyperSpeed · February 2015

@TinyTunnel_Tom said:
HyperSpeed yea haha.

In my primary school we had 2 IT rooms everything we tried to boot off generated it failed. Thank fully I was helping out with a after school club so went round and literally every computer in the whole building was switched off at the wall as they were shutdown incorrectly. Took a hour to turn each machine on then off. And then the servers took a shit. One wouldn't load without the other and vise versa but they tripped the power out -.-

We had to get a new room UPS and generator installed therefore requiring the power to be turned off, was such an annoyance because you're not sure what will come back on once you turn everything off. Network just went to pot seen as Cisco's gear only had one power supply apart from a firewall and a few switches but the rack UPS' needed power every 30 mins to cope without turning anything off. Don't like power issues and I don't think I ever will!

zed · February 2015

@Maounique said:

Hence we did not really had our own, nor contingency plans for such a situation, in 10 years or so, this is unheard of.

we are yet to get the latest news, but a full RFO will be put up when we have the data and the services are up.

Thanks Maounique!

jcaleb · February 2015

Mine is up 1 minute ago. Thank you @Maounique I love you

Maounique · February 2015

Status: about half the services are restored. this includes full network. The timing for the other half might be longer as those are the ones with most problems.

jcaleb · February 2015

Hopefully things get fixed soon for all customers

Maounique · February 2015

Salvatore went on the site to work on things we cannot do remotely. Cdlan also called all staffers. It might be worse than we thought as some things do not come online.

0xdragon · February 2015

@Maounique said:
Salvatore went on the site to work on things we cannot do remotely. Cdlan also called all staffers. It might be worse than we thought as some things do not come online.

Just hope there's no data loss..

Maounique · February 2015

0xdragon said: Just hope there's no data loss..

I hope so, we invested 1.15 mil Eur in state of the art SANs with own UPSes. All raids have battery backup even on the local storage servers. While we do not have own separate UPSes except a few cases and they might have not held 80 minutes anyway as the blackout lasted, the rest is well cared for. Still, more than one spin disk failures in a server are not to be ruled out because we have tens of those local storage nodes with few hundred disks, though SSDs should be safe.

Lee · February 2015

As I always say, it's not the disaster the bothers me as these things can and will happen.

It's the response to the disaster that matters. The response you are seeing is the reason Prometeus should be at the top of anyone's list of providers.

0xdragon · February 2015

@Maounique said:
I hope so, we invested 1.15 mil Eur in state of the art SANs with own UPSes. All raids have battery backup even on the local storage servers. While we do not have own separate UPSes except a few cases and they might have not held 80 minutes anyway as the blackout lasted, the rest is well cared for. Still, more than one spin disk failures in a server are not to be ruled out because we have tens of those local storage nodes with few hundred disks, though SSDs should be safe.

Don't you guys get alerts from the power switching over to the UPS? Just curious, but I know that these things happen, and I'd consider something like this the "worst case scenario".

William · February 2015

Can't get alerts if the network is not available - Even if you HAVE local UPS for routers/switches the DWDM/CWDM equipment of your upstream probably does not. 3G/SMS also likely fails on such a large scale outage for the same reason (the campus is fairly large and has it's own 3G/4G tower).

Maounique · February 2015

Well, we do not have own UPS at most racks, only corporate services have since this is an event very unlikely to happen as the DC is very well setup with the power supply and there are draconian regulations in place which stopped us from expanding for a while. We worried more about the network than the power.
At this time all things that could have been fixed remotely are. I am dispatching mails to all possibly affected customers at this time and wait for Salvatore and the boys to fix things in place.

tomle · February 2015

Normally A and B power should never die at the same time. Also, diesel generators should have kicked in (if they have any).
I've been to a few data centers and during power test where they cut off one line there is just a quick flicker of the lights as they are not protected by UPS. Servers are all up powered by UPS until diesel is running. Diesel generators tend to be quite noisy

bsdguy · February 2015

The bad side is that Prometeus seems to be in a lousy colo/DC. I mean, after all 100% power with UPS and genset is a major f-cking reason to not host servers beneath grandma's sofa.

I don't care about their (colo/DC) excuses, a breakdown like that is a nightmare tht should a) create insane damage litigation against the colo/DC and b) have their clients leave in droves.

What I hate particularly about that is that Prometeus' reputation is tarnished, too. I mean knowing that Prometeus is a very well designed and managed operation is nice and fine but that's not worth a lot when the DC they're in is shitty and incompetent. Many clients will take away just one message "In the end Prometeus isn't reliable"; that's how people tick.
And that makes me very angry because Prometeus seems to run a really clean and well designed and taken care of ship.

As for Prometeus: Your reaction, your communication should go into the school books for providers! Transparent, honest, straight, direct, quick - one couldn't ask for more.

dragon2611 · February 2015

@bsdguy, the DC can have UPS and Generators but if it ends up being a fault on the internal power network at the DC it could still take out racks.

Even for dual fed racks it's a possibility (Although reduces the likelihood somewhat).

Unless they go for HA across DC's it's very hard to mitigate such a scenario.

techsys · February 2015

7 days ago I had problems with two of my VPS servers I had not many problems Prometeus or iwstack but would be remiss to say there were none. Now all of my servers are unavailable my complaint to the company went unanswered and was closed by the company. As of at least 7am this morning I have had no access to iwstack client control panel so cannot tell if servers are off or just need to be restarted. I must admit that I am bitterly disappointed with Prometeus as this is the second time in only 7 days I have encountered such a huge problem only this time it is much worse.

The status says everything is up and they're are no network or server issues prometeus.com is now back up and I have added to my complaint I doubt I will get any better response this time to what I did last which of course was none. Time for me to look elsewhere I just hope they can restore services soon last time to over 8 hours.

Maounique · February 2015

It is not a shitty DC at all, Milano has been our location from the beginning in Italy and it has been almost flawless, barring some peering now and then, this is why we have own carriers direct peering too, besides the MIX.
IMO, the DC is state of the art, but, helas, everything man made can fail, it might have been human error too. We pay top money for the colo there exactly because it is a very good DC and it did not fail in many years while in the short history in others we had some kind of failure every few months, some lasting half a day or more.
We are not blaming the datacenter, in fact, I do not have all the details and Salvatore has more urgent things to attend to right now, we will know more in the evening when we will be able to draw some conclusions and plan mitigation in the future.

dragon2611 · February 2015

@Maounique I presume once iwstack does start coming back up the performance will be degraded for a while due to the extra load on the SAN's from all the VM's having to boot up?

bsdguy · February 2015

Inside the DC power failures don't just happen like rain just happens. There's cables and there's fuses, high end ones but still basically fuses.
Moreover power distribution is compartmentalized in a DC and you bet your ass that the core and back-end (feed-in, power in & control, routing, etc) is handled and equipped as the sanctum. There's no f-cking way that, say, some server in some rack going amok would take down the DC's core.

Being professional and reliable was a major cornerstone for Prometeus. And the DC/colo they're in just broke that and such badly harmed and damaged Prometeus and other clients.

It's a while ago (10, 15 years) but I actually did run a (not small) colo; I was in charge of Tech. So I know what I'm talking about.
And I can tell you that I would have heads chopped off by now. Well, in theory. In reality it couldn't happen anyway because I was a paranoid asshole not only fervently checking the design again and again but also running exercises. And I would strongly assume that many (good) DCs are quite similar in that.

And btw. I would also have sent out our complete sales team on knees to limit the customer bleed out.

No, Sir, something like that just does not happen in a well designed and run DC. Period.

I feel with Prometeus who end up as one of the victims of that incomptent blunder of what seems to be a joke of a DC.

P.S. I also happen to have a VPS there. Gladly I must not rely on it. But anyway I will patiently wait for it to come online again. Prometeus is not the culprit but the victim and deserves our patience and understanding, the more as they seem to very seriously handle that situation the best they can.
Good luck, Prometeus!

marrco · February 2015

All my services are up and running now.

William · February 2015

bsdguy said: No, Sir, something like that just does not happen in a well designed and run DC. Period.

Something like this happens all the time, i remember at least 2 InterXion outages in the last year and they are regarded as some of the best DCs in the world.

alepore · February 2015

thanks @maounique for the good follow ups, as usual.
but would be great if prometeus used something like https://www.statuspage.io/ to keep customers updated.
i think 99.9% of their customers doesn't know this forum, which is probably the only real source of info right now.
Having something to tell to our own customers makes a huge difference...

Anyway, thanks for the hard work, and thank god my MilanoDC2 services was restored in about an hour

Howdy, Stranger!

Categories

In this Discussion

Prometeus ?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Prometeus ?

Comments