Prometeus ?

mi5h0 · March 2015

@bsdguy said:
Oopsie. Stupid me. I assumed that the international phone system, emails servers, twitter, Skype, and other means of modern communication were still available

That is exactly the point!
There was no notice to customers on email, twitter, internal forum, website, or any other means...

I came to this forum, registered and posted inquiry. Only then I got the first information from @Maounique...

netomx · March 2015

Maounique said: 144 were cut through a passage

whoa

Maounique · March 2015

Yes, I am the PR person, this does not mean no work is done, only that I cannot be available all the time.
I was in a train tired with a lot of luggage from a day at the ski.
Phone was in the backpack and i didnt hear it. What can I say, s ** tty day is s ** tty...
Really sorry for this.

bsdguy · March 2015

@mi5h0

I'm with you, man. Seems to be a wide spread disease with providers. That's by no means specific to Prometeus.

I guess it's about time WHMCS created a "panic info mail to all clients on node(s)" module **g

afonic · March 2015

Maybe they are afraid they will lose potential customers if they advertise every problem though social media etc. A status page hosted somewhere else should be standard though.

By the way, 5 hours since I got the first Pingdom alert. Lets see...

mi5h0 · March 2015

@Maounique I understand, I'm not looking to blame.
I just assumed that there was someone on duty 24/7 that could set up some type of notification to customers that would help us save time and nerves...

dragon2611 · March 2015

@Maounique said:
Update:

The fibers are rechecked. 144 were cut through a passage, many other people in the area are affected. It is possible that the repair was not done correctly creating intermittent signal quality, even the link says up.

Those types of problem are always "fun" to deal with and often it's very hard to give an accurate ETA particularly with the larger cables it will depend when the engineers get to your CCT and where it is in the cable.

mi5h0 · March 2015

@bsdguy Yes, that might be useful except in case when WHMCS database is out of service

Maounique · March 2015

We thought it was the switches, but that does not seem to be the case upon in situ inspection. Both failing same time is completely unlikely anyway.

Let's be clear, the engineers already repaired all those 144 fibers and almost everyone else is up, but a few were seemingly botched probably due to the pressure, and now they are re-checking one by one.

dragon2611 · March 2015

@Maounique said:
We thought it was the switches, but that does not seem to be the case upon in situ inspection. Both failing same time is completely unlikely anyway.

Let's be clear, the engineers already repaired all those 144 fibers and almost everyone else is up, but a few were seemingly botched probably due to the pressure, and now they are re-checking one by one.

They should be able to perform an End2End test and confirm to you that the fibre is good or not, ofc they'll have to disconnect it to do it but as it down already.

Also check the RX/TX levels at each end.

Maounique · March 2015

dragon2611 said: They should be able to perform an End2End test and confirm to you that the fibre is good or not, ofc they'll have to disconnect it to do it but as it down already.

end2end passed. Signals look acceptable. It was online twice. The link shows up.
We are working in parallel to piggyback on some other fiber around. It involves some permissions from owners and the setup.
This is a nightmare and I am really sorry

dragon2611 · March 2015

If there's a fibre engineer onsite ask them shove an OTDR on it, a bad splice should hopefuly show up.

That's about the limit of my fibre knowledge I'm afarid, I know what an OTDR is and roughly what it's used for but as to interpreting the results that will need someone with more fibre training than me, was trained to splice indoor fibre (So not the multi-core stuff) several years ago but never actually had to do any.

Maounique · March 2015

dragon2611 said: If there's a fibre engineer onsite

I am sure Salvatore does all he can, He is there for hours and knows the facility way better than me since he helped build it.
This looks like a massive bad luck and might force us to take a separate carrier link to the second DC to have at least some limited connectivity over a tunnel or something as long as it comes via another route, not the same canal.

zeitgeist · March 2015

@mikho said:
While the client area is down I suspect there isn't that many tickets to answer.

Which is a good reminder that it wouldn't be a bad idea to have a redundant solution for client communications.

Maounique · March 2015

We have forum and twitter. Unfortunately, twitter didnt work for me and have to wait for uncle to fix it, and I was in a train when this happened.

Update: whmcs online.

zeitgeist · March 2015

@Maounique, that's all good, but I was thinking of a redundant solution for your regular client communication, i.e. the ticketing system. As you mentioned before, "setup redundancy across datacenters and providers, even countries, otherwise you will continue to be disappointed, no matter how much you pay." I think the same goes for a hoster's business site/ticketing system.

Maounique · March 2015

The connection is done through another cable for now. The old link is still showing on, but not working.
All the services are reconfigured to be working through this new route, essentially piggybacking on a working one.

I think the same goes for a hoster's business site/ticketing system.

Indeed, but, an external (in a DC we do not control) copy of our database will send the already crazed privacy "specialists" into overdrive.
we do have 2 frontends, but only one database.

alepore · March 2015

Maounique said: The connection is done through another cable for now

you mean all the stuff should work now? still can't ping my DC2 servers

afonic · March 2015

Do you know if the commands we sent to the VMs through Cloudstack will be executed after the link is back online? (reset, stop etc)

Maounique · March 2015

There is a timeout depending on command of various lenghts.
It will also affect snapshots.
The rerouting is in progress. The nodes should come back shortly.

alepore · March 2015

@afonic said:
Do you know if the commands we sent to the VMs through Cloudstack will be executed after the link is back online? (reset, stop etc)

if you got errors (like me) i would say no.
anyway commands seems to work now

bsdguy · March 2015

@Maounique said:
... will send the already crazed privacy "specialists" into overdrive

"Crazed" as in "I use a fake name in my support job and I use fake names as client, too"?

Btw: Like zeitgeist I also thought of your redundancy advice but I didn't mention it here because it would have felt like kicking a man who is already on the floor anyway (meaning I respected that you were under stress with the current DC/fiber problem and didn't want to laugh at you).

Maounique · March 2015

I know and would have not taken it that way. We did think of it, but we only control one datacenter. Replicating or giving access through servers we do not fully control, would have introduced another way of attack on the private data. Whether you believe it or not, we are doing our best to keep it private.

Due to rerouting, other services will go down briefly.

gsrdgrdghd · March 2015

Now my OVZ server is down too

Crab · March 2015

My KVM server in IWStack just went down.

fixidixi · March 2015

@Maounique:
Hey there,
my ovz on pm17 is showing a bit of weirdness. is it related?

--- google.com ping statistics ---
327 packets transmitted, 70 received, +21 errors, 78% packet loss, time 566767ms

Infinity · March 2015

@gsrdgrdghd said:
Now my OVZ server is down too

@Crab said:
My KVM server in IWStack just went down.

@fixidixi said:
Maounique:
Hey there,
my ovz on pm17 is showing a bit of weirdness. is it related?

--- google.com ping statistics ---
327 packets transmitted, 70 received, +21 errors, 78% packet loss, time 566767ms

I'm sure M will chime in with a more complete story, but rerouting and adding another temporary cable needed a short disconnect. Should be working fine on other services again now.

Maounique · March 2015

Maounique said: Due to rerouting, other services will go down briefly.

We are reorganizing the network to be more resilient and have a way to switch to another cable if needed to a backup. This required a big switch repurposed.

elbandido · March 2015

So dc2 will be up shortly?

fixidixi · March 2015

Thanks for the updates!

Howdy, Stranger!

Categories

In this Discussion

Prometeus ?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Prometeus ?

Comments