Crunchbits VDS node outage

rjbl · September 2023

My VDS node was been down at 3:29 AM PDT time. I opened a technical ticket at 5:15 AM PDT time. They responded at 12:53 PM and restored my server at 3:37 PM.
This resulted in a total of ~12 hours outage time.
I know that this is not a normal response time given that https://get.crunchbits.com/announcements/17/NOTICE-Support-times-over-the-next-48-hours.html.
Is anyone here on the same node as me? Am I the only one having this issue?
The final message I got from support was

Hello,

Your vps should be up right now.

Thanks [Redacted]

I was hoping for a bit more of an explanation for the outage though.

yoursunny · September 2023

Your service is up right now, so stop crying.

nick_ · September 2023

Which VDS plan is yours? My 7GB Xeon VDS has been running fine.

rjbl · September 2023

@nick_ said:
Which VDS plan is yours? My 7GB Xeon VDS has been running fine.

For me its a 16GBs Ryzen VDS

rjbl · September 2023

@yoursunny said:
Your service is up right now, so stop crying.

very helpful thanks for your contribution to the discussion.

emgh · September 2023

@rjbl said:

@yoursunny said:
Your service is up right now, so stop crying.

very helpful thanks for your contribution to the discussion.

I mean

One server was obviously down, what do you gain from knowing exactly who on LET was on that one server? Especially now that it’s up?

rjbl · September 2023

@emgh said:

@rjbl said:

@yoursunny said:
Your service is up right now, so stop crying.

very helpful thanks for your contribution to the discussion.

I mean

One server was obviously down, what do you gain from knowing exactly who on LET was on that one server? Especially now that it’s up?

I am hoping that it wasn't my configuration that caused my own outage.

emgh · September 2023

@rjbl said:

@emgh said:

@rjbl said:

@yoursunny said:
Your service is up right now, so stop crying.

very helpful thanks for your contribution to the discussion.

I mean

One server was obviously down, what do you gain from knowing exactly who on LET was on that one server? Especially now that it’s up?

I am hoping that it wasn't my configuration that caused my own outage.

Well what did they respond with?

If it was unclear, just reply back and ask if they had an issue or if it was only your VPS having issues

rjbl · September 2023

Ok, I received this message:

Hey [Reacted],

We do, and I apologize for the delay and downtime on this node. Our Ryzen products all have quad-9s SLA (99.99%) and we will process service credits once we close the case out on our side and are confident no additional work/downtime is needed. There was a power supply issue which required us to unrack and complete emergency repair work (and now we're monitoring).

To give you a full transparent answer: we also had a notification issue where we found our alerts for this server (and another provisioned on the same day) were not correctly tagged and thus did not trigger the emergency alert. It caused a delay on our part in responding to the downed server which would otherwise be a sub-5 minute response. It is completely on us, but just thought you were owed a full explanation.

I've assigned your ticket and added a note about SLA which we will process 9/4 to 9/5 assuming we are confident in the repairs and close out the ticket status. Additionally, we're pushing out a private status page for customers to see all of our node statuses in real-time which will also allow us to add notes and inform each of you better from a central source. I believe from the point of view of a customer, just seeing overall facility and network status was not enough.

Best,
[Reacted]

fluffernutter · September 2023

@rjbl said:
Ok, I received this message:

Hey [Reacted],

We do, and I apologize for the delay and downtime on this node. Our Ryzen products all have quad-9s SLA (99.99%) and we will process service credits once we close the case out on our side and are confident no additional work/downtime is needed. There was a power supply issue which required us to unrack and complete emergency repair work (and now we're monitoring).

To give you a full transparent answer: we also had a notification issue where we found our alerts for this server (and another provisioned on the same day) were not correctly tagged and thus did not trigger the emergency alert. It caused a delay on our part in responding to the downed server which would otherwise be a sub-5 minute response. It is completely on us, but just thought you were owed a full explanation.

I've assigned your ticket and added a note about SLA which we will process 9/4 to 9/5 assuming we are confident in the repairs and close out the ticket status. Additionally, we're pushing out a private status page for customers to see all of our node statuses in real-time which will also allow us to add notes and inform each of you better from a central source. I believe from the point of view of a customer, just seeing overall facility and network status was not enough.

Best,
[Reacted]

This is an excellent answer, good job @crunchbits on the transparency!

crunchbits · September 2023

@rjbl said:
I was hoping for a bit more of an explanation for the outage though.

Just now had this linked to me but you should already have a reply from me earlier. Admins and staff had already fixed the issue but it required a proper reply from myself directly. If you wish you're welcome to share it.

edit: see it shared above before I refreshed

@emgh said:
Well what did they respond with?

If it was unclear, just reply back and ask if they had an issue or if it was only your VPS having issues

It was a node-specific hardware issue. We were admittedly delayed in responding to it.

rjbl · September 2023

My node is down again for the past few minutes.

emgh · September 2023

@rjbl said:
My node is down again for the past few minutes.

I just ate a burger, but I'm finished now

rjbl · September 2023

I really like their hardware and network though. Does their dedicated servers have better uptime?

Don_Keedic · September 2023

@rjbl said:
I really like their hardware and network though. Does their dedicated servers have better uptime?

I've had maybe 2-4 minutes of downtime on the storage box I've had with them since Feb/March and that was due to them changing an IP range.

I've got a 4gb VPS I've had for about a month and a half and a VDS (7950x) I picked up last week - zero issues, no downtime.

yoursunny · September 2023

@emgh said:

@rjbl said:
My node is down again for the past few minutes.

I just ate a burger, but I'm finished now

I just ate the avocado bacon burger from Shake Shack.
I'm in a food coma now.

rjbl · September 2023

@Don_Keedic said:

@rjbl said:
I really like their hardware and network though. Does their dedicated servers have better uptime?

I've had maybe 2-4 minutes of downtime on the storage box I've had with them since Feb/March and that was due to them changing an IP range.

I've got a 4gb VPS I've had for about a month and a half and a VDS (7950x) I picked up last week - zero issues, no downtime.

Maybe I am on an problematic node. I bought them from the LES sale so around a month and half too. I logged the down time at 12.47 PM PST.

emgh · September 2023

Is it back up? @rjbl

rjbl · September 2023

@emgh said:
Is it back up? @rjbl

Not yet

BruhGamer12 · September 2023

@yoursunny said: I just ate the avocado bacon burger from Shake Shack.

I'm in a food coma now.

Lucky they don't have them where I live.

rjbl · September 2023

Its up now. I am waiting for node migration.
12 + 2 hours is a new record of unplanned downtime from a provider for me.

bethp · September 2023

I mean, better than deadpool ... shit sometimes happens and they seem to be handling this in a very professional way, nothing can ever be perfect no matter how ard we try, if they honour the SLA they offer then in my books you have a good host.

Advice for in future, have a backup vm that has a direct copy if It's critical to be online and load balance

Don_Keedic · September 2023

@rjbl said:
Its up now. I am waiting for node migration.
12 + 2 hours is a new record of unplanned downtime from a provider for me.

Well sorry to hear that. I'm certain they'll get everything back up and running for you, even on a holiday!

rjbl · September 2023

Sigh, losing many hours of work is not fun.

edrebe · September 2023

Offsite backups are important.

4 days ago there was a 13 hour 46 minute outage in Atlanta (USA) from a known provider. I waited the first hour, then with the offsite copy (5 websites and 5 databases), I migrated to another provider also with instant activation and in 45 minutes I was online ⚡

yoursunny · September 2023

@edrebe said:
Offsite backups are important.

4 days ago there was a 13 hour 46 minute outage in Atlanta (USA) from a known provider. I waited the first hour, then with the offsite copy (5 websites and 5 databases), I migrated to another provider also with instant activation and in 45 minutes I was online ⚡

Same happened to my website in 2021:
yoursunny.com Disaster Recovery Plan: 104 Minutes Downtime, No Tears
Recovery time would be much longer if I'm sleeping or geocaching though.

cybertech · September 2023

@rjbl said:
Sigh, losing many hours of work is not fun.

you should go with high availability vps.

rjbl · September 2023

I did duplicate data after what happened yesterday but recent data is inaccessible until the VDS came up again.

@cybertech said:

@rjbl said:
Sigh, losing many hours of work is not fun.

you should go with high availability vps.

Unless you setup a HA cluster, I am not sure what you mean by high availability vps?

cybertech · September 2023

@rjbl said:
I did duplicate data after what happened yesterday but recent data is inaccessible until the VDS came up again.

@cybertech said:

@rjbl said:
Sigh, losing many hours of work is not fun.

you should go with high availability vps.

Unless you setup a HA cluster, I am not sure what you mean by high availability vps?

https://www.clouvider.com/cloud-vps/

https://upcloud.com/products/cloud-servers

Not_Oles · September 2023

@rjbl said: I did duplicate data

Hi @rjbl!

Just duplicating your data might not be enough.

It seems too crazy, but maybe you want to copy all your data to multiple, independent locations. Maybe you want to use different backup media and formats at the multiple locations. Then maybe you want to double check that you actually can restore your backed up data and that the restored data matches the original.

For example, maybe back up to a local hard drive, to Google Drive, and to a server far away across the ocean.

Maybe you might be interested to see this article: https://lowendbox.com/blog/an-incredibly-amazing-co-incidence-of-doubled-double-disk-failures/

Best wishes!

Tom

rjbl · September 2023

@cybertech said:

@rjbl said:
I did duplicate data after what happened yesterday but recent data is inaccessible until the VDS came up again.

@cybertech said:

@rjbl said:
Sigh, losing many hours of work is not fun.

you should go with high availability vps.

Unless you setup a HA cluster, I am not sure what you mean by high availability vps?

https://www.clouvider.com/cloud-vps/

https://upcloud.com/products/cloud-servers

I did not know that these existed thanks! I am bit skeptical though. Are they using something like Ceph to duplicate live data?

@Not_Oles said:

@rjbl said: I did duplicate data

Hi @rjbl!

Just duplicating your data might not be enough.

It seems too crazy, but maybe you want to copy all your data to multiple, independent locations. Maybe you want to use different backup media and formats at the multiple locations. Then maybe you want to double check that you actually can restore your backed up data and that the restored data matches the original.

For example, maybe back up to a local hard drive, to Google Drive, and to a server far away across the ocean.

Maybe you might be interested to see this article: https://lowendbox.com/blog/an-incredibly-amazing-co-incidence-of-doubled-double-disk-failures/

Best wishes!

Tom

Hi Tom, thanks for advice.
I already have my backups on E2 object storage. Though I am hoping the claimed 9, 11s durability would not have a OVH incident too given its price.

Howdy, Stranger!

Categories

In this Discussion

Crunchbits VDS node outage

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Crunchbits VDS node outage

Comments