Racknerd Down

dustinc · October 2024

@ralf said:
Mine is still down, so I guess I'm on one of the machines with the failed RAID. At least it's not as dramatic it might have been... As soon as I read LAFD, I was imagining another OVH situation!

Luckily, the fire did not spread throughout the high rise - though, as a result of power being shut down, it did impact uptime, and sadly some hardware. We're doing our absolute best.

dustinc · October 2024

@tommyluo said:
I migrate some of my vps from cloudcone to racknerd,but I will use seattle,dallas,sanjose,as most of my vps with racknerd are in los angeles.

That's always a good thing to do, so we commend you for doing such. Thank You for sticking with us through the tough times.

dustinc · October 2024

@john_sd3 said:
@dustinc unrelated to the current issue but why do you have different payment methods for add funds and pay invoice? i want to add some funds by indian payment methods but your support tells me that is only available in invoices? i don;t see why that should be the case

Hi @john_sd3 — Currently, the India NetBanking and India UPI payment options are temporarily unavailable due to the fact that our third-party payment processor, Payssion, has temporarily deactivated these payment options. As soon as these are made available to us again, we'll reactivate it on our end.

dzzzzz · October 2024

@dustinc said:

Hi @dzzzzz - this particular node is still in the queue, if it is confirmed to be dead, without the possibility of being saved, we will for sure update you via ticket, though I have hope. Thank You so much for your patience. If you do happen to have backups, and are interested in proceeding with your DR plans, we can help with that too.

Thanks, appreciate the reply. I do have a backup, but don't want to deploy until I know my particular VPS won't be powered up again.

ElChile · October 2024

@dustinc any update on node LAXSSD4024nerd6DC02 ?

dustinc · October 2024

@ElChile said:
@dustinc any update on node LAXSSD4024nerd6DC02 ?

We're still working on this node -- our priority is restoring service with our customer's data - it is a slow process. Further updates to follow via status or e-mail.

114514 · October 2024

@dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

zGato · October 2024

@114514 said:
@dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

losing millions?

114514 · October 2024

@zGato said:

@114514 said:
@dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

losing millions?

data is priceless bro

zGato · October 2024

@114514 said:

@zGato said:

@114514 said:
@dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

losing millions?

data is priceless bro

backups have been a thing for decades

114514 · October 2024

@zGato said:

@114514 said:

@zGato said:

@114514 said:
@dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

losing millions?

data is priceless bro

backups have been a thing for decades

what if the instance happens to be literally my remote backup and now I need it because my local NAS is out of sync, and I found hosting on RN is much more affordable than other solutions?
I'm sorry but this is the dilemma I'm facing and your sarcastic comments under the feedback thread are not helpful for anyone right here, maybe just leaving them for inexperienced newcomers not dong 321 properly, please.

zGato · October 2024

@114514 said:

@zGato said:

@114514 said:

@zGato said:

@114514 said:
@dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

losing millions?

data is priceless bro

backups have been a thing for decades

and I found hosting on RN is much more affordable than other solutions?

you got the point

114514 · October 2024

@zGato said:

@114514 said:

@zGato said:

@114514 said:

@zGato said:

@114514 said:
@dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

losing millions?

data is priceless bro

backups have been a thing for decades

and I found hosting on RN is much more affordable than other solutions?

you got the point

fine. I'll take that

ralf · October 2024

I don't think communication has been that great, despite the detail on the status page. It's now been 52 hours since the server went down, and there's been no direct communication from RackNerd about the issue. I would have at least expected an e-mail after 24 hours of downtime explaining the situation.

Since the control panel is down, we can't even see what host our server was on - I only happen to know because you once mentioned it in a support ticket (or at least 2.5 years ago I was on LAXSSD5006nerd3DC02).

The last long list had 35 hosts still to be recovered, since then 10 have been named as fixed and then "an additional 2 hypervisors today thus far" without indicating which, and none of the updates are timestamped (since writing this, reloading the page it now lists the 2 and an additional 4). In any case, that still sounds like over 15 that still might be fixable or might not, but with a fixing velocity slowed down to somewhere around 10 per day.

There's no information on whether these are being looked at in parallel (I understand that a RAID rebuild is slow, but once it's started, it's mostly automatic) and what the likelihood of a given node will be recovered or not. I know disks can fail during resilvering, but having some information on likelihood of success (i.e. n out of m drives currently working), anticipated time frame (how long it's been rebuilding the array and %age complete) or alternative options such as being given a new instance on fresh host would help.

I do have backups and I'm not losing millions, but if it's likely that I'm going to have to reinstall from scratch, I'd rather get that process moving on a different sooner rather than waiting another few days. Similarly, if we have to wait a couple more days just to find out we have to reinstall anyway, that'd be a far worse outcome than spending a few hours today rebuilding and being told in 3 days time that actually we could have had our original instance restarted. OTOH, if it's only a few hours left, it makes sense to wait and see.

At the moment, we just have no idea. I think it was around 24 hours you posted "We will provide customers with available options as soon as we have more information, so that customers who have yet to activate their disaster recovery plans, can do so with our proposed solutions."

On the positive side, at least so far you've only reported successful rebuilds and no total RAID failures.

Milon · October 2024

Unfortunately, I lose my patience to :-( We have list of nodes, but don't know where is our vps located. It's 3 days offline. I'm not sure that my sites will not be de-indexed from search engines with full traffic lose If I had known that it would take three days, I would have tried to migrate from backups.

We need to know perspectives to not only sit and wait...

Milon · October 2024

@ralf said: I do have backups and I'm not losing millions, but if it's likely that I'm going to have to reinstall from scratch, I'd rather get that process moving on a different sooner rather than waiting another few days. Similarly, if we have to wait a couple more days just to find out we have to reinstall anyway, that'd be a far worse outcome than spending a few hours today rebuilding and being told in 3 days time that actually we could have had our original instance restarted. OTOH, if it's only a few hours left, it makes sense to wait and see.

At the moment, we just have no idea. I think it was around 24 hours you posted "We will provide customers with available options as soon as we have more information, so that customers who have yet to activate their disaster recovery plans, can do so with our proposed solutions."

The same thoughts. I was only stopped by the fact that I don’t have the latest backups, need to pay or new hosting and re-setup everything, but this is better than 3 dayr (or week) of downtime. :-(

dustinc · October 2024

Hi @ralf and @Milon -- I completely understand your concerns, and I agree that communication could have been improved throughout this process. As you’ve probably noticed, we are generally very proactive and responsive in addressing any issues. In this unprecedented situation, all hands have been on deck, and it’s been challenging. To provide a bit more context, as you know, the downtime stemmed from an unexpected fire on the 61st floor of the high-rise, which, thankfully, did not physically reach the servers or spread within the facility. However, the fire did result in an immediate, unforeseen power shutdown as mandated by the Los Angeles Fire Department affecting all systems, an issue that was beyond both our control and the facility’s.

Our highest priority has been to salvage and retain our customers' data and carefully checking it on a node by node basis. After a widespread, unexpected power outage like this, isolated issues with individual servers are not uncommon. In this case, while hundreds of physical nodes came back online without requiring manual intervention, about 30 nodes required individual attention to restore. A good number of these have since been recovered and brought back online -- some with minor repairs like PSU, RAID controller, or motherboard replacements, while others required more complex processes like RAID rebuilds.

Since the incident, our team has been working diligently through each affected node, with most of our staff members working 12-14+ hour shifts, prioritizing getting each node back online as quickly as possible. Some nodes encountered data corruption; when recovery was deemed impossible, we immediately reached out to affected customers and provisioned replacement services. Other nodes required RAID rebuilds (with no data loss), which, as you noted, is a lengthy process. While some just required minor repairs such as a motherboard or RAID controller replacement, etc. Currently at the time of writing this, we’re working on the final four nodes remaining on our list that require individual attention. These last 4 are proving to be the most difficult and challenging, but we're not giving up until all possible attempts have been exhausted.

As of yesterday evening, I have also directed our team responsible for status updates to include specific node names, and we will continue to do so to ensure transparency.

If anyone is still offline and wishes to proceed with their disaster recovery plans by setting up a fresh VM to re-establish their environments, instead of waiting for our recovery efforts, please reach out to us via ticket, and we will expedite the process.

We sincerely appreciate your business and understanding as we work through this process.

dzzzzz · October 2024

I received the dreaded support ticket this morning - after all this time it turns out they were unable to recover. RackNerd have replaced my VPS with a fresh one and given a month credit as compensation. Lost the IPv6 and rDNS settings, but I've submitted a ticket and it should be easy enough to correct.

I'm not really angry about this - with such a cheap service expectations aren't super high. But RackNerd definitely has some work to do - this was essentially just a simple power outage and that should not be the cause of such a major incident. If they use the same hardware globally then any node is susceptible to the same issue.

dustinc · October 2024

@dzzzzz said:
I received the dreaded support ticket this morning - after all this time it turns out they were unable to recover. RackNerd have replaced my VPS with a fresh one and given a month credit as compensation. Lost the IPv6 and rDNS settings, but I've submitted a ticket and it should be easy enough to correct.

I'm not really angry about this - with such a cheap service expectations aren't super high. But RackNerd definitely has some work to do - this was essentially just a simple power outage and that should not be the cause of such a major incident. If they use the same hardware globally then any node is susceptible to the same issue.

Hi @dzzzzz -- Thank you for your patience throughout this process. Please do submit a ticket if you haven't already, and we'll prioritize taking care of your IPv6 and rDNS settings.

While power outages can affect any provider (or any environment, for that matter) regardless of size or tier, we understand the impact this has had on your service. In LA DC-02, while most of our footprint was unaffected, some required additional intervention (majority of our infrastructure came back online without being affected). Our hypervisors utilize a 8x SSD RAID-10 configuration for redundancy, which typically provides excellent protection against drive failures. However, RAID-10 can only withstand up to two drive failures within the same span, and sudden power loss events can, in a worse case scenario, trigger multiple simultaneous drive failures that exceed this threshold. We also know that other customers/tenants of Multacom, with different environments/setups, were also impacted, so just for clarification, it’s not confined to any particular type of setup or specification configuration.

In your specific case, despite our recovery efforts, we weren't able to recover the node, so we moved forward to reprovision your instance accordingly. While we acknowledge this is a budget-friendly service as you pointed out, we still applied great attention to detail here, and we truly tried our very best here.

Milon · October 2024

Thank you for all your upgrades here @dustinc and you service anyway. I share the same thoughts and vision like @dzzzzz but 7+h later and zero upgrades on status page :-) Good that you share information here that it's still possible to recover nodes if I still don't receive any ticket about vps change.

I still hope that at least my node will be possible to boot online and it won't have to request to reinstall everything in a hurry and worry about data lose.

About RAID 10... maybe it's not a good choose? if the raid 10 is intended to save clients from data loss due to redundancy, but in fact with a high degree of probability leads to data lose during a sudden power outage(?).

silicomnet · October 2024

You should not trust a company that does not make external backups.

zGato · October 2024

@silicomnet said:
You should not trust a company that does not make external backups.

Neither one that sells lifetime deals

tentor · October 2024

@Milon said: About RAID 10... maybe it's not a good choose? if the raid 10 is intended to save clients from data loss due to redundancy, but in fact with a high degree of probability leads to data lose during a sudden power outage(?).

RAID-10 and power resilience are different topics. RAID-10 gives some extra time for a provider to replace disk without data loss. As for power resilience, it depends on software configuration and if we are talking about RAID, hardware controller with battery is a solution for this one.

But you should not expect high-availability from LET. Always do backups, and if you deem your service critical, implement high-available cluster yourself.

Milon · October 2024

Any progress?

ralf · October 2024

.> @tentor said:

@Milon said: About RAID 10... maybe it's not a good choose? if the raid 10 is intended to save clients from data loss due to redundancy, but in fact with a high degree of probability leads to data lose during a sudden power outage(?).

RAID-10 and power resilience are different topics. RAID-10 gives some extra time for a provider to replace disk without data loss. As for power resilience, it depends on software configuration and if we are talking about RAID, hardware controller with battery is a solution for this one.

I think you're thinking with your "host rep" hat on, and not actually seeing what he's asking. His question (I believe) is whether relying on RAID 10 is sufficient for purpose, if a power outage has led to such a catastrophic failure on so many nodes.

I notice that many providers now have shifted to ceph which distributes data across nodes as well as disks, and also has the advantage that storage is no longer tied to any specific host and so VMs can be migrated very easily to other hosts, for instance if a host motherboard is fried in a power outage. It'd be interesting to hear their experiences if any of them have experienced a similar wide-scale outage with many devices failing at the same time, and how they recovered from it.

ralf · October 2024

@Milon said:
Any progress?

Yeah, 72 hours into the downtime I got "the email" and a blank VM.

Milon · October 2024

Yeah, 72 hours into the downtime I got "the email" and a blank VM.

you are lucky... I kept hope that everything would be restored and so I waited 5 days and at the end what was predicted in this thread: complete loss of data and reinstallation.

ProHosting24 · October 2024

@ralf said:
.> @tentor said:

@Milon said: About RAID 10... maybe it's not a good choose? if the raid 10 is intended to save clients from data loss due to redundancy, but in fact with a high degree of probability leads to data lose during a sudden power outage(?).

RAID-10 and power resilience are different topics. RAID-10 gives some extra time for a provider to replace disk without data loss. As for power resilience, it depends on software configuration and if we are talking about RAID, hardware controller with battery is a solution for this one.

I think you're thinking with your "host rep" hat on, and not actually seeing what he's asking. His question (I believe) is whether relying on RAID 10 is sufficient for purpose, if a power outage has led to such a catastrophic failure on so many nodes.

I notice that many providers now have shifted to ceph which distributes data across nodes as well as disks, and also has the advantage that storage is no longer tied to any specific host and so VMs can be migrated very easily to other hosts, for instance if a host motherboard is fried in a power outage. It'd be interesting to hear their experiences if any of them have experienced a similar wide-scale outage with many devices failing at the same time, and how they recovered from it.

We moved our whole CEPH infrastructure back in 2020 from firstcolo to maincubes, everything came perfectly back up again after booting nodes.

Just like a CEPH cluster that i had my hands on, a fatal power loss killed all nodes and after restoring power nothing had to be done apart from running fsck.ext4 on some VMs.

This will always be the case because of the nature how ceph works (pg quoroms, sync ack. etc.)

ralf · October 2024

@Milon said:

Yeah, 72 hours into the downtime I got "the email" and a blank VM.

you are lucky... I kept hope that everything would be restored and so I waited 5 days and at the end what was predicted in this thread: complete loss of data and reinstallation.

How am I lucky? That's exactly the same!

Except, to be fair, it wasn't 5 days as it was still less than 96 hours since the outage when you replied, so you must have got your replacement VM within 4 days.

It was annoying having to spend my Saturday re-installing though when it could have been done a couple of days earlier.

yodo · October 2024

I too got the blank details email, I've now been sent login credentials for a new node.

Any early black friday deals for those of us who have lost all our data??

If i'm going to have to go through the effort of starting from scratch and spend a day re-installing, i'd be keen for a deal to upgrade to a better spec'ed vps. (Maybe Ryzen with a decent amount of ram and storage?)

Howdy, Stranger!

Categories

In this Discussion

Racknerd Down

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Racknerd Down

Comments