Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Racknerd Down

2

Comments

  • dustincdustinc Member, Patron Provider, Top Host

    @ralf said:
    Mine is still down, so I guess I'm on one of the machines with the failed RAID. At least it's not as dramatic it might have been... As soon as I read LAFD, I was imagining another OVH situation!

    Luckily, the fire did not spread throughout the high rise - though, as a result of power being shut down, it did impact uptime, and sadly some hardware. We're doing our absolute best.

    Thanked by 1ralf
  • dustincdustinc Member, Patron Provider, Top Host

    @tommyluo said:
    I migrate some of my vps from cloudcone to racknerd,but I will use seattle,dallas,sanjose,as most of my vps with racknerd are in los angeles.

    That's always a good thing to do, so we commend you for doing such. Thank You for sticking with us through the tough times.

  • dustincdustinc Member, Patron Provider, Top Host

    @john_sd3 said:
    @dustinc unrelated to the current issue but why do you have different payment methods for add funds and pay invoice? i want to add some funds by indian payment methods but your support tells me that is only available in invoices? i don;t see why that should be the case

    Hi @john_sd3 — Currently, the India NetBanking and India UPI payment options are temporarily unavailable due to the fact that our third-party payment processor, Payssion, has temporarily deactivated these payment options. As soon as these are made available to us again, we'll reactivate it on our end.

  • @dustinc said:

    Hi @dzzzzz - this particular node is still in the queue, if it is confirmed to be dead, without the possibility of being saved, we will for sure update you via ticket, though I have hope. Thank You so much for your patience. If you do happen to have backups, and are interested in proceeding with your DR plans, we can help with that too.

    Thanks, appreciate the reply. I do have a backup, but don't want to deploy until I know my particular VPS won't be powered up again.

    Thanked by 1dustinc
  • @dustinc any update on node LAXSSD4024nerd6DC02 ?

  • dustincdustinc Member, Patron Provider, Top Host

    @ElChile said:
    @dustinc any update on node LAXSSD4024nerd6DC02 ?

    We're still working on this node -- our priority is restoring service with our customer's data - it is a slow process. Further updates to follow via status or e-mail.

    Thanked by 1ElChile
  • @dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

  • @114514 said:
    @dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

    losing millions?

  • @zGato said:

    @114514 said:
    @dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

    losing millions?

    data is priceless bro

  • @114514 said:

    @zGato said:

    @114514 said:
    @dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

    losing millions?

    data is priceless bro

    backups have been a thing for decades

  • @zGato said:

    @114514 said:

    @zGato said:

    @114514 said:
    @dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

    losing millions?

    data is priceless bro

    backups have been a thing for decades

    what if the instance happens to be literally my remote backup and now I need it because my local NAS is out of sync, and I found hosting on RN is much more affordable than other solutions?
    I'm sorry but this is the dilemma I'm facing and your sarcastic comments under the feedback thread are not helpful for anyone right here, maybe just leaving them for inexperienced newcomers not dong 321 properly, please.

  • @114514 said:

    @zGato said:

    @114514 said:

    @zGato said:

    @114514 said:
    @dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

    losing millions?

    data is priceless bro

    backups have been a thing for decades

    and I found hosting on RN is much more affordable than other solutions?

    you got the point

  • @zGato said:

    @114514 said:

    @zGato said:

    @114514 said:

    @zGato said:

    @114514 said:
    @dustinc losing my patience, that's nearly 48hrs of offline. Will customers who are impacted by extended downtime receiving compensations for you guys disappointing us with "100% Uptime Guarantee", which is bolded on the description of DC-02?

    losing millions?

    data is priceless bro

    backups have been a thing for decades

    and I found hosting on RN is much more affordable than other solutions?

    you got the point

    fine. I'll take that

  • I don't think communication has been that great, despite the detail on the status page. It's now been 52 hours since the server went down, and there's been no direct communication from RackNerd about the issue. I would have at least expected an e-mail after 24 hours of downtime explaining the situation.

    Since the control panel is down, we can't even see what host our server was on - I only happen to know because you once mentioned it in a support ticket (or at least 2.5 years ago I was on LAXSSD5006nerd3DC02).

    The last long list had 35 hosts still to be recovered, since then 10 have been named as fixed and then "an additional 2 hypervisors today thus far" without indicating which, and none of the updates are timestamped (since writing this, reloading the page it now lists the 2 and an additional 4). In any case, that still sounds like over 15 that still might be fixable or might not, but with a fixing velocity slowed down to somewhere around 10 per day.

    There's no information on whether these are being looked at in parallel (I understand that a RAID rebuild is slow, but once it's started, it's mostly automatic) and what the likelihood of a given node will be recovered or not. I know disks can fail during resilvering, but having some information on likelihood of success (i.e. n out of m drives currently working), anticipated time frame (how long it's been rebuilding the array and %age complete) or alternative options such as being given a new instance on fresh host would help.

    I do have backups and I'm not losing millions, but if it's likely that I'm going to have to reinstall from scratch, I'd rather get that process moving on a different sooner rather than waiting another few days. Similarly, if we have to wait a couple more days just to find out we have to reinstall anyway, that'd be a far worse outcome than spending a few hours today rebuilding and being told in 3 days time that actually we could have had our original instance restarted. OTOH, if it's only a few hours left, it makes sense to wait and see.

    At the moment, we just have no idea. I think it was around 24 hours you posted "We will provide customers with available options as soon as we have more information, so that customers who have yet to activate their disaster recovery plans, can do so with our proposed solutions."

    On the positive side, at least so far you've only reported successful rebuilds and no total RAID failures.

  • Unfortunately, I lose my patience to :-( We have list of nodes, but don't know where is our vps located. It's 3 days offline. I'm not sure that my sites will not be de-indexed from search engines with full traffic lose :( If I had known that it would take three days, I would have tried to migrate from backups.

    We need to know perspectives to not only sit and wait...

  • @ralf said: I do have backups and I'm not losing millions, but if it's likely that I'm going to have to reinstall from scratch, I'd rather get that process moving on a different sooner rather than waiting another few days. Similarly, if we have to wait a couple more days just to find out we have to reinstall anyway, that'd be a far worse outcome than spending a few hours today rebuilding and being told in 3 days time that actually we could have had our original instance restarted. OTOH, if it's only a few hours left, it makes sense to wait and see.

    At the moment, we just have no idea. I think it was around 24 hours you posted "We will provide customers with available options as soon as we have more information, so that customers who have yet to activate their disaster recovery plans, can do so with our proposed solutions."

    The same thoughts. I was only stopped by the fact that I don’t have the latest backups, need to pay or new hosting and re-setup everything, but this is better than 3 dayr (or week) of downtime. :-(

  • dustincdustinc Member, Patron Provider, Top Host

    Hi @ralf and @Milon -- I completely understand your concerns, and I agree that communication could have been improved throughout this process. As you’ve probably noticed, we are generally very proactive and responsive in addressing any issues. In this unprecedented situation, all hands have been on deck, and it’s been challenging. To provide a bit more context, as you know, the downtime stemmed from an unexpected fire on the 61st floor of the high-rise, which, thankfully, did not physically reach the servers or spread within the facility. However, the fire did result in an immediate, unforeseen power shutdown as mandated by the Los Angeles Fire Department affecting all systems, an issue that was beyond both our control and the facility’s.

    Our highest priority has been to salvage and retain our customers' data and carefully checking it on a node by node basis. After a widespread, unexpected power outage like this, isolated issues with individual servers are not uncommon. In this case, while hundreds of physical nodes came back online without requiring manual intervention, about 30 nodes required individual attention to restore. A good number of these have since been recovered and brought back online -- some with minor repairs like PSU, RAID controller, or motherboard replacements, while others required more complex processes like RAID rebuilds.

    Since the incident, our team has been working diligently through each affected node, with most of our staff members working 12-14+ hour shifts, prioritizing getting each node back online as quickly as possible. Some nodes encountered data corruption; when recovery was deemed impossible, we immediately reached out to affected customers and provisioned replacement services. Other nodes required RAID rebuilds (with no data loss), which, as you noted, is a lengthy process. While some just required minor repairs such as a motherboard or RAID controller replacement, etc. Currently at the time of writing this, we’re working on the final four nodes remaining on our list that require individual attention. These last 4 are proving to be the most difficult and challenging, but we're not giving up until all possible attempts have been exhausted.

    As of yesterday evening, I have also directed our team responsible for status updates to include specific node names, and we will continue to do so to ensure transparency.

    If anyone is still offline and wishes to proceed with their disaster recovery plans by setting up a fresh VM to re-establish their environments, instead of waiting for our recovery efforts, please reach out to us via ticket, and we will expedite the process.

    We sincerely appreciate your business and understanding as we work through this process.

    Thanked by 3edrebe ralf Milon
  • I received the dreaded support ticket this morning - after all this time it turns out they were unable to recover. RackNerd have replaced my VPS with a fresh one and given a month credit as compensation. Lost the IPv6 and rDNS settings, but I've submitted a ticket and it should be easy enough to correct.

    I'm not really angry about this - with such a cheap service expectations aren't super high. But RackNerd definitely has some work to do - this was essentially just a simple power outage and that should not be the cause of such a major incident. If they use the same hardware globally then any node is susceptible to the same issue.

    Thanked by 2darkimmortal Ganonk
  • dustincdustinc Member, Patron Provider, Top Host

    @dzzzzz said:
    I received the dreaded support ticket this morning - after all this time it turns out they were unable to recover. RackNerd have replaced my VPS with a fresh one and given a month credit as compensation. Lost the IPv6 and rDNS settings, but I've submitted a ticket and it should be easy enough to correct.

    I'm not really angry about this - with such a cheap service expectations aren't super high. But RackNerd definitely has some work to do - this was essentially just a simple power outage and that should not be the cause of such a major incident. If they use the same hardware globally then any node is susceptible to the same issue.

    Hi @dzzzzz -- Thank you for your patience throughout this process. Please do submit a ticket if you haven't already, and we'll prioritize taking care of your IPv6 and rDNS settings.

    While power outages can affect any provider (or any environment, for that matter) regardless of size or tier, we understand the impact this has had on your service. In LA DC-02, while most of our footprint was unaffected, some required additional intervention (majority of our infrastructure came back online without being affected). Our hypervisors utilize a 8x SSD RAID-10 configuration for redundancy, which typically provides excellent protection against drive failures. However, RAID-10 can only withstand up to two drive failures within the same span, and sudden power loss events can, in a worse case scenario, trigger multiple simultaneous drive failures that exceed this threshold. We also know that other customers/tenants of Multacom, with different environments/setups, were also impacted, so just for clarification, it’s not confined to any particular type of setup or specification configuration.

    In your specific case, despite our recovery efforts, we weren't able to recover the node, so we moved forward to reprovision your instance accordingly. While we acknowledge this is a budget-friendly service as you pointed out, we still applied great attention to detail here, and we truly tried our very best here.

  • Thank you for all your upgrades here @dustinc and you service anyway. I share the same thoughts and vision like @dzzzzz but 7+h later and zero upgrades on status page :-) Good that you share information here that it's still possible to recover nodes if I still don't receive any ticket about vps change.

    I still hope that at least my node will be possible to boot online and it won't have to request to reinstall everything in a hurry and worry about data lose.

    About RAID 10... maybe it's not a good choose? if the raid 10 is intended to save clients from data loss due to redundancy, but in fact with a high degree of probability leads to data lose during a sudden power outage(?).

  • You should not trust a company that does not make external backups.

  • zGatozGato Member
    edited October 2024

    @silicomnet said:
    You should not trust a company that does not make external backups.

    Neither one that sells lifetime deals

    Thanked by 2tentor landnever
  • tentortentor Member, Host Rep

    @Milon said: About RAID 10... maybe it's not a good choose? if the raid 10 is intended to save clients from data loss due to redundancy, but in fact with a high degree of probability leads to data lose during a sudden power outage(?).

    RAID-10 and power resilience are different topics. RAID-10 gives some extra time for a provider to replace disk without data loss. As for power resilience, it depends on software configuration and if we are talking about RAID, hardware controller with battery is a solution for this one.

    But you should not expect high-availability from LET. Always do backups, and if you deem your service critical, implement high-available cluster yourself.

    Thanked by 2zGato Milon
  • Any progress?

  • .> @tentor said:

    @Milon said: About RAID 10... maybe it's not a good choose? if the raid 10 is intended to save clients from data loss due to redundancy, but in fact with a high degree of probability leads to data lose during a sudden power outage(?).

    RAID-10 and power resilience are different topics. RAID-10 gives some extra time for a provider to replace disk without data loss. As for power resilience, it depends on software configuration and if we are talking about RAID, hardware controller with battery is a solution for this one.

    I think you're thinking with your "host rep" hat on, and not actually seeing what he's asking. His question (I believe) is whether relying on RAID 10 is sufficient for purpose, if a power outage has led to such a catastrophic failure on so many nodes.

    I notice that many providers now have shifted to ceph which distributes data across nodes as well as disks, and also has the advantage that storage is no longer tied to any specific host and so VMs can be migrated very easily to other hosts, for instance if a host motherboard is fried in a power outage. It'd be interesting to hear their experiences if any of them have experienced a similar wide-scale outage with many devices failing at the same time, and how they recovered from it.

    Thanked by 2tentor Milon
  • ralfralf Member
    edited October 2024

    @Milon said:
    Any progress?

    Yeah, 72 hours into the downtime I got "the email" and a blank VM.

  • Yeah, 72 hours into the downtime I got "the email" and a blank VM.

    you are lucky... I kept hope that everything would be restored and so I waited 5 days and at the end what was predicted in this thread: complete loss of data and reinstallation.

  • ProHosting24ProHosting24 Member, Patron Provider

    @ralf said:
    .> @tentor said:

    @Milon said: About RAID 10... maybe it's not a good choose? if the raid 10 is intended to save clients from data loss due to redundancy, but in fact with a high degree of probability leads to data lose during a sudden power outage(?).

    RAID-10 and power resilience are different topics. RAID-10 gives some extra time for a provider to replace disk without data loss. As for power resilience, it depends on software configuration and if we are talking about RAID, hardware controller with battery is a solution for this one.

    I think you're thinking with your "host rep" hat on, and not actually seeing what he's asking. His question (I believe) is whether relying on RAID 10 is sufficient for purpose, if a power outage has led to such a catastrophic failure on so many nodes.

    I notice that many providers now have shifted to ceph which distributes data across nodes as well as disks, and also has the advantage that storage is no longer tied to any specific host and so VMs can be migrated very easily to other hosts, for instance if a host motherboard is fried in a power outage. It'd be interesting to hear their experiences if any of them have experienced a similar wide-scale outage with many devices failing at the same time, and how they recovered from it.

    We moved our whole CEPH infrastructure back in 2020 from firstcolo to maincubes, everything came perfectly back up again after booting nodes.

    Just like a CEPH cluster that i had my hands on, a fatal power loss killed all nodes and after restoring power nothing had to be done apart from running fsck.ext4 on some VMs.

    This will always be the case because of the nature how ceph works (pg quoroms, sync ack. etc.)

    Thanked by 2ralf maverick
  • @Milon said:

    Yeah, 72 hours into the downtime I got "the email" and a blank VM.

    you are lucky... I kept hope that everything would be restored and so I waited 5 days and at the end what was predicted in this thread: complete loss of data and reinstallation.

    How am I lucky? That's exactly the same!

    Except, to be fair, it wasn't 5 days as it was still less than 96 hours since the outage when you replied, so you must have got your replacement VM within 4 days.

    It was annoying having to spend my Saturday re-installing though when it could have been done a couple of days earlier.

  • I too got the blank details email, I've now been sent login credentials for a new node.

    Any early black friday deals for those of us who have lost all our data??

    If i'm going to have to go through the effort of starting from scratch and spend a day re-installing, i'd be keen for a deal to upgrade to a better spec'ed vps. (Maybe Ryzen with a decent amount of ram and storage?)

Sign In or Register to comment.