Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


HostHatch Los Angeles storage data corruption (Was: HostHatch Los Angeles storage down) - Page 3
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

HostHatch Los Angeles storage data corruption (Was: HostHatch Los Angeles storage down)

13567

Comments

  • NekkiNekki Veteran

    @dahartigan said:

    @stevewatson301 said:

    @angstrom said:
    Modified title based on OP's request

    Interesting, @Nekki had rightly predicted that title edits would be reserved for Host Reps and it’s interesting to see it play out.

    404 - Drama not found.

    I’ve been invoked for nothing?

    There will be repercussions.

  • Daniel15Daniel15 Veteran
    edited May 2022

    @joelby said:
    I spent most of the day recovering from issues similar to others in the thread :( I had the xz-compression error. I ended up with 300000 entries in lost+found, which wasn't much fun! Here's a general outline of how I handled it:

    • I use debian, so mount a Debian ISO with the same major version, and change boot order to ISO
    • Reboot server. In Debian installer, proceed to rescue mode (non-graphical)
    • Don't mount the disks - you want to fsck them first. You might need to do some exploration (e.g. lsblk, lvm equivalent) to find all of your partitions and fsck them.
    • Once they're fscked (which I left running overnight), you can restart the rescue installer but mount the partitions and launch the shell there. Effectively, this mounts your root partition and chroots into it. You also need to mount your /boot partition, and a few others e.g. devpts
    • I had lots of corrupt files all over the place, including apt lists. I could run some commands, but others caused segfaults. My main aim was to be able to run dpkg --verify so that I could validate APT packages and reinstall them if necessary. This was a torturous iterative process which involved finding corrupt library files with md5sum, copying them from another VM running the same version of debian (fortunately networking and wget were working). Eventually I was able to get dpkg to work, though I had to delete (or copy over) some apt info files in /var/lib/dpkg/info .
    • Then I kept running dpkg --verify, identifying the source package of corrupted files using dpkg -S , then apt reinstall
    • One of the corrupted packages was the Linux kernel - reinstalling that also rebuilds the initramfs image, which will fix the xz boot issue
    • Once dpkg --verify only returns 'known' mismatches such as manually modified config files, you can shut down, change boot order, unmount the ISO, then reboot - hopefully it will work at this point!
    • One the VM starts, deal with corrupted data issues which are affecting applications. In my case my Nextcloud installation was broken, which I fixed by restoring the postgres database from a backup (fortunately I had one), reinstalling the nextcloud PHP files from the same version over the top, and then restoring all of the corrupted data files (fortunately I use b2 sync to copy everything to backblaze - after deleting bad local copies I could reverse the sync in order to pull down all files which were missing)

    Thanks for this! Very useful.

    In my case it seems like all the 'core' files were okay so I was able to boot it after running fsck and rebooting. I still need to verify the Debian packages, but I don't have a lot installed on this system - mainly just Borgbackup, SyncThing, and Netdata, as most usage is via NFS from another Hosthatch system. I have photos on here and saw that some of my photo albums were corrupted, so I'm restoring all the data from backup just in case.

    This reminded me of something people need to watch out for:

    fortunately I use b2 sync to copy everything to backblaze

    For everyone in this thread, if your system is in a broken state, be sure to disable all scheduled backups. You don't want the corrupted files spreading to your backups! If you use a backup system that can keep multiple backups then be sure to restore from a backup that was taken before the corruption occurred. I take backups daily so I'm using the backups from the day before everything got corrupted.

    Thanked by 2Not_Oles adly
  • emghemgh Member

    Obviously, this thread is not needed since HostHatch is extremely good at communicating. If you don't believe so, it's probably due to personal issues.

  • xetsysxetsys Member
    edited May 2022

    I find it odd how users backup content to storage VPS and then that to cloud, i.e. google/blackblaze/onedrive/koofr etc for peace of mind. I just wonder why anyone bother with storage VPS when backups can be corrupted and require to be restored from cloud. Why not just use the commercial cloud directly? The data at those services can be encrypted too with rclone and they are much more reliable.

  • emghemgh Member

    @xetsys said:
    I find it odd how users backup content to storage VPS and then that to cloud, i.e. google/blackblaze/onedrive/koofr etc for peace of mind. I just wonder why anyone bother with storage VPS when backups can be corrupted and require to be restored from cloud. Why not just use the commercial cloud directly? The data at those services can be encrypted too with rclone and they are much more reliable.

    I mean there are purposes, like a relatively high amount of included bandwidth etc, but in general, you're right. I use B2 and couldn't be happier.

  • plumbergplumberg Veteran

    @Daniel15 said:
    For some reason the connection between HostHatch and Servarica (where one of my backups is stored) caps out at ~90Mbps even though it gets ~600Mbps from another server in LA (blaming Psychz for that), plus HostHatch's disks are a bit slow on this server, so this restore is taking a while. I started restoring ~1TB of files a few hours ago and it's 20% complete so far.

    Most files I've looked at seem fine though, so the restore is just in case some random files on disk got corrupted and I didn't notice.

    @plumberg said: This is what I was able to grab... Any suggestions/ help to fix? Thanks

    Are you booting into the system or into a live CD?

    I'm not familiar with XFS but maybe someone else has an idea.

    So I fsck each all disks upon live cd boot?

  • EmilEmil Member, Host Rep

    Hi,

    I'm responsible for the technical infrastructure and I just want to come in here and give a brief explanation of what we are changing on the infrastructure side of things.

    First of all I want to offer my apologies to the affected customers. We've had issues with a few of our older storage nodes. We can see it's limited to a specific combination of hard drives manufacturer and RAID card model. Some of the drives get kicked out of the array at the same time, it turns to read-only and throws out a lot of IO errors and has to be hard rebooted. This can happen after being online for >1 year without issues. Even if the drives are never actually 'failed', it sometimes causes inconsistency to the data and while most are still in tact, some files may be corrupted which makes it difficult to boot up a functional operating system.

    In the newer storage nodes we build, we use different and newer RAID cards and we use two of them. We also create multiple smaller RAID arrays and in some locations we also use a HCI design for new VMs. It's a lot easier to keep this hardware healthy as rebuild and consistency check times are much quicker.

    The older storage nodes were designed to provide as much storage (for backups, archives etc) for as low cost as possible. It has one big RAID array for everything including host OS which is obviously more error prone due to a number of reasons. Over the years we have re-designed both hardware and software (monitoring) to offer better resiliency and reliability with the things we have learned with past issues, without having to increase the pricing by a huge factor.

    These improvements are nothing new really, it's something we've worked on for years now, but due to the recent failures we're expediting the decommissioning process of the older nodes. We are ordering more hardware to all storage locations and we will reach out to customers on these nodes to get them running on the newer nodes in our new cloud platform.

    Note that we currently have >7 PB of capacity deployed, and are adding more. Only a small % of this have had data inconsistency or loss, but we will make sure that failure rate is eliminated in the next few months.

    Once again sorry for the inconvenience and thank you for your patience and understanding.

    Emil

  • @Emil said:
    Hi,

    I'm responsible for the technical infrastructure and I just want to come in here and give a brief explanation of what we are changing on the infrastructure side of things.

    First of all I want to offer my apologies to the affected customers. We've had issues with a few of our older storage nodes. We can see it's limited to a specific combination of hard drives manufacturer and RAID card model. Some of the drives get kicked out of the array at the same time, it turns to read-only and throws out a lot of IO errors and has to be hard rebooted. This can happen after being online for >1 year without issues. Even if the drives are never actually 'failed', it sometimes causes inconsistency to the data and while most are still in tact, some files may be corrupted which makes it difficult to boot up a functional operating system.

    In the newer storage nodes we build, we use different and newer RAID cards and we use two of them. We also create multiple smaller RAID arrays and in some locations we also use a HCI design for new VMs. It's a lot easier to keep this hardware healthy as rebuild and consistency check times are much quicker.

    The older storage nodes were designed to provide as much storage (for backups, archives etc) for as low cost as possible. It has one big RAID array for everything including host OS which is obviously more error prone due to a number of reasons. Over the years we have re-designed both hardware and software (monitoring) to offer better resiliency and reliability with the things we have learned with past issues, without having to increase the pricing by a huge factor.

    These improvements are nothing new really, it's something we've worked on for years now, but due to the recent failures we're expediting the decommissioning process of the older nodes. We are ordering more hardware to all storage locations and we will reach out to customers on these nodes to get them running on the newer nodes in our new cloud platform.

    Note that we currently have >7 PB of capacity deployed, and are adding more. Only a small % of this have had data inconsistency or loss, but we will make sure that failure rate is eliminated in the next few months.

    Once again sorry for the inconvenience and thank you for your patience and understanding.

    Emil

    Hi Emil,

    You have acknowledged the issue, which is great. You also mentioned that the older storage nodes have a known issue that could result in data loss.

    My question is why haven't you proactively migrated customers from the faulty system to the newer system? It seems like there is a catastrophe waiting to happen and the writing's on the wall..

  • @Daniel15 said:

    @msallak1 said:
    Anyone knows how to fix this error? this happened after shutting down system and booting it up again from website

    It means your kernel image (/boot/vmlinuz-*) is corrupt. See if there's an older kernel available in the GRUB menu. Otherwise you might have to boot into a live CD and copy a clean version of the kernel across. Which distro are you using?

    Debian 10, and unfortunately I can't type or do anything via console so I am not sure how I can chose a different kernel.

  • EmilEmil Member, Host Rep

    @dahartigan said:

    @Emil said:
    Hi,

    I'm responsible for the technical infrastructure and I just want to come in here and give a brief explanation of what we are changing on the infrastructure side of things.

    First of all I want to offer my apologies to the affected customers. We've had issues with a few of our older storage nodes. We can see it's limited to a specific combination of hard drives manufacturer and RAID card model. Some of the drives get kicked out of the array at the same time, it turns to read-only and throws out a lot of IO errors and has to be hard rebooted. This can happen after being online for >1 year without issues. Even if the drives are never actually 'failed', it sometimes causes inconsistency to the data and while most are still in tact, some files may be corrupted which makes it difficult to boot up a functional operating system.

    In the newer storage nodes we build, we use different and newer RAID cards and we use two of them. We also create multiple smaller RAID arrays and in some locations we also use a HCI design for new VMs. It's a lot easier to keep this hardware healthy as rebuild and consistency check times are much quicker.

    The older storage nodes were designed to provide as much storage (for backups, archives etc) for as low cost as possible. It has one big RAID array for everything including host OS which is obviously more error prone due to a number of reasons. Over the years we have re-designed both hardware and software (monitoring) to offer better resiliency and reliability with the things we have learned with past issues, without having to increase the pricing by a huge factor.

    These improvements are nothing new really, it's something we've worked on for years now, but due to the recent failures we're expediting the decommissioning process of the older nodes. We are ordering more hardware to all storage locations and we will reach out to customers on these nodes to get them running on the newer nodes in our new cloud platform.

    Note that we currently have >7 PB of capacity deployed, and are adding more. Only a small % of this have had data inconsistency or loss, but we will make sure that failure rate is eliminated in the next few months.

    Once again sorry for the inconvenience and thank you for your patience and understanding.

    Emil

    Hi Emil,

    You have acknowledged the issue, which is great. You also mentioned that the older storage nodes have a known issue that could result in data loss.

    My question is why haven't you proactively migrated customers from the faulty system to the newer system? It seems like there is a catastrophe waiting to happen and the writing's on the wall..

    We’ve always planned to deprecate these nodes but there is/was no immediate risk of data loss, just a general design that is less reliable than our newer design. However after the recent issues, it's possible that we have identified a pattern or a certain risk factor. It can also just be bad luck, but either way we will expedite the decom process as we want to improve the general reliability for everyone.

  • @Emil said:

    @dahartigan said:

    @Emil said:
    Hi,

    I'm responsible for the technical infrastructure and I just want to come in here and give a brief explanation of what we are changing on the infrastructure side of things.

    First of all I want to offer my apologies to the affected customers. We've had issues with a few of our older storage nodes. We can see it's limited to a specific combination of hard drives manufacturer and RAID card model. Some of the drives get kicked out of the array at the same time, it turns to read-only and throws out a lot of IO errors and has to be hard rebooted. This can happen after being online for >1 year without issues. Even if the drives are never actually 'failed', it sometimes causes inconsistency to the data and while most are still in tact, some files may be corrupted which makes it difficult to boot up a functional operating system.

    In the newer storage nodes we build, we use different and newer RAID cards and we use two of them. We also create multiple smaller RAID arrays and in some locations we also use a HCI design for new VMs. It's a lot easier to keep this hardware healthy as rebuild and consistency check times are much quicker.

    The older storage nodes were designed to provide as much storage (for backups, archives etc) for as low cost as possible. It has one big RAID array for everything including host OS which is obviously more error prone due to a number of reasons. Over the years we have re-designed both hardware and software (monitoring) to offer better resiliency and reliability with the things we have learned with past issues, without having to increase the pricing by a huge factor.

    These improvements are nothing new really, it's something we've worked on for years now, but due to the recent failures we're expediting the decommissioning process of the older nodes. We are ordering more hardware to all storage locations and we will reach out to customers on these nodes to get them running on the newer nodes in our new cloud platform.

    Note that we currently have >7 PB of capacity deployed, and are adding more. Only a small % of this have had data inconsistency or loss, but we will make sure that failure rate is eliminated in the next few months.

    Once again sorry for the inconvenience and thank you for your patience and understanding.

    Emil

    Hi Emil,

    You have acknowledged the issue, which is great. You also mentioned that the older storage nodes have a known issue that could result in data loss.

    My question is why haven't you proactively migrated customers from the faulty system to the newer system? It seems like there is a catastrophe waiting to happen and the writing's on the wall..

    We’ve always planned to deprecate these nodes but there is/was no immediate risk of data loss, just a general design that is less reliable than our newer design. However after the recent issues, it's possible that we have identified a pattern or a certain risk factor. It can also just be bad luck, but either way we will expedite the decom process as we want to improve the general reliability for everyone.

    Probably a good idea to assume there is now an immediate risk of future data loss given the fact these are in production and the design is a lot less reliable than you believe and have planned for.

    Luck won't replicate across locations, good or bad.

    Thanked by 1yoursunny
  • @Emil said:
    Hi,

    I'm responsible for the technical infrastructure and I just want to come in here and give a brief explanation of what we are changing on the infrastructure side of things.

    First of all I want to offer my apologies to the affected customers. We've had issues with a few of our older storage nodes. We can see it's limited to a specific combination of hard drives manufacturer and RAID card model. Some of the drives get kicked out of the array at the same time, it turns to read-only and throws out a lot of IO errors and has to be hard rebooted. This can happen after being online for >1 year without issues. Even if the drives are never actually 'failed', it sometimes causes inconsistency to the data and while most are still in tact, some files may be corrupted which makes it difficult to boot up a functional operating system.

    In the newer storage nodes we build, we use different and newer RAID cards and we use two of them. We also create multiple smaller RAID arrays and in some locations we also use a HCI design for new VMs. It's a lot easier to keep this hardware healthy as rebuild and consistency check times are much quicker.

    The older storage nodes were designed to provide as much storage (for backups, archives etc) for as low cost as possible. It has one big RAID array for everything including host OS which is obviously more error prone due to a number of reasons. Over the years we have re-designed both hardware and software (monitoring) to offer better resiliency and reliability with the things we have learned with past issues, without having to increase the pricing by a huge factor.

    These improvements are nothing new really, it's something we've worked on for years now, but due to the recent failures we're expediting the decommissioning process of the older nodes. We are ordering more hardware to all storage locations and we will reach out to customers on these nodes to get them running on the newer nodes in our new cloud platform.

    Note that we currently have >7 PB of capacity deployed, and are adding more. Only a small % of this have had data inconsistency or loss, but we will make sure that failure rate is eliminated in the next few months.

    Once again sorry for the inconvenience and thank you for your patience and understanding.

    Emil

    Thanks @Emil, I appreciate you taking the time to post an explanation here. I do hope in future an email is sent out to impacted customers.

    Unfortunately with the potential pattern identified, I cannot trust storing any data on the VPS until a migration to a new server occurs. For the expedited decommissioning are we talking weeks or months? Is there an option to request recreation on another server in the near term or refund for the remaining months, if we have paid upfront?

  • defaultdefault Veteran

    One thing is for sure: HostHatch storage is not that safe. All these loses of data, combined with poor support and lack of proper announcements by email or website, makes it really difficult to do business with them. As result: I already marked my services for cancellation at the end of billing period.

    I honestly do not wish to see another involucration drama, but it is what it is.

  • cybertechcybertech Member
    edited May 2022

    @Emil said:
    Hi,

    I'm responsible for the technical infrastructure and I just want to come in here and give a brief explanation of what we are changing on the infrastructure side of things.

    First of all I want to offer my apologies to the affected customers. We've had issues with a few of our older storage nodes. We can see it's limited to a specific combination of hard drives manufacturer and RAID card model. Some of the drives get kicked out of the array at the same time, it turns to read-only and throws out a lot of IO errors and has to be hard rebooted. This can happen after being online for >1 year without issues. Even if the drives are never actually 'failed', it sometimes causes inconsistency to the data and while most are still in tact, some files may be corrupted which makes it difficult to boot up a functional operating system.

    In the newer storage nodes we build, we use different and newer RAID cards and we use two of them. We also create multiple smaller RAID arrays and in some locations we also use a HCI design for new VMs. It's a lot easier to keep this hardware healthy as rebuild and consistency check times are much quicker.

    The older storage nodes were designed to provide as much storage (for backups, archives etc) for as low cost as possible. It has one big RAID array for everything including host OS which is obviously more error prone due to a number of reasons. Over the years we have re-designed both hardware and software (monitoring) to offer better resiliency and reliability with the things we have learned with past issues, without having to increase the pricing by a huge factor.

    These improvements are nothing new really, it's something we've worked on for years now, but due to the recent failures we're expediting the decommissioning process of the older nodes. We are ordering more hardware to all storage locations and we will reach out to customers on these nodes to get them running on the newer nodes in our new cloud platform.

    Note that we currently have >7 PB of capacity deployed, and are adding more. Only a small % of this have had data inconsistency or loss, but we will make sure that failure rate is eliminated in the next few months.

    Once again sorry for the inconvenience and thank you for your patience and understanding.

    Emil

    i have 2 storage vps in separate locations on new infra (EPYC), one required fsck recently, and another always had fluctuating IOPS down to below 100, and noticeably slow to install / update any apps however incrementally small.

    hope this can be looked at in parallel as well.

  • Storage is cursed. Nek minnit fran loses his slabs. Just migrate storage to a dedi or use Google Drive etc.

  • darkimmortaldarkimmortal Member
    edited May 2022

    @Emil said:
    First of all I want to offer my apologies to the affected customers. We've had issues with a few of our older storage nodes. We can see it's limited to a specific combination of hard drives manufacturer and RAID card model. Some of the drives get kicked out of the array at the same time, it turns to read-only and throws out a lot of IO errors and has to be hard rebooted. This can happen after being online for >1 year without issues. Even if the drives are never actually 'failed', it sometimes causes inconsistency to the data and while most are still in tact, some files may be corrupted which makes it difficult to boot up a functional operating system.

    This is the post I was waiting for. I couldn’t get my head around the failure mode happening any other way. This is a believable explanation

    Are all old nodes affected by this? Can we assume all E5v2 are affected and Epyc are unaffected?

    Is there any interim mitigation planned to make old nodes more trustworthy? Such as reboot every few months, swap to software raid, script to detect the first sign of data being hosed and immediately hard reboot, etc

  • dosaidosai Member

    @Emil communication during times like these are lacking. Have you thought of setting up a community channels to keep the customers informed like discord (for example)?

    Thanked by 1digitalwicked
  • digitalwickeddigitalwicked Member
    edited May 2022

    @darkimmortal said:
    This is the post I was waiting for. I couldn’t get my head around the failure mode happening any other way. This is a believable explanation

    Are all old nodes affected by this? Can we assume all E5v2 are affected and Epyc are unaffected?

    Is there any interim mitigation planned to make old nodes more trustworthy? Such as reboot every few months, swap to software raid, script to detect the first sign of data being hosed and immediately hard reboot, etc

    Unfortunately it was an 'AMD EPYC 7551P 32-Core Processor' for this data loss. Was bulk migrated to it earlier in the year from an old E5v2.

    Thanked by 2darkimmortal adly
  • @digitalwicked said:

    @darkimmortal said:
    This is the post I was waiting for. I couldn’t get my head around the failure mode happening any other way. This is a believable explanation

    Are all old nodes affected by this? Can we assume all E5v2 are affected and Epyc are unaffected?

    Is there any interim mitigation planned to make old nodes more trustworthy? Such as reboot every few months, swap to software raid, script to detect the first sign of data being hosed and immediately hard reboot, etc

    Unfortunately it was an 'AMD EPYC 7551P 32-Core Processor' for this data loss. Was bulk migrated to it earlier in the year from an old E5v2.

    Oh dear... I thought the Epycs were the new nodes, at least that's what I got from the most recent black friday sale

    Thanked by 1digitalwicked
  • bdlbdl Member
    edited May 2022

    @dahartigan said:
    Storage is cursed. Nek minnit fran loses his slabs. Just migrate storage to a dedi or use Google Drive etc.

    I'm OK, my Sydney HH instance is being backed up to HH LA and HH CHI :lol:

    Time to set up another backup to Letbox methinks

  • xetsysxetsys Member
    edited May 2022

    @digitalwicked said:

    @darkimmortal said:
    This is the post I was waiting for. I couldn’t get my head around the failure mode happening any other way. This is a believable explanation

    Are all old nodes affected by this? Can we assume all E5v2 are affected and Epyc are unaffected?

    Is there any interim mitigation planned to make old nodes more trustworthy? Such as reboot every few months, swap to software raid, script to detect the first sign of data being hosed and immediately hard reboot, etc

    Unfortunately it was an 'AMD EPYC 7551P 32-Core Processor' for this data loss. Was bulk migrated to it earlier in the year from an old E5v2.

    So we dont know whether this particular incident was due to the older/newer gen processor, failed hardware raid card, or discs itself. With other hosts, we usually get figures like, 02 discs failed, and those were replaced immediately and the raid is building up so there shouldn't be any data loss. When the exact cause of failure isn't known, it means it can happen again. People still use old processors for their storage NAS. They are hardly the cause of data failure. Its most likely hardware raid card or the discs. If its discs, it should be told how many failed, and whether all data was recovered by parity.

    It also sounds like the disc health monitoring isn't proper. As soon as there are IO errors, the situation warrants immediate disc replacement.

  • Your data corruption has been doubled.

    Thanked by 2zed ralf
  • darkimmortaldarkimmortal Member
    edited May 2022

    @xetsys said:

    @digitalwicked said:

    @darkimmortal said:
    This is the post I was waiting for. I couldn’t get my head around the failure mode happening any other way. This is a believable explanation

    Are all old nodes affected by this? Can we assume all E5v2 are affected and Epyc are unaffected?

    Is there any interim mitigation planned to make old nodes more trustworthy? Such as reboot every few months, swap to software raid, script to detect the first sign of data being hosed and immediately hard reboot, etc

    Unfortunately it was an 'AMD EPYC 7551P 32-Core Processor' for this data loss. Was bulk migrated to it earlier in the year from an old E5v2.

    So we dont know whether this particular incident was due to the older/newer gen processor, failed hardware raid card, or discs itself. With other hosts, we usually get figures like, 02 discs failed, and those were replaced immediately and the raid is building up so there shouldn't be any data loss. When the exact cause of failure isn't known, it means it can happen again. People still use old processors for their storage NAS. They are hardly the cause of data failure. Its most likely hardware raid card or the discs. If its discs, it should be told how many failed, and whether all data was recovered by parity.

    It also sounds like the disc health monitoring isn't proper. As soon as there are IO errors, the situation warrants immediate disc replacement.

    For sure the CPU won’t have any impact, but it is the only way we can tell if we are on old or new hardware at Hosthatch (or any VPS provider)

    I don’t believe this is a case of failing to react to a disk failure, or even related at all to a disk failure. You can only get symptoms like that from a raid card going haywire. Maybe it is a failure on their part to scrub, or react to IO errors. But a disk failure would be more of an all or nothing affair

    Thanked by 1hosthatch
  • hosthatchhosthatch Patron Provider, Top Host, Veteran
    edited May 2022

    @dosai said:
    @Emil communication during times like these are lacking. Have you thought of setting up a community channels to keep the customers informed like discord (for example)?

    Unfortunately any comment we make in times of crisis is taken apart, and made into five different explanations, which is why we have to make sure we are very clear with any communication during such times. It gets to the point where people start pointing out basics to us like we're children who started doing this yesterday and have no idea how any of this works. So at some point, we have to decide how much time we should spend sharing information as compared to actually solving it.

    @xetsys said: It also sounds like the disc health monitoring isn't proper. As soon as there are IO errors, the situation warrants immediate disc replacement.

    In both cases (last month and this week), there were zero failed or failing disks or disks with errors. The RAID card kicked out multiple drives, which resulted in IO errors and the array becoming read-only which needed to be hard rebooted to bring it back online. We obviously keep spares in each location, and have monitoring to detect errors and replace drives before failures can happen......otherwise we would have far more failures per week, as compared to two failures this year and another couple in the past 8? years we've been selling storage VMs. Again, we have several PB of raw storage at the moment, and these failures represent a very small % of it. This thread seems to make it sound like every second node we have is at risk of failure; it is not.

    We should totally do better on communication - but that is a separate problem compared to data safety.

    @digitalwicked said: Unfortunately it was an 'AMD EPYC 7551P 32-Core Processor' for this data loss. Was bulk migrated to it earlier in the year from an old E5v2.

    Sorry but this is incorrect. No EPYC node has had this issue. And we've been building newer 'version' of our nodes from the last few years with E5v2, far before we started using EPYC. Please verify your information before posting it here, as there is already a lot of misinformation in this thread without needing to make up more out of thin air.

    @default said:
    One thing is for sure: HostHatch storage is not that safe. As result: I already marked my services for cancellation at the end of billing period.

    I honestly do not wish to see another involucration drama, but it is what it is.

    If you actually really think your average storage VM provider is designing nodes better than us and you have less chance of data loss there, please feel free to move to them and do not let the door hit you on the way out.

    I am with you on the support on these storage plans - it is quite bad and should be better. But please ask your new provider how they would have resolved this issue any better, and what types of redundancies they have against such failures? I am all ears. No drama - I am truly interested in learning this valuable information.

    @darkimmortal said: Are all old nodes affected by this?

    No, if we go by the failures last month in Chicago and this week in LA (ie the same RAID and hard drive company), then we have a couple more nodes like that. We planned to migrate people to our newer 'version' of nodes anyway, we will just do it ASAP for some people. But in no way are "all" nodes or anything even close to that affected, and note that both of these nodes were running for years without any issues.


    To clarify Emil's response above:

    There is no known issue that would result in data loss.

    We design our newer nodes to be better than the ones before, it does not mean the old ones are faulty. It just means the new ones are better. Creating multiple smaller RAID arrays on different RAID cards mitigates risks, it also speeds up rebuild times, etc. That does not mean having a single large array means a definite potential for failure, considering that is what most providers do with their storage nodes anyway.

    I am truly sorry for the users who were affected by this, we are going to do our best to make sure it does not happen to you or anyone else again. With that said, please keep multiple copies of your backups.

  • caracalcaracal Member

    @hosthatch

    Given that I am dumping this set of data, could we ask for a migration to the new infra by means of ticket?

  • hosthatchhosthatch Patron Provider, Top Host, Veteran

    @caracal said:
    @hosthatch

    Given that I am dumping this set of data, could we ask for a migration to the new infra by means of ticket?

    Yes you can do that, no problem.

    Thanked by 1caracal
  • SirFoxySirFoxy Member

    hosthatch has been the definition of mid, this is nothing new.

    okay servers for good prices, with little to no support.

    you get what you pay for.

  • NekkiNekki Veteran

    @SirFoxy said:
    you get what you pay for.

    This should be the LET slogan/masthead/credo.

    Thanked by 2SirFoxy webcraft
  • defaultdefault Veteran

    @hosthatch said:

    If you actually really think your average storage VM provider is designing nodes better than us and you have less chance of data loss there, please feel free to move to them and do not let the door hit you on the way out.

    Sure. Thank you.

    But please ask your new provider how they would have resolved this issue any better, and what types of redundancies they have against such failures? I am all ears. No drama - I am truly interested in learning this valuable information.

    I do not know how they do it. Take notes from Servarica [ @servarica_hani ] in Canada, or from Terrahost [ @terrahost ] in Norway. I had absolutely no issues with them over the past couple of years. No data corruption; no downtime; and guess what: their support is also much better.

  • @default said:
    I do not know how they do it. Take notes from Servarica [ @servarica_hani ] in Canada, or from Terrahost [ @terrahost ] in Norway. I had absolutely no issues with them over the past couple of years. No data corruption; no downtime; and guess what: their support is also much better.

    I love Hani to death, but I think it's a bit unfair to compare support or downtime between them and HH. Not accounting for a 2-3 day Servarica downtime last year when they moved DCs (not their fault, communication was excellent and SLA credits were provided, but there was interruption of service), HH has much better price per TB of storage when it comes to their promos, as well as their list price. HH has also confirmed that people paying list price get priority support, which I think is totally fair considering how low the margin must be on the LET customers. I have my backups with Servarica and HH and I appreciate the different ways that both providers handle things, but I also don't think it's a fair 1:1 comparison. Both are fantastic in their own ways, just different expectations for each. Still upset about this incident, not really because of the corruption, but because of the poor response to it. At their scale, things are bound to go wrong, the important thing is how they address it, and I think they missed the mark on that front. No comment on Terrahost, never used them before.

    Thanked by 2zed bdl
Sign In or Register to comment.