HostHatch Los Angeles storage data corruption (Was: HostHatch Los Angeles storage down)

jackb · May 2022

@hosthatch said:

@dosai said:
@Emil communication during times like these are lacking. Have you thought of setting up a community channels to keep the customers informed like discord (for example)?

Unfortunately any comment we make in times of crisis is taken apart, and made into five different explanations, which is why we have to make sure we are very clear with any communication during such times. It gets to the point where people start pointing out basics to us like we're children who started doing this yesterday and have no idea how any of this works. So at some point, we have to decide how much time we should spend sharing information as compared to actually solving it.

It sounds like the affected customers weren't notified at all?

I'd agree with limiting certain details until you've reached a resolution or fully know the impact; but if you need to hard bounce a server I'd recommend sending an email out to let people know there was an outage and a hard reboot. In my experience the thing that annoys people the most is being completely in the dark.

Daniel15 · May 2022

@default said: Take notes from Servarica [ @servarica_hani ] in Canada

Servarica's architecture is entirely different. They have all the storage in a SAN (Storage Area Network) rather than in the servers themselves, and likely a separate dedicated network just for the storage. This can have disadvantages - in theory it's a bit slower since reads/writes have to go over a network of some sort, but in practice Servarica has unusually high fio 4k and 64k read/write speed and IOPs so it actually seems faster than some other storage VPSes ¯\_(ツ)_/¯

BuyVM's storage slabs and Letbox's block storage are similar.

plumberg · May 2022

@Daniel15 said:

@plumberg said:

@Daniel15 said:

@plumberg said: This is what I was able to grab... Any suggestions/ help to fix? Thanks

Are you booting into the system or into a live CD?

I'm not familiar with XFS but maybe someone else has an idea.

I think this is from the live Rescue upon bootup of the vm... Running Centos7 - with FDE

Booting from the OS won't work well if the partition containing /sbin and /usr/sbin (usually a root partition) is corrupted in any way, since some of the mount/fsck tools may themselves be corrupted. You might need to boot from a CentOS live CD and go into a recovery mode. That's what I did initially - Since I run Debian, I mounted the latest Debian ISO, booted into the rescue mode available in the "advanced options" in the installer boot menu, chose not to mount anything, and ran fsck from the terminal it started.

**Hi, So, I tried the same. Booted from Centos live CD, set my Luks password, tried the rescue option and I cant seem to find the disks to run fsck / xfs_repair... Any pointers? Thanks
**

Boot Rescue

Start Rescue?

Luks Passphrase:

fdisk status:

Daniel15 · May 2022

@xetsys said: I find it odd how users backup content to storage VPS and then that to cloud, i.e. google/blackblaze/onedrive/koofr etc for peace of mind. I just wonder why anyone bother with storage VPS when backups can be corrupted and require to be restored from cloud.
Why not just use the commercial cloud directly? The data at those services can be encrypted too with rclone and they are much more reliable.

I see it as part of a 3-2-1 backup policy. The three copies of important files are on my PC, on a storage VPS, and in "the cloud". Essentially instead of having two copies locally and one offsite (like traditional 3-2-1), I have one copy locally and two offsite. I have some important file in Seafile, and I've configured the Seadrive client to always have those files available locally on my PC.

You should have backups of files "in the cloud" too - That means either having two clouds (e.g. Google and Backblaze), or having a copy on DVD/BluRay, or USB HDD stored somewhere secure, whatever.

TimboJones · May 2022

@Emil said:
Hi,

I'm responsible for the technical infrastructure and I just want to come in here and give a brief explanation of what we are changing on the infrastructure side of things.

First of all I want to offer my apologies to the affected customers. We've had issues with a few of our older storage nodes. We can see it's limited to a specific combination of hard drives manufacturer and RAID card model. Some of the drives get kicked out of the array at the same time, it turns to read-only and throws out a lot of IO errors and has to be hard rebooted. This can happen after being online for >1 year without issues. Even if the drives are never actually 'failed', it sometimes causes inconsistency to the data and while most are still in tact, some files may be corrupted which makes it difficult to boot up a functional operating system.

What hard drive models were these? Sounds like consumer drives and not RAID supported ones. And also that your RAID cards are failing, possibly from heat. Sure would be good to know why the RAID kicked them out, should say in the logs. Since it kicks out good drives and has corruption despite ECC memory, your RAID cards are very much suspect. The lack of root cause analysis is an issue.

TimboJones · May 2022

@jackb said:

@hosthatch said:

@dosai said:
@Emil communication during times like these are lacking. Have you thought of setting up a community channels to keep the customers informed like discord (for example)?

Unfortunately any comment we make in times of crisis is taken apart, and made into five different explanations, which is why we have to make sure we are very clear with any communication during such times. It gets to the point where people start pointing out basics to us like we're children who started doing this yesterday and have no idea how any of this works. So at some point, we have to decide how much time we should spend sharing information as compared to actually solving it.

It sounds like the affected customers weren't notified at all?

I'd agree with limiting certain details until you've reached a resolution or fully know the impact; but if you need to hard bounce a server I'd recommend sending an email out to let people know there was an outage and a hard reboot. In my experience the thing that annoys people the most is being completely in the dark.

Correct. I immediately opened a ticket on March 19 and got the response:

Hi,

We are investigating this at the moment.

Ticket ID: #399183

And no further ticket responses or status updates.

jmgcaguicla · May 2022

@plumberg said:
**Hi, So, I tried the same. Booted from Centos live CD, set my Luks password, tried the rescue option and I cant seem to find the disks to run fsck / xfs_repair... Any pointers? Thanks

Stop using that dogshit fucking troubleshooter, grab a shell and do it yourself.

It's literally just cryptsetup open /dev/vda3 my-root, inspect the unlocked container, then diagnose from there.

TimboJones · May 2022

@jmgcaguicla said:

@plumberg said:
**Hi, So, I tried the same. Booted from Centos live CD, set my Luks password, tried the rescue option and I cant seem to find the disks to run fsck / xfs_repair... Any pointers? Thanks

Stop using that dogshit fucking troubleshooter, grab a shell and do it yourself.

It's literally just cryptsetup open /dev/vda3 my-root, then diagnose from there.

If he was doing full disk encryption then he'd have backups, isn't just easier and faster to just restore from backups?

Who has time to check and hope there isn't hidden corruption instead of reliable restoration?

jmgcaguicla · May 2022

@TimboJones said:
isn't just easier and faster to just restore from backups?

Ofc, but considering he is on this thread asking then I think you already know why

Daniel15 · May 2022

@TimboJones said: Sounds like consumer drives

To be fair, at their price point I don't expect everything to be enterprise-grade hardware.

Although I also don't expect so much data to become corrupted, even on non-enterprise hardware. This is going to take me at least a week to fully restore things I think. I've restored the backup of files on the server itself, now I have to verify all backups that are stored on the server, and finally re-enable all the backups and cronjobs (I turned them all off as soon as this issue happened).

I have a few other servers that back up to this server using borgbackup, and then those backups are then copied elsewhere for redundancy. Eventually I want to set up two entirely separate backups using two different apps across all my servers, but I've only done that for one server so far.

Anyways, since I have the "good" copy of the backups, I can copy those back to this server to revert them to a known good state. I'm using rsync -avc, the -c option meaning it performs a checksum of every file on both sides and uses that to determine if the file is different, rather than just using size and update time. I've done this for one backup so far, and it ended up having to transfer ~1.6GB out of ~103GB, meaning around 1.5% of the backup was corrupt (actually slightly less since that number includes rsync's metadata including all the checksums, but metadata would be <100MB total)

TimboJones · May 2022

@Daniel15 said:

@TimboJones said: Sounds like consumer drives

To be fair, at their price point I don't expect everything to be enterprise-grade hardware.

You don't run hardware RAID without RAID supported drives. That's been well known for a decade when they began differentiating desktop and NAS drives. I'm not talking about the excessively priced Dell drives locked down to PERC models, I'm talking Red brand and NAS designed drives.

Although I also don't expect so much data to become corrupted, even on non-enterprise hardware. This is going to take me at least a week to fully restore things I think. I've restored the backup of files on the server itself, now I have to verify all backups that are stored on the server, and finally re-enable all the backups and cronjobs (I turned them all off as soon as this issue happened).

I'm in the process of upgrading my home machine, which included a 10.4TB backup and the f'n restore started at 1.2Gbps and dropped to 200Mbps. It's mostly from the 8x10TB RAID 10 initializing while restoring the 10TB. So much for having 10Gbe NICs.

I have a few other servers that back up to this server using borgbackup, and then those backups are then copied elsewhere for redundancy. Eventually I want to set up two entirely separate backups using two different apps across all my servers, but I've only done that for one server so far.

Anyways, since I have the "good" copy of the backups, I can copy those back to this server to revert them to a known good state. I'm using rsync -avc, the -c option meaning it performs a checksum of every file on both sides and uses that to determine if the file is different, rather than just using size and update time. I've done this for one backup so far, and it ended up having to transfer ~1.6GB out of ~103GB, meaning around 1.5% of the backup was corrupt.

Both the HostHatch disasters and my home upgrade really pushing towards setting up a deal with a local friend to host each others backups so I can drive and pick up the server and bring home if needed.

Daniel15 · May 2022

The thing that stops me from setting up a NAS or something similar at home is that I'm in California and electricity is very expensive. For me, the monthly cost of the HostHatch 10TB VPS ($240 / 2 years = $10 / month) is cheaper than just the electricity to run a PC/NAS with decent CPU power 24/7. I'm also limited to 40Mbps upload (thanks, Comcast) and there's no other internet options in my apartment building

risharde · May 2022

@hosthatch said:

(Cleared to reduce the response)

@default said:
One thing is for sure: HostHatch storage is not that safe. As result: I already marked my services for cancellation at the end of billing period.

I honestly do not wish to see another involucration drama, but it is what it is.

If you actually really think your average storage VM provider is designing nodes better than us and you have less chance of data loss there, please feel free to move to them and do not let the door hit you on the way out.

Do you really care if the door hits him on the way out? - how you answered your current customer seems rude (that statement is usually used in my country when someone is trying to be rude not sure about yours). B

You seem skewed to thinking your statistics are good since its only 5% of customers in that location affected but you are missing the main point that some of your current customers are now unhappy AND someone in your company knew that the disks were hopping off the raid a little after a year (which sounds like something you all knew about? And you guys are responsible since you knew of the possibilities).

Are you waiting for this kind of setup to continue to corrupt other customers data? Hopefully you have an action plan to avoid this right like getting rid of that old config and provisioning new better designed storage for your old loyal customers? I mean should I wait till next year when I might be the unlucky one to lose data from my LA vps with you?

corbpie · May 2022

RE Hosthatch wondering why a thread was made:

Your 'customer service' is horrid, I have a ticket over 1 month old now with no response about your broken panel preventing actions on a service. So this service cannot be used at present and the ticket is over a month old, what are you doing?

bdl · May 2022

@Daniel15 said:
The thing that stops me from setting up a NAS or something similar at home is that I'm in California and electricity is very expensive. For me, the monthly cost of the HostHatch 10TB VPS ($240 / 2 years = $10 / month) is cheaper than just the electricity to run a PC/NAS with decent CPU power 24/7. I'm also limited to 40Mbps upload (thanks, Comcast) and there's no other internet options in my apartment building

But 40Mbps upload is enough - NBN said so

dahartigan · May 2022

@bdl said:

@Daniel15 said:
The thing that stops me from setting up a NAS or something similar at home is that I'm in California and electricity is very expensive. For me, the monthly cost of the HostHatch 10TB VPS ($240 / 2 years = $10 / month) is cheaper than just the electricity to run a PC/NAS with decent CPU power 24/7. I'm also limited to 40Mbps upload (thanks, Comcast) and there's no other internet options in my apartment building

But 40Mbps upload is enough - NBN said so

NBN is a laugh.

plumberg · May 2022

@jmgcaguicla said:

@plumberg said:
**Hi, So, I tried the same. Booted from Centos live CD, set my Luks password, tried the rescue option and I cant seem to find the disks to run fsck / xfs_repair... Any pointers? Thanks

Stop using that dogshit fucking troubleshooter, grab a shell and do it yourself.

It's literally just cryptsetup open /dev/vda3 my-root, inspect the unlocked container, then diagnose from there.

Ok. This first step worked.. I was able to unlock my luks container.
I tried xfs_repair and its throwing some other errors and asking me to mount the disk... i tried that too but get another error - "Structure needs cleaning". I need help.. Please.

darkimmortal · May 2022

@plumberg said:

@jmgcaguicla said:

@plumberg said:
**Hi, So, I tried the same. Booted from Centos live CD, set my Luks password, tried the rescue option and I cant seem to find the disks to run fsck / xfs_repair... Any pointers? Thanks

Stop using that dogshit fucking troubleshooter, grab a shell and do it yourself.

It's literally just cryptsetup open /dev/vda3 my-root, inspect the unlocked container, then diagnose from there.

Ok. This first step worked.. I was able to unlock my luks container.
I tried xfs_repair and its throwing some other errors and asking me to mount the disk... i tried that too but get another error - "Structure needs cleaning". I need help.. Please.

At this point there is zero chance you can recover this to a working and trustworthy system. The best you can hope for is getting some files off and using them to start over on a fresh install

Take a backup of the raw block device over ssh, as the next step may will destroy it further. Then, pass -L parameter to xfs_repair as it suggests

If there is some reason you can't take a backup, at a pinch you could use LVM snapshots (delete the swap LV if there is no free space)

TimboJones · May 2022

@hosthatch said:

@dosai said:
@Emil communication during times like these are lacking. Have you thought of setting up a community channels to keep the customers informed like discord (for example)?

Unfortunately any comment we make in times of crisis is taken apart, and made into five different explanations, which is why we have to make sure we are very clear with any communication during such times. It gets to the point where people start pointing out basics to us like we're children who started doing this yesterday and have no idea how any of this works. So at some point, we have to decide how much time we should spend sharing information as compared to actually solving it.

Whoosh. When you don't fix problems in any sort of reasonable time frame the average user would have resolved the issue and you haven't, it's fucking natural to try and help you fix our shit faster.

@xetsys said: It also sounds like the disc health monitoring isn't proper. As soon as there are IO errors, the situation warrants immediate disc replacement.

In both cases (last month and this week), there were zero failed or failing disks or disks with errors. The RAID card kicked out multiple drives, which resulted in IO errors and the array becoming read-only which needed to be hard rebooted to bring it back online. We obviously keep spares in each location, and have monitoring to detect errors and replace drives before failures can happen......otherwise we would have far more failures per week, as compared to two failures this year and another couple in the past 8? years we've been selling storage VMs. Again, we have several PB of raw storage at the moment, and these failures represent a very small % of it. This thread seems to make it sound like every second node we have is at risk of failure; it is not.

This doesn't make sense, btw. If disks were kicked out and went read-only (they'd be kicked out for I/O errors and then I/O errors stop, not the other way around as you say), they wouldn't be corrupt AF. Data that wasn't changed in a year was corrupted. That's bad writes right there. Like 5% of data or more.

So I'm not sure if you truly understand you know the problem. It clearly didn't immediately detect catastrophic I/O errors and you definitely had extensive bad writes instead of your read-only belief.

Are we going to find out someone actually set these up in RAID 0 and not RAID 10? That actually fits the symptoms more than battery backed hardware controller that isn't considered defective and needs to be replaced having momentary fuck ups randomly after a year.

Daniel15 · May 2022

So I finally restored most of the data on my affected VPS. I had already restored a backup of the most important files on the VPS, but today I was restoring the backups from other systems that are stored on this one, from a mirror stored on another server. I went through every backup that's stored on this server and used borg check to verify they're no longer corrupted. For some of the data, I only have a second copy of it locally (not "in the cloud") so I'll have to upload those again over my relatively slow 40Mbps upload (thanks, Comcast).

Overall, some directories had no corrupted files at all, whereas others had as many as 5-10% of files corrupted. In total, maybe 4-5% of the files were corrupted. Maybe it's based on where the files are located on the physical disks.

default · May 2022

@Daniel15 - grab more storage offers, from different providers, so you can copy the data from one server to another when a provider gets involucrated. I also have 40mbps bandwidth at home, and I use multiple servers with encrypted data on different providers.

I try to stay away from cloud, due to privacy concerns. On VPS I have more control over what is encrypted and how.

xetsys · May 2022

@TimboJones said:

Are we going to find out someone actually set these up in RAID 0 and not RAID 10? That actually fits the symptoms more than battery backed hardware controller that isn't considered defective and needs to be replaced having momentary fuck ups randomly after a year.

This is what I fear. Saving bucks in setting them up for raid0 and letting non-existent RAID cards take the blame for 01 whole year, I really hope thats not the case.

Daniel15 · May 2022

@default said: grab more storage offers, from different providers, so you can copy the data from one server to another when a provider gets involucrated.

I've got other storage, however I'm making extensive use of HostHatch's private network to access the storage from some of their fast NVMe servers in the same location. That makes it more difficult to switch to another provider in case of issues.

@default said: I try to stay away from cloud, due to privacy concerns. On VPS I have more control over what is encrypted and how.

The only thing I'm using the public cloud for is a mirror of encrypted Borgbackup backups.

fluffernutter · May 2022

@Daniel15 said:
I've got other storage, however I'm making extensive use of HostHatch's private network to access the storage from some of their fast NVMe servers in the same location. That makes it more difficult to switch to another provider in case of issues.

The private net bit is fantastic, really wish more providers would start doing this.

Daniel15 · May 2022

@LiliLabs said:

@Daniel15 said:
I've got other storage, however I'm making extensive use of HostHatch's private network to access the storage from some of their fast NVMe servers in the same location. That makes it more difficult to switch to another provider in case of issues.

The private net bit is fantastic, really wish more providers would start doing this.

Servarica can add a private network on request, they just don't advertise it. Their bandwidth billing is done on their edge routers meaning any internal traffic (between any Servarica VPSes) is free, which removes one of the use cases for a private network (avoiding fees for transferring a lot of data between VPSes at the same provider).

As far as I know, they own the entire data center, so they have more freedom in terms of what they can do. They also have fully redundant systems, AFAIK also including a warm spare for failover of their storage.

The only reason I don't use them more is because the speeds between me near San Francisco and their data center in Montreal are not great. HostHatch's Los Angeles location works well enough (as well as Psychz can work) with low ping times, and I've mostly been happy with it.

fluffernutter · May 2022

@Daniel15 said:
Servarica can add a private network on request, they just don't advertise it. Their bandwidth billing is done on their edge routers meaning any internal traffic (between any Servarica VPSes) is free, which removes one of the use cases for a private network (avoiding fees for transferring a lot of data between VPSes at the same provider).

Interesting to know, I'll have to ticket and ask.

As far as I know, they own the entire data center, so they have more freedom in terms of what they can do. They also have fully redundant systems, AFAIK also including a warm spare for failover of their storage.

They don't own the DC, but they do have a private cage. They do own all their hardware to my knowledge. Their actual infra is really good, probably the best setup I've seen out of a LET provider.

wpyuel · May 2022

@hosthatch said:

@dosai said:
@Emil communication during times like these are lacking. Have you thought of setting up a community channels to keep the customers informed like discord (for example)?

Unfortunately any comment we make in times of crisis is taken apart, and made into five different explanations, which is why we have to make sure we are very clear with any communication during such times. It gets to the point where people start pointing out basics to us like we're children who started doing this yesterday and have no idea how any of this works. So at some point, we have to decide how much time we should spend sharing information as compared to actually solving it.

@xetsys said: It also sounds like the disc health monitoring isn't proper. As soon as there are IO errors, the situation warrants immediate disc replacement.

In both cases (last month and this week), there were zero failed or failing disks or disks with errors. The RAID card kicked out multiple drives, which resulted in IO errors and the array becoming read-only which needed to be hard rebooted to bring it back online. We obviously keep spares in each location, and have monitoring to detect errors and replace drives before failures can happen......otherwise we would have far more failures per week, as compared to two failures this year and another couple in the past 8? years we've been selling storage VMs. Again, we have several PB of raw storage at the moment, and these failures represent a very small % of it. This thread seems to make it sound like every second node we have is at risk of failure; it is not.

We should totally do better on communication - but that is a separate problem compared to data safety.

@digitalwicked said: Unfortunately it was an 'AMD EPYC 7551P 32-Core Processor' for this data loss. Was bulk migrated to it earlier in the year from an old E5v2.

Sorry but this is incorrect. No EPYC node has had this issue. And we've been building newer 'version' of our nodes from the last few years with E5v2, far before we started using EPYC. Please verify your information before posting it here, as there is already a lot of misinformation in this thread without needing to make up more out of thin air.

@default said:
One thing is for sure: HostHatch storage is not that safe. As result: I already marked my services for cancellation at the end of billing period.

I honestly do not wish to see another involucration drama, but it is what it is.

If you actually really think your average storage VM provider is designing nodes better than us and you have less chance of data loss there, please feel free to move to them and do not let the door hit you on the way out.

I am with you on the support on these storage plans - it is quite bad and should be better. But please ask your new provider how they would have resolved this issue any better, and what types of redundancies they have against such failures? I am all ears. No drama - I am truly interested in learning this valuable information.

@darkimmortal said: Are all old nodes affected by this?

No, if we go by the failures last month in Chicago and this week in LA (ie the same RAID and hard drive company), then we have a couple more nodes like that. We planned to migrate people to our newer 'version' of nodes anyway, we will just do it ASAP for some people. But in no way are "all" nodes or anything even close to that affected, and note that both of these nodes were running for years without any issues.

To clarify Emil's response above:

There is no known issue that would result in data loss.

We design our newer nodes to be better than the ones before, it does not mean the old ones are faulty. It just means the new ones are better. Creating multiple smaller RAID arrays on different RAID cards mitigates risks, it also speeds up rebuild times, etc. That does not mean having a single large array means a definite potential for failure, considering that is what most providers do with their storage nodes anyway.

I am truly sorry for the users who were affected by this, we are going to do our best to make sure it does not happen to you or anyone else again. With that said, please keep multiple copies of your backups.

My LA has been unavailable for more than three days

fluffernutter · May 2022

@wpyuel said:
My LA has been unavailable for more than three days

Thanks for quoting 60 lines of text for a 1 sentance reply

Logano · May 2022

@hosthatch said:
we have to decide how much time we should spend sharing information as compared to actually solving it.

Seeing as how you spent zero time "sharing information as compared to actually solving it", I guess you see zero importance in timely customer notification because

we have several PB of raw storage at the moment, and these failures represent a very small % of it

I guess this part says it all:

@hosthatch said:
Not sure why there needs to be a thread, but sure

5 users at a location? Okay, I can agree. But 5%?

You don't need to write up a 10 paragraph status notification for these things until you've tried to solve an issue and learned that you can't, at which point the customers deserve one since they'll have to clean up after your own failures (which you ended up doing here, I'm not criticizing that). A simple one-liner in your status page saying "We are investigating issues with some storage nodes on XXX" within 15 minutes of seeing the issue helps a lot. Then another one-liner update a few hours later. It takes 30 seconds, which is probably less time than you spend trying "to decide how much time we should spend sharing information....". Simply refusing to communicate will just bring frustrated people to LET since your support seems to be a big cobwebbed black hole.

UPDATE: Examples from RamNode
https://twitter.com/NodeStatus
http://status.ramnode.com

fluffernutter · May 2022

@Logano said:

You don't need to write up a 10 paragraph status notification for these things until you've tried to solve an issue and learned that you can't, at which point the customers deserve one since they'll have to clean up after your own failures (which you ended up doing here, I'm not criticizing that). A simple one-liner in your status page saying "We are investigating issues with some storage nodes on XXX" within 15 minutes of seeing the issue helps a lot. Then another one-liner update a few hours later. It takes 30 seconds, which is probably less time than you spend trying "to decide how much time we should spend sharing information....". Simply refusing to communicate will just bring frustrated people to LET since your support seems to be a big cobwebbed black hole.

UPDATE: Examples from RamNode
https://twitter.com/NodeStatus
http://status.ramnode.com

Genuinely amazed there's no status page for HH yet.

Howdy, Stranger!

Categories

In this Discussion

HostHatch Los Angeles storage data corruption (Was: HostHatch Los Angeles storage down)

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

HostHatch Los Angeles storage data corruption (Was: HostHatch Los Angeles storage down)

Comments