HostHatch Los Angeles storage data corruption (Was: HostHatch Los Angeles storage down)

risharde · May 2022

@Daniel15 said:

@risharde said: Since op seemed to have recovered from the issue

Working on it... there's some data corruption.

@darkimmortal said: and use another tool to compare.

Since the backup is encrypted, it'd have to download and decrypt all the data to compare it. At that point I may as well just restore the entire backup anyways. I realised that I actually only have ~1TB in the backup as most of the space used on the server is actually backups from other systems where I can just use borg check to check it.

Oh man, good luck, hope it works out. Tha word corruption is my greatest nightmare! Thankfully my single project at the moment does both back ups and replication - hope it works as intended!

plumberg · May 2022

@jmgcaguicla said:

@plumberg said:
How does one fix the same? I am noob on this so will gladly appreciate some help.

Centos7 --> FDE setup. Have entered the LUKS password and it has spit out some weird errors. Not sure how to proceed. thanks

What's the error?

If it's a damaged LUKS header, you'd better kiss that data goodbye since you're not getting any of that back if you don't have a good copy of the header.

This is what I was able to grab... Any suggestions/ help to fix? Thanks

Daniel15 · May 2022

For some reason the connection between HostHatch and Servarica (where one of my backups is stored) caps out at ~90Mbps even though it gets ~600Mbps from another server in LA (blaming Psychz for that), plus HostHatch's disks are a bit slow on this server, so this restore is taking a while. I started restoring ~1TB of files a few hours ago and it's 20% complete so far.

Most files I've looked at seem fine though, so the restore is just in case some random files on disk got corrupted and I didn't notice.

@plumberg said: This is what I was able to grab... Any suggestions/ help to fix? Thanks

Are you booting into the system or into a live CD?

I'm not familiar with XFS but maybe someone else has an idea.

Daniel15 · May 2022

@darkimmortal said: Looks like another case of corruption of data at rest... very weird failure mode. Maybe understandable if that systemd log data was recent, but "Jul 14"?

Yeah it's bizarre! I would have checked what's in the systemd journal at that point, but it's also corrupted

root@la04:/# journalctl --until='2021-07-15'
Journal file /var/log/journal/4142ede3cca34ebc81b1e6d60a08e87a/[email protected]~ corrupted, ignoring file.
-- Journal begins at Mon 2021-09-27 23:43:15 PDT, ends at Thu 2022-05-05 18:52:12 PDT. --
-- No entries --

plumberg · May 2022

@Daniel15 said:

@plumberg said: This is what I was able to grab... Any suggestions/ help to fix? Thanks

Are you booting into the system or into a live CD?

I'm not familiar with XFS but maybe someone else has an idea.

I think this is from the live Rescue upon bootup of the vm... Running Centos7 - with FDE

Daniel15 · May 2022

@plumberg said:

@Daniel15 said:

@plumberg said: This is what I was able to grab... Any suggestions/ help to fix? Thanks

Are you booting into the system or into a live CD?

I'm not familiar with XFS but maybe someone else has an idea.

I think this is from the live Rescue upon bootup of the vm... Running Centos7 - with FDE

Booting from the OS won't work well if the partition containing /sbin and /usr/sbin (usually a root partition) is corrupted in any way, since some of the mount/fsck tools may themselves be corrupted. You might need to boot from a CentOS live CD and go into a recovery mode. That's what I did initially - Since I run Debian, I mounted the latest Debian ISO, booted into the rescue mode available in the "advanced options" in the installer boot menu, chose not to mount anything, and ran fsck from the terminal it started.

msallak1 · May 2022

Anyone knows how to fix this error? this happened after shutting down system and booting it up again from website

xetsys · May 2022

I will probably ditch my storage VPS with HH and move over to hetzner considering they are far more reliable and transparent. I just wonder why this particular node went down. Was it because of a disk failure, power failure, or what? And how that has affected user data? Shouldnt the host let users know how much data is affected, whether the drives failed, or were replaced etc. I thought the drives were in raid configuration which should help in case of drive failure.

jmgcaguicla · May 2022

@plumberg said:
I think this is from the live Rescue upon bootup of the vm... Running Centos7 - with FDE

You sure this is from a LiveCD/rescue? This looks like the initrd boot process (mounting root into /sysroot then switch_root-ing into it).

Try to boot into a LiveCD, unlock the LUKS volume then fsck the XFS volume.

Daniel15 · May 2022

@msallak1 said:
Anyone knows how to fix this error? this happened after shutting down system and booting it up again from website

It means your kernel image (/boot/vmlinuz-*) is corrupt. See if there's an older kernel available in the GRUB menu. Otherwise you might have to boot into a live CD and copy a clean version of the kernel across. Which distro are you using?

sanvit · May 2022

Getting this error, and googling it tells me to check the RAM. May this be a host node issue?

Daniel15 · May 2022

@hosthatch Given the number of people reporting corrupted data, can you please clarify what "an issue" refers to in your email (what was the actual issue and how was it resolved), why it had to be a hard reboot, and why it has resulted in data corruption?

Received ~20 hours ago:

Subject: [HostHatch] STOR2.LAX update

Hello,
Earlier today, we detected an issue with the storage node called "STOR2.LAX". A hard reboot was completed to boot the node back into the OS, however it took some time to complete this as a hard reboot on a large system generally causes forced file system checks (fsck).

While most VMs are back online, we noticed that some VMs are stuck during the boot process requiring further troubleshooting (such as fsck, or further repairs using rescueCD). To do this, please login to the noVNC console available inside our control panel.

If you require any assistance from us, please reply to this email and we will do our best to help you as soon as possible.

Kindest Regards,
Your HostHatch team

sanvit · May 2022

@xetsys said:
I will probably ditch my storage VPS with HH and move over to hetzner considering they are far more reliable and transparent. I just wonder why this particular node went down. Was it because of a disk failure, power failure, or what? And how that has affected user data? Shouldnt the host let users know how much data is affected, whether the drives failed, or were replaced etc. I thought the drives were in raid configuration which should help in case of drive failure.

I'm thinking something similar right now too. There are too many issues without proper communication.

fluffernutter · May 2022

This sucks, I've been a happy customer of HH for a while and have multiple storage VMs in each location. I wish they'd be more transparent about outages like this, not only does it suck for endusers but it's also terrible PR for them. I still love my services and I got them cheaply enough to where I can just replicate across locations, which is fantastic (and what they recommended doing in another thread). I'm glad the email eventually went out (a bit late), but it's really troubling how they haven't acknowledged the customers who are experiencing corruption. I'll probably still be a happy customer, but at the same time I'll also be on the lookout for storage VMs from other providers.

TimboJones · May 2022

@hosthatch said:
Not sure why there needs to be a thread, but sure

We're working on it.

Are you fucking serious?

Because you guys are fucking defective in terms of network status updates or replying to tickets.

(Never received a ticket update after involucrated Chicago server after initial "we're looking into it" and I've only now seeing a new server provisioned on a new node, but without private network interface).

So, next time you're not sure why such a thread exists, just pull up your Network Status page. Oh wait...

dahartigan · May 2022

I'll crack the shits if this happens in Sydney. I've been rather content with my 2-yearly deal but it's up for renewal this year and last time I used it I felt like it was very slow.

Daniel15 · May 2022

Something I still don't quite understand is that normally if it was just an unexpected power-off (for example if the host kernel panics or halts), in theory just the files that were in the write buffer/cache at the time should have the potential to get corrupted, if they weren't fsync'd to disk yet. Often those files can be repaired using the journal, if it's a journaling file system like ext4.

However, I'm seeing corruption in files that haven't been modified in over a year, and other people in this thread have had their Linux kernel image corrupted, which isn't something that's updated that often.

Seems like it's something deeper than just that? Maybe HDD / RAID failure as well...

angstrom · May 2022

Modified title based on OP's request

dahartigan · May 2022

@angstrom said:
Modified title based on OP's request

Ooof.

cybertech · May 2022

lol

digitalwicked · May 2022

I've got an error on mine as well 'the root filesystem on /dev/vda1 requires a manual fsck' when booting... this does not look promising!

bulbasaur · May 2022

@angstrom said:
Modified title based on OP's request

Interesting, @Nekki had rightly predicted that title edits would be reserved for Host Reps and it’s interesting to see it play out.

Daniel15 · May 2022

Restoring from backup (another storage VPS via Borgbackup) is still network IO bound, going at ~90Mb/s so it's taking a loooong time. 45% done so far.

So far so good, but now I'm thinking about what if my backups are corrupt... so I think after this I'm going to set up two different backup systems (eg. borg and duplicacy) to two different servers, as right now I just backup to another storage VPS then sync that backup to "cloud storage" (pCloud).

@digitalwicked said:
I've got an error on mine as well 'the root filesystem on /dev/vda1 requires a manual fsck' when booting... this does not look promising!

That's what I saw originally, and I got it into a bootable state by booting off a Debian installer CD, going to advanced → rescue mode and running fsck.ext4 (e.g. fsck.ext4 /dev/vda1) in the rescue shell. Of course the command will be different if you use something other than ext4.

@stevewatson301 said: @Nekki had rightly predicted that title edits would be reserved for Host Reps

but I'm not a host rep? lol
I just clicked the flag link and asked a mod to change the title

angstrom · May 2022

@stevewatson301 said:

@angstrom said:
Modified title based on OP's request

Interesting, @Nekki had rightly predicted that title edits would be reserved for Host Reps and it’s interesting to see it play out.

Well, I take the OP (= @Daniel15 ) to be an upright guy, and I think that he wanted a more accurate/transparent title

(This is not to say that any title-change request by any OP in any thread will be honored)

webcraft · May 2022

Their website is still the old one.

Daniel15 · May 2022

@angstrom said: I think that he wanted a more accurate/transparent title

That's right. My original thought was that it was some "regular" downtime where everything comes back fine, e.g. the server had to be rebooted, the network went out, etc. Now that I know that it wasn't a regular outage and there's actual data corruption, I wanted the title to be more accurate

Now the title is kinda similar to the post about a similar issue in Chicago (https://lowendtalk.com/discussion/comment/3395419).

caracal · May 2022

Just gonna re-install and not trust previous data state...

joelby · May 2022

I spent most of the day recovering from issues similar to others in the thread I had the xz-compression error. I ended up with 300000 entries in lost+found, which wasn't much fun! Here's a general outline of how I handled it:

I use debian, so mount a Debian ISO with the same major version, and change boot order to ISO
Reboot server. In Debian installer, proceed to rescue mode (non-graphical)
Don't mount the disks - you want to fsck them first. You might need to do some exploration (e.g. lsblk, lvm equivalent) to find all of your partitions and fsck them.
Once they're fscked (which I left running overnight), you can restart the rescue installer but mount the partitions and launch the shell there. Effectively, this mounts your root partition and chroots into it. You also need to mount your /boot partition, and a few others e.g. devpts
I had lots of corrupt files all over the place, including apt lists. I could run some commands, but others caused segfaults. My main aim was to be able to run dpkg --verify so that I could validate APT packages and reinstall them if necessary. This was a torturous iterative process which involved finding corrupt library files with md5sum, copying them from another VM running the same version of debian (fortunately networking and wget were working). Eventually I was able to get dpkg to work, though I had to delete (or copy over) some apt info files in /var/lib/dpkg/info .
Then I kept running dpkg --verify, identifying the source package of corrupted files using dpkg -S , then apt reinstall
One of the corrupted packages was the Linux kernel - reinstalling that also rebuilds the initramfs image, which will fix the xz boot issue
Once dpkg --verify only returns 'known' mismatches such as manually modified config files, you can shut down, change boot order, unmount the ISO, then reboot - hopefully it will work at this point!
One the VM starts, deal with corrupted data issues which are affecting applications. In my case my Nextcloud installation was broken, which I fixed by restoring the postgres database from a backup (fortunately I had one), reinstalling the nextcloud PHP files from the same version over the top, and then restoring all of the corrupted data files (fortunately I use b2 sync to copy everything to backblaze - after deleting bad local copies I could reverse the sync in order to pull down all files which were missing)

digitalwicked · May 2022

@hosthatch said:
Not sure why there needs to be a thread, but sure

We're working on it.

Are you serious?

I've received no email or notification of the data corruption event. Had I not seen this thread, I would have continued to assume my backups were safe. SNMP continued to report as okay up until I checked that the filesystem had become read-only. On reboot I received 'the root filesystem on /dev/vda1 requires a manual fsck.'

I've recommended people to HostHatch and have bought many services. Your responses on Chicago and now LA storage have been underwhelming to say the least. I know this is lowend but I've still bought $500+ worth of HostHatch VPS in the last 12months which I'm also happy to spend with other providers I use.

dahartigan · May 2022

@stevewatson301 said:

@angstrom said:
Modified title based on OP's request

Interesting, @Nekki had rightly predicted that title edits would be reserved for Host Reps and it’s interesting to see it play out.

404 - Drama not found.

Howdy, Stranger!

Categories

In this Discussion

HostHatch Los Angeles storage data corruption (Was: HostHatch Los Angeles storage down)

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

HostHatch Los Angeles storage data corruption (Was: HostHatch Los Angeles storage down)

Comments