Help with Data Recovery for a Xen Server VM

randvegeta · October 2024

Hello all.

I've got an issue an old XenServer 6 cluster with an iSCSI NAS storage. Only 1 VM is affected and I just can't figure out why.

We have a NAS in Hong Kong that is seemingly running fine, but has recently had VMs show I/O errors. I say "seemingly running fine" because the NAS itself reports no errors, and the SMART tests show the disks as running normally. So if you look only at the NAS there is nothing to indicate any problems. But all the virtualization nodes that use this NAS show I/O errors.

We rebooted the NAS and bam, everything came back online. Didn't know what to make of it and just let it continue running. Some VMs booted up with FS errors, which were fixed with a quick fsck. Everything seemed ok pretty shortly after.

4 days later, same issue with the NAS. OK so now we really think there must be some disk problems, or some other problems and we need to start migrating everything of this NAS. Reboot, and everything comes back online, same as before. We start migrating and figure that if the NAS can last 3-4 days, that should be enough time to finish the job.

8 hours later the NAS fails again. OK, the problem is getting more serious, so we shut down all the VMs and migrate more rapidly. After ~10 hours, most of the VHDs have been migrated and things are mostly going smoothly, except for 1 particular VM.

The only difference I can see between this VM and other VMs is that this one has had a couple of snapshots. And even when all the other VMs can boot and seemingly operate normally, this continues to have I/O issues. We cannot clone the VHD or migrate the VHD. When attempting to migrate, we get end of file errors reported. When we try and boot, it gets to grub and starts to boot CentOS 7 but get's stuck at mounting vg-root. Boot the VM into sysrescue, mount vg-root and we get I/O errors when attempting to list directory.

This is quite an important VM and the data is invaluable. There's not that much data to recover. The VHD is about 150GB, but we only really need to recover the SQL database stored, which is only a few 10s or maybe 100s of MB.

If anyone can offer assistance, I'm happy to pay a good amount of money if that can be recovered.

The NAS has a RAID 10 array. No other data seems to be lost so I don't think there's actually any data lost from the disks themselves. I think the issue is related to data corruption, and the snapshots themselves being corrupted. Ironic as the snapshots were taken to ensure we could recover the data.

Thanks in advance.

AlexBarakov · October 2024

Before attempting to do recovery - do a block level copy of whatever is left there. When we had to recover multiple failed drives in RAID arrays, we cloned HDDs block by block to a known good media. Ideally, if someone is doing recovery, they would start from the image you've made and then proceed to actually messing with the original source media.

randvegeta · October 2024

@AlexBarakov said:
Before attempting to do recovery - do a block level copy of whatever is left there. When we had to recover multiple failed drives in RAID arrays, we cloned HDDs block by block to a known good media. Ideally, if someone is doing recovery, they would start from the image you've made and then proceed to actually messing with the original source media.

That's exactly what I've tried. First, when booting into Sysrescue, I tried to dd xvda to xvdb for example, but there were IO issues when attempting that and it would stop after about 4GB.

Moving up into the virtualization node itself, for the life of me, I can't find the LV for the VHD. I can see where it's supposed to be but it's not there. If I could find it, then I would try and DD the LV and then play with the copy.

jackb · October 2024

Are these spinning disks? If so do they have built in TLER/similar error recovery control?

If yes to spinning and no to TLER, you can set error recovery manually. Usually I'd expect pending sectors if this is the problem, but I'm a little suspicious it might be anyway.

To show the current error time:

smartctl -l scterc /dev/sda

And to set the error time (to 7 seconds):

smartctl -l scterc,70,70 /dev/sda

My suspicion is you've got a drive with a few bad sectors and read attempts on those sectors hang forever. I've seen it before more than once.

rafaelscs · October 2024

Try HDD Regenerator (https://www.dposoft.net/) in all disks

randvegeta · October 2024

@jackb said: Are these spinning disks? If so do they have built in TLER/similar error recovery control?

Indeed these spinning disks.

Looks like only 2 of my 4 disks have TLER (we use different brands for the RAID 10 so we dont end up with a bad batch failing simultaneously).

My suspicion is you've got a drive with a few bad sectors and read attempts hang forever. I've seen it before more than once.

That is my suspicion also, but I am hoping it hasnt meant the loss of all the data on the VM.

jackb · October 2024

@randvegeta said:

@jackb said: Are these spinning disks? If so do they have built in TLER/similar error recovery control?

Indeed these spinning disks.

Looks like only 2 of my 4 disks have TLER (we use different brands for the RAID 10 so we dont end up with a bad batch failing simultaneously).

Every time I've seen this problem before it's been one drive at a time, but I suppose there's nothing to stop two drives having the same problem on a sector and its mirror.

In software raid I'd be tempted to try starting the array degraded read only. I assume not but is that an option with your NAS?

jmgcaguicla · October 2024

@randvegeta said:
I tried to dd xvda to xvdb for example, but there were IO issues when attempting that and it would stop after about 4GB.

The best chance at getting a block level copy would be to tell dd to pad unreadable sectors with nulls just so it doesn't die when encountering a bad block by adding conv=sync,noerror, just pray the rot hasn't hit anything important (e.g. the header for a LUKS container if it's encrypted).

zorker · October 2024

Maybe you just have bad RAM in that NAS. Is it replaceable or just move drives to another similar unit?

Howdy, Stranger!

Categories

In this Discussion

Help with Data Recovery for a Xen Server VM

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Help with Data Recovery for a Xen Server VM

Comments