Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Help with Data Recovery for a Xen Server VM

randvegetarandvegeta Member, Host Rep

Hello all.

I've got an issue an old XenServer 6 cluster with an iSCSI NAS storage. Only 1 VM is affected and I just can't figure out why.

We have a NAS in Hong Kong that is seemingly running fine, but has recently had VMs show I/O errors. I say "seemingly running fine" because the NAS itself reports no errors, and the SMART tests show the disks as running normally. So if you look only at the NAS there is nothing to indicate any problems. But all the virtualization nodes that use this NAS show I/O errors.

We rebooted the NAS and bam, everything came back online. Didn't know what to make of it and just let it continue running. Some VMs booted up with FS errors, which were fixed with a quick fsck. Everything seemed ok pretty shortly after.

4 days later, same issue with the NAS. OK so now we really think there must be some disk problems, or some other problems and we need to start migrating everything of this NAS. Reboot, and everything comes back online, same as before. We start migrating and figure that if the NAS can last 3-4 days, that should be enough time to finish the job.

8 hours later the NAS fails again. OK, the problem is getting more serious, so we shut down all the VMs and migrate more rapidly. After ~10 hours, most of the VHDs have been migrated and things are mostly going smoothly, except for 1 particular VM.

The only difference I can see between this VM and other VMs is that this one has had a couple of snapshots. And even when all the other VMs can boot and seemingly operate normally, this continues to have I/O issues. We cannot clone the VHD or migrate the VHD. When attempting to migrate, we get end of file errors reported. When we try and boot, it gets to grub and starts to boot CentOS 7 but get's stuck at mounting vg-root. Boot the VM into sysrescue, mount vg-root and we get I/O errors when attempting to list directory.

This is quite an important VM and the data is invaluable. There's not that much data to recover. The VHD is about 150GB, but we only really need to recover the SQL database stored, which is only a few 10s or maybe 100s of MB.

If anyone can offer assistance, I'm happy to pay a good amount of money if that can be recovered.

The NAS has a RAID 10 array. No other data seems to be lost so I don't think there's actually any data lost from the disks themselves. I think the issue is related to data corruption, and the snapshots themselves being corrupted. Ironic as the snapshots were taken to ensure we could recover the data.

Thanks in advance.

Comments

  • AlexBarakovAlexBarakov Patron Provider, Veteran

    Before attempting to do recovery - do a block level copy of whatever is left there. When we had to recover multiple failed drives in RAID arrays, we cloned HDDs block by block to a known good media. Ideally, if someone is doing recovery, they would start from the image you've made and then proceed to actually messing with the original source media.

  • randvegetarandvegeta Member, Host Rep

    @AlexBarakov said:
    Before attempting to do recovery - do a block level copy of whatever is left there. When we had to recover multiple failed drives in RAID arrays, we cloned HDDs block by block to a known good media. Ideally, if someone is doing recovery, they would start from the image you've made and then proceed to actually messing with the original source media.

    That's exactly what I've tried. First, when booting into Sysrescue, I tried to dd xvda to xvdb for example, but there were IO issues when attempting that and it would stop after about 4GB.

    Moving up into the virtualization node itself, for the life of me, I can't find the LV for the VHD. I can see where it's supposed to be but it's not there. If I could find it, then I would try and DD the LV and then play with the copy.

  • jackbjackb Member, Host Rep
    edited October 2024

    Are these spinning disks? If so do they have built in TLER/similar error recovery control?

    If yes to spinning and no to TLER, you can set error recovery manually. Usually I'd expect pending sectors if this is the problem, but I'm a little suspicious it might be anyway.

    To show the current error time:

    smartctl -l scterc /dev/sda

    And to set the error time (to 7 seconds):

    smartctl -l scterc,70,70 /dev/sda

    My suspicion is you've got a drive with a few bad sectors and read attempts on those sectors hang forever. I've seen it before more than once.

  • Try HDD Regenerator (https://www.dposoft.net/) in all disks

  • randvegetarandvegeta Member, Host Rep

    @jackb said: Are these spinning disks? If so do they have built in TLER/similar error recovery control?

    Indeed these spinning disks.

    Looks like only 2 of my 4 disks have TLER (we use different brands for the RAID 10 so we dont end up with a bad batch failing simultaneously).

    My suspicion is you've got a drive with a few bad sectors and read attempts hang forever. I've seen it before more than once.

    That is my suspicion also, but I am hoping it hasnt meant the loss of all the data on the VM.

  • jackbjackb Member, Host Rep
    edited October 2024

    @randvegeta said:

    @jackb said: Are these spinning disks? If so do they have built in TLER/similar error recovery control?

    Indeed these spinning disks.

    Looks like only 2 of my 4 disks have TLER (we use different brands for the RAID 10 so we dont end up with a bad batch failing simultaneously).

    Every time I've seen this problem before it's been one drive at a time, but I suppose there's nothing to stop two drives having the same problem on a sector and its mirror.

    In software raid I'd be tempted to try starting the array degraded read only. I assume not but is that an option with your NAS?

  • @randvegeta said:
    I tried to dd xvda to xvdb for example, but there were IO issues when attempting that and it would stop after about 4GB.

    The best chance at getting a block level copy would be to tell dd to pad unreadable sectors with nulls just so it doesn't die when encountering a bad block by adding conv=sync,noerror, just pray the rot hasn't hit anything important (e.g. the header for a LUKS container if it's encrypted).

  • Maybe you just have bad RAM in that NAS. Is it replaceable or just move drives to another similar unit?

Sign In or Register to comment.