Need tutorial on how to simulate Soft RAID failure

ccarita · July 2023

I've recently got a SYS (OVH) server that has 2 HDD SATA disks. I've installed Debian/Proxmox on it and it is configured as a RAID 1.
Since I didn't install anything important yet on this server, I would need a tutorial on how can I simulate a failure on 1 of the HDD disks (even formatting the "failed" one, just like a new HDD was installed to replace the failed one, and subsequent RAID rebuilding). I want to practice on unimportant data before I use it in production...

plumberg · July 2023

Impressive. Would love to see what experts here have to say

danblaze · July 2023

The easiest case is to forcibly unmount the disk, have it formatted, or do whatever it takes to mess up your partition. Then pretend it's a new disk, remount it, and start your array rebuild.

JamesF · July 2023

This is interesting, as this is my worst fear after OVH rebooting the server in rescue mode for an abuse claim.

darkimmortal · July 2023

echo 1 > /sys/block/sdb/device/delete to make a disk vanish

sparek · July 2023

Additionally, you'd probably want to know if the server is using UEFI or legacy boot and whether or not if the secondary drive is able to be booted from.

That's a common issue with software RAID setups. The second drive isn't set to be properly booted from, so if the first drive fails, then the server doesn't boot back up.

plumberg · July 2023

@sparek said:
Additionally, you'd probably want to know if the server is using UEFI or legacy boot and whether or not if the secondary drive is able to be booted from.

That's a common issue with software RAID setups. The second drive isn't set to be properly booted from, so if the first drive fails, then the server doesn't boot back up.

How does one check this?

plumberg · July 2023

@darkimmortal said:
echo 1 > /sys/block/sdb/device/delete to make a disk vanish

So what happens next? What to do?

plumberg · July 2023

@JamesF said:
This is interesting, as this is my worst fear after OVH rebooting the server in rescue mode for an abuse claim.

True...

sparek · July 2023

@plumberg said:

@sparek said:
Additionally, you'd probably want to know if the server is using UEFI or legacy boot and whether or not if the secondary drive is able to be booted from.

That's a common issue with software RAID setups. The second drive isn't set to be properly booted from, so if the first drive fails, then the server doesn't boot back up.

How does one check this?

Check for a /sys/firmware/efi directory on the system. If it exists, then you're using UEFI. If it doesn't, then you're probably using legacy BIOS boot.

If you're using UEFI then you will need to efibootmgr to check the order of booting AND insure that the second drive has a proper ESP with up to date grub settings.

If you're not using UEFI then you would want to insure that grub is install in the MBR of the second drive.

plumberg · July 2023

@sparek said:

@plumberg said:

@sparek said:
Additionally, you'd probably want to know if the server is using UEFI or legacy boot and whether or not if the secondary drive is able to be booted from.

That's a common issue with software RAID setups. The second drive isn't set to be properly booted from, so if the first drive fails, then the server doesn't boot back up.

How does one check this?

Check for a /sys/firmware/efi directory on the system. If it exists, then you're using UEFI. If it doesn't, then you're probably using legacy BIOS boot.

If you're using UEFI then you will need to efibootmgr to check the order of booting AND insure that the second drive has a proper ESP with up to date grub settings.

If you're not using UEFI then you would want to insure that grub is install in the MBR of the second drive.

That's insightful. Thanks for the same. I'll check.

But BTW how do I make a disk go bad to endure server comes back? Like what op asked.

Levi · July 2023

You need chaos monkey. https://github.com/Netflix/chaosmonkey this will get you stimulated and on the edge every time.

plumberg · July 2023

@LTniger said:
You need chaos monkey. https://github.com/Netflix/chaosmonkey this will get you stimulated and on the edge every time.

Had heard and kinda forgotten about this. Will check if it supports what we wanna do... thanks

ccarita · July 2023

@LTniger I need to know how to simulate a RAID disk failure and subsequent recovery/rebuild. Chaos Monkey is just a tool to randomly kill app services inside instances.

ptervip · July 2023

I also have such doubts, and I hope you can find a solution.

emgh · July 2023

Any online course or something in RAID management?

How to best detect failures (and maybe if possible likely upcoming failures/bad drives)

How to understand what failed

How to make sure all drives that should be bootable are bootable

How to rebuild failed drives and make sure everything works

Howdy, Stranger!

Categories

In this Discussion

Need tutorial on how to simulate Soft RAID failure

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Need tutorial on how to simulate Soft RAID failure

Comments