How does the provider know which drive to replace is failing software RAID?

vitobotta · July 2022

If I have software RAID in a dedicated server, and a drive is failing, how can I or the provider know which one needs to be replaced? Thanks.

luckypenguin · July 2022

S.M.A.R.T utils. If you have 1720 error - hard drive is about to fail.

yoursunny · July 2022

Each hard drive slot has a green light and a yellow light.
The green light means the drive is inserted.
The yellow light means the drive has failed.
You should replace the drive in the spot with yellow light.

vitobotta · July 2022

Is there a way of getting email alerts if a drive in the array is failing so that I know when to contact support? Because otherwise in a RAID 1 how would I notice if a drive is failing?

letlover · July 2022

@vitobotta said:
Is there a way of getting email alerts if a drive in the array is failing so that I know when to contact support? Because otherwise in a RAID 1 how would I notice if a drive is failing?

smartmontools, smartctl, as someone just answered my question in another post.

imok · July 2022

https://unix.stackexchange.com/questions/28636/how-to-check-mdadm-raids-while-running

jackb · July 2022

@vitobotta said:
If I have software RAID in a dedicated server, and a drive is failing, how can I or the provider know which one needs to be replaced? Thanks.

You should be able to check the health of the array e.g. with mdadm, in /proc/mdstat which will show failed drives. You need to remove the failed drive from the array before having it replaced.

You need to send the serial number of the drive to be replaced to your provider. If you can't get it (e.g. drive too dead to identify itself), you should send the serials of the healthy drives.

There are often health indicator lights though these may or may not be in use depending on configuration - so assume they aren't and send serials.

MeAtExampleDotCom · July 2022

@vitobotta said:
If I have software RAID in a dedicated server, and a drive is failing, how can I or the provider know which one needs to be replaced? Thanks.

Depends. Some servers with built-in hardware RAID will have lights for each drive, or maybe even on each drive if they are hot-swap, so they can see directly which one is failing as the controller will tell the drive to flash its “I'm dying” sequence.

With software RAID you possibly know of the problem through SMART throwing you a warning, from a drive dropping into a failed state in /proc/mdstat (and possibly a mail alert telling you this has happened), or from errors logged elsewhere. In the case of SMART the warning messages should include the serial number of the drive that is reporting issues. Or if you know the drive device name from elsewhere you can use smartctl or other tools to read the serial number of that drive and let the provider know. If they have sufficient access to your server, for instance if you have a managed server deal, they can check this themselves directly.

For hardware RAID without visible indicators on the physical machine, local hands-on support can tell by restarting the machine into the RAID BIOS or equivalent and get the details that way. You can probably tell using whatever monitoring software comes with your hardware RAID controller too, though how you read that will depend on the controller and its specific tools.

Once they know the serial number they can pull that drive, as it will also be included on the drive's label(s).

vitobotta · July 2022

@imok said:
https://unix.stackexchange.com/questions/28636/how-to-check-mdadm-raids-while-running

Thanks, this is basically what I was looking for. I configured mdadm to send me email alerts when the status of the RAID changes. This is good enough for now

imgmoney · July 2022

You can also use hetrixtools to monitor the health status of the drive and replace it before it fails.

dustinc · July 2022

Beyond checking with SMART, for software RAID environments, the command: cat /proc/mdstat will identify to you which drive is active, from there you can identify which drive is dead.

If for example you have 2x drives in a Software RAID-1, cat /proc/mdstat would return "2/2" which would imply that both drives are online and functioning. If it returns 1/2 - it would mean that only one drive is active, and that the missing/dead drive would need to be looked at further.

Howdy, Stranger!

Categories

In this Discussion

How does the provider know which drive to replace is failing software RAID?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

How does the provider know which drive to replace is failing software RAID?

Comments