How can I find out the broken HDD from the physical server bay if server it's on software raid/mdadm

Calin · April 18

Hello LET, we are facing a rather stressful problem, those who have experience with raid / mdadm software, if a HDD breaks, can they identify it in real life? To remove it from the server bay

The HDD can no longer be seen in lsblk, dmseg or others, what would be your solution?

Until now, we marked each bay with labels, but sometimes it failed to identify them so we start search more better ideeas

So any better idea?

Regards,
Calin

Calin · April 18

Maybe @PulsedMedia have any idea?

gwnd1989 · April 18

Good luck Calin.. I hope the issues are fixed soon

Calin · April 18

@gwnd1989 said: Good luck Calin.. I hope the issues are fixed soon

>

Thanks we fixed from long time ago , we just search a more easy method to identify broken HDDs

Regards

FlorinMarian · April 18

Every HDD/SSD has SN already printed on front label.
If HDD dissapeared, easily check which phisycal HDD does not have one of the following SNs received as output: lsblk --nodeps -o name,serial

Example:

root@hp1:~# lsblk --nodeps -o name,serial
NAME  SERIAL
sda   Z1Z9ZG3L0000R616ZA3T
sdb   Y7X0A10MFEGC
sdc   Y7P0A0CSFEGC
sdd   Z1Z9NVJ40000C60819JP
sde   Y7P0A0D2FEGC
sdf   Y7X0A10QFEGC
sdg   Y7X0A05MFEGC
sdh   Y7X0A0YKFEGC
sdi   Y7N0A1QGFEGC
sdj   Y7X0A0V1FEGC
sdk   Y7X0A0TGFEGC
sdl   Y7N0A1NZFEGC
sdm   PHWA6024000F1P2JGN
sdn   PHWA6024002D1P2JGN
sdo   PHWA602400021P2JGN
sdp   PHWA6024002K1P2JGN

Radi · April 18

Label them with serial numbers and compare the serial numbers of the disks that are showing with the physical ones.

totally_not_banned · April 18

That's easy. Just shake them a little. Voila, if it falls apart it's broken!

ScreenReader · April 18

blink them one by one in turn.
whichever not blinking could be broken

natvps_uk · April 18

https://linux.die.net/man/8/ledctl - turn on all of the drives UID lights that you can see in lsblk/frisk and the one that isn’t lit up is the one to remove.

host_c · April 18

@Calin

Bro, I can only say.... you know what I will say......

This time, I cannot help, as mdadm is above my pay-grade.

I suspect the disk poping out of the raid cannot keep in synk with the others.

That does not mean it is dead, sw raid is sensitive to latencies, as the cpu does the storage also + parity calculations + other
In high IO ( 20+ disks as you have ) these things might happen.

EDIT:

ZFS will be more stable, but you have to test that out....

Not_Oles · April 18

Hey @Calin! Sometimes software RAID configurations default to sending root an email when a problem is detected. If outbound email has not also been configured, sometimes the email is dumped as a text file into the /root directory. So you could take a peek in your /root directory. You might find a helpful email. It happened to me once, a long time ago. Best wishes! Tom

Aquatis_Joseph · April 18

@Calin said:
Hello LET, we are facing a rather stressful problem, those who have experience with raid / mdadm software, if a HDD breaks, can they identify it in real life? To remove it from the server bay

The HDD can no longer be seen in lsblk, dmseg or others, what would be your solution?

Until now, we marked each bay with labels, but sometimes it failed to identify them so we start search more better ideeas

So any better idea?

Regards,
Calin

Hi there!

Just confirming if the server management software (IDRAC/ILO) is reporting a faulted drive? I know there you can have the drive bay blink if your system supports that feature.

davide · April 18

On my regular tower case I have a sticky label on the rear side of each disk, because the label on top is not readable with the disks mounted..

host_c · April 18

@Aquatis_Joseph said: Just confirming if the server management software (IDRAC/ILO) is reporting a faulted drive? I know there you can have the drive bay blink if your system supports that feature.

I think he will not have that, as the ILO does not read the SMART from drives on the HBA on HP Apollo systems. ( as I recall )

PulsedMedia · April 19

@Calin said:
Maybe @PulsedMedia have any idea?

Difference in activity. Read to /dev/null from the rest and the leds will be different.
Some Dell servers, esp if using HW Raid can also show the leds.

Some DAS chassis' have an esoteric way to change the leds too.

There's also ways to check from cli which drive bay it is.

Difficulties start when you start to hit 40, 80 .... 120 drives on single system

rahulbose98 · April 19

Starting working hard and disable drives one by one to test the broken one.

Howdy, Stranger!

Categories

In this Discussion

How can I find out the broken HDD from the physical server bay if server it's on software raid/mdadm

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

How can I find out the broken HDD from the physical server bay if server it's on software raid/mdadm

Comments