Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


How can I find out the broken HDD from the physical server bay if server it's on software raid/mdadm
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

How can I find out the broken HDD from the physical server bay if server it's on software raid/mdadm

CalinCalin Member, Patron Provider
edited April 18 in Help

Hello LET, we are facing a rather stressful problem, those who have experience with raid / mdadm software, if a HDD breaks, can they identify it in real life? To remove it from the server bay

The HDD can no longer be seen in lsblk, dmseg or others, what would be your solution?

Until now, we marked each bay with labels, but sometimes it failed to identify them so we start search more better ideeas

So any better idea?

Regards,
Calin

Comments

  • CalinCalin Member, Patron Provider

    Maybe @PulsedMedia have any idea?

  • Good luck Calin.. I hope the issues are fixed soon

  • CalinCalin Member, Patron Provider

    @gwnd1989 said: Good luck Calin.. I hope the issues are fixed soon

    >

    Thanks :) we fixed from long time ago , we just search a more easy method to identify broken HDDs

    Regards

    Thanked by 1gwnd1989
  • FlorinMarianFlorinMarian Member, Host Rep

    Every HDD/SSD has SN already printed on front label.
    If HDD dissapeared, easily check which phisycal HDD does not have one of the following SNs received as output: lsblk --nodeps -o name,serial

    Example:

    root@hp1:~# lsblk --nodeps -o name,serial
    NAME  SERIAL
    sda   Z1Z9ZG3L0000R616ZA3T
    sdb   Y7X0A10MFEGC
    sdc   Y7P0A0CSFEGC
    sdd   Z1Z9NVJ40000C60819JP
    sde   Y7P0A0D2FEGC
    sdf   Y7X0A10QFEGC
    sdg   Y7X0A05MFEGC
    sdh   Y7X0A0YKFEGC
    sdi   Y7N0A1QGFEGC
    sdj   Y7X0A0V1FEGC
    sdk   Y7X0A0TGFEGC
    sdl   Y7N0A1NZFEGC
    sdm   PHWA6024000F1P2JGN
    sdn   PHWA6024002D1P2JGN
    sdo   PHWA602400021P2JGN
    sdp   PHWA6024002K1P2JGN
    
  • RadiRadi Host Rep, Veteran

    Label them with serial numbers and compare the serial numbers of the disks that are showing with the physical ones.

  • That's easy. Just shake them a little. Voila, if it falls apart it's broken!

  • blink them one by one in turn.
    whichever not blinking could be broken

    Thanked by 2host_c yongsiklee
  • https://linux.die.net/man/8/ledctl - turn on all of the drives UID lights that you can see in lsblk/frisk and the one that isn’t lit up is the one to remove.

  • host_chost_c Member, Patron Provider
    edited April 18

    @Calin

    Bro, I can only say.... you know what I will say......

    This time, I cannot help, as mdadm is above my pay-grade.

    I suspect the disk poping out of the raid cannot keep in synk with the others.

    That does not mean it is dead, sw raid is sensitive to latencies, as the cpu does the storage also + parity calculations + other
    In high IO ( 20+ disks as you have ) these things might happen.

    EDIT:

    ZFS will be more stable, but you have to test that out....

  • Not_OlesNot_Oles Moderator, Patron Provider

    Hey @Calin! Sometimes software RAID configurations default to sending root an email when a problem is detected. If outbound email has not also been configured, sometimes the email is dumped as a text file into the /root directory. So you could take a peek in your /root directory. You might find a helpful email. It happened to me once, a long time ago. Best wishes! Tom

  • Aquatis_JosephAquatis_Joseph Member, Patron Provider

    @Calin said:
    Hello LET, we are facing a rather stressful problem, those who have experience with raid / mdadm software, if a HDD breaks, can they identify it in real life? To remove it from the server bay

    The HDD can no longer be seen in lsblk, dmseg or others, what would be your solution?

    Until now, we marked each bay with labels, but sometimes it failed to identify them so we start search more better ideeas

    So any better idea?

    Regards,
    Calin

    Hi there!

    Just confirming if the server management software (IDRAC/ILO) is reporting a faulted drive? I know there you can have the drive bay blink if your system supports that feature.

  • davidedavide Member
    edited April 18

    On my regular tower case I have a sticky label on the rear side of each disk, because the label on top is not readable with the disks mounted..

  • host_chost_c Member, Patron Provider

    @Aquatis_Joseph said: Just confirming if the server management software (IDRAC/ILO) is reporting a faulted drive? I know there you can have the drive bay blink if your system supports that feature.

    I think he will not have that, as the ILO does not read the SMART from drives on the HBA on HP Apollo systems. ( as I recall )

    Thanked by 1Aquatis_Joseph
  • PulsedMediaPulsedMedia Member, Patron Provider

    @Calin said:
    Maybe @PulsedMedia have any idea?

    Difference in activity. Read to /dev/null from the rest and the leds will be different.
    Some Dell servers, esp if using HW Raid can also show the leds.

    Some DAS chassis' have an esoteric way to change the leds too.

    There's also ways to check from cli which drive bay it is.

    Difficulties start when you start to hit 40, 80 .... 120 drives on single system :D

  • Starting working hard and disable drives one by one to test the broken one.

    Thanked by 1yoursunny
Sign In or Register to comment.