Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


How to check if harddrive is broken?
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

How to check if harddrive is broken?

WHTWHT Member

I have a sys server since 3 month and yesterday my sites started to be slow, sometimes works good sometimes takes 20 seconds to load.

I suspect that the disk in raid1 is broken. Any tools, commands I can check it? Or SYS techs do replace from itself?

Comments

  • Awmusic12635Awmusic12635 Member, Host Rep

    smart data?

  • Is it software RAID or do you have a raid card? There are utilities for both to check the status of the raid array, as well as each individual drive.

  • doghouchdoghouch Member
    edited February 2016

    Is this your automatic assumption that because your website took 20 seconds to load that it's the raid card? Check the disk array - if you need help, just shoot me a PM. Also, model of the raid card please?

    EDIT: If this isn't your server, you'll need the provider to help.

  • ktkt Member, Host Rep
    edited February 2016

    Assuming its SW RAID, whats output of:
    cat /proc/mdstat

    (If it shows UU then array is fine)

    Also check smart:

    smartctl -a /dev/sda

    smartctl -a /dev/sdb

    smartctl -a /dev/sdc (if you have 3rd drive)

    Thanked by 2raindog308 WHT
  • WHTWHT Member
    edited February 2016

    I have software raid. Will check those dev things tomorrow. Thanks

  • Post the output on pastebin and I will check what's wrong.

  • vimalwarevimalware Member
    edited February 2016

    So, I started an extended SMART test over an hour or so ago after seeing several email alerts from smartmontools

    I got an email alert about 10% of the way through the extended self-test:

    The following warning/error was logged by the smartd daemon:
    Device: /dev/sdb [SAT], Self-Test Log error count increased from 0 to 1
    

    So, I poked right away:

        smartctl -a /dev/sdb |grep fail
                                                the read element of the test failed.
          1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
          3 Spin_Up_Time            0x0027   253   246   021    Pre-fail  Always       -       6591
          5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
        # 1  Extended offline    Completed: read failure       90%     35073         648257133
    

    Is this enough to ask for replacement?
    (edit: this is a zfs raid1 . I've never received any email alerts for /dev/sda)

  • qpsqps Member, Host Rep

    vimalware said: Is this enough to ask for replacement?

    I would try to run the test again and see if you have the same result.

    Thanked by 1Kris
  • This disk is way too old - replace it !

  • Run a few long / short tests to show it's reoccurring, and fails at the same place, it'll have more clout.

    smartctl --test=short /dev/sdb;
    smartctl --test=long /dev/sdb;
    

    You could also throw stress-ng at it, and see if the disks last through that, or develop sector errors after one.

    My opinion is if a system lasts stress-ng --random 32 -t 24h , machine is production worthy.

    Adjust the random workers depending on your machine, and disable / watch out for Watchdog being enabled in the BIOS.

    Otherwise, Watchdog will restart the machine if you tax it too hard thinking it's not responding due to the load.

    Thanked by 1vimalware
  • raindog308raindog308 Administrator, Veteran

    Kris said: My opinion is if a system lasts stress-ng --random 32 -t 24h , machine is production worthy.

    Only the strong shall survive! I like it.

    Thanked by 1Kris
  • BlazingServersBlazingServers Member, Host Rep

    Call OVH through Skype. They are helpful enough.

  • pbgbenpbgben Member, Host Rep

    First, BACKUP! Or your bound to loose all your shit...

  • Use 'watch' while rsyncing off the server to a temp. backup site on the smartctl and look for changes in the Reallocated Sectors, Seek errors etc.

    I'd suggest a good dose of stress-ng, as you can tune it to hit the disk if you want.

    You can always write out /dev/zero to a file the size on the disk you have left, and watch for smartctl for increasing errors, especially re-allocated sectors. They ain't unlimited.

  • raindog308 said: Only the strong shall survive! I like it.

    Few hundred days uptime on the beasts I built and stressed, would have been more if it weren't for those pesky XSAs. Still chugging along.

    I would sit twenty minutes per CPU cleaning old residue off with with TSP, ArticClean, isopropyl alcohol and a fresh micro-fiber cloth.

    Followed the same ritual on the heatsink, but also used melamine foam aka magic erasers to lay a slight key, like you would before repainting a vehicle. This ensured there was no chance of a layer of old thermal compound or grease, as they are essentially micro-polishers. Just tiny swirls making little circles. And since the magic eraser is white, you know exactly when it's clean of residual grease, and you can visually see the tiny scratches you make from well.. polishing the heatsink.

    Once spotless, I would use Noctua NT-H1 or Ceramique 2. I preferred Ceramique as it never spread off the chip, although Noctua is great if you remember to not use too much.

    Overkill? Sure. Only 1-2C differences from others, but every degree matters.

    I would let them go for the weekend stressing in the office. Figured if the machine was responsive and still alive with no kernel panic, it was good for production.

    I still prefer good 'ol Westmere workhorses, specifically any 5600 on a 1U chassis over a lot of the new blade crap I'm seeing.

  • @vimalware said:
    So, I started an extended SMART test over an hour or so ago after seeing several email alerts from smartmontools

        # 1  Extended offline    Completed: read failure       90%     35073         648257133
    

    Is this enough to ask for replacement?
    (edit: this is a zfs raid1 . I've never received any email alerts for /dev/sda)

    Yes it is. Open ticket from SYS-panel and copy/paste smart data to ticket.
    Last week they replaced both disks to my SYS-IP-1 server. And what I can say, whole process was completed very professional way. :)
    But remember: backups!
    And that, they only change faulty disk, so you must check that bootloader is installed ok and you must re-partiton disk and sync raid etc.
    They don't do that.

Sign In or Register to comment.