What is the standard behavior of providers on failing hard drives?

Giulio · June 2020

I have a dedicated server with OneProvider, that I believe is in the Online.net datacenter.

I have a dedicated server with two 1TB HDD that i use in Raid Z1 with FreeBSD.
Performance is low for a variety of factors but i do not care about that and it is more tha enough for the bargain price I'm paying.

I was wondering, what's the standard behavior of provider with Old and Pre_Fail drives. Should I wait for the disk to fail in order to get it replaced, with a small chance of both disks failing at the same time or may it be possible to ask for at least one replacement out of two before it fails?

Here are both smartctl outputs for reference:

smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-RELEASE-p13 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital RE4
Device Model:     WDC WD1003FBYX-18Y7B0
Serial Number:    WD-WCAW30467720
LU WWN Device Id: 5 0014ee 2055ef1dd
Add. Product Id:  DELL(tm)
Firmware Version: 01.01V02
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Wed Jun 10 00:43:30 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Warning! SMART Attribute Data Structure error: invalid SMART checksum.
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x00)     Offline data collection not supported.
SMART capabilities:            (0x0000) Automatic saving of SMART data                  is not implemented.
Error logging capability:        (0x00) Error logging supported.
                    General Purpose Logging supported.
SCT capabilities:          (0x303f) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       22
  3 Spin_Up_Time            0x0027   171   168   021    Pre-fail  Always       -       4450
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       38
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   046   046   000    Old_age   Always       -       39722
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       36
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       35
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2
194 Temperature_Celsius     0x0022   111   105   000    Old_age   Always       -       36 (Min/Max 32/39)
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0


SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         4         -
# 2  Short offline       Completed without error       00%         1         -

Selective Self-tests/Logging not supported

 smartctl --all /dev/da1
smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-RELEASE-p13 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital RE4
Device Model:     WDC WD1003FBYX-18Y7B0
Serial Number:    WD-WCAW30579252
LU WWN Device Id: 5 0014ee 25aba18f9
Add. Product Id:  DELL(tm)
Firmware Version: 01.01V02
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Wed Jun 10 00:43:52 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Warning! SMART Attribute Data Structure error: invalid SMART checksum.
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x00)     Offline data collection not supported.
SMART capabilities:            (0x0000) Automatic saving of SMART data                  is not implemented.
Error logging capability:        (0x00) Error logging supported.
                    General Purpose Logging supported.
SCT capabilities:          (0x303f) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       3
  3 Spin_Up_Time            0x0027   172   171   021    Pre-fail  Always       -       4391
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       38
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   028   028   000    Old_age   Always       -       52782
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       36
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       35
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2
194 Temperature_Celsius     0x0022   116   109   000    Old_age   Always       -       31 (Min/Max 27/34)
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         3         -
# 2  Short offline       Completed without error       00%         1         -

Selective Self-tests/Logging not supported

jackb · June 2020

Your drives look fine. That field indicates what the attribute means (i.e. if the values in that field are near thresh, it indicates is early indicator of failure; whereas other attributes are likely to just indicate old age)

What you want to look out for is:
1. Failing tests
2. Errors logged
3. Pending sectors
4. Offline uncorrectable sectors
5. A high reallocated sector count

If you are concerned, you should run a long test. There hasn't been one of your drive in a long time, so it's probably worth doing.

SGraf · June 2020

Right now your disks are fine. just run a long test and see if it brings up any issues.

I personally would expect to see drive replacement when its actively failing or indicating that it will fail soon (like "reallocated sector count" )

Giulio · June 2020

Thank you very much for your suggestions, the long test is now running on both disks. Given the age of the drive I guess the solution is to set up a cron and run test with alerts periodically in order to catch an eventually failing drive as early as possible?

SGraf · June 2020

Just set up smartd (from smartmontools) for automatic monitoring

jackb · June 2020

@Giulio said:
Thank you very much for your suggestions, the long test is now running on both disks. Given the age of the drive I guess the solution is to set up a cron and run test with alerts periodically in order to catch an eventually failing drive as early as possible?

For a set up and forget solution I'd suggest setting up smartd w/notifications as suggested by @SGraf ; and set up a monthly cron or similar job to long test the drives. One drive at a time is best.

Giulio · June 2020

Is it possible that no type of self testing id supported by the disks?

root:~ # smartctl -t short /dev/da0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-RELEASE-p13 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

Warning! SMART Attribute Data Structure error: invalid SMART checksum.
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Self-test functions not supported

Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.

root:~ # smartctl -t long /dev/da0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-RELEASE-p13 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

Warning! SMART Attribute Data Structure error: invalid SMART checksum.
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Self-test functions not supported

Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.

root:~ # smartctl -t offline /dev/da0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 12.0-RELEASE-p13 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

Warning! SMART Attribute Data Structure error: invalid SMART checksum.
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Execute Offline Immediate function not supported

Sending command: "Execute SMART off-line routine immediately in off-line mode".
Drive command "Execute SMART off-line routine immediately in off-line mode" successful.
Testing has begun.
Warning! SMART Attribute Data Structure error: invalid SMART checksum.

Edit: i believe that short and long are supported while offline is not

vivithemage · June 2020

The real killer for spinny drives are those Reallocated_Sector_Ct. If I saw one, I would replace it.

Howdy, Stranger!

Categories

In this Discussion

What is the standard behavior of providers on failing hard drives?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

What is the standard behavior of providers on failing hard drives?

Comments