smartctl - HDD Health - To be understand

alshahad · February 2019

Hello,

I have installed new server today with harddrive I have it in the past, I have run smartctl, from the result you think the drive health is good, or died or dying?

[root@server68 a]# smartctl --all /dev/sda
smartctl 5.43 2016-09-28 r4347 [x86_64-linux-2.6.32-754.11.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     ST4000LM024-2AN17V
Serial Number:    WCK0Y2X5
LU WWN Device Id: 5 000c50 0a9ccd0ce
Firmware Version: 0001
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ACS-3 (unknown minor revision code: 0x006d)
Local Time is:    Wed Feb 27 08:30:30 2019 +03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 642) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30a5) SCT Status supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   061   006    Pre-fail  Always       -       83753634
  3 Spin_Up_Time            0x0003   100   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   085   085   020    Old_age   Always       -       16258
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   086   060   045    Pre-fail  Always       -       361628497
  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       12648 (99 94 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       85
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   089   089   000    Old_age   Always       -       11
188 Command_Timeout         0x0032   100   082   000    Old_age   Always       -       163211247703
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   033   040    Old_age   Always   In_the_past 32 (12 183 35 27 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       4
192 Power-Off_Retract_Count 0x0032   092   092   000    Old_age   Always       -       16125
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       343076
194 Temperature_Celsius     0x0022   032   067   000    Old_age   Always       -       32 (0 21 0 0 0)
195 Hardware_ECC_Recovered  0x001a   079   064   000    Old_age   Always       -       83753634
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       3665
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       19426137091807
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       149499728557
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1638520732
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 11 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 11 occurred at disk power-on lifetime: 1853 hours (77 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      00:01:49.223  READ FPDMA QUEUED
  27 00 00 00 00 00 e0 00      00:01:49.176  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:01:49.173  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:01:49.173  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      00:01:49.117  READ NATIVE MAX ADDRESS EXT

Error 10 occurred at disk power-on lifetime: 1853 hours (77 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      00:01:43.031  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:01:43.029  READ FPDMA QUEUED
  27 00 00 00 00 00 e0 00      00:01:43.022  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:01:43.019  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:01:43.019  SET FEATURES [Set transfer mode]

Error 9 occurred at disk power-on lifetime: 1853 hours (77 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      00:01:37.279  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:01:37.277  READ FPDMA QUEUED
  27 00 00 00 00 00 e0 00      00:01:37.270  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:01:37.267  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:01:37.267  SET FEATURES [Set transfer mode]

Error 8 occurred at disk power-on lifetime: 1853 hours (77 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      00:01:30.962  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:01:30.962  READ FPDMA QUEUED
  60 00 08 32 00 38 40 00      00:01:30.960  READ FPDMA QUEUED
  27 00 00 00 00 00 e0 00      00:01:30.954  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:01:30.951  IDENTIFY DEVICE

Error 7 occurred at disk power-on lifetime: 1853 hours (77 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 00 ff ff ff 4f 00      00:01:25.095  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00      00:01:25.094  READ FPDMA QUEUED
  27 00 00 00 00 00 e0 00      00:01:25.061  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      00:01:25.058  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:01:25.058  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Looking forward.

Regards

doghouch · February 2019

Edited the thread to fix the formatting.

eol · February 2019

Drive good.
Move on.

imok · February 2019

More info https://www.lowendtalk.com/discussion/93554/trying-to-understand-smartctl

nullnothere · February 2019

In terms of the actual errors (197,198) things look OK - nothing alarming yet but definitely ensure you have backups. What is a likely bigger problem in your case is that you've probably enabled power saving/spindowns - the load cycle count is very high and that's not a good sign. Your power on hours and power cycle count is very reasonable in comparison.

Check that you've not enabled spindown on the drive.

rm_ · February 2019

It is not very good:

187 Reported_Uncorrect      0x0032   089   089   000    Old_age   Always       -       11
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8

If these continue increasing, then it's basically done for.

Also either right now or some time in the past, it had a bad cable connection:

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       3665

eol · February 2019

As long as there are no reallocated sectors I wouldn't be worried too much.

Shazan · February 2019

Try a long test with:

smartctl - t long /dev/sda

then check the results after a couple of hours with:

smartctl -a /dev/sda

eol · February 2019

@Shazan said:
Try a long test with:

smartctl - t long /dev/sda

then check the results after a couple of hours with:

smartctl -a /dev/sda

I don't recommend this as it will degrade the drive even more.

jackb · February 2019

@eol said:

@Shazan said:
Try a long test with:

smartctl - t long /dev/sda

then check the results after a couple of hours with:

smartctl -a /dev/sda

I don't recommend this as it will degrade the drive even more.

Not the case - it'll better identify if the drive is worse than you already think. One long test won't harm a drive.

dahartigan · February 2019

The demise of the drive is impending, backup your files to floppies now to avoid catastrophic catastrophe.

Shot2 · February 2019

These ST4000LM024 drives are the "crap" Seagate SMR drives, most of them acquire errors (too) early in their life, then (too) many die after a few weeks or months. Plus they perform quite poorly, due to the very nature of SMR technology, and suboptimal firmware... only the ST4000LM024 behaves a bit better, perhaps due to better-thought firmware. Still I wouldn't put them in a server, neither would I trust them with data. Search for "rosewood" on the hddguru forums, you'll realize how shitty this generation of Seagate drives is.

Still, I own one of these ST4000LM024 for temporary external backups and just like you it has a handful of such errors after a year of use, without things worsening ever since. Incidentally, I RMA'd another one yesterday (the 2TB version, with 50000 reallocated sectors after 3 months and SMART diag failure).

eol · February 2019

True.
WD's implementation of SMR tech is far better.

Falzo · February 2019

@rm_ said:
It is not very good:

187 Reported_Uncorrect 0x0032 089 089 000 Old_age Always - 11
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 8

If these continue increasing, then it's basically done for.

this!

nullnothere said:
problem in your case is that you've probably enabled power saving/spindowns - the load cycle count is very high and that's not a good sign.

and that...

@jackb said:

@eol said:

@Shazan said:
Try a long test with:

smartctl - t long /dev/sda

then check the results after a couple of hours with:

smartctl -a /dev/sda

I don't recommend this as it will degrade the drive even more.

Not the case - it'll better identify if the drive is worse than you already think. One long test won't harm a drive.

and definitely this ;-)

Howdy, Stranger!

Categories

In this Discussion

smartctl - HDD Health - To be understand

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

smartctl - HDD Health - To be understand

Comments