Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


How to detect SMART error on NMVe Disk?
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

How to detect SMART error on NMVe Disk?

akhfaakhfa Member
edited August 2020 in Help

Hi all,

I need help how to determine NVMe disk failing in SMART

My Hetrix monitoring tools said that my raid is not healthy, and I got this

cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0]
      1047552 blocks super 1.2 [2/2] [UU]

md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
      498925888 blocks super 1.2 [2/2] [UU]
      [===========>.........]  check = 59.1% (295085568/498925888) finish=16.9min speed=200056K/sec
      bitmap: 4/4 pages [16KB], 65536KB chunk

unused devices: <none>

But I don't see SMART value like as usual SSD in NVMe disk

smartctl -a /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB512HAJQ-00000
Serial Number:                      S3W8NPSE48888
Firmware Version:                   EXA7301Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            418,515,906,560 [418 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 8491b74abf
Local Time is:                      Tue Aug 18 03:13:21 2020 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.02W       -        -    0  0  0  0        0       0
 1 +     6.30W       -        -    1  1  1  1        0       0
 2 +     3.50W       -        -    2  2  2  2        0       0
 3 -   0.0760W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    3%
Data Units Read:                    19,867,513 [10.1 TB]
Data Units Written:                 48,641,973 [24.9 TB]
Host Read Commands:                 206,530,808
Host Write Commands:                1,943,029,925
Controller Busy Time:               9,376
Power Cycles:                       4
Power On Hours:                     8,754
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      2
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius
Temperature Sensor 2:               56 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
smartctl -a /dev/nvme1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB512HAJQ-00000
Serial Number:                      S3W8NZAC525555
Firmware Version:                   EXA7301Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            511,991,951,360 [511 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 85910c8f1c
Local Time is:                      Tue Aug 18 03:13:39 2020 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.02W       -        -    0  0  0  0        0       0
 1 +     6.30W       -        -    1  1  1  1        0       0
 2 +     3.50W       -        -    2  2  2  2        0       0
 3 -   0.0760W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    9%
Data Units Read:                    14,375,193 [7.36 TB]
Data Units Written:                 52,642,220 [26.9 TB]
Host Read Commands:                 164,570,410
Host Write Commands:                1,949,236,315
Controller Busy Time:               9,720
Power Cycles:                       4
Power On Hours:                     8,755
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      2
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               42 Celsius
Temperature Sensor 2:               62 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

Is it Media and Data Integrity Errors? Or where do I need to look at?

Thank you

Comments

  • Try
    smartctl -a /dev/nvme0n1

    Also looking at kernel log might help understand what happened. To me it looks like simple scheduled raid check.

    Thanked by 1akhfa
  • ShazanShazan Member, Host Rep

    Yes, Indeed that looks like a check, not a rebuild.

    Thanked by 1akhfa
  • FranciscoFrancisco Top Host, Host Rep, Veteran

    By default there's a cron in /etc/cron/* that issues a check every couple weeks.

    Francisco

    Thanked by 1akhfa
  • ShazanShazan Member, Host Rep

    I wouldn't use bitmap on a NVMe RAID anyway.
    It is extremely rare that you re-add an NVMe device.

  • @Gamma17 said:
    Try
    smartctl -a /dev/nvme0n1

    Also looking at kernel log might help understand what happened. To me it looks like simple scheduled raid check.

    Tried it, and the result is similar. I don't know what is the different between nvme0 and nvme0n1. What is the meaning of n(X) ?

    @Shazan said:
    Yes, Indeed that looks like a check, not a rebuild.

    @Francisco said:
    By default there's a cron in /etc/cron/* that issues a check every couple weeks.

    Francisco

    Yeah seems like only check. The scheduler is in /etc/cron.d/mdadm. Thank you for the answer!!

    @Shazan said:
    I wouldn't use bitmap on a NVMe RAID anyway.
    It is extremely rare that you re-add an NVMe device.

    Do you mind to explain this further? Are there any other "method" than bitmap?
    This set up was done by hetzner installimage. Usually I don't change RAID set up by the provider

    Thank you!!

  • vfusevfuse Member, Host Rep
  • jackbjackb Member, Host Rep

    @akhfa said:

    @Gamma17 said:
    Try
    smartctl -a /dev/nvme0n1

    Also looking at kernel log might help understand what happened. To me it looks like simple scheduled raid check.

    Tried it, and the result is similar. I don't know what is the different between nvme0 and nvme0n1. What is the meaning of n(X) ?

    @Shazan said:
    Yes, Indeed that looks like a check, not a rebuild.

    @Francisco said:
    By default there's a cron in /etc/cron/* that issues a check every couple weeks.

    Francisco

    Yeah seems like only check. The scheduler is in /etc/cron.d/mdadm. Thank you for the answer!!

    @Shazan said:
    I wouldn't use bitmap on a NVMe RAID anyway.
    It is extremely rare that you re-add an NVMe device.

    Do you mind to explain this further? Are there any other "method" than bitmap?
    This set up was done by hetzner installimage. Usually I don't change RAID set up by the provider

    Thank you!!

    Write intent bitmaps are used by mdadm to speed up resyncs if you remove and the readd a disk, or if the system crashes. They come at the cost of a bit of runtime performance.

    For most ssds, a full resync is usually fast enough that you don't need to use the write intent bitmap.

    https://raid.wiki.kernel.org/index.php/Write-intent_bitmap

    Thanked by 2akhfa Shazan
  • Gamma17Gamma17 Member
    edited August 2020

    @akhfa said: Tried it, and the result is similar. I don't know what is the different between nvme0 and nvme0n1. What is the meaning of n(X) ?

    n(x) is namespace, which you can google to get more detailed explanation than i can ever give, but in short it is another storage abstraction on ssd controller level, which also is not (fully) supported by most common nvme ssd-s (and they simply have one namespace).

    Guess how it works depends on smartmontools version and specific SSD-s.

    Thanked by 1akhfa
  • Webdock_ioWebdock_io Member, Host Rep

    I second trying out nvme-cli - it's been a while since I used it but I remember it being very good for gathering information on your nvme drives

    Thanked by 1akhfa
Sign In or Register to comment.