Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

How to detect SMART error on NMVe Disk?

akhfaakhfa Member
edited August 2020 in Help

Hi all,

I need help how to determine NVMe disk failing in SMART

My Hetrix monitoring tools said that my raid is not healthy, and I got this

cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0]
      1047552 blocks super 1.2 [2/2] [UU]

md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
      498925888 blocks super 1.2 [2/2] [UU]
      [===========>.........]  check = 59.1% (295085568/498925888) finish=16.9min speed=200056K/sec
      bitmap: 4/4 pages [16KB], 65536KB chunk

unused devices: <none>

But I don't see SMART value like as usual SSD in NVMe disk

smartctl -a /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB512HAJQ-00000
Serial Number:                      S3W8NPSE48888
Firmware Version:                   EXA7301Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            418,515,906,560 [418 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 8491b74abf
Local Time is:                      Tue Aug 18 03:13:21 2020 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.02W       -        -    0  0  0  0        0       0
 1 +     6.30W       -        -    1  1  1  1        0       0
 2 +     3.50W       -        -    2  2  2  2        0       0
 3 -   0.0760W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    3%
Data Units Read:                    19,867,513 [10.1 TB]
Data Units Written:                 48,641,973 [24.9 TB]
Host Read Commands:                 206,530,808
Host Write Commands:                1,943,029,925
Controller Busy Time:               9,376
Power Cycles:                       4
Power On Hours:                     8,754
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      2
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               39 Celsius
Temperature Sensor 2:               56 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
smartctl -a /dev/nvme1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB512HAJQ-00000
Serial Number:                      S3W8NZAC525555
Firmware Version:                   EXA7301Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            511,991,951,360 [511 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 85910c8f1c
Local Time is:                      Tue Aug 18 03:13:39 2020 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.02W       -        -    0  0  0  0        0       0
 1 +     6.30W       -        -    1  1  1  1        0       0
 2 +     3.50W       -        -    2  2  2  2        0       0
 3 -   0.0760W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    9%
Data Units Read:                    14,375,193 [7.36 TB]
Data Units Written:                 52,642,220 [26.9 TB]
Host Read Commands:                 164,570,410
Host Write Commands:                1,949,236,315
Controller Busy Time:               9,720
Power Cycles:                       4
Power On Hours:                     8,755
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      2
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               42 Celsius
Temperature Sensor 2:               62 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

Is it Media and Data Integrity Errors? Or where do I need to look at?

Thank you

Comments

  • Try
    smartctl -a /dev/nvme0n1

    Also looking at kernel log might help understand what happened. To me it looks like simple scheduled raid check.

    Thanked by 1akhfa
  • ShazanShazan Member, Host Rep

    Yes, Indeed that looks like a check, not a rebuild.

    Thanked by 1akhfa
  • FranciscoFrancisco Top Host, Host Rep, Veteran

    By default there's a cron in /etc/cron/* that issues a check every couple weeks.

    Francisco

    Thanked by 1akhfa
  • ShazanShazan Member, Host Rep

    I wouldn't use bitmap on a NVMe RAID anyway.
    It is extremely rare that you re-add an NVMe device.

  • @Gamma17 said:
    Try
    smartctl -a /dev/nvme0n1

    Also looking at kernel log might help understand what happened. To me it looks like simple scheduled raid check.

    Tried it, and the result is similar. I don't know what is the different between nvme0 and nvme0n1. What is the meaning of n(X) ?

    @Shazan said:
    Yes, Indeed that looks like a check, not a rebuild.

    @Francisco said:
    By default there's a cron in /etc/cron/* that issues a check every couple weeks.

    Francisco

    Yeah seems like only check. The scheduler is in /etc/cron.d/mdadm. Thank you for the answer!!

    @Shazan said:
    I wouldn't use bitmap on a NVMe RAID anyway.
    It is extremely rare that you re-add an NVMe device.

    Do you mind to explain this further? Are there any other "method" than bitmap?
    This set up was done by hetzner installimage. Usually I don't change RAID set up by the provider

    Thank you!!

  • vfusevfuse Member, Host Rep
  • jackbjackb Member, Host Rep

    @akhfa said:

    @Gamma17 said:
    Try
    smartctl -a /dev/nvme0n1

    Also looking at kernel log might help understand what happened. To me it looks like simple scheduled raid check.

    Tried it, and the result is similar. I don't know what is the different between nvme0 and nvme0n1. What is the meaning of n(X) ?

    @Shazan said:
    Yes, Indeed that looks like a check, not a rebuild.

    @Francisco said:
    By default there's a cron in /etc/cron/* that issues a check every couple weeks.

    Francisco

    Yeah seems like only check. The scheduler is in /etc/cron.d/mdadm. Thank you for the answer!!

    @Shazan said:
    I wouldn't use bitmap on a NVMe RAID anyway.
    It is extremely rare that you re-add an NVMe device.

    Do you mind to explain this further? Are there any other "method" than bitmap?
    This set up was done by hetzner installimage. Usually I don't change RAID set up by the provider

    Thank you!!

    Write intent bitmaps are used by mdadm to speed up resyncs if you remove and the readd a disk, or if the system crashes. They come at the cost of a bit of runtime performance.

    For most ssds, a full resync is usually fast enough that you don't need to use the write intent bitmap.

    https://raid.wiki.kernel.org/index.php/Write-intent_bitmap

    Thanked by 2akhfa Shazan
  • Gamma17Gamma17 Member
    edited August 2020

    @akhfa said: Tried it, and the result is similar. I don't know what is the different between nvme0 and nvme0n1. What is the meaning of n(X) ?

    n(x) is namespace, which you can google to get more detailed explanation than i can ever give, but in short it is another storage abstraction on ssd controller level, which also is not (fully) supported by most common nvme ssd-s (and they simply have one namespace).

    Guess how it works depends on smartmontools version and specific SSD-s.

    Thanked by 1akhfa
  • Webdock_ioWebdock_io Member, Host Rep

    I second trying out nvme-cli - it's been a while since I used it but I remember it being very good for gathering information on your nvme drives

    Thanked by 1akhfa
Sign In or Register to comment.