Should I be worried about these NVMe temps with Hetzner Auction Server?

integritly · August 2019

First time picking up a server with NVMe drives so I don't know if this is normal, or if I should be concerned running this box long term. Server is being used for a database and disk speed is important. Performance so far seems okay, although every so often a bench.sh pass will score roughly 33% of its usual numbers. Here's the smart log for both drives:

Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 69 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 8%
data_units_read                     : 55,552,962
data_units_written                  : 37,232,304
host_read_commands                  : 5,572,419,946
host_write_commands                 : 1,594,258,267
controller_busy_time                : 17,070
power_cycles                        : 27
power_on_hours                      : 18,393
unsafe_shutdowns                    : 10
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 84
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 69 C

Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 62 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 10%
data_units_read                     : 66,969,199
data_units_written                  : 38,698,184
host_read_commands                  : 5,876,328,377
host_write_commands                 : 1,600,756,699
controller_busy_time                : 17,278
power_cycles                        : 26
power_on_hours                      : 18,397
unsafe_shutdowns                    : 12
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 569
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 62 C

=== START OF INFORMATION SECTION ===
Model Number:                       THNSN5512GPU7 TOSHIBA
Serial Number:                      Z6IS108QTUHV
Firmware Version:                   57GA4103
PCI Vendor/Subsystem ID:            0x1179
IEEE OUI Identifier:                0x00080d
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Sat Aug 10 16:55:26 2019 CEST
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x000e):     Wr_Unc DS_Mngmt Wr_Zero
Warning  Comp. Temp. Threshold:     78 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        0       0
 1 +     2.40W       -        -    1  1  1  1        0       0
 2 +     1.90W       -        -    2  2  2  2        0       0
 3 -   0.1600W       -        -    3  3  3  3     1000    1000
 4 -   0.0120W       -        -    4  4  4  4     5000   35000
 5 -   0.0060W       -        -    5  5  5  5   100000  110000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
Read NVMe SMART/Health Information failed: NVMe Status 0x4002

The first thing that caught my attention was the 569 minutes of "Warning Temperature Time" on drive 2.

Bench.sh runs will routinely score ~730-750 MB/s on I/O speed, but sometimes one of the 3 runs will dip into the 250 MB/s range. Not sure if that's just a quirk, or if its my drives getting temp throttled.

I already cancelled one Hetzner order for an EX52 I got during the special as it was delivered about a week after I placed the order, during the time I ended up grabbing this one from the auction, so I don't want to look like I'm abusing their 14-day return policy by cancelling this one too, but I am genuinely concerned there is a problem here.

So, do I have reason to swap this box for a different one, or is this normal and just go with it?

Thanks guys.

MikeA · August 2019

I mean, that's high? It's not the idle temp I assume, if so surely it's wrong? Those temps are what you'd expect under heavy use or something. I checked some of my own servers which use a variety of NVMe drives and they're all between 29°c-35°c.

SpeedBus · August 2019

THNSN5512GPU7 seems to have

Min Operating Temperature 32 °F
Max Operating Temperature 176 °F

Source: https://www.cnet.com/products/toshiba-xg3-series-thnsn5512gpu7-solid-state-drive-512-gb-pci-express-3-1-x4-nvme/

176C is about 80C -- so seems okay (currently), but would be better to keep an eye on it as SMART has alerted you of Warning Temperature Time, maybe create a bash script to check/log the temperature value and alert if it goes above 80C, so you can show it to the DC to checkout.

Falzo · August 2019

open a support ticket. I think temp throttling could really be the case and that should be investigated. cooling the NVMe might be tricky, but maybe notifying them might help to raise awareness and have them thing about solutions for it...

@Hetzner_OL

rm_ · August 2019

integritly said: concerned there is a problem here.

Which kernel version do you run? I upgraded from 4.9 to 4.14 and got an 11C temperature drop on Samsung NVMe due to having APST implemented in that kernel (and newer). You guys tend to run some centos with 2.6.32 or old 3.10 so who knows what it has.

YellowHummingbird · August 2019

It's a bit on the warm side - I agree with @Falzo to ticket. One of my servers with them had the CPU thermal throttling under idle and they got it fixed within 15 mins of me opening a ticket - they're very helpful. Might be something simple like a failed or blocked fan.

integritly · August 2019

@MikeA said:
I mean, that's high? It's not the idle temp I assume, if so surely it's wrong? Those temps are what you'd expect under heavy use or something. I checked some of my own servers which use a variety of NVMe drives and they're all between 29°c-35°c.

Yeah, 29°c-35°c sounds a lot more reasonable. Granted my server isn't idling right now, but the load is still quite light.

@SpeedBus said:
THNSN5512GPU7 seems to have

> Min Operating Temperature 32 °F
> Max Operating Temperature 176 °F
>

Source: https://www.cnet.com/products/toshiba-xg3-series-thnsn5512gpu7-solid-state-drive-512-gb-pci-express-3-1-x4-nvme/

176C is about 80C -- so seems okay (currently), but would be better to keep an eye on it as SMART has alerted you of Warning Temperature Time, maybe create a bash script to check/log the temperature value and alert if it goes above 80C, so you can show it to the DC to checkout.

Thanks, will look into that. I've reached out to them with the current SMART logs to see what they say about it.

@Falzo said:
open a support ticket. I think temp throttling could really be the case and that should be investigated. cooling the NVMe might be tricky, but maybe notifying them might help to raise awareness and have them thing about solutions for it...

@Hetzner_OL

Thank you, just did that. Will report back here with their reply.

@rm_ said:

integritly said: concerned there is a problem here.

Which kernel version do you run? I upgraded from 4.9 to 4.14 and got an 11C temperature drop on Samsung NVMe due to having APST implemented in that kernel (and newer). You guys tend to run some centos with 2.6.32 or old 3.10 so who knows what it has.

I'm running Debian 9.9. Would go with 10, but my software is currently not compatible with it, so am stuck on the previous version for now.

integritly · August 2019

@YellowHummingbird said:
It's a bit on the warm side - I agree with @Falzo to ticket. One of my servers with them had the CPU thermal throttling under idle and they got it fixed within 15 mins of me opening a ticket - they're very helpful. Might be something simple like a failed or blocked fan.

Thanks, that's great to hear. I've opened up a ticket and will let you guys know what they say. Hopefully it's a quick fix like that as well. The server is otherwise running very well, and I like this IP allocation so I'd very much like to stay with it.

Thankfully neither drives seem to have gotten into the Critical temp threshold, so they've gotten hot but not too hot.

rm_ · August 2019

integritly said: I'm running Debian 9.9. Would go with 10, but my software is currently not compatible with it, so am stuck on the previous version for now.

Debian Stretch seems to use kernel 4.9. You can try 4.19 from backports.

Hetzner_OL · August 2019

Hi @integritly (Does this name mean that you only eat "real" grits and not the instant kind?) Sorry, off topic. And I think I'm missing cheese grits. And shrimp n' grits.

Sorry to just now respond, but I'm glad that you wrote a support ticket. Since you've already done that, I won't ask my colleague about it separately. --Katie

Clouvider · August 2019

I mean you’re benching it and the throttling engages several benches down the line, so that’s a heavy use, at least at the time, which is going to increase your temps at the time.

It doesn’t hurt asking them to have a look if you have this under really low load.

integritly · August 2019

@Clouvider said:
I mean you’re benching it and the throttling engages several benches down the line, so that’s a heavy use, at least at the time, which is going to increase your temps at the time.

It doesn’t hurt asking them to have a look if you have this under really low load.

Well the temps are the same all the time, not just when I run the bench tests. The throttling is also random, and often occurs on the first try as well, so not sure what the cause may be.

@Hetzner_OL said:
Hi @integritly (Does this name mean that you only eat "real" grits and not the instant kind?) Sorry, off topic. And I think I'm missing cheese grits. And shrimp n' grits.

Sorry to just now respond, but I'm glad that you wrote a support ticket. Since you've already done that, I won't ask my colleague about it separately. --Katie

Hehe, only real grits with shrimp and cheese.

Thanks for getting back to me on this, I've opened up a ticket and within minutes had someone go and replace a faulty fan on the server. +1 for lightning fast support, couldn't have asked for more.

Temps have seen a ~3°c improvement, more or less. Still running a little warm, but haven't gotten into the Warning Temp Time at all, so may just be a case of these Toshibas running a little warmer than other drives. Will continue to monitor it nonetheless, but at least I know the fans are working.

Thanks for all the help guys, really appreciate it.

jsg · August 2019

@integritly

FWIW: I'm just looking at a problem with a VPS that is based on a Hetzner NVMe dedi. The disk results are worse than awful, considerably slower than rusting spindle results.

Writing speed on those Hetzner "NVMes" is in the range of 20 to 45 MB/s.

integritly · August 2019

@jsg said:
@integritly

FWIW: I'm just looking at a problem with a VPS that is based on a Hetzner NVMe dedi. The disk results are worse than awful, considerably slower than rusting spindle results.

Writing speed on those Hetzner "NVMes" is in the range of 20 to 45 MB/s.

Interesting, the results I'm getting are much higher than that, but still intermittently dipping during some tests. 20-45 MB/s seems extremely low though, especially for being on NVMe storage. I'd definitely have them or your provider look into it.

willie · August 2019

This is on a CX11 NVMe:

$ dd if=/dev/zero of=foo bs=1M count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 0.587915 s, 89.2 MB/s

I didn't do this test repeatedly but can imagine a virtualized server having ups and downs hitting 20-45 MB/s at busy times.

Howdy, Stranger!

Categories

In this Discussion

Should I be worried about these NVMe temps with Hetzner Auction Server?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Should I be worried about these NVMe temps with Hetzner Auction Server?

Comments