All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Should I be worried about these NVMe temps with Hetzner Auction Server?
First time picking up a server with NVMe drives so I don't know if this is normal, or if I should be concerned running this box long term. Server is being used for a database and disk speed is important. Performance so far seems okay, although every so often a bench.sh pass will score roughly 33% of its usual numbers. Here's the smart log for both drives:
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 69 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 8%
data_units_read : 55,552,962
data_units_written : 37,232,304
host_read_commands : 5,572,419,946
host_write_commands : 1,594,258,267
controller_busy_time : 17,070
power_cycles : 27
power_on_hours : 18,393
unsafe_shutdowns : 10
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 84
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 69 C
Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
critical_warning : 0
temperature : 62 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 10%
data_units_read : 66,969,199
data_units_written : 38,698,184
host_read_commands : 5,876,328,377
host_write_commands : 1,600,756,699
controller_busy_time : 17,278
power_cycles : 26
power_on_hours : 18,397
unsafe_shutdowns : 12
media_errors : 0
num_err_log_entries : 0
Warning Temperature Time : 569
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 62 C
=== START OF INFORMATION SECTION ===
Model Number: THNSN5512GPU7 TOSHIBA
Serial Number: Z6IS108QTUHV
Firmware Version: 57GA4103
PCI Vendor/Subsystem ID: 0x1179
IEEE OUI Identifier: 0x00080d
Controller ID: 0
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Sat Aug 10 16:55:26 2019 CEST
Firmware Updates (0x02): 1 Slot
Optional Admin Commands (0x0007): Security Format Frmw_DL
Optional NVM Commands (0x000e): Wr_Unc DS_Mngmt Wr_Zero
Warning Comp. Temp. Threshold: 78 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.00W - - 0 0 0 0 0 0
1 + 2.40W - - 1 1 1 1 0 0
2 + 1.90W - - 2 2 2 2 0 0
3 - 0.1600W - - 3 3 3 3 1000 1000
4 - 0.0120W - - 4 4 4 4 5000 35000
5 - 0.0060W - - 5 5 5 5 100000 110000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1
=== START OF SMART DATA SECTION ===
Read NVMe SMART/Health Information failed: NVMe Status 0x4002
The first thing that caught my attention was the 569 minutes of "Warning Temperature Time" on drive 2.
Bench.sh runs will routinely score ~730-750 MB/s on I/O speed, but sometimes one of the 3 runs will dip into the 250 MB/s range. Not sure if that's just a quirk, or if its my drives getting temp throttled.
I already cancelled one Hetzner order for an EX52 I got during the special as it was delivered about a week after I placed the order, during the time I ended up grabbing this one from the auction, so I don't want to look like I'm abusing their 14-day return policy by cancelling this one too, but I am genuinely concerned there is a problem here.
So, do I have reason to swap this box for a different one, or is this normal and just go with it?
Thanks guys.
Comments
I mean, that's high? It's not the idle temp I assume, if so surely it's wrong? Those temps are what you'd expect under heavy use or something. I checked some of my own servers which use a variety of NVMe drives and they're all between 29°c-35°c.
THNSN5512GPU7 seems to have
Source: https://www.cnet.com/products/toshiba-xg3-series-thnsn5512gpu7-solid-state-drive-512-gb-pci-express-3-1-x4-nvme/
176C is about 80C -- so seems okay (currently), but would be better to keep an eye on it as SMART has alerted you of Warning Temperature Time, maybe create a bash script to check/log the temperature value and alert if it goes above 80C, so you can show it to the DC to checkout.
open a support ticket. I think temp throttling could really be the case and that should be investigated. cooling the NVMe might be tricky, but maybe notifying them might help to raise awareness and have them thing about solutions for it...
@Hetzner_OL
Which kernel version do you run? I upgraded from 4.9 to 4.14 and got an 11C temperature drop on Samsung NVMe due to having APST implemented in that kernel (and newer). You guys tend to run some centos with 2.6.32 or old 3.10 so who knows what it has.
It's a bit on the warm side - I agree with @Falzo to ticket. One of my servers with them had the CPU thermal throttling under idle and they got it fixed within 15 mins of me opening a ticket - they're very helpful. Might be something simple like a failed or blocked fan.
Yeah, 29°c-35°c sounds a lot more reasonable. Granted my server isn't idling right now, but the load is still quite light.
Thanks, will look into that. I've reached out to them with the current SMART logs to see what they say about it.
Thank you, just did that. Will report back here with their reply.
I'm running Debian 9.9. Would go with 10, but my software is currently not compatible with it, so am stuck on the previous version for now.
Thanks, that's great to hear. I've opened up a ticket and will let you guys know what they say. Hopefully it's a quick fix like that as well. The server is otherwise running very well, and I like this IP allocation so I'd very much like to stay with it.
Thankfully neither drives seem to have gotten into the Critical temp threshold, so they've gotten hot but not too hot.
Debian Stretch seems to use kernel 4.9. You can try 4.19 from backports.
Hi @integritly (Does this name mean that you only eat "real" grits and not the instant kind?) Sorry, off topic. And I think I'm missing cheese grits. And shrimp n' grits.
Sorry to just now respond, but I'm glad that you wrote a support ticket. Since you've already done that, I won't ask my colleague about it separately. --Katie
I mean you’re benching it and the throttling engages several benches down the line, so that’s a heavy use, at least at the time, which is going to increase your temps at the time.
It doesn’t hurt asking them to have a look if you have this under really low load.
Well the temps are the same all the time, not just when I run the bench tests. The throttling is also random, and often occurs on the first try as well, so not sure what the cause may be.
Hehe, only real grits with shrimp and cheese.
Thanks for getting back to me on this, I've opened up a ticket and within minutes had someone go and replace a faulty fan on the server. +1 for lightning fast support, couldn't have asked for more.
Temps have seen a ~3°c improvement, more or less. Still running a little warm, but haven't gotten into the Warning Temp Time at all, so may just be a case of these Toshibas running a little warmer than other drives. Will continue to monitor it nonetheless, but at least I know the fans are working.
Thanks for all the help guys, really appreciate it.
@integritly
FWIW: I'm just looking at a problem with a VPS that is based on a Hetzner NVMe dedi. The disk results are worse than awful, considerably slower than rusting spindle results.
Writing speed on those Hetzner "NVMes" is in the range of 20 to 45 MB/s.
Interesting, the results I'm getting are much higher than that, but still intermittently dipping during some tests. 20-45 MB/s seems extremely low though, especially for being on NVMe storage. I'd definitely have them or your provider look into it.
This is on a CX11 NVMe:
I didn't do this test repeatedly but can imagine a virtualized server having ups and downs hitting 20-45 MB/s at busy times.