★ VirMach ★ RYZEN ★ NVMe ★★ $8.88/YR- 384MB ★★ $21.85/YR- 2.5GB ★ Instant ★ Japan Pre-order ★ & More

joekerr · March 2022

@xmd5 said:
Tokyo bought on the 12th is still waiting.
Invoice #1399478, Please help, thanks.
@VirMach

Iam in 12th and pending too.

bark · March 2022

@VirMach

I requested the change last week, it may have gotten lost in the fog here.

Please change Amsterdam to Tokyo.

Invoice #1398758
Invoice Date 03/12/2022 Paid

At the moment it's just listed as 'pending'. (Limbo). So no work order.

qwerttaa · March 2022

@qwerttaa said:
i cant login now...help, pls....

ticket #552008

Still cannot receive a two-step verification email.
can't log in to my account.
Please help@VirMach ticket #552008

VirMach · March 2022

TYOC040 is facing issues related to the disks. Originally I thought this looked similar to a throttling issue I had seen in the past during stress testing. While that may still be a factor, there are definitely other facts at play because the idle temperatures are very low.

The SolusVM bug I mentioned previously definitely seems to also exist, as some VMs did not properly create in the first place, and these were not related to any particular disk but instead in specific bursts, as it would normally occur in the past before we patched the issue with a manual fix. We'll still look into that as well.

SMART also looks good, and the node did not run into any issues before the mass deployment.

SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        21 Celsius
Available Spare:                    100%
Available Spare Threshold:          25%
Percentage Used:                    0%
Data Units Read:                    576 [294 MB]
Data Units Written:                 40 [20.4 MB]
Host Read Commands:                 5,865
Host Write Commands:                372
Controller Busy Time:               0
Power Cycles:                       22
Power On Hours:                     21
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               21 Celsius
Temperature Sensor 2:               16 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

It's possible if certain operating systems are highly incompatible with Ryzen that something similar like this could theoretically occur, however, that would mean that this problem should also appear on our Gen3 nodes which is not the case.

In addition, the NVMe SSDs facing the problem are from the same manufacturer, and based on my research it seems that this may be a potential issue related to certain manufacturers, Gen4, and Linux virtualization. So it could be a potential software/hardware clash. Luckily, these represent a very small quantity of all of our NVMe SSDs so if that is the case, even if no solution is quickly discovered, it would only account for maybe 5% of all the NVMe SSDs we have, and we have excess NVMe SSDs so replacement would not be an issue.

The next step, I think, is to deploy the other Tokyo node that is also using some of these drives, and run some VMs/installations in testing to see if the same problem shows up.

I also have a contact at the manufacturer so I can see if this is a common issue that perhaps a firmware update can fix.

@LiliLabs said:

@VirMach said:
I'm actually looking into this now. It's not all the drives, one of them may be acting up though. That would make sense why some are offline some are online.

Hate adding to the spam, but I just realized that I'm also on TYOC040 and that's why my VM isn't booting. If only one drive is acting up, does that mean the hypervisors aren't using raid?

This was previously discussed on the old 2018 thread, but you are right that we're not utilizing RAID, and there are a few reasons for that. I'll try to cover it very quickly but if you have any further questions or concerns, let me know.

NVMe SSDs don't really play well with software RAID, and they also lose a lot of performance when it comes to hardware RAID. Plus, there's barely any options when it comes to hardware RAID, and they are extremely expensive. There are some controllers in between software and hardware, but those went up in price as well after COVID, and we still have concerns with the drivers and the device itself, since like an NVMe SSD, it's just a device with DRAM, a controller, and connected via PCIe, so if it fails we were worried about being able to restore it properly.

Proper hardware RAID10 would cost $2,000 for 4TB of usable space and end up not really being any greater performance due to the limitations of the controller and adding a layer where the NVMe is further away from the CPU. I do not believe any affordable provider offers hardware RAID.

Instead we decided not to go with RAID and as a result:

Have a large HDD per server that does more frequent backups on top of any external disaster recovery backups.
Put the same amount of disk we would have put in, but now customers get double the amount of usable space.
Customers get the unadulterated NVMe performance. Even if RAID can be slightly faster, it's not really faster "per drive" so all in all, it means everyone can burst less.
If we lose a drive, we only lose that drive. We don't potentially lose the entire array if something goes wrong with it (which has happened.)
We are able to get back up and running much quicker than a RAID controller failure, and we eliminate potential catastrophic DC hands human error when trying to restore an array.
Those on unaffected drives have the potential to be able to continue using their service.

hhwpo · March 2022

@VirMach I'm the first batch of users to place an order for VPS in Tokyo. Can you open AMD 5950x CPU for me and speed up the opening of the machine? Invoice 1398578, thank you.

tototo · March 2022

I do not plan to use my Tokyo node as a production, so I would like to help test it someday

shkong · March 2022

@VirMach Hello, I would like to ask, my order was required to be manually reviewed by the system, but no staff has responded yet. I purchased the server of the Tokyo node on March 12, and it has been more than two weeks so far. , will this affect the opening of the server?

riofredinand · March 2022

@VirMach said:
TYOC040 is facing issues related to the disks. Originally I thought this looked similar to a throttling issue I had seen in the past during stress testing. While that may still be a factor, there are definitely other facts at play because the idle temperatures are very low.

The SolusVM bug I mentioned previously definitely seems to also exist, as some VMs did not properly create in the first place, and these were not related to any particular disk but instead in specific bursts, as it would normally occur in the past before we patched the issue with a manual fix. We'll still look into that as well.

SMART also looks good, and the node did not run into any issues before the mass deployment.

SMART overall-health self-assessment test result: PASSED
> 
> SMART/Health Information (NVMe Log 0x02)
> Critical Warning:                   0x00
> Temperature:                        21 Celsius
> Available Spare:                    100%
> Available Spare Threshold:          25%
> Percentage Used:                    0%
> Data Units Read:                    576 [294 MB]
> Data Units Written:                 40 [20.4 MB]
> Host Read Commands:                 5,865
> Host Write Commands:                372
> Controller Busy Time:               0
> Power Cycles:                       22
> Power On Hours:                     21
> Unsafe Shutdowns:                   0
> Media and Data Integrity Errors:    0
> Error Information Log Entries:      0
> Warning  Comp. Temperature Time:    0
> Critical Comp. Temperature Time:    0
> Temperature Sensor 1:               21 Celsius
> Temperature Sensor 2:               16 Celsius
> 
> Error Information (NVMe Log 0x01, max 64 entries)
> No Errors Logged
>

It's possible if certain operating systems are highly incompatible with Ryzen that something similar like this could theoretically occur, however, that would mean that this problem should also appear on our Gen3 nodes which is not the case.

In addition, the NVMe SSDs facing the problem are from the same manufacturer, and based on my research it seems that this may be a potential issue related to certain manufacturers, Gen4, and Linux virtualization. So it could be a potential software/hardware clash. Luckily, these represent a very small quantity of all of our NVMe SSDs so if that is the case, even if no solution is quickly discovered, it would only account for maybe 5% of all the NVMe SSDs we have, and we have excess NVMe SSDs so replacement would not be an issue.

The next step, I think, is to deploy the other Tokyo node that is also using some of these drives, and run some VMs/installations in testing to see if the same problem shows up.

I also have a contact at the manufacturer so I can see if this is a common issue that perhaps a firmware update can fix.

@LiliLabs said:

@VirMach said:
I'm actually looking into this now. It's not all the drives, one of them may be acting up though. That would make sense why some are offline some are online.

Hate adding to the spam, but I just realized that I'm also on TYOC040 and that's why my VM isn't booting. If only one drive is acting up, does that mean the hypervisors aren't using raid?

This was previously discussed on the old 2018 thread, but you are right that we're not utilizing RAID, and there are a few reasons for that. I'll try to cover it very quickly but if you have any further questions or concerns, let me know.

NVMe SSDs don't really play well with software RAID, and they also lose a lot of performance when it comes to hardware RAID. Plus, there's barely any options when it comes to hardware RAID, and they are extremely expensive. There are some controllers in between software and hardware, but those went up in price as well after COVID, and we still have concerns with the drivers and the device itself, since like an NVMe SSD, it's just a device with DRAM, a controller, and connected via PCIe, so if it fails we were worried about being able to restore it properly.

Proper hardware RAID10 would cost $2,000 for 4TB of usable space and end up not really being any greater performance due to the limitations of the controller and adding a layer where the NVMe is further away from the CPU. I do not believe any affordable provider offers hardware RAID.

Instead we decided not to go with RAID and as a result:

Have a large HDD per server that does more frequent backups on top of any external disaster recovery backups.

Put the same amount of disk we would have put in, but now customers get double the amount of usable space.

Customers get the unadulterated NVMe performance. Even if RAID can be slightly faster, it's not really faster "per drive" so all in all, it means everyone can burst less.

If we lose a drive, we only lose that drive. We don't potentially lose the entire array if something goes wrong with it (which has happened.)

We are able to get back up and running much quicker than a RAID controller failure, and we eliminate potential catastrophic DC hands human error when trying to restore an array.

Those on unaffected drives have the potential to be able to continue using their service.

RAID is not necessary for NVME SSD, hot backup is enough.Hoping problem will be solved soon.

hehuangCCs · March 2022

To avoid excessive stress testing of the MJJ, I recommend that you buy a passive heatsink for NVME.

Cheung · March 2022

@VirMach Can I participate in the early test of Tokyo VPS?
Order #554455
Invoice #1398343
Thanks

fluffernutter · March 2022

@VirMach Sounds good, thank you for the explanation! Curious: what brand was causing you trouble? We've always had fantastic luck with Kioxia drives.

elliotc · March 2022

Is it just me or everyone?

nick_ · March 2022

@elliotc said:

Is it just me or everyone?

So is mine.

stingeo · March 2022

@elliotc said:

Is it just me or everyone?

you're not alone

noisycode · March 2022

@elliotc said:

Is it just me or everyone?

not just you, yet not everyone, maybe 50% of users.

ben47955 · March 2022

@hehuangCCs said:
To avoid excessive stress testing of the MJJ, I recommend that you buy a passive heatsink for NVME.

Do you know what a 1U look like ?

matheny · March 2022

@ben47955 said:

@hehuangCCs said:
To avoid excessive stress testing of the MJJ, I recommend that you buy a passive heatsink for NVME.

Do you know what a 1U look like ?

Kalm down he's just joking lol

qianiqan · March 2022

Invoice #1400102

fluffernutter · March 2022

@hehuangCCs said:

Not enough, you need something like this.

Astro · March 2022

@VirMach any update on the spin the wheel ryzen upgrade winners?

VirMach · March 2022

@LiliLabs said:
@VirMach Sounds good, thank you for the explanation! Curious: what brand was causing you trouble? We've always had fantastic luck with Kioxia drives.

@LiliLabs said:

@hehuangCCs said:

Not enough, you need something like this.

We actually have some of those in the office, no joke, I was considering them at some point. It may be going in one of the storage nodes

This one's a little sad and bent:

DanSummer · March 2022

@VirMach said:

We actually have some of those in the office, no joke, I was considering them at some point. It may be going in one of the storage nodes

This one's a little sad and bent:

Is that bent from heat?

VirMach · March 2022

@DanSummer said: Is that bent from heat?

Or maybe it gets so cool so fast that it buckles.

VirMach · March 2022

Sorry for the wait guys, new Tokyo node almost ready. I may just create all the ones that didn't create on this one, and migrate anyone on the old node to this one as well if it passes the stability tests.

joekerr · March 2022

@VirMach said:
Sorry for the wait guys, new Tokyo node almost ready. I may just create all the ones that didn't create on this one, and migrate anyone on the old node to this one as well if it passes the stability tests.

ordered on 3/12
Can I get my server today?

VirMach · March 2022

@joekerr said:

@VirMach said:
Sorry for the wait guys, new Tokyo node almost ready. I may just create all the ones that didn't create on this one, and migrate anyone on the old node to this one as well if it passes the stability tests.

ordered on 3/12
Can I get my server today?

A node is running into issues, meaning less total space available, meaning probably not, sorry.

VirMach · March 2022

@tomle said:

@VirMach said:

@TimboJones said: Your heatsink likely isn't rated for the TDP of the CPU or it sucks.

https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/8

It's funny because the heatsink rated for 95 TDP functions better than those advertised for 105+ but that's a whole other can of worms.

With these rackmounts the fans end up being more important in most cases I've noticed though and even with excessive fans (as in fans all over, even ones that would not fit in normal use) the same problem I'm describing exists. So at this point it doesn't really seem to be a "problem" and more of a "feature" on how these function. As in, AMD wants it to get hot and maximize performance and then cut it off however they want by default unless you do a custom configuration as we've done.

I'm not saying our solution is perfect, I was just providing background on why it seems to be necessary and why I believe 5900X will function better than 5950X in this particular use case (something I assume no one thought of because they didn't intend for it to be used this way, in a rackmount server.)

(edit) By the way short version of 95 TDP heatsink theory, I believe the material and design allows it to dissipate heat in a way where it is beneficial for bursts. It's thin aluminum versus bulky copper. But the copper should outperform in normal non-burst 24x7 operations.

@VirMach
I'd recommend you to look at Eco mode. In my experience, Eco mode doesn't impact single core performance at all but lowers all core performance by less than 10% while using less power and running a lot cooler.

My numbers from a 3900X (yes it's not the same CPU but the effect should be similar):
With Eco Mode: CB20 6523 multi & 505 single core @ 59W
Without Eco Mode: 7166 multi & 505 single core @ 110W

So around 45% less power (=less heat) while losing less than 10% performance. For me and my use case (24x7) this was a given win.

Yes for the motherboards that support it we do it this way. But some boards don't seem to support it, in which case we do manual changes that have similar effects. I haven't had time to look into why the others do not support it.

I do know Ryzen Master also supports it but we don't use that right now, nor do I think AMD supports it on Linux AFAIK. It seems like they just gave up for everything Ryzen related when it comes to Linux. It may be possible, that's just another thing I haven't really looked into because it hasn't been necessary since we can modify other configuration and have similar results.

VirMach · March 2022

@niko52 said:
Can I upgrade 1.5g ram to 2.5g ram by making up the difference? Still get it for $21.85？

No, sorry.

TimboJones · March 2022

@VirMach said:

@TimboJones said: Your heatsink likely isn't rated for the TDP of the CPU or it sucks.

https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/8

It's funny because the heatsink rated for 95 TDP functions better than those advertised for 105+ but that's a whole other can of worms.

With these rackmounts the fans end up being more important in most cases I've noticed though and even with excessive fans (as in fans all over, even ones that would not fit in normal use) the same problem I'm describing exists. So at this point it doesn't really seem to be a "problem" and more of a "feature" on how these function. As in, AMD wants it to get hot and maximize performance and then cut it off however they want by default unless you do a custom configuration as we've done.

There's a difference between throttling down below base frequency due to inadequate heat sinking and having limited time boost window ABOVE the base frequency. I don't call coming down from the boost frequency as throttling and I don't think others do, either.

So if you're actually throttling below base frequency, something ain't right.

VirMach · March 2022

@TimboJones said:

@VirMach said:

@TimboJones said: Your heatsink likely isn't rated for the TDP of the CPU or it sucks.

https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/8

It's funny because the heatsink rated for 95 TDP functions better than those advertised for 105+ but that's a whole other can of worms.

With these rackmounts the fans end up being more important in most cases I've noticed though and even with excessive fans (as in fans all over, even ones that would not fit in normal use) the same problem I'm describing exists. So at this point it doesn't really seem to be a "problem" and more of a "feature" on how these function. As in, AMD wants it to get hot and maximize performance and then cut it off however they want by default unless you do a custom configuration as we've done.

There's a difference between throttling down below base frequency due to inadequate heat sinking and having limited time boost window ABOVE the base frequency. I don't call coming down from the boost frequency as throttling and I don't think others do, either.

So if you're actually throttling below base frequency, something ain't right.

I think something important to note is that my testing is extreme, to make sure it's the absolute worst case possible scenario. In these cases, that means the node was purposefully overloaded to a situation similar to a node about to crash with load in at least the hundreds. The only time this would be potentially close to being replicated is if something is seriously messed up, and maybe this does paint an incorrect picture.

But the 3000 series can put up with this better, perhaps because it's never allowed or designed to reach the performance level of the 5000 series.

What I can assure you is that in these tests, no one would be able to bend the laws of physics any further. As in, with everything else kept exactly the same between 3000 and 5000 series, down all the way to the thermal paste used, there isn't any 1U chassis that could accommodate the level of cooling I provided to attempt to alleviate the throttling in these tests, even if it's up to spec. Yes, perhaps moving the 5000 series to 2U instead could help a little bit when it comes specifically to the heatsink and fans (and nothing else) but these processors are advertised as the same TDP and in many benchmarks the 5000 series actually may actually stay cooler. As in, the temperatures definitely do seem to be lower in some of these burst scenarios. And to add to that, there will be less throttling if the cooling is extreme. However, when it comes to 24x7 operation, the 5000 series definitely has more periods of time where it does throttle below the frequency of similar 3000 series (and it also has more periods of time where it goes much higher above the base frequency.)

Trust me, I've done more testing than I care to admit and ruled out pretty much everything else. I've used different boards, heatsinks, changed airflow, changed chassis, looked into different static pressure fans and physical configurations of the air shrouds, used processors from different batches, modified the ambient temperature, and even tested these on extreme liquid cooling, and the similar differences in pattern remain.

I'm not saying that the 5000 series cannot be used, only that I have some concerns, especially when it comes to 5000 series versus 3000 series or 5950X versus 5900X.

Howdy, Stranger!

Categories

In this Discussion

★ VirMach ★ RYZEN ★ NVMe ★★ $8.88/YR- 384MB ★★ $21.85/YR- 2.5GB ★ Instant ★ Japan Pre-order ★ & More

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

★ VirMach ★ RYZEN ★ NVMe ★★ $8.88/YR- 384MB ★★ $21.85/YR- 2.5GB ★ Instant ★ Japan Pre-order ★ & More

Comments