OVH Advance On Die ECC

nik · September 2024

Hi all,

I looked at the new OVH Advance line with EPYC 4004. Do they really only use On Die ECC and not regular ECC? Does someone have a new Advance server that could check which RAM Module is being used?

Thanks a lot

MechanicWeb · September 2024

If they do provide On Die ECC with advanced, this is a good move on their part. It will help debunk the many myths and fear mongering surrounding ECC RAM.

ECCs have their advantages. But do not shy away from on-die ECC unless your workload is mission critical or heavily database oriented. They do fine for typical hosting workloads.

nik · September 2024

It would be solely for databases, this is why I am asking this question in the first place. And I also disagree with the myths and fear mongering for regular workloads. I had several kernel panics with non ecc hardware that resulted in a reboot (downtime). Yes, the chances are low, but we are running 1000s of VMs.

MechanicWeb · September 2024

@nik said: Yes, the chances are low, but we are running 1000s of VMs.

I haven't tested this scenario or know anyone that runs 1000 VMs on non-ECC. So I can't comment on it other than that Virtualization is not regular hosting workload.

But I have a hunch that you were using either DDR3 or early DDR4.

With latest DDR4, and subsequently on DDR5, you probably haven't noticed any single issue solely because of non-ECC RAM. I am not aware of a single issue. But I am still looking for one.

stxsh · September 2024

I've got a 7945HX w/ a Crucial 96G (48GBx2) DDR5-5600 SODIMM Kit. It has ON-DIE ECC and memory's been extremely stable for me so far. I'm running PROXMOX on it w/ a bunch of VMs + (casaos, virtualized UNRAID, etc).

I had some initialize stability issues but those were unrelated to the ram. It's been up for 102 days since the last bios updates. I did run stress-ng on the ram and cpu, everything looks good so far.

crunchbits · September 2024

@nik said:
Hi all,

I looked at the new OVH Advance line with EPYC 4004. Do they really only use On Die ECC and not regular ECC? Does someone have a new Advance server that could check which RAM Module is being used?

Thanks a lot

On-die "ECC" isn't exactly the same thing as transfer ECC. It's kind of a marketing gimmick (not necessarily on OVH's part, but the manufacturers). It is not the same thing as full/transfer ECC, and if you need full ECC RAM there are specific (much more costly) sticks that have this. We run both, and for our internal/hypervisor builds it's always 100% of the time full ECC variant.

On-die ECC is just checking for errors in the data at rest on the RAM itself, not during/after transit to CPU. It's still a good thing to have. It's likely that the reason all DDR5 sticks have on-die ECC is because the higher clock speeds and tighter timings basically require it. Pushing that kind of performance and density without would likely result in significantly more errors, so JEDEC standard requirement is on-die ECC for all DDR5. Of course now it gets marketed as "ECC RAM" and is very confusing/misleading imho. It is absolutely not the same thing.

@MechanicWeb said:
But I have a hunch that you were using either DDR3 or early DDR4.

With latest DDR4, and subsequently on DDR5, you probably haven't noticed any single issue solely because of non-ECC RAM. I am not aware of a single issue. But I am still looking for one.

I don't run any DDR3, and I don't know what you'd consider "early" DDR4 but we've absolutely had DDR4 ECC catch and fix errors, multiple times. I will say for years I had never seen anything, but once we hit a certain number of servers (+ time operating them) it's happened a few times. Would the errors they corrected have been noticeable? Unsure, but I see no reason not to run it for a production environment. Also nice that you get better diagnostics, early failure warnings, etc via IPMI.

MechanicWeb · September 2024

@crunchbits said: we've absolutely had DDR4 ECC catch and fix errors, multiple times. I will say for years I had never seen anything, but once we hit a certain number of servers (+ time operating them) it's happened a few times. Would the errors they corrected have been noticeable? Unsure, but I see no reason not to run it for a production environment. Also nice that you get better diagnostics, early failure warnings, etc via IPMI.

What you said is one of the benefits of ECC. That is without ECC, you cannot monitor memory errors getting fixed. That is absolutely correct.

But it does not necessarily mean, as you noted, that error could have resulted in data corruption or a server crash. That's what I am saying.

If it was true, that modern non ECC RAM results in data corruption, you would have millions of such incidents, as basically all office computers are non ECC.

Try searching on google. There is almost zero incident. A complete lack of evidence == it doesn't happen in real world.

I can say with confidence because I know quite a few server providers running nonECC RAM for half a decade now. Besides that, I have been searching for a server crash solely due to nonECC for several years now. Everyone argued like you did based on assumption; none could actually present an example of a crash or data loss. There is more to it, too. If you read research papers on non-ECC and ECC RAM, you will see why there is such a lack of evidence.

You, too can try using nonECC to see for yourself. Otherwise, you would be assuming based on decades old theory and hardware.

That is not to say you should use non-ECC for mission-critical applications. You absolutely should not.

nik · September 2024

Back to my question, does anyone have a Epyc 4004 OVH server and can post the exact memory dimms being used

danblaze · September 2024

In the real world, use a stable memory frequency (don't go overclocking, xmp or whatever), keep the heat well dissipated, and even non-ecc memory will run well.

That said, I also think it's best for non-ecc memory to find its most appropriate use case.

For example, I use it to set up error-correcting minio clusters, mainly for storing images or videos, so that even the occasional bit-flip or two is usually harmless or even unnoticeable for images and videos.

For some important businesses, it's also best to avoid such risks, whether or not bit flips actually happen.

ddr4 ecc memory doesn't add much cost - especially relative to critical business operations.

Howdy, Stranger!

Categories

In this Discussion

OVH Advance On Die ECC

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

OVH Advance On Die ECC

Comments