Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

OVH Advance On Die ECC

niknik Member, Host Rep

Hi all,

I looked at the new OVH Advance line with EPYC 4004. Do they really only use On Die ECC and not regular ECC? Does someone have a new Advance server that could check which RAM Module is being used?

Thanks a lot

Comments

  • MechanicWebMechanicWeb Member, Patron Provider
    edited September 2024

    If they do provide On Die ECC with advanced, this is a good move on their part. It will help debunk the many myths and fear mongering surrounding ECC RAM.

    ECCs have their advantages. But do not shy away from on-die ECC unless your workload is mission critical or heavily database oriented. They do fine for typical hosting workloads.

  • niknik Member, Host Rep

    It would be solely for databases, this is why I am asking this question in the first place. And I also disagree with the myths and fear mongering for regular workloads. I had several kernel panics with non ecc hardware that resulted in a reboot (downtime). Yes, the chances are low, but we are running 1000s of VMs.

  • MechanicWebMechanicWeb Member, Patron Provider
    edited September 2024

    @nik said: Yes, the chances are low, but we are running 1000s of VMs.

    I haven't tested this scenario or know anyone that runs 1000 VMs on non-ECC. So I can't comment on it other than that Virtualization is not regular hosting workload.

    But I have a hunch that you were using either DDR3 or early DDR4.

    With latest DDR4, and subsequently on DDR5, you probably haven't noticed any single issue solely because of non-ECC RAM. I am not aware of a single issue. But I am still looking for one.

  • I've got a 7945HX w/ a Crucial 96G (48GBx2) DDR5-5600 SODIMM Kit. It has ON-DIE ECC and memory's been extremely stable for me so far. I'm running PROXMOX on it w/ a bunch of VMs + (casaos, virtualized UNRAID, etc).

    I had some initialize stability issues but those were unrelated to the ram. It's been up for 102 days since the last bios updates. I did run stress-ng on the ram and cpu, everything looks good so far.

  • crunchbitscrunchbits Member, Patron Provider, Top Host

    @nik said:
    Hi all,

    I looked at the new OVH Advance line with EPYC 4004. Do they really only use On Die ECC and not regular ECC? Does someone have a new Advance server that could check which RAM Module is being used?

    Thanks a lot

    On-die "ECC" isn't exactly the same thing as transfer ECC. It's kind of a marketing gimmick (not necessarily on OVH's part, but the manufacturers). It is not the same thing as full/transfer ECC, and if you need full ECC RAM there are specific (much more costly) sticks that have this. We run both, and for our internal/hypervisor builds it's always 100% of the time full ECC variant.

    On-die ECC is just checking for errors in the data at rest on the RAM itself, not during/after transit to CPU. It's still a good thing to have. It's likely that the reason all DDR5 sticks have on-die ECC is because the higher clock speeds and tighter timings basically require it. Pushing that kind of performance and density without would likely result in significantly more errors, so JEDEC standard requirement is on-die ECC for all DDR5. Of course now it gets marketed as "ECC RAM" and is very confusing/misleading imho. It is absolutely not the same thing.

    @MechanicWeb said:
    But I have a hunch that you were using either DDR3 or early DDR4.

    With latest DDR4, and subsequently on DDR5, you probably haven't noticed any single issue solely because of non-ECC RAM. I am not aware of a single issue. But I am still looking for one.

    I don't run any DDR3, and I don't know what you'd consider "early" DDR4 but we've absolutely had DDR4 ECC catch and fix errors, multiple times. I will say for years I had never seen anything, but once we hit a certain number of servers (+ time operating them) it's happened a few times. Would the errors they corrected have been noticeable? Unsure, but I see no reason not to run it for a production environment. Also nice that you get better diagnostics, early failure warnings, etc via IPMI.

    Thanked by 2OhJohn homelabber
  • MechanicWebMechanicWeb Member, Patron Provider
    edited September 2024

    @crunchbits said: we've absolutely had DDR4 ECC catch and fix errors, multiple times. I will say for years I had never seen anything, but once we hit a certain number of servers (+ time operating them) it's happened a few times. Would the errors they corrected have been noticeable? Unsure, but I see no reason not to run it for a production environment. Also nice that you get better diagnostics, early failure warnings, etc via IPMI.

    What you said is one of the benefits of ECC. That is without ECC, you cannot monitor memory errors getting fixed. That is absolutely correct.

    But it does not necessarily mean, as you noted, that error could have resulted in data corruption or a server crash. That's what I am saying.

    If it was true, that modern non ECC RAM results in data corruption, you would have millions of such incidents, as basically all office computers are non ECC.

    Try searching on google. There is almost zero incident. A complete lack of evidence == it doesn't happen in real world.

    I can say with confidence because I know quite a few server providers running nonECC RAM for half a decade now. Besides that, I have been searching for a server crash solely due to nonECC for several years now. Everyone argued like you did based on assumption; none could actually present an example of a crash or data loss. There is more to it, too. If you read research papers on non-ECC and ECC RAM, you will see why there is such a lack of evidence.

    You, too can try using nonECC to see for yourself. Otherwise, you would be assuming based on decades old theory and hardware.

    That is not to say you should use non-ECC for mission-critical applications. You absolutely should not.

  • niknik Member, Host Rep

    Back to my question, does anyone have a Epyc 4004 OVH server and can post the exact memory dimms being used

  • In the real world, use a stable memory frequency (don't go overclocking, xmp or whatever), keep the heat well dissipated, and even non-ecc memory will run well.

    That said, I also think it's best for non-ecc memory to find its most appropriate use case.

    For example, I use it to set up error-correcting minio clusters, mainly for storing images or videos, so that even the occasional bit-flip or two is usually harmless or even unnoticeable for images and videos.

    For some important businesses, it's also best to avoid such risks, whether or not bit flips actually happen.

    ddr4 ecc memory doesn't add much cost - especially relative to critical business operations.

Sign In or Register to comment.