New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
ASUS + Ryzen + ECC: unstable
This toy computer is made of these parts:
- ASUS PRIME B450M-K II
- Ryzen 7 1700
- Kingston UDIMM ECC 2666 MT/s
Once a day or so, the kernel logs:
invalid opcode
s across randomly-picked running processes;- random
SIGBUS
across randomly-picked running processes.
Core dumps are also corrupted for extra fun! These errors are well explained by the whole process stack containing random garbage, and the instruction pointer pointing to non-existing addresses! Its a lot of fun!
Of course the EDAC kernel module claims to be working, and no ECC errors are reported.
Now I updated the bios to the latest version. The next invalid opcode or SIGBUS and this pile of trash is off to ebay and I'll return to Supermicro + Xeon.
Open to ideas!
Comments
First gen Ryzen was pretty notorious for being picky about memory sticks right?
I have no idea! But on paper, these memory cards should be compatible with the motherboard and with the CPU. Clock, size and type all match what claims to be supported.
As a professional life coach and adult actor, I would still try a new stick of ram.
Gen 1 Ryzen was very unstable with Ram. Was never able to push 3200 until the last few supported bios came out. some people could only get 2800-3000 max.
Have you tried a None OC speed of the ram? (2666 and lower?)
I haven't enabled any overclock, not intentionally at least. But 80% of the configurable options on this motherboard's bios are about overclock and overvolt, there are so many weird acronyms that I may have left some fuckery going on, like "AI Tune", "Gentle Boost", "Auto Optimize" and the other 100 ones. This bios doesn't look like it's meant to be stable. In fact not even the motherboard hardware does. I'm more than convinced that getting astray from Supermicro was a mistake. But, I sent a request for help to ASUS technical support. I'll just wait for an answer. Not hoping for anything.
Extended memtest?
I ran 4 hours of
/usr/sbin/memtester
in userspace yesterday, didn't catch anything. But if I ran it for a week it would likely crash, as everything seem to crash randomly if run for long enough. The computer is normally loaded 24/7 so eventually I'll discover if this bios update made it stable.Normally XMP is an overclock. Make sure to disable that. Otherwise if it is running at 2133/2400/2666 than simple try with less sticks. If it's failing with even one stick than either the MOBO, RAM, or CPU is DOA. I hated messing with Zen 1 Memory. I gave up after a while.
Didn't know about XMP, I'll check it next time I descend in the basement and plug a monitor into it. Right now thru SSH I see:
and this other tool called
lshw
apparently claims that the memory's current speed is 2666 MT/s and ECC is enabled:Normally 2666 is none overclock. Only other issue could be timings. Also, we could be looking at RAM this whole time and RAM could just be well, fine. I know good tools for Windows but not Linux to test ram.
Also wonder if it's a launch Zen 1 CPU. https://www.reddit.com/r/Amd/comments/fblhta/psa_if_you_bought_a_ryzen_1000_series_cpu_at/
Holy fuck, the guy from Github even has the same CPU model I have. Looping thru GCC is the method I too discovered for combing hardware defects, in fact for a decade I've had a weekly cronjob that GCC's thru a whole kernel source.
I'm gonna check if this CPU has some sort of revision number or serial number and if AMD has published a recall notice for it. Gonna spend some time looking into this.
All right thanks @Kevinf100 for that link, I submitted an RMA request to AMD and a tech support request. If nothing comes out of it I'll buy a new CPU.
No problem. Glad the problem was figured out. Very early model you have. Not sure if they'll RMA it tbh, but this problem (I think) hit the first few months of 1700, 1700x, and 1800x (No idea about the other models). Not sure if my 1700 I have has this bug. I didn't want to rma it because it overclock to 3.9 on 1.4 volts. So it was a good chip.
I'll take a picture of the chip if I should swap it off the motherboard, that will show the manufacturing date / serial number, and if it may be part of the early faulty batches. I'm not aware of a better way to retrieve a Ryzen's serial number, it doesn't seem to be exposed to the OS.
Has AMD ever acknowledged the manufacturing defect? I only found claims on reddit that AMD used to RMA those chips for free during 2017-2020.
I don't remember if they ever publicly did, it's been a while. However people was 100% getting them RMA hassle free, suggesting at least the RMA team knows about it or didn't want to argue. One comment claimed it was due to SMT and turning it off fixed it. Unsure if true, but hopefully AMD will still replace the chip today.
The interesting exchange I just had with ASUS tech support.
I wrote:
The answer I received from ASUS, in Italian, translated:
I wasn't expecting anything so no expectations were shattered. I'm buying a Ryzen 2xxx in the next couple of days, this might fix the crashes. I still haven't disassembled the computer to confirm that the CPU serial number is flawed. I guess I'll find out when I'll put the new CPU in. Too time consuming otherwise.
You can't run memory tests from userspace. You need to run memtest86 from boot and get at least one pass when suspecting RAM issues.
Why?
man memtester
says:Anyway memory testing programs have never found any hardware fault for me, the GCC method linked above has been more reliable instead. Although this one computer seems to have a faulty CPU and the memories might be fine.
So guys I bought an R7 2700; I'll throw away the current R7 1700 @ 40€ as soon as I get the new one in the mail.
Darling, If I am to remember this, I am sorry. Oh, part of TT24 (ja, doubleread
Your remembrance is quite on spot, they say that you start leaving a mark in people's mind when you notice that you're repeating yourself. I believe I played my Supermicro record about 10 times now, so perhaps 5 times is the best amount to avoid a "doubleread" deja vu :)
Food for thought.
If anyone is interested, I replaced the CPU today and took a pic of the serial number printed on the chip. It appears this is a faulty batch, from week 33 of 2017. For what it's worth, there are claims of faulty chips up to week 48 of 2017 for the first generation of Ryzen:
All I know about this CPU is that my running processes were crashing once a day for the most assorted reasons with it mounted on my computer.
This CPU will be on sale for 40€.
Interesting, there's some on eBay with the same bad datecode:
https://www.ebay.co.uk/itm/186875110318
https://www.ebay.co.uk/itm/276816006806
https://www.ebay.co.uk/itm/226514260832
https://www.ebay.co.uk/itm/396123338853
https://www.ebay.co.uk/itm/196897325597
Oh, actually. I just realised the datecode is the second line.
And my google search also had 2 results explaining the defect saying the problem chips are before week 25 (1725), so your 1733 should actually be fine - and one of the people said they got an 1733 back from RMA'ing their 1716 chip. Weird.
Anyway, glad you figured out your issue.
@davide I’m interested in a faulty Ryzen 1700 for €40. Maybe even €100.
At least rebooting the machine every now and then is more fun than an idler doing nothing of interest.
Yeah. Might as well give €200, what do you think?
I think it's self-evident that today it's 2025 and that any previous owner of this CPU did not take advantage of the free 3-year warranty from AMD.
Pwned.
Owmhhh that hurts!
No okay, I see that some information is missing about this defect. Which is:
That's because they are ztupid. Especially redditz ztupid... never believe what someone shitposts on reddit if your life depends on it. There is this widespread tradition within the circles of reddit that him who doesn't know the facts, must teach them to others. Especially if the truth about the facts is 1 simple google search away from the professing shitposter.
What I mean is, there are all sorts of inconsistent reports around the web about this defect on the first generation Ryzen cpus. It is often repeated that it is associated with the manufacturing date antecedent week 25 of the serial number, less frequently that it occurs up to week 33, and some reports for up to week 48. Reality is, no one knows if the date code has even any relevance to the likelihood of the cpu being faulty, it's just a made-up supposition that was never confirmed by AMD, yet people have liked to attribute this defect to the date printed on the chip. As for myself, all I know is that with this cpu I get abnormal software failures that I don't get with other hardware, and swapping the cpu for a newer model is just an attempt.
I kind of understand how you can get in that predicament. I bought a first gen iPod Nano shortly after release in late 2006. I got the recall notice on that in November 2011 about the "possibly exploding battery" - definitely got it, as it's marked as read in my e-mail folder. I guess I was busy with starting a new job, but I didn't do anything.
I was still using the iPod for quite a few years after that, and then stopped. Anyway, last year 2024, I found my iPod. Give it a charge and it came back to life, battery still seeming fine, but maybe a very slight bulge on the case, but nothing terrible. Googled how to replace the battery and discovered the whole warranty replacement program and that it'd been running for over 5 years before they decided that there were no 1st gens left. Shame.
It has a new home now - inside a LiPo fire-proof bag outside the house while I decide what to do with it. I think it's too old to run the alternate OS you can get, it's too small to hold an especially useful selection of songs, so it doesn't seem worth buying it a new battery, I don't want to use it in case it explodes at an inopportune moment, but it's also otherwise in pristine condition because it always lived inside a special holder. It even still has the film on the screen (cut off at the screen because the cover also went over the wheel). But I can't even sell it as a collectible as it has my name and e-mail address etched on the back!
@ralf bitcoin fixes that