Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


ASUS + Ryzen + ECC: unstable
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

ASUS + Ryzen + ECC: unstable

davidedavide Member
edited January 7 in General

This toy computer is made of these parts:

  • ASUS PRIME B450M-K II
  • Ryzen 7 1700
  • Kingston UDIMM ECC 2666 MT/s

Once a day or so, the kernel logs:

  • invalid opcodes across randomly-picked running processes;
  • random SIGBUS across randomly-picked running processes.

Core dumps are also corrupted for extra fun! These errors are well explained by the whole process stack containing random garbage, and the instruction pointer pointing to non-existing addresses! Its a lot of fun!

Of course the EDAC kernel module claims to be working, and no ECC errors are reported.

Now I updated the bios to the latest version. The next invalid opcode or SIGBUS and this pile of trash is off to ebay and I'll return to Supermicro + Xeon.

Open to ideas!

«1

Comments

  • RubbenRubben Member

    First gen Ryzen was pretty notorious for being picky about memory sticks right?

  • davidedavide Member
    edited January 7

    @Rubben said:
    First gen Ryzen was pretty notorious for being picky about memory sticks right?

    I have no idea! But on paper, these memory cards should be compatible with the motherboard and with the CPU. Clock, size and type all match what claims to be supported.

  • RubbenRubben Member

    @davide said:

    @Rubben said:
    First gen Ryzen was pretty notorious for being picky about memory sticks right?

    I have no idea! But on paper, these memory cards should be compatible with the motherboard and with the CPU. Clock, size and type all match!

    As a professional life coach and adult actor, I would still try a new stick of ram.

  • @davide said:

    @Rubben said:
    First gen Ryzen was pretty notorious for being picky about memory sticks right?

    I have no idea! But on paper, these memory cards should be compatible with the motherboard and with the CPU. Clock, size and type all match what claims to be supported.

    Gen 1 Ryzen was very unstable with Ram. Was never able to push 3200 until the last few supported bios came out. some people could only get 2800-3000 max.
    Have you tried a None OC speed of the ram? (2666 and lower?)

  • davidedavide Member
    edited January 7

    @Kevinf100 said:
    Gen 1 Ryzen was very unstable with Ram. Was never able to push 3200 until the last few supported bios came out. some people could only get 2800-3000 max.
    Have you tried a None OC speed of the ram? (2666 and lower?)

    I haven't enabled any overclock, not intentionally at least. But 80% of the configurable options on this motherboard's bios are about overclock and overvolt, there are so many weird acronyms that I may have left some fuckery going on, like "AI Tune", "Gentle Boost", "Auto Optimize" and the other 100 ones. This bios doesn't look like it's meant to be stable. In fact not even the motherboard hardware does. I'm more than convinced that getting astray from Supermicro was a mistake. But, I sent a request for help to ASUS technical support. I'll just wait for an answer. Not hoping for anything.

  • Extended memtest?

  • davidedavide Member

    @CyberneticTitan said:
    Extended memtest?

    I ran 4 hours of /usr/sbin/memtester in userspace yesterday, didn't catch anything. But if I ran it for a week it would likely crash, as everything seem to crash randomly if run for long enough. The computer is normally loaded 24/7 so eventually I'll discover if this bios update made it stable.

  • @davide said:

    @Kevinf100 said:
    Gen 1 Ryzen was very unstable with Ram. Was never able to push 3200 until the last few supported bios came out. some people could only get 2800-3000 max.
    Have you tried a None OC speed of the ram? (2666 and lower?)

    I haven't enabled any overclock, not intentionally at least. But 80% of the configurable options on this motherboard's bios are about overclock and overvolt, there are so many weird acronyms that I may have left some fuckery going on, like "AI Tune", "Gentle Boost", "Auto Optimize" and the other 100 ones. This bios doesn't look like it's meant to be stable. In fact not even the motherboard hardware does. I'm more than convinced that getting astray from Supermicro was a mistake. But, I sent a request for help to ASUS technical support. I'll just wait for an answer. Not hoping for anything.

    Normally XMP is an overclock. Make sure to disable that. Otherwise if it is running at 2133/2400/2666 than simple try with less sticks. If it's failing with even one stick than either the MOBO, RAM, or CPU is DOA. I hated messing with Zen 1 Memory. I gave up after a while.

    Thanked by 1davide
  • davidedavide Member

    @Kevinf100 said:
    Normally XMP is an overclock. Make sure to disable that. Otherwise if it is running at 2133/2400/2666 than simple try with less sticks. If it's failing with even one stick than either the MOBO, RAM, or CPU is DOA. I hated messing with Zen 1 Memory. I gave up after a while.

    Didn't know about XMP, I'll check it next time I descend in the basement and plug a monitor into it. Right now thru SSH I see:

    # dmidecode
    [...]
    
    Handle 0x002D, DMI type 4, 48 bytes
    Processor Information
            Socket Designation: AM4
            Type: Central Processor
            Family: Zen
            Manufacturer: Advanced Micro Devices, Inc.
            Version: AMD Ryzen 7 1700 Eight-Core Processor
            Voltage: 1.2 V
            External Clock: 100 MHz
            Max Speed: 3750 MHz
            Current Speed: 3000 MHz
            [...]
    
    Handle 0x002F, DMI type 17, 92 bytes
    Memory Device
            Total Width: 128 bits
            Data Width: 64 bits
            Size: 16 GB
            Form Factor: DIMM
            Set: None
            Locator: DIMM_A1
            Bank Locator: BANK 0
            Type: DDR4
            Type Detail: Synchronous Unbuffered (Unregistered)
            Speed: 2666 MT/s
            Manufacturer: Kingston
            Serial Number: 5FB6B1E2
            Asset Tag: Not Specified
            Part Number: 9965745-042.A01G
            Rank: 2
            Configured Memory Speed: 2666 MT/s
            Minimum Voltage: 1.2 V
            Maximum Voltage: 1.2 V
            Configured Voltage: 1.2 V
            Memory Technology: DRAM
            [...]
    

    and this other tool called lshw apparently claims that the memory's current speed is 2666 MT/s and ECC is enabled:

    # lshw -c memory
      *-firmware
           description: BIOS
           vendor: American Megatrends Inc.
           physical id: 0
           version: 4622
           date: 09/26/2024
           size: 64KiB
           capacity: 16MiB
           capabilities: pci apm upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
      *-memory
           description: System Memory
           physical id: 28
           slot: System board or motherboard
           size: 32GiB
           capabilities: ecc
           configuration: errordetection=multi-bit-ecc
         *-bank:0
              description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2666 MHz (0.4 ns)
              product: 9965745-042.A01G
              vendor: Kingston
              physical id: 0
              serial: 5FB6B1E2
              slot: DIMM_A1
              size: 16GiB
              width: 64 bits
              clock: 2666MHz (0.4ns)
         *-bank:1
              description: DIMM DDR4 Synchronous Unbuffered (Unregistered) 2666 MHz (0.4 ns)
              product: 9965745-026.A00G
              vendor: Kingston
              physical id: 1
              serial: 24D87152
              slot: DIMM_B1
              size: 16GiB
              width: 64 bits
              clock: 2666MHz (0.4ns)
      *-cache:0
           description: L1 cache
           physical id: 2a
           slot: L1 - Cache
           size: 768KiB
           capacity: 768KiB
           clock: 1GHz (1.0ns)
           capabilities: pipeline-burst internal write-back unified
           configuration: level=1
      *-cache:1
           description: L2 cache
           physical id: 2b
           slot: L2 - Cache
           size: 4MiB
           capacity: 4MiB
           clock: 1GHz (1.0ns)
           capabilities: pipeline-burst internal write-back unified
           configuration: level=2
      *-cache:2
           description: L3 cache
           physical id: 2c
           slot: L3 - Cache
           size: 16MiB
           capacity: 16MiB
           clock: 1GHz (1.0ns)
           capabilities: pipeline-burst internal write-back unified
           configuration: level=3
    
  • Normally 2666 is none overclock. Only other issue could be timings. Also, we could be looking at RAM this whole time and RAM could just be well, fine. I know good tools for Windows but not Linux to test ram.
    Also wonder if it's a launch Zen 1 CPU. https://www.reddit.com/r/Amd/comments/fblhta/psa_if_you_bought_a_ryzen_1000_series_cpu_at/

    Thanked by 1davide
  • davidedavide Member
    edited January 7

    @Kevinf100 said:
    Normally 2666 is none overclock. Only other issue could be timings. Also, we could be looking at RAM this whole time and RAM could just be well, fine. I know good tools for Windows but not Linux to test ram.
    Also wonder if it's a launch Zen 1 CPU. https://www.reddit.com/r/Amd/comments/fblhta/psa_if_you_bought_a_ryzen_1000_series_cpu_at/

    Holy fuck, the guy from Github even has the same CPU model I have. Looping thru GCC is the method I too discovered for combing hardware defects, in fact for a decade I've had a weekly cronjob that GCC's thru a whole kernel source.

    I'm gonna check if this CPU has some sort of revision number or serial number and if AMD has published a recall notice for it. Gonna spend some time looking into this.

    Thanked by 1Kevinf100
  • davidedavide Member

    All right thanks @Kevinf100 for that link, I submitted an RMA request to AMD and a tech support request. If nothing comes out of it I'll buy a new CPU.

  • Kevinf100Kevinf100 Member
    edited January 8

    No problem. Glad the problem was figured out. Very early model you have. Not sure if they'll RMA it tbh, but this problem (I think) hit the first few months of 1700, 1700x, and 1800x (No idea about the other models). Not sure if my 1700 I have has this bug. I didn't want to rma it because it overclock to 3.9 on 1.4 volts. So it was a good chip.

  • davidedavide Member

    @Kevinf100 said:
    No problem. Glad the problem was figured out. Very early model you have. Not sure if they'll RMA it tbh, but this problem (I think) hit the first few months of 1700, 1700x, and 1800x (No idea about the other models). Not sure if my 1700 I have has this bug. I didn't want to rma it because it overclock to 3.9 on 1.4 volts. So it was a good chip.

    I'll take a picture of the chip if I should swap it off the motherboard, that will show the manufacturing date / serial number, and if it may be part of the early faulty batches. I'm not aware of a better way to retrieve a Ryzen's serial number, it doesn't seem to be exposed to the OS.

    Has AMD ever acknowledged the manufacturing defect? I only found claims on reddit that AMD used to RMA those chips for free during 2017-2020.

  • @davide said:

    @Kevinf100 said:
    No problem. Glad the problem was figured out. Very early model you have. Not sure if they'll RMA it tbh, but this problem (I think) hit the first few months of 1700, 1700x, and 1800x (No idea about the other models). Not sure if my 1700 I have has this bug. I didn't want to rma it because it overclock to 3.9 on 1.4 volts. So it was a good chip.

    I'll take a picture of the chip if I should swap it off the motherboard, that will show the manufacturing date / serial number, and if it may be part of the early faulty batches. I'm not aware of a better way to retrieve a Ryzen's serial number, it doesn't seem to be exposed to the OS.

    Has AMD ever acknowledged the manufacturing defect? I only found claims on reddit that AMD used to RMA those chips for free during 2017-2020.

    I don't remember if they ever publicly did, it's been a while. However people was 100% getting them RMA hassle free, suggesting at least the RMA team knows about it or didn't want to argue. One comment claimed it was due to SMT and turning it off fixed it. Unsure if true, but hopefully AMD will still replace the chip today.

    Thanked by 1davide
  • davidedavide Member
    edited January 9

    The interesting exchange I just had with ASUS tech support.

    I wrote:

    Subject: Suspected frequent memory corruption
    Hello,

    since I purchased this motherboard I have observed frequent abnormal terminations of processes under linux, with the kernel log containing entries such as:

    [Tue Jan  7 03:30:45 2025] traps: munin-graph[23851] trap invalid opcode ip:55f33ed49fe6 sp:7ffd201e54f8 error:0 in perl[55f33ec78000+195000]
    [Tue Jan  7 03:53:25 2025] traps: patterner[13060] trap stack segment ip:55ff8a29f045 sp:7f99981fdb50 error:0 in patterner[55ff8a26f000+56000]
    

    Upon examining the core-dump of the affected processes via GDB, I found stack corruption and invalid instruction-pointer register; these crashes apparently affect any running process indiscriminately, including system processes, approximately once or twice per day.

    The computer has ECC RAM and EDAC appears enables, as per these kernel log entries:

    [   15.488841] EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)
    [   15.488845] EDAC amd64: F17h detected (node 0).
    [   15.488848] EDAC MC: UMC0 chip selects:
    [   15.488849] EDAC amd64: MC: 0:  8192MB 1:  8192MB
    [   15.488850] EDAC amd64: MC: 2:     0MB 3:     0MB
    [   15.488853] EDAC MC: UMC1 chip selects:
    [   15.488854] EDAC amd64: MC: 0:  8192MB 1:  8192MB
    [   15.488855] EDAC amd64: MC: 2:     0MB 3:     0MB
    

    No ECC errors are reported to userspace, but because of these crashes, I suspect that memory errors do happen and remain uncorrected and undetected. Perhaps some settings in the BIOS need to be set to special values to obtain stability?

    Please advise.

    The answer I received from ASUS, in Italian, translated:

    Dear Mr. Davide,

    My name is Giorgio and I will do my best to meet your request.

    Regarding the issue you mentioned, the PRIME B450M-K II motherboard is not supported by Asus for the Linux OS and there is no official information available. We invite you to check if the installed memories are included in the list of memories tested/certified by Asus with the motherboard.

    QVL memory link: [link]

    Additionally, we invite you to check if the motherboard's BIOS is updated to the latest version 4622, which you can download from the following link. For the BIOS update, you can use the EZFLASH utility found in BIOS>TOOLS>EZFLASH. To access the BIOS, press the DELETE or F2 key during PC startup.

    BIOS link: [link]

    EZFLASH guide link: [link]

    I am available for any further questions or support requests.

    Best regards,

    Asus Customer Service Giorgio_P

    I wasn't expecting anything so no expectations were shattered. I'm buying a Ryzen 2xxx in the next couple of days, this might fix the crashes. I still haven't disassembled the computer to confirm that the CPU serial number is flawed. I guess I'll find out when I'll put the new CPU in. Too time consuming otherwise.

  • You can't run memory tests from userspace. You need to run memtest86 from boot and get at least one pass when suspecting RAM issues.

  • davidedavide Member
    edited January 9

    Why? man memtester says:

    memtester is an effective userspace tester for stress-testing the memory subsystem. It is very effective at finding intermittent and non-deterministic faults.

    Anyway memory testing programs have never found any hardware fault for me, the GCC method linked above has been more reliable instead. Although this one computer seems to have a faulty CPU and the memories might be fine.

    So guys I bought an R7 2700; I'll throw away the current R7 1700 @ 40€ as soon as I get the new one in the mail.

    Thanked by 1darkimmortal
  • ArkasArkas Moderator

    @davide said: This toy computer is made of these parts:

    Darling, If I am to remember this, I am sorry. Oh, part of TT24 (ja, doubleread

    Thanked by 1davide
  • davidedavide Member
    edited January 10

    @Arkas said:
    Darling, If I am to remember this, I am sorry. Oh, part of TT24 (ja, doubleread

    Your remembrance is quite on spot, they say that you start leaving a mark in people's mind when you notice that you're repeating yourself. I believe I played my Supermicro record about 10 times now, so perhaps 5 times is the best amount to avoid a "doubleread" deja vu :)

    Food for thought.

  • If anyone is interested, I replaced the CPU today and took a pic of the serial number printed on the chip. It appears this is a faulty batch, from week 33 of 2017. For what it's worth, there are claims of faulty chips up to week 48 of 2017 for the first generation of Ryzen:



    All I know about this CPU is that my running processes were crashing once a day for the most assorted reasons with it mounted on my computer.

    This CPU will be on sale for 40€.

  • ralfralf Member

    Oh, actually. I just realised the datecode is the second line.

    And my google search also had 2 results explaining the defect saying the problem chips are before week 25 (1725), so your 1733 should actually be fine - and one of the people said they got an 1733 back from RMA'ing their 1716 chip. Weird.

    Anyway, glad you figured out your issue.

  • emghemgh Member, Megathread Squad

    @davide I’m interested in a faulty Ryzen 1700 for €40. Maybe even €100.

    Thanked by 2ralf davide
  • ralfralf Member

    @emgh said:
    @davide I’m interested in a faulty Ryzen 1700 for €40. Maybe even €100.

    At least rebooting the machine every now and then is more fun than an idler doing nothing of interest.

    Thanked by 2emgh lukast__
  • emghemgh Member, Megathread Squad

    @ralf said:

    @emgh said:
    @davide I’m interested in a faulty Ryzen 1700 for €40. Maybe even €100.

    At least rebooting the machine every now and then is more fun than an idler doing nothing of interest.

    Yeah. Might as well give €200, what do you think?

    Thanked by 1davide
  • davidedavide Member
    edited January 14

    @emgh said:

    @ralf said:

    @emgh said:
    @davide I’m interested in a faulty Ryzen 1700 for €40. Maybe even €100.

    At least rebooting the machine every now and then is more fun than an idler doing nothing of interest.

    Yeah. Might as well give €200, what do you think?

    I think it's self-evident that today it's 2025 and that any previous owner of this CPU did not take advantage of the free 3-year warranty from AMD.

    Pwned.

    Owmhhh that hurts!


    No okay, I see that some information is missing about this defect. Which is:

    • this is a widespread defect affecting a large production batch of early CPUs, not just this one specimen;
    • most owners of these defective CPUs don't reach the software conditions that trigger the fault, or are not disturbed by the fault if they trigger it. In fact the fault is often labeled "the linux/GCC segfault bug", because it is incorrectly believed that it manifests exclusively thru segfaults when compiling with GCC under linux. Most people don't use linux.
  • davidedavide Member
    edited January 14

    @ralf said:
    And my google search also had 2 results explaining the defect saying the problem chips are before week 25 (1725)

    That's because they are ztupid. Especially redditz ztupid... :) never believe what someone shitposts on reddit if your life depends on it. There is this widespread tradition within the circles of reddit that him who doesn't know the facts, must teach them to others. Especially if the truth about the facts is 1 simple google search away from the professing shitposter.

    What I mean is, there are all sorts of inconsistent reports around the web about this defect on the first generation Ryzen cpus. It is often repeated that it is associated with the manufacturing date antecedent week 25 of the serial number, less frequently that it occurs up to week 33, and some reports for up to week 48. Reality is, no one knows if the date code has even any relevance to the likelihood of the cpu being faulty, it's just a made-up supposition that was never confirmed by AMD, yet people have liked to attribute this defect to the date printed on the chip. As for myself, all I know is that with this cpu I get abnormal software failures that I don't get with other hardware, and swapping the cpu for a newer model is just an attempt.

    Thanked by 1ralf
  • ralfralf Member

    @davide said:

    @emgh said:

    @ralf said:

    @emgh said:
    @davide I’m interested in a faulty Ryzen 1700 for €40. Maybe even €100.

    At least rebooting the machine every now and then is more fun than an idler doing nothing of interest.

    Yeah. Might as well give €200, what do you think?

    I think it's self-evident that today it's 2025 and that any previous owner of this CPU did not take advantage of the free 3-year warranty from AMD.

    I kind of understand how you can get in that predicament. I bought a first gen iPod Nano shortly after release in late 2006. I got the recall notice on that in November 2011 about the "possibly exploding battery" - definitely got it, as it's marked as read in my e-mail folder. I guess I was busy with starting a new job, but I didn't do anything.

    I was still using the iPod for quite a few years after that, and then stopped. Anyway, last year 2024, I found my iPod. Give it a charge and it came back to life, battery still seeming fine, but maybe a very slight bulge on the case, but nothing terrible. Googled how to replace the battery and discovered the whole warranty replacement program and that it'd been running for over 5 years before they decided that there were no 1st gens left. Shame.

    It has a new home now - inside a LiPo fire-proof bag outside the house while I decide what to do with it. I think it's too old to run the alternate OS you can get, it's too small to hold an especially useful selection of songs, so it doesn't seem worth buying it a new battery, I don't want to use it in case it explodes at an inopportune moment, but it's also otherwise in pristine condition because it always lived inside a special holder. It even still has the film on the screen (cut off at the screen because the cover also went over the wheel). But I can't even sell it as a collectible as it has my name and e-mail address etched on the back!

    Thanked by 1davide
  • emghemgh Member, Megathread Squad

    @ralf bitcoin fixes that

    Thanked by 2ralf davide
Sign In or Register to comment.