New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Comments
Well, that was my guess too, someone with a 128GB machine should try it if they can.
Apparently it actually will run with less than 80GB ram, but will probably be insanely slow, since CPU-only will be really slow to begin with even with the recommended 140GB VRAM+RAM that is needed for 20tok/s with a GPU.
Anyone with 128GB ram should definitely try the new DeepSeek R1 Dynamic 1.58-bit, since the results will be way better than any distilled version like Qwen-2.5-72B. The output is not that far off from the unquantized version in many cases.
I see there are reports that it runs pretty fast on CPU only with 120GB RAM too, possibly around 30 tokens/s.
How do you get this token speed ?
If you really are OK with a low rate...Deepseek R1 14B on an RPi 5 is 1-2 tok/sec (linked to the benchmark screenshot):
I had planned to test running deepseek r1 on different KS
But with the results posted above, it seems unusable for most uses.
I don't have any servers available with 128GB ram to compare.
If I get motivated, I'll test with a hetzner server with hourly billing and post the results.
I really think the new AMD "AI" chips paired with 128GB RAM are going to be the next big thing, real consumer AI on the "cheap".
https://www.youtube.com/shorts/CSSvJhRiP18
nearly!
Didn't think LLMs would even be worth running on ARM chips but now I want to test the RK3588 with this - would love to see some NPU utilization but I know it's a wet dream that's never gonna happen
Apparently this is a really shitty idea, it takes hours.
Don't do it guys.
used the script here, pull the maintenance branch of it https://github.com/Otterwerks/llm-benchmark/tree/maintenance/package-updates
So has anyone managed to selfhost a 70B (or bigger) model or is that too large?
No biggy, it ran on my KS-LE-B, however you should have at least something like 96GB of memory, slower than the 32B but runs.
Is there any guidance whay could be a good model to run on a 128gb ddr4 + Dual E5-2699v4.
I don't expect it to run phenomenal in terms of speed. Something which would serve as a decent personal assistant for coding and shit talk
You can run the 70b model if you wish.
But it isn't going to be fast enough for a chat.
Well it's shit talk. So it's ok for it to crap out stuff o throw
Try 7B, otherwise Hetzner still sells E5 with 1080, might be good enough for partial offloading.
What speeds are y'all getting on cpu? (tokens per second)
The recent released mistral-small is also interesting.
About 24b, can run on 32gig easily.
https://docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/
You can make any model "think".
https://www.reddit.com/r/LocalLLaMA/comments/1iggetv/make_your_mistral_small_3_24b_think_like/
The new deepsex model dropped.
I decided to abuse the shit out of the Mystery box.
Luckly, as always, Unsloth dropped dynamic quanitized models.
So you can put the new new deepsex on that dedi.
For the best results, I created a Raid YOLO parition with ZFS.
So I can read at speeds up 3.8GB/sec which should be good enough.
It runs, faster than expected, still slow as fuck though.
For some reason the LLM only pulls about 1.3GB/sec, sometimes up to 2GB/sec from the diks but never maxes out the 3.8GB/sec.
At this point, I question my decision putting these drives into yolo.
New QWEN 3 coder on MYSTERY
https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
This looks promising. Can you say the specs of your MYSTERY and what model size you're using?
5bit QXL one, as suggested by unsloth.
Xeona G
GLM 4.5 Air Q2 seems to run fast enough to chat on 64gig.
Q3 works fine too.
51 of 62GB used.
Q4 works okay with 4k context size on 64GB
59GB of 64GB used.
I wonder when OVH are going to give us a Kimsufi with a GPU.