LLM (deepseek?) on KimSufi server

Neoon · January 2025

@Cybr said:

@Neoon said:

@Cybr said:

@Neoon said:
72b running, on a 11$/m dedi, with 15GB to spare, is nuts.

edit: here is the joke:

Why don’t scientists trust atoms?
Because they make up everything! 😄

Have you tried the new DeepSeek R1 Dynamic 1.58-bit that just got released? They achieved an 80% size reduction. I'm interested in how well it can perform on a low/medium-end CPU.

If its on ollama fine, to lazy to compile shit.
edit: seems like with some params, it compiles fine for CPU only.

I wasn't going to install all these crap nvidia dependencies.

Looks like it is on ollama, but minimum VRAM+RAM=80GB, so your low end box probably won't have enough ram to even try it CPU only.

Well, that was my guess too, someone with a 128GB machine should try it if they can.

Cybr · January 2025

@Neoon said:

@Cybr said:

@Neoon said:

@Cybr said:

@Neoon said:
72b running, on a 11$/m dedi, with 15GB to spare, is nuts.

edit: here is the joke:

Why don’t scientists trust atoms?
Because they make up everything! 😄

Have you tried the new DeepSeek R1 Dynamic 1.58-bit that just got released? They achieved an 80% size reduction. I'm interested in how well it can perform on a low/medium-end CPU.

If its on ollama fine, to lazy to compile shit.
edit: seems like with some params, it compiles fine for CPU only.

I wasn't going to install all these crap nvidia dependencies.

Looks like it is on ollama, but minimum VRAM+RAM=80GB, so your low end box probably won't have enough ram to even try it CPU only.

Well, that was my guess too, someone with a 128GB machine should try it if they can.

Apparently it actually will run with less than 80GB ram, but will probably be insanely slow, since CPU-only will be really slow to begin with even with the recommended 140GB VRAM+RAM that is needed for 20tok/s with a GPU.

Cybr · January 2025

@Neoon said:

@Cybr said:

@Neoon said:

@Cybr said:

@Neoon said:
72b running, on a 11$/m dedi, with 15GB to spare, is nuts.

edit: here is the joke:

Why don’t scientists trust atoms?
Because they make up everything! 😄

Have you tried the new DeepSeek R1 Dynamic 1.58-bit that just got released? They achieved an 80% size reduction. I'm interested in how well it can perform on a low/medium-end CPU.

If its on ollama fine, to lazy to compile shit.
edit: seems like with some params, it compiles fine for CPU only.

I wasn't going to install all these crap nvidia dependencies.

Looks like it is on ollama, but minimum VRAM+RAM=80GB, so your low end box probably won't have enough ram to even try it CPU only.

Well, that was my guess too, someone with a 128GB machine should try it if they can.

Anyone with 128GB ram should definitely try the new DeepSeek R1 Dynamic 1.58-bit, since the results will be way better than any distilled version like Qwen-2.5-72B. The output is not that far off from the unquantized version in many cases.

I see there are reports that it runs pretty fast on CPU only with 120GB RAM too, possibly around 30 tokens/s.

plumberg · January 2025

@charger said:

KS-LE-B with E3-1245 v6 and 32gb of ram:
deepseek-r1:32b Prompt eval: 2.79 t/s Response: 1.34 t/s Total: 1.36 t/s

KS-LE-E with E5-1650 v3 and 64gb of ram:
deepseek-r1:32b Prompt eval: 3.24 t/s Response: 1.85 t/s Total: 1.86 t/s

So performance is not fantastic, but honestly for the few bucks a month and them mostly idling anyways I see a use case where tokens/s is not super important like background jobs and such

How do you get this token speed ?

raindog308 · January 2025

@Adam1 said: I wont be using it interactively. a baseline performance is required, but it would be quite low.

If you really are OK with a low rate...Deepseek R1 14B on an RPi 5 is 1-2 tok/sec (linked to the benchmark screenshot):

remy · January 2025

I had planned to test running deepseek r1 on different KS
But with the results posted above, it seems unusable for most uses.
I don't have any servers available with 128GB ram to compare.
If I get motivated, I'll test with a hetzner server with hourly billing and post the results.

Adam1 · January 2025

@raindog308 said: If you really are OK with a low rate...Deepseek R1 14B on an RPi 5 is 1-2 tok/sec (linked to the benchmark screenshot):

I really think the new AMD "AI" chips paired with 128GB RAM are going to be the next big thing, real consumer AI on the "cheap".

https://www.youtube.com/shorts/CSSvJhRiP18

nearly!

beanman109 · January 2025

@raindog308 said:

@Adam1 said: I wont be using it interactively. a baseline performance is required, but it would be quite low.

If you really are OK with a low rate...Deepseek R1 14B on an RPi 5 is 1-2 tok/sec (linked to the benchmark screenshot):

Didn't think LLMs would even be worth running on ARM chips but now I want to test the RK3588 with this - would love to see some NPU utilization but I know it's a wet dream that's never gonna happen

Neoon · January 2025

@Neoon said:
Do you guys think, it will run well on a swap file?

Apparently this is a really shitty idea, it takes hours.
Don't do it guys.

charger · January 2025

@plumberg said:

@charger said:

KS-LE-B with E3-1245 v6 and 32gb of ram:
deepseek-r1:32b Prompt eval: 2.79 t/s Response: 1.34 t/s Total: 1.36 t/s

KS-LE-E with E5-1650 v3 and 64gb of ram:
deepseek-r1:32b Prompt eval: 3.24 t/s Response: 1.85 t/s Total: 1.86 t/s

So performance is not fantastic, but honestly for the few bucks a month and them mostly idling anyways I see a use case where tokens/s is not super important like background jobs and such

How do you get this token speed ?

used the script here, pull the maintenance branch of it https://github.com/Otterwerks/llm-benchmark/tree/maintenance/package-updates

BasToTheMax · February 2025

So has anyone managed to selfhost a 70B (or bigger) model or is that too large?

Neoon · February 2025

@BasToTheMax said:
So has anyone managed to selfhost a 70B (or bigger) model or is that too large?

No biggy, it ran on my KS-LE-B, however you should have at least something like 96GB of memory, slower than the 32B but runs.

plumberg · February 2025

Is there any guidance whay could be a good model to run on a 128gb ddr4 + Dual E5-2699v4.

I don't expect it to run phenomenal in terms of speed. Something which would serve as a decent personal assistant for coding and shit talk

Neoon · February 2025

@plumberg said:
Is there any guidance whay could be a good model to run on a 128gb ddr4 + Dual E5-2699v4.

I don't expect it to run phenomenal in terms of speed. Something which would serve as a decent personal assistant for coding and shit talk

You can run the 70b model if you wish.
But it isn't going to be fast enough for a chat.

plumberg · February 2025

@Neoon said:

@plumberg said:
Is there any guidance whay could be a good model to run on a 128gb ddr4 + Dual E5-2699v4.

I don't expect it to run phenomenal in terms of speed. Something which would serve as a decent personal assistant for coding and shit talk

You can run the 70b model if you wish.
But it isn't going to be fast enough for a chat.

Well it's shit talk. So it's ok for it to crap out stuff o throw

Neoon · February 2025

@plumberg said:

@Neoon said:

@plumberg said:
Is there any guidance whay could be a good model to run on a 128gb ddr4 + Dual E5-2699v4.

I don't expect it to run phenomenal in terms of speed. Something which would serve as a decent personal assistant for coding and shit talk

You can run the 70b model if you wish.
But it isn't going to be fast enough for a chat.

Well it's shit talk. So it's ok for it to crap out stuff o throw

Try 7B, otherwise Hetzner still sells E5 with 1080, might be good enough for partial offloading.

BasToTheMax · February 2025

What speeds are y'all getting on cpu? (tokens per second)

Neoon · February 2025

The recent released mistral-small is also interesting.
About 24b, can run on 32gig easily.

tridinebandim · February 2025

https://docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/

A huge shoutout to UnslothAI for their incredible efforts! Thanks to their hard work, we can now run the full DeepSeek-R1 671B parameter model in its dynamic 1.58-bit quantized form (compressed to just 131GB) on Llama.cpp! And the best part? You no longer have to despair about needing massive enterprise-class GPUs or servers — it’s possible to run this model on your personal machine (albeit slowly for most consumer hardware).
note

The only true DeepSeek-R1 model on Ollama is the 671B version available here: https://ollama.com/library/deepseek-r1:671b. Other versions are distilled models.

This guide focuses on running the full DeepSeek-R1 Dynamic 1.58-bit quantized model using Llama.cpp integrated with Open WebUI. For this tutorial, we’ll demonstrate the steps with an M4 Max + 128GB RAM machine. You can adapt the settings to your own configuration.

Neoon · February 2025

You can make any model "think".

https://www.reddit.com/r/LocalLLaMA/comments/1iggetv/make_your_mistral_small_3_24b_think_like/

Neoon · June 2025

The new deepsex model dropped.
I decided to abuse the shit out of the Mystery box.

Luckly, as always, Unsloth dropped dynamic quanitized models.
So you can put the new new deepsex on that dedi.

For the best results, I created a Raid YOLO parition with ZFS.
So I can read at speeds up 3.8GB/sec which should be good enough.

It runs, faster than expected, still slow as fuck though.

For some reason the LLM only pulls about 1.3GB/sec, sometimes up to 2GB/sec from the diks but never maxes out the 3.8GB/sec.

At this point, I question my decision putting these drives into yolo.

Neoon · June 2025

./llama.cpp/llama-cli  -hf unsloth/DeepSeek-R1-0528-GGUF:IQ1_S     --cache-type-k q4_0     --threads 12 -no-cnv --prio 2     --temp 0.6     --ctx-size 8192     --seed 3407     --prompt "<｜User｜>What is 1+1?<｜Assistant｜>"


llama_perf_sampler_print:    sampling time =      43.66 ms /   424 runs   (    0.10 ms per token,  9711.41 tokens per second)
llama_perf_context_print:        load time =  230877.43 ms
llama_perf_context_print: prompt eval time =   27384.76 ms /    10 tokens ( 2738.48 ms per token,     0.37 tokens per second)
llama_perf_context_print:        eval time =  733528.75 ms /   413 runs   ( 1776.10 ms per token,     0.56 tokens per second)
llama_perf_context_print:       total time =  761051.91 ms /   423 tokens

Neoon · July 2025

New QWEN 3 coder on MYSTERY
https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

hezi · July 2025

@Neoon said:
New QWEN 3 coder on MYSTERY
https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

This looks promising. Can you say the specs of your MYSTERY and what model size you're using?

Neoon · July 2025

@hezi said:

@Neoon said:
New QWEN 3 coder on MYSTERY
https://huggingface.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

This looks promising. Can you say the specs of your MYSTERY and what model size you're using?

5bit QXL one, as suggested by unsloth.
Xeona G

Neoon · August 2025

GLM 4.5 Air Q2 seems to run fast enough to chat on 64gig.

Neoon · August 2025

Q3 works fine too.
51 of 62GB used.

Neoon · August 2025

Q4 works okay with 4k context size on 64GB
59GB of 64GB used.

barbarza · August 2025

I wonder when OVH are going to give us a Kimsufi with a GPU.

Neoon · September 2025

Howdy, Stranger!

Categories

In this Discussion

LLM (deepseek?) on KimSufi server

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

LLM (deepseek?) on KimSufi server

Comments