Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

LLM (deepseek?) on KimSufi server

1246

Comments

  • NeoonNeoon Community Contributor, Veteran

    @Cybr said:

    @Neoon said:

    @Cybr said:

    @Neoon said:
    72b running, on a 11$/m dedi, with 15GB to spare, is nuts.

    edit: here is the joke:

    Why don’t scientists trust atoms?
    Because they make up everything! 😄

    Have you tried the new DeepSeek R1 Dynamic 1.58-bit that just got released? They achieved an 80% size reduction. I'm interested in how well it can perform on a low/medium-end CPU.

    If its on ollama fine, to lazy to compile shit.
    edit: seems like with some params, it compiles fine for CPU only.

    I wasn't going to install all these crap nvidia dependencies.

    Looks like it is on ollama, but minimum VRAM+RAM=80GB, so your low end box probably won't have enough ram to even try it CPU only.

    Well, that was my guess too, someone with a 128GB machine should try it if they can.

  • @Neoon said:

    @Cybr said:

    @Neoon said:

    @Cybr said:

    @Neoon said:
    72b running, on a 11$/m dedi, with 15GB to spare, is nuts.

    edit: here is the joke:

    Why don’t scientists trust atoms?
    Because they make up everything! 😄

    Have you tried the new DeepSeek R1 Dynamic 1.58-bit that just got released? They achieved an 80% size reduction. I'm interested in how well it can perform on a low/medium-end CPU.

    If its on ollama fine, to lazy to compile shit.
    edit: seems like with some params, it compiles fine for CPU only.

    I wasn't going to install all these crap nvidia dependencies.

    Looks like it is on ollama, but minimum VRAM+RAM=80GB, so your low end box probably won't have enough ram to even try it CPU only.

    Well, that was my guess too, someone with a 128GB machine should try it if they can.

    Apparently it actually will run with less than 80GB ram, but will probably be insanely slow, since CPU-only will be really slow to begin with even with the recommended 140GB VRAM+RAM that is needed for 20tok/s with a GPU.

  • @Neoon said:

    @Cybr said:

    @Neoon said:

    @Cybr said:

    @Neoon said:
    72b running, on a 11$/m dedi, with 15GB to spare, is nuts.

    edit: here is the joke:

    Why don’t scientists trust atoms?
    Because they make up everything! 😄

    Have you tried the new DeepSeek R1 Dynamic 1.58-bit that just got released? They achieved an 80% size reduction. I'm interested in how well it can perform on a low/medium-end CPU.

    If its on ollama fine, to lazy to compile shit.
    edit: seems like with some params, it compiles fine for CPU only.

    I wasn't going to install all these crap nvidia dependencies.

    Looks like it is on ollama, but minimum VRAM+RAM=80GB, so your low end box probably won't have enough ram to even try it CPU only.

    Well, that was my guess too, someone with a 128GB machine should try it if they can.

    Anyone with 128GB ram should definitely try the new DeepSeek R1 Dynamic 1.58-bit, since the results will be way better than any distilled version like Qwen-2.5-72B. The output is not that far off from the unquantized version in many cases.

    I see there are reports that it runs pretty fast on CPU only with 120GB RAM too, possibly around 30 tokens/s.

    Thanked by 1xxsl
  • plumbergplumberg Veteran, Megathread Squad

    @charger said:

    KS-LE-B with E3-1245 v6 and 32gb of ram:
    deepseek-r1:32b Prompt eval: 2.79 t/s Response: 1.34 t/s Total: 1.36 t/s

    KS-LE-E with E5-1650 v3 and 64gb of ram:
    deepseek-r1:32b Prompt eval: 3.24 t/s Response: 1.85 t/s Total: 1.86 t/s

    So performance is not fantastic, but honestly for the few bucks a month and them mostly idling anyways I see a use case where tokens/s is not super important like background jobs and such

    How do you get this token speed ?

  • raindog308raindog308 Administrator, Veteran

    @Adam1 said: I wont be using it interactively. a baseline performance is required, but it would be quite low.

    If you really are OK with a low rate...Deepseek R1 14B on an RPi 5 is 1-2 tok/sec (linked to the benchmark screenshot):

    Thanked by 1admax
  • I had planned to test running deepseek r1 on different KS
    But with the results posted above, it seems unusable for most uses.
    I don't have any servers available with 128GB ram to compare.
    If I get motivated, I'll test with a hetzner server with hourly billing and post the results.

  • Adam1Adam1 Member
    edited January 2025

    @raindog308 said: If you really are OK with a low rate...Deepseek R1 14B on an RPi 5 is 1-2 tok/sec (linked to the benchmark screenshot):

    I really think the new AMD "AI" chips paired with 128GB RAM are going to be the next big thing, real consumer AI on the "cheap".

    https://www.youtube.com/shorts/CSSvJhRiP18

    nearly!

  • beanman109beanman109 Member, Host Rep, Megathread Squad

    @raindog308 said:

    @Adam1 said: I wont be using it interactively. a baseline performance is required, but it would be quite low.

    If you really are OK with a low rate...Deepseek R1 14B on an RPi 5 is 1-2 tok/sec (linked to the benchmark screenshot):

    Didn't think LLMs would even be worth running on ARM chips but now I want to test the RK3588 with this - would love to see some NPU utilization but I know it's a wet dream that's never gonna happen

    Thanked by 1admax
  • NeoonNeoon Community Contributor, Veteran

    @Neoon said:
    Do you guys think, it will run well on a swap file?

    Apparently this is a really shitty idea, it takes hours.
    Don't do it guys.

    Thanked by 1admax
  • @plumberg said:

    @charger said:

    KS-LE-B with E3-1245 v6 and 32gb of ram:
    deepseek-r1:32b Prompt eval: 2.79 t/s Response: 1.34 t/s Total: 1.36 t/s

    KS-LE-E with E5-1650 v3 and 64gb of ram:
    deepseek-r1:32b Prompt eval: 3.24 t/s Response: 1.85 t/s Total: 1.86 t/s

    So performance is not fantastic, but honestly for the few bucks a month and them mostly idling anyways I see a use case where tokens/s is not super important like background jobs and such

    How do you get this token speed ?

    used the script here, pull the maintenance branch of it https://github.com/Otterwerks/llm-benchmark/tree/maintenance/package-updates

  • So has anyone managed to selfhost a 70B (or bigger) model or is that too large?

  • NeoonNeoon Community Contributor, Veteran

    @BasToTheMax said:
    So has anyone managed to selfhost a 70B (or bigger) model or is that too large?

    No biggy, it ran on my KS-LE-B, however you should have at least something like 96GB of memory, slower than the 32B but runs.

  • plumbergplumberg Veteran, Megathread Squad

    Is there any guidance whay could be a good model to run on a 128gb ddr4 + Dual E5-2699v4.

    I don't expect it to run phenomenal in terms of speed. Something which would serve as a decent personal assistant for coding and shit talk

  • NeoonNeoon Community Contributor, Veteran

    @plumberg said:
    Is there any guidance whay could be a good model to run on a 128gb ddr4 + Dual E5-2699v4.

    I don't expect it to run phenomenal in terms of speed. Something which would serve as a decent personal assistant for coding and shit talk

    You can run the 70b model if you wish.
    But it isn't going to be fast enough for a chat.

    Thanked by 1plumberg
  • plumbergplumberg Veteran, Megathread Squad

    @Neoon said:

    @plumberg said:
    Is there any guidance whay could be a good model to run on a 128gb ddr4 + Dual E5-2699v4.

    I don't expect it to run phenomenal in terms of speed. Something which would serve as a decent personal assistant for coding and shit talk

    You can run the 70b model if you wish.
    But it isn't going to be fast enough for a chat.

    Well it's shit talk. So it's ok for it to crap out stuff o throw

  • NeoonNeoon Community Contributor, Veteran

    @plumberg said:

    @Neoon said:

    @plumberg said:
    Is there any guidance whay could be a good model to run on a 128gb ddr4 + Dual E5-2699v4.

    I don't expect it to run phenomenal in terms of speed. Something which would serve as a decent personal assistant for coding and shit talk

    You can run the 70b model if you wish.
    But it isn't going to be fast enough for a chat.

    Well it's shit talk. So it's ok for it to crap out stuff o throw

    Try 7B, otherwise Hetzner still sells E5 with 1080, might be good enough for partial offloading.

    Thanked by 1plumberg
  • What speeds are y'all getting on cpu? (tokens per second)

  • NeoonNeoon Community Contributor, Veteran

    The recent released mistral-small is also interesting.
    About 24b, can run on 32gig easily.

    Thanked by 1BasToTheMax
  • https://docs.openwebui.com/tutorials/integrations/deepseekr1-dynamic/

    A huge shoutout to UnslothAI for their incredible efforts! Thanks to their hard work, we can now run the full DeepSeek-R1 671B parameter model in its dynamic 1.58-bit quantized form (compressed to just 131GB) on Llama.cpp! And the best part? You no longer have to despair about needing massive enterprise-class GPUs or servers — it’s possible to run this model on your personal machine (albeit slowly for most consumer hardware).
    note

    The only true DeepSeek-R1 model on Ollama is the 671B version available here: https://ollama.com/library/deepseek-r1:671b. Other versions are distilled models.

    This guide focuses on running the full DeepSeek-R1 Dynamic 1.58-bit quantized model using Llama.cpp integrated with Open WebUI. For this tutorial, we’ll demonstrate the steps with an M4 Max + 128GB RAM machine. You can adapt the settings to your own configuration.

  • NeoonNeoon Community Contributor, Veteran
    edited June 2025

    The new deepsex model dropped.
    I decided to abuse the shit out of the Mystery box.

    Luckly, as always, Unsloth dropped dynamic quanitized models.
    So you can put the new new deepsex on that dedi.

    For the best results, I created a Raid YOLO parition with ZFS.
    So I can read at speeds up 3.8GB/sec which should be good enough.

    It runs, faster than expected, still slow as fuck though.

    For some reason the LLM only pulls about 1.3GB/sec, sometimes up to 2GB/sec from the diks but never maxes out the 3.8GB/sec.

    At this point, I question my decision putting these drives into yolo.

    Thanked by 1barbarza
  • NeoonNeoon Community Contributor, Veteran
    ./llama.cpp/llama-cli  -hf unsloth/DeepSeek-R1-0528-GGUF:IQ1_S     --cache-type-k q4_0     --threads 12 -no-cnv --prio 2     --temp 0.6     --ctx-size 8192     --seed 3407     --prompt "<|User|>What is 1+1?<|Assistant|>"
    
    
    llama_perf_sampler_print:    sampling time =      43.66 ms /   424 runs   (    0.10 ms per token,  9711.41 tokens per second)
    llama_perf_context_print:        load time =  230877.43 ms
    llama_perf_context_print: prompt eval time =   27384.76 ms /    10 tokens ( 2738.48 ms per token,     0.37 tokens per second)
    llama_perf_context_print:        eval time =  733528.75 ms /   413 runs   ( 1776.10 ms per token,     0.56 tokens per second)
    llama_perf_context_print:       total time =  761051.91 ms /   423 tokens
    
    Thanked by 1admax
  • NeoonNeoon Community Contributor, Veteran
    edited July 2025
  • hezihezi Member

    This looks promising. Can you say the specs of your MYSTERY and what model size you're using?

  • NeoonNeoon Community Contributor, Veteran
    edited July 2025

    @hezi said:

    This looks promising. Can you say the specs of your MYSTERY and what model size you're using?

    5bit QXL one, as suggested by unsloth.
    Xeona G

    Thanked by 1hezi
  • NeoonNeoon Community Contributor, Veteran
    edited August 2025

    GLM 4.5 Air Q2 seems to run fast enough to chat on 64gig.

  • NeoonNeoon Community Contributor, Veteran
    edited August 2025

    Q3 works fine too.
    51 of 62GB used.

    Thanked by 2hezi BasToTheMax
  • NeoonNeoon Community Contributor, Veteran

    Q4 works okay with 4k context size on 64GB
    59GB of 64GB used.

  • I wonder when OVH are going to give us a Kimsufi with a GPU.

  • NeoonNeoon Community Contributor, Veteran

    Thanked by 1BasToTheMax
Sign In or Register to comment.