Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


LLM (deepseek?) on KimSufi server
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

LLM (deepseek?) on KimSufi server

Adam1Adam1 Member
edited January 28 in Help

With some of the 64GB and 32GB cheap dedis from Kimsufi, I'm wondering if anyone has experience with LLM's on CPU only KimSufi servers. Deepseek looks promising?

More to the point, any benchmarks?

I'm considering using one my of my 64GB RAM KS's, for local API access and rewriting primarily, so performance isnt too important as long as it gets the job done.

I'll be picking up an AMD APU with 128GB RAM later in the year, but until then I'm limited to 16GB machines at home, although the memory bandwidth is probably much higher than most KS's.

«134

Comments

  • spiritlhlspiritlhl Member
    edited January 28

    At least 16GB Ram and RTX 3060 8G GPU, only cpu just run 8B version

  • NeoonNeoon Community Contributor, Veteran

    That's why I buy a KS-Game-LE.

    The 14B runs on the Nocix's but too slow.
    It runs fine on a Ryzen 2600.

    The i7 should be in-between, should be fast enough.

    I hope for a 32 or 64gig upgrade though.

  • Adam1Adam1 Member

    @Neoon said: That's why I buy a KS-Game-LE.

    Did you run LLM on it? what kinda performance?

  • NeoonNeoon Community Contributor, Veteran

    @Adam1 said:

    @Neoon said: That's why I buy a KS-Game-LE.

    Did you run LLM on it? what kinda performance?

    My order is still pending, so I shall wait.
    I guess for 12$ you can't go wrong with that hardware.

    the E5 would have worse performance and costs more.

  • seenuseenu Member

    Without GPU, speed will be damn less

    Eventually you will lose interest in using it

    Thanked by 1AXYZE
  • ArakiAraki Member
    edited January 28

    The memory speed is all that realistically matters, and you probably shouldn't expect good memory speed out of a Kimsufi server. I had one with 32GB and they put two RAM modules clocked at 2133 MHz. The two channel + low clock combo is awful for LLMs, all it would realistically run is 7B models, and Deepseek which needs ~200GB even for the lowest quants won't even fit.

    A productive CPU-only inference for big (~70B) models is possible, but you should hunt for 8-channel DDR5 servers. Refer to this spreadsheet, which is outdated but gives the general idea, and, specifically for Deepseek, to this GitHub comment where a person runs the 4-bit version of Deepseek V3 on an EPYC server. Still, expect that token preprocessing will be taking a while if you're planning to use long prompts.

    Thanked by 2Andreix xxsl
  • Adam1Adam1 Member

    @seenu said: Without GPU, speed will be damn less
    Eventually you will lose interest in using it

    I wont be using it interactively. a baseline performance is required, but it would be quite low.

  • Adam1Adam1 Member

    @Araki said: The memory speed is all that realistically matters

    sure, but let's just say we are limited to KS servers for the purposes of this thread. Many of us have idlers that could be used for this purpose, so to me it's interesting and potentially useful to see what's possible on such old hardware.

  • from my experience, in 8-13B model, if you load them to 64GB DDR4 RAM, you'll only get 8-10 tok/s at most. once you load more than 4k context, it'll crawl down to 4 tok/s, and it'll half again once you put 8k context on it (Q5_K_M with llamacpp, avx2, and kv cache enabled).

    if your work is on-demand and not continous running all day, IMO it's not worth the wait (waste way too much time). the only time I'd recommend you to do this is when your data really sensitive and you're under NDA to keep it secure (yes i've done this with 64GB DDR4+llamacpp-vulkan with amd gfx803 gpu too).

    otherwise, just use on-demand GPU where you can deploy with a script / terraform, use the endpoint, then destroy the instance once you're done. some provider even offer pricing hourly or by second (you only pay the GPU when it's actively running inference, then you'll pay only for cpu/ram per hour for the runner container).

    my recommendation if you want to go down with 2nd option;
    https://modal.com/pricing
    https://www.runpod.io/pricing
    I've been using them for more than one year, and the recent price decrease really helps reducing my spending. I never go over $10 a month for actual work that i need to be done (summarizing files, code generation, multi language translation, ocr using vision model, text to audio generation, audio to audio generation)

  • Adam1Adam1 Member

    @ScreenReader said: otherwise, just use on-demand GPU where you can deploy with a script / terraform, use the endpoint, then destroy the instance once you're done. some provider even offer pricing hourly or by second (you only pay the GPU when it's actively running inference, then you'll pay only for cpu/ram per hour for the runner container).

    Thanks, but let's stick to KS hardware for this thread. Theres countless other threads for other platforms.

  • Adam1Adam1 Member

    @ScreenReader said: from my experience, in 8-13B model, if you load them to 64GB DDR4 RAM, you'll only get 8-10 tok/s at most.

    well, that's OK actually, as it would mostly be just rewriting paragraphs via API. Nothing interactive. That's my use case anyway.

  • raindog308raindog308 Administrator, Veteran

    @Adam1 said: I'll be picking up an AMD APU with 128GB RAM later in the year, but until then I'm limited to 16GB machines at home, although the memory bandwidth is probably much higher than most KS's.

    16GB GPU or 16GB system RAM?

    If you have 32GB or 64GB of RAM on a laptop, run your LLM on that. It's going to outperform a Kimsufi.

    Of course, you're not gambling a lot to try out a Kimsufi.

    I have a $20-a-month ColoCrossing dedi with 32GB of RAM...perhaps I should try running an LLM on it...

  • Adam1Adam1 Member

    @raindog308 said: 16GB GPU or 16GB system RAM?

    system RAM, so CPU LLM's. I do have a laptop 3060 with (only) 6GB VRAM. However I'm more interested in what I can do with my KS dedi's that have a lot of spare CPU and RAM.

  • Adam1Adam1 Member

    @raindog308 said: I have a $20-a-month ColoCrossing dedi with 32GB of RAM...perhaps I should try running an LLM on it...

    Wont hurt, only time to waste finding out :D

    I hope at some point cheap AMD APU's with configurable VRAM will be possible at KS, maybe in 10+ years time haha.

  • beanman109beanman109 Member, Megathread Squad

    I still need to find a use for a 32GB dedi, what LLM would everyone recommend running locally? Not interested in Deepseek for obvious reasons

  • kvz12kvz12 Member

    @beanman109 said:
    I still need to find a use for a 32GB dedi, what LLM would everyone recommend running locally? Not interested in Deepseek for obvious reasons

    What's the obvious reasons?

  • beanman109beanman109 Member, Megathread Squad

    @kvz12 said:

    @beanman109 said:
    I still need to find a use for a 32GB dedi, what LLM would everyone recommend running locally? Not interested in Deepseek for obvious reasons

    What's the obvious reasons?

    Bias in the responses from the Chinese government, end of story - we don't need to turn this into a political thread, I just want answers that aren't deepseek.

  • @beanman109 isn't there an uncensored model too [just saying]

    Thanked by 1beanman109
  • As for 32gb ram dedis, you have to understand the dedis provided by ovh or even colocrossing is is not a new gen cpu you won't get performance required, it's better to stick with local machine btw @beanman109 I heard phi-4 the new microsoft open source model is good too, try it and give a review on how it works performance wise 😉

  • LeviLevi Member

    Out of context: so deepseek is better while free VS cgpt pro plan? Damn, I'am paying for cgpt 20/mo :O . Using for coding.

  • beanman109beanman109 Member, Megathread Squad

    @cainyxues said:
    @beanman109 isn't there an uncensored model too [just saying]

    Not as far as I know? Unless the locally run version is uncensored

  • beanman109beanman109 Member, Megathread Squad

    @cainyxues said:
    As for 32gb ram dedis, you have to understand the dedis provided by ovh or even colocrossing is is not a new gen cpu you won't get performance required, it's better to stick with local machine btw @beanman109 I heard phi-4 the new microsoft open source model is good too, try it and give a review on how it works performance wise 😉

    Trying out llama3.1 on openweb-ui currently, I believe ollama has support for phi-4 so I'll have a look at that soon, it's running on my Nuyek dedi so 32GB ddr4 and an E3-1275 v5

  • @beanman109 said:

    @cainyxues said:
    As for 32gb ram dedis, you have to understand the dedis provided by ovh or even colocrossing is is not a new gen cpu you won't get performance required, it's better to stick with local machine btw @beanman109 I heard phi-4 the new microsoft open source model is good too, try it and give a review on how it works performance wise 😉

    Trying out llama3.1 on openweb-ui currently, I believe ollama has support for phi-4 so I'll have a look at that soon, it's running on my Nuyek dedi so 32GB ddr4 and an E3-1275 v5

    I am selfhosting OpenWebUI too. Running llama 3.1 8B via ollama on my Contabo vps.
    It generates about 1 word per second.

  • beanman109beanman109 Member, Megathread Squad

    @BasToTheMax said:

    @beanman109 said:

    @cainyxues said:
    As for 32gb ram dedis, you have to understand the dedis provided by ovh or even colocrossing is is not a new gen cpu you won't get performance required, it's better to stick with local machine btw @beanman109 I heard phi-4 the new microsoft open source model is good too, try it and give a review on how it works performance wise 😉

    Trying out llama3.1 on openweb-ui currently, I believe ollama has support for phi-4 so I'll have a look at that soon, it's running on my Nuyek dedi so 32GB ddr4 and an E3-1275 v5

    I am selfhosting OpenWebUI too. Running llama 3.1 8B via ollama on my Contabo vps.
    It generates about 1 word per second.

    I got a 90 line HTML/JS code prompt for a clock / countdown timer website completed in about 1-2 minutes (rough guess)

  • LeviLevi Member

    @beanman109 said: I got a 90 line HTML/JS code prompt for a clock / countdown timer website completed in about 1-2 minutes (rough guess)

    That's bad... very bad. In cgpt 4o it is like 10 - 15 seconds or less.

  • jndjnd Member
    edited January 28

    I don't think it's worth it without GPU or at all. You can simply pay per token at Openrouter or similar, you get 1M input + 1M output tokens under $1 (DeepSeek R1 Distill Llama 70B example). Unless you really need private instance you can't beat the model of shared compute time that gets costs very low.

  • beanman109beanman109 Member, Megathread Squad

    @Levi said:

    @beanman109 said: I got a 90 line HTML/JS code prompt for a clock / countdown timer website completed in about 1-2 minutes (rough guess)

    That's bad... very bad. In cgpt 4o it is like 10 - 15 seconds or less.

    It's running on a E3-1275 v5 that I pay $100 a year for - cheaper than ChatGPT ¯_(ツ)_/¯

  • @cainyxues said:
    @beanman109 isn't there an uncensored model too [just saying]

    Dolphin Mistral is a nice uncensored model. It answers pretty much any question, no matter how sketchy, but I don't think any of the 'big players' offer uncensored versions of their models, (DeepSeek included): https://ollama.com/library/dolphin-mistral:7b

    An issue with DeepSeek is it generates a LOT of tokens whilst it's 'thinking', (basically an internal dialogue), and then uses that content to provide a coherent answer...which means it's really slow to answer questions on lowend hardware

  • beanman109beanman109 Member, Megathread Squad

    @cainyxues said:
    As for 32gb ram dedis, you have to understand the dedis provided by ovh or even colocrossing is is not a new gen cpu you won't get performance required, it's better to stick with local machine btw @beanman109 I heard phi-4 the new microsoft open source model is good too, try it and give a review on how it works performance wise 😉

    update:

    phi4:14b shits the bed when doing anything currently
    tested on 1275 v5 and an i5-12500, only eats about 14gb of RAM but it types at about 1-3 words per second after thinking about things for a good minute or two, also the code it outputs is way worse than llama from my couple quick tests

    llama 3.1:8b i would actually call usable on both hardware, wrote my HTML clock code in about 1-2 minutes on the 1275 v5 and about 30 seconds on the i5-12500 while using about 10gb of RAM

    Thanked by 2cainyxues dev077
  • rattlecattlerattlecattle Member
    edited January 28

    Been running Deepseek R1 the distilled models, on a 128 GB dedi with a 8 GB GTX 1080 GPU. Its performance is acceptable so far.

    Can only run the distilled models of deepseek r1. Running the actual deepseek r1 isn't possible on consumer hardware anyway.

    Also the distilled models are not the same as the actual r1. Its more like say the base LLama model fine tuned with DeepSeek R1.

Sign In or Register to comment.