LLM (deepseek?) on KimSufi server

Adam1 · January 2025

With some of the 64GB and 32GB cheap dedis from Kimsufi, I'm wondering if anyone has experience with LLM's on CPU only KimSufi servers. Deepseek looks promising?

More to the point, any benchmarks?

I'm considering using one my of my 64GB RAM KS's, for local API access and rewriting primarily, so performance isnt too important as long as it gets the job done.

I'll be picking up an AMD APU with 128GB RAM later in the year, but until then I'm limited to 16GB machines at home, although the memory bandwidth is probably much higher than most KS's.

spiritlhl · January 2025

At least 16GB Ram and RTX 3060 8G GPU, only cpu just run 8B version

Neoon · January 2025

That's why I buy a KS-Game-LE.

The 14B runs on the Nocix's but too slow.
It runs fine on a Ryzen 2600.

The i7 should be in-between, should be fast enough.

I hope for a 32 or 64gig upgrade though.

Adam1 · January 2025

@Neoon said: That's why I buy a KS-Game-LE.

Did you run LLM on it? what kinda performance?

Neoon · January 2025

@Adam1 said:

@Neoon said: That's why I buy a KS-Game-LE.

Did you run LLM on it? what kinda performance?

My order is still pending, so I shall wait.
I guess for 12$ you can't go wrong with that hardware.

the E5 would have worse performance and costs more.

seenu · January 2025

Without GPU, speed will be damn less

Eventually you will lose interest in using it

Araki · January 2025

The memory speed is all that realistically matters, and you probably shouldn't expect good memory speed out of a Kimsufi server. I had one with 32GB and they put two RAM modules clocked at 2133 MHz. The two channel + low clock combo is awful for LLMs, all it would realistically run is 7B models, and Deepseek which needs ~200GB even for the lowest quants won't even fit.

A productive CPU-only inference for big (~70B) models is possible, but you should hunt for 8-channel DDR5 servers. Refer to this spreadsheet, which is outdated but gives the general idea, and, specifically for Deepseek, to this GitHub comment where a person runs the 4-bit version of Deepseek V3 on an EPYC server. Still, expect that token preprocessing will be taking a while if you're planning to use long prompts.

Adam1 · January 2025

@seenu said: Without GPU, speed will be damn less
Eventually you will lose interest in using it

I wont be using it interactively. a baseline performance is required, but it would be quite low.

Adam1 · January 2025

@Araki said: The memory speed is all that realistically matters

sure, but let's just say we are limited to KS servers for the purposes of this thread. Many of us have idlers that could be used for this purpose, so to me it's interesting and potentially useful to see what's possible on such old hardware.

ScreenReader · January 2025

from my experience, in 8-13B model, if you load them to 64GB DDR4 RAM, you'll only get 8-10 tok/s at most. once you load more than 4k context, it'll crawl down to 4 tok/s, and it'll half again once you put 8k context on it (Q5_K_M with llamacpp, avx2, and kv cache enabled).

if your work is on-demand and not continous running all day, IMO it's not worth the wait (waste way too much time). the only time I'd recommend you to do this is when your data really sensitive and you're under NDA to keep it secure (yes i've done this with 64GB DDR4+llamacpp-vulkan with amd gfx803 gpu too).

otherwise, just use on-demand GPU where you can deploy with a script / terraform, use the endpoint, then destroy the instance once you're done. some provider even offer pricing hourly or by second (you only pay the GPU when it's actively running inference, then you'll pay only for cpu/ram per hour for the runner container).

my recommendation if you want to go down with 2nd option;
https://modal.com/pricing
https://www.runpod.io/pricing
I've been using them for more than one year, and the recent price decrease really helps reducing my spending. I never go over $10 a month for actual work that i need to be done (summarizing files, code generation, multi language translation, ocr using vision model, text to audio generation, audio to audio generation)

Adam1 · January 2025

@ScreenReader said: otherwise, just use on-demand GPU where you can deploy with a script / terraform, use the endpoint, then destroy the instance once you're done. some provider even offer pricing hourly or by second (you only pay the GPU when it's actively running inference, then you'll pay only for cpu/ram per hour for the runner container).

Thanks, but let's stick to KS hardware for this thread. Theres countless other threads for other platforms.

Adam1 · January 2025

@ScreenReader said: from my experience, in 8-13B model, if you load them to 64GB DDR4 RAM, you'll only get 8-10 tok/s at most.

well, that's OK actually, as it would mostly be just rewriting paragraphs via API. Nothing interactive. That's my use case anyway.

raindog308 · January 2025

@Adam1 said: I'll be picking up an AMD APU with 128GB RAM later in the year, but until then I'm limited to 16GB machines at home, although the memory bandwidth is probably much higher than most KS's.

16GB GPU or 16GB system RAM?

If you have 32GB or 64GB of RAM on a laptop, run your LLM on that. It's going to outperform a Kimsufi.

Of course, you're not gambling a lot to try out a Kimsufi.

I have a $20-a-month ColoCrossing dedi with 32GB of RAM...perhaps I should try running an LLM on it...

Adam1 · January 2025

@raindog308 said: 16GB GPU or 16GB system RAM?

system RAM, so CPU LLM's. I do have a laptop 3060 with (only) 6GB VRAM. However I'm more interested in what I can do with my KS dedi's that have a lot of spare CPU and RAM.

Adam1 · January 2025

@raindog308 said: I have a $20-a-month ColoCrossing dedi with 32GB of RAM...perhaps I should try running an LLM on it...

Wont hurt, only time to waste finding out

I hope at some point cheap AMD APU's with configurable VRAM will be possible at KS, maybe in 10+ years time haha.

beanman109 · January 2025

I still need to find a use for a 32GB dedi, what LLM would everyone recommend running locally? Not interested in Deepseek for obvious reasons

kvz12 · January 2025

@beanman109 said:
I still need to find a use for a 32GB dedi, what LLM would everyone recommend running locally? Not interested in Deepseek for obvious reasons

What's the obvious reasons?

beanman109 · January 2025

@kvz12 said:

@beanman109 said:
I still need to find a use for a 32GB dedi, what LLM would everyone recommend running locally? Not interested in Deepseek for obvious reasons

What's the obvious reasons?

Bias in the responses from the Chinese government, end of story - we don't need to turn this into a political thread, I just want answers that aren't deepseek.

cainyxues · January 2025

@beanman109 isn't there an uncensored model too [just saying]

cainyxues · January 2025

As for 32gb ram dedis, you have to understand the dedis provided by ovh or even colocrossing is is not a new gen cpu you won't get performance required, it's better to stick with local machine btw @beanman109 I heard phi-4 the new microsoft open source model is good too, try it and give a review on how it works performance wise 😉

Levi · January 2025

Out of context: so deepseek is better while free VS cgpt pro plan? Damn, I'am paying for cgpt 20/mo :O . Using for coding.

beanman109 · January 2025

@cainyxues said:
@beanman109 isn't there an uncensored model too [just saying]

Not as far as I know? Unless the locally run version is uncensored

beanman109 · January 2025

@cainyxues said:
As for 32gb ram dedis, you have to understand the dedis provided by ovh or even colocrossing is is not a new gen cpu you won't get performance required, it's better to stick with local machine btw @beanman109 I heard phi-4 the new microsoft open source model is good too, try it and give a review on how it works performance wise 😉

Trying out llama3.1 on openweb-ui currently, I believe ollama has support for phi-4 so I'll have a look at that soon, it's running on my Nuyek dedi so 32GB ddr4 and an E3-1275 v5

BasToTheMax · January 2025

@beanman109 said:

@cainyxues said:
As for 32gb ram dedis, you have to understand the dedis provided by ovh or even colocrossing is is not a new gen cpu you won't get performance required, it's better to stick with local machine btw @beanman109 I heard phi-4 the new microsoft open source model is good too, try it and give a review on how it works performance wise 😉

Trying out llama3.1 on openweb-ui currently, I believe ollama has support for phi-4 so I'll have a look at that soon, it's running on my Nuyek dedi so 32GB ddr4 and an E3-1275 v5

I am selfhosting OpenWebUI too. Running llama 3.1 8B via ollama on my Contabo vps.
It generates about 1 word per second.

beanman109 · January 2025

@BasToTheMax said:

@beanman109 said:

@cainyxues said:
As for 32gb ram dedis, you have to understand the dedis provided by ovh or even colocrossing is is not a new gen cpu you won't get performance required, it's better to stick with local machine btw @beanman109 I heard phi-4 the new microsoft open source model is good too, try it and give a review on how it works performance wise 😉

Trying out llama3.1 on openweb-ui currently, I believe ollama has support for phi-4 so I'll have a look at that soon, it's running on my Nuyek dedi so 32GB ddr4 and an E3-1275 v5

I am selfhosting OpenWebUI too. Running llama 3.1 8B via ollama on my Contabo vps.
It generates about 1 word per second.

I got a 90 line HTML/JS code prompt for a clock / countdown timer website completed in about 1-2 minutes (rough guess)

Levi · January 2025

@beanman109 said: I got a 90 line HTML/JS code prompt for a clock / countdown timer website completed in about 1-2 minutes (rough guess)

That's bad... very bad. In cgpt 4o it is like 10 - 15 seconds or less.

jnd · January 2025

I don't think it's worth it without GPU or at all. You can simply pay per token at Openrouter or similar, you get 1M input + 1M output tokens under $1 (DeepSeek R1 Distill Llama 70B example). Unless you really need private instance you can't beat the model of shared compute time that gets costs very low.

beanman109 · January 2025

@Levi said:

@beanman109 said: I got a 90 line HTML/JS code prompt for a clock / countdown timer website completed in about 1-2 minutes (rough guess)

That's bad... very bad. In cgpt 4o it is like 10 - 15 seconds or less.

It's running on a E3-1275 v5 that I pay $100 a year for - cheaper than ChatGPT ¯_(ツ)_/¯

CloudHopper · January 2025

@cainyxues said:
@beanman109 isn't there an uncensored model too [just saying]

Dolphin Mistral is a nice uncensored model. It answers pretty much any question, no matter how sketchy, but I don't think any of the 'big players' offer uncensored versions of their models, (DeepSeek included): https://ollama.com/library/dolphin-mistral:7b

An issue with DeepSeek is it generates a LOT of tokens whilst it's 'thinking', (basically an internal dialogue), and then uses that content to provide a coherent answer...which means it's really slow to answer questions on lowend hardware

beanman109 · January 2025

@cainyxues said:
As for 32gb ram dedis, you have to understand the dedis provided by ovh or even colocrossing is is not a new gen cpu you won't get performance required, it's better to stick with local machine btw @beanman109 I heard phi-4 the new microsoft open source model is good too, try it and give a review on how it works performance wise 😉

update:

phi4:14b shits the bed when doing anything currently
tested on 1275 v5 and an i5-12500, only eats about 14gb of RAM but it types at about 1-3 words per second after thinking about things for a good minute or two, also the code it outputs is way worse than llama from my couple quick tests

llama 3.1:8b i would actually call usable on both hardware, wrote my HTML clock code in about 1-2 minutes on the 1275 v5 and about 30 seconds on the i5-12500 while using about 10gb of RAM

rattlecattle · January 2025

Been running Deepseek R1 the distilled models, on a 128 GB dedi with a 8 GB GTX 1080 GPU. Its performance is acceptable so far.

Can only run the distilled models of deepseek r1. Running the actual deepseek r1 isn't possible on consumer hardware anyway.

Also the distilled models are not the same as the actual r1. Its more like say the base LLama model fine tuned with DeepSeek R1.

Howdy, Stranger!

Categories

In this Discussion

LLM (deepseek?) on KimSufi server

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

LLM (deepseek?) on KimSufi server

Comments