LLM (deepseek?) on KimSufi server

jnd · January 2025

@beanman109 said:

@Levi said:

@beanman109 said: I got a 90 line HTML/JS code prompt for a clock / countdown timer website completed in about 1-2 minutes (rough guess)

That's bad... very bad. In cgpt 4o it is like 10 - 15 seconds or less.

It's running on a E3-1275 v5 that I pay $100 a year for - cheaper than ChatGPT ¯_(ツ)_/¯

$100 per year is quite expensive for such slow output. I mean I just checked one of the better coder models at OpenRouter, you can go through more than 400M tokens before reaching your $100:
Qwen2.5 Coder 32B Instruct
qwen/qwen-2.5-coder-32b-instruct
Created Nov 11, 2024
33,000 context (might be too low for larger projects)
$0.07/M input tokens, $0.16/M output tokens

beanman109 · January 2025

@jnd said:

@beanman109 said:

@Levi said:

@beanman109 said: I got a 90 line HTML/JS code prompt for a clock / countdown timer website completed in about 1-2 minutes (rough guess)

That's bad... very bad. In cgpt 4o it is like 10 - 15 seconds or less.

It's running on a E3-1275 v5 that I pay $100 a year for - cheaper than ChatGPT ¯_(ツ)_/¯

$100 per year is quite expensive for such slow output. I mean I just checked one of the better coder models at OpenRouter, you can go through more than 400M tokens before reaching your $100:
Qwen2.5 Coder 32B Instruct
qwen/qwen-2.5-coder-32b-instruct
Created Nov 11, 2024
33,000 context (might be too low for larger projects)
$0.07/M input tokens, $0.16/M output tokens

sir this is a website about selfhosting and servers
to me it adds up to pay $100 per year for a box that can selfhost a remote LLM reasonably well for what i intend to use it for, that's not including the fact that server can have other uses while it's not cpu maxed outputting prompts

also i don't understand LLM tokens so i don't know how quickly i would go through 400M of them, could smash them out in a month for all i know

beanman109 · January 2025

@jnd said: $100 per year is quite expensive for such slow output. I mean I just checked one of the better coder models at OpenRouter, you can go through more than 400M tokens before reaching your $100:

i just tried openrouter with $5 of credit, the $5 turned into $4.4 after they deducted fees? then i made a single prompt and now i'm at $4.37

this aint it chief

jnd · January 2025

@beanman109 said:

sir this is a website about selfhosting and servers
to me it adds up to pay $100 per year for a box that can selfhost a remote LLM reasonably well for what i intend to use it for, that's not including the fact that server can have other uses while it's not cpu maxed outputting prompts

also i don't understand LLM tokens so i don't know how quickly i would go through 400M of them, could smash them out in a month for all i know

Sure, I know because I tried to self host LLM too and ended up giving away the VPS because it was waste of time without GPU (and with it you will pay much more). I'm just saying that you will be able to slowly run some tiny model with no so great output quality or you can pay pennies for something better. Or run it at your home PC.

Roughly speaking one token is one word. You can try your typical task and see how many tokens does it consume. For small tasks it won't be much at all.

HostSlick · January 2025

I planned buy 50 or 100 Single/Dual Epyc Servers for the Start when we got another Batch of Racks which will be delivered in Summer.
Time to offer GPU Options i guess.

The Gigabyte G292-Z20 can fit up to 4 GPUs.

Levi · January 2025

@HostSlick said: Time to offer GPU Options i guess.

Unfeasable. Either you concentrate entirely on GPU business or make GPU hosting a side gig (very expensive one). GPU's goes old very fast, faster than CPU. If CPU can be milked for decade, GPU - 2 - 3 years before 2 generations pass. And you are done. Power draw alone is insane and no one wants to pay for 3 years old card...

HostSlick · January 2025

@Levi said:

@HostSlick said: Time to offer GPU Options i guess.

Unfeasable. Either you concentrate entirely on GPU business or make GPU hosting a side gig (very expensive one). GPU's goes old very fast, faster than CPU. If CPU can be milked for decade, GPU - 2 - 3 years before 2 generations pass. And you are done. Power draw alone is insane and no one wants to pay for 3 years old card...

Side gig to try it out. The said Gigabyte servers i get myself anyway for normal uses. Why not do it then when client asks for quote and u can fullfill?

And what you think about Nvidia Tesla A100? I recently got offer 4000€ each

When client asks for it and has the money to rent it, why not. I dont except this going in bulk.

beanman109 · January 2025

@HostSlick said: And what you think about Nvidia Tesla A100? I recently got offer 4000€ each

Sounds like those badboys have been mined on

Levi · January 2025

4000€ for gpu

wadhah · January 2025

@HostSlick said:

@Levi said:

@HostSlick said: Time to offer GPU Options i guess.

Unfeasable. Either you concentrate entirely on GPU business or make GPU hosting a side gig (very expensive one). GPU's goes old very fast, faster than CPU. If CPU can be milked for decade, GPU - 2 - 3 years before 2 generations pass. And you are done. Power draw alone is insane and no one wants to pay for 3 years old card...

Side gig to try it out. The said Gigabyte servers i get myself anyway for normal uses. Why not do it then when client asks for quote and u can fullfill?

And what you think about Nvidia Tesla A100? I recently got offer 4000€ each

When client asks for it and has the money to rent it, why not. I dont except this going in bulk.

Not sure if i'm allowed to link sites or not so this is a screenshot from a goole search for the a100

Is this not the same card you were offered? At less than half price? I can link the website if allowed (or just google a100 price), the 80gb version is 18.5k

HostSlick · January 2025

@wadhah said:

@HostSlick said:

@Levi said:

@HostSlick said: Time to offer GPU Options i guess.

Unfeasable. Either you concentrate entirely on GPU business or make GPU hosting a side gig (very expensive one). GPU's goes old very fast, faster than CPU. If CPU can be milked for decade, GPU - 2 - 3 years before 2 generations pass. And you are done. Power draw alone is insane and no one wants to pay for 3 years old card...

Side gig to try it out. The said Gigabyte servers i get myself anyway for normal uses. Why not do it then when client asks for quote and u can fullfill?

And what you think about Nvidia Tesla A100? I recently got offer 4000€ each

When client asks for it and has the money to rent it, why not. I dont except this going in bulk.

Not sure if i'm allowed to link sites or not so this is a screenshot from a goole search for the a100

Is this not the same card you were offered? At less than half price? I can link the website if allowed (or just google a100 price), the 80gb version is 18.5k

40gb one

wadhah · January 2025

@HostSlick said:

@wadhah said:

@HostSlick said:

@Levi said:

@HostSlick said: Time to offer GPU Options i guess.

Unfeasable. Either you concentrate entirely on GPU business or make GPU hosting a side gig (very expensive one). GPU's goes old very fast, faster than CPU. If CPU can be milked for decade, GPU - 2 - 3 years before 2 generations pass. And you are done. Power draw alone is insane and no one wants to pay for 3 years old card...

Side gig to try it out. The said Gigabyte servers i get myself anyway for normal uses. Why not do it then when client asks for quote and u can fullfill?

And what you think about Nvidia Tesla A100? I recently got offer 4000€ each

When client asks for it and has the money to rent it, why not. I dont except this going in bulk.

Not sure if i'm allowed to link sites or not so this is a screenshot from a goole search for the a100

Is this not the same card you were offered? At less than half price? I can link the website if allowed (or just google a100 price), the 80gb version is 18.5k

40gb one

Holy shit man, is the offer you got reliable or just random?

HostSlick · January 2025

@wadhah said:

@HostSlick said:

@wadhah said:

@HostSlick said:

@Levi said:

@HostSlick said: Time to offer GPU Options i guess.

Unfeasable. Either you concentrate entirely on GPU business or make GPU hosting a side gig (very expensive one). GPU's goes old very fast, faster than CPU. If CPU can be milked for decade, GPU - 2 - 3 years before 2 generations pass. And you are done. Power draw alone is insane and no one wants to pay for 3 years old card...

Side gig to try it out. The said Gigabyte servers i get myself anyway for normal uses. Why not do it then when client asks for quote and u can fullfill?

And what you think about Nvidia Tesla A100? I recently got offer 4000€ each

When client asks for it and has the money to rent it, why not. I dont except this going in bulk.

Not sure if i'm allowed to link sites or not so this is a screenshot from a goole search for the a100

Is this not the same card you were offered? At less than half price? I can link the website if allowed (or just google a100 price), the 80gb version is 18.5k

40gb one

Holy shit man, is the offer you got reliable or just random?

Reliable supplier but The price apply for Minimum 10 pieces.

eBay you can find alot though for 4500-4600, so.

dav848 · January 2025

Buy an M2 apple devices

cainyxues · January 2025

@beanman109 said:

@cainyxues said:
@beanman109 isn't there an uncensored model too [just saying]

Not as far as I know? Unless the locally run version is uncensored

yaa the local hosted version is uncensored, btw sorry to hear that phi-4 was not upto the mark. I saw the reviews online, any they were quite on point so I thought maybe they are good.

gks · January 2025

@rattlecattle said:
Been running Deepseek R1 the distilled models, on a 128 GB dedi with a 8 GB GTX 1080 GPU. Its performance is acceptable so far.

Can only run the distilled models of deepseek r1. Running the actual deepseek r1 isn't possible on consumer hardware anyway.

Also the distilled models are not the same as the actual r1. Its more like say the base LLama model fine tuned with DeepSeek R1.

Ryzen, DDR 5, and 2 x 8 GB gpu works?

cainyxues · January 2025

@gks said:
Ryzen, DDR 5, and 2 x 8 GB gpu works?

Why not? It will run for sure, but what's the GPU model [AMD are generally bad as I have heard]

gks · January 2025

@cainyxues said:

@gks said:
Ryzen, DDR 5, and 2 x 8 GB gpu works?

Why not? It will run for sure, but what's the GPU model [AMD are generally bad as I have heard]

Nvidia used one on eBay would be fine start. Not worth for buying new expensive one for homelab

cainyxues · January 2025

@gks said:

@cainyxues said:

@gks said:
Ryzen, DDR 5, and 2 x 8 GB gpu works?

Why not? It will run for sure, but what's the GPU model [AMD are generally bad as I have heard]

Nvidia used one on eBay would be fine start. Not worth for buying new expensive one for homelab

Yup, would be fine. but it might also be better to look into m series or k series GPUs as they have more vram & also work great for LLMs

rattlecattle · January 2025

@gks said:

@rattlecattle said:
Been running Deepseek R1 the distilled models, on a 128 GB dedi with a 8 GB GTX 1080 GPU. Its performance is acceptable so far.

Can only run the distilled models of deepseek r1. Running the actual deepseek r1 isn't possible on consumer hardware anyway.

Also the distilled models are not the same as the actual r1. Its more like say the base LLama model fine tuned with DeepSeek R1.

Ryzen, DDR 5, and 2 x 8 GB gpu works?

It does. Even can get away with older Xeons. The GPU - single or multiple matters the most. If running via Ollama it would automatically offload the layers to both the GPUs. The ideal config is to fit in the complete model in the combined GPU memory.

That being said there is something like https://github.com/exo-explore/exo which aims to combine multiple devices as one powerful inference cluster. Haven't used though.

cainyxues · January 2025

@rattlecattle said:
It does. Even can get away with older Xeons. The GPU - single or multiple matters the most. If running via Ollama it would automatically offload the layers to both the GPUs. The ideal config is to fit in the complete model in the combined GPU memory.

That being said there is something like https://github.com/exo-explore/exo which aims to combine multiple devices as one powerful inference cluster. Haven't used though.

How does it do that since I have learnt that the more the slow data transfer the more it will bottleneck the performance [There was a very famous youtuber who tried this with mac minis through ethernet connection if I remember and it bottlenecked ]

Neoon · January 2025

@rattlecattle said:

@gks said:

@rattlecattle said:
Been running Deepseek R1 the distilled models, on a 128 GB dedi with a 8 GB GTX 1080 GPU. Its performance is acceptable so far.

Can only run the distilled models of deepseek r1. Running the actual deepseek r1 isn't possible on consumer hardware anyway.

Also the distilled models are not the same as the actual r1. Its more like say the base LLama model fine tuned with DeepSeek R1.

Ryzen, DDR 5, and 2 x 8 GB gpu works?

It does. Even can get away with older Xeons. The GPU - single or multiple matters the most. If running via Ollama it would automatically offload the layers to both the GPUs. The ideal config is to fit in the complete model in the combined GPU memory.

That being said there is something like https://github.com/exo-explore/exo which aims to combine multiple devices as one powerful inference cluster. Haven't used though.

So we buy more KS-Game-LE to make an A.I cluster? ok

gks · January 2025

@cainyxues said:

@rattlecattle said:
It does. Even can get away with older Xeons. The GPU - single or multiple matters the most. If running via Ollama it would automatically offload the layers to both the GPUs. The ideal config is to fit in the complete model in the combined GPU memory.

That being said there is something like https://github.com/exo-explore/exo which aims to combine multiple devices as one powerful inference cluster. Haven't used though.

How does it do that since I have learnt that the more the slow data transfer the more it will bottleneck the performance [There was a very famous youtuber who tried this with mac minis through ethernet connection if I remember and it bottlenecked ]

I prefer dedicated servers with GPUs, motherboard bus rather than using ethernet for shuffling. I have a lots of data in old damn hdd. New trend in AI now to generate sql for data warehouse and generate reports and dashboards using AI itself.

allthemtings · January 2025

@Neoon said:

@rattlecattle said:

@gks said:

@rattlecattle said:
Been running Deepseek R1 the distilled models, on a 128 GB dedi with a 8 GB GTX 1080 GPU. Its performance is acceptable so far.

Can only run the distilled models of deepseek r1. Running the actual deepseek r1 isn't possible on consumer hardware anyway.

Also the distilled models are not the same as the actual r1. Its more like say the base LLama model fine tuned with DeepSeek R1.

Ryzen, DDR 5, and 2 x 8 GB gpu works?

It does. Even can get away with older Xeons. The GPU - single or multiple matters the most. If running via Ollama it would automatically offload the layers to both the GPUs. The ideal config is to fit in the complete model in the combined GPU memory.

That being said there is something like https://github.com/exo-explore/exo which aims to combine multiple devices as one powerful inference cluster. Haven't used though.

So we buy more KS-Game-LE to make an A.I cluster? ok

That one guy who ordered bulk LE-B's with the 1245v5's

rattlecattle · January 2025

@cainyxues said:

@rattlecattle said:
It does. Even can get away with older Xeons. The GPU - single or multiple matters the most. If running via Ollama it would automatically offload the layers to both the GPUs. The ideal config is to fit in the complete model in the combined GPU memory.

That being said there is something like https://github.com/exo-explore/exo which aims to combine multiple devices as one powerful inference cluster. Haven't used though.

How does it do that since I have learnt that the more the slow data transfer the more it will bottleneck the performance [There was a very famous youtuber who tried this with mac minis through ethernet connection if I remember and it bottlenecked ]

Plain Ethernet would definitely bottleneck. Best to have the setup in a single system or to use something like GPUDirect RDMA for interconnection.

@Neoon said: So we buy more KS-Game-LE to make an A.I cluster? ok

Why not. Probably time to hook up the electric toothbrush. Every device counts.

gks · January 2025

@beanman109 said:

@jnd said: $100 per year is quite expensive for such slow output. I mean I just checked one of the better coder models at OpenRouter, you can go through more than 400M tokens before reaching your $100:

i just tried openrouter with $5 of credit, the $5 turned into $4.4 after they deducted fees? then i made a single prompt and now i'm at $4.37

this aint it chief

If your project is not continuous, mean you can have azure trial account for experiment for 30 days, 200 usd credit, you may get some cheap trial like stuffs for usd 10 . I can't promise genuinely , but few online services offer them.

Azure has AI services, it has chatgpt models, for learning purpose you may spend about 10 usd but get 200 usd credit. I use vector db, AI search, cosmos db, many analytics, data factory, etc along with Azure AI and IoT. For the month, the credit expires, obviously it is throwaway account.

Not useful for serious projects, but for learning, it is cheap. You can use react web app, that works with python api for chatgpt, that way you can also stop paying 20 usd per month to chatgpt.

Chatgpt batch reduce cost a lot.

I wish these cloud platform open source their stack, or allow a meaningful way to learn things for students and early adapters, as the cloud vendor tools themselves would cost texhnical debt to you, cost lot when they stop supporting or eco system too weak.

cainyxues · January 2025

you guys should also try groq, its free and fast also openrouter provides llama models for free right? also there is google gemini free tier with its flash 2.0 with real time interaction

Adam1 · January 2025

@HostSlick said: Time to offer GPU Options i guess.

You could try looking at AMD APU's - they can be configured to use as much system RAM as you like, take up far less space and consume far less power. The Ryzen "AI" chips are great for this. Similar performance to M4 chips, but with advantage of being able to use far more RAM (and running x86).

Adam1 · January 2025

@cainyxues said:
you guys should also try groq, its free and fast also openrouter provides llama models for free right? also there is google gemini free tier with its flash 2.0 with real time interaction

this thread was supposed to be able running it on cheap dedi's - for all kinds of reasons. Theres countless threads about llm's on other services.

Adam1 · January 2025

@rattlecattle said: That being said there is something like https://github.com/exo-explore/exo which aims to combine multiple devices as one powerful inference cluster. Haven't used though.

A YT channel I watch has dabbled with it , with mixed results clustering M4 Mac Minis. It'll probably get better, though, so it is something to watch.

Howdy, Stranger!

Categories

In this Discussion

LLM (deepseek?) on KimSufi server

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

LLM (deepseek?) on KimSufi server

Comments