New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Comments
$100 per year is quite expensive for such slow output. I mean I just checked one of the better coder models at OpenRouter, you can go through more than 400M tokens before reaching your $100:
Qwen2.5 Coder 32B Instruct
qwen/qwen-2.5-coder-32b-instruct
Created Nov 11, 2024
33,000 context (might be too low for larger projects)
$0.07/M input tokens, $0.16/M output tokens
sir this is a website about selfhosting and servers
to me it adds up to pay $100 per year for a box that can selfhost a remote LLM reasonably well for what i intend to use it for, that's not including the fact that server can have other uses while it's not cpu maxed outputting prompts
also i don't understand LLM tokens so i don't know how quickly i would go through 400M of them, could smash them out in a month for all i know
i just tried openrouter with $5 of credit, the $5 turned into $4.4 after they deducted fees? then i made a single prompt and now i'm at $4.37
this aint it chief
Sure, I know because I tried to self host LLM too and ended up giving away the VPS because it was waste of time without GPU (and with it you will pay much more). I'm just saying that you will be able to slowly run some tiny model with no so great output quality or you can pay pennies for something better. Or run it at your home PC.
Roughly speaking one token is one word. You can try your typical task and see how many tokens does it consume. For small tasks it won't be much at all.
I planned buy 50 or 100 Single/Dual Epyc Servers for the Start when we got another Batch of Racks which will be delivered in Summer.
Time to offer GPU Options i guess.
The Gigabyte G292-Z20 can fit up to 4 GPUs.
Unfeasable. Either you concentrate entirely on GPU business or make GPU hosting a side gig (very expensive one). GPU's goes old very fast, faster than CPU. If CPU can be milked for decade, GPU - 2 - 3 years before 2 generations pass. And you are done. Power draw alone is insane and no one wants to pay for 3 years old card...
Side gig to try it out. The said Gigabyte servers i get myself anyway for normal uses. Why not do it then when client asks for quote and u can fullfill?
And what you think about Nvidia Tesla A100? I recently got offer 4000€ each
When client asks for it and has the money to rent it, why not. I dont except this going in bulk.
Sounds like those badboys have been mined on
4000€ for gpu
Not sure if i'm allowed to link sites or not so this is a screenshot from a goole search for the a100
Is this not the same card you were offered? At less than half price? I can link the website if allowed (or just google a100 price), the 80gb version is 18.5k
40gb one
Holy shit man, is the offer you got reliable or just random?
Reliable supplier but The price apply for Minimum 10 pieces.
eBay you can find alot though for 4500-4600, so.
Buy an M2 apple devices
yaa the local hosted version is uncensored, btw sorry to hear that phi-4 was not upto the mark. I saw the reviews online, any they were quite on point so I thought maybe they are good.
Ryzen, DDR 5, and 2 x 8 GB gpu works?
Why not? It will run for sure, but what's the GPU model [AMD are generally bad as I have heard]
Nvidia used one on eBay would be fine start. Not worth for buying new expensive one for homelab
Yup, would be fine. but it might also be better to look into m series or k series GPUs as they have more vram & also work great for LLMs
It does. Even can get away with older Xeons. The GPU - single or multiple matters the most. If running via Ollama it would automatically offload the layers to both the GPUs. The ideal config is to fit in the complete model in the combined GPU memory.
That being said there is something like https://github.com/exo-explore/exo which aims to combine multiple devices as one powerful inference cluster. Haven't used though.
How does it do that since I have learnt that the more the slow data transfer the more it will bottleneck the performance [There was a very famous youtuber who tried this with mac minis through ethernet connection if I remember and it bottlenecked ]
So we buy more KS-Game-LE to make an A.I cluster? ok
I prefer dedicated servers with GPUs, motherboard bus rather than using ethernet for shuffling. I have a lots of data in old damn hdd. New trend in AI now to generate sql for data warehouse and generate reports and dashboards using AI itself.
That one guy who ordered bulk LE-B's with the 1245v5's

Plain Ethernet would definitely bottleneck. Best to have the setup in a single system or to use something like GPUDirect RDMA for interconnection.
Why not. Probably time to hook up the electric toothbrush. Every device counts.
If your project is not continuous, mean you can have azure trial account for experiment for 30 days, 200 usd credit, you may get some cheap trial like stuffs for usd 10 . I can't promise genuinely , but few online services offer them.
Azure has AI services, it has chatgpt models, for learning purpose you may spend about 10 usd but get 200 usd credit. I use vector db, AI search, cosmos db, many analytics, data factory, etc along with Azure AI and IoT. For the month, the credit expires, obviously it is throwaway account.
Not useful for serious projects, but for learning, it is cheap. You can use react web app, that works with python api for chatgpt, that way you can also stop paying 20 usd per month to chatgpt.
Chatgpt batch reduce cost a lot.
I wish these cloud platform open source their stack, or allow a meaningful way to learn things for students and early adapters, as the cloud vendor tools themselves would cost texhnical debt to you, cost lot when they stop supporting or eco system too weak.
you guys should also try groq, its free and fast also openrouter provides llama models for free right? also there is google gemini free tier with its flash 2.0 with real time interaction
You could try looking at AMD APU's - they can be configured to use as much system RAM as you like, take up far less space and consume far less power. The Ryzen "AI" chips are great for this. Similar performance to M4 chips, but with advantage of being able to use far more RAM (and running x86).
this thread was supposed to be able running it on cheap dedi's - for all kinds of reasons. Theres countless threads about llm's on other services.
A YT channel I watch has dabbled with it
, with mixed results clustering M4 Mac Minis. It'll probably get better, though, so it is something to watch.