New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Minimum spec for ollama with llama 3.2 3B
windswept321
Member
Hey guys, I have a small batch use-case for running an ollama instance 24/7 with llama 3.2 3B.
Unfortunately I’m Scottish so…
What do you think the minimum is I could get away with for running it? VPS or dedi?
Thanks!
Comments
I run qwen coder 7b with vps ram 16 cpu 8 is running smoothly
Wait what?
How is the speed?
What are the cpu spec?
cpu epyc 9554.
speed around 5 token/sec
You should use OpenRouter first. https://openrouter.ai/meta-llama/llama-3.2-3b-instruct
The price is very cheap for that model.
131,000 context
$0.015/M input tokens
$0.025/M output tokens
I also have NVIDIA 3060 GPU in my personal computer and 4060TI on another PC.
Those can run 8B models easily as well.
I don't think that renting a server with GPU is worth the price.
For example, instead of renting a gpu, you can buy $100 tokens from OpenRouter. They even have several free models.
Any 4vCPU server that can fit the GGUF quant into RAM will work. On VPS, the speed will depend on how your neighbors saturate the RAM bandwidth. If you want to get a dedicated server for this, search for the one with DDR5 because realistically the memory speed is all that matters.
Honestly, it's more about picking the right tool for the job. You're better off renting a GPU from vast.ai or RunPod (my pick) to finish the same task in 3 hours instead of 1 month, you'll save your time and possibly your money. Even the cheapest GPU on these platforms should munch through your data easily.
Nice. It's slow but still decent 👌
Yeah I have known about it. But sometimes it may help to have a llm locally, for private reasons (I think)
I have some spare computer and ram power (sans gpu ) and would love to try out something.
Hence the q.
Agreed. GPU makes it easy.
Like I think maybe a local one may help for privacy purposes only. Other than that I prefer to use the monthly or prepaid credits and get done.
Thanks
I’m getting like 10-11 tokens/s for Llama 3.2 3B Q8 model on iPhone 16 PM
This is for 24/7 website functionality on the backend. Nothing too strenuous.
I don't want to trust random services over a few years when something is likely to break at some point and give me headaches...
prompt eval rate: 41.47 tokens/s
eval rate: 13.27 tokens/s
13.27 is the average tokens per second using qwen2.5-coder:7b model
for AMD EPYC 9654 96-Core Processor pc-i440fx-9.0 CPU @ 2.0GHz
https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/ and colocation maybe?
I'd say it would be easier to use services. It would be as simple as changing the URI you're invoking the API on, so like a 1 line change if you have it as config if anything "breaks."
For my initial use case, a 3B LLM works fine. I've started working on automated translation though... seems like it will need something a bit stronger so I'll look into these services too, thanks.
As long as it's set it and forget it with non-expiring credit it should be ok.
I still need to figure out the bare minimum for a 3B Llama 3.2 instance though
Also interesting, thanks! Near future maybe...