Minimum spec for ollama with llama 3.2 3B

windswept321 · December 2024

Hey guys, I have a small batch use-case for running an ollama instance 24/7 with llama 3.2 3B.

Unfortunately I’m Scottish so…

What do you think the minimum is I could get away with for running it? VPS or dedi?

Thanks!

yakoudev · December 2024

I run qwen coder 7b with vps ram 16 cpu 8 is running smoothly

plumberg · December 2024

@yakoudev said:
I run qwen coder 7b with vps ram 16 cpu 8 is running smoothly

Wait what?
How is the speed?
What are the cpu spec?

yakoudev · December 2024

@plumberg said:

@yakoudev said:
I run qwen coder 7b with vps ram 16 cpu 8 is running smoothly

Wait what?
How is the speed?
What are the cpu spec?

cpu epyc 9554.

speed around 5 token/sec

quanhua92 · December 2024

You should use OpenRouter first. https://openrouter.ai/meta-llama/llama-3.2-3b-instruct
The price is very cheap for that model.
131,000 context
$0.015/M input tokens
$0.025/M output tokens

I also have NVIDIA 3060 GPU in my personal computer and 4060TI on another PC.
Those can run 8B models easily as well.
I don't think that renting a server with GPU is worth the price.

For example, instead of renting a gpu, you can buy $100 tokens from OpenRouter. They even have several free models.

Araki · December 2024

Any 4vCPU server that can fit the GGUF quant into RAM will work. On VPS, the speed will depend on how your neighbors saturate the RAM bandwidth. If you want to get a dedicated server for this, search for the one with DDR5 because realistically the memory speed is all that matters.

Honestly, it's more about picking the right tool for the job. You're better off renting a GPU from vast.ai or RunPod (my pick) to finish the same task in 3 hours instead of 1 month, you'll save your time and possibly your money. Even the cheapest GPU on these platforms should munch through your data easily.

plumberg · December 2024

@yakoudev said:

@plumberg said:

@yakoudev said:
I run qwen coder 7b with vps ram 16 cpu 8 is running smoothly

Wait what?
How is the speed?
What are the cpu spec?

cpu epyc 9554.

speed around 5 token/sec

Nice. It's slow but still decent 👌

plumberg · December 2024

@quanhua92 said:
You should use OpenRouter first. https://openrouter.ai/meta-llama/llama-3.2-3b-instruct
The price is very cheap for that model.
131,000 context
$0.015/M input tokens
$0.025/M output tokens

I also have NVIDIA 3060 GPU in my personal computer and 4060TI on another PC.
Those can run 8B models easily as well.
I don't think that renting a server with GPU is worth the price.

For example, instead of renting a gpu, you can buy $100 tokens from OpenRouter. They even have several free models.

Yeah I have known about it. But sometimes it may help to have a llm locally, for private reasons (I think)

I have some spare computer and ram power (sans gpu ) and would love to try out something.
Hence the q.

plumberg · December 2024

@Araki said:
Any 4vCPU server that can fit the GGUF quant into RAM will work. On VPS, the speed will depend on how your neighbors saturate the RAM bandwidth. If you want to get a dedicated server for this, search for the one with DDR5 because realistically the memory speed is all that matters.

Honestly, it's more about picking the right tool for the job. You're better off renting a GPU from vast.ai or RunPod (my pick) to finish the same task in 3 hours instead of 1 month, you'll save your time and possibly your money. Even the cheapest GPU on these platforms should munch through your data easily.

Agreed. GPU makes it easy.
Like I think maybe a local one may help for privacy purposes only. Other than that I prefer to use the monthly or prepaid credits and get done.

Thanks

Void · December 2024

I’m getting like 10-11 tokens/s for Llama 3.2 3B Q8 model on iPhone 16 PM

windswept321 · December 2024

This is for 24/7 website functionality on the backend. Nothing too strenuous.

I don't want to trust random services over a few years when something is likely to break at some point and give me headaches...

yakoudev · December 2024

@plumberg said:

@yakoudev said:

@plumberg said:

@yakoudev said:
I run qwen coder 7b with vps ram 16 cpu 8 is running smoothly

Wait what?
How is the speed?
What are the cpu spec?

cpu epyc 9554.

speed around 5 token/sec

Nice. It's slow but still decent 👌

prompt eval rate: 41.47 tokens/s
eval rate: 13.27 tokens/s

13.27 is the average tokens per second using qwen2.5-coder:7b model
for AMD EPYC 9654 96-Core Processor pc-i440fx-9.0 CPU @ 2.0GHz

tsusu · December 2024

https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/ and colocation maybe?

tsusu · December 2024

@windswept321 said:
This is for 24/7 website functionality on the backend. Nothing too strenuous.

I don't want to trust random services over a few years when something is likely to break at some point and give me headaches...

I'd say it would be easier to use services. It would be as simple as changing the URI you're invoking the API on, so like a 1 line change if you have it as config if anything "breaks."

windswept321 · December 2024

@tsusu said:

@windswept321 said:
This is for 24/7 website functionality on the backend. Nothing too strenuous.

I don't want to trust random services over a few years when something is likely to break at some point and give me headaches...

I'd say it would be easier to use services. It would be as simple as changing the URI you're invoking the API on, so like a 1 line change if you have it as config if anything "breaks."

For my initial use case, a 3B LLM works fine. I've started working on automated translation though... seems like it will need something a bit stronger so I'll look into these services too, thanks.

As long as it's set it and forget it with non-expiring credit it should be ok.

I still need to figure out the bare minimum for a 3B Llama 3.2 instance though

windswept321 · December 2024

@tsusu said:
https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/nano-super-developer-kit/ and colocation maybe?

Also interesting, thanks! Near future maybe...

Howdy, Stranger!

Categories

In this Discussion

Minimum spec for ollama with llama 3.2 3B

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Minimum spec for ollama with llama 3.2 3B

Comments