Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Minimum spec for ollama with llama 3.2 3B
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Minimum spec for ollama with llama 3.2 3B

windswept321windswept321 Member
edited December 2024 in Help

Hey guys, I have a small batch use-case for running an ollama instance 24/7 with llama 3.2 3B.

Unfortunately I’m Scottish so…

What do you think the minimum is I could get away with for running it? VPS or dedi?

Thanks!

Comments

  • I run qwen coder 7b with vps ram 16 cpu 8 is running smoothly

    Thanked by 1windswept321
  • plumbergplumberg Veteran, Megathread Squad

    @yakoudev said:
    I run qwen coder 7b with vps ram 16 cpu 8 is running smoothly

    Wait what?
    How is the speed?
    What are the cpu spec?

  • @plumberg said:

    @yakoudev said:
    I run qwen coder 7b with vps ram 16 cpu 8 is running smoothly

    Wait what?
    How is the speed?
    What are the cpu spec?

    cpu epyc 9554.

    speed around 5 token/sec

    Thanked by 1sillycat
  • quanhua92quanhua92 Member
    edited December 2024

    You should use OpenRouter first. https://openrouter.ai/meta-llama/llama-3.2-3b-instruct
    The price is very cheap for that model.
    131,000 context
    $0.015/M input tokens
    $0.025/M output tokens

    I also have NVIDIA 3060 GPU in my personal computer and 4060TI on another PC.
    Those can run 8B models easily as well.
    I don't think that renting a server with GPU is worth the price.

    For example, instead of renting a gpu, you can buy $100 tokens from OpenRouter. They even have several free models.

    Thanked by 2raindog308 gks
  • Any 4vCPU server that can fit the GGUF quant into RAM will work. On VPS, the speed will depend on how your neighbors saturate the RAM bandwidth. If you want to get a dedicated server for this, search for the one with DDR5 because realistically the memory speed is all that matters.

    Honestly, it's more about picking the right tool for the job. You're better off renting a GPU from vast.ai or RunPod (my pick) to finish the same task in 3 hours instead of 1 month, you'll save your time and possibly your money. Even the cheapest GPU on these platforms should munch through your data easily.

  • plumbergplumberg Veteran, Megathread Squad

    @yakoudev said:

    @plumberg said:

    @yakoudev said:
    I run qwen coder 7b with vps ram 16 cpu 8 is running smoothly

    Wait what?
    How is the speed?
    What are the cpu spec?

    cpu epyc 9554.

    speed around 5 token/sec

    Nice. It's slow but still decent 👌

    Thanked by 1yakoudev
  • plumbergplumberg Veteran, Megathread Squad

    @quanhua92 said:
    You should use OpenRouter first. https://openrouter.ai/meta-llama/llama-3.2-3b-instruct
    The price is very cheap for that model.
    131,000 context
    $0.015/M input tokens
    $0.025/M output tokens

    I also have NVIDIA 3060 GPU in my personal computer and 4060TI on another PC.
    Those can run 8B models easily as well.
    I don't think that renting a server with GPU is worth the price.

    For example, instead of renting a gpu, you can buy $100 tokens from OpenRouter. They even have several free models.

    Yeah I have known about it. But sometimes it may help to have a llm locally, for private reasons (I think)

    I have some spare computer and ram power (sans gpu ) and would love to try out something.
    Hence the q.

  • plumbergplumberg Veteran, Megathread Squad

    @Araki said:
    Any 4vCPU server that can fit the GGUF quant into RAM will work. On VPS, the speed will depend on how your neighbors saturate the RAM bandwidth. If you want to get a dedicated server for this, search for the one with DDR5 because realistically the memory speed is all that matters.

    Honestly, it's more about picking the right tool for the job. You're better off renting a GPU from vast.ai or RunPod (my pick) to finish the same task in 3 hours instead of 1 month, you'll save your time and possibly your money. Even the cheapest GPU on these platforms should munch through your data easily.

    Agreed. GPU makes it easy.
    Like I think maybe a local one may help for privacy purposes only. Other than that I prefer to use the monthly or prepaid credits and get done.

    Thanks

  • I’m getting like 10-11 tokens/s for Llama 3.2 3B Q8 model on iPhone 16 PM

    Thanked by 1TrK
  • This is for 24/7 website functionality on the backend. Nothing too strenuous.

    I don't want to trust random services over a few years when something is likely to break at some point and give me headaches...

  • yakoudevyakoudev Member
    edited December 2024

    @plumberg said:

    @yakoudev said:

    @plumberg said:

    @yakoudev said:
    I run qwen coder 7b with vps ram 16 cpu 8 is running smoothly

    Wait what?
    How is the speed?
    What are the cpu spec?

    cpu epyc 9554.

    speed around 5 token/sec

    Nice. It's slow but still decent 👌

    prompt eval rate: 41.47 tokens/s
    eval rate: 13.27 tokens/s

    13.27 is the average tokens per second using qwen2.5-coder:7b model
    for AMD EPYC 9654 96-Core Processor pc-i440fx-9.0 CPU @ 2.0GHz

    Thanked by 1plumberg
  • tsusutsusu Member
    edited December 2024

    @windswept321 said:
    This is for 24/7 website functionality on the backend. Nothing too strenuous.

    I don't want to trust random services over a few years when something is likely to break at some point and give me headaches...

    I'd say it would be easier to use services. It would be as simple as changing the URI you're invoking the API on, so like a 1 line change if you have it as config if anything "breaks."

  • @tsusu said:

    @windswept321 said:
    This is for 24/7 website functionality on the backend. Nothing too strenuous.

    I don't want to trust random services over a few years when something is likely to break at some point and give me headaches...

    I'd say it would be easier to use services. It would be as simple as changing the URI you're invoking the API on, so like a 1 line change if you have it as config if anything "breaks."

    For my initial use case, a 3B LLM works fine. I've started working on automated translation though... seems like it will need something a bit stronger so I'll look into these services too, thanks.

    As long as it's set it and forget it with non-expiring credit it should be ok.

    I still need to figure out the bare minimum for a 3B Llama 3.2 instance though

Sign In or Register to comment.