Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

VPS hosting for LLMs.

Need your opinion on what you're using to host LLMs?
I am starting to learn AI and would love to host my own LMS online.
What would be the cheapest way I can host LLM?

Comments

  • mrTommrTom Member

    CPUs are not efficient for LLM, it can be done, but only with "small" models (4B, 8B) and rather slow. For good performance you would need a GPU with lots of RAM (great timing!) which is going to be expensive.

    So i would suggest not hosting it yourself but using one of the many APIs like Openrouter for the LLM Part.

  • If you are just testing and playing around, Hugging Face would be a good place to start.

    The Pro plan costs $9/month and will give you a daily quota of 25 minutes of half H200 GPU (time counted only processing the requests).

    It is not much, but other than that the cheapest I could find when I looked into it was $0.50/hour, and that was for the instance being on, not actually the processing time.

    Thanked by 1dev077
  • NeoonNeoon Community Contributor, Veteran

    KS-LE-B, dedi with 64gig, can be abused 24/7, nobody is gonna complain.
    on the VPS the other hand, you might run into abuse problems.

    Besides that, CPU only isn't going to be fast, prob cheaper just to a GPUs with 16GB+ VRAM.

    Thanked by 1ariq01
  • If you only need it for a few hours at a time, I quite like vast.ai

  • @mrTom said:
    CPUs are not efficient for LLM, it can be done, but only with "small" models (4B, 8B) and rather slow. For good performance you would need a GPU with lots of RAM (great timing!) which is going to be expensive.

    So i would suggest not hosting it yourself but using one of the many APIs like Openrouter for the LLM Part.

    If you are not yet a hardcore user, Hugging Face Pro subscription has 2million tokens monthly on some providers:

    Disclaimer: I don't subscribe this service, ended up using my desktop GPU for my testing (RX 7800XT) - less processing power, smaller models, but it was already paid for.

  • @idroid007 said:
    Need your opinion on what you're using to host LLMs?

    in my own production, 4x3090 x 2. just some coder models along with it's embedded/reranker stack. being used for ~40 ish people.

    I am starting to learn AI and would love to host my own LMS online.

    if you're just starting / learning to deploy LLM, use free service that hands out credit to deploy gpu server like modal.com. if you're willing to pay take a look at runpod / salad.com

    What would be the cheapest way I can host LLM?

    rent them gpu by hours. if you trust openrouter enough, go with them. https://openrouter.ai/docs/guides/features/zdr
    don't ever get into BigGPU purchasing without proper planning (especially your in/out token spending). well, unless you can buy them at MSRP i guess. see https://www.latent.space/p/gpu-bubble

    if your client really need confidentiality and can't use external resources. make them pay for their own gpu server. i manage few H100 cluster like this for banks / private corporation.

    @motafoka said:

    @mrTom said:
    CPUs are not efficient for LLM, it can be done, but only with "small" models (4B, 8B) and rather slow. For good performance you would need a GPU with lots of RAM (great timing!) which is going to be expensive.

    So i would suggest not hosting it yourself but using one of the many APIs like Openrouter for the LLM Part.

    If you are not yet a hardcore user, Hugging Face Pro subscription has 2million tokens monthly on some providers:
    ...
    Disclaimer: I don't subscribe this service, ended up using my desktop GPU for my testing (RX 7800XT) - less processing power, smaller models, but it was already paid for.

    CPU inference really going to crawl once you have a lot of context (like beyond 100k).
    imo using huggingface pro for inference doesn't seems to be efficient. 2mil token is nothing (ymmv)

    Thanked by 2dev077 motafoka
  • I tried to run LLama on my local dedi on E5-2660v4 with 64GB RAM and it was very slow. I had to wait c.1 min for an answer to appear.

  • I've been running self-hosted Llama models on my own VPS for about a year now. Honestly the biggest factor is RAM – you need at least 16GB for anything useful, 32GB+ if you want decent speed.
    I use Ollama – easiest setup I've found, works well on Ubuntu with Docker. For cheap VPS with enough RAM, look at Hetzner or OVH – better price/RAM ratio than most providers.
    The catch nobody mentions: LLMs hammer your CPU between requests and memory leaks are real if you run them 24/7. Worth monitoring closely.

    Thanked by 1idroid007
  • host_chost_c Patron Provider, Top Host, Megathread Squad
    edited March 13

    @JohnFilch123 said: I tried to run LLama on my local dedi on E5-2660v4 with 64GB RAM and it was very slow. I had to wait c.1 min for an answer to appear.

    Bro, it is Friday, what do you expect, LLM is on vacation, you were lucky he replayed you in 1 MIN. :D

    But yes, x86 is not the way to go.

    @idroid007 - it will be much much cheaper to use a subscription based LLM at the moment, there are plenty you can choose from, let this be the problem of companies that have big pockets to perfect, don't waste your time and $$ for the moment on a technology that is basically in infant state and is as efficient as a 5.3L V8 Corvette from the 70's.

  • @host_c said: Bro, it is Friday, what do you expect, LLM is on vacation, you were lucky he replayed you in 1 MIN

    It was more like an experiment. Def not good to be run on a VPS.

    Thanked by 1host_c
  • @ScreenReader said:

    @idroid007 said:
    Need your opinion on what you're using to host LLMs?

    in my own production, 4x3090 x 2. just some coder models along with it's embedded/reranker stack. being used for ~40 ish people.

    I am starting to learn AI and would love to host my own LMS online.

    if you're just starting / learning to deploy LLM, use free service that hands out credit to deploy gpu server like modal.com. if you're willing to pay take a look at runpod / salad.com

    What would be the cheapest way I can host LLM?

    rent them gpu by hours. if you trust openrouter enough, go with them. https://openrouter.ai/docs/guides/features/zdr
    don't ever get into BigGPU purchasing without proper planning (especially your in/out token spending). well, unless you can buy them at MSRP i guess. see https://www.latent.space/p/gpu-bubble

    if your client really need confidentiality and can't use external resources. make them pay for their own gpu server. i manage few H100 cluster like this for banks / private corporation.

    @motafoka said:

    @mrTom said:
    CPUs are not efficient for LLM, it can be done, but only with "small" models (4B, 8B) and rather slow. For good performance you would need a GPU with lots of RAM (great timing!) which is going to be expensive.

    So i would suggest not hosting it yourself but using one of the many APIs like Openrouter for the LLM Part.

    If you are not yet a hardcore user, Hugging Face Pro subscription has 2million tokens monthly on some providers:
    ...
    Disclaimer: I don't subscribe this service, ended up using my desktop GPU for my testing (RX 7800XT) - less processing power, smaller models, but it was already paid for.

    CPU inference really going to crawl once you have a lot of context (like beyond 100k).
    imo using huggingface pro for inference doesn't seems to be efficient. 2mil token is nothing (ymmv)

    I got the impression that Hugging Face pro would run on half nVidia H200.

    What was the use case on that $6 tokes usage?

  • @motafoka said:

    I got the impression that Hugging Face pro would run on half nVidia H200.

    What was the use case on that $6 tokes usage?

    testing mcp servers that we create to make sure it really works with the recent / popular LLM for the tool calling process.

    Thanked by 1motafoka
Sign In or Register to comment.