VPS hosting for LLMs.

idroid007 · March 12

Need your opinion on what you're using to host LLMs?
I am starting to learn AI and would love to host my own LMS online.
What would be the cheapest way I can host LLM?

mrTom · March 12

CPUs are not efficient for LLM, it can be done, but only with "small" models (4B, 8B) and rather slow. For good performance you would need a GPU with lots of RAM (great timing!) which is going to be expensive.

So i would suggest not hosting it yourself but using one of the many APIs like Openrouter for the LLM Part.

motafoka · March 12

If you are just testing and playing around, Hugging Face would be a good place to start.

The Pro plan costs $9/month and will give you a daily quota of 25 minutes of half H200 GPU (time counted only processing the requests).

It is not much, but other than that the cheapest I could find when I looked into it was $0.50/hour, and that was for the instance being on, not actually the processing time.

Neoon · March 12

KS-LE-B, dedi with 64gig, can be abused 24/7, nobody is gonna complain.
on the VPS the other hand, you might run into abuse problems.

Besides that, CPU only isn't going to be fast, prob cheaper just to a GPUs with 16GB+ VRAM.

nitrousdev · March 12

If you only need it for a few hours at a time, I quite like vast.ai

motafoka · March 12

@mrTom said:
CPUs are not efficient for LLM, it can be done, but only with "small" models (4B, 8B) and rather slow. For good performance you would need a GPU with lots of RAM (great timing!) which is going to be expensive.

So i would suggest not hosting it yourself but using one of the many APIs like Openrouter for the LLM Part.

If you are not yet a hardcore user, Hugging Face Pro subscription has 2million tokens monthly on some providers:

Disclaimer: I don't subscribe this service, ended up using my desktop GPU for my testing (RX 7800XT) - less processing power, smaller models, but it was already paid for.

ScreenReader · March 13

@idroid007 said:
Need your opinion on what you're using to host LLMs?

in my own production, 4x3090 x 2. just some coder models along with it's embedded/reranker stack. being used for ~40 ish people.

I am starting to learn AI and would love to host my own LMS online.

if you're just starting / learning to deploy LLM, use free service that hands out credit to deploy gpu server like modal.com. if you're willing to pay take a look at runpod / salad.com

What would be the cheapest way I can host LLM?

rent them gpu by hours. if you trust openrouter enough, go with them. https://openrouter.ai/docs/guides/features/zdr
don't ever get into BigGPU purchasing without proper planning (especially your in/out token spending). well, unless you can buy them at MSRP i guess. see https://www.latent.space/p/gpu-bubble

if your client really need confidentiality and can't use external resources. make them pay for their own gpu server. i manage few H100 cluster like this for banks / private corporation.

@motafoka said:

@mrTom said:
CPUs are not efficient for LLM, it can be done, but only with "small" models (4B, 8B) and rather slow. For good performance you would need a GPU with lots of RAM (great timing!) which is going to be expensive.

So i would suggest not hosting it yourself but using one of the many APIs like Openrouter for the LLM Part.

If you are not yet a hardcore user, Hugging Face Pro subscription has 2million tokens monthly on some providers:
...
Disclaimer: I don't subscribe this service, ended up using my desktop GPU for my testing (RX 7800XT) - less processing power, smaller models, but it was already paid for.

CPU inference really going to crawl once you have a lot of context (like beyond 100k).
imo using huggingface pro for inference doesn't seems to be efficient. 2mil token is nothing (ymmv)

JohnFilch123 · March 13

I tried to run LLama on my local dedi on E5-2660v4 with 64GB RAM and it was very slow. I had to wait c.1 min for an answer to appear.

Slav_FixFlex · March 13

I've been running self-hosted Llama models on my own VPS for about a year now. Honestly the biggest factor is RAM – you need at least 16GB for anything useful, 32GB+ if you want decent speed.
I use Ollama – easiest setup I've found, works well on Ubuntu with Docker. For cheap VPS with enough RAM, look at Hetzner or OVH – better price/RAM ratio than most providers.
The catch nobody mentions: LLMs hammer your CPU between requests and memory leaks are real if you run them 24/7. Worth monitoring closely.

host_c · March 13

@JohnFilch123 said: I tried to run LLama on my local dedi on E5-2660v4 with 64GB RAM and it was very slow. I had to wait c.1 min for an answer to appear.

Bro, it is Friday, what do you expect, LLM is on vacation, you were lucky he replayed you in 1 MIN.

But yes, x86 is not the way to go.

@idroid007 - it will be much much cheaper to use a subscription based LLM at the moment, there are plenty you can choose from, let this be the problem of companies that have big pockets to perfect, don't waste your time and $$ for the moment on a technology that is basically in infant state and is as efficient as a 5.3L V8 Corvette from the 70's.

JohnFilch123 · March 13

@host_c said: Bro, it is Friday, what do you expect, LLM is on vacation, you were lucky he replayed you in 1 MIN

It was more like an experiment. Def not good to be run on a VPS.

motafoka · March 15

@ScreenReader said:

@idroid007 said:
Need your opinion on what you're using to host LLMs?

in my own production, 4x3090 x 2. just some coder models along with it's embedded/reranker stack. being used for ~40 ish people.

I am starting to learn AI and would love to host my own LMS online.

if you're just starting / learning to deploy LLM, use free service that hands out credit to deploy gpu server like modal.com. if you're willing to pay take a look at runpod / salad.com

What would be the cheapest way I can host LLM?

rent them gpu by hours. if you trust openrouter enough, go with them. https://openrouter.ai/docs/guides/features/zdr
don't ever get into BigGPU purchasing without proper planning (especially your in/out token spending). well, unless you can buy them at MSRP i guess. see https://www.latent.space/p/gpu-bubble

if your client really need confidentiality and can't use external resources. make them pay for their own gpu server. i manage few H100 cluster like this for banks / private corporation.

@motafoka said:

@mrTom said:
CPUs are not efficient for LLM, it can be done, but only with "small" models (4B, 8B) and rather slow. For good performance you would need a GPU with lots of RAM (great timing!) which is going to be expensive.

So i would suggest not hosting it yourself but using one of the many APIs like Openrouter for the LLM Part.

If you are not yet a hardcore user, Hugging Face Pro subscription has 2million tokens monthly on some providers:
...
Disclaimer: I don't subscribe this service, ended up using my desktop GPU for my testing (RX 7800XT) - less processing power, smaller models, but it was already paid for.

CPU inference really going to crawl once you have a lot of context (like beyond 100k).
imo using huggingface pro for inference doesn't seems to be efficient. 2mil token is nothing (ymmv)

I got the impression that Hugging Face pro would run on half nVidia H200.

What was the use case on that $6 tokes usage?

ScreenReader · March 16

@motafoka said:

@ScreenReader said:
...

I got the impression that Hugging Face pro would run on half nVidia H200.

What was the use case on that $6 tokes usage?

testing mcp servers that we create to make sure it really works with the recent / popular LLM for the tool calling process.

Howdy, Stranger!

Categories

In this Discussion

VPS hosting for LLMs.

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

VPS hosting for LLMs.

Comments