Choosing between renting dedicated servers with GPUs and using a GPU rental service for machine lear

vitobotta · November 2024

Hey everyone, I'll be working on revamping our recommendation system for event attendees in events. We also plan to add some features that could benefit from language capabilities similar to what OpenAI offers.

Right now, the recommendation system doesn't need much processing power and can run asynchronously on standard CPUs. But the new version will process some tasks in real time, so we'll need more powerful hardware.

Considering I might want to run some self hosted language-related tasks rather than through OpenAI, which would be better: renting dedicated servers with GPUs or using a GPU rental service?

Has anyone used either option and can share their thoughts on price and performance? We need a solution that can manage about 500 AI-related requests per second.

Thanks a lot for any insights.

egoror · November 2024

Afaik rule of thumb is get gpu rental for learning or project-specific and get dedi+gpu for inferencing and long-term?

AXYZE · November 2024

Will you use that performance 24/7? 12hours a day?
One day in a month?

If its not constant usage youre better off with using hosted models such as Gemini Flash 8B/GPT-4o mini/Llama ln Together/Groq/Fireworks/Sambanova.

You'll pay 1% for 100% of the compute. There is no scaling limits, theres no failing hardware, you can use all models that you can imagine. You can even pump 1000 rq/s and you still pay same money! just use multiple providers for same model.

Its like S3. If you need to pump 100GB/400GB/2TB for couple of hours or couple of days and then delete it then it will be WAY more cost efficient than getting your own server with storage and have it running 24/7. With LLMs difference is even bigger, because GPUs are way more expensive than 2TB HDD and theres no cheap way to make it HA (like with RAID1).

ScreenReader · November 2024

on-demand is always cheaper than renting dedicated gpu server (unless you're the service provider). take a look at openrouter

vitobotta · November 2024

@ScreenReader said:
on-demand is always cheaper than renting dedicated gpu server (unless you're the service provider). take a look at openrouter

Afaik rule of thumb is get gpu rental for learning or project-specific and get dedi+gpu for inferencing and long-term?

I'm okay with handling those servers, even though it adds a bit more work. Do you think I could get better performance for the money this way instead of using a GPU rental service?

@AXYZE said:
Will you use that performance 24/7? 12hours a day?
One day in a month?

If its not constant usage youre better off with using hosted models such as Gemini Flash 8B/GPT-4o mini/Llama ln Together/Groq/Fireworks/Sambanova.

You'll pay 1% for 100% of the compute. There is no scaling limits, theres no failing hardware, you can use all models that you can imagine. You can even pump 1000 rq/s and you still pay same money! just use multiple providers for same model.

Its like S3. If you need to pump 100GB/400GB/2TB for couple of hours or couple of days and then delete it then it will be WAY more cost efficient than getting your own server with storage and have it running 24/7. With LLMs difference is even bigger, because GPUs are way more expensive than 2TB HDD and theres no cheap way to make it HA (like with RAID1).

It would be a part of our usual traffic, so it would be in constant use.
Can you explain what you mean by being able to use those models as much as I want without the cost changing?

@ScreenReader said:
on-demand is always cheaper than renting dedicated gpu server (unless you're the service provider). take a look at openrouter

I have an OpenRouter account for personal use, although I mostly use local models on my Mac now. The pricing is really reasonable, especially for models like Qwen, even with the larger 72 billion parameter version, which generally gives better quality than the smaller 7 to 8 billion parameter models. However, the performance with OpenRouter can be quite inconsistent because they route requests to different providers. Sometimes the requests are really fast, and other times they are very slow, even with the same model.

AXYZE · November 2024

@vitobotta said:
It would be a part of our usual traffic, so it would be in constant use.
Can you explain what you mean by being able to use those models as much as I want without the cost changing?

When you will outscale your hardware (tokens/second, model parameters, context required to be stored in VRAM) you cant do anything other than spending bunch of money once again. This also mean that you have problem until you will fix that issue, which may be a disaster if it happens in the middle of work. Are you gonna teleport to near electronic shop and then teleport to DC?
If for example you need performance of 80tk/s for one hour and less than 40rq/s, while one GPU will do 50rq/s... the second GPU will just do nothing for 23hours per day.

With hosted LLMs you dont care, you can send requests to 8 of them at once and now you have access to 8x performance, while still paying the same amount for tokens. You outgrow Llama providers? Couple of minutes and you can add Gemma endpoints and do simple roundrobin between them.
It's like you would compare buying server instead of making Kubernetes cluster, where in Kubernetes you would pay $0 for compute, $0 commitment and only pay for bandwidth with instant scaling up to 10x+ GPU. It makes a lot of sense to go with hosted LLMs, because you pay 1% for 100% of the compute/local LLM cost.

The big companies provide outstanding value for your money, because their HW is utilized constantly as people live in different timezones. This alone allows them to cut costs by 70%+, because they do not have "peak hours" as every hour is a "peak hour". There's no way to beat them on price unless you are really gonna to do 24/7 compute.

vitobotta · November 2024

@AXYZE said:

@vitobotta said:
It would be a part of our usual traffic, so it would be in constant use.
Can you explain what you mean by being able to use those models as much as I want without the cost changing?

When you will outscale your hardware (tokens/second, model parameters, context required to be stored in VRAM) you cant do anything other than spending bunch of money once again. This also mean that you have problem until you will fix that issue, which may be a disaster if it happens in the middle of work. Are you gonna teleport to near electronic shop and then teleport to DC?
If for example you need performance of 80tk/s for one hour and less than 40rq/s, while one GPU will do 50rq/s... the second GPU will just do nothing for 23hours per day.

With hosted LLMs you dont care, you can send requests to 8 of them at once and now you have access to 8x performance, while still paying the same amount for tokens. You outgrow Llama providers? Couple of minutes and you can add Gemma endpoints and do simple roundrobin between them.
It's like you would compare buying server instead of making Kubernetes cluster, where in Kubernetes you would pay $0 for compute, $0 commitment and only pay for bandwidth with instant scaling up to 10x+ GPU. It makes a lot of sense to go with hosted LLMs, because you pay 1% for 100% of the compute/local LLM cost.

The big companies provide outstanding value for your money, because their HW is utilized constantly as people live in different timezones. This alone allows them to cut costs by 70%+, because they do not have "peak hours" as every hour is a "peak hour".

There's no way to beat them on price unless you are really gonna to do 24/7 compute.

I haven't worked with hosted LLMs at scale, so I thought it would cost a lot more for our kind of usage. Now I understand your point better. We'll give that option a try first. How about tasks that don't use LLMs? Do you have any suggestions for services we could use instead of renting servers with GPUs?

AXYZE · November 2024

@vitobotta said:
I haven't worked with hosted LLMs at scale, so I thought it would cost a lot more for our kind of usage. Now I understand your point better. We'll give that option a try first. How about tasks that don't use LLMs? Do you have any suggestions for services we could use instead of renting servers with GPUs?

What exactly do you need? How long it will be used per day?
With LLM's it's very skewed towards hosted solutions, because additional 40-100ms for reply doesnt matter so single US datacenter that serves whole world is all they need. With other things renting a GPU server may be a still good option.

vitobotta · November 2024

@AXYZE said:

@vitobotta said:
I haven't worked with hosted LLMs at scale, so I thought it would cost a lot more for our kind of usage. Now I understand your point better. We'll give that option a try first. How about tasks that don't use LLMs? Do you have any suggestions for services we could use instead of renting servers with GPUs?

What exactly do you need? How long it will be used per day?
With LLM's it's very skewed towards hosted solutions, because additional 40-100ms for reply doesnt matter so single US datacenter that serves whole world is all they need. With other things renting a GPU server may be a still good option.

It's hard to give a precise estimate at this early stage, but we expect that requests using AI will make up part of our regular traffic. The usage will likely be spread out throughout the day, with some occasional spikes.

Howdy, Stranger!

Categories

In this Discussion

Choosing between renting dedicated servers with GPUs and using a GPU rental service for machine lear

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Choosing between renting dedicated servers with GPUs and using a GPU rental service for machine lear

Comments