New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
AI server GPU recommendations
professorparabellum
Member
in General
i want to build a machine to run a local LLM. i run a minecraft server that sees around 1200 players peak and we get somewhere in the ballpark of 1 million messages daily which is impossible to moderate. We have an AI chat filter but it runs on gemini, and even though they have the cheapest tokens we would still eat through an astronomical amount on a daily basis.
What GPU should i go with for this? I was looking at the Nvidia T4 because it has very low TDP for how powerful it is and its around $1k. They also have the L4 which is twice as fast and is rated for the same TDP as the T4 but its alot more expensive.
Comments
I would look into consumer GPUs such as the higher end 30-series cards with high VRAM.
How 'intelligent' do you need the AI to be?
If you're just asking it to identify toxic comments you can probably use a 7b, (so maybe 4-5GB of vRAM), but to parse that much text you might need to run multiple models in parallel to process it all...so you might want to consider getting multiple lower end cards rather than one big one
Run a simple profanity filter, if it's enough for World of Warcraft it'll be enough for you.
What's the language model supposed to do? Considering the misspellings you might have 10^7 tokens per day. It's a metric crapton of data for a language model.
Maybe 30 regexes can surrogate.
@crunchbits has relatively cheap 4090's
It has to be intelligent enough to be able to detect people trying to bypass the filter (e.g. fxck instead of fuck) and to detect conversations that imply anything inappropriate, like something sexual or racist.
I didnt think about token usage like that, thats good to know. Regex does maybe 50% of the job just fine but the other half is messages that cant be caught by regex. Think: innuendos
Gemini 1.5 Flash-8B costs $0.0375 for milion input tokens.
Nvidia T4 will realistically give you 20tokens per second
How did you come up to conclusion, that spending $1k on Nvidia is cheaper than Gemini where you'll pay 4 cents for milion tokens?
Or are you using Gemini Pro model and think that model of such caliber can be run on single GPU that is 6 years old?
using llm for filtering is not a good choice,use other simple language models from huggingface is better
For 7B-8B parameter models, the 3080 is a good choice, achieving an output of 90-100 tokens per second. Even a 4090 only reaches 120-140 tokens per second.
How many MC servers even run AI to do this?
WoW, RuneScape, league of legends, and New World (bring up big names) don't as fair as I'm aware. What do they do? They have simple filter and report system. Someone bypasses the filter? Well, add what they did to it. It's a game of cat and mouse. They will always be ahead.
You also need to take into account the AI causing false positives and if someone knows your doing this, how they might spam chat messages in order to grief.
Most ppl here are quoting tokens per second but that's output, you are more so interested in input speed; your output can be something simple like a number in [0, 1] then you can just threshold for immediate ban or manual review. Most LLM benchmarks that measure Tokens/s and total response time are useless for you because I think, unless you want very detailed reports, include the time it takes for the LLM to respond with content.
I suggest you rent a couple of GPUs to play around with, first in Google Colab because they have the T4 you are eyeing, then shell out a couple bucks an hour before dumping into your own rig.
You should also ask how much delay you can accept for issuing a ban. If you can process each user's chat logs once per day, you'll probably see a significant resource savings versus processing every sentence sent.
speed;
Latency for Nvidia T4 is around 50ms on modern 8B models.
50ms * 20 = 1s, so its 20 tokens per second in his case, the same number I quoted earlier.
Why do you think that it is wrong and instead you suggest to look at speed of input tokens ingestion? He still needs result (output) to censor these messages.
Either way, it doesnt matter at all, because Gemini Flash is way cheaper than Nvidia T4 for input-intensive workloads, its not even close. I have no idea how he manage to ramp up prices of Gemini so much that slow 6yr old Nvidia T4 is more budget friendly.
As I argued you don't really need a lot of output tokens. It can be a single real number indicating model's belief warrant of ban, a list of indices that map to each input sentence, etc. you can be creative.
OP wants to process as many as 1M input messages per day, I would wager the output of the model is far less than that, so using output token processing speed to relate to the level of compute required seems like a mismatch.
GPU is doing 20 tokens per second.
Output of '0' / '1' is still a 1 full token.
Therefore this GPU wont process more than 20 messages per second, because that would mean it outputs more than 20 requests per second.
Right?
You just cannot go above that.
And the thing is, Gemini Flash-8B costs pennies for that amount of volume.
If he processes 10M input tokens per day:
Its $0.4 per day.
$12 per month.
$144 per year.
And somehow buying 6year old GPU that costs $1000, will easily eat $12+ electricity per month, and can fail at any time (requiring to reinvest $1k)... is a budget solution? No way
The time it takes to obtain the first token does not linearly scale with the number of input tokens. For example this guy did a benchmark: https://www.reddit.com/r/LocalLLaMA/comments/1cjvic2/analysis_of_time_to_first_token_ttft_of_llms/
Even supposing a 1 token output and 20 messages per second limit, one can easily engineer this to be higher, i.e. batch two messages at a time. Quoting some GPU's token/s limit doesn't tell you how many messages you can detect as spam per unit time.
Gemini Flash tokens are sold below the cost achievable with NVIDIA GPUs and typical household electricity.
My friend, batching requires padding so all prompts have the same amount of tokens. Minecraft messages will be 1-50char long. It wont do magic and you just introduced bunch of latency because you need to stop upcoming messages to batch process them. It will be slower in practice.
If you take a look how batching in commercial LLM (for example Claude) works you'll see that they gather prompts for hours, just to have no penatly that comes from padding etc.
I'm not sure why you just ignored the fact that there is a benchmark showing there is little penalty to the TTFT when 10x the number of tokens.
Whether or not batching is an acceptable or not solution is not up to either of us. All I know is that using the figure 20 tokens/s for a T4 and then using it as some indication of performance for a downstream task is flaky at best.
Because bemchmark you've send is on HBM and not on GDDR, additionally he wont even get Flash Attention.
It wont translate to T4 at all and a lot of nice techniques are either killed by bandwidth or lack of modern technologies.
Now that I'm thinking about OP usecase I would say that detecting shouldnt be done by LLM at all.
LLM could make nice REGEX by giving it a lot of messages (can be labeled to further improve the rules) and from then REGEX alone would do the work. Why? Because if theres bypass you can add additional rule in second. With LLM you need to bloat the system prompt, which will eventually get out of hand and cause false positivities.
REGEX is all you need
we're currently testing this out. So far the results seem promising, but we are training the models on our chat logs so they can fully understand the context and what not
Nobody to my knowledge. But it takes the work away of having to "catch" the mouse. Right now we're just feeding a weeks worth of chatlogs into a filtering model to see how accurate it is and it seems to be delivering promising results.
Get a basic GPU server from Hetner's auction and if that doesn't suit you, cancel it. Its billed hourly. They have multiple lineups that you can try.
I was looking at gemini pro probably, i had no idea about flash. Honestly now that you bring this up i might just have to scrap the idea of making an AI server, that price point is unbeatable.
It's available at AI Studio / OpenRouter at that price and throughput is ~180tok/s so absolutely blazing fast, it wont introduce a lot of latency.
I would advise you to choose OpenRouter so you can easily switch between models and also check Llama 3.1 8B, Qwen2.5 7B and Gemma 9B.
yeah ill really have to look into that, thank you for this
hmmm you may start with the T4 for cost-effectiveness and upgrade as your needs grow.
Take a look at this model list:
https://artificialanalysis.ai/leaderboards/models
Specifically look at the latency (median time to first chunk) - on a Minecraft messages delayed in chat by even 500ms is going to give the perception of lag to the players and be frustrating.
Take it from us, a GPU cloud provider, that applying LLMs to every task is a common mistake that will likely cost you more time and money for this use case.
Let's crunch some numbers on using LLMs for chat filtering. Taking typical Minecraft messages (15 words × 4 tokens/word = 60 tokens), with 1M daily messages, we're looking at 60M tokens/day. At Gemini Flash's rate ($0.0375/1M tokens), that's $2.25 daily, scaling to $67.50 monthly or $810 yearly.
While these costs might be justifiable, there are alternatives. One path is training a CPU-efficient filtering model, though latency becomes your main hurdle. Pre-existing datasets on Huggingface could save you the hassle of annotating your own chat logs.
For the self-hosted route, P100s offer solid price-to-performance value - I've used them in research. You could QLoRa a smaller model like Qwen-0.5B, deploy it with ExLlama or VLLM, and get impressive results. Though worth considering: if power costs exceed Gemini Flash pricing, the main benefit becomes customization flexibility.
At stated above, you don't need a LLM for this. Yours is a binary classification task (predict toxic or not toxic). You don't need to generate any text, which is the expensive part of such models.
Check out for toxic classification models at HuggingFace https://huggingface.co/models?sort=downloads&search=toxic
With models based on BERT and optimized, you can classify a chat message in ≈1ms with a T4 or 10ms in a CPU.