New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.


Comments
Actually, Q2 qwant runs with 256GB better would be 512GB though for a Q4 quant.
$7..... (per megabyte)
I'm saving up for a house deposit using sticks of 16GB ddr4. I think I'm half way there at 64gb total.
That's a lot of chrome tabs
I've tested it and I can't confirm this. The model's weight in 4-bit quantization is less than 500 GB. I tested the model also on 8x RTX PRO 6000 cards, and the performance difference is huge. These online stories that every big model can be run on 1 TB of RAM are false. Any model run this way will be unusable.
People forget that models weights are not a big problem, because they stil need to have VRAM/RAM for kv cache and context. It is not worth running such a model with an 8k context...
You have 8x RTX PRO just laying around?
In them basement
Of course the basment and upstairs he sells cat and dog food, the usual.
Dont forget them baguettes
Nah, Either you sell cat and dog food or baguettes.
I actually do run huge models on CPU/RAM only machines and yes, they are slow, but for background tasks or if you fine with the speed (e.g. around 10-12 t/s on gpt-oss:120b (mid-range model) and 2-4 t/s on e.g. GLM/Deepseek/Minimax/Kimi etc. large models) those setups do work.
@Neoon have you tried unsloth ui/studio? Or still on llama.cpp (which is probably the best idea). I have no time atm to switch from my old ollama setup to llama.cpp. Do a perfect bash script to install and perfectly configure llama.cpp, put it somewhere in your repos and I might try it out and get you access via vpn... (which really depends on what you want to do with that). 768GB machine so e.g. the 4-bit q should fit.
Unsloth studio is cancer on Windows, won't even start, so I didn't bother.
VPN uuuh, hot, I am into VPN's.
For RÄM only, use ik_llama.cpp instead if you wanna use only a specific model.
Otherwise llama.cpp, everything else is just a wrapper and/or trash.
There already is: https://pastebin.com/raw/gKYBcXqc
You just mod it a little bit: https://pastebin.com/raw/s7bgVsyH
anyone got a spare nvidia with 300GB nvram? So i can eat some bread and wash my clothes.
Dam bro sorry to hear that, see how these poor guys cope with hunger, they press mosquitos into burgers and fry them:
Bro, did NVIDIA even release a 300GB GPU? Ragebait I sense.
At least if you shitpost, shitpost with confidence.
I am quite intrigued. What is the power consumption cost like, for example running 10 hours of 100% on all cores dual cpu vs 1 hour — let's say 100 t/s.. generous estimate — of multi GPU time ?
Is this more efficient by far if you have stuff you do not care about batching overnight ? Or simply leftover hardware
its actually idle time on a hardware that is needed once a week where using hourly compute on e.g. a hyperscaler would be more expensive (for running it 4 or 5 times a month) than the server for a month.
So it has a few days idling per week that i use for self-hosted llm experiments and testing etc.
Otherwise any LLM subscription by the big AI names would be cheaper (at least with the number of tokens i produce).
just a joke, i dont know a thing about nvidia hardware
welcome to the club.