New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Comments
GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.
Not smart enough to output it properly like it should!
Reguards

I think I have to uninstall Proxmox.
After removing Proxmox, GTP OSS 120b is actually usuable on the KS-LE-B 64GB.
Got GLM 4.5 Air also working.
https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B
Can you paste the command how you run this? I get like 5 t/s on my LE-B
DDR3 or DDR4? what CPU?
DDR4 + E3-1270 v6
odd, did you compile it? what model are you using/quant?
Vibe Coding also works on the KS-LE-B.
It took 10 minutes, to build a simple landing page.
It took another 20 minutes to edit the file and add a dark mode.
That was 10.5k tokens, the initial opencoder prompt is 16k tokens.
I had to disable the initial prompt and use a custom one.
I've been trying the uncensored version of Qwen-3.5 9B as well as Ministral-3 8B (Q4_K_M quantization) with the llama.cpp Docker image on my KS-5 (Xeon-E3 1270 v6, 32 GB DDR4 RAM @ 2400 MHz).
They respectively use 11.6 GB and 8.1 GB RAM so I have plenty to spare even with other services running on the server, namely Seafile and Immich in their own podman containers.
Output token generation is 4 to 5 t/s, which is okay for Ministral but Qwen spends so much time thinking in loops when after following their recommended parameters (temperature, min and max P, repetition and presence penalties) that it takes 10 mins to reply to "What's up, man?".
Unfortunately at the 2300 to 2800 output tokens mark, llama-server abruptly stops either model with an "Error in the input stream" message. Nothing shows up in the logs, I'll have to investigate later.
People modded the offical qwen files, to be less thinking etc.
Check out https://www.reddit.com/r/LocalLLaMA/
Also I suggest you be using llama.cpp, everything else is overhead.
Depending on model, up to 25t/s is possible on a Xeon with DDR4.
llama-server is the official UI for llama.cpp so that's what it runs under the hood. It's included in the Docker image, you just have to pass the -s, --host and --port args to llama.cpp.
Disgusting, bare metal, nothing else.
Also self compiled.
Haha true, in this case with the amount of computations going on bare metal is always going to be more effective. I'll see how it compares to podman now that I've run these little tests.
Problem is, too many models, 1.3TB used already.
Gem Gemma Gemma 4 just dropped
https://huggingface.co/collections/unsloth/gemma-4
The Final Boss for the KS-LE-B has spawned in.
https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF
128B fucking dense, it takes minutes for a response.
What did you ask that its shittin pants?
Only llama 7b model can run on a kimsufi 64gb but its slow like 2 or 5 second delay better use ovh ai end points at 20 dollar budget you can use a lot on ovh ai end point
Ligma
show a clip of this thing in action
Someone fucked up, they have to rebuild the model and waiting for patches.
My screen capture software is broken, idk why, idk.
edit: It wasn't done generating, it actually took, 3 minutes and 50s for a Hi.
WE ARE COOKED.
It was actually reading from the NVMe with 2GB/sec.
Gotta get a smoler model.
Running fully in memory now.
I had to reduce the context size from like 40k to 12k, still free memory though, could still increase it though.
For some reason the screen capture software works again.
Hope you happy @allthemtings
so not usable irl
Id say a batch job
I ts posible