LLM (deepseek?) on KimSufi server

Neoon · January 22

GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

tfgp99 · January 22

@Neoon said:
GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

Not smart enough to output it properly like it should!

Reguards

Neoon · January 24

I think I have to uninstall Proxmox.

Neoon · January 24

After removing Proxmox, GTP OSS 120b is actually usuable on the KS-LE-B 64GB.
Got GLM 4.5 Air also working.

Neoon · January 29

https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B

brauni · January 29

@Neoon said:
GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

Can you paste the command how you run this? I get like 5 t/s on my LE-B

Neoon · January 29

@brauni said:

@Neoon said:
GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

Can you paste the command how you run this? I get like 5 t/s on my LE-B

DDR3 or DDR4? what CPU?

brauni · January 29

@Neoon said:

@brauni said:

@Neoon said:
GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

Can you paste the command how you run this? I get like 5 t/s on my LE-B

DDR3 or DDR4? what CPU?

DDR4 + E3-1270 v6

Neoon · January 29

@brauni said:

@Neoon said:

@brauni said:

@Neoon said:
GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

Can you paste the command how you run this? I get like 5 t/s on my LE-B

DDR3 or DDR4? what CPU?

DDR4 + E3-1270 v6

odd, did you compile it? what model are you using/quant?

Neoon · January 29

Vibe Coding also works on the KS-LE-B.
It took 10 minutes, to build a simple landing page.
It took another 20 minutes to edit the file and add a dark mode.

That was 10.5k tokens, the initial opencoder prompt is 16k tokens.
I had to disable the initial prompt and use a custom one.

miniopt · March 26

I've been trying the uncensored version of Qwen-3.5 9B as well as Ministral-3 8B (Q4_K_M quantization) with the llama.cpp Docker image on my KS-5 (Xeon-E3 1270 v6, 32 GB DDR4 RAM @ 2400 MHz).

They respectively use 11.6 GB and 8.1 GB RAM so I have plenty to spare even with other services running on the server, namely Seafile and Immich in their own podman containers.

Output token generation is 4 to 5 t/s, which is okay for Ministral but Qwen spends so much time thinking in loops when after following their recommended parameters (temperature, min and max P, repetition and presence penalties) that it takes 10 mins to reply to "What's up, man?".

Unfortunately at the 2300 to 2800 output tokens mark, llama-server abruptly stops either model with an "Error in the input stream" message. Nothing shows up in the logs, I'll have to investigate later.

Neoon · March 26

People modded the offical qwen files, to be less thinking etc.
Check out https://www.reddit.com/r/LocalLLaMA/

Also I suggest you be using llama.cpp, everything else is overhead.
Depending on model, up to 25t/s is possible on a Xeon with DDR4.

miniopt · March 26

@Neoon said:
People modded the offical qwen files, to be less thinking etc.
Check out https://www.reddit.com/r/LocalLLaMA/

Also I suggest you be using llama.cpp, everything else is overhead.
Depending on model, up to 25t/s is possible on a Xeon with DDR4.

llama-server is the official UI for llama.cpp so that's what it runs under the hood. It's included in the Docker image, you just have to pass the -s, --host and --port args to llama.cpp.

Neoon · March 26

@miniopt said:

@Neoon said:
People modded the offical qwen files, to be less thinking etc.
Check out https://www.reddit.com/r/LocalLLaMA/

Also I suggest you be using llama.cpp, everything else is overhead.
Depending on model, up to 25t/s is possible on a Xeon with DDR4.

llama-server is the official UI for llama.cpp so that's what it runs under the hood. It's included in the Docker image, you just have to pass the -s, --host and --port args to llama.cpp.

Disgusting, bare metal, nothing else.
Also self compiled.

miniopt · March 26

@Neoon said:

@miniopt said:

@Neoon said:
People modded the offical qwen files, to be less thinking etc.
Check out https://www.reddit.com/r/LocalLLaMA/

Also I suggest you be using llama.cpp, everything else is overhead.
Depending on model, up to 25t/s is possible on a Xeon with DDR4.

llama-server is the official UI for llama.cpp so that's what it runs under the hood. It's included in the Docker image, you just have to pass the -s, --host and --port args to llama.cpp.

Disgusting, bare metal, nothing else.
Also self compiled.

Haha true, in this case with the amount of computations going on bare metal is always going to be more effective. I'll see how it compares to podman now that I've run these little tests.

Neoon · March 26

Problem is, too many models, 1.3TB used already.

Neoon · April 2

Gem Gemma Gemma 4 just dropped
https://huggingface.co/collections/unsloth/gemma-4

Neoon · April 29

The Final Boss for the KS-LE-B has spawned in.
https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

128B fucking dense, it takes minutes for a response.

plumberg · April 30

@Neoon said:
The Final Boss for the KS-LE-B has spawned in.
https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

128B fucking dense, it takes minutes for a response.

What did you ask that its shittin pants?

diwakerd · April 30

Only llama 7b model can run on a kimsufi 64gb but its slow like 2 or 5 second delay better use ovh ai end points at 20 dollar budget you can use a lot on ovh ai end point

Neoon · April 30

@plumberg said:

@Neoon said:
The Final Boss for the KS-LE-B has spawned in.
https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

128B fucking dense, it takes minutes for a response.

What did you ask that its shittin pants?

Ligma

allthemtings · April 30

@Neoon said:
The Final Boss for the KS-LE-B has spawned in.
https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

128B fucking dense, it takes minutes for a response.

show a clip of this thing in action

Neoon · April 30

@allthemtings said:

@Neoon said:
The Final Boss for the KS-LE-B has spawned in.
https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

128B fucking dense, it takes minutes for a response.

show a clip of this thing in action

Someone fucked up, they have to rebuild the model and waiting for patches.

Neoon · May 3

@allthemtings said:

@Neoon said:
The Final Boss for the KS-LE-B has spawned in.
https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

128B fucking dense, it takes minutes for a response.

show a clip of this thing in action

My screen capture software is broken, idk why, idk.

edit: It wasn't done generating, it actually took, 3 minutes and 50s for a Hi.
WE ARE COOKED.

Neoon · May 3

It was actually reading from the NVMe with 2GB/sec.
Gotta get a smoler model.

Neoon · May 3

Running fully in memory now.
I had to reduce the context size from like 40k to 12k, still free memory though, could still increase it though.

Neoon · May 3

For some reason the screen capture software works again.
Hope you happy @allthemtings

allthemtings · May 3

@Neoon said:
For some reason the screen capture software works again.
Hope you happy @allthemtings

jugganuts · May 3

so not usable irl

plumberg · May 3

@jugganuts said:
so not usable irl

Id say a batch job

I ts posible

Howdy, Stranger!

Categories

In this Discussion

LLM (deepseek?) on KimSufi server

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

LLM (deepseek?) on KimSufi server

Comments