Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

LLM (deepseek?) on KimSufi server

1235

Comments

  • NeoonNeoon Community Contributor, Veteran
    edited January 22

    GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

    Thanked by 3loay ariq01 plumberg
  • @Neoon said:
    GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

    Not smart enough to output it properly like it should! :D

    Reguards

    Thanked by 2barbarza ariq01
  • NeoonNeoon Community Contributor, Veteran

    I think I have to uninstall Proxmox.

    Thanked by 3barbarza Freek miniopt
  • NeoonNeoon Community Contributor, Veteran

    After removing Proxmox, GTP OSS 120b is actually usuable on the KS-LE-B 64GB.
    Got GLM 4.5 Air also working.

  • @Neoon said:
    GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

    Can you paste the command how you run this? I get like 5 t/s on my LE-B :'(

  • NeoonNeoon Community Contributor, Veteran

    @brauni said:

    @Neoon said:
    GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

    Can you paste the command how you run this? I get like 5 t/s on my LE-B :'(

    DDR3 or DDR4? what CPU?

  • @Neoon said:

    @brauni said:

    @Neoon said:
    GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

    Can you paste the command how you run this? I get like 5 t/s on my LE-B :'(

    DDR3 or DDR4? what CPU?

    DDR4 + E3-1270 v6

  • NeoonNeoon Community Contributor, Veteran
    edited January 29

    @brauni said:

    @Neoon said:

    @brauni said:

    @Neoon said:
    GLM 4.7 Flash runs great on KS-LE-B 64GB Baguette.

    Can you paste the command how you run this? I get like 5 t/s on my LE-B :'(

    DDR3 or DDR4? what CPU?

    DDR4 + E3-1270 v6

    odd, did you compile it? what model are you using/quant?

  • NeoonNeoon Community Contributor, Veteran

    Vibe Coding also works on the KS-LE-B.
    It took 10 minutes, to build a simple landing page.
    It took another 20 minutes to edit the file and add a dark mode.

    That was 10.5k tokens, the initial opencoder prompt is 16k tokens.
    I had to disable the initial prompt and use a custom one.

  • minioptminiopt Member
    edited March 26

    I've been trying the uncensored version of Qwen-3.5 9B as well as Ministral-3 8B (Q4_K_M quantization) with the llama.cpp Docker image on my KS-5 (Xeon-E3 1270 v6, 32 GB DDR4 RAM @ 2400 MHz).

    They respectively use 11.6 GB and 8.1 GB RAM so I have plenty to spare even with other services running on the server, namely Seafile and Immich in their own podman containers.

    Output token generation is 4 to 5 t/s, which is okay for Ministral but Qwen spends so much time thinking in loops when after following their recommended parameters (temperature, min and max P, repetition and presence penalties) that it takes 10 mins to reply to "What's up, man?".

    Unfortunately at the 2300 to 2800 output tokens mark, llama-server abruptly stops either model with an "Error in the input stream" message. Nothing shows up in the logs, I'll have to investigate later.

  • NeoonNeoon Community Contributor, Veteran

    People modded the offical qwen files, to be less thinking etc.
    Check out https://www.reddit.com/r/LocalLLaMA/

    Also I suggest you be using llama.cpp, everything else is overhead.
    Depending on model, up to 25t/s is possible on a Xeon with DDR4.

  • minioptminiopt Member
    edited March 26

    @Neoon said:
    People modded the offical qwen files, to be less thinking etc.
    Check out https://www.reddit.com/r/LocalLLaMA/

    Also I suggest you be using llama.cpp, everything else is overhead.
    Depending on model, up to 25t/s is possible on a Xeon with DDR4.

    llama-server is the official UI for llama.cpp so that's what it runs under the hood. It's included in the Docker image, you just have to pass the -s, --host and --port args to llama.cpp.

  • NeoonNeoon Community Contributor, Veteran

    @miniopt said:

    @Neoon said:
    People modded the offical qwen files, to be less thinking etc.
    Check out https://www.reddit.com/r/LocalLLaMA/

    Also I suggest you be using llama.cpp, everything else is overhead.
    Depending on model, up to 25t/s is possible on a Xeon with DDR4.

    llama-server is the official UI for llama.cpp so that's what it runs under the hood. It's included in the Docker image, you just have to pass the -s, --host and --port args to llama.cpp.

    Disgusting, bare metal, nothing else.
    Also self compiled.

  • minioptminiopt Member

    @Neoon said:

    @miniopt said:

    @Neoon said:
    People modded the offical qwen files, to be less thinking etc.
    Check out https://www.reddit.com/r/LocalLLaMA/

    Also I suggest you be using llama.cpp, everything else is overhead.
    Depending on model, up to 25t/s is possible on a Xeon with DDR4.

    llama-server is the official UI for llama.cpp so that's what it runs under the hood. It's included in the Docker image, you just have to pass the -s, --host and --port args to llama.cpp.

    Disgusting, bare metal, nothing else.
    Also self compiled.

    Haha true, in this case with the amount of computations going on bare metal is always going to be more effective. I'll see how it compares to podman now that I've run these little tests.

  • NeoonNeoon Community Contributor, Veteran

    Problem is, too many models, 1.3TB used already.

    Thanked by 1ariq01
  • NeoonNeoon Community Contributor, Veteran
  • NeoonNeoon Community Contributor, Veteran

    The Final Boss for the KS-LE-B has spawned in.
    https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

    128B fucking dense, it takes minutes for a response.

  • plumbergplumberg Veteran, Megathread Squad

    @Neoon said:
    The Final Boss for the KS-LE-B has spawned in.
    https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

    128B fucking dense, it takes minutes for a response.

    What did you ask that its shittin pants?

  • Only llama 7b model can run on a kimsufi 64gb but its slow like 2 or 5 second delay better use ovh ai end points at 20 dollar budget you can use a lot on ovh ai end point

  • NeoonNeoon Community Contributor, Veteran

    @plumberg said:

    @Neoon said:
    The Final Boss for the KS-LE-B has spawned in.
    https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

    128B fucking dense, it takes minutes for a response.

    What did you ask that its shittin pants?

    Ligma

    Thanked by 1plumberg
  • allthemtingsallthemtings Member, Megathread Squad

    @Neoon said:
    The Final Boss for the KS-LE-B has spawned in.
    https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

    128B fucking dense, it takes minutes for a response.

    show a clip of this thing in action

    Thanked by 1plumberg
  • NeoonNeoon Community Contributor, Veteran

    @allthemtings said:

    @Neoon said:
    The Final Boss for the KS-LE-B has spawned in.
    https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

    128B fucking dense, it takes minutes for a response.

    show a clip of this thing in action

    Someone fucked up, they have to rebuild the model and waiting for patches.

    Thanked by 1allthemtings
  • NeoonNeoon Community Contributor, Veteran
    edited May 3

    @allthemtings said:

    @Neoon said:
    The Final Boss for the KS-LE-B has spawned in.
    https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF

    128B fucking dense, it takes minutes for a response.

    show a clip of this thing in action

    My screen capture software is broken, idk why, idk.

    edit: It wasn't done generating, it actually took, 3 minutes and 50s for a Hi.
    WE ARE COOKED.

    Thanked by 1BasToTheMax
  • NeoonNeoon Community Contributor, Veteran

    It was actually reading from the NVMe with 2GB/sec.
    Gotta get a smoler model.

    Thanked by 1allthemtings
  • NeoonNeoon Community Contributor, Veteran

    Running fully in memory now.
    I had to reduce the context size from like 40k to 12k, still free memory though, could still increase it though.

    Thanked by 1allthemtings
  • NeoonNeoon Community Contributor, Veteran

    For some reason the screen capture software works again.
    Hope you happy @allthemtings

  • allthemtingsallthemtings Member, Megathread Squad

    @Neoon said:
    For some reason the screen capture software works again.
    Hope you happy @allthemtings

  • jugganutsjugganuts Member

    so not usable irl :(

  • plumbergplumberg Veteran, Megathread Squad

    @jugganuts said:
    so not usable irl :(

    Id say a batch job

    I ts posible

Sign In or Register to comment.