Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

MSFS on Proxmox with GPU Passthrough (DXGI HANG) - $100 Bounty

2»

Comments

  • One thing you need is either a connected monitor, or a dummy hdmi plug that simulates a plugged monitor, if you want to play remote (sunshine/moonlight streaming).

    Currently, I'm using GLKVM for that purpose. And in case of something really catastrophic happens, I still have access to my BIOS.

  • Now, next step is Fortnite on VM :)

    Thanked by 1jason5545
  • @jason5545 said:

    One thing you need is either a connected monitor, or a dummy hdmi plug that simulates a plugged monitor, if you want to play remote (sunshine/moonlight streaming).

    Currently, I'm using GLKVM for that purpose. And in case of something really catastrophic happens, I still have access to my BIOS.

    GL-inet comet kvm ?

  • @xemaps said:

    @jason5545 said:

    One thing you need is either a connected monitor, or a dummy hdmi plug that simulates a plugged monitor, if you want to play remote (sunshine/moonlight streaming).

    Currently, I'm using GLKVM for that purpose. And in case of something really catastrophic happens, I still have access to my BIOS.

    GL-inet comet kvm ?

    Yes.

    Thanked by 1xemaps
  • @EagleCorals said:
    Now, next step is Fortnite on VM :)

    I mainly play simulator games, so EAC isn't an issue for me. Someone on Proxmox tried three (two) years ago. I would happily to test it if it will not require tinkering with my host environment. Will report back

  • Share some more Tweaks:

    Technical Analysis Report: Resolving Hardware Sensor Detection Failure on Linux

    Date: August 7, 2025
    Subject: Resolving a hardware sensor detection failure on an ASUS ROG STRIX B650E-I GAMING WIFI motherboard running under a Linux environment.
    System Context: Proxmox VE (Debian 13 "Trixie" base) with Kernel 6.14.8-2-pve.


    Executive Summary

    This report documents the diagnosis and resolution of an issue where the Linux operating system was unable to detect and report fan speeds and other critical sensor data on a modern server platform. The hardware consists of an ASUS ROG STRIX B650E-I motherboard and an AMD Ryzen 9 9950X3D CPU. The initial investigation revealed that the standard hardware monitoring suite, lm-sensors, failed to identify the motherboard's Super I/O (S.I.O.) controller chip.

    The root cause was traced to an unknown chip ID, 0xd802, reported by the sensors-detect utility. The working hypothesis was that this new chip was compatible with the existing Nuvoton nct6775 family driver, but its ID was not yet included in the driver's official support list.

    The problem was successfully resolved by forcing the nct6775 kernel module to recognize this specific ID. This action immediately enabled full monitoring capabilities, including fan speeds, temperatures, and voltages. The solution was then made permanent by creating a modprobe configuration file and updating the initramfs, ensuring persistence across system reboots.


    1. Problem Description

    • Hardware Platform:

      • Motherboard: ASUSTeK COMPUTER INC. ROG STRIX B650E-I GAMING WIFI
      • CPU: AMD Ryzen 9 9950X3D 16-Core Processor
    • Software Environment:

      • Operating System: Debian GNU/Linux 13 (trixie) / Proxmox VE
      • Kernel: 6.14.8-2-pve
    • Initial Symptoms:

      • When using the standard sensors command, fan speeds were either missing or reported as N/A.
      • This lack of visibility prevented administrators from monitoring the thermal state of the chassis and CPU, posing a potential risk to server stability and hardware longevity under load.

    2. Diagnostic Process and Analysis

    2.1. Initial Diagnostic Tool (lm-sensors)

    The standard diagnostic approach for such issues involves using the lm-sensors package. The sensors-detect script was executed with superuser privileges to scan the system for all available hardware monitoring chips.

    2.2. Analysis of sensors-detect Output

    The output from sensors-detect provided the definitive clue:

    # Board: ASUSTeK COMPUTER INC. ROG STRIX B650E-I GAMING WIFI
    ...
    Probing for Super-I/O at 0x2e/0x2f
    Trying family `VIA/Winbond/Nuvoton/Fintek'...               Yes
    Found unknown chip with ID 0xd802
        (logical device B has address 0x290, could be sensors)
    ...
    Sorry, no sensors were detected.
    
    • Key Finding 1: Unknown Chip Identified
      The log clearly states that during the Super I/O probe, it found a chip with the ID 0xd802. The Super I/O chip is a critical component responsible for low-bandwidth devices and, most importantly, the hardware monitoring sensors (fans, temperatures, voltages).

    • Key Finding 2: Detection Failure
      Because the ID 0xd802 was not present in the sensors-detect database, it could not match it to a known kernel driver. This resulted in a complete failure to configure any sensors, leading to the final message: Sorry, no sensors were detected.

    2.3. Hypothesis Formulation

    Based on the evidence, the following hypothesis was formed:

    1. New Hardware Support Lag: The ROG STRIX B650E-I is a new motherboard. It is common for its components, like the S.I.O. chip, to be newer than the versions of the kernel drivers distributed with the OS. The nct6775 driver likely supports the chip's functionality, but does not yet officially list its ID.
    2. Chip Family Compatibility: Hardware vendors like Nuvoton often maintain architectural compatibility across chip generations. It was highly probable that the new 0xd802 chip was a variant compatible with the well-established nct6775 driver family (which covers a range of models like the nct6779, nct6798, etc.).
    3. Solution Path: The most direct solution path would be to manually instruct the nct6775 kernel module to "force" ownership and initialization of this unknown chip ID, bypassing the need to wait for an official kernel patch.

    3. Solution and Implementation

    3.1. Immediate Mitigation: Forcing Module Loading

    To test the hypothesis, the nct6775 kernel module was loaded with the force_id parameter:

    sudo modprobe nct6775 force_id=0xd802
    

    3.2. Verification of Results

    Immediately after executing the command, the sensors command was run again. The result was a success:

    nct6799-isa-0290
    Adapter: ISA adapter
    ...
    fan1:                         1719 RPM  (min =    0 RPM)
    fan2:                         1425 RPM  (min =    0 RPM)
    fan7:                            0 RPM  (min =    0 RPM)
    SYSTIN:                        +44.0°C ...
    CPUTIN:                        +48.0°C ...
    ...
    
    • Verification Success: The output now included a new block for a nct6799-isa-0290 device. This confirmed that the nct6775 driver correctly identified the 0xd802 chip as a compatible model and initialized it.
    • Objective Achieved: The fan speeds for fan1 and fan2 were now clearly visible, resolving the core problem. As a benefit, all other motherboard sensors (temperatures and voltages) became available as well.

    3.3. Permanent Configuration

    To ensure the solution persists after a reboot, the module option was made permanent.

    1. Create Module Configuration File: A configuration file was created to automate the application of the force_id parameter on boot:

      echo 'options nct6775 force_id=0xd802' | sudo tee /etc/modprobe.d/nct6775.conf
      

      This file instructs the system to always use the force_id=0xd802 option whenever the nct6775 module is loaded.

    2. Update Initial RAM Disk (initramfs): To ensure the kernel loads the module with the correct options during the early boot process, the initramfs was rebuilt:

      sudo update-initramfs -u
      

    4. Conclusion

    The root cause of this issue was a predictable support lag between the release of new motherboard hardware and its inclusion in official Linux kernel drivers. By following a systematic diagnostic process, the key identifier of the unsupported hardware (0xd802) was located from the sensors-detect logs.

    The final solution, implemented by forcing an existing, compatible driver to claim the new device, was both effective and immediate. This case serves as an excellent example of the power and flexibility of the Linux ecosystem, which often provides the tools necessary to manage hardware interoperability challenges without waiting for official vendor or kernel updates. The server's full hardware monitoring capabilities have been successfully restored and made permanent.

    Thanked by 1devjorge
  • It would be interresting to see how FreeBSD with BHyve VM compares to Proxmox VM.
    Setup wise I have a feeling it would be easier, based on what I tried.
    I currently am away from my gaming rig, so can't test. But is sometthing I keep on my todo list.

  • @jason5545 what a boss. Super impressed by your digging. Thanks for reporting back for helping out any future seeker. This to me is what internet is for. LLM cant help like this

    Thanked by 1jason5545
  • @EagleCorals said:
    It would be interresting to see how FreeBSD with BHyve VM compares to Proxmox VM.
    Setup wise I have a feeling it would be easier, based on what I tried.
    I currently am away from my gaming rig, so can't test. But is sometthing I keep on my todo list.

    I only heard of Proxmox and XCP-ng in the open-source territory, it's clear I have more to learn.

  • EagleCoralsEagleCorals Member
    edited August 2025

    @jason5545 said:

    @EagleCorals said:
    It would be interresting to see how FreeBSD with BHyve VM compares to Proxmox VM.
    Setup wise I have a feeling it would be easier, based on what I tried.
    I currently am away from my gaming rig, so can't test. But is sometthing I keep on my todo list.

    I only heard of Proxmox and XCP-ng in the open-source territory, it's clear I have more to learn.

    I did it some weeks ago, just to test it.
    in /boot/loader.conf
    vmm_load="YES"
    pptdevs="0/21/0"

    then in the bhyve vm config just put the passthru
    passthru0="0/21/0"

    worked as expected.
    now I am using it for my homeassistant vm under bhyve.

    but would be interesting how it compres with your setup and game.

    https://wiki.freebsd.org/bhyve/pci_passthru

    ps: prepare to go down the rabbit hole, freebsd is not linux. 😊😊

    the nice thing is you can have a fully working workstation for day to day use, and running vm with the bhyve.
    not as with proxmox, where you basically cant use the machine for anything else.

    Thanked by 1jason5545
  • jason5545jason5545 Member
    edited August 2025

    @angstrom request to move to General, Thanks

  • very cool you got it running!

    no sound crackles? that was my biggest problem but some years ago.

    you said i can PM you? save your dollars and get an addon and ENJOY!

    Thanked by 2jason5545 Nick
  • Added some NUT configurations notes:

    Technical Report: Implementing a Configurable Server Shutdown Delay for NUT Monitoring Systems

    Date: August 08, 2025
    Subject: Optimizing the shutdown strategy for the LAB-22BR70G server, which is protected by an APC Back-UPS RS 1500MS and monitored by Network UPS Tools (NUT).

    --

    1. Executive Summary

    This report addresses the inherent lack of proactive control in the default shutdown strategy of Network UPS Tools (NUT). The standard configuration is reactive, typically initiating a server shutdown only when the UPS battery reaches a critically low level. This behavior can lead to unnecessary downtime during short-term power outages that resolve within minutes. This document presents a complete solution: an automated management script, nut-delay-manager.sh. This script empowers system administrators to easily view and configure a fixed "shutdown delay," allowing the server to wait for a user-defined period (e.g., 5 minutes) after a power failure before initiating its shutdown sequence. This implementation significantly enhances system stability and uptime.

    2. Background and Problem Statement

    The server LAB-22BR70G is protected against power loss by an APC Back-UPS RS 1500MS Uninterruptible Power Supply (UPS). The server utilizes the Network UPS Tools (NUT) software suite to monitor the UPS status and automate a graceful shutdown when necessary.

    A review of the current configuration file, upsmon.conf, reveals the use of an immediate shutdown command: SHUTDOWNCMD "/sbin/shutdown -h +0". The trigger for this command relies on NUT's default logic, which is to wait for the UPS to signal a LOWBATT (low battery) condition.

    This default strategy presents the following risks:

    • Over-Reactivity to Minor Outages: The system may begin a full shutdown sequence in response to a brief power fluctuation or a short-term outage that would have otherwise resolved before exhausting the battery, leading to unnecessary downtime.
    • Lack of Flexibility: The "shutdown on low battery" approach is rigid and does not allow administrators to define a grace period based on external factors, such as the expected time for power restoration or for a backup generator to start.

    A more flexible and resilient shutdown policy is required, replacing the "low battery trigger" with a "fixed time delay."

    3. Proposed Solution

    To address the stated problem, this report proposes the implementation of a timer-based shutdown delay policy utilizing NUT's built-in scheduling tool, upssched. The logic of this new policy is as follows:

    1. Power Loss Detection (ONBATT): When upsmon detects that the UPS has switched to battery power, it will instruct upssched to start a countdown timer for a predefined duration (e.g., 300 seconds).
    2. Power Restoration Detection (ONLINE): If utility power is restored before the timer expires, upssched will cancel the shutdown timer. The server will continue to operate normally with no interruption.
    3. Timer Expiration: If the timer completes its countdown and utility power has not been restored, upssched will execute a command script. This script invokes upsmon -c fsd (Force Shutdown), compelling the server to begin a graceful shutdown immediately.

    To simplify the deployment and ongoing management of this configuration, a Bash script named nut-delay-manager.sh has been developed. This script automates all necessary configuration file modifications and provides a simple command-line interface for administrators to view and set the shutdown delay time.

    4. Implementation Details

    4.1. Management Script: nut-delay-manager.sh

    This script is the core of the solution, encapsulating all configuration logic.

    #!/bin/bash
    
    # ==============================================================================
    # NUT (Network UPS Tools) Server Shutdown Delay Manager
    # Function: View and set the delay time from power loss until server shutdown
    #           is initiated.
    # Version:  1.1
    # ==============================================================================
    
    # --- Configuration File Paths ---
    UPSSCHED_CONF="/etc/nut/upssched.conf"
    CMDSCRIPT_PATH="/usr/local/sbin/nut-shutdown-trigger.sh"
    UPSMON_CONF="/etc/nut/upsmon.conf"
    TIMER_NAME="shutdown-delay"
    
    # --- Color Codes ---
    RED='\033[0;31m'
    GREEN='\033[0;32m'
    YELLOW='\033[1;33m'
    NC='\033[0m' # No Color
    
    # --- Check for root privileges ---
    if [ "$(id -u)" -ne 0 ]; then
      echo -e "${RED}Error: This script must be run with root privileges. Please use sudo.${NC}"
      echo "Usage: sudo ./nut-delay-manager.sh [view|set <minutes>]"
      exit 1
    fi
    
    # --- Function: Display usage ---
    usage() {
      echo "NUT Shutdown Delay Management Tool"
      echo "----------------------------------"
      echo "Usage: sudo $0 [view|set <minutes>]"
      echo
      echo "Commands:"
      echo "  view          View the current server shutdown delay."
      echo "  set <minutes> Set a new delay time in minutes."
      echo
      echo "Examples:"
      echo "  sudo $0 view"
      echo "  sudo $0 set 5"
      exit 1
    }
    
    # --- Function: View current setting ---
    view_setting() {
      echo "Checking current NUT server shutdown delay setting..."
      if [ ! -f "$UPSSCHED_CONF" ]; then
        echo -e "${YELLOW}Warning: upssched config file not found (${UPSSCHED_CONF}).${NC}"
        echo "The system might not be configured for delayed shutdown, or it uses a non-standard configuration."
        return
      fi
    
      local delay_line=$(grep "START-TIMER ${TIMER_NAME}" "$UPSSCHED_CONF")
      if [ -z "$delay_line" ]; then
        echo -e "${YELLOW}Warning: No active delay timer found in the configuration file.${NC}"
        echo "The system is likely configured to shut down on LOWBATT condition, not after a fixed delay."
        return
      fi
    
      local delay_sec=$(echo "$delay_line" | awk '{print $5}')
      if [[ "$delay_sec" =~ ^[0-9]+$ ]]; then
        local delay_min=$((delay_sec / 60))
        echo -e "${GREEN}Configuration Found!${NC}"
        echo "Current server shutdown delay is set to: ${delay_sec} seconds (${delay_min} minutes)."
        echo "This means the system will wait for ${delay_min} minutes on battery before initiating shutdown."
      else
        echo -e "${RED}Error: The delay value found in the config file is not a valid number.${NC}"
        echo "Config line: $delay_line"
      fi
    }
    
    # --- Function: Set new delay ---
    set_setting() {
      local delay_min=$1
      if ! [[ "$delay_min" =~ ^[1-9][0-9]*$ ]]; then
        echo -e "${RED}Error: Please provide a valid number of minutes (greater than 0).${NC}"
        usage
      fi
    
      local delay_sec=$((delay_min * 60))
    
      echo "Preparing to set the server shutdown delay to ${delay_min} minutes (${delay_sec} seconds)..."
      echo -e "${YELLOW}This will modify the following files:${NC}"
      echo "- ${CMDSCRIPT_PATH} (created or overwritten)"
      echo "- ${UPSSCHED_CONF} (created or overwritten)"
      echo "- ${UPSMON_CONF} (modified)"
      read -p "Are you sure you want to proceed? (y/n): " confirm
      if [ "$confirm" != "y" ]; then
        echo "Operation cancelled."
        exit 0
      fi
    
      # 1. Create the command script (CMDSCRIPT)
      echo "Creating command script: ${CMDSCRIPT_PATH}"
      cat << EOF > "$CMDSCRIPT_PATH"
    #!/bin/sh
    # This script is auto-generated by nut-delay-manager.sh
    # It is called by upssched when the timer expires to force a system shutdown.
    
    if [ "\$1" = "${TIMER_NAME}" ]; then
      /sbin/upsmon -c fsd
    fi
    EOF
      chmod +x "$CMDSCRIPT_PATH"
      echo " -> Done"
    
      # 2. Create/overwrite upssched.conf
      echo "Configuring upssched.conf..."
      # Back up old config file
      [ -f "$UPSSCHED_CONF" ] && cp "$UPSSCHED_CONF" "$UPSSCHED_CONF.bak.$(date +%F-%T)"
      cat << EOF > "$UPSSCHED_CONF"
    # This file is auto-generated by nut-delay-manager.sh
    CMDSCRIPT ${CMDSCRIPT_PATH}
    PIPEFN /var/run/nut/upssched.pipe
    
    # When power fails (ONBATT), start a shutdown timer for ${delay_sec} seconds
    AT ONBATT * START-TIMER ${TIMER_NAME} ${delay_sec}
    
    # When power returns (ONLINE), cancel the timer
    AT ONLINE * CANCEL-TIMER ${TIMER_NAME}
    EOF
      echo " -> Done"
    
      # 3. Modify upsmon.conf
      echo "Modifying upsmon.conf to enable the scheduler..."
      # Back up old config file
      cp "$UPSMON_CONF" "$UPSMON_CONF.bak.$(date +%F-%T)"
      # Ensure NOTIFYCMD points to upssched
      sed -i '/^NOTIFYCMD/d' "$UPSMON_CONF"
      echo "NOTIFYCMD /sbin/upssched" >> "$UPSMON_CONF"
      # Ensure relevant events trigger execution
      sed -i '/^NOTIFYFLAG ONBATT/d' "$UPSMON_CONF"
      sed -i '/^NOTIFYFLAG ONLINE/d' "$UPSMON_CONF"
      echo "NOTIFYFLAG ONBATT SYSLOG+EXEC" >> "$UPSMON_CONF"
      echo "NOTIFYFLAG ONLINE SYSLOG+EXEC" >> "$UPSMON_CONF"
      echo " -> Done"
    
      # 4. Restart NUT services
      echo "Restarting NUT services to apply the new configuration..."
      if systemctl restart nut-server nut-client nut-monitor &> /dev/null; then
        echo -e "${GREEN}NUT services restarted successfully.${NC}"
      else
        echo -e "${YELLOW}Warning: Failed to restart NUT services automatically. Please do it manually:${NC}"
        echo "sudo systemctl restart nut-server nut-client nut-monitor"
      fi
    
      echo -e "${GREEN}Configuration complete! Server shutdown delay is now set to ${delay_min} minutes.${NC}"
    }
    
    # --- Main script logic ---
    case "$1" in
      view)
        view_setting
        ;;
      set)
        set_setting "$2"
        ;;
      *)
        usage
        ;;
    esac
    

    5. Usage and Operational Procedures

    5.1. Installation Procedure
    1. Create the script file:

      nano nut-delay-manager.sh
      
    2. Paste the content: Copy the entire script from Section 4.1 and paste it into the nano editor.

    3. Save and exit: Press Ctrl + X, followed by Y, and then Enter.

    4. Grant execute permissions:

      chmod +x nut-delay-manager.sh
      
    5. (Recommended) Move to system path: To make the script globally accessible, move it to /usr/local/sbin.

      sudo mv nut-delay-manager.sh /usr/local/sbin/
      
    5.2. Command Reference

    Important: All commands must be executed with sudo privileges.

    • To view the current delay setting:

      sudo nut-delay-manager.sh view
      
    • To set the delay time (example: set to 5 minutes):

      sudo nut-delay-manager.sh set 5
      

      The script will prompt for confirmation. After entering y, it will automatically apply all configurations and restart the necessary NUT services.

    6. Conclusion and Recommendations

    This report has detailed the implementation of a robust server shutdown delay policy. Through the use of the nut-delay-manager.sh script, system administrators can move beyond NUT's default, inflexible shutdown mechanism to a precisely controlled, timer-based approach.

    Key Benefits:
    * Increased System Resilience: Significantly reduces unnecessary downtime caused by transient power events.
    * Simplified Administration: Condenses a complex, multi-file configuration process into a single, intuitive command.
    * Traceability: The script's automatic backup feature and clear configuration comments facilitate future audits and maintenance.

    It is recommended that this script be adopted as a standard operating procedure for all critical systems protected by a UPS to ensure an enhanced and predictable response to power failures.

  • @devjorge said:
    very cool you got it running!

    no sound crackles? that was my biggest problem but some years ago.

    you said i can PM you? save your dollars and get an addon and ENJOY!

    Thanks, will use it in my Fenix A320 :D
    Because I used Apollo, the sound driver will automatically switch to the steam streaming one, No cracking issues so far.

    Thanked by 1devjorge
  • devjorgedevjorge Member
    edited August 2025

    Do shutdown quickly and don't depleat your UPS battery too often or deeply...
    If it is plomo that does not like that. Shutdown at 12V or earlier to not hurt it.

    It works great the first few times you test. you get good "runtime" when power cuts out...
    but when you really need It, maybe in a year, the battery could be almost dead.
    Change battery before 2 years is adviced because when they can't keep the charge anymore they just die when power cuts out...
    This often happens without any warning before and no more time to shutdown cleanly.

    //:add
    But a question i have. What is your UPS rated for (in VA) and real Watts.
    If you UPS says 1000VA that's far away from beeing 1000 real Watts.
    Mostly you must divide VA by 2 to get near the real supported wattage.
    I would not run your system on a 1000VA.
    At least 2000 VA to have some margin for peak spikes.

    Thanked by 1jason5545
  • jason5545jason5545 Member
    edited August 2025

    @devjorge said:
    Do shutdown quickly and don't depleat your UPS battery too often or deeply...
    If it is plomo that does not like that. Shutdown at 12V or earlier to not hurt it.

    It works great the first few times you test. you get good "runtime" when power cuts out...
    but when you really need It, maybe in a year, the battery could be almost dead.
    Change battery before 2 years is adviced because when they can't keep the charge anymore they just die when power cuts out...
    This often happens without any warning before and no more time to shutdown cleanly.

    //:add
    But a question i have. What is your UPS rated for (in VA) and real Watts.
    If you UPS says 1000VA that's far away from beeing 1000 real Watts.
    Mostly you must divide VA by 2 to get near the real supported wattage.
    I would not run your system on a 1000VA.
    At least 2000 VA to have some margin for peak spikes.

    Here is the upsc output from my unit. I've removed the serial number, but the rest is straight from the system.

    upsc apc1500ms

    Init SSL without certificate database
    battery.charge: 100
    battery.charge.low: 10
    battery.charge.warning: 50
    battery.date: 2001/09/25
    battery.mfr.date: 2025/01/11
    battery.runtime: 2790
    battery.runtime.low: 120
    battery.type: PbAc
    battery.voltage: 27.3
    battery.voltage.nominal: 24.0
    device.mfr: American Power Conversion
    device.model: Back-UPS RS 1500MS
    device.serial: 5B25****
    device.type: ups
    driver.debug: 0
    driver.flag.allow_killpower: 0
    driver.name: usbhid-ups
    driver.parameter.pollfreq: 30
    driver.parameter.pollinterval: 2
    driver.parameter.port: auto
    driver.parameter.synchronous: auto
    driver.state: quiet
    driver.version: 2.8.1
    driver.version.data: APC HID 0.100
    driver.version.internal: 0.52
    driver.version.usb: libusb-1.0.28 (API: 0x100010a)
    input.sensitivity: high
    input.transfer.high: 144
    input.transfer.low: 88
    input.transfer.reason: input voltage out of range
    input.voltage: 110.0
    input.voltage.nominal: 120
    ups.beeper.status: disabled
    ups.delay.shutdown: 20
    ups.firmware: 966.h4 .D
    ups.firmware.aux: h4
    ups.load: 17
    ups.mfr: American Power Conversion
    ups.mfr.date: 2025/01/11
    ups.model: Back-UPS RS 1500MS
    ups.productid: 0002
    ups.realpower.nominal: 900
    ups.serial: 5B25****
    ups.status: OL
    ups.test.result: No test initiated
    ups.timer.reboot: 0
    ups.timer.shutdown: -1
    ups.vendorid: 051d
    ```

    Based on this data and the model number, the unit is a 1500VA model with a nominal real power rating of 900 Watts (ups.realpower.nominal: 900). This gives it a power factor of 0.6. My current load is sitting at only 17% (ups.load: 17), which should provide a good amount of headroom for my needs and help accommodate any peak power draws.

    Regarding the battery health, I believe my usage patterns are helping to preserve it. The concern about deep discharging lead-acid batteries is very valid. Repeatedly discharging them too deeply can cause sulfation, where lead sulfate crystals build up and reduce the battery's ability to hold a charge. This seems to be a primary cause of premature battery failure.

    The most common power issues I have are brief outages when a main breaker trips. The UPS only runs on battery for about five minutes in these cases. These short, infrequent events should count as shallow discharge cycles, which are much less damaging to the battery's long-term health than a full deep discharge.

    For any predictable, longer-term outages, such as scheduled maintenance, I make it a point to shut down the protected equipment properly beforehand to avoid draining the battery at all. I think this approach, combined with the fact that the UPS isn't overloaded, should hopefully extend the battery's life and ensure it's ready when an unexpected outage occurs.

    Thanks again for your valuable advice and for raising these important points.

  • great to see the data

    battery.voltage: 27.3
    battery.type: PbAc

    if you didn't know: you have two 12V in series and chemical it is old style plomo acid.
    Initiate shutdown when getting below 24.0 V (the nominal voltage)

    ups.load: 17 @ 900W = 153 Watts usage when you measured.

    Did you measure load when gaming?

    The batteries drain dramatically quicker with double or triple the load, voltage drops faster.

    Thanked by 1jason5545
  • Replacement-Battery-APC-Back-UPS-1500 B00PCSKZZU

    9AH per battery.

    In series 24V * 9 AH = ~216 (VAh similar to Watthours) but DC charge before transforming to AC outlet.
    Add 20-30% loss from transformer and you have about 150 Wh remaining if you drain the battery completely.
    At 17% load you can sustain almost an hour if you are lucky but you'r batteries don't like that.

    Max Discharge for PbAc should be never more than 50% so you can use max 75 Wh without hurting your batts.

    Another math issue that lowers the number again is:
    We've calculated VAh with nominal voltage but voltage at max is about 27V and min can be as low as 21-22V depending where the cut-off voltage of your UPS is (mostly 10.5-11V per batt) but this low voltage the battery don't like ;)

    Voltage drops faster with more load and batteries drain more amps to supply the same amount of watts, this adds some % more loss.

    Thanked by 1jason5545
  • @jason5545 said:

    @jason5545 said:

    @Falzo said:
    @jason5545 thanks for reporting back in detail, lots of people don't do that when they find a solution.

    Could you add infos about your final grub settings, vfio, blacklist/modprobe config as well?

    Had my own run-ins with passthru and there is lots of rather useless suggestions floating around which makes it so hard to cut through and find the real solutions.

    Will do it after I restore all my flightsim addons. :wink:


    The Ultimate Guide to Proxmox VE Gaming VM Optimization and Troubleshooting

    Date: August 2, 2025
    Hardware Platform: AMD Ryzen 9 9950X3D (Host), NVIDIA RTX 4070 Ti Super (VM Passthrough), AMD iGPU (Used by LXC)
    Project Goal: To resolve boot stability issues, performance bottlenecks, and random crashes in a Windows 11 gaming VM built on high-end hardware, achieving a near-bare-metal gaming experience.
    Final Outcome: Achieved stable VM startup and shutdown, and resolved the pci_irq_handler crash error. In-game GPU utilization was boosted from an initial 30% to a stable 97%. The system bottleneck was successfully shifted from the CPU to the GPU, with both performance and system stability meeting expectations.


    1. Root Cause Diagnosis: A Dual Dilemma of Stability and Performance

    This project faced two core, interconnected challenges that required a systematic solution.

    1. The Stability Challenge: Boot Failures and Random Crashes

      • Initial Fault: The VM would fail to start after a qm stop command or a crash, reporting a fatal error: kvm: ../hw/pci/pci.c:1654: pci_irq_handler: Assertion '0 <= irq_num && irq_num < PCI_NUM_PINS' failed..
      • Root Cause Analysis: This error indicates that KVM received an invalid Interrupt Request (IRQ) from the passthrough GPU. This is a classic "GPU Reset Bug" in VFIO environments. When the VM shuts down or crashes, the passthrough GPU is not correctly reset, leaving it in an unstable state and causing its next initialization attempt to fail. Further analysis revealed that unstable GPU overclocking was the direct cause of the initial crashes that triggered this reset bug.
    2. The Performance Challenge: A Severe CPU Bottleneck

      • Initial Fault: Despite allocating 16 cores to the VM, game performance was poor, with GPU utilization hovering around a mere 30%.
      • Root Cause Analysis: The root of the problem was Proxmox's inability to correctly handle the asymmetrical CCD architecture of the Ryzen 9 9950X3D. The Windows VM's game threads were being randomly scheduled onto the high-frequency cores that lacked 3D V-Cache, or were frequently migrating between the two CCDs. This introduced significant latency, crippling performance in latency-sensitive games.

    2. The Systematic Solution: From Host-Level to VM Fine-Tuning

    We adopted a bottom-up, layered optimization strategy to ensure each step built a solid foundation for the next.

    Phase 1: Proxmox Host-Level Configuration (The Bedrock)

    These settings are prerequisites for any successful passthrough, aiming to establish a correct IOMMU environment and flexible driver management capabilities for the host.

    1. GRUB Kernel Parameter Configuration:

      • Objective: To correctly enable IOMMU and improve device grouping, creating the necessary conditions for hardware passthrough.
      • Implementation: Edit /etc/default/grub and modify the GRUB_CMDLINE_LINUX_DEFAULT line:

        GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction"
        
        • amd_iommu=on: Force-enables IOMMU on AMD platforms.
        • iommu=pt: Enables passthrough mode for better performance.
        • pcie_acs_override=...: Breaks up non-ideal IOMMU groups, allowing devices like the GPU to be passed through independently.
      • Activation: Run update-grub and reboot the host.

    2. Precise Kernel Module Management (The blacklist vs. softdep Decision):

      • Challenge: We needed vfio-pci to claim the NVIDIA dGPU at boot, while also allowing the host's amdgpu driver to load normally for the iGPU used by an LXC container.
      • The Wrong Approach: Using blacklist completely prevents a driver from loading. This would stop our hookscript from returning the GPU to the host driver after VM shutdown and would also prevent the iGPU from functioning.
      • The Correct Approach: Use softdep (soft dependency) to establish a driver loading priority.
      • Implementation:

        • Edit /etc/modprobe.d/vfio.conf:

          # Specify the exact PCI IDs for the NVIDIA dGPU and its audio device
          options vfio-pci ids=10de:2705,10de:22bb
          
          # Create soft dependencies to ensure vfio-pci loads before any NVIDIA drivers
          softdep nvidia pre: vfio-pci
          softdep nvidia_drm pre: vfio-pci
          
        • Edit /etc/modprobe.d/pve-blacklist.conf (The Key Correction):

          # Comment out the blacklisting of NVIDIA and AMD drivers to allow softdep to manage them
          #blacklist nvidiafb
          #blacklist nouveau
          #blacklist nvidia
          #blacklist radeon
          #blacklist amdgpu
          
          # Retain blacklisting for generic audio drivers to prevent conflicts with GPU HDMI audio
          blacklist snd_hda_codec_hdmi
          # ... other snd_hda_* drivers ...
          
      • Activation: Run update-initramfs -u and reboot the host.

    3. Automated Hookscript for Seamless Driver Handoff:

      • Objective: Before the VM starts, automatically unbind the GPU from the host and hand it to vfio-pci. After the VM shuts down, automatically return the GPU to the host driver and trigger a reset, permanently curing the "Reset Bug".
      • Implementation: Create a hookscript at /var/lib/vz/snippets/gpu-manager.sh and apply it to the VM with qm set 100 --hookscript local:snippets/gpu-manager.sh. (Script contents are detailed below).

    Phase 2: Virtual Machine Level Configuration (The Performance Leap)

    With a stable host foundation, we performed precision surgery on the VM itself.

    1. CPU Core Pinning and NUMA Optimization (The Decisive Battle):

      • Data-Driven Decision: We used lscpu -e and cat /sys/devices/system/cpu/cpu*/cache/index3/size to precisely identify that the V-Cache CCD (with 96MB of L3 cache) resides on physical cores 0-7 (logical threads 0-7, 16-23).
      • Precision Pinning (affinity): We set affinity: 2-7,18-23, locking the VM's 12 cores firmly onto the V-Cache CCD. We strategically reserved cores 0/1 and their SMT siblings for the host to handle I/O and interrupts.
      • Enable Virtual NUMA (numa: 1): This makes the Windows guest OS aware that it is running on a single, tightly-coupled NUMA node. This optimizes its internal scheduling policy, ensuring game workloads do not stray outside the V-Cache domain.
    2. Memory and Peripheral Performance Optimization (Consolidating Gains):

      • hugepages: 2: Uses 2MB hugepages to reduce memory access latency.
      • balloon: 0: Disables memory ballooning to guarantee a stable memory supply.
      • discard=on, iothread=1: Enables TRIM and an I/O thread for our ZFS storage, boosting disk performance.
    3. GPU Passthrough Stability Tuning (A Discussion on rombar=0):

      • Parameter's Function: Adding rombar=0 to the hostpci0 parameter instructs Proxmox to ignore the GPU's Option ROM. This can circumvent known firmware compatibility issues on certain NVIDIA 30/40 series cards and is a powerful tool for resolving stubborn crashes.
      • Special Note: In this specific case, we discovered that the direct trigger for the VM crashes was excessive GPU overclocking. After removing the overclock, the system became stable. Therefore, rombar=0 was not necessary in this scenario. This provides a crucial lesson: before resorting to low-level workarounds like rombar=0, higher-level instability factors such as overclocking, cooling, and driver versions should be ruled out first.

    3. Conclusion and Final Configuration

    This optimization project proves that by deeply understanding the underlying hardware architecture and applying systematic configuration, it is entirely possible to build a stable, high-performance, top-tier gaming VM on Proxmox VE. The keys to success were:

    1. A Layered Problem-Solving Approach: First, ensure host-level stability and correctness, then optimize for performance at the virtual machine level.
    2. Data-Driven Precision Tuning: Abandon guesswork and use system utilities to precisely locate the V-Cache cores.
    3. Flexible Driver Management: Correctly use softdep and a hookscript to perfectly resolve the GPU reset bug and conflicts in a mixed-GPU setup.
    4. Top-Down Troubleshooting: When addressing stability, start with the application layer (overclocking) before considering low-level solutions (rombar).

    In the end, we not only resolved all initial issues but also established a "Golden Standard" configuration for your high-end hardware, laying a solid foundation for future virtualization endeavors.


    Appendix I: Final "Golden Standard" Configuration File (/etc/pve/qemu-server/100.conf)

    # --- CPU & NUMA ---
    # Pin VM cores to the V-Cache CCD (cores 2-7 and their SMT counterparts)
    affinity: 2-7,18-23
    cpu: host,hidden=1
    cores: 12
    sockets: 1
    numa: 1
    
    # --- Memory ---
    balloon: 0
    hugepages: 2
    memory: 32768
    
    # --- Passthrough Hardware & Automation ---
    bios: ovmf
    hookscript: local:snippets/gpu-manager.sh
    hostpci0: 0000:01:00,pcie=1,x-vga=1
    vga: none
    usb0: host=1d6b:0104
    
    # --- Storage ---
    scsihw: virtio-scsi-single
    virtio0: local-zfs:vm-100-disk-0,discard=on,iothread=1,size=300G
    # ... other virtio disks ...
    
    # --- System & Boot ---
    boot: order=virtio0;ide0;net0
    efidisk0: local-zfs:vm-100-disk-1,efitype=4m,pre-enrolled-keys=1,size=1M
    ide0: none,media=cdrom
    machine: pc-q35-9.2+pve1
    name: Windows
    net0: virtio=BC:24:11:65:B7:FD,bridge=vmbr0,firewall=1
    onboot: 1
    ostype: win11
    smbios1: uuid=64ea7622-afea-4da0-bbb5-9947461cac19
    tpmstate0: local-zfs:vm-100-disk-2,size=4M,version=v2.0
    

    Appendix II: Automated Driver Management Hookscript (/var/lib/vz/snippets/gpu-manager.sh)

    This script is the core component for achieving a seamless and stable handoff of the NVIDIA dGPU between the host and the VM. It resolves the "Reset Bug" that causes the VM to fail to restart after shutdown.

    Purpose and Principle:
    * Before VM Start (pre-start): The script forcibly unbinds the passthrough GPU devices from their host drivers (e.g., nvidia) and ensures they are ready to be claimed by the vfio-pci driver.
    * After VM Stop (post-stop): The script releases the GPU from vfio-pci and then triggers a rescan of the host's PCI bus. This prompts the host's nvidia driver to reclaim and completely re-initialize (reset) the GPU, returning it to a clean state, ready for the next VM boot.

    Script Content:

    #!/bin/bash
    # Proxmox VE VM Hookscript for GPU Passthrough Driver Management
    
    # --- USER CONFIGURATION ---
    # Please fill in the PCI addresses of your GPU's functions (video and audio).
    # You can use `lspci -nns <GPU_BUS_ID>` (e.g., `lspci -nns 01:00`) to find all functions.
    GPU_DEVICES="0000:01:00.0 0000:01:00.1"
    
    # --- LOGGING CONFIGURATION ---
    LOG_FILE="/var/log/pve/qemu-server/hookscript.log"
    
    # --- SCRIPT BODY ---
    VM_ID=$1
    PHASE=$2
    
    log_echo() {
        echo "$(date '+%Y-%m-%d %H:%M:%S') - VM $VM_ID - $PHASE:" "$@" >> $LOG_FILE
    }
    
    log_echo "Hookscript triggered."
    
    if [ "$PHASE" == "pre-start" ]; then
        log_echo "Unbinding GPU devices from host drivers..."
        for DEV in $GPU_DEVICES; do
            # To ensure QEMU can take over, we forcibly override the driver to vfio-pci.
            # This works even if the device is not currently bound to any driver.
            echo "vfio-pci" > /sys/bus/pci/devices/$DEV/driver_override
            # If the device is already bound to a driver (like nvidia, nouveau), unbind it first.
            if [ -e /sys/bus/pci/devices/$DEV/driver ]; then
                log_echo "Device $DEV is bound to $(basename $(readlink /sys/bus/pci/devices/$DEV/driver)). Unbinding..."
                echo "$DEV" > /sys/bus/pci/devices/$DEV/driver/unbind
            fi
        done
        # Ensure the vfio-pci module is loaded
        modprobe -i vfio-pci
        log_echo "GPU devices are now ready for vfio-pci."
    
    elif [ "$PHASE" == "post-stop" ]; then
        log_echo "Rebinding GPU devices to host drivers..."
        for DEV in $GPU_DEVICES; do
            # Clear the vfio-pci driver override
            echo "" > /sys/bus/pci/devices/$DEV/driver_override
            # If the device is still bound to vfio-pci, unbind it
            if [ -e /sys/bus/pci/devices/$DEV/driver ] && [ "$(basename $(readlink /sys/bus/pci/devices/$DEV/driver))" == "vfio-pci" ]; then
                log_echo "Device $DEV is bound to vfio-pci. Unbinding..."
                echo "$DEV" > /sys/bus/pci/drivers/vfio-pci/unbind
            fi
        done
    
        # Trigger a PCI bus rescan. This is the most critical step!
        # This prompts the host kernel to re-discover the "ownerless" GPU devices,
        # allowing the appropriate host driver (nvidia) to bind to and initialize them,
        # thereby completing the hardware reset.
        log_echo "Triggering PCI bus rescan to re-initialize GPU..."
        echo 1 > /sys/bus/pci/rescan
        log_echo "GPU devices have been returned to the host."
    fi
    
    exit 0
    

    Installation and Usage Instructions:

    1. Save the Script: Save the content above to /var/lib/vz/snippets/gpu-manager.sh on your Proxmox host.
    2. Grant Execute Permissions: In the host shell, run the following command to make the script executable.
      sh chmod +x /var/lib/vz/snippets/gpu-manager.sh
    3. Apply to the Virtual Machine: Run the following command to associate the script with your Windows VM (using ID 100 as an example).
      sh qm set 100 --hookscript local:snippets/gpu-manager.sh
    4. Verify (Optional): After starting and stopping your VM, you can check the log file to confirm the script executed as expected.
      sh cat /var/log/pve/qemu-server/hookscript.log

    Final config as of now, will continue to post here if i found out some better parameters.

    Follow-up: why nvidia-drm.modeset=0 makes GPU passthrough steadier

    Quick TL;DR: setting nvidia-drm.modeset=0 disables NVIDIA’s DRM/KMS on the host, so the host won’t grab the card for consoles/plymouth/Wayland. That keeps the GPU “clean” for VFIO, which in practice reduces “device is busy”, failed resets, and flaky re-binds after a VM shuts down.

    What it actually changes

    With KMS on (modeset=1), nvidia_drm registers a DRM device and the host can light up /dev/dri/* for fbcon, plymouth, or a display manager.

    With KMS off (modeset=0), nvidia_drm doesn’t expose KMS, so the host is far less likely to touch the card. VFIO can claim it early and keep it.

    Why this helps VFIO

    Fewer conflicts from fbcon/Wayland/plymouth touching the GPU.

    Better odds that resets succeed and the card is re-usable across VM start/stop cycles.

    Especially helpful when the host uses a different adapter for display (e.g., an AMD iGPU) and the NVIDIA card is only for passthrough.

    Trade-offs

    No KMS on the host for that NVIDIA card (Wayland/DRM features won’t work on it).

    Not an issue if the host has another display adapter or is headless.

    Check your current state

    Is KMS enabled for nvidia_drm? (Y = on, N = off)

    cat /sys/module/nvidia_drm/parameters/modeset

    Do DRM nodes exist for this GPU?

    ls -l /dev/dri/ || true

    Did the kernel get the parameter?

    cat /proc/cmdline

    Enable it (Debian/Proxmox example)

    sudo nano /etc/default/grub

    Add nvidia-drm.modeset=0 (keep your existing IOMMU flags)

    Example:

    GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt nvidia-drm.modeset=0"

    sudo update-grub

    If you use systemd-boot on Proxmox:

    sudo proxmox-boot-tool refresh
    sudo reboot

    (Optional) Bind the card to vfio-pci early
    This part avoids softdep/blacklist as requested.

    Find your device IDs

    lspci -nn | grep -i nvidia

    Then set them (replace 10de:XXXX,10de:YYYY with your IDs)

    sudo nano /etc/modprobe.d/vfio.conf

    contents:

    options vfio-pci ids=10de:XXXX,10de:YYYY disable_vga=1

    Rebuild initramfs if your distro requires it, then reboot.

    In short, nvidia-drm.modeset=0 keeps the host from “owning” the card via KMS, so VFIO can take exclusive control early. If your host display runs on another GPU (iGPU/AMD), this is a near-zero-cost stability win for passthrough.

Sign In or Register to comment.