MSFS on Proxmox with GPU Passthrough (DXGI HANG) - $100 Bounty

jason5545 · August 2025

One thing you need is either a connected monitor, or a dummy hdmi plug that simulates a plugged monitor, if you want to play remote (sunshine/moonlight streaming).

Currently, I'm using GLKVM for that purpose. And in case of something really catastrophic happens, I still have access to my BIOS.

EagleCorals · August 2025

Now, next step is Fortnite on VM

xemaps · August 2025

@jason5545 said:

One thing you need is either a connected monitor, or a dummy hdmi plug that simulates a plugged monitor, if you want to play remote (sunshine/moonlight streaming).

Currently, I'm using GLKVM for that purpose. And in case of something really catastrophic happens, I still have access to my BIOS.

GL-inet comet kvm ?

jason5545 · August 2025

@xemaps said:

@jason5545 said:

One thing you need is either a connected monitor, or a dummy hdmi plug that simulates a plugged monitor, if you want to play remote (sunshine/moonlight streaming).

Currently, I'm using GLKVM for that purpose. And in case of something really catastrophic happens, I still have access to my BIOS.

GL-inet comet kvm ?

Yes.

jason5545 · August 2025

@EagleCorals said:
Now, next step is Fortnite on VM

I mainly play simulator games, so EAC isn't an issue for me. Someone on Proxmox tried three (two) years ago. I would happily to test it if it will not require tinkering with my host environment. Will report back

jason5545 · August 2025

Share some more Tweaks:

Technical Analysis Report: Resolving Hardware Sensor Detection Failure on Linux

Date: August 7, 2025
Subject: Resolving a hardware sensor detection failure on an ASUS ROG STRIX B650E-I GAMING WIFI motherboard running under a Linux environment.
System Context: Proxmox VE (Debian 13 "Trixie" base) with Kernel 6.14.8-2-pve.

Executive Summary

This report documents the diagnosis and resolution of an issue where the Linux operating system was unable to detect and report fan speeds and other critical sensor data on a modern server platform. The hardware consists of an ASUS ROG STRIX B650E-I motherboard and an AMD Ryzen 9 9950X3D CPU. The initial investigation revealed that the standard hardware monitoring suite, lm-sensors, failed to identify the motherboard's Super I/O (S.I.O.) controller chip.

The root cause was traced to an unknown chip ID, 0xd802, reported by the sensors-detect utility. The working hypothesis was that this new chip was compatible with the existing Nuvoton nct6775 family driver, but its ID was not yet included in the driver's official support list.

The problem was successfully resolved by forcing the nct6775 kernel module to recognize this specific ID. This action immediately enabled full monitoring capabilities, including fan speeds, temperatures, and voltages. The solution was then made permanent by creating a modprobe configuration file and updating the initramfs, ensuring persistence across system reboots.

1. Problem Description

Hardware Platform:
- Motherboard: ASUSTeK COMPUTER INC. ROG STRIX B650E-I GAMING WIFI
- CPU: AMD Ryzen 9 9950X3D 16-Core Processor
Software Environment:
- Operating System: Debian GNU/Linux 13 (trixie) / Proxmox VE
- Kernel: 6.14.8-2-pve
Initial Symptoms:
- When using the standard sensors command, fan speeds were either missing or reported as N/A.
- This lack of visibility prevented administrators from monitoring the thermal state of the chassis and CPU, posing a potential risk to server stability and hardware longevity under load.

2. Diagnostic Process and Analysis

2.1. Initial Diagnostic Tool (lm-sensors)

The standard diagnostic approach for such issues involves using the lm-sensors package. The sensors-detect script was executed with superuser privileges to scan the system for all available hardware monitoring chips.

2.2. Analysis of sensors-detect Output

The output from sensors-detect provided the definitive clue:

# Board: ASUSTeK COMPUTER INC. ROG STRIX B650E-I GAMING WIFI
...
Probing for Super-I/O at 0x2e/0x2f
Trying family `VIA/Winbond/Nuvoton/Fintek'...               Yes
Found unknown chip with ID 0xd802
    (logical device B has address 0x290, could be sensors)
...
Sorry, no sensors were detected.

Key Finding 1: Unknown Chip Identified
The log clearly states that during the Super I/O probe, it found a chip with the ID 0xd802. The Super I/O chip is a critical component responsible for low-bandwidth devices and, most importantly, the hardware monitoring sensors (fans, temperatures, voltages).
Key Finding 2: Detection Failure
Because the ID 0xd802 was not present in the sensors-detect database, it could not match it to a known kernel driver. This resulted in a complete failure to configure any sensors, leading to the final message: Sorry, no sensors were detected.

2.3. Hypothesis Formulation

Based on the evidence, the following hypothesis was formed:

New Hardware Support Lag: The ROG STRIX B650E-I is a new motherboard. It is common for its components, like the S.I.O. chip, to be newer than the versions of the kernel drivers distributed with the OS. The nct6775 driver likely supports the chip's functionality, but does not yet officially list its ID.
Chip Family Compatibility: Hardware vendors like Nuvoton often maintain architectural compatibility across chip generations. It was highly probable that the new 0xd802 chip was a variant compatible with the well-established nct6775 driver family (which covers a range of models like the nct6779, nct6798, etc.).
Solution Path: The most direct solution path would be to manually instruct the nct6775 kernel module to "force" ownership and initialization of this unknown chip ID, bypassing the need to wait for an official kernel patch.

3. Solution and Implementation

3.1. Immediate Mitigation: Forcing Module Loading

To test the hypothesis, the nct6775 kernel module was loaded with the force_id parameter:

sudo modprobe nct6775 force_id=0xd802

3.2. Verification of Results

Immediately after executing the command, the sensors command was run again. The result was a success:

nct6799-isa-0290
Adapter: ISA adapter
...
fan1:                         1719 RPM  (min =    0 RPM)
fan2:                         1425 RPM  (min =    0 RPM)
fan7:                            0 RPM  (min =    0 RPM)
SYSTIN:                        +44.0°C ...
CPUTIN:                        +48.0°C ...
...

Verification Success: The output now included a new block for a nct6799-isa-0290 device. This confirmed that the nct6775 driver correctly identified the 0xd802 chip as a compatible model and initialized it.
Objective Achieved: The fan speeds for fan1 and fan2 were now clearly visible, resolving the core problem. As a benefit, all other motherboard sensors (temperatures and voltages) became available as well.

3.3. Permanent Configuration

To ensure the solution persists after a reboot, the module option was made permanent.

Create Module Configuration File: A configuration file was created to automate the application of the force_id parameter on boot:
```
echo 'options nct6775 force_id=0xd802' | sudo tee /etc/modprobe.d/nct6775.conf
```
This file instructs the system to always use the force_id=0xd802 option whenever the nct6775 module is loaded.
Update Initial RAM Disk (initramfs): To ensure the kernel loads the module with the correct options during the early boot process, the initramfs was rebuilt:
```
sudo update-initramfs -u
```

4. Conclusion

The root cause of this issue was a predictable support lag between the release of new motherboard hardware and its inclusion in official Linux kernel drivers. By following a systematic diagnostic process, the key identifier of the unsupported hardware (0xd802) was located from the sensors-detect logs.

The final solution, implemented by forcing an existing, compatible driver to claim the new device, was both effective and immediate. This case serves as an excellent example of the power and flexibility of the Linux ecosystem, which often provides the tools necessary to manage hardware interoperability challenges without waiting for official vendor or kernel updates. The server's full hardware monitoring capabilities have been successfully restored and made permanent.

EagleCorals · August 2025

It would be interresting to see how FreeBSD with BHyve VM compares to Proxmox VM.
Setup wise I have a feeling it would be easier, based on what I tried.
I currently am away from my gaming rig, so can't test. But is sometthing I keep on my todo list.

tall_ice · August 2025

@jason5545 what a boss. Super impressed by your digging. Thanks for reporting back for helping out any future seeker. This to me is what internet is for. LLM cant help like this

jason5545 · August 2025

@EagleCorals said:
It would be interresting to see how FreeBSD with BHyve VM compares to Proxmox VM.
Setup wise I have a feeling it would be easier, based on what I tried.
I currently am away from my gaming rig, so can't test. But is sometthing I keep on my todo list.

I only heard of Proxmox and XCP-ng in the open-source territory, it's clear I have more to learn.

EagleCorals · August 2025

@jason5545 said:

@EagleCorals said:
It would be interresting to see how FreeBSD with BHyve VM compares to Proxmox VM.
Setup wise I have a feeling it would be easier, based on what I tried.
I currently am away from my gaming rig, so can't test. But is sometthing I keep on my todo list.

I only heard of Proxmox and XCP-ng in the open-source territory, it's clear I have more to learn.

I did it some weeks ago, just to test it.
in /boot/loader.conf
vmm_load="YES"
pptdevs="0/21/0"

then in the bhyve vm config just put the passthru
passthru0="0/21/0"

worked as expected.
now I am using it for my homeassistant vm under bhyve.

but would be interesting how it compres with your setup and game.

https://wiki.freebsd.org/bhyve/pci_passthru

ps: prepare to go down the rabbit hole, freebsd is not linux. 😊😊

the nice thing is you can have a fully working workstation for day to day use, and running vm with the bhyve.
not as with proxmox, where you basically cant use the machine for anything else.

jason5545 · August 2025

@angstrom request to move to General, Thanks

devjorge · August 2025

very cool you got it running!

no sound crackles? that was my biggest problem but some years ago.

you said i can PM you? save your dollars and get an addon and ENJOY!

jason5545 · August 2025

Added some NUT configurations notes:

Technical Report: Implementing a Configurable Server Shutdown Delay for NUT Monitoring Systems

Date: August 08, 2025
Subject: Optimizing the shutdown strategy for the LAB-22BR70G server, which is protected by an APC Back-UPS RS 1500MS and monitored by Network UPS Tools (NUT).

--

1. Executive Summary

This report addresses the inherent lack of proactive control in the default shutdown strategy of Network UPS Tools (NUT). The standard configuration is reactive, typically initiating a server shutdown only when the UPS battery reaches a critically low level. This behavior can lead to unnecessary downtime during short-term power outages that resolve within minutes. This document presents a complete solution: an automated management script, nut-delay-manager.sh. This script empowers system administrators to easily view and configure a fixed "shutdown delay," allowing the server to wait for a user-defined period (e.g., 5 minutes) after a power failure before initiating its shutdown sequence. This implementation significantly enhances system stability and uptime.

2. Background and Problem Statement

The server LAB-22BR70G is protected against power loss by an APC Back-UPS RS 1500MS Uninterruptible Power Supply (UPS). The server utilizes the Network UPS Tools (NUT) software suite to monitor the UPS status and automate a graceful shutdown when necessary.

A review of the current configuration file, upsmon.conf, reveals the use of an immediate shutdown command: SHUTDOWNCMD "/sbin/shutdown -h +0". The trigger for this command relies on NUT's default logic, which is to wait for the UPS to signal a LOWBATT (low battery) condition.

This default strategy presents the following risks:

Over-Reactivity to Minor Outages: The system may begin a full shutdown sequence in response to a brief power fluctuation or a short-term outage that would have otherwise resolved before exhausting the battery, leading to unnecessary downtime.
Lack of Flexibility: The "shutdown on low battery" approach is rigid and does not allow administrators to define a grace period based on external factors, such as the expected time for power restoration or for a backup generator to start.

A more flexible and resilient shutdown policy is required, replacing the "low battery trigger" with a "fixed time delay."

3. Proposed Solution

To address the stated problem, this report proposes the implementation of a timer-based shutdown delay policy utilizing NUT's built-in scheduling tool, upssched. The logic of this new policy is as follows:

Power Loss Detection (ONBATT): When upsmon detects that the UPS has switched to battery power, it will instruct upssched to start a countdown timer for a predefined duration (e.g., 300 seconds).
Power Restoration Detection (ONLINE): If utility power is restored before the timer expires, upssched will cancel the shutdown timer. The server will continue to operate normally with no interruption.
Timer Expiration: If the timer completes its countdown and utility power has not been restored, upssched will execute a command script. This script invokes upsmon -c fsd (Force Shutdown), compelling the server to begin a graceful shutdown immediately.

To simplify the deployment and ongoing management of this configuration, a Bash script named nut-delay-manager.sh has been developed. This script automates all necessary configuration file modifications and provides a simple command-line interface for administrators to view and set the shutdown delay time.

4. Implementation Details

4.1. Management Script: `nut-delay-manager.sh`

This script is the core of the solution, encapsulating all configuration logic.

#!/bin/bash

# ==============================================================================
# NUT (Network UPS Tools) Server Shutdown Delay Manager
# Function: View and set the delay time from power loss until server shutdown
#           is initiated.
# Version:  1.1
# ==============================================================================

# --- Configuration File Paths ---
UPSSCHED_CONF="/etc/nut/upssched.conf"
CMDSCRIPT_PATH="/usr/local/sbin/nut-shutdown-trigger.sh"
UPSMON_CONF="/etc/nut/upsmon.conf"
TIMER_NAME="shutdown-delay"

# --- Color Codes ---
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Color

# --- Check for root privileges ---
if [ "$(id -u)" -ne 0 ]; then
  echo -e "${RED}Error: This script must be run with root privileges. Please use sudo.${NC}"
  echo "Usage: sudo ./nut-delay-manager.sh [view|set <minutes>]"
  exit 1
fi

# --- Function: Display usage ---
usage() {
  echo "NUT Shutdown Delay Management Tool"
  echo "----------------------------------"
  echo "Usage: sudo $0 [view|set <minutes>]"
  echo
  echo "Commands:"
  echo "  view          View the current server shutdown delay."
  echo "  set <minutes> Set a new delay time in minutes."
  echo
  echo "Examples:"
  echo "  sudo $0 view"
  echo "  sudo $0 set 5"
  exit 1
}

# --- Function: View current setting ---
view_setting() {
  echo "Checking current NUT server shutdown delay setting..."
  if [ ! -f "$UPSSCHED_CONF" ]; then
    echo -e "${YELLOW}Warning: upssched config file not found (${UPSSCHED_CONF}).${NC}"
    echo "The system might not be configured for delayed shutdown, or it uses a non-standard configuration."
    return
  fi

  local delay_line=$(grep "START-TIMER ${TIMER_NAME}" "$UPSSCHED_CONF")
  if [ -z "$delay_line" ]; then
    echo -e "${YELLOW}Warning: No active delay timer found in the configuration file.${NC}"
    echo "The system is likely configured to shut down on LOWBATT condition, not after a fixed delay."
    return
  fi

  local delay_sec=$(echo "$delay_line" | awk '{print $5}')
  if [[ "$delay_sec" =~ ^[0-9]+$ ]]; then
    local delay_min=$((delay_sec / 60))
    echo -e "${GREEN}Configuration Found!${NC}"
    echo "Current server shutdown delay is set to: ${delay_sec} seconds (${delay_min} minutes)."
    echo "This means the system will wait for ${delay_min} minutes on battery before initiating shutdown."
  else
    echo -e "${RED}Error: The delay value found in the config file is not a valid number.${NC}"
    echo "Config line: $delay_line"
  fi
}

# --- Function: Set new delay ---
set_setting() {
  local delay_min=$1
  if ! [[ "$delay_min" =~ ^[1-9][0-9]*$ ]]; then
    echo -e "${RED}Error: Please provide a valid number of minutes (greater than 0).${NC}"
    usage
  fi

  local delay_sec=$((delay_min * 60))

  echo "Preparing to set the server shutdown delay to ${delay_min} minutes (${delay_sec} seconds)..."
  echo -e "${YELLOW}This will modify the following files:${NC}"
  echo "- ${CMDSCRIPT_PATH} (created or overwritten)"
  echo "- ${UPSSCHED_CONF} (created or overwritten)"
  echo "- ${UPSMON_CONF} (modified)"
  read -p "Are you sure you want to proceed? (y/n): " confirm
  if [ "$confirm" != "y" ]; then
    echo "Operation cancelled."
    exit 0
  fi

  # 1. Create the command script (CMDSCRIPT)
  echo "Creating command script: ${CMDSCRIPT_PATH}"
  cat << EOF > "$CMDSCRIPT_PATH"
#!/bin/sh
# This script is auto-generated by nut-delay-manager.sh
# It is called by upssched when the timer expires to force a system shutdown.

if [ "\$1" = "${TIMER_NAME}" ]; then
  /sbin/upsmon -c fsd
fi
EOF
  chmod +x "$CMDSCRIPT_PATH"
  echo " -> Done"

  # 2. Create/overwrite upssched.conf
  echo "Configuring upssched.conf..."
  # Back up old config file
  [ -f "$UPSSCHED_CONF" ] && cp "$UPSSCHED_CONF" "$UPSSCHED_CONF.bak.$(date +%F-%T)"
  cat << EOF > "$UPSSCHED_CONF"
# This file is auto-generated by nut-delay-manager.sh
CMDSCRIPT ${CMDSCRIPT_PATH}
PIPEFN /var/run/nut/upssched.pipe

# When power fails (ONBATT), start a shutdown timer for ${delay_sec} seconds
AT ONBATT * START-TIMER ${TIMER_NAME} ${delay_sec}

# When power returns (ONLINE), cancel the timer
AT ONLINE * CANCEL-TIMER ${TIMER_NAME}
EOF
  echo " -> Done"

  # 3. Modify upsmon.conf
  echo "Modifying upsmon.conf to enable the scheduler..."
  # Back up old config file
  cp "$UPSMON_CONF" "$UPSMON_CONF.bak.$(date +%F-%T)"
  # Ensure NOTIFYCMD points to upssched
  sed -i '/^NOTIFYCMD/d' "$UPSMON_CONF"
  echo "NOTIFYCMD /sbin/upssched" >> "$UPSMON_CONF"
  # Ensure relevant events trigger execution
  sed -i '/^NOTIFYFLAG ONBATT/d' "$UPSMON_CONF"
  sed -i '/^NOTIFYFLAG ONLINE/d' "$UPSMON_CONF"
  echo "NOTIFYFLAG ONBATT SYSLOG+EXEC" >> "$UPSMON_CONF"
  echo "NOTIFYFLAG ONLINE SYSLOG+EXEC" >> "$UPSMON_CONF"
  echo " -> Done"

  # 4. Restart NUT services
  echo "Restarting NUT services to apply the new configuration..."
  if systemctl restart nut-server nut-client nut-monitor &> /dev/null; then
    echo -e "${GREEN}NUT services restarted successfully.${NC}"
  else
    echo -e "${YELLOW}Warning: Failed to restart NUT services automatically. Please do it manually:${NC}"
    echo "sudo systemctl restart nut-server nut-client nut-monitor"
  fi

  echo -e "${GREEN}Configuration complete! Server shutdown delay is now set to ${delay_min} minutes.${NC}"
}

# --- Main script logic ---
case "$1" in
  view)
    view_setting
    ;;
  set)
    set_setting "$2"
    ;;
  *)
    usage
    ;;
esac

5. Usage and Operational Procedures

5.1. Installation Procedure

Create the script file:
```
nano nut-delay-manager.sh
```
Paste the content: Copy the entire script from Section 4.1 and paste it into the nano editor.
Save and exit: Press Ctrl + X, followed by Y, and then Enter.
Grant execute permissions:
```
chmod +x nut-delay-manager.sh
```
(Recommended) Move to system path: To make the script globally accessible, move it to /usr/local/sbin.
```
sudo mv nut-delay-manager.sh /usr/local/sbin/
```

5.2. Command Reference

Important: All commands must be executed with sudo privileges.

To view the current delay setting:
```
sudo nut-delay-manager.sh view
```
To set the delay time (example: set to 5 minutes):
```
sudo nut-delay-manager.sh set 5
```
The script will prompt for confirmation. After entering y, it will automatically apply all configurations and restart the necessary NUT services.

6. Conclusion and Recommendations

This report has detailed the implementation of a robust server shutdown delay policy. Through the use of the nut-delay-manager.sh script, system administrators can move beyond NUT's default, inflexible shutdown mechanism to a precisely controlled, timer-based approach.

Key Benefits:
* Increased System Resilience: Significantly reduces unnecessary downtime caused by transient power events.
* Simplified Administration: Condenses a complex, multi-file configuration process into a single, intuitive command.
* Traceability: The script's automatic backup feature and clear configuration comments facilitate future audits and maintenance.

It is recommended that this script be adopted as a standard operating procedure for all critical systems protected by a UPS to ensure an enhanced and predictable response to power failures.

jason5545 · August 2025

@devjorge said:
very cool you got it running!

no sound crackles? that was my biggest problem but some years ago.

you said i can PM you? save your dollars and get an addon and ENJOY!

Thanks, will use it in my Fenix A320
Because I used Apollo, the sound driver will automatically switch to the steam streaming one, No cracking issues so far.

devjorge · August 2025

Do shutdown quickly and don't depleat your UPS battery too often or deeply...
If it is plomo that does not like that. Shutdown at 12V or earlier to not hurt it.

It works great the first few times you test. you get good "runtime" when power cuts out...
but when you really need It, maybe in a year, the battery could be almost dead.
Change battery before 2 years is adviced because when they can't keep the charge anymore they just die when power cuts out...
This often happens without any warning before and no more time to shutdown cleanly.

//:add
But a question i have. What is your UPS rated for (in VA) and real Watts.
If you UPS says 1000VA that's far away from beeing 1000 real Watts.
Mostly you must divide VA by 2 to get near the real supported wattage.
I would not run your system on a 1000VA.
At least 2000 VA to have some margin for peak spikes.

jason5545 · August 2025

@devjorge said:
Do shutdown quickly and don't depleat your UPS battery too often or deeply...
If it is plomo that does not like that. Shutdown at 12V or earlier to not hurt it.

It works great the first few times you test. you get good "runtime" when power cuts out...
but when you really need It, maybe in a year, the battery could be almost dead.
Change battery before 2 years is adviced because when they can't keep the charge anymore they just die when power cuts out...
This often happens without any warning before and no more time to shutdown cleanly.

//:add
But a question i have. What is your UPS rated for (in VA) and real Watts.
If you UPS says 1000VA that's far away from beeing 1000 real Watts.
Mostly you must divide VA by 2 to get near the real supported wattage.
I would not run your system on a 1000VA.
At least 2000 VA to have some margin for peak spikes.

Here is the upsc output from my unit. I've removed the serial number, but the rest is straight from the system.

upsc apc1500ms

Init SSL without certificate database
battery.charge: 100
battery.charge.low: 10
battery.charge.warning: 50
battery.date: 2001/09/25
battery.mfr.date: 2025/01/11
battery.runtime: 2790
battery.runtime.low: 120
battery.type: PbAc
battery.voltage: 27.3
battery.voltage.nominal: 24.0
device.mfr: American Power Conversion
device.model: Back-UPS RS 1500MS
device.serial: 5B25****
device.type: ups
driver.debug: 0
driver.flag.allow_killpower: 0
driver.name: usbhid-ups
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2
driver.parameter.port: auto
driver.parameter.synchronous: auto
driver.state: quiet
driver.version: 2.8.1
driver.version.data: APC HID 0.100
driver.version.internal: 0.52
driver.version.usb: libusb-1.0.28 (API: 0x100010a)
input.sensitivity: high
input.transfer.high: 144
input.transfer.low: 88
input.transfer.reason: input voltage out of range
input.voltage: 110.0
input.voltage.nominal: 120
ups.beeper.status: disabled
ups.delay.shutdown: 20
ups.firmware: 966.h4 .D
ups.firmware.aux: h4
ups.load: 17
ups.mfr: American Power Conversion
ups.mfr.date: 2025/01/11
ups.model: Back-UPS RS 1500MS
ups.productid: 0002
ups.realpower.nominal: 900
ups.serial: 5B25****
ups.status: OL
ups.test.result: No test initiated
ups.timer.reboot: 0
ups.timer.shutdown: -1
ups.vendorid: 051d
```

Based on this data and the model number, the unit is a 1500VA model with a nominal real power rating of 900 Watts (ups.realpower.nominal: 900). This gives it a power factor of 0.6. My current load is sitting at only 17% (ups.load: 17), which should provide a good amount of headroom for my needs and help accommodate any peak power draws.

Regarding the battery health, I believe my usage patterns are helping to preserve it. The concern about deep discharging lead-acid batteries is very valid. Repeatedly discharging them too deeply can cause sulfation, where lead sulfate crystals build up and reduce the battery's ability to hold a charge. This seems to be a primary cause of premature battery failure.

The most common power issues I have are brief outages when a main breaker trips. The UPS only runs on battery for about five minutes in these cases. These short, infrequent events should count as shallow discharge cycles, which are much less damaging to the battery's long-term health than a full deep discharge.

For any predictable, longer-term outages, such as scheduled maintenance, I make it a point to shut down the protected equipment properly beforehand to avoid draining the battery at all. I think this approach, combined with the fact that the UPS isn't overloaded, should hopefully extend the battery's life and ensure it's ready when an unexpected outage occurs.

Thanks again for your valuable advice and for raising these important points.

devjorge · August 2025

great to see the data

battery.voltage: 27.3
battery.type: PbAc

if you didn't know: you have two 12V in series and chemical it is old style plomo acid.
Initiate shutdown when getting below 24.0 V (the nominal voltage)

ups.load: 17 @ 900W = 153 Watts usage when you measured.

Did you measure load when gaming?

The batteries drain dramatically quicker with double or triple the load, voltage drops faster.

devjorge · August 2025

Replacement-Battery-APC-Back-UPS-1500 B00PCSKZZU

9AH per battery.

In series 24V * 9 AH = ~216 (VAh similar to Watthours) but DC charge before transforming to AC outlet.
Add 20-30% loss from transformer and you have about 150 Wh remaining if you drain the battery completely.
At 17% load you can sustain almost an hour if you are lucky but you'r batteries don't like that.

Max Discharge for PbAc should be never more than 50% so you can use max 75 Wh without hurting your batts.

Another math issue that lowers the number again is:
We've calculated VAh with nominal voltage but voltage at max is about 27V and min can be as low as 21-22V depending where the cut-off voltage of your UPS is (mostly 10.5-11V per batt) but this low voltage the battery don't like

Voltage drops faster with more load and batteries drain more amps to supply the same amount of watts, this adds some % more loss.

jason5545 · September 2025

@jason5545 said:

@jason5545 said:

@Falzo said:
@jason5545 thanks for reporting back in detail, lots of people don't do that when they find a solution.

Could you add infos about your final grub settings, vfio, blacklist/modprobe config as well?

Had my own run-ins with passthru and there is lots of rather useless suggestions floating around which makes it so hard to cut through and find the real solutions.

Will do it after I restore all my flightsim addons.

The Ultimate Guide to Proxmox VE Gaming VM Optimization and Troubleshooting

Date: August 2, 2025
Hardware Platform: AMD Ryzen 9 9950X3D (Host), NVIDIA RTX 4070 Ti Super (VM Passthrough), AMD iGPU (Used by LXC)
Project Goal: To resolve boot stability issues, performance bottlenecks, and random crashes in a Windows 11 gaming VM built on high-end hardware, achieving a near-bare-metal gaming experience.
Final Outcome: Achieved stable VM startup and shutdown, and resolved the pci_irq_handler crash error. In-game GPU utilization was boosted from an initial 30% to a stable 97%. The system bottleneck was successfully shifted from the CPU to the GPU, with both performance and system stability meeting expectations.

1. Root Cause Diagnosis: A Dual Dilemma of Stability and Performance

This project faced two core, interconnected challenges that required a systematic solution.

The Stability Challenge: Boot Failures and Random Crashes

Initial Fault: The VM would fail to start after a qm stop command or a crash, reporting a fatal error: kvm: ../hw/pci/pci.c:1654: pci_irq_handler: Assertion '0 <= irq_num && irq_num < PCI_NUM_PINS' failed..

Root Cause Analysis: This error indicates that KVM received an invalid Interrupt Request (IRQ) from the passthrough GPU. This is a classic "GPU Reset Bug" in VFIO environments. When the VM shuts down or crashes, the passthrough GPU is not correctly reset, leaving it in an unstable state and causing its next initialization attempt to fail. Further analysis revealed that unstable GPU overclocking was the direct cause of the initial crashes that triggered this reset bug.

The Performance Challenge: A Severe CPU Bottleneck

Initial Fault: Despite allocating 16 cores to the VM, game performance was poor, with GPU utilization hovering around a mere 30%.

Root Cause Analysis: The root of the problem was Proxmox's inability to correctly handle the asymmetrical CCD architecture of the Ryzen 9 9950X3D. The Windows VM's game threads were being randomly scheduled onto the high-frequency cores that lacked 3D V-Cache, or were frequently migrating between the two CCDs. This introduced significant latency, crippling performance in latency-sensitive games.

2. The Systematic Solution: From Host-Level to VM Fine-Tuning

We adopted a bottom-up, layered optimization strategy to ensure each step built a solid foundation for the next.

Phase 1: Proxmox Host-Level Configuration (The Bedrock)

These settings are prerequisites for any successful passthrough, aiming to establish a correct IOMMU environment and flexible driver management capabilities for the host.
GRUB Kernel Parameter Configuration:
Objective: To correctly enable IOMMU and improve device grouping, creating the necessary conditions for hardware passthrough.
Implementation: Edit /etc/default/grub and modify the GRUB_CMDLINE_LINUX_DEFAULT line:
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction"
amd_iommu=on: Force-enables IOMMU on AMD platforms.

iommu=pt: Enables passthrough mode for better performance.

pcie_acs_override=...: Breaks up non-ideal IOMMU groups, allowing devices like the GPU to be passed through independently.
Activation: Run update-grub and reboot the host.
Precise Kernel Module Management (The blacklist vs. softdep Decision):
Challenge: We needed vfio-pci to claim the NVIDIA dGPU at boot, while also allowing the host's amdgpu driver to load normally for the iGPU used by an LXC container.

The Wrong Approach: Using blacklist completely prevents a driver from loading. This would stop our hookscript from returning the GPU to the host driver after VM shutdown and would also prevent the iGPU from functioning.

The Correct Approach: Use softdep (soft dependency) to establish a driver loading priority.
Implementation:
Edit /etc/modprobe.d/vfio.conf:
# Specify the exact PCI IDs for the NVIDIA dGPU and its audio device
options vfio-pci ids=10de:2705,10de:22bb

# Create soft dependencies to ensure vfio-pci loads before any NVIDIA drivers
softdep nvidia pre: vfio-pci
softdep nvidia_drm pre: vfio-pci
Edit /etc/modprobe.d/pve-blacklist.conf (The Key Correction):
# Comment out the blacklisting of NVIDIA and AMD drivers to allow softdep to manage them
#blacklist nvidiafb
#blacklist nouveau
#blacklist nvidia
#blacklist radeon
#blacklist amdgpu

# Retain blacklisting for generic audio drivers to prevent conflicts with GPU HDMI audio
blacklist snd_hda_codec_hdmi
# ... other snd_hda_* drivers ...
Activation: Run update-initramfs -u and reboot the host.
Automated Hookscript for Seamless Driver Handoff:

Objective: Before the VM starts, automatically unbind the GPU from the host and hand it to vfio-pci. After the VM shuts down, automatically return the GPU to the host driver and trigger a reset, permanently curing the "Reset Bug".

Implementation: Create a hookscript at /var/lib/vz/snippets/gpu-manager.sh and apply it to the VM with qm set 100 --hookscript local:snippets/gpu-manager.sh. (Script contents are detailed below).
Phase 2: Virtual Machine Level Configuration (The Performance Leap)

With a stable host foundation, we performed precision surgery on the VM itself.

CPU Core Pinning and NUMA Optimization (The Decisive Battle):

Data-Driven Decision: We used lscpu -e and cat /sys/devices/system/cpu/cpu*/cache/index3/size to precisely identify that the V-Cache CCD (with 96MB of L3 cache) resides on physical cores 0-7 (logical threads 0-7, 16-23).

Precision Pinning (affinity): We set affinity: 2-7,18-23, locking the VM's 12 cores firmly onto the V-Cache CCD. We strategically reserved cores 0/1 and their SMT siblings for the host to handle I/O and interrupts.

Enable Virtual NUMA (numa: 1): This makes the Windows guest OS aware that it is running on a single, tightly-coupled NUMA node. This optimizes its internal scheduling policy, ensuring game workloads do not stray outside the V-Cache domain.

Memory and Peripheral Performance Optimization (Consolidating Gains):

hugepages: 2: Uses 2MB hugepages to reduce memory access latency.

balloon: 0: Disables memory ballooning to guarantee a stable memory supply.

discard=on, iothread=1: Enables TRIM and an I/O thread for our ZFS storage, boosting disk performance.

GPU Passthrough Stability Tuning (A Discussion on rombar=0):

Parameter's Function: Adding rombar=0 to the hostpci0 parameter instructs Proxmox to ignore the GPU's Option ROM. This can circumvent known firmware compatibility issues on certain NVIDIA 30/40 series cards and is a powerful tool for resolving stubborn crashes.

Special Note: In this specific case, we discovered that the direct trigger for the VM crashes was excessive GPU overclocking. After removing the overclock, the system became stable. Therefore, rombar=0 was not necessary in this scenario. This provides a crucial lesson: before resorting to low-level workarounds like rombar=0, higher-level instability factors such as overclocking, cooling, and driver versions should be ruled out first.

3. Conclusion and Final Configuration

This optimization project proves that by deeply understanding the underlying hardware architecture and applying systematic configuration, it is entirely possible to build a stable, high-performance, top-tier gaming VM on Proxmox VE. The keys to success were:

A Layered Problem-Solving Approach: First, ensure host-level stability and correctness, then optimize for performance at the virtual machine level.

Data-Driven Precision Tuning: Abandon guesswork and use system utilities to precisely locate the V-Cache cores.

Flexible Driver Management: Correctly use softdep and a hookscript to perfectly resolve the GPU reset bug and conflicts in a mixed-GPU setup.

Top-Down Troubleshooting: When addressing stability, start with the application layer (overclocking) before considering low-level solutions (rombar).

In the end, we not only resolved all initial issues but also established a "Golden Standard" configuration for your high-end hardware, laying a solid foundation for future virtualization endeavors.

Appendix I: Final "Golden Standard" Configuration File (/etc/pve/qemu-server/100.conf)
# --- CPU & NUMA ---
# Pin VM cores to the V-Cache CCD (cores 2-7 and their SMT counterparts)
affinity: 2-7,18-23
cpu: host,hidden=1
cores: 12
sockets: 1
numa: 1

# --- Memory ---
balloon: 0
hugepages: 2
memory: 32768

# --- Passthrough Hardware & Automation ---
bios: ovmf
hookscript: local:snippets/gpu-manager.sh
hostpci0: 0000:01:00,pcie=1,x-vga=1
vga: none
usb0: host=1d6b:0104

# --- Storage ---
scsihw: virtio-scsi-single
virtio0: local-zfs:vm-100-disk-0,discard=on,iothread=1,size=300G
# ... other virtio disks ...

# --- System & Boot ---
boot: order=virtio0;ide0;net0
efidisk0: local-zfs:vm-100-disk-1,efitype=4m,pre-enrolled-keys=1,size=1M
ide0: none,media=cdrom
machine: pc-q35-9.2+pve1
name: Windows
net0: virtio=BC:24:11:65:B7:FD,bridge=vmbr0,firewall=1
onboot: 1
ostype: win11
smbios1: uuid=64ea7622-afea-4da0-bbb5-9947461cac19
tpmstate0: local-zfs:vm-100-disk-2,size=4M,version=v2.0
Appendix II: Automated Driver Management Hookscript (/var/lib/vz/snippets/gpu-manager.sh)

This script is the core component for achieving a seamless and stable handoff of the NVIDIA dGPU between the host and the VM. It resolves the "Reset Bug" that causes the VM to fail to restart after shutdown.

Purpose and Principle:
* Before VM Start (pre-start): The script forcibly unbinds the passthrough GPU devices from their host drivers (e.g., nvidia) and ensures they are ready to be claimed by the vfio-pci driver.
* After VM Stop (post-stop): The script releases the GPU from vfio-pci and then triggers a rescan of the host's PCI bus. This prompts the host's nvidia driver to reclaim and completely re-initialize (reset) the GPU, returning it to a clean state, ready for the next VM boot.

Script Content:
#!/bin/bash
# Proxmox VE VM Hookscript for GPU Passthrough Driver Management

# --- USER CONFIGURATION ---
# Please fill in the PCI addresses of your GPU's functions (video and audio).
# You can use `lspci -nns <GPU_BUS_ID>` (e.g., `lspci -nns 01:00`) to find all functions.
GPU_DEVICES="0000:01:00.0 0000:01:00.1"

# --- LOGGING CONFIGURATION ---
LOG_FILE="/var/log/pve/qemu-server/hookscript.log"

# --- SCRIPT BODY ---
VM_ID=$1
PHASE=$2

log_echo() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - VM $VM_ID - $PHASE:" "$@" >> $LOG_FILE
}

log_echo "Hookscript triggered."

if [ "$PHASE" == "pre-start" ]; then
    log_echo "Unbinding GPU devices from host drivers..."
    for DEV in $GPU_DEVICES; do
        # To ensure QEMU can take over, we forcibly override the driver to vfio-pci.
        # This works even if the device is not currently bound to any driver.
        echo "vfio-pci" > /sys/bus/pci/devices/$DEV/driver_override
        # If the device is already bound to a driver (like nvidia, nouveau), unbind it first.
        if [ -e /sys/bus/pci/devices/$DEV/driver ]; then
            log_echo "Device $DEV is bound to $(basename $(readlink /sys/bus/pci/devices/$DEV/driver)). Unbinding..."
            echo "$DEV" > /sys/bus/pci/devices/$DEV/driver/unbind
        fi
    done
    # Ensure the vfio-pci module is loaded
    modprobe -i vfio-pci
    log_echo "GPU devices are now ready for vfio-pci."

elif [ "$PHASE" == "post-stop" ]; then
    log_echo "Rebinding GPU devices to host drivers..."
    for DEV in $GPU_DEVICES; do
        # Clear the vfio-pci driver override
        echo "" > /sys/bus/pci/devices/$DEV/driver_override
        # If the device is still bound to vfio-pci, unbind it
        if [ -e /sys/bus/pci/devices/$DEV/driver ] && [ "$(basename $(readlink /sys/bus/pci/devices/$DEV/driver))" == "vfio-pci" ]; then
            log_echo "Device $DEV is bound to vfio-pci. Unbinding..."
            echo "$DEV" > /sys/bus/pci/drivers/vfio-pci/unbind
        fi
    done

    # Trigger a PCI bus rescan. This is the most critical step!
    # This prompts the host kernel to re-discover the "ownerless" GPU devices,
    # allowing the appropriate host driver (nvidia) to bind to and initialize them,
    # thereby completing the hardware reset.
    log_echo "Triggering PCI bus rescan to re-initialize GPU..."
    echo 1 > /sys/bus/pci/rescan
    log_echo "GPU devices have been returned to the host."
fi

exit 0
Installation and Usage Instructions:

Save the Script: Save the content above to /var/lib/vz/snippets/gpu-manager.sh on your Proxmox host.

Grant Execute Permissions: In the host shell, run the following command to make the script executable.
sh chmod +x /var/lib/vz/snippets/gpu-manager.sh

Apply to the Virtual Machine: Run the following command to associate the script with your Windows VM (using ID 100 as an example).
sh qm set 100 --hookscript local:snippets/gpu-manager.sh

Verify (Optional): After starting and stopping your VM, you can check the log file to confirm the script executed as expected.
sh cat /var/log/pve/qemu-server/hookscript.log

Final config as of now, will continue to post here if i found out some better parameters.

Follow-up: why nvidia-drm.modeset=0 makes GPU passthrough steadier

Quick TL;DR: setting nvidia-drm.modeset=0 disables NVIDIA’s DRM/KMS on the host, so the host won’t grab the card for consoles/plymouth/Wayland. That keeps the GPU “clean” for VFIO, which in practice reduces “device is busy”, failed resets, and flaky re-binds after a VM shuts down.

What it actually changes

With KMS on (modeset=1), nvidia_drm registers a DRM device and the host can light up /dev/dri/* for fbcon, plymouth, or a display manager.

With KMS off (modeset=0), nvidia_drm doesn’t expose KMS, so the host is far less likely to touch the card. VFIO can claim it early and keep it.

Why this helps VFIO

Fewer conflicts from fbcon/Wayland/plymouth touching the GPU.

Better odds that resets succeed and the card is re-usable across VM start/stop cycles.

Especially helpful when the host uses a different adapter for display (e.g., an AMD iGPU) and the NVIDIA card is only for passthrough.

Trade-offs

No KMS on the host for that NVIDIA card (Wayland/DRM features won’t work on it).

Not an issue if the host has another display adapter or is headless.

Check your current state

Is KMS enabled for nvidia_drm? (Y = on, N = off)

cat /sys/module/nvidia_drm/parameters/modeset

Do DRM nodes exist for this GPU?

ls -l /dev/dri/ || true

Did the kernel get the parameter?

cat /proc/cmdline

Enable it (Debian/Proxmox example)

sudo nano /etc/default/grub

Add nvidia-drm.modeset=0 (keep your existing IOMMU flags)

Example:

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt nvidia-drm.modeset=0"

sudo update-grub

If you use systemd-boot on Proxmox:

sudo proxmox-boot-tool refresh
sudo reboot

(Optional) Bind the card to vfio-pci early
This part avoids softdep/blacklist as requested.

Find your device IDs

lspci -nn | grep -i nvidia

Then set them (replace 10de:XXXX,10de:YYYY with your IDs)

sudo nano /etc/modprobe.d/vfio.conf

options vfio-pci ids=10de:XXXX,10de:YYYY disable_vga=1

Rebuild initramfs if your distro requires it, then reboot.

In short, nvidia-drm.modeset=0 keeps the host from “owning” the card via KMS, so VFIO can take exclusive control early. If your host display runs on another GPU (iGPU/AMD), this is a near-zero-cost stability win for passthrough.

Howdy, Stranger!

Quick Links

Categories

In this Discussion

MSFS on Proxmox with GPU Passthrough (DXGI HANG) - $100 Bounty

Comments