Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
Home โ€บ Help
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

MSFS on Proxmox with GPU Passthrough (DXGI HANG) - $100 Bounty

Hello LET Community,

I'm reaching out for some help with my new homelab. I'm trying to move away from bare-metal Windows, because frankly, Windows is just a mess as a virtualization host. My goal is to use Proxmox VE as my main OS and run a high-performance Windows VM for gaming.

As detailed in my previous thread, "Overcoming Hardware Hurdles: My Journey From Rented Servers to an ITX Homelab", I've got the hardware sorted.


The Problem

I've successfully configured VFIO and passed my RTX 4070 Ti Super to a Windows VM. The guest OS recognizes the GPU correctly, but Microsoft Flight Simulator is giving me major headaches:

  • Microsoft Flight Simulator 2024: Fails during loading with a DXGI_ERROR_DEVICE_HUNG error.
  • Microsoft Flight Simulator 2020: This ran when I tested it last year, but the performance was terrible with unplayably low FPS.

Both simulators run flawlessly on the same machine with a bare-metal Windows install, so I'm confident the issue lies within my Proxmox/VM configuration.


My Setup

Here's a look at the hardware and software I'm using:

  • Motherboard: ROG STRIX B650-I
  • CPU: AMD Ryzen 9 9950X3D
  • GPU (Passthrough): ASUS Dual RTX 4070 Ti Super OC 16G
  • RAM: 96GB
  • Storage: Two GM7 4TB M.2 drives, a BX500 1TB, plus an MX500 500GB boot drive
  • Hypervisor: Proxmox VE 8.4

The Bounty: $100 via PayPal

I've been stuck on this for a while and I'm willing to reward whoever can solve it.

I am offering a $100 bounty via PayPal to the person who provides the specific steps, configuration, or solution that gets Microsoft Flight Simulator running smoothly in the VM.

Any advice on what to try next would be greatly appreciated.

Thank you!

Thanked by 1jolo22
ยซ1

Comments

  • CfrCfr Member
    edited July 2025

    Could it maybe be that you're only passing part of the GPU "functions" through and not the entire PCI-E device?

    Thanked by 2jason5545 tux
  • Few things I'd check would be verifying you disabled proxmox from utilizing your gpu and that you are passing through all GPU functions(usually will be GPU *:00 then audio for GPU *:01)

  • SlowDDSlowDD Member
    edited July 2025

    this might be an easy fix, which pci-e slot are you using on for motherboard?

    nvm, it only has 1 pci-e port

  • g519g519 Member

    Do other 3d apps/games work in the vm? Like does 3dmark tests complete or something like unigine heaven work?

  • ralfralf Member

    @jason5545 said: Fails during loading with a DXGI_ERROR_DEVICE_HUNG error

    Sadly, this is code for "shrug dunno mate". Could literally be anything. Faulty VRAM, not enough VRAM, timeout, invalid operation, too much put on the command queue, etc...

    As @james50a said, there's almost certainly an audio device there (for over HDMI), but possibly other things too.

    I'm assuming you've blacklisted the PCI device on the host so that it doesn't load the module for it. If you've not done that, you'll have problems passing it to the guest.

    Have a look using lspci on the host to find the card (ideally before blacklisting the device and trying to share with guest), and once you've found it have a look at its IRQ e.g. cat /sys/bus/pci/devices/0000:00:03.2/irq and then find any devices with the same IRQs and make sure they're also blacklisted and forwarded to the host, e.g. grep 26 /sys/bus/pci/devices/*/irq. Also forward anything that looks like it's on the same device level, because otherwise you risk having both host and client attempting to drive the same part of the GPU via different devices.

    You might have to dump the VGA firmware and manually add that to the client too. Without that, maybe it can present as a dumb VGA card but crash as soon as something tries to render using the GPU. I'd have thought Windows would have already used the GPU before it got to that point though.

  • I could never understand why someone would want their daily driver to be a VM host with VM passthrough. Get a fucking server to do server things, get a workstation to do workstation things.

    Less fucking mess.

  • Mind running dxdiag to see if anything is not supported but suppose to be?

    Thanked by 1jason5545
  • you need the right bios in your vm or nvidia drivers lock down because consumer cards are not allowed to run in vm. that's the most probably cause of your error message or pciE passthru is not setup correctly. this can be a mess.
    You hoping you'd get better? performance than with windows directly installed... forget it. you will have microstuttering and audio de-syncs... it's just not worth the hassleโ€ฆ keep windows for playing... on a seperated disk... boot it when you wanna play. else stick to linux as desktop... for working!

  • AdvinAdvin Member, Host Rep

    @devjorge said:
    you need the right bios in your vm or nvidia drivers lock down because consumer cards are not allowed to run in vm. that's the most probably cause of your error message or pciE passthru is not setup correctly. this can be a mess.
    You hoping you'd get better? performance than with windows directly installed... forget it. you will have microstuttering and audio de-syncs... it's just not worth the hassleโ€ฆ keep windows for playing... on a seperated disk... boot it when you wanna play. else stick to linux as desktop... for working!

    As far as Iโ€™m aware, itโ€™s fine to passthrough a consumer card to a VM. The problem is splitting it across multiple VMs.

    Thanked by 2tentor host_c
  • As far as Iโ€™m aware, itโ€™s fine to passthrough a consumer card to a VM. The problem is splitting it across multiple VMs.

    had been years ago and I had to lookup again: you need an UEFI bios in the vm (OVMF) or nvidia drivers will not work properly. there might be some kernel boot flag missing like iommu. you mainboard must have full iommu support enabled, vt-d and so on.

    https://pve.proxmox.com/wiki/PCI_Passthrough

    Thanked by 1jason5545
  • devjorgedevjorge Member
    edited July 2025

    blacklisting drivers, nvidia, nouveau, vesa can help. if linux loads gpu drivers your vm will no run correctly or passthru does not work at all or drops out.

    no brag but i have better cpu and gpu and MS2024 runs lets's say: it runs... but MS is a beast. I dunno wanna know how it feels in a vm... when you tested with 2020 and fps was bad, why do you think it'll run better with new fs? :D

    Thanked by 1jason5545
  • @g519 said:
    Do other 3d apps/games work in the vm? Like does 3dmark tests complete or something like unigine heaven work?

    Yes. 3D Mark is fine.

  • For some reasons,LET is not sending reply notifications to my Gmail, thanks for all the replies

  • @devjorge said:
    blacklisting drivers, nvidia, nouveau, vesa can help. if linux loads gpu drivers your vm will no run correctly or passthru does not work at all or drops out.

    no brag but i have better cpu and gpu and MS2024 runs lets's say: it runs... but MS is a beast. I dunno wanna know how it feels in a vm... when you tested with 2020 and fps was bad, why do you think it'll run better with new fs? :D

    Asobo studio said they used an updated engine in 24. thats make my second try.

  • raindog308raindog308 Administrator, Veteran

    @ralf said: Sadly, this is code for "shrug dunno mate".

    In my experience, MSFS is more finicky than normal GPU-heavy games, which are already pretty finicky.

    @devjorge said: no brag but i have better cpu and gpu and MS2024 runs lets's say: it runs... but MS is a beast.

    So if you have a 4080+ video card, MS2024 should (pun incoming) fly, no? I haven't upgraded from 2020 myself.. You may not be able to turn up rendering every single blade of grass but...req for MS2024 is a 3070.

  • @Kevinf100 said:
    Mind running dxdiag to see if anything is not supported but suppose to be?

    Will post when I get home. Thanks

  • this is what worked for me for Windows GPU passthru with proxmox
    basically, you have to blacklist all the modules you passtru
    and set vfio (if I remember correctly)

    root@pve:/etc/modprobe.d# cat kvm.conf 
    options kvm ignore_msrs=1
    
    
    root@pve:/etc/modprobe.d# cat pve-blacklist.conf 
    # This file contains a list of modules which are not supported by Proxmox VE 
    
    # nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
    blacklist nvidiafb
    blacklist nouveau
    blacklist nvidia*
    blacklist snd_hda_intel
    
    
    root@pve:/etc/modprobe.d# cat vfio.conf 
    options vfio-pci ids=10de:1b80,10de:10f0 disable_vga=1
    
    
    lspci -nnn
    lspci -v
    
    Thanked by 1jason5545
  • @devjorge said:
    blacklisting drivers, nvidia, nouveau, vesa can help. if linux loads gpu drivers your vm will no run correctly or passthru does not work at all or drops out.

    no brag but i have better cpu and gpu and MS2024 runs lets's say: it runs... but MS is a beast. I dunno wanna know how it feels in a vm... when you tested with 2020 and fps was bad, why do you think it'll run better with new fs? :D

    @EagleCorals said:
    this is what worked for me for Windows GPU passthru with proxmox
    basically, you have to blacklist all the modules you passtru
    and set vfio (if I remember correctly)

    root@pve:/etc/modprobe.d# cat kvm.conf
    options kvm ignore_msrs=1


    root@pve:/etc/modprobe.d# cat pve-blacklist.conf
    # This file contains a list of modules which are not supported by Proxmox VE

    # nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
    blacklist nvidiafb
    blacklist nouveau
    blacklist nvidia*
    blacklist snd_hda_intel


    root@pve:/etc/modprobe.d# cat vfio.conf
    options vfio-pci ids=10de:1b80,10de:10f0 disable_vga=1


    lspci -nnn
    lspci -v

    Here is my current vfio config
    cat /etc/modprobe.d/vfio.conf
    options vfio-pci ids=10de:2705,10de:22bb
    softdep nouveau pre: vfio-pci
    softdep nvidia pre: vfio-pci
    softdep nvidiafb pre: vfio-pci
    softdep nvidia_drm pre: vfio-pci
    softdep drm pre: vfio-pci

    Will give blacklist a shot and report it back,, thanks

  • @Kevinf100 said:
    Mind running dxdiag to see if anything is not supported but suppose to be?


    dxdiag shows all activated (sorry for it isn't in English) , Just FYI, for the sudoVDA, I'm using Apollo, that's why. Thanks.

  • ralfralf Member
    edited July 2025

    @raindog308 said:

    @ralf said: Sadly, this is code for "shrug dunno mate".

    In my experience, MSFS is more finicky than normal GPU-heavy games, which are already pretty finicky.

    Hehe, I just meant that I encounter DXGI_ERROR_DEVICE_HUNG multiple times per day (I'm a graphics developer for games, doing lots of hacky experimental things with raytracing) and it's literally just signalling that the GPU gave up doing whatever it was doing with an error, but provided no information as to why. Back in the day, that mostly indicated a GPU crash or a shader that never terminated, but there can be many reasons now, e.g. writing to an bound render target that got unmapped somehow, etc. All it really tells you is that something went wrong and the GPU firmware didn't know how to recover.

  • @jason5545 said:

    @devjorge said:
    blacklisting drivers, nvidia, nouveau, vesa can help. if linux loads gpu drivers your vm will no run correctly or passthru does not work at all or drops out.

    no brag but i have better cpu and gpu and MS2024 runs lets's say: it runs... but MS is a beast. I dunno wanna know how it feels in a vm... when you tested with 2020 and fps was bad, why do you think it'll run better with new fs? :D

    @EagleCorals said:
    this is what worked for me for Windows GPU passthru with proxmox
    basically, you have to blacklist all the modules you passtru
    and set vfio (if I remember correctly)

    root@pve:/etc/modprobe.d# cat kvm.conf
    options kvm ignore_msrs=1


    root@pve:/etc/modprobe.d# cat pve-blacklist.conf
    # This file contains a list of modules which are not supported by Proxmox VE

    # nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
    blacklist nvidiafb
    blacklist nouveau
    blacklist nvidia*
    blacklist snd_hda_intel


    root@pve:/etc/modprobe.d# cat vfio.conf
    options vfio-pci ids=10de:1b80,10de:10f0 disable_vga=1


    lspci -nnn
    lspci -v

    Here is my current vfio config
    cat /etc/modprobe.d/vfio.conf
    options vfio-pci ids=10de:2705,10de:22bb
    softdep nouveau pre: vfio-pci
    softdep nvidia pre: vfio-pci
    softdep nvidiafb pre: vfio-pci
    softdep nvidia_drm pre: vfio-pci
    softdep drm pre: vfio-pci

    Will give blacklist a shot and report it back,, thanks

    I think you really need to blacklist the devices.
    not partition the gpu for between multiple VMs, but just passhtru to one VM.
    Worked for me, with the settings I posted.
    Installed nvidia drivers, played games and all.
    One thing you need is either a connected monitor, or a dummy hdmi plug that simulates a plugged monitor, if you want to play remote (sunshine/moonlight streaming).

    Thanked by 1jason5545
  • @Cfr said:
    Could it maybe be that you're only passing part of the GPU "functions" through and not the entire PCI-E device?

    The first DXGI HANG issue has been solved, I forgot to tick the PCI Express tickbox. My bad.

    @EagleCorals said:

    @jason5545 said:

    @devjorge said:
    blacklisting drivers, nvidia, nouveau, vesa can help. if linux loads gpu drivers your vm will no run correctly or passthru does not work at all or drops out.

    no brag but i have better cpu and gpu and MS2024 runs lets's say: it runs... but MS is a beast. I dunno wanna know how it feels in a vm... when you tested with 2020 and fps was bad, why do you think it'll run better with new fs? :D

    @EagleCorals said:
    this is what worked for me for Windows GPU passthru with proxmox
    basically, you have to blacklist all the modules you passtru
    and set vfio (if I remember correctly)

    root@pve:/etc/modprobe.d# cat kvm.conf
    options kvm ignore_msrs=1


    root@pve:/etc/modprobe.d# cat pve-blacklist.conf
    # This file contains a list of modules which are not supported by Proxmox VE

    # nvidiafb see bugreport https://bugzilla.proxmox.com/show_bug.cgi?id=701
    blacklist nvidiafb
    blacklist nouveau
    blacklist nvidia*
    blacklist snd_hda_intel


    root@pve:/etc/modprobe.d# cat vfio.conf
    options vfio-pci ids=10de:1b80,10de:10f0 disable_vga=1


    lspci -nnn
    lspci -v

    Here is my current vfio config
    cat /etc/modprobe.d/vfio.conf
    options vfio-pci ids=10de:2705,10de:22bb
    softdep nouveau pre: vfio-pci
    softdep nvidia pre: vfio-pci
    softdep nvidiafb pre: vfio-pci
    softdep nvidia_drm pre: vfio-pci
    softdep drm pre: vfio-pci

    Will give blacklist a shot and report it back,, thanks

    I think you really need to blacklist the devices.
    not partition the gpu for between multiple VMs, but just passhtru to one VM.
    Worked for me, with the settings I posted.
    Installed nvidia drivers, played games and all.
    One thing you need is either a connected monitor, or a dummy hdmi plug that simulates a plugged monitor, if you want to play remote (sunshine/moonlight streaming).

    Now it's working, but I faced https://forum.level1techs.com/t/good-benchmarks-poor-gaming-performance-w-rtx-4090-vfio-proxmox/215088 problem, my gpu usage it's low, so resulting low fps, i'm guessing core pinning issues maybe?
    But at least improving.

    Here's my vm conf if it helps

    G:/home/l# cat /etc/pve/qemu-server/100.conf
    affinity: 0-15
    bios: ovmf
    boot: order=virtio0;ide0;net0
    cores: 16
    cpu: host
    efidisk0: local-zfs:vm-100-disk-1,efitype=4m,pre-enrolled-keys=1,size=1M
    hostpci0: 0000:01:00,pcie=1,x-vga=1
    ide0: none,media=cdrom
    machine: pc-q35-9.2+pve1
    memory: 32768
    meta: creation-qemu=9.2.0,ctime=1752965706
    name: Windows
    net0: virtio=BC:24:11:65:B7:FD,bridge=vmbr0,firewall=1
    numa: 1
    onboot: 1
    ostype: win11
    scsihw: virtio-scsi-single
    smbios1: uuid=64ea7622-afea-4da0-bbb5-9947461cac19
    sockets: 1
    tpmstate0: local-zfs:vm-100-disk-2,size=4M,version=v2.0
    usb0: host=1d6b:0104
    virtio0: local-zfs:vm-100-disk-0,iothread=1,size=300G
    virtio1: local-zfs:vm-100-disk-3,discard=on,iothread=1,size=750G
    virtio2: local-zfs:vm-100-disk-4,discard=on,iothread=1,size=150G
    virtio3: local-zfs:vm-100-disk-5,discard=on,iothread=1,size=150G
    args: -machine hpet=off -cpu 'host,topoext=on'

  • CfrCfr Member

    @jason5545 said:

    @Cfr said:
    Could it maybe be that you're only passing part of the GPU "functions" through and not the entire PCI-E device?

    The first DXGI HANG issue has been solved, I forgot to tick the PCI Express tickbox. My bad.

    Does this qualify me for the $100 bounty? ๐Ÿ˜„

    Kidding, hope you'll get it to work properly

    Thanked by 1jason5545
  • I fixed with the help of Perplexity Deep Research and Google Gemini.

    Proxmox VE Gaming VM: In-Depth Performance & Stability Optimization Analysis Report

    Date: August 2, 2025
    Project Objective: To resolve severe performance degradation and system instability for a Windows 11 gaming VM running on Proxmox VE, equipped with an AMD Ryzen 9 9950X3D CPU and an NVIDIA RTX 4070 Ti SUPER GPU.
    Initial State: In-game GPU utilization was critically low (~30%), performance was a fraction of its potential, and the VM was subject to random, unrecoverable crashes.
    Final Outcome: GPU utilization under heavy load now consistently reaches 97%, achieving near-native gaming performance. The virtual machine operates with complete long-term stability.


    1. Core Problem Diagnosis

    The initial analysis concluded that the root cause was not a failure of the GPU passthrough itself, but rather a severe CPU bottleneck compounded by a critical stability issue. Despite allocating a significant number of cores to the VM, the virtualization layer was failing to correctly manage the Ryzen 9 9950X3D's unique hybrid architecture (3D V-Cache CCD vs. High-Frequency CCD).

    This manifested in two primary problems:

    1. Erroneous Core Scheduling: The guest OS (Windows) scheduler was randomly assigning latency-sensitive game threads to the non-V-Cache CCD or, worse, allowing threads to migrate between the two functionally distinct CCDs. This cross-CCD traversal introduced massive latency, crippling game performance.
    2. Resource Contention & Jitter: VM vCPUs were competing for the same physical core resources as the Proxmox host's system tasks, leading to context-switching overhead and performance jitter.

    Furthermore, the random VM crashes pointed towards a lower-level hardware interrupt handling or firmware compatibility conflict.

    2. Systematic Optimization Strategy & Implementation

    A multi-layered optimization strategy was employed, addressing performance bottlenecks first, followed by stability hardening.

    Phase 1: CPU Resource Layer Optimization (The Decisive Factor)

    This phase was the most critical and yielded the most significant performance gains. The objective was to isolate and dedicate the specialized 3D V-Cache CCD exclusively to the gaming VM.

    1. Precision V-Cache Core Identification: Using lscpu -e and cat /sys/devices/system/cpu/cpu0/cache/index3/size, we performed a data-driven verification to conclusively identify the 96MB 3D V-Cache CCD as being located on physical cores 0-7 (logical threads 0-7 & 16-23). This eliminated all guesswork and formed the foundation for precision tuning.
    2. Precision Core Pinning (affinity): The VM was configured with affinity: 2-7,18-23. This locked the VM's 12 vCPUs squarely onto the V-Cache CCD, while strategically reserving cores 0 and 1 for the Proxmox host to handle system interrupts, thereby minimizing host-guest interference.
    3. Enabling Virtual NUMA (numa: 1): This setting exposed the tightly-coupled core group to the guest OS as a single NUMA node. This allowed the Windows scheduler to optimize its internal thread and memory placement policies, ensuring all critical game computations remained within the fast V-Cache domain.

    Phase 2: Memory & I/O Subsystem Optimization (Foundation Hardening)

    1. HugePages (hugepages: 2): Enabled 2MB memory pages to significantly reduce TLB (Translation Lookaside Buffer) pressure and lower memory access latency for the large memory allocation.
    2. Memory Ballooning (balloon: 0): Disabled dynamic memory reclamation by the host, guaranteeing a stable and non-revocable memory pool for the VM, which is critical for gaming.
    3. Storage I/O Optimization: Enabled discard=on (TRIM/UNMAP) and iothread=1 for all ZFS-based virtual disks to improve long-term SSD performance and I/O throughput.

    Phase 3: Stability Troubleshooting (Critical Error Resolution)

    With performance addressed, we tackled the VM crashes, which presented with a fatal kvm: ... pci_irq_handler: Assertion ... failed. error.

    1. Failure Analysis: The error message indicated that the passthrough GPU was sending a malformed or invalid Interrupt Request (IRQ) to the KVM hypervisor.
    2. Root Cause Identification: This was diagnosed as the well-known "ROM BAR issue" affecting modern NVIDIA GPUs (30/40 series) in VFIO environments. A firmware-level incompatibility between the GPU's onboard Option ROM and the KVM hypervisor was causing an improper initialization state, which was triggered under heavy load.
    3. Resolution (rombar=0): By adding rombar=0 to the hostpci0 directive, we instructed Proxmox to completely ignore the GPU's Option ROM. This bypassed the firmware conflict, allowing the in-guest NVIDIA driver to initialize the hardware from a "clean slate," thereby providing full, stable control and completely resolving the crashes.

    4. Final Results & Performance Comparison

    Performance Metric Before Optimization After Optimization Status
    GPU Utilization (In-Game) ~30% 97% โœ… Significant Uplift
    System Bottleneck CPU (Incorrect Scheduling) GPU (Ideal State) โœ… Successfully Shifted
    System Stability Random Crashes (IRQ Error) Rock-Solid, Long-Term โœ… Problem Eradicated
    Subjective Gaming Experience Stuttering, Low FPS Fluid, Near-Native โœ… Transformative

    5. Conclusion

    This project was a textbook success in advanced VFIO tuning. It demonstrates that achieving high-performance, stable gaming on Proxmox VE is entirely feasible but requires a deep understanding and precise control of the underlying hardware architecture.

    The success was predicated on:
    1. Data-Driven Decisions: Precisely locating the V-Cache CCD with system commands rather than relying on assumptions.
    2. A Holistic Approach: Addressing both the macro-level strategic problem (CPU core pinning) and the micro-level but critical detail (the rombar=0 stability fix).
    3. Systematic Elimination: Methodically identifying and resolving each bottleneck, from performance to stability.

    The final result is a "Golden Standard" configuration for this high-end hardware combination, establishing a robust and powerful foundation for any virtualization task.


    Appendix: Final Golden Standard Configuration (/etc/pve/qemu-server/100.conf)

    # --- CPU & NUMA ---
    affinity: 2-7,18-23
    cpu: host,hidden=1
    cores: 12
    sockets: 1
    numa: 1
    
    # --- Memory ---
    balloon: 0
    hugepages: 2
    memory: 32768
    
    # --- Passthrough Hardware ---
    bios: ovmf
    hostpci0: 0000:01:00,pcie=1,x-vga=1,rombar=0
    vga: none
    usb0: host=1d6b:0104
    
    # --- Storage ---
    scsihw: virtio-scsi-single
    virtio0: local-zfs:vm-100-disk-0,discard=on,iothread=1,size=300G
    virtio1: local-zfs:vm-100-disk-3,discard=on,iothread=1,size=750G
    virtio2: local-zfs:vm-100-disk-4,discard=on,iothread=1,size=150G
    virtio3: local-zfs:vm-100-disk-5,discard=on,iothread=1,size=150G
    
    # --- System & Boot ---
    boot: order=virtio0;ide0;net0
    efidisk0: local-zfs:vm-100-disk-1,efitype=4m,pre-enrolled-keys=1,size=1M
    ide0: none,media=cdrom
    machine: pc-q35-9.2+pve1
    meta: creation-qemu=9.2.0,ctime=1752965706
    name: Windows
    net0: virtio=BC:24:11:65:B7:FD,bridge=vmbr0,firewall=1
    onboot: 1
    ostype: win11
    smbios1: uuid=64ea7622-afea-4da0-bbb5-9947461cac19
    sockets: 1
    tpmstate0: local-zfs:vm-100-disk-2,size=4M,version=v2.0
    
    Thanked by 3Falzo xemaps tall_ice
  • @Cfr said:

    @jason5545 said:

    @Cfr said:
    Could it maybe be that you're only passing part of the GPU "functions" through and not the entire PCI-E device?

    The first DXGI HANG issue has been solved, I forgot to tick the PCI Express tickbox. My bad.

    Does this qualify me for the $100 bounty? ๐Ÿ˜„

    Kidding, hope you'll get it to work properly

    If you don't mind taking some part of it, feel free to PM Your PayPal account.> @devjorge said:

    blacklisting drivers, nvidia, nouveau, vesa can help. if linux loads gpu drivers your vm will no run correctly or passthru does not work at all or drops out.

    no brag but i have better cpu and gpu and MS2024 runs lets's say: it runs... but MS is a beast. I dunno wanna know how it feels in a vm... when you tested with 2020 and fps was bad, why do you think it'll run better with new fs? :D

    You're welcome to PM too.

    Thanked by 2Falzo devjorge
  • @jason5545 thanks for reporting back in detail, lots of people don't do that when they find a solution.

    Could you add infos about your final grub settings, vfio, blacklist/modprobe config as well?

    Had my own run-ins with passthru and there is lots of rather useless suggestions floating around which makes it so hard to cut through and find the real solutions.

    Thanked by 2jason5545 xemaps
  • @Falzo said:
    @jason5545 thanks for reporting back in detail, lots of people don't do that when they find a solution.

    Could you add infos about your final grub settings, vfio, blacklist/modprobe config as well?

    Had my own run-ins with passthru and there is lots of rather useless suggestions floating around which makes it so hard to cut through and find the real solutions.

    Will do it after I restore all my flightsim addons. :wink:

    Thanked by 1Falzo
  • jason5545jason5545 Member
    edited August 2025

    @jason5545 said:

    @Falzo said:
    @jason5545 thanks for reporting back in detail, lots of people don't do that when they find a solution.

    Could you add infos about your final grub settings, vfio, blacklist/modprobe config as well?

    Had my own run-ins with passthru and there is lots of rather useless suggestions floating around which makes it so hard to cut through and find the real solutions.

    Will do it after I restore all my flightsim addons. :wink:


    The Ultimate Guide to Proxmox VE Gaming VM Optimization and Troubleshooting

    Date: August 2, 2025
    Hardware Platform: AMD Ryzen 9 9950X3D (Host), NVIDIA RTX 4070 Ti Super (VM Passthrough), AMD iGPU (Used by LXC)
    Project Goal: To resolve boot stability issues, performance bottlenecks, and random crashes in a Windows 11 gaming VM built on high-end hardware, achieving a near-bare-metal gaming experience.
    Final Outcome: Achieved stable VM startup and shutdown, and resolved the pci_irq_handler crash error. In-game GPU utilization was boosted from an initial 30% to a stable 97%. The system bottleneck was successfully shifted from the CPU to the GPU, with both performance and system stability meeting expectations.


    1. Root Cause Diagnosis: A Dual Dilemma of Stability and Performance

    This project faced two core, interconnected challenges that required a systematic solution.

    1. The Stability Challenge: Boot Failures and Random Crashes

      • Initial Fault: The VM would fail to start after a qm stop command or a crash, reporting a fatal error: kvm: ../hw/pci/pci.c:1654: pci_irq_handler: Assertion '0 <= irq_num && irq_num < PCI_NUM_PINS' failed..
      • Root Cause Analysis: This error indicates that KVM received an invalid Interrupt Request (IRQ) from the passthrough GPU. This is a classic "GPU Reset Bug" in VFIO environments. When the VM shuts down or crashes, the passthrough GPU is not correctly reset, leaving it in an unstable state and causing its next initialization attempt to fail. Further analysis revealed that unstable GPU overclocking was the direct cause of the initial crashes that triggered this reset bug.
    2. The Performance Challenge: A Severe CPU Bottleneck

      • Initial Fault: Despite allocating 16 cores to the VM, game performance was poor, with GPU utilization hovering around a mere 30%.
      • Root Cause Analysis: The root of the problem was Proxmox's inability to correctly handle the asymmetrical CCD architecture of the Ryzen 9 9950X3D. The Windows VM's game threads were being randomly scheduled onto the high-frequency cores that lacked 3D V-Cache, or were frequently migrating between the two CCDs. This introduced significant latency, crippling performance in latency-sensitive games.

    2. The Systematic Solution: From Host-Level to VM Fine-Tuning

    We adopted a bottom-up, layered optimization strategy to ensure each step built a solid foundation for the next.

    Phase 1: Proxmox Host-Level Configuration (The Bedrock)

    These settings are prerequisites for any successful passthrough, aiming to establish a correct IOMMU environment and flexible driver management capabilities for the host.

    1. GRUB Kernel Parameter Configuration:

      • Objective: To correctly enable IOMMU and improve device grouping, creating the necessary conditions for hardware passthrough.
      • Implementation: Edit /etc/default/grub and modify the GRUB_CMDLINE_LINUX_DEFAULT line:

        GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction"
        
        • amd_iommu=on: Force-enables IOMMU on AMD platforms.
        • iommu=pt: Enables passthrough mode for better performance.
        • pcie_acs_override=...: Breaks up non-ideal IOMMU groups, allowing devices like the GPU to be passed through independently.
      • Activation: Run update-grub and reboot the host.

    2. Precise Kernel Module Management (The blacklist vs. softdep Decision):

      • Challenge: We needed vfio-pci to claim the NVIDIA dGPU at boot, while also allowing the host's amdgpu driver to load normally for the iGPU used by an LXC container.
      • The Wrong Approach: Using blacklist completely prevents a driver from loading. This would stop our hookscript from returning the GPU to the host driver after VM shutdown and would also prevent the iGPU from functioning.
      • The Correct Approach: Use softdep (soft dependency) to establish a driver loading priority.
      • Implementation:

        • Edit /etc/modprobe.d/vfio.conf:

          # Specify the exact PCI IDs for the NVIDIA dGPU and its audio device
          options vfio-pci ids=10de:2705,10de:22bb
          
          # Create soft dependencies to ensure vfio-pci loads before any NVIDIA drivers
          softdep nvidia pre: vfio-pci
          softdep nvidia_drm pre: vfio-pci
          
        • Edit /etc/modprobe.d/pve-blacklist.conf (The Key Correction):

          # Comment out the blacklisting of NVIDIA and AMD drivers to allow softdep to manage them
          #blacklist nvidiafb
          #blacklist nouveau
          #blacklist nvidia
          #blacklist radeon
          #blacklist amdgpu
          
          # Retain blacklisting for generic audio drivers to prevent conflicts with GPU HDMI audio
          blacklist snd_hda_codec_hdmi
          # ... other snd_hda_* drivers ...
          
      • Activation: Run update-initramfs -u and reboot the host.

    3. Automated Hookscript for Seamless Driver Handoff:

      • Objective: Before the VM starts, automatically unbind the GPU from the host and hand it to vfio-pci. After the VM shuts down, automatically return the GPU to the host driver and trigger a reset, permanently curing the "Reset Bug".
      • Implementation: Create a hookscript at /var/lib/vz/snippets/gpu-manager.sh and apply it to the VM with qm set 100 --hookscript local:snippets/gpu-manager.sh. (Script contents are detailed below).

    Phase 2: Virtual Machine Level Configuration (The Performance Leap)

    With a stable host foundation, we performed precision surgery on the VM itself.

    1. CPU Core Pinning and NUMA Optimization (The Decisive Battle):

      • Data-Driven Decision: We used lscpu -e and cat /sys/devices/system/cpu/cpu*/cache/index3/size to precisely identify that the V-Cache CCD (with 96MB of L3 cache) resides on physical cores 0-7 (logical threads 0-7, 16-23).
      • Precision Pinning (affinity): We set affinity: 2-7,18-23, locking the VM's 12 cores firmly onto the V-Cache CCD. We strategically reserved cores 0/1 and their SMT siblings for the host to handle I/O and interrupts.
      • Enable Virtual NUMA (numa: 1): This makes the Windows guest OS aware that it is running on a single, tightly-coupled NUMA node. This optimizes its internal scheduling policy, ensuring game workloads do not stray outside the V-Cache domain.
    2. Memory and Peripheral Performance Optimization (Consolidating Gains):

      • hugepages: 2: Uses 2MB hugepages to reduce memory access latency.
      • balloon: 0: Disables memory ballooning to guarantee a stable memory supply.
      • discard=on, iothread=1: Enables TRIM and an I/O thread for our ZFS storage, boosting disk performance.
    3. GPU Passthrough Stability Tuning (A Discussion on rombar=0):

      • Parameter's Function: Adding rombar=0 to the hostpci0 parameter instructs Proxmox to ignore the GPU's Option ROM. This can circumvent known firmware compatibility issues on certain NVIDIA 30/40 series cards and is a powerful tool for resolving stubborn crashes.
      • Special Note: In this specific case, we discovered that the direct trigger for the VM crashes was excessive GPU overclocking. After removing the overclock, the system became stable. Therefore, rombar=0 was not necessary in this scenario. This provides a crucial lesson: before resorting to low-level workarounds like rombar=0, higher-level instability factors such as overclocking, cooling, and driver versions should be ruled out first.

    3. Conclusion and Final Configuration

    This optimization project proves that by deeply understanding the underlying hardware architecture and applying systematic configuration, it is entirely possible to build a stable, high-performance, top-tier gaming VM on Proxmox VE. The keys to success were:

    1. A Layered Problem-Solving Approach: First, ensure host-level stability and correctness, then optimize for performance at the virtual machine level.
    2. Data-Driven Precision Tuning: Abandon guesswork and use system utilities to precisely locate the V-Cache cores.
    3. Flexible Driver Management: Correctly use softdep and a hookscript to perfectly resolve the GPU reset bug and conflicts in a mixed-GPU setup.
    4. Top-Down Troubleshooting: When addressing stability, start with the application layer (overclocking) before considering low-level solutions (rombar).

    In the end, we not only resolved all initial issues but also established a "Golden Standard" configuration for your high-end hardware, laying a solid foundation for future virtualization endeavors.


    Appendix I: Final "Golden Standard" Configuration File (/etc/pve/qemu-server/100.conf)

    # --- CPU & NUMA ---
    # Pin VM cores to the V-Cache CCD (cores 2-7 and their SMT counterparts)
    affinity: 2-7,18-23
    cpu: host,hidden=1
    cores: 12
    sockets: 1
    numa: 1
    
    # --- Memory ---
    balloon: 0
    hugepages: 2
    memory: 32768
    
    # --- Passthrough Hardware & Automation ---
    bios: ovmf
    hookscript: local:snippets/gpu-manager.sh
    hostpci0: 0000:01:00,pcie=1,x-vga=1
    vga: none
    usb0: host=1d6b:0104
    
    # --- Storage ---
    scsihw: virtio-scsi-single
    virtio0: local-zfs:vm-100-disk-0,discard=on,iothread=1,size=300G
    # ... other virtio disks ...
    
    # --- System & Boot ---
    boot: order=virtio0;ide0;net0
    efidisk0: local-zfs:vm-100-disk-1,efitype=4m,pre-enrolled-keys=1,size=1M
    ide0: none,media=cdrom
    machine: pc-q35-9.2+pve1
    name: Windows
    net0: virtio=BC:24:11:65:B7:FD,bridge=vmbr0,firewall=1
    onboot: 1
    ostype: win11
    smbios1: uuid=64ea7622-afea-4da0-bbb5-9947461cac19
    tpmstate0: local-zfs:vm-100-disk-2,size=4M,version=v2.0
    

    Appendix II: Automated Driver Management Hookscript (/var/lib/vz/snippets/gpu-manager.sh)

    This script is the core component for achieving a seamless and stable handoff of the NVIDIA dGPU between the host and the VM. It resolves the "Reset Bug" that causes the VM to fail to restart after shutdown.

    Purpose and Principle:
    * Before VM Start (pre-start): The script forcibly unbinds the passthrough GPU devices from their host drivers (e.g., nvidia) and ensures they are ready to be claimed by the vfio-pci driver.
    * After VM Stop (post-stop): The script releases the GPU from vfio-pci and then triggers a rescan of the host's PCI bus. This prompts the host's nvidia driver to reclaim and completely re-initialize (reset) the GPU, returning it to a clean state, ready for the next VM boot.

    Script Content:

    #!/bin/bash
    # Proxmox VE VM Hookscript for GPU Passthrough Driver Management
    
    # --- USER CONFIGURATION ---
    # Please fill in the PCI addresses of your GPU's functions (video and audio).
    # You can use `lspci -nns <GPU_BUS_ID>` (e.g., `lspci -nns 01:00`) to find all functions.
    GPU_DEVICES="0000:01:00.0 0000:01:00.1"
    
    # --- LOGGING CONFIGURATION ---
    LOG_FILE="/var/log/pve/qemu-server/hookscript.log"
    
    # --- SCRIPT BODY ---
    VM_ID=$1
    PHASE=$2
    
    log_echo() {
        echo "$(date '+%Y-%m-%d %H:%M:%S') - VM $VM_ID - $PHASE:" "$@" >> $LOG_FILE
    }
    
    log_echo "Hookscript triggered."
    
    if [ "$PHASE" == "pre-start" ]; then
        log_echo "Unbinding GPU devices from host drivers..."
        for DEV in $GPU_DEVICES; do
            # To ensure QEMU can take over, we forcibly override the driver to vfio-pci.
            # This works even if the device is not currently bound to any driver.
            echo "vfio-pci" > /sys/bus/pci/devices/$DEV/driver_override
            # If the device is already bound to a driver (like nvidia, nouveau), unbind it first.
            if [ -e /sys/bus/pci/devices/$DEV/driver ]; then
                log_echo "Device $DEV is bound to $(basename $(readlink /sys/bus/pci/devices/$DEV/driver)). Unbinding..."
                echo "$DEV" > /sys/bus/pci/devices/$DEV/driver/unbind
            fi
        done
        # Ensure the vfio-pci module is loaded
        modprobe -i vfio-pci
        log_echo "GPU devices are now ready for vfio-pci."
    
    elif [ "$PHASE" == "post-stop" ]; then
        log_echo "Rebinding GPU devices to host drivers..."
        for DEV in $GPU_DEVICES; do
            # Clear the vfio-pci driver override
            echo "" > /sys/bus/pci/devices/$DEV/driver_override
            # If the device is still bound to vfio-pci, unbind it
            if [ -e /sys/bus/pci/devices/$DEV/driver ] && [ "$(basename $(readlink /sys/bus/pci/devices/$DEV/driver))" == "vfio-pci" ]; then
                log_echo "Device $DEV is bound to vfio-pci. Unbinding..."
                echo "$DEV" > /sys/bus/pci/drivers/vfio-pci/unbind
            fi
        done
    
        # Trigger a PCI bus rescan. This is the most critical step!
        # This prompts the host kernel to re-discover the "ownerless" GPU devices,
        # allowing the appropriate host driver (nvidia) to bind to and initialize them,
        # thereby completing the hardware reset.
        log_echo "Triggering PCI bus rescan to re-initialize GPU..."
        echo 1 > /sys/bus/pci/rescan
        log_echo "GPU devices have been returned to the host."
    fi
    
    exit 0
    

    Installation and Usage Instructions:

    1. Save the Script: Save the content above to /var/lib/vz/snippets/gpu-manager.sh on your Proxmox host.
    2. Grant Execute Permissions: In the host shell, run the following command to make the script executable.
      sh chmod +x /var/lib/vz/snippets/gpu-manager.sh
    3. Apply to the Virtual Machine: Run the following command to associate the script with your Windows VM (using ID 100 as an example).
      sh qm set 100 --hookscript local:snippets/gpu-manager.sh
    4. Verify (Optional): After starting and stopping your VM, you can check the log file to confirm the script executed as expected.
      sh cat /var/log/pve/qemu-server/hookscript.log

    Final config as of now, will continue to post here if i found out some better parameters.

  • ๐Ÿ‘ Thank You ! this topic will certainly help on multiple (x)GPU env. on Proxmox !
    But i don't know who host Ryzen + GPU until now. Hope it's coming.
    As second choice (?!), heavy players could directly install windows and games on the ryzen.

    Thanked by 1jason5545
  • @xemaps said:
    ๐Ÿ‘ Thank You ! this topic will certainly help on multiple (x)GPU env. on Proxmox !
    But i don't know who host Ryzen + GPU until now. Hope it's coming.
    As second choice (?!), heavy players could directly install windows and games on the ryzen.

    I just Very frustrated with Windows, spyware, unwanted/buggy Windows Update, I even got memory leaks from VMware Workstation just because I have a Dual GPUs Setup. Imagine if i don't disable my iGPU, VMware's s**tty graphical stack will used up all the Ram in just three days. :'(

Sign In or Register to comment.