New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Decrease Fail Over time in Proxmox HA
in Help
Hi,
I have 2 server nodes in a cluster in proxmox. And there is a test VM on it which has replication enabled along with HA.
So it replicates VM disks at set interval and whenever a node goes down, it moves its VM to another node and starts it.
But it is taking ~5 minutes. Which is slow in my opinion.
I have already tried using hardware watchdog instead of software watchdog which proxmox HA uses by default, but there is no improvement.
Both nodes are connected via LAN.
Any pointers/suggestions on how to speed up the Failover process?

Comments
/etc/pve/datacenter.cfg:
migration: type=insecure
Create a blank VM on the 2nd node, set it to boot from the replicated disks, set your monitor to spin this VM up when the main one fails, AND stop this one when the main is online.
insert_rat_profile_pic_here
Proxmox docs do say to expect it to take a couple minutes:
But I don't know how to speed up from 5 minutes to 2 minutes.
This helped me save ~15-20 sec
I was hoping to get the proxmox's HA failover to be faster than writing my own script to monitor and start / stop VMs. But it's a last resort that I might do if I can not get proxmox's fail over to be fast enough.
the cluster should take max 2 minutes to declare a node is offline and start recovering the VM, which will be additional 5-10s - but that is just starting the VM. If the VM is slow to boot up then it adds additional time.
is your 5 minutes the time it takes for the VM to be available in another node, or the time it takes for the VM to become online (boot up and responding to ping)?
I think it's working like that now, it is taking average ~2.5 min for VM to start responding to pings. Earlier it was taking more than 3-4 minutes.
But can we reduce time ever further that cluster takes to declare a node is offline?
it is hard coded in Proxmox, you could do find and replace it you would like, however it will be more risky for shorter fence delay or watchdog timeout, because the node might be able to recover and you will have 2 VMs running at the same time.
The solution is using shared storage at the expense of budget and complexity.