High Availability & the Low End

silun · January 2025

How does everyone implement high availability for the hosted services, if at all? I am specifically asking users, this is NOT about what kind of HA providers offer. For example, I haven't personally found virtual/failover IPs on any offers here so far. Do you just go with DNS round-robin, or do you use cloudflare zero trust tunnel replicas, or perhaps an entry point with a truly HA cloud provider?

What about storage, do you use ceph or something else? Or do you just not bother with HA in any way whatsoever?

Basically I am asking for rough sketches of how people make sure their stuff stays available in the face of low end uptimes

donli · January 2025

@silun said:
How does everyone implement high availability for the hosted services, if at all? I am specifically asking users, this is NOT about what kind of HA providers offer. For example, I haven't personally found virtual/failover IPs on any offers here so far. Do you just go with DNS round-robin, or do you use cloudflare zero trust tunnel replicas, or perhaps an entry point with a truly HA cloud provider?

What about storage, do you use ceph or something else? Or do you just not bother with HA in any way whatsoever?

Basically I am asking for rough sketches of how people make sure their stuff stays available in the face of low end uptimes

I only run stuff on the low end for which a 24-hour downtime isn't critical.

cmeerw · January 2025

The only services I consider critical are DNS and email, so I just have multiple DNS and email servers (and those email servers deliver each email to 2 email mailbox servers, so if the main mailbox server goes down I can just switch to the other). That way I don't really care if a whole data centre burns down.

donli · January 2025

[@cmeerw said]:
....That way I don't really care if a whole data centre burns down.

Because that occasionally happens...

silun · January 2025

@cmeerw said:
those email servers deliver each email to 2 email mailbox servers, so if the main mailbox server goes down I can just switch to the other

Like a poor man's HA mailserver! Dead simple and straightforward idea for single person mailservers, if a bit pedestrian

donli · January 2025

@silun said:

@cmeerw said:
those email servers deliver each email to 2 email mailbox servers, so if the main mailbox server goes down I can just switch to the other

Like a poor man's HA mailserver! Dead simple and straightforward idea for single person mailservers, if a bit pedestrian

RAIV - Redundant Array of Inexpensive VPSes.

CloudHopper · January 2025

It depends on what you're trying to do but the usual framework would be HA Proxy for traffic management and identical clusters of containers running on geographically separated hosts.

Databases are the biggest challenge, but they can be replicated to/from 'free tier' cloud services serving as a 'master' record because insert/update/delete actions are computationally cheap so you can run a lot of them on small engines. As long as you keep your LowEnd DBs synced to 'master' then you can reliably do the multi-table select queries on cheap compute resources.

You can use Ansible and scripting for the synchronization, backup, deployment and coordination tasks across the clusters. But you have to ensure that the data is replicated to each cluster before you need it there because each user session will be directed to a different host/cluster. How much that matters depends on your usecase

suyadi92 · January 2025

@donli said:
I only run stuff on the low end for which a 24-hour downtime isn't critical.

What's that if I may ask?
A backup server?

silun · January 2025

Thank you @CloudHopper, those are some interesting points about databases for sure! About the HA Proxy though, that is on yet another unreliable low end box, how have you chosen to mitigate that? If the proxy goes down, so do all the services behind it.

emgh · January 2025

@CloudHopper said: Databases are the biggest challenge

Indeed. A hassle.

Therefore, I don't do HA.

But I have a fully automated bash script to setup EVERYTHING. If everything goes tits up, getting it back up is just running the bash script and updating the DNS records.

Every service we have is backed up regularly to both B2 and R2, and the bash script is always using the latest backup for dynamic data. Code is fetched from buckets from Docker Hub.

If you can take the occasional 20-minute downtime on a complete failure (rare with a good provider), I'd seriously consider a good DR plan and not focus on HA.

HA becomes increasingly complex when you run several services and each service have several different DBs and caching systems as dependencies etc. It's just too much of a hassle.

donli · January 2025

@suyadi92 said:

@donli said:
I only run stuff on the low end for which a 24-hour downtime isn't critical.

What's that if I may ask?
A backup server?

Extra backup dumps, server monitors, VPSs to try out other OSs/experimental ISOs and trying out software that it is easiest to remove via OS reinstall, cheap extra VPN locations.

ralf · January 2025

@silun said:
Thank you @CloudHopper, those are some interesting points about databases for sure! About the HA Proxy though, that is on yet another unreliable low end box, how have you chosen to mitigate that? If the proxy goes down, so do all the services behind it.

Yeah, I personally have my www.domain.com and domain.com do round-robin DNS to 3 different servers (all in Europe currently, but that's just because that's where I live). All 3 of them run haproxy and forward normal web traffic to the closest one that has a copy of the websites and similarly with the app backend traffic (but I have more of those). Each of the servers also has a country.domain.com DNS entry that's just for that one server too.

I keep thinking about whether I want to make that anycast or not, but there aren't many cheap and good solutions for that. I think I'll probably make both my phone app (and maybe the web-site) fire off an AJAX request to a couple of servers at random and if the ping is lower than the current one, offer a redirect to that specific server via a popup (or in the background in the phone app).

All the copies of the website are rsync'd hourly from the master copy. I thought about more complicated ways of doing things, but ultimately settled on this. The only problem is that if you're copying something that relies on multiple files at once, e.g. new page with images, there's the possibility of catching it mid-copy. I know some people like to shut that webserver instance before copying to it, so haproxy fails over to a different instance, and bring up the webserver when the copy has finished. Doing that can still cause dropped connections though, and as my static web content doesn't change very often, I'm not worried about the multiple file issue. If you really care, I've seen some approaches using 2 copies of a webserver, with an iptables rule that uses snat to rewrite a port to one or the other webserver. In progress connections will continue until they're ended, so the idea is to copy the data to the one that's disabled, change the snat so it points to the one that just received the new data and leave the other one idle until all its connections are terminated by the client

For database stuff between the app backends, generally it's a tough problem, especially if you're using something like mysql or postgres. I've gone for the really low-end approach, and use sqlite3 databases on each node, and I'm using litefs (for now) to copy the data across. So, every node has its own write-only database and a read-only copy of the master database. Instead of writing directly to the shared database, a node writes to the write-only database (how you do this is up to you, either store data in JSON/XML as a command, or as I do insert a record into a table that matches the master database), that gets automatically replicated to another server as read-only and that server is the one that copies the new data into the master database which is then automatically replicated out to all the copies. It sounds complicated, and in truth it's more work than just using a connection to a remote mysql, but allows the application nodes to keep read-only access to the database even if they lost connection to the master server, and to queue writes waiting for the master server connection to start up again, so externally the system is very fault tolerant and I can distribute it to loads of cheap VPS and I don't care particularly if any go down. I can also easily scale the backend just by adding more VPS and running another instance on it. As well as my crazy database system, I also rely on the client and server to handle out-of-date or missing data. The client receives data as an ordered list of events from the server, each time telling the server the last ID/timestamp it received and the server only sends data if it has newer data than that. Modified records are done by having a new ID/timestamp and saying which record it replaces. The client also resends data to the server if it hasn't received an acknowledgement for an hour, so even if it was accepted by a server that then failed and never forwarded it to the master server, the client will retry and send it again, so it actually doesn't matter if any of the app servers go down, even if they were all running entirely out of a RAM disk. It's still in my plan to improve HA on this one server that does the integration, but that's not there yet. But at least, even if I have to rebuild it manually (if all 5 of my backups fail), I know that I can still reconstruct all the data from what's on the worker nodes.

cmeerw · January 2025

@ralf said:
All the copies of the website are rsync'd hourly from the master copy. I thought about more complicated ways of doing things, but ultimately settled on this. The only problem is that if you're copying something that relies on multiple files at once, e.g. new page with images, there's the possibility of catching it mid-copy. I know some people like to shut that webserver instance before copying to it, so haproxy fails over to a different instance, and bring up the webserver when the copy has finished. Doing that can still cause dropped connections though, and as my static web content doesn't change very often, I'm not worried about the multiple file issue. If you really care, I've seen some approaches using 2 copies of a webserver, with an iptables rule that uses snat to rewrite a port to one or the other webserver. In progress connections will continue until they're ended, so the idea is to copy the data to the one that's disabled, change the snat so it points to the one that just received the new data and leave the other one idle until all its connections are terminated by the client

Why not have a symlink that points to the current version of the site. rsync to a new directory and then change the symlink to point to the new version.

Of course, there is still the problem that a client might end up requesting a particular file, and the next request then already goes to the new version of the site. But the only way to solve that would be to have the version encoded in the URL (and leave the old version around for some time).

vicaya · January 2025

How much would you pay for the following:

multi-provider/region dns that does loadbalancing queries (e.g. latency, geoproximity, endpoint health based, e.g., from cloudflare loadbalancer, aws route 53 etc.)
multi-provider/region storage with unlimited snapshots and space
transparent or one-click multi-provider/region failover of stateful workloads (including any custom databases).

ralf · January 2025

@cmeerw said:

@ralf said:
All the copies of the website are rsync'd hourly from the master copy. I thought about more complicated ways of doing things, but ultimately settled on this. The only problem is that if you're copying something that relies on multiple files at once, e.g. new page with images, there's the possibility of catching it mid-copy. I know some people like to shut that webserver instance before copying to it, so haproxy fails over to a different instance, and bring up the webserver when the copy has finished. Doing that can still cause dropped connections though, and as my static web content doesn't change very often, I'm not worried about the multiple file issue. If you really care, I've seen some approaches using 2 copies of a webserver, with an iptables rule that uses snat to rewrite a port to one or the other webserver. In progress connections will continue until they're ended, so the idea is to copy the data to the one that's disabled, change the snat so it points to the one that just received the new data and leave the other one idle until all its connections are terminated by the client

Why not have a symlink that points to the current version of the site. rsync to a new directory and then change the symlink to point to the new version.

Of course, there is still the problem that a client might end up requesting a particular file, and the next request then already goes to the new version of the site. But the only way to solve that would be to have the version encoded in the URL (and leave the old version around for some time).

Yeah, you could do that too. I used to do similar with a hybrid increment/full backup system, but I just wanted to write this quickly and not worry about multiple directories and removing old copies etc. Like I say, my web site data changes infrequently, so I don't care too much about that edge case.

ralf · January 2025

@cmeerw said:
The only services I consider critical are DNS and email, so I just have multiple DNS and email servers (and those email servers deliver each email to 2 email mailbox servers, so if the main mailbox server goes down I can just switch to the other). That way I don't really care if a whole data centre burns down.

I was thinking about how you actually achieve that. If you have exim, do you mind sharing how you get each mail server to deliver to itself and the other without creating mail loops?

The only way I can think of without massively hacking up the configuration file (which always goes wrong if I meddle with it too much) is to just have 2 mail servers in VMs on each server, so the front one receives the mail from the outside and queues it for delivery to both backends. Even then, if one goes down, I think it'd be hard to avoid to stop the eventual bounce message from being sent, so not sure how you're doing it behind the scenes.

Or is it just as simple as aliases.virtual containing entries like:

[email protected]: ralf@localhost, ralf@backuphost
@domain.org: catchall@localhost, catchall@backuphost

rcy026 · January 2025

@ralf said:

@cmeerw said:
The only services I consider critical are DNS and email, so I just have multiple DNS and email servers (and those email servers deliver each email to 2 email mailbox servers, so if the main mailbox server goes down I can just switch to the other). That way I don't really care if a whole data centre burns down.

I was thinking about how you actually achieve that. If you have exim, do you mind sharing how you get each mail server to deliver to itself and the other without creating mail loops?

The only way I can think of without massively hacking up the configuration file (which always goes wrong if I meddle with it too much) is to just have 2 mail servers in VMs on each server, so the front one receives the mail from the outside and queues it for delivery to both backends. Even then, if one goes down, I think it'd be hard to avoid to stop the eventual bounce message from being sent, so not sure how you're doing it behind the scenes.

I did something like this once a long time ago, and it really was "behind the scenes" since the solution was to simply sync the backend filesystem.
As long as both nodes were running one was acting like primary and received all email. The mailfiles were then synced from the primary to the secondary.
If primary failed, the secondary received all the emails and no sync was done. Once the primary resumed operation, it did a one time sync from the secondary and then resumed it's role as primary mailserver.
I think we also had som hack to block port 25 on the secondary as long as the primary was up since some people ignored the mx priorities and delivered to the secondary even if primary was operational.

It was basically just a bunch of sh scripts but it worked surprisingly well for almost a decade. I cant remember for sure but this was late 90's so the software used was probably postfix or qmail.

cmeerw · January 2025

@ralf said:

@cmeerw said:
The only services I consider critical are DNS and email, so I just have multiple DNS and email servers (and those email servers deliver each email to 2 email mailbox servers, so if the main mailbox server goes down I can just switch to the other). That way I don't really care if a whole data centre burns down.

I was thinking about how you actually achieve that. If you have exim, do you mind sharing how you get each mail server to deliver to itself and the other without creating mail loops?

The only way I can think of without massively hacking up the configuration file (which always goes wrong if I meddle with it too much) is to just have 2 mail servers in VMs on each server, so the front one receives the mail from the outside and queues it for delivery to both backends. Even then, if one goes down, I think it'd be hard to avoid to stop the eventual bounce message from being sent, so not sure how you're doing it behind the scenes.

Or is it just as simple as aliases.virtual containing entries like:

[email protected]: ralf@localhost, ralf@backuphost
> @domain.org: catchall@localhost, catchall@backuphost

Yes, that would be the basic idea - and you just want to make sure that you retry long enough and/or don't send any bounces back for delivery failures. In my case I also keep all that user configuration in LDAP (with LDAP synced to each server)

quicksilver03 · January 2025

For DNS I currently have a MariaDB primary with several replicas, and scripts to quickly promote a replica to primary; API and web traffic can switch to the new primary with a DNS change.

I tried Galera replication over a WAN, but in the end I found that it was very fragile and would break almost weekly. My next try of an active-active architecture will be with CockroachDB, which hopefully will work better across heterogeneous networks.

For incoming mail, I have exim on 2 different servers, pointed to by the MX records. Each exim delivers mail with LMTP to a Dovecot instance on the same server, and each Dovecot instance replicates the mailboxes with the other one. For outgoing mail, both servers are declared in the SPF record and I use one or the other depending on which account I'm sending from.

jsg · January 2025

multiple users said:
Databases are the biggest challenge

The probably most convenient solution, and the one I use (and put into my own software), is to simply double write any writing DB operations, once "normally" to the DB and once to a shadow DB or log server. Doing that properly and a bit smartly is at least four nines "HA" and very convenient and easy to restore.

Howdy, Stranger!

Categories

In this Discussion

High Availability & the Low End

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

High Availability & the Low End

Comments