Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


fail over desgin?
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

fail over desgin?

quadequade Member

Hey all
Yesterday, my monitoring systems went crazy and I watched 20 of my AMS Vultr servers switch into unknown state one by one
Then the phone started ringing with angry customers

I was able re-route some traffic as 3 of my AMS server stayed online (using floating IPs), but I lost some features with specific servers that were off.
I deployed a bunch of servers from snapshots to germany, but AMS was not reachable so they could not copy across the image

Vultr said their upstream provider had traffic issues

They went off about 9am my time, and came back around 2pm, I had to point domains at different IP's and it took a number of hours for it to work.

Given we on thin ice with one customer already for a similar AMS vultr issue, i would like to explore failover solutions

I asked vultr about failover with their servers/systems where i can point an IP address at say germany, and they said:

I spoke to one of our senior engineers and he was recommending that you anycast your own prefixes. In order to accomplish this you would need to get an IP block from one of the RIRs (ARIN, RIPE, APNIC etc.). Once you have this IP block, you will need to setup a BGP session with us. You can either bring your own ASN or we can provide you with a private one for you to use. After configuring BGP on different VMS in our different POP, the IP block can be propagated from several different datacenters.

Is this my only option?

Basically i would like to design a "region balancing and failover" for my IOT sensors reporting data to our servers.
so everything goes to 1 IP, that then routes to either AMS, or Germany or both
If AMS goes down, then it re-routes everything to germany

I hope that makes sense.
i'm looking for suggestions on how to achieve this.
We are a small company so dont have dedicated network guys, so ideally the solution would be easy to manage.

Comments

  • You can use Cloudflare for load balancing/failover.
    https://developers.cloudflare.com/load-balancing/about

    You can configure such that when one IP fails, traffic is routed to a different server. Or spread traffic between both and remove one when health check fails.

    Thanked by 1quade
  • You can check Route 53 too.

    Thanked by 2quade rtsh
  • both these options look really interesting, thanks

  • vfusevfuse Member, Host Rep

    For our API we just use the cloudflare API to check if all IP's behind a hostname are still online, if one goes offline we remove/rename it at cloudflare and rename it back once it's back online. Floating IP is nice but usually still limited to a specific location.

    We also had almost 4 hours of downtime in Vultr/AMS on monday but this solution worked fine for us.

  • thanks vfuse, the issue is our IOT sensors are all programmed on a specific IP or domain name (domain name where possible)
    Would this solution still work?

  • coolicecoolice Member
    edited August 2020

    You can spread between multiple DCs to minimize an impact of network outage even some locations are a bit further away as you can use even different providers...

    If your sensors are set to use specific domain / subdomains... You can do "failover" manually with dns and low TTL (300) (impaling you do some failover for the your virtual machines) (mysql master master replication and unison for file sync) or proxmox storage replication if you decide to go with dedicated servers

    It is simple and do not relay on any fancy 3rd party technology or panel or service to work correctly and can use different providers

  • vfusevfuse Member, Host Rep
    edited August 2020

    @quade said:
    thanks vfuse, the issue is our IOT sensors are all programmed on a specific IP or domain name (domain name where possible)
    Would this solution still work?

    If they're all programmed on a specific IP no, on a domain yes.

    Edit: pm'ed you the script we currently use

    Thanked by 1vimalware
  • Brend4nBrend4n Member
    edited August 2020

    @vfuse said:

    @quade said:
    thanks vfuse, the issue is our IOT sensors are all programmed on a specific IP or domain name (domain name where possible)
    Would this solution still work?

    If they're all programmed on a specific IP no, on a domain yes.

    Edit: pm'ed you the script we currently use

    If pricing isn't an issue though, you're probably better off using Cloudflare's load balancing/failover (esp in prod environment). They have tons of options regarding health checks/load balanacing and you dont have to deal with reliability concerns. IE: If the monitoring server goes down.

  • cloudflare load balancer is one of the quickest implementation and (maybe) without architecture changes from your side

    Another alternative is you can create your own "more than one load balancer" from different provider and use route53 as DNS to monitor the uptime of the load balancer, but this is more complicated than the first solution

  • vfusevfuse Member, Host Rep

    @Brend4n said:

    @vfuse said:

    @quade said:
    thanks vfuse, the issue is our IOT sensors are all programmed on a specific IP or domain name (domain name where possible)
    Would this solution still work?

    If they're all programmed on a specific IP no, on a domain yes.

    Edit: pm'ed you the script we currently use

    If pricing isn't an issue though, you're probably better off using Cloudflare's load balancing/failover (esp in prod environment). They have tons of options regarding health checks/load balanacing and you dont have to deal with reliability concerns. IE: If the monitoring server goes down.

    Of course if pricing is not an option there are several enterprise options available. @OP didn't mention pricing was not an option, in case of CF this could end up costing a lot of money for IOT (making million/billions of request per day).

  • Brend4nBrend4n Member
    edited August 2020

    @vfuse said:

    @Brend4n said:

    @vfuse said:

    @quade said:
    thanks vfuse, the issue is our IOT sensors are all programmed on a specific IP or domain name (domain name where possible)
    Would this solution still work?

    If they're all programmed on a specific IP no, on a domain yes.

    Edit: pm'ed you the script we currently use

    If pricing isn't an issue though, you're probably better off using Cloudflare's load balancing/failover (esp in prod environment). They have tons of options regarding health checks/load balanacing and you dont have to deal with reliability concerns. IE: If the monitoring server goes down.

    Of course if pricing is not an option there are several enterprise options available. @OP didn't mention pricing was not an option, in case of CF this could end up costing a lot of money for IOT (making million/billions of request per day).

    Fair point, although I assume if OP has 20 Vultr servers and using for business, he can spare $5-$10/month for cloudflare’s load balancer.. or maybe $2.95 for ClouDNS. Plenty of affordable options besides self host.

  • @coolice said:
    You can spread between multiple DCs to minimize an impact of network outage even some locations are a bit further away as you can use even different providers...

    If your sensors are set to use specific domain / subdomains... You can do "failover" manually with dns and low TTL (300) (impaling you do some failover for the your virtual machines) (mysql master master replication and unison for file sync) or proxmox storage replication if you decide to go with dedicated servers

    It is simple and do not relay on any fancy 3rd party technology or panel or service to work correctly and can use different providers

    Thanks @coolice
    Can you explain a little more about how to set this up (or point me in right direction)
    I'm not sure how to spread between multiple DCs

    We have in the past run a continuous "ping google.com" and see ping times go really high when one server starts to get saturated, so it would be good to have a solution to re-route traffic if this happens

  • @vfuse said:

    @Brend4n said:

    @vfuse said:

    @quade said:
    thanks vfuse, the issue is our IOT sensors are all programmed on a specific IP or domain name (domain name where possible)
    Would this solution still work?

    If they're all programmed on a specific IP no, on a domain yes.

    Edit: pm'ed you the script we currently use

    If pricing isn't an issue though, you're probably better off using Cloudflare's load balancing/failover (esp in prod environment). They have tons of options regarding health checks/load balanacing and you dont have to deal with reliability concerns. IE: If the monitoring server goes down.

    Of course if pricing is not an option there are several enterprise options available. @OP didn't mention pricing was not an option, in case of CF this could end up costing a lot of money for IOT (making million/billions of request per day).

    At ingest point only, the IOT sensors are doing around 5 billion a month
    That doesn't include any of the other API's, but that is from myServerA to myServerB so I have other solutions for this

    So i guess cloudflare could end up costing a lot

  • @quade said:

    @vfuse said:

    @Brend4n said:

    @vfuse said:

    @quade said:
    thanks vfuse, the issue is our IOT sensors are all programmed on a specific IP or domain name (domain name where possible)
    Would this solution still work?

    If they're all programmed on a specific IP no, on a domain yes.

    Edit: pm'ed you the script we currently use

    If pricing isn't an issue though, you're probably better off using Cloudflare's load balancing/failover (esp in prod environment). They have tons of options regarding health checks/load balanacing and you dont have to deal with reliability concerns. IE: If the monitoring server goes down.

    Of course if pricing is not an option there are several enterprise options available. @OP didn't mention pricing was not an option, in case of CF this could end up costing a lot of money for IOT (making million/billions of request per day).

    At ingest point only, the IOT sensors are doing around 5 billion a month
    That doesn't include any of the other API's, but that is from myServerA to myServerB so I have other solutions for this

    So i guess cloudflare could end up costing a lot

    Ah, you can try self hosting an uptime script and update via Cloudflare API as @vfuse suggested then. Alternatively, there's a hosted platform that does exactly this https://failover.cc/ . I haven't tried them though.

  • coolicecoolice Member
    edited August 2020

    @quade said:

    @coolice said:
    You can spread between multiple DCs to minimize an impact of network outage even some locations are a bit further away as you can use even different providers...

    If your sensors are set to use specific domain / subdomains... You can do "failover" manually with dns and low TTL (300) (impaling you do some failover for the your virtual machines) (mysql master master replication and unison for file sync) or proxmox storage replication if you decide to go with dedicated servers

    It is simple and do not relay on any fancy 3rd party technology or panel or service to work correctly and can use different providers

    Thanks @coolice
    Can you explain a little more about how to set this up (or point me in right direction)
    I'm not sure how to spread between multiple DCs

    We have in the past run a continuous "ping google.com" and see ping times go really high when one server starts to get saturated, so it would be good to have a solution to re-route traffic if this happens

    Spread your virtual machines between multiple locations (data centers) even with the same provider, that way you will not get alert for 20 vm down and maybe just 5...

    then you can have a dedicated server somewhere with a different provider where you run a dedicated sever with multiple virtual machines that are reserve for the main ones and sync it with

    With mysql master-master (google it) replication in addition to failover In case of a provider failure you just change the A record as i pointed in my first post, you can also do simple load balancing when one of the VMs has too many request => With round robin dns (jut go to dns add second A record that points to the reserve server to the dns (without removing first one) for the sub-domain that VM is using and in 300 seconds it will spread load between the 2

    then to not relay on third party API or panel to move IPs or DNS and you can not relay on third party dns too (by spinning couple of small vps with different providers to server you as dns servers)

    P.S. 300 seconds TTL is example minimum what third party dns providers allow, you can go even lower if you host your dns

  • I never knew a sub domain could have multiple IPs, i thought there was just one

    The only problem i can foresee is these IOT devices first create a TCP connection with the server, then start transmitting data
    If there is load balancing between 3 IP's, each time it round robins to a new server, it will have to re-initiate the conenction.

    Is there also a solution to do this load balancing between 3 IPs but with sticky sessions and least connections

    i.e. dont throw all load at first server only, but if it connects to serverA, maintain that connection until the connection is dead

    The next IOT sensor will be routed to the least connections box, (serverB) ...and so on..

  • vfusevfuse Member, Host Rep

    If it's opening a TCP connection and staying connected it will go to a single IP. For least connections type of LB you would have to go with CF's load balancing option or something else.

  • coolicecoolice Member
    edited August 2020

    Until cloudflare fails > @quade said:

    I never knew a sub domain could have multiple IPs, i thought there was just one

    The only problem i can foresee is these IOT devices first create a TCP connection with the server, then start transmitting data
    If there is load balancing between 3 IP's, each time it round robins to a new server, it will have to re-initiate the conenction.

    Is there also a solution to do this load balancing between 3 IPs but with sticky sessions and least connections

    i.e. dont throw all load at first server only, but if it connects to serverA, maintain that connection until the connection is dead

    The next IOT sensor will be routed to the least connections box, (serverB) ...and so on..

    the more complexity you add that is why even big openstack cloud providers has issues time to time

    Yes you can go full cluster for mysql (Maria DB Galera) with 3 VMs and you can add more VMs to be your HA proxy but it will be more things to support and more ways something to goes wrong that can generate you downtime ...

    Round robin dns is not per connection it is per TTL.... device A want to share sensor data to datashare1.domain.com it ask dns where that address resolves on a a round robin (random) principle it is provided with one of the IPs for VM-DB1-A or VM-DB1-B if TTL is 60 seconds if it do a second share before TTL is expired it use the ip in the cache before do a second dns query where data1.domain.com resolves and be provided on a round robin principle with IP ... when it do

    Any way that basic design is just to be temporary load balancing

    I think you should go with main VM spread in different data-centers, collecting data to them as now
    and doing DB replication to dedicated server with multiple VMs for failover

  • thanks all for your valuable advice
    I've been testing out cloudflare dns only and it does appear to fail over within a few mins
    I understand proxied mode might be quicker

    i'll test out scripting this fail over and see how it goes

  • proxied may be quicker but if cloudflare has issues you will have them too

  • If you want a self-hosted option you might take a look at gdnsd.
    I have used it for a couple of years and it has been excellent self monitoring, fail over/ load balancing solution.

Sign In or Register to comment.