50 nodes Kubernetes cluster in under 9 minutes

vitobotta · October 2023

First tests with the foundation of what will be ClusterNinja: from zero to kubeconfig for a cluster with 50 nodes total (3 masters for HA, 47 worker nodes of different types in different pools and in different locations) created in only 8 minutes and 54 seconds TOTAL.

This includes everything, from creating all the resources (servers, load balancer, private network, firewall) to deploying Kubernetes and components required for managing upgrades as well as provisioning persistent volumes and load balancers out of the box.

The upgrade of the same cluster from Kubernetes 1.26 to 1.27 also took only 4 minutes.

Do you know how long these things take on GCP/AWS/Azure?

Thoughts?

Francisco · October 2023

@vitobotta said: Thoughts?

Does grannies knitting wordpress load pretty quick then?

Francisco

vitobotta · October 2023

@Francisco said:

@vitobotta said: Thoughts?

Does grannies knitting wordpress load pretty quick then?

Francisco

no doubt

ehab · October 2023

that's pretty impressive, was it from different providers and networks?

vitobotta · October 2023

@ehab said:

that's pretty impressive, was it from different providers and networks?

No, this test was with Hetzner for now with servers in 3 different locations. I need to do more work to do a test like this with multi cloud, although the first experiments (at a smaller scale) were successful.

vitobotta · October 2023

I was wondering if any providers would be interested in partnering when I am ready to launch or soon after. ClusterNinja will be a "bring your own provider" managed service and will support some popular providers natively, but it will also be possible to use nodes from any provider of your choice. I could potentially partner with some LET providers with good reputation and offer them as recommended options for those who want to save more money.

Which providers on LET would be good to talk with about this?

akhfa · October 2023

@vitobotta said:
First tests with the foundation of what will be ClusterNinja: from zero to kubeconfig for a cluster with 50 nodes total (3 masters for HA, 47 worker nodes of different types in different pools and in different locations) created in only 8 minutes and 54 seconds TOTAL.

This includes everything, from creating all the resources (servers, load balancer, private network, firewall) to deploying Kubernetes and components required for managing upgrades as well as provisioning persistent volumes and load balancers out of the box.

The upgrade of the same cluster from Kubernetes 1.26 to 1.27 also took only 4 minutes.

Do you know how long these things take on GCP/AWS/Azure?

Thoughts?

Assuming that it will spawn on multiple provider, the challenge for the distributed storage is that how to encrypt the connection when all traffic comes from the public IP. You can use CNI that encrypt connection by default while taking hit to the storage performance, and maybe further performance hit when using public connection between providers, depend on the peering.

What I can think of is put IO-heavy apps like database to outside of the cluster for the multi-provider deployment

ehab · October 2023

@vitobotta said:
No, this test was with Hetzner for now with servers in 3 different locations. I need to do more work to do a test like this with multi cloud, although the first experiments (at a smaller scale) were successful.

Val · October 2023

This work because VMs are created pretty fast by the API on Hetzner Cloud. It might be slower, for instance, on Azure, where VMs takes a bit of time to be created.
Great job! Love all your automatization stuff.

vitobotta · October 2023

@akhfa said:

@vitobotta said:
First tests with the foundation of what will be ClusterNinja: from zero to kubeconfig for a cluster with 50 nodes total (3 masters for HA, 47 worker nodes of different types in different pools and in different locations) created in only 8 minutes and 54 seconds TOTAL.

This includes everything, from creating all the resources (servers, load balancer, private network, firewall) to deploying Kubernetes and components required for managing upgrades as well as provisioning persistent volumes and load balancers out of the box.

The upgrade of the same cluster from Kubernetes 1.26 to 1.27 also took only 4 minutes.

Do you know how long these things take on GCP/AWS/Azure?

Thoughts?

Assuming that it will spawn on multiple provider, the challenge for the distributed storage is that how to encrypt the connection when all traffic comes from the public IP. You can use CNI that encrypt connection by default while taking hit to the storage performance, and maybe further performance hit when using public connection between providers, depend on the peering.

In the alpha version of the multi cloud node I have tested with, the nodes talk to each other over a wireguard mesh. For the storage, I have tested successfully with Longhorn, which I am going to recommend to users due to its ease of use and features, and for workloads that require high performance and do not require replication at storage level (such as databases that handle their own replication), I will recommend/install optionally Rancher's local path provisioner.

All this works beautifully from my initial tests

vitobotta · October 2023

@akhfa said:

@vitobotta said:
First tests with the foundation of what will be ClusterNinja: from zero to kubeconfig for a cluster with 50 nodes total (3 masters for HA, 47 worker nodes of different types in different pools and in different locations) created in only 8 minutes and 54 seconds TOTAL.

This includes everything, from creating all the resources (servers, load balancer, private network, firewall) to deploying Kubernetes and components required for managing upgrades as well as provisioning persistent volumes and load balancers out of the box.

The upgrade of the same cluster from Kubernetes 1.26 to 1.27 also took only 4 minutes.

Do you know how long these things take on GCP/AWS/Azure?

Thoughts?

Assuming that it will spawn on multiple provider, the challenge for the distributed storage is that how to encrypt the connection when all traffic comes from the public IP. You can use CNI that encrypt connection by default while taking hit to the storage performance, and maybe further performance hit when using public connection between providers, depend on the peering.

What I can think of is put IO-heavy apps like database to outside of the cluster for the multi-provider deployment

I saw you added the last sentence. Like I said for performance they can just provision volumes on the root disk with the local path provisioner. These disks are usually NVME or at least SSD, so they are OK for most tasks. But later on I could also add features like automatic provisioning of resources outside the cluster, if there is demand. But to be honest things have changed dramatically over the past 2-3 years, and there are some awesome Kubernetes operators that make Kubernetes the best platform even to run databases, which is something people would have thought more than twice before doing it only 2-3 years ago.

vitobotta · October 2023

@Val said:
This work because VMs are created pretty fast by the API on Hetzner Cloud. It might be slower, for instance, on Azure, where VMs takes a bit of time to be created.
Great job! Love all your automatization stuff.

Yeah Hetzner is ridiculously fast, I have to agree. So far it's the fastest I have seen at creating servers. One newly created server is up and running in like 20-30 seconds. I wonder if they keep a bunch of servers ready at all times just to be assigned to an account, because otherwise I can't explain that speed.

tjn · October 2023

It's pretty impressive @vitobotta keep it up!

MrLime · October 2023

Another great post @vitobotta. Looking forward to ClusterNinja.

emgh · October 2023

@vitobotta said:

@Val said:
This work because VMs are created pretty fast by the API on Hetzner Cloud. It might be slower, for instance, on Azure, where VMs takes a bit of time to be created.
Great job! Love all your automatization stuff.

Yeah Hetzner is ridiculously fast, I have to agree. So far it's the fastest I have seen at creating servers. One newly created server is up and running in like 20-30 seconds. I wonder if they keep a bunch of servers ready at all times just to be assigned to an account, because otherwise I can't explain that speed.

In the OS tab they have (or had) a symbol that said ”rapid deploy” or something similar (they all did), I think that = it’s avaliable stand-by

vitobotta · October 2023

@tjn said:
It's pretty impressive @vitobotta keep it up!

@MrLime said:
Another great post @vitobotta. Looking forward to ClusterNinja.

Thanks!

@emgh said:

@vitobotta said:

@Val said:
This work because VMs are created pretty fast by the API on Hetzner Cloud. It might be slower, for instance, on Azure, where VMs takes a bit of time to be created.
Great job! Love all your automatization stuff.

Yeah Hetzner is ridiculously fast, I have to agree. So far it's the fastest I have seen at creating servers. One newly created server is up and running in like 20-30 seconds. I wonder if they keep a bunch of servers ready at all times just to be assigned to an account, because otherwise I can't explain that speed.

In the OS tab they have (or had) a symbol that said ”rapid deploy” or something similar (they all did), I think that = it’s avaliable stand-by

Oh, I have never noticed that symbol

Jack_SBE · October 2023

@vitobotta said:
I was wondering if any providers would be interested in partnering when I am ready to launch or soon after. ClusterNinja will be a "bring your own provider" managed service and will support some popular providers natively, but it will also be possible to use nodes from any provider of your choice. I could potentially partner with some LET providers with good reputation and offer them as recommended options for those who want to save more money.

Which providers on LET would be good to talk with about this?

This seems great! We would possibly be interested if you’re still looking for providers.

bangudopw · October 2023

Which CNI you deployed with?

ddorian43 · October 2023

I wonder if they keep a bunch of servers ready at all times just to be assigned to an account, because otherwise I can't explain that speed.

This is the only way. Same as with fast FaaS.

The point is not starting fast but maintaining it, day-2 operations & being cheap enough. One provider that failed on "cheap" is/was elest.io IMHO.

vitobotta · October 2023

@Jack_SBE said:

@vitobotta said:
I was wondering if any providers would be interested in partnering when I am ready to launch or soon after. ClusterNinja will be a "bring your own provider" managed service and will support some popular providers natively, but it will also be possible to use nodes from any provider of your choice. I could potentially partner with some LET providers with good reputation and offer them as recommended options for those who want to save more money.

Which providers on LET would be good to talk with about this?

This seems great! We would possibly be interested if you’re still looking for providers.

Sounds good! I'll take a note about you so I'll get in touch a bit later.

@ddorian43 said:

I wonder if they keep a bunch of servers ready at all times just to be assigned to an account, because otherwise I can't explain that speed.

This is the only way. Same as with fast FaaS.

The point is not starting fast but maintaining it, day-2 operations & being cheap enough. One provider that failed on "cheap" is/was elest.io IMHO.

How did they fail? What do you mean? Thanks

nanankcornering · October 2023

@vitobotta said: Do you know how long these things take on GCP/AWS/Azure?

more than 10. just creating an EKS cluster (not nodes!!) takes 5mins.

vitobotta · October 2023

@nanankcornering said:

@vitobotta said: Do you know how long these things take on GCP/AWS/Azure?

more than 10. just creating an EKS cluster (not nodes!!) takes 5mins.

Exactly

ddorian43 · October 2023

@vitobotta said: How did they fail? What do you mean?

Too expensive IMHO for BYOC.

vitobotta · October 2023

@ddorian43 said:

@vitobotta said: How did they fail? What do you mean?

Too expensive IMHO for BYOC.

They have a completely different pricing than what I have in mind. I won't add a % on top of the instance price. You will be able to chose whichever nodes you want. ClusterNinja will likely be priced based on the size of the cluster. Still need to think about it but it's likely going that way. WDYT?

sibaper · October 2023

Upgrading empty cluster it's easy, for comparison shake, when upgrading the control plane in AWS/GCP it's slower but no downtime

Maybe start with your sales pitch, nodes availability (what happen if hosts run out of node) ?
Are you offering managed services?

vitobotta · October 2023

@sibaper said:
Upgrading empty cluster it's easy, for comparison shake, when upgrading the control plane in AWS/GCP it's slower but no downtime

Maybe start with your sales pitch, nodes availability (what happen if hosts run out of node) ?
Are you offering managed services?

The cluster was not empty, I installed several things on purpose to fill nodes etc.

Node availability is not my concern since this is a "bring your own provider" kinda of service, so that is responsibility of the infrastructure provider you choose. I don't provide infrastructure, although that may change if I do some nice partnerships with some providers here.

Yes, it will be a managed Kubernetes service with some nice features but I am not ready to share more details yet since I don't know what will be available at launch if we try to launch in 3-4 months.

Howdy, Stranger!

Categories

In this Discussion

50 nodes Kubernetes cluster in under 9 minutes

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

50 nodes Kubernetes cluster in under 9 minutes

Comments