HA storage solution?

marvel · March 2020

Hello all,

I am looking for HA storage but most commercial solutions are way too expensive. They either charge for each TB or they charge a pretty serious amount for each storage node so I am looking for an open source solution.

So basically I need a fully redundant distributed file system starting with two nodes with local SSD storage. (I don't want to mess around with DAS/external SAS etc). If one storage node goes down all my VPS should just keep running. This should be easily expandable with more nodes and/or SDD drives so it can grow over time.

So I found out about CEPH. Anybody has any experience with this? There is also a frontend to it called PetaSAN but I've been reading mixed reviews on that plus I rather do everything cmd line based so I'm not depended on some GUI.

Any ideas? Thanks!

Lunar · March 2020

Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

marvel · March 2020

@Lunar said:
Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

Lunar · March 2020

@marvel said:

@Lunar said:
Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

marvel · March 2020

@Lunar said:

@marvel said:

@Lunar said:
Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

Thanks for all the helpful answers!

Lunar · March 2020

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:
Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

Thanks for all the helpful answers!

One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

jlay · March 2020

I've been pretty happy with GlusterFS, but it really loves CPU time and bandwidth. If you can spring for Infiniband, RDMA helps things run way better

For the best reliability you'd want three nodes, or two replicas with an arbiter. I prefer dispersed volumes that take advantage of erasure coding

marvel · March 2020

@Lunar said:

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:
Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

Thanks for all the helpful answers!

One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

marvel · March 2020

@jlay said:
I've been pretty happy with GlusterFS, but it really loves CPU time and bandwidth. If you can spring for Infiniband, RDMA helps things run way better

For the best reliability you'd want three nodes, or two replicas with an arbiter. I prefer dispersed volumes that take advantage of erasure coding

Yeah I've been looking at infiniband 40GBe but it all adds up. Thanks!

jlay · March 2020

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:
Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

Thanks for all the helpful answers!

One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

For resilience some of the storage is going to be lost, but how much depends on the topology and implementation choices

I have three nodes in my lab running GlusterFS with RDMA (Mellanox CX-3s), each one has two 5TB drives. 30TB of drives, 20TB usable. With erasure coding volumes like mine, you get more usable space because it's not replicating. Replicas would use at least half of the space for redundancy

The trade offs are performance and resiliency, the volume type and structure matters a lot. Think of it kind of like network RAID

TimboJones · March 2020

I've never fucked with vSANs before, but I thought of Starwinds vSANs when reading this thread.

https://www.starwindsoftware.com/starwind-virtual-san-free

marvel · March 2020

@jlay said:

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:
Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

Thanks for all the helpful answers!

One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

For resilience some of the storage is going to be lost, but how much depends on the topology and implementation choices

I have three nodes in my lab running GlusterFS with RDMA (Mellanox CX-3s), each one has two 5TB drives. 30TB of drives, 20TB usable. With erasure coding volumes like mine, you get more usable space because it's not replicating. Replicas would use at least half of the space for redundancy

The trade offs are performance and resiliency, the volume type and structure matters a lot. Think of it kind of like network RAID

Well I bit the bullet and ordered 3 nodes, every node with 4 x 2TB NVMe so 8 TB per node, so I'm expecting to have 8 TB net but according to what you are saying it's probably more right. If it's half the space (like you say network RAID10) I should perhaps get like 12 maybe. What's great with CEPH is that I can just keep adding space and nodes as I go.

I went for Mellanox as well, everything 40GBe including the hypervisor machines. Also I assume this is active/active right, so all nodes are being used all the time and it divides the load over the different nodes.

Also you are running GlusterFS but how is that different to CEPH? Did some digging and you can't use block devices (iSCSI) with Gluster right? So for virtualization one has always to go for CEPH.

Anyway can't wait to try it out and see how it performs

jlay · March 2020

@marvel said:

@jlay said:

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:
Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

Thanks for all the helpful answers!

One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

For resilience some of the storage is going to be lost, but how much depends on the topology and implementation choices

I have three nodes in my lab running GlusterFS with RDMA (Mellanox CX-3s), each one has two 5TB drives. 30TB of drives, 20TB usable. With erasure coding volumes like mine, you get more usable space because it's not replicating. Replicas would use at least half of the space for redundancy

The trade offs are performance and resiliency, the volume type and structure matters a lot. Think of it kind of like network RAID

Well I bit the bullet and ordered 3 nodes, every node with 4 x 2TB NVMe so 8 TB per node, so I'm expecting to have 8 TB net but according to what you are saying it's probably more right. If it's half the space (like you say network RAID10) I should perhaps get like 12 maybe. What's great with CEPH is that I can just keep adding space and nodes as I go.

I went for Mellanox as well, everything 40GBe including the hypervisor machines. Also I assume this is active/active right, so all nodes are being used all the time and it divides the load over the different nodes.

Also you are running GlusterFS but how is that different to CEPH? Did some digging and you can't use block devices (iSCSI) with Gluster right? So for virtualization one has always to go for CEPH.

Anyway can't wait to try it out and see how it performs

That's some awesome gear! Look into using RDMA transport instead of TCP, it reduces latency a lot.

The main thing I like about GlusterFS is it's less centralized. I'm not too familiar with Ceph, but I believe it has some sort of central meta data systems. Gluster does that cleverly in metadata on the inodes

Libvirt is able to use GlusterFS volumes as pools, then just normal qcow or raw files on that. It's live migration friendly, I use it in my lab at home all the time.

There was some project to do block devices with Gluster but I'm not sure if it's still going

marvel · March 2020

@jlay said:

@marvel said:

@jlay said:

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:
Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

Thanks for all the helpful answers!

One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

For resilience some of the storage is going to be lost, but how much depends on the topology and implementation choices

I have three nodes in my lab running GlusterFS with RDMA (Mellanox CX-3s), each one has two 5TB drives. 30TB of drives, 20TB usable. With erasure coding volumes like mine, you get more usable space because it's not replicating. Replicas would use at least half of the space for redundancy

The trade offs are performance and resiliency, the volume type and structure matters a lot. Think of it kind of like network RAID

Well I bit the bullet and ordered 3 nodes, every node with 4 x 2TB NVMe so 8 TB per node, so I'm expecting to have 8 TB net but according to what you are saying it's probably more right. If it's half the space (like you say network RAID10) I should perhaps get like 12 maybe. What's great with CEPH is that I can just keep adding space and nodes as I go.

I went for Mellanox as well, everything 40GBe including the hypervisor machines. Also I assume this is active/active right, so all nodes are being used all the time and it divides the load over the different nodes.

Also you are running GlusterFS but how is that different to CEPH? Did some digging and you can't use block devices (iSCSI) with Gluster right? So for virtualization one has always to go for CEPH.

Anyway can't wait to try it out and see how it performs

That's some awesome gear! Look into using RDMA transport instead of TCP, it reduces latency a lot.

The main thing I like about GlusterFS is it's less centralized. I'm not too familiar with Ceph, but I believe it has some sort of central meta data systems. Gluster does that cleverly in metadata on the inodes

Libvirt is able to use GlusterFS volumes as pools, then just normal qcow or raw files on that. It's live migration friendly, I use it in my lab at home all the time.

There was some project to do block devices with Gluster but I'm not sure if it's still going

Right, I need to see how to enable that but I think it's just a matter of installing the relevant RDMA packages and start the service? I will look into Gluster as well

jlay · March 2020

@marvel said:

@jlay said:

@marvel said:

@jlay said:

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:

@marvel said:

@Lunar said:
Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

Thanks for all the helpful answers!

One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

For resilience some of the storage is going to be lost, but how much depends on the topology and implementation choices

I have three nodes in my lab running GlusterFS with RDMA (Mellanox CX-3s), each one has two 5TB drives. 30TB of drives, 20TB usable. With erasure coding volumes like mine, you get more usable space because it's not replicating. Replicas would use at least half of the space for redundancy

The trade offs are performance and resiliency, the volume type and structure matters a lot. Think of it kind of like network RAID

Well I bit the bullet and ordered 3 nodes, every node with 4 x 2TB NVMe so 8 TB per node, so I'm expecting to have 8 TB net but according to what you are saying it's probably more right. If it's half the space (like you say network RAID10) I should perhaps get like 12 maybe. What's great with CEPH is that I can just keep adding space and nodes as I go.

I went for Mellanox as well, everything 40GBe including the hypervisor machines. Also I assume this is active/active right, so all nodes are being used all the time and it divides the load over the different nodes.

Also you are running GlusterFS but how is that different to CEPH? Did some digging and you can't use block devices (iSCSI) with Gluster right? So for virtualization one has always to go for CEPH.

Anyway can't wait to try it out and see how it performs

That's some awesome gear! Look into using RDMA transport instead of TCP, it reduces latency a lot.

The main thing I like about GlusterFS is it's less centralized. I'm not too familiar with Ceph, but I believe it has some sort of central meta data systems. Gluster does that cleverly in metadata on the inodes

Libvirt is able to use GlusterFS volumes as pools, then just normal qcow or raw files on that. It's live migration friendly, I use it in my lab at home all the time.

There was some project to do block devices with Gluster but I'm not sure if it's still going

Right, I need to see how to enable that but I think it's just a matter of installing the relevant RDMA packages and start the service? I will look into Gluster as well

Setting up RDMA is pretty involved, I'd look to some guides for it - if I remember correctly, Red Hat had the best reference.

In order for Gluster to take advantage, the volume will need both TCP and RDMA transport types listed (it's a volume option)

Howdy, Stranger!

Categories

In this Discussion

HA storage solution?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

HA storage solution?

Comments