Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


HA storage solution?
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

HA storage solution?

marvelmarvel Member

Hello all,

I am looking for HA storage but most commercial solutions are way too expensive. They either charge for each TB or they charge a pretty serious amount for each storage node so I am looking for an open source solution.

So basically I need a fully redundant distributed file system starting with two nodes with local SSD storage. (I don't want to mess around with DAS/external SAS etc). If one storage node goes down all my VPS should just keep running. This should be easily expandable with more nodes and/or SDD drives so it can grow over time.

So I found out about CEPH. Anybody has any experience with this? There is also a frontend to it called PetaSAN but I've been reading mixed reviews on that plus I rather do everything cmd line based so I'm not depended on some GUI.

Any ideas? Thanks!

Comments

  • LunarLunar Member

    Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

    Thanked by 1marvel
  • @Lunar said:
    Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

    Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

  • LunarLunar Member

    @marvel said:

    @Lunar said:
    Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

    Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

    Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

    Thanked by 1marvel
  • @Lunar said:

    @marvel said:

    @Lunar said:
    Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

    Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

    Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

    Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

    Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

    Thanks for all the helpful answers!

  • LunarLunar Member

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:
    Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

    Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

    Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

    Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

    Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

    Thanks for all the helpful answers!

    One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

    A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

    Thanked by 2marvel doughnet
  • jlayjlay Member
    edited March 2020

    I've been pretty happy with GlusterFS, but it really loves CPU time and bandwidth. If you can spring for Infiniband, RDMA helps things run way better

    For the best reliability you'd want three nodes, or two replicas with an arbiter. I prefer dispersed volumes that take advantage of erasure coding

    Thanked by 2marvel quicksilver03
  • marvelmarvel Member
    edited March 2020

    @Lunar said:

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:
    Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

    Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

    Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

    Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

    Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

    Thanks for all the helpful answers!

    One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

    A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

    3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

    I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

    Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

  • @jlay said:
    I've been pretty happy with GlusterFS, but it really loves CPU time and bandwidth. If you can spring for Infiniband, RDMA helps things run way better

    For the best reliability you'd want three nodes, or two replicas with an arbiter. I prefer dispersed volumes that take advantage of erasure coding

    Yeah I've been looking at infiniband 40GBe but it all adds up. Thanks!

  • jlayjlay Member
    edited March 2020

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:
    Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

    Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

    Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

    Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

    Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

    Thanks for all the helpful answers!

    One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

    A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

    3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

    I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

    Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

    For resilience some of the storage is going to be lost, but how much depends on the topology and implementation choices

    I have three nodes in my lab running GlusterFS with RDMA (Mellanox CX-3s), each one has two 5TB drives. 30TB of drives, 20TB usable. With erasure coding volumes like mine, you get more usable space because it's not replicating. Replicas would use at least half of the space for redundancy

    The trade offs are performance and resiliency, the volume type and structure matters a lot. Think of it kind of like network RAID

    Thanked by 1marvel
  • I've never fucked with vSANs before, but I thought of Starwinds vSANs when reading this thread.

    https://www.starwindsoftware.com/starwind-virtual-san-free

  • @jlay said:

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:
    Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

    Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

    Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

    Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

    Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

    Thanks for all the helpful answers!

    One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

    A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

    3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

    I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

    Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

    For resilience some of the storage is going to be lost, but how much depends on the topology and implementation choices

    I have three nodes in my lab running GlusterFS with RDMA (Mellanox CX-3s), each one has two 5TB drives. 30TB of drives, 20TB usable. With erasure coding volumes like mine, you get more usable space because it's not replicating. Replicas would use at least half of the space for redundancy

    The trade offs are performance and resiliency, the volume type and structure matters a lot. Think of it kind of like network RAID

    Well I bit the bullet and ordered 3 nodes, every node with 4 x 2TB NVMe so 8 TB per node, so I'm expecting to have 8 TB net but according to what you are saying it's probably more right. If it's half the space (like you say network RAID10) I should perhaps get like 12 maybe. What's great with CEPH is that I can just keep adding space and nodes as I go.

    I went for Mellanox as well, everything 40GBe including the hypervisor machines. Also I assume this is active/active right, so all nodes are being used all the time and it divides the load over the different nodes.

    Also you are running GlusterFS but how is that different to CEPH? Did some digging and you can't use block devices (iSCSI) with Gluster right? So for virtualization one has always to go for CEPH.

    Anyway can't wait to try it out and see how it performs :smile:

  • jlayjlay Member
    edited March 2020

    @marvel said:

    @jlay said:

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:
    Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

    Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

    Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

    Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

    Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

    Thanks for all the helpful answers!

    One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

    A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

    3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

    I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

    Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

    For resilience some of the storage is going to be lost, but how much depends on the topology and implementation choices

    I have three nodes in my lab running GlusterFS with RDMA (Mellanox CX-3s), each one has two 5TB drives. 30TB of drives, 20TB usable. With erasure coding volumes like mine, you get more usable space because it's not replicating. Replicas would use at least half of the space for redundancy

    The trade offs are performance and resiliency, the volume type and structure matters a lot. Think of it kind of like network RAID

    Well I bit the bullet and ordered 3 nodes, every node with 4 x 2TB NVMe so 8 TB per node, so I'm expecting to have 8 TB net but according to what you are saying it's probably more right. If it's half the space (like you say network RAID10) I should perhaps get like 12 maybe. What's great with CEPH is that I can just keep adding space and nodes as I go.

    I went for Mellanox as well, everything 40GBe including the hypervisor machines. Also I assume this is active/active right, so all nodes are being used all the time and it divides the load over the different nodes.

    Also you are running GlusterFS but how is that different to CEPH? Did some digging and you can't use block devices (iSCSI) with Gluster right? So for virtualization one has always to go for CEPH.

    Anyway can't wait to try it out and see how it performs :smile:

    That's some awesome gear! Look into using RDMA transport instead of TCP, it reduces latency a lot.

    The main thing I like about GlusterFS is it's less centralized. I'm not too familiar with Ceph, but I believe it has some sort of central meta data systems. Gluster does that cleverly in metadata on the inodes

    Libvirt is able to use GlusterFS volumes as pools, then just normal qcow or raw files on that. It's live migration friendly, I use it in my lab at home all the time.

    There was some project to do block devices with Gluster but I'm not sure if it's still going

  • @jlay said:

    @marvel said:

    @jlay said:

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:
    Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

    Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

    Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

    Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

    Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

    Thanks for all the helpful answers!

    One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

    A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

    3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

    I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

    Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

    For resilience some of the storage is going to be lost, but how much depends on the topology and implementation choices

    I have three nodes in my lab running GlusterFS with RDMA (Mellanox CX-3s), each one has two 5TB drives. 30TB of drives, 20TB usable. With erasure coding volumes like mine, you get more usable space because it's not replicating. Replicas would use at least half of the space for redundancy

    The trade offs are performance and resiliency, the volume type and structure matters a lot. Think of it kind of like network RAID

    Well I bit the bullet and ordered 3 nodes, every node with 4 x 2TB NVMe so 8 TB per node, so I'm expecting to have 8 TB net but according to what you are saying it's probably more right. If it's half the space (like you say network RAID10) I should perhaps get like 12 maybe. What's great with CEPH is that I can just keep adding space and nodes as I go.

    I went for Mellanox as well, everything 40GBe including the hypervisor machines. Also I assume this is active/active right, so all nodes are being used all the time and it divides the load over the different nodes.

    Also you are running GlusterFS but how is that different to CEPH? Did some digging and you can't use block devices (iSCSI) with Gluster right? So for virtualization one has always to go for CEPH.

    Anyway can't wait to try it out and see how it performs :smile:

    That's some awesome gear! Look into using RDMA transport instead of TCP, it reduces latency a lot.

    The main thing I like about GlusterFS is it's less centralized. I'm not too familiar with Ceph, but I believe it has some sort of central meta data systems. Gluster does that cleverly in metadata on the inodes

    Libvirt is able to use GlusterFS volumes as pools, then just normal qcow or raw files on that. It's live migration friendly, I use it in my lab at home all the time.

    There was some project to do block devices with Gluster but I'm not sure if it's still going

    Right, I need to see how to enable that but I think it's just a matter of installing the relevant RDMA packages and start the service? I will look into Gluster as well :)

  • jlayjlay Member

    @marvel said:

    @jlay said:

    @marvel said:

    @jlay said:

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:

    @marvel said:

    @Lunar said:
    Ceph is the way to go, but you need a lots of RAM for OSD (drive) caching. 2-4GB (preferably) of RAM per OSD. https://docs.ceph.com/docs/nautilus/start/hardware-recommendations/

    Ok but I can translate OSD to disks right? So 10 disks I need 40 GB RAM?

    Yeah OSDs are really just disks, the Ceph term for them. Forgot to mention, per their docs the recommended amount of RAM is 1GB per 1TB of storage. So it also depends on how many TBs of storage you have. Also, with Ceph, don't use any kind of RAID, Ceph needs to see the raw disks and each OSD is a single disk.

    Nice, it's a bit like ZFS works. Sounds all good I'll check it out and do a test setup soon.

    Do you also happen to know if parts of it can be run on virtual machines? Like the monitor and admin service for instance. Or do you need physical machines?

    Thanks for all the helpful answers!

    One other thing to note with Ceph, you should do a 3x replica (3 servers with same drives in each) at the very least for HA. So... overall, Ceph can be a lot more expensive.

    A good question, although I'm not sure. I believe everything Ceph recommends is dedicated hardware. You can run the OSD daemons and monitors on the same servers though. A good example and easy to setup Ceph cluster would be one that Proxmox allows you to setup with their tools. It runs on each Proxmox node in a cluster. https://pve.proxmox.com/wiki/Deploy_Hyper-Converged_Ceph_Cluster

    3x ouch, hmm yeah that will be a lot more expensive then. But I rather spend my money on hardware than licensing. Plus I can start low and grow over time which is a huge plus.

    I've been looking at Proxmox but I really want to go with openstack instead which I think is better for production usage.

    Btw, I've been thinking. If I need 3 storage nodes ok but there is no raid right, so let's say I have 6 x 2 TB in each node I will net. 12 tb total right. That makes it a bit better.

    For resilience some of the storage is going to be lost, but how much depends on the topology and implementation choices

    I have three nodes in my lab running GlusterFS with RDMA (Mellanox CX-3s), each one has two 5TB drives. 30TB of drives, 20TB usable. With erasure coding volumes like mine, you get more usable space because it's not replicating. Replicas would use at least half of the space for redundancy

    The trade offs are performance and resiliency, the volume type and structure matters a lot. Think of it kind of like network RAID

    Well I bit the bullet and ordered 3 nodes, every node with 4 x 2TB NVMe so 8 TB per node, so I'm expecting to have 8 TB net but according to what you are saying it's probably more right. If it's half the space (like you say network RAID10) I should perhaps get like 12 maybe. What's great with CEPH is that I can just keep adding space and nodes as I go.

    I went for Mellanox as well, everything 40GBe including the hypervisor machines. Also I assume this is active/active right, so all nodes are being used all the time and it divides the load over the different nodes.

    Also you are running GlusterFS but how is that different to CEPH? Did some digging and you can't use block devices (iSCSI) with Gluster right? So for virtualization one has always to go for CEPH.

    Anyway can't wait to try it out and see how it performs :smile:

    That's some awesome gear! Look into using RDMA transport instead of TCP, it reduces latency a lot.

    The main thing I like about GlusterFS is it's less centralized. I'm not too familiar with Ceph, but I believe it has some sort of central meta data systems. Gluster does that cleverly in metadata on the inodes

    Libvirt is able to use GlusterFS volumes as pools, then just normal qcow or raw files on that. It's live migration friendly, I use it in my lab at home all the time.

    There was some project to do block devices with Gluster but I'm not sure if it's still going

    Right, I need to see how to enable that but I think it's just a matter of installing the relevant RDMA packages and start the service? I will look into Gluster as well :)

    Setting up RDMA is pretty involved, I'd look to some guides for it - if I remember correctly, Red Hat had the best reference.

    In order for Gluster to take advantage, the volume will need both TCP and RDMA transport types listed (it's a volume option)

Sign In or Register to comment.