Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Building a New High Performance Virtualisation and Storage Cluster
New on LowEndTalk? Please Register and read our Community Rules.

Building a New High Performance Virtualisation and Storage Cluster

randvegetarandvegeta Member, Provider

In 2022, at least in Hong Kong, I'm looking to dramatically reduce the total footprint of our servers, increase density, performance and management.

Our current infrastructure in HK has been built out over more than a decade. Many of our racks are still only connected to the 'core' over Gbit ethernet. Much of our equipment is running on Xeon E5s and DDR3.

Of course we're not going to throw these things out, but it seems like it's time for some new gear and try and replace our existing 'cloud' with the latest and greatest.

Our typical storage virtualisation nodes run Dual Xeon E5 with 128GB RAM, while our storage nodes run on single socket E5s, 32-64GB RAM, 8x 3.5" hostwap bays (some nodes are SSD, some HDD and some hybrid), and it's all connected together over 10G.

Anyone have any experience, and opinions on more modern gear? I'm especially interested in 100Gbit networking. I'm trying to come up with a highly robust solution with few bottleneck points. 10G networking definitely felt like a bottleneck, but upping to 100G will probably reveal other bottlenecks elsewhere in the system.

The emphasis on the new system is density, but not at the cost of complexity. We're not limited by space, and running a CEPH/Virtuozzo Storage like storage solution added considerable complexity, and seemingly also at a performance cost as well. Our current solution is to run numerous standalone storage nodes in RAID 10, with backup nodes and scheduled backups. At least for us, this turned out to be more reliable, faster, and easier (simpler) to manage. The cost of course is a great deal of disk waste, and general excess redundancy. That said, I'm still willing to use such a model as the reduced labour is probably worth the initial up front cost. As such, I'm expecting a pretty high upfront cost to build the first initial system.

Anyone with real world experience that can share some advice on building such a system?

Thanked by 1devp

Comments

  • I don't know what kind of answer you are expecting here but if are looking for advice from experience then I suppose @Francisco is pro in such stuff.

  • wdmgwdmg Member, LIR

    If you're looking to condense hypervisors and don't mind a fun setup...

    I would recommend some Supermicro's with H11SSL-I revision 2 ("rev2") motherboards (supporting 7001, 7002 CPUs for AMD EPYC), put this together with a 32 core CPU such as the 7551P (only ~$200 on ebay) in a 1U. I'd recommend a minimum of 128 GB of memory per node, remember with AMD EPYC you need ECC ram though the cost these days it's not too much of a difference depending on where your vendor is.

    You can then slap a Mellanox X6 in there if you want 100GE but realistically without some performance tuning on the nic you won't hit the full pipe even under load, so I'm going to assume this is just to avoid single-node congestion like you'd experience with a 1G RJ45 port. You can slap that together with 4x2TB SSD + an NVMe and it makes a decent build for <$3k/piece.

    I would recommend only going 40GE on these nodes though, it's a bit better cost wise. Now, like all builds this has a downside that unless you get a better chassis instead of the cheap 813's or similar, you're limited to a single PSU on it but it can be mitigated with an automatic transfer switch (ATS).

    If you do need dual PSUs on your machines instead of wanting to rock an ATS then you can stick with the Supermicro and grab a suitable chassis, or, you can explore other options. I've not had much luck personally with HPE's or Dell's EPYCs from the interactions I've had, so I'd recommend other vendors... YMMV.

    Then, connect this to something like the Cisco N9K, or, if you love Arista (such as I do) you can link the nodes directly into something like the DCS-7060CX-32S-R which runs about $5K USD on ebay as of today (mixture of 40G and 100G ports). You can get good optics from fs.com very affordably, grab yourself the cabling and the optics from fs and link them together.

    Now, let's talk storage:

    I would highly recommend the Supermicro 4U's with 36 bays for 3.5" HDDs, a few months ago they could be had for ~$400 each (at least from my local vendor), obviously your mileage will vary today but they shouldn't be overly pricey. Slap this with any decent Xeon CPU (eg, 2650's or 2680's) and fill your drive bays.

    I've played with Ceph over the last little while, and while it is definitely good it's not worth it unless you have at least 4-5 storage servers. If you have less, I'd suggest you look into glusterfs -- you can always migrate one at a time after the fact to Ceph if you scale enough outwards. One perk of Ceph is you can enjoy a full S3 compatible object storage endpoint plus the block storage versus simply glusterfs being block storage.

    If you do Ceph, you'll want multiple metadata servers (at least 2) as it will really help.

    Gluster does have native integration with libvirt so you'll be fine on that area for block storage, though I'm not entirely sure how good it is when it comes to LXD (it's on my list to eventually try out).

    Down to the cost of it all:

    So, for let's say 4 hypervisors (32 Core 7551P, 128G Memory, 1U single PSU with a Mellanox X6 - 25GE) you're looking at approximately $3500/hypervisor (frankly ~$1500 of that will be the X6, which is why I recommend something like an X4 which is only a few hundred still), so $14,000 in hypervisors before shipping/taxes.

    Storage wise, presuming you run all 8 TB disks with 36 bays and fill 4 storage servers... 8TB Seagate Barracuda's are about $135/ea on Amazon, so you'd need 144 of them which will be $19,440 before shipping/taxes. Now your actual storage units themselves with caddy's are likely ~$600/each, so I'd say around $2400 in storage servers of this model.

    Lastly, your switch. In this case I'd suggest the Supermicro, so using the 100GE one about $5000 on ebay, though if you end up with just 40GE there are much cheaper (~$1K USD) pure 40GE 32 port switches available.

    Your FS optics will likely run you another $2000-3000 for 100G and 40G optics encoded unless you buy your own encoder and DIY it.

    So, the final number on a complete upgrade cost... $43,840 USD which does not account shipping, duties/taxes/customs costs.

  • servarica_haniservarica_hani Member, Provider

    I assume you are looking for 100G networking as connections between racks

    if you are buying new then go for it , if you are looking for saving 40gbps networking is dirt cheap used and you simply use lacp to have more than 1 cable between racks

    in our setup each rack is connected to the core rack using 4x 40gbps (2x from each switch) which give us redundancy as well

    the core network rack have 2x 32 port 40gbps switch
    this setup allow us to have up to 16x racks connected together using 160gbps uplink each with full redundancy in network in each rack

    our racks are 10KW each so i am not sure if you want to go more dense than that (usually if you go more than 10kw per rack you need to change how your cooling work by having in row cooling or even rear of the rack cooling)

    100gb from servers is overkill in my experience as you will need top of the line servers and even with those i doubt you can run anything useful while maintaining the 100gbps

    if you have more questions please let me know as our setup is considered dense and cheap :)

    Thanks

  • randvegetarandvegeta Member, Provider

    40G and LACP should be fine. What kindof medium do you connect on? (Fiber/Copper?)

    How cheap is "Dirt Cheap"? I'm using 10G now, so 40G is already 4X. If the NICs and switches are "cheap", I can always double up and make it 80G per node. I don't even need bonding. I'm happy with 'dumb' load balancing. Or even just using the 2nd interface as a backup, or for making backups.

    I will start with 1 rack. Actually 2 racks... the core, and 1 extra rack for the new 'cloud'. I'll roll out the fiber later for the other racks. It's easier to do in empty racks after all.

    Redundancy and resilience is my priority. Slow networks actually inhibits the 'resilience' on a 'heavily loaded' cluster.

  • key900key900 Member
    edited December 2021

    @randvegeta said:
    In 2022, at least in Hong Kong, I'm looking to

    Anyone with real world experience that can share some advice on building such a system?

    You will be disappointed with the performance Raid10 VS cluster Block Storage until spending a lot of cash for that. So the right answer how much you willing to spend?! Using 8x bay per server no good idea since Ceph running standalone disk with no raid set so you are required to use a load of disks to got certain speed especially for Iops .

  • wdmgwdmg Member, LIR

    @randvegeta said:
    40G and LACP should be fine. What kindof medium do you connect on? (Fiber/Copper?)

    How cheap is "Dirt Cheap"? I'm using 10G now, so 40G is already 4X. If the NICs and switches are "cheap", I can always double up and make it 80G per node. I don't even need bonding. I'm happy with 'dumb' load balancing. Or even just using the 2nd interface as a backup, or for making backups.

    I will start with 1 rack. Actually 2 racks... the core, and 1 extra rack for the new 'cloud'. I'll roll out the fiber later for the other racks. It's easier to do in empty racks after all.

    Redundancy and resilience is my priority. Slow networks actually inhibits the 'resilience' on a 'heavily loaded' cluster.

    A decent 40G switch like the Arista DCS-7050QX-32-F (US $650.00 on ebay) would suffice. I'd say connect with OM4 MMF cable (good value, price), the optics (40G QSFP+) on fs.com are about $40 USD a piece.

    Thanked by 1randvegeta
  • key900key900 Member
    edited December 2021

    @randvegeta said:
    40G and LACP should be fine. What kindof medium do you connect on? (Fiber/Copper?)

    How cheap is "Dirt Cheap"? I'm using 10G now, so 40G is already 4X. If the NICs and switches are "cheap", I can always double up and make it 80G per node. I don't even need bonding. I'm happy with 'dumb' load balancing. Or even just using the 2nd interface as a backup, or for making backups.

    I will start with 1 rack. Actually 2 racks... the core, and 1 extra rack for the new 'cloud'. I'll roll out the fiber later for the other racks. It's easier to do in empty racks after all.

    Redundancy and resilience is my priority. Slow networks actually inhibits the 'resilience' on a 'heavily loaded' cluster.

    Why using copper since you can use Mellanox FDR 56GB/S Dual-Port Network Card MCX354A for your cluster ? It will be private anyway so just set a dual connection into SX6025 Mellanox 36-port Unmanaged 56Gb/s FDR Infiniband SDN Switch https://www.ebay.com/itm/SX6025-Mellanox-36-port-Unmanaged-56Gb-s-FDR-Infiniband-SDN-Switch-/254659533329?mkcid=16&mkevt=1&_trksid=p2349624.m46890.l6249&mkrid=711-127632-2357-0

  • randvegetarandvegeta Member, Provider

    @wdmg said: A decent 40G switch like the Arista DCS-7050QX-32-F (US $650.00 on ebay) would suffice. I'd say connect with OM4 MMF cable (good value, price), the optics (40G QSFP+) on fs.com are about $40 USD a piece.

    Wow. That seems too cheap. I remember not long ago forking out thousands for 10g!

    Thanked by 1lentro
  • @randvegeta said:

    Wow. That seems too cheap. I remember not long ago forking out thousands for 10g!

    I think one thing's definitely for sure.

    Don't invest in networking gear (as an asset that is).

  • servarica_haniservarica_hani Member, Provider

    for us we have a lot of copper FDR cables from our ceph experiment that we ran few years ago (it was very bad and we went with 12x servers with 6 disks each but ceph was new at that time i assume it is better now I even have approx 10 servers with 12 disks each planning to rerun the experiment just to see how it will go )

    but they are limited to 7m and they are really thick , so last few racks i just used some MTP fibre cables and SR4 optics (from fs.com or ebay)

    @wdmg said:

    A decent 40G switch like the Arista DCS-7050QX-32-F (US $650.00 on ebay) would suffice. I'd say connect with OM4 MMF cable (good value, price), the optics (40G QSFP+) on fs.com are about $40 USD a piece.

    yes thats the 40gbps switch we use , and most of our 10gbps switches are also Arista (they were really cheap 1 year ago but for some reason they are double the price now )

  • AlexBarakovAlexBarakov Member, Provider

    @servarica_hani said: (they were really cheap 1 year ago but for some reason they are double the price now )

    Everything new is essentially out of stock and wait times are months, one guy told me he'll be waiting for a year for a Cisco. That's driving used/grey market prices up as well. I am currently looking to add another MX204 for an edge location and I can't even get ETAs for delivery. Seems they are significantly more expensive than last year as well.

    Thanked by 1randvegeta
  • randvegetarandvegeta Member, Provider

    @AlexBarakov said:

    @servarica_hani said: (they were really cheap 1 year ago but for some reason they are double the price now )

    Everything new is essentially out of stock and wait times are months, one guy told me he'll be waiting for a year for a Cisco. That's driving used/grey market prices up as well. I am currently looking to add another MX204 for an edge location and I can't even get ETAs for delivery. Seems they are significantly more expensive than last year as well.

    I've noticed. I remember getting a used MX204 for <$3K. I was being asked for $20K for a used on only a couple week ago. Massive difference.

  • laobanlaoban Member

    @wdmg said:
    A decent 40G switch like the Arista DCS-7050QX-32-F (US $650.00 on ebay) would suffice. I'd say connect with OM4 MMF cable (good value, price), the optics (40G QSFP+) on fs.com are about $40 USD a piece.

    it is infiniband, right? any drawbacks using infiniband for ceph deployment?

  • servarica_haniservarica_hani Member, Provider

    @laoban said:

    @wdmg said:
    A decent 40G switch like the Arista DCS-7050QX-32-F (US $650.00 on ebay) would suffice. I'd say connect with OM4 MMF cable (good value, price), the optics (40G QSFP+) on fs.com are about $40 USD a piece.

    it is infiniband, right? any drawbacks using infiniband for ceph deployment?

    we used infiniband on our ceph servers (IPoIB)

    it was working well for inter ceph communication like re-balancing and so on

    we had alot of issue maintaining the driver support in our xenservers since it is not supported out of the box and we had to recompile the driver after each kernel update which was a lot of work

    our new test build will be full on Ethernet

    Thanked by 1laoban
  • @servarica_hani said:
    we used infiniband on our ceph servers (IPoIB)
    it was working well for inter ceph communication like re-balancing and so on

    I read the Proxmox forum, someone said 56Gb IPoIB performance was atrocious, less than 10GbE. Can you share your experience, please?

  • servarica_haniservarica_hani Member, Provider

    @laoban said:

    @servarica_hani said:
    we used infiniband on our ceph servers (IPoIB)
    it was working well for inter ceph communication like re-balancing and so on

    I read the Proxmox forum, someone said 56Gb IPoIB performance was atrocious, less than 10GbE. Can you share your experience, please?

    when the cluster was re balancing i did see 3GB/s and in few times 4GB/s so it was utilizing the IB really well

    we tried initially to have the client side use IB and it was the worst , then we switched to have the client side to ethernet (the reason for bad perof,ance is related to xenserver not having native support for it )

    Thanked by 1laoban
  • randvegetarandvegeta Member, Provider

    Found an interesting video on YouTube showing that StarLink actually still works in pretty heavily wooded areas.

    Earlier in the video it shows that even in extremely obstructed areas, the dish can still pickup a signal to make it useful on the go. I mean.. they were getting 20 minutes of downtime per hour, but apparently the remaining 40 minutes was nice and fast, so useful if it's your only option for connectivity in the middle of nowhere.

  • @randvegeta said:

    I think you posted this in the wrong thread chief.

    Thanked by 1randvegeta
  • @HalfEatenPie said:

    @randvegeta said:

    I think you posted this in the wrong thread chief.

    At least in his own thread. :D

    Thanked by 1randvegeta
  • randvegetarandvegeta Member, Provider

    @HalfEatenPie said:

    @randvegeta said:

    I think you posted this in the wrong thread chief.

    You're right.

    Silly me. This was supposed to be on my StarLink thread. :s

    Thanked by 1HalfEatenPie
Sign In or Register to comment.