50-100TB video storage cluster

AXYZE · July 2022

Hey.

I need cost-effective & reliable solution.
50-100TB storage.
Only large (10-25GB) video files that will be streamed.
100% legit stuff, no "dmca ignored" needed.
No need for high availability, but it would be nice benefit.
It needs to have good network in Poland.

My current idea is to get 4x Hetzner Auction Servers (i7 3770, 32GB ram 4x8TB) within same DC.
That will give me 32TB per server, 128TB total.
80-90TB usable would be great, thats why I look at erasure coding instead of RAID10. Raid5 isnt a great idea for such array because of rebuild time right?

Now, which would be best solution to do cheapest, but still reliable cluster?

Proxmox cluster + CEPH + erasure coding
Proxmox cluster + GlusterFS + erasure coding
Proxmox + ZFS simple master/slave (50% usable only )
One Proxmox + connected Minio cluster + erasure coding
Something else?

These servers at Hetzner will have just 1GbE link, so I'm not sure if distributed FS is correct thing to do.

I dont have limited budget, but lets say its 100-200euro/mo. Its more like a fun project for me, I would want to check if its possible to do cheap-ass storage cluster that is highly reliable and provides good enough performance.

Astro · July 2022

There are 40 euro 40TB servers as well!

AXYZE · July 2022

@Astro said:
There are 40 euro 40TB servers as well!

So 160TB for 160 euro. Perfect now I just need solution to get 100TB usable and reliable storage on it

Theres no SSD cache and just 1Gbit, thats why I dont know if these solutions are good

servarica_hani · July 2022

for reliability and even usability it is much easier to have 1 server with many disks
Another thing you need is at least 25% parity overhead to be safe
So if you go by zfs raidz6 you will need to have 8 disk groups with 2 disks for parity (we do 6 disks with 2 disks for parity but you can live with 8)

management of single server is always much more easier than ceph cluster (honestly 100TB usable is too low to justify for ceph cluster)

From our tests that we posted on reddit ZFS was much more better than Ceph in terms of performance (we tested 72 disks in single server as ZFS and 72 disks in 5x servers as ceph cluster )
Note: we have years of experience in ZFS and its optimization while Ceph is new to us so I assume with better optimization ceph can perform better but i dont think it can reach zfs levels

ceph make sense once you reach certain level , for example for us with more than 1k of disks it started to be painful to manage all that as individual servers and thats why we are interested in ceph

ManishPant · July 2022

If you checkout Hetzner Server auction there is one server of Euro 140.70 / month and Euro 144.70/month Having 160TB HDD and 2*960 Nvme Drive in single server

https://www.hetzner.com/sb?hddcount_from=0&hdd_from=16000&hdd_to=17000

AXYZE · July 2022

@servarica_hani said:
for reliability and even usability it is much easier to have 1 server with many disks
Another thing you need is at least 25% parity overhead to be safe
So if you go by zfs raidz6 you will need to have 8 disk groups with 2 disks for parity (we do 6 disks with 2 disks for parity but you can live with 8)

management of single server is always much more easier than ceph cluster (honestly 100TB usable is too low to justify for ceph cluster)

From our tests that we posted on reddit ZFS was much more better than Ceph in terms of performance (we tested 72 disks in single server as ZFS and 72 disks in 5x servers as ceph cluster )
Note: we have years of experience in ZFS and its optimization while Ceph is new to us so I assume with better optimization ceph can perform better but i dont think it can reach zfs levels

ceph make sense once you reach certain level , for example for us with more than 1k of disks it started to be painful to manage all that as individual servers and thats why we are interested in ceph

Hmmm... maybe going distributed with just 32TB per node is not worth it like you say...
Any experience with Raid-Z2 vs erasure coding?

servarica_hani · July 2022

@AXYZE said:

@servarica_hani said:
for reliability and even usability it is much easier to have 1 server with many disks
Another thing you need is at least 25% parity overhead to be safe
So if you go by zfs raidz6 you will need to have 8 disk groups with 2 disks for parity (we do 6 disks with 2 disks for parity but you can live with 8)

management of single server is always much more easier than ceph cluster (honestly 100TB usable is too low to justify for ceph cluster)

From our tests that we posted on reddit ZFS was much more better than Ceph in terms of performance (we tested 72 disks in single server as ZFS and 72 disks in 5x servers as ceph cluster )
Note: we have years of experience in ZFS and its optimization while Ceph is new to us so I assume with better optimization ceph can perform better but i dont think it can reach zfs levels

ceph make sense once you reach certain level , for example for us with more than 1k of disks it started to be painful to manage all that as individual servers and thats why we are interested in ceph

Hmmm... maybe going distributed with just 32TB per node is not worth it like you say...
Any experience with Raid-Z2 vs erasure coding?

we did it.
compared to mirrors which is faster and the results was highly in favour of ZFS
https://www.reddit.com/r/DataHoarder/comments/up9tiu/servarica_distributed_storage_test_series/

check the first 2 parts as well if you want to see the full process
https://www.reddit.com/r/DataHoarder/comments/tb69gv/servarica_distributed_storage_test_series_ceph/
https://www.reddit.com/r/DataHoarder/comments/tf73wt/servarica_distributed_storage_test_series_part2/

letlover · July 2022

@Astro said:
There are 40 euro 40TB servers as well!

Gone from Hetzner auction already. Now $50 per 4x10TB.

letlover · July 2022

If it is a 100tb cluster for video streaming >10 gb video clips, limited unlimited bw may be a potential issue.

afn · July 2022

@AXYZE said: Only large (10-25GB) video files that will be streamed.

Stream? what is your traffic requirements tho? you need to be sure of that if you are going with hetzner, it appears there is some undefined FUP, and with such large files streamed you may be reaching that FUP if you have a lot of traffic.

I would suggest Andy10gbit and Walkerservers as well for large storage solutions. Andy appears expensive at the surface but PMing him on discord and asking nicely may -depending on stock- get you a good discounted deal (from my own humble experience with him)

for reliability and even usability it is much easier to have 1 server with many disks

euh... but having 4 servers ==> 4x 1gbps port, vs 1 huge server with 1gbps. Better network capacity with multiple servers.

But if that 1 server is 5/10 gbps that is a different story

dfroe · July 2022

Depending on your knowledge and experience regarding distributed file systems I would also consider a simple setup without Ceph, GlusterFS etc.

Instead you could simply store the files across all your servers, maintain a simple index which file is stored on which server and use this for redirecting requests to the correct server.
For example srv.example.com/foo redirects to srv1.example.com/foo, srv.example.com/bar to srv42.example.com/bar and so on.
Avoiding distributed file systems could make maintanence easier and the whole setup more flexible as servers could be placed in any datacenter.
You might also abstain from any local redundancy like RAID or RAIDz if instead you save each file on two (or three) servers in different datacenters.
Then you could also use the redirector service as a load balancer.

In my personal opinion this sounds more robust, reliable, flexible and scalable.

servarica_hani · July 2022

@afn said:

for reliability and even usability it is much easier to have 1 server with many disks

euh... but having 4 servers ==> 4x 1gbps port, vs 1 huge server with 1gbps. Better network capacity with multiple servers.

But if that 1 server is 5/10 gbps that is a different story

Ceph will use part of these 1gbps for its internal communication
and on re balancing it will use it all

also If you can maximize the 1gbps thats not bad for 100TB of usable data but more is better

afn · July 2022

@servarica_hani said: also If you can maximize the 1gbps thats not bad for 100TB of usable data but more is better

I can assure you can maximize a 1gbps for 24/7 with only 512GB of content.

Which is why I asking OP to thing about his traffic as well not only storage. (not a priority, but worth thinking about)

Levi · July 2022

Is it iptv stuff with vod? It is not innocent...

AXYZE · July 2022

@LTniger said:
Is it iptv stuff with vod? It is not innocent...

Stream archive platform for several top streamers & project archiving for YouTubers.
Access will be available for patrons & editors which are making "shorts" for YouTube & TikTok etc.

That's why I say it's more "fun project" than something that needs to be 24/7 available. Its not business critical. Some downtime is fine, losing all data not so much xD

@afn said:
Which is why I asking OP to thing about his traffic as well not only storage. (not a priority, but worth thinking about)

I know about Hetzner drama, if I will go near this 250TB limit I will just add Hetzner Cloud VPS (internal traffic is not calculated).

50TB-100TB storage is not needed at this moment (eventually it will be, but idk how fast it will fill up). Sponsor (streamer) of this project wants to pay for whole year and don't think about limits, scaling. Just use it (for his editors & supporters) and give upload access to his Twitch & YouTube friends. So many people are pissed of at some recent events on Twitch & YouTube, long story and even I don't know it fully as they have some NDAs etc.

AXYZE · July 2022

Based on suggestions from here... What about this:

2 Big servers
Xeon E3-1275V6
64GB ECC
4x 16TB (64TB per server)
1Gbps, 250TB FUP

And then setup filesystem with erasure coding (afaik ZFS doesnt support it)?
I don't think that going RAID-5 on 16TB drives is good idea, rebuild time are gone be looong and too high risk of losing data.
I would have two pools of data, each for server, like suggested by @dfroe . That would theoretically give me 2Gbps 500TB traffic - ofc scaling won't be that perfect, but still it will be major improvement over single 1Gbps server.

Or... maybe... single server is better idea? Then I would get:

1 Mega big server
Intel Xeon W-2145
128GB RAM
10x 16TB (160GB total) + 2x 960GB NVMe
1Gbps, 250TB FUP.

I think two servers is better way, happy to hear your suggestions!

@servarica_hani thank you very much for your input, your reddit posts gave me nice info

yoursunny · July 2022

Mentally strong people use YOLORAID.
Buy top-tier hard drives and they won't fail.

Hxxx · July 2022

I'm glad @dfroe said that (assuming no edits to his post).

Anyway I was thinking why ZFS? Are you skipping HW RAID ? (which normally people skip or put disks into HBA when using ZFS).

Don't know if you heard about the Linus (the media content creator I think this is) that lost data due to using ZFS, worth a look to not make the same mistake.

Then I was reading that ZFS need scrubbing (periodically / recurrent every X time) to make sure all files are intact and not sitting in bad sectors / corrupted. Then it seems also that ECC RAM is crucial for this to work properly and avoid potential issues.

I'm a fan of simplicity. HW Raid or mdadm + monitoring. But I understand simplicity is not enough for all scenarios and there are cases where ZFS, CEPH, etc are the way to go due to customizability (if that's a word lol).

Would love to hear more from @servarica_hani as it seems they had proper experience with ZFS.

dfroe · July 2022

I think we had some threads about ZFS experience here on LET; and of course you'll find them on reddit (datahoarder).

I personally run ZFS on FreeBSD and Linux since more than a decade, and it's one piece of code that I am really happy that it exists. You need to know a bit how it works in order to use it right. But it's a real pleassure to handle and maintain ZFS pools. From my personal experience, having started with ZFS on FreeBSD but meanwhile also using ZFS on all my Linux servers.

However ZFS is just something to be used locally on one machine. Of course you can do certain helpful jobs with zfs send/receive and snapshots, but it is no distributed file system. ZFS is something you use on a particular machine.

And regarding simplicity, yes, ZFS has certain features like RAIDz(2), integrated checksums, scrubbing, advanced RAM caching etc. and relying on ECC RAM is a must if you care about your data. But that's no disadvantage of ZFS at all. With other file systems you simply do not know when such bit rot occurs.

risharde · July 2022

@Astro said:
There are 40 euro 40TB servers as well!

Where if you don't mind me asking? Just checked Hetzner auctions and didn't see that, don't think I have ever noticed that

servarica_hani · July 2022

@Hxxx said:

Then I was reading that ZFS need scrubbing (periodically / recurrent every X time) to make sure all files are intact and not sitting in bad sectors / corrupted. Then it seems also that ECC RAM is crucial for this to work properly and avoid potential issues.

I'm a fan of simplicity. HW Raid or mdadm + monitoring. But I understand simplicity is not enough for all scenarios and there are cases where ZFS, CEPH, etc are the way to go due to customizability (if that's a word lol).

Would love to hear more from @servarica_hani as it seems they had proper experience with ZFS.

there is alot of misinformation about scubbing out there
Scubbing is more needed when the data is not accessed regularly so for example if you have backup server that you write the data once to it and maybe will restore the data few years in the future in this case scrubbing is really important as data can rot while at rest

But for active data that are regularly accessed (at least once per year ) usually you dont need scrubbing as the same scrubbing process is done when ever you read the data

Another thing that lower the need for scrubbing is having raidz2 as even if you have bit rot you will need 3x blocks in same position in 3 disks to be corrupted in the group to have data loss which is statically very unlikely

finally having proper Enterprise disks make huge diff as usually they have their BER rating at 10x less likely to have bit error rates than consumer disks which adds to the reliability of the system

Note: ECC Ram is a must with ZFS, dont do ZFS without ECC

from our experience ZFS is years ahead of any hardware raid solution , it is open sourced solution that you will find whole communities to support you in case of issues and it do everything to maintain your data

Hxxx · July 2022

@servarica_hani thanks for the reply.

Do you have time to elaborate about this: "it is years ahead of any hw raid solution". ?

What advantages you see in comparison to just having a proper HW RAID (modern) solution?

Edit:Edit:Edit... typos.

servarica_hani · July 2022

@Hxxx said:
@servarica_hani thanks for the reply.

Do you have time to elaborate about this: "it is years ahead of any hw raid solution". ?

What advantages you see in comparison to just having a proper HW RAID (modern) solution?

Edit:Edit:Edit... typos.

I will use @dfroe answer here

@dfroe said:
And regarding simplicity, yes, ZFS has certain features like RAIDz(2), integrated checksums, scrubbing, advanced RAM caching etc. and relying on ECC RAM is a must if you care about your data. But that's no disadvantage of ZFS at all. With other file systems you simply do not know when such bit rot occurs.

I will add the following example
In case you need to build big array of 60x disks
using hardware raid you are forced to do raid 5 , 6, 10, 50 or 60 or have more than one array with lvm gluing them which is band aid solution as you get 2 layers of overhead above your disks to access them
The option will either waste too much space in case of raid 10, 50 ,60
or have too little parity in case of raid 5 and 6

while in ZFS you can divide your disks into groups each with parity of 2 disks so you get 1 file system that has 6x groups of 10 disks each and each group have 2 of those 10 disks for parity
Add to that the rebuild time when one disk fails is only for that group not the whole cluster while in raid 5, 6, 50 , 60 it is either the whole cluster has to be reread or half of it

ZFS has integrated checksum which means it has native way to find and fix bit rot , it do check the checksum of the block when it read it and will immediately fix it if there is something wrong

Add that to layered caching which is just started to appear in some of hardware raid solutions

Another aspect is the excellent metrics we get our of each disk in the system and the ability to monitor smallest details about the disks operations which all will be hidden behind the HW raid device if we go hardware raid route

I mean this just what I remembered just now but the list goes on

AXYZE · July 2022

Thanks for help guys!

In the end we decided to go with 3 servers.

One AX41 (Ryzen 5 3600, 64GB RAM, 2x 512GB NVMe)
Two auction servers (Xeon E5-1650V3, 128GB, 10x 10TB Enterprise HDD)

AX41 will act as main server (getting streams from streamers, hosting website, discord bot etc.)
Two auction servers will be setup as cdn1.* and cdn2.* and they will just have video files.

With that solution we can stream videos at up to 2Gbps (if we have perfect load balancing between servers) and have separate server which makes sure that nagivation / stream downloading is perfect even if two storage servers are under heavy load.

I'm still wanting to get your opinion on RAID-6 on ZFS vs. erasure coding on another filesystem. Share your experiences!

Edit: I just added "480GB SSD Datacenter Edition" to these storage servers to have separate boot drive. Many space left, maybe I can use like 300GB of it to help these HDDs? I know ZFS can cache on SSD and RAM (two layers) - good idea?

dfroe · July 2022

Chosing between RAID-6 and RAIDz2 (ZFS) is probably a question of personal taste.
Building a software RAID-6 follows a strict layered approach and each layer is quite easy to understand. The RAID-6 itself is build with mdadm and on top of that you put a filesystem of your trust (e.g. ext4 or XFS). You may add LVM in between if it adds some benefit.
ZFS on the other hand combines these three layers (RAID, LVM, Filesystem). There a advantages and disadvantages; while I learned to love the advantages.

Another advantage of RAIDz besides what's already mentioned is that for a successful rebuild each individual block must be readable from at least one disk while with traditional RAID the whole remaining disks must be readable, reducing the chance to lose the whole array due to hardware failure.

Speaking for ZFS (which I personally would use) a RAIDz2 with 10 disks will perform pretty good.
Best performance with RAIDz (most efficient distribution across disks) is achieved with 2^n data disks plus parity (which is true for 8+2 disks in RAIDz2).

In theory you could gain a read performance of 8x the read performance of each disk and a write performance of the write performance of the weakest disk. Which should be fine for your use case as you will read much more data than write.

Having a large amount of RAM will be beneficial for ARC (ZFS RAM cache) which will work out of the box.
There is also L2ARC which optionally can be configured as a SSD cache but intuitivly I doubt if this would significantly boost your actual end-user performance.
You can add a L2ARC without any disadvantages but it might not make anything better, either. You will have to test and benchmark under real life conditions.

A separate SIL on redundant (no single) SSDs can improve the weak performance of ZFS during synchronous writes but in your use case you will mostly have asynchronous writes which always go into RAM first so I wouldn't configure a (separate) SIL.

TL;DR: RAIDz2 with 10 disks and a large amount of RAM should be a good choice to deliver video files.

AXYZE · July 2022

@dfroe said:
Chosing between RAID-6 and RAIDz2 (ZFS) is probably a question of personal taste.
Building a software RAID-6 follows a strict layered approach and each layer is quite easy to understand. The RAID-6 itself is build with mdadm and on top of that you put a filesystem of your trust (e.g. ext4 or XFS). You may add LVM in between if it adds some benefit.
ZFS on the other hand combines these three layers (RAID, LVM, Filesystem). There a advantages and disadvantages; while I learned to love the advantages.

Another advantage of RAIDz besides what's already mentioned is that for a successful rebuild each individual block must be readable from at least one disk while with traditional RAID the whole remaining disks must be readable, reducing the chance to lose the whole array due to hardware failure.

Speaking for ZFS (which I personally would use) a RAIDz2 with 10 disks will perform pretty good.
Best performance with RAIDz (most efficient distribution across disks) is achieved with 2^n data disks plus parity (which is true for 8+2 disks in RAIDz2).

In theory you could gain a read performance of 8x the read performance of each disk and a write performance of the write performance of the weakest disk. Which should be fine for your use case as you will read much more data than write.

Having a large amount of RAM will be beneficial for ARC (ZFS RAM cache) which will work out of the box.
There is also L2ARC which optionally can be configured as a SSD cache but intuitivly I doubt if this would significantly boost your actual end-user performance.
You can add a L2ARC without any disadvantages but it might not make anything better, either. You will have to test and benchmark under real life conditions.

A separate SIL on redundant (no single) SSDs can improve the weak performance of ZFS during synchronous writes but in your use case you will mostly have asynchronous writes which always go into RAM first so I wouldn't configure a (separate) SIL.

TL;DR: RAIDz2 with 10 disks and a large amount of RAM should be a good choice to deliver video files.

For now I've gone with BTRFS as its supported out of the box with Hetzner images with no extra hassle.

I'll try to install ZFS on second box and compare them. Could you share best practices to do it?
On BTRFS box I've chosen Ubuntu

dfroe · July 2022

I wouldn't call it a 'best practice', but in Linux environments it can be easier for example to use the first 100 GB of each disk traditionally for the operating system (for example with mdadm RAID, LVM, etc., whatever you are used to) and then create a partition with the remaining 9+ TB for the ZFS pool. It makes no difference whether you create the ZFS pool on the disks or partitions of the disks. This way you are more flexible to reinstall the linux os keeping the zpool untouched so you can import it again.

If you seek for documentations, there are tons of it to be found on the internet.

karanchoo · July 2022

I have few Hetzner server with private 10gb switch . you can use it for proxmox idea.
don't remember exactly but switch was ~40euro and per server it was 8 euro for 10gb card I guess.

AXYZE · July 2022

@dfroe said:
I wouldn't call it a 'best practice', but in Linux environments it can be easier for example to use the first 100 GB of each disk traditionally for the operating system (for example with mdadm RAID, LVM, etc., whatever you are used to) and then create a partition with the remaining 9+ TB for the ZFS pool. It makes no difference whether you create the ZFS pool on the disks or partitions of the disks. This way you are more flexible to reinstall the linux os keeping the zpool untouched so you can import it again.

If you seek for documentations, there are tons of it to be found on the internet.

I've already setup btrfs like this.
200GB for OS on ext4 (ext4 should provide better performance from what I saw online)
rest (around 74TB after RAID-6) is BTRFS mounted as /home/

I'll try to do the similar thing with ZFS.
I need to study how LVM/ZFS works or try to do it with Proxmox.

AXYZE · July 2022

I've tried to setup ZFS on it, but failed.

Tried everything I could think of.
QEMU (with virtio to enable more than 4 disks) + Proxmox VNC installation got me "almost working" - its stuck at 99% installation progress on "make bootable drive". htop shows me that there is still around 8% cpu usage by qemu, ram usage is going up and down but its still stuck at 99% installation progress.
I've saw online that it is a bug with floppy drive, so I'll try Q35 QEMU now, maybe it will fix it.

Do you guys have method how I can setup ZFS with RAID-Z2 on 10 disks where system is on that ZFS array?
Debian, Proxmox, Ubuntu... I dont care.

I've thought about adding SSD for boot+os, but Hetzner upgrade option for SX132 is not good for me - I cant install SATA SSD or NVMe SSD, I can only install "NVMe Datacenter" and only in high capacity (wtf?), I cant choose small capacity like 480GB. It becomes unnecessary expensive, and point of this project is to be as cheap as possible without compromising on reliability, speed is already good enough.
Maybe someone has an idea how to do it. I've wasted 8h trying to do it xD

luckypenguin · July 2022

@AXYZE said: QEMU (with virtio to enable more than 4 disks) + Proxmox VNC installation got me "almost working" - its stuck at 99% installation progress on "make bootable drive". htop shows me that there is still around 8% cpu usage by qemu, ram usage is going up and down but its still stuck at 99%.

I've saw online that it is a bug with floppy drive, so I'll try Q35 QEMU now, maybe it will fix it.

Why?
Ask them to connect IPMI to your server, you will have it for 3 hours, and you can mount
real ISOs and have real VNC directly from UEFI and down. Why complicate it with QEMU?

Howdy, Stranger!

Categories

In this Discussion

50-100TB video storage cluster

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

50-100TB video storage cluster

Comments