About your experience with HA storage (ceph)

FlorinMarian · April 12

Hello, dears!

In my search for increasing customer satisfaction and offering new services to the general public, I have been playing with k8s / Gitlab DevOps tools and soon with the DevSECops part.

Since I would really like my services (hazi.ro website and related services such as Database, DNS, Mail, NextCloud, Gitlab and others) to move them to a k8s cluster ready for production, but I need more opinions on what it concerns the most efficient way of distribution of the storage to ensure both the integrity of the data and their availability.

I find the idea of replicating the storage on 4x RAID0 SSDs on 3 different servers through cephfs offered by proxmox very tempting, but I have no knowledge about it at all in terms of performance/reliability.

Can anyone who has used/researched it provide a relevant opinion?

I would avoid the simple option with a NAS to serve the entire k8s cluster because it does not offer the performance I need using only 6 disks as each physical node allows.

Thank you!

djunior · April 12

Hello Florin,

CEPH uses a lot of resources and the network. The three servers need at least one 10 Gbit private port and another port for public access.

Three different servers with all of them 4x SSDs is your idea, correct? Don't use any RAID, CEPH handles that.

What kind of drives are you going to use? Proxmox is a great thing to use for clustering and also for CEPH, but what is your expected performance, and what are your exact configurations? Because if you don't do this the right way, performance is going to be horrible.

FlorinMarian · April 12

@djunior said:
Hello Florin,

CEPH uses a lot of resources and the network. The three servers need at least one 10 Gbit private port and another port for public access.

Yes, I have (at least) 3 nodes with 2x RJ45 + 2x SFP+ each.

Three different servers with all of them 4x SSDs is your idea, correct? Don't use any RAID, CEPH handles that.

What kind of drives are you going to use? Proxmox is a great thing to use for clustering and also for CEPH, but what is your expected performance, and what are your exact configurations? Because if you don't do this the right way, performance is going to be horrible.

All nodes with 6x Intel S3510 1.2TB where I intend to use RAID1 with two disks OS itself and other 4 drives to be used via ceph.
Performance is expected to be like those 4 disks in RAID0.

Thank you!

user123 · April 13

IIRC, ZXPlay migrated their servers to CEPH several years ago and lost customer data. Many customers who had prepaid for longer duration plans got screwed. You can tell I'm still totally not bitter about it.

FlorinMarian · April 13

@user123 said:
IIRC, ZXPlay migrated their servers to CEPH several years ago and lost customer data. Many customers who had prepaid for longer duration plans got screwed. You can tell I'm still totally not bitter about it.

Hey!
Thanks for the reply!
It is quite obvious that you can lose your data if you play dangerously.
Likewise, RAID10 does not help you if the same partition of the intact disk dies during the synchronization of the replaced disk.
It's a dangerous game, that's why daily backups will remain in place even after, not just now because as some people here said "RAID1 is not a backup!".

RapToN · April 13

@FlorinMarian said:

@user123 said:
IIRC, ZXPlay migrated their servers to CEPH several years ago and lost customer data. Many customers who had prepaid for longer duration plans got screwed. You can tell I'm still totally not bitter about it.

Hey!
Thanks for the reply!
It is quite obvious that you can lose your data if you play dangerously.
Likewise, RAID10 does not help you if the same partition of the intact disk dies during the synchronization of the replaced disk.
It's a dangerous game, that's why daily backups will remain in place even after, not just now because as some people here said "RAID1 is not a backup!".

Ceph is great and anyone who has sufficient infrastructure to run a Ceph cluster should do so.

However, there is a reason why we were in a test phase for a long time until we were satisfied with our ceph cluster and could be sure that we would know how to handle it in the event of a disaster.
Compared to a SW or HW RAID, ceph is quite complex.

hyperblast · April 13

my experience

fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 388.28 MB/s  (97.0k) | 2.47 GB/s    (38.6k)
Write      | 389.30 MB/s  (97.3k) | 2.48 GB/s    (38.8k)
Total      | 777.58 MB/s (194.3k) | 4.96 GB/s    (77.5k)
           |                      |
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 3.21 GB/s     (6.2k) | 3.24 GB/s     (3.1k)
Write      | 3.38 GB/s     (6.6k) | 3.45 GB/s     (3.3k)
Total      | 6.60 GB/s    (12.9k) | 6.70 GB/s     (6.5k)

with ceph:

fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/sda1):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 325.31 MB/s  (81.3k) | 4.25 GB/s    (66.4k)
Write      | 326.17 MB/s  (81.5k) | 4.27 GB/s    (66.7k)
Total      | 651.49 MB/s (162.8k) | 8.52 GB/s   (133.2k)
           |                      |
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 9.07 GB/s    (17.7k) | 8.78 GB/s     (8.5k)
Write      | 9.56 GB/s    (18.6k) | 9.37 GB/s     (9.1k)
Total      | 18.64 GB/s   (36.4k) | 18.15 GB/s   (17.7k)

FlorinMarian · April 13

@hyperblast said:
my experience

> fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
> ---------------------------------
> Block Size | 4k            (IOPS) | 64k           (IOPS)
>   ------   | ---            ----  | ----           ----
> Read       | 388.28 MB/s  (97.0k) | 2.47 GB/s    (38.6k)
> Write      | 389.30 MB/s  (97.3k) | 2.48 GB/s    (38.8k)
> Total      | 777.58 MB/s (194.3k) | 4.96 GB/s    (77.5k)
>            |                      |
> Block Size | 512k          (IOPS) | 1m            (IOPS)
>   ------   | ---            ----  | ----           ----
> Read       | 3.21 GB/s     (6.2k) | 3.24 GB/s     (3.1k)
> Write      | 3.38 GB/s     (6.6k) | 3.45 GB/s     (3.3k)
> Total      | 6.60 GB/s    (12.9k) | 6.70 GB/s     (6.5k)
>

with ceph:

> fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/sda1):
> ---------------------------------
> Block Size | 4k            (IOPS) | 64k           (IOPS)
>   ------   | ---            ----  | ----           ----
> Read       | 325.31 MB/s  (81.3k) | 4.25 GB/s    (66.4k)
> Write      | 326.17 MB/s  (81.5k) | 4.27 GB/s    (66.7k)
> Total      | 651.49 MB/s (162.8k) | 8.52 GB/s   (133.2k)
>            |                      |
> Block Size | 512k          (IOPS) | 1m            (IOPS)
>   ------   | ---            ----  | ----           ----
> Read       | 9.07 GB/s    (17.7k) | 8.78 GB/s     (8.5k)
> Write      | 9.56 GB/s    (18.6k) | 9.37 GB/s     (9.1k)
> Total      | 18.64 GB/s   (36.4k) | 18.15 GB/s   (17.7k)
> 
>

Amazing.
What config, if you don't bother?

sibaper · April 13

@FlorinMarian said: Since I would really like my services (hazi.ro website and related services such as Database, DNS, Mail, NextCloud, Gitlab and others) to move them to a k8s cluster ready for production, but I need more opinions on what it concerns the most efficient way of distribution of the storage to ensure both the integrity of the data and their availability.

move to Kubernetes? Don't, just don't. It's easy to manage and maintain if you're using managed services, but rolling your own will become a burden soon or later

This is coming from the person who manages Kubernetes for living

FlorinMarian · April 14

@sibaper said:

@FlorinMarian said: Since I would really like my services (hazi.ro website and related services such as Database, DNS, Mail, NextCloud, Gitlab and others) to move them to a k8s cluster ready for production, but I need more opinions on what it concerns the most efficient way of distribution of the storage to ensure both the integrity of the data and their availability.

move to Kubernetes? Don't, just don't. It's easy to manage and maintain if you're using managed services, but rolling your own will become a burden soon or later

This is coming from the person who manages Kubernetes for living

Hey!
I also do kubernetes through the jobs I've had and will have. (DevOps -> DevSecOps transition in progress..)
HAZI.ro has done this and has been doing this since the first day, for 3 years it has been training me to be a valuable employee. If it didn't teach me so many things, it wouldn't even deserve to exist anymore because the profit is minor or non-existent, but changing three jobs in the last 2 years and 2 months, the current salary is ~ 4 times higher than the first, although they all have was in the position of DevOps Engineer.

vsys_host · April 17

Kuber was created to make ci/cd deployments easier. Mentioned services like Database, DNS, and Mail are definitely not the services that need to be often deployed. Not sure even WHMCS will be updated more than a few times per year. Putting non-microservice infrastructure into kuber just because you can is not the best practice.

FlorinMarian · April 17

@vsys_host said:
Kuber was created to make ci/cd deployments easier. Mentioned services like Database, DNS, and Mail are definitely not the services that need to be often deployed. Not sure even WHMCS will be updated more than a few times per year. Putting non-microservice infrastructure into kuber just because you can is not the best practice.

Hey!
Certainly, the purpose of migrating to containers is not necessarily related to performance (although a failover wouldn't hurt), but to keep my DevOps-related experience trained, which I completely lack in my day-to-day job.

Athorio · April 17

we running ceph as our main storage backend for our vps offering on 40g nodes, if theres something specific you want to know..

Athorio · April 17

@hyperblast said:
my experience

> fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
> ---------------------------------
> Block Size | 4k            (IOPS) | 64k           (IOPS)
>   ------   | ---            ----  | ----           ----
> Read       | 388.28 MB/s  (97.0k) | 2.47 GB/s    (38.6k)
> Write      | 389.30 MB/s  (97.3k) | 2.48 GB/s    (38.8k)
> Total      | 777.58 MB/s (194.3k) | 4.96 GB/s    (77.5k)
>            |                      |
> Block Size | 512k          (IOPS) | 1m            (IOPS)
>   ------   | ---            ----  | ----           ----
> Read       | 3.21 GB/s     (6.2k) | 3.24 GB/s     (3.1k)
> Write      | 3.38 GB/s     (6.6k) | 3.45 GB/s     (3.3k)
> Total      | 6.60 GB/s    (12.9k) | 6.70 GB/s     (6.5k)
>

with ceph:

> fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/sda1):
> ---------------------------------
> Block Size | 4k            (IOPS) | 64k           (IOPS)
>   ------   | ---            ----  | ----           ----
> Read       | 325.31 MB/s  (81.3k) | 4.25 GB/s    (66.4k)
> Write      | 326.17 MB/s  (81.5k) | 4.27 GB/s    (66.7k)
> Total      | 651.49 MB/s (162.8k) | 8.52 GB/s   (133.2k)
>            |                      |
> Block Size | 512k          (IOPS) | 1m            (IOPS)
>   ------   | ---            ----  | ----           ----
> Read       | 9.07 GB/s    (17.7k) | 8.78 GB/s     (8.5k)
> Write      | 9.56 GB/s    (18.6k) | 9.37 GB/s     (9.1k)
> Total      | 18.64 GB/s   (36.4k) | 18.15 GB/s   (17.7k)
> 
>

i highly doubt this numbers. 10 GB write/sek needs a 100G connection..

lowprofile · April 17

FYI when visiting hazi.ro

Athorio · April 17

@lowprofile said:
FYI when visiting hazi.ro

FlorinMarian · April 18

@lowprofile said:
FYI when visiting hazi.ro

0 reports out of 92 tests https://virustotal.com/gui/url/cb54f32b73f2160364018db966cbce4565d5abb4186c67ade46f0959c22edda1?nocache=1

Thanks anyway for your reporting, even if there's nothing to do on our side ^.^

marian · April 18

@FlorinMarian said: even if there's nothing to do on our side

let me give you an idea (if you accept community help this time):

you can write an email to ask details about this issue:

https://support.malwarebytes.com/hc/en-us/p/contact_support

host_c · April 18

@FlorinMarian

Hy friend.

I will no try to re-nvent the wheel here, but I hate SW raid as hell. This is my personal experience and for a lot of reasons I will stick to it.

CEPH / ZFS / MDADM / vMware VSAN and all other work perfect in high IO as long as they are a raid 10 copy ( stripped mirrors ). Today, a modern gen ASIC in a raid card with 2GB+ cache will work the same.

Synthetic tests will give you a holy grail of values, you will see for example on 8 hdd drives of raid 10 values beyond SSD values. That is fine, until the cache goes to shit and IO is high ( 20+ vps / node ), then hell is goig to be unleashed.

NAS/SAN type storage has some pretty big/strict requirements, and latency is the biggest of them

In ceph you need network, and I am not talking about cisco catalyst or nexus, junos switches, the only way to fly in SAN/NAS is Arista, as they have sub 1ms switches, also multipath is a must!

From my point of view, ceph might work nice in business, but not hosting ( those that do it, have my upmost respect as it is hard as hell to implement )

The amount of drives you will burn for redundancy in a 3 node config is high, 33% as I recall.

Putting a node offline for maintenance will be difficult, as it participates in the storage also.

A NFS share would work the same or better, it will require you to have low latency network card also. ( Chelsio is a king here )

from all the storage protocols out there, NFS/CIFS,CEPH....... ISCSI is the king, and I mean ISCSI over FC. It was developed as a storage protocol with high bandwidth and low latency.
Unfortunately, unusable in Proxmox, very/only usable in vMware with VMFS!!! - do google this.

You will have a less problematic life if you stick with HW raid in a low number of nodes ( 10 under ) or ZFS.

hyperblast · April 18

@Athorio said:

@hyperblast said:
my experience

> > fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
> > ---------------------------------
> > Block Size | 4k            (IOPS) | 64k           (IOPS)
> >   ------   | ---            ----  | ----           ----
> > Read       | 388.28 MB/s  (97.0k) | 2.47 GB/s    (38.6k)
> > Write      | 389.30 MB/s  (97.3k) | 2.48 GB/s    (38.8k)
> > Total      | 777.58 MB/s (194.3k) | 4.96 GB/s    (77.5k)
> >            |                      |
> > Block Size | 512k          (IOPS) | 1m            (IOPS)
> >   ------   | ---            ----  | ----           ----
> > Read       | 3.21 GB/s     (6.2k) | 3.24 GB/s     (3.1k)
> > Write      | 3.38 GB/s     (6.6k) | 3.45 GB/s     (3.3k)
> > Total      | 6.60 GB/s    (12.9k) | 6.70 GB/s     (6.5k)
> >

with ceph:

> > fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/sda1):
> > ---------------------------------
> > Block Size | 4k            (IOPS) | 64k           (IOPS)
> >   ------   | ---            ----  | ----           ----
> > Read       | 325.31 MB/s  (81.3k) | 4.25 GB/s    (66.4k)
> > Write      | 326.17 MB/s  (81.5k) | 4.27 GB/s    (66.7k)
> > Total      | 651.49 MB/s (162.8k) | 8.52 GB/s   (133.2k)
> >            |                      |
> > Block Size | 512k          (IOPS) | 1m            (IOPS)
> >   ------   | ---            ----  | ----           ----
> > Read       | 9.07 GB/s    (17.7k) | 8.78 GB/s     (8.5k)
> > Write      | 9.56 GB/s    (18.6k) | 9.37 GB/s     (9.1k)
> > Total      | 18.64 GB/s   (36.4k) | 18.15 GB/s   (17.7k)
> > 
> >

i highly doubt this numbers. 10 GB write/sek needs a 100G connection..

believe whatever you like. yabs has issued these values. a system without ceph and a system with ceph!

Athorio · April 18

@hyperblast said:

@Athorio said:

@hyperblast said:
my experience

> > > fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
> > > ---------------------------------
> > > Block Size | 4k            (IOPS) | 64k           (IOPS)
> > >   ------   | ---            ----  | ----           ----
> > > Read       | 388.28 MB/s  (97.0k) | 2.47 GB/s    (38.6k)
> > > Write      | 389.30 MB/s  (97.3k) | 2.48 GB/s    (38.8k)
> > > Total      | 777.58 MB/s (194.3k) | 4.96 GB/s    (77.5k)
> > >            |                      |
> > > Block Size | 512k          (IOPS) | 1m            (IOPS)
> > >   ------   | ---            ----  | ----           ----
> > > Read       | 3.21 GB/s     (6.2k) | 3.24 GB/s     (3.1k)
> > > Write      | 3.38 GB/s     (6.6k) | 3.45 GB/s     (3.3k)
> > > Total      | 6.60 GB/s    (12.9k) | 6.70 GB/s     (6.5k)
> > >

with ceph:

> > > fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/sda1):
> > > ---------------------------------
> > > Block Size | 4k            (IOPS) | 64k           (IOPS)
> > >   ------   | ---            ----  | ----           ----
> > > Read       | 325.31 MB/s  (81.3k) | 4.25 GB/s    (66.4k)
> > > Write      | 326.17 MB/s  (81.5k) | 4.27 GB/s    (66.7k)
> > > Total      | 651.49 MB/s (162.8k) | 8.52 GB/s   (133.2k)
> > >            |                      |
> > > Block Size | 512k          (IOPS) | 1m            (IOPS)
> > >   ------   | ---            ----  | ----           ----
> > > Read       | 9.07 GB/s    (17.7k) | 8.78 GB/s     (8.5k)
> > > Write      | 9.56 GB/s    (18.6k) | 9.37 GB/s     (9.1k)
> > > Total      | 18.64 GB/s   (36.4k) | 18.15 GB/s   (17.7k)
> > > 
> > >

i highly doubt this numbers. 10 GB write/sek needs a 100G connection..

believe whatever you like. yabs has issued these values. a system without ceph and a system with ceph!

if you dont have hypervisors with 100G its Impossible to archive these values.

What you see there is cached values from your hypervisor then.

host_c · April 18

@FlorinMarian said: Likewise, RAID10 does not help you if the same partition of the intact disk dies during the synchronization of the replaced disk.

Let me give you a short crash course on raid, you can also google me on this:

RAID5/ ZFS Z(1) - it is not recommended to be used in production for over 2 decades, as wth larg TB drives you have a 50% of a second drive fail in the first 24H of a rebuild. Array capacuty is Number of drives minus 1. Recomanded number of max drives is 9. After that you can do raid 50, that is basically o strip ( raid 0 ) of raid 5 arrays.

RAID6 / ZFS Z2 - it is industry standard for over 2 decades for backup or large files, total array capacity is Number of Drives minus 2. It is a dual parity raid, withstanding any 2 drives to fail. Rebuild time is high on large drives, can take up to several weeks depending on individual drive size for HW RAID, ZFS rebuild will be faster, but not by much.
IO is low as you have 2 parity calculations for each write/read cycle. Recomeded maximum drives / array is 12, after the you do raid 60, strip ( raid 0 ) of raid 6 arrays.

RAID10 - Industry Standard from the beginning of time in any concept, be it HW raid or SW raid! Storage capacity is number of drives divided by 2. Array consists of 2 drives in mirror multiplied by the number of drives in mirror you have. ( if you have 8 drives total, you will have 4 x 2mirrored array ) - hope I explained this right

Overhead in rebuild in almost none, as the data is basically copied from the good member to the new changed member drive. There is no Parity calculation overhead like in raid 5 or 6.
Also, read speed is times the number of drives, write speed is half of the number of drives ( as it has to be mirrored )

Probability of a second same drive fail from same mirror stripe is almost in-existent ( below 0.00xxx )

There is no high capacity, low number of drives reserved for redundancy raid type.
Raid 10 is the only high speed raid on the market , at least to my knowledge.

If I omitted anything, please fill me in the details.

Cheers!

hyperblast · April 18

@Athorio said:

@hyperblast said:

@Athorio said:

@hyperblast said:
my experience

> > > > fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/vda1):
> > > > ---------------------------------
> > > > Block Size | 4k            (IOPS) | 64k           (IOPS)
> > > >   ------   | ---            ----  | ----           ----
> > > > Read       | 388.28 MB/s  (97.0k) | 2.47 GB/s    (38.6k)
> > > > Write      | 389.30 MB/s  (97.3k) | 2.48 GB/s    (38.8k)
> > > > Total      | 777.58 MB/s (194.3k) | 4.96 GB/s    (77.5k)
> > > >            |                      |
> > > > Block Size | 512k          (IOPS) | 1m            (IOPS)
> > > >   ------   | ---            ----  | ----           ----
> > > > Read       | 3.21 GB/s     (6.2k) | 3.24 GB/s     (3.1k)
> > > > Write      | 3.38 GB/s     (6.6k) | 3.45 GB/s     (3.3k)
> > > > Total      | 6.60 GB/s    (12.9k) | 6.70 GB/s     (6.5k)
> > > >

with ceph:

> > > > fio Disk Speed Tests (Mixed R/W 50/50) (Partition /dev/sda1):
> > > > ---------------------------------
> > > > Block Size | 4k            (IOPS) | 64k           (IOPS)
> > > >   ------   | ---            ----  | ----           ----
> > > > Read       | 325.31 MB/s  (81.3k) | 4.25 GB/s    (66.4k)
> > > > Write      | 326.17 MB/s  (81.5k) | 4.27 GB/s    (66.7k)
> > > > Total      | 651.49 MB/s (162.8k) | 8.52 GB/s   (133.2k)
> > > >            |                      |
> > > > Block Size | 512k          (IOPS) | 1m            (IOPS)
> > > >   ------   | ---            ----  | ----           ----
> > > > Read       | 9.07 GB/s    (17.7k) | 8.78 GB/s     (8.5k)
> > > > Write      | 9.56 GB/s    (18.6k) | 9.37 GB/s     (9.1k)
> > > > Total      | 18.64 GB/s   (36.4k) | 18.15 GB/s   (17.7k)
> > > > 
> > > >

i highly doubt this numbers. 10 GB write/sek needs a 100G connection..

believe whatever you like. yabs has issued these values. a system without ceph and a system with ceph!

if you dont have hypervisors with 100G its Impossible to archive these values.

What you see there is cached values from your hypervisor then.

@dataforest @Avoro @php-friends please comment. the two values are from dataforest servers.

host_c · April 18

@hyperblast

Not even bare mental PCI-EX Gen 4 NVME can do 8G @ 1 MB. not to mention a VPS. Cached.

I have to admit, well done, respect.

hyperblast · April 19

@host_c said:
@hyperblast

Not even bare mental PCI-EX Gen 4 NVME can do 8G @ 1 MB. not to mention a VPS. Cached.

I have to admit, well done, respect.

the honour is due: @dataforest @Avoro
maybe the representatives of @dataforest can comment on this?! thanks!

Athorio · April 19

@hyperblast said:

@host_c said:
@hyperblast

Not even bare mental PCI-EX Gen 4 NVME can do 8G @ 1 MB. not to mention a VPS. Cached.

I have to admit, well done, respect.

the honour is due: @dataforest @Avoro
maybe the representatives of @dataforest can comment on this?! thanks!

dude.. ceph is replicated over network, only 1/n is stored locally. Mostly 10G sometimes 40G is used.

considering this facts, even without replication you can only transfer with 40G around max 4GB/sek, its physics.. nothing more..

hyperblast · April 19

@Athorio said:

@hyperblast said:

@host_c said:
@hyperblast

Not even bare mental PCI-EX Gen 4 NVME can do 8G @ 1 MB. not to mention a VPS. Cached.

I have to admit, well done, respect.

the honour is due: @dataforest @Avoro
maybe the representatives of @dataforest can comment on this?! thanks!

dude.. ceph is replicated over network, only 1/n is stored locally. Mostly 10G sometimes 40G is used.

considering this facts, even without replication you can only transfer with 40G around max 4GB/sek, its physics.. nothing more..

dude, so that means that yabs delivers incorrect values?

Athorio · April 19

@hyperblast said:

@Athorio said:

@hyperblast said:

@host_c said:
@hyperblast

Not even bare mental PCI-EX Gen 4 NVME can do 8G @ 1 MB. not to mention a VPS. Cached.

I have to admit, well done, respect.

the honour is due: @dataforest @Avoro
maybe the representatives of @dataforest can comment on this?! thanks!

dude.. ceph is replicated over network, only 1/n is stored locally. Mostly 10G sometimes 40G is used.

considering this facts, even without replication you can only transfer with 40G around max 4GB/sek, its physics.. nothing more..

dude, so that means that yabs delivers incorrect values?

semi, its the way how you look at it.

you wont be able to do a long time test (couple of minutes) with this rates, especially not with fsync=1

crunchbits · April 19

@hyperblast said:

@Athorio said:

@hyperblast said:

@host_c said:
@hyperblast

Not even bare mental PCI-EX Gen 4 NVME can do 8G @ 1 MB. not to mention a VPS. Cached.

I have to admit, well done, respect.

the honour is due: @dataforest @Avoro
maybe the representatives of @dataforest can comment on this?! thanks!

dude.. ceph is replicated over network, only 1/n is stored locally. Mostly 10G sometimes 40G is used.

considering this facts, even without replication you can only transfer with 40G around max 4GB/sek, its physics.. nothing more..

dude, so that means that yabs delivers incorrect values?

Nah, it's just easy to "game" with RAM cache. There's nothing wrong with that (you should leverage it), but @Athorio is right too: 18GB/s = ~144Gbps, meaning a pair of 100G NICs are dedicated to the storage network alone and it would have to be fairly large ceph cluster to even hit those speeds (a lot of osd's).

With ZFS and writeback/RAM you can hit 18-22GB/s on HDDs in YABS It's just a tool, one of many, to give you a quick idea. Personally, I think YABS is best used comparing your VM to itself overtime or maybe for extreme outliers (i.e. a 100 GB6 single core score confirming the CPU steal you see to be true).

Athorio · April 19

@Athorio said:

@hyperblast said:

@Athorio said:

@hyperblast said:

@host_c said:
@hyperblast

Not even bare mental PCI-EX Gen 4 NVME can do 8G @ 1 MB. not to mention a VPS. Cached.

I have to admit, well done, respect.

the honour is due: @dataforest @Avoro
maybe the representatives of @dataforest can comment on this?! thanks!

dude.. ceph is replicated over network, only 1/n is stored locally. Mostly 10G sometimes 40G is used.

considering this facts, even without replication you can only transfer with 40G around max 4GB/sek, its physics.. nothing more..

dude, so that means that yabs delivers incorrect values?

semi, its the way how you look at it.

you wont be able to do a long time test (couple of minutes) with this rates, especially not with fsync=1

oh and to prove that you can easy use:

dd if=/dev/urandom of=test bs=1G count=20 conv=fsync

cu_olly · April 19

Storytime.

In the summer of 2022, (company) wanted to build a 50TB-usable redundant networked data storage, but for the cheapest possible price. I had used and compared Ceph to ZFS back in 2019, on a single local server, and found that Ceph ate CPU cycles for breakfast with nothing extra to show for it -- but Ceph was the go-to tech in free/open-source distributed storage tech in 2022, so went with it anyway.

We studied various hardware strategies that achieved our goals (cheapest, 50TB usable storage) and settled on a 6-node system with the following specs:
2u generic server MATX chassis (ATX PSU support, no hot-swap or front bays)
Cheapest (any vendor) Z690 MATX board
Intel i5 12400 (6 cores, cheap CPU with high clocks)
16GB RAM
2x NVMe to 6-port SATA controllers
1x PCIe 6-port SATA controller
(+ 6x onboard SATA)
1x Intel XXV710 25G NIC
24x 1TB SATA SSD

The builds were very difficult. All the SSDs were stacked like bricks in the front of the chassis. Splitting the SATA power so many times was a nightmare. We had to use slim SATA data cables because there was no room for standard ones. One of the NVMe 6-port SATA controllers was mounted on a PCIe riser and that made cable management, coupled with airflow to the NIC, a challenge. Still, we pulled it off, 6 times no less.

Then came the Ceph configuration. Pretty simple; 6 OSD (storage) nodes of course, with 2 also budding up as MON(itor) nodes. We let Ceph handle each SSD natively and chose the default 3-copy strategy, so of the 144TB raw capacity, we had 48TB of usable. (NB, 2TB less than our 50TB goal, but it was workable). The configuration was remarkably quick and simple, with network handled by a 25G link to each server, a Mikrotik CRS326 on switching duty, and for testing purposes, a 100G Ryzen 3900X client.

The performance? Fucking diabolical. With a single-threaded sequential read workload, we were getting less than the performance of a single SATA SSD - about 3Gbit/s. What’s more, the Ceph client was munching CPU cycles like they were going out of fashion, the MON servers were both going at it, and even the OSD servers that weren’t being read from were happily singing for no reason. Single-threaded write performance was equally as bad, and again, all OSD servers were hit with CPU cycles for seemingly no reason, with the MONs about twice as much. Luckily, the performance did scale somewhat with IO multi-threading, but at 64 threads we managed to max out the CPU of all boxes while only hitting ~50Gbit of throughput. So, yeah, diabolical.

This wasn’t the expected outcome and wasn’t fit for our purpose. Even with basic RAID, one can expect a linear read performance increase as the number of disks (copies) increases, but Ceph had no magic performance trickery up its sleeve at all. So what in the fuck to do?

Well, I’m not going to share too much of the secret sauce, but ZFS. Always ZFS to the rescue. All 6 servers were wiped of their Ceph virus, reinstalled with Debian, then we set up some funky ZFS nesting. Each server ended up in a RAID50, with 4x 6-disk RAID5 volumes in a parent RAID0. This gave 20TB of usable capacity per server, with up to 4 disks worth of failure before data loss. 120TB of usable space over all 6 servers. But how do we link them together?

iSCSI, of course! This is where the sauce turns secret real quick, but to summarize, the 100G client maintained its own ZFS RAIDZ pool with 6 independent underlying iSCSI disks (one server = one disk). The performance was incredible. Single-threaded reads were coming in at 30Gbit/s, and multi-threaded read/write workloads were dancing at 90Gbit/s, effectively bottlenecking the client port and NIC.

However, scaling the storage over multiple clients in tandem is very much a challenge, involving a mindbending plethora of iSCSI targets, and no way of data over multiple clients simultaneously --though Ceph wouldn’t have done that either. The solution was finalized in Jan 2023 and has been running in production ever since. We’ve taken out entire servers for upgrades and reboots without any data loss or downtime, just some resilvering (which goes real quick!).

Is this something I would recommend everybody go and build? Absolutely not. But if you want the highest performance, lowest cost network redundant storage and you love a challenge… go for it.

Howdy, Stranger!

Categories

In this Discussion

About your experience with HA storage (ceph)

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

About your experience with HA storage (ceph)

Comments