Practical experience of RAID1 BTRFS, ZFS and mdadm raid1

mgcAna · November 2022

With 2 nvme SSD, which filesystem would you suggest:

BTRFS raid 1
ZFS raid 1
mdadm raid 1

I heard that zfs is not good for ssd because it was developed specially for HDDs in mind, so using on ssd is not sensible. BTRFS is said to be still not reliable with raid setup ? Mdadm is good but lacks snapshot, self-healing properties.

Which one works well with to run KVM over it ?

Erisa · November 2022

@mgcAna said: I heard that zfs is not good for ssd because it was developed specially for HDDs in mind, so using on ssd is not sensible.

This is not exactly true (Maybe it once was?), ZFS is fine to use on an SSD but may need some configuration to work exactly how you expect (And tbh this is always true, SSD or not, ZFS is high maintenance)

Some loose thoughts I had the last time this came up: https://lowendtalk.com/discussion/comment/3459232/#Comment_3459232

mgcAna · November 2022

You seems to have plenty of experience with zfs.
So you suggest that when use SSD with ZFS but decrease ARC size and also enable compression ?

What you suggest for a 64GB RAM with modern CPU with 2x2tb nvme ssd ?

Hxxx · November 2022

if you are not using ECC RAM stay away from ZFS.

MDADM RAID 1 works just fine.

Erisa · November 2022

@mgcAna said: You seems to have plenty of experience with zfs.

Not in production (business) use, but I have used it on a personal level for almost 3 years now. I may not have the deepest understanding in the world but there was many experiences and lessons learnt throughout that time.

@mgcAna said: So you suggest that when use SSD with ZFS but decrease ARC size

You don't have to decrease it, it's just probably a good idea because when your backing store is nvme you don't exactly need to have a large ARC. At thee same time you might want some or have tons of spare RAM.

@mgcAna said: and also enable compression ?

Up to you. Provides a tiny CPU hit (Much less so if modern) and saves space and some amount of I/O strain. The latter concern is less important on an nvme ssd and the former depends on your needs. Some compression methods like zstd provide extremely minimal CPU performance for moderate compression ratio, which is cool.

@mgcAna said: What you suggest for a 64GB RAM with modern CPU with 2x2tb nvme ssd ?

Whatever works for you. Seriously. The ARC will default to maximum size of half your RAM (32 GB), if you have plenty of spare RAM after running your applications then you can just keep it that way unless you expect the usage to spike suddenly. Otherwise you can set it to 4GB, 2GB, etc, whatever works best. There will come a point where it will start causing high ARC starvation and cause high CPU usage. You want to increase it above that point a comfortable amount.

@Hxxx said:
if you are not using ECC RAM stay away from ZFS.

Hey so I've not heard this point mentioned before and I'm trying to understand it. I did a little googling to try and research what the pitfalls might be with non-ECC on ZFS but I just found pages mentioning how it might cause problems with ZFS checksums but also other posts saying the risk is no different to other filesystems.
My (personal use!) ZFS pools of many TBs have been running on non-ECC RAM for almost 3 years now and I've not ran into a single problem. Am I just lucky here?

Could you explain a little more about what you mean so I can understand?

Shakib · November 2022

I will choose the same old EXT4 and call it a day.

One word. Reliable.

Using XFS for backup storage.

masiqbal · December 2022

@Hxxx said:
if you are not using ECC RAM stay away from ZFS.

must it be ECC Registered?

lewellyn · December 2022

@Hxxx said:
if you are not using ECC RAM stay away from ZFS.

MDADM RAID 1 works just fine.

So this advice always distresses me. It seems to stem from the fact that ZFS can tell that the data written to disk wasn't what was expected.

ZFS isn't more or less prone to memory bit flips causing bad data going to disk. It just knows they happened. Would you rather know that you just read something bogus? Or would you rather just silently get a bogus byte?

The risk is identical with any filesystem. ECC RAM is equally important on any system. The question is just whether you'll ever notice.

Hxxx · December 2022

@lewellyn @masiqbal
Taken from theeee webs. A random place but I like how it explain it so illllll just COPY PASTA.

-
The main reason for using ZFS over legacy file systems is the ability to assure data integrity. But ZFS is only one piece of the data integrity puzzle. The other part of the puzzle is ECC memory.

ZFS covers the risk of your storage subsystem serving corrupt data. ECC memory covers the risk of corrupt memory. If you leave any of these parts out, you are compromising data integrity.

If you care about data integrity, you need to use ZFS in combination with ECC memory. If you don't care that much about data integrity, it doesn't really matter if you use either ZFS or ECC memory.

Please remember that ZFS was developed to assure data integrity in a corporate IT environment, where data integrity is top priority and ECC-memory in servers is the norm, a fundament, on wich ZFS has been build. ZFS is not some magic pixie dust that protects your data under all circumstances. If its requirements are not met, data integrity is not assured.
source: https://louwrentius.com/please-use-zfs-with-ecc-memory.html#:~:text=ZFS covers the risk of,in combination with ECC memory.

That's Google first answer when you type ZFS and ECC RAM. Thank you Google for saving me the time to explain it.

rm_ · December 2022

Now that the actual argument has been copy-pasted, it sounds more sensible:

@Hxxx said: If you care about data integrity, you need to use ZFS in combination with ECC memory.

"ZFS alone does not protect you 100%, ECC RAM is also needed for that"
does not translate into "stay away from ZFS if not using ECC" in any shape or form, thanks.

@Hxxx said: That's Google first answer . Thank you Google .

Way to not only be wrong, but to be a total wanker about it.

Hxxx · December 2022

@rm_ well The overall consensus is that ZFS shall be used with ECC for maximum data integrity. Many sources that explain the why's.

if data get's corrupted in non-ecc RAM (bit flip) ZFS will just receive that data, stored it like nothing happened.... so if you are using ZFS for data integrity then it will not protect your data as intended.

Thank you Sir, as always I appreciate your useful replies. Certainly what you think is very important.

Explaining it just for your @rm_ I don't want you to stay confused, be a better person.

rm_ · December 2022

@Hxxx said: well The overall consensus is that ZFS shall be used with ECC for maximum data integrity. Many sources that explain the why's.

The roots of that stems from the paranoia that ZFS "scrub" might get confused by an error in non-ECC RAM, and do a "mis-recover" of data from parity, which would actually corrupt the data. But that seems dubious to begin with, people have been using parity RAIDs even before ZFS, and it never had an ECC requirement. Even here, too many chances have to align exactly right for that to happen. And secondly, that does not apply to RAID1 which is OP is asking about, since it is not a parity RAID level. Certainly not a reason to avoid ZFS entirely just because of non-ECC.

darkimmortal · December 2022

There is still good reason to use Btrfs or ZFS without ECC RAM. It is not going to be any worse than non-checksum filesystems, and it gives you an early warning that RAM is bad. People will ignore the occasional BSOD, application crash, corrupted download, etc associated with bad RAM, but a filesystem error is less likely to be ignored

tjn · December 2022

I've been using ZFS on SSD's for a while now, and I haven't had any major issues.

I will say that ZFS will rip/wear through consumer SSD's pretty quickly though - so if you have a use case that requires a lot of I/O keep that in mind. I killed a couple of consumer SSD's on proxmox with ZFS in less than a year.

I only use "enterprise" SSD's and ECC RAM in production - been a very positive experience so far.

Hxxx · December 2022

Didnt know about the SSD wear out issue. Interesting. ZFS was invented when there was no SSD right? might work better with spinning rust.

masiqbal · December 2022

As far I know, there are three types of ECC: Registered, Unbuffered, and On-die ECC.
I still do not know which type of ECC is safe for ZFS. The ECC registered should be the safest. I am curious whether unbuffered and on-die ECC are safe enough for ZFS. I would like to hear any experiences about it.

darkimmortal · December 2022

@masiqbal said:
As far I know, there are three types of ECC: Registered, Unbuffered, and On-die ECC.
I still do not know which type of ECC is safe for ZFS. The ECC registered should be the safest. I am curious whether unbuffered and on-die ECC are safe enough for ZFS. I would like to hear any experiences about it.

There is no safety difference between registered and unbuffered. Unbuffered has a performance advantage, but is worse for max capacity and cost

On-die ECC is better than nothing, but it’s not going to halt the system and/or ring alarm bells before your data gets hosed if a stick spontaneously fails

lewellyn · December 2022

@tjn said:
I've been using ZFS on SSD's for a while now, and I haven't had any major issues.

I will say that ZFS will rip/wear through consumer SSD's pretty quickly though - so if you have a use case that requires a lot of I/O keep that in mind. I killed a couple of consumer SSD's on proxmox with ZFS in less than a year.

I only use "enterprise" SSD's and ECC RAM in production - been a very positive experience so far.

@Hxxx said:
Didnt know about the SSD wear out issue. Interesting. ZFS was invented when there was no SSD right? might work better with spinning rust.

ZFS was invented with SSD in mind: look at the external ZIL and L2ARC.

Specifically, at the time, there was murmuring from the manufacturers that spinning rust was going to start being slower to increase density. This would have made external ZIL/L2ARC on SSD vital for performance. Could you imagine a RAIDZ3 of a couple dozen 3000 RPM 30TB HDDs without an external ZIL and L2ARC? OMFG.

I and others were running ZFS on SSD for OpenSolaris compiles back then. Remember that, at Sun, a nightly build wasn't just "it was built at a specific time": the goal was for it to be done building by the time people started working for the day.

And even the datacenter-grade SSDs weren't as good in any metric as today's consumer disks, so I'd like to see reliable hard numbers that ZFS generally causes untoward SSD demise versus other solutions. Especially on platforms where ZFS is in-kernel rather than a bolt-on (as in-kernel filesystems tend to get extra care in related parts of code).

Especially with Linux, the VFS is... (I want to be generous and avoid a flame war here, though I have done so on LKML in the past...) less than performant. Once you start adding in filesystems which would reasonably need VFS work, outside the kernel, it's not fair to blame the filesystem for inherent flaws in the kernel which it happens to tickle and there is animosity towards those who would like to see fixes (due to the risk factor for no provable benefit to any filesystems widely used and shipped in the kernel).

Very specifically, Linus has hated ZFS since its introduction. There was once active work done to make it harder to let ZFS land if somehow the CDDL vs GPL thing came to a peaceful resolution (and then Oracle bought Sun, so that whole line of thought was doomed). And even now, he'll advocate against anything that might make ZFS better, especially if there's even a theoretical harm to any other filesystem (and even if that harm is a bug that the filesystem maintainers intend to fix regardless).

If you DO run ZFS on SSD, ensure you follow the best practices:

Only use 4K-native SSDs (you may need to do extra work with 512e disks, but it's still possible; truly 4K is preferable)
Only use SSDs with working TRIM support (that may be what killed your consumer SSDs btw, but non-TRIM SSDs will be overworked with ZFS and even btrfs RAID at any grade due to CoW)
Set recordsize appropriately (small such as 8K for many workloads but as high as like 1M for things like bulk storage of literal ISOs)
Set ashift properly (12 and 13 are common suggestions)
Turn off atime (this is generally something you wanna do for performance on most RAIDs anyhow, but I'm trying to be helpful here)
Set compression (lz4 is a good balance of CPU price/performance) to reduce read/write cycles, which can also improve performance in addition to potentially ekeing out more life due to fewer blocks being rewritten. Though if your workloads involve lots of random writes to files that remain on the filesystem (e.g. databases), you may not want to do this. For the standard bulk-storage type things many here do, though, you most definitely do want to.

ZFS is a highly tuneable filesystem and one generally wants to tune it.

tl;dr: Use good quality SSDs, tune your pool and datasets, and expect it to work better on BSD/Illumos where the kernel isn't hostile to it.

Hxxx · December 2022

this guy fucks.
Excellent post. Thanks for replying.
@lewellyn

darkimmortal · December 2022

Here's some good fresh evidence of why ECC matters when using Btrfs/ZFS: https://lore.kernel.org/linux-btrfs/[email protected]/T/#m7f80e2f1163433dd244fbf740594d9932ddd2b8c

Indeed an extent tree corruption.

And furthermore, it's also a bitflip:

hex(1140559929556992) = 0x40d554d724000
hex(14660022714368) = 0x00d554d724000

The difference is 0x400000000000 (whatever how many bits), exactly one
bitflipped.

That exactly matches the result, all copies are corrupted, but all
checksums passes.

As the checksum is calculated using the in-memory metadata, if your
hardware memory has bitflip, there is no way to detect the problem.

Please run a full memtest before doing anything to verify the hardware
problem, then fix the hardware memory, and finally run "btrfs check
--repair" to fix the problem.

...

Darn, this was expensive memory.

Filesystem (recoverably) hosed from a single bitflip in memory. Filesystems like ext4 would likely have kept working after a fsck, but not without silent data loss

The design of btrfs handles dog shit disks, flapping controllers and lack of power loss protection better than other FS's, but everything else must be perfect

rm_ · December 2022

@darkimmortal said: Here's some good fresh evidence of why ECC matters when using Btrfs/ZFS

There's nothing specific to "when using Btrfs/ZFS" about this.

Btrfs just helped catch the faulty memory. If you used Ext4/XFS, the error would have went unnoticed much longer, and corrupted your data or your filesystem without you knowing. So it is just "evidence why ECC matters" - period.

...Or in fact just "why not buyng faulty RAM matters", because please do have three guesses whether or not the user in the mailing list ran a 24h MemTest run after buying the machine or RAM in question? Nobody does that anymore, that's too difficult, and it's new RAM man, how there can be any issue with it.

Returned 2 or 3 sticks last year due to "faulty out of the box", i.e. a specific address having persistent error in MemTest86+, or just the module in general not being stable in TestMem5 (discovered when trying to overclock, that one stick out of the 4 returns errors even at its stock frequency).

darkimmortal · December 2022

@rm_ said:

@darkimmortal said: Here's some good fresh evidence of why ECC matters when using Btrfs/ZFS

There's nothing specific to "when using Btrfs/ZFS" about this.

Btrfs just helped catch the faulty memory. If you used Ext4/XFS, the error would have went unnoticed much longer, and corrupted your data or your filesystem without you knowing. So it is just "evidence why ECC matters" - period.

...Or in fact just "why not buyng faulty RAM matters", because please do have three guesses whether or not the user in the mailing list ran a 24h MemTest run after buying the machine or RAM in question? Nobody does that anymore, that's too difficult, and it's new RAM man, how there can be any issue with it.

Returned 2 or 3 sticks last year due to "faulty out of the box", i.e. a specific address having persistent error in MemTest86+, or just the module in general not being stable in TestMem5 (discovered when trying to overclock, that one stick out of the 4 returns errors even at its stock frequency).

I agree ECC should always be used regardless of filesystem. But in this case the filesystem has failed readonly from a single bitflip. This would not happen on ext4/xfs, as they have mature repair tools that run automatically. Sure they'll probably lose data, and silently, but total loss is unlikely

rm_ · December 2022

@darkimmortal said: But in this case the filesystem has failed readonly from a single bitflip.

Due to some wild outlandish FS structure values, not knowing what to do with those, and to prevent further damage. "Failed" only in a sense that a good properly designed system should always fail-safe, not continue operating regardless, based on bad inputs, and possibly multiplying the damage by an order of magnitude.

@darkimmortal said: This would not happen on ext4/xfs, as they have mature repair tools that run automatically.

They do not. fsck.ext4 does not run automatically, only if ever at reboots, and if there's been an improper shutdown to the filesystem. And xfs_repair does not run automatically ever, has to be invoked by the user.

So how do you know you'd better run fsck.ext4 or xfs_repair? By observing some issue with the filesystem, files missing, data going corrupt? And you praise those FSes for allowing that to happen, while Btrfs gets dashed for "going read-only" in an instant, to prevent such damage?

darkimmortal · December 2022

@rm_ said:

@darkimmortal said: But in this case the filesystem has failed readonly from a single bitflip.

Due to some wild outlandish FS structure values, not knowing what to do with those, and to prevent further damage. "Failed" only in a sense that a good properly designed system should always fail-safe, not continue operating regardless, based on bad inputs, and possibly multiplying the damage by an order of magnitude.

@darkimmortal said: This would not happen on ext4/xfs, as they have mature repair tools that run automatically.

They do not. fsck.ext4 does not run automatically, only if ever at reboots, and if there's been an improper shutdown to the filesystem. And xfs_repair does not run automatically ever, has to be invoked by the user.

So how do you know you'd better run fsck.ext4 or xfs_repair? By observing some issue with the filesystem, files missing, data going corrupt? And you praise those FSes for allowing that to happen, while Btrfs gets dashed for "going read-only" in an instant, to prevent such damage?

I’m not bashing btrfs, I use it in all my systems. Other filesystems are not as good outside of performance use cases, but they do handle bit flips without exploding

rm_ · December 2022

@darkimmortal said: they do handle bit flips without exploding

Turning read-only is not exploding. What you are saying is that keeping to somehow work, like Ext4 and XFS would, letting bad memory chew through your data and FS structures, until the damage is so great that it reaches the user-visible level and makes you run Fsck to try and salvage what remains, is better than stopping in its tracks (going read-only) instantly on the first occurrence of the first memory error. Come on, really?

darkimmortal · December 2022

@rm_ said:

@darkimmortal said: they do handle bit flips without exploding

Turning read-only is not exploding. What you are saying is that keeping to somehow work, like Ext4 and XFS would, letting bad memory chew through your data and FS structures, until the damage is so great that it reaches the user-visible level and makes you run Fsck to try and salvage what remains, is better than stopping in its tracks (going read-only) instantly on the first occurrence of the first memory error. Come on, really?

I agree that turning read only at first sign of an issue is better design, but for the average user it’s only better if there is a repair tool. What if this is a many TB filesystem where re-creating the entire fs is a ballache? Btrfs check --repair is covered in warnings to only use it as a last resort on advice of a dev

Howdy, Stranger!

Categories

In this Discussion

Practical experience of RAID1 BTRFS, ZFS and mdadm raid1

Comments

That's Google first answer when you type ZFS and ECC RAM. Thank you Google for saving me the time to explain it.

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Practical experience of RAID1 BTRFS, ZFS and mdadm raid1

Comments

That's Google first answer when you type ZFS and ECC RAM. Thank you Google for saving me the time to explain it.