Critical Data Backup

jsg · December 2021

@darkimmortal said:
Without RAID, when you hit a URE and if you notice it, you are left with a filesystem-specific manual faff to identify the file affected and restore it from backup

(a) URE != corrupted data
(b) If you backup files from a Raid'ed disk, the backup will contain the (possibly "repaired" by Raid) data.
(c) If the problem isn't an URE but corrupted data Raid will not magically repair them.
Note that 'corrupted' isn't limited to e.g. bit flips but can - and often does - result from an application writing out corrupted data.

The whole thing is much more complicated when looking closer. Example: drive 1 of an R1 has some bits flipped and drive 2 has some bits flipped too and neither (or both) have a CRC error.
Short, checksums (very significantly) increase the chance to detect errors - but they aren't a guarantee, nor do they address all possible situations, plus they usually can not correct data.

Erasure codes otoh can correct corrupted data (e.g. bit flips).

Also one must differentiate, most importantly between correct but corrupted data (e.g. bit flips on the device) vs "logically corrupt" data like (often CRC correct) data that are however corrupt on a higher level. Example: an app writes out hex data with some digits between 'g' and 'z' due to some internal problem but those wrong data are written out and stored correctly.

darkimmortal · December 2021

@jsg said:

@darkimmortal said:
Without RAID, when you hit a URE and if you notice it, you are left with a filesystem-specific manual faff to identify the file affected and restore it from backup

(a) URE != corrupted data
(b) If you backup files from a Raid'ed disk, the backup will contain the (possibly "repaired" by Raid) data.
(c) If the problem isn't an URE but corrupted data Raid will not magically repair them.
Note that 'corrupted' isn't limited to e.g. bit flips but can - and often does - result from an application writing out corrupted data.

The whole thing is much more complicated when looking closer. Example: drive 1 of an R1 has some bits flipped and drive 2 has some bits flipped too and neither (or both) have a CRC error.
Short, checksums (very significantly) increase the chance to detect errors - but they aren't a guarantee, nor do they address all possible situations, plus they usually can not correct data.

Erasure codes otoh can correct corrupted data (e.g. bit flips).

Also one must differentiate, most importantly between correct but corrupted data (e.g. bit flips on the device) vs "logically corrupt" data like (often CRC correct) data that are however corrupt on a higher level. Example: an app writes out hex data with some digits between 'g' and 'z' due to some internal problem but those wrong data are written out and stored correctly.

Right, nothing other than domain-specific checks can detect logically corrupt data such as from software bugs. On server grade hardware there should be nothing else in between that and UREs. So it is an argument of semantics, when I say corrupt data I mean the only type of corruption that one could expect to run into and be able to detect/fix - UREs (due to a transient disk issue at write time or bitrot in-place)

Howdy, Stranger!

Categories

In this Discussion

Critical Data Backup

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Critical Data Backup

Comments