All of lore.kernel.org
 help / color / mirror / Atom feed
* bcachefs csum: what about scrubbing with ec?
@ 2022-04-24 13:10 Janpieter Sollie
  2022-05-07 18:27 ` Kent Overstreet
  0 siblings, 1 reply; 2+ messages in thread
From: Janpieter Sollie @ 2022-04-24 13:10 UTC (permalink / raw)
  To: linux-bcachefs


[-- Attachment #1.1.1: Type: text/plain, Size: 2088 bytes --]

Hi everyone,


I'm still learning about bcachefs, my experiments are currently mostly targeted at the part of
bcachefs-tools ,
digging into the whole bcachefs structure is still somewhat advanced.
I'm considering implementing bcachefs scrub: During the past months, I had several fs upgrades
where the filesystem decided the checksum wasn't correct, whereas the data was fine.  It would
be nice to be fixable.
However, something I thought about:
All non-encryption checksum algorithms are < 64 bits wide, and as such fit in the .lo part of
bch_csum.
What about using the upper part for ec? possibilities are here:
- using another checksum algorihm to check whether a failed checksum isn't an error of the
checksum itself.  maybe using bit 4 here of CSUM_TYPES to say: if 0, no 2nd, if 1, use crc32c
(as it is mostly hardware accelerated).
This would be fast + would tell whether the checksum is corrupt, or the file is.
it would however limit future checksums to 3: 1b1000 (a none algorithm can't have a 2nd
checksum), 1b1011 and 1b1100 (as those do not have room for a 2nd checksum + it may be unsafe)
- using reed-solomon as a new algorithm (nr 8).
I'm not entirely sure of that, using RS in a situation where there are only 16 bytes for ec in a
billion-size data block (many files are > 1GB these days) is mostly useless.
But, for what it's worth, it would allow to correct single bit errors on the fly (eg: the user
would never notice a bit error in its file, and it could be corrected automatically).
would any feature be worth investigating?
I had a chat with woobilicious about the topic on IRC, but we weren't really sure about the
usefullness of either of them.
So, what would other developers think about it?

Technically, an invalid checksum would point at an invalid checksum calculation (in which case
the file can be scrubbed), or a damaged file (in which case the file must be marked as dirty).
currently, the only scrub way is assuming the data is correct and the checksum isn't

I'd be glad to hear your opinions


Janpieter Sollie

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 33315 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: bcachefs csum: what about scrubbing with ec?
  2022-04-24 13:10 bcachefs csum: what about scrubbing with ec? Janpieter Sollie
@ 2022-05-07 18:27 ` Kent Overstreet
  0 siblings, 0 replies; 2+ messages in thread
From: Kent Overstreet @ 2022-05-07 18:27 UTC (permalink / raw)
  To: Janpieter Sollie; +Cc: linux-bcachefs

On Sun, Apr 24, 2022 at 03:10:57PM +0200, Janpieter Sollie wrote:
> Hi everyone,
> 
> 
> I'm still learning about bcachefs, my experiments are currently mostly targeted at the part of
> bcachefs-tools ,
> digging into the whole bcachefs structure is still somewhat advanced.
> I'm considering implementing bcachefs scrub: During the past months, I had several fs upgrades
> where the filesystem decided the checksum wasn't correct, whereas the data was fine.  It would
> be nice to be fixable.
> However, something I thought about:
> All non-encryption checksum algorithms are < 64 bits wide, and as such fit in the .lo part of
> bch_csum.
> What about using the upper part for ec? possibilities are here:
> - using another checksum algorihm to check whether a failed checksum isn't an error of the
> checksum itself.  maybe using bit 4 here of CSUM_TYPES to say: if 0, no 2nd, if 1, use crc32c
> (as it is mostly hardware accelerated).

Not following your idea here? What do you mean by by error of the checksum
itself?

> This would be fast + would tell whether the checksum is corrupt, or the file is.
> it would however limit future checksums to 3: 1b1000 (a none algorithm can't have a 2nd
> checksum), 1b1011 and 1b1100 (as those do not have room for a 2nd checksum + it may be unsafe)

> - using reed-solomon as a new algorithm (nr 8).
> I'm not entirely sure of that, using RS in a situation where there are only 16 bytes for ec in a
> billion-size data block (many files are > 1GB these days) is mostly useless.
> But, for what it's worth, it would allow to correct single bit errors on the fly (eg: the user
> would never notice a bit error in its file, and it could be corrected automatically).
> would any feature be worth investigating?

Hmm, maybe.

In theory, on modern hardware we really shouldn't be seeing much bitrot, since
hdds and ssds have not just checksums but strong error correcting codes -
bitflips should have been caught & fixed by the devices themselves. But maybe
there's still too much buggy/garbage hardware out there - storing the error
correcting code in bch_csum is a cool idea for that.

> Technically, an invalid checksum would point at an invalid checksum calculation (in which case
> the file can be scrubbed), or a damaged file (in which case the file must be marked as dirty).
> currently, the only scrub way is assuming the data is correct and the checksum isn't

So generally if you get a checksum error you have to assume the data is garbage.
Technically, all that you know is that there's an error somewhere in (data +
checksum); but given that our checksums are stored separately and themselves
checksummed (in btree nodes) a checksum error pretty much always means the data
was damaged, and the only way to recover is from another replica (that will
hopefully pass the checksum check!).

Also marking something dirty has a specific meaning in filesystem terminology -
dirty is a property that cached data may have, and it means that the data in the
cache has been updated and must be written back. It doesn't really fit here :)

So, using RS for a checksum type is an interesting idea. I think this kind of RS
has been done before, code might even exist in the kernel for it already (as
opposed to large block RS, that RAID uses) - we'd have to check how the
performance is, and make sure it can still catch bitflips.

Could be a fun project :)

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-05-07 18:28 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-24 13:10 bcachefs csum: what about scrubbing with ec? Janpieter Sollie
2022-05-07 18:27 ` Kent Overstreet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.