All of lore.kernel.org
 help / color / mirror / Atom feed
* Changing the checksum algorithm on an existing BTRFS filesystem
@ 2021-10-04  5:26 Matthew Warren
  2021-10-04  6:20 ` Nikolay Borisov
  2021-10-04  9:51 ` David Sterba
  0 siblings, 2 replies; 8+ messages in thread
From: Matthew Warren @ 2021-10-04  5:26 UTC (permalink / raw)
  To: linux-btrfs

Is there currently any way to change the checksum used by btrfs
without recreating the filesystem and copying data to the new fs?

Matthew Warren

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changing the checksum algorithm on an existing BTRFS filesystem
  2021-10-04  5:26 Changing the checksum algorithm on an existing BTRFS filesystem Matthew Warren
@ 2021-10-04  6:20 ` Nikolay Borisov
  2021-10-04  9:51 ` David Sterba
  1 sibling, 0 replies; 8+ messages in thread
From: Nikolay Borisov @ 2021-10-04  6:20 UTC (permalink / raw)
  To: Matthew Warren, linux-btrfs



On 4.10.21 г. 8:26, Matthew Warren wrote:
> Is there currently any way to change the checksum used by btrfs
> without recreating the filesystem and copying data to the new fs?

No

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changing the checksum algorithm on an existing BTRFS filesystem
  2021-10-04  5:26 Changing the checksum algorithm on an existing BTRFS filesystem Matthew Warren
  2021-10-04  6:20 ` Nikolay Borisov
@ 2021-10-04  9:51 ` David Sterba
  2021-10-04 16:01   ` Matthew Warren
  1 sibling, 1 reply; 8+ messages in thread
From: David Sterba @ 2021-10-04  9:51 UTC (permalink / raw)
  To: Matthew Warren; +Cc: linux-btrfs

On Mon, Oct 04, 2021 at 12:26:16AM -0500, Matthew Warren wrote:
> Is there currently any way to change the checksum used by btrfs
> without recreating the filesystem and copying data to the new fs?

I have a WIP patch but it's buggy. It works on small filesystems but
when I tried it on TB-sized images it blew up somewhere. Also the WIP is
not too smart, it deletes the whole checksum tree and rebuilds if from
the on-disk data, so it lacks the verification against the existing
csums. I intend to debug it some day but it's a nice to have feature,
I'm aware that people have been asking for it but at this point it would
be to dangerous to provide even the prototype.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changing the checksum algorithm on an existing BTRFS filesystem
  2021-10-04  9:51 ` David Sterba
@ 2021-10-04 16:01   ` Matthew Warren
  2021-10-04 20:37     ` Chris Murphy
  2021-10-04 22:54     ` Qu Wenruo
  0 siblings, 2 replies; 8+ messages in thread
From: Matthew Warren @ 2021-10-04 16:01 UTC (permalink / raw)
  To: dsterba, Matthew Warren, linux-btrfs

I don't know how btrfs is layed out internally, but is checksum tree
separate from file (meta)data or is it part of the (meta)data? If it's
separate it should be possible to build a second csum tree and then
replace the old one once it's done, right?

Matthew Warren

On Mon, Oct 4, 2021 at 4:52 AM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, Oct 04, 2021 at 12:26:16AM -0500, Matthew Warren wrote:
> > Is there currently any way to change the checksum used by btrfs
> > without recreating the filesystem and copying data to the new fs?
>
> I have a WIP patch but it's buggy. It works on small filesystems but
> when I tried it on TB-sized images it blew up somewhere. Also the WIP is
> not too smart, it deletes the whole checksum tree and rebuilds if from
> the on-disk data, so it lacks the verification against the existing
> csums. I intend to debug it some day but it's a nice to have feature,
> I'm aware that people have been asking for it but at this point it would
> be to dangerous to provide even the prototype.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changing the checksum algorithm on an existing BTRFS filesystem
  2021-10-04 16:01   ` Matthew Warren
@ 2021-10-04 20:37     ` Chris Murphy
  2021-10-04 22:54     ` Qu Wenruo
  1 sibling, 0 replies; 8+ messages in thread
From: Chris Murphy @ 2021-10-04 20:37 UTC (permalink / raw)
  To: Matthew Warren; +Cc: David Sterba, Btrfs BTRFS

On Mon, Oct 4, 2021 at 12:02 PM Matthew Warren
<matthewwarren101010@gmail.com> wrote:
>
> I don't know how btrfs is layed out internally, but is checksum tree
> separate from file (meta)data or is it part of the (meta)data? If it's
> separate it should be possible to build a second csum tree and then
> replace the old one once it's done, right?

Yes, but the other trees still use the old checksum algo. The simplest
way to do this and ensure it's atomic and thus crash safe is offline
conversion. And of course offline conversion doesn't really scale.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changing the checksum algorithm on an existing BTRFS filesystem
  2021-10-04 16:01   ` Matthew Warren
  2021-10-04 20:37     ` Chris Murphy
@ 2021-10-04 22:54     ` Qu Wenruo
  2021-10-05  3:26       ` Zygo Blaxell
  1 sibling, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2021-10-04 22:54 UTC (permalink / raw)
  To: Matthew Warren, dsterba, linux-btrfs



On 2021/10/5 00:01, Matthew Warren wrote:
> I don't know how btrfs is layed out internally, but is checksum tree
> separate from file (meta)data or is it part of the (meta)data? If it's
> separate it should be possible to build a second csum tree and then
> replace the old one once it's done, right?

There are several problems, even for off-line covert.

1) Metadata checksum
    Unlike data checksum, metadata checksum is inlined into the metadata
    header.
    Thus there is no way to make the csum co-exist for metadata, and we
    need to re-write (CoW) all metadata of the fs to convert them to the
    new algo.

2) Superblock flags/csum_type
    Currently we have only one csum_type, which is shared by both data
    and metadata.

    We need extra csum_type to indicate the destination csum algo.
    We will also need to sync this bit to kernel, no matter if the kernel
    really uses that.

Thanks,
Qu

>
> Matthew Warren
>
> On Mon, Oct 4, 2021 at 4:52 AM David Sterba <dsterba@suse.cz> wrote:
>>
>> On Mon, Oct 04, 2021 at 12:26:16AM -0500, Matthew Warren wrote:
>>> Is there currently any way to change the checksum used by btrfs
>>> without recreating the filesystem and copying data to the new fs?
>>
>> I have a WIP patch but it's buggy. It works on small filesystems but
>> when I tried it on TB-sized images it blew up somewhere. Also the WIP is
>> not too smart, it deletes the whole checksum tree and rebuilds if from
>> the on-disk data, so it lacks the verification against the existing
>> csums. I intend to debug it some day but it's a nice to have feature,
>> I'm aware that people have been asking for it but at this point it would
>> be to dangerous to provide even the prototype.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changing the checksum algorithm on an existing BTRFS filesystem
  2021-10-04 22:54     ` Qu Wenruo
@ 2021-10-05  3:26       ` Zygo Blaxell
  2021-10-05  6:00         ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: Zygo Blaxell @ 2021-10-05  3:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Matthew Warren, dsterba, linux-btrfs

On Tue, Oct 05, 2021 at 06:54:27AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/10/5 00:01, Matthew Warren wrote:
> > I don't know how btrfs is layed out internally, but is checksum tree
> > separate from file (meta)data or is it part of the (meta)data? If it's
> > separate it should be possible to build a second csum tree and then
> > replace the old one once it's done, right?
> 
> There are several problems, even for off-line covert.

I worked out something of a plan to do an online conversion, but there
are some unfortunate issues with the btrfs on-disk design that make online
conversion harder than it has to be.  It would be easier to do an online
conversion to a new on-disk format that supports multiple csum types,
than to convert from one of the existing csum types to another.

> 1) Metadata checksum
>    Unlike data checksum, metadata checksum is inlined into the metadata
>    header.
>    Thus there is no way to make the csum co-exist for metadata, and we
>    need to re-write (CoW) all metadata of the fs to convert them to the
>    new algo.

Metadata is the easy case.  In the metadata pages the csum is a
fixed-width field.  There is also a transid/generation field on each page.
So if you know that the csum was changed from crc32c to xxhash on transid
123456 (and you'd know this because the online conversion would store
this in an item at the beginning of the conversion process), then any
time you read a page with a transid below 123456 you check crc32c, and any
time you read a page with transid above 123456 you check xxhash, and any
time you write a page you use xxhash.  Once conversion is set up, run a
metadata balance, and all the metadata pages are converted to use the new
csum algorithm.  A few other places (chunk headers and superblocks) need
to be handled separately.  Allow only one conversion at a time to keep
things simple; otherwise, we'd need to maintain a list of transids and
csum algorithms to figure out which to use, or we'd have to brute-force
all possible algorithms until we figure out which one a metadata page
is using.  When all the metadata block groups have been balanced, the
conversion is done, and future csum checks use only the new algorithm.

An offline conversion tool can read, verify, compute, and write the new
csum algorithm in place, as there is a fixed-length 32 bytes struct in
each metadata page for csum.  It would need to do some kind of data
journalling to be crash-safe, but that's not difficult to arrange
(and in practice it can be YOLOed with no journal most of the time,
using dup or raid1 metadata to fix any blocks mangled by a crash).

> 2) Superblock flags/csum_type
>    Currently we have only one csum_type, which is shared by both data
>    and metadata.
> 
>    We need extra csum_type to indicate the destination csum algo.
>    We will also need to sync this bit to kernel, no matter if the kernel
>    really uses that.

In the data csum tree it's a bit awkward.  Although there are 136 bits
of (objectid, key, offset) for each csum item, for historical reasons
the objectid and key fields are constant, leaving no room to identify
the csum type in the item key (and we don't want to use any item data
bytes for this as it will make the items larger).  We could create a
new csum tree with a different root, or use the objectid in the key to
identify the csum algorithm, but this requires completely new multi-tree
code to read and modify multiple csum trees during conversion.  It would
have to ensure there are no overlapping csums in two trees (or distant
locations in a single tree, which is just as bad) at the same time.
Every read would require potentially two tree lookups--all at places
where there have historically been bugs in the csum tree implementation.
This double-lookup and double-update (and quadruple-kernel-bug-fixing)
would have to continue until the last item in the old csum tree is
deleted.

Ideally a future csum tree (csum_tree_v2?) would have keys of (objectid =
bytenr of data block, type = CSUM_ITEM_KEY, offset = csum algorithm).
Then the existing csum tree code can be used more or less unmodified,
except for the detail that each csum item might contain csums from
any one of the csum algorithms.  This is a simpler modification to the
csum tree code, since it uses only one tree, adjacent block csums are
adjacent in the tree key order, and lookup performance doesn't change.
The gotcha is that doing this requires a less-simple modification to the
fsync log tree code, which is apparently leveraging the existing csum item
key format to share a lot of code between fsync log tree and csum tree.
Also a single extent might have csums of different sizes, which will
complicate code that deals with e.g. printing csum values.

The advantage of csum_tree_v2 is that once the tree is converted to
a multi-csum tree format, it's much easier/faster to convert it again
(e.g. maybe you decide sha256 is too slow for you and you want to go
back to xxhash or blake2b), and you can mkfs new filesystems using
csum_tree_v2 from the beginning for online conversion later.

Any change in the tree organization probably means we have to solve the
multi-tree problem above anyway, since whatever new tree format we choose
to support multiple csums, the existing csum tree isn't laid out in the
same format.  There might be a way out of this if we borrow an idea from
space_cache=v2, and convert the csum tree to csum_tree_v2 format during
mount, without changing the csum algorithm at the same time.  This would
be relatively fast (compared to reading all the data blocks) but it
would eliminate the need to update multiple trees online (conversion
would simply copy the csums to a new tree and delete the old one, while
nothing is allowed to write to the filesystem).  After the csum tree
is converted to csum_tree_v2, the kernel would be able to complete the
conversion online.

Regardless of how multiple csum algorithms are supported, a data balance
would be sufficient to recompute all of the data csums by brute-force
reading (checking old csum) and writing (computing new csum) the data.
This is very wasteful as it rewrites many (but not all) of the metadata
trees again, as well as all of the data blocks, but it would work and
could be used until a proper conversion tool exists.

It might be possible to do something like scrub, but only reading the
data blocks and writing the csum trees.  Except not really scrub--I don't
think any of the existing scrub code can be used this way.  Maybe a
variation of defrag that only rewrites csums?  It might be easier to
build this component from scratch--none of the existing building blocks
in the kernel really fits.

Offline conversion of the existing csum tree from one csum algorithm
to another is non-trivial, since the csum data is packed into pages
dynamically.  Changing from any csum algorithm to one with a different
length will require not only rewriting the content of the csum tree pages,
but also allocating a new tree structures to store them.

Offline conversion to csum_tree_v2 is easy since it changes only the
arragement of bits in the btrfs key structure, not the size of anything.
It can be done in place the same way metadata can be converted in place.

> Thanks,
> Qu
> 
> > 
> > Matthew Warren
> > 
> > On Mon, Oct 4, 2021 at 4:52 AM David Sterba <dsterba@suse.cz> wrote:
> > > 
> > > On Mon, Oct 04, 2021 at 12:26:16AM -0500, Matthew Warren wrote:
> > > > Is there currently any way to change the checksum used by btrfs
> > > > without recreating the filesystem and copying data to the new fs?
> > > 
> > > I have a WIP patch but it's buggy. It works on small filesystems but
> > > when I tried it on TB-sized images it blew up somewhere. Also the WIP is
> > > not too smart, it deletes the whole checksum tree and rebuilds if from
> > > the on-disk data, so it lacks the verification against the existing
> > > csums. I intend to debug it some day but it's a nice to have feature,
> > > I'm aware that people have been asking for it but at this point it would
> > > be to dangerous to provide even the prototype.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Changing the checksum algorithm on an existing BTRFS filesystem
  2021-10-05  3:26       ` Zygo Blaxell
@ 2021-10-05  6:00         ` Qu Wenruo
  0 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2021-10-05  6:00 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Matthew Warren, dsterba, linux-btrfs



On 2021/10/5 11:26, Zygo Blaxell wrote:
> On Tue, Oct 05, 2021 at 06:54:27AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2021/10/5 00:01, Matthew Warren wrote:
>>> I don't know how btrfs is layed out internally, but is checksum tree
>>> separate from file (meta)data or is it part of the (meta)data? If it's
>>> separate it should be possible to build a second csum tree and then
>>> replace the old one once it's done, right?
>>
>> There are several problems, even for off-line covert.
>
> I worked out something of a plan to do an online conversion, but there
> are some unfortunate issues with the btrfs on-disk design that make online
> conversion harder than it has to be.  It would be easier to do an online
> conversion to a new on-disk format that supports multiple csum types,
> than to convert from one of the existing csum types to another.
>
>> 1) Metadata checksum
>>     Unlike data checksum, metadata checksum is inlined into the metadata
>>     header.
>>     Thus there is no way to make the csum co-exist for metadata, and we
>>     need to re-write (CoW) all metadata of the fs to convert them to the
>>     new algo.
>
> Metadata is the easy case.  In the metadata pages the csum is a
> fixed-width field.  There is also a transid/generation field on each page.
> So if you know that the csum was changed from crc32c to xxhash on transid
> 123456 (and you'd know this because the online conversion would store
> this in an item at the beginning of the conversion process), then any
> time you read a page with a transid below 123456 you check crc32c, and any
> time you read a page with transid above 123456 you check xxhash, and any
> time you write a page you use xxhash.

This means we need a way to record such info on-disk. (source checksum
algo, dest checksum algo, transid of the initial convert)

Thus a format change is needed (which is not really a big deal compared
to other things involved).

>  Once conversion is set up, run a
> metadata balance, and all the metadata pages are converted to use the new
> csum algorithm.  A few other places (chunk headers and superblocks) need
> to be handled separately.  Allow only one conversion at a time to keep
> things simple;

Well, single conversion is already going to be painful...

> otherwise, we'd need to maintain a list of transids and
> csum algorithms to figure out which to use, or we'd have to brute-force
> all possible algorithms until we figure out which one a metadata page
> is using.  When all the metadata block groups have been balanced, the
> conversion is done, and future csum checks use only the new algorithm.

Yes, that's it, we need something like a metadata balance to do that.

But I doubt if we should do it in kernel space.
>
> An offline conversion tool can read, verify, compute, and write the new
> csum algorithm in place,

For in-place convert, it still needs some way to record the source/dest
algo.

It will work just like uuid changing tool, and would be my preferred way
to do it.

> as there is a fixed-length 32 bytes struct in
> each metadata page for csum.  It would need to do some kind of data
> journalling to be crash-safe, but that's not difficult to arrange

Either we go CoW full, or we don't care and go with try-and-error way.
The only thing we need to know is the source and dest algo, just two
tries should be enough.

> (and in practice it can be YOLOed with no journal most of the time,
> using dup or raid1 metadata to fix any blocks mangled by a crash).
>
>> 2) Superblock flags/csum_type
>>     Currently we have only one csum_type, which is shared by both data
>>     and metadata.
>>
>>     We need extra csum_type to indicate the destination csum algo.
>>     We will also need to sync this bit to kernel, no matter if the kernel
>>     really uses that.
>
> In the data csum tree it's a bit awkward.  Although there are 136 bits
> of (objectid, key, offset) for each csum item, for historical reasons
> the objectid and key fields are constant, leaving no room to identify
> the csum type in the item key (and we don't want to use any item data
> bytes for this as it will make the items larger).  We could create a
> new csum tree with a different root, or use the objectid in the key to
> identify the csum algorithm, but this requires completely new multi-tree
> code to read and modify multiple csum trees during conversion.

Yes, that's the problem.

>  It would
> have to ensure there are no overlapping csums in two trees (or distant
> locations in a single tree, which is just as bad) at the same time.
> Every read would require potentially two tree lookups--all at places
> where there have historically been bugs in the csum tree implementation.
> This double-lookup and double-update (and quadruple-kernel-bug-fixing)
> would have to continue until the last item in the old csum tree is
> deleted.
>
> Ideally a future csum tree (csum_tree_v2?) would have keys of (objectid =
> bytenr of data block, type = CSUM_ITEM_KEY, offset = csum algorithm).

This indeed sounds pretty interesting.

> Then the existing csum tree code can be used more or less unmodified,
> except for the detail that each csum item might contain csums from
> any one of the csum algorithms.  This is a simpler modification to the
> csum tree code, since it uses only one tree, adjacent block csums are
> adjacent in the tree key order, and lookup performance doesn't change.
> The gotcha is that doing this requires a less-simple modification to the
> fsync log tree code, which is apparently leveraging the existing csum item
> key format to share a lot of code between fsync log tree and csum tree.
> Also a single extent might have csums of different sizes, which will
> complicate code that deals with e.g. printing csum values.
>
> The advantage of csum_tree_v2 is that once the tree is converted to
> a multi-csum tree format, it's much easier/faster to convert it again
> (e.g. maybe you decide sha256 is too slow for you and you want to go
> back to xxhash or blake2b), and you can mkfs new filesystems using
> csum_tree_v2 from the beginning for online conversion later.

That's why I'd prefer the offline convert.

We only need to update the new csum tree, without bothering all the
runtime csum lookup things.

And just need to rename the objectid after finishing the csum tree
update, and finally drop the old csum tree.

>
> Any change in the tree organization probably means we have to solve the
> multi-tree problem above anyway, since whatever new tree format we choose
> to support multiple csums, the existing csum tree isn't laid out in the
> same format.  There might be a way out of this if we borrow an idea from
> space_cache=v2, and convert the csum tree to csum_tree_v2 format during
> mount, without changing the csum algorithm at the same time.  This would
> be relatively fast (compared to reading all the data blocks) but it
> would eliminate the need to update multiple trees online (conversion
> would simply copy the csums to a new tree and delete the old one, while
> nothing is allowed to write to the filesystem).  After the csum tree
> is converted to csum_tree_v2, the kernel would be able to complete the
> conversion online.
>
> Regardless of how multiple csum algorithms are supported, a data balance
> would be sufficient to recompute all of the data csums by brute-force
> reading (checking old csum) and writing (computing new csum) the data.
> This is very wasteful as it rewrites many (but not all) of the metadata
> trees again, as well as all of the data blocks, but it would work and
> could be used until a proper conversion tool exists.
>
> It might be possible to do something like scrub, but only reading the
> data blocks and writing the csum trees.  Except not really scrub--I don't
> think any of the existing scrub code can be used this way.  Maybe a
> variation of defrag that only rewrites csums?  It might be easier to
> build this component from scratch--none of the existing building blocks
> in the kernel really fits.
>
> Offline conversion of the existing csum tree from one csum algorithm
> to another is non-trivial, since the csum data is packed into pages
> dynamically.  Changing from any csum algorithm to one with a different
> length will require not only rewriting the content of the csum tree pages,
> but also allocating a new tree structures to store them.

For offline convert, we need to do regular tree CoWs to create a new
csum tree.
And rely on CoW to protect against powerloss/signal.

The good thing is, we can easily grab where we were, thus no need for
extra format to record the process.

>
> Offline conversion to csum_tree_v2 is easy since it changes only the
> arragement of bits in the btrfs key structure, not the size of anything.
> It can be done in place the same way metadata can be converted in place.
If we go offline, it then means we don't need on-line convert, thus no
need for csum_tree_v2 for convert purpose.

Thanks,
Qu

>
>> Thanks,
>> Qu
>>
>>>
>>> Matthew Warren
>>>
>>> On Mon, Oct 4, 2021 at 4:52 AM David Sterba <dsterba@suse.cz> wrote:
>>>>
>>>> On Mon, Oct 04, 2021 at 12:26:16AM -0500, Matthew Warren wrote:
>>>>> Is there currently any way to change the checksum used by btrfs
>>>>> without recreating the filesystem and copying data to the new fs?
>>>>
>>>> I have a WIP patch but it's buggy. It works on small filesystems but
>>>> when I tried it on TB-sized images it blew up somewhere. Also the WIP is
>>>> not too smart, it deletes the whole checksum tree and rebuilds if from
>>>> the on-disk data, so it lacks the verification against the existing
>>>> csums. I intend to debug it some day but it's a nice to have feature,
>>>> I'm aware that people have been asking for it but at this point it would
>>>> be to dangerous to provide even the prototype.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-10-05  6:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-04  5:26 Changing the checksum algorithm on an existing BTRFS filesystem Matthew Warren
2021-10-04  6:20 ` Nikolay Borisov
2021-10-04  9:51 ` David Sterba
2021-10-04 16:01   ` Matthew Warren
2021-10-04 20:37     ` Chris Murphy
2021-10-04 22:54     ` Qu Wenruo
2021-10-05  3:26       ` Zygo Blaxell
2021-10-05  6:00         ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.