All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] btrfs: send: avoid unaligned encoded writes when attempting to clone range
@ 2022-11-15 16:29 fdmanana
  2022-11-15 21:45 ` Boris Burkov
  2022-11-18 16:11 ` David Sterba
  0 siblings, 2 replies; 4+ messages in thread
From: fdmanana @ 2022-11-15 16:29 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When trying to see if we can clone a file range, there are cases where we
end up sending two write operations in case the inode from the source root
has an i_size that is not sector size aligned and the length from the
current offset to its i_size is less than the remaining length we are
trying to clone.

Issuing two write operations when we could instead issue a single write
operation is not incorrect. However it is not optimal, specially if the
extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
to the send ioctl. In that case we can end up sending an encoded write
with an offset that is not sector size aligned, which makes the receiver
fallback to decompressing the data and writing it using regular buffered
IO (so re-compressing the data in case the fs is mounted with compression
enabled), because encoded writes fail with -EINVAL when an offset is not
sector size aligned.

The following example, which triggered a bug in the receiver code for the
fallback logic of decompressing + regular buffer IO and is fixed by the
patchset referred in a Link at the bottom of this changelog, is an example
where we have the non-optimal behaviour due to an unaligned encoded write:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdj
   MNT=/mnt/sdj

   mkfs.btrfs -f $DEV > /dev/null
   mount -o compress $DEV $MNT

   # File foo has a size of 33K, not aligned to the sector size.
   xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo

   xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar

   # Now clone the first 32K of file bar into foo at offset 0.
   xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo

   # Snapshot the default subvolume and create a full send stream (v2).
   btrfs subvolume snapshot -r $MNT $MNT/snap

   btrfs send --compressed-data -f /tmp/test.send $MNT/snap

   echo -e "\nFile bar in the original filesystem:"
   od -A d -t x1 $MNT/snap/bar

   umount $MNT
   mkfs.btrfs -f $DEV > /dev/null
   mount $DEV $MNT

   echo -e "\nReceiving stream in a new filesystem..."
   btrfs receive -f /tmp/test.send $MNT

   echo -e "\nFile bar in the new filesystem:"
   od -A d -t x1 $MNT/snap/bar

   umount $MNT

Before this patch, the send stream included one regular write and one
encoded write for file 'bar', with the later being not sector size aligned
and causing the receiver to fallback to decompression + buffered writes.
The output of the btrfs receive command in verbose mode (-vvv):

   (...)
   mkfile o258-7-0
   rename o258-7-0 -> bar
   utimes
   clone bar - source=foo source offset=0 offset=0 length=32768
   write bar - offset=32768 length=1024
   encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
   encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
   (...)

This patch avoids the regular write followed by an unaligned encoded write
so that we end up sending a single encoded write that is aligned. So after
this patch the stream content is (output of btrfs receive -vvv):

   (...)
   mkfile o258-7-0
   rename o258-7-0 -> bar
   utimes
   clone bar - source=foo source offset=0 offset=0 length=32768
   encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
   (...)

So we get more optimal behaviour and avoid the silent data loss bug in
versions of btrfs-progs affected by the bug referred by the Link tag
below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).

Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/send.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 6950d3f9cbc1..5a00d08c8300 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -5935,6 +5935,7 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
 		u64 ext_len;
 		u64 clone_len;
 		u64 clone_data_offset;
+		bool crossed_src_i_size = false;
 
 		if (slot >= btrfs_header_nritems(leaf)) {
 			ret = btrfs_next_leaf(clone_root->root, path);
@@ -5992,8 +5993,10 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
 		if (key.offset >= clone_src_i_size)
 			break;
 
-		if (key.offset + ext_len > clone_src_i_size)
+		if (key.offset + ext_len > clone_src_i_size) {
 			ext_len = clone_src_i_size - key.offset;
+			crossed_src_i_size = true;
+		}
 
 		clone_data_offset = btrfs_file_extent_offset(leaf, ei);
 		if (btrfs_file_extent_disk_bytenr(leaf, ei) == disk_byte) {
@@ -6054,6 +6057,25 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
 				ret = send_clone(sctx, offset, clone_len,
 						 clone_root);
 			}
+		} else if (crossed_src_i_size && clone_len < len) {
+			/*
+			 * If we are at i_size of the clone source inode and we
+			 * can not clone from it, terminate the loop. This is
+			 * to avoid sending two write operations, one with a
+			 * length matching clone_len and the final one after
+			 * this loop with a length of len - clone_len.
+			 *
+			 * When using encoded writes (BTRFS_SEND_FLAG_COMPRESSED
+			 * was passed to the send ioctl), this helps avoid
+			 * sending an encoded write for an offset that is not
+			 * sector size aligned, in case the i_size of the source
+			 * inode is not sector size aligned. That will make the
+			 * receiver fallback to decompression of the data and
+			 * writing it using regular buffered IO, therefore while
+			 * not incorrect, it's not optimal due decompression and
+			 * possible re-compression at the receiver.
+			 */
+			break;
 		} else {
 			ret = send_extent_data(sctx, dst_path, offset,
 					       clone_len);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: send: avoid unaligned encoded writes when attempting to clone range
  2022-11-15 16:29 [PATCH] btrfs: send: avoid unaligned encoded writes when attempting to clone range fdmanana
@ 2022-11-15 21:45 ` Boris Burkov
  2022-11-16 10:28   ` Filipe Manana
  2022-11-18 16:11 ` David Sterba
  1 sibling, 1 reply; 4+ messages in thread
From: Boris Burkov @ 2022-11-15 21:45 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Tue, Nov 15, 2022 at 04:29:44PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When trying to see if we can clone a file range, there are cases where we
> end up sending two write operations in case the inode from the source root
> has an i_size that is not sector size aligned and the length from the
> current offset to its i_size is less than the remaining length we are
> trying to clone.
> 
> Issuing two write operations when we could instead issue a single write
> operation is not incorrect. However it is not optimal, specially if the
> extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
> to the send ioctl. In that case we can end up sending an encoded write
> with an offset that is not sector size aligned, which makes the receiver
> fallback to decompressing the data and writing it using regular buffered
> IO (so re-compressing the data in case the fs is mounted with compression
> enabled), because encoded writes fail with -EINVAL when an offset is not
> sector size aligned.
> 
> The following example, which triggered a bug in the receiver code for the
> fallback logic of decompressing + regular buffer IO and is fixed by the
> patchset referred in a Link at the bottom of this changelog, is an example
> where we have the non-optimal behaviour due to an unaligned encoded write:
> 
>    $ cat test.sh
>    #!/bin/bash
> 
>    DEV=/dev/sdj
>    MNT=/mnt/sdj
> 
>    mkfs.btrfs -f $DEV > /dev/null
>    mount -o compress $DEV $MNT
> 

Nice fix, confirmed that it works for me.

FWIW, I was curious if this fix would result in the "opposite" problem
if you reflinked less than the full file and needed to finish the loop
to get the next big chunk to be aligned. But reflink fails if the end
is not aligned, so every variant I tried with foo size = 32K and reflink
reflink size <32K worked in a good, predictable way resulting in encoded
writes and such.

Would it make sense to add reflink + send/recv tests like this test.sh
to fstests?  I can do it if you like the idea but don't have time.

>    # File foo has a size of 33K, not aligned to the sector size.
>    xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo
> 
>    xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar
> 
>    # Now clone the first 32K of file bar into foo at offset 0.
>    xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo
> 
>    # Snapshot the default subvolume and create a full send stream (v2).
>    btrfs subvolume snapshot -r $MNT $MNT/snap
> 
>    btrfs send --compressed-data -f /tmp/test.send $MNT/snap
> 
>    echo -e "\nFile bar in the original filesystem:"
>    od -A d -t x1 $MNT/snap/bar
> 
>    umount $MNT
>    mkfs.btrfs -f $DEV > /dev/null
>    mount $DEV $MNT
> 
>    echo -e "\nReceiving stream in a new filesystem..."
>    btrfs receive -f /tmp/test.send $MNT
> 
>    echo -e "\nFile bar in the new filesystem:"
>    od -A d -t x1 $MNT/snap/bar
> 
>    umount $MNT
> 
> Before this patch, the send stream included one regular write and one
> encoded write for file 'bar', with the later being not sector size aligned
> and causing the receiver to fallback to decompression + buffered writes.
> The output of the btrfs receive command in verbose mode (-vvv):
> 
>    (...)
>    mkfile o258-7-0
>    rename o258-7-0 -> bar
>    utimes
>    clone bar - source=foo source offset=0 offset=0 length=32768
>    write bar - offset=32768 length=1024
>    encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
>    encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
>    (...)
> 
> This patch avoids the regular write followed by an unaligned encoded write
> so that we end up sending a single encoded write that is aligned. So after
> this patch the stream content is (output of btrfs receive -vvv):
> 
>    (...)
>    mkfile o258-7-0
>    rename o258-7-0 -> bar
>    utimes
>    clone bar - source=foo source offset=0 offset=0 length=32768
>    encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
>    (...)
> 
> So we get more optimal behaviour and avoid the silent data loss bug in
> versions of btrfs-progs affected by the bug referred by the Link tag
> below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).
> 
> Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/send.c | 24 +++++++++++++++++++++++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 6950d3f9cbc1..5a00d08c8300 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -5935,6 +5935,7 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
>  		u64 ext_len;
>  		u64 clone_len;
>  		u64 clone_data_offset;
> +		bool crossed_src_i_size = false;
>  
>  		if (slot >= btrfs_header_nritems(leaf)) {
>  			ret = btrfs_next_leaf(clone_root->root, path);
> @@ -5992,8 +5993,10 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
>  		if (key.offset >= clone_src_i_size)
>  			break;
>  
> -		if (key.offset + ext_len > clone_src_i_size)
> +		if (key.offset + ext_len > clone_src_i_size) {
>  			ext_len = clone_src_i_size - key.offset;
> +			crossed_src_i_size = true;
> +		}
>  
>  		clone_data_offset = btrfs_file_extent_offset(leaf, ei);
>  		if (btrfs_file_extent_disk_bytenr(leaf, ei) == disk_byte) {
> @@ -6054,6 +6057,25 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
>  				ret = send_clone(sctx, offset, clone_len,
>  						 clone_root);
>  			}
> +		} else if (crossed_src_i_size && clone_len < len) {
> +			/*
> +			 * If we are at i_size of the clone source inode and we
> +			 * can not clone from it, terminate the loop. This is
> +			 * to avoid sending two write operations, one with a
> +			 * length matching clone_len and the final one after
> +			 * this loop with a length of len - clone_len.
> +			 *
> +			 * When using encoded writes (BTRFS_SEND_FLAG_COMPRESSED
> +			 * was passed to the send ioctl), this helps avoid
> +			 * sending an encoded write for an offset that is not
> +			 * sector size aligned, in case the i_size of the source
> +			 * inode is not sector size aligned. That will make the
> +			 * receiver fallback to decompression of the data and
> +			 * writing it using regular buffered IO, therefore while
> +			 * not incorrect, it's not optimal due decompression and
> +			 * possible re-compression at the receiver.
> +			 */
> +			break;
>  		} else {
>  			ret = send_extent_data(sctx, dst_path, offset,
>  					       clone_len);
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: send: avoid unaligned encoded writes when attempting to clone range
  2022-11-15 21:45 ` Boris Burkov
@ 2022-11-16 10:28   ` Filipe Manana
  0 siblings, 0 replies; 4+ messages in thread
From: Filipe Manana @ 2022-11-16 10:28 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs

On Tue, Nov 15, 2022 at 01:45:47PM -0800, Boris Burkov wrote:
> On Tue, Nov 15, 2022 at 04:29:44PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> > 
> > When trying to see if we can clone a file range, there are cases where we
> > end up sending two write operations in case the inode from the source root
> > has an i_size that is not sector size aligned and the length from the
> > current offset to its i_size is less than the remaining length we are
> > trying to clone.
> > 
> > Issuing two write operations when we could instead issue a single write
> > operation is not incorrect. However it is not optimal, specially if the
> > extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
> > to the send ioctl. In that case we can end up sending an encoded write
> > with an offset that is not sector size aligned, which makes the receiver
> > fallback to decompressing the data and writing it using regular buffered
> > IO (so re-compressing the data in case the fs is mounted with compression
> > enabled), because encoded writes fail with -EINVAL when an offset is not
> > sector size aligned.
> > 
> > The following example, which triggered a bug in the receiver code for the
> > fallback logic of decompressing + regular buffer IO and is fixed by the
> > patchset referred in a Link at the bottom of this changelog, is an example
> > where we have the non-optimal behaviour due to an unaligned encoded write:
> > 
> >    $ cat test.sh
> >    #!/bin/bash
> > 
> >    DEV=/dev/sdj
> >    MNT=/mnt/sdj
> > 
> >    mkfs.btrfs -f $DEV > /dev/null
> >    mount -o compress $DEV $MNT
> > 
> 
> Nice fix, confirmed that it works for me.

Not exactly a fix, I see it more as a performance improvement.

It works around the bug in receive for this type of scenario, but the real bug
is the fallback code at the receiver, plus there are more cases where it needs
to fallback to decompression + write, like passing --force-decompress to receive
or the kernel at the receiver simply doesn't support the encoded writes ioctl,
as well as a few more cases.

> 
> FWIW, I was curious if this fix would result in the "opposite" problem
> if you reflinked less than the full file and needed to finish the loop
> to get the next big chunk to be aligned. But reflink fails if the end
> is not aligned, so every variant I tried with foo size = 32K and reflink
> reflink size <32K worked in a good, predictable way resulting in encoded
> writes and such.
> 
> Would it make sense to add reflink + send/recv tests like this test.sh
> to fstests?  I can do it if you like the idea but don't have time.

Yes, the goal is to add it to fstests, and I'll do it. In general if you
see me pasting a reproducer in a changelog, you can assume it will end up
in fstests sooner or later. Nowadays it's more later than sooner, as with
the new fstests maintainer things flow more slowly and the test should go
after the respective kernel patch is merged in Linus' tree, so there's some
delay.

Thanks.

> 
> >    # File foo has a size of 33K, not aligned to the sector size.
> >    xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo
> > 
> >    xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar
> > 
> >    # Now clone the first 32K of file bar into foo at offset 0.
> >    xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo
> > 
> >    # Snapshot the default subvolume and create a full send stream (v2).
> >    btrfs subvolume snapshot -r $MNT $MNT/snap
> > 
> >    btrfs send --compressed-data -f /tmp/test.send $MNT/snap
> > 
> >    echo -e "\nFile bar in the original filesystem:"
> >    od -A d -t x1 $MNT/snap/bar
> > 
> >    umount $MNT
> >    mkfs.btrfs -f $DEV > /dev/null
> >    mount $DEV $MNT
> > 
> >    echo -e "\nReceiving stream in a new filesystem..."
> >    btrfs receive -f /tmp/test.send $MNT
> > 
> >    echo -e "\nFile bar in the new filesystem:"
> >    od -A d -t x1 $MNT/snap/bar
> > 
> >    umount $MNT
> > 
> > Before this patch, the send stream included one regular write and one
> > encoded write for file 'bar', with the later being not sector size aligned
> > and causing the receiver to fallback to decompression + buffered writes.
> > The output of the btrfs receive command in verbose mode (-vvv):
> > 
> >    (...)
> >    mkfile o258-7-0
> >    rename o258-7-0 -> bar
> >    utimes
> >    clone bar - source=foo source offset=0 offset=0 length=32768
> >    write bar - offset=32768 length=1024
> >    encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
> >    encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
> >    (...)
> > 
> > This patch avoids the regular write followed by an unaligned encoded write
> > so that we end up sending a single encoded write that is aligned. So after
> > this patch the stream content is (output of btrfs receive -vvv):
> > 
> >    (...)
> >    mkfile o258-7-0
> >    rename o258-7-0 -> bar
> >    utimes
> >    clone bar - source=foo source offset=0 offset=0 length=32768
> >    encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
> >    (...)
> > 
> > So we get more optimal behaviour and avoid the silent data loss bug in
> > versions of btrfs-progs affected by the bug referred by the Link tag
> > below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).
> > 
> > Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> Reviewed-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/send.c | 24 +++++++++++++++++++++++-
> >  1 file changed, 23 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> > index 6950d3f9cbc1..5a00d08c8300 100644
> > --- a/fs/btrfs/send.c
> > +++ b/fs/btrfs/send.c
> > @@ -5935,6 +5935,7 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
> >  		u64 ext_len;
> >  		u64 clone_len;
> >  		u64 clone_data_offset;
> > +		bool crossed_src_i_size = false;
> >  
> >  		if (slot >= btrfs_header_nritems(leaf)) {
> >  			ret = btrfs_next_leaf(clone_root->root, path);
> > @@ -5992,8 +5993,10 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
> >  		if (key.offset >= clone_src_i_size)
> >  			break;
> >  
> > -		if (key.offset + ext_len > clone_src_i_size)
> > +		if (key.offset + ext_len > clone_src_i_size) {
> >  			ext_len = clone_src_i_size - key.offset;
> > +			crossed_src_i_size = true;
> > +		}
> >  
> >  		clone_data_offset = btrfs_file_extent_offset(leaf, ei);
> >  		if (btrfs_file_extent_disk_bytenr(leaf, ei) == disk_byte) {
> > @@ -6054,6 +6057,25 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_path *dst_path,
> >  				ret = send_clone(sctx, offset, clone_len,
> >  						 clone_root);
> >  			}
> > +		} else if (crossed_src_i_size && clone_len < len) {
> > +			/*
> > +			 * If we are at i_size of the clone source inode and we
> > +			 * can not clone from it, terminate the loop. This is
> > +			 * to avoid sending two write operations, one with a
> > +			 * length matching clone_len and the final one after
> > +			 * this loop with a length of len - clone_len.
> > +			 *
> > +			 * When using encoded writes (BTRFS_SEND_FLAG_COMPRESSED
> > +			 * was passed to the send ioctl), this helps avoid
> > +			 * sending an encoded write for an offset that is not
> > +			 * sector size aligned, in case the i_size of the source
> > +			 * inode is not sector size aligned. That will make the
> > +			 * receiver fallback to decompression of the data and
> > +			 * writing it using regular buffered IO, therefore while
> > +			 * not incorrect, it's not optimal due decompression and
> > +			 * possible re-compression at the receiver.
> > +			 */
> > +			break;
> >  		} else {
> >  			ret = send_extent_data(sctx, dst_path, offset,
> >  					       clone_len);
> > -- 
> > 2.35.1
> > 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: send: avoid unaligned encoded writes when attempting to clone range
  2022-11-15 16:29 [PATCH] btrfs: send: avoid unaligned encoded writes when attempting to clone range fdmanana
  2022-11-15 21:45 ` Boris Burkov
@ 2022-11-18 16:11 ` David Sterba
  1 sibling, 0 replies; 4+ messages in thread
From: David Sterba @ 2022-11-18 16:11 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Tue, Nov 15, 2022 at 04:29:44PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When trying to see if we can clone a file range, there are cases where we
> end up sending two write operations in case the inode from the source root
> has an i_size that is not sector size aligned and the length from the
> current offset to its i_size is less than the remaining length we are
> trying to clone.
> 
> Issuing two write operations when we could instead issue a single write
> operation is not incorrect. However it is not optimal, specially if the
> extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
> to the send ioctl. In that case we can end up sending an encoded write
> with an offset that is not sector size aligned, which makes the receiver
> fallback to decompressing the data and writing it using regular buffered
> IO (so re-compressing the data in case the fs is mounted with compression
> enabled), because encoded writes fail with -EINVAL when an offset is not
> sector size aligned.
> 
> The following example, which triggered a bug in the receiver code for the
> fallback logic of decompressing + regular buffer IO and is fixed by the
> patchset referred in a Link at the bottom of this changelog, is an example
> where we have the non-optimal behaviour due to an unaligned encoded write:
> 
>    $ cat test.sh
>    #!/bin/bash
> 
>    DEV=/dev/sdj
>    MNT=/mnt/sdj
> 
>    mkfs.btrfs -f $DEV > /dev/null
>    mount -o compress $DEV $MNT
> 
>    # File foo has a size of 33K, not aligned to the sector size.
>    xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo
> 
>    xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar
> 
>    # Now clone the first 32K of file bar into foo at offset 0.
>    xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo
> 
>    # Snapshot the default subvolume and create a full send stream (v2).
>    btrfs subvolume snapshot -r $MNT $MNT/snap
> 
>    btrfs send --compressed-data -f /tmp/test.send $MNT/snap
> 
>    echo -e "\nFile bar in the original filesystem:"
>    od -A d -t x1 $MNT/snap/bar
> 
>    umount $MNT
>    mkfs.btrfs -f $DEV > /dev/null
>    mount $DEV $MNT
> 
>    echo -e "\nReceiving stream in a new filesystem..."
>    btrfs receive -f /tmp/test.send $MNT
> 
>    echo -e "\nFile bar in the new filesystem:"
>    od -A d -t x1 $MNT/snap/bar
> 
>    umount $MNT
> 
> Before this patch, the send stream included one regular write and one
> encoded write for file 'bar', with the later being not sector size aligned
> and causing the receiver to fallback to decompression + buffered writes.
> The output of the btrfs receive command in verbose mode (-vvv):
> 
>    (...)
>    mkfile o258-7-0
>    rename o258-7-0 -> bar
>    utimes
>    clone bar - source=foo source offset=0 offset=0 length=32768
>    write bar - offset=32768 length=1024
>    encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
>    encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
>    (...)
> 
> This patch avoids the regular write followed by an unaligned encoded write
> so that we end up sending a single encoded write that is aligned. So after
> this patch the stream content is (output of btrfs receive -vvv):
> 
>    (...)
>    mkfile o258-7-0
>    rename o258-7-0 -> bar
>    utimes
>    clone bar - source=foo source offset=0 offset=0 length=32768
>    encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
>    (...)
> 
> So we get more optimal behaviour and avoid the silent data loss bug in
> versions of btrfs-progs affected by the bug referred by the Link tag
> below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).
> 
> Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-11-18 16:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-15 16:29 [PATCH] btrfs: send: avoid unaligned encoded writes when attempting to clone range fdmanana
2022-11-15 21:45 ` Boris Burkov
2022-11-16 10:28   ` Filipe Manana
2022-11-18 16:11 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.