All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] btrfs-progs: receive: fallback to buffered copy if clone failed
@ 2021-09-28 23:51 Qu Wenruo
  2021-09-29  9:27 ` Filipe Manana
  0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2021-09-28 23:51 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
There are two every basic send streams:
(a/m/ctime and uuid omitted)

  Stream 1: (Parent subvolume)
  subvol   ./parent_subv           transid=8
  chown    ./parent_subv/          gid=0 uid=0
  chmod    ./parent_subv/          mode=755
  utimes   ./parent_subv/
  mkfile   ./parent_subv/o257-7-0
  rename   ./parent_subv/o257-7-0  dest=./parent_subv/source
  utimes   ./parent_subv/
  write    ./parent_subv/source    offset=0 len=16384
  chown    ./parent_subv/source    gid=0 uid=0
  chmod    ./parent_subv/source    mode=600
  utimes   ./parent_subv/source

  Stream 2: (snapshot and clone)
  snapshot ./dest_subv             transid=14 parent_transid=10
  utimes   ./dest_subv/
  mkfile   ./dest_subv/o258-14-0
  rename   ./dest_subv/o258-14-0   dest=./dest_subv/reflink
  utimes   ./dest_subv/
  clone    ./dest_subv/reflink     offset=0 len=16384 from=./dest_subv/source clone_offset=0
  chown    ./dest_subv/reflink     gid=0 uid=0
  chmod    ./dest_subv/reflink     mode=600
  utimes   ./dest_subv/reflink

But if we receive the first stream with default mount options, then
remount to nodatasum, and try to receive the second stream, it will fail:

 # mount /mnt/btrfs
 # btrfs receive -f ~/parent_stream /mnt/btrfs/
 At subvol parent_subv
 # mount -o remount,nodatasum /mnt/btrfs
 # btrfs receive -f ~/clone_stream /mnt/btrfs/
 At snapshot dest_subv
 ERROR: failed to clone extents to reflink: Invalid argument
 # echo $?
 1

[CAUSE]
Btrfs doesn't allow clone source and destination to have different NODATASUM
flags.
This is to prevent a data extent to be owned by both NODATASUM inode and
regular DATASUM inode.

For above receive operations, the clone destination is inheriting the
NODATASUM flag from mount option, while the clone source has no
NODATASUM flag, thus preventing us from doing the clone.

[FIX]
Btrfs send/receive doesn't require the underlying inode has the same
flags (thus we can send from compressed extent and receive on a
non-compressed filesystem).

So here we can just fall back to buffered write to copy the data from
the source file if we got an -EINVAL error.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Reason for RFC:

Such fallback can lead to hidden bugs not being exposed, thus a new
warning is added for such fallback case.

Personally I really want to do more comprehensive check in user space to
ensure it's only mismatching NODATASUM flags causing the problem.
Then we can completely remove the warning message.

But unfortunately that check can go out-of-sync with kernel and due to
the lack of NODATASUM flags interface we're not really able to check
that easily.

So I took the advice from Filipe to just do a simple fall back.

Any feedback on such maybe niche point would help.
(Really hope it's me being paranoid again)
---
 cmds/receive.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/cmds/receive.c b/cmds/receive.c
index 48c774cea587..4cb857a13cdf 100644
--- a/cmds/receive.c
+++ b/cmds/receive.c
@@ -705,6 +705,51 @@ out:
 	return ret;
 }
 
+#define BUFFER_SIZE	SZ_32K
+static int buffered_copy(int src_fd, int dst_fd, u64 src_offset, u64 len,
+			 u64 dest_offset)
+{
+	unsigned char *buf;
+	u64 cur_offset = 0;
+	int ret = 0;
+
+	buf = calloc(BUFFER_SIZE, 1);
+	if (!buf)
+		return -ENOMEM;
+
+	while (cur_offset < len) {
+		u32 copy_len = min_t(u32, BUFFER_SIZE, len - cur_offset);
+		u32 write_offset = 0;
+		ssize_t read_size;
+
+		read_size = pread(src_fd, buf, copy_len, src_offset + cur_offset);
+		if (read_size < 0) {
+			ret = -errno;
+			error("failed to read source file: %m");
+			goto out;
+		}
+
+		/* Write the buffer to dest file */
+		while (write_offset < read_size) {
+			ssize_t write_size;
+
+			write_size = pwrite(dst_fd, buf + write_offset,
+					read_size - write_offset,
+					dest_offset + cur_offset + write_offset);
+			if (write_size < 0) {
+				ret = -errno;
+				error("failed to write source file: %m");
+				goto out;
+			}
+			write_offset += write_size;
+		}
+		cur_offset += read_size;
+	}
+out:
+	free(buf);
+	return ret;
+}
+
 static int process_clone(const char *path, u64 offset, u64 len,
 			 const u8 *clone_uuid, u64 clone_ctransid,
 			 const char *clone_path, u64 clone_offset,
@@ -788,8 +833,16 @@ static int process_clone(const char *path, u64 offset, u64 len,
 	ret = ioctl(rctx->write_fd, BTRFS_IOC_CLONE_RANGE, &clone_args);
 	if (ret < 0) {
 		ret = -errno;
-		error("failed to clone extents to %s: %m", path);
-		goto out;
+		if (ret != -EINVAL) {
+			error("failed to clone extents to %s: %m", path);
+			goto out;
+		}
+
+		warning(
+		"failed to clone extents to %s, fallback to buffered write",
+			path);
+		ret = buffered_copy(clone_fd, rctx->write_fd, clone_offset,
+				    len, offset);
 	}
 
 out:
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] btrfs-progs: receive: fallback to buffered copy if clone failed
  2021-09-28 23:51 [PATCH RFC] btrfs-progs: receive: fallback to buffered copy if clone failed Qu Wenruo
@ 2021-09-29  9:27 ` Filipe Manana
  2021-09-29  9:33   ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: Filipe Manana @ 2021-09-29  9:27 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Sep 29, 2021 at 12:55 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> There are two every basic send streams:
> (a/m/ctime and uuid omitted)
>
>   Stream 1: (Parent subvolume)
>   subvol   ./parent_subv           transid=8
>   chown    ./parent_subv/          gid=0 uid=0
>   chmod    ./parent_subv/          mode=755
>   utimes   ./parent_subv/
>   mkfile   ./parent_subv/o257-7-0
>   rename   ./parent_subv/o257-7-0  dest=./parent_subv/source
>   utimes   ./parent_subv/
>   write    ./parent_subv/source    offset=0 len=16384
>   chown    ./parent_subv/source    gid=0 uid=0
>   chmod    ./parent_subv/source    mode=600
>   utimes   ./parent_subv/source
>
>   Stream 2: (snapshot and clone)
>   snapshot ./dest_subv             transid=14 parent_transid=10
>   utimes   ./dest_subv/
>   mkfile   ./dest_subv/o258-14-0
>   rename   ./dest_subv/o258-14-0   dest=./dest_subv/reflink
>   utimes   ./dest_subv/
>   clone    ./dest_subv/reflink     offset=0 len=16384 from=./dest_subv/source clone_offset=0
>   chown    ./dest_subv/reflink     gid=0 uid=0
>   chmod    ./dest_subv/reflink     mode=600
>   utimes   ./dest_subv/reflink
>
> But if we receive the first stream with default mount options, then
> remount to nodatasum, and try to receive the second stream, it will fail:
>
>  # mount /mnt/btrfs
>  # btrfs receive -f ~/parent_stream /mnt/btrfs/
>  At subvol parent_subv
>  # mount -o remount,nodatasum /mnt/btrfs
>  # btrfs receive -f ~/clone_stream /mnt/btrfs/
>  At snapshot dest_subv
>  ERROR: failed to clone extents to reflink: Invalid argument
>  # echo $?
>  1
>
> [CAUSE]
> Btrfs doesn't allow clone source and destination to have different NODATASUM
> flags.
> This is to prevent a data extent to be owned by both NODATASUM inode and
> regular DATASUM inode.
>
> For above receive operations, the clone destination is inheriting the
> NODATASUM flag from mount option, while the clone source has no
> NODATASUM flag, thus preventing us from doing the clone.
>
> [FIX]
> Btrfs send/receive doesn't require the underlying inode has the same
> flags (thus we can send from compressed extent and receive on a
> non-compressed filesystem).
>
> So here we can just fall back to buffered write to copy the data from
> the source file if we got an -EINVAL error.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Reason for RFC:
>
> Such fallback can lead to hidden bugs not being exposed, thus a new
> warning is added for such fallback case.
>
> Personally I really want to do more comprehensive check in user space to
> ensure it's only mismatching NODATASUM flags causing the problem.
> Then we can completely remove the warning message.
>
> But unfortunately that check can go out-of-sync with kernel and due to
> the lack of NODATASUM flags interface we're not really able to check
> that easily.
>
> So I took the advice from Filipe to just do a simple fall back.
>
> Any feedback on such maybe niche point would help.
> (Really hope it's me being paranoid again)
> ---
>  cmds/receive.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 55 insertions(+), 2 deletions(-)
>
> diff --git a/cmds/receive.c b/cmds/receive.c
> index 48c774cea587..4cb857a13cdf 100644
> --- a/cmds/receive.c
> +++ b/cmds/receive.c
> @@ -705,6 +705,51 @@ out:
>         return ret;
>  }
>
> +#define BUFFER_SIZE    SZ_32K
> +static int buffered_copy(int src_fd, int dst_fd, u64 src_offset, u64 len,

At the very least leave a blank line between the #define and the
function's declaration.

> +                        u64 dest_offset)
> +{
> +       unsigned char *buf;
> +       u64 cur_offset = 0;
> +       int ret = 0;
> +
> +       buf = calloc(BUFFER_SIZE, 1);

It could be simpler:

char buf[SZ_32K];

then use ARRAY_SIZE() below.

> +       if (!buf)
> +               return -ENOMEM;
> +
> +       while (cur_offset < len) {

This is a bit confusing, comparing an offset to a length.
Renaming "cur_offset" to "copied" would be more logical imo.

> +               u32 copy_len = min_t(u32, BUFFER_SIZE, len - cur_offset);
> +               u32 write_offset = 0;
> +               ssize_t read_size;
> +
> +               read_size = pread(src_fd, buf, copy_len, src_offset + cur_offset);
> +               if (read_size < 0) {
> +                       ret = -errno;
> +                       error("failed to read source file: %m");
> +                       goto out;
> +               }

Normally we should only exit if errno is not EINTR, and retry
(continue) on the EINTR case.

> +
> +               /* Write the buffer to dest file */
> +               while (write_offset < read_size) {

Same here, like "write_offset" to "written".

> +                       ssize_t write_size;
> +
> +                       write_size = pwrite(dst_fd, buf + write_offset,
> +                                       read_size - write_offset,
> +                                       dest_offset + cur_offset + write_offset);
> +                       if (write_size < 0) {
> +                               ret = -errno;
> +                               error("failed to write source file: %m");
> +                               goto out;
> +                       }

Same here regarding dealing with EINTR.

> +                       write_offset += write_size;
> +               }
> +               cur_offset += read_size;
> +       }
> +out:
> +       free(buf);
> +       return ret;
> +}
> +
>  static int process_clone(const char *path, u64 offset, u64 len,
>                          const u8 *clone_uuid, u64 clone_ctransid,
>                          const char *clone_path, u64 clone_offset,
> @@ -788,8 +833,16 @@ static int process_clone(const char *path, u64 offset, u64 len,
>         ret = ioctl(rctx->write_fd, BTRFS_IOC_CLONE_RANGE, &clone_args);
>         if (ret < 0) {
>                 ret = -errno;
> -               error("failed to clone extents to %s: %m", path);
> -               goto out;
> +               if (ret != -EINVAL) {
> +                       error("failed to clone extents to %s: %m", path);
> +                       goto out;
> +               }
> +
> +               warning(
> +               "failed to clone extents to %s, fallback to buffered write",
> +                       path);

What if we have thousands of clone operations?
Is there any rate limited print() in progs like there is for kernel?

That's one reason why my proposal had in mind an optional flag to
trigger this behaviour.

Thanks for doing it, I was planning on doing something similar soon.

> +               ret = buffered_copy(clone_fd, rctx->write_fd, clone_offset,
> +                                   len, offset);
>         }
>
>  out:
> --
> 2.33.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] btrfs-progs: receive: fallback to buffered copy if clone failed
  2021-09-29  9:27 ` Filipe Manana
@ 2021-09-29  9:33   ` Qu Wenruo
  2021-09-29  9:59     ` Filipe Manana
  0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2021-09-29  9:33 UTC (permalink / raw)
  To: fdmanana, Qu Wenruo; +Cc: linux-btrfs



On 2021/9/29 17:27, Filipe Manana wrote:
> On Wed, Sep 29, 2021 at 12:55 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> There are two every basic send streams:
>> (a/m/ctime and uuid omitted)
>>
>>    Stream 1: (Parent subvolume)
>>    subvol   ./parent_subv           transid=8
>>    chown    ./parent_subv/          gid=0 uid=0
>>    chmod    ./parent_subv/          mode=755
>>    utimes   ./parent_subv/
>>    mkfile   ./parent_subv/o257-7-0
>>    rename   ./parent_subv/o257-7-0  dest=./parent_subv/source
>>    utimes   ./parent_subv/
>>    write    ./parent_subv/source    offset=0 len=16384
>>    chown    ./parent_subv/source    gid=0 uid=0
>>    chmod    ./parent_subv/source    mode=600
>>    utimes   ./parent_subv/source
>>
>>    Stream 2: (snapshot and clone)
>>    snapshot ./dest_subv             transid=14 parent_transid=10
>>    utimes   ./dest_subv/
>>    mkfile   ./dest_subv/o258-14-0
>>    rename   ./dest_subv/o258-14-0   dest=./dest_subv/reflink
>>    utimes   ./dest_subv/
>>    clone    ./dest_subv/reflink     offset=0 len=16384 from=./dest_subv/source clone_offset=0
>>    chown    ./dest_subv/reflink     gid=0 uid=0
>>    chmod    ./dest_subv/reflink     mode=600
>>    utimes   ./dest_subv/reflink
>>
>> But if we receive the first stream with default mount options, then
>> remount to nodatasum, and try to receive the second stream, it will fail:
>>
>>   # mount /mnt/btrfs
>>   # btrfs receive -f ~/parent_stream /mnt/btrfs/
>>   At subvol parent_subv
>>   # mount -o remount,nodatasum /mnt/btrfs
>>   # btrfs receive -f ~/clone_stream /mnt/btrfs/
>>   At snapshot dest_subv
>>   ERROR: failed to clone extents to reflink: Invalid argument
>>   # echo $?
>>   1
>>
>> [CAUSE]
>> Btrfs doesn't allow clone source and destination to have different NODATASUM
>> flags.
>> This is to prevent a data extent to be owned by both NODATASUM inode and
>> regular DATASUM inode.
>>
>> For above receive operations, the clone destination is inheriting the
>> NODATASUM flag from mount option, while the clone source has no
>> NODATASUM flag, thus preventing us from doing the clone.
>>
>> [FIX]
>> Btrfs send/receive doesn't require the underlying inode has the same
>> flags (thus we can send from compressed extent and receive on a
>> non-compressed filesystem).
>>
>> So here we can just fall back to buffered write to copy the data from
>> the source file if we got an -EINVAL error.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Reason for RFC:
>>
>> Such fallback can lead to hidden bugs not being exposed, thus a new
>> warning is added for such fallback case.
>>
>> Personally I really want to do more comprehensive check in user space to
>> ensure it's only mismatching NODATASUM flags causing the problem.
>> Then we can completely remove the warning message.
>>
>> But unfortunately that check can go out-of-sync with kernel and due to
>> the lack of NODATASUM flags interface we're not really able to check
>> that easily.
>>
>> So I took the advice from Filipe to just do a simple fall back.
>>
>> Any feedback on such maybe niche point would help.
>> (Really hope it's me being paranoid again)
>> ---
>>   cmds/receive.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 55 insertions(+), 2 deletions(-)
>>
>> diff --git a/cmds/receive.c b/cmds/receive.c
>> index 48c774cea587..4cb857a13cdf 100644
>> --- a/cmds/receive.c
>> +++ b/cmds/receive.c
>> @@ -705,6 +705,51 @@ out:
>>          return ret;
>>   }
>>
>> +#define BUFFER_SIZE    SZ_32K
>> +static int buffered_copy(int src_fd, int dst_fd, u64 src_offset, u64 len,
>
> At the very least leave a blank line between the #define and the
> function's declaration.
>
>> +                        u64 dest_offset)
>> +{
>> +       unsigned char *buf;
>> +       u64 cur_offset = 0;
>> +       int ret = 0;
>> +
>> +       buf = calloc(BUFFER_SIZE, 1);
>
> It could be simpler:
>
> char buf[SZ_32K];
>
> then use ARRAY_SIZE() below.
>
>> +       if (!buf)
>> +               return -ENOMEM;
>> +
>> +       while (cur_offset < len) {
>
> This is a bit confusing, comparing an offset to a length.
> Renaming "cur_offset" to "copied" would be more logical imo.
>
>> +               u32 copy_len = min_t(u32, BUFFER_SIZE, len - cur_offset);
>> +               u32 write_offset = 0;
>> +               ssize_t read_size;
>> +
>> +               read_size = pread(src_fd, buf, copy_len, src_offset + cur_offset);
>> +               if (read_size < 0) {
>> +                       ret = -errno;
>> +                       error("failed to read source file: %m");
>> +                       goto out;
>> +               }
>
> Normally we should only exit if errno is not EINTR, and retry
> (continue) on the EINTR case.

For this, I'm a little concerned.

EINTR means the operation is interrupted (by signal).
In that case, doesn't it mean the user want to stop the receive?

>
>> +
>> +               /* Write the buffer to dest file */
>> +               while (write_offset < read_size) {
>
> Same here, like "write_offset" to "written".
>
>> +                       ssize_t write_size;
>> +
>> +                       write_size = pwrite(dst_fd, buf + write_offset,
>> +                                       read_size - write_offset,
>> +                                       dest_offset + cur_offset + write_offset);
>> +                       if (write_size < 0) {
>> +                               ret = -errno;
>> +                               error("failed to write source file: %m");
>> +                               goto out;
>> +                       }
>
> Same here regarding dealing with EINTR.
>
>> +                       write_offset += write_size;
>> +               }
>> +               cur_offset += read_size;
>> +       }
>> +out:
>> +       free(buf);
>> +       return ret;
>> +}
>> +
>>   static int process_clone(const char *path, u64 offset, u64 len,
>>                           const u8 *clone_uuid, u64 clone_ctransid,
>>                           const char *clone_path, u64 clone_offset,
>> @@ -788,8 +833,16 @@ static int process_clone(const char *path, u64 offset, u64 len,
>>          ret = ioctl(rctx->write_fd, BTRFS_IOC_CLONE_RANGE, &clone_args);
>>          if (ret < 0) {
>>                  ret = -errno;
>> -               error("failed to clone extents to %s: %m", path);
>> -               goto out;
>> +               if (ret != -EINVAL) {
>> +                       error("failed to clone extents to %s: %m", path);
>> +                       goto out;
>> +               }
>> +
>> +               warning(
>> +               "failed to clone extents to %s, fallback to buffered write",
>> +                       path);
>
> What if we have thousands of clone operations?
> Is there any rate limited print() in progs like there is for kernel?

Unfortunately we don't yet have.

But the good news (that I didn't catch at time of writing) is, we now
have global verbose/quite switch, so that we can easily hide those
warning by default and only output such warning for verbose run.

Thanks,
Qu

>
> That's one reason why my proposal had in mind an optional flag to
> trigger this behaviour.
>
> Thanks for doing it, I was planning on doing something similar soon.
>
>> +               ret = buffered_copy(clone_fd, rctx->write_fd, clone_offset,
>> +                                   len, offset);
>>          }
>>
>>   out:
>> --
>> 2.33.0
>>
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] btrfs-progs: receive: fallback to buffered copy if clone failed
  2021-09-29  9:33   ` Qu Wenruo
@ 2021-09-29  9:59     ` Filipe Manana
  2021-09-29 10:25       ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: Filipe Manana @ 2021-09-29  9:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On Wed, Sep 29, 2021 at 10:33 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2021/9/29 17:27, Filipe Manana wrote:
> > On Wed, Sep 29, 2021 at 12:55 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> [BUG]
> >> There are two every basic send streams:
> >> (a/m/ctime and uuid omitted)
> >>
> >>    Stream 1: (Parent subvolume)
> >>    subvol   ./parent_subv           transid=8
> >>    chown    ./parent_subv/          gid=0 uid=0
> >>    chmod    ./parent_subv/          mode=755
> >>    utimes   ./parent_subv/
> >>    mkfile   ./parent_subv/o257-7-0
> >>    rename   ./parent_subv/o257-7-0  dest=./parent_subv/source
> >>    utimes   ./parent_subv/
> >>    write    ./parent_subv/source    offset=0 len=16384
> >>    chown    ./parent_subv/source    gid=0 uid=0
> >>    chmod    ./parent_subv/source    mode=600
> >>    utimes   ./parent_subv/source
> >>
> >>    Stream 2: (snapshot and clone)
> >>    snapshot ./dest_subv             transid=14 parent_transid=10
> >>    utimes   ./dest_subv/
> >>    mkfile   ./dest_subv/o258-14-0
> >>    rename   ./dest_subv/o258-14-0   dest=./dest_subv/reflink
> >>    utimes   ./dest_subv/
> >>    clone    ./dest_subv/reflink     offset=0 len=16384 from=./dest_subv/source clone_offset=0
> >>    chown    ./dest_subv/reflink     gid=0 uid=0
> >>    chmod    ./dest_subv/reflink     mode=600
> >>    utimes   ./dest_subv/reflink
> >>
> >> But if we receive the first stream with default mount options, then
> >> remount to nodatasum, and try to receive the second stream, it will fail:
> >>
> >>   # mount /mnt/btrfs
> >>   # btrfs receive -f ~/parent_stream /mnt/btrfs/
> >>   At subvol parent_subv
> >>   # mount -o remount,nodatasum /mnt/btrfs
> >>   # btrfs receive -f ~/clone_stream /mnt/btrfs/
> >>   At snapshot dest_subv
> >>   ERROR: failed to clone extents to reflink: Invalid argument
> >>   # echo $?
> >>   1
> >>
> >> [CAUSE]
> >> Btrfs doesn't allow clone source and destination to have different NODATASUM
> >> flags.
> >> This is to prevent a data extent to be owned by both NODATASUM inode and
> >> regular DATASUM inode.
> >>
> >> For above receive operations, the clone destination is inheriting the
> >> NODATASUM flag from mount option, while the clone source has no
> >> NODATASUM flag, thus preventing us from doing the clone.
> >>
> >> [FIX]
> >> Btrfs send/receive doesn't require the underlying inode has the same
> >> flags (thus we can send from compressed extent and receive on a
> >> non-compressed filesystem).
> >>
> >> So here we can just fall back to buffered write to copy the data from
> >> the source file if we got an -EINVAL error.
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >> Reason for RFC:
> >>
> >> Such fallback can lead to hidden bugs not being exposed, thus a new
> >> warning is added for such fallback case.
> >>
> >> Personally I really want to do more comprehensive check in user space to
> >> ensure it's only mismatching NODATASUM flags causing the problem.
> >> Then we can completely remove the warning message.
> >>
> >> But unfortunately that check can go out-of-sync with kernel and due to
> >> the lack of NODATASUM flags interface we're not really able to check
> >> that easily.
> >>
> >> So I took the advice from Filipe to just do a simple fall back.
> >>
> >> Any feedback on such maybe niche point would help.
> >> (Really hope it's me being paranoid again)
> >> ---
> >>   cmds/receive.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++--
> >>   1 file changed, 55 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/cmds/receive.c b/cmds/receive.c
> >> index 48c774cea587..4cb857a13cdf 100644
> >> --- a/cmds/receive.c
> >> +++ b/cmds/receive.c
> >> @@ -705,6 +705,51 @@ out:
> >>          return ret;
> >>   }
> >>
> >> +#define BUFFER_SIZE    SZ_32K
> >> +static int buffered_copy(int src_fd, int dst_fd, u64 src_offset, u64 len,
> >
> > At the very least leave a blank line between the #define and the
> > function's declaration.
> >
> >> +                        u64 dest_offset)
> >> +{
> >> +       unsigned char *buf;
> >> +       u64 cur_offset = 0;
> >> +       int ret = 0;
> >> +
> >> +       buf = calloc(BUFFER_SIZE, 1);
> >
> > It could be simpler:
> >
> > char buf[SZ_32K];
> >
> > then use ARRAY_SIZE() below.
> >
> >> +       if (!buf)
> >> +               return -ENOMEM;
> >> +
> >> +       while (cur_offset < len) {
> >
> > This is a bit confusing, comparing an offset to a length.
> > Renaming "cur_offset" to "copied" would be more logical imo.
> >
> >> +               u32 copy_len = min_t(u32, BUFFER_SIZE, len - cur_offset);
> >> +               u32 write_offset = 0;
> >> +               ssize_t read_size;
> >> +
> >> +               read_size = pread(src_fd, buf, copy_len, src_offset + cur_offset);
> >> +               if (read_size < 0) {
> >> +                       ret = -errno;
> >> +                       error("failed to read source file: %m");
> >> +                       goto out;
> >> +               }
> >
> > Normally we should only exit if errno is not EINTR, and retry
> > (continue) on the EINTR case.
>
> For this, I'm a little concerned.
>
> EINTR means the operation is interrupted (by signal).
> In that case, doesn't it mean the user want to stop the receive?

There will be plenty of opportunities to be interrupted and still be responsive.
But ok, you can leave it as it is.

>
> >
> >> +
> >> +               /* Write the buffer to dest file */
> >> +               while (write_offset < read_size) {
> >
> > Same here, like "write_offset" to "written".
> >
> >> +                       ssize_t write_size;
> >> +
> >> +                       write_size = pwrite(dst_fd, buf + write_offset,
> >> +                                       read_size - write_offset,
> >> +                                       dest_offset + cur_offset + write_offset);
> >> +                       if (write_size < 0) {
> >> +                               ret = -errno;
> >> +                               error("failed to write source file: %m");
> >> +                               goto out;
> >> +                       }
> >
> > Same here regarding dealing with EINTR.
> >
> >> +                       write_offset += write_size;
> >> +               }
> >> +               cur_offset += read_size;
> >> +       }
> >> +out:
> >> +       free(buf);
> >> +       return ret;
> >> +}
> >> +
> >>   static int process_clone(const char *path, u64 offset, u64 len,
> >>                           const u8 *clone_uuid, u64 clone_ctransid,
> >>                           const char *clone_path, u64 clone_offset,
> >> @@ -788,8 +833,16 @@ static int process_clone(const char *path, u64 offset, u64 len,
> >>          ret = ioctl(rctx->write_fd, BTRFS_IOC_CLONE_RANGE, &clone_args);
> >>          if (ret < 0) {
> >>                  ret = -errno;
> >> -               error("failed to clone extents to %s: %m", path);
> >> -               goto out;
> >> +               if (ret != -EINVAL) {
> >> +                       error("failed to clone extents to %s: %m", path);
> >> +                       goto out;
> >> +               }
> >> +
> >> +               warning(
> >> +               "failed to clone extents to %s, fallback to buffered write",
> >> +                       path);
> >
> > What if we have thousands of clone operations?
> > Is there any rate limited print() in progs like there is for kernel?
>
> Unfortunately we don't yet have.
>
> But the good news (that I didn't catch at time of writing) is, we now
> have global verbose/quite switch, so that we can easily hide those
> warning by default and only output such warning for verbose run.

The problem with this is that it will possibly hide future bugs with
the kernel sending incorrect clone operations.

Having this fallback-to-read-write behaviour by default would hide
some bugs we had in the past on the kernel side, and unless users
start running receive with the verbose switch, we won't know about it.
Even if they do run with the verbose switch, some might not notice the
warnings at all, and for those who notice it they might not bother to
report the warnings since the receive succeeded and they didn't find
any data corruption/loss.

Or we might start receiving reports about send/receive doing less
cloning/extent sharing than before, and we won't easily know that the
receive side has fallen back to this read-write behaviour due to some
bug on kernel.

That's why making the behaviour explicit through a new command line
flag would cause less surprises and make it harder to miss regressions
on the kernel. And add some documentation explaining on which cases
the flag is useful.

Thanks.

>
> Thanks,
> Qu
>
> >
> > That's one reason why my proposal had in mind an optional flag to
> > trigger this behaviour.
> >
> > Thanks for doing it, I was planning on doing something similar soon.
> >
> >> +               ret = buffered_copy(clone_fd, rctx->write_fd, clone_offset,
> >> +                                   len, offset);
> >>          }
> >>
> >>   out:
> >> --
> >> 2.33.0
> >>
> >
> >



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] btrfs-progs: receive: fallback to buffered copy if clone failed
  2021-09-29  9:59     ` Filipe Manana
@ 2021-09-29 10:25       ` Qu Wenruo
  2021-09-29 10:44         ` Filipe Manana
  0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2021-09-29 10:25 UTC (permalink / raw)
  To: fdmanana, Qu Wenruo; +Cc: linux-btrfs



On 2021/9/29 17:59, Filipe Manana wrote:
[...]
>>> Normally we should only exit if errno is not EINTR, and retry
>>> (continue) on the EINTR case.
>>
>> For this, I'm a little concerned.
>>
>> EINTR means the operation is interrupted (by signal).
>> In that case, doesn't it mean the user want to stop the receive?
> 
> There will be plenty of opportunities to be interrupted and still be responsive.
> But ok, you can leave it as it is.
> 
>>
[...]
>>>
>>> What if we have thousands of clone operations?
>>> Is there any rate limited print() in progs like there is for kernel?
>>
>> Unfortunately we don't yet have.
>>
>> But the good news (that I didn't catch at time of writing) is, we now
>> have global verbose/quite switch, so that we can easily hide those
>> warning by default and only output such warning for verbose run.
> 
> The problem with this is that it will possibly hide future bugs with
> the kernel sending incorrect clone operations.
> 
> Having this fallback-to-read-write behaviour by default would hide
> some bugs we had in the past on the kernel side, and unless users
> start running receive with the verbose switch, we won't know about it.
> Even if they do run with the verbose switch, some might not notice the
> warnings at all, and for those who notice it they might not bother to
> report the warnings since the receive succeeded and they didn't find
> any data corruption/loss.
> 
> Or we might start receiving reports about send/receive doing less
> cloning/extent sharing than before, and we won't easily know that the
> receive side has fallen back to this read-write behaviour due to some
> bug on kernel.
> 
> That's why making the behaviour explicit through a new command line
> flag would cause less surprises and make it harder to miss regressions
> on the kernel. And add some documentation explaining on which cases
> the flag is useful.

This sounds indeed better, handling both cases well.

I'll try to add a new option to receive, maybe something called 
--clone-fallback.

Any advice on the option name would be very helpful.

Thanks,
Qu

> 
> Thanks.
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>> That's one reason why my proposal had in mind an optional flag to
>>> trigger this behaviour.
>>>
>>> Thanks for doing it, I was planning on doing something similar soon.
>>>
>>>> +               ret = buffered_copy(clone_fd, rctx->write_fd, clone_offset,
>>>> +                                   len, offset);
>>>>           }
>>>>
>>>>    out:
>>>> --
>>>> 2.33.0
>>>>
>>>
>>>
> 
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] btrfs-progs: receive: fallback to buffered copy if clone failed
  2021-09-29 10:25       ` Qu Wenruo
@ 2021-09-29 10:44         ` Filipe Manana
  0 siblings, 0 replies; 6+ messages in thread
From: Filipe Manana @ 2021-09-29 10:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs

On Wed, Sep 29, 2021 at 11:25 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> On 2021/9/29 17:59, Filipe Manana wrote:
> [...]
> >>> Normally we should only exit if errno is not EINTR, and retry
> >>> (continue) on the EINTR case.
> >>
> >> For this, I'm a little concerned.
> >>
> >> EINTR means the operation is interrupted (by signal).
> >> In that case, doesn't it mean the user want to stop the receive?
> >
> > There will be plenty of opportunities to be interrupted and still be responsive.
> > But ok, you can leave it as it is.
> >
> >>
> [...]
> >>>
> >>> What if we have thousands of clone operations?
> >>> Is there any rate limited print() in progs like there is for kernel?
> >>
> >> Unfortunately we don't yet have.
> >>
> >> But the good news (that I didn't catch at time of writing) is, we now
> >> have global verbose/quite switch, so that we can easily hide those
> >> warning by default and only output such warning for verbose run.
> >
> > The problem with this is that it will possibly hide future bugs with
> > the kernel sending incorrect clone operations.
> >
> > Having this fallback-to-read-write behaviour by default would hide
> > some bugs we had in the past on the kernel side, and unless users
> > start running receive with the verbose switch, we won't know about it.
> > Even if they do run with the verbose switch, some might not notice the
> > warnings at all, and for those who notice it they might not bother to
> > report the warnings since the receive succeeded and they didn't find
> > any data corruption/loss.
> >
> > Or we might start receiving reports about send/receive doing less
> > cloning/extent sharing than before, and we won't easily know that the
> > receive side has fallen back to this read-write behaviour due to some
> > bug on kernel.
> >
> > That's why making the behaviour explicit through a new command line
> > flag would cause less surprises and make it harder to miss regressions
> > on the kernel. And add some documentation explaining on which cases
> > the flag is useful.
>
> This sounds indeed better, handling both cases well.
>
> I'll try to add a new option to receive, maybe something called
> --clone-fallback.
>
> Any advice on the option name would be very helpful.

--clone-fallback

Sounds fine to me. It's short enough and gives some clue what it's about.
It's not completely self-explanatory, but I don't have a better
suggestion, and I think any name will always require an explanation of
what it does and when it's useful in the documentation (--help, man
page).

Thanks.


>
> Thanks,
> Qu
>
> >
> > Thanks.
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> That's one reason why my proposal had in mind an optional flag to
> >>> trigger this behaviour.
> >>>
> >>> Thanks for doing it, I was planning on doing something similar soon.
> >>>
> >>>> +               ret = buffered_copy(clone_fd, rctx->write_fd, clone_offset,
> >>>> +                                   len, offset);
> >>>>           }
> >>>>
> >>>>    out:
> >>>> --
> >>>> 2.33.0
> >>>>
> >>>
> >>>
> >
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-09-29 10:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-28 23:51 [PATCH RFC] btrfs-progs: receive: fallback to buffered copy if clone failed Qu Wenruo
2021-09-29  9:27 ` Filipe Manana
2021-09-29  9:33   ` Qu Wenruo
2021-09-29  9:59     ` Filipe Manana
2021-09-29 10:25       ` Qu Wenruo
2021-09-29 10:44         ` Filipe Manana

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.