Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/2] Allow deduplication of the eof block when it is safe to do so
@ 2019-12-16 18:26 fdmanana
  2019-12-16 18:26 ` [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file fdmanana
  2019-12-16 18:26 ` [PATCH 2/2] Btrfs: make deduplication with range including the last block work fdmanana
  0 siblings, 2 replies; 17+ messages in thread
From: fdmanana @ 2019-12-16 18:26 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-btrfs, linux-xfs, darrick.wong, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

Hi,

This short series allows deduplication of the last block of a file when
the eof is not aligned to the sector size, as long as the range's end
offset matches the eof of the destination file.

This is a safe case unlike the case where we attempt to clone the block in
the middle of a file (which results in a corruption I found last year and
affected both btrfs and xfs).

This is motivated by btrfs users reporting lower deduplication scores
starting with kernel 5.0, which was the kernel release where btrfs was
changed to use the generic VFS helper generic_remap_file_range_prep().
Users observed that the last block was no longer deduplicated when a
file's size is not block size aligned.  For btrfs this is specially
important because references are kept per extent and not per block, so
not having the last block deduplicated means the entire extent is kept
allocated, making the deduplication not effective and often pointless in
many cases.

Thanks.

Filipe Manana (2):
  fs: allow deduplication of eof block into the end of the destination
    file
  Btrfs: make deduplication with range including the last block work

 fs/btrfs/ioctl.c |  3 ++-
 fs/read_write.c  | 10 ++++------
 2 files changed, 6 insertions(+), 7 deletions(-)

-- 
2.11.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file
  2019-12-16 18:26 [PATCH 0/2] Allow deduplication of the eof block when it is safe to do so fdmanana
@ 2019-12-16 18:26 ` fdmanana
  2019-12-17 15:52   ` Josef Bacik
  2020-01-07 16:23   ` Filipe Manana
  2019-12-16 18:26 ` [PATCH 2/2] Btrfs: make deduplication with range including the last block work fdmanana
  1 sibling, 2 replies; 17+ messages in thread
From: fdmanana @ 2019-12-16 18:26 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-btrfs, linux-xfs, darrick.wong, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

We always round down, to a multiple of the filesystem's block size, the
length to deduplicate at generic_remap_check_len().  However this is only
needed if an attempt to deduplicate the last block into the middle of the
destination file is requested, since that leads into a corruption if the
length of the source file is not block size aligned.  When an attempt to
deduplicate the last block into the end of the destination file is
requested, we should allow it because it is safe to do it - there's no
stale data exposure and we are prepared to compare the data ranges for
a length not aligned to the block (or page) size - in fact we even do
the data compare before adjusting the deduplication length.

After btrfs was updated to use the generic helpers from VFS (by commit
34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
and deduplication")) we started to have user reports of deduplication
not reflinking the last block anymore, and whence users getting lower
deduplication scores.  The main use case is deduplication of entire
files that have a size not aligned to the block size of the filesystem.

We already allow cloning the last block to the end (and beyond) of the
destination file, so allow for deduplication as well.

Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/read_write.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 5bbf587f5bc1..7458fccc59e1 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
  * else.  Assume that the offsets have already been checked for block
  * alignment.
  *
- * For deduplication we always scale down to the previous block because we
- * can't meaningfully compare post-EOF contents.
- *
- * For clone we only link a partial EOF block above the destination file's EOF.
+ * For clone we only link a partial EOF block above or at the destination file's
+ * EOF.  For deduplication we accept a partial EOF block only if it ends at the
+ * destination file's EOF (can not link it into the middle of a file).
  *
  * Shorten the request if possible.
  */
@@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in,
 	if ((*len & blkmask) == 0)
 		return 0;
 
-	if ((remap_flags & REMAP_FILE_DEDUP) ||
-	    pos_out + *len < i_size_read(inode_out))
+	if (pos_out + *len < i_size_read(inode_out))
 		new_len &= ~blkmask;
 
 	if (new_len == *len)
-- 
2.11.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/2] Btrfs: make deduplication with range including the last block work
  2019-12-16 18:26 [PATCH 0/2] Allow deduplication of the eof block when it is safe to do so fdmanana
  2019-12-16 18:26 ` [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file fdmanana
@ 2019-12-16 18:26 ` fdmanana
  2019-12-17 15:54   ` Josef Bacik
  2019-12-29  5:22   ` Zygo Blaxell
  1 sibling, 2 replies; 17+ messages in thread
From: fdmanana @ 2019-12-16 18:26 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-btrfs, linux-xfs, darrick.wong, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

Since btrfs was migrated to use the generic VFS helpers for clone and
deduplication, it stopped allowing for the last block of a file to be
deduplicated when the source file size is not sector size aligned (when
eof is somewhere in the middle of the last block). There are two reasons
for that:

1) The generic code always rounds down, to a multiple of the block size,
   the range's length for deduplications. This means we end up never
   deduplicating the last block when the eof is not block size aligned,
   even for the safe case where the destination range's end offset matches
   the destination file's size. That rounding down operation is done at
   generic_remap_check_len();

2) Because of that, the btrfs specific code does not expect anymore any
   non-aligned range length's for deduplication and therefore does not
   work if such nona-aligned length is given.

This patch addresses that second part, and it depends on a patch that
fixes generic_remap_check_len(), in the VFS, which was submitted ealier
and has the following subject:

  "fs: allow deduplication of eof block into the end of the destination file"

These two patches address reports from users that started seeing lower
deduplication rates due to the last block never being deduplicated when
the file size is not aligned to the filesystem's block size.

Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 3418decb9e61..c41c276ff272 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3237,6 +3237,7 @@ static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1,
 static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
 				   struct inode *dst, u64 dst_loff)
 {
+	const u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
 	int ret;
 
 	/*
@@ -3244,7 +3245,7 @@ static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
 	 * source range to serialize with relocation.
 	 */
 	btrfs_double_extent_lock(src, loff, dst, dst_loff, len);
-	ret = btrfs_clone(src, dst, loff, len, len, dst_loff, 1);
+	ret = btrfs_clone(src, dst, loff, len, ALIGN(len, bs), dst_loff, 1);
 	btrfs_double_extent_unlock(src, loff, dst, dst_loff, len);
 
 	return ret;
-- 
2.11.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file
  2019-12-16 18:26 ` [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file fdmanana
@ 2019-12-17 15:52   ` Josef Bacik
  2020-01-07 16:23   ` Filipe Manana
  1 sibling, 0 replies; 17+ messages in thread
From: Josef Bacik @ 2019-12-17 15:52 UTC (permalink / raw)
  To: fdmanana, linux-fsdevel
  Cc: linux-btrfs, linux-xfs, darrick.wong, Filipe Manana

On 12/16/19 1:26 PM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We always round down, to a multiple of the filesystem's block size, the
> length to deduplicate at generic_remap_check_len().  However this is only
> needed if an attempt to deduplicate the last block into the middle of the
> destination file is requested, since that leads into a corruption if the
> length of the source file is not block size aligned.  When an attempt to
> deduplicate the last block into the end of the destination file is
> requested, we should allow it because it is safe to do it - there's no
> stale data exposure and we are prepared to compare the data ranges for
> a length not aligned to the block (or page) size - in fact we even do
> the data compare before adjusting the deduplication length.
> 
> After btrfs was updated to use the generic helpers from VFS (by commit
> 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
> and deduplication")) we started to have user reports of deduplication
> not reflinking the last block anymore, and whence users getting lower
> deduplication scores.  The main use case is deduplication of entire
> files that have a size not aligned to the block size of the filesystem.
> 
> We already allow cloning the last block to the end (and beyond) of the
> destination file, so allow for deduplication as well.
> 
> Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] Btrfs: make deduplication with range including the last block work
  2019-12-16 18:26 ` [PATCH 2/2] Btrfs: make deduplication with range including the last block work fdmanana
@ 2019-12-17 15:54   ` Josef Bacik
  2019-12-29  5:22   ` Zygo Blaxell
  1 sibling, 0 replies; 17+ messages in thread
From: Josef Bacik @ 2019-12-17 15:54 UTC (permalink / raw)
  To: fdmanana, linux-fsdevel
  Cc: linux-btrfs, linux-xfs, darrick.wong, Filipe Manana

On 12/16/19 1:26 PM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Since btrfs was migrated to use the generic VFS helpers for clone and
> deduplication, it stopped allowing for the last block of a file to be
> deduplicated when the source file size is not sector size aligned (when
> eof is somewhere in the middle of the last block). There are two reasons
> for that:
> 
> 1) The generic code always rounds down, to a multiple of the block size,
>     the range's length for deduplications. This means we end up never
>     deduplicating the last block when the eof is not block size aligned,
>     even for the safe case where the destination range's end offset matches
>     the destination file's size. That rounding down operation is done at
>     generic_remap_check_len();
> 
> 2) Because of that, the btrfs specific code does not expect anymore any
>     non-aligned range length's for deduplication and therefore does not
>     work if such nona-aligned length is given.
> 

Does anybody else rely on this behavior that needs a change like this for their fs?

> This patch addresses that second part, and it depends on a patch that
> fixes generic_remap_check_len(), in the VFS, which was submitted ealier
> and has the following subject:
> 
>    "fs: allow deduplication of eof block into the end of the destination file"
> 
> These two patches address reports from users that started seeing lower
> deduplication rates due to the last block never being deduplicated when
> the file size is not aligned to the filesystem's block size.
> 
> Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] Btrfs: make deduplication with range including the last block work
  2019-12-16 18:26 ` [PATCH 2/2] Btrfs: make deduplication with range including the last block work fdmanana
  2019-12-17 15:54   ` Josef Bacik
@ 2019-12-29  5:22   ` Zygo Blaxell
  2020-01-07 16:18     ` Filipe Manana
  1 sibling, 1 reply; 17+ messages in thread
From: Zygo Blaxell @ 2019-12-29  5:22 UTC (permalink / raw)
  To: fdmanana
  Cc: linux-fsdevel, linux-btrfs, linux-xfs, darrick.wong, Filipe Manana

[-- Attachment #1: Type: text/plain, Size: 3682 bytes --]

On Mon, Dec 16, 2019 at 06:26:56PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Since btrfs was migrated to use the generic VFS helpers for clone and
> deduplication, it stopped allowing for the last block of a file to be
> deduplicated when the source file size is not sector size aligned (when
> eof is somewhere in the middle of the last block). There are two reasons
> for that:
> 
> 1) The generic code always rounds down, to a multiple of the block size,
>    the range's length for deduplications. This means we end up never
>    deduplicating the last block when the eof is not block size aligned,
>    even for the safe case where the destination range's end offset matches
>    the destination file's size. That rounding down operation is done at
>    generic_remap_check_len();
> 
> 2) Because of that, the btrfs specific code does not expect anymore any
>    non-aligned range length's for deduplication and therefore does not
>    work if such nona-aligned length is given.
> 
> This patch addresses that second part, and it depends on a patch that
> fixes generic_remap_check_len(), in the VFS, which was submitted ealier
> and has the following subject:
> 
>   "fs: allow deduplication of eof block into the end of the destination file"
> 
> These two patches address reports from users that started seeing lower
> deduplication rates due to the last block never being deduplicated when
> the file size is not aligned to the filesystem's block size.
> 
> Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Should these patches be marked for stable (5.0+, but see below for
caveats about 5.0)?  The bug affects 5.3 and 5.4 which are still active,
and dedupe is an important feature for some users.

> ---
>  fs/btrfs/ioctl.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 3418decb9e61..c41c276ff272 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3237,6 +3237,7 @@ static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1,
>  static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
>  				   struct inode *dst, u64 dst_loff)
>  {
> +	const u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
>  	int ret;
>  
>  	/*
> @@ -3244,7 +3245,7 @@ static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
>  	 * source range to serialize with relocation.
>  	 */
>  	btrfs_double_extent_lock(src, loff, dst, dst_loff, len);
> -	ret = btrfs_clone(src, dst, loff, len, len, dst_loff, 1);
> +	ret = btrfs_clone(src, dst, loff, len, ALIGN(len, bs), dst_loff, 1);

A heads-up for anyone backporting this to 5.0:  this patch depends on

	57a50e2506df Btrfs: remove no longer needed range length checks for deduplication

Simply resolving the git conflict without including 57a50e2506df produces
a kernel where dedupe rounds the size of the dst file up to the next
block boundary.  This is because 57a50e2506df changes the value of
"len".  Before 57a50e2506df, "len" is equivalent to "ALIGN(len, bs)"
at the btrfs_clone line; after 57a50e2506df, "len" is the unaligned
dedupe request length passed to the btrfs_extent_same_range function.
This changes the semantics of the btrfs_clone line significantly.

57a50e2506df is in 5.1, so 5.1+ kernels do not require any additional
patches.

4.20 and earlier don't have the bug, so don't need a fix.

>  	btrfs_double_extent_unlock(src, loff, dst, dst_loff, len);
>  
>  	return ret;
> -- 
> 2.11.0
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] Btrfs: make deduplication with range including the last block work
  2019-12-29  5:22   ` Zygo Blaxell
@ 2020-01-07 16:18     ` Filipe Manana
  2020-01-07 18:16       ` Zygo Blaxell
  0 siblings, 1 reply; 17+ messages in thread
From: Filipe Manana @ 2020-01-07 16:18 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: linux-fsdevel, linux-btrfs, xfs, Darrick J. Wong, Filipe Manana

On Sun, Dec 29, 2019 at 5:22 AM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Mon, Dec 16, 2019 at 06:26:56PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Since btrfs was migrated to use the generic VFS helpers for clone and
> > deduplication, it stopped allowing for the last block of a file to be
> > deduplicated when the source file size is not sector size aligned (when
> > eof is somewhere in the middle of the last block). There are two reasons
> > for that:
> >
> > 1) The generic code always rounds down, to a multiple of the block size,
> >    the range's length for deduplications. This means we end up never
> >    deduplicating the last block when the eof is not block size aligned,
> >    even for the safe case where the destination range's end offset matches
> >    the destination file's size. That rounding down operation is done at
> >    generic_remap_check_len();
> >
> > 2) Because of that, the btrfs specific code does not expect anymore any
> >    non-aligned range length's for deduplication and therefore does not
> >    work if such nona-aligned length is given.
> >
> > This patch addresses that second part, and it depends on a patch that
> > fixes generic_remap_check_len(), in the VFS, which was submitted ealier
> > and has the following subject:
> >
> >   "fs: allow deduplication of eof block into the end of the destination file"
> >
> > These two patches address reports from users that started seeing lower
> > deduplication rates due to the last block never being deduplicated when
> > the file size is not aligned to the filesystem's block size.
> >
> > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> Should these patches be marked for stable (5.0+, but see below for
> caveats about 5.0)?  The bug affects 5.3 and 5.4 which are still active,
> and dedupe is an important feature for some users.

Usually I only mark things for stable that are critical: corruptions,
crashes and memory leaks for example.
I don't think this is a critical issue, since none of those things
happen. It's certainly inconvenient to not have
an extent fully deduplicated, but it's just that.

If a maintainer wants to add it for stable, I'm fine with it.

>
> > ---
> >  fs/btrfs/ioctl.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 3418decb9e61..c41c276ff272 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -3237,6 +3237,7 @@ static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1,
> >  static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
> >                                  struct inode *dst, u64 dst_loff)
> >  {
> > +     const u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
> >       int ret;
> >
> >       /*
> > @@ -3244,7 +3245,7 @@ static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
> >        * source range to serialize with relocation.
> >        */
> >       btrfs_double_extent_lock(src, loff, dst, dst_loff, len);
> > -     ret = btrfs_clone(src, dst, loff, len, len, dst_loff, 1);
> > +     ret = btrfs_clone(src, dst, loff, len, ALIGN(len, bs), dst_loff, 1);
>
> A heads-up for anyone backporting this to 5.0:  this patch depends on
>
>         57a50e2506df Btrfs: remove no longer needed range length checks for deduplication

For any kernel without that cleanup patch, backporting the first patch
in the series (the one touching only fs/read_write.c) is enough.
For any kernel with that cleanup patch, then both patches in the
series are needed.

Thanks.

>
> Simply resolving the git conflict without including 57a50e2506df produces
> a kernel where dedupe rounds the size of the dst file up to the next
> block boundary.  This is because 57a50e2506df changes the value of
> "len".  Before 57a50e2506df, "len" is equivalent to "ALIGN(len, bs)"
> at the btrfs_clone line; after 57a50e2506df, "len" is the unaligned
> dedupe request length passed to the btrfs_extent_same_range function.
> This changes the semantics of the btrfs_clone line significantly.
>
> 57a50e2506df is in 5.1, so 5.1+ kernels do not require any additional
> patches.
>
> 4.20 and earlier don't have the bug, so don't need a fix.
>
> >       btrfs_double_extent_unlock(src, loff, dst, dst_loff, len);
> >
> >       return ret;
> > --
> > 2.11.0
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file
  2019-12-16 18:26 ` [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file fdmanana
  2019-12-17 15:52   ` Josef Bacik
@ 2020-01-07 16:23   ` Filipe Manana
  2020-01-07 17:57     ` Darrick J. Wong
  1 sibling, 1 reply; 17+ messages in thread
From: Filipe Manana @ 2020-01-07 16:23 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-btrfs, xfs, Darrick J. Wong, Filipe Manana, Alexander Viro

On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> We always round down, to a multiple of the filesystem's block size, the
> length to deduplicate at generic_remap_check_len().  However this is only
> needed if an attempt to deduplicate the last block into the middle of the
> destination file is requested, since that leads into a corruption if the
> length of the source file is not block size aligned.  When an attempt to
> deduplicate the last block into the end of the destination file is
> requested, we should allow it because it is safe to do it - there's no
> stale data exposure and we are prepared to compare the data ranges for
> a length not aligned to the block (or page) size - in fact we even do
> the data compare before adjusting the deduplication length.
>
> After btrfs was updated to use the generic helpers from VFS (by commit
> 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
> and deduplication")) we started to have user reports of deduplication
> not reflinking the last block anymore, and whence users getting lower
> deduplication scores.  The main use case is deduplication of entire
> files that have a size not aligned to the block size of the filesystem.
>
> We already allow cloning the last block to the end (and beyond) of the
> destination file, so allow for deduplication as well.
>
> Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Darrick, Al, any feedback?

Thanks.

> ---
>  fs/read_write.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 5bbf587f5bc1..7458fccc59e1 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
>   * else.  Assume that the offsets have already been checked for block
>   * alignment.
>   *
> - * For deduplication we always scale down to the previous block because we
> - * can't meaningfully compare post-EOF contents.
> - *
> - * For clone we only link a partial EOF block above the destination file's EOF.
> + * For clone we only link a partial EOF block above or at the destination file's
> + * EOF.  For deduplication we accept a partial EOF block only if it ends at the
> + * destination file's EOF (can not link it into the middle of a file).
>   *
>   * Shorten the request if possible.
>   */
> @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in,
>         if ((*len & blkmask) == 0)
>                 return 0;
>
> -       if ((remap_flags & REMAP_FILE_DEDUP) ||
> -           pos_out + *len < i_size_read(inode_out))
> +       if (pos_out + *len < i_size_read(inode_out))
>                 new_len &= ~blkmask;
>
>         if (new_len == *len)
> --
> 2.11.0
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file
  2020-01-07 16:23   ` Filipe Manana
@ 2020-01-07 17:57     ` Darrick J. Wong
  2020-01-08 11:36       ` Filipe Manana
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2020-01-07 17:57 UTC (permalink / raw)
  To: Filipe Manana
  Cc: linux-fsdevel, linux-btrfs, xfs, Filipe Manana, Alexander Viro

On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote:
> On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote:
> >
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > We always round down, to a multiple of the filesystem's block size, the
> > length to deduplicate at generic_remap_check_len().  However this is only
> > needed if an attempt to deduplicate the last block into the middle of the
> > destination file is requested, since that leads into a corruption if the
> > length of the source file is not block size aligned.  When an attempt to
> > deduplicate the last block into the end of the destination file is
> > requested, we should allow it because it is safe to do it - there's no
> > stale data exposure and we are prepared to compare the data ranges for
> > a length not aligned to the block (or page) size - in fact we even do
> > the data compare before adjusting the deduplication length.
> >
> > After btrfs was updated to use the generic helpers from VFS (by commit
> > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
> > and deduplication")) we started to have user reports of deduplication
> > not reflinking the last block anymore, and whence users getting lower
> > deduplication scores.  The main use case is deduplication of entire
> > files that have a size not aligned to the block size of the filesystem.
> >
> > We already allow cloning the last block to the end (and beyond) of the
> > destination file, so allow for deduplication as well.
> >
> > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> 
> Darrick, Al, any feedback?

Is there a fstest to check for correct operation of dedupe at or beyond
source and destfile EOF?  Particularly if one range is /not/ at EOF?
And that an mmap read of the EOF block will see zeroes past EOF before
and after the dedupe operation?

If I fallocate a 16k file, write 'X' into the first 5000 bytes,
write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and
then try to dedupe (first file, 0-8k) with (second file, 60k-68k),
should that work?

I'm convinced that we could support dedupe to EOF when the ranges of the
two files both end at the respective file's EOF, but it's the weirder
corner cases that I worry about...

--D

> Thanks.
> 
> > ---
> >  fs/read_write.c | 10 ++++------
> >  1 file changed, 4 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 5bbf587f5bc1..7458fccc59e1 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
> >   * else.  Assume that the offsets have already been checked for block
> >   * alignment.
> >   *
> > - * For deduplication we always scale down to the previous block because we
> > - * can't meaningfully compare post-EOF contents.
> > - *
> > - * For clone we only link a partial EOF block above the destination file's EOF.
> > + * For clone we only link a partial EOF block above or at the destination file's
> > + * EOF.  For deduplication we accept a partial EOF block only if it ends at the
> > + * destination file's EOF (can not link it into the middle of a file).
> >   *
> >   * Shorten the request if possible.
> >   */
> > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in,
> >         if ((*len & blkmask) == 0)
> >                 return 0;
> >
> > -       if ((remap_flags & REMAP_FILE_DEDUP) ||
> > -           pos_out + *len < i_size_read(inode_out))
> > +       if (pos_out + *len < i_size_read(inode_out))
> >                 new_len &= ~blkmask;
> >
> >         if (new_len == *len)
> > --
> > 2.11.0
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] Btrfs: make deduplication with range including the last block work
  2020-01-07 16:18     ` Filipe Manana
@ 2020-01-07 18:16       ` Zygo Blaxell
  2020-01-08 11:42         ` Filipe Manana
  0 siblings, 1 reply; 17+ messages in thread
From: Zygo Blaxell @ 2020-01-07 18:16 UTC (permalink / raw)
  To: Filipe Manana
  Cc: linux-fsdevel, linux-btrfs, xfs, Darrick J. Wong, Filipe Manana

[-- Attachment #1: Type: text/plain, Size: 5949 bytes --]

On Tue, Jan 07, 2020 at 04:18:42PM +0000, Filipe Manana wrote:
> On Sun, Dec 29, 2019 at 5:22 AM Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> >
> > On Mon, Dec 16, 2019 at 06:26:56PM +0000, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Since btrfs was migrated to use the generic VFS helpers for clone and
> > > deduplication, it stopped allowing for the last block of a file to be
> > > deduplicated when the source file size is not sector size aligned (when
> > > eof is somewhere in the middle of the last block). There are two reasons
> > > for that:
> > >
> > > 1) The generic code always rounds down, to a multiple of the block size,
> > >    the range's length for deduplications. This means we end up never
> > >    deduplicating the last block when the eof is not block size aligned,
> > >    even for the safe case where the destination range's end offset matches
> > >    the destination file's size. That rounding down operation is done at
> > >    generic_remap_check_len();
> > >
> > > 2) Because of that, the btrfs specific code does not expect anymore any
> > >    non-aligned range length's for deduplication and therefore does not
> > >    work if such nona-aligned length is given.
> > >
> > > This patch addresses that second part, and it depends on a patch that
> > > fixes generic_remap_check_len(), in the VFS, which was submitted ealier
> > > and has the following subject:
> > >
> > >   "fs: allow deduplication of eof block into the end of the destination file"
> > >
> > > These two patches address reports from users that started seeing lower
> > > deduplication rates due to the last block never being deduplicated when
> > > the file size is not aligned to the filesystem's block size.
> > >
> > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >
> > Should these patches be marked for stable (5.0+, but see below for
> > caveats about 5.0)?  The bug affects 5.3 and 5.4 which are still active,
> > and dedupe is an important feature for some users.
> 
> Usually I only mark things for stable that are critical: corruptions,
> crashes and memory leaks for example.
> I don't think this is a critical issue, since none of those things
> happen. It's certainly inconvenient to not have
> an extent fully deduplicated, but it's just that.

In btrfs the reference counting is done by extent and extents are
immutable, so extents are either fully deduplicated, or not deduplicated
at all.  We have to dedupe every part of an extent, and if we fail to
do so, no data space is saved while metadata usage increases for the
new partial extent reference.

This bug means the dedupe feature is not usable _at all_ for single-extent
files with non-aligned EOF, and that is a significant problem for users
that rely on dedupe to manage space usage on btrfs (e.g. for build
servers where there are millions of duplicate odd-sized small files, and
the space savings from working dedupe can be 90% or more).  Doubling or
tripling space usage for the same data is beyond inconvenience.

It is possible to work around the bug in userspace and recover the space
with clone, but there is no way to do it safely on live data without a
working dedupe-range ioctl.

> If a maintainer wants to add it for stable, I'm fine with it.

At this point it would only affect 5.4--all the other short-term kernels
are closed, and none of the LTS kernels need the patch--but it would be
nice if 5.4 had working dedupe.

> >
> > > ---
> > >  fs/btrfs/ioctl.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > > index 3418decb9e61..c41c276ff272 100644
> > > --- a/fs/btrfs/ioctl.c
> > > +++ b/fs/btrfs/ioctl.c
> > > @@ -3237,6 +3237,7 @@ static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1,
> > >  static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
> > >                                  struct inode *dst, u64 dst_loff)
> > >  {
> > > +     const u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
> > >       int ret;
> > >
> > >       /*
> > > @@ -3244,7 +3245,7 @@ static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
> > >        * source range to serialize with relocation.
> > >        */
> > >       btrfs_double_extent_lock(src, loff, dst, dst_loff, len);
> > > -     ret = btrfs_clone(src, dst, loff, len, len, dst_loff, 1);
> > > +     ret = btrfs_clone(src, dst, loff, len, ALIGN(len, bs), dst_loff, 1);
> >
> > A heads-up for anyone backporting this to 5.0:  this patch depends on
> >
> >         57a50e2506df Btrfs: remove no longer needed range length checks for deduplication
> 
> For any kernel without that cleanup patch, backporting the first patch
> in the series (the one touching only fs/read_write.c) is enough.
> For any kernel with that cleanup patch, then both patches in the
> series are needed.
> 
> Thanks.
> 
> >
> > Simply resolving the git conflict without including 57a50e2506df produces
> > a kernel where dedupe rounds the size of the dst file up to the next
> > block boundary.  This is because 57a50e2506df changes the value of
> > "len".  Before 57a50e2506df, "len" is equivalent to "ALIGN(len, bs)"
> > at the btrfs_clone line; after 57a50e2506df, "len" is the unaligned
> > dedupe request length passed to the btrfs_extent_same_range function.
> > This changes the semantics of the btrfs_clone line significantly.
> >
> > 57a50e2506df is in 5.1, so 5.1+ kernels do not require any additional
> > patches.
> >
> > 4.20 and earlier don't have the bug, so don't need a fix.
> >
> > >       btrfs_double_extent_unlock(src, loff, dst, dst_loff, len);
> > >
> > >       return ret;
> > > --
> > > 2.11.0
> > >
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file
  2020-01-07 17:57     ` Darrick J. Wong
@ 2020-01-08 11:36       ` Filipe Manana
  2020-01-08 16:15         ` Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Filipe Manana @ 2020-01-08 11:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-btrfs, xfs, Filipe Manana, Alexander Viro

On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote:
> > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote:
> > >
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > We always round down, to a multiple of the filesystem's block size, the
> > > length to deduplicate at generic_remap_check_len().  However this is only
> > > needed if an attempt to deduplicate the last block into the middle of the
> > > destination file is requested, since that leads into a corruption if the
> > > length of the source file is not block size aligned.  When an attempt to
> > > deduplicate the last block into the end of the destination file is
> > > requested, we should allow it because it is safe to do it - there's no
> > > stale data exposure and we are prepared to compare the data ranges for
> > > a length not aligned to the block (or page) size - in fact we even do
> > > the data compare before adjusting the deduplication length.
> > >
> > > After btrfs was updated to use the generic helpers from VFS (by commit
> > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
> > > and deduplication")) we started to have user reports of deduplication
> > > not reflinking the last block anymore, and whence users getting lower
> > > deduplication scores.  The main use case is deduplication of entire
> > > files that have a size not aligned to the block size of the filesystem.
> > >
> > > We already allow cloning the last block to the end (and beyond) of the
> > > destination file, so allow for deduplication as well.
> > >
> > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> >
> > Darrick, Al, any feedback?
>
> Is there a fstest to check for correct operation of dedupe at or beyond
> source and destfile EOF?  Particularly if one range is /not/ at EOF?

Such as what generic/158 does already?


> And that an mmap read of the EOF block will see zeroes past EOF before
> and after the dedupe operation?

Can you elaborate a bit more? Why an mmap read and not a buffered or a
direct IO read before and after deduplication?
Is there anything special for the mmap reads on xfs, is that your
concern? Or is the idea to deduplicate while the file is mmap'ed?

>
> If I fallocate a 16k file, write 'X' into the first 5000 bytes,
> write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and
> then try to dedupe (first file, 0-8k) with (second file, 60k-68k),
> should that work?

You haven't mentioned the size of the second file, nor if the first
file has a size of 16K which I assume (instead of fallocate with the
keep size flag).

Anyway, I assume you actually meant to dedupe the range 0 - 5000 from
the first file into the range 60k - 60k + 5000 of the second file, and
that the second file has a size of 60k + 5000.
If so, that fails with -EINVAL because the source range is not block
size aligned, and we already have generic fstests that test attempt to
duplication and clone non-aligned ranges that don't end at eof.
This patch doesn't change that behaviour, it only aims to allow
deduplication of the eof block of the source file into the eof of the
destination file.


>
> I'm convinced that we could support dedupe to EOF when the ranges of the
> two files both end at the respective file's EOF, but it's the weirder
> corner cases that I worry about...

Well, we used to do that in btrfs before migrating to the generic code.
Since I discovered the corruption due to deduplication of the eof
block into the middle of a file in 2018's summer, the btrfs fix
allowed deduplication of the eof block only if the destination end
offset matched the eof of the destination file:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0

Since then no issues were found nor users reported any problems so far.

Any other specific test you would like to see?

Thanks.

>
> --D
>
> > Thanks.
> >
> > > ---
> > >  fs/read_write.c | 10 ++++------
> > >  1 file changed, 4 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > index 5bbf587f5bc1..7458fccc59e1 100644
> > > --- a/fs/read_write.c
> > > +++ b/fs/read_write.c
> > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
> > >   * else.  Assume that the offsets have already been checked for block
> > >   * alignment.
> > >   *
> > > - * For deduplication we always scale down to the previous block because we
> > > - * can't meaningfully compare post-EOF contents.
> > > - *
> > > - * For clone we only link a partial EOF block above the destination file's EOF.
> > > + * For clone we only link a partial EOF block above or at the destination file's
> > > + * EOF.  For deduplication we accept a partial EOF block only if it ends at the
> > > + * destination file's EOF (can not link it into the middle of a file).
> > >   *
> > >   * Shorten the request if possible.
> > >   */
> > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in,
> > >         if ((*len & blkmask) == 0)
> > >                 return 0;
> > >
> > > -       if ((remap_flags & REMAP_FILE_DEDUP) ||
> > > -           pos_out + *len < i_size_read(inode_out))
> > > +       if (pos_out + *len < i_size_read(inode_out))
> > >                 new_len &= ~blkmask;
> > >
> > >         if (new_len == *len)
> > > --
> > > 2.11.0
> > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] Btrfs: make deduplication with range including the last block work
  2020-01-07 18:16       ` Zygo Blaxell
@ 2020-01-08 11:42         ` Filipe Manana
  2020-01-08 14:53           ` David Sterba
  0 siblings, 1 reply; 17+ messages in thread
From: Filipe Manana @ 2020-01-08 11:42 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: linux-fsdevel, linux-btrfs, xfs, Darrick J. Wong, Filipe Manana

On Tue, Jan 7, 2020 at 6:16 PM Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
>
> On Tue, Jan 07, 2020 at 04:18:42PM +0000, Filipe Manana wrote:
> > On Sun, Dec 29, 2019 at 5:22 AM Zygo Blaxell
> > <ce3g8jdj@umail.furryterror.org> wrote:
> > >
> > > On Mon, Dec 16, 2019 at 06:26:56PM +0000, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > Since btrfs was migrated to use the generic VFS helpers for clone and
> > > > deduplication, it stopped allowing for the last block of a file to be
> > > > deduplicated when the source file size is not sector size aligned (when
> > > > eof is somewhere in the middle of the last block). There are two reasons
> > > > for that:
> > > >
> > > > 1) The generic code always rounds down, to a multiple of the block size,
> > > >    the range's length for deduplications. This means we end up never
> > > >    deduplicating the last block when the eof is not block size aligned,
> > > >    even for the safe case where the destination range's end offset matches
> > > >    the destination file's size. That rounding down operation is done at
> > > >    generic_remap_check_len();
> > > >
> > > > 2) Because of that, the btrfs specific code does not expect anymore any
> > > >    non-aligned range length's for deduplication and therefore does not
> > > >    work if such nona-aligned length is given.
> > > >
> > > > This patch addresses that second part, and it depends on a patch that
> > > > fixes generic_remap_check_len(), in the VFS, which was submitted ealier
> > > > and has the following subject:
> > > >
> > > >   "fs: allow deduplication of eof block into the end of the destination file"
> > > >
> > > > These two patches address reports from users that started seeing lower
> > > > deduplication rates due to the last block never being deduplicated when
> > > > the file size is not aligned to the filesystem's block size.
> > > >
> > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > >
> > > Should these patches be marked for stable (5.0+, but see below for
> > > caveats about 5.0)?  The bug affects 5.3 and 5.4 which are still active,
> > > and dedupe is an important feature for some users.
> >
> > Usually I only mark things for stable that are critical: corruptions,
> > crashes and memory leaks for example.
> > I don't think this is a critical issue, since none of those things
> > happen. It's certainly inconvenient to not have
> > an extent fully deduplicated, but it's just that.
>
> In btrfs the reference counting is done by extent and extents are
> immutable, so extents are either fully deduplicated, or not deduplicated
> at all.  We have to dedupe every part of an extent, and if we fail to
> do so, no data space is saved while metadata usage increases for the
> new partial extent reference.

Yes, I know. That was explained in the cover letter, why allowing
deduplication of the eof block is more important for btrfs than it is
for xfs for example.

>
> This bug means the dedupe feature is not usable _at all_ for single-extent
> files with non-aligned EOF, and that is a significant problem for users
> that rely on dedupe to manage space usage on btrfs (e.g. for build
> servers where there are millions of duplicate odd-sized small files, and
> the space savings from working dedupe can be 90% or more).  Doubling or
> tripling space usage for the same data is beyond inconvenience.

Sure, I understand that, I know how btrfs manages extents and I'm well
familiar with its cloning/deduplication implementation.

Still, it's not something I consider critical enough to get to stable,
as there's no corruption, data loss or a crash.
That doesn't mean the patches aren't going to stable branches, that
depends on the maintainers of each subsystem (vfs, btrfs).

Thanks.

>
> It is possible to work around the bug in userspace and recover the space
> with clone, but there is no way to do it safely on live data without a
> working dedupe-range ioctl.
>
> > If a maintainer wants to add it for stable, I'm fine with it.
>
> At this point it would only affect 5.4--all the other short-term kernels
> are closed, and none of the LTS kernels need the patch--but it would be
> nice if 5.4 had working dedupe.
>
> > >
> > > > ---
> > > >  fs/btrfs/ioctl.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > > > index 3418decb9e61..c41c276ff272 100644
> > > > --- a/fs/btrfs/ioctl.c
> > > > +++ b/fs/btrfs/ioctl.c
> > > > @@ -3237,6 +3237,7 @@ static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1,
> > > >  static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
> > > >                                  struct inode *dst, u64 dst_loff)
> > > >  {
> > > > +     const u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
> > > >       int ret;
> > > >
> > > >       /*
> > > > @@ -3244,7 +3245,7 @@ static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
> > > >        * source range to serialize with relocation.
> > > >        */
> > > >       btrfs_double_extent_lock(src, loff, dst, dst_loff, len);
> > > > -     ret = btrfs_clone(src, dst, loff, len, len, dst_loff, 1);
> > > > +     ret = btrfs_clone(src, dst, loff, len, ALIGN(len, bs), dst_loff, 1);
> > >
> > > A heads-up for anyone backporting this to 5.0:  this patch depends on
> > >
> > >         57a50e2506df Btrfs: remove no longer needed range length checks for deduplication
> >
> > For any kernel without that cleanup patch, backporting the first patch
> > in the series (the one touching only fs/read_write.c) is enough.
> > For any kernel with that cleanup patch, then both patches in the
> > series are needed.
> >
> > Thanks.
> >
> > >
> > > Simply resolving the git conflict without including 57a50e2506df produces
> > > a kernel where dedupe rounds the size of the dst file up to the next
> > > block boundary.  This is because 57a50e2506df changes the value of
> > > "len".  Before 57a50e2506df, "len" is equivalent to "ALIGN(len, bs)"
> > > at the btrfs_clone line; after 57a50e2506df, "len" is the unaligned
> > > dedupe request length passed to the btrfs_extent_same_range function.
> > > This changes the semantics of the btrfs_clone line significantly.
> > >
> > > 57a50e2506df is in 5.1, so 5.1+ kernels do not require any additional
> > > patches.
> > >
> > > 4.20 and earlier don't have the bug, so don't need a fix.
> > >
> > > >       btrfs_double_extent_unlock(src, loff, dst, dst_loff, len);
> > > >
> > > >       return ret;
> > > > --
> > > > 2.11.0
> > > >
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] Btrfs: make deduplication with range including the last block work
  2020-01-08 11:42         ` Filipe Manana
@ 2020-01-08 14:53           ` David Sterba
  0 siblings, 0 replies; 17+ messages in thread
From: David Sterba @ 2020-01-08 14:53 UTC (permalink / raw)
  To: Filipe Manana
  Cc: Zygo Blaxell, linux-fsdevel, linux-btrfs, xfs, Darrick J. Wong,
	Filipe Manana

On Wed, Jan 08, 2020 at 11:42:05AM +0000, Filipe Manana wrote:
> > > > Should these patches be marked for stable (5.0+, but see below for
> > > > caveats about 5.0)?  The bug affects 5.3 and 5.4 which are still active,
> > > > and dedupe is an important feature for some users.
> > >
> > > Usually I only mark things for stable that are critical: corruptions,
> > > crashes and memory leaks for example.
> > > I don't think this is a critical issue, since none of those things
> > > happen. It's certainly inconvenient to not have
> > > an extent fully deduplicated, but it's just that.
> >
> > In btrfs the reference counting is done by extent and extents are
> > immutable, so extents are either fully deduplicated, or not deduplicated
> > at all.  We have to dedupe every part of an extent, and if we fail to
> > do so, no data space is saved while metadata usage increases for the
> > new partial extent reference.
> 
> Yes, I know. That was explained in the cover letter, why allowing
> deduplication of the eof block is more important for btrfs than it is
> for xfs for example.
> 
> >
> > This bug means the dedupe feature is not usable _at all_ for single-extent
> > files with non-aligned EOF, and that is a significant problem for users
> > that rely on dedupe to manage space usage on btrfs (e.g. for build
> > servers where there are millions of duplicate odd-sized small files, and
> > the space savings from working dedupe can be 90% or more).  Doubling or
> > tripling space usage for the same data is beyond inconvenience.
> 
> Sure, I understand that, I know how btrfs manages extents and I'm well
> familiar with its cloning/deduplication implementation.
> 
> Still, it's not something I consider critical enough to get to stable,
> as there's no corruption, data loss or a crash.
> That doesn't mean the patches aren't going to stable branches, that
> depends on the maintainers of each subsystem (vfs, btrfs).

To me this looks like a usability bug and regression so I'm all for
adding it to stable. Less serious fixes than corruption, data loss and
crash land in stable kernels anyway, so if this fixes behaviour and
usecases then it qualifies.

I evaluate each patch for stable inclusion so the CC: stable is not
required to be in the patch itself when posted, and late requests
for inclusion to stable have been working well so we have the process in
place.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file
  2020-01-08 11:36       ` Filipe Manana
@ 2020-01-08 16:15         ` Darrick J. Wong
  2020-01-09 19:00           ` Filipe Manana
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2020-01-08 16:15 UTC (permalink / raw)
  To: Filipe Manana
  Cc: linux-fsdevel, linux-btrfs, xfs, Filipe Manana, Alexander Viro

On Wed, Jan 08, 2020 at 11:36:04AM +0000, Filipe Manana wrote:
> On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote:
> > > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote:
> > > >
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > We always round down, to a multiple of the filesystem's block size, the
> > > > length to deduplicate at generic_remap_check_len().  However this is only
> > > > needed if an attempt to deduplicate the last block into the middle of the
> > > > destination file is requested, since that leads into a corruption if the
> > > > length of the source file is not block size aligned.  When an attempt to
> > > > deduplicate the last block into the end of the destination file is
> > > > requested, we should allow it because it is safe to do it - there's no
> > > > stale data exposure and we are prepared to compare the data ranges for
> > > > a length not aligned to the block (or page) size - in fact we even do
> > > > the data compare before adjusting the deduplication length.
> > > >
> > > > After btrfs was updated to use the generic helpers from VFS (by commit
> > > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
> > > > and deduplication")) we started to have user reports of deduplication
> > > > not reflinking the last block anymore, and whence users getting lower
> > > > deduplication scores.  The main use case is deduplication of entire
> > > > files that have a size not aligned to the block size of the filesystem.
> > > >
> > > > We already allow cloning the last block to the end (and beyond) of the
> > > > destination file, so allow for deduplication as well.
> > > >
> > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > >
> > > Darrick, Al, any feedback?
> >
> > Is there a fstest to check for correct operation of dedupe at or beyond
> > source and destfile EOF?  Particularly if one range is /not/ at EOF?
> 
> Such as what generic/158 does already?

Urk, heh. :)

> > And that an mmap read of the EOF block will see zeroes past EOF before
> > and after the dedupe operation?
> 
> Can you elaborate a bit more? Why an mmap read and not a buffered or a
> direct IO read before and after deduplication?
> Is there anything special for the mmap reads on xfs, is that your
> concern? Or is the idea to deduplicate while the file is mmap'ed?

I cite mmap reads past EOF specifically because unlike buffered/direct
reads where the VFS will stop reading exactly at EOF, a memory mapping
maps in an entire memory page, and the fs is supposed to ensure that the
bytes past EOF are zeroed.

Hm now that I look at g/158 it doesn't actually verify mmap reads.  I
looked around and can't really see anything that checks mmap reads
before and after a dedupe operation at EOF.

> > If I fallocate a 16k file, write 'X' into the first 5000 bytes,
> > write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and
> > then try to dedupe (first file, 0-8k) with (second file, 60k-68k),
> > should that work?
> 
> You haven't mentioned the size of the second file, nor if the first
> file has a size of 16K which I assume (instead of fallocate with the
> keep size flag).

Er, sorry, yes.  The first file is 16,384 bytes long; the second file is
66,440 bytes.

> Anyway, I assume you actually meant to dedupe the range 0 - 5000 from
> the first file into the range 60k - 60k + 5000 of the second file, and
> that the second file has a size of 60k + 5000.

Nope, I meant to say to dedupe the range (offset: 0, length: 8192) from
the first file into the second file (offset: 61440, length: 8192).  The
source range is entirely below EOF, and the dest range ends right at
EOF in the second file.

> If so, that fails with -EINVAL because the source range is not block
> size aligned, and we already have generic fstests that test attempt to
> duplication and clone non-aligned ranges that don't end at eof.
> This patch doesn't change that behaviour, it only aims to allow
> deduplication of the eof block of the source file into the eof of the
> destination file.
> 
> 
> >
> > I'm convinced that we could support dedupe to EOF when the ranges of the
> > two files both end at the respective file's EOF, but it's the weirder
> > corner cases that I worry about...
> 
> Well, we used to do that in btrfs before migrating to the generic code.
> Since I discovered the corruption due to deduplication of the eof
> block into the middle of a file in 2018's summer, the btrfs fix
> allowed deduplication of the eof block only if the destination end
> offset matched the eof of the destination file:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0
> 
> Since then no issues were found nor users reported any problems so far.

<nod> I'm ok with that one scenario, it's the "one range ends at eof,
the other doesn't" case that I'm picking on. :)

(Another way to shut me up would be to run generic/52[12] with
TIME_FACTOR=1000 (i.e. 1 billion fsx ops) and see what comes exploding
out. :))

> Any other specific test you would like to see?

No, just that.  And mmap reads. :)

--D

> Thanks.
> 
> >
> > --D
> >
> > > Thanks.
> > >
> > > > ---
> > > >  fs/read_write.c | 10 ++++------
> > > >  1 file changed, 4 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > index 5bbf587f5bc1..7458fccc59e1 100644
> > > > --- a/fs/read_write.c
> > > > +++ b/fs/read_write.c
> > > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
> > > >   * else.  Assume that the offsets have already been checked for block
> > > >   * alignment.
> > > >   *
> > > > - * For deduplication we always scale down to the previous block because we
> > > > - * can't meaningfully compare post-EOF contents.
> > > > - *
> > > > - * For clone we only link a partial EOF block above the destination file's EOF.
> > > > + * For clone we only link a partial EOF block above or at the destination file's
> > > > + * EOF.  For deduplication we accept a partial EOF block only if it ends at the
> > > > + * destination file's EOF (can not link it into the middle of a file).
> > > >   *
> > > >   * Shorten the request if possible.
> > > >   */
> > > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in,
> > > >         if ((*len & blkmask) == 0)
> > > >                 return 0;
> > > >
> > > > -       if ((remap_flags & REMAP_FILE_DEDUP) ||
> > > > -           pos_out + *len < i_size_read(inode_out))
> > > > +       if (pos_out + *len < i_size_read(inode_out))
> > > >                 new_len &= ~blkmask;
> > > >
> > > >         if (new_len == *len)
> > > > --
> > > > 2.11.0
> > > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file
  2020-01-08 16:15         ` Darrick J. Wong
@ 2020-01-09 19:00           ` Filipe Manana
  2020-01-09 19:12             ` Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Filipe Manana @ 2020-01-09 19:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-btrfs, xfs, Filipe Manana, Alexander Viro

On Wed, Jan 8, 2020 at 4:15 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Wed, Jan 08, 2020 at 11:36:04AM +0000, Filipe Manana wrote:
> > On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >
> > > On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote:
> > > > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote:
> > > > >
> > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > >
> > > > > We always round down, to a multiple of the filesystem's block size, the
> > > > > length to deduplicate at generic_remap_check_len().  However this is only
> > > > > needed if an attempt to deduplicate the last block into the middle of the
> > > > > destination file is requested, since that leads into a corruption if the
> > > > > length of the source file is not block size aligned.  When an attempt to
> > > > > deduplicate the last block into the end of the destination file is
> > > > > requested, we should allow it because it is safe to do it - there's no
> > > > > stale data exposure and we are prepared to compare the data ranges for
> > > > > a length not aligned to the block (or page) size - in fact we even do
> > > > > the data compare before adjusting the deduplication length.
> > > > >
> > > > > After btrfs was updated to use the generic helpers from VFS (by commit
> > > > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
> > > > > and deduplication")) we started to have user reports of deduplication
> > > > > not reflinking the last block anymore, and whence users getting lower
> > > > > deduplication scores.  The main use case is deduplication of entire
> > > > > files that have a size not aligned to the block size of the filesystem.
> > > > >
> > > > > We already allow cloning the last block to the end (and beyond) of the
> > > > > destination file, so allow for deduplication as well.
> > > > >
> > > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > Darrick, Al, any feedback?
> > >
> > > Is there a fstest to check for correct operation of dedupe at or beyond
> > > source and destfile EOF?  Particularly if one range is /not/ at EOF?
> >
> > Such as what generic/158 does already?
>
> Urk, heh. :)
>
> > > And that an mmap read of the EOF block will see zeroes past EOF before
> > > and after the dedupe operation?
> >
> > Can you elaborate a bit more? Why an mmap read and not a buffered or a
> > direct IO read before and after deduplication?
> > Is there anything special for the mmap reads on xfs, is that your
> > concern? Or is the idea to deduplicate while the file is mmap'ed?
>
> I cite mmap reads past EOF specifically because unlike buffered/direct
> reads where the VFS will stop reading exactly at EOF, a memory mapping
> maps in an entire memory page, and the fs is supposed to ensure that the
> bytes past EOF are zeroed.

Ok, on btrfs we zero out the rest of the page both in for all reads.

So, it's not problem:

$ file_size=$((64 * 1024 + 500))
$ xfs_io -f -c "pwrite -S 0xba 0 $file_size" foo
$ xfs_io -f -c "pwrite -S 0xba 0 $file_size" bar
$ sync

$ xfs_io -c "mmap 0 68K" -c "mread -v 0 68K" -c "dedupe bar 0 0
$file_size" -c "mread -v 0 68K" foo

Both mreads returns exactly the same, eof is still the same and all
bytes after it have a value of 0.

This is the same as it is for cloning.

>
> Hm now that I look at g/158 it doesn't actually verify mmap reads.  I
> looked around and can't really see anything that checks mmap reads
> before and after a dedupe operation at EOF.
>
> > > If I fallocate a 16k file, write 'X' into the first 5000 bytes,
> > > write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and
> > > then try to dedupe (first file, 0-8k) with (second file, 60k-68k),
> > > should that work?
> >
> > You haven't mentioned the size of the second file, nor if the first
> > file has a size of 16K which I assume (instead of fallocate with the
> > keep size flag).
>
> Er, sorry, yes.  The first file is 16,384 bytes long; the second file is
> 66,440 bytes.
>
> > Anyway, I assume you actually meant to dedupe the range 0 - 5000 from
> > the first file into the range 60k - 60k + 5000 of the second file, and
> > that the second file has a size of 60k + 5000.
>
> Nope, I meant to say to dedupe the range (offset: 0, length: 8192) from
> the first file into the second file (offset: 61440, length: 8192).  The
> source range is entirely below EOF, and the dest range ends right at
> EOF in the second file.

The dedupe operations fails with -EINVAL, both before and after this patch,
since the whenever one of ranges crosses eof, we must fail dedupe with -EINVAL.
This happens at generic_remap_checks(), which called before
generic_remap_check_len().

This is no different from the other existing tests in fstests, which
cover this case already.

>
> > If so, that fails with -EINVAL because the source range is not block
> > size aligned, and we already have generic fstests that test attempt to
> > duplication and clone non-aligned ranges that don't end at eof.
> > This patch doesn't change that behaviour, it only aims to allow
> > deduplication of the eof block of the source file into the eof of the
> > destination file.
> >
> >
> > >
> > > I'm convinced that we could support dedupe to EOF when the ranges of the
> > > two files both end at the respective file's EOF, but it's the weirder
> > > corner cases that I worry about...
> >
> > Well, we used to do that in btrfs before migrating to the generic code.
> > Since I discovered the corruption due to deduplication of the eof
> > block into the middle of a file in 2018's summer, the btrfs fix
> > allowed deduplication of the eof block only if the destination end
> > offset matched the eof of the destination file:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0
> >
> > Since then no issues were found nor users reported any problems so far.
>
> <nod> I'm ok with that one scenario, it's the "one range ends at eof,
> the other doesn't" case that I'm picking on. :)
>
> (Another way to shut me up would be to run generic/52[12] with
> TIME_FACTOR=1000 (i.e. 1 billion fsx ops) and see what comes exploding
> out. :))

Well, it will take nearly 2 weeks to get those tests to complete with
1 billion ops, taking into
account each takes around 500 seconds to run here (on xfs) with 1
million operations...

>
> > Any other specific test you would like to see?
>
> No, just that.  And mmap reads. :)

Given that we allow cloning of eof into eof already (and that doesn't
cause any known problems),
I don't see why you are so concerned with allowing dedupe to behave the same.

The only thing I can see it could be a problem would be comparing the
contents of the last page beyond the eof position,
but as stated on the changelog, it's not a problem since we call
vfs_dedupe_file_range_compare() before we
round down the length at generic_remap_check_len() - the data compare
function already has the correct behaviour.

Thanks.

>
> --D
>
> > Thanks.
> >
> > >
> > > --D
> > >
> > > > Thanks.
> > > >
> > > > > ---
> > > > >  fs/read_write.c | 10 ++++------
> > > > >  1 file changed, 4 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > index 5bbf587f5bc1..7458fccc59e1 100644
> > > > > --- a/fs/read_write.c
> > > > > +++ b/fs/read_write.c
> > > > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
> > > > >   * else.  Assume that the offsets have already been checked for block
> > > > >   * alignment.
> > > > >   *
> > > > > - * For deduplication we always scale down to the previous block because we
> > > > > - * can't meaningfully compare post-EOF contents.
> > > > > - *
> > > > > - * For clone we only link a partial EOF block above the destination file's EOF.
> > > > > + * For clone we only link a partial EOF block above or at the destination file's
> > > > > + * EOF.  For deduplication we accept a partial EOF block only if it ends at the
> > > > > + * destination file's EOF (can not link it into the middle of a file).
> > > > >   *
> > > > >   * Shorten the request if possible.
> > > > >   */
> > > > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in,
> > > > >         if ((*len & blkmask) == 0)
> > > > >                 return 0;
> > > > >
> > > > > -       if ((remap_flags & REMAP_FILE_DEDUP) ||
> > > > > -           pos_out + *len < i_size_read(inode_out))
> > > > > +       if (pos_out + *len < i_size_read(inode_out))
> > > > >                 new_len &= ~blkmask;
> > > > >
> > > > >         if (new_len == *len)
> > > > > --
> > > > > 2.11.0
> > > > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file
  2020-01-09 19:00           ` Filipe Manana
@ 2020-01-09 19:12             ` Darrick J. Wong
  2020-01-14 14:36               ` Filipe Manana
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2020-01-09 19:12 UTC (permalink / raw)
  To: Filipe Manana
  Cc: linux-fsdevel, linux-btrfs, xfs, Filipe Manana, Alexander Viro

On Thu, Jan 09, 2020 at 07:00:09PM +0000, Filipe Manana wrote:
> On Wed, Jan 8, 2020 at 4:15 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> >
> > On Wed, Jan 08, 2020 at 11:36:04AM +0000, Filipe Manana wrote:
> > > On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > >
> > > > On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote:
> > > > > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote:
> > > > > >
> > > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > > >
> > > > > > We always round down, to a multiple of the filesystem's block size, the
> > > > > > length to deduplicate at generic_remap_check_len().  However this is only
> > > > > > needed if an attempt to deduplicate the last block into the middle of the
> > > > > > destination file is requested, since that leads into a corruption if the
> > > > > > length of the source file is not block size aligned.  When an attempt to
> > > > > > deduplicate the last block into the end of the destination file is
> > > > > > requested, we should allow it because it is safe to do it - there's no
> > > > > > stale data exposure and we are prepared to compare the data ranges for
> > > > > > a length not aligned to the block (or page) size - in fact we even do
> > > > > > the data compare before adjusting the deduplication length.
> > > > > >
> > > > > > After btrfs was updated to use the generic helpers from VFS (by commit
> > > > > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
> > > > > > and deduplication")) we started to have user reports of deduplication
> > > > > > not reflinking the last block anymore, and whence users getting lower
> > > > > > deduplication scores.  The main use case is deduplication of entire
> > > > > > files that have a size not aligned to the block size of the filesystem.
> > > > > >
> > > > > > We already allow cloning the last block to the end (and beyond) of the
> > > > > > destination file, so allow for deduplication as well.
> > > > > >
> > > > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> > > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > > >
> > > > > Darrick, Al, any feedback?
> > > >
> > > > Is there a fstest to check for correct operation of dedupe at or beyond
> > > > source and destfile EOF?  Particularly if one range is /not/ at EOF?
> > >
> > > Such as what generic/158 does already?
> >
> > Urk, heh. :)
> >
> > > > And that an mmap read of the EOF block will see zeroes past EOF before
> > > > and after the dedupe operation?
> > >
> > > Can you elaborate a bit more? Why an mmap read and not a buffered or a
> > > direct IO read before and after deduplication?
> > > Is there anything special for the mmap reads on xfs, is that your
> > > concern? Or is the idea to deduplicate while the file is mmap'ed?
> >
> > I cite mmap reads past EOF specifically because unlike buffered/direct
> > reads where the VFS will stop reading exactly at EOF, a memory mapping
> > maps in an entire memory page, and the fs is supposed to ensure that the
> > bytes past EOF are zeroed.
> 
> Ok, on btrfs we zero out the rest of the page both in for all reads.
> 
> So, it's not problem:
> 
> $ file_size=$((64 * 1024 + 500))
> $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" foo
> $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" bar
> $ sync
> 
> $ xfs_io -c "mmap 0 68K" -c "mread -v 0 68K" -c "dedupe bar 0 0
> $file_size" -c "mread -v 0 68K" foo
> 
> Both mreads returns exactly the same, eof is still the same and all
> bytes after it have a value of 0.
> 
> This is the same as it is for cloning.

Oh good. :)

> >
> > Hm now that I look at g/158 it doesn't actually verify mmap reads.  I
> > looked around and can't really see anything that checks mmap reads
> > before and after a dedupe operation at EOF.
> >
> > > > If I fallocate a 16k file, write 'X' into the first 5000 bytes,
> > > > write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and
> > > > then try to dedupe (first file, 0-8k) with (second file, 60k-68k),
> > > > should that work?
> > >
> > > You haven't mentioned the size of the second file, nor if the first
> > > file has a size of 16K which I assume (instead of fallocate with the
> > > keep size flag).
> >
> > Er, sorry, yes.  The first file is 16,384 bytes long; the second file is
> > 66,440 bytes.
> >
> > > Anyway, I assume you actually meant to dedupe the range 0 - 5000 from
> > > the first file into the range 60k - 60k + 5000 of the second file, and
> > > that the second file has a size of 60k + 5000.
> >
> > Nope, I meant to say to dedupe the range (offset: 0, length: 8192) from
> > the first file into the second file (offset: 61440, length: 8192).  The
> > source range is entirely below EOF, and the dest range ends right at
> > EOF in the second file.
> 
> The dedupe operations fails with -EINVAL, both before and after this patch,
> since the whenever one of ranges crosses eof, we must fail dedupe with -EINVAL.
> This happens at generic_remap_checks(), which called before
> generic_remap_check_len().
> 
> This is no different from the other existing tests in fstests, which
> cover this case already.

Ah, ok.

> > > If so, that fails with -EINVAL because the source range is not block
> > > size aligned, and we already have generic fstests that test attempt to
> > > duplication and clone non-aligned ranges that don't end at eof.
> > > This patch doesn't change that behaviour, it only aims to allow
> > > deduplication of the eof block of the source file into the eof of the
> > > destination file.
> > >
> > >
> > > >
> > > > I'm convinced that we could support dedupe to EOF when the ranges of the
> > > > two files both end at the respective file's EOF, but it's the weirder
> > > > corner cases that I worry about...
> > >
> > > Well, we used to do that in btrfs before migrating to the generic code.
> > > Since I discovered the corruption due to deduplication of the eof
> > > block into the middle of a file in 2018's summer, the btrfs fix
> > > allowed deduplication of the eof block only if the destination end
> > > offset matched the eof of the destination file:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0
> > >
> > > Since then no issues were found nor users reported any problems so far.
> >
> > <nod> I'm ok with that one scenario, it's the "one range ends at eof,
> > the other doesn't" case that I'm picking on. :)
> >
> > (Another way to shut me up would be to run generic/52[12] with
> > TIME_FACTOR=1000 (i.e. 1 billion fsx ops) and see what comes exploding
> > out. :))
> 
> Well, it will take nearly 2 weeks to get those tests to complete with
> 1 billion ops, taking into
> account each takes around 500 seconds to run here (on xfs) with 1
> million operations...

I know.  I probably should have clarified: every month or two I build a
kernel from tip, throw it in a VM, and let it run for however long it
goes until it dies, fixing whatever problems as I find them.

*So far* no billionaire plutocrats have complained about the huge
compute bills that are probably racking up. ;)

(FWIW I also tossed your patch onto another one of these VMs and it
hasn't died yet, so that's probably an ok sign)

> >
> > > Any other specific test you would like to see?
> >
> > No, just that.  And mmap reads. :)
> 
> Given that we allow cloning of eof into eof already (and that doesn't
> cause any known problems),
> I don't see why you are so concerned with allowing dedupe to behave the same.
> 
> The only thing I can see it could be a problem would be comparing the
> contents of the last page beyond the eof position,
> but as stated on the changelog, it's not a problem since we call
> vfs_dedupe_file_range_compare() before we
> round down the length at generic_remap_check_len() - the data compare
> function already has the correct behaviour.

<nod>

--D

> Thanks.
> 
> >
> > --D
> >
> > > Thanks.
> > >
> > > >
> > > > --D
> > > >
> > > > > Thanks.
> > > > >
> > > > > > ---
> > > > > >  fs/read_write.c | 10 ++++------
> > > > > >  1 file changed, 4 insertions(+), 6 deletions(-)
> > > > > >
> > > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > > index 5bbf587f5bc1..7458fccc59e1 100644
> > > > > > --- a/fs/read_write.c
> > > > > > +++ b/fs/read_write.c
> > > > > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
> > > > > >   * else.  Assume that the offsets have already been checked for block
> > > > > >   * alignment.
> > > > > >   *
> > > > > > - * For deduplication we always scale down to the previous block because we
> > > > > > - * can't meaningfully compare post-EOF contents.
> > > > > > - *
> > > > > > - * For clone we only link a partial EOF block above the destination file's EOF.
> > > > > > + * For clone we only link a partial EOF block above or at the destination file's
> > > > > > + * EOF.  For deduplication we accept a partial EOF block only if it ends at the
> > > > > > + * destination file's EOF (can not link it into the middle of a file).
> > > > > >   *
> > > > > >   * Shorten the request if possible.
> > > > > >   */
> > > > > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in,
> > > > > >         if ((*len & blkmask) == 0)
> > > > > >                 return 0;
> > > > > >
> > > > > > -       if ((remap_flags & REMAP_FILE_DEDUP) ||
> > > > > > -           pos_out + *len < i_size_read(inode_out))
> > > > > > +       if (pos_out + *len < i_size_read(inode_out))
> > > > > >                 new_len &= ~blkmask;
> > > > > >
> > > > > >         if (new_len == *len)
> > > > > > --
> > > > > > 2.11.0
> > > > > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file
  2020-01-09 19:12             ` Darrick J. Wong
@ 2020-01-14 14:36               ` Filipe Manana
  0 siblings, 0 replies; 17+ messages in thread
From: Filipe Manana @ 2020-01-14 14:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-fsdevel, linux-btrfs, xfs, Filipe Manana, Alexander Viro

On Thu, Jan 9, 2020 at 7:12 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Thu, Jan 09, 2020 at 07:00:09PM +0000, Filipe Manana wrote:
> > On Wed, Jan 8, 2020 at 4:15 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >
> > > On Wed, Jan 08, 2020 at 11:36:04AM +0000, Filipe Manana wrote:
> > > > On Tue, Jan 7, 2020 at 5:57 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > > > >
> > > > > On Tue, Jan 07, 2020 at 04:23:15PM +0000, Filipe Manana wrote:
> > > > > > On Mon, Dec 16, 2019 at 6:28 PM <fdmanana@kernel.org> wrote:
> > > > > > >
> > > > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > > > >
> > > > > > > We always round down, to a multiple of the filesystem's block size, the
> > > > > > > length to deduplicate at generic_remap_check_len().  However this is only
> > > > > > > needed if an attempt to deduplicate the last block into the middle of the
> > > > > > > destination file is requested, since that leads into a corruption if the
> > > > > > > length of the source file is not block size aligned.  When an attempt to
> > > > > > > deduplicate the last block into the end of the destination file is
> > > > > > > requested, we should allow it because it is safe to do it - there's no
> > > > > > > stale data exposure and we are prepared to compare the data ranges for
> > > > > > > a length not aligned to the block (or page) size - in fact we even do
> > > > > > > the data compare before adjusting the deduplication length.
> > > > > > >
> > > > > > > After btrfs was updated to use the generic helpers from VFS (by commit
> > > > > > > 34a28e3d77535e ("Btrfs: use generic_remap_file_range_prep() for cloning
> > > > > > > and deduplication")) we started to have user reports of deduplication
> > > > > > > not reflinking the last block anymore, and whence users getting lower
> > > > > > > deduplication scores.  The main use case is deduplication of entire
> > > > > > > files that have a size not aligned to the block size of the filesystem.
> > > > > > >
> > > > > > > We already allow cloning the last block to the end (and beyond) of the
> > > > > > > destination file, so allow for deduplication as well.
> > > > > > >
> > > > > > > Link: https://lore.kernel.org/linux-btrfs/2019-1576167349.500456@svIo.N5dq.dFFD/
> > > > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > > > >
> > > > > > Darrick, Al, any feedback?
> > > > >
> > > > > Is there a fstest to check for correct operation of dedupe at or beyond
> > > > > source and destfile EOF?  Particularly if one range is /not/ at EOF?
> > > >
> > > > Such as what generic/158 does already?
> > >
> > > Urk, heh. :)
> > >
> > > > > And that an mmap read of the EOF block will see zeroes past EOF before
> > > > > and after the dedupe operation?
> > > >
> > > > Can you elaborate a bit more? Why an mmap read and not a buffered or a
> > > > direct IO read before and after deduplication?
> > > > Is there anything special for the mmap reads on xfs, is that your
> > > > concern? Or is the idea to deduplicate while the file is mmap'ed?
> > >
> > > I cite mmap reads past EOF specifically because unlike buffered/direct
> > > reads where the VFS will stop reading exactly at EOF, a memory mapping
> > > maps in an entire memory page, and the fs is supposed to ensure that the
> > > bytes past EOF are zeroed.
> >
> > Ok, on btrfs we zero out the rest of the page both in for all reads.
> >
> > So, it's not problem:
> >
> > $ file_size=$((64 * 1024 + 500))
> > $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" foo
> > $ xfs_io -f -c "pwrite -S 0xba 0 $file_size" bar
> > $ sync
> >
> > $ xfs_io -c "mmap 0 68K" -c "mread -v 0 68K" -c "dedupe bar 0 0
> > $file_size" -c "mread -v 0 68K" foo
> >
> > Both mreads returns exactly the same, eof is still the same and all
> > bytes after it have a value of 0.
> >
> > This is the same as it is for cloning.
>
> Oh good. :)
>
> > >
> > > Hm now that I look at g/158 it doesn't actually verify mmap reads.  I
> > > looked around and can't really see anything that checks mmap reads
> > > before and after a dedupe operation at EOF.
> > >
> > > > > If I fallocate a 16k file, write 'X' into the first 5000 bytes,
> > > > > write 'X' into the first 66,440 bytes (60k + 5000) of a second file, and
> > > > > then try to dedupe (first file, 0-8k) with (second file, 60k-68k),
> > > > > should that work?
> > > >
> > > > You haven't mentioned the size of the second file, nor if the first
> > > > file has a size of 16K which I assume (instead of fallocate with the
> > > > keep size flag).
> > >
> > > Er, sorry, yes.  The first file is 16,384 bytes long; the second file is
> > > 66,440 bytes.
> > >
> > > > Anyway, I assume you actually meant to dedupe the range 0 - 5000 from
> > > > the first file into the range 60k - 60k + 5000 of the second file, and
> > > > that the second file has a size of 60k + 5000.
> > >
> > > Nope, I meant to say to dedupe the range (offset: 0, length: 8192) from
> > > the first file into the second file (offset: 61440, length: 8192).  The
> > > source range is entirely below EOF, and the dest range ends right at
> > > EOF in the second file.
> >
> > The dedupe operations fails with -EINVAL, both before and after this patch,
> > since the whenever one of ranges crosses eof, we must fail dedupe with -EINVAL.
> > This happens at generic_remap_checks(), which called before
> > generic_remap_check_len().
> >
> > This is no different from the other existing tests in fstests, which
> > cover this case already.
>
> Ah, ok.
>
> > > > If so, that fails with -EINVAL because the source range is not block
> > > > size aligned, and we already have generic fstests that test attempt to
> > > > duplication and clone non-aligned ranges that don't end at eof.
> > > > This patch doesn't change that behaviour, it only aims to allow
> > > > deduplication of the eof block of the source file into the eof of the
> > > > destination file.
> > > >
> > > >
> > > > >
> > > > > I'm convinced that we could support dedupe to EOF when the ranges of the
> > > > > two files both end at the respective file's EOF, but it's the weirder
> > > > > corner cases that I worry about...
> > > >
> > > > Well, we used to do that in btrfs before migrating to the generic code.
> > > > Since I discovered the corruption due to deduplication of the eof
> > > > block into the middle of a file in 2018's summer, the btrfs fix
> > > > allowed deduplication of the eof block only if the destination end
> > > > offset matched the eof of the destination file:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de02b9f6bb65a6a1848f346f7a3617b7a9b930c0
> > > >
> > > > Since then no issues were found nor users reported any problems so far.
> > >
> > > <nod> I'm ok with that one scenario, it's the "one range ends at eof,
> > > the other doesn't" case that I'm picking on. :)
> > >
> > > (Another way to shut me up would be to run generic/52[12] with
> > > TIME_FACTOR=1000 (i.e. 1 billion fsx ops) and see what comes exploding
> > > out. :))
> >
> > Well, it will take nearly 2 weeks to get those tests to complete with
> > 1 billion ops, taking into
> > account each takes around 500 seconds to run here (on xfs) with 1
> > million operations...
>
> I know.  I probably should have clarified: every month or two I build a
> kernel from tip, throw it in a VM, and let it run for however long it
> goes until it dies, fixing whatever problems as I find them.
>
> *So far* no billionaire plutocrats have complained about the huge
> compute bills that are probably racking up. ;)
>
> (FWIW I also tossed your patch onto another one of these VMs and it
> hasn't died yet, so that's probably an ok sign)

For 100 million ops, on 5.5-rc5 kernel with this patchset and on xfs:

$ TIME_FACTOR=100 MKFS_OPTIONS="-m reflink=1,rmapbt=1" ./check
generic/521 generic/522
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 debian5 5.5.0-rc5-btrfs-next-67 #1 SMP
PREEMPT Thu Jan 9 13:48:07 WET 2020
MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1 /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

generic/521 482s ...  48961s
generic/522 536s ...  54432s
Ran: generic/521 generic/522
Passed all 2 tests

Similar for btrfs, no problems found (yet at least).

Thanks.

>
> > >
> > > > Any other specific test you would like to see?
> > >
> > > No, just that.  And mmap reads. :)
> >
> > Given that we allow cloning of eof into eof already (and that doesn't
> > cause any known problems),
> > I don't see why you are so concerned with allowing dedupe to behave the same.
> >
> > The only thing I can see it could be a problem would be comparing the
> > contents of the last page beyond the eof position,
> > but as stated on the changelog, it's not a problem since we call
> > vfs_dedupe_file_range_compare() before we
> > round down the length at generic_remap_check_len() - the data compare
> > function already has the correct behaviour.
>
> <nod>
>
> --D
>
> > Thanks.
> >
> > >
> > > --D
> > >
> > > > Thanks.
> > > >
> > > > >
> > > > > --D
> > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > > ---
> > > > > > >  fs/read_write.c | 10 ++++------
> > > > > > >  1 file changed, 4 insertions(+), 6 deletions(-)
> > > > > > >
> > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > > > index 5bbf587f5bc1..7458fccc59e1 100644
> > > > > > > --- a/fs/read_write.c
> > > > > > > +++ b/fs/read_write.c
> > > > > > > @@ -1777,10 +1777,9 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
> > > > > > >   * else.  Assume that the offsets have already been checked for block
> > > > > > >   * alignment.
> > > > > > >   *
> > > > > > > - * For deduplication we always scale down to the previous block because we
> > > > > > > - * can't meaningfully compare post-EOF contents.
> > > > > > > - *
> > > > > > > - * For clone we only link a partial EOF block above the destination file's EOF.
> > > > > > > + * For clone we only link a partial EOF block above or at the destination file's
> > > > > > > + * EOF.  For deduplication we accept a partial EOF block only if it ends at the
> > > > > > > + * destination file's EOF (can not link it into the middle of a file).
> > > > > > >   *
> > > > > > >   * Shorten the request if possible.
> > > > > > >   */
> > > > > > > @@ -1796,8 +1795,7 @@ static int generic_remap_check_len(struct inode *inode_in,
> > > > > > >         if ((*len & blkmask) == 0)
> > > > > > >                 return 0;
> > > > > > >
> > > > > > > -       if ((remap_flags & REMAP_FILE_DEDUP) ||
> > > > > > > -           pos_out + *len < i_size_read(inode_out))
> > > > > > > +       if (pos_out + *len < i_size_read(inode_out))
> > > > > > >                 new_len &= ~blkmask;
> > > > > > >
> > > > > > >         if (new_len == *len)
> > > > > > > --
> > > > > > > 2.11.0
> > > > > > >

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-16 18:26 [PATCH 0/2] Allow deduplication of the eof block when it is safe to do so fdmanana
2019-12-16 18:26 ` [PATCH 1/2] fs: allow deduplication of eof block into the end of the destination file fdmanana
2019-12-17 15:52   ` Josef Bacik
2020-01-07 16:23   ` Filipe Manana
2020-01-07 17:57     ` Darrick J. Wong
2020-01-08 11:36       ` Filipe Manana
2020-01-08 16:15         ` Darrick J. Wong
2020-01-09 19:00           ` Filipe Manana
2020-01-09 19:12             ` Darrick J. Wong
2020-01-14 14:36               ` Filipe Manana
2019-12-16 18:26 ` [PATCH 2/2] Btrfs: make deduplication with range including the last block work fdmanana
2019-12-17 15:54   ` Josef Bacik
2019-12-29  5:22   ` Zygo Blaxell
2020-01-07 16:18     ` Filipe Manana
2020-01-07 18:16       ` Zygo Blaxell
2020-01-08 11:42         ` Filipe Manana
2020-01-08 14:53           ` David Sterba

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git