All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
@ 2016-11-18  0:07 Omar Sandoval
  2016-11-18  0:07 ` [RFC PATCH 1/2] Btrfs: refactor btrfs_extent_same() slightly Omar Sandoval
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Omar Sandoval @ 2016-11-18  0:07 UTC (permalink / raw)
  To: linux-btrfs, linux-xfs
  Cc: Darrick J. Wong, Qu Wenruo, Christoph Hellwig, kernel-team

From: Omar Sandoval <osandov@fb.com>

This is the follow-up to the discussions here [1] and here [2] that
makes Btrfs treat a src_length of 0 to dedupe as "until EOF". The
implementation is straightforward, but there are a few catches that
convinced me to post this as an RFC:

1. We still don't know for sure whether userspace cares about the
   original Btrfs behavior. Darrick and I both seem to think not, but
   hopefully someone will speak up if it matters to them.
2. When doing the implicit EOF, XFS returns 0 for bytes_deduped. I
   copied this in my implementation, but I'm guessing that's a bug
   rather than a feature.
3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
   implicit EOF gets around this in the existing XFS implementation. I
   copied this for the Btrfs implementation.

So now we have 3 options:

a) Apply these patches as-is.
b) Fix XFS to both return the actual bytes_deduped and cap the length
   for the EOF case. Do the same for Btrfs.
c) Make XFS consistent with the existing Btrfs behavior.

I'm starting to lean towards option c after writing this cover letter,
but I might as well send these out and get a second opinion.

Thanks!

1: http://marc.info/?l=linux-btrfs&m=146828374631829&w=2
2: http://marc.info/?l=linux-btrfs&m=147694962912532&w=2

Omar Sandoval (2):
  Btrfs: refactor btrfs_extent_same() slightly
  Btrfs: make a source length of 0 imply EOF for dedupe

 fs/btrfs/ioctl.c | 45 ++++++++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

-- 
2.10.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/2] Btrfs: refactor btrfs_extent_same() slightly
  2016-11-18  0:07 [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe Omar Sandoval
@ 2016-11-18  0:07 ` Omar Sandoval
  2016-11-18  3:22     ` Qu Wenruo
  2016-11-18  0:07 ` [RFC PATCH 2/2] Btrfs: make a source length of 0 imply EOF for dedupe Omar Sandoval
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Omar Sandoval @ 2016-11-18  0:07 UTC (permalink / raw)
  To: linux-btrfs, linux-xfs
  Cc: Darrick J. Wong, Qu Wenruo, Christoph Hellwig, kernel-team

From: Omar Sandoval <osandov@fb.com>

This just makes the following patch cleaner.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ioctl.c | 33 ++++++++++++---------------------
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7acbd2c..e23f945 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3124,26 +3124,27 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 	int ret;
 	u64 len = olen;
 	struct cmp_pages cmp;
-	int same_inode = 0;
+	bool same_inode = (src == dst);
 	u64 same_lock_start = 0;
 	u64 same_lock_len = 0;
 
-	if (src == dst)
-		same_inode = 1;
-
 	if (len == 0)
 		return 0;
 
-	if (same_inode) {
+	if (same_inode)
 		inode_lock(src);
+	else
+		btrfs_double_inode_lock(src, dst);
 
-		ret = extent_same_check_offsets(src, loff, &len, olen);
-		if (ret)
-			goto out_unlock;
-		ret = extent_same_check_offsets(src, dst_loff, &len, olen);
-		if (ret)
-			goto out_unlock;
+	ret = extent_same_check_offsets(src, loff, &len, olen);
+	if (ret)
+		goto out_unlock;
 
+	ret = extent_same_check_offsets(dst, dst_loff, &len, olen);
+	if (ret)
+		goto out_unlock;
+
+	if (same_inode) {
 		/*
 		 * Single inode case wants the same checks, except we
 		 * don't want our length pushed out past i_size as
@@ -3171,16 +3172,6 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 
 		same_lock_start = min_t(u64, loff, dst_loff);
 		same_lock_len = max_t(u64, loff, dst_loff) + len - same_lock_start;
-	} else {
-		btrfs_double_inode_lock(src, dst);
-
-		ret = extent_same_check_offsets(src, loff, &len, olen);
-		if (ret)
-			goto out_unlock;
-
-		ret = extent_same_check_offsets(dst, dst_loff, &len, olen);
-		if (ret)
-			goto out_unlock;
 	}
 
 	/* don't make the dst file partly checksummed */
-- 
2.10.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-18  0:07 [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe Omar Sandoval
  2016-11-18  0:07 ` [RFC PATCH 1/2] Btrfs: refactor btrfs_extent_same() slightly Omar Sandoval
@ 2016-11-18  0:07 ` Omar Sandoval
  2016-11-18  5:38 ` [RFC PATCH 0/2] " Christoph Hellwig
  2016-11-23  2:02 ` Zygo Blaxell
  3 siblings, 0 replies; 18+ messages in thread
From: Omar Sandoval @ 2016-11-18  0:07 UTC (permalink / raw)
  To: linux-btrfs, linux-xfs
  Cc: Darrick J. Wong, Qu Wenruo, Christoph Hellwig, kernel-team

From: Omar Sandoval <osandov@fb.com>

This makes us consistent with both reflink and the XFS implementation of
dedupe.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 fs/btrfs/ioctl.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e23f945..320fb35 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3121,21 +3121,25 @@ static int extent_same_check_offsets(struct inode *inode, u64 off, u64 *plen,
 static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
 			     struct inode *dst, u64 dst_loff)
 {
-	int ret;
-	u64 len = olen;
+	int ret = 0;
+	u64 len;
 	struct cmp_pages cmp;
 	bool same_inode = (src == dst);
 	u64 same_lock_start = 0;
 	u64 same_lock_len = 0;
 
-	if (len == 0)
-		return 0;
-
 	if (same_inode)
 		inode_lock(src);
 	else
 		btrfs_double_inode_lock(src, dst);
 
+	if (olen == 0) {
+		olen = i_size_read(src) - loff;
+		if (olen == 0)
+			goto out_unlock;
+	}
+	len = olen;
+
 	ret = extent_same_check_offsets(src, loff, &len, olen);
 	if (ret)
 		goto out_unlock;
-- 
2.10.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] Btrfs: refactor btrfs_extent_same() slightly
  2016-11-18  0:07 ` [RFC PATCH 1/2] Btrfs: refactor btrfs_extent_same() slightly Omar Sandoval
@ 2016-11-18  3:22     ` Qu Wenruo
  0 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2016-11-18  3:22 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs, linux-xfs
  Cc: Darrick J. Wong, Christoph Hellwig, kernel-team



At 11/18/2016 08:07 AM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> This just makes the following patch cleaner.
Looks good for me.

It is just a good refactoring. Making code short and easier to read.

And it doesn't affect the len = 0 behavior.
So no matter what the choice we do, it can be applied independently.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Thanks,
Qu
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/ioctl.c | 33 ++++++++++++---------------------
>  1 file changed, 12 insertions(+), 21 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 7acbd2c..e23f945 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3124,26 +3124,27 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
>  	int ret;
>  	u64 len = olen;
>  	struct cmp_pages cmp;
> -	int same_inode = 0;
> +	bool same_inode = (src == dst);
>  	u64 same_lock_start = 0;
>  	u64 same_lock_len = 0;
>
> -	if (src == dst)
> -		same_inode = 1;
> -
>  	if (len == 0)
>  		return 0;
>
> -	if (same_inode) {
> +	if (same_inode)
>  		inode_lock(src);
> +	else
> +		btrfs_double_inode_lock(src, dst);
>
> -		ret = extent_same_check_offsets(src, loff, &len, olen);
> -		if (ret)
> -			goto out_unlock;
> -		ret = extent_same_check_offsets(src, dst_loff, &len, olen);
> -		if (ret)
> -			goto out_unlock;
> +	ret = extent_same_check_offsets(src, loff, &len, olen);
> +	if (ret)
> +		goto out_unlock;
>
> +	ret = extent_same_check_offsets(dst, dst_loff, &len, olen);
> +	if (ret)
> +		goto out_unlock;
> +
> +	if (same_inode) {
>  		/*
>  		 * Single inode case wants the same checks, except we
>  		 * don't want our length pushed out past i_size as
> @@ -3171,16 +3172,6 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
>
>  		same_lock_start = min_t(u64, loff, dst_loff);
>  		same_lock_len = max_t(u64, loff, dst_loff) + len - same_lock_start;
> -	} else {
> -		btrfs_double_inode_lock(src, dst);
> -
> -		ret = extent_same_check_offsets(src, loff, &len, olen);
> -		if (ret)
> -			goto out_unlock;
> -
> -		ret = extent_same_check_offsets(dst, dst_loff, &len, olen);
> -		if (ret)
> -			goto out_unlock;
>  	}
>
>  	/* don't make the dst file partly checksummed */
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] Btrfs: refactor btrfs_extent_same() slightly
@ 2016-11-18  3:22     ` Qu Wenruo
  0 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2016-11-18  3:22 UTC (permalink / raw)
  To: Omar Sandoval, linux-btrfs, linux-xfs
  Cc: Darrick J. Wong, Christoph Hellwig, kernel-team



At 11/18/2016 08:07 AM, Omar Sandoval wrote:
> From: Omar Sandoval <osandov@fb.com>
>
> This just makes the following patch cleaner.
Looks good for me.

It is just a good refactoring. Making code short and easier to read.

And it doesn't affect the len = 0 behavior.
So no matter what the choice we do, it can be applied independently.

Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Thanks,
Qu
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  fs/btrfs/ioctl.c | 33 ++++++++++++---------------------
>  1 file changed, 12 insertions(+), 21 deletions(-)
>
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 7acbd2c..e23f945 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3124,26 +3124,27 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
>  	int ret;
>  	u64 len = olen;
>  	struct cmp_pages cmp;
> -	int same_inode = 0;
> +	bool same_inode = (src == dst);
>  	u64 same_lock_start = 0;
>  	u64 same_lock_len = 0;
>
> -	if (src == dst)
> -		same_inode = 1;
> -
>  	if (len == 0)
>  		return 0;
>
> -	if (same_inode) {
> +	if (same_inode)
>  		inode_lock(src);
> +	else
> +		btrfs_double_inode_lock(src, dst);
>
> -		ret = extent_same_check_offsets(src, loff, &len, olen);
> -		if (ret)
> -			goto out_unlock;
> -		ret = extent_same_check_offsets(src, dst_loff, &len, olen);
> -		if (ret)
> -			goto out_unlock;
> +	ret = extent_same_check_offsets(src, loff, &len, olen);
> +	if (ret)
> +		goto out_unlock;
>
> +	ret = extent_same_check_offsets(dst, dst_loff, &len, olen);
> +	if (ret)
> +		goto out_unlock;
> +
> +	if (same_inode) {
>  		/*
>  		 * Single inode case wants the same checks, except we
>  		 * don't want our length pushed out past i_size as
> @@ -3171,16 +3172,6 @@ static int btrfs_extent_same(struct inode *src, u64 loff, u64 olen,
>
>  		same_lock_start = min_t(u64, loff, dst_loff);
>  		same_lock_len = max_t(u64, loff, dst_loff) + len - same_lock_start;
> -	} else {
> -		btrfs_double_inode_lock(src, dst);
> -
> -		ret = extent_same_check_offsets(src, loff, &len, olen);
> -		if (ret)
> -			goto out_unlock;
> -
> -		ret = extent_same_check_offsets(dst, dst_loff, &len, olen);
> -		if (ret)
> -			goto out_unlock;
>  	}
>
>  	/* don't make the dst file partly checksummed */
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-18  0:07 [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe Omar Sandoval
  2016-11-18  0:07 ` [RFC PATCH 1/2] Btrfs: refactor btrfs_extent_same() slightly Omar Sandoval
  2016-11-18  0:07 ` [RFC PATCH 2/2] Btrfs: make a source length of 0 imply EOF for dedupe Omar Sandoval
@ 2016-11-18  5:38 ` Christoph Hellwig
  2016-11-22 21:17   ` Darrick J. Wong
  2016-11-23  2:02 ` Zygo Blaxell
  3 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2016-11-18  5:38 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-btrfs, linux-xfs, Darrick J. Wong, Qu Wenruo,
	Christoph Hellwig, kernel-team

On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> So now we have 3 options:
> 
> a) Apply these patches as-is.
> b) Fix XFS to both return the actual bytes_deduped and cap the length
>    for the EOF case. Do the same for Btrfs.
> c) Make XFS consistent with the existing Btrfs behavior.
> 
> I'm starting to lean towards option c after writing this cover letter,
> but I might as well send these out and get a second opinion.

I agree - btrfs was there first, and there is no fatal flaw in the ABI,
so let's adjust XFS to it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-18  5:38 ` [RFC PATCH 0/2] " Christoph Hellwig
@ 2016-11-22 21:17   ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2016-11-22 21:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Omar Sandoval, linux-btrfs, linux-xfs, Qu Wenruo, kernel-team

On Thu, Nov 17, 2016 at 09:38:53PM -0800, Christoph Hellwig wrote:
> On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > So now we have 3 options:
> > 
> > a) Apply these patches as-is.
> > b) Fix XFS to both return the actual bytes_deduped and cap the length
> >    for the EOF case. Do the same for Btrfs.
> > c) Make XFS consistent with the existing Btrfs behavior.
> > 
> > I'm starting to lean towards option c after writing this cover letter,
> > but I might as well send these out and get a second opinion.
> 
> I agree - btrfs was there first, and there is no fatal flaw in the ABI,
> so let's adjust XFS to it.

Ok.  I'll send in a fix for generic/182, a patch to fix XFS, and I'll
roll the same fix into the ocfs2 patches before the next submission.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-18  0:07 [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe Omar Sandoval
                   ` (2 preceding siblings ...)
  2016-11-18  5:38 ` [RFC PATCH 0/2] " Christoph Hellwig
@ 2016-11-23  2:02 ` Zygo Blaxell
  2016-11-23  2:44   ` Darrick J. Wong
  2016-11-23  4:26   ` Dave Chinner
  3 siblings, 2 replies; 18+ messages in thread
From: Zygo Blaxell @ 2016-11-23  2:02 UTC (permalink / raw)
  To: Omar Sandoval
  Cc: linux-btrfs, linux-xfs, Darrick J. Wong, Qu Wenruo,
	Christoph Hellwig, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1933 bytes --]

On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
>    implicit EOF gets around this in the existing XFS implementation. I
>    copied this for the Btrfs implementation.

Somewhat tangential to this patch, but on the dedup topic:  Can we raise
or drop that 16MB limit?

The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
behavior for a 128MB extent is to generate 8x16MB shared extent references
with different extent offsets to a single 128MB physical extent.
These references no longer look like the original 128MB extent to a
userspace dedup tool.  That raises the difficulty level substantially
for a userspace dedup tool when it tries to figure out which extents to
keep and which to discard or rewrite.

XFS may not have this problem--I haven't checked.  On btrfs it's
definitely not as simple as "bang two inode/offset/length pairs together
with dedup and disk space will be freed automagically."  If dedup is
done incorrectly on btrfs, it can end up just making the filesystem slow
without freeing any space.

The 16MB limit doesn't seem to be useful in practice.  The two useful
effects of the limit seem to be DoS mitigation.  There is no checking of
the RAM usage that I can find (i.e. if you fire off 16 dedup threads,
they want 256MB of RAM; put another way, if you want to tie up 16GB of
kernel RAM, all you have to do is create 1024 dedup threads), so it's
not an effective DoS mitigation feature.  Internally dedup could verify
blocks in batches of 16MB and check for signals/release and reacquire
locks in between, so it wouldn't tie up the kernel or the two inodes
for excessively long periods.

Even if we want to keep the 16MB limit, there's also no way to query the
kernel from userspace to find out what the limit is, other than by trial
and error.  It's not even in a header file, userspace just has to *know*.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-23  2:02 ` Zygo Blaxell
@ 2016-11-23  2:44   ` Darrick J. Wong
  2016-11-24  5:16     ` Zygo Blaxell
  2016-11-23  4:26   ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2016-11-23  2:44 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Omar Sandoval, linux-btrfs, linux-xfs, Qu Wenruo,
	Christoph Hellwig, kernel-team

On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> >    implicit EOF gets around this in the existing XFS implementation. I
> >    copied this for the Btrfs implementation.
> 
> Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> or drop that 16MB limit?
> 
> The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> behavior for a 128MB extent is to generate 8x16MB shared extent references
> with different extent offsets to a single 128MB physical extent.
> These references no longer look like the original 128MB extent to a
> userspace dedup tool.  That raises the difficulty level substantially
> for a userspace dedup tool when it tries to figure out which extents to
> keep and which to discard or rewrite.
> 
> XFS may not have this problem--I haven't checked.  On btrfs it's
> definitely not as simple as "bang two inode/offset/length pairs together
> with dedup and disk space will be freed automagically."  If dedup is
> done incorrectly on btrfs, it can end up just making the filesystem slow
> without freeing any space.

I copied the 16M limit into xfs/ocfs2 because btrfs had it. :)

The VFS now limits the size of the incoming struct file_dedupe_range to
whatever a page size is.  On x86 that only allows us 126 dedupe
candidates, which means that a single call can perform up to ~2GB of IO.
Storage is getting faster, but 2GB is still a fair amount for a single
call.  Of course in XFS we do the dedupe one file and one page at a time
to keep the memory footprint sane.

On ppc64 with its huge 64k pages that comes out to 32GB of IO.

One thing we (speaking for XFS, anyway) /could/ do is limit based on the
theoretical IO count instead of clamping the length, e.g.

if ((u64)dest_count * len >= (1ULL << 31))
	return -E2BIG;

That way you /could/ specify a larger extent size if you pass in fewer
file descriptors.  OTOH XFS will merge all the records together, so even
if you deduped the whole 128M in 4k chunks you'll still end up with a
single block mapping record and a single backref.

Were I starting from scratch I'd probably just dump the existing dedupe
interface in favor of a non-vectorized dedupe_range call taking the same
parameters as clone_range:

int dedupe_range(src_fd, src_off, dest_fd, dest_off);

I'd also change the semantics to "Find and share all identical blocks in
this subrange.  Differing blocks are left alone." because it seems silly
that duperemove can issue large requests but a single byte difference in
the middle causes info->status to be set to FILE_DEDUPE_RANGE_DIFFERS
and info->bytes_deduped only changes if the entire range was deduped.

> The 16MB limit doesn't seem to be useful in practice.  The two useful
> effects of the limit seem to be DoS mitigation.  There is no checking of
> the RAM usage that I can find (i.e. if you fire off 16 dedup threads,
> they want 256MB of RAM; put another way, if you want to tie up 16GB of
> kernel RAM, all you have to do is create 1024 dedup threads), so it's
> not an effective DoS mitigation feature.  Internally dedup could verify
> blocks in batches of 16MB and check for signals/release and reacquire
> locks in between, so it wouldn't tie up the kernel or the two inodes
> for excessively long periods.

(Does btrfs actually do the extent_same stuff in parallel??)

> Even if we want to keep the 16MB limit, there's also no way to query the
> kernel from userspace to find out what the limit is, other than by trial
> and error.  It's not even in a header file, userspace just has to *know*.

-E2BIG?

--D

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-23  2:02 ` Zygo Blaxell
  2016-11-23  2:44   ` Darrick J. Wong
@ 2016-11-23  4:26   ` Dave Chinner
  2016-11-23 13:55     ` Zygo Blaxell
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-11-23  4:26 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Omar Sandoval, linux-btrfs, linux-xfs, Darrick J. Wong,
	Qu Wenruo, Christoph Hellwig, kernel-team

On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> >    implicit EOF gets around this in the existing XFS implementation. I
> >    copied this for the Btrfs implementation.
> 
> Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> or drop that 16MB limit?
> 
> The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> behavior for a 128MB extent is to generate 8x16MB shared extent references
> with different extent offsets to a single 128MB physical extent.
> These references no longer look like the original 128MB extent to a
> userspace dedup tool.  That raises the difficulty level substantially
> for a userspace dedup tool when it tries to figure out which extents to
> keep and which to discard or rewrite.

That, IMO, is a btrfs design/implementation problem, not a problem
with the API. Applications are always going to end up doing things
that aren't perfectly aligned to extent boundaries or sizes
regardless of the size limit that is placed on the dedupe ranges.

> XFS may not have this problem--I haven't checked.

It doesn't - it tracks shared blocks exactly and merges adjacent
extent records whenever possible.

> Even if we want to keep the 16MB limit, there's also no way to query the
> kernel from userspace to find out what the limit is, other than by trial
> and error.  It's not even in a header file, userspace just has to *know*.

So add a define to the API to make it visible to applications and
document it in the man page.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-23  4:26   ` Dave Chinner
@ 2016-11-23 13:55     ` Zygo Blaxell
  2016-11-23 22:13       ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Zygo Blaxell @ 2016-11-23 13:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Omar Sandoval, linux-btrfs, linux-xfs, Darrick J. Wong,
	Qu Wenruo, Christoph Hellwig, kernel-team

[-- Attachment #1: Type: text/plain, Size: 2181 bytes --]

On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote:
> On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> > >    implicit EOF gets around this in the existing XFS implementation. I
> > >    copied this for the Btrfs implementation.
> > 
> > Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> > or drop that 16MB limit?
> > 
> > The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> > behavior for a 128MB extent is to generate 8x16MB shared extent references
> > with different extent offsets to a single 128MB physical extent.
> > These references no longer look like the original 128MB extent to a
> > userspace dedup tool.  That raises the difficulty level substantially
> > for a userspace dedup tool when it tries to figure out which extents to
> > keep and which to discard or rewrite.
> 
> That, IMO, is a btrfs design/implementation problem, not a problem
> with the API. Applications are always going to end up doing things
> that aren't perfectly aligned to extent boundaries or sizes
> regardless of the size limit that is placed on the dedupe ranges.

Given that XFS doesn't have all the problems btrfs does, why does XFS
have the same aribitrary size limit?  Especially since XFS demonstrably
doesn't need it?

> > XFS may not have this problem--I haven't checked.
> 
> It doesn't - it tracks shared blocks exactly and merges adjacent
> extent records whenever possible.
> 
> > Even if we want to keep the 16MB limit, there's also no way to query the
> > kernel from userspace to find out what the limit is, other than by trial
> > and error.  It's not even in a header file, userspace just has to *know*.
> 
> So add a define to the API to make it visible to applications and
> document it in the man page.

To answer some of my own questions on the btrfs side:  It looks like
the btrfs implementation does have a reason for it (fixed-size arrays).

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-23 13:55     ` Zygo Blaxell
@ 2016-11-23 22:13       ` Dave Chinner
  2016-11-23 23:14         ` Zygo Blaxell
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-11-23 22:13 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Omar Sandoval, linux-btrfs, linux-xfs, Darrick J. Wong,
	Qu Wenruo, Christoph Hellwig, kernel-team

On Wed, Nov 23, 2016 at 08:55:59AM -0500, Zygo Blaxell wrote:
> On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote:
> > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> > > >    implicit EOF gets around this in the existing XFS implementation. I
> > > >    copied this for the Btrfs implementation.
> > > 
> > > Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> > > or drop that 16MB limit?
> > > 
> > > The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> > > behavior for a 128MB extent is to generate 8x16MB shared extent references
> > > with different extent offsets to a single 128MB physical extent.
> > > These references no longer look like the original 128MB extent to a
> > > userspace dedup tool.  That raises the difficulty level substantially
> > > for a userspace dedup tool when it tries to figure out which extents to
> > > keep and which to discard or rewrite.
> > 
> > That, IMO, is a btrfs design/implementation problem, not a problem
> > with the API. Applications are always going to end up doing things
> > that aren't perfectly aligned to extent boundaries or sizes
> > regardless of the size limit that is placed on the dedupe ranges.
> 
> Given that XFS doesn't have all the problems btrfs does, why does XFS
> have the same aribitrary size limit?  Especially since XFS demonstrably
> doesn't need it?

Creating a new-but-slightly-incompatible jsut for XFS makes no
sense - we have multiple filesystems that support this functionality
and so they all should use the same APIs and present (as far as is
possible) the same behaviour to userspace.

IOWs it's more important to use existing APIs than to invent a new
one that does almost the same thing. This way userspace applications
don't need to be changed to support new XFS functionality and we
make life easier for everyone. A shiny new API without warts would
be nice, but we've already got to support the existing one forever,
it does the job we need and so it's less burden on everyone if we
just use it as is.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-23 22:13       ` Dave Chinner
@ 2016-11-23 23:14         ` Zygo Blaxell
  2016-11-23 23:53           ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Zygo Blaxell @ 2016-11-23 23:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Omar Sandoval, linux-btrfs, linux-xfs, Darrick J. Wong,
	Qu Wenruo, Christoph Hellwig, kernel-team

[-- Attachment #1: Type: text/plain, Size: 2858 bytes --]

On Thu, Nov 24, 2016 at 09:13:28AM +1100, Dave Chinner wrote:
> On Wed, Nov 23, 2016 at 08:55:59AM -0500, Zygo Blaxell wrote:
> > On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote:
> > > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> > > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> > > > >    implicit EOF gets around this in the existing XFS implementation. I
> > > > >    copied this for the Btrfs implementation.
> > > > 
> > > > Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> > > > or drop that 16MB limit?
> > > > 
> > > > The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> > > > behavior for a 128MB extent is to generate 8x16MB shared extent references
> > > > with different extent offsets to a single 128MB physical extent.
> > > > These references no longer look like the original 128MB extent to a
> > > > userspace dedup tool.  That raises the difficulty level substantially
> > > > for a userspace dedup tool when it tries to figure out which extents to
> > > > keep and which to discard or rewrite.
> > > 
> > > That, IMO, is a btrfs design/implementation problem, not a problem
> > > with the API. Applications are always going to end up doing things
> > > that aren't perfectly aligned to extent boundaries or sizes
> > > regardless of the size limit that is placed on the dedupe ranges.
> > 
> > Given that XFS doesn't have all the problems btrfs does, why does XFS
> > have the same aribitrary size limit?  Especially since XFS demonstrably
> > doesn't need it?
> 
> Creating a new-but-slightly-incompatible jsut for XFS makes no
> sense - we have multiple filesystems that support this functionality
> and so they all should use the same APIs and present (as far as is
> possible) the same behaviour to userspace.

OK.  Let's just remove the limit on all the filesystems then.
XFS doesn't need it, and btrfs can be fixed.

> IOWs it's more important to use existing APIs than to invent a new
> one that does almost the same thing. This way userspace applications
> don't need to be changed to support new XFS functionality and we
> make life easier for everyone. 

Except removing the limit doesn't work that way.  An application that
didn't impose an undocumented limit on itself wouldn't break when moved
to a filesystem that imposed no such limit, i.e. if XFS had no limit,
an application that moved from btrfs to XFS would just work.

> A shiny new API without warts would
> be nice, but we've already got to support the existing one forever,
> it does the job we need and so it's less burden on everyone if we
> just use it as is.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-23 23:14         ` Zygo Blaxell
@ 2016-11-23 23:53           ` Dave Chinner
  2016-11-24  1:26             ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2016-11-23 23:53 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Omar Sandoval, linux-btrfs, linux-xfs, Darrick J. Wong,
	Qu Wenruo, Christoph Hellwig, kernel-team

On Wed, Nov 23, 2016 at 06:14:47PM -0500, Zygo Blaxell wrote:
> On Thu, Nov 24, 2016 at 09:13:28AM +1100, Dave Chinner wrote:
> > On Wed, Nov 23, 2016 at 08:55:59AM -0500, Zygo Blaxell wrote:
> > > On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote:
> > > > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> > > > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > > > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> > > > > >    implicit EOF gets around this in the existing XFS implementation. I
> > > > > >    copied this for the Btrfs implementation.
> > > > > 
> > > > > Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> > > > > or drop that 16MB limit?
> > > > > 
> > > > > The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> > > > > behavior for a 128MB extent is to generate 8x16MB shared extent references
> > > > > with different extent offsets to a single 128MB physical extent.
> > > > > These references no longer look like the original 128MB extent to a
> > > > > userspace dedup tool.  That raises the difficulty level substantially
> > > > > for a userspace dedup tool when it tries to figure out which extents to
> > > > > keep and which to discard or rewrite.
> > > > 
> > > > That, IMO, is a btrfs design/implementation problem, not a problem
> > > > with the API. Applications are always going to end up doing things
> > > > that aren't perfectly aligned to extent boundaries or sizes
> > > > regardless of the size limit that is placed on the dedupe ranges.
> > > 
> > > Given that XFS doesn't have all the problems btrfs does, why does XFS
> > > have the same aribitrary size limit?  Especially since XFS demonstrably
> > > doesn't need it?
> > 
> > Creating a new-but-slightly-incompatible jsut for XFS makes no
> > sense - we have multiple filesystems that support this functionality
> > and so they all should use the same APIs and present (as far as is
> > possible) the same behaviour to userspace.
> 
> OK.  Let's just remove the limit on all the filesystems then.
> XFS doesn't need it, and btrfs can be fixed.

Yet applications still have to support kernel versions where btrfs
has a limit. IOWs, we can remove the limit for future improvement,
but that doesn't mean userspace is free from having to know about
the existing limit constraints.

That is, once a behaviour has been exposed to userspace through an
API, we can't just change it and act like it was always that way -
apps still have to support kernels that expose the old behaviour.
i.e. the old behaviour is there forever, and this why designing
userspace APIs is /hard/. It's also why it's better to use an
existing, slightly less than ideal API than invent a new one that
will simply have different problems exposed in future...

> > IOWs it's more important to use existing APIs than to invent a new
> > one that does almost the same thing. This way userspace applications
> > don't need to be changed to support new XFS functionality and we
> > make life easier for everyone. 
> 
> Except removing the limit doesn't work that way.  An application that
> didn't impose an undocumented limit on itself wouldn't break when moved
> to a filesystem that imposed no such limit, i.e. if XFS had no limit,
> an application that moved from btrfs to XFS would just work.

It goes /both ways/ though. Write an app on XFS that does not care
about limits and it won't work on btrfs because it gets unexpected
errors.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-23 23:53           ` Dave Chinner
@ 2016-11-24  1:26             ` Darrick J. Wong
  2016-11-25  4:20               ` Zygo Blaxell
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2016-11-24  1:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Zygo Blaxell, Omar Sandoval, linux-btrfs, linux-xfs, Qu Wenruo,
	Christoph Hellwig, kernel-team

On Thu, Nov 24, 2016 at 10:53:24AM +1100, Dave Chinner wrote:
> On Wed, Nov 23, 2016 at 06:14:47PM -0500, Zygo Blaxell wrote:
> > On Thu, Nov 24, 2016 at 09:13:28AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 23, 2016 at 08:55:59AM -0500, Zygo Blaxell wrote:
> > > > On Wed, Nov 23, 2016 at 03:26:32PM +1100, Dave Chinner wrote:
> > > > > On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> > > > > > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > > > > > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> > > > > > >    implicit EOF gets around this in the existing XFS implementation. I
> > > > > > >    copied this for the Btrfs implementation.
> > > > > > 
> > > > > > Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> > > > > > or drop that 16MB limit?
> > > > > > 
> > > > > > The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> > > > > > behavior for a 128MB extent is to generate 8x16MB shared extent references
> > > > > > with different extent offsets to a single 128MB physical extent.
> > > > > > These references no longer look like the original 128MB extent to a
> > > > > > userspace dedup tool.  That raises the difficulty level substantially
> > > > > > for a userspace dedup tool when it tries to figure out which extents to
> > > > > > keep and which to discard or rewrite.
> > > > > 
> > > > > That, IMO, is a btrfs design/implementation problem, not a problem
> > > > > with the API. Applications are always going to end up doing things
> > > > > that aren't perfectly aligned to extent boundaries or sizes
> > > > > regardless of the size limit that is placed on the dedupe ranges.
> > > > 
> > > > Given that XFS doesn't have all the problems btrfs does, why does XFS
> > > > have the same aribitrary size limit?  Especially since XFS demonstrably
> > > > doesn't need it?
> > > 
> > > Creating a new-but-slightly-incompatible jsut for XFS makes no
> > > sense - we have multiple filesystems that support this functionality
> > > and so they all should use the same APIs and present (as far as is
> > > possible) the same behaviour to userspace.
> > 
> > OK.  Let's just remove the limit on all the filesystems then.
> > XFS doesn't need it, and btrfs can be fixed.
> 
> Yet applications still have to support kernel versions where btrfs
> has a limit. IOWs, we can remove the limit for future improvement,
> but that doesn't mean userspace is free from having to know about
> the existing limit constraints.

Keep in mind that the number of bytes deduped is returned to userspace
via file_dedupe_range.info[x].bytes_deduped, so a properly functioning
userspace program actually /can/ detect that its 128MB request got cut
down to only 16MB and re-issue the request with the offsets moved up by
16MB.  The dedupe client in xfs_io (see dedupe_ioctl() in io/reflink.c)
implements this strategy.  duperemove (the only other user I know of)
also does this.

So it's really no big deal to increase the limit beyond 16MB, eliminate
it entirely, or even change it to cap the total request size while
dropping the per-item IO limit.

As I mentioned in my other reply, the only hesitation I have for not
killing XFS_MAX_DEDUPE_LEN is that I feel that 2GB is enough IO for a
single ioctl call.

(Dave: That said, if you want to kill it, I'm more than happy to do so
for XFS and ocfs2.)

--D

> That is, once a behaviour has been exposed to userspace through an
> API, we can't just change it and act like it was always that way -
> apps still have to support kernels that expose the old behaviour.
> i.e. the old behaviour is there forever, and this why designing
> userspace APIs is /hard/. It's also why it's better to use an
> existing, slightly less than ideal API than invent a new one that
> will simply have different problems exposed in future...
> 
> > > IOWs it's more important to use existing APIs than to invent a new
> > > one that does almost the same thing. This way userspace applications
> > > don't need to be changed to support new XFS functionality and we
> > > make life easier for everyone. 
> > 
> > Except removing the limit doesn't work that way.  An application that
> > didn't impose an undocumented limit on itself wouldn't break when moved
> > to a filesystem that imposed no such limit, i.e. if XFS had no limit,
> > an application that moved from btrfs to XFS would just work.
> 
> It goes /both ways/ though. Write an app on XFS that does not care
> about limits and it won't work on btrfs because it gets unexpected
> errors.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-23  2:44   ` Darrick J. Wong
@ 2016-11-24  5:16     ` Zygo Blaxell
  0 siblings, 0 replies; 18+ messages in thread
From: Zygo Blaxell @ 2016-11-24  5:16 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Omar Sandoval, linux-btrfs, linux-xfs, Qu Wenruo,
	Christoph Hellwig, kernel-team

[-- Attachment #1: Type: text/plain, Size: 5814 bytes --]

On Tue, Nov 22, 2016 at 06:44:19PM -0800, Darrick J. Wong wrote:
> On Tue, Nov 22, 2016 at 09:02:10PM -0500, Zygo Blaxell wrote:
> > On Thu, Nov 17, 2016 at 04:07:48PM -0800, Omar Sandoval wrote:
> > > 3. Both XFS and Btrfs cap each dedupe operation to 16MB, but the
> > >    implicit EOF gets around this in the existing XFS implementation. I
> > >    copied this for the Btrfs implementation.
> > 
> > Somewhat tangential to this patch, but on the dedup topic:  Can we raise
> > or drop that 16MB limit?
> > 
> > The maximum btrfs extent length is 128MB.  Currently the btrfs dedup
> > behavior for a 128MB extent is to generate 8x16MB shared extent references
> > with different extent offsets to a single 128MB physical extent.
> > These references no longer look like the original 128MB extent to a
> > userspace dedup tool.  That raises the difficulty level substantially
> > for a userspace dedup tool when it tries to figure out which extents to
> > keep and which to discard or rewrite.
> > 
> > XFS may not have this problem--I haven't checked.  On btrfs it's
> > definitely not as simple as "bang two inode/offset/length pairs together
> > with dedup and disk space will be freed automagically."  If dedup is
> > done incorrectly on btrfs, it can end up just making the filesystem slow
> > without freeing any space.
> 
> I copied the 16M limit into xfs/ocfs2 because btrfs had it. :)

Finally, a clearly stated rationale.  ;)

> The VFS now limits the size of the incoming struct file_dedupe_range to
> whatever a page size is.  On x86 that only allows us 126 dedupe
> candidates, which means that a single call can perform up to ~2GB of IO.
> Storage is getting faster, but 2GB is still a fair amount for a single
> call.  Of course in XFS we do the dedupe one file and one page at a time
> to keep the memory footprint sane.
> 
> On ppc64 with its huge 64k pages that comes out to 32GB of IO.
> 
> One thing we (speaking for XFS, anyway) /could/ do is limit based on the
> theoretical IO count instead of clamping the length, e.g.
> 
> if ((u64)dest_count * len >= (1ULL << 31))
> 	return -E2BIG;
> 
> That way you /could/ specify a larger extent size if you pass in fewer
> file descriptors.  OTOH XFS will merge all the records together, so even
> if you deduped the whole 128M in 4k chunks you'll still end up with a
> single block mapping record and a single backref.

This is why I'm mystified that XFS has this limitation.  On btrfs there
were at least _reasons_ for it, even if they were just "we have a v0.3
implementation and nobody's even started optimizing it yet."

The btrfs code calls kzalloc (with size limited by MAX_DEDUPE_LEN) in
the context of the thread executing the ioctl.  It then loads up all the
pages, compares them, then decides whether to continue with clone_range
for the whole extent, or not.  btrfs doesn't seem to ever merge these.

> Were I starting from scratch I'd probably just dump the existing dedupe
> interface in favor of a non-vectorized dedupe_range call taking the same
> parameters as clone_range:
> 
> int dedupe_range(src_fd, src_off, dest_fd, dest_off);
> 
> I'd also change the semantics to "Find and share all identical blocks in
> this subrange.  Differing blocks are left alone." because it seems silly
> that duperemove can issue large requests but a single byte difference in
> the middle causes info->status to be set to FILE_DEDUPE_RANGE_DIFFERS
> and info->bytes_deduped only changes if the entire range was deduped.

It'd also be nice if it replaced all existing shared refs to the dst
blocks at the same time.  On btrfs, dedup agents have to find all the
shared refs (either through brute force or by using LOGICAL_INO to look
them all up through backrefs) and feed each one into extent_same until
the last reference to dst is removed.  But maybe this is only needed to
work around a btrfs thing that never happens on XFS... :-P

> > The 16MB limit doesn't seem to be useful in practice.  The two useful
> > effects of the limit seem to be DoS mitigation.  There is no checking of
> > the RAM usage that I can find (i.e. if you fire off 16 dedup threads,
> > they want 256MB of RAM; put another way, if you want to tie up 16GB of
> > kernel RAM, all you have to do is create 1024 dedup threads), so it's
> > not an effective DoS mitigation feature.  Internally dedup could verify
> > blocks in batches of 16MB and check for signals/release and reacquire
> > locks in between, so it wouldn't tie up the kernel or the two inodes
> > for excessively long periods.
> 
> (Does btrfs actually do the extent_same stuff in parallel??)

A btrfs dedup agent can invoke multiple extent_sames in separate threads.
It's not a good idea on a single filesystem--they all end up blocking
on a disk IO or throttled in transaction commit--but in theory two users
could be running separate dedup agents simultaneously.

A malicious agent doesn't care about sanity if it just wants to tie
up resources.

> > Even if we want to keep the 16MB limit, there's also no way to query the
> > kernel from userspace to find out what the limit is, other than by trial
> > and error.  It's not even in a header file, userspace just has to *know*.
> 
> -E2BIG?

That's how it currently works:  an application fills in the size it wants,
gets an error if it's too big (EINVAL?), and then has to guess which of
the parameters the kernel didn't like.  Btrfs has had maybe half a dozen
slight variations of this so far, always removing a restriction.

> --D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-24  1:26             ` Darrick J. Wong
@ 2016-11-25  4:20               ` Zygo Blaxell
  2016-11-28 17:58                 ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Zygo Blaxell @ 2016-11-25  4:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Omar Sandoval, linux-btrfs, linux-xfs, Qu Wenruo,
	Christoph Hellwig, kernel-team

[-- Attachment #1: Type: text/plain, Size: 1264 bytes --]

On Wed, Nov 23, 2016 at 05:26:18PM -0800, Darrick J. Wong wrote:
[...]
> Keep in mind that the number of bytes deduped is returned to userspace
> via file_dedupe_range.info[x].bytes_deduped, so a properly functioning
> userspace program actually /can/ detect that its 128MB request got cut
> down to only 16MB and re-issue the request with the offsets moved up by
> 16MB.  The dedupe client in xfs_io (see dedupe_ioctl() in io/reflink.c)
> implements this strategy.  duperemove (the only other user I know of)
> also does this.
> 
> So it's really no big deal to increase the limit beyond 16MB, eliminate
> it entirely, or even change it to cap the total request size while
> dropping the per-item IO limit.
> 
> As I mentioned in my other reply, the only hesitation I have for not
> killing XFS_MAX_DEDUPE_LEN is that I feel that 2GB is enough IO for a
> single ioctl call.

Everything's relative.  btrfs has ioctls that will do hundreds of
terabytes of IO and take months to run.  2GB of data is nothing.

Deduping entire 100TB files with a single ioctl call makes as much
sense to me as reflink copying them with a single ioctl call.  The only
reason I see to keep the limit is to work around something wrong with
the implementation.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe
  2016-11-25  4:20               ` Zygo Blaxell
@ 2016-11-28 17:58                 ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2016-11-28 17:58 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Dave Chinner, Omar Sandoval, linux-btrfs, linux-xfs, Qu Wenruo,
	Christoph Hellwig, kernel-team

On Thu, Nov 24, 2016 at 11:20:39PM -0500, Zygo Blaxell wrote:
> On Wed, Nov 23, 2016 at 05:26:18PM -0800, Darrick J. Wong wrote:
> [...]
> > Keep in mind that the number of bytes deduped is returned to userspace
> > via file_dedupe_range.info[x].bytes_deduped, so a properly functioning
> > userspace program actually /can/ detect that its 128MB request got cut
> > down to only 16MB and re-issue the request with the offsets moved up by
> > 16MB.  The dedupe client in xfs_io (see dedupe_ioctl() in io/reflink.c)
> > implements this strategy.  duperemove (the only other user I know of)
> > also does this.
> > 
> > So it's really no big deal to increase the limit beyond 16MB, eliminate
> > it entirely, or even change it to cap the total request size while
> > dropping the per-item IO limit.
> > 
> > As I mentioned in my other reply, the only hesitation I have for not
> > killing XFS_MAX_DEDUPE_LEN is that I feel that 2GB is enough IO for a
> > single ioctl call.
> 
> Everything's relative.  btrfs has ioctls that will do hundreds of
> terabytes of IO and take months to run.  2GB of data is nothing.
> 
> Deduping entire 100TB files with a single ioctl call makes as much
> sense to me as reflink copying them with a single ioctl call.  The only
> reason I see to keep the limit is to work around something wrong with
> the implementation.

Ok.  I'll post patches removing the 16MB limitation for XFS and ocfs2 in 4.10
if nobody objects.

--D

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-11-28 17:59 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-18  0:07 [RFC PATCH 0/2] Btrfs: make a source length of 0 imply EOF for dedupe Omar Sandoval
2016-11-18  0:07 ` [RFC PATCH 1/2] Btrfs: refactor btrfs_extent_same() slightly Omar Sandoval
2016-11-18  3:22   ` Qu Wenruo
2016-11-18  3:22     ` Qu Wenruo
2016-11-18  0:07 ` [RFC PATCH 2/2] Btrfs: make a source length of 0 imply EOF for dedupe Omar Sandoval
2016-11-18  5:38 ` [RFC PATCH 0/2] " Christoph Hellwig
2016-11-22 21:17   ` Darrick J. Wong
2016-11-23  2:02 ` Zygo Blaxell
2016-11-23  2:44   ` Darrick J. Wong
2016-11-24  5:16     ` Zygo Blaxell
2016-11-23  4:26   ` Dave Chinner
2016-11-23 13:55     ` Zygo Blaxell
2016-11-23 22:13       ` Dave Chinner
2016-11-23 23:14         ` Zygo Blaxell
2016-11-23 23:53           ` Dave Chinner
2016-11-24  1:26             ` Darrick J. Wong
2016-11-25  4:20               ` Zygo Blaxell
2016-11-28 17:58                 ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.