All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] [RFC] btrfs: offline dedupe
@ 2013-04-16 22:15 Mark Fasheh
  2013-04-16 22:15 ` [PATCH 1/4] btrfs: abtract out range locking in clone ioctl() Mark Fasheh
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Mark Fasheh @ 2013-04-16 22:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik, chris.mason

Hi,
 
The following series of patches implements in btrfs an ioctl to do
offline deduplication of file extents.

To be clear, "offline" in this sense means that the file system is
mounted and running, but the dedupe is not done during file writes,
but after the fact when some userspace software initiates a dedupe.

The primary patch is loosely based off of one sent by Josef Bacik back
in January, 2011.

http://permalink.gmane.org/gmane.comp.file-systems.btrfs/8508

I've made significant updates and changes from the original. In
particular the structure passed is more fleshed out, this series has a
high degree of code sharing between itself and the clone code, and the
locking has been updated.


The ioctl accepts a struct:

struct btrfs_ioctl_same_args {
	__u64 logical_offset;	/* in - start of extent in source */
	__u64 length;		/* in - length of extent */
	__u16 total_files;	/* in - total elements in info array */
	__u16 files_deduped;	/* out - number of files that got deduped */
	__u32 reserved;
	struct btrfs_ioctl_same_extent_info info[0];
};

Userspace puts each duplicate extent (other than the source) in an
item in the info array. As there can be multiple dedupes in one
operation, each info item has it's own status and 'bytes_deduped'
member. This provides a number of benefits:

- We don't have to fail the entire ioctl because one of the dedupes failed.

- Userspace will always know how much progress was made on a file as we always
  return the number of bytes deduped.


#define BTRFS_SAME_DATA_DIFFERS	1
/* For extent-same ioctl */
struct btrfs_ioctl_same_extent_info {
	__s64 fd;		/* in - destination file */
	__u64 logical_offset;	/* in - start of extent in destination */
	__u64 bytes_deduped;	/* out - total # of bytes we were able
				 * to dedupe from this file */
	/* status of this dedupe operation:
	 * 0 if dedup succeeds
	 * < 0 for error
	 * == BTRFS_SAME_DATA_DIFFERS if data differs
	 */
	__s32 status;		/* out - see above description */
	__u32 reserved;
};


The kernel patches are based off Linux v3.8. The ioctl has been tested
in the most basic sense, and should not be trusted to keep your data
safe. There are bugs.

A git tree for the kernel changes can be found at:

https://github.com/markfasheh/btrfs-extent-same


I have a userspace project, duperemove available at:

https://github.com/markfasheh/duperemove

Hopefully this can serve as an example of one possible usage of the ioctl.

duperemove takes a list of files as argument and will search them for
duplicated extents. My most recent changes have been to integrate it
with btrfs_extent_same so that the '-D' switch will have it fire off
dedupe requests once processing of data is complete. Integration with
extent_same has *not* been tested yet so don't expect that to work
flawlessly.

Within the duperemove repo is a file, btrfs-extent-same.c that acts as
a test wrapper around the ioctl. It can be compiled completely
seperately from the rest of the project via "make
btrfs-extent-same". This makes direct testing of the ioctl more
convenient.

Code review is very much appreciated. Thanks,
     --Mark

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/4] btrfs: abtract out range locking in clone ioctl()
  2013-04-16 22:15 [PATCH 0/4] [RFC] btrfs: offline dedupe Mark Fasheh
@ 2013-04-16 22:15 ` Mark Fasheh
  2013-04-16 22:15 ` [PATCH 2/4] btrfs_ioctl_clone: Move clone code into it's own function Mark Fasheh
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Mark Fasheh @ 2013-04-16 22:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik, chris.mason, Mark Fasheh

The range locking in btrfs_ioctl_clone is trivially broken out into it's own
function. This reduces the complexity of btrfs_ioctl_clone() by a small bit
and makes that locking code available to future functions in
fs/btrfs/ioctl.c

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ioctl.c |   36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 338f259..7c80738 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2412,6 +2412,26 @@ out:
 	return ret;
 }
 
+static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
+{
+	/* do any pending delalloc/csum calc on src, one way or
+	   another, and lock file content */
+	while (1) {
+		struct btrfs_ordered_extent *ordered;
+		lock_extent(&BTRFS_I(inode)->io_tree, off, off + len - 1);
+		ordered = btrfs_lookup_first_ordered_extent(inode,
+							    off + len - 1);
+		if (!ordered &&
+		    !test_range_bit(&BTRFS_I(inode)->io_tree, off,
+				    off + len - 1, EXTENT_DELALLOC, 0, NULL))
+			break;
+		unlock_extent(&BTRFS_I(inode)->io_tree, off, off + len - 1);
+		if (ordered)
+			btrfs_put_ordered_extent(ordered);
+		btrfs_wait_ordered_range(inode, off, len);
+	}
+}
+
 static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 				       u64 off, u64 olen, u64 destoff)
 {
@@ -2529,21 +2549,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
 	truncate_inode_pages_range(&inode->i_data, destoff,
 				   PAGE_CACHE_ALIGN(destoff + len) - 1);
 
-	/* do any pending delalloc/csum calc on src, one way or
-	   another, and lock file content */
-	while (1) {
-		struct btrfs_ordered_extent *ordered;
-		lock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1);
-		ordered = btrfs_lookup_first_ordered_extent(src, off + len - 1);
-		if (!ordered &&
-		    !test_range_bit(&BTRFS_I(src)->io_tree, off, off + len - 1,
-				    EXTENT_DELALLOC, 0, NULL))
-			break;
-		unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1);
-		if (ordered)
-			btrfs_put_ordered_extent(ordered);
-		btrfs_wait_ordered_range(src, off, len);
-	}
+	lock_extent_range(src, off, len);
 
 	/* clone data */
 	key.objectid = btrfs_ino(src);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/4] btrfs_ioctl_clone: Move clone code into it's own function
  2013-04-16 22:15 [PATCH 0/4] [RFC] btrfs: offline dedupe Mark Fasheh
  2013-04-16 22:15 ` [PATCH 1/4] btrfs: abtract out range locking in clone ioctl() Mark Fasheh
@ 2013-04-16 22:15 ` Mark Fasheh
  2013-04-16 22:15 ` [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock() Mark Fasheh
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Mark Fasheh @ 2013-04-16 22:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik, chris.mason, Mark Fasheh

There's some 250+ lines here that are easily encapsulated into their own
function. I don't change how anything works here, just create and document
the new btrfs_clone() function from btrfs_ioctl_clone() code.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ioctl.c |  232 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 128 insertions(+), 104 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7c80738..d237447 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2432,125 +2432,43 @@ static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 	}
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-				       u64 off, u64 olen, u64 destoff)
+/**
+ * btrfs_clone() - clone a range from inode file to another
+ *
+ * @src: Inode to clone from
+ * @inode: Inode to clone to
+ * @off: Offset within source to start clone from
+ * @olen: Original length, passed by user, of range to clone
+ * @olen_aligned: Block-aligned value of olen, extent_same uses
+ *               identical values here
+ * @destoff: Offset within @inode to start clone
+ */
+static int btrfs_clone(struct inode *src, struct inode *inode,
+		       u64 off, u64 olen, u64 olen_aligned, u64 destoff)
 {
-	struct inode *inode = fdentry(file)->d_inode;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct fd src_file;
-	struct inode *src;
-	struct btrfs_trans_handle *trans;
-	struct btrfs_path *path;
+	struct btrfs_path *path = NULL;
 	struct extent_buffer *leaf;
-	char *buf;
+	struct btrfs_trans_handle *trans;
+	char *buf = NULL;
 	struct btrfs_key key;
 	u32 nritems;
 	int slot;
 	int ret;
-	u64 len = olen;
-	u64 bs = root->fs_info->sb->s_blocksize;
-
-	/*
-	 * TODO:
-	 * - split compressed inline extents.  annoying: we need to
-	 *   decompress into destination's address_space (the file offset
-	 *   may change, so source mapping won't do), then recompress (or
-	 *   otherwise reinsert) a subrange.
-	 * - allow ranges within the same file to be cloned (provided
-	 *   they don't overlap)?
-	 */
-
-	/* the destination must be opened for writing */
-	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-		return -EINVAL;
-
-	if (btrfs_root_readonly(root))
-		return -EROFS;
-
-	ret = mnt_want_write_file(file);
-	if (ret)
-		return ret;
-
-	src_file = fdget(srcfd);
-	if (!src_file.file) {
-		ret = -EBADF;
-		goto out_drop_write;
-	}
-
-	ret = -EXDEV;
-	if (src_file.file->f_path.mnt != file->f_path.mnt)
-		goto out_fput;
-
-	src = src_file.file->f_dentry->d_inode;
-
-	ret = -EINVAL;
-	if (src == inode)
-		goto out_fput;
-
-	/* the src must be open for reading */
-	if (!(src_file.file->f_mode & FMODE_READ))
-		goto out_fput;
-
-	/* don't make the dst file partly checksummed */
-	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
-	    (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
-		goto out_fput;
-
-	ret = -EISDIR;
-	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
-		goto out_fput;
-
-	ret = -EXDEV;
-	if (src->i_sb != inode->i_sb)
-		goto out_fput;
+	u64 len = olen_aligned;
 
 	ret = -ENOMEM;
 	buf = vmalloc(btrfs_level_size(root, 0));
 	if (!buf)
-		goto out_fput;
+		return ret;
 
 	path = btrfs_alloc_path();
 	if (!path) {
 		vfree(buf);
-		goto out_fput;
-	}
-	path->reada = 2;
-
-	if (inode < src) {
-		mutex_lock_nested(&inode->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&src->i_mutex, I_MUTEX_CHILD);
-	} else {
-		mutex_lock_nested(&src->i_mutex, I_MUTEX_PARENT);
-		mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD);
-	}
-
-	/* determine range to clone */
-	ret = -EINVAL;
-	if (off + len > src->i_size || off + len < off)
-		goto out_unlock;
-	if (len == 0)
-		olen = len = src->i_size - off;
-	/* if we extend to eof, continue to block boundary */
-	if (off + len == src->i_size)
-		len = ALIGN(src->i_size, bs) - off;
-
-	/* verify the end result is block aligned */
-	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs) ||
-	    !IS_ALIGNED(destoff, bs))
-		goto out_unlock;
-
-	if (destoff > inode->i_size) {
-		ret = btrfs_cont_expand(inode, inode->i_size, destoff);
-		if (ret)
-			goto out_unlock;
+		return ret;
 	}
 
-	/* truncate page cache pages from target inode range */
-	truncate_inode_pages_range(&inode->i_data, destoff,
-				   PAGE_CACHE_ALIGN(destoff + len) - 1);
-
-	lock_extent_range(src, off, len);
-
+	path->reada = 2;
 	/* clone data */
 	key.objectid = btrfs_ino(src);
 	key.type = BTRFS_EXTENT_DATA_KEY;
@@ -2796,14 +2714,120 @@ next:
 		key.offset++;
 	}
 	ret = 0;
+
 out:
 	btrfs_release_path(path);
+	btrfs_free_path(path);
+	vfree(buf);
+	return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+				       u64 off, u64 olen, u64 destoff)
+{
+	struct inode *inode = fdentry(file)->d_inode;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct fd src_file;
+	struct inode *src;
+	int ret;
+	u64 len = olen;
+	u64 bs = root->fs_info->sb->s_blocksize;
+
+	/*
+	 * TODO:
+	 * - split compressed inline extents.  annoying: we need to
+	 *   decompress into destination's address_space (the file offset
+	 *   may change, so source mapping won't do), then recompress (or
+	 *   otherwise reinsert) a subrange.
+	 * - allow ranges within the same file to be cloned (provided
+	 *   they don't overlap)?
+	 */
+
+	/* the destination must be opened for writing */
+	if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+		return -EINVAL;
+
+	if (btrfs_root_readonly(root))
+		return -EROFS;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	src_file = fdget(srcfd);
+	if (!src_file.file) {
+		ret = -EBADF;
+		goto out_drop_write;
+	}
+
+	ret = -EXDEV;
+	if (src_file.file->f_path.mnt != file->f_path.mnt)
+		goto out_fput;
+
+	src = src_file.file->f_dentry->d_inode;
+
+	ret = -EINVAL;
+	if (src == inode)
+		goto out_fput;
+
+	/* the src must be open for reading */
+	if (!(src_file.file->f_mode & FMODE_READ))
+		goto out_fput;
+
+	/* don't make the dst file partly checksummed */
+	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
+	    (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
+		goto out_fput;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
+		goto out_fput;
+
+	ret = -EXDEV;
+	if (src->i_sb != inode->i_sb)
+		goto out_fput;
+
+	if (inode < src) {
+		mutex_lock_nested(&inode->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&src->i_mutex, I_MUTEX_CHILD);
+	} else {
+		mutex_lock_nested(&src->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode->i_mutex, I_MUTEX_CHILD);
+	}
+
+	/* determine range to clone */
+	ret = -EINVAL;
+	if (off + len > src->i_size || off + len < off)
+		goto out_unlock;
+	if (len == 0)
+		olen = len = src->i_size - off;
+	/* if we extend to eof, continue to block boundary */
+	if (off + len == src->i_size)
+		len = ALIGN(src->i_size, bs) - off;
+
+	/* verify the end result is block aligned */
+	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs) ||
+	    !IS_ALIGNED(destoff, bs))
+		goto out_unlock;
+
+	if (destoff > inode->i_size) {
+		ret = btrfs_cont_expand(inode, inode->i_size, destoff);
+		if (ret)
+			goto out_unlock;
+	}
+
+	/* truncate page cache pages from target inode range */
+	truncate_inode_pages_range(&inode->i_data, destoff,
+				   PAGE_CACHE_ALIGN(destoff + len) - 1);
+
+	lock_extent_range(src, off, len);
+
+	ret = btrfs_clone(src, inode, off, olen, len, destoff);
+
 	unlock_extent(&BTRFS_I(src)->io_tree, off, off + len - 1);
 out_unlock:
 	mutex_unlock(&src->i_mutex);
 	mutex_unlock(&inode->i_mutex);
-	vfree(buf);
-	btrfs_free_path(path);
 out_fput:
 	fdput(src_file);
 out_drop_write:
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()
  2013-04-16 22:15 [PATCH 0/4] [RFC] btrfs: offline dedupe Mark Fasheh
  2013-04-16 22:15 ` [PATCH 1/4] btrfs: abtract out range locking in clone ioctl() Mark Fasheh
  2013-04-16 22:15 ` [PATCH 2/4] btrfs_ioctl_clone: Move clone code into it's own function Mark Fasheh
@ 2013-04-16 22:15 ` Mark Fasheh
  2013-05-06 12:38   ` David Sterba
  2013-04-16 22:15 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Mark Fasheh @ 2013-04-16 22:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik, chris.mason, Mark Fasheh

We want this for btrfs_extent_same. Basically readpage and friends do their
own extent locking but for the purposes of dedupe, we want to have both
files locked down across a set of readpage operations (so that we can
compare data). Introduce this variant and a flag which can be set for
extent_read_full_page() to indicate that we are already locked.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/extent_io.c |   44 ++++++++++++++++++++++++++++++++------------
 fs/btrfs/extent_io.h |    2 ++
 2 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1b319df..9256503 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2592,7 +2592,7 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
 				   struct page *page,
 				   get_extent_t *get_extent,
 				   struct bio **bio, int mirror_num,
-				   unsigned long *bio_flags)
+				   unsigned long *bio_flags, int parent_locked)
 {
 	struct inode *inode = page->mapping->host;
 	u64 start = (u64)page->index << PAGE_CACHE_SHIFT;
@@ -2625,7 +2625,7 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
 	}
 
 	end = page_end;
-	while (1) {
+	while (1 && !parent_locked) {
 		lock_extent(tree, start, end);
 		ordered = btrfs_lookup_ordered_extent(inode, start);
 		if (!ordered)
@@ -2659,15 +2659,18 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
 			kunmap_atomic(userpage);
 			set_extent_uptodate(tree, cur, cur + iosize - 1,
 					    &cached, GFP_NOFS);
-			unlock_extent_cached(tree, cur, cur + iosize - 1,
-					     &cached, GFP_NOFS);
+			if (!parent_locked)
+				unlock_extent_cached(tree, cur,
+						     cur + iosize - 1,
+						     &cached, GFP_NOFS);
 			break;
 		}
 		em = get_extent(inode, page, pg_offset, cur,
 				end - cur + 1, 0);
 		if (IS_ERR_OR_NULL(em)) {
 			SetPageError(page);
-			unlock_extent(tree, cur, end);
+			if (!parent_locked)
+				unlock_extent(tree, cur, end);
 			break;
 		}
 		extent_offset = cur - em->start;
@@ -2719,7 +2722,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
 		if (test_range_bit(tree, cur, cur_end,
 				   EXTENT_UPTODATE, 1, NULL)) {
 			check_page_uptodate(tree, page);
-			unlock_extent(tree, cur, cur + iosize - 1);
+			if (!parent_locked)
+				unlock_extent(tree, cur, cur + iosize - 1);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -2729,7 +2733,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
 		 */
 		if (block_start == EXTENT_MAP_INLINE) {
 			SetPageError(page);
-			unlock_extent(tree, cur, cur + iosize - 1);
+			if (!parent_locked)
+				unlock_extent(tree, cur, cur + iosize - 1);
 			cur = cur + iosize;
 			pg_offset += iosize;
 			continue;
@@ -2756,7 +2761,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
 		}
 		if (ret) {
 			SetPageError(page);
-			unlock_extent(tree, cur, cur + iosize - 1);
+			if (!parent_locked)
+				unlock_extent(tree, cur, cur + iosize - 1);
 		}
 		cur = cur + iosize;
 		pg_offset += iosize;
@@ -2778,7 +2784,21 @@ int extent_read_full_page(struct extent_io_tree *tree, struct page *page,
 	int ret;
 
 	ret = __extent_read_full_page(tree, page, get_extent, &bio, mirror_num,
-				      &bio_flags);
+				      &bio_flags, 0);
+	if (bio)
+		ret = submit_one_bio(READ, bio, mirror_num, bio_flags);
+	return ret;
+}
+
+int extent_read_full_page_nolock(struct extent_io_tree *tree, struct page *page,
+				 get_extent_t *get_extent, int mirror_num)
+{
+	struct bio *bio = NULL;
+	unsigned long bio_flags = 0;
+	int ret;
+
+	ret = __extent_read_full_page(tree, page, get_extent, &bio, mirror_num,
+				      &bio_flags, 1);
 	if (bio)
 		ret = submit_one_bio(READ, bio, mirror_num, bio_flags);
 	return ret;
@@ -3648,14 +3668,14 @@ int extent_readpages(struct extent_io_tree *tree,
 			continue;
 		for (i = 0; i < nr; i++) {
 			__extent_read_full_page(tree, pagepool[i], get_extent,
-					&bio, 0, &bio_flags);
+						&bio, 0, &bio_flags, 0);
 			page_cache_release(pagepool[i]);
 		}
 		nr = 0;
 	}
 	for (i = 0; i < nr; i++) {
 		__extent_read_full_page(tree, pagepool[i], get_extent,
-					&bio, 0, &bio_flags);
+					&bio, 0, &bio_flags, 0);
 		page_cache_release(pagepool[i]);
 	}
 
@@ -4620,7 +4640,7 @@ int read_extent_buffer_pages(struct extent_io_tree *tree,
 			ClearPageError(page);
 			err = __extent_read_full_page(tree, page,
 						      get_extent, &bio,
-						      mirror_num, &bio_flags);
+						      mirror_num, &bio_flags, 0);
 			if (err)
 				ret = err;
 		} else {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 2eacfab..71752fc 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -202,6 +202,8 @@ int unlock_extent_cached(struct extent_io_tree *tree, u64 start, u64 end,
 int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end);
 int extent_read_full_page(struct extent_io_tree *tree, struct page *page,
 			  get_extent_t *get_extent, int mirror_num);
+int extent_read_full_page_nolock(struct extent_io_tree *tree, struct page *page,
+				 get_extent_t *get_extent, int mirror_num);
 int __init extent_io_init(void);
 void extent_io_exit(void);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/4] btrfs: offline dedupe
  2013-04-16 22:15 [PATCH 0/4] [RFC] btrfs: offline dedupe Mark Fasheh
                   ` (2 preceding siblings ...)
  2013-04-16 22:15 ` [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock() Mark Fasheh
@ 2013-04-16 22:15 ` Mark Fasheh
  2013-05-06 12:36   ` David Sterba
  2013-04-16 22:50 ` [PATCH 0/4] [RFC] " Marek Otahal
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Mark Fasheh @ 2013-04-16 22:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik, chris.mason, Mark Fasheh

This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
de-duplicate a list of extents across a range of files.

Internally, the ioctl re-uses code from the clone ioctl. This avoids
rewriting a large chunk of extent handling code.

Userspace passes in an array of file, offset pairs along with a length
argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison
of the user data before deduping the extent. Status and number of bytes
deduped are returned for each operation.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ioctl.c |  272 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ioctl.h |   28 +++++-
 2 files changed, 299 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d237447..7cad49e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -57,6 +57,9 @@
 #include "send.h"
 #include "dev-replace.h"
 
+static int btrfs_clone(struct inode *src, struct inode *inode,
+		       u64 off, u64 olen, u64 olen_aligned, u64 destoff);
+
 /* Mask out flags that are inappropriate for the given type of inode. */
 static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
 {
@@ -2412,6 +2415,61 @@ out:
 	return ret;
 }
 
+static noinline int fill_data(struct inode *inode, u64 off, u64 len,
+			      char **cur_buffer)
+{
+	struct page *page;
+	void *addr;
+	char *buffer;
+	pgoff_t index;
+	pgoff_t last_index;
+	int ret = 0;
+	int bytes_copied = 0;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+
+	buffer = kmalloc(len, GFP_NOFS);
+	if (!buffer)
+		return -ENOMEM;
+
+	index = off >> PAGE_CACHE_SHIFT;
+	last_index = (off + len - 1) >> PAGE_CACHE_SHIFT;
+
+	while (index <= last_index) {
+		page = grab_cache_page(inode->i_mapping, index);
+		if (!page) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (!PageUptodate(page)) {
+			extent_read_full_page_nolock(tree, page,
+						     btrfs_get_extent, 0);
+			lock_page(page);
+			if (!PageUptodate(page)) {
+				unlock_page(page);
+				page_cache_release(page);
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+
+		addr = kmap(page);
+		memcpy(buffer + bytes_copied, addr, PAGE_CACHE_SIZE);
+		kunmap(page);
+		unlock_page(page);
+		page_cache_release(page);
+		bytes_copied += PAGE_CACHE_SIZE;
+		index++;
+	}
+
+	*cur_buffer = buffer;
+
+out:
+	if (ret)
+		kfree(buffer);
+	return ret;
+}
+
 static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 {
 	/* do any pending delalloc/csum calc on src, one way or
@@ -2432,6 +2490,218 @@ static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 	}
 }
 
+static void btrfs_double_unlock(struct inode *inode1, u64 loff1,
+				struct inode *inode2, u64 loff2, u64 len)
+{
+	unlock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1);
+	unlock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1);
+
+	mutex_unlock(&inode1->i_mutex);
+	mutex_unlock(&inode2->i_mutex);
+}
+
+static void btrfs_double_lock(struct inode *inode1, u64 loff1,
+			      struct inode *inode2, u64 loff2, u64 len)
+{
+	if (inode1 < inode2) {
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+		lock_extent_range(inode1, loff1, len);
+		lock_extent_range(inode2, loff2, len);
+	} else {
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+		lock_extent_range(inode2, loff2, len);
+		lock_extent_range(inode1, loff1, len);
+	}
+}
+
+static int btrfs_extent_same(struct inode *src, u64 loff, u64 len,
+			     struct inode *dst, u64 dst_loff)
+{
+	char *orig_buffer = NULL;
+	char *dst_inode_buffer = NULL;
+	int ret;
+
+	/*
+	 * btrfs_clone() can't handle extents in the same file
+	 * yet. Once that works, we can drop this check and replace it
+	 * with a check for the same inode, but overlapping extents.
+	 */
+	if (src == dst)
+		return -EINVAL;
+
+	btrfs_double_lock(src, loff, dst, dst_loff, len);
+
+	ret = fill_data(src, loff, len, &orig_buffer);
+	if (ret) {
+		printk(KERN_ERR "btrfs: unable to source populate data "
+		       "buffer.\n");
+		goto out;
+	}
+
+	ret = fill_data(dst, dst_loff, len, &dst_inode_buffer);
+	if (ret) {
+		printk(KERN_ERR "btrfs: unable to populate destination data "
+		       "buffer.\n");
+		goto out;
+	}
+
+	ret = memcmp(orig_buffer, dst_inode_buffer, len);
+	if (ret) {
+		ret = BTRFS_SAME_DATA_DIFFERS;
+		printk(KERN_ERR "btrfs: data for inode %lu does not "
+		       "match\n", dst->i_ino);
+		goto out;
+	}
+
+	ret = btrfs_clone(src, dst, loff, len, len, dst_loff);
+
+out:
+	btrfs_double_unlock(src, loff, dst, dst_loff, len);
+
+	kfree(dst_inode_buffer);
+	kfree(orig_buffer);
+	return ret;
+}
+
+static long btrfs_ioctl_file_extent_same(struct file *file,
+					 void __user *argp)
+{
+	struct btrfs_ioctl_same_args *args;
+	struct btrfs_ioctl_same_args tmp;
+	struct btrfs_ioctl_same_extent_info *info;
+	struct inode *src = file->f_dentry->d_inode;
+	struct file *dst_file = NULL;
+	struct inode *dst;
+	u64 off;
+	u64 len;
+	int args_size;
+	int i;
+	int ret;
+	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
+
+	if (copy_from_user(&tmp,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   sizeof(tmp)))
+		return -EFAULT;
+
+	args_size = sizeof(tmp) + (tmp.total_files *
+			sizeof(struct btrfs_ioctl_same_extent_info));
+
+	/* Keep size of ioctl argument sane */
+	if (args_size > PAGE_CACHE_SIZE)
+		return -ENOMEM;
+
+	args = kmalloc(args_size, GFP_NOFS);
+	if (!args)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(args,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   args_size))
+		goto out;
+	/* Make sure args didn't change magically between copies. */
+	if (memcmp(&tmp, args, sizeof(tmp)))
+		goto out;
+
+	if ((sizeof(tmp) + (sizeof(*info) * args->total_files)) > args_size)
+		goto out;
+
+	/* pre-format 'out' fields to sane default values */
+	args->files_deduped = 0;
+	for (i = 0; i < args->total_files; i++) {
+		info = &args->info[i];
+		info->bytes_deduped = 0;
+		info->status = 0;
+	}
+
+	off = args->logical_offset;
+	len = args->length;
+
+	ret = -EINVAL;
+	if (off + len > src->i_size || off + len < off)
+		goto out;
+	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs))
+		goto out;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode))
+		goto out;
+
+	/*
+	 * Since we have to memcmp the data to make sure it does actually match
+	 * eachother we need to allocate 2 buffers to handle this.  So limit the
+	 * blocksize to 1 megabyte to make sure nobody abuses this.
+	 */ 
+	ret = -ENOMEM;
+	if (len > 1 * 1024 * 1024)
+		goto out;
+
+	ret = 0;
+	for (i = 0; i < args->total_files; i++) {
+		u64 dest_off;
+
+		info = &args->info[i];
+
+		dst_file = fget(info->fd);
+		if (!dst_file) {
+			printk(KERN_ERR "btrfs: invalid fd %lld\n", info->fd);
+			info->status = -EBADF;
+			continue;
+		}
+
+		dst = dst_file->f_dentry->d_inode;
+		if (S_ISDIR(dst->i_mode)) {
+			printk(KERN_ERR "btrfs: file is dir %lld\n", info->fd);
+			info->status = -EISDIR;
+			goto next;
+		}
+
+		info->status = -EINVAL;
+		if (dst == src) {
+			printk(KERN_ERR "btrfs: file dup %lld\n", info->fd);
+			goto next;
+		}
+
+		if (BTRFS_I(dst)->root != BTRFS_I(src)->root) {
+			printk(KERN_ERR "btrfs: cannot dedup across subvolumes"
+			       " %lld\n", info->fd);
+			goto next;
+		}
+
+		dest_off = info->logical_offset;
+
+		if (dest_off + len > dst->i_size || dest_off + len < dest_off)
+			goto next;
+		if (!IS_ALIGNED(dest_off, bs))
+			goto next;
+
+		info->status = btrfs_extent_same(src, off, len, dst,
+						 info->logical_offset);
+		if (info->status == 0) {
+			info->bytes_deduped = len;
+			args->files_deduped++;
+		} else {
+			printk(KERN_ERR "error %d from btrfs_extent_same\n",
+				info->status);
+		}
+next:
+		fput(dst_file);
+		dst_file = NULL;
+	}
+
+	if (copy_to_user(argp, args, args_size))
+		ret = -EFAULT;
+
+out:
+	if (dst_file)
+		fput(dst_file);
+	kfree(args);
+	return ret;
+}
+
 /**
  * btrfs_clone() - clone a range from inode file to another
  *
@@ -4044,6 +4314,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_qgroup_limit(file, argp);
 	case BTRFS_IOC_DEV_REPLACE:
 		return btrfs_ioctl_dev_replace(root, argp);
+	case BTRFS_IOC_FILE_EXTENT_SAME:
+		return btrfs_ioctl_file_extent_same(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index dabca9c..445e448 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -304,6 +304,31 @@ struct btrfs_ioctl_clone_range_args {
 #define BTRFS_DEFRAG_RANGE_COMPRESS 1
 #define BTRFS_DEFRAG_RANGE_START_IO 2
 
+#define BTRFS_SAME_DATA_DIFFERS	1
+/* For extent-same ioctl */
+struct btrfs_ioctl_same_extent_info {
+	__s64 fd;		/* in - destination file */
+	__u64 logical_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file */
+	/* status of this dedupe operation:
+	 * 0 if dedup succeeds
+	 * < 0 for error
+	 * == BTRFS_SAME_DATA_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;
+};
+
+struct btrfs_ioctl_same_args {
+	__u64 logical_offset;	/* in - start of extent in source */
+	__u64 length;		/* in - length of extent */
+	__u16 total_files;	/* in - total elements in info array */
+	__u16 files_deduped;	/* out - number of files that got deduped */
+	__u32 reserved;
+	struct btrfs_ioctl_same_extent_info info[0];
+};
+
 struct btrfs_ioctl_space_info {
 	__u64 flags;
 	__u64 total_bytes;
@@ -498,5 +523,6 @@ struct btrfs_ioctl_send_args {
 				      struct btrfs_ioctl_get_dev_stats)
 #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
 				    struct btrfs_ioctl_dev_replace_args)
-
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
+					 struct btrfs_ioctl_same_args)
 #endif
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] [RFC] btrfs: offline dedupe
  2013-04-16 22:15 [PATCH 0/4] [RFC] btrfs: offline dedupe Mark Fasheh
                   ` (3 preceding siblings ...)
  2013-04-16 22:15 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
@ 2013-04-16 22:50 ` Marek Otahal
  2013-04-16 23:17   ` Mark Fasheh
  2013-04-20 15:49 ` Gabriel de Perthuis
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Marek Otahal @ 2013-04-16 22:50 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs

Hi Mark, 

could you compare (appart from online/offline) your implementation to LiuBo's work?, appeared on ML a while ago: 
http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg23656.html

It would be interesting if the two approaches could share some code, and also confirmation that using one technique does not 
disregard using the other in future. 

Best wishes, 
Mark

On Tuesday 16 April 2013 15:15:31 Mark Fasheh wrote:
> Hi,
>  
> The following series of patches implements in btrfs an ioctl to do
> offline deduplication of file extents.
> 
> To be clear, "offline" in this sense means that the file system is
> mounted and running, but the dedupe is not done during file writes,
> but after the fact when some userspace software initiates a dedupe.
> 
> The primary patch is loosely based off of one sent by Josef Bacik back
> in January, 2011.
> 
> http://permalink.gmane.org/gmane.comp.file-systems.btrfs/8508
> 
> I've made significant updates and changes from the original. In
> particular the structure passed is more fleshed out, this series has a
> high degree of code sharing between itself and the clone code, and the
> locking has been updated.
> 
...
> 
> Code review is very much appreciated. Thanks,
>      --Mark
-- 

Marek Otahal :o)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] [RFC] btrfs: offline dedupe
  2013-04-16 22:50 ` [PATCH 0/4] [RFC] " Marek Otahal
@ 2013-04-16 23:17   ` Mark Fasheh
  2013-04-17  4:00     ` Liu Bo
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Fasheh @ 2013-04-16 23:17 UTC (permalink / raw)
  To: Marek Otahal; +Cc: linux-btrfs

On Wed, Apr 17, 2013 at 12:50:04AM +0200, Marek Otahal wrote: could you
> compare (appart from online/offline) your implementation to LiuBo's work?,
> appeared on ML a while ago:
> http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg23656.html

Well that's the primary difference. Liu Bo's patch requires a format change
also since it's done online. My patch requires no format change. So they're
complimentary approaches in my opinion.

There's also the possibility that some other file systems could pick up the
ioctl. Ocfs2 in particular should be able to.


> It would be interesting if the two approaches could share some code, and
> also confirmation that using one technique does not disregard using the
> other in future.

Both features can exist together and probably should, I can see great uses
for both cases.

I haven't looked at the patches but with respect to code sharing I'll take a
look. My patches don't actually add any custom code for the actual "let's
de-dupe this extent" as I re-use the code from btrfs_ioctl_clone().
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] [RFC] btrfs: offline dedupe
  2013-04-16 23:17   ` Mark Fasheh
@ 2013-04-17  4:00     ` Liu Bo
  0 siblings, 0 replies; 28+ messages in thread
From: Liu Bo @ 2013-04-17  4:00 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: Marek Otahal, linux-btrfs

On Tue, Apr 16, 2013 at 04:17:15PM -0700, Mark Fasheh wrote:
> On Wed, Apr 17, 2013 at 12:50:04AM +0200, Marek Otahal wrote: could you
> > compare (appart from online/offline) your implementation to LiuBo's work?,
> > appeared on ML a while ago:
> > http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg23656.html
> 
> Well that's the primary difference. Liu Bo's patch requires a format change
> also since it's done online. My patch requires no format change. So they're
> complimentary approaches in my opinion.
> 
> There's also the possibility that some other file systems could pick up the
> ioctl. Ocfs2 in particular should be able to.
> 
> 
> > It would be interesting if the two approaches could share some code, and
> > also confirmation that using one technique does not disregard using the
> > other in future.
> 
> Both features can exist together and probably should, I can see great uses
> for both cases.
> 
> I haven't looked at the patches but with respect to code sharing I'll take a
> look. My patches don't actually add any custom code for the actual "let's
> de-dupe this extent" as I re-use the code from btrfs_ioctl_clone().

In online dedup, I just make some changes in write path, as we regard dedup as a
special kind of compression, doing dedup as compression way is the goal.

The difference is where hash database is -- offline dedup puts it in userspace
while online dedup in kernel.

Although there might be no code that can be shared here, I agree that both
online and offline one are useful.

Good job.

thanks,
liubo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] [RFC] btrfs: offline dedupe
  2013-04-16 22:15 [PATCH 0/4] [RFC] btrfs: offline dedupe Mark Fasheh
                   ` (4 preceding siblings ...)
  2013-04-16 22:50 ` [PATCH 0/4] [RFC] " Marek Otahal
@ 2013-04-20 15:49 ` Gabriel de Perthuis
  2013-04-21 20:02   ` Mark Fasheh
  2013-05-07  7:33 ` [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock() Gabriel de Perthuis
  2013-05-09 21:31 ` Gabriel de Perthuis
  7 siblings, 1 reply; 28+ messages in thread
From: Gabriel de Perthuis @ 2013-04-20 15:49 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs

> Hi,
>
> The following series of patches implements in btrfs an ioctl to do
> offline deduplication of file extents.

I am a fan of this patch, the API is just right.  I just have a few 
tweaks to suggest to the argument checking.

At first the 1M limitation on the length was a bit inconvenient, but 
making repeated calls in userspace is okay and allows for better error 
recovery (for example, repeating the calls when only a portion of the 
ranges is identical).  The destination file appears to be fragmented 
into 1M extents, but these are contiguous so it's not really a problem.

Requiring the offset or the length to align is spurious however; it 
doesn't translate to any alignment in the extent tree (especially with 
compression).  Requiring a minimum length of a few blocks or dropping 
the alignment condition entirely would make more sense.

I notice there is a restriction on cross-subvolume deduplication. 
Hopefully it can be lifted like it was done in 3.6 for the clone ioctl.

Deduplicating frozen subvolumes works fine, which is great for backups.

Basic integration with bedup, my offline deduplication tool, is in an 
experimental branch:

   https://github.com/g2p/bedup/tree/wip/dedup-syscall

Thanks to this, I look forward to shedding most of the caveats given in 
the bedup readme and some unnecessary subtleties in the code.

> I've made significant updates and changes from the original. In
> particular the structure passed is more fleshed out, this series has a
> high degree of code sharing between itself and the clone code, and the
> locking has been updated.
>
> The ioctl accepts a struct:
>
> struct btrfs_ioctl_same_args {
> 	__u64 logical_offset;	/* in - start of extent in source */
> 	__u64 length;		/* in - length of extent */
> 	__u16 total_files;	/* in - total elements in info array */

Nit: total_files sounds like it would count the source file.
dest_count would be better.

By the way, extent-same might be better named range-same, since there is 
no need for the input to fall on extent boundaries.

> 	__u16 files_deduped;	/* out - number of files that got deduped */
> 	__u32 reserved;
> 	struct btrfs_ioctl_same_extent_info info[0];
> };
>
> Userspace puts each duplicate extent (other than the source) in an
> item in the info array. As there can be multiple dedupes in one
> operation, each info item has it's own status and 'bytes_deduped'
> member. This provides a number of benefits:
>
> - We don't have to fail the entire ioctl because one of the dedupes failed.
>
> - Userspace will always know how much progress was made on a file as we always
>    return the number of bytes deduped.
>
>
> #define BTRFS_SAME_DATA_DIFFERS	1
> /* For extent-same ioctl */
> struct btrfs_ioctl_same_extent_info {
> 	__s64 fd;		/* in - destination file */
> 	__u64 logical_offset;	/* in - start of extent in destination */
> 	__u64 bytes_deduped;	/* out - total # of bytes we were able
> 				 * to dedupe from this file */
> 	/* status of this dedupe operation:
> 	 * 0 if dedup succeeds
> 	 * < 0 for error
> 	 * == BTRFS_SAME_DATA_DIFFERS if data differs
> 	 */
> 	__s32 status;		/* out - see above description */
> 	__u32 reserved;
> };


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] [RFC] btrfs: offline dedupe
  2013-04-20 15:49 ` Gabriel de Perthuis
@ 2013-04-21 20:02   ` Mark Fasheh
  2013-04-22  8:56     ` Gabriel de Perthuis
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Fasheh @ 2013-04-21 20:02 UTC (permalink / raw)
  To: Gabriel de Perthuis; +Cc: linux-btrfs

On Sat, Apr 20, 2013 at 05:49:25PM +0200, Gabriel de Perthuis wrote:
>> Hi,
>>
>> The following series of patches implements in btrfs an ioctl to do
>> offline deduplication of file extents.
>
> I am a fan of this patch, the API is just right.  I just have a few tweaks 
> to suggest to the argument checking.

Awesome, thanks for looking this over!


> At first the 1M limitation on the length was a bit inconvenient, but making 
> repeated calls in userspace is okay and allows for better error recovery 
> (for example, repeating the calls when only a portion of the ranges is 
> identical).  The destination file appears to be fragmented into 1M extents, 
> but these are contiguous so it's not really a problem.

Yeah I agree it's a bit inconvenient. To that end, I fixed things up so that
instead of erroring, we just limit the dedupe to 1M. If you want to see what
I'm talking about, the patch is at the top of my tree now:

https://github.com/markfasheh/btrfs-extent-same/commit/b39f93c2e78385ceea850b59edbd759120543a8b

This way userspace doesn't have to guess at what size is the max, and we can
change it in the future, etc.

Furthermore, I'm thinking it might even be better for us to just internally
loop on the entire range asked for. That won't necessarily fix the issue
where we fragment into 1M extents, but it would ease the interface even
more.

My only concern with looping over a large range would be the (almost)
unbounded nature of the operation... For example, if someone passes in a 4
Gig range and 100 files to do that on, we could be in the ioctl for some
time.

The middle ground would be to loop like I was talking about but limit the
maximum length (by just truncating the value, as above). The limit in this
case would obviously be much larger than 1 megabyte but not so large that we
can go off for an extreme amount of time. I'm thinking maybe 16 megabytes or
so to start?


> Requiring the offset or the length to align is spurious however; it doesn't 
> translate to any alignment in the extent tree (especially with 
> compression).  Requiring a minimum length of a few blocks or dropping the 
> alignment condition entirely would make more sense.

I'll take a look at this. Some of those checks are there for my own sanity
at the moment.

I really like that the start offset should align but there's no reason that
length can't be aligned to blocksize internally.

Are you sure that extents don't have to start at block boundaries? If that's
the case and we never have to change the start offset (to align it) then we
could drop the check entirely.


> I notice there is a restriction on cross-subvolume deduplication. Hopefully 
> it can be lifted like it was done in 3.6 for the clone ioctl.

Ok if it was removed from clone then this might be a spurious check on my
part. Most of the real extent work in btrfs-same is done by the code from
the clone ioctl.


> Deduplicating frozen subvolumes works fine, which is great for backups.
>
> Basic integration with bedup, my offline deduplication tool, is in an 
> experimental branch:
>
>   https://github.com/g2p/bedup/tree/wip/dedup-syscall
>
> Thanks to this, I look forward to shedding most of the caveats given in the 
> bedup readme and some unnecessary subtleties in the code.

Again, I'm really glad this is working out for you :)

I'll check out your bedup patch early this week. It will be instructive to
see how another engineer uses the ioctl.


>> I've made significant updates and changes from the original. In
>> particular the structure passed is more fleshed out, this series has a
>> high degree of code sharing between itself and the clone code, and the
>> locking has been updated.
>>
>> The ioctl accepts a struct:
>>
>> struct btrfs_ioctl_same_args {
>> 	__u64 logical_offset;	/* in - start of extent in source */
>> 	__u64 length;		/* in - length of extent */
>> 	__u16 total_files;	/* in - total elements in info array */
>
> Nit: total_files sounds like it would count the source file.
> dest_count would be better.
>
> By the way, extent-same might be better named range-same, since there is no 
> need for the input to fall on extent boundaries.

I actually don't like a lot of the naming. Much of it is historic from
Josef's early patch and the rest of the names I came up with earlier in
development while I was still feeling out the problem. I'm defnitely gonna
fix up total_files though - you're absolutely right that it's needlessly
confusing.


Thanks again for looking at this Gabriel. If you don't have any objections
I'll add you as CC to this series when I send new versions to the list from
now on.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] [RFC] btrfs: offline dedupe
  2013-04-21 20:02   ` Mark Fasheh
@ 2013-04-22  8:56     ` Gabriel de Perthuis
  0 siblings, 0 replies; 28+ messages in thread
From: Gabriel de Perthuis @ 2013-04-22  8:56 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs

> On Sat, Apr 20, 2013 at 05:49:25PM +0200, Gabriel de Perthuis wrote:
>>> Hi,
>>>
>>> The following series of patches implements in btrfs an ioctl to do
>>> offline deduplication of file extents.
>>
>> I am a fan of this patch, the API is just right.  I just have a few tweaks
>> to suggest to the argument checking.
>
> Awesome, thanks for looking this over!
>
>> At first the 1M limitation on the length was a bit inconvenient, but making
>> repeated calls in userspace is okay and allows for better error recovery
>> (for example, repeating the calls when only a portion of the ranges is
>> identical).  The destination file appears to be fragmented into 1M extents,
>> but these are contiguous so it's not really a problem.
>
> Yeah I agree it's a bit inconvenient. To that end, I fixed things up so that
> instead of erroring, we just limit the dedupe to 1M. If you want to see what
> I'm talking about, the patch is at the top of my tree now:
>
> https://github.com/markfasheh/btrfs-extent-same/commit/b39f93c2e78385ceea850b59edbd759120543a8b
>
> This way userspace doesn't have to guess at what size is the max, and we can
> change it in the future, etc.
>
> Furthermore, I'm thinking it might even be better for us to just internally
> loop on the entire range asked for. That won't necessarily fix the issue
> where we fragment into 1M extents, but it would ease the interface even
> more.
>
> My only concern with looping over a large range would be the (almost)
> unbounded nature of the operation... For example, if someone passes in a 4
> Gig range and 100 files to do that on, we could be in the ioctl for some
> time.
>
> The middle ground would be to loop like I was talking about but limit the
> maximum length (by just truncating the value, as above). The limit in this
> case would obviously be much larger than 1 megabyte but not so large that we
> can go off for an extreme amount of time. I'm thinking maybe 16 megabytes or
> so to start?

A cursor-style API could work here: make the offset and length 
parameters in/out, exit early in case of error or after the read quota 
has been used up.

The caller can retry as long as the length is nonzero (and at least one 
block), and the syscall will return frequently enough that it won't 
block an unmount operation or concurrent access to the ranges.

>> Requiring the offset or the length to align is spurious however; it doesn't
>> translate to any alignment in the extent tree (especially with
>> compression).  Requiring a minimum length of a few blocks or dropping the
>> alignment condition entirely would make more sense.
>
> I'll take a look at this. Some of those checks are there for my own sanity
> at the moment.
>
> I really like that the start offset should align but there's no reason that
> length can't be aligned to blocksize internally.
>
> Are you sure that extents don't have to start at block boundaries? If that's
> the case and we never have to change the start offset (to align it) then we
> could drop the check entirely.

I've had a look, and btrfs fiemap only sets FIEMAP_EXTENT_NOT_ALIGNED 
for inline extents, so the alignment requirement makes sense.  The 
caller should do the alignment and decide if it wants to extend a bit 
and accept a not-same status or shrink a bit, so just keep it as is and 
maybe add explanatory comments.

>> I notice there is a restriction on cross-subvolume deduplication. Hopefully
>> it can be lifted like it was done in 3.6 for the clone ioctl.
>
> Ok if it was removed from clone then this might be a spurious check on my
> part. Most of the real extent work in btrfs-same is done by the code from
> the clone ioctl.

Good to have this code shared (compression support is another win). 
bedup will need feature parity to switch from one ioctl to the other.

>> Deduplicating frozen subvolumes works fine, which is great for backups.
>>
>> Basic integration with bedup, my offline deduplication tool, is in an
>> experimental branch:
>>
>>    https://github.com/g2p/bedup/tree/wip/dedup-syscall
>>
>> Thanks to this, I look forward to shedding most of the caveats given in the
>> bedup readme and some unnecessary subtleties in the code.
>
> Again, I'm really glad this is working out for you :)
>
> I'll check out your bedup patch early this week. It will be instructive to
> see how another engineer uses the ioctl.

See ranges_same and dedup_fileset.  The ImmutableFDs stuff can be 
removed and the fact that dedup can be partially successful over a range 
will ripple through.

>>> I've made significant updates and changes from the original. In
>>> particular the structure passed is more fleshed out, this series has a
>>> high degree of code sharing between itself and the clone code, and the
>>> locking has been updated.
>>>
>>> The ioctl accepts a struct:
>>>
>>> struct btrfs_ioctl_same_args {
>>> 	__u64 logical_offset;	/* in - start of extent in source */
>>> 	__u64 length;		/* in - length of extent */
>>> 	__u16 total_files;	/* in - total elements in info array */
>>
>> Nit: total_files sounds like it would count the source file.
>> dest_count would be better.
>>
>> By the way, extent-same might be better named range-same, since there is no
>> need for the input to fall on extent boundaries.
>
> I actually don't like a lot of the naming. Much of it is historic from
> Josef's early patch and the rest of the names I came up with earlier in
> development while I was still feeling out the problem. I'm defnitely gonna
> fix up total_files though - you're absolutely right that it's needlessly
> confusing.

> Thanks again for looking at this Gabriel. If you don't have any objections
> I'll add you as CC to this series when I send new versions to the list from
> now on.
> 	--Mark

Of course, thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] btrfs: offline dedupe
  2013-04-16 22:15 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
@ 2013-05-06 12:36   ` David Sterba
  0 siblings, 0 replies; 28+ messages in thread
From: David Sterba @ 2013-05-06 12:36 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, Josef Bacik, chris.mason

On Tue, Apr 16, 2013 at 03:15:35PM -0700, Mark Fasheh wrote:
> +static void btrfs_double_lock(struct inode *inode1, u64 loff1,
> +			      struct inode *inode2, u64 loff2, u64 len)
> +{
> +	if (inode1 < inode2) {
> +		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
> +		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
> +		lock_extent_range(inode1, loff1, len);
> +		lock_extent_range(inode2, loff2, len);
> +	} else {
> +		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
> +		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
> +		lock_extent_range(inode2, loff2, len);
> +		lock_extent_range(inode1, loff1, len);
> +	}

You can decrease the code size by swapping just the pointers.

> +}
> +
> +static long btrfs_ioctl_file_extent_same(struct file *file,
> +					 void __user *argp)
> +{
> +	struct btrfs_ioctl_same_args *args;
> +	struct btrfs_ioctl_same_args tmp;
> +	struct btrfs_ioctl_same_extent_info *info;
> +	struct inode *src = file->f_dentry->d_inode;
> +	struct file *dst_file = NULL;
> +	struct inode *dst;
> +	u64 off;
> +	u64 len;
> +	int args_size;
> +	int i;
> +	int ret;
> +	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
> +
> +	if (copy_from_user(&tmp,
> +			   (struct btrfs_ioctl_same_args __user *)argp,
> +			   sizeof(tmp)))
> +		return -EFAULT;
> +
> +	args_size = sizeof(tmp) + (tmp.total_files *
> +			sizeof(struct btrfs_ioctl_same_extent_info));
> +
> +	/* Keep size of ioctl argument sane */
> +	if (args_size > PAGE_CACHE_SIZE)
> +		return -ENOMEM;

Using E2BIG            7      /* Argument list too long */
makes more sense to me, it's not really an ENOMEM condition.

> +
> +	args = kmalloc(args_size, GFP_NOFS);
> +	if (!args)
> +		return -ENOMEM;

(like here)

> +
> +	ret = -EFAULT;
> +		if (BTRFS_I(dst)->root != BTRFS_I(src)->root) {
> +			printk(KERN_ERR "btrfs: cannot dedup across subvolumes"
> +			       " %lld\n", info->fd);
> +			goto next;
> +		}
...
> +		info->status = btrfs_extent_same(src, off, len, dst,
> +						 info->logical_offset);
> +		if (info->status == 0) {
> +			info->bytes_deduped = len;
> +			args->files_deduped++;
> +		} else {
> +			printk(KERN_ERR "error %d from btrfs_extent_same\n",

missing "btrfs:" prefix

> +				info->status);
> +		}
> +next:

> --- a/fs/btrfs/ioctl.h
> +++ b/fs/btrfs/ioctl.h
> +/* For extent-same ioctl */
> +struct btrfs_ioctl_same_extent_info {
> +	__s64 fd;		/* in - destination file */
> +	__u64 logical_offset;	/* in - start of extent in destination */
> +	__u64 bytes_deduped;	/* out - total # of bytes we were able
> +				 * to dedupe from this file */
> +	/* status of this dedupe operation:
> +	 * 0 if dedup succeeds
> +	 * < 0 for error
> +	 * == BTRFS_SAME_DATA_DIFFERS if data differs
> +	 */
> +	__s32 status;		/* out - see above description */
> +	__u32 reserved;
> +};
> +
> +struct btrfs_ioctl_same_args {
> +	__u64 logical_offset;	/* in - start of extent in source */
> +	__u64 length;		/* in - length of extent */
> +	__u16 total_files;	/* in - total elements in info array */
> +	__u16 files_deduped;	/* out - number of files that got deduped */
> +	__u32 reserved;

Please add a few more reserved bytes here, we may want to enhance the
call with some fine tunables or extended status. This is an external
interface, we don't need to count every byte here and makes minor future
enhancements easier.

> +	struct btrfs_ioctl_same_extent_info info[0];
> +};
> +
>  struct btrfs_ioctl_space_info {
>  	__u64 flags;
>  	__u64 total_bytes;
> @@ -498,5 +523,6 @@ struct btrfs_ioctl_send_args {
>  				      struct btrfs_ioctl_get_dev_stats)
>  #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
>  				    struct btrfs_ioctl_dev_replace_args)
> -
> +#define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
> +					 struct btrfs_ioctl_same_args)

Feel free to claim the ioctl number at

https://btrfs.wiki.kernel.org/index.php/Project_ideas#Development_notes.2C_please_read

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()
  2013-04-16 22:15 ` [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock() Mark Fasheh
@ 2013-05-06 12:38   ` David Sterba
  0 siblings, 0 replies; 28+ messages in thread
From: David Sterba @ 2013-05-06 12:38 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, Josef Bacik, chris.mason

On Tue, Apr 16, 2013 at 03:15:34PM -0700, Mark Fasheh wrote:
> @@ -2625,7 +2625,7 @@ static int __extent_read_full_page(struct extent_io_tree *tree,
>  	}
>  
>  	end = page_end;
> -	while (1) {
> +	while (1 && !parent_locked) {

the patch is ok, just this caught my eye :)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()
  2013-04-16 22:15 [PATCH 0/4] [RFC] btrfs: offline dedupe Mark Fasheh
                   ` (5 preceding siblings ...)
  2013-04-20 15:49 ` Gabriel de Perthuis
@ 2013-05-07  7:33 ` Gabriel de Perthuis
  2013-05-09 21:31 ` Gabriel de Perthuis
  7 siblings, 0 replies; 28+ messages in thread
From: Gabriel de Perthuis @ 2013-05-07  7:33 UTC (permalink / raw)
  To: Mark Fasheh, linux-btrfs

> We want this for btrfs_extent_same. Basically readpage and friends do their
> own extent locking but for the purposes of dedupe, we want to have both
> files locked down across a set of readpage operations (so that we can
> compare data). Introduce this variant and a flag which can be set for
> extent_read_full_page() to indicate that we are already locked.

This one can get stuck in TASK_UNINTERRUPTIBLE:

[32129.522257] SysRq : Show Blocked State
[32129.524337]   task                        PC stack   pid father
[32129.526515] python          D ffff88021f394280     0 16281      1 0x00000004
[32129.528656]  ffff88020e079a48 0000000000000082 ffff88013d3cdd40 ffff88020e079fd8
[32129.530840]  ffff88020e079fd8 ffff88020e079fd8 ffff8802138dc5f0 ffff88013d3cdd40
[32129.533044]  0000000000000000 0000000000001fff ffff88015286f440 0000000000000008
[32129.535285] Call Trace:
[32129.537522]  [<ffffffff816dcca9>] schedule+0x29/0x70
[32129.539829]  [<ffffffffa02b4908>] wait_extent_bit+0xf8/0x150 [btrfs]
[32129.542130]  [<ffffffff8107ea00>] ? finish_wait+0x80/0x80
[32129.544463]  [<ffffffffa02b4f84>] lock_extent_bits+0x44/0xa0 [btrfs]
[32129.546824]  [<ffffffffa02b4ff3>] lock_extent+0x13/0x20 [btrfs]
[32129.549198]  [<ffffffffa02dc0cf>] add_ra_bio_pages.isra.8+0x17f/0x2d0 [btrfs]
[32129.551602]  [<ffffffffa02dccfc>] btrfs_submit_compressed_read+0x25c/0x4c0 [btrfs]
[32129.554028]  [<ffffffffa029d131>] btrfs_submit_bio_hook+0x1d1/0x1e0 [btrfs]
[32129.556457]  [<ffffffffa02b2d07>] submit_one_bio+0x67/0xa0 [btrfs]
[32129.558899]  [<ffffffffa02b7ecd>] extent_read_full_page_nolock+0x4d/0x60 [btrfs]
[32129.561290]  [<ffffffffa02c8052>] fill_data+0xb2/0x230 [btrfs]
[32129.563623]  [<ffffffffa02cd57e>] btrfs_ioctl+0x1f7e/0x2560 [btrfs]
[32129.565924]  [<ffffffff816ddbae>] ? _raw_spin_lock+0xe/0x20
[32129.568207]  [<ffffffff8119b907>] ? inode_get_bytes+0x47/0x60
[32129.570472]  [<ffffffff811a8297>] do_vfs_ioctl+0x97/0x560
[32129.572700]  [<ffffffff8119bb5a>] ? sys_newfstat+0x2a/0x40
[32129.574882]  [<ffffffff811a87f1>] sys_ioctl+0x91/0xb0
[32129.577008]  [<ffffffff816e64dd>] system_call_fastpath+0x1a/0x1f

Side note, I wish btrfs used TASK_KILLABLE[1] instead.

[1]: https://lwn.net/Articles/288056/


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock()
  2013-04-16 22:15 [PATCH 0/4] [RFC] btrfs: offline dedupe Mark Fasheh
                   ` (6 preceding siblings ...)
  2013-05-07  7:33 ` [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock() Gabriel de Perthuis
@ 2013-05-09 21:31 ` Gabriel de Perthuis
  7 siblings, 0 replies; 28+ messages in thread
From: Gabriel de Perthuis @ 2013-05-09 21:31 UTC (permalink / raw)
  To: linux-btrfs

>> We want this for btrfs_extent_same. Basically readpage and friends do their
>> own extent locking but for the purposes of dedupe, we want to have both
>> files locked down across a set of readpage operations (so that we can
>> compare data). Introduce this variant and a flag which can be set for
>> extent_read_full_page() to indicate that we are already locked.
> 
> This one can get stuck in TASK_UNINTERRUPTIBLE:
> 
> [32129.522257] SysRq : Show Blocked State
> [32129.524337]   task                        PC stack   pid father
> [32129.526515] python          D ffff88021f394280     0 16281      1 0x00000004
> [32129.528656]  ffff88020e079a48 0000000000000082 ffff88013d3cdd40 ffff88020e079fd8
> [32129.530840]  ffff88020e079fd8 ffff88020e079fd8 ffff8802138dc5f0 ffff88013d3cdd40
> [32129.533044]  0000000000000000 0000000000001fff ffff88015286f440 0000000000000008
> [32129.535285] Call Trace:
> [32129.537522]  [<ffffffff816dcca9>] schedule+0x29/0x70
> [32129.539829]  [<ffffffffa02b4908>] wait_extent_bit+0xf8/0x150 [btrfs]
> [32129.542130]  [<ffffffff8107ea00>] ? finish_wait+0x80/0x80
> [32129.544463]  [<ffffffffa02b4f84>] lock_extent_bits+0x44/0xa0 [btrfs]
> [32129.546824]  [<ffffffffa02b4ff3>] lock_extent+0x13/0x20 [btrfs]
> [32129.549198]  [<ffffffffa02dc0cf>] add_ra_bio_pages.isra.8+0x17f/0x2d0 [btrfs]
> [32129.551602]  [<ffffffffa02dccfc>] btrfs_submit_compressed_read+0x25c/0x4c0 [btrfs]
> [32129.554028]  [<ffffffffa029d131>] btrfs_submit_bio_hook+0x1d1/0x1e0 [btrfs]
> [32129.556457]  [<ffffffffa02b2d07>] submit_one_bio+0x67/0xa0 [btrfs]
> [32129.558899]  [<ffffffffa02b7ecd>] extent_read_full_page_nolock+0x4d/0x60 [btrfs]
> [32129.561290]  [<ffffffffa02c8052>] fill_data+0xb2/0x230 [btrfs]
> [32129.563623]  [<ffffffffa02cd57e>] btrfs_ioctl+0x1f7e/0x2560 [btrfs]
> [32129.565924]  [<ffffffff816ddbae>] ? _raw_spin_lock+0xe/0x20
> [32129.568207]  [<ffffffff8119b907>] ? inode_get_bytes+0x47/0x60
> [32129.570472]  [<ffffffff811a8297>] do_vfs_ioctl+0x97/0x560
> [32129.572700]  [<ffffffff8119bb5a>] ? sys_newfstat+0x2a/0x40
> [32129.574882]  [<ffffffff811a87f1>] sys_ioctl+0x91/0xb0
> [32129.577008]  [<ffffffff816e64dd>] system_call_fastpath+0x1a/0x1f

For anyone trying those patches, there's a fix here:
https://github.com/g2p/linux/tree/v3.9%2Bbtrfs-extent-same


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] btrfs: offline dedupe
  2013-08-06 18:42 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
@ 2013-08-06 19:11   ` Zach Brown
  0 siblings, 0 replies; 28+ messages in thread
From: Zach Brown @ 2013-08-06 19:11 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: linux-btrfs, Josef Bacik, Chris Mason, Gabriel de Perthuis, David Sterba

On Tue, Aug 06, 2013 at 11:42:51AM -0700, Mark Fasheh wrote:
> This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
> de-duplicate a list of extents across a range of files.

This addresses all my previous issues (save the interface that we've
aggreed to disagree on).  Thanks for taking the time to fix it up.

Reviewed-by: Zach Brown <zab@redhat.com>

- z

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/4] btrfs: offline dedupe
  2013-08-06 18:42 [PATCH 0/4] btrfs: out-of-band (aka offline) dedupe v4 Mark Fasheh
@ 2013-08-06 18:42 ` Mark Fasheh
  2013-08-06 19:11   ` Zach Brown
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Fasheh @ 2013-08-06 18:42 UTC (permalink / raw)
  To: linux-btrfs, Josef Bacik
  Cc: Chris Mason, Gabriel de Perthuis, David Sterba, Zach Brown, Mark Fasheh

This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
de-duplicate a list of extents across a range of files.

Internally, the ioctl re-uses code from the clone ioctl. This avoids
rewriting a large chunk of extent handling code.

Userspace passes in an array of file, offset pairs along with a length
argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison
of the user data before deduping the extent. Status and number of bytes
deduped are returned for each operation.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ioctl.c           | 278 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h |  27 +++++
 2 files changed, 305 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e90c519..7db2ee6 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -57,6 +57,9 @@
 #include "send.h"
 #include "dev-replace.h"
 
+static int btrfs_clone(struct inode *src, struct inode *inode,
+		       u64 off, u64 olen, u64 olen_aligned, u64 destoff);
+
 /* Mask out flags that are inappropriate for the given type of inode. */
 static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
 {
@@ -2456,6 +2459,34 @@ out:
 	return ret;
 }
 
+static struct page *extent_same_get_page(struct inode *inode, u64 off)
+{
+	struct page *page;
+	pgoff_t index;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+
+	index = off >> PAGE_CACHE_SHIFT;
+
+	page = grab_cache_page(inode->i_mapping, index);
+	if (!page)
+		return NULL;
+
+	if (!PageUptodate(page)) {
+		if (extent_read_full_page_nolock(tree, page, btrfs_get_extent,
+						 0))
+			return NULL;
+		lock_page(page);
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return NULL;
+		}
+	}
+	unlock_page(page);
+
+	return page;
+}
+
 static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 {
 	/* do any pending delalloc/csum calc on src, one way or
@@ -2476,6 +2507,251 @@ static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 	}
 }
 
+static void btrfs_double_unlock(struct inode *inode1, u64 loff1,
+				struct inode *inode2, u64 loff2, u64 len)
+{
+	unlock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1);
+	unlock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1);
+
+	mutex_unlock(&inode1->i_mutex);
+	mutex_unlock(&inode2->i_mutex);
+}
+
+static void btrfs_double_lock(struct inode *inode1, u64 loff1,
+			      struct inode *inode2, u64 loff2, u64 len)
+{
+	if (inode1 < inode2) {
+		swap(inode1, inode2);
+		swap(loff1, loff2);
+	}
+
+	mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+	lock_extent_range(inode1, loff1, len);
+	if (inode1 != inode2) {
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+		lock_extent_range(inode2, loff2, len);
+	}
+}
+
+static int btrfs_cmp_data(struct inode *src, u64 loff, struct inode *dst,
+			  u64 dst_loff, u64 len)
+{
+	int ret = 0;
+	struct page *src_page, *dst_page;
+	unsigned int cmp_len = PAGE_CACHE_SIZE;
+	void *addr, *dst_addr;
+
+	while (len) {
+		if (len < PAGE_CACHE_SIZE)
+			cmp_len = len;
+
+		src_page = extent_same_get_page(src, loff);
+		if (!src_page)
+			return -EINVAL;
+		dst_page = extent_same_get_page(dst, dst_loff);
+		if (!dst_page) {
+			page_cache_release(src_page);
+			return -EINVAL;
+		}
+		addr = kmap_atomic(src_page);
+		dst_addr = kmap_atomic(dst_page);
+
+		flush_dcache_page(src_page);
+		flush_dcache_page(dst_page);
+
+		if (memcmp(addr, dst_addr, cmp_len))
+			ret = BTRFS_SAME_DATA_DIFFERS;
+
+		kunmap_atomic(addr);
+		kunmap_atomic(dst_addr);
+		page_cache_release(src_page);
+		page_cache_release(dst_page);
+
+		if (ret)
+			break;
+
+		loff += cmp_len;
+		dst_loff += cmp_len;
+		len -= cmp_len;
+	}
+
+	return ret;
+}
+
+static int extent_same_check_offsets(struct inode *inode, u64 off, u64 len)
+{
+	u64 bs = BTRFS_I(inode)->root->fs_info->sb->s_blocksize;
+
+	if (off + len > inode->i_size || off + len < off)
+		return -EINVAL;
+	/* Check that we are block aligned - btrfs_clone() requires this */
+	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int btrfs_extent_same(struct inode *src, u64 loff, u64 len,
+			     struct inode *dst, u64 dst_loff)
+{
+	int ret;
+
+	/*
+	 * btrfs_clone() can't handle extents in the same file
+	 * yet. Once that works, we can drop this check and replace it
+	 * with a check for the same inode, but overlapping extents.
+	 */
+	if (src == dst)
+		return -EINVAL;
+
+	btrfs_double_lock(src, loff, dst, dst_loff, len);
+
+	ret = extent_same_check_offsets(src, loff, len);
+	if (ret)
+		goto out_unlock;
+
+	ret = extent_same_check_offsets(dst, dst_loff, len);
+	if (ret)
+		goto out_unlock;
+
+	/* don't make the dst file partly checksummed */
+	if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
+	    (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = btrfs_cmp_data(src, loff, dst, dst_loff, len);
+	if (ret == 0)
+		ret = btrfs_clone(src, dst, loff, len, len, dst_loff);
+
+out_unlock:
+	btrfs_double_unlock(src, loff, dst, dst_loff, len);
+
+	return ret;
+}
+
+#define BTRFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+
+static long btrfs_ioctl_file_extent_same(struct file *file,
+					 void __user *argp)
+{
+	struct btrfs_ioctl_same_args *args = argp;
+	struct btrfs_ioctl_same_args same;
+	struct btrfs_ioctl_same_extent_info info;
+	struct inode *src = file->f_dentry->d_inode;
+	struct file *dst_file = NULL;
+	struct inode *dst;
+	u64 off;
+	u64 len;
+	int i;
+	int ret;
+	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
+	bool is_admin = capable(CAP_SYS_ADMIN);
+
+	if (!(file->f_mode & FMODE_READ))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	if (copy_from_user(&same,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   sizeof(same))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	off = same.logical_offset;
+	len = same.length;
+
+	/*
+	 * Limit the total length we will dedupe for each operation.
+	 * This is intended to bound the total time spent in this
+	 * ioctl to something sane.
+	 */
+	if (len > BTRFS_MAX_DEDUPE_LEN)
+		len = BTRFS_MAX_DEDUPE_LEN;
+
+	if (WARN_ON_ONCE(bs < PAGE_CACHE_SIZE)) {
+		/*
+		 * Btrfs does not support blocksize < page_size. As a
+		 * result, btrfs_cmp_data() won't correctly handle
+		 * this situation without an update.
+		 */
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode))
+		goto out;
+
+	ret = -EACCES;
+	if (!S_ISREG(src->i_mode))
+		goto out;
+
+	ret = 0;
+	for (i = 0; i < same.dest_count; i++) {
+		if (copy_from_user(&info, &args->info[i], sizeof(info))) {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		info.bytes_deduped = 0;
+
+		dst_file = fget(info.fd);
+		if (!dst_file) {
+			info.status = -EBADF;
+			goto next;
+		}
+
+		if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+			info.status = -EINVAL;
+			goto next;
+		}
+
+		info.status = -EXDEV;
+		if (file->f_path.mnt != dst_file->f_path.mnt)
+			goto next;
+
+		dst = dst_file->f_dentry->d_inode;
+		if (src->i_sb != dst->i_sb)
+			goto next;
+
+		if (S_ISDIR(dst->i_mode)) {
+			info.status = -EISDIR;
+			goto next;
+		}
+
+		if (!S_ISREG(dst->i_mode)) {
+			info.status = -EACCES;
+			goto next;
+		}
+
+		info.status = btrfs_extent_same(src, off, len, dst,
+						info.logical_offset);
+		if (info.status == 0)
+			info.bytes_deduped += len;
+
+next:
+		if (dst_file)
+			fput(dst_file);
+
+		if (__put_user_unaligned(info.status, &args->info[i].status) ||
+		    __put_user_unaligned(info.bytes_deduped,
+					 &args->info[i].bytes_deduped)) {
+			ret = -EFAULT;
+			goto out;
+		}                                                               
+	}
+
+out:
+	mnt_drop_write_file(file);
+	return ret;
+}
+
 /**
  * btrfs_clone() - clone a range from inode file to another
  *
@@ -4151,6 +4427,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_fslabel(file, argp);
 	case BTRFS_IOC_SET_FSLABEL:
 		return btrfs_ioctl_set_fslabel(file, argp);
+	case BTRFS_IOC_FILE_EXTENT_SAME:
+		return btrfs_ioctl_file_extent_same(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index fa3a5f9..5465bc2 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -305,6 +305,31 @@ struct btrfs_ioctl_clone_range_args {
 #define BTRFS_DEFRAG_RANGE_COMPRESS 1
 #define BTRFS_DEFRAG_RANGE_START_IO 2
 
+#define BTRFS_SAME_DATA_DIFFERS	1
+/* For extent-same ioctl */
+struct btrfs_ioctl_same_extent_info {
+	__s64 fd;		/* in - destination file */
+	__u64 logical_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file */
+	/* status of this dedupe operation:
+	 * 0 if dedup succeeds
+	 * < 0 for error
+	 * == BTRFS_SAME_DATA_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;
+};
+
+struct btrfs_ioctl_same_args {
+	__u64 logical_offset;	/* in - start of extent in source */
+	__u64 length;		/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;
+	__u32 reserved2;
+	struct btrfs_ioctl_same_extent_info info[0];
+};
+
 struct btrfs_ioctl_space_info {
 	__u64 flags;
 	__u64 total_bytes;
@@ -510,5 +535,7 @@ struct btrfs_ioctl_send_args {
 				      struct btrfs_ioctl_get_dev_stats)
 #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
 				    struct btrfs_ioctl_dev_replace_args)
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
+					 struct btrfs_ioctl_same_args)
 
 #endif /* _UAPI_LINUX_BTRFS_H */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] btrfs: offline dedupe
  2013-07-26 16:30 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
@ 2013-07-26 22:09   ` Zach Brown
  0 siblings, 0 replies; 28+ messages in thread
From: Zach Brown @ 2013-07-26 22:09 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: linux-btrfs, Josef Bacik, Chris Mason, Gabriel de Perthuis, David Sterba

> +static struct page *extent_same_get_page(struct inode *inode, u64 off)
> +{
> +	struct page *page;
> +	pgoff_t index;
> +	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> +
> +	index = off >> PAGE_CACHE_SHIFT;
> +
> +	page = grab_cache_page(inode->i_mapping, index);
> +	if (!page)
> +		return NULL;
> +
> +	if (!PageUptodate(page)) {
> +		extent_read_full_page_nolock(tree, page, btrfs_get_extent, 0);

Do we need to test for errors from extent_read_full_page_nolock()?

> +static int btrfs_cmp_data(struct inode *src, u64 loff, struct inode *dst,
> +			  u64 dst_loff, u64 len)
> +{
> +	int ret = 0;
> +	struct page *src_page, *dst_page;
> +	unsigned int cmp_len = PAGE_CACHE_SIZE;
> +	void *addr, *dst_addr;
> +
> +	while (len) {
> +		if (len < PAGE_CACHE_SIZE)
> +			cmp_len = len;
> +
> +		src_page = extent_same_get_page(src, loff);
> +		if (!src_page)
> +			return -EINVAL;
> +		dst_page = extent_same_get_page(dst, dst_loff);
> +		if (!dst_page) {
> +			page_cache_release(src_page);
> +			return -EINVAL;
> +		}
> +		addr = kmap(src_page);
> +		dst_addr = kmap(dst_page);

Acquiring multiple kmaps can deadlock if you get enough tasks holding
the first kmap while their second kmap blocks waiting for free kmap
slots.  Just use kmap_atomic().  It has enough static slots for task
context to grab two.

> +
> +		flush_dcache_page(src_page);
> +		flush_dcache_page(dst_page);
> +
> +		if (memcmp(addr, dst_addr, cmp_len))
> +			ret = BTRFS_SAME_DATA_DIFFERS;

Might as well add the sub-page offset from the file offset so that this
can't silently corrupt data if someone starts trying to get block_size <
page_size patches working.  Or, I guess, add a WARN_ON error path up in
the caller where block alignment is tested?

> +	args_size = sizeof(tmp) + (tmp.dest_count *
> +			sizeof(struct btrfs_ioctl_same_extent_info));
> +
> +	/* Keep size of ioctl argument sane */
> +	if (args_size > PAGE_CACHE_SIZE) {
> +		ret = -E2BIG;
> +		goto out_drop_write;
> +	}
> +
> +	args = kmalloc(args_size, GFP_NOFS);
> +	if (!args) {
> +		ret = -ENOMEM;
> +		goto out_drop_write;
> +	}
> +
> +	ret = -EFAULT;
> +	if (copy_from_user(args,
> +			   (struct btrfs_ioctl_same_args __user *)argp,
> +			   args_size))
> +		goto out;

None of this is needed, and getting rid of that arg size limit would be
nice.  Copy into and store from on-stack structs as they're used.

> +	/* Make sure args didn't change magically between copies. */
> +	if (memcmp(&tmp, args, sizeof(tmp)))
> +		goto out;
> +
> +	if ((sizeof(tmp) + (sizeof(*info) * args->dest_count)) > args_size)
> +		goto out;

This should be deleted.  This magical change can happen after this
second copy and test and if it does.. who cares?

> +	/* pre-format 'out' fields to sane default values */
> +	for (i = 0; i < args->dest_count; i++) {
> +		info = &args->info[i];
> +		info->bytes_deduped = 0;
> +		info->status = 0;
> +	}

No need, just copy the output fields as they're discovered.  Because the
copies to userspace can partially fail userspace can't trust the output
fields if the syscall returns an error code anyway.

> +
> +	off = args->logical_offset;
> +	len = args->length;
> +
> +	/*
> +	 * Limit the total length we will dedupe for each operation. 
> +	 * This is intended to bound the entire ioctl to something sane.
> +	 */
> +	if (len > BTRFS_MAX_DEDUPE_LEN)
> +		len = BTRFS_MAX_DEDUPE_LEN;

The comment doesn't really explain *why* it wants to bound the entire
ioctl, given that the ioctl can lock and clone in chunks.  Are callers
expected to notice truncated dedups and fix up and resubmit the
remainder of their extents? 

Is this just a leftover from the allocated temporary comparison buffer
that we can now remove? 

> +	ret = -EISDIR;
> +	if (S_ISDIR(src->i_mode))
> +		goto out;

Doesn't this have the same ISLNK problem that the destination file had?
Shouldn't both tests be !S_ISREG()?

> +	ret = 0;
> +	for (i = 0; i < args->dest_count; i++) {
> +		u64 dest_off;
> +
> +		info = &args->info[i];

		if (copy_from_user(&info, &args->info[i], sizeof(info)) {
			ret = -EFAULT;
			goto out;
		}

> +		if (S_ISDIR(dst->i_mode)) {
> +			info->status = -EISDIR;
> +			goto next;
> +		}
> +
> +		if (S_ISLNK(dst->i_mode)) {
> +			info->status = -EACCES;
> +			goto next;
> +		}

( !S_ISREG() )

> +
> +		info->status = -EINVAL;
> +		/* don't make the dst file partly checksummed */
> +		if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
> +		    (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM))
> +			goto next;
> +
> +		dest_off = info->logical_offset;
> +		if (dest_off + len > dst->i_size || dest_off + len < dest_off)
> +			goto next;
> +		if (!IS_ALIGNED(dest_off, bs))
> +			goto next;

It just occurred to me: shouldn't these be tested under i_mutex?  Can't
altering the NODATASUM flags race after the test but before extent_same
gets the locks?  (Similar with the i_size test, I suppose.)

> +		info->status = btrfs_extent_same(src, off, len, dst, dest_off);
> +		if (info->status == 0) {
> +			info->bytes_deduped += len;
> +		} else
> +			break;	

		if (__put_user_unaligned(info.status, &args->info[i].status) ||
		    __put_user_unaligned(info.bytes_deduped, &args->info[i].bytes_deduped)) {
			ret = -EFAULT;
			goto out;
		}

> +
> +	if (copy_to_user(argp, args, args_size))
> +		ret = -EFAULT;
> +
> +out:
> +	if (dst_file)
> +		fput(dst_file);
> +	kfree(args);

Copying to and from the stack as needed gets rid of this final copy and
free.

> +#define BTRFS_SAME_DATA_DIFFERS	1
> +/* For extent-same ioctl */
> +struct btrfs_ioctl_same_extent_info {
> +	__s64 fd;		/* in - destination file */
> +	__u64 logical_offset;	/* in - start of extent in destination */
> +	__u64 bytes_deduped;	/* out - total # of bytes we were able
> +				 * to dedupe from this file */
> +	/* status of this dedupe operation:
> +	 * 0 if dedup succeeds
> +	 * < 0 for error
> +	 * == BTRFS_SAME_DATA_DIFFERS if data differs
> +	 */
> +	__s32 status;		/* out - see above description */
> +	__u32 reserved;
> +};

(I still think the output fields are more complexity than is justified,
but I'm not going to push it if Josef and Chris are fine with it.)

- z

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/4] btrfs: offline dedupe
  2013-07-26 16:30 [PATCH 0/4] btrfs: offline dedupe v3 Mark Fasheh
@ 2013-07-26 16:30 ` Mark Fasheh
  2013-07-26 22:09   ` Zach Brown
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Fasheh @ 2013-07-26 16:30 UTC (permalink / raw)
  To: linux-btrfs, Josef Bacik
  Cc: Chris Mason, Gabriel de Perthuis, David Sterba, Zach Brown, Mark Fasheh

This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
de-duplicate a list of extents across a range of files.

Internally, the ioctl re-uses code from the clone ioctl. This avoids
rewriting a large chunk of extent handling code.

Userspace passes in an array of file, offset pairs along with a length
argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison
of the user data before deduping the extent. Status and number of bytes
deduped are returned for each operation.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ioctl.c           | 283 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h |  27 +++++
 2 files changed, 310 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e90c519..09ca76d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -57,6 +57,9 @@
 #include "send.h"
 #include "dev-replace.h"
 
+static int btrfs_clone(struct inode *src, struct inode *inode,
+		       u64 off, u64 olen, u64 olen_aligned, u64 destoff);
+
 /* Mask out flags that are inappropriate for the given type of inode. */
 static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
 {
@@ -2456,6 +2459,32 @@ out:
 	return ret;
 }
 
+static struct page *extent_same_get_page(struct inode *inode, u64 off)
+{
+	struct page *page;
+	pgoff_t index;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+
+	index = off >> PAGE_CACHE_SHIFT;
+
+	page = grab_cache_page(inode->i_mapping, index);
+	if (!page)
+		return NULL;
+
+	if (!PageUptodate(page)) {
+		extent_read_full_page_nolock(tree, page, btrfs_get_extent, 0);
+		lock_page(page);
+		if (!PageUptodate(page)) {
+			unlock_page(page);
+			page_cache_release(page);
+			return NULL;
+		}
+		unlock_page(page);
+	}
+
+	return page;
+}
+
 static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 {
 	/* do any pending delalloc/csum calc on src, one way or
@@ -2476,6 +2505,258 @@ static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 	}
 }
 
+static void btrfs_double_unlock(struct inode *inode1, u64 loff1,
+				struct inode *inode2, u64 loff2, u64 len)
+{
+	unlock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1);
+	unlock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1);
+
+	mutex_unlock(&inode1->i_mutex);
+	mutex_unlock(&inode2->i_mutex);
+}
+
+static void btrfs_double_lock(struct inode *inode1, u64 loff1,
+			      struct inode *inode2, u64 loff2, u64 len)
+{
+	if (inode1 < inode2) {
+		swap(inode1, inode2);
+		swap(loff1, loff2);
+	}
+
+	mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+	lock_extent_range(inode1, loff1, len);
+	if (inode1 != inode2) {
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+		lock_extent_range(inode2, loff2, len);
+	}
+}
+
+static int btrfs_cmp_data(struct inode *src, u64 loff, struct inode *dst,
+			  u64 dst_loff, u64 len)
+{
+	int ret = 0;
+	struct page *src_page, *dst_page;
+	unsigned int cmp_len = PAGE_CACHE_SIZE;
+	void *addr, *dst_addr;
+
+	while (len) {
+		if (len < PAGE_CACHE_SIZE)
+			cmp_len = len;
+
+		src_page = extent_same_get_page(src, loff);
+		if (!src_page)
+			return -EINVAL;
+		dst_page = extent_same_get_page(dst, dst_loff);
+		if (!dst_page) {
+			page_cache_release(src_page);
+			return -EINVAL;
+		}
+		addr = kmap(src_page);
+		dst_addr = kmap(dst_page);
+
+		flush_dcache_page(src_page);
+		flush_dcache_page(dst_page);
+
+		if (memcmp(addr, dst_addr, cmp_len))
+			ret = BTRFS_SAME_DATA_DIFFERS;
+
+		kunmap(src_page);
+		kunmap(dst_page);
+		page_cache_release(src_page);
+		page_cache_release(dst_page);
+
+		if (ret)
+			break;
+
+		loff += cmp_len;
+		dst_loff += cmp_len;
+		len -= cmp_len;
+	}
+
+	return ret;
+}
+
+static int btrfs_extent_same(struct inode *src, u64 loff, u64 len,
+			     struct inode *dst, u64 dst_loff)
+{
+	int ret;
+
+	/*
+	 * btrfs_clone() can't handle extents in the same file
+	 * yet. Once that works, we can drop this check and replace it
+	 * with a check for the same inode, but overlapping extents.
+	 */
+	if (src == dst)
+		return -EINVAL;
+
+	btrfs_double_lock(src, loff, dst, dst_loff, len);
+
+	ret = btrfs_cmp_data(src, loff, dst, dst_loff, len);
+	if (ret == 0)
+		ret = btrfs_clone(src, dst, loff, len, len, dst_loff);
+
+	btrfs_double_unlock(src, loff, dst, dst_loff, len);
+
+	return ret;
+}
+
+#define BTRFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+
+static long btrfs_ioctl_file_extent_same(struct file *file,
+					 void __user *argp)
+{
+	struct btrfs_ioctl_same_args *args;
+	struct btrfs_ioctl_same_args tmp;
+	struct btrfs_ioctl_same_extent_info *info;
+	struct inode *src = file->f_dentry->d_inode;
+	struct file *dst_file = NULL;
+	struct inode *dst;
+	u64 off;
+	u64 len;
+	int args_size;
+	int i;
+	int ret;
+	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
+	bool is_admin = capable(CAP_SYS_ADMIN);
+
+	if (!(file->f_mode & FMODE_READ))
+		return -EINVAL;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	if (copy_from_user(&tmp,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   sizeof(tmp))) {
+		ret = -EFAULT;
+		goto out_drop_write;
+	}
+
+	args_size = sizeof(tmp) + (tmp.dest_count *
+			sizeof(struct btrfs_ioctl_same_extent_info));
+
+	/* Keep size of ioctl argument sane */
+	if (args_size > PAGE_CACHE_SIZE) {
+		ret = -E2BIG;
+		goto out_drop_write;
+	}
+
+	args = kmalloc(args_size, GFP_NOFS);
+	if (!args) {
+		ret = -ENOMEM;
+		goto out_drop_write;
+	}
+
+	ret = -EFAULT;
+	if (copy_from_user(args,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   args_size))
+		goto out;
+	/* Make sure args didn't change magically between copies. */
+	if (memcmp(&tmp, args, sizeof(tmp)))
+		goto out;
+
+	if ((sizeof(tmp) + (sizeof(*info) * args->dest_count)) > args_size)
+		goto out;
+
+	/* pre-format 'out' fields to sane default values */
+	for (i = 0; i < args->dest_count; i++) {
+		info = &args->info[i];
+		info->bytes_deduped = 0;
+		info->status = 0;
+	}
+
+	off = args->logical_offset;
+	len = args->length;
+
+	/*
+	 * Limit the total length we will dedupe for each operation. 
+	 * This is intended to bound the entire ioctl to something sane.
+	 */
+	if (len > BTRFS_MAX_DEDUPE_LEN)
+		len = BTRFS_MAX_DEDUPE_LEN;
+
+	ret = -EINVAL;
+	if (off + len > src->i_size || off + len < off)
+		goto out;
+	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs))
+		goto out;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode))
+		goto out;
+
+	ret = 0;
+	for (i = 0; i < args->dest_count; i++) {
+		u64 dest_off;
+
+		info = &args->info[i];
+
+		dst_file = fget(info->fd);
+		if (!dst_file) {
+			info->status = -EBADF;
+			continue;
+		}
+
+		if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
+			info->status = -EINVAL;
+			goto next;
+		}
+
+		info->status = -EXDEV;
+		if (file->f_path.mnt != dst_file->f_path.mnt)
+			goto next;
+
+		dst = dst_file->f_dentry->d_inode;
+		if (src->i_sb != dst->i_sb)
+			goto next;
+
+		if (S_ISDIR(dst->i_mode)) {
+			info->status = -EISDIR;
+			goto next;
+		}
+
+		if (S_ISLNK(dst->i_mode)) {
+			info->status = -EACCES;
+			goto next;
+		}
+
+		info->status = -EINVAL;
+		/* don't make the dst file partly checksummed */
+		if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
+		    (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM))
+			goto next;
+
+		dest_off = info->logical_offset;
+		if (dest_off + len > dst->i_size || dest_off + len < dest_off)
+			goto next;
+		if (!IS_ALIGNED(dest_off, bs))
+			goto next;
+
+		info->status = btrfs_extent_same(src, off, len, dst, dest_off);
+		if (info->status == 0) {
+			info->bytes_deduped += len;
+		} else
+			break;	
+
+next:
+		fput(dst_file);
+		dst_file = NULL;
+	}
+
+	if (copy_to_user(argp, args, args_size))
+		ret = -EFAULT;
+
+out:
+	if (dst_file)
+		fput(dst_file);
+	kfree(args);
+out_drop_write:
+	mnt_drop_write_file(file);
+	return ret;
+}
+
 /**
  * btrfs_clone() - clone a range from inode file to another
  *
@@ -4151,6 +4432,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_fslabel(file, argp);
 	case BTRFS_IOC_SET_FSLABEL:
 		return btrfs_ioctl_set_fslabel(file, argp);
+	case BTRFS_IOC_FILE_EXTENT_SAME:
+		return btrfs_ioctl_file_extent_same(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index fa3a5f9..5465bc2 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -305,6 +305,31 @@ struct btrfs_ioctl_clone_range_args {
 #define BTRFS_DEFRAG_RANGE_COMPRESS 1
 #define BTRFS_DEFRAG_RANGE_START_IO 2
 
+#define BTRFS_SAME_DATA_DIFFERS	1
+/* For extent-same ioctl */
+struct btrfs_ioctl_same_extent_info {
+	__s64 fd;		/* in - destination file */
+	__u64 logical_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file */
+	/* status of this dedupe operation:
+	 * 0 if dedup succeeds
+	 * < 0 for error
+	 * == BTRFS_SAME_DATA_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;
+};
+
+struct btrfs_ioctl_same_args {
+	__u64 logical_offset;	/* in - start of extent in source */
+	__u64 length;		/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;
+	__u32 reserved2;
+	struct btrfs_ioctl_same_extent_info info[0];
+};
+
 struct btrfs_ioctl_space_info {
 	__u64 flags;
 	__u64 total_bytes;
@@ -510,5 +535,7 @@ struct btrfs_ioctl_send_args {
 				      struct btrfs_ioctl_get_dev_stats)
 #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
 				    struct btrfs_ioctl_dev_replace_args)
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
+					 struct btrfs_ioctl_same_args)
 
 #endif /* _UAPI_LINUX_BTRFS_H */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] btrfs: offline dedupe
  2013-07-15 20:55   ` Zach Brown
@ 2013-07-17  0:14     ` Gabriel de Perthuis
  0 siblings, 0 replies; 28+ messages in thread
From: Gabriel de Perthuis @ 2013-07-17  0:14 UTC (permalink / raw)
  To: linux-btrfs

On Mon, 15 Jul 2013 13:55:51 -0700, Zach Brown wrote:
> I'd get rid of all this code by only copying each input argument on to
> the stack as it's needed and by getting rid of the writable output
> struct fields.  (more on this later)

> As I said, I'd get rid of the output fields.  Like the other vectored io
> syscalls, the return value can indicate the number of initial
> consecutive bytes that worked.  When no progess was made then it can
> return errors.  Userspace is left to sort out the resulting state and
> figure out the extents to retry in exactly the same way that it found
> the initial extents to attempt to dedupe in the first place.
> 
> (And imagine strace trying to print the inputs and outputs.  Poor, poor,
> strace! :))

The dedup branch that uses this syscall[1] doesn't compare files
before submitting them anymore (the kernel will do it, and ranges
may not fit in cache once I get rid of an unnecessary loop).

I don't have strong opinions on the return style, but it would be
good to have the syscall always make progress by finding at least
one good range before bailing out, and signaling which files were
involved.  With those constraints, the current struct seems like the
cleanest way to pass the data.  The early return you suggest is a
good idea if Mark agrees, but the return condition should be
something like: if one range with bytes_deduped != 0 doesn't get
bytes_deduped incremented by iteration_len in this iteration, bail
out.
That's sufficient to guarantee progress and to know which ranges
were involved.

> I hope this helps!
> 
> - z

Thank you and everyone involved for the progress on this.

[1] https://github.com/g2p/bedup/tree/wip/dedup-syscall



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] btrfs: offline dedupe
  2013-06-11 20:31 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
@ 2013-07-15 20:55   ` Zach Brown
  2013-07-17  0:14     ` Gabriel de Perthuis
  0 siblings, 1 reply; 28+ messages in thread
From: Zach Brown @ 2013-07-15 20:55 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, Josef Bacik

Josef asked that I check out the offline dedup patches.  I hope that I
found the most recent posting in my archives :).

The first three patches seemed fine.  I have questions about the
read/compare/clone core:

> +		addr = kmap(page);
> +		memcpy(buffer + bytes_copied, addr, PAGE_CACHE_SIZE);
> +		kunmap(page);

I think you need flush_dcache_page() here 'cause there could be writable
user address mappings of these page cache pages.

> +static void btrfs_double_lock(struct inode *inode1, u64 loff1,
> +			      struct inode *inode2, u64 loff2, u64 len)
> +{
> +	if (inode1 < inode2) {
> +		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
> +		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
> +		lock_extent_range(inode1, loff1, len);
> +		lock_extent_range(inode2, loff2, len);
> +	} else {
> +		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
> +		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
> +		lock_extent_range(inode2, loff2, len);
> +		lock_extent_range(inode1, loff1, len);
> +	}
> +}

I'd do it a little differently to reduce the surface area for future 1/2
copy-paste typos.  (And maybe make it safe for when src == dst?)

	if (inode1 > inode2)
		swap(inode1, inode2)

	mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
	lock_extent_range(inode1, loff1, len);
	if (inode1 != inode2) {
		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
		lock_extent_range(inode2, loff2, len);
	}

If lockdep lets us do this, anyway :/.  (Then maybe a similar swap to
match in unlock, I'm not sure.)

> +	if (copy_from_user(&tmp,
> +			   (struct btrfs_ioctl_same_args __user *)argp,
> +			   sizeof(tmp))) {
> +		ret = -EFAULT;
> +		goto out_drop_write;
> +	}
> +
> +	args_size = sizeof(tmp) + (tmp.dest_count *
> +			sizeof(struct btrfs_ioctl_same_extent_info));
> +
> +	/* Keep size of ioctl argument sane */
> +	if (args_size > PAGE_CACHE_SIZE) {
> +		ret = -E2BIG;
> +		goto out_drop_write;
> +	}
> +
> +	args = kmalloc(args_size, GFP_NOFS);
> +	if (!args) {
> +		ret = -ENOMEM;
> +		goto out_drop_write;
> +	}
> +
> +	ret = -EFAULT;
> +	if (copy_from_user(args,
> +			   (struct btrfs_ioctl_same_args __user *)argp,
> +			   args_size))
> +		goto out;
> +	/* Make sure args didn't change magically between copies. */
> +	if (memcmp(&tmp, args, sizeof(tmp)))
> +		goto out;
> +
> +	if ((sizeof(tmp) + (sizeof(*info) * args->dest_count)) > args_size)
> +		goto out;
> +
> +	/* pre-format 'out' fields to sane default values */
> +	for (i = 0; i < args->dest_count; i++) {
> +		info = &args->info[i];
> +		info->bytes_deduped = 0;
> +		info->status = 0;
> +	}

I'd get rid of all this code by only copying each input argument on to
the stack as it's needed and by getting rid of the writable output
struct fields.  (more on this later)

> +	/*
> +	 * Limit the total length we will dedupe for each operation. 
> +	 * This is intended to bound the entire ioctl to something sane.
> +	 */
> +	if (len > BTRFS_MAX_DEDUPE_LEN)
> +		len = BTRFS_MAX_DEDUPE_LEN;

[ ... ]

> +	src_buffer = btrfs_kvmalloc(len, GFP_NOFS);
> +	dst_buffer = btrfs_kvmalloc(len, GFP_NOFS);
> +	if (!src_buffer || !dst_buffer) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}

Why use these allocated intermediate buffers at all?  Regardless of how
large a region you decide to lock and clone at a time, you can get page
refs for both files and use kmap_atomic() to compare without copying. 

And if you don't have the buffers, why not lock and clone as much as
the user asked for?

I could understand wanting to reduce lock hold times, but you could do
that by doing the initial io waits on shared page cache references and
only using the expensive locks to verify the pages and perform the
clone.  But hopefully that complexity isn't really needed?

> +		if (S_ISDIR(dst->i_mode)) {
> +			info->status = -EISDIR;
> +			goto next;
> +		}

Does anything stop us from getting here with symlinks?  It looks like
btrfs_symlink() helpfully overrides i_fop with the _file_ operations
which have .ioctl.  I didn't actually try it.

> +	if (copy_to_user(argp, args, args_size))
> +		ret = -EFAULT;

I don't think it should be possible to return -EFAULT after all the work
is done just because some input memory was accidentally read-only for
whatever reason.

> +#define BTRFS_SAME_DATA_DIFFERS	1
> +/* For extent-same ioctl */
> +struct btrfs_ioctl_same_extent_info {
> +	__s64 fd;		/* in - destination file */
> +	__u64 logical_offset;	/* in - start of extent in destination */
> +	__u64 bytes_deduped;	/* out - total # of bytes we were able
> +				 * to dedupe from this file */
> +	/* status of this dedupe operation:
> +	 * 0 if dedup succeeds
> +	 * < 0 for error
> +	 * == BTRFS_SAME_DATA_DIFFERS if data differs
> +	 */
> +	__s32 status;		/* out - see above description */
> +	__u32 reserved;
> +};

As I said, I'd get rid of the output fields.  Like the other vectored io
syscalls, the return value can indicate the number of initial
consecutive bytes that worked.  When no progess was made then it can
return errors.  Userspace is left to sort out the resulting state and
figure out the extents to retry in exactly the same way that it found
the initial extents to attempt to dedupe in the first place.

(And imagine strace trying to print the inputs and outputs.  Poor, poor,
strace! :))

I hope this helps!

- z

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/4] btrfs: offline dedupe
  2013-06-11 20:31 [PATCH 0/4] btrfs: offline dedupe v2 Mark Fasheh
@ 2013-06-11 20:31 ` Mark Fasheh
  2013-07-15 20:55   ` Zach Brown
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Fasheh @ 2013-06-11 20:31 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chris Mason, Josef Bacik, Gabriel de Perthuis, David Sterba, Mark Fasheh

This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
de-duplicate a list of extents across a range of files.

Internally, the ioctl re-uses code from the clone ioctl. This avoids
rewriting a large chunk of extent handling code.

Userspace passes in an array of file, offset pairs along with a length
argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison
of the user data before deduping the extent. Status and number of bytes
deduped are returned for each operation.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ioctl.c           | 329 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h |  27 ++++
 2 files changed, 356 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e90c519..9351b69 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -57,6 +57,9 @@
 #include "send.h"
 #include "dev-replace.h"
 
+static int btrfs_clone(struct inode *src, struct inode *inode,
+		       u64 off, u64 olen, u64 olen_aligned, u64 destoff);
+
 /* Mask out flags that are inappropriate for the given type of inode. */
 static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
 {
@@ -2456,6 +2459,54 @@ out:
 	return ret;
 }
 
+static noinline int fill_data(struct inode *inode, u64 off, u64 len,
+			      char *buffer)
+{
+	struct page *page;
+	void *addr;
+	pgoff_t index;
+	pgoff_t last_index;
+	int ret = 0;
+	int bytes_copied = 0;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+
+	index = off >> PAGE_CACHE_SHIFT;
+	last_index = (off + len - 1) >> PAGE_CACHE_SHIFT;
+
+	while (index <= last_index) {
+		page = grab_cache_page(inode->i_mapping, index);
+		if (!page) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (!PageUptodate(page)) {
+			extent_read_full_page_nolock(tree, page,
+						     btrfs_get_extent, 0);
+			lock_page(page);
+			if (!PageUptodate(page)) {
+				unlock_page(page);
+				page_cache_release(page);
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+
+		addr = kmap(page);
+		memcpy(buffer + bytes_copied, addr, PAGE_CACHE_SIZE);
+		kunmap(page);
+		unlock_page(page);
+		page_cache_release(page);
+		bytes_copied += PAGE_CACHE_SIZE;
+		index++;
+	}
+
+out:
+	if (ret)
+		kfree(buffer);
+	return ret;
+}
+
 static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 {
 	/* do any pending delalloc/csum calc on src, one way or
@@ -2476,6 +2527,282 @@ static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 	}
 }
 
+static void btrfs_double_unlock(struct inode *inode1, u64 loff1,
+				struct inode *inode2, u64 loff2, u64 len)
+{
+	unlock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1);
+	unlock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1);
+
+	mutex_unlock(&inode1->i_mutex);
+	mutex_unlock(&inode2->i_mutex);
+}
+
+static void btrfs_double_lock(struct inode *inode1, u64 loff1,
+			      struct inode *inode2, u64 loff2, u64 len)
+{
+	if (inode1 < inode2) {
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+		lock_extent_range(inode1, loff1, len);
+		lock_extent_range(inode2, loff2, len);
+	} else {
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+		lock_extent_range(inode2, loff2, len);
+		lock_extent_range(inode1, loff1, len);
+	}
+}
+
+static int btrfs_extent_same(struct inode *src, u64 loff, u64 len,
+			     struct inode *dst, u64 dst_loff,
+			     char *src_buffer, char *dst_buffer)
+{
+	int ret;
+
+	/*
+	 * btrfs_clone() can't handle extents in the same file
+	 * yet. Once that works, we can drop this check and replace it
+	 * with a check for the same inode, but overlapping extents.
+	 */
+	if (src == dst)
+		return -EINVAL;
+
+	btrfs_double_lock(src, loff, dst, dst_loff, len);
+
+	ret = fill_data(src, loff, len, src_buffer);
+	if (ret)
+		goto out;
+
+	ret = fill_data(dst, dst_loff, len, dst_buffer);
+	if (ret)
+		goto out;
+
+	ret = memcmp(src_buffer, dst_buffer, len);
+	if (ret) {
+		ret = BTRFS_SAME_DATA_DIFFERS;
+		goto out;
+	}
+
+	ret = btrfs_clone(src, dst, loff, len, len, dst_loff);
+
+out:
+	btrfs_double_unlock(src, loff, dst, dst_loff, len);
+
+	return ret;
+}
+
+static void *btrfs_kvmalloc(size_t size, gfp_t flags)
+{
+	void *ret;
+
+	ret = kmalloc(size, flags);
+	if (!ret)
+		ret = __vmalloc(size, flags, PAGE_KERNEL);
+	return ret;
+}
+
+static void btrfs_kvfree(void *ptr)
+{
+	if (is_vmalloc_addr(ptr))
+		vfree(ptr);
+	else
+		kfree(ptr);
+}
+
+#define BTRFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+#define BTRFS_ONE_DEDUPE_LEN	(1 * 1024 * 1024)
+
+static long btrfs_ioctl_file_extent_same(struct file *file,
+					 void __user *argp)
+{
+	struct btrfs_ioctl_same_args *args;
+	struct btrfs_ioctl_same_args tmp;
+	struct btrfs_ioctl_same_extent_info *info;
+	struct inode *src = file->f_dentry->d_inode;
+	struct file *dst_file = NULL;
+	struct inode *dst;
+	u64 off;
+	u64 len;
+	int args_size;
+	int i;
+	int ret;
+	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
+	char *src_buffer = NULL;
+	char *dst_buffer = NULL;
+
+	if (!(file->f_mode & FMODE_READ))
+		return -EINVAL;
+
+	if (btrfs_root_readonly(BTRFS_I(src)->root))
+		return -EROFS;
+
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
+	if (copy_from_user(&tmp,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   sizeof(tmp))) {
+		ret = -EFAULT;
+		goto out_drop_write;
+	}
+
+	args_size = sizeof(tmp) + (tmp.dest_count *
+			sizeof(struct btrfs_ioctl_same_extent_info));
+
+	/* Keep size of ioctl argument sane */
+	if (args_size > PAGE_CACHE_SIZE) {
+		ret = -E2BIG;
+		goto out_drop_write;
+	}
+
+	args = kmalloc(args_size, GFP_NOFS);
+	if (!args) {
+		ret = -ENOMEM;
+		goto out_drop_write;
+	}
+
+	ret = -EFAULT;
+	if (copy_from_user(args,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   args_size))
+		goto out;
+	/* Make sure args didn't change magically between copies. */
+	if (memcmp(&tmp, args, sizeof(tmp)))
+		goto out;
+
+	if ((sizeof(tmp) + (sizeof(*info) * args->dest_count)) > args_size)
+		goto out;
+
+	/* pre-format 'out' fields to sane default values */
+	for (i = 0; i < args->dest_count; i++) {
+		info = &args->info[i];
+		info->bytes_deduped = 0;
+		info->status = 0;
+	}
+
+	off = args->logical_offset;
+	len = args->length;
+
+	/*
+	 * Limit the total length we will dedupe for each operation. 
+	 * This is intended to bound the entire ioctl to something sane.
+	 */
+	if (len > BTRFS_MAX_DEDUPE_LEN)
+		len = BTRFS_MAX_DEDUPE_LEN;
+
+	ret = -EINVAL;
+	if (off + len > src->i_size || off + len < off)
+		goto out;
+	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs))
+		goto out;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode))
+		goto out;
+
+	src_buffer = btrfs_kvmalloc(len, GFP_NOFS);
+	dst_buffer = btrfs_kvmalloc(len, GFP_NOFS);
+	if (!src_buffer || !dst_buffer) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = 0;
+	for (i = 0; i < args->dest_count; i++) {
+		u64 dest_off;
+		u64 src_off;
+		u64 op_len;
+
+		info = &args->info[i];
+
+		dst_file = fget(info->fd);
+		if (!dst_file) {
+			info->status = -EBADF;
+			continue;
+		}
+
+		if (!(dst_file->f_mode & FMODE_WRITE)) {
+			info->status = -EINVAL;
+			goto next;
+		}
+
+		info->status = -EXDEV;
+		if (file->f_path.mnt != dst_file->f_path.mnt)
+			goto next;
+
+		dst = dst_file->f_dentry->d_inode;
+		if (src->i_sb != dst->i_sb)
+			goto next;
+
+		if (S_ISDIR(dst->i_mode)) {
+			info->status = -EISDIR;
+			goto next;
+		}
+
+		info->status = -EINVAL;
+		/* don't make the dst file partly checksummed */
+		if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
+		    (BTRFS_I(dst)->flags & BTRFS_INODE_NODATASUM))
+			goto next;
+
+		dest_off = info->logical_offset;
+		if (dest_off + len > dst->i_size || dest_off + len < dest_off)
+			goto next;
+		if (!IS_ALIGNED(dest_off, bs))
+			goto next;
+
+		/*
+		 * The purpose of this loop is to limit the number of
+		 * bytes we dedupe during a single call to
+		 * btrfs_extent_same().
+		 *
+		 * In order to memcmp the data we have to allocate a
+		 * pair of buffers. We don't want to allocate too
+		 * large a buffer, so limiting the size for each
+		 * dedupe is an easy way to do this.
+		 */
+		src_off = off;
+		op_len = len;
+		while (op_len) {
+			u64 tmp_len;
+
+			tmp_len = op_len;
+			if (op_len > BTRFS_ONE_DEDUPE_LEN)
+				tmp_len = BTRFS_ONE_DEDUPE_LEN;
+
+			info->status = btrfs_extent_same(src, src_off, tmp_len,
+							 dst, dest_off,
+							 src_buffer, dst_buffer);
+			if (info->status == 0) {
+				info->bytes_deduped += tmp_len;
+			} else
+				break;
+
+			dest_off += tmp_len;
+			src_off += tmp_len;
+			op_len -= tmp_len;
+		}
+
+next:
+		fput(dst_file);
+		dst_file = NULL;
+	}
+
+	if (copy_to_user(argp, args, args_size))
+		ret = -EFAULT;
+
+out:
+	btrfs_kvfree(src_buffer);
+	btrfs_kvfree(dst_buffer);
+	if (dst_file)
+		fput(dst_file);
+	kfree(args);
+out_drop_write:
+	mnt_drop_write_file(file);
+	return ret;
+}
+
 /**
  * btrfs_clone() - clone a range from inode file to another
  *
@@ -4151,6 +4478,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_fslabel(file, argp);
 	case BTRFS_IOC_SET_FSLABEL:
 		return btrfs_ioctl_set_fslabel(file, argp);
+	case BTRFS_IOC_FILE_EXTENT_SAME:
+		return btrfs_ioctl_file_extent_same(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index fa3a5f9..5465bc2 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -305,6 +305,31 @@ struct btrfs_ioctl_clone_range_args {
 #define BTRFS_DEFRAG_RANGE_COMPRESS 1
 #define BTRFS_DEFRAG_RANGE_START_IO 2
 
+#define BTRFS_SAME_DATA_DIFFERS	1
+/* For extent-same ioctl */
+struct btrfs_ioctl_same_extent_info {
+	__s64 fd;		/* in - destination file */
+	__u64 logical_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file */
+	/* status of this dedupe operation:
+	 * 0 if dedup succeeds
+	 * < 0 for error
+	 * == BTRFS_SAME_DATA_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;
+};
+
+struct btrfs_ioctl_same_args {
+	__u64 logical_offset;	/* in - start of extent in source */
+	__u64 length;		/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;
+	__u32 reserved2;
+	struct btrfs_ioctl_same_extent_info info[0];
+};
+
 struct btrfs_ioctl_space_info {
 	__u64 flags;
 	__u64 total_bytes;
@@ -510,5 +535,7 @@ struct btrfs_ioctl_send_args {
 				      struct btrfs_ioctl_get_dev_stats)
 #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
 				    struct btrfs_ioctl_dev_replace_args)
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
+					 struct btrfs_ioctl_same_args)
 
 #endif /* _UAPI_LINUX_BTRFS_H */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] btrfs: offline dedupe
  2013-05-24 22:38         ` Mark Fasheh
@ 2013-05-24 23:36           ` Gabriel de Perthuis
  0 siblings, 0 replies; 28+ messages in thread
From: Gabriel de Perthuis @ 2013-05-24 23:36 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: David Sterba, linux-btrfs, Chris Mason, Josef Bacik

Le sam. 25 mai 2013 00:38:27 CEST, Mark Fasheh a écrit :
> On Fri, May 24, 2013 at 09:50:14PM +0200, Gabriel de Perthuis wrote:
>>> Sure. Actually, you got me thinking about some sanity checks... I need to
>>> add at least this check:
>>>
>>> 	if (btrfs_root_readonly(root))
>>> 		return -EROFS;
>>>
>>> which isn't in there as of now.
>>
>> It's not needed and I'd rather do without, read-only snapshots and deduplication go together well for backups.
>> Data and metadata are guaranteed to be immutable, extent storage isn't.  This is also the case with raid.
>
> You're absolutely right - I miswrote the check I meant.
> Specifically, I was thinking about when the entire fs is readonly due to
> either some error or the user mounted with -oro. So something more like:
>
> 	if (root->fs_info->sb->s_flags & MS_RDONLY)
> 		return -EROFS;
>
> I think that should be reasonable and wouldn't affect most use cases,
> right?

That's all right.

>>> Also I don't really check the open mode (read, write, etc) on files passed
>>> in. We do this in the clone ioctl and it makes sense there since data (to
>>> the user) can change. With this ioctl though data won't ever change (even if
>>> the underlying extent does). So I left the checks out. A part of me is
>>> thinking we might want to be conservative to start with though and just add
>>> those type of checks in. Basically, I figure the source should be open for
>>> read at least and target files need write access.
>>
>> I don't know of any privileged files that one would be able to open(2),
>> but if this is available to unprivileged users the files all need to be
>> open for reading so that it can't be used to guess at their contents. As
>> long as root gets to bypass the checks (no btrfs_root_readonly) it doesn't
>> hurt my use case.
>
> Oh ok so this seems to make sense. How does this logic sound:
>
> We're not going to worry about write access since it would be entirely
> reasonable for the user to want to do this on a readonly submount
> (specifically for the purpose of deduplicating backups).
>
> Read access needs to be provided however so we know that the user has access
> to the file data.
>
> So basically, if a user can open any files for read, they can check their
> contents and dedupe them.
>
> Letting users dedupe files in say, /etc seems kind of weird to me but I'm
> struggling to come up with a good explanation of why that should mean we
> limit this ioctl to root.
> 	--Mark

I agree with that model.  Most of the code is shared with clone (and the
copy_range RFC) which are unprivileged, so it doesn't increase the potential
surface for bugs much.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] btrfs: offline dedupe
  2013-05-24 19:50       ` Gabriel de Perthuis
@ 2013-05-24 22:38         ` Mark Fasheh
  2013-05-24 23:36           ` Gabriel de Perthuis
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Fasheh @ 2013-05-24 22:38 UTC (permalink / raw)
  To: Gabriel de Perthuis; +Cc: David Sterba, linux-btrfs, Chris Mason, Josef Bacik

On Fri, May 24, 2013 at 09:50:14PM +0200, Gabriel de Perthuis wrote:
> > Sure. Actually, you got me thinking about some sanity checks... I need to
> > add at least this check:
> > 
> > 	if (btrfs_root_readonly(root))
> > 		return -EROFS;
> > 
> > which isn't in there as of now.
> 
> It's not needed and I'd rather do without, read-only snapshots and deduplication go together well for backups.
> Data and metadata are guaranteed to be immutable, extent storage isn't.  This is also the case with raid.

You're absolutely right - I miswrote the check I meant.
Specifically, I was thinking about when the entire fs is readonly due to
either some error or the user mounted with -oro. So something more like:

	if (root->fs_info->sb->s_flags & MS_RDONLY)
		return -EROFS;

I think that should be reasonable and wouldn't affect most use cases,
right?


> > Also I don't really check the open mode (read, write, etc) on files passed
> > in. We do this in the clone ioctl and it makes sense there since data (to
> > the user) can change. With this ioctl though data won't ever change (even if
> > the underlying extent does). So I left the checks out. A part of me is
> > thinking we might want to be conservative to start with though and just add
> > those type of checks in. Basically, I figure the source should be open for
> > read at least and target files need write access.
> 
> I don't know of any privileged files that one would be able to open(2),
> but if this is available to unprivileged users the files all need to be
> open for reading so that it can't be used to guess at their contents. As
> long as root gets to bypass the checks (no btrfs_root_readonly) it doesn't
> hurt my use case.

Oh ok so this seems to make sense. How does this logic sound:

We're not going to worry about write access since it would be entirely
reasonable for the user to want to do this on a readonly submount
(specifically for the purpose of deduplicating backups).

Read access needs to be provided however so we know that the user has access
to the file data.

So basically, if a user can open any files for read, they can check their
contents and dedupe them.

Letting users dedupe files in say, /etc seems kind of weird to me but I'm
struggling to come up with a good explanation of why that should mean we
limit this ioctl to root.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] btrfs: offline dedupe
  2013-05-24 18:17     ` Mark Fasheh
@ 2013-05-24 19:50       ` Gabriel de Perthuis
  2013-05-24 22:38         ` Mark Fasheh
  0 siblings, 1 reply; 28+ messages in thread
From: Gabriel de Perthuis @ 2013-05-24 19:50 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: David Sterba, linux-btrfs, Chris Mason, Josef Bacik

>>> +#define BTRFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
>>> +#define BTRFS_ONE_DEDUPE_LEN	(1 * 1024 * 1024)
>>> +
>>> +static long btrfs_ioctl_file_extent_same(struct file *file,
>>> +					 void __user *argp)
>>> +{
>>> +	struct btrfs_ioctl_same_args *args;
>>> +	struct btrfs_ioctl_same_args tmp;
>>> +	struct btrfs_ioctl_same_extent_info *info;
>>> +	struct inode *src = file->f_dentry->d_inode;
>>> +	struct file *dst_file = NULL;
>>> +	struct inode *dst;
>>> +	u64 off;
>>> +	u64 len;
>>> +	int args_size;
>>> +	int i;
>>> +	int ret;
>>> +	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
>>
>> The ioctl is available to non-root, so an extra care should be taken to
>> potentail overflows etc. I haven't spotted anything so far.
> 
> 
> Sure. Actually, you got me thinking about some sanity checks... I need to
> add at least this check:
> 
> 	if (btrfs_root_readonly(root))
> 		return -EROFS;
> 
> which isn't in there as of now.

It's not needed and I'd rather do without, read-only snapshots and deduplication go together well for backups.
Data and metadata are guaranteed to be immutable, extent storage isn't.  This is also the case with raid.


> Also I don't really check the open mode (read, write, etc) on files passed
> in. We do this in the clone ioctl and it makes sense there since data (to
> the user) can change. With this ioctl though data won't ever change (even if
> the underlying extent does). So I left the checks out. A part of me is
> thinking we might want to be conservative to start with though and just add
> those type of checks in. Basically, I figure the source should be open for
> read at least and target files need write access.

I don't know of any privileged files that one would be able to open(2), but if this is available to unprivileged users the files all need to be open for reading so that it can't be used to guess at their contents.
As long as root gets to bypass the checks (no btrfs_root_readonly) it doesn't hurt my use case.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] btrfs: offline dedupe
  2013-05-24 14:05   ` David Sterba
@ 2013-05-24 18:17     ` Mark Fasheh
  2013-05-24 19:50       ` Gabriel de Perthuis
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Fasheh @ 2013-05-24 18:17 UTC (permalink / raw)
  To: dsterba, linux-btrfs, Chris Mason, Josef Bacik, Gabriel de Perthuis

Hey David, thanks again for the review! Comments are inline below.

On Fri, May 24, 2013 at 04:05:36PM +0200, David Sterba wrote:
> On Tue, May 21, 2013 at 11:28:28AM -0700, Mark Fasheh wrote:
> > +static noinline int fill_data(struct inode *inode, u64 off, u64 len,
> > +			      char **cur_buffer)
> > +{
> > +	struct page *page;
> > +	void *addr;
> > +	char *buffer;
> > +	pgoff_t index;
> > +	pgoff_t last_index;
> > +	int ret = 0;
> > +	int bytes_copied = 0;
> > +	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> > +
> > +	buffer = kmalloc(len, GFP_NOFS);
> 
> I've looked at the caller chain and len can be as big as 1MB, this is
> likely to fail to allocate. More comments below.

Well it's there in the description that we'll do this 1MB at a time at max :)


> > @@ -2476,6 +2534,236 @@ static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
> > +static int btrfs_extent_same(struct inode *src, u64 loff, u64 len,
> > +			     struct inode *dst, u64 dst_loff)
> > +{
> > +	char *orig_buffer = NULL;
> > +	char *dst_inode_buffer = NULL;
> > +	int ret;
> > +
> > +	/*
> > +	 * btrfs_clone() can't handle extents in the same file
> > +	 * yet. Once that works, we can drop this check and replace it
> > +	 * with a check for the same inode, but overlapping extents.
> > +	 */
> > +	if (src == dst)
> > +		return -EINVAL;
> > +
> > +	btrfs_double_lock(src, loff, dst, dst_loff, len);
> > +
> 
> Let's see how fill_data is actually used:
> 
> > +	ret = fill_data(src, loff, len, &orig_buffer);
> 
> Now orig_buffer is allocated from inside fill_data
> 
> > +	ret = fill_data(dst, dst_loff, len, &dst_inode_buffer);
> 
> And again, although the buffers are just memcmp'ed:
> 
> > +	ret = memcmp(orig_buffer, dst_inode_buffer, len);
> 
> extents linked:
> 
> > +	ret = btrfs_clone(src, dst, loff, len, len, dst_loff);
> > +
> > +out:
> > +	btrfs_double_unlock(src, loff, dst, dst_loff, len);
> > +
> and thrown away, all of this to be repeated on next call of
> btrfs_extent_same:
> 
> > +	kfree(dst_inode_buffer);
> > +	kfree(orig_buffer);
> > +	return ret;
> > +}
> 
> Apart from the potential kmalloc failure, I think at least the second
> fill_data can be replaced with memcpy with orig_buffer. (Possibly both
> fill_data/memcpy could be replaced by memcmp reading from both inodes in
> parallel.)

We have to do the read (fill_data) on each inode for each iteration. The
design of this is that we only lock down the inodes when we're deduping and
then relock on the next pass. Otherwise we could be doing a ton of work
while preventing access to file data. The obvious downside is that data can
change between locks so we have to always check.

Keeping the kmalloc'd buffers around isn't a bad idea and luckily is trivial
to handle. I can just alloc them up front in the caller and pass them
through on each call.


> > +#define BTRFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
> > +#define BTRFS_ONE_DEDUPE_LEN	(1 * 1024 * 1024)
> > +
> > +static long btrfs_ioctl_file_extent_same(struct file *file,
> > +					 void __user *argp)
> > +{
> > +	struct btrfs_ioctl_same_args *args;
> > +	struct btrfs_ioctl_same_args tmp;
> > +	struct btrfs_ioctl_same_extent_info *info;
> > +	struct inode *src = file->f_dentry->d_inode;
> > +	struct file *dst_file = NULL;
> > +	struct inode *dst;
> > +	u64 off;
> > +	u64 len;
> > +	int args_size;
> > +	int i;
> > +	int ret;
> > +	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
> 
> The ioctl is available to non-root, so an extra care should be taken to
> potentail overflows etc. I haven't spotted anything so far.


Sure. Actually, you got me thinking about some sanity checks... I need to
add at least this check:

	if (btrfs_root_readonly(root))
		return -EROFS;

which isn't in there as of now.


Also I don't really check the open mode (read, write, etc) on files passed
in. We do this in the clone ioctl and it makes sense there since data (to
the user) can change. With this ioctl though data won't ever change (even if
the underlying extent does). So I left the checks out. A part of me is
thinking we might want to be conservative to start with though and just add
those type of checks in. Basically, I figure the source should be open for
read at least and target files need write access.


> > +	if (copy_from_user(&tmp,
> > +			   (struct btrfs_ioctl_same_args __user *)argp,
> > +			   sizeof(tmp)))
> > +		return -EFAULT;
> > +
> > +	args_size = sizeof(tmp) + (tmp.dest_count *
> > +			sizeof(struct btrfs_ioctl_same_extent_info));
> 
> A minor note: computing the size like this works as long as the
> structures do not require extra padding between elements, ie the gap at
> the end of an element until alignmed start of the next element. That's
> ok now and I don't think that the compiler would use alignment larger
> than 8 bytes (ie for u64).
> 
> size of btrfs_ioctl_same_extent_info is 4x u64, and 'info' starts at 3rd
> u64 inside btrfs_ioctl_same_args.
> 
> A proposed paranoid-safety check is something like
> 
> 	struct btrfs_ioctl_same_extent_info dummy[2];
> 
> 	ASSERT(sizeof(struct btrfs_ioctl_same_extent_info)
> 		== (&dummy[1] - &dummy[0]))

Maybe BUILD_BUG_ON is even better actually. If someone messed this up the
build would catch it and we'd never even get a module with bad checks.

I'll have to think up how to catch it in that context though.


> > +
> > +	ret = 0;
> > +	for (i = 0; i < args->dest_count; i++) {
> > +		u64 dest_off;
> > +		u64 src_off;
> > +		u64 op_len;
> > +
> > +		info = &args->info[i];
> > +
> > +		dst_file = fget(info->fd);
> > +		if (!dst_file) {
> > +			printk(KERN_ERR "btrfs: invalid fd %lld\n", info->fd);
> 
> I think adding the keyword 'dedup' to the messages would make it more
> clear at which operation it happend.

Good idea.... Btw, I was actually thinking of just ditching those printks in
the next iteration of this series. We don't tend to printk as often in other
parts of the btrfs code. Any opinion on that? :)
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/4] btrfs: offline dedupe
  2013-05-21 18:28 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
@ 2013-05-24 14:05   ` David Sterba
  2013-05-24 18:17     ` Mark Fasheh
  0 siblings, 1 reply; 28+ messages in thread
From: David Sterba @ 2013-05-24 14:05 UTC (permalink / raw)
  To: Mark Fasheh
  Cc: linux-btrfs, Chris Mason, Josef Bacik, Gabriel de Perthuis, David Sterba

On Tue, May 21, 2013 at 11:28:28AM -0700, Mark Fasheh wrote:
> +static noinline int fill_data(struct inode *inode, u64 off, u64 len,
> +			      char **cur_buffer)
> +{
> +	struct page *page;
> +	void *addr;
> +	char *buffer;
> +	pgoff_t index;
> +	pgoff_t last_index;
> +	int ret = 0;
> +	int bytes_copied = 0;
> +	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> +
> +	buffer = kmalloc(len, GFP_NOFS);

I've looked at the caller chain and len can be as big as 1MB, this is
likely to fail to allocate. More comments below.

> +	if (!buffer)
> +		return -ENOMEM;
> +
> +	index = off >> PAGE_CACHE_SHIFT;
> +	last_index = (off + len - 1) >> PAGE_CACHE_SHIFT;
> +
> +	while (index <= last_index) {
> +		page = grab_cache_page(inode->i_mapping, index);
> +		if (!page) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +
> +		if (!PageUptodate(page)) {
> +			extent_read_full_page_nolock(tree, page,
> +						     btrfs_get_extent, 0);
> +			lock_page(page);
> +			if (!PageUptodate(page)) {
> +				unlock_page(page);
> +				page_cache_release(page);
> +				ret = -EINVAL;
> +				goto out;
> +			}
> +		}
> +
> +		addr = kmap(page);
> +		memcpy(buffer + bytes_copied, addr, PAGE_CACHE_SIZE);
> +		kunmap(page);
> +		unlock_page(page);
> +		page_cache_release(page);
> +		bytes_copied += PAGE_CACHE_SIZE;
> +		index++;
> +	}
> +
> +	*cur_buffer = buffer;
> +
> +out:
> +	if (ret)
> +		kfree(buffer);
> +	return ret;
> +}
> +
>  static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
>  {
>  	/* do any pending delalloc/csum calc on src, one way or
> @@ -2476,6 +2534,236 @@ static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
> +static int btrfs_extent_same(struct inode *src, u64 loff, u64 len,
> +			     struct inode *dst, u64 dst_loff)
> +{
> +	char *orig_buffer = NULL;
> +	char *dst_inode_buffer = NULL;
> +	int ret;
> +
> +	/*
> +	 * btrfs_clone() can't handle extents in the same file
> +	 * yet. Once that works, we can drop this check and replace it
> +	 * with a check for the same inode, but overlapping extents.
> +	 */
> +	if (src == dst)
> +		return -EINVAL;
> +
> +	btrfs_double_lock(src, loff, dst, dst_loff, len);
> +

Let's see how fill_data is actually used:

> +	ret = fill_data(src, loff, len, &orig_buffer);

Now orig_buffer is allocated from inside fill_data

> +	ret = fill_data(dst, dst_loff, len, &dst_inode_buffer);

And again, although the buffers are just memcmp'ed:

> +	ret = memcmp(orig_buffer, dst_inode_buffer, len);

extents linked:

> +	ret = btrfs_clone(src, dst, loff, len, len, dst_loff);
> +
> +out:
> +	btrfs_double_unlock(src, loff, dst, dst_loff, len);
> +
and thrown away, all of this to be repeated on next call of
btrfs_extent_same:

> +	kfree(dst_inode_buffer);
> +	kfree(orig_buffer);
> +	return ret;
> +}

Apart from the potential kmalloc failure, I think at least the second
fill_data can be replaced with memcpy with orig_buffer. (Possibly both
fill_data/memcpy could be replaced by memcmp reading from both inodes in
parallel.)

> +#define BTRFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
> +#define BTRFS_ONE_DEDUPE_LEN	(1 * 1024 * 1024)
> +
> +static long btrfs_ioctl_file_extent_same(struct file *file,
> +					 void __user *argp)
> +{
> +	struct btrfs_ioctl_same_args *args;
> +	struct btrfs_ioctl_same_args tmp;
> +	struct btrfs_ioctl_same_extent_info *info;
> +	struct inode *src = file->f_dentry->d_inode;
> +	struct file *dst_file = NULL;
> +	struct inode *dst;
> +	u64 off;
> +	u64 len;
> +	int args_size;
> +	int i;
> +	int ret;
> +	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;

The ioctl is available to non-root, so an extra care should be taken to
potentail overflows etc. I haven't spotted anything so far.

> +	if (copy_from_user(&tmp,
> +			   (struct btrfs_ioctl_same_args __user *)argp,
> +			   sizeof(tmp)))
> +		return -EFAULT;
> +
> +	args_size = sizeof(tmp) + (tmp.dest_count *
> +			sizeof(struct btrfs_ioctl_same_extent_info));

A minor note: computing the size like this works as long as the
structures do not require extra padding between elements, ie the gap at
the end of an element until alignmed start of the next element. That's
ok now and I don't think that the compiler would use alignment larger
than 8 bytes (ie for u64).

size of btrfs_ioctl_same_extent_info is 4x u64, and 'info' starts at 3rd
u64 inside btrfs_ioctl_same_args.

A proposed paranoid-safety check is something like

	struct btrfs_ioctl_same_extent_info dummy[2];

	ASSERT(sizeof(struct btrfs_ioctl_same_extent_info)
		== (&dummy[1] - &dummy[0]))

> +	/* Keep size of ioctl argument sane */
> +	if (args_size > PAGE_CACHE_SIZE)
> +		return -E2BIG;
> +
> +	args = kmalloc(args_size, GFP_NOFS);
> +	if (!args)
> +		return -ENOMEM;
> +
> +	ret = -EFAULT;
> +	if (copy_from_user(args,
> +			   (struct btrfs_ioctl_same_args __user *)argp,
> +			   args_size))
> +		goto out;
> +	/* Make sure args didn't change magically between copies. */

Not much fun left :)

> +	if (memcmp(&tmp, args, sizeof(tmp)))
> +		goto out;
> +
> +	if ((sizeof(tmp) + (sizeof(*info) * args->dest_count)) > args_size)
> +		goto out;
> +
> +	/* pre-format 'out' fields to sane default values */
> +	for (i = 0; i < args->dest_count; i++) {
> +		info = &args->info[i];
> +		info->bytes_deduped = 0;
> +		info->status = 0;
> +	}
> +
> +	off = args->logical_offset;
> +	len = args->length;
> +
> +	/*
> +	 * Limit the total length we will dedupe for each operation. 
> +	 * This is intended to bound the entire ioctl to something sane.
> +	 */
> +	if (len > BTRFS_MAX_DEDUPE_LEN)
> +		len = BTRFS_MAX_DEDUPE_LEN;
> +
> +	ret = -EINVAL;
> +	if (off + len > src->i_size || off + len < off)
> +		goto out;
> +	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs))
> +		goto out;
> +
> +	ret = -EISDIR;
> +	if (S_ISDIR(src->i_mode))
> +		goto out;
> +
> +	ret = 0;
> +	for (i = 0; i < args->dest_count; i++) {
> +		u64 dest_off;
> +		u64 src_off;
> +		u64 op_len;
> +
> +		info = &args->info[i];
> +
> +		dst_file = fget(info->fd);
> +		if (!dst_file) {
> +			printk(KERN_ERR "btrfs: invalid fd %lld\n", info->fd);

I think adding the keyword 'dedup' to the messages would make it more
clear at which operation it happend.

> +			info->status = -EBADF;
> +			continue;
> +		}
> +
> +		dst = dst_file->f_dentry->d_inode;
> +		if (S_ISDIR(dst->i_mode)) {
> +			printk(KERN_ERR "btrfs: file is dir %lld\n", info->fd);
> +			info->status = -EISDIR;
> +			goto next;
> +		}
> +
> +		info->status = -EINVAL;
> +		if (dst == src) {
> +			printk(KERN_ERR "btrfs: file dup %lld\n", info->fd);
> +			goto next;
> +		}
> +
> +		dest_off = info->logical_offset;
> +
> +		if (dest_off + len > dst->i_size || dest_off + len < dest_off)
> +			goto next;
> +		if (!IS_ALIGNED(dest_off, bs))
> +			goto next;
> +
> +		/*
> +		 * The purpose of this loop is to limit the number of
> +		 * bytes we dedupe during a single call to
> +		 * btrfs_extent_same().
> +		 *
> +		 * In order to memcmp the data we have to allocate a
> +		 * pair of buffers. We don't want to allocate too
> +		 * large a buffer, so limiting the size for each
> +		 * dedupe is an easy way to do this.
> +		 */
> +		src_off = off;
> +		op_len = len;
> +		while (op_len) {
> +			u64 tmp_len;
> +
> +			tmp_len = op_len;
> +			if (op_len > BTRFS_ONE_DEDUPE_LEN)
> +				tmp_len = BTRFS_ONE_DEDUPE_LEN;
> +
> +			info->status = btrfs_extent_same(src, src_off, tmp_len,
> +							 dst, dest_off);
> +			if (info->status == 0) {
> +				info->bytes_deduped += tmp_len;
> +			} else
> +				break;
> +
> +			dest_off += tmp_len;
> +			src_off += tmp_len;
> +			op_len -= tmp_len;
> +		}
> +
> +next:
> +		fput(dst_file);
> +		dst_file = NULL;
> +	}
> +
> +	if (copy_to_user(argp, args, args_size))
> +		ret = -EFAULT;
> +
> +out:
> +	if (dst_file)
> +		fput(dst_file);
> +	kfree(args);
> +	return ret;
> +}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/4] btrfs: offline dedupe
  2013-05-21 18:28 [PATCH 0/4] btrfs: offline dedupe v1 Mark Fasheh
@ 2013-05-21 18:28 ` Mark Fasheh
  2013-05-24 14:05   ` David Sterba
  0 siblings, 1 reply; 28+ messages in thread
From: Mark Fasheh @ 2013-05-21 18:28 UTC (permalink / raw)
  To: linux-btrfs
  Cc: Chris Mason, Josef Bacik, Gabriel de Perthuis, David Sterba, Mark Fasheh

This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to
de-duplicate a list of extents across a range of files.

Internally, the ioctl re-uses code from the clone ioctl. This avoids
rewriting a large chunk of extent handling code.

Userspace passes in an array of file, offset pairs along with a length
argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison
of the user data before deduping the extent. Status and number of bytes
deduped are returned for each operation.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ioctl.c           | 290 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/btrfs.h |  27 +++++
 2 files changed, 317 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e90c519..54fcb90 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -57,6 +57,9 @@
 #include "send.h"
 #include "dev-replace.h"
 
+static int btrfs_clone(struct inode *src, struct inode *inode,
+		       u64 off, u64 olen, u64 olen_aligned, u64 destoff);
+
 /* Mask out flags that are inappropriate for the given type of inode. */
 static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags)
 {
@@ -2456,6 +2459,61 @@ out:
 	return ret;
 }
 
+static noinline int fill_data(struct inode *inode, u64 off, u64 len,
+			      char **cur_buffer)
+{
+	struct page *page;
+	void *addr;
+	char *buffer;
+	pgoff_t index;
+	pgoff_t last_index;
+	int ret = 0;
+	int bytes_copied = 0;
+	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+
+	buffer = kmalloc(len, GFP_NOFS);
+	if (!buffer)
+		return -ENOMEM;
+
+	index = off >> PAGE_CACHE_SHIFT;
+	last_index = (off + len - 1) >> PAGE_CACHE_SHIFT;
+
+	while (index <= last_index) {
+		page = grab_cache_page(inode->i_mapping, index);
+		if (!page) {
+			ret = -EINVAL;
+			goto out;
+		}
+
+		if (!PageUptodate(page)) {
+			extent_read_full_page_nolock(tree, page,
+						     btrfs_get_extent, 0);
+			lock_page(page);
+			if (!PageUptodate(page)) {
+				unlock_page(page);
+				page_cache_release(page);
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+
+		addr = kmap(page);
+		memcpy(buffer + bytes_copied, addr, PAGE_CACHE_SIZE);
+		kunmap(page);
+		unlock_page(page);
+		page_cache_release(page);
+		bytes_copied += PAGE_CACHE_SIZE;
+		index++;
+	}
+
+	*cur_buffer = buffer;
+
+out:
+	if (ret)
+		kfree(buffer);
+	return ret;
+}
+
 static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 {
 	/* do any pending delalloc/csum calc on src, one way or
@@ -2476,6 +2534,236 @@ static inline void lock_extent_range(struct inode *inode, u64 off, u64 len)
 	}
 }
 
+static void btrfs_double_unlock(struct inode *inode1, u64 loff1,
+				struct inode *inode2, u64 loff2, u64 len)
+{
+	unlock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1);
+	unlock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1);
+
+	mutex_unlock(&inode1->i_mutex);
+	mutex_unlock(&inode2->i_mutex);
+}
+
+static void btrfs_double_lock(struct inode *inode1, u64 loff1,
+			      struct inode *inode2, u64 loff2, u64 len)
+{
+	if (inode1 < inode2) {
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_CHILD);
+		lock_extent_range(inode1, loff1, len);
+		lock_extent_range(inode2, loff2, len);
+	} else {
+		mutex_lock_nested(&inode2->i_mutex, I_MUTEX_PARENT);
+		mutex_lock_nested(&inode1->i_mutex, I_MUTEX_CHILD);
+		lock_extent_range(inode2, loff2, len);
+		lock_extent_range(inode1, loff1, len);
+	}
+}
+
+static int btrfs_extent_same(struct inode *src, u64 loff, u64 len,
+			     struct inode *dst, u64 dst_loff)
+{
+	char *orig_buffer = NULL;
+	char *dst_inode_buffer = NULL;
+	int ret;
+
+	/*
+	 * btrfs_clone() can't handle extents in the same file
+	 * yet. Once that works, we can drop this check and replace it
+	 * with a check for the same inode, but overlapping extents.
+	 */
+	if (src == dst)
+		return -EINVAL;
+
+	btrfs_double_lock(src, loff, dst, dst_loff, len);
+
+	ret = fill_data(src, loff, len, &orig_buffer);
+	if (ret) {
+		printk(KERN_ERR "btrfs: unable to source populate data "
+		       "buffer.\n");
+		goto out;
+	}
+
+	ret = fill_data(dst, dst_loff, len, &dst_inode_buffer);
+	if (ret) {
+		printk(KERN_ERR "btrfs: unable to populate destination data "
+		       "buffer.\n");
+		goto out;
+	}
+
+	ret = memcmp(orig_buffer, dst_inode_buffer, len);
+	if (ret) {
+		ret = BTRFS_SAME_DATA_DIFFERS;
+		printk(KERN_ERR "btrfs: data for inode %lu does not "
+		       "match\n", dst->i_ino);
+		goto out;
+	}
+
+	ret = btrfs_clone(src, dst, loff, len, len, dst_loff);
+
+out:
+	btrfs_double_unlock(src, loff, dst, dst_loff, len);
+
+	kfree(dst_inode_buffer);
+	kfree(orig_buffer);
+	return ret;
+}
+
+#define BTRFS_MAX_DEDUPE_LEN	(16 * 1024 * 1024)
+#define BTRFS_ONE_DEDUPE_LEN	(1 * 1024 * 1024)
+
+static long btrfs_ioctl_file_extent_same(struct file *file,
+					 void __user *argp)
+{
+	struct btrfs_ioctl_same_args *args;
+	struct btrfs_ioctl_same_args tmp;
+	struct btrfs_ioctl_same_extent_info *info;
+	struct inode *src = file->f_dentry->d_inode;
+	struct file *dst_file = NULL;
+	struct inode *dst;
+	u64 off;
+	u64 len;
+	int args_size;
+	int i;
+	int ret;
+	u64 bs = BTRFS_I(src)->root->fs_info->sb->s_blocksize;
+
+	if (copy_from_user(&tmp,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   sizeof(tmp)))
+		return -EFAULT;
+
+	args_size = sizeof(tmp) + (tmp.dest_count *
+			sizeof(struct btrfs_ioctl_same_extent_info));
+
+	/* Keep size of ioctl argument sane */
+	if (args_size > PAGE_CACHE_SIZE)
+		return -E2BIG;
+
+	args = kmalloc(args_size, GFP_NOFS);
+	if (!args)
+		return -ENOMEM;
+
+	ret = -EFAULT;
+	if (copy_from_user(args,
+			   (struct btrfs_ioctl_same_args __user *)argp,
+			   args_size))
+		goto out;
+	/* Make sure args didn't change magically between copies. */
+	if (memcmp(&tmp, args, sizeof(tmp)))
+		goto out;
+
+	if ((sizeof(tmp) + (sizeof(*info) * args->dest_count)) > args_size)
+		goto out;
+
+	/* pre-format 'out' fields to sane default values */
+	for (i = 0; i < args->dest_count; i++) {
+		info = &args->info[i];
+		info->bytes_deduped = 0;
+		info->status = 0;
+	}
+
+	off = args->logical_offset;
+	len = args->length;
+
+	/*
+	 * Limit the total length we will dedupe for each operation. 
+	 * This is intended to bound the entire ioctl to something sane.
+	 */
+	if (len > BTRFS_MAX_DEDUPE_LEN)
+		len = BTRFS_MAX_DEDUPE_LEN;
+
+	ret = -EINVAL;
+	if (off + len > src->i_size || off + len < off)
+		goto out;
+	if (!IS_ALIGNED(off, bs) || !IS_ALIGNED(off + len, bs))
+		goto out;
+
+	ret = -EISDIR;
+	if (S_ISDIR(src->i_mode))
+		goto out;
+
+	ret = 0;
+	for (i = 0; i < args->dest_count; i++) {
+		u64 dest_off;
+		u64 src_off;
+		u64 op_len;
+
+		info = &args->info[i];
+
+		dst_file = fget(info->fd);
+		if (!dst_file) {
+			printk(KERN_ERR "btrfs: invalid fd %lld\n", info->fd);
+			info->status = -EBADF;
+			continue;
+		}
+
+		dst = dst_file->f_dentry->d_inode;
+		if (S_ISDIR(dst->i_mode)) {
+			printk(KERN_ERR "btrfs: file is dir %lld\n", info->fd);
+			info->status = -EISDIR;
+			goto next;
+		}
+
+		info->status = -EINVAL;
+		if (dst == src) {
+			printk(KERN_ERR "btrfs: file dup %lld\n", info->fd);
+			goto next;
+		}
+
+		dest_off = info->logical_offset;
+
+		if (dest_off + len > dst->i_size || dest_off + len < dest_off)
+			goto next;
+		if (!IS_ALIGNED(dest_off, bs))
+			goto next;
+
+		/*
+		 * The purpose of this loop is to limit the number of
+		 * bytes we dedupe during a single call to
+		 * btrfs_extent_same().
+		 *
+		 * In order to memcmp the data we have to allocate a
+		 * pair of buffers. We don't want to allocate too
+		 * large a buffer, so limiting the size for each
+		 * dedupe is an easy way to do this.
+		 */
+		src_off = off;
+		op_len = len;
+		while (op_len) {
+			u64 tmp_len;
+
+			tmp_len = op_len;
+			if (op_len > BTRFS_ONE_DEDUPE_LEN)
+				tmp_len = BTRFS_ONE_DEDUPE_LEN;
+
+			info->status = btrfs_extent_same(src, src_off, tmp_len,
+							 dst, dest_off);
+			if (info->status == 0) {
+				info->bytes_deduped += tmp_len;
+			} else
+				break;
+
+			dest_off += tmp_len;
+			src_off += tmp_len;
+			op_len -= tmp_len;
+		}
+
+next:
+		fput(dst_file);
+		dst_file = NULL;
+	}
+
+	if (copy_to_user(argp, args, args_size))
+		ret = -EFAULT;
+
+out:
+	if (dst_file)
+		fput(dst_file);
+	kfree(args);
+	return ret;
+}
+
 /**
  * btrfs_clone() - clone a range from inode file to another
  *
@@ -4151,6 +4439,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_fslabel(file, argp);
 	case BTRFS_IOC_SET_FSLABEL:
 		return btrfs_ioctl_set_fslabel(file, argp);
+	case BTRFS_IOC_FILE_EXTENT_SAME:
+		return btrfs_ioctl_file_extent_same(file, argp);
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index fa3a5f9..5465bc2 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -305,6 +305,31 @@ struct btrfs_ioctl_clone_range_args {
 #define BTRFS_DEFRAG_RANGE_COMPRESS 1
 #define BTRFS_DEFRAG_RANGE_START_IO 2
 
+#define BTRFS_SAME_DATA_DIFFERS	1
+/* For extent-same ioctl */
+struct btrfs_ioctl_same_extent_info {
+	__s64 fd;		/* in - destination file */
+	__u64 logical_offset;	/* in - start of extent in destination */
+	__u64 bytes_deduped;	/* out - total # of bytes we were able
+				 * to dedupe from this file */
+	/* status of this dedupe operation:
+	 * 0 if dedup succeeds
+	 * < 0 for error
+	 * == BTRFS_SAME_DATA_DIFFERS if data differs
+	 */
+	__s32 status;		/* out - see above description */
+	__u32 reserved;
+};
+
+struct btrfs_ioctl_same_args {
+	__u64 logical_offset;	/* in - start of extent in source */
+	__u64 length;		/* in - length of extent */
+	__u16 dest_count;	/* in - total elements in info array */
+	__u16 reserved1;
+	__u32 reserved2;
+	struct btrfs_ioctl_same_extent_info info[0];
+};
+
 struct btrfs_ioctl_space_info {
 	__u64 flags;
 	__u64 total_bytes;
@@ -510,5 +535,7 @@ struct btrfs_ioctl_send_args {
 				      struct btrfs_ioctl_get_dev_stats)
 #define BTRFS_IOC_DEV_REPLACE _IOWR(BTRFS_IOCTL_MAGIC, 53, \
 				    struct btrfs_ioctl_dev_replace_args)
+#define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
+					 struct btrfs_ioctl_same_args)
 
 #endif /* _UAPI_LINUX_BTRFS_H */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2013-08-06 19:11 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-16 22:15 [PATCH 0/4] [RFC] btrfs: offline dedupe Mark Fasheh
2013-04-16 22:15 ` [PATCH 1/4] btrfs: abtract out range locking in clone ioctl() Mark Fasheh
2013-04-16 22:15 ` [PATCH 2/4] btrfs_ioctl_clone: Move clone code into it's own function Mark Fasheh
2013-04-16 22:15 ` [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock() Mark Fasheh
2013-05-06 12:38   ` David Sterba
2013-04-16 22:15 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
2013-05-06 12:36   ` David Sterba
2013-04-16 22:50 ` [PATCH 0/4] [RFC] " Marek Otahal
2013-04-16 23:17   ` Mark Fasheh
2013-04-17  4:00     ` Liu Bo
2013-04-20 15:49 ` Gabriel de Perthuis
2013-04-21 20:02   ` Mark Fasheh
2013-04-22  8:56     ` Gabriel de Perthuis
2013-05-07  7:33 ` [PATCH 3/4] btrfs: Introduce extent_read_full_page_nolock() Gabriel de Perthuis
2013-05-09 21:31 ` Gabriel de Perthuis
2013-05-21 18:28 [PATCH 0/4] btrfs: offline dedupe v1 Mark Fasheh
2013-05-21 18:28 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
2013-05-24 14:05   ` David Sterba
2013-05-24 18:17     ` Mark Fasheh
2013-05-24 19:50       ` Gabriel de Perthuis
2013-05-24 22:38         ` Mark Fasheh
2013-05-24 23:36           ` Gabriel de Perthuis
2013-06-11 20:31 [PATCH 0/4] btrfs: offline dedupe v2 Mark Fasheh
2013-06-11 20:31 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
2013-07-15 20:55   ` Zach Brown
2013-07-17  0:14     ` Gabriel de Perthuis
2013-07-26 16:30 [PATCH 0/4] btrfs: offline dedupe v3 Mark Fasheh
2013-07-26 16:30 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
2013-07-26 22:09   ` Zach Brown
2013-08-06 18:42 [PATCH 0/4] btrfs: out-of-band (aka offline) dedupe v4 Mark Fasheh
2013-08-06 18:42 ` [PATCH 4/4] btrfs: offline dedupe Mark Fasheh
2013-08-06 19:11   ` Zach Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.