All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v11 00/13] Btrfs dedupe framework
@ 2016-06-15  2:09 Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 01/13] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
                   ` (14 more replies)
  0 siblings, 15 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs

This patchset can be fetched from github:
https://github.com/adam900710/linux.git wang_dedupe_20160524

In this update, the patchset goes through another re-organization along
with other fixes to address comments from community.
1) Move on-disk backend and dedupe props out of the patchset
   Suggested by David.
   There is still some discussion on the on-disk format.
   And dedupe prop is still not 100% determined.

   So it's better to focus on the current in-memory backend only, which
   doesn't bring any on-disk format change.

   Once the framework is done, new backends and props can be added more
   easily.

2) Better enable/disable and buffered write race avoidance
   Inspired by Mark.
   Although in previous version, we didn't trigger it with our test
   case, but if we manually add delay(5s) to __btrfs_buffered_write(),
   it's possible to trigger disable and buffered write race.

   The cause is, there is a windows between __btrfs_buffered_write() and
   btrfs_dirty_pages().
   In that window, sync_filesystem() can return very quickly since there
   is no dirty page.
   During that window, dedupe disable can happen and finish, and
   buffered writer may access to the NULL pointer of dedupe info.

   Now we use sb->s_writers.rw_sem to wait all current writers and block
   further writers, then sync the fs, change dedupe status and finally
   unblock writers. (Like freeze)
   This provides clearer logical and code, and safer than previous
   method, because there is no windows before we dirty pages.

3) Fix ENOSPC problem with better solution.
   Pointed out by Josef.
   The last 2 patches from Wang fixes ENOSPC problem, in a more
   comprehensive method for delalloc metadata reservation.
   Alone with small outstanding extents improvement, to co-operate with
   tunable max extent size.

Now the whole patchset will only add in-memory backend as a whole.
No other backend nor prop.
So we can focus on the framework itself.

Next version will focus on ioctl interface modification suggested by
David.

Thanks,
Qu

Changelog:
v2:
  Totally reworked to handle multiple backends
v3:
  Fix a stupid but deadly on-disk backend bug
  Add handle for multiple hash on same bytenr corner case to fix abort
  trans error
  Increase dedup rate by enhancing delayed ref handler for both backend.
  Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
  Increase dedup block size up limit to 8M.
v4:
  Add dedup prop for disabling dedup for given files/dirs.
  Merge inmem_search() and ondisk_search() into generic_search() to save
  some code
  Fix another delayed_ref related bug.
  Use the same mutex for both inmem and ondisk backend.
  Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
  rate.
v5:
  Reuse compress routine for much simpler dedup function.
  Slightly improved performance due to above modification.
  Fix race between dedup enable/disable
  Fix for false ENOSPC report
v6:
  Further enable/disable race window fix.
  Minor format change according to checkpatch.
v7:
  Fix one concurrency bug with balance.
  Slightly modify return value from -EINVAL to -EOPNOTSUPP for
  btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
  and wrong parameter.
  Rebased to integration-4.6.
v8:
  Rename 'dedup' to 'dedupe'.
  Add support to allow dedupe and compression work at the same time.
  Fix several balance related bugs. Special thanks to Satoru Takeuchi,
  who exposed most of them.
  Small dedupe hit case performance improvement.
v9:
  Re-order the patchset to completely separate pure in-memory and any
  on-disk format change.
  Fold bug fixes into its original patch.
v10:
  Adding back missing bug fix patch.
  Reduce on-disk item size.
  Hide dedupe ioctl under CONFIG_BTRFS_DEBUG.
v11:
  Remove other backend and props support to focus on the framework and
  in-memory backend. Suggested by David.
  Better disable and buffered write race protection.
  Comprehensive fix to dedupe metadata ENOSPC problem.

Qu Wenruo (3):
  btrfs: delayed-ref: Add support for increasing data ref under spinlock
  btrfs: dedupe: Inband in-memory only de-duplication implement
  btrfs: relocation: Enhance error handling to avoid BUG_ON

Wang Xiaoguang (10):
  btrfs: dedupe: Introduce dedupe framework and its header
  btrfs: dedupe: Introduce function to initialize dedupe info
  btrfs: dedupe: Introduce function to add hash into in-memory tree
  btrfs: dedupe: Introduce function to remove hash from in-memory tree
  btrfs: dedupe: Introduce function to search for an existing hash
  btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  btrfs: ordered-extent: Add support for dedupe
  btrfs: dedupe: Add ioctl for inband dedupelication
  btrfs: improve inode's outstanding_extents computation
  btrfs: dedupe: fix false ENOSPC

 fs/btrfs/Makefile           |   2 +-
 fs/btrfs/ctree.h            |  25 +-
 fs/btrfs/dedupe.c           | 710 ++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dedupe.h           | 210 +++++++++++++
 fs/btrfs/delayed-ref.c      |  30 +-
 fs/btrfs/delayed-ref.h      |   8 +
 fs/btrfs/disk-io.c          |   4 +
 fs/btrfs/extent-tree.c      |  83 +++++-
 fs/btrfs/extent_io.c        |  63 +++-
 fs/btrfs/extent_io.h        |  15 +-
 fs/btrfs/file.c             |  26 +-
 fs/btrfs/free-space-cache.c |   5 +-
 fs/btrfs/inode-map.c        |   4 +-
 fs/btrfs/inode.c            | 434 ++++++++++++++++++++++-----
 fs/btrfs/ioctl.c            |  80 ++++-
 fs/btrfs/ordered-data.c     |  46 ++-
 fs/btrfs/ordered-data.h     |  14 +
 fs/btrfs/relocation.c       |  46 ++-
 fs/btrfs/sysfs.c            |   2 +
 include/uapi/linux/btrfs.h  |  41 +++
 20 files changed, 1701 insertions(+), 147 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c
 create mode 100644 fs/btrfs/dedupe.h

-- 
2.8.3




^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v11 01/13] btrfs: dedupe: Introduce dedupe framework and its header
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
@ 2016-06-15  2:09 ` Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 02/13] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce the header for btrfs online(write time) de-duplication
framework and needed header.

The new de-duplication framework is going to support 2 different dedupe
methods and 1 dedupe hash.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h           |   7 +++
 fs/btrfs/dedupe.h          | 149 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/disk-io.c         |   1 +
 include/uapi/linux/btrfs.h |  16 +++++
 4 files changed, 173 insertions(+)
 create mode 100644 fs/btrfs/dedupe.h

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 101c3cf..8f70f53d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1091,6 +1091,13 @@ struct btrfs_fs_info {
 	struct list_head pinned_chunks;
 
 	int creating_free_space_tree;
+
+	/*
+	 * Inband de-duplication related structures
+	 */
+	unsigned long dedupe_enabled:1;
+	struct btrfs_dedupe_info *dedupe_info;
+	struct mutex dedupe_ioctl_lock;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
new file mode 100644
index 0000000..d7b1a77
--- /dev/null
+++ b/fs/btrfs/dedupe.h
@@ -0,0 +1,149 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_DEDUPE__
+#define __BTRFS_DEDUPE__
+
+#include <linux/btrfs.h>
+#include <linux/wait.h>
+#include <crypto/hash.h>
+
+static int btrfs_dedupe_sizes[] = { 32 };
+
+/*
+ * For caller outside of dedupe.c
+ *
+ * Different dedupe backends should have their own hash structure
+ */
+struct btrfs_dedupe_hash {
+	u64 bytenr;
+	u32 num_bytes;
+
+	/* last field is a variable length array of dedupe hash */
+	u8 hash[];
+};
+
+struct btrfs_dedupe_info {
+	/* dedupe blocksize */
+	u64 blocksize;
+	u16 backend;
+	u16 hash_type;
+
+	struct crypto_shash *dedupe_driver;
+
+	/*
+	 * Use mutex to portect both backends
+	 * Even for in-memory backends, the rb-tree can be quite large,
+	 * so mutex is better for such use case.
+	 */
+	struct mutex lock;
+
+	/* following members are only used in in-memory backend */
+	struct rb_root hash_root;
+	struct rb_root bytenr_root;
+	struct list_head lru_list;
+	u64 limit_nr;
+	u64 current_nr;
+};
+
+struct btrfs_trans_handle;
+
+static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
+{
+	return (hash && hash->bytenr);
+}
+
+int btrfs_dedupe_hash_size(u16 type);
+struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type);
+
+/*
+ * Initial inband dedupe info
+ * Called at dedupe enable time.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (from unsupported param to tree creation error for some backends)
+ */
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+			u64 blocksize, u64 limit_nr, u64 limit_mem);
+
+/*
+ * Disable dedupe and invalidate all its dedupe data.
+ * Called at dedupe disable time.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
+
+/*
+ * Calculate hash for dedupe.
+ * Caller must ensure [start, start + dedupe_bs) has valid data.
+ *
+ * Return 0 for success
+ * Return <0 for any error
+ * (error from hash codes)
+ */
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+			   struct inode *inode, u64 start,
+			   struct btrfs_dedupe_hash *hash);
+
+/*
+ * Search for duplicated extents by calculated hash
+ * Caller must call btrfs_dedupe_calc_hash() first to get the hash.
+ *
+ * @inode: the inode for we are writing
+ * @file_pos: offset inside the inode
+ * As we will increase extent ref immediately after a hash match,
+ * we need @file_pos and @inode in this case.
+ *
+ * Return > 0 for a hash match, and the extent ref will be
+ * *INCREASED*, and hash->bytenr/num_bytes will record the existing
+ * extent data.
+ * Return 0 for a hash miss. Nothing is done
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+			struct inode *inode, u64 file_pos,
+			struct btrfs_dedupe_hash *hash);
+
+/*
+ * Add a dedupe hash into dedupe info
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ */
+int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info,
+		     struct btrfs_dedupe_hash *hash);
+
+/*
+ * Remove a dedupe hash from dedupe info
+ * Return 0 for success
+ * Return <0 for any error
+ * (tree operation error for some backends)
+ *
+ * NOTE: if hash deletion error is not handled well, it will lead
+ * to corrupted fs, as later dedupe write can points to non-exist or even
+ * wrong extent.
+ */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info, u64 bytenr);
+#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1142127..dccd608 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2593,6 +2593,7 @@ int open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
 	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
+	mutex_init(&fs_info->dedupe_ioctl_lock);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 2bdd1e3..1279bf0 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -619,6 +619,22 @@ struct btrfs_ioctl_get_dev_stats {
 	__u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
 };
 
+/* In-band dedupe related */
+#define BTRFS_DEDUPE_BACKEND_INMEMORY		0
+#define BTRFS_DEDUPE_BACKEND_ONDISK		1
+
+/* Only support inmemory yet, so count is still only 1 */
+#define BTRFS_DEDUPE_BACKEND_COUNT		1
+
+/* Dedup block size limit and default value */
+#define BTRFS_DEDUPE_BLOCKSIZE_MAX	(8 * 1024 * 1024)
+#define BTRFS_DEDUPE_BLOCKSIZE_MIN	(16 * 1024)
+#define BTRFS_DEDUPE_BLOCKSIZE_DEFAULT	(128 * 1024)
+
+/* Hash algorithm, only support SHA256 yet */
+#define BTRFS_DEDUPE_HASH_SHA256		0
+
+
 #define BTRFS_QUOTA_CTL_ENABLE	1
 #define BTRFS_QUOTA_CTL_DISABLE	2
 #define BTRFS_QUOTA_CTL_RESCAN__NOTUSED	3
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 02/13] btrfs: dedupe: Introduce function to initialize dedupe info
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 01/13] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
@ 2016-06-15  2:09 ` Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 03/13] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Add generic function to initialize dedupe info.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/Makefile          |   2 +-
 fs/btrfs/dedupe.c          | 160 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dedupe.h          |  13 +++-
 include/uapi/linux/btrfs.h |   2 +
 4 files changed, 174 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..1b8c627 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-	   uuid-tree.o props.o hash.o free-space-tree.o
+	   uuid-tree.o props.o hash.o free-space-tree.o dedupe.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
new file mode 100644
index 0000000..941ee37
--- /dev/null
+++ b/fs/btrfs/dedupe.c
@@ -0,0 +1,160 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include "ctree.h"
+#include "dedupe.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "delayed-ref.h"
+
+struct inmem_hash {
+	struct rb_node hash_node;
+	struct rb_node bytenr_node;
+	struct list_head lru_list;
+
+	u64 bytenr;
+	u32 num_bytes;
+
+	u8 hash[];
+};
+
+static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
+			    u16 backend, u64 blocksize, u64 limit)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+
+	dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
+	if (!dedupe_info)
+		return -ENOMEM;
+
+	dedupe_info->hash_type = type;
+	dedupe_info->backend = backend;
+	dedupe_info->blocksize = blocksize;
+	dedupe_info->limit_nr = limit;
+
+	/* only support SHA256 yet */
+	dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
+	if (IS_ERR(dedupe_info->dedupe_driver)) {
+		int ret;
+
+		ret = PTR_ERR(dedupe_info->dedupe_driver);
+		kfree(dedupe_info);
+		return ret;
+	}
+
+	dedupe_info->hash_root = RB_ROOT;
+	dedupe_info->bytenr_root = RB_ROOT;
+	dedupe_info->current_nr = 0;
+	INIT_LIST_HEAD(&dedupe_info->lru_list);
+	mutex_init(&dedupe_info->lock);
+
+	*ret_info = dedupe_info;
+	return 0;
+}
+
+static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type,
+				  u16 backend, u64 blocksize, u64 limit_nr,
+				  u64 limit_mem, u64 *ret_limit)
+{
+	if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
+	    blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
+	    blocksize < fs_info->tree_root->sectorsize ||
+	    !is_power_of_2(blocksize))
+		return -EINVAL;
+	/*
+	 * For new backend and hash type, we return special return code
+	 * as they can be easily expended.
+	 */
+	if (hash_type >= ARRAY_SIZE(btrfs_dedupe_sizes))
+		return -EOPNOTSUPP;
+	if (backend >= BTRFS_DEDUPE_BACKEND_COUNT)
+		return -EOPNOTSUPP;
+
+	/* Backend specific check */
+	if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+		if (!limit_nr && !limit_mem)
+			*ret_limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
+		else {
+			u64 tmp = (u64)-1;
+
+			if (limit_mem) {
+				tmp = limit_mem / (sizeof(struct inmem_hash) +
+					btrfs_dedupe_hash_size(hash_type));
+				/* Too small limit_mem to fill a hash item */
+				if (!tmp)
+					return -EINVAL;
+			}
+			if (!limit_nr)
+				limit_nr = (u64)-1;
+
+			*ret_limit = min(tmp, limit_nr);
+		}
+	}
+	if (backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+		*ret_limit = 0;
+	return 0;
+}
+
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+			u64 blocksize, u64 limit_nr, u64 limit_mem)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+	u64 limit = 0;
+	int ret = 0;
+
+	/* only one limit is accepted for enable*/
+	if (limit_nr && limit_mem)
+		return -EINVAL;
+
+	ret = check_dedupe_parameter(fs_info, type, backend, blocksize,
+				     limit_nr, limit_mem, &limit);
+	if (ret < 0)
+		return ret;
+
+	dedupe_info = fs_info->dedupe_info;
+	if (dedupe_info) {
+		/* Check if we are re-enable for different dedupe config */
+		if (dedupe_info->blocksize != blocksize ||
+		    dedupe_info->hash_type != type ||
+		    dedupe_info->backend != backend) {
+			btrfs_dedupe_disable(fs_info);
+			goto enable;
+		}
+
+		/* On-fly limit change is OK */
+		mutex_lock(&dedupe_info->lock);
+		fs_info->dedupe_info->limit_nr = limit;
+		mutex_unlock(&dedupe_info->lock);
+		return 0;
+	}
+
+enable:
+	ret = init_dedupe_info(&dedupe_info, type, backend, blocksize, limit);
+	if (ret < 0)
+		return ret;
+	fs_info->dedupe_info = dedupe_info;
+	/* We must ensure dedupe_bs is modified after dedupe_info */
+	smp_wmb();
+	fs_info->dedupe_enabled = 1;
+	return ret;
+}
+
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
+{
+	/* Place holder for bisect, will be implemented in later patches */
+	return 0;
+}
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index d7b1a77..9162d2c 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -68,8 +68,17 @@ static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
 	return (hash && hash->bytenr);
 }
 
-int btrfs_dedupe_hash_size(u16 type);
-struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type);
+static inline int btrfs_dedupe_hash_size(u16 type)
+{
+	if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
+		return -EINVAL;
+	return sizeof(struct btrfs_dedupe_hash) + btrfs_dedupe_sizes[type];
+}
+
+static inline struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type)
+{
+	return kzalloc(btrfs_dedupe_hash_size(type), GFP_NOFS);
+}
 
 /*
  * Initial inband dedupe info
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 1279bf0..bc3416e 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -634,6 +634,8 @@ struct btrfs_ioctl_get_dev_stats {
 /* Hash algorithm, only support SHA256 yet */
 #define BTRFS_DEDUPE_HASH_SHA256		0
 
+/* Default dedupe limit on number of hash */
+#define BTRFS_DEDUPE_LIMIT_NR_DEFAULT	(32 * 1024)
 
 #define BTRFS_QUOTA_CTL_ENABLE	1
 #define BTRFS_QUOTA_CTL_DISABLE	2
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 03/13] btrfs: dedupe: Introduce function to add hash into in-memory tree
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 01/13] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 02/13] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
@ 2016-06-15  2:09 ` Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 04/13] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce static function inmem_add() to add hash into in-memory tree.
And now we can implement the btrfs_dedupe_add() interface.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/dedupe.c | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 151 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 941ee37..be83aca 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -32,6 +32,14 @@ struct inmem_hash {
 	u8 hash[];
 };
 
+static inline struct inmem_hash *inmem_alloc_hash(u16 type)
+{
+	if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
+		return NULL;
+	return kzalloc(sizeof(struct inmem_hash) + btrfs_dedupe_sizes[type],
+			GFP_NOFS);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
 			    u16 backend, u64 blocksize, u64 limit)
 {
@@ -158,3 +166,146 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 	/* Place holder for bisect, will be implemented in later patches */
 	return 0;
 }
+
+static int inmem_insert_hash(struct rb_root *root,
+			     struct inmem_hash *hash, int hash_len)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, hash_node);
+		if (memcmp(hash->hash, entry->hash, hash_len) < 0)
+			p = &(*p)->rb_left;
+		else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
+			p = &(*p)->rb_right;
+		else
+			return 1;
+	}
+	rb_link_node(&hash->hash_node, parent, p);
+	rb_insert_color(&hash->hash_node, root);
+	return 0;
+}
+
+static int inmem_insert_bytenr(struct rb_root *root,
+			       struct inmem_hash *hash)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+		if (hash->bytenr < entry->bytenr)
+			p = &(*p)->rb_left;
+		else if (hash->bytenr > entry->bytenr)
+			p = &(*p)->rb_right;
+		else
+			return 1;
+	}
+	rb_link_node(&hash->bytenr_node, parent, p);
+	rb_insert_color(&hash->bytenr_node, root);
+	return 0;
+}
+
+static void __inmem_del(struct btrfs_dedupe_info *dedupe_info,
+			struct inmem_hash *hash)
+{
+	list_del(&hash->lru_list);
+	rb_erase(&hash->hash_node, &dedupe_info->hash_root);
+	rb_erase(&hash->bytenr_node, &dedupe_info->bytenr_root);
+
+	if (!WARN_ON(dedupe_info->current_nr == 0))
+		dedupe_info->current_nr--;
+
+	kfree(hash);
+}
+
+/*
+ * Insert a hash into in-memory dedupe tree
+ * Will remove exceeding last recent use hash.
+ *
+ * If the hash mathced with existing one, we won't insert it, to
+ * save memory
+ */
+static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
+		     struct btrfs_dedupe_hash *hash)
+{
+	int ret = 0;
+	u16 type = dedupe_info->hash_type;
+	struct inmem_hash *ihash;
+
+	ihash = inmem_alloc_hash(type);
+
+	if (!ihash)
+		return -ENOMEM;
+
+	/* Copy the data out */
+	ihash->bytenr = hash->bytenr;
+	ihash->num_bytes = hash->num_bytes;
+	memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
+
+	mutex_lock(&dedupe_info->lock);
+
+	ret = inmem_insert_bytenr(&dedupe_info->bytenr_root, ihash);
+	if (ret > 0) {
+		kfree(ihash);
+		ret = 0;
+		goto out;
+	}
+
+	ret = inmem_insert_hash(&dedupe_info->hash_root, ihash,
+				btrfs_dedupe_sizes[type]);
+	if (ret > 0) {
+		/*
+		 * We only keep one hash in tree to save memory, so if
+		 * hash conflicts, free the one to insert.
+		 */
+		rb_erase(&ihash->bytenr_node, &dedupe_info->bytenr_root);
+		kfree(ihash);
+		ret = 0;
+		goto out;
+	}
+
+	list_add(&ihash->lru_list, &dedupe_info->lru_list);
+	dedupe_info->current_nr++;
+
+	/* Remove the last dedupe hash if we exceed limit */
+	while (dedupe_info->current_nr > dedupe_info->limit_nr) {
+		struct inmem_hash *last;
+
+		last = list_entry(dedupe_info->lru_list.prev,
+				  struct inmem_hash, lru_list);
+		__inmem_del(dedupe_info, last);
+	}
+out:
+	mutex_unlock(&dedupe_info->lock);
+	return 0;
+}
+
+int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info,
+		     struct btrfs_dedupe_hash *hash)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+	if (!fs_info->dedupe_enabled || !hash)
+		return 0;
+
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	if (WARN_ON(!btrfs_dedupe_hash_hit(hash)))
+		return -EINVAL;
+
+	/* ignore old hash */
+	if (dedupe_info->blocksize != hash->num_bytes)
+		return 0;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		return inmem_add(dedupe_info, hash);
+	return -EINVAL;
+}
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 04/13] btrfs: dedupe: Introduce function to remove hash from in-memory tree
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (2 preceding siblings ...)
  2016-06-15  2:09 ` [PATCH v11 03/13] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
@ 2016-06-15  2:09 ` Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 05/13] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang, Mark Fasheh

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce static function inmem_del() to remove hash from in-memory
dedupe tree.
And implement btrfs_dedupe_del() and btrfs_dedup_disable() interfaces.

Also for btrfs_dedupe_disable(), add new functions to wait existing
writer and block incoming writers to eliminate all possible race.

Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 126 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index be83aca..960b039 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -161,12 +161,6 @@ enable:
 	return ret;
 }
 
-int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
-{
-	/* Place holder for bisect, will be implemented in later patches */
-	return 0;
-}
-
 static int inmem_insert_hash(struct rb_root *root,
 			     struct inmem_hash *hash, int hash_len)
 {
@@ -309,3 +303,129 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
 		return inmem_add(dedupe_info, hash);
 	return -EINVAL;
 }
+
+static struct inmem_hash *
+inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+	struct rb_node **p = &dedupe_info->bytenr_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+
+		if (bytenr < entry->bytenr)
+			p = &(*p)->rb_left;
+		else if (bytenr > entry->bytenr)
+			p = &(*p)->rb_right;
+		else
+			return entry;
+	}
+
+	return NULL;
+}
+
+/* Delete a hash from in-memory dedupe tree */
+static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+	struct inmem_hash *hash;
+
+	mutex_lock(&dedupe_info->lock);
+	hash = inmem_search_bytenr(dedupe_info, bytenr);
+	if (!hash) {
+		mutex_unlock(&dedupe_info->lock);
+		return 0;
+	}
+
+	__inmem_del(dedupe_info, hash);
+	mutex_unlock(&dedupe_info->lock);
+	return 0;
+}
+
+/* Remove a dedupe hash from dedupe tree */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+	if (!fs_info->dedupe_enabled)
+		return 0;
+
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		return inmem_del(dedupe_info, bytenr);
+	return -EINVAL;
+}
+
+static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
+{
+	struct inmem_hash *entry, *tmp;
+
+	mutex_lock(&dedupe_info->lock);
+	list_for_each_entry_safe(entry, tmp, &dedupe_info->lru_list, lru_list)
+		__inmem_del(dedupe_info, entry);
+	mutex_unlock(&dedupe_info->lock);
+}
+
+/*
+ * Helper function to wait and block all incoming writers
+ *
+ * Use rw_sem introduced for freeze to wait/block writers.
+ * So during the block time, no new write will happen, so we can
+ * do something quite safe, espcially helpful for dedupe disable,
+ * as it affect buffered write.
+ */
+static void block_all_writers(struct btrfs_fs_info *fs_info)
+{
+	struct super_block *sb = fs_info->sb;
+
+	percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+	down_write(&sb->s_umount);
+}
+
+static void unblock_all_writers(struct btrfs_fs_info *fs_info)
+{
+	struct super_block *sb = fs_info->sb;
+
+	up_write(&sb->s_umount);
+	percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+}
+
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+	int ret;
+
+	dedupe_info = fs_info->dedupe_info;
+
+	if (!dedupe_info)
+		return 0;
+
+	/* Don't allow disable status change in RO mount */
+	if (fs_info->sb->s_flags & MS_RDONLY)
+		return -EROFS;
+
+	/*
+	 * Wait for all unfinished writers and block further writers.
+	 * Then sync the whole fs so all current write will go through
+	 * dedupe, and all later write won't go through dedupe.
+	 */
+	block_all_writers(fs_info);
+	ret = sync_filesystem(fs_info->sb);
+	fs_info->dedupe_enabled = 0;
+	fs_info->dedupe_info = NULL;
+	unblock_all_writers(fs_info);
+	if (ret < 0)
+		return ret;
+
+	/* now we are OK to clean up everything */
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		inmem_destroy(dedupe_info);
+
+	crypto_free_shash(dedupe_info->dedupe_driver);
+	kfree(dedupe_info);
+	return 0;
+}
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 05/13] btrfs: delayed-ref: Add support for increasing data ref under spinlock
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (3 preceding siblings ...)
  2016-06-15  2:09 ` [PATCH v11 04/13] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
@ 2016-06-15  2:09 ` Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 06/13] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs

For in-band dedupe, btrfs needs to increase data ref with delayed_ref
locked, so add a new function btrfs_add_delayed_data_ref_lock() to
increase extent ref with delayed_refs already locked.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/delayed-ref.c | 30 +++++++++++++++++++++++-------
 fs/btrfs/delayed-ref.h |  8 ++++++++
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 430b368..07474e8 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -805,6 +805,26 @@ free_ref:
 }
 
 /*
+ * Do real delayed data ref insert.
+ * Caller must hold delayed_refs->lock and allocation memory
+ * for dref,head_ref and record.
+ */
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+			struct btrfs_trans_handle *trans,
+			struct btrfs_delayed_data_ref *dref,
+			struct btrfs_delayed_ref_head *head_ref,
+			struct btrfs_qgroup_extent_record *qrecord,
+			u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+			u64 owner, u64 offset, u64 reserved, int action)
+{
+	head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node,
+			qrecord, bytenr, num_bytes, ref_root, reserved,
+			action, 1);
+	add_delayed_data_ref(fs_info, trans, head_ref, &dref->node, bytenr,
+			num_bytes, parent, ref_root, owner, offset, action);
+}
+
+/*
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
@@ -849,13 +869,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
 	 * insert both the head node and the new ref without dropping
 	 * the spin lock
 	 */
-	head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node, record,
-					bytenr, num_bytes, ref_root, reserved,
-					action, 1);
-
-	add_delayed_data_ref(fs_info, trans, head_ref, &ref->node, bytenr,
-				   num_bytes, parent, ref_root, owner, offset,
-				   action);
+	btrfs_add_delayed_data_ref_locked(fs_info, trans, ref, head_ref, record,
+			bytenr, num_bytes, parent, ref_root, owner, offset,
+			reserved, action);
 	spin_unlock(&delayed_refs->lock);
 
 	return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 5fca953..5830341 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -239,11 +239,19 @@ static inline void btrfs_put_delayed_ref(struct btrfs_delayed_ref_node *ref)
 	}
 }
 
+struct btrfs_qgroup_extent_record;
 int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
 			       struct btrfs_trans_handle *trans,
 			       u64 bytenr, u64 num_bytes, u64 parent,
 			       u64 ref_root, int level, int action,
 			       struct btrfs_delayed_extent_op *extent_op);
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+			struct btrfs_trans_handle *trans,
+			struct btrfs_delayed_data_ref *dref,
+			struct btrfs_delayed_ref_head *head_ref,
+			struct btrfs_qgroup_extent_record *qrecord,
+			u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+			u64 owner, u64 offset, u64 reserved, int action);
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
 			       struct btrfs_trans_handle *trans,
 			       u64 bytenr, u64 num_bytes,
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 06/13] btrfs: dedupe: Introduce function to search for an existing hash
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (4 preceding siblings ...)
  2016-06-15  2:09 ` [PATCH v11 05/13] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
@ 2016-06-15  2:09 ` Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 07/13] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce static function inmem_search() to handle the job for in-memory
hash tree.

The trick is, we must ensure the delayed ref head is not being run at
the time we search the for the hash.

With inmem_search(), we can implement the btrfs_dedupe_search()
interface.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/dedupe.c | 185 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 185 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 960b039..867f481 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -20,6 +20,7 @@
 #include "btrfs_inode.h"
 #include "transaction.h"
 #include "delayed-ref.h"
+#include "qgroup.h"
 
 struct inmem_hash {
 	struct rb_node hash_node;
@@ -429,3 +430,187 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 	kfree(dedupe_info);
 	return 0;
 }
+
+/*
+ * Caller must ensure the corresponding ref head is not being run.
+ */
+static struct inmem_hash *
+inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
+{
+	struct rb_node **p = &dedupe_info->hash_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+	u16 hash_type = dedupe_info->hash_type;
+	int hash_len = btrfs_dedupe_sizes[hash_type];
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, hash_node);
+
+		if (memcmp(hash, entry->hash, hash_len) < 0) {
+			p = &(*p)->rb_left;
+		} else if (memcmp(hash, entry->hash, hash_len) > 0) {
+			p = &(*p)->rb_right;
+		} else {
+			/* Found, need to re-add it to LRU list head */
+			list_del(&entry->lru_list);
+			list_add(&entry->lru_list, &dedupe_info->lru_list);
+			return entry;
+		}
+	}
+	return NULL;
+}
+
+static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
+			struct inode *inode, u64 file_pos,
+			struct btrfs_dedupe_hash *hash)
+{
+	int ret;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_trans_handle *trans;
+	struct btrfs_delayed_ref_root *delayed_refs;
+	struct btrfs_delayed_ref_head *head;
+	struct btrfs_delayed_ref_head *insert_head;
+	struct btrfs_delayed_data_ref *insert_dref;
+	struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
+	struct inmem_hash *found_hash;
+	int free_insert = 1;
+	u64 bytenr;
+	u32 num_bytes;
+
+	insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+	if (!insert_head)
+		return -ENOMEM;
+	insert_head->extent_op = NULL;
+	insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
+	if (!insert_dref) {
+		kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+		return -ENOMEM;
+	}
+	if (root->fs_info->quota_enabled &&
+	    is_fstree(root->root_key.objectid)) {
+		insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS);
+		if (!insert_qrecord) {
+			kmem_cache_free(btrfs_delayed_ref_head_cachep,
+					insert_head);
+			kmem_cache_free(btrfs_delayed_data_ref_cachep,
+					insert_dref);
+			return -ENOMEM;
+		}
+	}
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto free_mem;
+	}
+
+again:
+	mutex_lock(&dedupe_info->lock);
+	found_hash = inmem_search_hash(dedupe_info, hash->hash);
+	/* If we don't find a duplicated extent, just return. */
+	if (!found_hash) {
+		ret = 0;
+		goto out;
+	}
+	bytenr = found_hash->bytenr;
+	num_bytes = found_hash->num_bytes;
+
+	delayed_refs = &trans->transaction->delayed_refs;
+
+	spin_lock(&delayed_refs->lock);
+	head = btrfs_find_delayed_ref_head(trans, bytenr);
+	if (!head) {
+		/*
+		 * We can safely insert a new delayed_ref as long as we
+		 * hold delayed_refs->lock.
+		 * Only need to use atomic inc_extent_ref()
+		 */
+		btrfs_add_delayed_data_ref_locked(root->fs_info, trans,
+				insert_dref, insert_head, insert_qrecord,
+				bytenr, num_bytes, 0, root->root_key.objectid,
+				btrfs_ino(inode), file_pos, 0,
+				BTRFS_ADD_DELAYED_REF);
+		spin_unlock(&delayed_refs->lock);
+
+		/* add_delayed_data_ref_locked will free unused memory */
+		free_insert = 0;
+		hash->bytenr = bytenr;
+		hash->num_bytes = num_bytes;
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * We can't lock ref head with dedupe_info->lock hold or we will cause
+	 * ABBA dead lock.
+	 */
+	mutex_unlock(&dedupe_info->lock);
+	ret = btrfs_delayed_ref_lock(trans, head);
+	spin_unlock(&delayed_refs->lock);
+	if (ret == -EAGAIN)
+		goto again;
+
+	mutex_lock(&dedupe_info->lock);
+	/* Search again to ensure the hash is still here */
+	found_hash = inmem_search_hash(dedupe_info, hash->hash);
+	if (!found_hash) {
+		ret = 0;
+		mutex_unlock(&head->mutex);
+		goto out;
+	}
+	ret = 1;
+	hash->bytenr = bytenr;
+	hash->num_bytes = num_bytes;
+
+	/*
+	 * Increase the extent ref right now, to avoid delayed ref run
+	 * Or we may increase ref on non-exist extent.
+	 */
+	btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0,
+			     root->root_key.objectid,
+			     btrfs_ino(inode), file_pos);
+	mutex_unlock(&head->mutex);
+out:
+	mutex_unlock(&dedupe_info->lock);
+	btrfs_end_transaction(trans, root);
+
+free_mem:
+	if (free_insert) {
+		kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+		kmem_cache_free(btrfs_delayed_data_ref_cachep, insert_dref);
+		kfree(insert_qrecord);
+	}
+	return ret;
+}
+
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+			struct inode *inode, u64 file_pos,
+			struct btrfs_dedupe_hash *hash)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+	int ret = -EINVAL;
+
+	if (!hash)
+		return 0;
+
+	/*
+	 * This function doesn't follow fs_info->dedupe_enabled as it will need
+	 * to ensure any hashed extent to go through dedupe routine
+	 */
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	if (WARN_ON(btrfs_dedupe_hash_hit(hash)))
+		return -EINVAL;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		ret = inmem_search(dedupe_info, inode, file_pos, hash);
+
+	/* It's possible hash->bytenr/num_bytenr already changed */
+	if (ret == 0) {
+		hash->num_bytes = 0;
+		hash->bytenr = 0;
+	}
+	return ret;
+}
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 07/13] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (5 preceding siblings ...)
  2016-06-15  2:09 ` [PATCH v11 06/13] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
@ 2016-06-15  2:09 ` Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 08/13] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
supported yet, so implement btrfs_dedupe_calc_hash() interface using
SHA256.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/dedupe.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 867f481..4c5b3fc 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -614,3 +614,49 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
 	}
 	return ret;
 }
+
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+			   struct inode *inode, u64 start,
+			   struct btrfs_dedupe_hash *hash)
+{
+	int i;
+	int ret;
+	struct page *p;
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+	struct crypto_shash *tfm = dedupe_info->dedupe_driver;
+	SHASH_DESC_ON_STACK(sdesc, tfm);
+	u64 dedupe_bs;
+	u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
+
+	if (!fs_info->dedupe_enabled || !hash)
+		return 0;
+
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	WARN_ON(!IS_ALIGNED(start, sectorsize));
+
+	dedupe_bs = dedupe_info->blocksize;
+
+	sdesc->tfm = tfm;
+	sdesc->flags = 0;
+	ret = crypto_shash_init(sdesc);
+	if (ret)
+		return ret;
+	for (i = 0; sectorsize * i < dedupe_bs; i++) {
+		char *d;
+
+		p = find_get_page(inode->i_mapping,
+				  (start >> PAGE_SHIFT) + i);
+		if (WARN_ON(!p))
+			return -ENOENT;
+		d = kmap(p);
+		ret = crypto_shash_update(sdesc, d, sectorsize);
+		kunmap(p);
+		put_page(p);
+		if (ret)
+			return ret;
+	}
+	ret = crypto_shash_final(sdesc, hash->hash);
+	return ret;
+}
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 08/13] btrfs: ordered-extent: Add support for dedupe
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (6 preceding siblings ...)
  2016-06-15  2:09 ` [PATCH v11 07/13] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
@ 2016-06-15  2:09 ` Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 09/13] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Add ordered-extent support for dedupe.

Note, current ordered-extent support only supports non-compressed source
extent.
Support for compressed source extent will be added later.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/ordered-data.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/ordered-data.h | 13 +++++++++++++
 2 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index e96634a..7b1fce4 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -26,6 +26,7 @@
 #include "extent_io.h"
 #include "disk-io.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -184,7 +185,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 				      u64 start, u64 len, u64 disk_len,
-				      int type, int dio, int compress_type)
+				      int type, int dio, int compress_type,
+				      struct btrfs_dedupe_hash *hash)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_ordered_inode_tree *tree;
@@ -204,6 +206,33 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 	entry->inode = igrab(inode);
 	entry->compress_type = compress_type;
 	entry->truncated_len = (u64)-1;
+	entry->hash = NULL;
+	/*
+	 * A hash hit means we have already incremented the extents delayed
+	 * ref.
+	 * We must handle this even if another process is trying to 
+	 * turn off dedupe, otherwise we will leak a reference.
+	 */
+	if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) {
+		struct btrfs_dedupe_info *dedupe_info;
+
+		dedupe_info = root->fs_info->dedupe_info;
+		if (WARN_ON(dedupe_info == NULL)) {
+			kmem_cache_free(btrfs_ordered_extent_cache,
+					entry);
+			return -EINVAL;
+		}
+		entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_type);
+		if (!entry->hash) {
+			kmem_cache_free(btrfs_ordered_extent_cache, entry);
+			return -ENOMEM;
+		}
+		entry->hash->bytenr = hash->bytenr;
+		entry->hash->num_bytes = hash->num_bytes;
+		memcpy(entry->hash->hash, hash->hash,
+		       btrfs_dedupe_sizes[dedupe_info->hash_type]);
+	}
+
 	if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
 		set_bit(type, &entry->flags);
 
@@ -250,15 +279,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 {
 	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
 					  disk_len, type, 0,
-					  BTRFS_COMPRESS_NONE);
+					  BTRFS_COMPRESS_NONE, NULL);
 }
 
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+				   u64 start, u64 len, u64 disk_len, int type,
+				   struct btrfs_dedupe_hash *hash)
+{
+	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+					  disk_len, type, 0,
+					  BTRFS_COMPRESS_NONE, hash);
+}
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 				 u64 start, u64 len, u64 disk_len, int type)
 {
 	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
 					  disk_len, type, 1,
-					  BTRFS_COMPRESS_NONE);
+					  BTRFS_COMPRESS_NONE, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
@@ -267,7 +304,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
 {
 	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
 					  disk_len, type, 0,
-					  compress_type);
+					  compress_type, NULL);
 }
 
 /*
@@ -577,6 +614,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry)
 			list_del(&sum->list);
 			kfree(sum);
 		}
+		kfree(entry->hash);
 		kmem_cache_free(btrfs_ordered_extent_cache, entry);
 	}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 4515077..8dda4a5 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -139,6 +139,16 @@ struct btrfs_ordered_extent {
 	struct completion completion;
 	struct btrfs_work flush_work;
 	struct list_head work_list;
+
+	/*
+	 * For inband deduplication
+	 * If hash is NULL, no deduplication.
+	 * If hash->bytenr is zero, means this is a dedupe miss, hash will
+	 * be added into dedupe tree.
+	 * If hash->bytenr is non-zero, this is a dedupe hit. Extent ref is
+	 * *ALREADY* increased.
+	 */
+	struct btrfs_dedupe_hash *hash;
 };
 
 /*
@@ -172,6 +182,9 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode,
 				   int uptodate);
 int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 			     u64 start, u64 len, u64 disk_len, int type);
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+				   u64 start, u64 len, u64 disk_len, int type,
+				   struct btrfs_dedupe_hash *hash);
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 				 u64 start, u64 len, u64 disk_len, int type);
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 09/13] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (7 preceding siblings ...)
  2016-06-15  2:09 ` [PATCH v11 08/13] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
@ 2016-06-15  2:09 ` Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 10/13] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Core implement for inband de-duplication.
It reuse the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c |  18 ++++
 fs/btrfs/inode.c       | 257 ++++++++++++++++++++++++++++++++++++++++++-------
 fs/btrfs/relocation.c  |  16 +++
 3 files changed, 256 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 689d25a..e0db77e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -37,6 +37,7 @@
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2405,6 +2406,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 
 	if (btrfs_delayed_ref_is_head(node)) {
 		struct btrfs_delayed_ref_head *head;
+		struct btrfs_fs_info *fs_info = root->fs_info;
+
 		/*
 		 * we've hit the end of the chain and we were supposed
 		 * to insert this extent into the tree.  But, it got
@@ -2419,6 +2422,15 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 			btrfs_pin_extent(root, node->bytenr,
 					 node->num_bytes, 1);
 			if (head->is_data) {
+				/*
+				 * If insert_reserved is given, it means
+				 * a new extent is revered, then deleted
+				 * in one tran, and inc/dec get merged to 0.
+				 *
+				 * In this case, we need to remove its dedup
+				 * hash.
+				 */
+				btrfs_dedupe_del(trans, fs_info, node->bytenr);
 				ret = btrfs_del_csums(trans, root,
 						      node->bytenr,
 						      node->num_bytes);
@@ -6826,6 +6838,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 		btrfs_release_path(path);
 
 		if (is_data) {
+			ret = btrfs_dedupe_del(trans, info, bytenr);
+			if (ret < 0) {
+				btrfs_abort_transaction(trans, extent_root,
+							ret);
+				goto out;
+			}
 			ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
 			if (ret) {
 				btrfs_abort_transaction(trans, extent_root, ret);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e5558d9..23a725f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -60,6 +60,7 @@
 #include "hash.h"
 #include "props.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 struct btrfs_iget_args {
 	struct btrfs_key *location;
@@ -106,7 +107,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent);
 static noinline int cow_file_range(struct inode *inode,
 				   struct page *locked_page,
 				   u64 start, u64 end, int *page_started,
-				   unsigned long *nr_written, int unlock);
+				   unsigned long *nr_written, int unlock,
+				   struct btrfs_dedupe_hash *hash);
 static struct extent_map *create_pinned_em(struct inode *inode, u64 start,
 					   u64 len, u64 orig_start,
 					   u64 block_start, u64 block_len,
@@ -335,6 +337,7 @@ struct async_extent {
 	struct page **pages;
 	unsigned long nr_pages;
 	int compress_type;
+	struct btrfs_dedupe_hash *hash;
 	struct list_head list;
 };
 
@@ -353,7 +356,8 @@ static noinline int add_async_extent(struct async_cow *cow,
 				     u64 compressed_size,
 				     struct page **pages,
 				     unsigned long nr_pages,
-				     int compress_type)
+				     int compress_type,
+				     struct btrfs_dedupe_hash *hash)
 {
 	struct async_extent *async_extent;
 
@@ -365,6 +369,7 @@ static noinline int add_async_extent(struct async_cow *cow,
 	async_extent->pages = pages;
 	async_extent->nr_pages = nr_pages;
 	async_extent->compress_type = compress_type;
+	async_extent->hash = hash;
 	list_add_tail(&async_extent->list, &cow->extents);
 	return 0;
 }
@@ -616,7 +621,7 @@ cont:
 		 */
 		add_async_extent(async_cow, start, num_bytes,
 				 total_compressed, pages, nr_pages_ret,
-				 compress_type);
+				 compress_type, NULL);
 
 		if (start + num_bytes < end) {
 			start += num_bytes;
@@ -641,7 +646,7 @@ cleanup_and_bail_uncompressed:
 		if (redirty)
 			extent_range_redirty_for_io(inode, start, end);
 		add_async_extent(async_cow, start, end - start + 1,
-				 0, NULL, 0, BTRFS_COMPRESS_NONE);
+				 0, NULL, 0, BTRFS_COMPRESS_NONE, NULL);
 		*num_added += 1;
 	}
 
@@ -671,6 +676,38 @@ static void free_async_extent_pages(struct async_extent *async_extent)
 	async_extent->pages = NULL;
 }
 
+static void end_dedupe_extent(struct inode *inode, u64 start,
+			      u32 len, unsigned long page_ops)
+{
+	int i;
+	unsigned nr_pages = len / PAGE_SIZE;
+	struct page *page;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = find_get_page(inode->i_mapping,
+				     start >> PAGE_SHIFT);
+		/* page should be already locked by caller */
+		if (WARN_ON(!page))
+			continue;
+
+		/* We need to do this by ourselves as we skipped IO */
+		if (page_ops & PAGE_CLEAR_DIRTY)
+			clear_page_dirty_for_io(page);
+		if (page_ops & PAGE_SET_WRITEBACK)
+			set_page_writeback(page);
+
+		end_extent_writepage(page, 0, start,
+				     start + PAGE_SIZE - 1);
+		if (page_ops & PAGE_END_WRITEBACK)
+			end_page_writeback(page);
+		if (page_ops & PAGE_UNLOCK)
+			unlock_page(page);
+
+		start += PAGE_SIZE;
+		put_page(page);
+	}
+}
+
 /*
  * phase two of compressed writeback.  This is the ordered portion
  * of the code, which only gets called in the order the work was
@@ -687,6 +724,7 @@ static noinline void submit_compressed_extents(struct inode *inode,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
 	struct extent_io_tree *io_tree;
+	struct btrfs_dedupe_hash *hash;
 	int ret = 0;
 
 again:
@@ -696,6 +734,7 @@ again:
 		list_del(&async_extent->list);
 
 		io_tree = &BTRFS_I(inode)->io_tree;
+		hash = async_extent->hash;
 
 retry:
 		/* did the compression code fall back to uncompressed IO? */
@@ -712,7 +751,8 @@ retry:
 					     async_extent->start,
 					     async_extent->start +
 					     async_extent->ram_size - 1,
-					     &page_started, &nr_written, 0);
+					     &page_started, &nr_written, 0,
+					     hash);
 
 			/* JDM XXX */
 
@@ -722,15 +762,26 @@ retry:
 			 * and IO for us.  Otherwise, we need to submit
 			 * all those pages down to the drive.
 			 */
-			if (!page_started && !ret)
-				extent_write_locked_range(io_tree,
-						  inode, async_extent->start,
-						  async_extent->start +
-						  async_extent->ram_size - 1,
-						  btrfs_get_extent,
-						  WB_SYNC_ALL);
-			else if (ret)
+			if (!page_started && !ret) {
+				/* Skip IO for dedupe async_extent */
+				if (btrfs_dedupe_hash_hit(hash))
+					end_dedupe_extent(inode,
+						async_extent->start,
+						async_extent->ram_size,
+						PAGE_CLEAR_DIRTY |
+						PAGE_SET_WRITEBACK |
+						PAGE_END_WRITEBACK |
+						PAGE_UNLOCK);
+				else
+					extent_write_locked_range(io_tree,
+						inode, async_extent->start,
+						async_extent->start +
+						async_extent->ram_size - 1,
+						btrfs_get_extent,
+						WB_SYNC_ALL);
+			} else if (ret)
 				unlock_page(async_cow->locked_page);
+			kfree(hash);
 			kfree(async_extent);
 			cond_resched();
 			continue;
@@ -857,6 +908,7 @@ retry:
 			free_async_extent_pages(async_extent);
 		}
 		alloc_hint = ins.objectid + ins.offset;
+		kfree(hash);
 		kfree(async_extent);
 		cond_resched();
 	}
@@ -874,6 +926,7 @@ out_free:
 				     PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK |
 				     PAGE_SET_ERROR);
 	free_async_extent_pages(async_extent);
+	kfree(hash);
 	kfree(async_extent);
 	goto again;
 }
@@ -927,7 +980,7 @@ static noinline int cow_file_range(struct inode *inode,
 				   struct page *locked_page,
 				   u64 start, u64 end, int *page_started,
 				   unsigned long *nr_written,
-				   int unlock)
+				   int unlock, struct btrfs_dedupe_hash *hash)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 alloc_hint = 0;
@@ -986,11 +1039,16 @@ static noinline int cow_file_range(struct inode *inode,
 		unsigned long op;
 
 		cur_alloc_size = disk_num_bytes;
-		ret = btrfs_reserve_extent(root, cur_alloc_size,
+		if (btrfs_dedupe_hash_hit(hash)) {
+			ins.objectid = hash->bytenr;
+			ins.offset = hash->num_bytes;
+		} else {
+			ret = btrfs_reserve_extent(root, cur_alloc_size,
 					   root->sectorsize, 0, alloc_hint,
 					   &ins, 1, 1);
-		if (ret < 0)
-			goto out_unlock;
+			if (ret < 0)
+				goto out_unlock;
+		}
 
 		em = alloc_extent_map();
 		if (!em) {
@@ -1027,8 +1085,9 @@ static noinline int cow_file_range(struct inode *inode,
 			goto out_reserve;
 
 		cur_alloc_size = ins.offset;
-		ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
-					       ram_size, cur_alloc_size, 0);
+		ret = btrfs_add_ordered_extent_dedupe(inode, start,
+				ins.objectid, cur_alloc_size, ins.offset,
+				0, hash);
 		if (ret)
 			goto out_drop_extent_cache;
 
@@ -1040,7 +1099,14 @@ static noinline int cow_file_range(struct inode *inode,
 				goto out_drop_extent_cache;
 		}
 
-		btrfs_dec_block_group_reservations(root->fs_info, ins.objectid);
+		/*
+		 * Hash hit didn't allocate extent, no need to dec bg
+		 * reservation.
+		 * Or we will underflow reservations and block balance.
+		 */
+		if (!btrfs_dedupe_hash_hit(hash))
+			btrfs_dec_block_group_reservations(root->fs_info,
+							   ins.objectid);
 
 		if (disk_num_bytes < cur_alloc_size)
 			break;
@@ -1081,6 +1147,72 @@ out_unlock:
 	goto out;
 }
 
+static int hash_file_ranges(struct inode *inode, u64 start, u64 end,
+			    struct async_cow *async_cow, int *num_added)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+	struct page *locked_page = async_cow->locked_page;
+	u16 hash_algo;
+	u64 actual_end;
+	u64 isize = i_size_read(inode);
+	u64 dedupe_bs;
+	u64 cur_offset = start;
+	int ret = 0;
+
+	actual_end = min_t(u64, isize, end + 1);
+	/* If dedupe is not enabled, don't split extent into dedupe_bs */
+	if (fs_info->dedupe_enabled && dedupe_info) {
+		dedupe_bs = dedupe_info->blocksize;
+		hash_algo = dedupe_info->hash_type;
+	} else {
+		dedupe_bs = SZ_128M;
+		/* Just dummy, to avoid access NULL pointer */
+		hash_algo = BTRFS_DEDUPE_HASH_SHA256;
+	}
+
+	while (cur_offset < end) {
+		struct btrfs_dedupe_hash *hash = NULL;
+		u64 len;
+
+		len = min(end + 1 - cur_offset, dedupe_bs);
+		if (len < dedupe_bs)
+			goto next;
+
+		hash = btrfs_dedupe_alloc_hash(hash_algo);
+		if (!hash) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		ret = btrfs_dedupe_calc_hash(fs_info, inode, cur_offset, hash);
+		if (ret < 0) {
+			kfree(hash);
+			goto out;
+		}
+
+		ret = btrfs_dedupe_search(fs_info, inode, cur_offset, hash);
+		if (ret < 0) {
+			kfree(hash);
+			goto out;
+		}
+		ret = 0;
+
+next:
+		/* Redirty the locked page if it corresponds to our extent */
+		if (page_offset(locked_page) >= start &&
+		    page_offset(locked_page) <= end)
+			__set_page_dirty_nobuffers(locked_page);
+
+		add_async_extent(async_cow, cur_offset, len, 0, NULL, 0,
+				 BTRFS_COMPRESS_NONE, hash);
+		cur_offset += len;
+		(*num_added)++;
+	}
+out:
+	return ret;
+}
+
 /*
  * work queue call back to started compression on a file and pages
  */
@@ -1088,11 +1220,18 @@ static noinline void async_cow_start(struct btrfs_work *work)
 {
 	struct async_cow *async_cow;
 	int num_added = 0;
+	int ret = 0;
 	async_cow = container_of(work, struct async_cow, work);
 
-	compress_file_range(async_cow->inode, async_cow->locked_page,
-			    async_cow->start, async_cow->end, async_cow,
-			    &num_added);
+	if (inode_need_compress(async_cow->inode))
+		compress_file_range(async_cow->inode, async_cow->locked_page,
+				    async_cow->start, async_cow->end, async_cow,
+				    &num_added);
+	else
+		ret = hash_file_ranges(async_cow->inode, async_cow->start,
+				       async_cow->end, async_cow, &num_added);
+	WARN_ON(ret);
+
 	if (num_added == 0) {
 		btrfs_add_delayed_iput(async_cow->inode);
 		async_cow->inode = NULL;
@@ -1141,6 +1280,8 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 {
 	struct async_cow *async_cow;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
 	unsigned long nr_pages;
 	u64 cur_end;
 	int limit = 10 * SZ_1M;
@@ -1155,7 +1296,11 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 		async_cow->locked_page = locked_page;
 		async_cow->start = start;
 
-		if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS &&
+		if (fs_info->dedupe_enabled && dedupe_info) {
+			u64 len = max_t(u64, SZ_512K, dedupe_info->blocksize);
+
+			cur_end = min(end, start + len - 1);
+		} else if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS &&
 		    !btrfs_test_opt(root, FORCE_COMPRESS))
 			cur_end = end;
 		else
@@ -1418,7 +1563,7 @@ out_check:
 		if (cow_start != (u64)-1) {
 			ret = cow_file_range(inode, locked_page,
 					     cow_start, found_key.offset - 1,
-					     page_started, nr_written, 1);
+					     page_started, nr_written, 1, NULL);
 			if (ret) {
 				if (!nolock && nocow)
 					btrfs_end_write_no_snapshoting(root);
@@ -1502,7 +1647,7 @@ out_check:
 
 	if (cow_start != (u64)-1) {
 		ret = cow_file_range(inode, locked_page, cow_start, end,
-				     page_started, nr_written, 1);
+				     page_started, nr_written, 1, NULL);
 		if (ret)
 			goto error;
 	}
@@ -1553,6 +1698,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 {
 	int ret;
 	int force_cow = need_force_cow(inode, start, end);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) {
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
@@ -1560,9 +1707,9 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 	} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_need_compress(inode)) {
+	} else if (!inode_need_compress(inode) && !fs_info->dedupe_enabled) {
 		ret = cow_file_range(inode, locked_page, start, end,
-				      page_started, nr_written, 1);
+				      page_started, nr_written, 1, NULL);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 			&BTRFS_I(inode)->runtime_flags);
@@ -2092,7 +2239,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 				       u64 disk_bytenr, u64 disk_num_bytes,
 				       u64 num_bytes, u64 ram_bytes,
 				       u8 compression, u8 encryption,
-				       u16 other_encoding, int extent_type)
+				       u16 other_encoding, int extent_type,
+				       struct btrfs_dedupe_hash *hash)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_file_extent_item *fi;
@@ -2154,10 +2302,43 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	ins.objectid = disk_bytenr;
 	ins.offset = disk_num_bytes;
 	ins.type = BTRFS_EXTENT_ITEM_KEY;
-	ret = btrfs_alloc_reserved_file_extent(trans, root,
+
+	/*
+	 * Only for no-dedupe or hash miss case, we need to increase
+	 * extent reference
+	 * For hash hit case, reference is already increased
+	 */
+	if (!hash || hash->bytenr == 0)
+		ret = btrfs_alloc_reserved_file_extent(trans, root,
 					root->root_key.objectid,
 					btrfs_ino(inode), file_pos,
 					ram_bytes, &ins);
+	if (ret < 0)
+		goto out_qgroup;
+
+	/*
+	 * Hash hit won't create a new data extent, so its reserved quota
+	 * space won't be freed by new delayed_ref_head.
+	 * Need to free it here.
+	 */
+	if (btrfs_dedupe_hash_hit(hash))
+		btrfs_qgroup_free_data(inode, file_pos, ram_bytes);
+
+	/* Add missed hash into dedupe tree */
+	if (hash && hash->bytenr == 0) {
+		hash->bytenr = ins.objectid;
+		hash->num_bytes = ins.offset;
+
+		/*
+		 * Here we ignore dedupe_add error, as even it failed,
+		 * it won't corrupt the filesystem. It will only only slightly
+		 * reduce dedup rate
+		 */
+		btrfs_dedupe_add(trans, root->fs_info, hash);
+	}
+
+out_qgroup:
+
 	/*
 	 * Release the reserved range from inode dirty range map, as it is
 	 * already moved into delayed_ref_head
@@ -2848,6 +3029,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	u64 logical_len = ordered_extent->len;
 	bool nolock;
 	bool truncated = false;
+	int hash_hit = btrfs_dedupe_hash_hit(ordered_extent->hash);
 
 	nolock = btrfs_is_free_space_inode(inode);
 
@@ -2941,8 +3123,10 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 						ordered_extent->disk_len,
 						logical_len, logical_len,
 						compress_type, 0, 0,
-						BTRFS_FILE_EXTENT_REG);
-		if (!ret)
+						BTRFS_FILE_EXTENT_REG,
+						ordered_extent->hash);
+		/* Hash hit case doesn't reserve delalloc bytes */
+		if (!ret && !hash_hit)
 			btrfs_release_delalloc_bytes(root,
 						     ordered_extent->start,
 						     ordered_extent->disk_len);
@@ -2993,15 +3177,17 @@ out:
 		 * wrong we need to return the space for this ordered extent
 		 * back to the allocator.  We only free the extent in the
 		 * truncated case if we didn't write out the extent at all.
+		 *
+		 * For hash hit case, never free that extent, as it's being used
+		 * by others.
 		 */
-		if ((ret || !logical_len) &&
+		if ((ret || !logical_len) && !hash_hit &&
 		    !test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) &&
 		    !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags))
 			btrfs_free_reserved_extent(root, ordered_extent->start,
 						   ordered_extent->disk_len, 1);
 	}
 
-
 	/*
 	 * This needs to be done to make sure anybody waiting knows we are done
 	 * updating everything for this ordered extent.
@@ -10292,7 +10478,8 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 						  cur_offset, ins.objectid,
 						  ins.offset, ins.offset,
 						  ins.offset, 0, 0, 0,
-						  BTRFS_FILE_EXTENT_PREALLOC);
+						  BTRFS_FILE_EXTENT_PREALLOC,
+						  NULL);
 		if (ret) {
 			btrfs_free_reserved_extent(root, ins.objectid,
 						   ins.offset, 0);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 0477dca..b7de713 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -31,6 +31,7 @@
 #include "async-thread.h"
 #include "free-space-cache.h"
 #include "inode-map.h"
+#include "dedupe.h"
 
 /*
  * backref_node, mapping_node and tree_block start with this
@@ -3910,6 +3911,7 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
 	struct btrfs_trans_handle *trans = NULL;
 	struct btrfs_path *path;
 	struct btrfs_extent_item *ei;
+	struct btrfs_fs_info *fs_info = rc->extent_root->fs_info;
 	u64 flags;
 	u32 item_size;
 	int ret;
@@ -4032,6 +4034,20 @@ restart:
 				rc->search_start = key.objectid;
 			}
 		}
+		/*
+		 * This data extent will be replaced, but normal dedupe_del()
+		 * will only happen at run_delayed_ref() time, which is too
+		 * late, so delete dedupe_hash early to prevent its ref get
+		 * increased during relocation
+		 */
+		if (rc->stage == MOVE_DATA_EXTENTS &&
+		    (flags & BTRFS_EXTENT_FLAG_DATA)) {
+			ret = btrfs_dedupe_del(trans, fs_info, key.objectid);
+			if (ret < 0) {
+				err = ret;
+				break;
+			}
+		}
 
 		btrfs_end_transaction_throttle(trans, rc->extent_root);
 		btrfs_btree_balance_dirty(rc->extent_root);
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 10/13] btrfs: dedupe: Add ioctl for inband dedupelication
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (8 preceding siblings ...)
  2016-06-15  2:09 ` [PATCH v11 09/13] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
@ 2016-06-15  2:09 ` Qu Wenruo
  2016-06-15  2:09 ` [PATCH v11 11/13] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Add ioctl interface for inband dedupelication, which includes:
1) enable
2) disable
3) status

And a pseudo RO compat flag, to imply that btrfs now supports inband
dedup.
However we don't add any ondisk format change, it's just a pseudo RO
compat flag.

All these ioctl interface are state-less, which means caller don't need
to bother previous dedupe state before calling them, and only need to
care the final desired state.

For example, if user want to enable dedupe with specified block size and
limit, just fill the ioctl structure and call enable ioctl.
No need to check if dedupe is already running.

These ioctls will handle things like re-configure or disable quite well.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c          | 48 ++++++++++++++++++++++++++++++++
 fs/btrfs/dedupe.h          | 15 ++++++++++
 fs/btrfs/disk-io.c         |  3 ++
 fs/btrfs/extent-tree.c     |  7 +++--
 fs/btrfs/ioctl.c           | 68 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/sysfs.c           |  2 ++
 include/uapi/linux/btrfs.h | 23 ++++++++++++++++
 7 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 4c5b3fc..74e396a 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -41,6 +41,33 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 type)
 			GFP_NOFS);
 }
 
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+			 struct btrfs_ioctl_dedupe_args *dargs)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+	if (!fs_info->dedupe_enabled || !dedupe_info) {
+		dargs->status = 0;
+		dargs->blocksize = 0;
+		dargs->backend = 0;
+		dargs->hash_type = 0;
+		dargs->limit_nr = 0;
+		dargs->current_nr = 0;
+		return;
+	}
+	mutex_lock(&dedupe_info->lock);
+	dargs->status = 1;
+	dargs->blocksize = dedupe_info->blocksize;
+	dargs->backend = dedupe_info->backend;
+	dargs->hash_type = dedupe_info->hash_type;
+	dargs->limit_nr = dedupe_info->limit_nr;
+	dargs->limit_mem = dedupe_info->limit_nr *
+		(sizeof(struct inmem_hash) +
+		 btrfs_dedupe_sizes[dedupe_info->hash_type]);
+	dargs->current_nr = dedupe_info->current_nr;
+	mutex_unlock(&dedupe_info->lock);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
 			    u16 backend, u64 blocksize, u64 limit)
 {
@@ -395,6 +422,27 @@ static void unblock_all_writers(struct btrfs_fs_info *fs_info)
 	percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
 }
 
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+
+	fs_info->dedupe_enabled = 0;
+	/* same as disable */
+	smp_wmb();
+	dedupe_info = fs_info->dedupe_info;
+	fs_info->dedupe_info = NULL;
+
+	if (!dedupe_info)
+		return 0;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		inmem_destroy(dedupe_info);
+
+	crypto_free_shash(dedupe_info->dedupe_driver);
+	kfree(dedupe_info);
+	return 0;
+}
+
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_dedupe_info *dedupe_info;
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 9162d2c..f605a7f 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -91,6 +91,15 @@ static inline struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type)
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
 			u64 blocksize, u64 limit_nr, u64 limit_mem);
 
+
+ /*
+ * Get inband dedupe info
+ * Since it needs to access different backends' hash size, which
+ * is not exported, we need such simple function.
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+			 struct btrfs_ioctl_dedupe_args *dargs);
+
 /*
  * Disable dedupe and invalidate all its dedupe data.
  * Called at dedupe disable time.
@@ -102,6 +111,12 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
 
 /*
+ * Cleanup current btrfs_dedupe_info
+ * Called in umount time
+ */
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
+
+/*
  * Calculate hash for dedupe.
  * Caller must ensure [start, start + dedupe_bs) has valid data.
  *
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dccd608..9918e2ff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -50,6 +50,7 @@
 #include "sysfs.h"
 #include "qgroup.h"
 #include "compression.h"
+#include "dedupe.h"
 
 #ifdef CONFIG_X86
 #include <asm/cpufeature.h>
@@ -3902,6 +3903,8 @@ void close_ctree(struct btrfs_root *root)
 
 	btrfs_free_qgroup_config(fs_info);
 
+	btrfs_dedupe_cleanup(fs_info);
+
 	if (percpu_counter_sum(&fs_info->delalloc_bytes)) {
 		btrfs_info(fs_info, "at unmount delalloc count %lld",
 		       percpu_counter_sum(&fs_info->delalloc_bytes));
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e0db77e..f6213e7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2427,10 +2427,13 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 				 * a new extent is revered, then deleted
 				 * in one tran, and inc/dec get merged to 0.
 				 *
-				 * In this case, we need to remove its dedup
+				 * In this case, we need to remove its dedupe
 				 * hash.
 				 */
-				btrfs_dedupe_del(trans, fs_info, node->bytenr);
+				ret = btrfs_dedupe_del(trans, fs_info,
+						       node->bytenr);
+				if (ret < 0)
+					return ret;
 				ret = btrfs_del_csums(trans, root,
 						      node->bytenr,
 						      node->num_bytes);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e6714e2..25aa620 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -61,6 +61,7 @@
 #include "qgroup.h"
 #include "tree-log.h"
 #include "compression.h"
+#include "dedupe.h"
 
 #ifdef CONFIG_64BIT
 /* If we have a 32-bit userspace and 64-bit kernel, then the UAPI
@@ -3268,6 +3269,69 @@ ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
 	return olen;
 }
 
+static long btrfs_ioctl_dedupe_ctl(struct btrfs_root *root, void __user *args)
+{
+	struct btrfs_ioctl_dedupe_args *dargs;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	dargs = memdup_user(args, sizeof(*dargs));
+	if (IS_ERR(dargs)) {
+		ret = PTR_ERR(dargs);
+		return ret;
+	}
+
+	if (dargs->cmd >= BTRFS_DEDUPE_CTL_LAST) {
+		ret = -EINVAL;
+		goto out;
+	}
+	switch (dargs->cmd) {
+	case BTRFS_DEDUPE_CTL_ENABLE:
+		mutex_lock(&fs_info->dedupe_ioctl_lock);
+		ret = btrfs_dedupe_enable(fs_info, dargs->hash_type,
+					 dargs->backend, dargs->blocksize,
+					 dargs->limit_nr, dargs->limit_mem);
+		mutex_unlock(&fs_info->dedupe_ioctl_lock);
+		if (ret < 0)
+			break;
+
+		/* Also copy the result to caller for further use */
+		btrfs_dedupe_status(fs_info, dargs);
+		if (copy_to_user(args, dargs, sizeof(*dargs)))
+			ret = -EFAULT;
+		else
+			ret = 0;
+		break;
+	case BTRFS_DEDUPE_CTL_DISABLE:
+		mutex_lock(&fs_info->dedupe_ioctl_lock);
+		ret = btrfs_dedupe_disable(fs_info);
+		mutex_unlock(&fs_info->dedupe_ioctl_lock);
+		break;
+	case BTRFS_DEDUPE_CTL_STATUS:
+		mutex_lock(&fs_info->dedupe_ioctl_lock);
+		btrfs_dedupe_status(fs_info, dargs);
+		mutex_unlock(&fs_info->dedupe_ioctl_lock);
+		if (copy_to_user(args, dargs, sizeof(*dargs)))
+			ret = -EFAULT;
+		else
+			ret = 0;
+		break;
+	default:
+		/*
+		 * Use this return value to inform progs that kernel
+		 * doesn't support such new command.
+		 */
+		ret = -EOPNOTSUPP;
+		break;
+	}
+out:
+	kfree(dargs);
+	return ret;
+}
+
 static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
 				     struct inode *inode,
 				     u64 endoff,
@@ -5619,6 +5683,10 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_fslabel(file, argp);
 	case BTRFS_IOC_SET_FSLABEL:
 		return btrfs_ioctl_set_fslabel(file, argp);
+#ifdef CONFIG_BTRFS_DEBUG
+	case BTRFS_IOC_DEDUPE_CTL:
+		return btrfs_ioctl_dedupe_ctl(root, argp);
+#endif
 	case BTRFS_IOC_GET_SUPPORTED_FEATURES:
 		return btrfs_ioctl_get_supported_features(argp);
 	case BTRFS_IOC_GET_FEATURES:
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 4879656..4da84ca 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -206,6 +206,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56);
 BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA);
 BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
+BTRFS_FEAT_ATTR_COMPAT_RO(dedupe, DEDUPE);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -218,6 +219,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(skinny_metadata),
 	BTRFS_FEAT_ATTR_PTR(no_holes),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
+	BTRFS_FEAT_ATTR_PTR(dedupe),
 	NULL
 };
 
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index bc3416e..6f25b9f 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -240,6 +240,7 @@ struct btrfs_ioctl_fs_info_args {
  * struct btrfs_ioctl_feature_flags
  */
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE	(1ULL << 0)
+#define BTRFS_FEATURE_COMPAT_RO_DEDUPE		(1ULL << 1)
 
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
@@ -636,6 +637,26 @@ struct btrfs_ioctl_get_dev_stats {
 
 /* Default dedupe limit on number of hash */
 #define BTRFS_DEDUPE_LIMIT_NR_DEFAULT	(32 * 1024)
+/*
+ * de-duplication control modes
+ * For re-config, re-enable will handle it
+ */
+#define BTRFS_DEDUPE_CTL_ENABLE	1
+#define BTRFS_DEDUPE_CTL_DISABLE 2
+#define BTRFS_DEDUPE_CTL_STATUS	3
+#define BTRFS_DEDUPE_CTL_LAST	4
+struct btrfs_ioctl_dedupe_args {
+	__u16 cmd;		/* In: command(see above macro) */
+	__u64 blocksize;	/* In/Out: For enable/status */
+	__u64 limit_nr;		/* In/Out: For enable/status */
+	__u64 limit_mem;	/* In/Out: For enable/status */
+	__u64 current_nr;	/* Out: For status output */
+	__u16 backend;		/* In/Out: For enable/status */
+	__u16 hash_type;	/* In/Out: For enable/status */
+	u8 status;		/* Out: For status output */
+	/* pad to 512 bytes */
+	u8 __unused[473];
+};
 
 #define BTRFS_QUOTA_CTL_ENABLE	1
 #define BTRFS_QUOTA_CTL_DISABLE	2
@@ -845,6 +866,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code)
 				    struct btrfs_ioctl_dev_replace_args)
 #define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
 					 struct btrfs_ioctl_same_args)
+#define BTRFS_IOC_DEDUPE_CTL	_IOWR(BTRFS_IOCTL_MAGIC, 55, \
+				      struct btrfs_ioctl_dedupe_args)
 #define BTRFS_IOC_GET_FEATURES _IOR(BTRFS_IOCTL_MAGIC, 57, \
 				   struct btrfs_ioctl_feature_flags)
 #define BTRFS_IOC_SET_FEATURES _IOW(BTRFS_IOCTL_MAGIC, 57, \
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 11/13] btrfs: relocation: Enhance error handling to avoid BUG_ON
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (9 preceding siblings ...)
  2016-06-15  2:09 ` [PATCH v11 10/13] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
@ 2016-06-15  2:09 ` Qu Wenruo
  2016-06-15  2:10 ` [PATCH v11 12/13] btrfs: improve inode's outstanding_extents computation Qu Wenruo
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:09 UTC (permalink / raw)
  To: linux-btrfs

Since the introduce of btrfs dedupe tree, it's possible that balance can
race with dedupe disabling.

When this happens, dedupe_enabled will make btrfs_get_fs_root() return
PTR_ERR(-ENOENT).
But due to a bug in error handling branch, when this happens
backref_cache->nr_nodes is increased but the node is neither added to
backref_cache or nr_nodes decreased.
Causing BUG_ON() in backref_cache_cleanup()

[ 2611.668810] ------------[ cut here ]------------
[ 2611.669946] kernel BUG at
/home/sat/ktest/linux/fs/btrfs/relocation.c:243!
[ 2611.670572] invalid opcode: 0000 [#1] SMP
[ 2611.686797] Call Trace:
[ 2611.687034]  [<ffffffffa01f71d3>]
btrfs_relocate_block_group+0x1b3/0x290 [btrfs]
[ 2611.687706]  [<ffffffffa01cc177>]
btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs]
[ 2611.688385]  [<ffffffffa01cdb12>] btrfs_balance+0xb22/0x11e0 [btrfs]
[ 2611.688966]  [<ffffffffa01d9611>] btrfs_ioctl_balance+0x391/0x3a0
[btrfs]
[ 2611.689587]  [<ffffffffa01ddaf0>] btrfs_ioctl+0x1650/0x2290 [btrfs]
[ 2611.690145]  [<ffffffff81171cda>] ? lru_cache_add+0x3a/0x80
[ 2611.690647]  [<ffffffff81171e4c>] ?
lru_cache_add_active_or_unevictable+0x4c/0xc0
[ 2611.691310]  [<ffffffff81193f04>] ? handle_mm_fault+0xcd4/0x17f0
[ 2611.691842]  [<ffffffff811da423>] ? cp_new_stat+0x153/0x180
[ 2611.692342]  [<ffffffff8119913d>] ? __vma_link_rb+0xfd/0x110
[ 2611.692842]  [<ffffffff81199209>] ? vma_link+0xb9/0xc0
[ 2611.693303]  [<ffffffff811e7e81>] do_vfs_ioctl+0xa1/0x5a0
[ 2611.693781]  [<ffffffff8104e024>] ? __do_page_fault+0x1b4/0x400
[ 2611.694310]  [<ffffffff811e83c1>] SyS_ioctl+0x41/0x70
[ 2611.694758]  [<ffffffff816dfc6e>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0
05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b
0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44
[ 2611.697870] RIP  [<ffffffffa01f6fc1>]
relocate_block_group+0x741/0x7a0 [btrfs]
[ 2611.698818]  RSP <ffff88002a81fb30>

This patch will call remove_backref_node() in error handling branch, and
cache the returned -ENOENT in relocate_tree_block() and continue
balancing.

Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/relocation.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index b7de713..32fcd8d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -887,6 +887,13 @@ again:
 		root = read_fs_root(rc->extent_root->fs_info, key.offset);
 		if (IS_ERR(root)) {
 			err = PTR_ERR(root);
+			/*
+			 * Don't forget to cleanup current node.
+			 * As it may not be added to backref_cache but nr_node
+			 * increased.
+			 * This will cause BUG_ON() in backref_cache_cleanup().
+			 */
+			remove_backref_node(&rc->backref_cache, cur);
 			goto out;
 		}
 
@@ -2991,14 +2998,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans,
 	}
 
 	rb_node = rb_first(blocks);
-	while (rb_node) {
+	for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) {
 		block = rb_entry(rb_node, struct tree_block, rb_node);
 
 		node = build_backref_tree(rc, &block->key,
 					  block->level, block->bytenr);
 		if (IS_ERR(node)) {
+			/*
+			 * The root(dedupe tree yet) of the tree block is
+			 * going to be freed and can't be reached.
+			 * Just skip it and continue balancing.
+			 */
+			if (PTR_ERR(node) == -ENOENT)
+				continue;
 			err = PTR_ERR(node);
-			goto out;
+			break;
 		}
 
 		ret = relocate_tree_block(trans, rc, node, &block->key,
@@ -3006,11 +3020,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans,
 		if (ret < 0) {
 			if (ret != -EAGAIN || rb_node == rb_first(blocks))
 				err = ret;
-			goto out;
+			break;
 		}
-		rb_node = rb_next(rb_node);
 	}
-out:
 	err = finish_pending_nodes(trans, rc, path, err);
 
 out_free_path:
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 12/13] btrfs: improve inode's outstanding_extents computation
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (10 preceding siblings ...)
  2016-06-15  2:09 ` [PATCH v11 11/13] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
@ 2016-06-15  2:10 ` Qu Wenruo
  2016-06-15  2:10 ` [PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC Qu Wenruo
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:10 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang, Mark Fasheh, Josef Bacik

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

This issue was revealed by modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB,
When modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often
gets these warnings from btrfs_destroy_inode():
	WARN_ON(BTRFS_I(inode)->outstanding_extents);
	WARN_ON(BTRFS_I(inode)->reserved_extents);

Simple test program below can reproduce this issue steadily.
Note: you need to modify BTRFS_MAX_EXTENT_SIZE to 64KB to have test,
otherwise there won't be such WARNING.
	#include <string.h>
	#include <unistd.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <fcntl.h>

	int main(void)
	{
		int fd;
		char buf[68 *1024];

		memset(buf, 0, 68 * 1024);
		fd = open("testfile", O_CREAT | O_EXCL | O_RDWR);
		pwrite(fd, buf, 68 * 1024, 64 * 1024);
		return;
	}

When BTRFS_MAX_EXTENT_SIZE is 64KB, and buffered data range is:
64KB						128K		132KB
|-----------------------------------------------|---------------|
                         64 + 4KB

1) for above data range, btrfs_delalloc_reserve_metadata() will reserve
metadata and set BTRFS_I(inode)->outstanding_extents to 2.
(68KB + 64KB - 1) / 64KB == 2

Outstanding_extents: 2

2) then btrfs_dirty_page() will be called to dirty pages and set
EXTENT_DELALLOC flag. In this case, btrfs_set_bit_hook() will be called
twice.
The 1st set_bit_hook() call will set DEALLOC flag for the first 64K.
64KB						128KB
|-----------------------------------------------|
	64KB DELALLOC
Outstanding_extents: 2

Set_bit_hooks() uses FIRST_DELALLOC flag to avoid re-increase
outstanding_extents counter.
So for 1st set_bit_hooks() call, it won't modify outstanding_extents,
it's still 2.

Then FIRST_DELALLOC flag is *CLEARED*.

3) 2nd btrfs_set_bit_hook() call.
Because FIRST_DELALLOC have been cleared by previous set_bit_hook(),
btrfs_set_bit_hook() will increase BTRFS_I(inode)->outstanding_extents by one, so
now BTRFS_I(inode)->outstanding_extents is 3.
64KB                                            128KB            132KB
|-----------------------------------------------|----------------|
	64K DELALLOC				   4K DELALLOC
Outstanding_extents: 3

But the correct outstanding_extents number should be 2, not 3.
The 2nd btrfs_set_bit_hook() call just screwed up this, and leads to the
WARN_ON().

Normally, we can solve it by only increasing outstanding_extents in
set_bit_hook().
But the problem is for delalloc_reserve/release_metadata(), we only have
a 'length' parameter, and calculate in-accurate outstanding_extents.
If we only rely on set_bit_hook() release_metadata() will crew things up
as it will decrease inaccurate number.

So the fix we use is:
1) Increase *INACCURATE* outstanding_extents at delalloc_reserve_meta
   Just as a place holder.
2) Increase *accurate* outstanding_extents at set_bit_hooks()
   This is the real increaser.
3) Decrease *INACCURATE* outstanding_extents before returning
   This makes outstanding_extents to correct value.

For 128M BTRFS_MAX_EXTENT_SIZE, due to limitation of
__btrfs_buffered_write(), each iteration will only handle about 2MB
data.
So btrfs_dirty_pages() won't need to handle cases cross 2 extents.

Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/inode.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 fs/btrfs/ioctl.c |  6 ++---
 3 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8f70f53d..62037e9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3094,6 +3094,8 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput,
 			       int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
 			      struct extent_state **cached_state);
+int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
+			    struct extent_state **cached_state);
 int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *new_root,
 			     struct btrfs_root *parent_root,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 23a725f..4a02383 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1723,11 +1723,15 @@ static void btrfs_split_extent_hook(struct inode *inode,
 				    struct extent_state *orig, u64 split)
 {
 	u64 size;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
 
 	/* not delalloc, ignore it */
 	if (!(orig->state & EXTENT_DELALLOC))
 		return;
 
+	if (root == root->fs_info->tree_root)
+		return;
+
 	size = orig->end - orig->start + 1;
 	if (size > BTRFS_MAX_EXTENT_SIZE) {
 		u64 num_extents;
@@ -1765,11 +1769,15 @@ static void btrfs_merge_extent_hook(struct inode *inode,
 {
 	u64 new_size, old_size;
 	u64 num_extents;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
 
 	/* not delalloc, ignore it */
 	if (!(other->state & EXTENT_DELALLOC))
 		return;
 
+	if (root == root->fs_info->tree_root)
+		return;
+
 	if (new->start > other->start)
 		new_size = new->end - other->start + 1;
 	else
@@ -1876,13 +1884,16 @@ static void btrfs_set_bit_hook(struct inode *inode,
 	if (!(state->state & EXTENT_DELALLOC) && (*bits & EXTENT_DELALLOC)) {
 		struct btrfs_root *root = BTRFS_I(inode)->root;
 		u64 len = state->end + 1 - state->start;
+		u64 num_extents = div64_u64(len + BTRFS_MAX_EXTENT_SIZE - 1,
+					    BTRFS_MAX_EXTENT_SIZE);
 		bool do_list = !btrfs_is_free_space_inode(inode);
 
-		if (*bits & EXTENT_FIRST_DELALLOC) {
+		if (*bits & EXTENT_FIRST_DELALLOC)
 			*bits &= ~EXTENT_FIRST_DELALLOC;
-		} else {
+
+		if (root != root->fs_info->tree_root) {
 			spin_lock(&BTRFS_I(inode)->lock);
-			BTRFS_I(inode)->outstanding_extents++;
+			BTRFS_I(inode)->outstanding_extents += num_extents;
 			spin_unlock(&BTRFS_I(inode)->lock);
 		}
 
@@ -1930,7 +1941,7 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 
 		if (*bits & EXTENT_FIRST_DELALLOC) {
 			*bits &= ~EXTENT_FIRST_DELALLOC;
-		} else if (!(*bits & EXTENT_DO_ACCOUNTING)) {
+		} else if (!(*bits & EXTENT_DO_ACCOUNTING) && do_list) {
 			spin_lock(&BTRFS_I(inode)->lock);
 			BTRFS_I(inode)->outstanding_extents -= num_extents;
 			spin_unlock(&BTRFS_I(inode)->lock);
@@ -2123,9 +2134,54 @@ static noinline int add_pending_csums(struct btrfs_trans_handle *trans,
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
 			      struct extent_state **cached_state)
 {
+	int ret;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	u64 num_extents = div64_u64(end - start + BTRFS_MAX_EXTENT_SIZE,
+				    BTRFS_MAX_EXTENT_SIZE);
+
 	WARN_ON((end & (PAGE_SIZE - 1)) == 0);
-	return set_extent_delalloc(&BTRFS_I(inode)->io_tree, start, end,
-				   cached_state);
+	ret = set_extent_delalloc(&BTRFS_I(inode)->io_tree, start, end,
+				  cached_state);
+
+	/*
+	 * btrfs_delalloc_reserve_metadata() will first add number of
+	 * outstanding extents according to data length, which is inaccurate
+	 * for case like dirtying already dirty pages.
+	 * so here we will decrease such inaccurate numbers, to make
+	 * outstanding_extents only rely on the correct values added by
+	 * set_bit_hook()
+	 *
+	 * Also, we skipped the metadata space reserve for space cache inodes,
+	 * so don't modify the outstanding_extents value.
+	 */
+	if (ret == 0 && root != root->fs_info->tree_root) {
+		spin_lock(&BTRFS_I(inode)->lock);
+		BTRFS_I(inode)->outstanding_extents -= num_extents;
+		spin_unlock(&BTRFS_I(inode)->lock);
+	}
+
+	return ret;
+}
+
+int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
+			    struct extent_state **cached_state)
+{
+	int ret;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	u64 num_extents = div64_u64(end - start + BTRFS_MAX_EXTENT_SIZE,
+				    BTRFS_MAX_EXTENT_SIZE);
+
+	WARN_ON((end & (PAGE_SIZE - 1)) == 0);
+	ret = set_extent_defrag(&BTRFS_I(inode)->io_tree, start, end,
+				cached_state);
+
+	if (ret == 0 && root != root->fs_info->tree_root) {
+		spin_lock(&BTRFS_I(inode)->lock);
+		BTRFS_I(inode)->outstanding_extents -= num_extents;
+		spin_unlock(&BTRFS_I(inode)->lock);
+	}
+
+	return ret;
 }
 
 /* see btrfs_writepage_start_hook for details on why this is required */
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 25aa620..f38b472 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1236,10 +1236,8 @@ again:
 				(page_cnt - i_done) << PAGE_SHIFT);
 	}
 
-
-	set_extent_defrag(&BTRFS_I(inode)->io_tree, page_start, page_end - 1,
-			  &cached_state);
-
+	btrfs_set_extent_defrag(inode, page_start,
+				page_end - 1, &cached_state);
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree,
 			     page_start, page_end - 1, &cached_state,
 			     GFP_NOFS);
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (11 preceding siblings ...)
  2016-06-15  2:10 ` [PATCH v11 12/13] btrfs: improve inode's outstanding_extents computation Qu Wenruo
@ 2016-06-15  2:10 ` Qu Wenruo
  2016-06-15  3:11   ` kbuild test robot
                     ` (2 more replies)
  2016-06-20 16:03 ` [PATCH v11 00/13] Btrfs dedupe framework David Sterba
  2016-06-22  1:48 ` Qu Wenruo
  14 siblings, 3 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  2:10 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang, Josef Bacik, Mark Fasheh

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

When testing in-band dedupe, sometimes we got ENOSPC error, though fs
still has much free space. After some debuging work, we found that it's
btrfs_delalloc_reserve_metadata() which sometimes tries to reserve
plenty of metadata space, even for very small data range.

In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
to reserve is calculated by the difference between outstanding_extents and
reserved_extents. Please see below case for how ENOSPC occurs:

  1, Buffered write 128MB data in unit of 1MB, so finially we'll have
inode outstanding extents be 1, and reserved_extents be 128.
Note it's btrfs_merge_extent_hook() that merges these 1MB units into
one big outstanding extent, but do not change reserved_extents.

  2, When writing dirty pages, for in-band dedupe, cow_file_range() will
split above big extent in unit of 16KB(assume our in-band dedupe blocksize
is 16KB). When first split opeartion finishes, we'll have 2 outstanding
extents and 128 reserved extents, and just right the currently generated
ordered extent is dispatched to run and complete, then
btrfs_delalloc_release_metadata()(see btrfs_finish_ordered_io()) will be
called to release metadata, after that we will have 1 outstanding extents
and 1 reserved extents(also see logic in drop_outstanding_extent()). Later
cow_file_range() continues to handles left data range[16KB, 128MB), and if
no other ordered extent was dispatched to run, there will be 8191
outstanding extents and 1 reserved extent.

  3, Now if another bufferd write for this file enters, then
btrfs_delalloc_reserve_metadata() will at least try to reserve metadata
for 8191 outstanding extents' metadata, for 64K node size, it'll be
8191*65536*16, about 8GB metadata, so obviously it'll return ENOSPC error.

But indeed when a file goes through in-band dedupe, its max extent size
will no longer be BTRFS_MAX_EXTENT_SIZE(128MB), it'll be limited by in-band
dedupe blocksize, so current metadata reservation method in btrfs is not
appropriate or correct, here we introduce btrfs_max_extent_size(), which
will return max extent size for corresponding files, which go through in-band
and we use this value to do metadata reservation and extent_io merge, split,
clear operations, we can make sure difference between outstanding_extents
and reserved_extents will not be so big.

Currently only buffered write will go through in-band dedupe if in-band
dedupe is enabled.

Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h            |  16 +++--
 fs/btrfs/dedupe.h           |  37 +++++++++++
 fs/btrfs/extent-tree.c      |  62 ++++++++++++++----
 fs/btrfs/extent_io.c        |  63 +++++++++++++++++-
 fs/btrfs/extent_io.h        |  15 ++++-
 fs/btrfs/file.c             |  26 +++++---
 fs/btrfs/free-space-cache.c |   5 +-
 fs/btrfs/inode-map.c        |   4 +-
 fs/btrfs/inode.c            | 155 ++++++++++++++++++++++++++------------------
 fs/btrfs/ioctl.c            |   6 +-
 fs/btrfs/ordered-data.h     |   1 +
 fs/btrfs/relocation.c       |   8 +--
 12 files changed, 290 insertions(+), 108 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 62037e9..21f2689 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2649,10 +2649,14 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
 void btrfs_subvolume_release_metadata(struct btrfs_root *root,
 				      struct btrfs_block_rsv *rsv,
 				      u64 qgroup_reserved);
-int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
-void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
-void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
+int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes,
+				    u32 max_extent_size);
+void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes,
+				     u32 max_extent_size);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len,
+				 u32 max_extent_size);
+void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len,
+				  u32 max_extent_size);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
 					      unsigned short type);
@@ -3093,7 +3097,7 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput);
 int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput,
 			       int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
-			      struct extent_state **cached_state);
+			      struct extent_state **cached_state, int dedupe);
 int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
 			    struct extent_state **cached_state);
 int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
@@ -3188,7 +3192,7 @@ int btrfs_release_file(struct inode *inode, struct file *file);
 int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 		      struct page **pages, size_t num_pages,
 		      loff_t pos, size_t write_bytes,
-		      struct extent_state **cached);
+		      struct extent_state **cached, int dedupe);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
 			      struct file *file_out, loff_t pos_out,
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index f605a7f..fd6096c 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -22,6 +22,7 @@
 #include <linux/btrfs.h>
 #include <linux/wait.h>
 #include <crypto/hash.h>
+#include "btrfs_inode.h"
 
 static int btrfs_dedupe_sizes[] = { 32 };
 
@@ -63,6 +64,42 @@ struct btrfs_dedupe_info {
 
 struct btrfs_trans_handle;
 
+static inline u64 btrfs_dedupe_blocksize(struct inode *inode)
+{
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+
+	BUG_ON(fs_info->dedupe_info == NULL);
+	return fs_info->dedupe_info->blocksize;
+}
+
+static inline int inode_need_dedupe(struct inode *inode)
+{
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+
+	if (!fs_info->dedupe_enabled)
+		return 0;
+
+	return 1;
+}
+
+/*
+ * For in-band dedupe, its max extent size will be limited by in-band
+ * dedupe blocksize.
+ */
+static inline u64 btrfs_max_extent_size(struct inode *inode, int do_dedupe)
+{
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+	if (do_dedupe) {
+		BUG_ON(dedupe_info == NULL);
+		return dedupe_info->blocksize;
+	} else {
+		return BTRFS_MAX_EXTENT_SIZE;
+	}
+}
+
+
 static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
 {
 	return (hash && hash->bytenr);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f6213e7..6146729 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5642,22 +5642,29 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root,
 /**
  * drop_outstanding_extent - drop an outstanding extent
  * @inode: the inode we're dropping the extent for
- * @num_bytes: the number of bytes we're releasing.
+ * @num_bytes: the number of bytes we're relaseing.
+ * @max_extent_size: for in-band dedupe, max_extent_size will be set to in-band
+ * dedupe blocksize, othersize max_extent_size should be BTRFS_MAX_EXTENT_SIZE.
+ * Also if max_extent_size is 0, it'll be set to BTRFS_MAX_EXTENT_SIZE.
  *
  * This is called when we are freeing up an outstanding extent, either called
  * after an error or after an extent is written.  This will return the number of
  * reserved extents that need to be freed.  This must be called with
  * BTRFS_I(inode)->lock held.
  */
-static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes)
+static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes,
+					u32 max_extent_size)
 {
 	unsigned drop_inode_space = 0;
 	unsigned dropped_extents = 0;
 	unsigned num_extents = 0;
 
+	if (max_extent_size == 0)
+		max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
 	num_extents = (unsigned)div64_u64(num_bytes +
-					  BTRFS_MAX_EXTENT_SIZE - 1,
-					  BTRFS_MAX_EXTENT_SIZE);
+					  max_extent_size - 1,
+					  max_extent_size);
 	ASSERT(num_extents);
 	ASSERT(BTRFS_I(inode)->outstanding_extents >= num_extents);
 	BTRFS_I(inode)->outstanding_extents -= num_extents;
@@ -5727,7 +5734,13 @@ static u64 calc_csum_metadata_size(struct inode *inode, u64 num_bytes,
 	return btrfs_calc_trans_metadata_size(root, old_csums - num_csums);
 }
 
-int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
+/*
+ * @max_extent_size: for in-band dedupe, max_extent_size will be set to in-band
+ * dedupe blocksize, othersize max_extent_size should be BTRFS_MAX_EXTENT_SIZE.
+ * Also if max_extent_size is 0, it'll be set to BTRFS_MAX_EXTENT_SIZE.
+ */
+int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes,
+				    u32 max_extent_size)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_block_rsv *block_rsv = &root->fs_info->delalloc_block_rsv;
@@ -5741,6 +5754,9 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	u64 to_free = 0;
 	unsigned dropped;
 
+	if (max_extent_size == 0)
+		max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
 	/* If we are a free space inode we need to not flush since we will be in
 	 * the middle of a transaction commit.  We also don't need the delalloc
 	 * mutex since we won't race with anybody.  We need this mostly to make
@@ -5762,8 +5778,8 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	nr_extents = (unsigned)div64_u64(num_bytes +
-					 BTRFS_MAX_EXTENT_SIZE - 1,
-					 BTRFS_MAX_EXTENT_SIZE);
+					 max_extent_size - 1,
+					 max_extent_size);
 	BTRFS_I(inode)->outstanding_extents += nr_extents;
 	nr_extents = 0;
 
@@ -5821,7 +5837,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 
 out_fail:
 	spin_lock(&BTRFS_I(inode)->lock);
-	dropped = drop_outstanding_extent(inode, num_bytes);
+	dropped = drop_outstanding_extent(inode, num_bytes, max_extent_size);
 	/*
 	 * If the inodes csum_bytes is the same as the original
 	 * csum_bytes then we know we haven't raced with any free()ers
@@ -5887,20 +5903,27 @@ out_fail:
  * btrfs_delalloc_release_metadata - release a metadata reservation for an inode
  * @inode: the inode to release the reservation for
  * @num_bytes: the number of bytes we're releasing
+ * @max_extent_size: for in-band dedupe, max_extent_size will be set to in-band
+ * dedupe blocksize, othersize max_extent_size should be BTRFS_MAX_EXTENT_SIZE.
+ * Also if max_extent_size is 0, it'll be set to BTRFS_MAX_EXTENT_SIZE.
  *
  * This will release the metadata reservation for an inode.  This can be called
  * once we complete IO for a given set of bytes to release their metadata
  * reservations.
  */
-void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
+void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes,
+				     u32 max_extent_size)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 to_free = 0;
 	unsigned dropped;
 
+	if (max_extent_size == 0)
+		max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
 	num_bytes = ALIGN(num_bytes, root->sectorsize);
 	spin_lock(&BTRFS_I(inode)->lock);
-	dropped = drop_outstanding_extent(inode, num_bytes);
+	dropped = drop_outstanding_extent(inode, num_bytes, max_extent_size);
 
 	if (num_bytes)
 		to_free = calc_csum_metadata_size(inode, num_bytes, 0);
@@ -5924,6 +5947,9 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
  * @inode: inode we're writing to
  * @start: start range we are writing to
  * @len: how long the range we are writing to
+ * @max_extent_size: for in-band dedupe, max_extent_size will be set to in-band
+ * dedupe blocksize, othersize max_extent_size should be BTRFS_MAX_EXTENT_SIZE.
+ * Also if max_extent_size is 0, it'll be set to BTRFS_MAX_EXTENT_SIZE.
  *
  * TODO: This function will finally replace old btrfs_delalloc_reserve_space()
  *
@@ -5943,14 +5969,18 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
  * Return 0 for success
  * Return <0 for error(-ENOSPC or -EQUOT)
  */
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len)
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len,
+				 u32 max_extent_size)
 {
 	int ret;
 
+	if (max_extent_size == 0)
+		max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
 	ret = btrfs_check_data_free_space(inode, start, len);
 	if (ret < 0)
 		return ret;
-	ret = btrfs_delalloc_reserve_metadata(inode, len);
+	ret = btrfs_delalloc_reserve_metadata(inode, len, max_extent_size);
 	if (ret < 0)
 		btrfs_free_reserved_data_space(inode, start, len);
 	return ret;
@@ -5961,6 +5991,9 @@ int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len)
  * @inode: inode we're releasing space for
  * @start: start position of the space already reserved
  * @len: the len of the space already reserved
+ * @max_extent_size: for in-band dedupe, max_extent_size will be set to in-band
+ * dedupe blocksize, othersize max_extent_size should be BTRFS_MAX_EXTENT_SIZE.
+ * Also if max_extent_size is 0, it'll be set to BTRFS_MAX_EXTENT_SIZE.
  *
  * This must be matched with a call to btrfs_delalloc_reserve_space.  This is
  * called in the case that we don't need the metadata AND data reservations
@@ -5971,9 +6004,10 @@ int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len)
  * list if there are no delalloc bytes left.
  * Also it will handle the qgroup reserved space.
  */
-void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len)
+void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len,
+				  u32 max_extent_size)
 {
-	btrfs_delalloc_release_metadata(inode, len);
+	btrfs_delalloc_release_metadata(inode, len, max_extent_size);
 	btrfs_free_reserved_data_space(inode, start, len);
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a3412d6..4e3bac2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "dedupe.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -605,7 +606,7 @@ static int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	btrfs_debug_check_extent_io_range(tree, start, end);
 
 	if (bits & EXTENT_DELALLOC)
-		bits |= EXTENT_NORESERVE;
+		bits |= EXTENT_NORESERVE | EXTENT_DEDUPE;
 
 	if (delete)
 		bits |= ~EXTENT_CTLBITS;
@@ -1491,6 +1492,61 @@ out:
 	return ret;
 }
 
+static void adjust_one_outstanding_extent(struct inode *inode, u64 len)
+{
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+	u64 dedupe_blocksize = fs_info->dedupe_info->blocksize;
+	unsigned old_extents, new_extents;
+
+	old_extents = div64_u64(len + dedupe_blocksize - 1, dedupe_blocksize);
+	new_extents = div64_u64(len + BTRFS_MAX_EXTENT_SIZE - 1,
+				BTRFS_MAX_EXTENT_SIZE);
+	if (old_extents <= new_extents)
+		return;
+
+	spin_lock(&BTRFS_I(inode)->lock);
+	BTRFS_I(inode)->outstanding_extents -= old_extents - new_extents;
+	spin_unlock(&BTRFS_I(inode)->lock);
+}
+
+/*
+ * For a extent with EXTENT_DEDUPE flag, if later it does not go through
+ * in-band dedupe, we need to adjust the number of outstanding_extents.
+ * It's because for extent with EXTENT_DEDUPE flag, its number of outstanding
+ * extents is calculated by in-band dedupe blocksize, so here we need to
+ * adjust it.
+ */
+void adjust_buffered_io_outstanding_extents(struct extent_io_tree *tree,
+					    u64 start, u64 end)
+{
+	struct inode *inode = tree->mapping->host;
+	struct rb_node *node;
+	struct extent_state *state;
+
+	spin_lock(&tree->lock);
+	node = tree_search(tree, start);
+	if (!node)
+		goto out;
+
+	while (1) {
+		state = rb_entry(node, struct extent_state, rb_node);
+		if (state->start > end)
+			goto out;
+		/*
+		 * The whole range is locked, so we can safely clear
+		 * EXTENT_DEDUPE flag.
+		 */
+		state->state &= ~EXTENT_DEDUPE;
+		adjust_one_outstanding_extent(inode,
+				state->end - state->start + 1);
+		node = rb_next(node);
+		if (!node)
+			break;
+	}
+out:
+	spin_unlock(&tree->lock);
+}
+
 /*
  * find a contiguous range of bytes in the file marked as delalloc, not
  * more than 'max_bytes'.  start and end are used to return the range,
@@ -1506,6 +1562,7 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 	u64 cur_start = *start;
 	u64 found = 0;
 	u64 total_bytes = 0;
+	unsigned pre_state;
 
 	spin_lock(&tree->lock);
 
@@ -1523,7 +1580,8 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 	while (1) {
 		state = rb_entry(node, struct extent_state, rb_node);
 		if (found && (state->start != cur_start ||
-			      (state->state & EXTENT_BOUNDARY))) {
+		    (state->state & EXTENT_BOUNDARY) ||
+		    (state->state ^ pre_state) & EXTENT_DEDUPE)) {
 			goto out;
 		}
 		if (!(state->state & EXTENT_DELALLOC)) {
@@ -1539,6 +1597,7 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 		found++;
 		*end = state->end;
 		cur_start = state->end + 1;
+		pre_state = state->state;
 		node = rb_next(node);
 		total_bytes += state->end - state->start + 1;
 		if (total_bytes >= max_bytes)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c0c1c4f..7ba66b0 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -20,6 +20,7 @@
 #define EXTENT_DAMAGED		(1U << 14)
 #define EXTENT_NORESERVE	(1U << 15)
 #define EXTENT_QGROUP_RESERVED	(1U << 16)
+#define EXTENT_DEDUPE		(1U << 17)
 #define EXTENT_IOBITS		(EXTENT_LOCKED | EXTENT_WRITEBACK)
 #define EXTENT_CTLBITS		(EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC)
 
@@ -250,6 +251,8 @@ static inline int clear_extent_bits(struct extent_io_tree *tree, u64 start,
 			GFP_NOFS);
 }
 
+void adjust_buffered_io_outstanding_extents(struct extent_io_tree *tree,
+					    u64 start, u64 end);
 int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 			   unsigned bits, struct extent_changeset *changeset);
 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
@@ -289,10 +292,16 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		       struct extent_state **cached_state);
 
 static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
-		u64 end, struct extent_state **cached_state)
+		u64 end, struct extent_state **cached_state, int dedupe)
 {
-	return set_extent_bit(tree, start, end,
-			      EXTENT_DELALLOC | EXTENT_UPTODATE,
+	unsigned bits;
+
+	if (dedupe)
+		bits = EXTENT_DELALLOC | EXTENT_UPTODATE | EXTENT_DEDUPE;
+	else
+		bits = EXTENT_DELALLOC | EXTENT_UPTODATE;
+
+	return set_extent_bit(tree, start, end, bits,
 			      NULL, cached_state, GFP_NOFS);
 }
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 159a934..e3e00e7 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -42,6 +42,7 @@
 #include "volumes.h"
 #include "qgroup.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 /*
@@ -488,7 +489,7 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
 int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 			     struct page **pages, size_t num_pages,
 			     loff_t pos, size_t write_bytes,
-			     struct extent_state **cached)
+			     struct extent_state **cached, int dedupe)
 {
 	int err = 0;
 	int i;
@@ -502,8 +503,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 	num_bytes = round_up(write_bytes + pos - start_pos, root->sectorsize);
 
 	end_of_last_block = start_pos + num_bytes - 1;
+
 	err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
-					cached);
+					cached, dedupe);
 	if (err)
 		return err;
 
@@ -1496,6 +1498,11 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 	bool only_release_metadata = false;
 	bool force_page_uptodate = false;
 	bool need_unlock;
+	u32 max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+	int dedupe = inode_need_dedupe(inode);
+
+	if (dedupe)
+		max_extent_size = btrfs_dedupe_blocksize(inode);
 
 	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
 			PAGE_SIZE / (sizeof(struct page *)));
@@ -1558,7 +1565,8 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 			break;
 
 reserve_metadata:
-		ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes);
+		ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes,
+						      max_extent_size);
 		if (ret) {
 			if (!only_release_metadata)
 				btrfs_free_reserved_data_space(inode, pos,
@@ -1643,14 +1651,15 @@ again:
 			}
 			if (only_release_metadata) {
 				btrfs_delalloc_release_metadata(inode,
-								release_bytes);
+					release_bytes, max_extent_size);
 			} else {
 				u64 __pos;
 
 				__pos = round_down(pos, root->sectorsize) +
 					(dirty_pages << PAGE_SHIFT);
 				btrfs_delalloc_release_space(inode, __pos,
-							     release_bytes);
+							     release_bytes,
+							     max_extent_size);
 			}
 		}
 
@@ -1660,7 +1669,7 @@ again:
 		if (copied > 0)
 			ret = btrfs_dirty_pages(root, inode, pages,
 						dirty_pages, pos, copied,
-						NULL);
+						NULL, dedupe);
 		if (need_unlock)
 			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
 					     lockstart, lockend, &cached_state,
@@ -1701,11 +1710,12 @@ again:
 	if (release_bytes) {
 		if (only_release_metadata) {
 			btrfs_end_write_no_snapshoting(root);
-			btrfs_delalloc_release_metadata(inode, release_bytes);
+			btrfs_delalloc_release_metadata(inode, release_bytes,
+							max_extent_size);
 		} else {
 			btrfs_delalloc_release_space(inode,
 						round_down(pos, root->sectorsize),
-						release_bytes);
+						release_bytes, max_extent_size);
 		}
 	}
 
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 69d270f..dd7e6af 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1296,7 +1296,7 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode,
 
 	/* Everything is written out, now we dirty the pages in the file. */
 	ret = btrfs_dirty_pages(root, inode, io_ctl->pages, io_ctl->num_pages,
-				0, i_size_read(inode), &cached_state);
+				0, i_size_read(inode), &cached_state, 0);
 	if (ret)
 		goto out_nospc;
 
@@ -3533,7 +3533,8 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
 
 	if (ret) {
 		if (release_metadata)
-			btrfs_delalloc_release_metadata(inode, inode->i_size);
+			btrfs_delalloc_release_metadata(inode, inode->i_size,
+							0);
 #ifdef DEBUG
 		btrfs_err(root->fs_info,
 			"failed to write free ino cache for root %llu",
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 70107f7..99c1f8e 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -488,14 +488,14 @@ again:
 	/* Just to make sure we have enough space */
 	prealloc += 8 * PAGE_SIZE;
 
-	ret = btrfs_delalloc_reserve_space(inode, 0, prealloc);
+	ret = btrfs_delalloc_reserve_space(inode, 0, prealloc, 0);
 	if (ret)
 		goto out_put;
 
 	ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, prealloc,
 					      prealloc, prealloc, &alloc_hint);
 	if (ret) {
-		btrfs_delalloc_release_space(inode, 0, prealloc);
+		btrfs_delalloc_release_space(inode, 0, prealloc, 0);
 		goto out_put;
 	}
 	btrfs_free_reserved_data_space(inode, 0, prealloc);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4a02383..918d5e0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -315,7 +315,7 @@ static noinline int cow_file_range_inline(struct btrfs_root *root,
 	}
 
 	set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &BTRFS_I(inode)->runtime_flags);
-	btrfs_delalloc_release_metadata(inode, end + 1 - start);
+	btrfs_delalloc_release_metadata(inode, end + 1 - start, 0);
 	btrfs_drop_extent_cache(inode, start, aligned_end - 1, 0);
 out:
 	/*
@@ -347,6 +347,7 @@ struct async_cow {
 	struct page *locked_page;
 	u64 start;
 	u64 end;
+	int dedupe;
 	struct list_head extents;
 	struct btrfs_work work;
 };
@@ -1163,14 +1164,8 @@ static int hash_file_ranges(struct inode *inode, u64 start, u64 end,
 
 	actual_end = min_t(u64, isize, end + 1);
 	/* If dedupe is not enabled, don't split extent into dedupe_bs */
-	if (fs_info->dedupe_enabled && dedupe_info) {
-		dedupe_bs = dedupe_info->blocksize;
-		hash_algo = dedupe_info->hash_type;
-	} else {
-		dedupe_bs = SZ_128M;
-		/* Just dummy, to avoid access NULL pointer */
-		hash_algo = BTRFS_DEDUPE_HASH_SHA256;
-	}
+	dedupe_bs = dedupe_info->blocksize;
+	hash_algo = dedupe_info->hash_type;
 
 	while (cur_offset < end) {
 		struct btrfs_dedupe_hash *hash = NULL;
@@ -1223,13 +1218,13 @@ static noinline void async_cow_start(struct btrfs_work *work)
 	int ret = 0;
 	async_cow = container_of(work, struct async_cow, work);
 
-	if (inode_need_compress(async_cow->inode))
+	if (async_cow->dedupe)
+		ret = hash_file_ranges(async_cow->inode, async_cow->start,
+				       async_cow->end, async_cow, &num_added);
+	else
 		compress_file_range(async_cow->inode, async_cow->locked_page,
 				    async_cow->start, async_cow->end, async_cow,
 				    &num_added);
-	else
-		ret = hash_file_ranges(async_cow->inode, async_cow->start,
-				       async_cow->end, async_cow, &num_added);
 	WARN_ON(ret);
 
 	if (num_added == 0) {
@@ -1276,7 +1271,7 @@ static noinline void async_cow_free(struct btrfs_work *work)
 
 static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 				u64 start, u64 end, int *page_started,
-				unsigned long *nr_written)
+				unsigned long *nr_written, int dedupe)
 {
 	struct async_cow *async_cow;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -1295,10 +1290,10 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 		async_cow->root = root;
 		async_cow->locked_page = locked_page;
 		async_cow->start = start;
+		async_cow->dedupe = dedupe;
 
-		if (fs_info->dedupe_enabled && dedupe_info) {
+		if (dedupe) {
 			u64 len = max_t(u64, SZ_512K, dedupe_info->blocksize);
-
 			cur_end = min(end, start + len - 1);
 		} else if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS &&
 		    !btrfs_test_opt(root, FORCE_COMPRESS))
@@ -1696,25 +1691,35 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 			      u64 start, u64 end, int *page_started,
 			      unsigned long *nr_written)
 {
-	int ret;
+	int ret, dedupe;
 	int force_cow = need_force_cow(inode, start, end);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_dedupe_info *dedupe_info = root->fs_info->dedupe_info;
+
+	dedupe = test_range_bit(io_tree, start, end, EXTENT_DEDUPE, 1, NULL);
+	BUG_ON(dedupe && dedupe_info == NULL);
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) {
+		if (dedupe)
+			adjust_buffered_io_outstanding_extents(io_tree,
+							       start, end);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 1, nr_written);
 	} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
+		if (dedupe)
+			adjust_buffered_io_outstanding_extents(io_tree,
+							       start, end);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_need_compress(inode) && !fs_info->dedupe_enabled) {
+	} else if (!inode_need_compress(inode) && !dedupe) {
 		ret = cow_file_range(inode, locked_page, start, end,
 				      page_started, nr_written, 1, NULL);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 			&BTRFS_I(inode)->runtime_flags);
 		ret = cow_file_range_async(inode, locked_page, start, end,
-					   page_started, nr_written);
+					   page_started, nr_written, dedupe);
 	}
 	return ret;
 }
@@ -1724,6 +1729,8 @@ static void btrfs_split_extent_hook(struct inode *inode,
 {
 	u64 size;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
+	int do_dedupe = orig->state & EXTENT_DEDUPE;
+	u64 max_extent_size = btrfs_max_extent_size(inode, do_dedupe);
 
 	/* not delalloc, ignore it */
 	if (!(orig->state & EXTENT_DELALLOC))
@@ -1733,7 +1740,7 @@ static void btrfs_split_extent_hook(struct inode *inode,
 		return;
 
 	size = orig->end - orig->start + 1;
-	if (size > BTRFS_MAX_EXTENT_SIZE) {
+	if (size > max_extent_size) {
 		u64 num_extents;
 		u64 new_size;
 
@@ -1742,13 +1749,13 @@ static void btrfs_split_extent_hook(struct inode *inode,
 		 * applies here, just in reverse.
 		 */
 		new_size = orig->end - split + 1;
-		num_extents = div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1,
-					BTRFS_MAX_EXTENT_SIZE);
+		num_extents = div64_u64(new_size + max_extent_size - 1,
+					max_extent_size);
 		new_size = split - orig->start;
-		num_extents += div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1,
-					BTRFS_MAX_EXTENT_SIZE);
-		if (div64_u64(size + BTRFS_MAX_EXTENT_SIZE - 1,
-			      BTRFS_MAX_EXTENT_SIZE) >= num_extents)
+		num_extents += div64_u64(new_size + max_extent_size - 1,
+					 max_extent_size);
+		if (div64_u64(size + max_extent_size - 1,
+			      max_extent_size) >= num_extents)
 			return;
 	}
 
@@ -1770,6 +1777,8 @@ static void btrfs_merge_extent_hook(struct inode *inode,
 	u64 new_size, old_size;
 	u64 num_extents;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
+	int do_dedupe = other->state & EXTENT_DEDUPE;
+	u64 max_extent_size = btrfs_max_extent_size(inode, do_dedupe);
 
 	/* not delalloc, ignore it */
 	if (!(other->state & EXTENT_DELALLOC))
@@ -1784,7 +1793,7 @@ static void btrfs_merge_extent_hook(struct inode *inode,
 		new_size = other->end - new->start + 1;
 
 	/* we're not bigger than the max, unreserve the space and go */
-	if (new_size <= BTRFS_MAX_EXTENT_SIZE) {
+	if (new_size <= max_extent_size) {
 		spin_lock(&BTRFS_I(inode)->lock);
 		BTRFS_I(inode)->outstanding_extents--;
 		spin_unlock(&BTRFS_I(inode)->lock);
@@ -1796,7 +1805,6 @@ static void btrfs_merge_extent_hook(struct inode *inode,
 	 * accounted for before we merged into one big extent.  If the number of
 	 * extents we accounted for is <= the amount we need for the new range
 	 * then we can return, otherwise drop.  Think of it like this
-	 *
 	 * [ 4k][MAX_SIZE]
 	 *
 	 * So we've grown the extent by a MAX_SIZE extent, this would mean we
@@ -1810,14 +1818,14 @@ static void btrfs_merge_extent_hook(struct inode *inode,
 	 * this case.
 	 */
 	old_size = other->end - other->start + 1;
-	num_extents = div64_u64(old_size + BTRFS_MAX_EXTENT_SIZE - 1,
-				BTRFS_MAX_EXTENT_SIZE);
+	num_extents = div64_u64(old_size + max_extent_size - 1,
+				max_extent_size);
 	old_size = new->end - new->start + 1;
-	num_extents += div64_u64(old_size + BTRFS_MAX_EXTENT_SIZE - 1,
-				 BTRFS_MAX_EXTENT_SIZE);
+	num_extents += div64_u64(old_size + max_extent_size - 1,
+				 max_extent_size);
 
-	if (div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1,
-		      BTRFS_MAX_EXTENT_SIZE) >= num_extents)
+	if (div64_u64(new_size + max_extent_size - 1,
+		      max_extent_size) >= num_extents)
 		return;
 
 	spin_lock(&BTRFS_I(inode)->lock);
@@ -1883,9 +1891,11 @@ static void btrfs_set_bit_hook(struct inode *inode,
 	 */
 	if (!(state->state & EXTENT_DELALLOC) && (*bits & EXTENT_DELALLOC)) {
 		struct btrfs_root *root = BTRFS_I(inode)->root;
+		int do_dedupe = *bits & EXTENT_DEDUPE;
+		u64 max_extent_size = btrfs_max_extent_size(inode, do_dedupe);
 		u64 len = state->end + 1 - state->start;
-		u64 num_extents = div64_u64(len + BTRFS_MAX_EXTENT_SIZE - 1,
-					    BTRFS_MAX_EXTENT_SIZE);
+		u64 num_extents = div64_u64(len + max_extent_size - 1,
+					    max_extent_size);
 		bool do_list = !btrfs_is_free_space_inode(inode);
 
 		if (*bits & EXTENT_FIRST_DELALLOC)
@@ -1922,8 +1932,10 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 				 unsigned *bits)
 {
 	u64 len = state->end + 1 - state->start;
-	u64 num_extents = div64_u64(len + BTRFS_MAX_EXTENT_SIZE -1,
-				    BTRFS_MAX_EXTENT_SIZE);
+	int do_dedupe = state->state & EXTENT_DEDUPE;
+	u64 max_extent_size = btrfs_max_extent_size(inode, do_dedupe);
+	u64 num_extents = div64_u64(len + max_extent_size - 1,
+				    max_extent_size);
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG))
@@ -1954,7 +1966,8 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 		 */
 		if (*bits & EXTENT_DO_ACCOUNTING &&
 		    root != root->fs_info->tree_root)
-			btrfs_delalloc_release_metadata(inode, len);
+			btrfs_delalloc_release_metadata(inode, len,
+							max_extent_size);
 
 		/* For sanity tests. */
 		if (btrfs_test_is_dummy_root(root))
@@ -2132,16 +2145,18 @@ static noinline int add_pending_csums(struct btrfs_trans_handle *trans,
 }
 
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
-			      struct extent_state **cached_state)
+			      struct extent_state **cached_state,
+			      int dedupe)
 {
 	int ret;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	u64 num_extents = div64_u64(end - start + BTRFS_MAX_EXTENT_SIZE,
-				    BTRFS_MAX_EXTENT_SIZE);
+	u64 max_extent_size = btrfs_max_extent_size(inode, dedupe);
+	u64 num_extents = div64_u64(end - start + max_extent_size,
+				    max_extent_size);
 
 	WARN_ON((end & (PAGE_SIZE - 1)) == 0);
 	ret = set_extent_delalloc(&BTRFS_I(inode)->io_tree, start, end,
-				  cached_state);
+				  cached_state, dedupe);
 
 	/*
 	 * btrfs_delalloc_reserve_metadata() will first add number of
@@ -2168,13 +2183,15 @@ int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
 {
 	int ret;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	u64 num_extents = div64_u64(end - start + BTRFS_MAX_EXTENT_SIZE,
-				    BTRFS_MAX_EXTENT_SIZE);
+	u64 max_extent_size = btrfs_max_extent_size(inode, 0);
+	u64 num_extents = div64_u64(end - start + max_extent_size,
+				    max_extent_size);
 
 	WARN_ON((end & (PAGE_SIZE - 1)) == 0);
 	ret = set_extent_defrag(&BTRFS_I(inode)->io_tree, start, end,
 				cached_state);
 
+	/* see same comments in btrfs_set_extent_delalloc */
 	if (ret == 0 && root != root->fs_info->tree_root) {
 		spin_lock(&BTRFS_I(inode)->lock);
 		BTRFS_I(inode)->outstanding_extents -= num_extents;
@@ -2233,7 +2250,7 @@ again:
 	}
 
 	ret = btrfs_delalloc_reserve_space(inode, page_start,
-					   PAGE_SIZE);
+					   PAGE_SIZE, 0);
 	if (ret) {
 		mapping_set_error(page->mapping, ret);
 		end_extent_writepage(page, ret, page_start, page_end);
@@ -2241,7 +2258,8 @@ again:
 		goto out;
 	 }
 
-	btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state);
+	btrfs_set_extent_delalloc(inode, page_start, page_end,
+				  &cached_state, 0);
 	ClearPageChecked(page);
 	set_page_dirty(page);
 out:
@@ -3086,6 +3104,10 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	bool nolock;
 	bool truncated = false;
 	int hash_hit = btrfs_dedupe_hash_hit(ordered_extent->hash);
+	u32 max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
+	if (ordered_extent->hash)
+		max_extent_size = root->fs_info->dedupe_info->blocksize;
 
 	nolock = btrfs_is_free_space_inode(inode);
 
@@ -3211,7 +3233,9 @@ out_unlock:
 			     ordered_extent->len - 1, &cached_state, GFP_NOFS);
 out:
 	if (root != root->fs_info->tree_root)
-		btrfs_delalloc_release_metadata(inode, ordered_extent->len);
+		btrfs_delalloc_release_metadata(inode, ordered_extent->len,
+						max_extent_size);
+
 	if (trans)
 		btrfs_end_transaction(trans, root);
 
@@ -4903,7 +4927,7 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 		goto out;
 
 	ret = btrfs_delalloc_reserve_space(inode,
-			round_down(from, blocksize), blocksize);
+			round_down(from, blocksize), blocksize, 0);
 	if (ret)
 		goto out;
 
@@ -4912,7 +4936,7 @@ again:
 	if (!page) {
 		btrfs_delalloc_release_space(inode,
 				round_down(from, blocksize),
-				blocksize);
+				blocksize, 0);
 		ret = -ENOMEM;
 		goto out;
 	}
@@ -4955,7 +4979,7 @@ again:
 			  0, 0, &cached_state, GFP_NOFS);
 
 	ret = btrfs_set_extent_delalloc(inode, block_start, block_end,
-					&cached_state);
+					&cached_state, 0);
 	if (ret) {
 		unlock_extent_cached(io_tree, block_start, block_end,
 				     &cached_state, GFP_NOFS);
@@ -4983,7 +5007,7 @@ again:
 out_unlock:
 	if (ret)
 		btrfs_delalloc_release_space(inode, block_start,
-					     blocksize);
+					     blocksize, 0);
 	unlock_page(page);
 	put_page(page);
 out:
@@ -7806,9 +7830,10 @@ static void adjust_dio_outstanding_extents(struct inode *inode,
 					   const u64 len)
 {
 	unsigned num_extents;
+	u64 max_extent_size = btrfs_max_extent_size(inode, 0);
 
-	num_extents = (unsigned) div64_u64(len + BTRFS_MAX_EXTENT_SIZE - 1,
-					   BTRFS_MAX_EXTENT_SIZE);
+	num_extents = (unsigned) div64_u64(len + max_extent_size - 1,
+					   max_extent_size);
 	/*
 	 * If we have an outstanding_extents count still set then we're
 	 * within our reservation, otherwise we need to adjust our inode
@@ -8827,6 +8852,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	bool wakeup = true;
 	bool relock = false;
 	ssize_t ret;
+	u64 max_extent_size = btrfs_max_extent_size(inode, 0);
 
 	if (check_direct_IO(BTRFS_I(inode)->root, iocb, iter, offset))
 		return 0;
@@ -8856,12 +8882,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 			inode_unlock(inode);
 			relock = true;
 		}
-		ret = btrfs_delalloc_reserve_space(inode, offset, count);
+		ret = btrfs_delalloc_reserve_space(inode, offset, count, 0);
 		if (ret)
 			goto out;
 		dio_data.outstanding_extents = div64_u64(count +
-						BTRFS_MAX_EXTENT_SIZE - 1,
-						BTRFS_MAX_EXTENT_SIZE);
+						max_extent_size - 1,
+						max_extent_size);
 
 		/*
 		 * We need to know how many extents we reserved so that we can
@@ -8888,7 +8914,8 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		if (ret < 0 && ret != -EIOCBQUEUED) {
 			if (dio_data.reserve)
 				btrfs_delalloc_release_space(inode, offset,
-							     dio_data.reserve);
+							     dio_data.reserve,
+							     0);
 			/*
 			 * On error we might have left some ordered extents
 			 * without submitting corresponding bios for them, so
@@ -8904,7 +8931,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 					0);
 		} else if (ret >= 0 && (size_t)ret < count)
 			btrfs_delalloc_release_space(inode, offset,
-						     count - (size_t)ret);
+						     count - (size_t)ret, 0);
 	}
 out:
 	if (wakeup)
@@ -9164,7 +9191,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	 * being processed by btrfs_page_mkwrite() function.
 	 */
 	ret = btrfs_delalloc_reserve_space(inode, page_start,
-					   reserved_space);
+					   reserved_space, 0);
 	if (!ret) {
 		ret = file_update_time(vma->vm_file);
 		reserved = 1;
@@ -9216,7 +9243,7 @@ again:
 			BTRFS_I(inode)->outstanding_extents++;
 			spin_unlock(&BTRFS_I(inode)->lock);
 			btrfs_delalloc_release_space(inode, page_start,
-						PAGE_SIZE - reserved_space);
+						PAGE_SIZE - reserved_space, 0);
 		}
 	}
 
@@ -9233,7 +9260,7 @@ again:
 			  0, 0, &cached_state, GFP_NOFS);
 
 	ret = btrfs_set_extent_delalloc(inode, page_start, end,
-					&cached_state);
+					&cached_state, 0);
 	if (ret) {
 		unlock_extent_cached(io_tree, page_start, page_end,
 				     &cached_state, GFP_NOFS);
@@ -9271,7 +9298,7 @@ out_unlock:
 	}
 	unlock_page(page);
 out:
-	btrfs_delalloc_release_space(inode, page_start, reserved_space);
+	btrfs_delalloc_release_space(inode, page_start, reserved_space, 0);
 out_noreserve:
 	sb_end_pagefault(inode->i_sb);
 	return ret;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f38b472..41aa2c4 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1142,7 +1142,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 
 	ret = btrfs_delalloc_reserve_space(inode,
 			start_index << PAGE_SHIFT,
-			page_cnt << PAGE_SHIFT);
+			page_cnt << PAGE_SHIFT, 0);
 	if (ret)
 		return ret;
 	i_done = 0;
@@ -1233,7 +1233,7 @@ again:
 		spin_unlock(&BTRFS_I(inode)->lock);
 		btrfs_delalloc_release_space(inode,
 				start_index << PAGE_SHIFT,
-				(page_cnt - i_done) << PAGE_SHIFT);
+				(page_cnt - i_done) << PAGE_SHIFT, 0);
 	}
 
 	btrfs_set_extent_defrag(inode, page_start,
@@ -1258,7 +1258,7 @@ out:
 	}
 	btrfs_delalloc_release_space(inode,
 			start_index << PAGE_SHIFT,
-			page_cnt << PAGE_SHIFT);
+			page_cnt << PAGE_SHIFT, 0);
 	return ret;
 
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 8dda4a5..41f81e5 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -75,6 +75,7 @@ struct btrfs_ordered_sum {
 				 * in the logging code. */
 #define BTRFS_ORDERED_PENDING 11 /* We are waiting for this ordered extent to
 				  * complete in the current transaction. */
+
 struct btrfs_ordered_extent {
 	/* logical offset in the file */
 	u64 file_offset;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 32fcd8d..16bb383 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3146,7 +3146,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
 	index = (cluster->start - offset) >> PAGE_SHIFT;
 	last_index = (cluster->end - offset) >> PAGE_SHIFT;
 	while (index <= last_index) {
-		ret = btrfs_delalloc_reserve_metadata(inode, PAGE_SIZE);
+		ret = btrfs_delalloc_reserve_metadata(inode, PAGE_SIZE, 0);
 		if (ret)
 			goto out;
 
@@ -3159,7 +3159,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
 						   mask);
 			if (!page) {
 				btrfs_delalloc_release_metadata(inode,
-							PAGE_SIZE);
+							PAGE_SIZE, 0);
 				ret = -ENOMEM;
 				goto out;
 			}
@@ -3178,7 +3178,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
 				unlock_page(page);
 				put_page(page);
 				btrfs_delalloc_release_metadata(inode,
-							PAGE_SIZE);
+							PAGE_SIZE, 0);
 				ret = -EIO;
 				goto out;
 			}
@@ -3199,7 +3199,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
 			nr++;
 		}
 
-		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
+		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL, 0);
 		set_page_dirty(page);
 
 		unlock_extent(&BTRFS_I(inode)->io_tree,
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC
  2016-06-15  2:10 ` [PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC Qu Wenruo
@ 2016-06-15  3:11   ` kbuild test robot
  2016-06-15  3:17   ` [PATCH v11.1 " Qu Wenruo
  2016-06-15  3:26   ` [PATCH v11 " kbuild test robot
  2 siblings, 0 replies; 34+ messages in thread
From: kbuild test robot @ 2016-06-15  3:11 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: kbuild-all, linux-btrfs, Wang Xiaoguang, Josef Bacik, Mark Fasheh

[-- Attachment #1: Type: text/plain, Size: 5769 bytes --]

Hi,

[auto build test ERROR on v4.7-rc3]
[cannot apply to btrfs/next next-20160614]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160615-101646
config: i386-randconfig-a0-201624 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/btrfs/tests/extent-io-tests.c: In function 'test_find_delalloc':
>> fs/btrfs/tests/extent-io-tests.c:117:2: error: too few arguments to function 'set_extent_delalloc'
     set_extent_delalloc(&tmp, 0, sectorsize - 1, NULL);
     ^~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
                    from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
    static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
                      ^~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/extent-io-tests.c:148:2: error: too few arguments to function 'set_extent_delalloc'
     set_extent_delalloc(&tmp, sectorsize, max_bytes - 1, NULL);
     ^~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
                    from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
    static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
                      ^~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/extent-io-tests.c:203:2: error: too few arguments to function 'set_extent_delalloc'
     set_extent_delalloc(&tmp, max_bytes, total_dirty - 1, NULL);
     ^~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
                    from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
    static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
                      ^~~~~~~~~~~~~~~~~~~
--
   fs/btrfs/tests/inode-tests.c: In function 'test_extent_accounting':
>> fs/btrfs/tests/inode-tests.c:969:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode, 0, BTRFS_MAX_EXTENT_SIZE - 1,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/inode-tests.c:984:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/inode-tests.c:1018:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE >> 1,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/inode-tests.c:1041:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/inode-tests.c:1060:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/inode-tests.c:1097:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~

vim +/set_extent_delalloc +117 fs/btrfs/tests/extent-io-tests.c

294e30fe Josef Bacik 2013-10-09  111  	}
294e30fe Josef Bacik 2013-10-09  112  
294e30fe Josef Bacik 2013-10-09  113  	/* Test this scenario
294e30fe Josef Bacik 2013-10-09  114  	 * |--- delalloc ---|
294e30fe Josef Bacik 2013-10-09  115  	 * |---  search  ---|
294e30fe Josef Bacik 2013-10-09  116  	 */
b9ef22de Feifei Xu   2016-06-01 @117  	set_extent_delalloc(&tmp, 0, sectorsize - 1, NULL);
294e30fe Josef Bacik 2013-10-09  118  	start = 0;
294e30fe Josef Bacik 2013-10-09  119  	end = 0;
294e30fe Josef Bacik 2013-10-09  120  	found = find_lock_delalloc_range(inode, &tmp, locked_page, &start,

:::::: The code at line 117 was first introduced by commit
:::::: b9ef22dedde08ab1b4ccd5f53344984c4dcb89f4 Btrfs: self-tests: Support non-4k page size

:::::: TO: Feifei Xu <xufeifei@linux.vnet.ibm.com>
:::::: CC: David Sterba <dsterba@suse.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 26196 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v11.1 13/13] btrfs: dedupe: fix false ENOSPC
  2016-06-15  2:10 ` [PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC Qu Wenruo
  2016-06-15  3:11   ` kbuild test robot
@ 2016-06-15  3:17   ` Qu Wenruo
  2016-06-15  3:26   ` [PATCH v11 " kbuild test robot
  2 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-15  3:17 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang, Josef Bacik, Mark Fasheh

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

When testing in-band dedupe, sometimes we got ENOSPC error, though fs
still has much free space. After some debuging work, we found that it's
btrfs_delalloc_reserve_metadata() which sometimes tries to reserve
plenty of metadata space, even for very small data range.

In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
to reserve is calculated by the difference between outstanding_extents and
reserved_extents. Please see below case for how ENOSPC occurs:

  1, Buffered write 128MB data in unit of 1MB, so finially we'll have
inode outstanding extents be 1, and reserved_extents be 128.
Note it's btrfs_merge_extent_hook() that merges these 1MB units into
one big outstanding extent, but do not change reserved_extents.

  2, When writing dirty pages, for in-band dedupe, cow_file_range() will
split above big extent in unit of 16KB(assume our in-band dedupe blocksize
is 16KB). When first split opeartion finishes, we'll have 2 outstanding
extents and 128 reserved extents, and just right the currently generated
ordered extent is dispatched to run and complete, then
btrfs_delalloc_release_metadata()(see btrfs_finish_ordered_io()) will be
called to release metadata, after that we will have 1 outstanding extents
and 1 reserved extents(also see logic in drop_outstanding_extent()). Later
cow_file_range() continues to handles left data range[16KB, 128MB), and if
no other ordered extent was dispatched to run, there will be 8191
outstanding extents and 1 reserved extent.

  3, Now if another bufferd write for this file enters, then
btrfs_delalloc_reserve_metadata() will at least try to reserve metadata
for 8191 outstanding extents' metadata, for 64K node size, it'll be
8191*65536*16, about 8GB metadata, so obviously it'll return ENOSPC error.

But indeed when a file goes through in-band dedupe, its max extent size
will no longer be BTRFS_MAX_EXTENT_SIZE(128MB), it'll be limited by in-band
dedupe blocksize, so current metadata reservation method in btrfs is not
appropriate or correct, here we introduce btrfs_max_extent_size(), which
will return max extent size for corresponding files, which go through in-band
and we use this value to do metadata reservation and extent_io merge, split,
clear operations, we can make sure difference between outstanding_extents
and reserved_extents will not be so big.

Currently only buffered write will go through in-band dedupe if in-band
dedupe is enabled.

Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
v11.1
  Fix compile error on self test.
---
 fs/btrfs/ctree.h                 |  16 ++--
 fs/btrfs/dedupe.h                |  37 ++++++++++
 fs/btrfs/extent-tree.c           |  62 ++++++++++++----
 fs/btrfs/extent_io.c             |  63 +++++++++++++++-
 fs/btrfs/extent_io.h             |  15 +++-
 fs/btrfs/file.c                  |  26 +++++--
 fs/btrfs/free-space-cache.c      |   5 +-
 fs/btrfs/inode-map.c             |   4 +-
 fs/btrfs/inode.c                 | 155 +++++++++++++++++++++++----------------
 fs/btrfs/ioctl.c                 |   6 +-
 fs/btrfs/ordered-data.h          |   1 +
 fs/btrfs/relocation.c            |   8 +-
 fs/btrfs/tests/extent-io-tests.c |   6 +-
 fs/btrfs/tests/inode-tests.c     |  12 +--
 14 files changed, 299 insertions(+), 117 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 62037e9..21f2689 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2649,10 +2649,14 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
 void btrfs_subvolume_release_metadata(struct btrfs_root *root,
 				      struct btrfs_block_rsv *rsv,
 				      u64 qgroup_reserved);
-int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
-void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
-void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
+int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes,
+				    u32 max_extent_size);
+void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes,
+				     u32 max_extent_size);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len,
+				 u32 max_extent_size);
+void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len,
+				  u32 max_extent_size);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
 					      unsigned short type);
@@ -3093,7 +3097,7 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput);
 int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, int delay_iput,
 			       int nr);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
-			      struct extent_state **cached_state);
+			      struct extent_state **cached_state, int dedupe);
 int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
 			    struct extent_state **cached_state);
 int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
@@ -3188,7 +3192,7 @@ int btrfs_release_file(struct inode *inode, struct file *file);
 int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 		      struct page **pages, size_t num_pages,
 		      loff_t pos, size_t write_bytes,
-		      struct extent_state **cached);
+		      struct extent_state **cached, int dedupe);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
 ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
 			      struct file *file_out, loff_t pos_out,
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index f605a7f..fd6096c 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -22,6 +22,7 @@
 #include <linux/btrfs.h>
 #include <linux/wait.h>
 #include <crypto/hash.h>
+#include "btrfs_inode.h"
 
 static int btrfs_dedupe_sizes[] = { 32 };
 
@@ -63,6 +64,42 @@ struct btrfs_dedupe_info {
 
 struct btrfs_trans_handle;
 
+static inline u64 btrfs_dedupe_blocksize(struct inode *inode)
+{
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+
+	BUG_ON(fs_info->dedupe_info == NULL);
+	return fs_info->dedupe_info->blocksize;
+}
+
+static inline int inode_need_dedupe(struct inode *inode)
+{
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+
+	if (!fs_info->dedupe_enabled)
+		return 0;
+
+	return 1;
+}
+
+/*
+ * For in-band dedupe, its max extent size will be limited by in-band
+ * dedupe blocksize.
+ */
+static inline u64 btrfs_max_extent_size(struct inode *inode, int do_dedupe)
+{
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+	if (do_dedupe) {
+		BUG_ON(dedupe_info == NULL);
+		return dedupe_info->blocksize;
+	} else {
+		return BTRFS_MAX_EXTENT_SIZE;
+	}
+}
+
+
 static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
 {
 	return (hash && hash->bytenr);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f6213e7..6146729 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5642,22 +5642,29 @@ void btrfs_subvolume_release_metadata(struct btrfs_root *root,
 /**
  * drop_outstanding_extent - drop an outstanding extent
  * @inode: the inode we're dropping the extent for
- * @num_bytes: the number of bytes we're releasing.
+ * @num_bytes: the number of bytes we're relaseing.
+ * @max_extent_size: for in-band dedupe, max_extent_size will be set to in-band
+ * dedupe blocksize, othersize max_extent_size should be BTRFS_MAX_EXTENT_SIZE.
+ * Also if max_extent_size is 0, it'll be set to BTRFS_MAX_EXTENT_SIZE.
  *
  * This is called when we are freeing up an outstanding extent, either called
  * after an error or after an extent is written.  This will return the number of
  * reserved extents that need to be freed.  This must be called with
  * BTRFS_I(inode)->lock held.
  */
-static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes)
+static unsigned drop_outstanding_extent(struct inode *inode, u64 num_bytes,
+					u32 max_extent_size)
 {
 	unsigned drop_inode_space = 0;
 	unsigned dropped_extents = 0;
 	unsigned num_extents = 0;
 
+	if (max_extent_size == 0)
+		max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
 	num_extents = (unsigned)div64_u64(num_bytes +
-					  BTRFS_MAX_EXTENT_SIZE - 1,
-					  BTRFS_MAX_EXTENT_SIZE);
+					  max_extent_size - 1,
+					  max_extent_size);
 	ASSERT(num_extents);
 	ASSERT(BTRFS_I(inode)->outstanding_extents >= num_extents);
 	BTRFS_I(inode)->outstanding_extents -= num_extents;
@@ -5727,7 +5734,13 @@ static u64 calc_csum_metadata_size(struct inode *inode, u64 num_bytes,
 	return btrfs_calc_trans_metadata_size(root, old_csums - num_csums);
 }
 
-int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
+/*
+ * @max_extent_size: for in-band dedupe, max_extent_size will be set to in-band
+ * dedupe blocksize, othersize max_extent_size should be BTRFS_MAX_EXTENT_SIZE.
+ * Also if max_extent_size is 0, it'll be set to BTRFS_MAX_EXTENT_SIZE.
+ */
+int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes,
+				    u32 max_extent_size)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_block_rsv *block_rsv = &root->fs_info->delalloc_block_rsv;
@@ -5741,6 +5754,9 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	u64 to_free = 0;
 	unsigned dropped;
 
+	if (max_extent_size == 0)
+		max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
 	/* If we are a free space inode we need to not flush since we will be in
 	 * the middle of a transaction commit.  We also don't need the delalloc
 	 * mutex since we won't race with anybody.  We need this mostly to make
@@ -5762,8 +5778,8 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	nr_extents = (unsigned)div64_u64(num_bytes +
-					 BTRFS_MAX_EXTENT_SIZE - 1,
-					 BTRFS_MAX_EXTENT_SIZE);
+					 max_extent_size - 1,
+					 max_extent_size);
 	BTRFS_I(inode)->outstanding_extents += nr_extents;
 	nr_extents = 0;
 
@@ -5821,7 +5837,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 
 out_fail:
 	spin_lock(&BTRFS_I(inode)->lock);
-	dropped = drop_outstanding_extent(inode, num_bytes);
+	dropped = drop_outstanding_extent(inode, num_bytes, max_extent_size);
 	/*
 	 * If the inodes csum_bytes is the same as the original
 	 * csum_bytes then we know we haven't raced with any free()ers
@@ -5887,20 +5903,27 @@ out_fail:
  * btrfs_delalloc_release_metadata - release a metadata reservation for an inode
  * @inode: the inode to release the reservation for
  * @num_bytes: the number of bytes we're releasing
+ * @max_extent_size: for in-band dedupe, max_extent_size will be set to in-band
+ * dedupe blocksize, othersize max_extent_size should be BTRFS_MAX_EXTENT_SIZE.
+ * Also if max_extent_size is 0, it'll be set to BTRFS_MAX_EXTENT_SIZE.
  *
  * This will release the metadata reservation for an inode.  This can be called
  * once we complete IO for a given set of bytes to release their metadata
  * reservations.
  */
-void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
+void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes,
+				     u32 max_extent_size)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 to_free = 0;
 	unsigned dropped;
 
+	if (max_extent_size == 0)
+		max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
 	num_bytes = ALIGN(num_bytes, root->sectorsize);
 	spin_lock(&BTRFS_I(inode)->lock);
-	dropped = drop_outstanding_extent(inode, num_bytes);
+	dropped = drop_outstanding_extent(inode, num_bytes, max_extent_size);
 
 	if (num_bytes)
 		to_free = calc_csum_metadata_size(inode, num_bytes, 0);
@@ -5924,6 +5947,9 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
  * @inode: inode we're writing to
  * @start: start range we are writing to
  * @len: how long the range we are writing to
+ * @max_extent_size: for in-band dedupe, max_extent_size will be set to in-band
+ * dedupe blocksize, othersize max_extent_size should be BTRFS_MAX_EXTENT_SIZE.
+ * Also if max_extent_size is 0, it'll be set to BTRFS_MAX_EXTENT_SIZE.
  *
  * TODO: This function will finally replace old btrfs_delalloc_reserve_space()
  *
@@ -5943,14 +5969,18 @@ void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes)
  * Return 0 for success
  * Return <0 for error(-ENOSPC or -EQUOT)
  */
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len)
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len,
+				 u32 max_extent_size)
 {
 	int ret;
 
+	if (max_extent_size == 0)
+		max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
 	ret = btrfs_check_data_free_space(inode, start, len);
 	if (ret < 0)
 		return ret;
-	ret = btrfs_delalloc_reserve_metadata(inode, len);
+	ret = btrfs_delalloc_reserve_metadata(inode, len, max_extent_size);
 	if (ret < 0)
 		btrfs_free_reserved_data_space(inode, start, len);
 	return ret;
@@ -5961,6 +5991,9 @@ int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len)
  * @inode: inode we're releasing space for
  * @start: start position of the space already reserved
  * @len: the len of the space already reserved
+ * @max_extent_size: for in-band dedupe, max_extent_size will be set to in-band
+ * dedupe blocksize, othersize max_extent_size should be BTRFS_MAX_EXTENT_SIZE.
+ * Also if max_extent_size is 0, it'll be set to BTRFS_MAX_EXTENT_SIZE.
  *
  * This must be matched with a call to btrfs_delalloc_reserve_space.  This is
  * called in the case that we don't need the metadata AND data reservations
@@ -5971,9 +6004,10 @@ int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len)
  * list if there are no delalloc bytes left.
  * Also it will handle the qgroup reserved space.
  */
-void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len)
+void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len,
+				  u32 max_extent_size)
 {
-	btrfs_delalloc_release_metadata(inode, len);
+	btrfs_delalloc_release_metadata(inode, len, max_extent_size);
 	btrfs_free_reserved_data_space(inode, start, len);
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a3412d6..4e3bac2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -20,6 +20,7 @@
 #include "locking.h"
 #include "rcu-string.h"
 #include "backref.h"
+#include "dedupe.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -605,7 +606,7 @@ static int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	btrfs_debug_check_extent_io_range(tree, start, end);
 
 	if (bits & EXTENT_DELALLOC)
-		bits |= EXTENT_NORESERVE;
+		bits |= EXTENT_NORESERVE | EXTENT_DEDUPE;
 
 	if (delete)
 		bits |= ~EXTENT_CTLBITS;
@@ -1491,6 +1492,61 @@ out:
 	return ret;
 }
 
+static void adjust_one_outstanding_extent(struct inode *inode, u64 len)
+{
+	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
+	u64 dedupe_blocksize = fs_info->dedupe_info->blocksize;
+	unsigned old_extents, new_extents;
+
+	old_extents = div64_u64(len + dedupe_blocksize - 1, dedupe_blocksize);
+	new_extents = div64_u64(len + BTRFS_MAX_EXTENT_SIZE - 1,
+				BTRFS_MAX_EXTENT_SIZE);
+	if (old_extents <= new_extents)
+		return;
+
+	spin_lock(&BTRFS_I(inode)->lock);
+	BTRFS_I(inode)->outstanding_extents -= old_extents - new_extents;
+	spin_unlock(&BTRFS_I(inode)->lock);
+}
+
+/*
+ * For a extent with EXTENT_DEDUPE flag, if later it does not go through
+ * in-band dedupe, we need to adjust the number of outstanding_extents.
+ * It's because for extent with EXTENT_DEDUPE flag, its number of outstanding
+ * extents is calculated by in-band dedupe blocksize, so here we need to
+ * adjust it.
+ */
+void adjust_buffered_io_outstanding_extents(struct extent_io_tree *tree,
+					    u64 start, u64 end)
+{
+	struct inode *inode = tree->mapping->host;
+	struct rb_node *node;
+	struct extent_state *state;
+
+	spin_lock(&tree->lock);
+	node = tree_search(tree, start);
+	if (!node)
+		goto out;
+
+	while (1) {
+		state = rb_entry(node, struct extent_state, rb_node);
+		if (state->start > end)
+			goto out;
+		/*
+		 * The whole range is locked, so we can safely clear
+		 * EXTENT_DEDUPE flag.
+		 */
+		state->state &= ~EXTENT_DEDUPE;
+		adjust_one_outstanding_extent(inode,
+				state->end - state->start + 1);
+		node = rb_next(node);
+		if (!node)
+			break;
+	}
+out:
+	spin_unlock(&tree->lock);
+}
+
 /*
  * find a contiguous range of bytes in the file marked as delalloc, not
  * more than 'max_bytes'.  start and end are used to return the range,
@@ -1506,6 +1562,7 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 	u64 cur_start = *start;
 	u64 found = 0;
 	u64 total_bytes = 0;
+	unsigned pre_state;
 
 	spin_lock(&tree->lock);
 
@@ -1523,7 +1580,8 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 	while (1) {
 		state = rb_entry(node, struct extent_state, rb_node);
 		if (found && (state->start != cur_start ||
-			      (state->state & EXTENT_BOUNDARY))) {
+		    (state->state & EXTENT_BOUNDARY) ||
+		    (state->state ^ pre_state) & EXTENT_DEDUPE)) {
 			goto out;
 		}
 		if (!(state->state & EXTENT_DELALLOC)) {
@@ -1539,6 +1597,7 @@ static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
 		found++;
 		*end = state->end;
 		cur_start = state->end + 1;
+		pre_state = state->state;
 		node = rb_next(node);
 		total_bytes += state->end - state->start + 1;
 		if (total_bytes >= max_bytes)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c0c1c4f..7ba66b0 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -20,6 +20,7 @@
 #define EXTENT_DAMAGED		(1U << 14)
 #define EXTENT_NORESERVE	(1U << 15)
 #define EXTENT_QGROUP_RESERVED	(1U << 16)
+#define EXTENT_DEDUPE		(1U << 17)
 #define EXTENT_IOBITS		(EXTENT_LOCKED | EXTENT_WRITEBACK)
 #define EXTENT_CTLBITS		(EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC)
 
@@ -250,6 +251,8 @@ static inline int clear_extent_bits(struct extent_io_tree *tree, u64 start,
 			GFP_NOFS);
 }
 
+void adjust_buffered_io_outstanding_extents(struct extent_io_tree *tree,
+					    u64 start, u64 end);
 int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 			   unsigned bits, struct extent_changeset *changeset);
 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
@@ -289,10 +292,16 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		       struct extent_state **cached_state);
 
 static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
-		u64 end, struct extent_state **cached_state)
+		u64 end, struct extent_state **cached_state, int dedupe)
 {
-	return set_extent_bit(tree, start, end,
-			      EXTENT_DELALLOC | EXTENT_UPTODATE,
+	unsigned bits;
+
+	if (dedupe)
+		bits = EXTENT_DELALLOC | EXTENT_UPTODATE | EXTENT_DEDUPE;
+	else
+		bits = EXTENT_DELALLOC | EXTENT_UPTODATE;
+
+	return set_extent_bit(tree, start, end, bits,
 			      NULL, cached_state, GFP_NOFS);
 }
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 159a934..e3e00e7 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -42,6 +42,7 @@
 #include "volumes.h"
 #include "qgroup.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_inode_defrag_cachep;
 /*
@@ -488,7 +489,7 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
 int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 			     struct page **pages, size_t num_pages,
 			     loff_t pos, size_t write_bytes,
-			     struct extent_state **cached)
+			     struct extent_state **cached, int dedupe)
 {
 	int err = 0;
 	int i;
@@ -502,8 +503,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
 	num_bytes = round_up(write_bytes + pos - start_pos, root->sectorsize);
 
 	end_of_last_block = start_pos + num_bytes - 1;
+
 	err = btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block,
-					cached);
+					cached, dedupe);
 	if (err)
 		return err;
 
@@ -1496,6 +1498,11 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 	bool only_release_metadata = false;
 	bool force_page_uptodate = false;
 	bool need_unlock;
+	u32 max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+	int dedupe = inode_need_dedupe(inode);
+
+	if (dedupe)
+		max_extent_size = btrfs_dedupe_blocksize(inode);
 
 	nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
 			PAGE_SIZE / (sizeof(struct page *)));
@@ -1558,7 +1565,8 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file,
 			break;
 
 reserve_metadata:
-		ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes);
+		ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes,
+						      max_extent_size);
 		if (ret) {
 			if (!only_release_metadata)
 				btrfs_free_reserved_data_space(inode, pos,
@@ -1643,14 +1651,15 @@ again:
 			}
 			if (only_release_metadata) {
 				btrfs_delalloc_release_metadata(inode,
-								release_bytes);
+					release_bytes, max_extent_size);
 			} else {
 				u64 __pos;
 
 				__pos = round_down(pos, root->sectorsize) +
 					(dirty_pages << PAGE_SHIFT);
 				btrfs_delalloc_release_space(inode, __pos,
-							     release_bytes);
+							     release_bytes,
+							     max_extent_size);
 			}
 		}
 
@@ -1660,7 +1669,7 @@ again:
 		if (copied > 0)
 			ret = btrfs_dirty_pages(root, inode, pages,
 						dirty_pages, pos, copied,
-						NULL);
+						NULL, dedupe);
 		if (need_unlock)
 			unlock_extent_cached(&BTRFS_I(inode)->io_tree,
 					     lockstart, lockend, &cached_state,
@@ -1701,11 +1710,12 @@ again:
 	if (release_bytes) {
 		if (only_release_metadata) {
 			btrfs_end_write_no_snapshoting(root);
-			btrfs_delalloc_release_metadata(inode, release_bytes);
+			btrfs_delalloc_release_metadata(inode, release_bytes,
+							max_extent_size);
 		} else {
 			btrfs_delalloc_release_space(inode,
 						round_down(pos, root->sectorsize),
-						release_bytes);
+						release_bytes, max_extent_size);
 		}
 	}
 
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 69d270f..dd7e6af 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1296,7 +1296,7 @@ static int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode,
 
 	/* Everything is written out, now we dirty the pages in the file. */
 	ret = btrfs_dirty_pages(root, inode, io_ctl->pages, io_ctl->num_pages,
-				0, i_size_read(inode), &cached_state);
+				0, i_size_read(inode), &cached_state, 0);
 	if (ret)
 		goto out_nospc;
 
@@ -3533,7 +3533,8 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
 
 	if (ret) {
 		if (release_metadata)
-			btrfs_delalloc_release_metadata(inode, inode->i_size);
+			btrfs_delalloc_release_metadata(inode, inode->i_size,
+							0);
 #ifdef DEBUG
 		btrfs_err(root->fs_info,
 			"failed to write free ino cache for root %llu",
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index 70107f7..99c1f8e 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -488,14 +488,14 @@ again:
 	/* Just to make sure we have enough space */
 	prealloc += 8 * PAGE_SIZE;
 
-	ret = btrfs_delalloc_reserve_space(inode, 0, prealloc);
+	ret = btrfs_delalloc_reserve_space(inode, 0, prealloc, 0);
 	if (ret)
 		goto out_put;
 
 	ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, prealloc,
 					      prealloc, prealloc, &alloc_hint);
 	if (ret) {
-		btrfs_delalloc_release_space(inode, 0, prealloc);
+		btrfs_delalloc_release_space(inode, 0, prealloc, 0);
 		goto out_put;
 	}
 	btrfs_free_reserved_data_space(inode, 0, prealloc);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4a02383..918d5e0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -315,7 +315,7 @@ static noinline int cow_file_range_inline(struct btrfs_root *root,
 	}
 
 	set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &BTRFS_I(inode)->runtime_flags);
-	btrfs_delalloc_release_metadata(inode, end + 1 - start);
+	btrfs_delalloc_release_metadata(inode, end + 1 - start, 0);
 	btrfs_drop_extent_cache(inode, start, aligned_end - 1, 0);
 out:
 	/*
@@ -347,6 +347,7 @@ struct async_cow {
 	struct page *locked_page;
 	u64 start;
 	u64 end;
+	int dedupe;
 	struct list_head extents;
 	struct btrfs_work work;
 };
@@ -1163,14 +1164,8 @@ static int hash_file_ranges(struct inode *inode, u64 start, u64 end,
 
 	actual_end = min_t(u64, isize, end + 1);
 	/* If dedupe is not enabled, don't split extent into dedupe_bs */
-	if (fs_info->dedupe_enabled && dedupe_info) {
-		dedupe_bs = dedupe_info->blocksize;
-		hash_algo = dedupe_info->hash_type;
-	} else {
-		dedupe_bs = SZ_128M;
-		/* Just dummy, to avoid access NULL pointer */
-		hash_algo = BTRFS_DEDUPE_HASH_SHA256;
-	}
+	dedupe_bs = dedupe_info->blocksize;
+	hash_algo = dedupe_info->hash_type;
 
 	while (cur_offset < end) {
 		struct btrfs_dedupe_hash *hash = NULL;
@@ -1223,13 +1218,13 @@ static noinline void async_cow_start(struct btrfs_work *work)
 	int ret = 0;
 	async_cow = container_of(work, struct async_cow, work);
 
-	if (inode_need_compress(async_cow->inode))
+	if (async_cow->dedupe)
+		ret = hash_file_ranges(async_cow->inode, async_cow->start,
+				       async_cow->end, async_cow, &num_added);
+	else
 		compress_file_range(async_cow->inode, async_cow->locked_page,
 				    async_cow->start, async_cow->end, async_cow,
 				    &num_added);
-	else
-		ret = hash_file_ranges(async_cow->inode, async_cow->start,
-				       async_cow->end, async_cow, &num_added);
 	WARN_ON(ret);
 
 	if (num_added == 0) {
@@ -1276,7 +1271,7 @@ static noinline void async_cow_free(struct btrfs_work *work)
 
 static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 				u64 start, u64 end, int *page_started,
-				unsigned long *nr_written)
+				unsigned long *nr_written, int dedupe)
 {
 	struct async_cow *async_cow;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -1295,10 +1290,10 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 		async_cow->root = root;
 		async_cow->locked_page = locked_page;
 		async_cow->start = start;
+		async_cow->dedupe = dedupe;
 
-		if (fs_info->dedupe_enabled && dedupe_info) {
+		if (dedupe) {
 			u64 len = max_t(u64, SZ_512K, dedupe_info->blocksize);
-
 			cur_end = min(end, start + len - 1);
 		} else if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS &&
 		    !btrfs_test_opt(root, FORCE_COMPRESS))
@@ -1696,25 +1691,35 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 			      u64 start, u64 end, int *page_started,
 			      unsigned long *nr_written)
 {
-	int ret;
+	int ret, dedupe;
 	int force_cow = need_force_cow(inode, start, end);
+	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_dedupe_info *dedupe_info = root->fs_info->dedupe_info;
+
+	dedupe = test_range_bit(io_tree, start, end, EXTENT_DEDUPE, 1, NULL);
+	BUG_ON(dedupe && dedupe_info == NULL);
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) {
+		if (dedupe)
+			adjust_buffered_io_outstanding_extents(io_tree,
+							       start, end);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 1, nr_written);
 	} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
+		if (dedupe)
+			adjust_buffered_io_outstanding_extents(io_tree,
+							       start, end);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_need_compress(inode) && !fs_info->dedupe_enabled) {
+	} else if (!inode_need_compress(inode) && !dedupe) {
 		ret = cow_file_range(inode, locked_page, start, end,
 				      page_started, nr_written, 1, NULL);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 			&BTRFS_I(inode)->runtime_flags);
 		ret = cow_file_range_async(inode, locked_page, start, end,
-					   page_started, nr_written);
+					   page_started, nr_written, dedupe);
 	}
 	return ret;
 }
@@ -1724,6 +1729,8 @@ static void btrfs_split_extent_hook(struct inode *inode,
 {
 	u64 size;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
+	int do_dedupe = orig->state & EXTENT_DEDUPE;
+	u64 max_extent_size = btrfs_max_extent_size(inode, do_dedupe);
 
 	/* not delalloc, ignore it */
 	if (!(orig->state & EXTENT_DELALLOC))
@@ -1733,7 +1740,7 @@ static void btrfs_split_extent_hook(struct inode *inode,
 		return;
 
 	size = orig->end - orig->start + 1;
-	if (size > BTRFS_MAX_EXTENT_SIZE) {
+	if (size > max_extent_size) {
 		u64 num_extents;
 		u64 new_size;
 
@@ -1742,13 +1749,13 @@ static void btrfs_split_extent_hook(struct inode *inode,
 		 * applies here, just in reverse.
 		 */
 		new_size = orig->end - split + 1;
-		num_extents = div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1,
-					BTRFS_MAX_EXTENT_SIZE);
+		num_extents = div64_u64(new_size + max_extent_size - 1,
+					max_extent_size);
 		new_size = split - orig->start;
-		num_extents += div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1,
-					BTRFS_MAX_EXTENT_SIZE);
-		if (div64_u64(size + BTRFS_MAX_EXTENT_SIZE - 1,
-			      BTRFS_MAX_EXTENT_SIZE) >= num_extents)
+		num_extents += div64_u64(new_size + max_extent_size - 1,
+					 max_extent_size);
+		if (div64_u64(size + max_extent_size - 1,
+			      max_extent_size) >= num_extents)
 			return;
 	}
 
@@ -1770,6 +1777,8 @@ static void btrfs_merge_extent_hook(struct inode *inode,
 	u64 new_size, old_size;
 	u64 num_extents;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
+	int do_dedupe = other->state & EXTENT_DEDUPE;
+	u64 max_extent_size = btrfs_max_extent_size(inode, do_dedupe);
 
 	/* not delalloc, ignore it */
 	if (!(other->state & EXTENT_DELALLOC))
@@ -1784,7 +1793,7 @@ static void btrfs_merge_extent_hook(struct inode *inode,
 		new_size = other->end - new->start + 1;
 
 	/* we're not bigger than the max, unreserve the space and go */
-	if (new_size <= BTRFS_MAX_EXTENT_SIZE) {
+	if (new_size <= max_extent_size) {
 		spin_lock(&BTRFS_I(inode)->lock);
 		BTRFS_I(inode)->outstanding_extents--;
 		spin_unlock(&BTRFS_I(inode)->lock);
@@ -1796,7 +1805,6 @@ static void btrfs_merge_extent_hook(struct inode *inode,
 	 * accounted for before we merged into one big extent.  If the number of
 	 * extents we accounted for is <= the amount we need for the new range
 	 * then we can return, otherwise drop.  Think of it like this
-	 *
 	 * [ 4k][MAX_SIZE]
 	 *
 	 * So we've grown the extent by a MAX_SIZE extent, this would mean we
@@ -1810,14 +1818,14 @@ static void btrfs_merge_extent_hook(struct inode *inode,
 	 * this case.
 	 */
 	old_size = other->end - other->start + 1;
-	num_extents = div64_u64(old_size + BTRFS_MAX_EXTENT_SIZE - 1,
-				BTRFS_MAX_EXTENT_SIZE);
+	num_extents = div64_u64(old_size + max_extent_size - 1,
+				max_extent_size);
 	old_size = new->end - new->start + 1;
-	num_extents += div64_u64(old_size + BTRFS_MAX_EXTENT_SIZE - 1,
-				 BTRFS_MAX_EXTENT_SIZE);
+	num_extents += div64_u64(old_size + max_extent_size - 1,
+				 max_extent_size);
 
-	if (div64_u64(new_size + BTRFS_MAX_EXTENT_SIZE - 1,
-		      BTRFS_MAX_EXTENT_SIZE) >= num_extents)
+	if (div64_u64(new_size + max_extent_size - 1,
+		      max_extent_size) >= num_extents)
 		return;
 
 	spin_lock(&BTRFS_I(inode)->lock);
@@ -1883,9 +1891,11 @@ static void btrfs_set_bit_hook(struct inode *inode,
 	 */
 	if (!(state->state & EXTENT_DELALLOC) && (*bits & EXTENT_DELALLOC)) {
 		struct btrfs_root *root = BTRFS_I(inode)->root;
+		int do_dedupe = *bits & EXTENT_DEDUPE;
+		u64 max_extent_size = btrfs_max_extent_size(inode, do_dedupe);
 		u64 len = state->end + 1 - state->start;
-		u64 num_extents = div64_u64(len + BTRFS_MAX_EXTENT_SIZE - 1,
-					    BTRFS_MAX_EXTENT_SIZE);
+		u64 num_extents = div64_u64(len + max_extent_size - 1,
+					    max_extent_size);
 		bool do_list = !btrfs_is_free_space_inode(inode);
 
 		if (*bits & EXTENT_FIRST_DELALLOC)
@@ -1922,8 +1932,10 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 				 unsigned *bits)
 {
 	u64 len = state->end + 1 - state->start;
-	u64 num_extents = div64_u64(len + BTRFS_MAX_EXTENT_SIZE -1,
-				    BTRFS_MAX_EXTENT_SIZE);
+	int do_dedupe = state->state & EXTENT_DEDUPE;
+	u64 max_extent_size = btrfs_max_extent_size(inode, do_dedupe);
+	u64 num_extents = div64_u64(len + max_extent_size - 1,
+				    max_extent_size);
 
 	spin_lock(&BTRFS_I(inode)->lock);
 	if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG))
@@ -1954,7 +1966,8 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 		 */
 		if (*bits & EXTENT_DO_ACCOUNTING &&
 		    root != root->fs_info->tree_root)
-			btrfs_delalloc_release_metadata(inode, len);
+			btrfs_delalloc_release_metadata(inode, len,
+							max_extent_size);
 
 		/* For sanity tests. */
 		if (btrfs_test_is_dummy_root(root))
@@ -2132,16 +2145,18 @@ static noinline int add_pending_csums(struct btrfs_trans_handle *trans,
 }
 
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
-			      struct extent_state **cached_state)
+			      struct extent_state **cached_state,
+			      int dedupe)
 {
 	int ret;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	u64 num_extents = div64_u64(end - start + BTRFS_MAX_EXTENT_SIZE,
-				    BTRFS_MAX_EXTENT_SIZE);
+	u64 max_extent_size = btrfs_max_extent_size(inode, dedupe);
+	u64 num_extents = div64_u64(end - start + max_extent_size,
+				    max_extent_size);
 
 	WARN_ON((end & (PAGE_SIZE - 1)) == 0);
 	ret = set_extent_delalloc(&BTRFS_I(inode)->io_tree, start, end,
-				  cached_state);
+				  cached_state, dedupe);
 
 	/*
 	 * btrfs_delalloc_reserve_metadata() will first add number of
@@ -2168,13 +2183,15 @@ int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end,
 {
 	int ret;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
-	u64 num_extents = div64_u64(end - start + BTRFS_MAX_EXTENT_SIZE,
-				    BTRFS_MAX_EXTENT_SIZE);
+	u64 max_extent_size = btrfs_max_extent_size(inode, 0);
+	u64 num_extents = div64_u64(end - start + max_extent_size,
+				    max_extent_size);
 
 	WARN_ON((end & (PAGE_SIZE - 1)) == 0);
 	ret = set_extent_defrag(&BTRFS_I(inode)->io_tree, start, end,
 				cached_state);
 
+	/* see same comments in btrfs_set_extent_delalloc */
 	if (ret == 0 && root != root->fs_info->tree_root) {
 		spin_lock(&BTRFS_I(inode)->lock);
 		BTRFS_I(inode)->outstanding_extents -= num_extents;
@@ -2233,7 +2250,7 @@ again:
 	}
 
 	ret = btrfs_delalloc_reserve_space(inode, page_start,
-					   PAGE_SIZE);
+					   PAGE_SIZE, 0);
 	if (ret) {
 		mapping_set_error(page->mapping, ret);
 		end_extent_writepage(page, ret, page_start, page_end);
@@ -2241,7 +2258,8 @@ again:
 		goto out;
 	 }
 
-	btrfs_set_extent_delalloc(inode, page_start, page_end, &cached_state);
+	btrfs_set_extent_delalloc(inode, page_start, page_end,
+				  &cached_state, 0);
 	ClearPageChecked(page);
 	set_page_dirty(page);
 out:
@@ -3086,6 +3104,10 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	bool nolock;
 	bool truncated = false;
 	int hash_hit = btrfs_dedupe_hash_hit(ordered_extent->hash);
+	u32 max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
+	if (ordered_extent->hash)
+		max_extent_size = root->fs_info->dedupe_info->blocksize;
 
 	nolock = btrfs_is_free_space_inode(inode);
 
@@ -3211,7 +3233,9 @@ out_unlock:
 			     ordered_extent->len - 1, &cached_state, GFP_NOFS);
 out:
 	if (root != root->fs_info->tree_root)
-		btrfs_delalloc_release_metadata(inode, ordered_extent->len);
+		btrfs_delalloc_release_metadata(inode, ordered_extent->len,
+						max_extent_size);
+
 	if (trans)
 		btrfs_end_transaction(trans, root);
 
@@ -4903,7 +4927,7 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
 		goto out;
 
 	ret = btrfs_delalloc_reserve_space(inode,
-			round_down(from, blocksize), blocksize);
+			round_down(from, blocksize), blocksize, 0);
 	if (ret)
 		goto out;
 
@@ -4912,7 +4936,7 @@ again:
 	if (!page) {
 		btrfs_delalloc_release_space(inode,
 				round_down(from, blocksize),
-				blocksize);
+				blocksize, 0);
 		ret = -ENOMEM;
 		goto out;
 	}
@@ -4955,7 +4979,7 @@ again:
 			  0, 0, &cached_state, GFP_NOFS);
 
 	ret = btrfs_set_extent_delalloc(inode, block_start, block_end,
-					&cached_state);
+					&cached_state, 0);
 	if (ret) {
 		unlock_extent_cached(io_tree, block_start, block_end,
 				     &cached_state, GFP_NOFS);
@@ -4983,7 +5007,7 @@ again:
 out_unlock:
 	if (ret)
 		btrfs_delalloc_release_space(inode, block_start,
-					     blocksize);
+					     blocksize, 0);
 	unlock_page(page);
 	put_page(page);
 out:
@@ -7806,9 +7830,10 @@ static void adjust_dio_outstanding_extents(struct inode *inode,
 					   const u64 len)
 {
 	unsigned num_extents;
+	u64 max_extent_size = btrfs_max_extent_size(inode, 0);
 
-	num_extents = (unsigned) div64_u64(len + BTRFS_MAX_EXTENT_SIZE - 1,
-					   BTRFS_MAX_EXTENT_SIZE);
+	num_extents = (unsigned) div64_u64(len + max_extent_size - 1,
+					   max_extent_size);
 	/*
 	 * If we have an outstanding_extents count still set then we're
 	 * within our reservation, otherwise we need to adjust our inode
@@ -8827,6 +8852,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	bool wakeup = true;
 	bool relock = false;
 	ssize_t ret;
+	u64 max_extent_size = btrfs_max_extent_size(inode, 0);
 
 	if (check_direct_IO(BTRFS_I(inode)->root, iocb, iter, offset))
 		return 0;
@@ -8856,12 +8882,12 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 			inode_unlock(inode);
 			relock = true;
 		}
-		ret = btrfs_delalloc_reserve_space(inode, offset, count);
+		ret = btrfs_delalloc_reserve_space(inode, offset, count, 0);
 		if (ret)
 			goto out;
 		dio_data.outstanding_extents = div64_u64(count +
-						BTRFS_MAX_EXTENT_SIZE - 1,
-						BTRFS_MAX_EXTENT_SIZE);
+						max_extent_size - 1,
+						max_extent_size);
 
 		/*
 		 * We need to know how many extents we reserved so that we can
@@ -8888,7 +8914,8 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		if (ret < 0 && ret != -EIOCBQUEUED) {
 			if (dio_data.reserve)
 				btrfs_delalloc_release_space(inode, offset,
-							     dio_data.reserve);
+							     dio_data.reserve,
+							     0);
 			/*
 			 * On error we might have left some ordered extents
 			 * without submitting corresponding bios for them, so
@@ -8904,7 +8931,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 					0);
 		} else if (ret >= 0 && (size_t)ret < count)
 			btrfs_delalloc_release_space(inode, offset,
-						     count - (size_t)ret);
+						     count - (size_t)ret, 0);
 	}
 out:
 	if (wakeup)
@@ -9164,7 +9191,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	 * being processed by btrfs_page_mkwrite() function.
 	 */
 	ret = btrfs_delalloc_reserve_space(inode, page_start,
-					   reserved_space);
+					   reserved_space, 0);
 	if (!ret) {
 		ret = file_update_time(vma->vm_file);
 		reserved = 1;
@@ -9216,7 +9243,7 @@ again:
 			BTRFS_I(inode)->outstanding_extents++;
 			spin_unlock(&BTRFS_I(inode)->lock);
 			btrfs_delalloc_release_space(inode, page_start,
-						PAGE_SIZE - reserved_space);
+						PAGE_SIZE - reserved_space, 0);
 		}
 	}
 
@@ -9233,7 +9260,7 @@ again:
 			  0, 0, &cached_state, GFP_NOFS);
 
 	ret = btrfs_set_extent_delalloc(inode, page_start, end,
-					&cached_state);
+					&cached_state, 0);
 	if (ret) {
 		unlock_extent_cached(io_tree, page_start, page_end,
 				     &cached_state, GFP_NOFS);
@@ -9271,7 +9298,7 @@ out_unlock:
 	}
 	unlock_page(page);
 out:
-	btrfs_delalloc_release_space(inode, page_start, reserved_space);
+	btrfs_delalloc_release_space(inode, page_start, reserved_space, 0);
 out_noreserve:
 	sb_end_pagefault(inode->i_sb);
 	return ret;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f38b472..41aa2c4 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1142,7 +1142,7 @@ static int cluster_pages_for_defrag(struct inode *inode,
 
 	ret = btrfs_delalloc_reserve_space(inode,
 			start_index << PAGE_SHIFT,
-			page_cnt << PAGE_SHIFT);
+			page_cnt << PAGE_SHIFT, 0);
 	if (ret)
 		return ret;
 	i_done = 0;
@@ -1233,7 +1233,7 @@ again:
 		spin_unlock(&BTRFS_I(inode)->lock);
 		btrfs_delalloc_release_space(inode,
 				start_index << PAGE_SHIFT,
-				(page_cnt - i_done) << PAGE_SHIFT);
+				(page_cnt - i_done) << PAGE_SHIFT, 0);
 	}
 
 	btrfs_set_extent_defrag(inode, page_start,
@@ -1258,7 +1258,7 @@ out:
 	}
 	btrfs_delalloc_release_space(inode,
 			start_index << PAGE_SHIFT,
-			page_cnt << PAGE_SHIFT);
+			page_cnt << PAGE_SHIFT, 0);
 	return ret;
 
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 8dda4a5..41f81e5 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -75,6 +75,7 @@ struct btrfs_ordered_sum {
 				 * in the logging code. */
 #define BTRFS_ORDERED_PENDING 11 /* We are waiting for this ordered extent to
 				  * complete in the current transaction. */
+
 struct btrfs_ordered_extent {
 	/* logical offset in the file */
 	u64 file_offset;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 32fcd8d..16bb383 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3146,7 +3146,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
 	index = (cluster->start - offset) >> PAGE_SHIFT;
 	last_index = (cluster->end - offset) >> PAGE_SHIFT;
 	while (index <= last_index) {
-		ret = btrfs_delalloc_reserve_metadata(inode, PAGE_SIZE);
+		ret = btrfs_delalloc_reserve_metadata(inode, PAGE_SIZE, 0);
 		if (ret)
 			goto out;
 
@@ -3159,7 +3159,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
 						   mask);
 			if (!page) {
 				btrfs_delalloc_release_metadata(inode,
-							PAGE_SIZE);
+							PAGE_SIZE, 0);
 				ret = -ENOMEM;
 				goto out;
 			}
@@ -3178,7 +3178,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
 				unlock_page(page);
 				put_page(page);
 				btrfs_delalloc_release_metadata(inode,
-							PAGE_SIZE);
+							PAGE_SIZE, 0);
 				ret = -EIO;
 				goto out;
 			}
@@ -3199,7 +3199,7 @@ static int relocate_file_extent_cluster(struct inode *inode,
 			nr++;
 		}
 
-		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
+		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL, 0);
 		set_page_dirty(page);
 
 		unlock_extent(&BTRFS_I(inode)->io_tree,
diff --git a/fs/btrfs/tests/extent-io-tests.c b/fs/btrfs/tests/extent-io-tests.c
index d19ab03..0a31527 100644
--- a/fs/btrfs/tests/extent-io-tests.c
+++ b/fs/btrfs/tests/extent-io-tests.c
@@ -114,7 +114,7 @@ static int test_find_delalloc(u32 sectorsize)
 	 * |--- delalloc ---|
 	 * |---  search  ---|
 	 */
-	set_extent_delalloc(&tmp, 0, sectorsize - 1, NULL);
+	set_extent_delalloc(&tmp, 0, sectorsize - 1, NULL, 0);
 	start = 0;
 	end = 0;
 	found = find_lock_delalloc_range(inode, &tmp, locked_page, &start,
@@ -145,7 +145,7 @@ static int test_find_delalloc(u32 sectorsize)
 		test_msg("Couldn't find the locked page\n");
 		goto out_bits;
 	}
-	set_extent_delalloc(&tmp, sectorsize, max_bytes - 1, NULL);
+	set_extent_delalloc(&tmp, sectorsize, max_bytes - 1, NULL, 0);
 	start = test_start;
 	end = 0;
 	found = find_lock_delalloc_range(inode, &tmp, locked_page, &start,
@@ -200,7 +200,7 @@ static int test_find_delalloc(u32 sectorsize)
 	 *
 	 * We are re-using our test_start from above since it works out well.
 	 */
-	set_extent_delalloc(&tmp, max_bytes, total_dirty - 1, NULL);
+	set_extent_delalloc(&tmp, max_bytes, total_dirty - 1, NULL, 0);
 	start = test_start;
 	end = 0;
 	found = find_lock_delalloc_range(inode, &tmp, locked_page, &start,
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index 29648c0..aede6cb 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -967,7 +967,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	/* [BTRFS_MAX_EXTENT_SIZE] */
 	BTRFS_I(inode)->outstanding_extents++;
 	ret = btrfs_set_extent_delalloc(inode, 0, BTRFS_MAX_EXTENT_SIZE - 1,
-					NULL);
+					NULL, 0);
 	if (ret) {
 		test_msg("btrfs_set_extent_delalloc returned %d\n", ret);
 		goto out;
@@ -983,7 +983,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	BTRFS_I(inode)->outstanding_extents++;
 	ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE,
 					BTRFS_MAX_EXTENT_SIZE + sectorsize - 1,
-					NULL);
+					NULL, 0);
 	if (ret) {
 		test_msg("btrfs_set_extent_delalloc returned %d\n", ret);
 		goto out;
@@ -1018,7 +1018,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE >> 1,
 					(BTRFS_MAX_EXTENT_SIZE >> 1)
 					+ sectorsize - 1,
-					NULL);
+					NULL, 0);
 	if (ret) {
 		test_msg("btrfs_set_extent_delalloc returned %d\n", ret);
 		goto out;
@@ -1041,7 +1041,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	ret = btrfs_set_extent_delalloc(inode,
 			BTRFS_MAX_EXTENT_SIZE + 2 * sectorsize,
 			(BTRFS_MAX_EXTENT_SIZE << 1) + 3 * sectorsize - 1,
-			NULL);
+			NULL, 0);
 	if (ret) {
 		test_msg("btrfs_set_extent_delalloc returned %d\n", ret);
 		goto out;
@@ -1059,7 +1059,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	BTRFS_I(inode)->outstanding_extents++;
 	ret = btrfs_set_extent_delalloc(inode,
 			BTRFS_MAX_EXTENT_SIZE + sectorsize,
-			BTRFS_MAX_EXTENT_SIZE + 2 * sectorsize - 1, NULL);
+			BTRFS_MAX_EXTENT_SIZE + 2 * sectorsize - 1, NULL, 0);
 	if (ret) {
 		test_msg("btrfs_set_extent_delalloc returned %d\n", ret);
 		goto out;
@@ -1096,7 +1096,7 @@ static int test_extent_accounting(u32 sectorsize, u32 nodesize)
 	BTRFS_I(inode)->outstanding_extents++;
 	ret = btrfs_set_extent_delalloc(inode,
 			BTRFS_MAX_EXTENT_SIZE + sectorsize,
-			BTRFS_MAX_EXTENT_SIZE + 2 * sectorsize - 1, NULL);
+			BTRFS_MAX_EXTENT_SIZE + 2 * sectorsize - 1, NULL, 0);
 	if (ret) {
 		test_msg("btrfs_set_extent_delalloc returned %d\n", ret);
 		goto out;
-- 
2.8.3




^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC
  2016-06-15  2:10 ` [PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC Qu Wenruo
  2016-06-15  3:11   ` kbuild test robot
  2016-06-15  3:17   ` [PATCH v11.1 " Qu Wenruo
@ 2016-06-15  3:26   ` kbuild test robot
  2 siblings, 0 replies; 34+ messages in thread
From: kbuild test robot @ 2016-06-15  3:26 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: kbuild-all, linux-btrfs, Wang Xiaoguang, Josef Bacik, Mark Fasheh

Hi,

[auto build test WARNING on v4.7-rc3]
[cannot apply to btrfs/next next-20160614]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160615-101646
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   include/linux/compiler.h:232:8: sparse: attribute 'no_sanitize_address': unknown attribute
>> fs/btrfs/tests/extent-io-tests.c:117:28: sparse: not enough arguments for function set_extent_delalloc
   fs/btrfs/tests/extent-io-tests.c:148:28: sparse: not enough arguments for function set_extent_delalloc
   fs/btrfs/tests/extent-io-tests.c:203:28: sparse: not enough arguments for function set_extent_delalloc
   fs/btrfs/tests/extent-io-tests.c: In function 'test_find_delalloc':
   fs/btrfs/tests/extent-io-tests.c:117:2: error: too few arguments to function 'set_extent_delalloc'
     set_extent_delalloc(&tmp, 0, sectorsize - 1, NULL);
     ^~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
                    from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
    static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
                      ^~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/extent-io-tests.c:148:2: error: too few arguments to function 'set_extent_delalloc'
     set_extent_delalloc(&tmp, sectorsize, max_bytes - 1, NULL);
     ^~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
                    from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
    static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
                      ^~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/extent-io-tests.c:203:2: error: too few arguments to function 'set_extent_delalloc'
     set_extent_delalloc(&tmp, max_bytes, total_dirty - 1, NULL);
     ^~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/../ctree.h:40:0,
                    from fs/btrfs/tests/extent-io-tests.c:24:
   fs/btrfs/tests/../extent_io.h:294:19: note: declared here
    static inline int set_extent_delalloc(struct extent_io_tree *tree, u64 start,
                      ^~~~~~~~~~~~~~~~~~~
--
   include/linux/compiler.h:232:8: sparse: attribute 'no_sanitize_address': unknown attribute
>> fs/btrfs/tests/inode-tests.c:969:40: sparse: not enough arguments for function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c:984:40: sparse: not enough arguments for function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c:1018:40: sparse: not enough arguments for function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c:1041:40: sparse: not enough arguments for function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c:1060:40: sparse: not enough arguments for function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c:1097:40: sparse: not enough arguments for function btrfs_set_extent_delalloc
   fs/btrfs/tests/inode-tests.c: In function 'test_extent_accounting':
   fs/btrfs/tests/inode-tests.c:969:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode, 0, BTRFS_MAX_EXTENT_SIZE - 1,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/inode-tests.c:984:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/inode-tests.c:1018:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode, BTRFS_MAX_EXTENT_SIZE >> 1,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/inode-tests.c:1041:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/inode-tests.c:1060:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~
   fs/btrfs/tests/inode-tests.c:1097:8: error: too few arguments to function 'btrfs_set_extent_delalloc'
     ret = btrfs_set_extent_delalloc(inode,
           ^~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from fs/btrfs/tests/inode-tests.c:21:0:
   fs/btrfs/tests/../ctree.h:3099:5: note: declared here
    int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
        ^~~~~~~~~~~~~~~~~~~~~~~~~

vim +117 fs/btrfs/tests/extent-io-tests.c

294e30fe Josef Bacik        2013-10-09  101  			ret = -ENOMEM;
294e30fe Josef Bacik        2013-10-09  102  			goto out;
294e30fe Josef Bacik        2013-10-09  103  		}
294e30fe Josef Bacik        2013-10-09  104  		SetPageDirty(page);
294e30fe Josef Bacik        2013-10-09  105  		if (index) {
294e30fe Josef Bacik        2013-10-09  106  			unlock_page(page);
294e30fe Josef Bacik        2013-10-09  107  		} else {
09cbfeaf Kirill A. Shutemov 2016-04-01  108  			get_page(page);
294e30fe Josef Bacik        2013-10-09  109  			locked_page = page;
294e30fe Josef Bacik        2013-10-09  110  		}
294e30fe Josef Bacik        2013-10-09  111  	}
294e30fe Josef Bacik        2013-10-09  112  
294e30fe Josef Bacik        2013-10-09  113  	/* Test this scenario
294e30fe Josef Bacik        2013-10-09  114  	 * |--- delalloc ---|
294e30fe Josef Bacik        2013-10-09  115  	 * |---  search  ---|
294e30fe Josef Bacik        2013-10-09  116  	 */
b9ef22de Feifei Xu          2016-06-01 @117  	set_extent_delalloc(&tmp, 0, sectorsize - 1, NULL);
294e30fe Josef Bacik        2013-10-09  118  	start = 0;
294e30fe Josef Bacik        2013-10-09  119  	end = 0;
294e30fe Josef Bacik        2013-10-09  120  	found = find_lock_delalloc_range(inode, &tmp, locked_page, &start,
294e30fe Josef Bacik        2013-10-09  121  					 &end, max_bytes);
294e30fe Josef Bacik        2013-10-09  122  	if (!found) {
294e30fe Josef Bacik        2013-10-09  123  		test_msg("Should have found at least one delalloc\n");
294e30fe Josef Bacik        2013-10-09  124  		goto out_bits;
294e30fe Josef Bacik        2013-10-09  125  	}

:::::: The code at line 117 was first introduced by commit
:::::: b9ef22dedde08ab1b4ccd5f53344984c4dcb89f4 Btrfs: self-tests: Support non-4k page size

:::::: TO: Feifei Xu <xufeifei@linux.vnet.ibm.com>
:::::: CC: David Sterba <dsterba@suse.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (12 preceding siblings ...)
  2016-06-15  2:10 ` [PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC Qu Wenruo
@ 2016-06-20 16:03 ` David Sterba
  2016-06-21  0:36   ` Qu Wenruo
  2016-06-22  1:48 ` Qu Wenruo
  14 siblings, 1 reply; 34+ messages in thread
From: David Sterba @ 2016-06-20 16:03 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, chandan

Hi,

I'm looking how well does this patchset merges with the rest, so far
there are excpected conflicts with Chandan's subpage-blocksize
patchset. For the easy parts, we can add stub patches to extend
functions like cow_file_range with parameters that are added by the
other patch.

Honestly I don't know which patchset to take first. As the
subpage-blockszie is in the core, there are no user visibility and
interface questions, but it must not cause any regressions.

Dedupe is optional, not default, and we have to mainly make sure it does
not have any impact when turned off.

So I see three possible ways:

* merge subpage first, as it defines the API, adapt dedupe
* merge dedupe first, as it only enhanced existing API, adapt subpage
* create a common branch for both, merge relevant parts of each
  patchset, add more patches to prepare common ground for either patch

You can deduce yourself which vairant poses work on who. My preference
is to do the 3rd variant, as it does not force us any particular merge
order.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-20 16:03 ` [PATCH v11 00/13] Btrfs dedupe framework David Sterba
@ 2016-06-21  0:36   ` Qu Wenruo
  2016-06-21  9:13     ` David Sterba
  0 siblings, 1 reply; 34+ messages in thread
From: Qu Wenruo @ 2016-06-21  0:36 UTC (permalink / raw)
  To: dsterba, linux-btrfs, chandan



At 06/21/2016 12:03 AM, David Sterba wrote:
> Hi,
>
> I'm looking how well does this patchset merges with the rest, so far
> there are excpected conflicts with Chandan's subpage-blocksize
> patchset. For the easy parts, we can add stub patches to extend
> functions like cow_file_range with parameters that are added by the
> other patch.
>
> Honestly I don't know which patchset to take first. As the
> subpage-blockszie is in the core, there are no user visibility and
> interface questions, but it must not cause any regressions.
>
> Dedupe is optional, not default, and we have to mainly make sure it does
> not have any impact when turned off.
>
> So I see three possible ways:
>
> * merge subpage first, as it defines the API, adapt dedupe

Personally, I'd like to merge subpage first.

AFAIK, it's more important than dedupe.
It affects whether a fs created in 64K page size environment can be 
mounted on a 4K page size system.

And further more, dedupe is still not in the ready-to-be-merged status.

The main undetermined part is ioctl interface.
I'm still working on the state-ful ioctl interface(use -f option to be 
stateless), along with some minor change to allow easy extension.
(To allow user-space caller to know exactly what optional parameter is 
not supported, for later dedeup rate accounting and other things)

And Wang and I are waiting for feedback for V11 patchset.
The latest ENOSPC fix may need another version to address such feedback.

And further more, for dedupe, it's quite easy to avoid any possible 
problem related to sectorsize change.

Just increase minimal dedupe blocksize to maximum sectorsize(64K), then
possible conflicts would be solved.

Thanks,
Qu

> * merge dedupe first, as it only enhanced existing API, adapt subpage
> * create a common branch for both, merge relevant parts of each
>   patchset, add more patches to prepare common ground for either patch
>
> You can deduce yourself which vairant poses work on who. My preference
> is to do the 3rd variant, as it does not force us any particular merge
> order.
>
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-21  0:36   ` Qu Wenruo
@ 2016-06-21  9:13     ` David Sterba
  2016-06-21  9:26       ` Qu Wenruo
  0 siblings, 1 reply; 34+ messages in thread
From: David Sterba @ 2016-06-21  9:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, chandan

On Tue, Jun 21, 2016 at 08:36:49AM +0800, Qu Wenruo wrote:
> > I'm looking how well does this patchset merges with the rest, so far
> > there are excpected conflicts with Chandan's subpage-blocksize
> > patchset. For the easy parts, we can add stub patches to extend
> > functions like cow_file_range with parameters that are added by the
> > other patch.
> >
> > Honestly I don't know which patchset to take first. As the
> > subpage-blockszie is in the core, there are no user visibility and
> > interface questions, but it must not cause any regressions.
> >
> > Dedupe is optional, not default, and we have to mainly make sure it does
> > not have any impact when turned off.
> >
> > So I see three possible ways:
> >
> > * merge subpage first, as it defines the API, adapt dedupe
> 
> Personally, I'd like to merge subpage first.
> 
> AFAIK, it's more important than dedupe.
> It affects whether a fs created in 64K page size environment can be 
> mounted on a 4K page size system.

Yeah, but I'm now concerned about the way both will be integrated in the
development or preview branches, not really the functionality itself.

Now the conflicts are not trivial, so this takes extra time on my side
and I can't be sure about the result in the end if I put only minor
efforts to resolve the conflicts ("make it compile"). And I don't want
to do that too often.

As stated in past discussions, the features of this impact should spend
one development cycle in for-next, even if it's not ready for merge or
there are reviews going on.

The subpage patchset is now in a relatively good shape to start actual
testing, which already revealed some problems.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-21  9:13     ` David Sterba
@ 2016-06-21  9:26       ` Qu Wenruo
  2016-06-21  9:34         ` David Sterba
  0 siblings, 1 reply; 34+ messages in thread
From: Qu Wenruo @ 2016-06-21  9:26 UTC (permalink / raw)
  To: dsterba, linux-btrfs, chandan



At 06/21/2016 05:13 PM, David Sterba wrote:
> On Tue, Jun 21, 2016 at 08:36:49AM +0800, Qu Wenruo wrote:
>>> I'm looking how well does this patchset merges with the rest, so far
>>> there are excpected conflicts with Chandan's subpage-blocksize
>>> patchset. For the easy parts, we can add stub patches to extend
>>> functions like cow_file_range with parameters that are added by the
>>> other patch.
>>>
>>> Honestly I don't know which patchset to take first. As the
>>> subpage-blockszie is in the core, there are no user visibility and
>>> interface questions, but it must not cause any regressions.
>>>
>>> Dedupe is optional, not default, and we have to mainly make sure it does
>>> not have any impact when turned off.
>>>
>>> So I see three possible ways:
>>>
>>> * merge subpage first, as it defines the API, adapt dedupe
>>
>> Personally, I'd like to merge subpage first.
>>
>> AFAIK, it's more important than dedupe.
>> It affects whether a fs created in 64K page size environment can be
>> mounted on a 4K page size system.
>
> Yeah, but I'm now concerned about the way both will be integrated in the
> development or preview branches, not really the functionality itself.
>
> Now the conflicts are not trivial, so this takes extra time on my side
> and I can't be sure about the result in the end if I put only minor
> efforts to resolve the conflicts ("make it compile"). And I don't want
> to do that too often.
>
> As stated in past discussions, the features of this impact should spend
> one development cycle in for-next, even if it's not ready for merge or
> there are reviews going on.
>
> The subpage patchset is now in a relatively good shape to start actual
> testing, which already revealed some problems.
>
>
I'm completely OK to do the rebase, but since I don't have 64K page size 
machine to test the rebase, we can only test if 4K system is unaffected.

Although not much help, at least it would be better than making it compile.

Also such rebase may help us to expose bad design/unexpected corner case 
in dedupe.
So if it's OK, please let me try to do the rebase.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-21  9:26       ` Qu Wenruo
@ 2016-06-21  9:34         ` David Sterba
  2016-06-21 16:55           ` Chandan Rajendra
  0 siblings, 1 reply; 34+ messages in thread
From: David Sterba @ 2016-06-21  9:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, chandan

On Tue, Jun 21, 2016 at 05:26:23PM +0800, Qu Wenruo wrote:
> > Yeah, but I'm now concerned about the way both will be integrated in the
> > development or preview branches, not really the functionality itself.
> >
> > Now the conflicts are not trivial, so this takes extra time on my side
> > and I can't be sure about the result in the end if I put only minor
> > efforts to resolve the conflicts ("make it compile"). And I don't want
> > to do that too often.
> >
> > As stated in past discussions, the features of this impact should spend
> > one development cycle in for-next, even if it's not ready for merge or
> > there are reviews going on.
> >
> > The subpage patchset is now in a relatively good shape to start actual
> > testing, which already revealed some problems.
> >
> >
> I'm completely OK to do the rebase, but since I don't have 64K page size 
> machine to test the rebase, we can only test if 4K system is unaffected.
> 
> Although not much help, at least it would be better than making it compile.
> 
> Also such rebase may help us to expose bad design/unexpected corner case 
> in dedupe.
> So if it's OK, please let me try to do the rebase.

Well, if you base dedupe on subpage, then it could be hard to find the
patchset that introduces bugs, or combination of both. We should be able
to test the features independently, and thus I'm proposing to first find
some common patchset that makes that possible.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-21  9:34         ` David Sterba
@ 2016-06-21 16:55           ` Chandan Rajendra
  2016-06-23 12:17             ` David Sterba
  0 siblings, 1 reply; 34+ messages in thread
From: Chandan Rajendra @ 2016-06-21 16:55 UTC (permalink / raw)
  To: dsterba; +Cc: Qu Wenruo, linux-btrfs

On Tuesday, June 21, 2016 11:34:57 AM David Sterba wrote:
> On Tue, Jun 21, 2016 at 05:26:23PM +0800, Qu Wenruo wrote:
> > > Yeah, but I'm now concerned about the way both will be integrated in the
> > > development or preview branches, not really the functionality itself.
> > >
> > > Now the conflicts are not trivial, so this takes extra time on my side
> > > and I can't be sure about the result in the end if I put only minor
> > > efforts to resolve the conflicts ("make it compile"). And I don't want
> > > to do that too often.
> > >
> > > As stated in past discussions, the features of this impact should spend
> > > one development cycle in for-next, even if it's not ready for merge or
> > > there are reviews going on.
> > >
> > > The subpage patchset is now in a relatively good shape to start actual
> > > testing, which already revealed some problems.
> > >
> > >
> > I'm completely OK to do the rebase, but since I don't have 64K page size 
> > machine to test the rebase, we can only test if 4K system is unaffected.
> > 
> > Although not much help, at least it would be better than making it compile.
> > 
> > Also such rebase may help us to expose bad design/unexpected corner case 
> > in dedupe.
> > So if it's OK, please let me try to do the rebase.
> 
> Well, if you base dedupe on subpage, then it could be hard to find the
> patchset that introduces bugs, or combination of both. We should be able
> to test the features independently, and thus I'm proposing to first find
> some common patchset that makes that possible.
>

Hi David,

I am not sure if I understood the above statement correctly. Do you mean to
commit the 'common/simple' patches from both the subpage-blocksize & dedupe
patchset first and then bring in the complicated ones later?

If yes, then we have a problem doing that w.r.t subpage-blocksize
patchset. The first few patches bring in the core changes necessary for the
other remaining patches.

-- 
chandan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
                   ` (13 preceding siblings ...)
  2016-06-20 16:03 ` [PATCH v11 00/13] Btrfs dedupe framework David Sterba
@ 2016-06-22  1:48 ` Qu Wenruo
  2016-06-24  6:54   ` Satoru Takeuchi
  14 siblings, 1 reply; 34+ messages in thread
From: Qu Wenruo @ 2016-06-22  1:48 UTC (permalink / raw)
  To: linux-btrfs

Here is the long-waited (simple and theoretical) performance test for 
dedupe.

Such result may be added to btrfs wiki page, as an advice for dedupe use 
case.

The full result can be check from google drive:
https://drive.google.com/file/d/0BxpkL3ehzX3pb05WT1lZSnRGbjA/view?usp=sharing

[Short Conclusion]
For high dedupe rate and easily compressible data,
if cpu cores >= 4, dedupe speed is on par with lzo compression,
and faster than default dd, about 35% faster.

if cpu == 2, lzo compression is faster than dedupe, but both faster than 
default dd.

For cpu == 1, lzo compression is on par with SAS HDD, while dedupe is 
slower than default dd.

[Test Platform]
The test platform is Fujitsu PRIMERGY RX300 S7.
CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (2 nodes)
Memory: 32G, while limited to 8G in performace tests
Disk: 300G SAS HDDs with hardware RAID 5/6

[Test method]
Just do 40G buffered write into a new btrfs.
Since it's 5 times(in fact 7 times, since only 5.7G available memory) 
the total usable memory, flush will happen several times.

dd if=/dev/zero bs=1M count=40960 of=/mnt/btrfs/out

[Future plan]
More tests on less theoretical cases, like low-to-medium dedup rate.
Which may leads to slower performance than raw dd.

Considering lzo is already the fastest compression method btrfs provides 
yet, SHA512 should make dedupe even faster, faster than compression.

Also, current dedupe is splitting dealloc range into 512K segments 
first, then split into 128K (default dedupe size) and balance hash work 
into different CPUs, so for smaller dedupe block size, dedupe should be 
faster and take full usage of all CPUs.

Thanks,
Qu


At 06/15/2016 10:09 AM, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux.git wang_dedupe_20160524
>
> In this update, the patchset goes through another re-organization along
> with other fixes to address comments from community.
> 1) Move on-disk backend and dedupe props out of the patchset
>    Suggested by David.
>    There is still some discussion on the on-disk format.
>    And dedupe prop is still not 100% determined.
>
>    So it's better to focus on the current in-memory backend only, which
>    doesn't bring any on-disk format change.
>
>    Once the framework is done, new backends and props can be added more
>    easily.
>
> 2) Better enable/disable and buffered write race avoidance
>    Inspired by Mark.
>    Although in previous version, we didn't trigger it with our test
>    case, but if we manually add delay(5s) to __btrfs_buffered_write(),
>    it's possible to trigger disable and buffered write race.
>
>    The cause is, there is a windows between __btrfs_buffered_write() and
>    btrfs_dirty_pages().
>    In that window, sync_filesystem() can return very quickly since there
>    is no dirty page.
>    During that window, dedupe disable can happen and finish, and
>    buffered writer may access to the NULL pointer of dedupe info.
>
>    Now we use sb->s_writers.rw_sem to wait all current writers and block
>    further writers, then sync the fs, change dedupe status and finally
>    unblock writers. (Like freeze)
>    This provides clearer logical and code, and safer than previous
>    method, because there is no windows before we dirty pages.
>
> 3) Fix ENOSPC problem with better solution.
>    Pointed out by Josef.
>    The last 2 patches from Wang fixes ENOSPC problem, in a more
>    comprehensive method for delalloc metadata reservation.
>    Alone with small outstanding extents improvement, to co-operate with
>    tunable max extent size.
>
> Now the whole patchset will only add in-memory backend as a whole.
> No other backend nor prop.
> So we can focus on the framework itself.
>
> Next version will focus on ioctl interface modification suggested by
> David.
>
> Thanks,
> Qu
>
> Changelog:
> v2:
>   Totally reworked to handle multiple backends
> v3:
>   Fix a stupid but deadly on-disk backend bug
>   Add handle for multiple hash on same bytenr corner case to fix abort
>   trans error
>   Increase dedup rate by enhancing delayed ref handler for both backend.
>   Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
>   Increase dedup block size up limit to 8M.
> v4:
>   Add dedup prop for disabling dedup for given files/dirs.
>   Merge inmem_search() and ondisk_search() into generic_search() to save
>   some code
>   Fix another delayed_ref related bug.
>   Use the same mutex for both inmem and ondisk backend.
>   Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
>   rate.
> v5:
>   Reuse compress routine for much simpler dedup function.
>   Slightly improved performance due to above modification.
>   Fix race between dedup enable/disable
>   Fix for false ENOSPC report
> v6:
>   Further enable/disable race window fix.
>   Minor format change according to checkpatch.
> v7:
>   Fix one concurrency bug with balance.
>   Slightly modify return value from -EINVAL to -EOPNOTSUPP for
>   btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
>   and wrong parameter.
>   Rebased to integration-4.6.
> v8:
>   Rename 'dedup' to 'dedupe'.
>   Add support to allow dedupe and compression work at the same time.
>   Fix several balance related bugs. Special thanks to Satoru Takeuchi,
>   who exposed most of them.
>   Small dedupe hit case performance improvement.
> v9:
>   Re-order the patchset to completely separate pure in-memory and any
>   on-disk format change.
>   Fold bug fixes into its original patch.
> v10:
>   Adding back missing bug fix patch.
>   Reduce on-disk item size.
>   Hide dedupe ioctl under CONFIG_BTRFS_DEBUG.
> v11:
>   Remove other backend and props support to focus on the framework and
>   in-memory backend. Suggested by David.
>   Better disable and buffered write race protection.
>   Comprehensive fix to dedupe metadata ENOSPC problem.
>
> Qu Wenruo (3):
>   btrfs: delayed-ref: Add support for increasing data ref under spinlock
>   btrfs: dedupe: Inband in-memory only de-duplication implement
>   btrfs: relocation: Enhance error handling to avoid BUG_ON
>
> Wang Xiaoguang (10):
>   btrfs: dedupe: Introduce dedupe framework and its header
>   btrfs: dedupe: Introduce function to initialize dedupe info
>   btrfs: dedupe: Introduce function to add hash into in-memory tree
>   btrfs: dedupe: Introduce function to remove hash from in-memory tree
>   btrfs: dedupe: Introduce function to search for an existing hash
>   btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
>   btrfs: ordered-extent: Add support for dedupe
>   btrfs: dedupe: Add ioctl for inband dedupelication
>   btrfs: improve inode's outstanding_extents computation
>   btrfs: dedupe: fix false ENOSPC
>
>  fs/btrfs/Makefile           |   2 +-
>  fs/btrfs/ctree.h            |  25 +-
>  fs/btrfs/dedupe.c           | 710 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/dedupe.h           | 210 +++++++++++++
>  fs/btrfs/delayed-ref.c      |  30 +-
>  fs/btrfs/delayed-ref.h      |   8 +
>  fs/btrfs/disk-io.c          |   4 +
>  fs/btrfs/extent-tree.c      |  83 +++++-
>  fs/btrfs/extent_io.c        |  63 +++-
>  fs/btrfs/extent_io.h        |  15 +-
>  fs/btrfs/file.c             |  26 +-
>  fs/btrfs/free-space-cache.c |   5 +-
>  fs/btrfs/inode-map.c        |   4 +-
>  fs/btrfs/inode.c            | 434 ++++++++++++++++++++++-----
>  fs/btrfs/ioctl.c            |  80 ++++-
>  fs/btrfs/ordered-data.c     |  46 ++-
>  fs/btrfs/ordered-data.h     |  14 +
>  fs/btrfs/relocation.c       |  46 ++-
>  fs/btrfs/sysfs.c            |   2 +
>  include/uapi/linux/btrfs.h  |  41 +++
>  20 files changed, 1701 insertions(+), 147 deletions(-)
>  create mode 100644 fs/btrfs/dedupe.c
>  create mode 100644 fs/btrfs/dedupe.h
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-21 16:55           ` Chandan Rajendra
@ 2016-06-23 12:17             ` David Sterba
  2016-06-24  2:50               ` Qu Wenruo
  2016-06-24  4:10               ` Chandan Rajendra
  0 siblings, 2 replies; 34+ messages in thread
From: David Sterba @ 2016-06-23 12:17 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: dsterba, Qu Wenruo, linux-btrfs

On Tue, Jun 21, 2016 at 10:25:19PM +0530, Chandan Rajendra wrote:
> > > I'm completely OK to do the rebase, but since I don't have 64K page size 
> > > machine to test the rebase, we can only test if 4K system is unaffected.
> > > 
> > > Although not much help, at least it would be better than making it compile.
> > > 
> > > Also such rebase may help us to expose bad design/unexpected corner case 
> > > in dedupe.
> > > So if it's OK, please let me try to do the rebase.
> > 
> > Well, if you base dedupe on subpage, then it could be hard to find the
> > patchset that introduces bugs, or combination of both. We should be able
> > to test the features independently, and thus I'm proposing to first find
> > some common patchset that makes that possible.
> 
> I am not sure if I understood the above statement correctly. Do you mean to
> commit the 'common/simple' patches from both the subpage-blocksize & dedupe
> patchset first and then bring in the complicated ones later?

That would be great yes, but ...

> If yes, then we have a problem doing that w.r.t subpage-blocksize
> patchset. The first few patches bring in the core changes necessary for the
> other remaining patches.

... not easily possible. I looked again for common functions that change
the singature and found only cow_file_range and run_delalloc_nocow. The
plan:

- separate patch that adds new parameters required by both patches to
  the functions
- update all call sites, add 0/NULL as defaults for the new unused
  parameters
- rebase both patches on top of this patch

How does this help: if a patch starts to use the new parameter, it
changes only the value at all call sites. This is much easier to verify
and merge manually compared to adding a new parameter to the middle of
the list, namely when the functions take 6+.

The other conflicts like conversion from PAGE_SIZE to the the block
oriented iterations will be harder, but these are usually localized and
can be resolved. We'll see if there are other options to reduce the
clashes but at the moment it's stuck at the two functions. Does that
explain it better?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-23 12:17             ` David Sterba
@ 2016-06-24  2:50               ` Qu Wenruo
  2016-06-24  4:34                 ` Chandan Rajendra
  2016-06-24  9:29                 ` Chandan Rajendra
  2016-06-24  4:10               ` Chandan Rajendra
  1 sibling, 2 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-24  2:50 UTC (permalink / raw)
  To: dsterba, Chandan Rajendra, linux-btrfs

Hi Chandan, David,

When I'm trying to rebase dedupe patchset on top of Chadan's sub page 
size patchset (using David's for-next-test-20160620), although the 
rebase itself is quite simple, but I'm afraid that I found some bugs for 
sub page size patchset, *without* dedupe patchset applied.

These bugs seems to be unrelated to each other
1) state leak at btrfs rmmod time
2) bytes_may_use leak at qgroup EDQUOTA error time
3) selftest is run several times at modules load time
    15 times, to be more exact
    And since I didn't found any immediate number related to run it 15
    times, I assume at least it's not designed to do it 15 times.

The reproducer for 1) and 2) is quite simple, extracted from btrfs/022 
test case:
------
dev=/dev/sdb5
mnt=/mnt/test

umount $dev &> /dev/null

mkfs.btrfs $dev -f
mount $dev $mnt -o nospace_cache
btrfs dedupe enable $mnt
btrfs sub create $mnt/sub
btrfs quota enable $mnt


# Just use small limit, making ftrace less noise.
btrfs qgroup limit 512K 0/257 $mnt
dd if=/dev/urandom of=$mnt/sub/test bs=1M count=1
umount $mnt
rmmod btrfs
------

At unmount time, kernel warning will happen due to may_use bytes leak.
I could dig it further, as it looks like a bug in space reservation 
failure case.
------
BTRFS: space_info 1 has 8044544 free, is not full
BTRFS: space_info total=8388608, used=344064, pinned=0, reserved=0, 
may_use=409600, readonly=0
------

And at rmmod time, btrfs will detect extent_state leak, whose length is 
always 4095 (page size - 1).

Hope this will help, and I'm willing to help to fix the problem.

Thanks,
Qu

At 06/23/2016 08:17 PM, David Sterba wrote:
> On Tue, Jun 21, 2016 at 10:25:19PM +0530, Chandan Rajendra wrote:
>>>> I'm completely OK to do the rebase, but since I don't have 64K page size
>>>> machine to test the rebase, we can only test if 4K system is unaffected.
>>>>
>>>> Although not much help, at least it would be better than making it compile.
>>>>
>>>> Also such rebase may help us to expose bad design/unexpected corner case
>>>> in dedupe.
>>>> So if it's OK, please let me try to do the rebase.
>>>
>>> Well, if you base dedupe on subpage, then it could be hard to find the
>>> patchset that introduces bugs, or combination of both. We should be able
>>> to test the features independently, and thus I'm proposing to first find
>>> some common patchset that makes that possible.
>>
>> I am not sure if I understood the above statement correctly. Do you mean to
>> commit the 'common/simple' patches from both the subpage-blocksize & dedupe
>> patchset first and then bring in the complicated ones later?
>
> That would be great yes, but ...
>
>> If yes, then we have a problem doing that w.r.t subpage-blocksize
>> patchset. The first few patches bring in the core changes necessary for the
>> other remaining patches.
>
> ... not easily possible. I looked again for common functions that change
> the singature and found only cow_file_range and run_delalloc_nocow. The
> plan:
>
> - separate patch that adds new parameters required by both patches to
>   the functions
> - update all call sites, add 0/NULL as defaults for the new unused
>   parameters
> - rebase both patches on top of this patch
>
> How does this help: if a patch starts to use the new parameter, it
> changes only the value at all call sites. This is much easier to verify
> and merge manually compared to adding a new parameter to the middle of
> the list, namely when the functions take 6+.
>
> The other conflicts like conversion from PAGE_SIZE to the the block
> oriented iterations will be harder, but these are usually localized and
> can be resolved. We'll see if there are other options to reduce the
> clashes but at the moment it's stuck at the two functions. Does that
> explain it better?
>
>



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-23 12:17             ` David Sterba
  2016-06-24  2:50               ` Qu Wenruo
@ 2016-06-24  4:10               ` Chandan Rajendra
  1 sibling, 0 replies; 34+ messages in thread
From: Chandan Rajendra @ 2016-06-24  4:10 UTC (permalink / raw)
  To: dsterba; +Cc: Qu Wenruo, linux-btrfs

On Thursday, June 23, 2016 02:17:38 PM David Sterba wrote:
> On Tue, Jun 21, 2016 at 10:25:19PM +0530, Chandan Rajendra wrote:
> > > > I'm completely OK to do the rebase, but since I don't have 64K page size 
> > > > machine to test the rebase, we can only test if 4K system is unaffected.
> > > > 
> > > > Although not much help, at least it would be better than making it compile.
> > > > 
> > > > Also such rebase may help us to expose bad design/unexpected corner case 
> > > > in dedupe.
> > > > So if it's OK, please let me try to do the rebase.
> > > 
> > > Well, if you base dedupe on subpage, then it could be hard to find the
> > > patchset that introduces bugs, or combination of both. We should be able
> > > to test the features independently, and thus I'm proposing to first find
> > > some common patchset that makes that possible.
> > 
> > I am not sure if I understood the above statement correctly. Do you mean to
> > commit the 'common/simple' patches from both the subpage-blocksize & dedupe
> > patchset first and then bring in the complicated ones later?
> 
> That would be great yes, but ...
> 
> > If yes, then we have a problem doing that w.r.t subpage-blocksize
> > patchset. The first few patches bring in the core changes necessary for the
> > other remaining patches.
> 
> ... not easily possible. I looked again for common functions that change
> the singature and found only cow_file_range and run_delalloc_nocow. The
> plan:
> 
> - separate patch that adds new parameters required by both patches to
>   the functions
> - update all call sites, add 0/NULL as defaults for the new unused
>   parameters
> - rebase both patches on top of this patch
> 
> How does this help: if a patch starts to use the new parameter, it
> changes only the value at all call sites. This is much easier to verify
> and merge manually compared to adding a new parameter to the middle of
> the list, namely when the functions take 6+.

David, I can implement it. In my next post of the subpage-blocksize patchset, I
will bring in this change.


-- 
chandan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-24  2:50               ` Qu Wenruo
@ 2016-06-24  4:34                 ` Chandan Rajendra
  2016-06-24  9:29                 ` Chandan Rajendra
  1 sibling, 0 replies; 34+ messages in thread
From: Chandan Rajendra @ 2016-06-24  4:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs

On Friday, June 24, 2016 10:50:41 AM Qu Wenruo wrote:
> Hi Chandan, David,
> 
> When I'm trying to rebase dedupe patchset on top of Chadan's sub page 
> size patchset (using David's for-next-test-20160620), although the 
> rebase itself is quite simple, but I'm afraid that I found some bugs for 
> sub page size patchset, *without* dedupe patchset applied.
> 
> These bugs seems to be unrelated to each other
> 1) state leak at btrfs rmmod time
> 2) bytes_may_use leak at qgroup EDQUOTA error time
> 3) selftest is run several times at modules load time
>     15 times, to be more exact
>     And since I didn't found any immediate number related to run it 15
>     times, I assume at least it's not designed to do it 15 times.
>

Ah, In btrfs_run_sanity_tests(), just after,

for (i = 0; i < ARRAY_SIZE(test_sectorsize); i++) {
        sectorsize = test_sectorsize[i];


I missed out on adding "if (sectorsize > PAGE_SIZE) break;". I will fix this
up in the next post of the patchset. Thanks for pointing this out.

> The reproducer for 1) and 2) is quite simple, extracted from btrfs/022 
> test case:
> ------
> dev=/dev/sdb5
> mnt=/mnt/test
> 
> umount $dev &> /dev/null
> 
> mkfs.btrfs $dev -f
> mount $dev $mnt -o nospace_cache
> btrfs dedupe enable $mnt
> btrfs sub create $mnt/sub
> btrfs quota enable $mnt
> 
> 
> # Just use small limit, making ftrace less noise.
> btrfs qgroup limit 512K 0/257 $mnt
> dd if=/dev/urandom of=$mnt/sub/test bs=1M count=1
> umount $mnt
> rmmod btrfs
> ------
> 
> At unmount time, kernel warning will happen due to may_use bytes leak.
> I could dig it further, as it looks like a bug in space reservation 
> failure case.
> ------
> BTRFS: space_info 1 has 8044544 free, is not full
> BTRFS: space_info total=8388608, used=344064, pinned=0, reserved=0, 
> may_use=409600, readonly=0
> ------
> 
> And at rmmod time, btrfs will detect extent_state leak, whose length is 
> always 4095 (page size - 1).
>

Qu, I will investigate and fix this issue. And thanks a lot for the
reproducer test.

-- 
chandan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-22  1:48 ` Qu Wenruo
@ 2016-06-24  6:54   ` Satoru Takeuchi
  2016-06-24  8:30     ` Qu Wenruo
  0 siblings, 1 reply; 34+ messages in thread
From: Satoru Takeuchi @ 2016-06-24  6:54 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 2016/06/22 10:48, Qu Wenruo wrote:
> Here is the long-waited (simple and theoretical) performance test for dedupe.
> 
> Such result may be added to btrfs wiki page, as an advice for dedupe use case.
> 
> The full result can be check from google drive:
> https://drive.google.com/file/d/0BxpkL3ehzX3pb05WT1lZSnRGbjA/view?usp=sharing
> 
> [Short Conclusion]
> For high dedupe rate and easily compressible data,
> if cpu cores >= 4, dedupe speed is on par with lzo compression,
> and faster than default dd, about 35% faster.
> 
> if cpu == 2, lzo compression is faster than dedupe, but both faster than default dd.
> 
> For cpu == 1, lzo compression is on par with SAS HDD, while dedupe is slower than default dd.

It's better to clarify the meaning of the numbers
described in you graph.

> 
> [Test Platform]
> The test platform is Fujitsu PRIMERGY RX300 S7.
> CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (2 nodes)
> Memory: 32G, while limited to 8G in performace tests
> Disk: 300G SAS HDDs with hardware RAID 5/6

How about describing the kernel versions too?

> 
> [Test method]
> Just do 40G buffered write into a new btrfs.
> Since it's 5 times(in fact 7 times, since only 5.7G available memory) the total usable memory, flush will happen several times.
> 
> dd if=/dev/zero bs=1M count=40960 of=/mnt/btrfs/out
> 
> [Future plan]
> More tests on less theoretical cases, like low-to-medium dedup rate.
> Which may leads to slower performance than raw dd.
> 
> Considering lzo is already the fastest compression method btrfs provides yet, SHA512 should make dedupe even faster, faster than compression.
> 
> Also, current dedupe is splitting dealloc range into 512K segments first, then split into 128K (default dedupe size) and balance hash work into different CPUs, so for smaller dedupe block size, dedupe should be faster and take full usage of all CPUs.

How about adding the performance of the kernel without
dedupe patchset? By adding such data and comparing it with
"default" data, you can prove dedupe patchset doesn't
affect performance at all if dedupe is disabled.

Thanks,
Satoru

> 
> Thanks,
> Qu
> 
> 
> At 06/15/2016 10:09 AM, Qu Wenruo wrote:
>> This patchset can be fetched from github:
>> https://github.com/adam900710/linux.git wang_dedupe_20160524
>>
>> In this update, the patchset goes through another re-organization along
>> with other fixes to address comments from community.
>> 1) Move on-disk backend and dedupe props out of the patchset
>>    Suggested by David.
>>    There is still some discussion on the on-disk format.
>>    And dedupe prop is still not 100% determined.
>>
>>    So it's better to focus on the current in-memory backend only, which
>>    doesn't bring any on-disk format change.
>>
>>    Once the framework is done, new backends and props can be added more
>>    easily.
>>
>> 2) Better enable/disable and buffered write race avoidance
>>    Inspired by Mark.
>>    Although in previous version, we didn't trigger it with our test
>>    case, but if we manually add delay(5s) to __btrfs_buffered_write(),
>>    it's possible to trigger disable and buffered write race.
>>
>>    The cause is, there is a windows between __btrfs_buffered_write() and
>>    btrfs_dirty_pages().
>>    In that window, sync_filesystem() can return very quickly since there
>>    is no dirty page.
>>    During that window, dedupe disable can happen and finish, and
>>    buffered writer may access to the NULL pointer of dedupe info.
>>
>>    Now we use sb->s_writers.rw_sem to wait all current writers and block
>>    further writers, then sync the fs, change dedupe status and finally
>>    unblock writers. (Like freeze)
>>    This provides clearer logical and code, and safer than previous
>>    method, because there is no windows before we dirty pages.
>>
>> 3) Fix ENOSPC problem with better solution.
>>    Pointed out by Josef.
>>    The last 2 patches from Wang fixes ENOSPC problem, in a more
>>    comprehensive method for delalloc metadata reservation.
>>    Alone with small outstanding extents improvement, to co-operate with
>>    tunable max extent size.
>>
>> Now the whole patchset will only add in-memory backend as a whole.
>> No other backend nor prop.
>> So we can focus on the framework itself.
>>
>> Next version will focus on ioctl interface modification suggested by
>> David.
>>
>> Thanks,
>> Qu
>>
>> Changelog:
>> v2:
>>   Totally reworked to handle multiple backends
>> v3:
>>   Fix a stupid but deadly on-disk backend bug
>>   Add handle for multiple hash on same bytenr corner case to fix abort
>>   trans error
>>   Increase dedup rate by enhancing delayed ref handler for both backend.
>>   Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
>>   Increase dedup block size up limit to 8M.
>> v4:
>>   Add dedup prop for disabling dedup for given files/dirs.
>>   Merge inmem_search() and ondisk_search() into generic_search() to save
>>   some code
>>   Fix another delayed_ref related bug.
>>   Use the same mutex for both inmem and ondisk backend.
>>   Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
>>   rate.
>> v5:
>>   Reuse compress routine for much simpler dedup function.
>>   Slightly improved performance due to above modification.
>>   Fix race between dedup enable/disable
>>   Fix for false ENOSPC report
>> v6:
>>   Further enable/disable race window fix.
>>   Minor format change according to checkpatch.
>> v7:
>>   Fix one concurrency bug with balance.
>>   Slightly modify return value from -EINVAL to -EOPNOTSUPP for
>>   btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
>>   and wrong parameter.
>>   Rebased to integration-4.6.
>> v8:
>>   Rename 'dedup' to 'dedupe'.
>>   Add support to allow dedupe and compression work at the same time.
>>   Fix several balance related bugs. Special thanks to Satoru Takeuchi,
>>   who exposed most of them.
>>   Small dedupe hit case performance improvement.
>> v9:
>>   Re-order the patchset to completely separate pure in-memory and any
>>   on-disk format change.
>>   Fold bug fixes into its original patch.
>> v10:
>>   Adding back missing bug fix patch.
>>   Reduce on-disk item size.
>>   Hide dedupe ioctl under CONFIG_BTRFS_DEBUG.
>> v11:
>>   Remove other backend and props support to focus on the framework and
>>   in-memory backend. Suggested by David.
>>   Better disable and buffered write race protection.
>>   Comprehensive fix to dedupe metadata ENOSPC problem.
>>
>> Qu Wenruo (3):
>>   btrfs: delayed-ref: Add support for increasing data ref under spinlock
>>   btrfs: dedupe: Inband in-memory only de-duplication implement
>>   btrfs: relocation: Enhance error handling to avoid BUG_ON
>>
>> Wang Xiaoguang (10):
>>   btrfs: dedupe: Introduce dedupe framework and its header
>>   btrfs: dedupe: Introduce function to initialize dedupe info
>>   btrfs: dedupe: Introduce function to add hash into in-memory tree
>>   btrfs: dedupe: Introduce function to remove hash from in-memory tree
>>   btrfs: dedupe: Introduce function to search for an existing hash
>>   btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
>>   btrfs: ordered-extent: Add support for dedupe
>>   btrfs: dedupe: Add ioctl for inband dedupelication
>>   btrfs: improve inode's outstanding_extents computation
>>   btrfs: dedupe: fix false ENOSPC
>>
>>  fs/btrfs/Makefile           |   2 +-
>>  fs/btrfs/ctree.h            |  25 +-
>>  fs/btrfs/dedupe.c           | 710 ++++++++++++++++++++++++++++++++++++++++++++
>>  fs/btrfs/dedupe.h           | 210 +++++++++++++
>>  fs/btrfs/delayed-ref.c      |  30 +-
>>  fs/btrfs/delayed-ref.h      |   8 +
>>  fs/btrfs/disk-io.c          |   4 +
>>  fs/btrfs/extent-tree.c      |  83 +++++-
>>  fs/btrfs/extent_io.c        |  63 +++-
>>  fs/btrfs/extent_io.h        |  15 +-
>>  fs/btrfs/file.c             |  26 +-
>>  fs/btrfs/free-space-cache.c |   5 +-
>>  fs/btrfs/inode-map.c        |   4 +-
>>  fs/btrfs/inode.c            | 434 ++++++++++++++++++++++-----
>>  fs/btrfs/ioctl.c            |  80 ++++-
>>  fs/btrfs/ordered-data.c     |  46 ++-
>>  fs/btrfs/ordered-data.h     |  14 +
>>  fs/btrfs/relocation.c       |  46 ++-
>>  fs/btrfs/sysfs.c            |   2 +
>>  include/uapi/linux/btrfs.h  |  41 +++
>>  20 files changed, 1701 insertions(+), 147 deletions(-)
>>  create mode 100644 fs/btrfs/dedupe.c
>>  create mode 100644 fs/btrfs/dedupe.h
>>
> 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-24  6:54   ` Satoru Takeuchi
@ 2016-06-24  8:30     ` Qu Wenruo
  0 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-24  8:30 UTC (permalink / raw)
  To: Satoru Takeuchi, Qu Wenruo, linux-btrfs



On 06/24/2016 02:54 PM, Satoru Takeuchi wrote:
> On 2016/06/22 10:48, Qu Wenruo wrote:
>> Here is the long-waited (simple and theoretical) performance test for dedupe.
>>
>> Such result may be added to btrfs wiki page, as an advice for dedupe use case.
>>
>> The full result can be check from google drive:
>> https://drive.google.com/file/d/0BxpkL3ehzX3pb05WT1lZSnRGbjA/view?usp=sharing
>>
>> [Short Conclusion]
>> For high dedupe rate and easily compressible data,
>> if cpu cores >= 4, dedupe speed is on par with lzo compression,
>> and faster than default dd, about 35% faster.
>>
>> if cpu == 2, lzo compression is faster than dedupe, but both faster than default dd.
>>
>> For cpu == 1, lzo compression is on par with SAS HDD, while dedupe is slower than default dd.
>
> It's better to clarify the meaning of the numbers
> described in you graph.

Right, I just missed the unit (MBytes/s)

>
>>
>> [Test Platform]
>> The test platform is Fujitsu PRIMERGY RX300 S7.
>> CPU: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (2 nodes)
>> Memory: 32G, while limited to 8G in performace tests
>> Disk: 300G SAS HDDs with hardware RAID 5/6
>
> How about describing the kernel versions too?

Yes, I should mention it, as wang_dedupe_20160524 branch.

>
>>
>> [Test method]
>> Just do 40G buffered write into a new btrfs.
>> Since it's 5 times(in fact 7 times, since only 5.7G available memory) the total usable memory, flush will happen several times.
>>
>> dd if=/dev/zero bs=1M count=40960 of=/mnt/btrfs/out
>>
>> [Future plan]
>> More tests on less theoretical cases, like low-to-medium dedup rate.
>> Which may leads to slower performance than raw dd.
>>
>> Considering lzo is already the fastest compression method btrfs provides yet, SHA512 should make dedupe even faster, faster than compression.
>>
>> Also, current dedupe is splitting dealloc range into 512K segments first, then split into 128K (default dedupe size) and balance hash work into different CPUs, so for smaller dedupe block size, dedupe should be faster and take full usage of all CPUs.
>
> How about adding the performance of the kernel without
> dedupe patchset? By adding such data and comparing it with
> "default" data, you can prove dedupe patchset doesn't
> affect performance at all if dedupe is disabled.

Right, makes sense.
I'll add more test and result in next version.

Thanks,
Qu

>
> Thanks,
> Satoru
>
>>
>> Thanks,
>> Qu
>>
>>
>> At 06/15/2016 10:09 AM, Qu Wenruo wrote:
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux.git wang_dedupe_20160524
>>>
>>> In this update, the patchset goes through another re-organization along
>>> with other fixes to address comments from community.
>>> 1) Move on-disk backend and dedupe props out of the patchset
>>>    Suggested by David.
>>>    There is still some discussion on the on-disk format.
>>>    And dedupe prop is still not 100% determined.
>>>
>>>    So it's better to focus on the current in-memory backend only, which
>>>    doesn't bring any on-disk format change.
>>>
>>>    Once the framework is done, new backends and props can be added more
>>>    easily.
>>>
>>> 2) Better enable/disable and buffered write race avoidance
>>>    Inspired by Mark.
>>>    Although in previous version, we didn't trigger it with our test
>>>    case, but if we manually add delay(5s) to __btrfs_buffered_write(),
>>>    it's possible to trigger disable and buffered write race.
>>>
>>>    The cause is, there is a windows between __btrfs_buffered_write() and
>>>    btrfs_dirty_pages().
>>>    In that window, sync_filesystem() can return very quickly since there
>>>    is no dirty page.
>>>    During that window, dedupe disable can happen and finish, and
>>>    buffered writer may access to the NULL pointer of dedupe info.
>>>
>>>    Now we use sb->s_writers.rw_sem to wait all current writers and block
>>>    further writers, then sync the fs, change dedupe status and finally
>>>    unblock writers. (Like freeze)
>>>    This provides clearer logical and code, and safer than previous
>>>    method, because there is no windows before we dirty pages.
>>>
>>> 3) Fix ENOSPC problem with better solution.
>>>    Pointed out by Josef.
>>>    The last 2 patches from Wang fixes ENOSPC problem, in a more
>>>    comprehensive method for delalloc metadata reservation.
>>>    Alone with small outstanding extents improvement, to co-operate with
>>>    tunable max extent size.
>>>
>>> Now the whole patchset will only add in-memory backend as a whole.
>>> No other backend nor prop.
>>> So we can focus on the framework itself.
>>>
>>> Next version will focus on ioctl interface modification suggested by
>>> David.
>>>
>>> Thanks,
>>> Qu
>>>
>>> Changelog:
>>> v2:
>>>   Totally reworked to handle multiple backends
>>> v3:
>>>   Fix a stupid but deadly on-disk backend bug
>>>   Add handle for multiple hash on same bytenr corner case to fix abort
>>>   trans error
>>>   Increase dedup rate by enhancing delayed ref handler for both backend.
>>>   Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
>>>   Increase dedup block size up limit to 8M.
>>> v4:
>>>   Add dedup prop for disabling dedup for given files/dirs.
>>>   Merge inmem_search() and ondisk_search() into generic_search() to save
>>>   some code
>>>   Fix another delayed_ref related bug.
>>>   Use the same mutex for both inmem and ondisk backend.
>>>   Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
>>>   rate.
>>> v5:
>>>   Reuse compress routine for much simpler dedup function.
>>>   Slightly improved performance due to above modification.
>>>   Fix race between dedup enable/disable
>>>   Fix for false ENOSPC report
>>> v6:
>>>   Further enable/disable race window fix.
>>>   Minor format change according to checkpatch.
>>> v7:
>>>   Fix one concurrency bug with balance.
>>>   Slightly modify return value from -EINVAL to -EOPNOTSUPP for
>>>   btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
>>>   and wrong parameter.
>>>   Rebased to integration-4.6.
>>> v8:
>>>   Rename 'dedup' to 'dedupe'.
>>>   Add support to allow dedupe and compression work at the same time.
>>>   Fix several balance related bugs. Special thanks to Satoru Takeuchi,
>>>   who exposed most of them.
>>>   Small dedupe hit case performance improvement.
>>> v9:
>>>   Re-order the patchset to completely separate pure in-memory and any
>>>   on-disk format change.
>>>   Fold bug fixes into its original patch.
>>> v10:
>>>   Adding back missing bug fix patch.
>>>   Reduce on-disk item size.
>>>   Hide dedupe ioctl under CONFIG_BTRFS_DEBUG.
>>> v11:
>>>   Remove other backend and props support to focus on the framework and
>>>   in-memory backend. Suggested by David.
>>>   Better disable and buffered write race protection.
>>>   Comprehensive fix to dedupe metadata ENOSPC problem.
>>>
>>> Qu Wenruo (3):
>>>   btrfs: delayed-ref: Add support for increasing data ref under spinlock
>>>   btrfs: dedupe: Inband in-memory only de-duplication implement
>>>   btrfs: relocation: Enhance error handling to avoid BUG_ON
>>>
>>> Wang Xiaoguang (10):
>>>   btrfs: dedupe: Introduce dedupe framework and its header
>>>   btrfs: dedupe: Introduce function to initialize dedupe info
>>>   btrfs: dedupe: Introduce function to add hash into in-memory tree
>>>   btrfs: dedupe: Introduce function to remove hash from in-memory tree
>>>   btrfs: dedupe: Introduce function to search for an existing hash
>>>   btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
>>>   btrfs: ordered-extent: Add support for dedupe
>>>   btrfs: dedupe: Add ioctl for inband dedupelication
>>>   btrfs: improve inode's outstanding_extents computation
>>>   btrfs: dedupe: fix false ENOSPC
>>>
>>>  fs/btrfs/Makefile           |   2 +-
>>>  fs/btrfs/ctree.h            |  25 +-
>>>  fs/btrfs/dedupe.c           | 710 ++++++++++++++++++++++++++++++++++++++++++++
>>>  fs/btrfs/dedupe.h           | 210 +++++++++++++
>>>  fs/btrfs/delayed-ref.c      |  30 +-
>>>  fs/btrfs/delayed-ref.h      |   8 +
>>>  fs/btrfs/disk-io.c          |   4 +
>>>  fs/btrfs/extent-tree.c      |  83 +++++-
>>>  fs/btrfs/extent_io.c        |  63 +++-
>>>  fs/btrfs/extent_io.h        |  15 +-
>>>  fs/btrfs/file.c             |  26 +-
>>>  fs/btrfs/free-space-cache.c |   5 +-
>>>  fs/btrfs/inode-map.c        |   4 +-
>>>  fs/btrfs/inode.c            | 434 ++++++++++++++++++++++-----
>>>  fs/btrfs/ioctl.c            |  80 ++++-
>>>  fs/btrfs/ordered-data.c     |  46 ++-
>>>  fs/btrfs/ordered-data.h     |  14 +
>>>  fs/btrfs/relocation.c       |  46 ++-
>>>  fs/btrfs/sysfs.c            |   2 +
>>>  include/uapi/linux/btrfs.h  |  41 +++
>>>  20 files changed, 1701 insertions(+), 147 deletions(-)
>>>  create mode 100644 fs/btrfs/dedupe.c
>>>  create mode 100644 fs/btrfs/dedupe.h
>>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-24  2:50               ` Qu Wenruo
  2016-06-24  4:34                 ` Chandan Rajendra
@ 2016-06-24  9:29                 ` Chandan Rajendra
  2016-06-25  1:22                   ` Qu Wenruo
  1 sibling, 1 reply; 34+ messages in thread
From: Chandan Rajendra @ 2016-06-24  9:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs

On Friday, June 24, 2016 10:50:41 AM Qu Wenruo wrote:
> Hi Chandan, David,
> 
> When I'm trying to rebase dedupe patchset on top of Chadan's sub page 
> size patchset (using David's for-next-test-20160620), although the 
> rebase itself is quite simple, but I'm afraid that I found some bugs for 
> sub page size patchset, *without* dedupe patchset applied.
> 
> These bugs seems to be unrelated to each other
> 1) state leak at btrfs rmmod time

The leak was due to not freeing 'cached_state' in
read_extent_buffer_pages(). I have fixed this and the fix will be part of the
patchset when I post the next version to the mailing list.

I have always compiled the btrfs code as part of the vmlinux image and hence
have never rmmod the btrfs module during my local testing. The space leak
messages might have appeared when I shut down my guest. Hence I had never
noticed them before. Thanks once again for informing me about it.

> 2) bytes_may_use leak at qgroup EDQUOTA error time

I have a slightly older version of btrfs-progs which does not yet have btrfs
dedupe" command. I will get the new version and check if the space leak can be
reproduced on my machine.

However, I don't see the space leak warning messages when the reproducer
script is executed after commenting out "btrfs dedupe enable $mnt".

> 3) selftest is run several times at modules load time
>     15 times, to be more exact
>     And since I didn't found any immediate number related to run it 15
>     times, I assume at least it's not designed to do it 15 times.
> 
> The reproducer for 1) and 2) is quite simple, extracted from btrfs/022 
> test case:
> ------
> dev=/dev/sdb5
> mnt=/mnt/test
> 
> umount $dev &> /dev/null
> 
> mkfs.btrfs $dev -f
> mount $dev $mnt -o nospace_cache
> btrfs dedupe enable $mnt
> btrfs sub create $mnt/sub
> btrfs quota enable $mnt
> 
> 
> # Just use small limit, making ftrace less noise.
> btrfs qgroup limit 512K 0/257 $mnt
> dd if=/dev/urandom of=$mnt/sub/test bs=1M count=1
> umount $mnt
> rmmod btrfs
> ------
> 
> At unmount time, kernel warning will happen due to may_use bytes leak.
> I could dig it further, as it looks like a bug in space reservation 
> failure case.
> ------
> BTRFS: space_info 1 has 8044544 free, is not full
> BTRFS: space_info total=8388608, used=344064, pinned=0, reserved=0, 
> may_use=409600, readonly=0
> ------
> 
> And at rmmod time, btrfs will detect extent_state leak, whose length is 
> always 4095 (page size - 1).
> 
> Hope this will help, and I'm willing to help to fix the problem.
> 
> Thanks,
> Qu
> 
> At 06/23/2016 08:17 PM, David Sterba wrote:
> > On Tue, Jun 21, 2016 at 10:25:19PM +0530, Chandan Rajendra wrote:
> >>>> I'm completely OK to do the rebase, but since I don't have 64K page size
> >>>> machine to test the rebase, we can only test if 4K system is unaffected.
> >>>>
> >>>> Although not much help, at least it would be better than making it compile.
> >>>>
> >>>> Also such rebase may help us to expose bad design/unexpected corner case
> >>>> in dedupe.
> >>>> So if it's OK, please let me try to do the rebase.
> >>>
> >>> Well, if you base dedupe on subpage, then it could be hard to find the
> >>> patchset that introduces bugs, or combination of both. We should be able
> >>> to test the features independently, and thus I'm proposing to first find
> >>> some common patchset that makes that possible.
> >>
> >> I am not sure if I understood the above statement correctly. Do you mean to
> >> commit the 'common/simple' patches from both the subpage-blocksize & dedupe
> >> patchset first and then bring in the complicated ones later?
> >
> > That would be great yes, but ...
> >
> >> If yes, then we have a problem doing that w.r.t subpage-blocksize
> >> patchset. The first few patches bring in the core changes necessary for the
> >> other remaining patches.
> >
> > ... not easily possible. I looked again for common functions that change
> > the singature and found only cow_file_range and run_delalloc_nocow. The
> > plan:
> >
> > - separate patch that adds new parameters required by both patches to
> >   the functions
> > - update all call sites, add 0/NULL as defaults for the new unused
> >   parameters
> > - rebase both patches on top of this patch
> >
> > How does this help: if a patch starts to use the new parameter, it
> > changes only the value at all call sites. This is much easier to verify
> > and merge manually compared to adding a new parameter to the middle of
> > the list, namely when the functions take 6+.
> >
> > The other conflicts like conversion from PAGE_SIZE to the the block
> > oriented iterations will be harder, but these are usually localized and
> > can be resolved. We'll see if there are other options to reduce the
> > clashes but at the moment it's stuck at the two functions. Does that
> > explain it better?
> >
> >
> 
> 

-- 
chandan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-24  9:29                 ` Chandan Rajendra
@ 2016-06-25  1:22                   ` Qu Wenruo
  2016-06-25  5:45                     ` Chandan Rajendra
  0 siblings, 1 reply; 34+ messages in thread
From: Qu Wenruo @ 2016-06-25  1:22 UTC (permalink / raw)
  To: Chandan Rajendra, Qu Wenruo; +Cc: dsterba, linux-btrfs



On 06/24/2016 05:29 PM, Chandan Rajendra wrote:
> On Friday, June 24, 2016 10:50:41 AM Qu Wenruo wrote:
>> Hi Chandan, David,
>>
>> When I'm trying to rebase dedupe patchset on top of Chadan's sub page
>> size patchset (using David's for-next-test-20160620), although the
>> rebase itself is quite simple, but I'm afraid that I found some bugs for
>> sub page size patchset, *without* dedupe patchset applied.
>>
>> These bugs seems to be unrelated to each other
>> 1) state leak at btrfs rmmod time
>
> The leak was due to not freeing 'cached_state' in
> read_extent_buffer_pages(). I have fixed this and the fix will be part of the
> patchset when I post the next version to the mailing list.
>
> I have always compiled the btrfs code as part of the vmlinux image and hence
> have never rmmod the btrfs module during my local testing. The space leak
> messages might have appeared when I shut down my guest. Hence I had never
> noticed them before. Thanks once again for informing me about it.
>
>> 2) bytes_may_use leak at qgroup EDQUOTA error time
>
> I have a slightly older version of btrfs-progs which does not yet have btrfs
> dedupe" command. I will get the new version and check if the space leak can be
> reproduced on my machine.
>
> However, I don't see the space leak warning messages when the reproducer
> script is executed after commenting out "btrfs dedupe enable $mnt".

Strange.
That dedupe command is not useful at all, as I'm using the branch 
without the dedupe patchset.
Even with btrfs-progs dedupe patchset, dedupe enable only output ENOTTY 
error message.

I'll double check if it's related to the dedupe.

BTW, are you testing with 4K page size?

Thanks,
Qu
>
>> 3) selftest is run several times at modules load time
>>     15 times, to be more exact
>>     And since I didn't found any immediate number related to run it 15
>>     times, I assume at least it's not designed to do it 15 times.
>>
>> The reproducer for 1) and 2) is quite simple, extracted from btrfs/022
>> test case:
>> ------
>> dev=/dev/sdb5
>> mnt=/mnt/test
>>
>> umount $dev &> /dev/null
>>
>> mkfs.btrfs $dev -f
>> mount $dev $mnt -o nospace_cache
>> btrfs dedupe enable $mnt
>> btrfs sub create $mnt/sub
>> btrfs quota enable $mnt
>>
>>
>> # Just use small limit, making ftrace less noise.
>> btrfs qgroup limit 512K 0/257 $mnt
>> dd if=/dev/urandom of=$mnt/sub/test bs=1M count=1
>> umount $mnt
>> rmmod btrfs
>> ------
>>
>> At unmount time, kernel warning will happen due to may_use bytes leak.
>> I could dig it further, as it looks like a bug in space reservation
>> failure case.
>> ------
>> BTRFS: space_info 1 has 8044544 free, is not full
>> BTRFS: space_info total=8388608, used=344064, pinned=0, reserved=0,
>> may_use=409600, readonly=0
>> ------
>>
>> And at rmmod time, btrfs will detect extent_state leak, whose length is
>> always 4095 (page size - 1).
>>
>> Hope this will help, and I'm willing to help to fix the problem.
>>
>> Thanks,
>> Qu
>>
>> At 06/23/2016 08:17 PM, David Sterba wrote:
>>> On Tue, Jun 21, 2016 at 10:25:19PM +0530, Chandan Rajendra wrote:
>>>>>> I'm completely OK to do the rebase, but since I don't have 64K page size
>>>>>> machine to test the rebase, we can only test if 4K system is unaffected.
>>>>>>
>>>>>> Although not much help, at least it would be better than making it compile.
>>>>>>
>>>>>> Also such rebase may help us to expose bad design/unexpected corner case
>>>>>> in dedupe.
>>>>>> So if it's OK, please let me try to do the rebase.
>>>>>
>>>>> Well, if you base dedupe on subpage, then it could be hard to find the
>>>>> patchset that introduces bugs, or combination of both. We should be able
>>>>> to test the features independently, and thus I'm proposing to first find
>>>>> some common patchset that makes that possible.
>>>>
>>>> I am not sure if I understood the above statement correctly. Do you mean to
>>>> commit the 'common/simple' patches from both the subpage-blocksize & dedupe
>>>> patchset first and then bring in the complicated ones later?
>>>
>>> That would be great yes, but ...
>>>
>>>> If yes, then we have a problem doing that w.r.t subpage-blocksize
>>>> patchset. The first few patches bring in the core changes necessary for the
>>>> other remaining patches.
>>>
>>> ... not easily possible. I looked again for common functions that change
>>> the singature and found only cow_file_range and run_delalloc_nocow. The
>>> plan:
>>>
>>> - separate patch that adds new parameters required by both patches to
>>>   the functions
>>> - update all call sites, add 0/NULL as defaults for the new unused
>>>   parameters
>>> - rebase both patches on top of this patch
>>>
>>> How does this help: if a patch starts to use the new parameter, it
>>> changes only the value at all call sites. This is much easier to verify
>>> and merge manually compared to adding a new parameter to the middle of
>>> the list, namely when the functions take 6+.
>>>
>>> The other conflicts like conversion from PAGE_SIZE to the the block
>>> oriented iterations will be harder, but these are usually localized and
>>> can be resolved. We'll see if there are other options to reduce the
>>> clashes but at the moment it's stuck at the two functions. Does that
>>> explain it better?
>>>
>>>
>>
>>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-25  1:22                   ` Qu Wenruo
@ 2016-06-25  5:45                     ` Chandan Rajendra
  2016-06-27  3:04                       ` Qu Wenruo
  0 siblings, 1 reply; 34+ messages in thread
From: Chandan Rajendra @ 2016-06-25  5:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, dsterba, linux-btrfs

On Saturday, June 25, 2016 09:22:44 AM Qu Wenruo wrote:
> 
> On 06/24/2016 05:29 PM, Chandan Rajendra wrote:
> > On Friday, June 24, 2016 10:50:41 AM Qu Wenruo wrote:
> >> Hi Chandan, David,
> >>
> >> When I'm trying to rebase dedupe patchset on top of Chadan's sub page
> >> size patchset (using David's for-next-test-20160620), although the
> >> rebase itself is quite simple, but I'm afraid that I found some bugs for
> >> sub page size patchset, *without* dedupe patchset applied.
> >>
> >> These bugs seems to be unrelated to each other
> >> 1) state leak at btrfs rmmod time
> >
> > The leak was due to not freeing 'cached_state' in
> > read_extent_buffer_pages(). I have fixed this and the fix will be part of the
> > patchset when I post the next version to the mailing list.
> >
> > I have always compiled the btrfs code as part of the vmlinux image and hence
> > have never rmmod the btrfs module during my local testing. The space leak
> > messages might have appeared when I shut down my guest. Hence I had never
> > noticed them before. Thanks once again for informing me about it.
> >
> >> 2) bytes_may_use leak at qgroup EDQUOTA error time
> >
> > I have a slightly older version of btrfs-progs which does not yet have btrfs
> > dedupe" command. I will get the new version and check if the space leak can be
> > reproduced on my machine.
> >
> > However, I don't see the space leak warning messages when the reproducer
> > script is executed after commenting out "btrfs dedupe enable $mnt".
> 
> Strange.
> That dedupe command is not useful at all, as I'm using the branch 
> without the dedupe patchset.
> Even with btrfs-progs dedupe patchset, dedupe enable only output ENOTTY 
> error message.
> 
> I'll double check if it's related to the dedupe.
> 
> BTW, are you testing with 4K page size?

Yes, I executed the script with 4k page size. I had based my patchset on top
of 4.7-rc2 kernel. If you are interested, you can get the kernel sources at
'https://github.com/chandanr/linux subpagesize-blocksize'.

I will soon rebase my patchset on David's master branch. I will let you know
if I hit the space leak issue on the rebased kernel.

-- 
chandan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v11 00/13] Btrfs dedupe framework
  2016-06-25  5:45                     ` Chandan Rajendra
@ 2016-06-27  3:04                       ` Qu Wenruo
  0 siblings, 0 replies; 34+ messages in thread
From: Qu Wenruo @ 2016-06-27  3:04 UTC (permalink / raw)
  To: Chandan Rajendra, Qu Wenruo; +Cc: dsterba, linux-btrfs



At 06/25/2016 01:45 PM, Chandan Rajendra wrote:
> On Saturday, June 25, 2016 09:22:44 AM Qu Wenruo wrote:
>>
>> On 06/24/2016 05:29 PM, Chandan Rajendra wrote:
>>> On Friday, June 24, 2016 10:50:41 AM Qu Wenruo wrote:
>>>> Hi Chandan, David,
>>>>
>>>> When I'm trying to rebase dedupe patchset on top of Chadan's sub page
>>>> size patchset (using David's for-next-test-20160620), although the
>>>> rebase itself is quite simple, but I'm afraid that I found some bugs for
>>>> sub page size patchset, *without* dedupe patchset applied.
>>>>
>>>> These bugs seems to be unrelated to each other
>>>> 1) state leak at btrfs rmmod time
>>>
>>> The leak was due to not freeing 'cached_state' in
>>> read_extent_buffer_pages(). I have fixed this and the fix will be part of the
>>> patchset when I post the next version to the mailing list.
>>>
>>> I have always compiled the btrfs code as part of the vmlinux image and hence
>>> have never rmmod the btrfs module during my local testing. The space leak
>>> messages might have appeared when I shut down my guest. Hence I had never
>>> noticed them before. Thanks once again for informing me about it.
>>>
>>>> 2) bytes_may_use leak at qgroup EDQUOTA error time
>>>
>>> I have a slightly older version of btrfs-progs which does not yet have btrfs
>>> dedupe" command. I will get the new version and check if the space leak can be
>>> reproduced on my machine.
>>>
>>> However, I don't see the space leak warning messages when the reproducer
>>> script is executed after commenting out "btrfs dedupe enable $mnt".
>>
>> Strange.
>> That dedupe command is not useful at all, as I'm using the branch
>> without the dedupe patchset.
>> Even with btrfs-progs dedupe patchset, dedupe enable only output ENOTTY
>> error message.
>>
>> I'll double check if it's related to the dedupe.
>>
>> BTW, are you testing with 4K page size?
>
> Yes, I executed the script with 4k page size. I had based my patchset on top
> of 4.7-rc2 kernel. If you are interested, you can get the kernel sources at
> 'https://github.com/chandanr/linux subpagesize-blocksize'.
>
> I will soon rebase my patchset on David's master branch. I will let you know
> if I hit the space leak issue on the rebased kernel.
>

Thanks for your info.

Confirmed that leaked space is unrelated to your sub page size.

It's another patchset causing the bug.
Bisected and I'll inform the author.

Great thanks for your help.
Qu



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-06-27  3:04 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-15  2:09 [PATCH v11 00/13] Btrfs dedupe framework Qu Wenruo
2016-06-15  2:09 ` [PATCH v11 01/13] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
2016-06-15  2:09 ` [PATCH v11 02/13] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
2016-06-15  2:09 ` [PATCH v11 03/13] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
2016-06-15  2:09 ` [PATCH v11 04/13] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
2016-06-15  2:09 ` [PATCH v11 05/13] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
2016-06-15  2:09 ` [PATCH v11 06/13] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
2016-06-15  2:09 ` [PATCH v11 07/13] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
2016-06-15  2:09 ` [PATCH v11 08/13] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
2016-06-15  2:09 ` [PATCH v11 09/13] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
2016-06-15  2:09 ` [PATCH v11 10/13] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
2016-06-15  2:09 ` [PATCH v11 11/13] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
2016-06-15  2:10 ` [PATCH v11 12/13] btrfs: improve inode's outstanding_extents computation Qu Wenruo
2016-06-15  2:10 ` [PATCH v11 13/13] btrfs: dedupe: fix false ENOSPC Qu Wenruo
2016-06-15  3:11   ` kbuild test robot
2016-06-15  3:17   ` [PATCH v11.1 " Qu Wenruo
2016-06-15  3:26   ` [PATCH v11 " kbuild test robot
2016-06-20 16:03 ` [PATCH v11 00/13] Btrfs dedupe framework David Sterba
2016-06-21  0:36   ` Qu Wenruo
2016-06-21  9:13     ` David Sterba
2016-06-21  9:26       ` Qu Wenruo
2016-06-21  9:34         ` David Sterba
2016-06-21 16:55           ` Chandan Rajendra
2016-06-23 12:17             ` David Sterba
2016-06-24  2:50               ` Qu Wenruo
2016-06-24  4:34                 ` Chandan Rajendra
2016-06-24  9:29                 ` Chandan Rajendra
2016-06-25  1:22                   ` Qu Wenruo
2016-06-25  5:45                     ` Chandan Rajendra
2016-06-27  3:04                       ` Qu Wenruo
2016-06-24  4:10               ` Chandan Rajendra
2016-06-22  1:48 ` Qu Wenruo
2016-06-24  6:54   ` Satoru Takeuchi
2016-06-24  8:30     ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.