All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 00/21] Btrfs dedupe framework
@ 2016-04-01  6:34 Qu Wenruo
  2016-04-01  6:34 ` [PATCH v10 01/21] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
                   ` (22 more replies)
  0 siblings, 23 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:34 UTC (permalink / raw)
  To: linux-btrfs

This patchset can be fetched from github:
https://github.com/adam900710/linux.git wang_dedupe_20160401

In this patchset, we're proud to bring a completely new storage backend:
Khala backend.

With Khala backend, all dedupe hash will be restored in the Khala,
shared with every Kalai protoss, with unlimited storage and almost zero
search latency.
A perfect backend for any Kalai protoss. "My life for Aiur!"

Unfortunately, such backend is not available for human.


OK, except the super-fancy and date-related backend, the patchset is
still a serious patchset.
In this patchset, we mostly addressed the on-disk format change comment from
Chris:
1) Reduced dedupe hash item and bytenr item.
   Now dedupe hash item structure size is reduced from 41 bytes
   (9 bytes hash_item + 32 bytes hash)
   to 29 bytes (5 bytes hash_item + 24 bytes hash)
   Without the last patch, it's even less with only 24 bytes
   (24 bytes hash only).
   And dedupe bytenr item structure size is reduced from 32 bytes (full
   hash) to 0.

2) Hide dedupe ioctls into CONFIG_BTRFS_DEBUG
   Advised by David, to make btrfs dedupe as an experimental feature for
   advanced user.
   This is used to allow this patchset to be merged while still allow us
   to change ioctl in the further.

3) Add back missing bug fix patches
   I just missed 2 bug fix patches in previous iteration.
   Adding them back.

Now patch 1~11 provide the full backward-compatible in-memory backend.
And patch 12~14 provide per-file dedupe flag feature.
Patch 15~20 provide on-disk dedupe backend with persist dedupe state for
in-memory backend.
The last patch is just preparation for possible dedupe-compress co-work.


Changelog:
v2:
  Totally reworked to handle multiple backends
v3:
  Fix a stupid but deadly on-disk backend bug
  Add handle for multiple hash on same bytenr corner case to fix abort
  trans error
  Increase dedup rate by enhancing delayed ref handler for both backend.
  Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
  Increase dedup block size up limit to 8M.
v4:
  Add dedup prop for disabling dedup for given files/dirs.
  Merge inmem_search() and ondisk_search() into generic_search() to save
  some code
  Fix another delayed_ref related bug.
  Use the same mutex for both inmem and ondisk backend.
  Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
  rate.
v5:
  Reuse compress routine for much simpler dedup function.
  Slightly improved performance due to above modification.
  Fix race between dedup enable/disable
  Fix for false ENOSPC report
v6:
  Further enable/disable race window fix.
  Minor format change according to checkpatch.
v7:
  Fix one concurrency bug with balance.
  Slightly modify return value from -EINVAL to -EOPNOTSUPP for
  btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
  and wrong parameter.
  Rebased to integration-4.6.
v8:
  Rename 'dedup' to 'dedupe'.
  Add support to allow dedupe and compression work at the same time.
  Fix several balance related bugs. Special thanks to Satoru Takeuchi,
  who exposed most of them.
  Small dedupe hit case performance improvement.
v9:
  Re-order the patchset to completely separate pure in-memory and any
  on-disk format change.
  Fold bug fixes into its original patch.
v10:
  Adding back missing bug fix patch.
  Reduce on-disk item size.
  Hide dedupe ioctl under CONFIG_BTRFS_DEBUG.

Qu Wenruo (9):
  btrfs: delayed-ref: Add support for increasing data ref under spinlock
  btrfs: dedupe: Inband in-memory only de-duplication implement
  btrfs: relocation: Enhance error handling to avoid BUG_ON
  btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
  btrfs: dedupe: Add support for on-disk hash search
  btrfs: dedupe: Add support to delete hash for on-disk backend
  btrfs: dedupe: Add support for adding hash for on-disk backend
  btrfs: dedupe: Preparation for compress-dedupe co-work

Wang Xiaoguang (12):
  btrfs: dedupe: Introduce dedupe framework and its header
  btrfs: dedupe: Introduce function to initialize dedupe info
  btrfs: dedupe: Introduce function to add hash into in-memory tree
  btrfs: dedupe: Introduce function to remove hash from in-memory tree
  btrfs: dedupe: Introduce function to search for an existing hash
  btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  btrfs: ordered-extent: Add support for dedupe
  btrfs: try more times to alloc metadata reserve space
  btrfs: dedupe: Add ioctl for inband dedupelication
  btrfs: dedupe: add an inode nodedupe flag
  btrfs: dedupe: add a property handler for online dedupe
  btrfs: dedupe: add per-file online dedupe control

 fs/btrfs/Makefile            |    2 +-
 fs/btrfs/ctree.h             |   80 ++-
 fs/btrfs/dedupe.c            | 1239 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dedupe.h            |  181 ++++++
 fs/btrfs/delayed-ref.c       |   30 +-
 fs/btrfs/delayed-ref.h       |    8 +
 fs/btrfs/disk-io.c           |   35 +-
 fs/btrfs/disk-io.h           |    1 +
 fs/btrfs/extent-tree.c       |   41 +-
 fs/btrfs/inode.c             |  250 +++++++--
 fs/btrfs/ioctl.c             |   74 ++-
 fs/btrfs/ordered-data.c      |   46 +-
 fs/btrfs/ordered-data.h      |   13 +
 fs/btrfs/props.c             |   41 ++
 fs/btrfs/relocation.c        |   41 +-
 fs/btrfs/sysfs.c             |    2 +
 include/trace/events/btrfs.h |    3 +-
 include/uapi/linux/btrfs.h   |   23 +
 18 files changed, 2052 insertions(+), 58 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c
 create mode 100644 fs/btrfs/dedupe.h

-- 
2.7.4




^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v10 01/21] btrfs: dedupe: Introduce dedupe framework and its header
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
@ 2016-04-01  6:34 ` Qu Wenruo
  2016-04-01  6:34 ` [PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce the header for btrfs online(write time) de-duplication
framework and needed header.

The new de-duplication framework is going to support 2 different dedupe
methods and 1 dedupe hash.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h   |   5 ++
 fs/btrfs/dedupe.h  | 134 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/disk-io.c |   1 +
 3 files changed, 140 insertions(+)
 create mode 100644 fs/btrfs/dedupe.h

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 84a6a5b..022ab61 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1860,6 +1860,11 @@ struct btrfs_fs_info {
 	struct list_head pinned_chunks;
 
 	int creating_free_space_tree;
+
+	/* Inband de-duplication related structures*/
+	unsigned int dedupe_enabled:1;
+	struct btrfs_dedupe_info *dedupe_info;
+	struct mutex dedupe_ioctl_lock;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
new file mode 100644
index 0000000..40f4808
--- /dev/null
+++ b/fs/btrfs/dedupe.h
@@ -0,0 +1,134 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_DEDUPE__
+#define __BTRFS_DEDUPE__
+
+#include <linux/btrfs.h>
+#include <linux/wait.h>
+#include <crypto/hash.h>
+
+/*
+ * Dedup storage backend
+ * On disk is persist storage but overhead is large
+ * In memory is fast but will lose all its hash on umount
+ */
+#define BTRFS_DEDUPE_BACKEND_INMEMORY		0
+#define BTRFS_DEDUPE_BACKEND_ONDISK		1
+
+/* Only support inmemory yet, so count is still only 1 */
+#define BTRFS_DEDUPE_BACKEND_COUNT		1
+
+/* Dedup block size limit and default value */
+#define BTRFS_DEDUPE_BLOCKSIZE_MAX	(8 * 1024 * 1024)
+#define BTRFS_DEDUPE_BLOCKSIZE_MIN	(16 * 1024)
+#define BTRFS_DEDUPE_BLOCKSIZE_DEFAULT	(128 * 1024)
+
+/* Hash algorithm, only support SHA256 yet */
+#define BTRFS_DEDUPE_HASH_SHA256		0
+
+static int btrfs_dedupe_sizes[] = { 32 };
+
+/*
+ * For caller outside of dedup.c
+ *
+ * Different dedupe backends should have their own hash structure
+ */
+struct btrfs_dedupe_hash {
+	u64 bytenr;
+	u32 num_bytes;
+
+	/* last field is a variable length array of dedupe hash */
+	u8 hash[];
+};
+
+struct btrfs_dedupe_info {
+	/* dedupe blocksize */
+	u64 blocksize;
+	u16 backend;
+	u16 hash_type;
+
+	struct crypto_shash *dedupe_driver;
+	struct mutex lock;
+
+	/* following members are only used in in-memory dedupe mode */
+	struct rb_root hash_root;
+	struct rb_root bytenr_root;
+	struct list_head lru_list;
+	u64 limit_nr;
+	u64 current_nr;
+};
+
+struct btrfs_trans_handle;
+
+static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
+{
+	return (hash && hash->bytenr);
+}
+
+int btrfs_dedupe_hash_size(u16 type);
+struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type);
+
+/*
+ * Initial inband dedupe info
+ * Called at dedupe enable time.
+ */
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+			u64 blocksize, u64 limit_nr, u64 limit_mem);
+
+/*
+ * Disable dedupe and invalidate all its dedupe data.
+ * Called at dedupe disable time.
+ */
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
+
+/*
+ * Calculate hash for dedup.
+ * Caller must ensure [start, start + dedupe_bs) has valid data.
+ */
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+			   struct inode *inode, u64 start,
+			   struct btrfs_dedupe_hash *hash);
+
+/*
+ * Search for duplicated extents by calculated hash
+ * Caller must call btrfs_dedupe_calc_hash() first to get the hash.
+ *
+ * @inode: the inode for we are writing
+ * @file_pos: offset inside the inode
+ * As we will increase extent ref immediately after a hash match,
+ * we need @file_pos and @inode in this case.
+ *
+ * Return > 0 for a hash match, and the extent ref will be
+ * *INCREASED*, and hash->bytenr/num_bytes will record the existing
+ * extent data.
+ * Return 0 for a hash miss. Nothing is done
+ */
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+			struct inode *inode, u64 file_pos,
+			struct btrfs_dedupe_hash *hash);
+
+/* Add a dedupe hash into dedupe info */
+int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info,
+		     struct btrfs_dedupe_hash *hash);
+
+/* Remove a dedupe hash from dedupe info */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info, u64 bytenr);
+#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c95e3ce..3cf4c11 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2584,6 +2584,7 @@ int open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
 	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
+	mutex_init(&fs_info->dedupe_ioctl_lock);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
  2016-04-01  6:34 ` [PATCH v10 01/21] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
@ 2016-04-01  6:34 ` Qu Wenruo
  2016-04-01  9:59   ` kbuild test robot
  2016-04-01  6:34 ` [PATCH v10 03/21] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Add generic function to initialize dedupe info.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/Makefile |   2 +-
 fs/btrfs/dedupe.c | 154 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dedupe.h |  16 +++++-
 3 files changed, 169 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..1b8c627 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-	   uuid-tree.o props.o hash.o free-space-tree.o
+	   uuid-tree.o props.o hash.o free-space-tree.o dedupe.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
new file mode 100644
index 0000000..2211588
--- /dev/null
+++ b/fs/btrfs/dedupe.c
@@ -0,0 +1,154 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include "ctree.h"
+#include "dedupe.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "delayed-ref.h"
+
+struct inmem_hash {
+	struct rb_node hash_node;
+	struct rb_node bytenr_node;
+	struct list_head lru_list;
+
+	u64 bytenr;
+	u32 num_bytes;
+
+	u8 hash[];
+};
+
+static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
+			    u16 backend, u64 blocksize, u64 limit)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+
+	dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
+	if (!dedupe_info)
+		return -ENOMEM;
+
+	dedupe_info->hash_type = type;
+	dedupe_info->backend = backend;
+	dedupe_info->blocksize = blocksize;
+	dedupe_info->limit_nr = limit;
+
+	/* only support SHA256 yet */
+	dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
+	if (IS_ERR(dedupe_info->dedupe_driver)) {
+		int ret;
+
+		ret = PTR_ERR(dedupe_info->dedupe_driver);
+		kfree(dedupe_info);
+		return ret;
+	}
+
+	dedupe_info->hash_root = RB_ROOT;
+	dedupe_info->bytenr_root = RB_ROOT;
+	dedupe_info->current_nr = 0;
+	INIT_LIST_HEAD(&dedupe_info->lru_list);
+	mutex_init(&dedupe_info->lock);
+
+	*ret_info = dedupe_info;
+	return 0;
+}
+
+static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type,
+				  u16 backend, u64 blocksize, u64 limit_nr,
+				  u64 limit_mem, u64 *ret_limit)
+{
+	if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
+	    blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
+	    blocksize < fs_info->tree_root->sectorsize ||
+	    !is_power_of_2(blocksize))
+		return -EINVAL;
+	/*
+	 * For new backend and hash type, we return special return code
+	 * as they can be easily expended.
+	 */
+	if (hash_type >= ARRAY_SIZE(btrfs_dedupe_sizes))
+		return -EOPNOTSUPP;
+	if (backend >= BTRFS_DEDUPE_BACKEND_COUNT)
+		return -EOPNOTSUPP;
+
+	/* Backend specific check */
+	if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+		if (!limit_nr && !limit_mem)
+			*ret_limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
+		else {
+			u64 tmp = (u64)-1;
+
+			if (limit_mem) {
+				tmp = limit_mem / (sizeof(struct inmem_hash) +
+					btrfs_dedupe_hash_size(hash_type));
+				/* Too small limit_mem to fill a hash item */
+				if (!tmp)
+					return -EINVAL;
+			}
+			if (!limit_nr)
+				limit_nr = (u64)-1;
+
+			*ret_limit = min(tmp, limit_nr);
+		}
+	}
+	if (backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+		*ret_limit = 0;
+	return 0;
+}
+
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+			u64 blocksize, u64 limit_nr, u64 limit_mem)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+	u64 limit = 0;
+	int ret = 0;
+
+	/* only one limit is accepted for enable*/
+	if (limit_nr && limit_mem)
+		return -EINVAL;
+
+	ret = check_dedupe_parameter(fs_info, type, backend, blocksize,
+				     limit_nr, limit_mem, &limit);
+	if (ret < 0)
+		return ret;
+
+	dedupe_info = fs_info->dedupe_info;
+	if (dedupe_info) {
+		/* Check if we are re-enable for different dedupe config */
+		if (dedupe_info->blocksize != blocksize ||
+		    dedupe_info->hash_type != type ||
+		    dedupe_info->backend != backend) {
+			btrfs_dedupe_disable(fs_info);
+			goto enable;
+		}
+
+		/* On-fly limit change is OK */
+		mutex_lock(&dedupe_info->lock);
+		fs_info->dedupe_info->limit_nr = limit;
+		mutex_unlock(&dedupe_info->lock);
+		return 0;
+	}
+
+enable:
+	ret = init_dedupe_info(&dedupe_info, type, backend, blocksize, limit);
+	if (ret < 0)
+		return ret;
+	fs_info->dedupe_info = dedupe_info;
+	/* We must ensure dedupe_enabled is modified after dedupe_info */
+	smp_wmb();
+	fs_info->dedupe_enabled = 1;
+	return ret;
+}
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 40f4808..e5d0d34 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -39,6 +39,9 @@
 #define BTRFS_DEDUPE_BLOCKSIZE_MIN	(16 * 1024)
 #define BTRFS_DEDUPE_BLOCKSIZE_DEFAULT	(128 * 1024)
 
+/* Default dedupe limit on number of hash */
+#define BTRFS_DEDUPE_LIMIT_NR_DEFAULT	(32 * 1024)
+
 /* Hash algorithm, only support SHA256 yet */
 #define BTRFS_DEDUPE_HASH_SHA256		0
 
@@ -81,8 +84,17 @@ static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
 	return (hash && hash->bytenr);
 }
 
-int btrfs_dedupe_hash_size(u16 type);
-struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type);
+static inline int btrfs_dedupe_hash_size(u16 type)
+{
+	if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
+		return -EINVAL;
+	return sizeof(struct btrfs_dedupe_hash) + btrfs_dedupe_sizes[type];
+}
+
+static inline struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type)
+{
+	return kzalloc(btrfs_dedupe_hash_size(type), GFP_NOFS);
+}
 
 /*
  * Initial inband dedupe info
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 03/21] btrfs: dedupe: Introduce function to add hash into in-memory tree
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
  2016-04-01  6:34 ` [PATCH v10 01/21] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
  2016-04-01  6:34 ` [PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
@ 2016-04-01  6:34 ` Qu Wenruo
  2016-06-01 19:37   ` Mark Fasheh
  2016-04-01  6:34 ` [PATCH v10 04/21] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
                   ` (19 subsequent siblings)
  22 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce static function inmem_add() to add hash into in-memory tree.
And now we can implement the btrfs_dedupe_add() interface.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 151 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 2211588..4e8455e 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -32,6 +32,14 @@ struct inmem_hash {
 	u8 hash[];
 };
 
+static inline struct inmem_hash *inmem_alloc_hash(u16 type)
+{
+	if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
+		return NULL;
+	return kzalloc(sizeof(struct inmem_hash) + btrfs_dedupe_sizes[type],
+			GFP_NOFS);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
 			    u16 backend, u64 blocksize, u64 limit)
 {
@@ -152,3 +160,146 @@ enable:
 	fs_info->dedupe_enabled = 1;
 	return ret;
 }
+
+static int inmem_insert_hash(struct rb_root *root,
+			     struct inmem_hash *hash, int hash_len)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, hash_node);
+		if (memcmp(hash->hash, entry->hash, hash_len) < 0)
+			p = &(*p)->rb_left;
+		else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
+			p = &(*p)->rb_right;
+		else
+			return 1;
+	}
+	rb_link_node(&hash->hash_node, parent, p);
+	rb_insert_color(&hash->hash_node, root);
+	return 0;
+}
+
+static int inmem_insert_bytenr(struct rb_root *root,
+			       struct inmem_hash *hash)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+		if (hash->bytenr < entry->bytenr)
+			p = &(*p)->rb_left;
+		else if (hash->bytenr > entry->bytenr)
+			p = &(*p)->rb_right;
+		else
+			return 1;
+	}
+	rb_link_node(&hash->bytenr_node, parent, p);
+	rb_insert_color(&hash->bytenr_node, root);
+	return 0;
+}
+
+static void __inmem_del(struct btrfs_dedupe_info *dedupe_info,
+			struct inmem_hash *hash)
+{
+	list_del(&hash->lru_list);
+	rb_erase(&hash->hash_node, &dedupe_info->hash_root);
+	rb_erase(&hash->bytenr_node, &dedupe_info->bytenr_root);
+
+	if (!WARN_ON(dedupe_info->current_nr == 0))
+		dedupe_info->current_nr--;
+
+	kfree(hash);
+}
+
+/*
+ * Insert a hash into in-memory dedupe tree
+ * Will remove exceeding last recent use hash.
+ *
+ * If the hash mathced with existing one, we won't insert it, to
+ * save memory
+ */
+static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
+		     struct btrfs_dedupe_hash *hash)
+{
+	int ret = 0;
+	u16 type = dedupe_info->hash_type;
+	struct inmem_hash *ihash;
+
+	ihash = inmem_alloc_hash(type);
+
+	if (!ihash)
+		return -ENOMEM;
+
+	/* Copy the data out */
+	ihash->bytenr = hash->bytenr;
+	ihash->num_bytes = hash->num_bytes;
+	memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
+
+	mutex_lock(&dedupe_info->lock);
+
+	ret = inmem_insert_bytenr(&dedupe_info->bytenr_root, ihash);
+	if (ret > 0) {
+		kfree(ihash);
+		ret = 0;
+		goto out;
+	}
+
+	ret = inmem_insert_hash(&dedupe_info->hash_root, ihash,
+				btrfs_dedupe_sizes[type]);
+	if (ret > 0) {
+		/*
+		 * We only keep one hash in tree to save memory, so if
+		 * hash conflicts, free the one to insert.
+		 */
+		rb_erase(&ihash->bytenr_node, &dedupe_info->bytenr_root);
+		kfree(ihash);
+		ret = 0;
+		goto out;
+	}
+
+	list_add(&ihash->lru_list, &dedupe_info->lru_list);
+	dedupe_info->current_nr++;
+
+	/* Remove the last dedupe hash if we exceed limit */
+	while (dedupe_info->current_nr > dedupe_info->limit_nr) {
+		struct inmem_hash *last;
+
+		last = list_entry(dedupe_info->lru_list.prev,
+				  struct inmem_hash, lru_list);
+		__inmem_del(dedupe_info, last);
+	}
+out:
+	mutex_unlock(&dedupe_info->lock);
+	return 0;
+}
+
+int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info,
+		     struct btrfs_dedupe_hash *hash)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+	if (!fs_info->dedupe_enabled || !hash)
+		return 0;
+
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	if (WARN_ON(!btrfs_dedupe_hash_hit(hash)))
+		return -EINVAL;
+
+	/* ignore old hash */
+	if (dedupe_info->blocksize != hash->num_bytes)
+		return 0;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		return inmem_add(dedupe_info, hash);
+	return -EINVAL;
+}
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 04/21] btrfs: dedupe: Introduce function to remove hash from in-memory tree
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (2 preceding siblings ...)
  2016-04-01  6:34 ` [PATCH v10 03/21] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
@ 2016-04-01  6:34 ` Qu Wenruo
  2016-06-01 19:40   ` Mark Fasheh
  2016-04-01  6:34 ` [PATCH v10 05/21] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce static function inmem_del() to remove hash from in-memory
dedupe tree.
And implement btrfs_dedupe_del() and btrfs_dedup_destroy() interfaces.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 4e8455e..a229ded 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -303,3 +303,108 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
 		return inmem_add(dedupe_info, hash);
 	return -EINVAL;
 }
+
+static struct inmem_hash *
+inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+	struct rb_node **p = &dedupe_info->bytenr_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+
+		if (bytenr < entry->bytenr)
+			p = &(*p)->rb_left;
+		else if (bytenr > entry->bytenr)
+			p = &(*p)->rb_right;
+		else
+			return entry;
+	}
+
+	return NULL;
+}
+
+/* Delete a hash from in-memory dedupe tree */
+static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+	struct inmem_hash *hash;
+
+	mutex_lock(&dedupe_info->lock);
+	hash = inmem_search_bytenr(dedupe_info, bytenr);
+	if (!hash) {
+		mutex_unlock(&dedupe_info->lock);
+		return 0;
+	}
+
+	__inmem_del(dedupe_info, hash);
+	mutex_unlock(&dedupe_info->lock);
+	return 0;
+}
+
+/* Remove a dedupe hash from dedupe tree */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+	if (!fs_info->dedupe_enabled)
+		return 0;
+
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		return inmem_del(dedupe_info, bytenr);
+	return -EINVAL;
+}
+
+static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
+{
+	struct inmem_hash *entry, *tmp;
+
+	mutex_lock(&dedupe_info->lock);
+	list_for_each_entry_safe(entry, tmp, &dedupe_info->lru_list, lru_list)
+		__inmem_del(dedupe_info, entry);
+	mutex_unlock(&dedupe_info->lock);
+}
+
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+	int ret;
+
+	/* Here we don't want to increase refs of dedupe_info */
+	fs_info->dedupe_enabled = 0;
+
+	dedupe_info = fs_info->dedupe_info;
+
+	if (!dedupe_info)
+		return 0;
+
+	/* Don't allow disable status change in RO mount */
+	if (fs_info->sb->s_flags & MS_RDONLY)
+		return -EROFS;
+
+	/*
+	 * Wait for all unfinished write to complete dedupe routine
+	 * As disable operation is not a frequent operation, we are
+	 * OK to use heavy but safe sync_filesystem().
+	 */
+	down_read(&fs_info->sb->s_umount);
+	ret = sync_filesystem(fs_info->sb);
+	up_read(&fs_info->sb->s_umount);
+	if (ret < 0)
+		return ret;
+
+	fs_info->dedupe_info = NULL;
+
+	/* now we are OK to clean up everything */
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		inmem_destroy(dedupe_info);
+
+	crypto_free_shash(dedupe_info->dedupe_driver);
+	kfree(dedupe_info);
+	return 0;
+}
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 05/21] btrfs: delayed-ref: Add support for increasing data ref under spinlock
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (3 preceding siblings ...)
  2016-04-01  6:34 ` [PATCH v10 04/21] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
@ 2016-04-01  6:34 ` Qu Wenruo
  2016-04-01  6:34 ` [PATCH v10 06/21] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:34 UTC (permalink / raw)
  To: linux-btrfs

For in-band dedupe, btrfs needs to increase data ref with delayed_ref
locked, so add a new function btrfs_add_delayed_data_ref_lock() to
increase extent ref with delayed_refs already locked.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/delayed-ref.c | 30 +++++++++++++++++++++++-------
 fs/btrfs/delayed-ref.h |  8 ++++++++
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 430b368..07474e8 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -805,6 +805,26 @@ free_ref:
 }
 
 /*
+ * Do real delayed data ref insert.
+ * Caller must hold delayed_refs->lock and allocation memory
+ * for dref,head_ref and record.
+ */
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+			struct btrfs_trans_handle *trans,
+			struct btrfs_delayed_data_ref *dref,
+			struct btrfs_delayed_ref_head *head_ref,
+			struct btrfs_qgroup_extent_record *qrecord,
+			u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+			u64 owner, u64 offset, u64 reserved, int action)
+{
+	head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node,
+			qrecord, bytenr, num_bytes, ref_root, reserved,
+			action, 1);
+	add_delayed_data_ref(fs_info, trans, head_ref, &dref->node, bytenr,
+			num_bytes, parent, ref_root, owner, offset, action);
+}
+
+/*
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
@@ -849,13 +869,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
 	 * insert both the head node and the new ref without dropping
 	 * the spin lock
 	 */
-	head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node, record,
-					bytenr, num_bytes, ref_root, reserved,
-					action, 1);
-
-	add_delayed_data_ref(fs_info, trans, head_ref, &ref->node, bytenr,
-				   num_bytes, parent, ref_root, owner, offset,
-				   action);
+	btrfs_add_delayed_data_ref_locked(fs_info, trans, ref, head_ref, record,
+			bytenr, num_bytes, parent, ref_root, owner, offset,
+			reserved, action);
 	spin_unlock(&delayed_refs->lock);
 
 	return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index c24b653..2765858 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -239,11 +239,19 @@ static inline void btrfs_put_delayed_ref(struct btrfs_delayed_ref_node *ref)
 	}
 }
 
+struct btrfs_qgroup_extent_record;
 int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
 			       struct btrfs_trans_handle *trans,
 			       u64 bytenr, u64 num_bytes, u64 parent,
 			       u64 ref_root, int level, int action,
 			       struct btrfs_delayed_extent_op *extent_op);
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+			struct btrfs_trans_handle *trans,
+			struct btrfs_delayed_data_ref *dref,
+			struct btrfs_delayed_ref_head *head_ref,
+			struct btrfs_qgroup_extent_record *qrecord,
+			u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+			u64 owner, u64 offset, u64 reserved, int action);
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
 			       struct btrfs_trans_handle *trans,
 			       u64 bytenr, u64 num_bytes,
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 06/21] btrfs: dedupe: Introduce function to search for an existing hash
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (4 preceding siblings ...)
  2016-04-01  6:34 ` [PATCH v10 05/21] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
@ 2016-04-01  6:34 ` Qu Wenruo
  2016-04-01  6:34 ` [PATCH v10 07/21] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce static function inmem_search() to handle the job for in-memory
hash tree.

The trick is, we must ensure the delayed ref head is not being run at
the time we search the for the hash.

With inmem_search(), we can implement the btrfs_dedupe_search()
interface.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 185 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 185 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index a229ded..9175a5f 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -20,6 +20,7 @@
 #include "btrfs_inode.h"
 #include "transaction.h"
 #include "delayed-ref.h"
+#include "qgroup.h"
 
 struct inmem_hash {
 	struct rb_node hash_node;
@@ -408,3 +409,187 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 	kfree(dedupe_info);
 	return 0;
 }
+
+/*
+ * Caller must ensure the corresponding ref head is not being run.
+ */
+static struct inmem_hash *
+inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
+{
+	struct rb_node **p = &dedupe_info->hash_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+	u16 hash_type = dedupe_info->hash_type;
+	int hash_len = btrfs_dedupe_sizes[hash_type];
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, hash_node);
+
+		if (memcmp(hash, entry->hash, hash_len) < 0) {
+			p = &(*p)->rb_left;
+		} else if (memcmp(hash, entry->hash, hash_len) > 0) {
+			p = &(*p)->rb_right;
+		} else {
+			/* Found, need to re-add it to LRU list head */
+			list_del(&entry->lru_list);
+			list_add(&entry->lru_list, &dedupe_info->lru_list);
+			return entry;
+		}
+	}
+	return NULL;
+}
+
+static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
+			struct inode *inode, u64 file_pos,
+			struct btrfs_dedupe_hash *hash)
+{
+	int ret;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_trans_handle *trans;
+	struct btrfs_delayed_ref_root *delayed_refs;
+	struct btrfs_delayed_ref_head *head;
+	struct btrfs_delayed_ref_head *insert_head;
+	struct btrfs_delayed_data_ref *insert_dref;
+	struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
+	struct inmem_hash *found_hash;
+	int free_insert = 1;
+	u64 bytenr;
+	u32 num_bytes;
+
+	insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+	if (!insert_head)
+		return -ENOMEM;
+	insert_head->extent_op = NULL;
+	insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
+	if (!insert_dref) {
+		kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+		return -ENOMEM;
+	}
+	if (root->fs_info->quota_enabled &&
+	    is_fstree(root->root_key.objectid)) {
+		insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS);
+		if (!insert_qrecord) {
+			kmem_cache_free(btrfs_delayed_ref_head_cachep,
+					insert_head);
+			kmem_cache_free(btrfs_delayed_data_ref_cachep,
+					insert_dref);
+			return -ENOMEM;
+		}
+	}
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto free_mem;
+	}
+
+again:
+	mutex_lock(&dedupe_info->lock);
+	found_hash = inmem_search_hash(dedupe_info, hash->hash);
+	/* If we don't find a duplicated extent, just return. */
+	if (!found_hash) {
+		ret = 0;
+		goto out;
+	}
+	bytenr = found_hash->bytenr;
+	num_bytes = found_hash->num_bytes;
+
+	delayed_refs = &trans->transaction->delayed_refs;
+
+	spin_lock(&delayed_refs->lock);
+	head = btrfs_find_delayed_ref_head(trans, bytenr);
+	if (!head) {
+		/*
+		 * We can safely insert a new delayed_ref as long as we
+		 * hold delayed_refs->lock.
+		 * Only need to use atomic inc_extent_ref()
+		 */
+		btrfs_add_delayed_data_ref_locked(root->fs_info, trans,
+				insert_dref, insert_head, insert_qrecord,
+				bytenr, num_bytes, 0, root->root_key.objectid,
+				btrfs_ino(inode), file_pos, 0,
+				BTRFS_ADD_DELAYED_REF);
+		spin_unlock(&delayed_refs->lock);
+
+		/* add_delayed_data_ref_locked will free unused memory */
+		free_insert = 0;
+		hash->bytenr = bytenr;
+		hash->num_bytes = num_bytes;
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * We can't lock ref head with dedupe_info->lock hold or we will cause
+	 * ABBA dead lock.
+	 */
+	mutex_unlock(&dedupe_info->lock);
+	ret = btrfs_delayed_ref_lock(trans, head);
+	spin_unlock(&delayed_refs->lock);
+	if (ret == -EAGAIN)
+		goto again;
+
+	mutex_lock(&dedupe_info->lock);
+	/* Search again to ensure the hash is still here */
+	found_hash = inmem_search_hash(dedupe_info, hash->hash);
+	if (!found_hash) {
+		ret = 0;
+		mutex_unlock(&head->mutex);
+		goto out;
+	}
+	ret = 1;
+	hash->bytenr = bytenr;
+	hash->num_bytes = num_bytes;
+
+	/*
+	 * Increase the extent ref right now, to avoid delayed ref run
+	 * Or we may increase ref on non-exist extent.
+	 */
+	btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0,
+			     root->root_key.objectid,
+			     btrfs_ino(inode), file_pos);
+	mutex_unlock(&head->mutex);
+out:
+	mutex_unlock(&dedupe_info->lock);
+	btrfs_end_transaction(trans, root);
+
+free_mem:
+	if (free_insert) {
+		kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+		kmem_cache_free(btrfs_delayed_data_ref_cachep, insert_dref);
+		kfree(insert_qrecord);
+	}
+	return ret;
+}
+
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+			struct inode *inode, u64 file_pos,
+			struct btrfs_dedupe_hash *hash)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+	int ret = -EINVAL;
+
+	if (!hash)
+		return 0;
+
+	/*
+	 * This function doesn't follow fs_info->dedupe_enabled as it will need
+	 * to ensure any hashed extent to go through dedupe routine
+	 */
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	if (WARN_ON(btrfs_dedupe_hash_hit(hash)))
+		return -EINVAL;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		ret = inmem_search(dedupe_info, inode, file_pos, hash);
+
+	/* It's possible hash->bytenr/num_bytenr already changed */
+	if (ret == 0) {
+		hash->num_bytes = 0;
+		hash->bytenr = 0;
+	}
+	return ret;
+}
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 07/21] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (5 preceding siblings ...)
  2016-04-01  6:34 ` [PATCH v10 06/21] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
@ 2016-04-01  6:34 ` Qu Wenruo
  2016-05-17 13:15   ` David Sterba
  2016-04-01  6:34 ` [PATCH v10 08/21] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
supported yet, so implement btrfs_dedupe_calc_hash() interface using
SHA256.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 9175a5f..bdaea3a 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -593,3 +593,52 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
 	}
 	return ret;
 }
+
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+			   struct inode *inode, u64 start,
+			   struct btrfs_dedupe_hash *hash)
+{
+	int i;
+	int ret;
+	struct page *p;
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+	struct crypto_shash *tfm = dedupe_info->dedupe_driver;
+	struct {
+		struct shash_desc desc;
+		char ctx[crypto_shash_descsize(tfm)];
+	} sdesc;
+	u64 dedupe_bs;
+	u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
+
+	if (!fs_info->dedupe_enabled || !hash)
+		return 0;
+
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	WARN_ON(!IS_ALIGNED(start, sectorsize));
+
+	dedupe_bs = dedupe_info->blocksize;
+
+	sdesc.desc.tfm = tfm;
+	sdesc.desc.flags = 0;
+	ret = crypto_shash_init(&sdesc.desc);
+	if (ret)
+		return ret;
+	for (i = 0; sectorsize * i < dedupe_bs; i++) {
+		char *d;
+
+		p = find_get_page(inode->i_mapping,
+				  (start >> PAGE_CACHE_SHIFT) + i);
+		if (WARN_ON(!p))
+			return -ENOENT;
+		d = kmap(p);
+		ret = crypto_shash_update(&sdesc.desc, d, sectorsize);
+		kunmap(p);
+		page_cache_release(p);
+		if (ret)
+			return ret;
+	}
+	ret = crypto_shash_final(&sdesc.desc, hash->hash);
+	return ret;
+}
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 08/21] btrfs: ordered-extent: Add support for dedupe
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (6 preceding siblings ...)
  2016-04-01  6:34 ` [PATCH v10 07/21] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
@ 2016-04-01  6:34 ` Qu Wenruo
  2016-06-01 22:06   ` Mark Fasheh
  2016-04-01  6:35 ` [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Add ordered-extent support for dedupe.

Note, current ordered-extent support only supports non-compressed source
extent.
Support for compressed source extent will be added later.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ordered-data.c | 44 ++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/ordered-data.h | 13 +++++++++++++
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 0de7da5..ef24ad1 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -26,6 +26,7 @@
 #include "extent_io.h"
 #include "disk-io.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -184,7 +185,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 				      u64 start, u64 len, u64 disk_len,
-				      int type, int dio, int compress_type)
+				      int type, int dio, int compress_type,
+				      struct btrfs_dedupe_hash *hash)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_ordered_inode_tree *tree;
@@ -204,6 +206,31 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 	entry->inode = igrab(inode);
 	entry->compress_type = compress_type;
 	entry->truncated_len = (u64)-1;
+	entry->hash = NULL;
+	/*
+	 * Hash hit must go through dedupe routine at all cost, even dedupe
+	 * is disabled. As its delayed ref is already increased.
+	 */
+	if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) {
+		struct btrfs_dedupe_info *dedupe_info;
+
+		dedupe_info = root->fs_info->dedupe_info;
+		if (WARN_ON(dedupe_info == NULL)) {
+			kmem_cache_free(btrfs_ordered_extent_cache,
+					entry);
+			return -EINVAL;
+		}
+		entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_type);
+		if (!entry->hash) {
+			kmem_cache_free(btrfs_ordered_extent_cache, entry);
+			return -ENOMEM;
+		}
+		entry->hash->bytenr = hash->bytenr;
+		entry->hash->num_bytes = hash->num_bytes;
+		memcpy(entry->hash->hash, hash->hash,
+		       btrfs_dedupe_sizes[dedupe_info->hash_type]);
+	}
+
 	if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
 		set_bit(type, &entry->flags);
 
@@ -250,15 +277,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 {
 	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
 					  disk_len, type, 0,
-					  BTRFS_COMPRESS_NONE);
+					  BTRFS_COMPRESS_NONE, NULL);
 }
 
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+				   u64 start, u64 len, u64 disk_len, int type,
+				   struct btrfs_dedupe_hash *hash)
+{
+	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+					  disk_len, type, 0,
+					  BTRFS_COMPRESS_NONE, hash);
+}
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 				 u64 start, u64 len, u64 disk_len, int type)
 {
 	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
 					  disk_len, type, 1,
-					  BTRFS_COMPRESS_NONE);
+					  BTRFS_COMPRESS_NONE, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
@@ -267,7 +302,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
 {
 	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
 					  disk_len, type, 0,
-					  compress_type);
+					  compress_type, NULL);
 }
 
 /*
@@ -577,6 +612,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry)
 			list_del(&sum->list);
 			kfree(sum);
 		}
+		kfree(entry->hash);
 		kmem_cache_free(btrfs_ordered_extent_cache, entry);
 	}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 23c9605..8a54476 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -139,6 +139,16 @@ struct btrfs_ordered_extent {
 	struct completion completion;
 	struct btrfs_work flush_work;
 	struct list_head work_list;
+
+	/*
+	 * For inband deduplication
+	 * If hash is NULL, no deduplication.
+	 * If hash->bytenr is zero, means this is a dedupe miss, hash will
+	 * be added into dedupe tree.
+	 * If hash->bytenr is non-zero, this is a dedupe hit. Extent ref is
+	 * *ALREADY* increased.
+	 */
+	struct btrfs_dedupe_hash *hash;
 };
 
 /*
@@ -172,6 +182,9 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode,
 				   int uptodate);
 int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 			     u64 start, u64 len, u64 disk_len, int type);
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+				   u64 start, u64 len, u64 disk_len, int type,
+				   struct btrfs_dedupe_hash *hash);
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 				 u64 start, u64 len, u64 disk_len, int type);
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (7 preceding siblings ...)
  2016-04-01  6:34 ` [PATCH v10 08/21] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-06-01 22:08   ` Mark Fasheh
  2016-06-03 14:43   ` Josef Bacik
  2016-04-01  6:35 ` [PATCH v10 10/21] btrfs: try more times to alloc metadata reserve space Qu Wenruo
                   ` (13 subsequent siblings)
  22 siblings, 2 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Core implement for inband de-duplication.
It reuse the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c |  18 ++++
 fs/btrfs/inode.c       | 235 ++++++++++++++++++++++++++++++++++++++++++-------
 fs/btrfs/relocation.c  |  16 ++++
 3 files changed, 236 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 53e1297..dabd721 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -37,6 +37,7 @@
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 
 	if (btrfs_delayed_ref_is_head(node)) {
 		struct btrfs_delayed_ref_head *head;
+		struct btrfs_fs_info *fs_info = root->fs_info;
+
 		/*
 		 * we've hit the end of the chain and we were supposed
 		 * to insert this extent into the tree.  But, it got
@@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 			btrfs_pin_extent(root, node->bytenr,
 					 node->num_bytes, 1);
 			if (head->is_data) {
+				/*
+				 * If insert_reserved is given, it means
+				 * a new extent is revered, then deleted
+				 * in one tran, and inc/dec get merged to 0.
+				 *
+				 * In this case, we need to remove its dedup
+				 * hash.
+				 */
+				btrfs_dedupe_del(trans, fs_info, node->bytenr);
 				ret = btrfs_del_csums(trans, root,
 						      node->bytenr,
 						      node->num_bytes);
@@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 		btrfs_release_path(path);
 
 		if (is_data) {
+			ret = btrfs_dedupe_del(trans, info, bytenr);
+			if (ret < 0) {
+				btrfs_abort_transaction(trans, extent_root,
+							ret);
+				goto out;
+			}
 			ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
 			if (ret) {
 				btrfs_abort_transaction(trans, extent_root, ret);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 41a5688..96790d0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -60,6 +60,7 @@
 #include "hash.h"
 #include "props.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 struct btrfs_iget_args {
 	struct btrfs_key *location;
@@ -106,7 +107,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent);
 static noinline int cow_file_range(struct inode *inode,
 				   struct page *locked_page,
 				   u64 start, u64 end, int *page_started,
-				   unsigned long *nr_written, int unlock);
+				   unsigned long *nr_written, int unlock,
+				   struct btrfs_dedupe_hash *hash);
 static struct extent_map *create_pinned_em(struct inode *inode, u64 start,
 					   u64 len, u64 orig_start,
 					   u64 block_start, u64 block_len,
@@ -335,6 +337,7 @@ struct async_extent {
 	struct page **pages;
 	unsigned long nr_pages;
 	int compress_type;
+	struct btrfs_dedupe_hash *hash;
 	struct list_head list;
 };
 
@@ -353,7 +356,8 @@ static noinline int add_async_extent(struct async_cow *cow,
 				     u64 compressed_size,
 				     struct page **pages,
 				     unsigned long nr_pages,
-				     int compress_type)
+				     int compress_type,
+				     struct btrfs_dedupe_hash *hash)
 {
 	struct async_extent *async_extent;
 
@@ -365,6 +369,7 @@ static noinline int add_async_extent(struct async_cow *cow,
 	async_extent->pages = pages;
 	async_extent->nr_pages = nr_pages;
 	async_extent->compress_type = compress_type;
+	async_extent->hash = hash;
 	list_add_tail(&async_extent->list, &cow->extents);
 	return 0;
 }
@@ -616,7 +621,7 @@ cont:
 		 */
 		add_async_extent(async_cow, start, num_bytes,
 				 total_compressed, pages, nr_pages_ret,
-				 compress_type);
+				 compress_type, NULL);
 
 		if (start + num_bytes < end) {
 			start += num_bytes;
@@ -641,7 +646,7 @@ cleanup_and_bail_uncompressed:
 		if (redirty)
 			extent_range_redirty_for_io(inode, start, end);
 		add_async_extent(async_cow, start, end - start + 1,
-				 0, NULL, 0, BTRFS_COMPRESS_NONE);
+				 0, NULL, 0, BTRFS_COMPRESS_NONE, NULL);
 		*num_added += 1;
 	}
 
@@ -671,6 +676,38 @@ static void free_async_extent_pages(struct async_extent *async_extent)
 	async_extent->pages = NULL;
 }
 
+static void end_dedupe_extent(struct inode *inode, u64 start,
+			      u32 len, unsigned long page_ops)
+{
+	int i;
+	unsigned nr_pages = len / PAGE_CACHE_SIZE;
+	struct page *page;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = find_get_page(inode->i_mapping,
+				     start >> PAGE_CACHE_SHIFT);
+		/* page should be already locked by caller */
+		if (WARN_ON(!page))
+			continue;
+
+		/* We need to do this by ourselves as we skipped IO */
+		if (page_ops & PAGE_CLEAR_DIRTY)
+			clear_page_dirty_for_io(page);
+		if (page_ops & PAGE_SET_WRITEBACK)
+			set_page_writeback(page);
+
+		end_extent_writepage(page, 0, start,
+				     start + PAGE_CACHE_SIZE - 1);
+		if (page_ops & PAGE_END_WRITEBACK)
+			end_page_writeback(page);
+		if (page_ops & PAGE_UNLOCK)
+			unlock_page(page);
+
+		start += PAGE_CACHE_SIZE;
+		page_cache_release(page);
+	}
+}
+
 /*
  * phase two of compressed writeback.  This is the ordered portion
  * of the code, which only gets called in the order the work was
@@ -687,6 +724,7 @@ static noinline void submit_compressed_extents(struct inode *inode,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
 	struct extent_io_tree *io_tree;
+	struct btrfs_dedupe_hash *hash;
 	int ret = 0;
 
 again:
@@ -696,6 +734,7 @@ again:
 		list_del(&async_extent->list);
 
 		io_tree = &BTRFS_I(inode)->io_tree;
+		hash = async_extent->hash;
 
 retry:
 		/* did the compression code fall back to uncompressed IO? */
@@ -712,7 +751,8 @@ retry:
 					     async_extent->start,
 					     async_extent->start +
 					     async_extent->ram_size - 1,
-					     &page_started, &nr_written, 0);
+					     &page_started, &nr_written, 0,
+					     hash);
 
 			/* JDM XXX */
 
@@ -722,15 +762,26 @@ retry:
 			 * and IO for us.  Otherwise, we need to submit
 			 * all those pages down to the drive.
 			 */
-			if (!page_started && !ret)
-				extent_write_locked_range(io_tree,
-						  inode, async_extent->start,
-						  async_extent->start +
-						  async_extent->ram_size - 1,
-						  btrfs_get_extent,
-						  WB_SYNC_ALL);
-			else if (ret)
+			if (!page_started && !ret) {
+				/* Skip IO for dedupe async_extent */
+				if (btrfs_dedupe_hash_hit(hash))
+					end_dedupe_extent(inode,
+						async_extent->start,
+						async_extent->ram_size,
+						PAGE_CLEAR_DIRTY |
+						PAGE_SET_WRITEBACK |
+						PAGE_END_WRITEBACK |
+						PAGE_UNLOCK);
+				else
+					extent_write_locked_range(io_tree,
+						inode, async_extent->start,
+						async_extent->start +
+						async_extent->ram_size - 1,
+						btrfs_get_extent,
+						WB_SYNC_ALL);
+			} else if (ret)
 				unlock_page(async_cow->locked_page);
+			kfree(hash);
 			kfree(async_extent);
 			cond_resched();
 			continue;
@@ -856,6 +907,7 @@ retry:
 			free_async_extent_pages(async_extent);
 		}
 		alloc_hint = ins.objectid + ins.offset;
+		kfree(hash);
 		kfree(async_extent);
 		cond_resched();
 	}
@@ -872,6 +924,7 @@ out_free:
 				     PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK |
 				     PAGE_SET_ERROR);
 	free_async_extent_pages(async_extent);
+	kfree(hash);
 	kfree(async_extent);
 	goto again;
 }
@@ -925,7 +978,7 @@ static noinline int cow_file_range(struct inode *inode,
 				   struct page *locked_page,
 				   u64 start, u64 end, int *page_started,
 				   unsigned long *nr_written,
-				   int unlock)
+				   int unlock, struct btrfs_dedupe_hash *hash)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 alloc_hint = 0;
@@ -984,11 +1037,16 @@ static noinline int cow_file_range(struct inode *inode,
 		unsigned long op;
 
 		cur_alloc_size = disk_num_bytes;
-		ret = btrfs_reserve_extent(root, cur_alloc_size,
+		if (btrfs_dedupe_hash_hit(hash)) {
+			ins.objectid = hash->bytenr;
+			ins.offset = hash->num_bytes;
+		} else {
+			ret = btrfs_reserve_extent(root, cur_alloc_size,
 					   root->sectorsize, 0, alloc_hint,
 					   &ins, 1, 1);
-		if (ret < 0)
-			goto out_unlock;
+			if (ret < 0)
+				goto out_unlock;
+		}
 
 		em = alloc_extent_map();
 		if (!em) {
@@ -1025,8 +1083,9 @@ static noinline int cow_file_range(struct inode *inode,
 			goto out_reserve;
 
 		cur_alloc_size = ins.offset;
-		ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
-					       ram_size, cur_alloc_size, 0);
+		ret = btrfs_add_ordered_extent_dedupe(inode, start,
+				ins.objectid, cur_alloc_size, ins.offset,
+				0, hash);
 		if (ret)
 			goto out_drop_extent_cache;
 
@@ -1076,6 +1135,68 @@ out_unlock:
 	goto out;
 }
 
+static int hash_file_ranges(struct inode *inode, u64 start, u64 end,
+			    struct async_cow *async_cow, int *num_added)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+	struct page *locked_page = async_cow->locked_page;
+	u16 hash_algo;
+	u64 actual_end;
+	u64 isize = i_size_read(inode);
+	u64 dedupe_bs;
+	u64 cur_offset = start;
+	int ret = 0;
+
+	actual_end = min_t(u64, isize, end + 1);
+	/* If dedupe is not enabled, don't split extent into dedupe_bs */
+	if (fs_info->dedupe_enabled && dedupe_info) {
+		dedupe_bs = dedupe_info->blocksize;
+		hash_algo = dedupe_info->hash_type;
+	} else {
+		dedupe_bs = SZ_128M;
+		/* Just dummy, to avoid access NULL pointer */
+		hash_algo = BTRFS_DEDUPE_HASH_SHA256;
+	}
+
+	while (cur_offset < end) {
+		struct btrfs_dedupe_hash *hash = NULL;
+		u64 len;
+
+		len = min(end + 1 - cur_offset, dedupe_bs);
+		if (len < dedupe_bs)
+			goto next;
+
+		hash = btrfs_dedupe_alloc_hash(hash_algo);
+		if (!hash) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		ret = btrfs_dedupe_calc_hash(fs_info, inode, cur_offset, hash);
+		if (ret < 0)
+			goto out;
+
+		ret = btrfs_dedupe_search(fs_info, inode, cur_offset, hash);
+		if (ret < 0)
+			goto out;
+		ret = 0;
+
+next:
+		/* Redirty the locked page if it corresponds to our extent */
+		if (page_offset(locked_page) >= start &&
+		    page_offset(locked_page) <= end)
+			__set_page_dirty_nobuffers(locked_page);
+
+		add_async_extent(async_cow, cur_offset, len, 0, NULL, 0,
+				 BTRFS_COMPRESS_NONE, hash);
+		cur_offset += len;
+		(*num_added)++;
+	}
+out:
+	return ret;
+}
+
 /*
  * work queue call back to started compression on a file and pages
  */
@@ -1083,11 +1204,18 @@ static noinline void async_cow_start(struct btrfs_work *work)
 {
 	struct async_cow *async_cow;
 	int num_added = 0;
+	int ret = 0;
 	async_cow = container_of(work, struct async_cow, work);
 
-	compress_file_range(async_cow->inode, async_cow->locked_page,
-			    async_cow->start, async_cow->end, async_cow,
-			    &num_added);
+	if (inode_need_compress(async_cow->inode))
+		compress_file_range(async_cow->inode, async_cow->locked_page,
+				    async_cow->start, async_cow->end, async_cow,
+				    &num_added);
+	else
+		ret = hash_file_ranges(async_cow->inode, async_cow->start,
+				       async_cow->end, async_cow, &num_added);
+	WARN_ON(ret);
+
 	if (num_added == 0) {
 		btrfs_add_delayed_iput(async_cow->inode);
 		async_cow->inode = NULL;
@@ -1136,6 +1264,8 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 {
 	struct async_cow *async_cow;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
 	unsigned long nr_pages;
 	u64 cur_end;
 	int limit = 10 * SZ_1M;
@@ -1150,7 +1280,11 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 		async_cow->locked_page = locked_page;
 		async_cow->start = start;
 
-		if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS &&
+		if (fs_info->dedupe_enabled && dedupe_info) {
+			u64 len = max_t(u64, SZ_512K, dedupe_info->blocksize);
+
+			cur_end = min(end, start + len - 1);
+		} else if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS &&
 		    !btrfs_test_opt(root, FORCE_COMPRESS))
 			cur_end = end;
 		else
@@ -1407,7 +1541,7 @@ out_check:
 		if (cow_start != (u64)-1) {
 			ret = cow_file_range(inode, locked_page,
 					     cow_start, found_key.offset - 1,
-					     page_started, nr_written, 1);
+					     page_started, nr_written, 1, NULL);
 			if (ret) {
 				if (!nolock && nocow)
 					btrfs_end_write_no_snapshoting(root);
@@ -1486,7 +1620,7 @@ out_check:
 
 	if (cow_start != (u64)-1) {
 		ret = cow_file_range(inode, locked_page, cow_start, end,
-				     page_started, nr_written, 1);
+				     page_started, nr_written, 1, NULL);
 		if (ret)
 			goto error;
 	}
@@ -1537,6 +1671,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 {
 	int ret;
 	int force_cow = need_force_cow(inode, start, end);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) {
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
@@ -1544,9 +1680,9 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 	} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_need_compress(inode)) {
+	} else if (!inode_need_compress(inode) && !fs_info->dedupe_enabled) {
 		ret = cow_file_range(inode, locked_page, start, end,
-				      page_started, nr_written, 1);
+				      page_started, nr_written, 1, NULL);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 			&BTRFS_I(inode)->runtime_flags);
@@ -2076,7 +2212,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 				       u64 disk_bytenr, u64 disk_num_bytes,
 				       u64 num_bytes, u64 ram_bytes,
 				       u8 compression, u8 encryption,
-				       u16 other_encoding, int extent_type)
+				       u16 other_encoding, int extent_type,
+				       struct btrfs_dedupe_hash *hash)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_file_extent_item *fi;
@@ -2138,10 +2275,37 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	ins.objectid = disk_bytenr;
 	ins.offset = disk_num_bytes;
 	ins.type = BTRFS_EXTENT_ITEM_KEY;
-	ret = btrfs_alloc_reserved_file_extent(trans, root,
+
+	/*
+	 * Only for no-dedupe or hash miss case, we need to increase
+	 * extent reference
+	 * For hash hit case, reference is already increased
+	 */
+	if (!hash || hash->bytenr == 0)
+		ret = btrfs_alloc_reserved_file_extent(trans, root,
 					root->root_key.objectid,
 					btrfs_ino(inode), file_pos,
 					ram_bytes, &ins);
+	if (ret < 0)
+		goto out_qgroup;
+
+	/*
+	 * Hash hit won't create a new data extent, so its reserved quota
+	 * space won't be freed by new delayed_ref_head.
+	 * Need to free it here.
+	 */
+	if (btrfs_dedupe_hash_hit(hash))
+		btrfs_qgroup_free_data(inode, file_pos, ram_bytes);
+
+	/* Add missed hash into dedupe tree */
+	if (hash && hash->bytenr == 0) {
+		hash->bytenr = ins.objectid;
+		hash->num_bytes = ins.offset;
+		ret = btrfs_dedupe_add(trans, root->fs_info, hash);
+	}
+
+out_qgroup:
+
 	/*
 	 * Release the reserved range from inode dirty range map, as it is
 	 * already moved into delayed_ref_head
@@ -2918,6 +3082,9 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 						ordered_extent->file_offset +
 						logical_len);
 	} else {
+		/* Must be checked before hash modified */
+		int hash_hit = btrfs_dedupe_hash_hit(ordered_extent->hash);
+
 		BUG_ON(root == root->fs_info->tree_root);
 		ret = insert_reserved_file_extent(trans, inode,
 						ordered_extent->file_offset,
@@ -2925,8 +3092,10 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 						ordered_extent->disk_len,
 						logical_len, logical_len,
 						compress_type, 0, 0,
-						BTRFS_FILE_EXTENT_REG);
-		if (!ret)
+						BTRFS_FILE_EXTENT_REG,
+						ordered_extent->hash);
+		/* Hash hit case doesn't reserve delalloc bytes */
+		if (!ret && !hash_hit)
 			btrfs_release_delalloc_bytes(root,
 						     ordered_extent->start,
 						     ordered_extent->disk_len);
@@ -2985,7 +3154,6 @@ out:
 						   ordered_extent->disk_len, 1);
 	}
 
-
 	/*
 	 * This needs to be done to make sure anybody waiting knows we are done
 	 * updating everything for this ordered extent.
@@ -9948,7 +10116,8 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 						  cur_offset, ins.objectid,
 						  ins.offset, ins.offset,
 						  ins.offset, 0, 0, 0,
-						  BTRFS_FILE_EXTENT_PREALLOC);
+						  BTRFS_FILE_EXTENT_PREALLOC,
+						  NULL);
 		if (ret) {
 			btrfs_free_reserved_extent(root, ins.objectid,
 						   ins.offset, 0);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 2bd0011..33183ce 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -31,6 +31,7 @@
 #include "async-thread.h"
 #include "free-space-cache.h"
 #include "inode-map.h"
+#include "dedupe.h"
 
 /*
  * backref_node, mapping_node and tree_block start with this
@@ -3909,6 +3910,7 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
 	struct btrfs_trans_handle *trans = NULL;
 	struct btrfs_path *path;
 	struct btrfs_extent_item *ei;
+	struct btrfs_fs_info *fs_info = rc->extent_root->fs_info;
 	u64 flags;
 	u32 item_size;
 	int ret;
@@ -4031,6 +4033,20 @@ restart:
 				rc->search_start = key.objectid;
 			}
 		}
+		/*
+		 * This data extent will be replaced, but normal dedupe_del()
+		 * will only happen at run_delayed_ref() time, which is too
+		 * late, so delete dedupe_hash early to prevent its ref get
+		 * increased during relocation
+		 */
+		if (rc->stage == MOVE_DATA_EXTENTS &&
+		    (flags & BTRFS_EXTENT_FLAG_DATA)) {
+			ret = btrfs_dedupe_del(trans, fs_info, key.objectid);
+			if (ret < 0) {
+				err = ret;
+				break;
+			}
+		}
 
 		btrfs_end_transaction_throttle(trans, rc->extent_root);
 		btrfs_btree_balance_dirty(rc->extent_root);
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 10/21] btrfs: try more times to alloc metadata reserve space
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (8 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-05-17 13:20   ` David Sterba
  2016-04-01  6:35 ` [PATCH v10 11/21] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
to reserve is calculated by the difference between outstanding_extents and
reserved_extents.

When reserve_metadata_bytes() fails to reserve desited metadata space,
it has already done some reclaim work, such as write ordered extents.

In that case, outstanding_extents and reserved_extents may already
changed, and we may reserve enough metadata space then.

So this patch will try to call reserve_metadata_bytes() at most 3 times
to ensure we really run out of space.

Such false ENOSPC is mainly caused by small file extents and time
consuming delalloc functions, which mainly affects in-band
de-duplication. (Compress should also be affected, but LZO/zlib is
faster than SHA256, so still harder to trigger than dedupe).

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dabd721..016d2ec 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 				 * a new extent is revered, then deleted
 				 * in one tran, and inc/dec get merged to 0.
 				 *
-				 * In this case, we need to remove its dedup
+				 * In this case, we need to remove its dedupe
 				 * hash.
 				 */
 				btrfs_dedupe_del(trans, fs_info, node->bytenr);
@@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	bool delalloc_lock = true;
 	u64 to_free = 0;
 	unsigned dropped;
+	int loops = 0;
 
 	/* If we are a free space inode we need to not flush since we will be in
 	 * the middle of a transaction commit.  We also don't need the delalloc
@@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	    btrfs_transaction_in_commit(root->fs_info))
 		schedule_timeout(1);
 
+	num_bytes = ALIGN(num_bytes, root->sectorsize);
+
+again:
 	if (delalloc_lock)
 		mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
 
-	num_bytes = ALIGN(num_bytes, root->sectorsize);
-
 	spin_lock(&BTRFS_I(inode)->lock);
 	nr_extents = (unsigned)div64_u64(num_bytes +
 					 BTRFS_MAX_EXTENT_SIZE - 1,
@@ -5815,6 +5817,23 @@ out_fail:
 	}
 	if (delalloc_lock)
 		mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
+	/*
+	 * The number of metadata bytes is calculated by the difference
+	 * between outstanding_extents and reserved_extents. Sometimes though
+	 * reserve_metadata_bytes() fails to reserve the wanted metadata bytes,
+	 * indeed it has already done some work to reclaim metadata space, hence
+	 * both outstanding_extents and reserved_extents would have changed and
+	 * the bytes we try to reserve would also has changed(may be smaller).
+	 * So here we try to reserve again. This is much useful for online
+	 * dedupe, which will easily eat almost all meta space.
+	 *
+	 * XXX: Indeed here 3 is arbitrarily choosed, it's a good workaround for
+	 * online dedupe, later we should find a better method to avoid dedupe
+	 * enospc issue.
+	 */
+	if (unlikely(ret == -ENOSPC && loops++ < 3))
+		goto again;
+
 	return ret;
 }
 
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 11/21] btrfs: dedupe: Add ioctl for inband dedupelication
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (9 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 10/21] btrfs: try more times to alloc metadata reserve space Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-04-27  1:29   ` Qu Wenruo
  2016-04-01  6:35 ` [PATCH v10 12/21] btrfs: dedupe: add an inode nodedupe flag Qu Wenruo
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Add ioctl interface for inband dedupelication, which includes:
1) enable
2) disable
3) status

And a pseudo RO compat flag, to imply that btrfs now supports inband
dedup.
However we don't add any ondisk format change, it's just a pseudo RO
compat flag.

All these ioctl interface are state-less, which means caller don't need
to bother previous dedupe state before calling them, and only need to
care the final desired state.

For example, if user want to enable dedupe with specified block size and
limit, just fill the ioctl structure and call enable ioctl.
No need to check if dedupe is already running.

These ioctls will handle things like re-configure or disable quite well.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h           |  1 +
 fs/btrfs/dedupe.c          | 48 ++++++++++++++++++++++++++++++++
 fs/btrfs/dedupe.h          | 15 ++++++++++
 fs/btrfs/disk-io.c         |  3 ++
 fs/btrfs/ioctl.c           | 68 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/sysfs.c           |  2 ++
 include/uapi/linux/btrfs.h | 23 ++++++++++++++++
 7 files changed, 160 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 022ab61..85044bf 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -508,6 +508,7 @@ struct btrfs_super_block {
  * ones specified below then we will fail to mount
  */
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE	(1ULL << 0)
+#define BTRFS_FEATURE_COMPAT_RO_DEDUPE		(1ULL << 1)
 
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index bdaea3a..cfb7fea 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -41,6 +41,33 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 type)
 			GFP_NOFS);
 }
 
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+			 struct btrfs_ioctl_dedupe_args *dargs)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+	if (!fs_info->dedupe_enabled || !dedupe_info) {
+		dargs->status = 0;
+		dargs->blocksize = 0;
+		dargs->backend = 0;
+		dargs->hash_type = 0;
+		dargs->limit_nr = 0;
+		dargs->current_nr = 0;
+		return;
+	}
+	mutex_lock(&dedupe_info->lock);
+	dargs->status = 1;
+	dargs->blocksize = dedupe_info->blocksize;
+	dargs->backend = dedupe_info->backend;
+	dargs->hash_type = dedupe_info->hash_type;
+	dargs->limit_nr = dedupe_info->limit_nr;
+	dargs->limit_mem = dedupe_info->limit_nr *
+		(sizeof(struct inmem_hash) +
+		 btrfs_dedupe_sizes[dedupe_info->hash_type]);
+	dargs->current_nr = dedupe_info->current_nr;
+	mutex_unlock(&dedupe_info->lock);
+}
+
 static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
 			    u16 backend, u64 blocksize, u64 limit)
 {
@@ -371,6 +398,27 @@ static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
 	mutex_unlock(&dedupe_info->lock);
 }
 
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+
+	fs_info->dedupe_enabled = 0;
+	/* same as disable */
+	smp_wmb();
+	dedupe_info = fs_info->dedupe_info;
+	fs_info->dedupe_info = NULL;
+
+	if (!dedupe_info)
+		return 0;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		inmem_destroy(dedupe_info);
+
+	crypto_free_shash(dedupe_info->dedupe_driver);
+	kfree(dedupe_info);
+	return 0;
+}
+
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_dedupe_info *dedupe_info;
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index e5d0d34..f5d2b45 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -103,6 +103,15 @@ static inline struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type)
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
 			u64 blocksize, u64 limit_nr, u64 limit_mem);
 
+
+ /*
+ * Get inband dedupe info
+ * Since it needs to access different backends' hash size, which
+ * is not exported, we need such simple function.
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+			 struct btrfs_ioctl_dedupe_args *dargs);
+
 /*
  * Disable dedupe and invalidate all its dedupe data.
  * Called at dedupe disable time.
@@ -110,6 +119,12 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
 
 /*
+ * Cleanup current btrfs_dedupe_info
+ * Called in umount time
+ */
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
+
+/*
  * Calculate hash for dedup.
  * Caller must ensure [start, start + dedupe_bs) has valid data.
  */
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3cf4c11..ed6a6fd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -51,6 +51,7 @@
 #include "sysfs.h"
 #include "qgroup.h"
 #include "compression.h"
+#include "dedupe.h"
 
 #ifdef CONFIG_X86
 #include <asm/cpufeature.h>
@@ -3884,6 +3885,8 @@ void close_ctree(struct btrfs_root *root)
 
 	btrfs_free_qgroup_config(fs_info);
 
+	btrfs_dedupe_cleanup(fs_info);
+
 	if (percpu_counter_sum(&fs_info->delalloc_bytes)) {
 		btrfs_info(fs_info, "at unmount delalloc count %lld",
 		       percpu_counter_sum(&fs_info->delalloc_bytes));
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 053e677..f659ed5 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -61,6 +61,7 @@
 #include "qgroup.h"
 #include "tree-log.h"
 #include "compression.h"
+#include "dedupe.h"
 
 #ifdef CONFIG_64BIT
 /* If we have a 32-bit userspace and 64-bit kernel, then the UAPI
@@ -3206,6 +3207,69 @@ ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
 	return olen;
 }
 
+static long btrfs_ioctl_dedupe_ctl(struct btrfs_root *root, void __user *args)
+{
+	struct btrfs_ioctl_dedupe_args *dargs;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	dargs = memdup_user(args, sizeof(*dargs));
+	if (IS_ERR(dargs)) {
+		ret = PTR_ERR(dargs);
+		return ret;
+	}
+
+	if (dargs->cmd >= BTRFS_DEDUPE_CTL_LAST) {
+		ret = -EINVAL;
+		goto out;
+	}
+	switch (dargs->cmd) {
+	case BTRFS_DEDUPE_CTL_ENABLE:
+		mutex_lock(&fs_info->dedupe_ioctl_lock);
+		ret = btrfs_dedupe_enable(fs_info, dargs->hash_type,
+					 dargs->backend, dargs->blocksize,
+					 dargs->limit_nr, dargs->limit_mem);
+		mutex_unlock(&fs_info->dedupe_ioctl_lock);
+		if (ret < 0)
+			break;
+
+		/* Also copy the result to caller for further use */
+		btrfs_dedupe_status(fs_info, dargs);
+		if (copy_to_user(args, dargs, sizeof(*dargs)))
+			ret = -EFAULT;
+		else
+			ret = 0;
+		break;
+	case BTRFS_DEDUPE_CTL_DISABLE:
+		mutex_lock(&fs_info->dedupe_ioctl_lock);
+		ret = btrfs_dedupe_disable(fs_info);
+		mutex_unlock(&fs_info->dedupe_ioctl_lock);
+		break;
+	case BTRFS_DEDUPE_CTL_STATUS:
+		mutex_lock(&fs_info->dedupe_ioctl_lock);
+		btrfs_dedupe_status(fs_info, dargs);
+		mutex_unlock(&fs_info->dedupe_ioctl_lock);
+		if (copy_to_user(args, dargs, sizeof(*dargs)))
+			ret = -EFAULT;
+		else
+			ret = 0;
+		break;
+	default:
+		/*
+		 * Use this return value to inform progs that kernel
+		 * doesn't support such new command.
+		 */
+		ret = -EOPNOTSUPP;
+		break;
+	}
+out:
+	kfree(dargs);
+	return ret;
+}
+
 static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
 				     struct inode *inode,
 				     u64 endoff,
@@ -5542,6 +5606,10 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_fslabel(file, argp);
 	case BTRFS_IOC_SET_FSLABEL:
 		return btrfs_ioctl_set_fslabel(file, argp);
+#ifdef CONFIG_BTRFS_DEBUG
+	case BTRFS_IOC_DEDUPE_CTL:
+		return btrfs_ioctl_dedupe_ctl(root, argp);
+#endif
 	case BTRFS_IOC_GET_SUPPORTED_FEATURES:
 		return btrfs_ioctl_get_supported_features(argp);
 	case BTRFS_IOC_GET_FEATURES:
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 539e7b5..18686d1 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -203,6 +203,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56);
 BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA);
 BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
+BTRFS_FEAT_ATTR_COMPAT_RO(dedupe, DEDUPE);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -215,6 +216,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(skinny_metadata),
 	BTRFS_FEAT_ATTR_PTR(no_holes),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
+	BTRFS_FEAT_ATTR_PTR(dedupe),
 	NULL
 };
 
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dea8931..de48414 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -445,6 +445,27 @@ struct btrfs_ioctl_get_dev_stats {
 	__u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
 };
 
+/*
+ * de-duplication control modes
+ * For re-config, re-enable will handle it
+ */
+#define BTRFS_DEDUPE_CTL_ENABLE	1
+#define BTRFS_DEDUPE_CTL_DISABLE 2
+#define BTRFS_DEDUPE_CTL_STATUS	3
+#define BTRFS_DEDUPE_CTL_LAST	4
+struct btrfs_ioctl_dedupe_args {
+	__u16 cmd;		/* In: command(see above macro) */
+	__u64 blocksize;	/* In/Out: For enable/status */
+	__u64 limit_nr;		/* In/Out: For enable/status */
+	__u64 limit_mem;	/* In/Out: For enable/status */
+	__u64 current_nr;	/* Out: For status output */
+	__u16 backend;		/* In/Out: For enable/status */
+	__u16 hash_type;	/* In/Out: For enable/status */
+	u8 status;		/* Out: For status output */
+	/* pad to 512 bytes */
+	u8 __unused[473];
+};
+
 #define BTRFS_QUOTA_CTL_ENABLE	1
 #define BTRFS_QUOTA_CTL_DISABLE	2
 #define BTRFS_QUOTA_CTL_RESCAN__NOTUSED	3
@@ -653,6 +674,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code)
 				    struct btrfs_ioctl_dev_replace_args)
 #define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
 					 struct btrfs_ioctl_same_args)
+#define BTRFS_IOC_DEDUPE_CTL	_IOWR(BTRFS_IOCTL_MAGIC, 55, \
+				      struct btrfs_ioctl_dedupe_args)
 #define BTRFS_IOC_GET_FEATURES _IOR(BTRFS_IOCTL_MAGIC, 57, \
 				   struct btrfs_ioctl_feature_flags)
 #define BTRFS_IOC_SET_FEATURES _IOW(BTRFS_IOCTL_MAGIC, 57, \
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 12/21] btrfs: dedupe: add an inode nodedupe flag
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (10 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 11/21] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-04-01  6:35 ` [PATCH v10 13/21] btrfs: dedupe: add a property handler for online dedupe Qu Wenruo
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce BTRFS_INODE_NODEDUP flag, then we can explicitly disable
online data dedupelication for specified files.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h | 1 +
 fs/btrfs/ioctl.c | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 85044bf..0e8933c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2381,6 +2381,7 @@ do {                                                                   \
 #define BTRFS_INODE_NOATIME		(1 << 9)
 #define BTRFS_INODE_DIRSYNC		(1 << 10)
 #define BTRFS_INODE_COMPRESS		(1 << 11)
+#define BTRFS_INODE_NODEDUPE		(1 << 12)
 
 #define BTRFS_INODE_ROOT_ITEM_INIT	(1 << 31)
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f659ed5..1fca655 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -161,7 +161,8 @@ void btrfs_update_iflags(struct inode *inode)
 /*
  * Inherit flags from the parent inode.
  *
- * Currently only the compression flags and the cow flags are inherited.
+ * Currently only the compression flags, dedupe flags and the cow flags
+ * are inherited.
  */
 void btrfs_inherit_iflags(struct inode *inode, struct inode *dir)
 {
@@ -186,6 +187,9 @@ void btrfs_inherit_iflags(struct inode *inode, struct inode *dir)
 			BTRFS_I(inode)->flags |= BTRFS_INODE_NODATASUM;
 	}
 
+	if (flags & BTRFS_INODE_NODEDUPE)
+		BTRFS_I(inode)->flags |= BTRFS_INODE_NODEDUPE;
+
 	btrfs_update_iflags(inode);
 }
 
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 13/21] btrfs: dedupe: add a property handler for online dedupe
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (11 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 12/21] btrfs: dedupe: add an inode nodedupe flag Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-04-01  6:35 ` [PATCH v10 14/21] btrfs: dedupe: add per-file online dedupe control Qu Wenruo
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

We use btrfs extended attribute "btrfs.dedupe" to record per-file online
dedupe status, so add a dedupe property handler.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/props.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c
index 3699212..a430886 100644
--- a/fs/btrfs/props.c
+++ b/fs/btrfs/props.c
@@ -42,6 +42,11 @@ static int prop_compression_apply(struct inode *inode,
 				  size_t len);
 static const char *prop_compression_extract(struct inode *inode);
 
+static int prop_dedupe_validate(const char *value, size_t len);
+static int prop_dedupe_apply(struct inode *inode, const char *value,
+			     size_t len);
+static const char *prop_dedupe_extract(struct inode *inode);
+
 static struct prop_handler prop_handlers[] = {
 	{
 		.xattr_name = XATTR_BTRFS_PREFIX "compression",
@@ -50,6 +55,13 @@ static struct prop_handler prop_handlers[] = {
 		.extract = prop_compression_extract,
 		.inheritable = 1
 	},
+	{
+		.xattr_name = XATTR_BTRFS_PREFIX "dedupe",
+		.validate = prop_dedupe_validate,
+		.apply = prop_dedupe_apply,
+		.extract = prop_dedupe_extract,
+		.inheritable = 1
+	},
 };
 
 void __init btrfs_props_init(void)
@@ -426,4 +438,33 @@ static const char *prop_compression_extract(struct inode *inode)
 	return NULL;
 }
 
+static int prop_dedupe_validate(const char *value, size_t len)
+{
+	if (!strncmp("disable", value, len))
+		return 0;
+
+	return -EINVAL;
+}
+
+static int prop_dedupe_apply(struct inode *inode, const char *value, size_t len)
+{
+	if (len == 0) {
+		BTRFS_I(inode)->flags &= ~BTRFS_INODE_NODEDUPE;
+		return 0;
+	}
+
+	if (!strncmp("disable", value, len)) {
+		BTRFS_I(inode)->flags |= BTRFS_INODE_NODEDUPE;
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static const char *prop_dedupe_extract(struct inode *inode)
+{
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODEDUPE)
+		return "disable";
 
+	return NULL;
+}
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 14/21] btrfs: dedupe: add per-file online dedupe control
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (12 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 13/21] btrfs: dedupe: add a property handler for online dedupe Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-04-01  6:35 ` [PATCH v10 15/21] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce inode_need_dedupe() to implement per-file online dedupe control.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/inode.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 96790d0..c80fd74 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -708,6 +708,18 @@ static void end_dedupe_extent(struct inode *inode, u64 start,
 	}
 }
 
+static inline int inode_need_dedupe(struct btrfs_fs_info *fs_info,
+				    struct inode *inode)
+{
+	if (!fs_info->dedupe_enabled)
+		return 0;
+
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODEDUPE)
+		return 0;
+
+	return 1;
+}
+
 /*
  * phase two of compressed writeback.  This is the ordered portion
  * of the code, which only gets called in the order the work was
@@ -1680,7 +1692,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 	} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_need_compress(inode) && !fs_info->dedupe_enabled) {
+	} else if (!inode_need_compress(inode) &&
+		   !inode_need_dedupe(fs_info, inode)) {
 		ret = cow_file_range(inode, locked_page, start, end,
 				      page_started, nr_written, 1, NULL);
 	} else {
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 15/21] btrfs: relocation: Enhance error handling to avoid BUG_ON
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (13 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 14/21] btrfs: dedupe: add per-file online dedupe control Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-04-01  6:35 ` [PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs

Since the introduce of btrfs dedupe tree, it's possible that balance can
race with dedupe disabling.

When this happens, dedupe_enabled will make btrfs_get_fs_root() return
PTR_ERR(-ENOENT).
But due to a bug in error handling branch, when this happens
backref_cache->nr_nodes is increased but the node is neither added to
backref_cache or nr_nodes decreased.
Causing BUG_ON() in backref_cache_cleanup()

[ 2611.668810] ------------[ cut here ]------------
[ 2611.669946] kernel BUG at
/home/sat/ktest/linux/fs/btrfs/relocation.c:243!
[ 2611.670572] invalid opcode: 0000 [#1] SMP
[ 2611.686797] Call Trace:
[ 2611.687034]  [<ffffffffa01f71d3>]
btrfs_relocate_block_group+0x1b3/0x290 [btrfs]
[ 2611.687706]  [<ffffffffa01cc177>]
btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs]
[ 2611.688385]  [<ffffffffa01cdb12>] btrfs_balance+0xb22/0x11e0 [btrfs]
[ 2611.688966]  [<ffffffffa01d9611>] btrfs_ioctl_balance+0x391/0x3a0
[btrfs]
[ 2611.689587]  [<ffffffffa01ddaf0>] btrfs_ioctl+0x1650/0x2290 [btrfs]
[ 2611.690145]  [<ffffffff81171cda>] ? lru_cache_add+0x3a/0x80
[ 2611.690647]  [<ffffffff81171e4c>] ?
lru_cache_add_active_or_unevictable+0x4c/0xc0
[ 2611.691310]  [<ffffffff81193f04>] ? handle_mm_fault+0xcd4/0x17f0
[ 2611.691842]  [<ffffffff811da423>] ? cp_new_stat+0x153/0x180
[ 2611.692342]  [<ffffffff8119913d>] ? __vma_link_rb+0xfd/0x110
[ 2611.692842]  [<ffffffff81199209>] ? vma_link+0xb9/0xc0
[ 2611.693303]  [<ffffffff811e7e81>] do_vfs_ioctl+0xa1/0x5a0
[ 2611.693781]  [<ffffffff8104e024>] ? __do_page_fault+0x1b4/0x400
[ 2611.694310]  [<ffffffff811e83c1>] SyS_ioctl+0x41/0x70
[ 2611.694758]  [<ffffffff816dfc6e>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0
05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b
0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44
[ 2611.697870] RIP  [<ffffffffa01f6fc1>]
relocate_block_group+0x741/0x7a0 [btrfs]
[ 2611.698818]  RSP <ffff88002a81fb30>

This patch will call remove_backref_node() in error handling branch, and
cache the returned -ENOENT in relocate_tree_block() and continue
balancing.

Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/relocation.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 33183ce..d72a981 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -887,6 +887,13 @@ again:
 		root = read_fs_root(rc->extent_root->fs_info, key.offset);
 		if (IS_ERR(root)) {
 			err = PTR_ERR(root);
+			/*
+			 * Don't forget to cleanup current node.
+			 * As it may not be added to backref_cache but nr_node
+			 * increased.
+			 * This will cause BUG_ON() in backref_cache_cleanup().
+			 */
+			remove_backref_node(&rc->backref_cache, cur);
 			goto out;
 		}
 
@@ -2990,14 +2997,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans,
 	}
 
 	rb_node = rb_first(blocks);
-	while (rb_node) {
+	for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) {
 		block = rb_entry(rb_node, struct tree_block, rb_node);
 
 		node = build_backref_tree(rc, &block->key,
 					  block->level, block->bytenr);
 		if (IS_ERR(node)) {
+			/*
+			 * The root(dedupe tree yet) of the tree block is
+			 * going to be freed and can't be reached.
+			 * Just skip it and continue balancing.
+			 */
+			if (PTR_ERR(node) == -ENOENT)
+				continue;
 			err = PTR_ERR(node);
-			goto out;
+			break;
 		}
 
 		ret = relocate_tree_block(trans, rc, node, &block->key,
@@ -3005,11 +3019,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans,
 		if (ret < 0) {
 			if (ret != -EAGAIN || rb_node == rb_first(blocks))
 				err = ret;
-			goto out;
+			break;
 		}
-		rb_node = rb_next(rb_node);
 	}
-out:
 	err = finish_pending_nodes(trans, rc, path, err);
 
 out_free_path:
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (14 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 15/21] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-04-01  6:35 ` [PATCH v10 17/21] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info Qu Wenruo
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Liu Bo, Wang Xiaoguang

Introduce a new tree, dedupe tree to record on-disk dedupe hash.
As a persist hash storage instead of in-memeory only implement.

Unlike Liu Bo's implement, in this version we won't do hack for
bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such
search case, just like in-memory backend.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/ctree.h             | 53 +++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/dedupe.h            |  5 +++++
 fs/btrfs/disk-io.c           |  6 +++++
 fs/btrfs/relocation.c        |  3 ++-
 include/trace/events/btrfs.h |  3 ++-
 5 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0e8933c..659790c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -100,6 +100,9 @@ struct btrfs_ordered_sum;
 /* tracks free space in block groups. */
 #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
 
+/* on-disk dedupe tree (EXPERIMENTAL) */
+#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -538,7 +541,8 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
 
 #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
-	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)
+	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
+	 BTRFS_FEATURE_COMPAT_RO_DEDUPE)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
@@ -960,6 +964,36 @@ struct btrfs_csum_item {
 	u8 csum;
 } __attribute__ ((__packed__));
 
+/*
+ * Objectid: 0
+ * Type: BTRFS_DEDUPE_STATUS_ITEM_KEY
+ * Offset: 0
+ */
+struct btrfs_dedupe_status_item {
+	__le64 blocksize;
+	__le64 limit_nr;
+	__le16 hash_type;
+	__le16 backend;
+} __attribute__ ((__packed__));
+
+/*
+ * Objectid: Last 64 bit of the hash
+ * Type: BTRFS_DEDUPE_HASH_ITEM_KEY
+ * Offset: Bytenr of the hash
+ *
+ * Used for hash <-> bytenr search
+ * Hash exclude the last 64 bit follows
+ */
+
+/*
+ * Objectid: bytenr
+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
+ * offset: Last 64 bit of the hash
+ *
+ * Used for bytenr <-> hash search (for free_extent)
+ * Its itemsize should always be 0.
+ */
+
 struct btrfs_dev_stats_item {
 	/*
 	 * grow this item struct at the end for future enhancements and keep
@@ -2168,6 +2202,13 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_CHUNK_ITEM_KEY	228
 
 /*
+ * Dedup item and status
+ */
+#define BTRFS_DEDUPE_STATUS_ITEM_KEY	230
+#define BTRFS_DEDUPE_HASH_ITEM_KEY	231
+#define BTRFS_DEDUPE_BYTENR_ITEM_KEY	232
+
+/*
  * Records the overall state of the qgroups.
  * There's only one instance of this key present,
  * (0, BTRFS_QGROUP_STATUS_KEY, 0)
@@ -3265,6 +3306,16 @@ static inline unsigned long btrfs_leaf_data(struct extent_buffer *l)
 	return offsetof(struct btrfs_leaf, items);
 }
 
+/* btrfs_dedupe_status */
+BTRFS_SETGET_FUNCS(dedupe_status_blocksize, struct btrfs_dedupe_status_item,
+		   blocksize, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_limit, struct btrfs_dedupe_status_item,
+		   limit_nr, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_hash_type, struct btrfs_dedupe_status_item,
+		   hash_type, 16);
+BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item,
+		   backend, 16);
+
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr,
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index f5d2b45..1ac1bcb 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -60,6 +60,8 @@ struct btrfs_dedupe_hash {
 	u8 hash[];
 };
 
+struct btrfs_root;
+
 struct btrfs_dedupe_info {
 	/* dedupe blocksize */
 	u64 blocksize;
@@ -75,6 +77,9 @@ struct btrfs_dedupe_info {
 	struct list_head lru_list;
 	u64 limit_nr;
 	u64 current_nr;
+
+	/* for persist data like dedup-hash and dedupe status */
+	struct btrfs_root *dedupe_root;
 };
 
 struct btrfs_trans_handle;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ed6a6fd..c7eda03 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -184,6 +184,7 @@ static struct btrfs_lockdep_keyset {
 	{ .id = BTRFS_DATA_RELOC_TREE_OBJECTID,	.name_stem = "dreloc"	},
 	{ .id = BTRFS_UUID_TREE_OBJECTID,	.name_stem = "uuid"	},
 	{ .id = BTRFS_FREE_SPACE_TREE_OBJECTID,	.name_stem = "free-space" },
+	{ .id = BTRFS_DEDUPE_TREE_OBJECTID,	.name_stem = "dedupe"	},
 	{ .id = 0,				.name_stem = "tree"	},
 };
 
@@ -1678,6 +1679,11 @@ struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
 	if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
 		return fs_info->free_space_root ? fs_info->free_space_root :
 						  ERR_PTR(-ENOENT);
+	if (location->objectid == BTRFS_DEDUPE_TREE_OBJECTID) {
+		if (fs_info->dedupe_enabled && fs_info->dedupe_info)
+			return fs_info->dedupe_info->dedupe_root;
+		return ERR_PTR(-ENOENT);
+	}
 again:
 	root = btrfs_lookup_fs_root(fs_info, location->objectid);
 	if (root) {
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d72a981..f26ac00 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -577,7 +577,8 @@ static int is_cowonly_root(u64 root_objectid)
 	    root_objectid == BTRFS_CSUM_TREE_OBJECTID ||
 	    root_objectid == BTRFS_UUID_TREE_OBJECTID ||
 	    root_objectid == BTRFS_QUOTA_TREE_OBJECTID ||
-	    root_objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
+	    root_objectid == BTRFS_FREE_SPACE_TREE_OBJECTID ||
+	    root_objectid == BTRFS_DEDUPE_TREE_OBJECTID)
 		return 1;
 	return 0;
 }
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index d866f21..2c3d48a 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -47,12 +47,13 @@ struct btrfs_qgroup_operation;
 		{ BTRFS_TREE_RELOC_OBJECTID,	"TREE_RELOC"	},	\
 		{ BTRFS_UUID_TREE_OBJECTID,	"UUID_TREE"	},	\
 		{ BTRFS_FREE_SPACE_TREE_OBJECTID, "FREE_SPACE_TREE" },	\
+		{ BTRFS_DEDUPE_TREE_OBJECTID,	"DEDUPE_TREE"	},	\
 		{ BTRFS_DATA_RELOC_TREE_OBJECTID, "DATA_RELOC_TREE" })
 
 #define show_root_type(obj)						\
 	obj, ((obj >= BTRFS_DATA_RELOC_TREE_OBJECTID) ||		\
 	      (obj >= BTRFS_ROOT_TREE_OBJECTID &&			\
-	       obj <= BTRFS_QUOTA_TREE_OBJECTID)) ? __show_root_type(obj) : "-"
+	       obj <= BTRFS_DEDUPE_TREE_OBJECTID)) ? __show_root_type(obj) : "-"
 
 #define BTRFS_GROUP_FLAGS	\
 	{ BTRFS_BLOCK_GROUP_DATA,	"DATA"},	\
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 17/21] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (15 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-06-03 14:54   ` Josef Bacik
  2016-04-01  6:35 ` [PATCH v10 18/21] btrfs: dedupe: Add support for on-disk hash search Qu Wenruo
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Since we will introduce a new on-disk based dedupe method, introduce new
interfaces to resume previous dedupe setup.

And since we introduce a new tree for status, also add disable handler
for it.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c  | 197 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/dedupe.h  |  13 ++++
 fs/btrfs/disk-io.c |  25 ++++++-
 fs/btrfs/disk-io.h |   1 +
 4 files changed, 232 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index cfb7fea..a274c1c 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -21,6 +21,8 @@
 #include "transaction.h"
 #include "delayed-ref.h"
 #include "qgroup.h"
+#include "disk-io.h"
+#include "locking.h"
 
 struct inmem_hash {
 	struct rb_node hash_node;
@@ -102,10 +104,69 @@ static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
 	return 0;
 }
 
+static int init_dedupe_tree(struct btrfs_fs_info *fs_info,
+			    struct btrfs_dedupe_info *dedupe_info)
+{
+	struct btrfs_root *dedupe_root;
+	struct btrfs_key key;
+	struct btrfs_path *path;
+	struct btrfs_dedupe_status_item *status;
+	struct btrfs_trans_handle *trans;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	trans = btrfs_start_transaction(fs_info->tree_root, 2);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
+	dedupe_root = btrfs_create_tree(trans, fs_info,
+				       BTRFS_DEDUPE_TREE_OBJECTID);
+	if (IS_ERR(dedupe_root)) {
+		ret = PTR_ERR(dedupe_root);
+		btrfs_abort_transaction(trans, fs_info->tree_root, ret);
+		goto out;
+	}
+	dedupe_info->dedupe_root = dedupe_root;
+
+	key.objectid = 0;
+	key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
+	key.offset = 0;
+
+	ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key,
+				      sizeof(*status));
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, fs_info->tree_root, ret);
+		goto out;
+	}
+
+	status = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				struct btrfs_dedupe_status_item);
+	btrfs_set_dedupe_status_blocksize(path->nodes[0], status,
+					 dedupe_info->blocksize);
+	btrfs_set_dedupe_status_limit(path->nodes[0], status,
+			dedupe_info->limit_nr);
+	btrfs_set_dedupe_status_hash_type(path->nodes[0], status,
+			dedupe_info->hash_type);
+	btrfs_set_dedupe_status_backend(path->nodes[0], status,
+			dedupe_info->backend);
+	btrfs_mark_buffer_dirty(path->nodes[0]);
+out:
+	btrfs_free_path(path);
+	if (ret == 0)
+		btrfs_commit_transaction(trans, fs_info->tree_root);
+	return ret;
+}
+
 static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type,
 				  u16 backend, u64 blocksize, u64 limit_nr,
 				  u64 limit_mem, u64 *ret_limit)
 {
+	u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
+
 	if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
 	    blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
 	    blocksize < fs_info->tree_root->sectorsize ||
@@ -140,8 +201,12 @@ static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type,
 			*ret_limit = min(tmp, limit_nr);
 		}
 	}
-	if (backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+	if (backend == BTRFS_DEDUPE_BACKEND_ONDISK) {
+		/* Ondisk backend must use RO compat feature */
+		if (!(compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE))
+			return -EOPNOTSUPP;
 		*ret_limit = 0;
+	}
 	return 0;
 }
 
@@ -150,11 +215,16 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
 {
 	struct btrfs_dedupe_info *dedupe_info;
 	u64 limit = 0;
+	u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
+	int create_tree;
 	int ret = 0;
 
 	/* only one limit is accepted for enable*/
 	if (limit_nr && limit_mem)
 		return -EINVAL;
+	/* enable and disable may modify ondisk data, so block RO fs*/
+	if (fs_info->sb->s_flags & MS_RDONLY)
+		return -EROFS;
 
 	ret = check_dedupe_parameter(fs_info, type, backend, blocksize,
 				     limit_nr, limit_mem, &limit);
@@ -179,9 +249,19 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
 	}
 
 enable:
+	create_tree = compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE;
+
 	ret = init_dedupe_info(&dedupe_info, type, backend, blocksize, limit);
 	if (ret < 0)
 		return ret;
+	if (create_tree) {
+		ret = init_dedupe_tree(fs_info, dedupe_info);
+		if (ret < 0) {
+			crypto_free_shash(dedupe_info->dedupe_driver);
+			kfree(dedupe_info);
+			return ret;
+		}
+	}
 	fs_info->dedupe_info = dedupe_info;
 	/* We must ensure dedupe_enabled is modified after dedupe_info */
 	smp_wmb();
@@ -189,6 +269,55 @@ enable:
 	return ret;
 }
 
+int btrfs_dedupe_resume(struct btrfs_fs_info *fs_info,
+			struct btrfs_root *dedupe_root)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+	struct btrfs_dedupe_status_item *status;
+	struct btrfs_key key;
+	struct btrfs_path *path;
+	u64 blocksize;
+	u64 limit_nr;
+	u64 limit;
+	u16 type;
+	u16 backend;
+	int ret = 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = 0;
+	key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
+	key.offset = 0;
+
+	ret = btrfs_search_slot(NULL, dedupe_root, &key, path, 0, 0);
+	if (ret > 0) {
+		ret = -ENOENT;
+		goto out;
+	} else if (ret < 0) {
+		goto out;
+	}
+	status = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				struct btrfs_dedupe_status_item);
+	blocksize = btrfs_dedupe_status_blocksize(path->nodes[0], status);
+	limit_nr = btrfs_dedupe_status_limit(path->nodes[0], status);
+	type = btrfs_dedupe_status_hash_type(path->nodes[0], status);
+	backend = btrfs_dedupe_status_backend(path->nodes[0], status);
+
+	ret = check_dedupe_parameter(fs_info, type, backend, blocksize,
+				     limit_nr, 0, &limit);
+	if (ret < 0)
+		goto out;
+	ret = init_dedupe_info(&dedupe_info, type, backend, blocksize, limit);
+	if (ret < 0)
+		goto out;
+	dedupe_info->dedupe_root = dedupe_root;
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
 static int inmem_insert_hash(struct rb_root *root,
 			     struct inmem_hash *hash, int hash_len)
 {
@@ -413,12 +542,74 @@ int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
 
 	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
 		inmem_destroy(dedupe_info);
-
+	if (dedupe_info->dedupe_root) {
+		free_root_extent_buffers(dedupe_info->dedupe_root);
+		kfree(dedupe_info->dedupe_root);
+	}
 	crypto_free_shash(dedupe_info->dedupe_driver);
 	kfree(dedupe_info);
 	return 0;
 }
 
+static int remove_dedupe_tree(struct btrfs_root *dedupe_root)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_fs_info *fs_info = dedupe_root->fs_info;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct extent_buffer *node;
+	int ret;
+	int nr;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+	trans = btrfs_start_transaction(fs_info->tree_root, 2);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
+
+	path->leave_spinning = 1;
+	key.objectid = 0;
+	key.offset = 0;
+	key.type = 0;
+
+	while (1) {
+		ret = btrfs_search_slot(trans, dedupe_root, &key, path, -1, 1);
+		if (ret < 0)
+			goto out;
+		node = path->nodes[0];
+		nr = btrfs_header_nritems(node);
+		if (nr == 0) {
+			btrfs_release_path(path);
+			break;
+		}
+		path->slots[0] = 0;
+		ret = btrfs_del_items(trans, dedupe_root, path, 0, nr);
+		if (ret)
+			goto out;
+		btrfs_release_path(path);
+	}
+
+	ret = btrfs_del_root(trans, fs_info->tree_root, &dedupe_root->root_key);
+	if (ret)
+		goto out;
+
+	list_del(&dedupe_root->dirty_list);
+	btrfs_tree_lock(dedupe_root->node);
+	clean_tree_block(trans, fs_info, dedupe_root->node);
+	btrfs_tree_unlock(dedupe_root->node);
+	btrfs_free_tree_block(trans, dedupe_root, dedupe_root->node, 0, 1);
+	free_extent_buffer(dedupe_root->node);
+	free_extent_buffer(dedupe_root->commit_root);
+	kfree(dedupe_root);
+	ret = btrfs_commit_transaction(trans, fs_info->tree_root);
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_dedupe_info *dedupe_info;
@@ -452,6 +643,8 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 	/* now we are OK to clean up everything */
 	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
 		inmem_destroy(dedupe_info);
+	if (dedupe_info->dedupe_root)
+		ret = remove_dedupe_tree(dedupe_info->dedupe_root);
 
 	crypto_free_shash(dedupe_info->dedupe_driver);
 	kfree(dedupe_info);
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 1ac1bcb..2038ab8 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -123,6 +123,19 @@ void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
  */
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
 
+ /*
+ * Restore previous dedupe setup from disk
+ * Called at mount time
+ */
+int btrfs_dedupe_resume(struct btrfs_fs_info *fs_info,
+		       struct btrfs_root *dedupe_root);
+
+/*
+ * Cleanup current btrfs_dedupe_info
+ * Called in umount time
+ */
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
+
 /*
  * Cleanup current btrfs_dedupe_info
  * Called in umount time
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c7eda03..4ba27b0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2162,7 +2162,7 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->extent_workers);
 }
 
-static void free_root_extent_buffers(struct btrfs_root *root)
+void free_root_extent_buffers(struct btrfs_root *root)
 {
 	if (root) {
 		free_extent_buffer(root->node);
@@ -2496,7 +2496,28 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info,
 		fs_info->free_space_root = root;
 	}
 
-	return 0;
+	location.objectid = BTRFS_DEDUPE_TREE_OBJECTID;
+	root = btrfs_read_tree_root(tree_root, &location);
+	if (IS_ERR(root)) {
+		ret = PTR_ERR(root);
+		if (ret != -ENOENT)
+			return ret;
+		return 0;
+	}
+
+	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+	ret = btrfs_dedupe_resume(fs_info, root);
+	if (ret < 0) {
+		if (ret == -EINVAL)
+			btrfs_err(fs_info,
+				"invalid dedupe parameter found");
+		if (ret == -EOPNOTSUPP)
+			btrfs_err(fs_info,
+				"unsupported dedupe parameter found");
+		free_root_extent_buffers(root);
+		kfree(root);
+	}
+	return ret;
 }
 
 int open_ctree(struct super_block *sb,
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 8e79d00..42c4ff2 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -70,6 +70,7 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_root *tree_root,
 int btrfs_init_fs_root(struct btrfs_root *root);
 int btrfs_insert_fs_root(struct btrfs_fs_info *fs_info,
 			 struct btrfs_root *root);
+void free_root_extent_buffers(struct btrfs_root *root);
 void btrfs_free_fs_roots(struct btrfs_fs_info *fs_info);
 
 struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 18/21] btrfs: dedupe: Add support for on-disk hash search
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (16 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 17/21] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-06-03 14:57   ` Josef Bacik
  2016-04-01  6:35 ` [PATCH v10 19/21] btrfs: dedupe: Add support to delete hash for on-disk backend Qu Wenruo
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Now on-disk backend should be able to search hash now.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++++------
 fs/btrfs/dedupe.h |   1 +
 2 files changed, 151 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index a274c1c..00f2a01 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -652,6 +652,112 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 }
 
 /*
+ * Compare ondisk hash with src.
+ * Return 0 if hash matches.
+ * Return non-zero for hash mismatch
+ *
+ * Caller should ensure the slot contains a valid hash item.
+ */
+static int memcmp_ondisk_hash(const struct btrfs_key *key,
+			      struct extent_buffer *node, int slot,
+			      int hash_len, const u8 *src)
+{
+	u64 offset;
+	int ret;
+
+	/* Return value doesn't make sense in this case though */
+	if (WARN_ON(hash_len <= 8 || key->type != BTRFS_DEDUPE_HASH_ITEM_KEY))
+		return -EINVAL;
+
+	/* compare the hash exlcuding the last 64 bits */
+	offset = btrfs_item_ptr_offset(node, slot);
+	ret = memcmp_extent_buffer(node, src, offset, hash_len - 8);
+	if (ret)
+		return ret;
+	return memcmp(&key->objectid, src + hash_len - 8, 8);
+}
+
+ /*
+ * Return 0 for not found
+ * Return >0 for found and set bytenr_ret
+ * Return <0 for error
+ */
+static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
+			      u64 *bytenr_ret, u32 *num_bytes_ret)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+	u8 *buf = NULL;
+	u64 hash_key;
+	int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	buf = kmalloc(hash_len, GFP_NOFS);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	memcpy(&hash_key, hash + hash_len - 8, 8);
+	key.objectid = hash_key;
+	key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_slot(NULL, dedupe_root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	WARN_ON(ret == 0);
+	while (1) {
+		struct extent_buffer *node;
+		struct btrfs_dedupe_hash_item *hash_item;
+		int slot;
+
+		ret = btrfs_previous_item(dedupe_root, path, hash_key,
+					  BTRFS_DEDUPE_HASH_ITEM_KEY);
+		if (ret < 0)
+			break;
+		if (ret > 0) {
+			ret = 0;
+			break;
+		}
+
+		node = path->nodes[0];
+		slot = path->slots[0];
+		btrfs_item_key_to_cpu(node, &key, slot);
+
+		/*
+		 * Type of objectid mismatch means no previous item may
+		 * hit, exit searching
+		 */
+		if (key.type != BTRFS_DEDUPE_HASH_ITEM_KEY ||
+		    memcmp(&key.objectid, &hash_key, 8))
+			break;
+		hash_item = btrfs_item_ptr(node, slot,
+				struct btrfs_dedupe_hash_item);
+		/*
+		 * If the hash mismatch, it's still possible that previous item
+		 * has the desired hash.
+		 */
+		if (memcmp_ondisk_hash(&key, node, slot, hash_len, hash))
+			continue;
+		/* Found */
+		ret = 1;
+		*bytenr_ret = key.offset;
+		*num_bytes_ret = dedupe_info->blocksize;
+		break;
+	}
+out:
+	kfree(buf);
+	btrfs_free_path(path);
+	return ret;
+}
+
+/*
  * Caller must ensure the corresponding ref head is not being run.
  */
 static struct inmem_hash *
@@ -681,9 +787,36 @@ inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
 	return NULL;
 }
 
-static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
-			struct inode *inode, u64 file_pos,
-			struct btrfs_dedupe_hash *hash)
+/* Wapper for different backends, caller needs to hold dedupe_info->lock */
+static inline int generic_search_hash(struct btrfs_dedupe_info *dedupe_info,
+				      u8 *hash, u64 *bytenr_ret,
+				      u32 *num_bytes_ret)
+{
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+		struct inmem_hash *found_hash;
+		int ret;
+
+		found_hash = inmem_search_hash(dedupe_info, hash);
+		if (found_hash) {
+			ret = 1;
+			*bytenr_ret = found_hash->bytenr;
+			*num_bytes_ret = found_hash->num_bytes;
+		} else {
+			ret = 0;
+			*bytenr_ret = 0;
+			*num_bytes_ret = 0;
+		}
+		return ret;
+	} else if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK) {
+		return ondisk_search_hash(dedupe_info, hash, bytenr_ret,
+					  num_bytes_ret);
+	}
+	return -EINVAL;
+}
+
+static int generic_search(struct btrfs_dedupe_info *dedupe_info,
+			  struct inode *inode, u64 file_pos,
+			  struct btrfs_dedupe_hash *hash)
 {
 	int ret;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -693,9 +826,9 @@ static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
 	struct btrfs_delayed_ref_head *insert_head;
 	struct btrfs_delayed_data_ref *insert_dref;
 	struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
-	struct inmem_hash *found_hash;
 	int free_insert = 1;
 	u64 bytenr;
+	u64 tmp_bytenr;
 	u32 num_bytes;
 
 	insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
@@ -727,14 +860,9 @@ static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
 
 again:
 	mutex_lock(&dedupe_info->lock);
-	found_hash = inmem_search_hash(dedupe_info, hash->hash);
-	/* If we don't find a duplicated extent, just return. */
-	if (!found_hash) {
-		ret = 0;
+	ret = generic_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes);
+	if (ret <= 0)
 		goto out;
-	}
-	bytenr = found_hash->bytenr;
-	num_bytes = found_hash->num_bytes;
 
 	delayed_refs = &trans->transaction->delayed_refs;
 
@@ -773,13 +901,17 @@ again:
 
 	mutex_lock(&dedupe_info->lock);
 	/* Search again to ensure the hash is still here */
-	found_hash = inmem_search_hash(dedupe_info, hash->hash);
-	if (!found_hash) {
-		ret = 0;
+	ret = generic_search_hash(dedupe_info, hash->hash, &tmp_bytenr,
+				  &num_bytes);
+	if (ret <= 0) {
 		mutex_unlock(&head->mutex);
 		goto out;
 	}
-	ret = 1;
+	if (tmp_bytenr != bytenr) {
+		mutex_unlock(&head->mutex);
+		mutex_unlock(&dedupe_info->lock);
+		goto again;
+	}
 	hash->bytenr = bytenr;
 	hash->num_bytes = num_bytes;
 
@@ -824,8 +956,9 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
 	if (WARN_ON(btrfs_dedupe_hash_hit(hash)))
 		return -EINVAL;
 
-	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
-		ret = inmem_search(dedupe_info, inode, file_pos, hash);
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY ||
+	    dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+		ret = generic_search(dedupe_info, inode, file_pos, hash);
 
 	/* It's possible hash->bytenr/num_bytenr already changed */
 	if (ret == 0) {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 2038ab8..bfcacd7 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -163,6 +163,7 @@ int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
  * *INCREASED*, and hash->bytenr/num_bytes will record the existing
  * extent data.
  * Return 0 for a hash miss. Nothing is done
+ * Return < 0 for error
  */
 int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
 			struct inode *inode, u64 file_pos,
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 19/21] btrfs: dedupe: Add support to delete hash for on-disk backend
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (17 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 18/21] btrfs: dedupe: Add support for on-disk hash search Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-04-01  6:35 ` [PATCH v10 20/21] btrfs: dedupe: Add support for adding " Qu Wenruo
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Now on-disk backend can delete hash now.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 00f2a01..7c5d58a 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -500,6 +500,104 @@ static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
 	return 0;
 }
 
+/*
+ * If prepare_del is given, this will setup search_slot() for delete.
+ * Caller needs to do proper locking.
+ *
+ * Return > 0 for found.
+ * Return 0 for not found.
+ * Return < 0 for error.
+ */
+static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
+				struct btrfs_dedupe_info *dedupe_info,
+				struct btrfs_path *path, u64 bytenr,
+				int prepare_del)
+{
+	struct btrfs_key key;
+	struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+	int ret;
+	int ins_len = 0;
+	int cow = 0;
+
+	if (prepare_del) {
+		if (WARN_ON(trans == NULL))
+			return -EINVAL;
+		cow = 1;
+		ins_len = -1;
+	}
+
+	key.objectid = bytenr;
+	key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_slot(trans, dedupe_root, &key, path,
+				ins_len, cow);
+
+	if (ret < 0)
+		return ret;
+	/*
+	 * Although it's almost impossible, it's still possible that
+	 * the last 64bits are all 1.
+	 */
+	if (ret == 0)
+		return 1;
+
+	ret = btrfs_previous_item(dedupe_root, path, bytenr,
+				  BTRFS_DEDUPE_BYTENR_ITEM_KEY);
+	if (ret < 0)
+		return ret;
+	if (ret > 0)
+		return 0;
+	return 1;
+}
+
+static int ondisk_del(struct btrfs_trans_handle *trans,
+		      struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+	struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = bytenr;
+	key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+	key.offset = 0;
+
+	mutex_lock(&dedupe_info->lock);
+
+	ret = ondisk_search_bytenr(trans, dedupe_info, path, bytenr, 1);
+	if (ret <= 0)
+		goto out;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	ret = btrfs_del_item(trans, dedupe_root, path);
+	btrfs_release_path(path);
+	if (ret < 0)
+		goto out;
+	/* Search for hash item and delete it */
+	key.objectid = key.offset;
+	key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+	key.offset = bytenr;
+
+	ret = btrfs_search_slot(trans, dedupe_root, &key, path, -1, 1);
+	if (WARN_ON(ret > 0)) {
+		ret = -ENOENT;
+		goto out;
+	}
+	if (ret < 0)
+		goto out;
+	ret = btrfs_del_item(trans, dedupe_root, path);
+
+out:
+	btrfs_free_path(path);
+	mutex_unlock(&dedupe_info->lock);
+	return ret;
+}
+
 /* Remove a dedupe hash from dedupe tree */
 int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
 		     struct btrfs_fs_info *fs_info, u64 bytenr)
@@ -514,6 +612,8 @@ int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
 
 	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
 		return inmem_del(dedupe_info, bytenr);
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+		return ondisk_del(trans, dedupe_info, bytenr);
 	return -EINVAL;
 }
 
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 20/21] btrfs: dedupe: Add support for adding hash for on-disk backend
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (18 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 19/21] btrfs: dedupe: Add support to delete hash for on-disk backend Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-06-03 15:03   ` Josef Bacik
  2016-04-01  6:35 ` [PATCH v10 21/21] btrfs: dedupe: Preparation for compress-dedupe co-work Qu Wenruo
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Now on-disk backend can add hash now.

Since all needed on-disk backend functions are added, also allow on-disk
backend to be used, by changing DEDUPE_BACKEND_COUNT from 1(inmemory
only) to 2 (inmemory + ondisk).

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dedupe.h |  3 +-
 2 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 7c5d58a..1f0178e 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -437,6 +437,87 @@ out:
 	return 0;
 }
 
+static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
+				struct btrfs_dedupe_info *dedupe_info,
+				struct btrfs_path *path, u64 bytenr,
+				int prepare_del);
+static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
+			      u64 *bytenr_ret, u32 *num_bytes_ret);
+static int ondisk_add(struct btrfs_trans_handle *trans,
+		      struct btrfs_dedupe_info *dedupe_info,
+		      struct btrfs_dedupe_hash *hash)
+{
+	struct btrfs_path *path;
+	struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+	struct btrfs_key key;
+	u64 hash_offset;
+	u64 bytenr;
+	u32 num_bytes;
+	int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
+	int ret;
+
+	if (WARN_ON(hash_len <= 8 ||
+	    !IS_ALIGNED(hash->bytenr, dedupe_root->sectorsize)))
+		return -EINVAL;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	mutex_lock(&dedupe_info->lock);
+
+	ret = ondisk_search_bytenr(NULL, dedupe_info, path, hash->bytenr, 0);
+	if (ret < 0)
+		goto out;
+	if (ret > 0) {
+		ret = 0;
+		goto out;
+	}
+	btrfs_release_path(path);
+
+	ret = ondisk_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes);
+	if (ret < 0)
+		goto out;
+	/* Same hash found, don't re-add to save dedupe tree space */
+	if (ret > 0) {
+		ret = 0;
+		goto out;
+	}
+
+	/* Insert hash->bytenr item */
+	memcpy(&key.objectid, hash->hash + hash_len - 8, 8);
+	key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+	key.offset = hash->bytenr;
+
+	/* The last 8 bit will not be included into hash */
+	ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key,
+				      hash_len - 8);
+	WARN_ON(ret == -EEXIST);
+	if (ret < 0)
+		goto out;
+	hash_offset = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
+	write_extent_buffer(path->nodes[0], hash->hash,
+			    hash_offset, hash_len - 8);
+	btrfs_mark_buffer_dirty(path->nodes[0]);
+	btrfs_release_path(path);
+
+	/* Then bytenr->hash item */
+	key.objectid = hash->bytenr;
+	key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+	memcpy(&key.offset, hash->hash + hash_len - 8, 8);
+
+	ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key, 0);
+	WARN_ON(ret == -EEXIST);
+	if (ret < 0)
+		goto out;
+	btrfs_mark_buffer_dirty(path->nodes[0]);
+
+out:
+	mutex_unlock(&dedupe_info->lock);
+	btrfs_free_path(path);
+	return ret;
+}
+
 int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
 		     struct btrfs_fs_info *fs_info,
 		     struct btrfs_dedupe_hash *hash)
@@ -458,6 +539,8 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
 
 	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
 		return inmem_add(dedupe_info, hash);
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+		return ondisk_add(trans, dedupe_info, hash);
 	return -EINVAL;
 }
 
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index bfcacd7..1573456 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -31,8 +31,7 @@
 #define BTRFS_DEDUPE_BACKEND_INMEMORY		0
 #define BTRFS_DEDUPE_BACKEND_ONDISK		1
 
-/* Only support inmemory yet, so count is still only 1 */
-#define BTRFS_DEDUPE_BACKEND_COUNT		1
+#define BTRFS_DEDUPE_BACKEND_COUNT		2
 
 /* Dedup block size limit and default value */
 #define BTRFS_DEDUPE_BLOCKSIZE_MAX	(8 * 1024 * 1024)
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 21/21] btrfs: dedupe: Preparation for compress-dedupe co-work
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (19 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 20/21] btrfs: dedupe: Add support for adding " Qu Wenruo
@ 2016-04-01  6:35 ` Qu Wenruo
  2016-04-01  8:53 ` [PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
  2016-06-03 15:20 ` [PATCH v10 00/21] Btrfs dedupe framework Josef Bacik
  22 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  6:35 UTC (permalink / raw)
  To: linux-btrfs

For dedupe to work with compression, new members recording compression
algorithm and on-disk extent length are needed.

Add them for later compress-dedupe co-work.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/ctree.h        | 22 +++++++++++++-
 fs/btrfs/dedupe.c       | 78 ++++++++++++++++++++++++++++++++++++-------------
 fs/btrfs/dedupe.h       |  2 ++
 fs/btrfs/inode.c        |  2 ++
 fs/btrfs/ordered-data.c |  2 ++
 5 files changed, 85 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 659790c..fdbe66b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -982,8 +982,22 @@ struct btrfs_dedupe_status_item {
  * Offset: Bytenr of the hash
  *
  * Used for hash <-> bytenr search
- * Hash exclude the last 64 bit follows
  */
+struct btrfs_dedupe_hash_item {
+	/*
+	 * length of dedupe range on disk
+	 * For in-memory length, it's always
+	 * dedupe_info->block_size
+	 */
+	__le32 disk_len;
+
+	u8 compression;
+
+	/*
+	 * Hash follows, exclude the last 64bit,
+	 * as it's already in key.objectid.
+	 */
+} __attribute__ ((__packed__));
 
 /*
  * Objectid: bytenr
@@ -3316,6 +3330,12 @@ BTRFS_SETGET_FUNCS(dedupe_status_hash_type, struct btrfs_dedupe_status_item,
 BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item,
 		   backend, 16);
 
+/* btrfs_dedupe_hash_item */
+BTRFS_SETGET_FUNCS(dedupe_hash_disk_len, struct btrfs_dedupe_hash_item,
+		   disk_len, 32);
+BTRFS_SETGET_FUNCS(dedupe_hash_compression, struct btrfs_dedupe_hash_item,
+		   compression, 8);
+
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr,
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 1f0178e..e91420d 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -31,6 +31,8 @@ struct inmem_hash {
 
 	u64 bytenr;
 	u32 num_bytes;
+	u32 disk_num_bytes;
+	u8 compression;
 
 	u8 hash[];
 };
@@ -397,6 +399,8 @@ static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
 	/* Copy the data out */
 	ihash->bytenr = hash->bytenr;
 	ihash->num_bytes = hash->num_bytes;
+	ihash->disk_num_bytes = hash->disk_num_bytes;
+	ihash->compression = hash->compression;
 	memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
 
 	mutex_lock(&dedupe_info->lock);
@@ -442,7 +446,8 @@ static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
 				struct btrfs_path *path, u64 bytenr,
 				int prepare_del);
 static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
-			      u64 *bytenr_ret, u32 *num_bytes_ret);
+			      u64 *bytenr_ret, u32 *num_bytes_ret,
+			      u32 *disk_num_bytes_ret, u8 *compression);
 static int ondisk_add(struct btrfs_trans_handle *trans,
 		      struct btrfs_dedupe_info *dedupe_info,
 		      struct btrfs_dedupe_hash *hash)
@@ -450,7 +455,7 @@ static int ondisk_add(struct btrfs_trans_handle *trans,
 	struct btrfs_path *path;
 	struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
 	struct btrfs_key key;
-	u64 hash_offset;
+	struct btrfs_dedupe_hash_item *hash_item;
 	u64 bytenr;
 	u32 num_bytes;
 	int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
@@ -475,7 +480,8 @@ static int ondisk_add(struct btrfs_trans_handle *trans,
 	}
 	btrfs_release_path(path);
 
-	ret = ondisk_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes);
+	ret = ondisk_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes,
+				 NULL, NULL);
 	if (ret < 0)
 		goto out;
 	/* Same hash found, don't re-add to save dedupe tree space */
@@ -491,13 +497,18 @@ static int ondisk_add(struct btrfs_trans_handle *trans,
 
 	/* The last 8 bit will not be included into hash */
 	ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key,
-				      hash_len - 8);
+				      sizeof(*hash_item) + hash_len - 8);
 	WARN_ON(ret == -EEXIST);
 	if (ret < 0)
 		goto out;
-	hash_offset = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
+	hash_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				   struct btrfs_dedupe_hash_item);
+	btrfs_set_dedupe_hash_disk_len(path->nodes[0], hash_item,
+				       hash->disk_num_bytes);
+	btrfs_set_dedupe_hash_compression(path->nodes[0], hash_item,
+					  hash->compression);
 	write_extent_buffer(path->nodes[0], hash->hash,
-			    hash_offset, hash_len - 8);
+			    (unsigned long)(hash_item + 1), hash_len - 8);
 	btrfs_mark_buffer_dirty(path->nodes[0]);
 	btrfs_release_path(path);
 
@@ -845,7 +856,7 @@ static int memcmp_ondisk_hash(const struct btrfs_key *key,
 			      struct extent_buffer *node, int slot,
 			      int hash_len, const u8 *src)
 {
-	u64 offset;
+	struct btrfs_dedupe_hash_item *hash_item;
 	int ret;
 
 	/* Return value doesn't make sense in this case though */
@@ -853,8 +864,10 @@ static int memcmp_ondisk_hash(const struct btrfs_key *key,
 		return -EINVAL;
 
 	/* compare the hash exlcuding the last 64 bits */
-	offset = btrfs_item_ptr_offset(node, slot);
-	ret = memcmp_extent_buffer(node, src, offset, hash_len - 8);
+	hash_item = btrfs_item_ptr(node, slot,
+				   struct btrfs_dedupe_hash_item);
+	ret = memcmp_extent_buffer(node, src, (unsigned long)(hash_item + 1),
+				   hash_len - 8);
 	if (ret)
 		return ret;
 	return memcmp(&key->objectid, src + hash_len - 8, 8);
@@ -866,7 +879,8 @@ static int memcmp_ondisk_hash(const struct btrfs_key *key,
  * Return <0 for error
  */
 static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
-			      u64 *bytenr_ret, u32 *num_bytes_ret)
+			      u64 *bytenr_ret, u32 *num_bytes_ret,
+			      u32 *disk_num_bytes_ret, u8 *compression_ret)
 {
 	struct btrfs_path *path;
 	struct btrfs_key key;
@@ -930,8 +944,16 @@ static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
 			continue;
 		/* Found */
 		ret = 1;
-		*bytenr_ret = key.offset;
-		*num_bytes_ret = dedupe_info->blocksize;
+		if (bytenr_ret)
+			*bytenr_ret = key.offset;
+		if (num_bytes_ret)
+			*num_bytes_ret = dedupe_info->blocksize;
+		if (disk_num_bytes_ret)
+			*disk_num_bytes_ret = btrfs_dedupe_hash_disk_len(node,
+					hash_item);
+		if (compression_ret)
+			*compression_ret = btrfs_dedupe_hash_compression(node,
+					hash_item);
 		break;
 	}
 out:
@@ -973,7 +995,9 @@ inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
 /* Wapper for different backends, caller needs to hold dedupe_info->lock */
 static inline int generic_search_hash(struct btrfs_dedupe_info *dedupe_info,
 				      u8 *hash, u64 *bytenr_ret,
-				      u32 *num_bytes_ret)
+				      u32 *num_bytes_ret,
+				      u32 *disk_num_bytes_ret,
+				      u8 *compression_ret)
 {
 	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
 		struct inmem_hash *found_hash;
@@ -984,15 +1008,20 @@ static inline int generic_search_hash(struct btrfs_dedupe_info *dedupe_info,
 			ret = 1;
 			*bytenr_ret = found_hash->bytenr;
 			*num_bytes_ret = found_hash->num_bytes;
+			*disk_num_bytes_ret = found_hash->disk_num_bytes;
+			*compression_ret = found_hash->compression;
 		} else {
 			ret = 0;
 			*bytenr_ret = 0;
 			*num_bytes_ret = 0;
+			*disk_num_bytes_ret = 0;
+			*compression_ret = 0;
 		}
 		return ret;
 	} else if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK) {
 		return ondisk_search_hash(dedupe_info, hash, bytenr_ret,
-					  num_bytes_ret);
+					  num_bytes_ret, disk_num_bytes_ret,
+					  compression_ret);
 	}
 	return -EINVAL;
 }
@@ -1013,6 +1042,8 @@ static int generic_search(struct btrfs_dedupe_info *dedupe_info,
 	u64 bytenr;
 	u64 tmp_bytenr;
 	u32 num_bytes;
+	u32 disk_num_bytes;
+	u8 compression;
 
 	insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
 	if (!insert_head)
@@ -1043,7 +1074,8 @@ static int generic_search(struct btrfs_dedupe_info *dedupe_info,
 
 again:
 	mutex_lock(&dedupe_info->lock);
-	ret = generic_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes);
+	ret = generic_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes,
+				  &disk_num_bytes, &compression);
 	if (ret <= 0)
 		goto out;
 
@@ -1059,15 +1091,17 @@ again:
 		 */
 		btrfs_add_delayed_data_ref_locked(root->fs_info, trans,
 				insert_dref, insert_head, insert_qrecord,
-				bytenr, num_bytes, 0, root->root_key.objectid,
-				btrfs_ino(inode), file_pos, 0,
-				BTRFS_ADD_DELAYED_REF);
+				bytenr, disk_num_bytes, 0,
+				root->root_key.objectid, btrfs_ino(inode),
+				file_pos, 0, BTRFS_ADD_DELAYED_REF);
 		spin_unlock(&delayed_refs->lock);
 
 		/* add_delayed_data_ref_locked will free unused memory */
 		free_insert = 0;
 		hash->bytenr = bytenr;
 		hash->num_bytes = num_bytes;
+		hash->disk_num_bytes = disk_num_bytes;
+		hash->compression = compression;
 		ret = 1;
 		goto out;
 	}
@@ -1085,7 +1119,7 @@ again:
 	mutex_lock(&dedupe_info->lock);
 	/* Search again to ensure the hash is still here */
 	ret = generic_search_hash(dedupe_info, hash->hash, &tmp_bytenr,
-				  &num_bytes);
+				  &num_bytes, &disk_num_bytes, &compression);
 	if (ret <= 0) {
 		mutex_unlock(&head->mutex);
 		goto out;
@@ -1097,12 +1131,14 @@ again:
 	}
 	hash->bytenr = bytenr;
 	hash->num_bytes = num_bytes;
+	hash->disk_num_bytes = disk_num_bytes;
+	hash->compression = compression;
 
 	/*
 	 * Increase the extent ref right now, to avoid delayed ref run
 	 * Or we may increase ref on non-exist extent.
 	 */
-	btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0,
+	btrfs_inc_extent_ref(trans, root, bytenr, disk_num_bytes, 0,
 			     root->root_key.objectid,
 			     btrfs_ino(inode), file_pos);
 	mutex_unlock(&head->mutex);
@@ -1147,6 +1183,8 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
 	if (ret == 0) {
 		hash->num_bytes = 0;
 		hash->bytenr = 0;
+		hash->disk_num_bytes = 0;
+		hash->compression = 0;
 	}
 	return ret;
 }
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 1573456..9298b8b 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -54,6 +54,8 @@ static int btrfs_dedupe_sizes[] = { 32 };
 struct btrfs_dedupe_hash {
 	u64 bytenr;
 	u32 num_bytes;
+	u32 disk_num_bytes;
+	u8 compression;
 
 	/* last field is a variable length array of dedupe hash */
 	u8 hash[];
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c80fd74..d49ef5c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2314,6 +2314,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	if (hash && hash->bytenr == 0) {
 		hash->bytenr = ins.objectid;
 		hash->num_bytes = ins.offset;
+		hash->disk_num_bytes = hash->num_bytes;
+		hash->compression = BTRFS_COMPRESS_NONE;
 		ret = btrfs_dedupe_add(trans, root->fs_info, hash);
 	}
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index ef24ad1..695c0e2 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -227,6 +227,8 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 		}
 		entry->hash->bytenr = hash->bytenr;
 		entry->hash->num_bytes = hash->num_bytes;
+		entry->hash->disk_num_bytes = hash->disk_num_bytes;
+		entry->hash->compression = hash->compression;
 		memcpy(entry->hash->hash, hash->hash,
 		       btrfs_dedupe_sizes[dedupe_info->hash_type]);
 	}
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (20 preceding siblings ...)
  2016-04-01  6:35 ` [PATCH v10 21/21] btrfs: dedupe: Preparation for compress-dedupe co-work Qu Wenruo
@ 2016-04-01  8:53 ` Qu Wenruo
  2016-06-03 15:20 ` [PATCH v10 00/21] Btrfs dedupe framework Josef Bacik
  22 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-04-01  8:53 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Liu Bo, Wang Xiaoguang

Introduce a new tree, dedupe tree to record on-disk dedupe hash.
As a persist hash storage instead of in-memeory only implement.

Unlike Liu Bo's implement, in this version we won't do hack for
bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such
search case, just like in-memory backend.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
Fix a small rebase bug, which missed 4 lines.
---
 fs/btrfs/ctree.h             | 53 +++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/dedupe.h            |  5 +++++
 fs/btrfs/disk-io.c           |  6 +++++
 fs/btrfs/relocation.c        |  3 ++-
 include/trace/events/btrfs.h |  3 ++-
 5 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0e8933c..659790c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -100,6 +100,9 @@ struct btrfs_ordered_sum;
 /* tracks free space in block groups. */
 #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
 
+/* on-disk dedupe tree (EXPERIMENTAL) */
+#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -538,7 +541,8 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
 
 #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
-	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)
+	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
+	 BTRFS_FEATURE_COMPAT_RO_DEDUPE)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
@@ -960,6 +964,36 @@ struct btrfs_csum_item {
 	u8 csum;
 } __attribute__ ((__packed__));
 
+/*
+ * Objectid: 0
+ * Type: BTRFS_DEDUPE_STATUS_ITEM_KEY
+ * Offset: 0
+ */
+struct btrfs_dedupe_status_item {
+	__le64 blocksize;
+	__le64 limit_nr;
+	__le16 hash_type;
+	__le16 backend;
+} __attribute__ ((__packed__));
+
+/*
+ * Objectid: Last 64 bit of the hash
+ * Type: BTRFS_DEDUPE_HASH_ITEM_KEY
+ * Offset: Bytenr of the hash
+ *
+ * Used for hash <-> bytenr search
+ * Hash exclude the last 64 bit follows
+ */
+
+/*
+ * Objectid: bytenr
+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
+ * offset: Last 64 bit of the hash
+ *
+ * Used for bytenr <-> hash search (for free_extent)
+ * Its itemsize should always be 0.
+ */
+
 struct btrfs_dev_stats_item {
 	/*
 	 * grow this item struct at the end for future enhancements and keep
@@ -2168,6 +2202,13 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_CHUNK_ITEM_KEY	228
 
 /*
+ * Dedup item and status
+ */
+#define BTRFS_DEDUPE_STATUS_ITEM_KEY	230
+#define BTRFS_DEDUPE_HASH_ITEM_KEY	231
+#define BTRFS_DEDUPE_BYTENR_ITEM_KEY	232
+
+/*
  * Records the overall state of the qgroups.
  * There's only one instance of this key present,
  * (0, BTRFS_QGROUP_STATUS_KEY, 0)
@@ -3265,6 +3306,16 @@ static inline unsigned long btrfs_leaf_data(struct extent_buffer *l)
 	return offsetof(struct btrfs_leaf, items);
 }
 
+/* btrfs_dedupe_status */
+BTRFS_SETGET_FUNCS(dedupe_status_blocksize, struct btrfs_dedupe_status_item,
+		   blocksize, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_limit, struct btrfs_dedupe_status_item,
+		   limit_nr, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_hash_type, struct btrfs_dedupe_status_item,
+		   hash_type, 16);
+BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item,
+		   backend, 16);
+
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr,
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index f5d2b45..1ac1bcb 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -60,6 +60,8 @@ struct btrfs_dedupe_hash {
 	u8 hash[];
 };
 
+struct btrfs_root;
+
 struct btrfs_dedupe_info {
 	/* dedupe blocksize */
 	u64 blocksize;
@@ -75,6 +77,9 @@ struct btrfs_dedupe_info {
 	struct list_head lru_list;
 	u64 limit_nr;
 	u64 current_nr;
+
+	/* for persist data like dedup-hash and dedupe status */
+	struct btrfs_root *dedupe_root;
 };
 
 struct btrfs_trans_handle;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ed6a6fd..c7eda03 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -184,6 +184,7 @@ static struct btrfs_lockdep_keyset {
 	{ .id = BTRFS_DATA_RELOC_TREE_OBJECTID,	.name_stem = "dreloc"	},
 	{ .id = BTRFS_UUID_TREE_OBJECTID,	.name_stem = "uuid"	},
 	{ .id = BTRFS_FREE_SPACE_TREE_OBJECTID,	.name_stem = "free-space" },
+	{ .id = BTRFS_DEDUPE_TREE_OBJECTID,	.name_stem = "dedupe"	},
 	{ .id = 0,				.name_stem = "tree"	},
 };
 
@@ -1678,6 +1679,11 @@ struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
 	if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
 		return fs_info->free_space_root ? fs_info->free_space_root :
 						  ERR_PTR(-ENOENT);
+	if (location->objectid == BTRFS_DEDUPE_TREE_OBJECTID) {
+		if (fs_info->dedupe_enabled && fs_info->dedupe_info)
+			return fs_info->dedupe_info->dedupe_root;
+		return ERR_PTR(-ENOENT);
+	}
 again:
 	root = btrfs_lookup_fs_root(fs_info, location->objectid);
 	if (root) {
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d72a981..f26ac00 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -577,7 +577,8 @@ static int is_cowonly_root(u64 root_objectid)
 	    root_objectid == BTRFS_CSUM_TREE_OBJECTID ||
 	    root_objectid == BTRFS_UUID_TREE_OBJECTID ||
 	    root_objectid == BTRFS_QUOTA_TREE_OBJECTID ||
-	    root_objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
+	    root_objectid == BTRFS_FREE_SPACE_TREE_OBJECTID ||
+	    root_objectid == BTRFS_DEDUPE_TREE_OBJECTID)
 		return 1;
 	return 0;
 }
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index d866f21..2c3d48a 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -47,12 +47,13 @@ struct btrfs_qgroup_operation;
 		{ BTRFS_TREE_RELOC_OBJECTID,	"TREE_RELOC"	},	\
 		{ BTRFS_UUID_TREE_OBJECTID,	"UUID_TREE"	},	\
 		{ BTRFS_FREE_SPACE_TREE_OBJECTID, "FREE_SPACE_TREE" },	\
+		{ BTRFS_DEDUPE_TREE_OBJECTID,	"DEDUPE_TREE"	},	\
 		{ BTRFS_DATA_RELOC_TREE_OBJECTID, "DATA_RELOC_TREE" })
 
 #define show_root_type(obj)						\
 	obj, ((obj >= BTRFS_DATA_RELOC_TREE_OBJECTID) ||		\
 	      (obj >= BTRFS_ROOT_TREE_OBJECTID &&			\
-	       obj <= BTRFS_QUOTA_TREE_OBJECTID)) ? __show_root_type(obj) : "-"
+	       obj <= BTRFS_DEDUPE_TREE_OBJECTID)) ? __show_root_type(obj) : "-"
 
 #define BTRFS_GROUP_FLAGS	\
 	{ BTRFS_BLOCK_GROUP_DATA,	"DATA"},	\
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info
  2016-04-01  6:34 ` [PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
@ 2016-04-01  9:59   ` kbuild test robot
  2016-05-11  0:00     ` Mark Fasheh
  0 siblings, 1 reply; 54+ messages in thread
From: kbuild test robot @ 2016-04-01  9:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: kbuild-all, linux-btrfs, Wang Xiaoguang

[-- Attachment #1: Type: text/plain, Size: 948 bytes --]

Hi Wang,

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.6-rc1 next-20160401]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next
config: x86_64-rhel (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

Note: the linux-review/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937 HEAD 0a445f5009c064ee1d3fc966e41bb75627594afe builds fine.
      It only hurts bisectibility.

All errors (new ones prefixed by >>):

>> ERROR: "btrfs_dedupe_disable" [fs/btrfs/btrfs.ko] undefined!

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 36091 bytes --]

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 11/21] btrfs: dedupe: Add ioctl for inband dedupelication
  2016-04-01  6:35 ` [PATCH v10 11/21] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
@ 2016-04-27  1:29   ` Qu Wenruo
  2016-05-17 13:14     ` David Sterba
  0 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-04-27  1:29 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang, David Sterba

Hi David

Qu Wenruo wrote on 2016/04/01 14:35 +0800:
> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>
> Add ioctl interface for inband dedupelication, which includes:
> 1) enable
> 2) disable
> 3) status
>
> And a pseudo RO compat flag, to imply that btrfs now supports inband
> dedup.
> However we don't add any ondisk format change, it's just a pseudo RO
> compat flag.
>
> All these ioctl interface are state-less, which means caller don't need
> to bother previous dedupe state before calling them, and only need to
> care the final desired state.
>
> For example, if user want to enable dedupe with specified block size and
> limit, just fill the ioctl structure and call enable ioctl.
> No need to check if dedupe is already running.
>
> These ioctls will handle things like re-configure or disable quite well.
>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>  static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
>  				     struct inode *inode,
>  				     u64 endoff,
> @@ -5542,6 +5606,10 @@ long btrfs_ioctl(struct file *file, unsigned int
>  		return btrfs_ioctl_get_fslabel(file, argp);
>  	case BTRFS_IOC_SET_FSLABEL:
>  		return btrfs_ioctl_set_fslabel(file, argp);

Would you mind me to add a new kernel config "Btrfs experimental 
features-> dedupe ioctl" for case like dedupe and further experimental 
btrfs features?

The BTRFS_DEBUG seems quite odd for me.
Although in-band dedupe is quite good at exposing bugs of 
backref/qgroup/delayed_refs, but I still think it's not a debug tool.

So I hope to use "BTRFS_EXPERIMENTAL_DEDUPE_IOCTL" and add corresponding 
Kconfig interfaces.

Thanks,
Qu

> +#ifdef CONFIG_BTRFS_DEBUG
> +	case BTRFS_IOC_DEDUPE_CTL:
> +		return btrfs_ioctl_dedupe_ctl(root, argp);
> +#endif
>  	case BTRFS_IOC_GET_SUPPORTED_FEATURES:
>  		return btrfs_ioctl_get_supported_features(argp);
>  	case BTRFS_IOC_GET_FEATURES:
> diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
> index 539e7b5..18686d1 100644
> --- a/fs/btrfs/sysfs.c
> +++ b/fs/btrfs/sysfs.c
> @@ -203,6 +203,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56);
>  BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA);
>  BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
>  BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
> +BTRFS_FEAT_ATTR_COMPAT_RO(dedupe, DEDUPE);
>
>  static struct attribute *btrfs_supported_feature_attrs[] = {
>  	BTRFS_FEAT_ATTR_PTR(mixed_backref),
> @@ -215,6 +216,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
>  	BTRFS_FEAT_ATTR_PTR(skinny_metadata),
>  	BTRFS_FEAT_ATTR_PTR(no_holes),
>  	BTRFS_FEAT_ATTR_PTR(free_space_tree),
> +	BTRFS_FEAT_ATTR_PTR(dedupe),
>  	NULL
>  };
>
> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index dea8931..de48414 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -445,6 +445,27 @@ struct btrfs_ioctl_get_dev_stats {
>  	__u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
>  };
>
> +/*
> + * de-duplication control modes
> + * For re-config, re-enable will handle it
> + */
> +#define BTRFS_DEDUPE_CTL_ENABLE	1
> +#define BTRFS_DEDUPE_CTL_DISABLE 2
> +#define BTRFS_DEDUPE_CTL_STATUS	3
> +#define BTRFS_DEDUPE_CTL_LAST	4
> +struct btrfs_ioctl_dedupe_args {
> +	__u16 cmd;		/* In: command(see above macro) */
> +	__u64 blocksize;	/* In/Out: For enable/status */
> +	__u64 limit_nr;		/* In/Out: For enable/status */
> +	__u64 limit_mem;	/* In/Out: For enable/status */
> +	__u64 current_nr;	/* Out: For status output */
> +	__u16 backend;		/* In/Out: For enable/status */
> +	__u16 hash_type;	/* In/Out: For enable/status */
> +	u8 status;		/* Out: For status output */
> +	/* pad to 512 bytes */
> +	u8 __unused[473];
> +};
> +
>  #define BTRFS_QUOTA_CTL_ENABLE	1
>  #define BTRFS_QUOTA_CTL_DISABLE	2
>  #define BTRFS_QUOTA_CTL_RESCAN__NOTUSED	3
> @@ -653,6 +674,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code)
>  				    struct btrfs_ioctl_dev_replace_args)
>  #define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
>  					 struct btrfs_ioctl_same_args)
> +#define BTRFS_IOC_DEDUPE_CTL	_IOWR(BTRFS_IOCTL_MAGIC, 55, \
> +				      struct btrfs_ioctl_dedupe_args)
>  #define BTRFS_IOC_GET_FEATURES _IOR(BTRFS_IOCTL_MAGIC, 57, \
>  				   struct btrfs_ioctl_feature_flags)
>  #define BTRFS_IOC_SET_FEATURES _IOW(BTRFS_IOCTL_MAGIC, 57, \
>



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info
  2016-04-01  9:59   ` kbuild test robot
@ 2016-05-11  0:00     ` Mark Fasheh
  2016-05-11  0:21       ` Qu Wenruo
  0 siblings, 1 reply; 54+ messages in thread
From: Mark Fasheh @ 2016-05-11  0:00 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Wang Xiaoguang

On Fri, Apr 01, 2016 at 05:59:13PM +0800, kbuild test robot wrote:
> Hi Wang,
> 
> [auto build test ERROR on btrfs/next]
> [also build test ERROR on v4.6-rc1 next-20160401]
> [if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next
> config: x86_64-rhel (attached as .config)
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=x86_64 
> 
> Note: the linux-review/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937 HEAD 0a445f5009c064ee1d3fc966e41bb75627594afe builds fine.
>       It only hurts bisectibility.
> 
> All errors (new ones prefixed by >>):
> 
> >> ERROR: "btrfs_dedupe_disable" [fs/btrfs/btrfs.ko] undefined!
> 
> ---
> 0-DAY kernel test infrastructure                Open Source Technology Center
> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Please correct this, we need to be able to bisect a kernel without random
patches breaking the build.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info
  2016-05-11  0:00     ` Mark Fasheh
@ 2016-05-11  0:21       ` Qu Wenruo
  2016-05-11  2:24         ` Qu Wenruo
  0 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-05-11  0:21 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, Wang Xiaoguang



Mark Fasheh wrote on 2016/05/10 17:00 -0700:
> On Fri, Apr 01, 2016 at 05:59:13PM +0800, kbuild test robot wrote:
>> Hi Wang,
>>
>> [auto build test ERROR on btrfs/next]
>> [also build test ERROR on v4.6-rc1 next-20160401]
>> [if your patch is applied to the wrong git tree, please drop us a note to help improving the system]
>>
>> url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937
>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next
>> config: x86_64-rhel (attached as .config)
>> reproduce:
>>         # save the attached .config to linux build tree
>>         make ARCH=x86_64
>>
>> Note: the linux-review/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937 HEAD 0a445f5009c064ee1d3fc966e41bb75627594afe builds fine.
>>       It only hurts bisectibility.
>>
>> All errors (new ones prefixed by >>):
>>
>>>> ERROR: "btrfs_dedupe_disable" [fs/btrfs/btrfs.ko] undefined!
>>
>> ---
>> 0-DAY kernel test infrastructure                Open Source Technology Center
>> https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
>
> Please correct this, we need to be able to bisect a kernel without random
> patches breaking the build.
> 	--Mark
>
> --
> Mark Fasheh
>
>
The build bot is just using wrong branch.

We're using integration-4.6, which is not in their build bot branching.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info
  2016-05-11  0:21       ` Qu Wenruo
@ 2016-05-11  2:24         ` Qu Wenruo
  0 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-05-11  2:24 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, Wang Xiaoguang



Qu Wenruo wrote on 2016/05/11 08:21 +0800:
>
>
> Mark Fasheh wrote on 2016/05/10 17:00 -0700:
>> On Fri, Apr 01, 2016 at 05:59:13PM +0800, kbuild test robot wrote:
>>> Hi Wang,
>>>
>>> [auto build test ERROR on btrfs/next]
>>> [also build test ERROR on v4.6-rc1 next-20160401]
>>> [if your patch is applied to the wrong git tree, please drop us a
>>> note to help improving the system]
>>>
>>> url:
>>> https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937
>>>
>>> base:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git
>>> next
>>> config: x86_64-rhel (attached as .config)
>>> reproduce:
>>>         # save the attached .config to linux build tree
>>>         make ARCH=x86_64
>>>
>>> Note: the
>>> linux-review/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937 HEAD
>>> 0a445f5009c064ee1d3fc966e41bb75627594afe builds fine.
>>>       It only hurts bisectibility.
>>>
>>> All errors (new ones prefixed by >>):
>>>
>>>>> ERROR: "btrfs_dedupe_disable" [fs/btrfs/btrfs.ko] undefined!
>>>
>>> ---
>>> 0-DAY kernel test infrastructure                Open Source
>>> Technology Center
>>> https://lists.01.org/pipermail/kbuild-all                   Intel
>>> Corporation
>>
>> Please correct this, we need to be able to bisect a kernel without random
>> patches breaking the build.
>>     --Mark
>>
>> --
>> Mark Fasheh
>>
>>
> The build bot is just using wrong branch.
>
> We're using integration-4.6, which is not in their build bot branching.
>
> Thanks,
> Qu

Oh, my fault, I just get confused it with old days when it uses wrong 
branch.

Will fix it in v11 patchset.

Thanks,
Qu
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 11/21] btrfs: dedupe: Add ioctl for inband dedupelication
  2016-04-27  1:29   ` Qu Wenruo
@ 2016-05-17 13:14     ` David Sterba
  2016-05-18  0:54       ` Qu Wenruo
  0 siblings, 1 reply; 54+ messages in thread
From: David Sterba @ 2016-05-17 13:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Wang Xiaoguang, David Sterba

On Wed, Apr 27, 2016 at 09:29:29AM +0800, Qu Wenruo wrote:
> > @@ -5542,6 +5606,10 @@ long btrfs_ioctl(struct file *file, unsigned int
> >  		return btrfs_ioctl_get_fslabel(file, argp);
> >  	case BTRFS_IOC_SET_FSLABEL:
> >  		return btrfs_ioctl_set_fslabel(file, argp);
> 
> Would you mind me to add a new kernel config "Btrfs experimental 
> features-> dedupe ioctl" for case like dedupe and further experimental 
> btrfs features?
> 
> The BTRFS_DEBUG seems quite odd for me.
> Although in-band dedupe is quite good at exposing bugs of 
> backref/qgroup/delayed_refs, but I still think it's not a debug tool.
> 
> So I hope to use "BTRFS_EXPERIMENTAL_DEDUPE_IOCTL" and add corresponding 
> Kconfig interfaces.

The point was not to add new config options, but if you find _DEBUG odd
we can rethink the "hide it under config option" approach. The alternate
way is to get the code into a good shape and add it to for-next. I don't
want to track yet another set of branches with the too experimental
stuff.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 07/21] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  2016-04-01  6:34 ` [PATCH v10 07/21] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
@ 2016-05-17 13:15   ` David Sterba
  0 siblings, 0 replies; 54+ messages in thread
From: David Sterba @ 2016-05-17 13:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Wang Xiaoguang

On Fri, Apr 01, 2016 at 02:34:58PM +0800, Qu Wenruo wrote:
> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> 
> Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
> supported yet, so implement btrfs_dedupe_calc_hash() interface using
> SHA256.
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
>  fs/btrfs/dedupe.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 49 insertions(+)
> 
> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
> index 9175a5f..bdaea3a 100644
> --- a/fs/btrfs/dedupe.c
> +++ b/fs/btrfs/dedupe.c
> @@ -593,3 +593,52 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
>  	}
>  	return ret;
>  }
> +
> +int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
> +			   struct inode *inode, u64 start,
> +			   struct btrfs_dedupe_hash *hash)
> +{
> +	int i;
> +	int ret;
> +	struct page *p;
> +	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
> +	struct crypto_shash *tfm = dedupe_info->dedupe_driver;
> +	struct {
> +		struct shash_desc desc;
> +		char ctx[crypto_shash_descsize(tfm)];
> +	} sdesc;

This construct has been obsoleted, please use SHASH_DESC_ON_STACK.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 10/21] btrfs: try more times to alloc metadata reserve space
  2016-04-01  6:35 ` [PATCH v10 10/21] btrfs: try more times to alloc metadata reserve space Qu Wenruo
@ 2016-05-17 13:20   ` David Sterba
  2016-05-18  0:57     ` Qu Wenruo
  2016-06-01 22:14     ` Mark Fasheh
  0 siblings, 2 replies; 54+ messages in thread
From: David Sterba @ 2016-05-17 13:20 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Wang Xiaoguang

On Fri, Apr 01, 2016 at 02:35:01PM +0800, Qu Wenruo wrote:
> @@ -5815,6 +5817,23 @@ out_fail:
>  	}
>  	if (delalloc_lock)
>  		mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
> +	/*
> +	 * The number of metadata bytes is calculated by the difference
> +	 * between outstanding_extents and reserved_extents. Sometimes though
> +	 * reserve_metadata_bytes() fails to reserve the wanted metadata bytes,
> +	 * indeed it has already done some work to reclaim metadata space, hence
> +	 * both outstanding_extents and reserved_extents would have changed and
> +	 * the bytes we try to reserve would also has changed(may be smaller).
> +	 * So here we try to reserve again. This is much useful for online
> +	 * dedupe, which will easily eat almost all meta space.
> +	 *
> +	 * XXX: Indeed here 3 is arbitrarily choosed, it's a good workaround for
> +	 * online dedupe, later we should find a better method to avoid dedupe
> +	 * enospc issue.
> +	 */
> +	if (unlikely(ret == -ENOSPC && loops++ < 3))
> +		goto again;
> +

This does not seem right and needs to be addressed properly before I
consider adding the patchset to for-next. I don't have idea how to fix
it.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 11/21] btrfs: dedupe: Add ioctl for inband dedupelication
  2016-05-17 13:14     ` David Sterba
@ 2016-05-18  0:54       ` Qu Wenruo
  0 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-05-18  0:54 UTC (permalink / raw)
  To: dsterba, linux-btrfs, Wang Xiaoguang, David Sterba



David Sterba wrote on 2016/05/17 15:14 +0200:
> On Wed, Apr 27, 2016 at 09:29:29AM +0800, Qu Wenruo wrote:
>>> @@ -5542,6 +5606,10 @@ long btrfs_ioctl(struct file *file, unsigned int
>>>  		return btrfs_ioctl_get_fslabel(file, argp);
>>>  	case BTRFS_IOC_SET_FSLABEL:
>>>  		return btrfs_ioctl_set_fslabel(file, argp);
>>
>> Would you mind me to add a new kernel config "Btrfs experimental
>> features-> dedupe ioctl" for case like dedupe and further experimental
>> btrfs features?
>>
>> The BTRFS_DEBUG seems quite odd for me.
>> Although in-band dedupe is quite good at exposing bugs of
>> backref/qgroup/delayed_refs, but I still think it's not a debug tool.
>>
>> So I hope to use "BTRFS_EXPERIMENTAL_DEDUPE_IOCTL" and add corresponding
>> Kconfig interfaces.
>
> The point was not to add new config options, but if you find _DEBUG odd
> we can rethink the "hide it under config option" approach. The alternate
> way is to get the code into a good shape and add it to for-next. I don't
> want to track yet another set of branches with the too experimental
> stuff.

OK, I'll try the alternative method, make the code into good shape for 
in-memory backend.

As for the ioctl interface, I'll post some ideas on this to wiki and 
hopes we can get a good solution for it.

Like adding new 'force' flag, to make 'enable --force' to be stateless, 
and without that flag, go to normal enable/config stateful method.

Just going completely stateful method is still another alternative though.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 10/21] btrfs: try more times to alloc metadata reserve space
  2016-05-17 13:20   ` David Sterba
@ 2016-05-18  0:57     ` Qu Wenruo
  2016-06-01 22:14     ` Mark Fasheh
  1 sibling, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-05-18  0:57 UTC (permalink / raw)
  To: dsterba, linux-btrfs, Wang Xiaoguang



David Sterba wrote on 2016/05/17 15:20 +0200:
> On Fri, Apr 01, 2016 at 02:35:01PM +0800, Qu Wenruo wrote:
>> @@ -5815,6 +5817,23 @@ out_fail:
>>  	}
>>  	if (delalloc_lock)
>>  		mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
>> +	/*
>> +	 * The number of metadata bytes is calculated by the difference
>> +	 * between outstanding_extents and reserved_extents. Sometimes though
>> +	 * reserve_metadata_bytes() fails to reserve the wanted metadata bytes,
>> +	 * indeed it has already done some work to reclaim metadata space, hence
>> +	 * both outstanding_extents and reserved_extents would have changed and
>> +	 * the bytes we try to reserve would also has changed(may be smaller).
>> +	 * So here we try to reserve again. This is much useful for online
>> +	 * dedupe, which will easily eat almost all meta space.
>> +	 *
>> +	 * XXX: Indeed here 3 is arbitrarily choosed, it's a good workaround for
>> +	 * online dedupe, later we should find a better method to avoid dedupe
>> +	 * enospc issue.
>> +	 */
>> +	if (unlikely(ret == -ENOSPC && loops++ < 3))
>> +		goto again;
>> +
>
> This does not seem right and needs to be addressed properly before I
> consider adding the patchset to for-next. I don't have idea how to fix
> it.
>
>
Right, the code is bad, just as Josef pointed out.

In fact this behavior really hides a lot of ENOSPC problem and shows 
quite a lot original metadata reserve limitation of current code.
(Mostly hidden by the default 128M max extent size)

We are actively debugging and testing new root fix for the problem.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 03/21] btrfs: dedupe: Introduce function to add hash into in-memory tree
  2016-04-01  6:34 ` [PATCH v10 03/21] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
@ 2016-06-01 19:37   ` Mark Fasheh
  2016-06-02  0:49     ` Qu Wenruo
  0 siblings, 1 reply; 54+ messages in thread
From: Mark Fasheh @ 2016-06-01 19:37 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Wang Xiaoguang

On Fri, Apr 01, 2016 at 02:34:54PM +0800, Qu Wenruo wrote:
> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> 
> Introduce static function inmem_add() to add hash into in-memory tree.
> And now we can implement the btrfs_dedupe_add() interface.
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
>  fs/btrfs/dedupe.c | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 151 insertions(+)
> 
> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
> index 2211588..4e8455e 100644
> --- a/fs/btrfs/dedupe.c
> +++ b/fs/btrfs/dedupe.c
> @@ -32,6 +32,14 @@ struct inmem_hash {
>  	u8 hash[];
>  };
>  
> +static inline struct inmem_hash *inmem_alloc_hash(u16 type)
> +{
> +	if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
> +		return NULL;
> +	return kzalloc(sizeof(struct inmem_hash) + btrfs_dedupe_sizes[type],
> +			GFP_NOFS);
> +}
> +
>  static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
>  			    u16 backend, u64 blocksize, u64 limit)
>  {
> @@ -152,3 +160,146 @@ enable:
>  	fs_info->dedupe_enabled = 1;
>  	return ret;
>  }
> +
> +static int inmem_insert_hash(struct rb_root *root,
> +			     struct inmem_hash *hash, int hash_len)
> +{
> +	struct rb_node **p = &root->rb_node;
> +	struct rb_node *parent = NULL;
> +	struct inmem_hash *entry = NULL;
> +
> +	while (*p) {
> +		parent = *p;
> +		entry = rb_entry(parent, struct inmem_hash, hash_node);
> +		if (memcmp(hash->hash, entry->hash, hash_len) < 0)
> +			p = &(*p)->rb_left;
> +		else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
> +			p = &(*p)->rb_right;
> +		else
> +			return 1;
> +	}
> +	rb_link_node(&hash->hash_node, parent, p);
> +	rb_insert_color(&hash->hash_node, root);
> +	return 0;
> +}
> +
> +static int inmem_insert_bytenr(struct rb_root *root,
> +			       struct inmem_hash *hash)
> +{
> +	struct rb_node **p = &root->rb_node;
> +	struct rb_node *parent = NULL;
> +	struct inmem_hash *entry = NULL;
> +
> +	while (*p) {
> +		parent = *p;
> +		entry = rb_entry(parent, struct inmem_hash, bytenr_node);
> +		if (hash->bytenr < entry->bytenr)
> +			p = &(*p)->rb_left;
> +		else if (hash->bytenr > entry->bytenr)
> +			p = &(*p)->rb_right;
> +		else
> +			return 1;
> +	}
> +	rb_link_node(&hash->bytenr_node, parent, p);
> +	rb_insert_color(&hash->bytenr_node, root);
> +	return 0;
> +}
> +
> +static void __inmem_del(struct btrfs_dedupe_info *dedupe_info,
> +			struct inmem_hash *hash)
> +{
> +	list_del(&hash->lru_list);
> +	rb_erase(&hash->hash_node, &dedupe_info->hash_root);
> +	rb_erase(&hash->bytenr_node, &dedupe_info->bytenr_root);
> +
> +	if (!WARN_ON(dedupe_info->current_nr == 0))
> +		dedupe_info->current_nr--;
> +
> +	kfree(hash);
> +}
> +
> +/*
> + * Insert a hash into in-memory dedupe tree
> + * Will remove exceeding last recent use hash.
> + *
> + * If the hash mathced with existing one, we won't insert it, to
> + * save memory
> + */
> +static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
> +		     struct btrfs_dedupe_hash *hash)
> +{
> +	int ret = 0;
> +	u16 type = dedupe_info->hash_type;
> +	struct inmem_hash *ihash;
> +
> +	ihash = inmem_alloc_hash(type);
> +
> +	if (!ihash)
> +		return -ENOMEM;
> +
> +	/* Copy the data out */
> +	ihash->bytenr = hash->bytenr;
> +	ihash->num_bytes = hash->num_bytes;
> +	memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
> +
> +	mutex_lock(&dedupe_info->lock);

Can you describe somewhere in a comment why we need this mutex? It is
unclear just based on reading the code why we need a sleeping lock here.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 04/21] btrfs: dedupe: Introduce function to remove hash from in-memory tree
  2016-04-01  6:34 ` [PATCH v10 04/21] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
@ 2016-06-01 19:40   ` Mark Fasheh
  2016-06-02  1:01     ` Qu Wenruo
  0 siblings, 1 reply; 54+ messages in thread
From: Mark Fasheh @ 2016-06-01 19:40 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Wang Xiaoguang

On Fri, Apr 01, 2016 at 02:34:55PM +0800, Qu Wenruo wrote:
> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> 
> Introduce static function inmem_del() to remove hash from in-memory
> dedupe tree.
> And implement btrfs_dedupe_del() and btrfs_dedup_destroy() interfaces.
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
>  fs/btrfs/dedupe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 105 insertions(+)
> 
> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
> index 4e8455e..a229ded 100644
> --- a/fs/btrfs/dedupe.c
> +++ b/fs/btrfs/dedupe.c
> @@ -303,3 +303,108 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
>  		return inmem_add(dedupe_info, hash);
>  	return -EINVAL;
>  }
> +
> +static struct inmem_hash *
> +inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
> +{
> +	struct rb_node **p = &dedupe_info->bytenr_root.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct inmem_hash *entry = NULL;
> +
> +	while (*p) {
> +		parent = *p;
> +		entry = rb_entry(parent, struct inmem_hash, bytenr_node);
> +
> +		if (bytenr < entry->bytenr)
> +			p = &(*p)->rb_left;
> +		else if (bytenr > entry->bytenr)
> +			p = &(*p)->rb_right;
> +		else
> +			return entry;
> +	}
> +
> +	return NULL;
> +}
> +
> +/* Delete a hash from in-memory dedupe tree */
> +static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
> +{
> +	struct inmem_hash *hash;
> +
> +	mutex_lock(&dedupe_info->lock);
> +	hash = inmem_search_bytenr(dedupe_info, bytenr);
> +	if (!hash) {
> +		mutex_unlock(&dedupe_info->lock);
> +		return 0;
> +	}
> +
> +	__inmem_del(dedupe_info, hash);
> +	mutex_unlock(&dedupe_info->lock);
> +	return 0;
> +}
> +
> +/* Remove a dedupe hash from dedupe tree */
> +int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
> +		     struct btrfs_fs_info *fs_info, u64 bytenr)
> +{
> +	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
> +
> +	if (!fs_info->dedupe_enabled)
> +		return 0;
> +
> +	if (WARN_ON(dedupe_info == NULL))
> +		return -EINVAL;
> +
> +	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
> +		return inmem_del(dedupe_info, bytenr);
> +	return -EINVAL;
> +}
> +
> +static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
> +{
> +	struct inmem_hash *entry, *tmp;
> +
> +	mutex_lock(&dedupe_info->lock);
> +	list_for_each_entry_safe(entry, tmp, &dedupe_info->lru_list, lru_list)
> +		__inmem_del(dedupe_info, entry);
> +	mutex_unlock(&dedupe_info->lock);
> +}
> +
> +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_dedupe_info *dedupe_info;
> +	int ret;
> +
> +	/* Here we don't want to increase refs of dedupe_info */
> +	fs_info->dedupe_enabled = 0;

Can this clear of fs_info->dedupe_enabled race with another thread in write?
I don't see any locking (but perhaps that comes in a later patch).
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 08/21] btrfs: ordered-extent: Add support for dedupe
  2016-04-01  6:34 ` [PATCH v10 08/21] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
@ 2016-06-01 22:06   ` Mark Fasheh
  2016-06-02  1:08     ` Qu Wenruo
  0 siblings, 1 reply; 54+ messages in thread
From: Mark Fasheh @ 2016-06-01 22:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Wang Xiaoguang, David Sterba

On Fri, Apr 01, 2016 at 02:34:59PM +0800, Qu Wenruo wrote:
> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> 
> Add ordered-extent support for dedupe.
> 
> Note, current ordered-extent support only supports non-compressed source
> extent.
> Support for compressed source extent will be added later.
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
>  fs/btrfs/ordered-data.c | 44 ++++++++++++++++++++++++++++++++++++++++----
>  fs/btrfs/ordered-data.h | 13 +++++++++++++
>  2 files changed, 53 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 0de7da5..ef24ad1 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -26,6 +26,7 @@
>  #include "extent_io.h"
>  #include "disk-io.h"
>  #include "compression.h"
> +#include "dedupe.h"
>  
>  static struct kmem_cache *btrfs_ordered_extent_cache;
>  
> @@ -184,7 +185,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
>   */
>  static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
>  				      u64 start, u64 len, u64 disk_len,
> -				      int type, int dio, int compress_type)
> +				      int type, int dio, int compress_type,
> +				      struct btrfs_dedupe_hash *hash)
>  {
>  	struct btrfs_root *root = BTRFS_I(inode)->root;
>  	struct btrfs_ordered_inode_tree *tree;
> @@ -204,6 +206,31 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
>  	entry->inode = igrab(inode);
>  	entry->compress_type = compress_type;
>  	entry->truncated_len = (u64)-1;
> +	entry->hash = NULL;
> +	/*
> +	 * Hash hit must go through dedupe routine at all cost, even dedupe
> +	 * is disabled. As its delayed ref is already increased.
> +	 */

Initially, I had a hard time understanding this comment but I'm pretty sure
I know what you mean.

/*
 * A hash hit means we have already incremented the extents delayed ref.
 * We must handle this even if another process raced to turn off dedupe
 * otherwise we might leak a reference.
 */

might be better. Hope that helps.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-04-01  6:35 ` [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
@ 2016-06-01 22:08   ` Mark Fasheh
  2016-06-02  1:12     ` Qu Wenruo
  2016-06-03 14:43   ` Josef Bacik
  1 sibling, 1 reply; 54+ messages in thread
From: Mark Fasheh @ 2016-06-01 22:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Wang Xiaoguang, David Sterba

On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote:
> Core implement for inband de-duplication.
> It reuse the async_cow_start() facility to do the calculate dedupe hash.
> And use dedupe hash to do inband de-duplication at extent level.
> 
> The work flow is as below:
> 1) Run delalloc range for an inode
> 2) Calculate hash for the delalloc range at the unit of dedupe_bs
> 3) For hash match(duplicated) case, just increase source extent ref
>    and insert file extent.
>    For hash mismatch case, go through the normal cow_file_range()
>    fallback, and add hash into dedupe_tree.
>    Compress for hash miss case is not supported yet.
> 
> Current implement restore all dedupe hash in memory rb-tree, with LRU
> behavior to control the limit.
> 
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/extent-tree.c |  18 ++++
>  fs/btrfs/inode.c       | 235 ++++++++++++++++++++++++++++++++++++++++++-------
>  fs/btrfs/relocation.c  |  16 ++++
>  3 files changed, 236 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 53e1297..dabd721 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -37,6 +37,7 @@
>  #include "math.h"
>  #include "sysfs.h"
>  #include "qgroup.h"
> +#include "dedupe.h"
>  
>  #undef SCRAMBLE_DELAYED_REFS
>  
> @@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>  
>  	if (btrfs_delayed_ref_is_head(node)) {
>  		struct btrfs_delayed_ref_head *head;
> +		struct btrfs_fs_info *fs_info = root->fs_info;
> +
>  		/*
>  		 * we've hit the end of the chain and we were supposed
>  		 * to insert this extent into the tree.  But, it got
> @@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>  			btrfs_pin_extent(root, node->bytenr,
>  					 node->num_bytes, 1);
>  			if (head->is_data) {
> +				/*
> +				 * If insert_reserved is given, it means
> +				 * a new extent is revered, then deleted
> +				 * in one tran, and inc/dec get merged to 0.
> +				 *
> +				 * In this case, we need to remove its dedup
> +				 * hash.
> +				 */
> +				btrfs_dedupe_del(trans, fs_info, node->bytenr);
>  				ret = btrfs_del_csums(trans, root,
>  						      node->bytenr,
>  						      node->num_bytes);
> @@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>  		btrfs_release_path(path);
>  
>  		if (is_data) {
> +			ret = btrfs_dedupe_del(trans, info, bytenr);
> +			if (ret < 0) {
> +				btrfs_abort_transaction(trans, extent_root,
> +							ret);

I don't see why an error here should lead to a readonly fs.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 10/21] btrfs: try more times to alloc metadata reserve space
  2016-05-17 13:20   ` David Sterba
  2016-05-18  0:57     ` Qu Wenruo
@ 2016-06-01 22:14     ` Mark Fasheh
  1 sibling, 0 replies; 54+ messages in thread
From: Mark Fasheh @ 2016-06-01 22:14 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs, Wang Xiaoguang, Josef Bacik,
	Chris Mason

On Tue, May 17, 2016 at 03:20:16PM +0200, David Sterba wrote:
> On Fri, Apr 01, 2016 at 02:35:01PM +0800, Qu Wenruo wrote:
> > @@ -5815,6 +5817,23 @@ out_fail:
> >  	}
> >  	if (delalloc_lock)
> >  		mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
> > +	/*
> > +	 * The number of metadata bytes is calculated by the difference
> > +	 * between outstanding_extents and reserved_extents. Sometimes though
> > +	 * reserve_metadata_bytes() fails to reserve the wanted metadata bytes,
> > +	 * indeed it has already done some work to reclaim metadata space, hence
> > +	 * both outstanding_extents and reserved_extents would have changed and
> > +	 * the bytes we try to reserve would also has changed(may be smaller).
> > +	 * So here we try to reserve again. This is much useful for online
> > +	 * dedupe, which will easily eat almost all meta space.
> > +	 *
> > +	 * XXX: Indeed here 3 is arbitrarily choosed, it's a good workaround for
> > +	 * online dedupe, later we should find a better method to avoid dedupe
> > +	 * enospc issue.
> > +	 */
> > +	if (unlikely(ret == -ENOSPC && loops++ < 3))
> > +		goto again;
> > +
> 
> This does not seem right and needs to be addressed properly before I
> consider adding the patchset to for-next. I don't have idea how to fix
> it.

Agreed, and this sort of issue is a reason why I strongly feel we don't want
to merge this series piecemeal until we know that after everything is
complete, we can end up with a fully baked in-band dedupe implementation.

Luckily Qu says he's on it so if he posts a workable fix here my whole point
can become moot. Until then though this is exactly the type of 'fix later'
coding we need to be avoiding.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 03/21] btrfs: dedupe: Introduce function to add hash into in-memory tree
  2016-06-01 19:37   ` Mark Fasheh
@ 2016-06-02  0:49     ` Qu Wenruo
  0 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-06-02  0:49 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, Wang Xiaoguang



At 06/02/2016 03:37 AM, Mark Fasheh wrote:
> On Fri, Apr 01, 2016 at 02:34:54PM +0800, Qu Wenruo wrote:
>> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>
>> Introduce static function inmem_add() to add hash into in-memory tree.
>> And now we can implement the btrfs_dedupe_add() interface.
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> ---
>>  fs/btrfs/dedupe.c | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 151 insertions(+)
>>
>> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
>> index 2211588..4e8455e 100644
>> --- a/fs/btrfs/dedupe.c
>> +++ b/fs/btrfs/dedupe.c
>> @@ -32,6 +32,14 @@ struct inmem_hash {
>>  	u8 hash[];
>>  };
>>
>> +static inline struct inmem_hash *inmem_alloc_hash(u16 type)
>> +{
>> +	if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
>> +		return NULL;
>> +	return kzalloc(sizeof(struct inmem_hash) + btrfs_dedupe_sizes[type],
>> +			GFP_NOFS);
>> +}
>> +
>>  static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
>>  			    u16 backend, u64 blocksize, u64 limit)
>>  {
>> @@ -152,3 +160,146 @@ enable:
>>  	fs_info->dedupe_enabled = 1;
>>  	return ret;
>>  }
>> +
>> +static int inmem_insert_hash(struct rb_root *root,
>> +			     struct inmem_hash *hash, int hash_len)
>> +{
>> +	struct rb_node **p = &root->rb_node;
>> +	struct rb_node *parent = NULL;
>> +	struct inmem_hash *entry = NULL;
>> +
>> +	while (*p) {
>> +		parent = *p;
>> +		entry = rb_entry(parent, struct inmem_hash, hash_node);
>> +		if (memcmp(hash->hash, entry->hash, hash_len) < 0)
>> +			p = &(*p)->rb_left;
>> +		else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
>> +			p = &(*p)->rb_right;
>> +		else
>> +			return 1;
>> +	}
>> +	rb_link_node(&hash->hash_node, parent, p);
>> +	rb_insert_color(&hash->hash_node, root);
>> +	return 0;
>> +}
>> +
>> +static int inmem_insert_bytenr(struct rb_root *root,
>> +			       struct inmem_hash *hash)
>> +{
>> +	struct rb_node **p = &root->rb_node;
>> +	struct rb_node *parent = NULL;
>> +	struct inmem_hash *entry = NULL;
>> +
>> +	while (*p) {
>> +		parent = *p;
>> +		entry = rb_entry(parent, struct inmem_hash, bytenr_node);
>> +		if (hash->bytenr < entry->bytenr)
>> +			p = &(*p)->rb_left;
>> +		else if (hash->bytenr > entry->bytenr)
>> +			p = &(*p)->rb_right;
>> +		else
>> +			return 1;
>> +	}
>> +	rb_link_node(&hash->bytenr_node, parent, p);
>> +	rb_insert_color(&hash->bytenr_node, root);
>> +	return 0;
>> +}
>> +
>> +static void __inmem_del(struct btrfs_dedupe_info *dedupe_info,
>> +			struct inmem_hash *hash)
>> +{
>> +	list_del(&hash->lru_list);
>> +	rb_erase(&hash->hash_node, &dedupe_info->hash_root);
>> +	rb_erase(&hash->bytenr_node, &dedupe_info->bytenr_root);
>> +
>> +	if (!WARN_ON(dedupe_info->current_nr == 0))
>> +		dedupe_info->current_nr--;
>> +
>> +	kfree(hash);
>> +}
>> +
>> +/*
>> + * Insert a hash into in-memory dedupe tree
>> + * Will remove exceeding last recent use hash.
>> + *
>> + * If the hash mathced with existing one, we won't insert it, to
>> + * save memory
>> + */
>> +static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
>> +		     struct btrfs_dedupe_hash *hash)
>> +{
>> +	int ret = 0;
>> +	u16 type = dedupe_info->hash_type;
>> +	struct inmem_hash *ihash;
>> +
>> +	ihash = inmem_alloc_hash(type);
>> +
>> +	if (!ihash)
>> +		return -ENOMEM;
>> +
>> +	/* Copy the data out */
>> +	ihash->bytenr = hash->bytenr;
>> +	ihash->num_bytes = hash->num_bytes;
>> +	memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
>> +
>> +	mutex_lock(&dedupe_info->lock);
>
> Can you describe somewhere in a comment why we need this mutex? It is
> unclear just based on reading the code why we need a sleeping lock here.
> 	--Mark

For on-disk backend, we will do B-tree operation inside the critical 
range, so in that case we need to use mutex.

It's OK to use spinlock for in-memory backend and use mutex for on-disk 
backend, but we want to re-use most of their code, just like in later 
patch with generic_search_hash(), so we use mutex for all backends.

And for mutex, it's not that slow than spinlock, unless there is a lot 
of concurrency.
(IIRC, linux-rt replace most spinlock with mutex for better preemption)

For inband dedupe case, the most time consuming routine is not hash 
insert, but hash calculation.

So mutex here is not a optimization hotspot AFAIK.

Thanks,
Qu

>
> --
> Mark Fasheh
>
>



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 04/21] btrfs: dedupe: Introduce function to remove hash from in-memory tree
  2016-06-01 19:40   ` Mark Fasheh
@ 2016-06-02  1:01     ` Qu Wenruo
  0 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-06-02  1:01 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, Wang Xiaoguang



At 06/02/2016 03:40 AM, Mark Fasheh wrote:
> On Fri, Apr 01, 2016 at 02:34:55PM +0800, Qu Wenruo wrote:
>> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>
>> Introduce static function inmem_del() to remove hash from in-memory
>> dedupe tree.
>> And implement btrfs_dedupe_del() and btrfs_dedup_destroy() interfaces.
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> ---
>>  fs/btrfs/dedupe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 105 insertions(+)
>>
>> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
>> index 4e8455e..a229ded 100644
>> --- a/fs/btrfs/dedupe.c
>> +++ b/fs/btrfs/dedupe.c
>> @@ -303,3 +303,108 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
>>  		return inmem_add(dedupe_info, hash);
>>  	return -EINVAL;
>>  }
>> +
>> +static struct inmem_hash *
>> +inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
>> +{
>> +	struct rb_node **p = &dedupe_info->bytenr_root.rb_node;
>> +	struct rb_node *parent = NULL;
>> +	struct inmem_hash *entry = NULL;
>> +
>> +	while (*p) {
>> +		parent = *p;
>> +		entry = rb_entry(parent, struct inmem_hash, bytenr_node);
>> +
>> +		if (bytenr < entry->bytenr)
>> +			p = &(*p)->rb_left;
>> +		else if (bytenr > entry->bytenr)
>> +			p = &(*p)->rb_right;
>> +		else
>> +			return entry;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +/* Delete a hash from in-memory dedupe tree */
>> +static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
>> +{
>> +	struct inmem_hash *hash;
>> +
>> +	mutex_lock(&dedupe_info->lock);
>> +	hash = inmem_search_bytenr(dedupe_info, bytenr);
>> +	if (!hash) {
>> +		mutex_unlock(&dedupe_info->lock);
>> +		return 0;
>> +	}
>> +
>> +	__inmem_del(dedupe_info, hash);
>> +	mutex_unlock(&dedupe_info->lock);
>> +	return 0;
>> +}
>> +
>> +/* Remove a dedupe hash from dedupe tree */
>> +int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
>> +		     struct btrfs_fs_info *fs_info, u64 bytenr)
>> +{
>> +	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
>> +
>> +	if (!fs_info->dedupe_enabled)
>> +		return 0;
>> +
>> +	if (WARN_ON(dedupe_info == NULL))
>> +		return -EINVAL;
>> +
>> +	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
>> +		return inmem_del(dedupe_info, bytenr);
>> +	return -EINVAL;
>> +}
>> +
>> +static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
>> +{
>> +	struct inmem_hash *entry, *tmp;
>> +
>> +	mutex_lock(&dedupe_info->lock);
>> +	list_for_each_entry_safe(entry, tmp, &dedupe_info->lru_list, lru_list)
>> +		__inmem_del(dedupe_info, entry);
>> +	mutex_unlock(&dedupe_info->lock);
>> +}
>> +
>> +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
>> +{
>> +	struct btrfs_dedupe_info *dedupe_info;
>> +	int ret;
>> +
>> +	/* Here we don't want to increase refs of dedupe_info */
>> +	fs_info->dedupe_enabled = 0;
>
> Can this clear of fs_info->dedupe_enabled race with another thread in write?
> I don't see any locking (but perhaps that comes in a later patch).
> 	--Mark

Here we use sync fs to ensure no dedupe write will happen later.
(No need to ensure current transaction will go through dedupe)

Different buffered writer may get different dedupe_enabled flag, but 
whether we do dedupe is not determined by the flag at buffered write 
time.(For this V10 patch)

Instead it's determined by the flag when we run delalloc range.
And we use sync_fs() to write all dirty pages, ensuring no dedupe write 
will happen later, then freeing dedupe_info.

So it's should be fine, and test cases like btrfs/200(not mainlined) 
will test it.


Although the behavior may change a little in V11, to handle ENOSPC problem.

In next version, whether doing dedupe is determined at buffered write 
time, and restore the result to io_tree, to handle metadata reservation 
better.
And still use sync_fs() to ensure no other race.

Thanks,
Qu

>
> --
> Mark Fasheh
>
>



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 08/21] btrfs: ordered-extent: Add support for dedupe
  2016-06-01 22:06   ` Mark Fasheh
@ 2016-06-02  1:08     ` Qu Wenruo
  0 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-06-02  1:08 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, Wang Xiaoguang, David Sterba



At 06/02/2016 06:06 AM, Mark Fasheh wrote:
> On Fri, Apr 01, 2016 at 02:34:59PM +0800, Qu Wenruo wrote:
>> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>
>> Add ordered-extent support for dedupe.
>>
>> Note, current ordered-extent support only supports non-compressed source
>> extent.
>> Support for compressed source extent will be added later.
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> ---
>>  fs/btrfs/ordered-data.c | 44 ++++++++++++++++++++++++++++++++++++++++----
>>  fs/btrfs/ordered-data.h | 13 +++++++++++++
>>  2 files changed, 53 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
>> index 0de7da5..ef24ad1 100644
>> --- a/fs/btrfs/ordered-data.c
>> +++ b/fs/btrfs/ordered-data.c
>> @@ -26,6 +26,7 @@
>>  #include "extent_io.h"
>>  #include "disk-io.h"
>>  #include "compression.h"
>> +#include "dedupe.h"
>>
>>  static struct kmem_cache *btrfs_ordered_extent_cache;
>>
>> @@ -184,7 +185,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
>>   */
>>  static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
>>  				      u64 start, u64 len, u64 disk_len,
>> -				      int type, int dio, int compress_type)
>> +				      int type, int dio, int compress_type,
>> +				      struct btrfs_dedupe_hash *hash)
>>  {
>>  	struct btrfs_root *root = BTRFS_I(inode)->root;
>>  	struct btrfs_ordered_inode_tree *tree;
>> @@ -204,6 +206,31 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
>>  	entry->inode = igrab(inode);
>>  	entry->compress_type = compress_type;
>>  	entry->truncated_len = (u64)-1;
>> +	entry->hash = NULL;
>> +	/*
>> +	 * Hash hit must go through dedupe routine at all cost, even dedupe
>> +	 * is disabled. As its delayed ref is already increased.
>> +	 */
>
> Initially, I had a hard time understanding this comment but I'm pretty sure
> I know what you mean.
>
> /*
>  * A hash hit means we have already incremented the extents delayed ref.
>  * We must handle this even if another process raced to turn off dedupe
>  * otherwise we might leak a reference.
>  */
>
> might be better. Hope that helps.
> 	--Mark
>

Same meaning, much better grammar.

Thanks,
Qu
> --
> Mark Fasheh
>
>



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-06-01 22:08   ` Mark Fasheh
@ 2016-06-02  1:12     ` Qu Wenruo
  2016-06-03 14:27       ` Josef Bacik
  0 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-06-02  1:12 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, Wang Xiaoguang, David Sterba



At 06/02/2016 06:08 AM, Mark Fasheh wrote:
> On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote:
>> Core implement for inband de-duplication.
>> It reuse the async_cow_start() facility to do the calculate dedupe hash.
>> And use dedupe hash to do inband de-duplication at extent level.
>>
>> The work flow is as below:
>> 1) Run delalloc range for an inode
>> 2) Calculate hash for the delalloc range at the unit of dedupe_bs
>> 3) For hash match(duplicated) case, just increase source extent ref
>>    and insert file extent.
>>    For hash mismatch case, go through the normal cow_file_range()
>>    fallback, and add hash into dedupe_tree.
>>    Compress for hash miss case is not supported yet.
>>
>> Current implement restore all dedupe hash in memory rb-tree, with LRU
>> behavior to control the limit.
>>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>  fs/btrfs/extent-tree.c |  18 ++++
>>  fs/btrfs/inode.c       | 235 ++++++++++++++++++++++++++++++++++++++++++-------
>>  fs/btrfs/relocation.c  |  16 ++++
>>  3 files changed, 236 insertions(+), 33 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 53e1297..dabd721 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -37,6 +37,7 @@
>>  #include "math.h"
>>  #include "sysfs.h"
>>  #include "qgroup.h"
>> +#include "dedupe.h"
>>
>>  #undef SCRAMBLE_DELAYED_REFS
>>
>> @@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>>
>>  	if (btrfs_delayed_ref_is_head(node)) {
>>  		struct btrfs_delayed_ref_head *head;
>> +		struct btrfs_fs_info *fs_info = root->fs_info;
>> +
>>  		/*
>>  		 * we've hit the end of the chain and we were supposed
>>  		 * to insert this extent into the tree.  But, it got
>> @@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>>  			btrfs_pin_extent(root, node->bytenr,
>>  					 node->num_bytes, 1);
>>  			if (head->is_data) {
>> +				/*
>> +				 * If insert_reserved is given, it means
>> +				 * a new extent is revered, then deleted
>> +				 * in one tran, and inc/dec get merged to 0.
>> +				 *
>> +				 * In this case, we need to remove its dedup
>> +				 * hash.
>> +				 */
>> +				btrfs_dedupe_del(trans, fs_info, node->bytenr);
>>  				ret = btrfs_del_csums(trans, root,
>>  						      node->bytenr,
>>  						      node->num_bytes);
>> @@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>>  		btrfs_release_path(path);
>>
>>  		if (is_data) {
>> +			ret = btrfs_dedupe_del(trans, info, bytenr);
>> +			if (ret < 0) {
>> +				btrfs_abort_transaction(trans, extent_root,
>> +							ret);
>
> I don't see why an error here should lead to a readonly fs.
> 	--Mark
>

Because such deletion error can lead to corruption.

For example, extent A is already in hash pool.
And when freeing extent A, we need to delete its hash, of course.

But if such deletion fails, which means the hash is still in the pool, 
even the extent A no longer exists in extent tree.

If we don't abort trans here, next dedupe write may points to the 
non-exist extent A, and cause corruption.

Thanks,
Qu
> --
> Mark Fasheh
>
>



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-06-02  1:12     ` Qu Wenruo
@ 2016-06-03 14:27       ` Josef Bacik
  2016-06-04 10:26         ` Qu Wenruo
  0 siblings, 1 reply; 54+ messages in thread
From: Josef Bacik @ 2016-06-03 14:27 UTC (permalink / raw)
  To: Qu Wenruo, Mark Fasheh; +Cc: linux-btrfs, Wang Xiaoguang, David Sterba

On 06/01/2016 09:12 PM, Qu Wenruo wrote:
>
>
> At 06/02/2016 06:08 AM, Mark Fasheh wrote:
>> On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote:
>>> Core implement for inband de-duplication.
>>> It reuse the async_cow_start() facility to do the calculate dedupe hash.
>>> And use dedupe hash to do inband de-duplication at extent level.
>>>
>>> The work flow is as below:
>>> 1) Run delalloc range for an inode
>>> 2) Calculate hash for the delalloc range at the unit of dedupe_bs
>>> 3) For hash match(duplicated) case, just increase source extent ref
>>>    and insert file extent.
>>>    For hash mismatch case, go through the normal cow_file_range()
>>>    fallback, and add hash into dedupe_tree.
>>>    Compress for hash miss case is not supported yet.
>>>
>>> Current implement restore all dedupe hash in memory rb-tree, with LRU
>>> behavior to control the limit.
>>>
>>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> ---
>>>  fs/btrfs/extent-tree.c |  18 ++++
>>>  fs/btrfs/inode.c       | 235
>>> ++++++++++++++++++++++++++++++++++++++++++-------
>>>  fs/btrfs/relocation.c  |  16 ++++
>>>  3 files changed, 236 insertions(+), 33 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 53e1297..dabd721 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -37,6 +37,7 @@
>>>  #include "math.h"
>>>  #include "sysfs.h"
>>>  #include "qgroup.h"
>>> +#include "dedupe.h"
>>>
>>>  #undef SCRAMBLE_DELAYED_REFS
>>>
>>> @@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct
>>> btrfs_trans_handle *trans,
>>>
>>>      if (btrfs_delayed_ref_is_head(node)) {
>>>          struct btrfs_delayed_ref_head *head;
>>> +        struct btrfs_fs_info *fs_info = root->fs_info;
>>> +
>>>          /*
>>>           * we've hit the end of the chain and we were supposed
>>>           * to insert this extent into the tree.  But, it got
>>> @@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct
>>> btrfs_trans_handle *trans,
>>>              btrfs_pin_extent(root, node->bytenr,
>>>                       node->num_bytes, 1);
>>>              if (head->is_data) {
>>> +                /*
>>> +                 * If insert_reserved is given, it means
>>> +                 * a new extent is revered, then deleted
>>> +                 * in one tran, and inc/dec get merged to 0.
>>> +                 *
>>> +                 * In this case, we need to remove its dedup
>>> +                 * hash.
>>> +                 */
>>> +                btrfs_dedupe_del(trans, fs_info, node->bytenr);
>>>                  ret = btrfs_del_csums(trans, root,
>>>                                node->bytenr,
>>>                                node->num_bytes);
>>> @@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct
>>> btrfs_trans_handle *trans,
>>>          btrfs_release_path(path);
>>>
>>>          if (is_data) {
>>> +            ret = btrfs_dedupe_del(trans, info, bytenr);
>>> +            if (ret < 0) {
>>> +                btrfs_abort_transaction(trans, extent_root,
>>> +                            ret);
>>
>> I don't see why an error here should lead to a readonly fs.
>>     --Mark
>>
>
> Because such deletion error can lead to corruption.
>
> For example, extent A is already in hash pool.
> And when freeing extent A, we need to delete its hash, of course.
>
> But if such deletion fails, which means the hash is still in the pool,
> even the extent A no longer exists in extent tree.

Except if we're in in-memory mode only it doesn't matter, so don't abort 
if we're in in-memory mode.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-04-01  6:35 ` [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
  2016-06-01 22:08   ` Mark Fasheh
@ 2016-06-03 14:43   ` Josef Bacik
  2016-06-04 10:28     ` Qu Wenruo
  1 sibling, 1 reply; 54+ messages in thread
From: Josef Bacik @ 2016-06-03 14:43 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Wang Xiaoguang

On 04/01/2016 02:35 AM, Qu Wenruo wrote:
> Core implement for inband de-duplication.
> It reuse the async_cow_start() facility to do the calculate dedupe hash.
> And use dedupe hash to do inband de-duplication at extent level.
>
> The work flow is as below:
> 1) Run delalloc range for an inode
> 2) Calculate hash for the delalloc range at the unit of dedupe_bs
> 3) For hash match(duplicated) case, just increase source extent ref
>    and insert file extent.
>    For hash mismatch case, go through the normal cow_file_range()
>    fallback, and add hash into dedupe_tree.
>    Compress for hash miss case is not supported yet.
>
> Current implement restore all dedupe hash in memory rb-tree, with LRU
> behavior to control the limit.
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/extent-tree.c |  18 ++++
>  fs/btrfs/inode.c       | 235 ++++++++++++++++++++++++++++++++++++++++++-------
>  fs/btrfs/relocation.c  |  16 ++++
>  3 files changed, 236 insertions(+), 33 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 53e1297..dabd721 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c


<snip>

> @@ -1076,6 +1135,68 @@ out_unlock:
>  	goto out;
>  }
>
> +static int hash_file_ranges(struct inode *inode, u64 start, u64 end,
> +			    struct async_cow *async_cow, int *num_added)
> +{
> +	struct btrfs_root *root = BTRFS_I(inode)->root;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
> +	struct page *locked_page = async_cow->locked_page;
> +	u16 hash_algo;
> +	u64 actual_end;
> +	u64 isize = i_size_read(inode);
> +	u64 dedupe_bs;
> +	u64 cur_offset = start;
> +	int ret = 0;
> +
> +	actual_end = min_t(u64, isize, end + 1);
> +	/* If dedupe is not enabled, don't split extent into dedupe_bs */
> +	if (fs_info->dedupe_enabled && dedupe_info) {
> +		dedupe_bs = dedupe_info->blocksize;
> +		hash_algo = dedupe_info->hash_type;
> +	} else {
> +		dedupe_bs = SZ_128M;
> +		/* Just dummy, to avoid access NULL pointer */
> +		hash_algo = BTRFS_DEDUPE_HASH_SHA256;
> +	}
> +
> +	while (cur_offset < end) {
> +		struct btrfs_dedupe_hash *hash = NULL;
> +		u64 len;
> +
> +		len = min(end + 1 - cur_offset, dedupe_bs);
> +		if (len < dedupe_bs)
> +			goto next;
> +
> +		hash = btrfs_dedupe_alloc_hash(hash_algo);
> +		if (!hash) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		ret = btrfs_dedupe_calc_hash(fs_info, inode, cur_offset, hash);
> +		if (ret < 0)
> +			goto out;
> +
> +		ret = btrfs_dedupe_search(fs_info, inode, cur_offset, hash);
> +		if (ret < 0)
> +			goto out;

You leak hash in both of these cases.  Also if btrfs_dedup_search

<snip>

> +	if (ret < 0)
> +		goto out_qgroup;
> +
> +	/*
> +	 * Hash hit won't create a new data extent, so its reserved quota
> +	 * space won't be freed by new delayed_ref_head.
> +	 * Need to free it here.
> +	 */
> +	if (btrfs_dedupe_hash_hit(hash))
> +		btrfs_qgroup_free_data(inode, file_pos, ram_bytes);
> +
> +	/* Add missed hash into dedupe tree */
> +	if (hash && hash->bytenr == 0) {
> +		hash->bytenr = ins.objectid;
> +		hash->num_bytes = ins.offset;
> +		ret = btrfs_dedupe_add(trans, root->fs_info, hash);

I don't want to flip read only if we fail this in the in-memory mode. 
Thanks,

Josef

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 17/21] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
  2016-04-01  6:35 ` [PATCH v10 17/21] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info Qu Wenruo
@ 2016-06-03 14:54   ` Josef Bacik
  0 siblings, 0 replies; 54+ messages in thread
From: Josef Bacik @ 2016-06-03 14:54 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Wang Xiaoguang

On 04/01/2016 02:35 AM, Qu Wenruo wrote:
> Since we will introduce a new on-disk based dedupe method, introduce new
> interfaces to resume previous dedupe setup.
>
> And since we introduce a new tree for status, also add disable handler
> for it.
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/dedupe.c  | 197 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/dedupe.h  |  13 ++++
>  fs/btrfs/disk-io.c |  25 ++++++-
>  fs/btrfs/disk-io.h |   1 +
>  4 files changed, 232 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
> index cfb7fea..a274c1c 100644
> --- a/fs/btrfs/dedupe.c
> +++ b/fs/btrfs/dedupe.c
> @@ -21,6 +21,8 @@
>  #include "transaction.h"
>  #include "delayed-ref.h"
>  #include "qgroup.h"
> +#include "disk-io.h"
> +#include "locking.h"
>
>  struct inmem_hash {
>  	struct rb_node hash_node;
> @@ -102,10 +104,69 @@ static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
>  	return 0;
>  }
>
> +static int init_dedupe_tree(struct btrfs_fs_info *fs_info,
> +			    struct btrfs_dedupe_info *dedupe_info)
> +{
> +	struct btrfs_root *dedupe_root;
> +	struct btrfs_key key;
> +	struct btrfs_path *path;
> +	struct btrfs_dedupe_status_item *status;
> +	struct btrfs_trans_handle *trans;
> +	int ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	trans = btrfs_start_transaction(fs_info->tree_root, 2);
> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		goto out;
> +	}
> +	dedupe_root = btrfs_create_tree(trans, fs_info,
> +				       BTRFS_DEDUPE_TREE_OBJECTID);
> +	if (IS_ERR(dedupe_root)) {
> +		ret = PTR_ERR(dedupe_root);
> +		btrfs_abort_transaction(trans, fs_info->tree_root, ret);
> +		goto out;
> +	}
> +	dedupe_info->dedupe_root = dedupe_root;
> +
> +	key.objectid = 0;
> +	key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
> +	key.offset = 0;
> +
> +	ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key,
> +				      sizeof(*status));
> +	if (ret < 0) {
> +		btrfs_abort_transaction(trans, fs_info->tree_root, ret);
> +		goto out;
> +	}
> +
> +	status = btrfs_item_ptr(path->nodes[0], path->slots[0],
> +				struct btrfs_dedupe_status_item);
> +	btrfs_set_dedupe_status_blocksize(path->nodes[0], status,
> +					 dedupe_info->blocksize);
> +	btrfs_set_dedupe_status_limit(path->nodes[0], status,
> +			dedupe_info->limit_nr);
> +	btrfs_set_dedupe_status_hash_type(path->nodes[0], status,
> +			dedupe_info->hash_type);
> +	btrfs_set_dedupe_status_backend(path->nodes[0], status,
> +			dedupe_info->backend);
> +	btrfs_mark_buffer_dirty(path->nodes[0]);
> +out:
> +	btrfs_free_path(path);
> +	if (ret == 0)
> +		btrfs_commit_transaction(trans, fs_info->tree_root);

Still need to call btrfs_end_transaction() if we aborted to clean things 
up.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 18/21] btrfs: dedupe: Add support for on-disk hash search
  2016-04-01  6:35 ` [PATCH v10 18/21] btrfs: dedupe: Add support for on-disk hash search Qu Wenruo
@ 2016-06-03 14:57   ` Josef Bacik
  0 siblings, 0 replies; 54+ messages in thread
From: Josef Bacik @ 2016-06-03 14:57 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Wang Xiaoguang

On 04/01/2016 02:35 AM, Qu Wenruo wrote:
> Now on-disk backend should be able to search hash now.
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/dedupe.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++++------
>  fs/btrfs/dedupe.h |   1 +
>  2 files changed, 151 insertions(+), 17 deletions(-)
>
> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
> index a274c1c..00f2a01 100644
> --- a/fs/btrfs/dedupe.c
> +++ b/fs/btrfs/dedupe.c
> @@ -652,6 +652,112 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
>  }
>
>  /*
> + * Compare ondisk hash with src.
> + * Return 0 if hash matches.
> + * Return non-zero for hash mismatch
> + *
> + * Caller should ensure the slot contains a valid hash item.
> + */
> +static int memcmp_ondisk_hash(const struct btrfs_key *key,
> +			      struct extent_buffer *node, int slot,
> +			      int hash_len, const u8 *src)
> +{
> +	u64 offset;
> +	int ret;
> +
> +	/* Return value doesn't make sense in this case though */
> +	if (WARN_ON(hash_len <= 8 || key->type != BTRFS_DEDUPE_HASH_ITEM_KEY))

No magic numbers please.

> +		return -EINVAL;
> +
> +	/* compare the hash exlcuding the last 64 bits */
> +	offset = btrfs_item_ptr_offset(node, slot);
> +	ret = memcmp_extent_buffer(node, src, offset, hash_len - 8);
> +	if (ret)
> +		return ret;
> +	return memcmp(&key->objectid, src + hash_len - 8, 8);
> +}
> +
> + /*
> + * Return 0 for not found
> + * Return >0 for found and set bytenr_ret
> + * Return <0 for error
> + */
> +static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
> +			      u64 *bytenr_ret, u32 *num_bytes_ret)
> +{
> +	struct btrfs_path *path;
> +	struct btrfs_key key;
> +	struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
> +	u8 *buf = NULL;
> +	u64 hash_key;
> +	int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
> +	int ret;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	buf = kmalloc(hash_len, GFP_NOFS);
> +	if (!buf) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	memcpy(&hash_key, hash + hash_len - 8, 8);
> +	key.objectid = hash_key;
> +	key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
> +	key.offset = (u64)-1;
> +
> +	ret = btrfs_search_slot(NULL, dedupe_root, &key, path, 0, 0);
> +	if (ret < 0)
> +		goto out;
> +	WARN_ON(ret == 0);
> +	while (1) {
> +		struct extent_buffer *node;
> +		struct btrfs_dedupe_hash_item *hash_item;
> +		int slot;
> +
> +		ret = btrfs_previous_item(dedupe_root, path, hash_key,
> +					  BTRFS_DEDUPE_HASH_ITEM_KEY);
> +		if (ret < 0)
> +			break;
> +		if (ret > 0) {
> +			ret = 0;
> +			break;
> +		}
> +
> +		node = path->nodes[0];
> +		slot = path->slots[0];
> +		btrfs_item_key_to_cpu(node, &key, slot);
> +
> +		/*
> +		 * Type of objectid mismatch means no previous item may
> +		 * hit, exit searching
> +		 */
> +		if (key.type != BTRFS_DEDUPE_HASH_ITEM_KEY ||
> +		    memcmp(&key.objectid, &hash_key, 8))
> +			break;
> +		hash_item = btrfs_item_ptr(node, slot,
> +				struct btrfs_dedupe_hash_item);
> +		/*
> +		 * If the hash mismatch, it's still possible that previous item
> +		 * has the desired hash.
> +		 */
> +		if (memcmp_ondisk_hash(&key, node, slot, hash_len, hash))
> +			continue;
> +		/* Found */
> +		ret = 1;
> +		*bytenr_ret = key.offset;
> +		*num_bytes_ret = dedupe_info->blocksize;
> +		break;
> +	}
> +out:
> +	kfree(buf);
> +	btrfs_free_path(path);
> +	return ret;
> +}
> +
> +/*
>   * Caller must ensure the corresponding ref head is not being run.
>   */
>  static struct inmem_hash *
> @@ -681,9 +787,36 @@ inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
>  	return NULL;
>  }
>
> -static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
> -			struct inode *inode, u64 file_pos,
> -			struct btrfs_dedupe_hash *hash)
> +/* Wapper for different backends, caller needs to hold dedupe_info->lock */
> +static inline int generic_search_hash(struct btrfs_dedupe_info *dedupe_info,
> +				      u8 *hash, u64 *bytenr_ret,
> +				      u32 *num_bytes_ret)
> +{
> +	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
> +		struct inmem_hash *found_hash;
> +		int ret;
> +
> +		found_hash = inmem_search_hash(dedupe_info, hash);
> +		if (found_hash) {
> +			ret = 1;
> +			*bytenr_ret = found_hash->bytenr;
> +			*num_bytes_ret = found_hash->num_bytes;
> +		} else {
> +			ret = 0;
> +			*bytenr_ret = 0;
> +			*num_bytes_ret = 0;

Why set it to 0 only in the INMEMORY case?  If they need to be zero'ed 
perhaps do it at the start of the helper?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 20/21] btrfs: dedupe: Add support for adding hash for on-disk backend
  2016-04-01  6:35 ` [PATCH v10 20/21] btrfs: dedupe: Add support for adding " Qu Wenruo
@ 2016-06-03 15:03   ` Josef Bacik
  0 siblings, 0 replies; 54+ messages in thread
From: Josef Bacik @ 2016-06-03 15:03 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Wang Xiaoguang

On 04/01/2016 02:35 AM, Qu Wenruo wrote:
> Now on-disk backend can add hash now.
>
> Since all needed on-disk backend functions are added, also allow on-disk
> backend to be used, by changing DEDUPE_BACKEND_COUNT from 1(inmemory
> only) to 2 (inmemory + ondisk).
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/dedupe.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/dedupe.h |  3 +-
>  2 files changed, 84 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
> index 7c5d58a..1f0178e 100644
> --- a/fs/btrfs/dedupe.c
> +++ b/fs/btrfs/dedupe.c
> @@ -437,6 +437,87 @@ out:
>  	return 0;
>  }
>
> +static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
> +				struct btrfs_dedupe_info *dedupe_info,
> +				struct btrfs_path *path, u64 bytenr,
> +				int prepare_del);
> +static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
> +			      u64 *bytenr_ret, u32 *num_bytes_ret);
> +static int ondisk_add(struct btrfs_trans_handle *trans,
> +		      struct btrfs_dedupe_info *dedupe_info,
> +		      struct btrfs_dedupe_hash *hash)
> +{
> +	struct btrfs_path *path;
> +	struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
> +	struct btrfs_key key;
> +	u64 hash_offset;
> +	u64 bytenr;
> +	u32 num_bytes;
> +	int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
> +	int ret;
> +
> +	if (WARN_ON(hash_len <= 8 ||
> +	    !IS_ALIGNED(hash->bytenr, dedupe_root->sectorsize)))
> +		return -EINVAL;
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	mutex_lock(&dedupe_info->lock);
> +
> +	ret = ondisk_search_bytenr(NULL, dedupe_info, path, hash->bytenr, 0);
> +	if (ret < 0)
> +		goto out;
> +	if (ret > 0) {
> +		ret = 0;
> +		goto out;
> +	}
> +	btrfs_release_path(path);
> +
> +	ret = ondisk_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes);
> +	if (ret < 0)
> +		goto out;
> +	/* Same hash found, don't re-add to save dedupe tree space */
> +	if (ret > 0) {
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	/* Insert hash->bytenr item */
> +	memcpy(&key.objectid, hash->hash + hash_len - 8, 8);

No magic numbers please.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 00/21] Btrfs dedupe framework
  2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
                   ` (21 preceding siblings ...)
  2016-04-01  8:53 ` [PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
@ 2016-06-03 15:20 ` Josef Bacik
  2016-06-04 10:37   ` Qu Wenruo
  22 siblings, 1 reply; 54+ messages in thread
From: Josef Bacik @ 2016-06-03 15:20 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 04/01/2016 02:34 AM, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux.git wang_dedupe_20160401
>
> In this patchset, we're proud to bring a completely new storage backend:
> Khala backend.
>
> With Khala backend, all dedupe hash will be restored in the Khala,
> shared with every Kalai protoss, with unlimited storage and almost zero
> search latency.
> A perfect backend for any Kalai protoss. "My life for Aiur!"
>
> Unfortunately, such backend is not available for human.
>
>
> OK, except the super-fancy and date-related backend, the patchset is
> still a serious patchset.
> In this patchset, we mostly addressed the on-disk format change comment from
> Chris:
> 1) Reduced dedupe hash item and bytenr item.
>    Now dedupe hash item structure size is reduced from 41 bytes
>    (9 bytes hash_item + 32 bytes hash)
>    to 29 bytes (5 bytes hash_item + 24 bytes hash)
>    Without the last patch, it's even less with only 24 bytes
>    (24 bytes hash only).
>    And dedupe bytenr item structure size is reduced from 32 bytes (full
>    hash) to 0.
>
> 2) Hide dedupe ioctls into CONFIG_BTRFS_DEBUG
>    Advised by David, to make btrfs dedupe as an experimental feature for
>    advanced user.
>    This is used to allow this patchset to be merged while still allow us
>    to change ioctl in the further.
>
> 3) Add back missing bug fix patches
>    I just missed 2 bug fix patches in previous iteration.
>    Adding them back.
>
> Now patch 1~11 provide the full backward-compatible in-memory backend.
> And patch 12~14 provide per-file dedupe flag feature.
> Patch 15~20 provide on-disk dedupe backend with persist dedupe state for
> in-memory backend.
> The last patch is just preparation for possible dedupe-compress co-work.
>

You can add

Reviewed-by: Josef Bacik <jbacik@fb.com>

to everything I didn't comment on (and not the ENOSPC one either, but I 
commented on that one last time).

But just because I've reviewed it doesn't mean it's ready to go in. 
Before we are going to take this I want to see the following

1) fsck support for dedupe that verifies the hashes with what is on disk 
so any xfstests we write are sure to catch problems.
2) xfstests.  They need to do the following things for both in memory 
and ondisk
     a) targeted verification.  So write one pattern, write the same
        pattern to a different file and use fiemap to verify they are the
        same.
     b) modify fsstress to have an option to always write the same
        pattern and then run a stress test while balancing.

Once the issues I've hilighted in the other patches are resolved and the 
above xfstests things are merged and the fsck patches are 
reviewed/accepted then we can move forward with including dedup.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-06-03 14:27       ` Josef Bacik
@ 2016-06-04 10:26         ` Qu Wenruo
  2016-06-06 19:54           ` Mark Fasheh
  0 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-06-04 10:26 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, Mark Fasheh
  Cc: linux-btrfs, Wang Xiaoguang, David Sterba



On 06/03/2016 10:27 PM, Josef Bacik wrote:
> On 06/01/2016 09:12 PM, Qu Wenruo wrote:
>>
>>
>> At 06/02/2016 06:08 AM, Mark Fasheh wrote:
>>> On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote:
>>>> Core implement for inband de-duplication.
>>>> It reuse the async_cow_start() facility to do the calculate dedupe
>>>> hash.
>>>> And use dedupe hash to do inband de-duplication at extent level.
>>>>
>>>> The work flow is as below:
>>>> 1) Run delalloc range for an inode
>>>> 2) Calculate hash for the delalloc range at the unit of dedupe_bs
>>>> 3) For hash match(duplicated) case, just increase source extent ref
>>>>    and insert file extent.
>>>>    For hash mismatch case, go through the normal cow_file_range()
>>>>    fallback, and add hash into dedupe_tree.
>>>>    Compress for hash miss case is not supported yet.
>>>>
>>>> Current implement restore all dedupe hash in memory rb-tree, with LRU
>>>> behavior to control the limit.
>>>>
>>>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>> ---
>>>>  fs/btrfs/extent-tree.c |  18 ++++
>>>>  fs/btrfs/inode.c       | 235
>>>> ++++++++++++++++++++++++++++++++++++++++++-------
>>>>  fs/btrfs/relocation.c  |  16 ++++
>>>>  3 files changed, 236 insertions(+), 33 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index 53e1297..dabd721 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -37,6 +37,7 @@
>>>>  #include "math.h"
>>>>  #include "sysfs.h"
>>>>  #include "qgroup.h"
>>>> +#include "dedupe.h"
>>>>
>>>>  #undef SCRAMBLE_DELAYED_REFS
>>>>
>>>> @@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct
>>>> btrfs_trans_handle *trans,
>>>>
>>>>      if (btrfs_delayed_ref_is_head(node)) {
>>>>          struct btrfs_delayed_ref_head *head;
>>>> +        struct btrfs_fs_info *fs_info = root->fs_info;
>>>> +
>>>>          /*
>>>>           * we've hit the end of the chain and we were supposed
>>>>           * to insert this extent into the tree.  But, it got
>>>> @@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct
>>>> btrfs_trans_handle *trans,
>>>>              btrfs_pin_extent(root, node->bytenr,
>>>>                       node->num_bytes, 1);
>>>>              if (head->is_data) {
>>>> +                /*
>>>> +                 * If insert_reserved is given, it means
>>>> +                 * a new extent is revered, then deleted
>>>> +                 * in one tran, and inc/dec get merged to 0.
>>>> +                 *
>>>> +                 * In this case, we need to remove its dedup
>>>> +                 * hash.
>>>> +                 */
>>>> +                btrfs_dedupe_del(trans, fs_info, node->bytenr);
>>>>                  ret = btrfs_del_csums(trans, root,
>>>>                                node->bytenr,
>>>>                                node->num_bytes);
>>>> @@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct
>>>> btrfs_trans_handle *trans,
>>>>          btrfs_release_path(path);
>>>>
>>>>          if (is_data) {
>>>> +            ret = btrfs_dedupe_del(trans, info, bytenr);
>>>> +            if (ret < 0) {
>>>> +                btrfs_abort_transaction(trans, extent_root,
>>>> +                            ret);
>>>
>>> I don't see why an error here should lead to a readonly fs.
>>>     --Mark
>>>
>>
>> Because such deletion error can lead to corruption.
>>
>> For example, extent A is already in hash pool.
>> And when freeing extent A, we need to delete its hash, of course.
>>
>> But if such deletion fails, which means the hash is still in the pool,
>> even the extent A no longer exists in extent tree.
>
> Except if we're in in-memory mode only it doesn't matter, so don't abort
> if we're in in-memory mode.  Thanks,
>
> Josef
>

If we can't ensure a hash is delete along with the extent, we will screw 
up the whole fs, as new write can points to non-exist extent.

Although you're right with in-memory mode here, we won't abort trans, as 
inmem_del_hash() won't return error code. It will always return 0.

So still, no need to change anyway.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-06-03 14:43   ` Josef Bacik
@ 2016-06-04 10:28     ` Qu Wenruo
  0 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-06-04 10:28 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs; +Cc: Wang Xiaoguang



On 06/03/2016 10:43 PM, Josef Bacik wrote:
> On 04/01/2016 02:35 AM, Qu Wenruo wrote:
>> Core implement for inband de-duplication.
>> It reuse the async_cow_start() facility to do the calculate dedupe hash.
>> And use dedupe hash to do inband de-duplication at extent level.
>>
>> The work flow is as below:
>> 1) Run delalloc range for an inode
>> 2) Calculate hash for the delalloc range at the unit of dedupe_bs
>> 3) For hash match(duplicated) case, just increase source extent ref
>>    and insert file extent.
>>    For hash mismatch case, go through the normal cow_file_range()
>>    fallback, and add hash into dedupe_tree.
>>    Compress for hash miss case is not supported yet.
>>
>> Current implement restore all dedupe hash in memory rb-tree, with LRU
>> behavior to control the limit.
>>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>  fs/btrfs/extent-tree.c |  18 ++++
>>  fs/btrfs/inode.c       | 235
>> ++++++++++++++++++++++++++++++++++++++++++-------
>>  fs/btrfs/relocation.c  |  16 ++++
>>  3 files changed, 236 insertions(+), 33 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 53e1297..dabd721 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>
>
> <snip>
>
>> @@ -1076,6 +1135,68 @@ out_unlock:
>>      goto out;
>>  }
>>
>> +static int hash_file_ranges(struct inode *inode, u64 start, u64 end,
>> +                struct async_cow *async_cow, int *num_added)
>> +{
>> +    struct btrfs_root *root = BTRFS_I(inode)->root;
>> +    struct btrfs_fs_info *fs_info = root->fs_info;
>> +    struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
>> +    struct page *locked_page = async_cow->locked_page;
>> +    u16 hash_algo;
>> +    u64 actual_end;
>> +    u64 isize = i_size_read(inode);
>> +    u64 dedupe_bs;
>> +    u64 cur_offset = start;
>> +    int ret = 0;
>> +
>> +    actual_end = min_t(u64, isize, end + 1);
>> +    /* If dedupe is not enabled, don't split extent into dedupe_bs */
>> +    if (fs_info->dedupe_enabled && dedupe_info) {
>> +        dedupe_bs = dedupe_info->blocksize;
>> +        hash_algo = dedupe_info->hash_type;
>> +    } else {
>> +        dedupe_bs = SZ_128M;
>> +        /* Just dummy, to avoid access NULL pointer */
>> +        hash_algo = BTRFS_DEDUPE_HASH_SHA256;
>> +    }
>> +
>> +    while (cur_offset < end) {
>> +        struct btrfs_dedupe_hash *hash = NULL;
>> +        u64 len;
>> +
>> +        len = min(end + 1 - cur_offset, dedupe_bs);
>> +        if (len < dedupe_bs)
>> +            goto next;
>> +
>> +        hash = btrfs_dedupe_alloc_hash(hash_algo);
>> +        if (!hash) {
>> +            ret = -ENOMEM;
>> +            goto out;
>> +        }
>> +        ret = btrfs_dedupe_calc_hash(fs_info, inode, cur_offset, hash);
>> +        if (ret < 0)
>> +            goto out;
>> +
>> +        ret = btrfs_dedupe_search(fs_info, inode, cur_offset, hash);
>> +        if (ret < 0)
>> +            goto out;
>
> You leak hash in both of these cases.  Also if btrfs_dedup_search
>
> <snip>
>
>> +    if (ret < 0)
>> +        goto out_qgroup;
>> +
>> +    /*
>> +     * Hash hit won't create a new data extent, so its reserved quota
>> +     * space won't be freed by new delayed_ref_head.
>> +     * Need to free it here.
>> +     */
>> +    if (btrfs_dedupe_hash_hit(hash))
>> +        btrfs_qgroup_free_data(inode, file_pos, ram_bytes);
>> +
>> +    /* Add missed hash into dedupe tree */
>> +    if (hash && hash->bytenr == 0) {
>> +        hash->bytenr = ins.objectid;
>> +        hash->num_bytes = ins.offset;
>> +        ret = btrfs_dedupe_add(trans, root->fs_info, hash);
>
> I don't want to flip read only if we fail this in the in-memory mode.
> Thanks,
>
> Josef

Right, unlike btrfs_dedupe_del() case, if we fail to insert hash, 
nothing wrong will happen.
We would just slightly reduce the dedupe rate.

I'm OK to skip dedupe_add() error.

Thanks,
Qu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 00/21] Btrfs dedupe framework
  2016-06-03 15:20 ` [PATCH v10 00/21] Btrfs dedupe framework Josef Bacik
@ 2016-06-04 10:37   ` Qu Wenruo
  0 siblings, 0 replies; 54+ messages in thread
From: Qu Wenruo @ 2016-06-04 10:37 UTC (permalink / raw)
  To: Josef Bacik, Qu Wenruo, linux-btrfs



On 06/03/2016 11:20 PM, Josef Bacik wrote:
> On 04/01/2016 02:34 AM, Qu Wenruo wrote:
>> This patchset can be fetched from github:
>> https://github.com/adam900710/linux.git wang_dedupe_20160401
>>
>> In this patchset, we're proud to bring a completely new storage backend:
>> Khala backend.
>>
>> With Khala backend, all dedupe hash will be restored in the Khala,
>> shared with every Kalai protoss, with unlimited storage and almost zero
>> search latency.
>> A perfect backend for any Kalai protoss. "My life for Aiur!"
>>
>> Unfortunately, such backend is not available for human.
>>
>>
>> OK, except the super-fancy and date-related backend, the patchset is
>> still a serious patchset.
>> In this patchset, we mostly addressed the on-disk format change
>> comment from
>> Chris:
>> 1) Reduced dedupe hash item and bytenr item.
>>    Now dedupe hash item structure size is reduced from 41 bytes
>>    (9 bytes hash_item + 32 bytes hash)
>>    to 29 bytes (5 bytes hash_item + 24 bytes hash)
>>    Without the last patch, it's even less with only 24 bytes
>>    (24 bytes hash only).
>>    And dedupe bytenr item structure size is reduced from 32 bytes (full
>>    hash) to 0.
>>
>> 2) Hide dedupe ioctls into CONFIG_BTRFS_DEBUG
>>    Advised by David, to make btrfs dedupe as an experimental feature for
>>    advanced user.
>>    This is used to allow this patchset to be merged while still allow us
>>    to change ioctl in the further.
>>
>> 3) Add back missing bug fix patches
>>    I just missed 2 bug fix patches in previous iteration.
>>    Adding them back.
>>
>> Now patch 1~11 provide the full backward-compatible in-memory backend.
>> And patch 12~14 provide per-file dedupe flag feature.
>> Patch 15~20 provide on-disk dedupe backend with persist dedupe state for
>> in-memory backend.
>> The last patch is just preparation for possible dedupe-compress co-work.
>>
>
> You can add
>
> Reviewed-by: Josef Bacik <jbacik@fb.com>
>
> to everything I didn't comment on (and not the ENOSPC one either, but I
> commented on that one last time).

Thanks for the review.

All your comment will be addressed in next version, except ones I commented.

>
> But just because I've reviewed it doesn't mean it's ready to go in.
> Before we are going to take this I want to see the following

Right, I won't rush to merge it, and I'm pretty sure you would like to 
review the incoming ENOSPC fix further, as the root fix would be a 
little complicated and affects a lot of common routines.

>
> 1) fsck support for dedupe that verifies the hashes with what is on disk

Nice advice, if hash pool is screwed up, the whole fs will be screwed up.

But that's for on-disk backend, and unfortunately, on-disk backend will 
be excluded in next version.

On-disk backend will only be re-introduced after in-memory backend only 
patchset.

> so any xfstests we write are sure to catch problems.


> 2) xfstests.  They need to do the following things for both in memory
> and ondisk
>     a) targeted verification.  So write one pattern, write the same
>        pattern to a different file and use fiemap to verify they are the
>        same.
Already in previous xfstests patchset.

But need a little modification, as we may merge in-mem and on-disk 
backend in different kernel merge windows, so test cases may be split 
for different backends.

I'll update xfstest with V11 patchset, to do in-mem only checks.

>     b) modify fsstress to have an option to always write the same
>        pattern and then run a stress test while balancing.

We already had such test cases, and even with current fsstress, its 
pattern is already good enough to trigger some bug in our test cases.

But it's still a good idea to make fsstress to reproduce dedupe bugs 
more preciously.

Thanks,
Qu
>
> Once the issues I've hilighted in the other patches are resolved and the
> above xfstests things are merged and the fsck patches are
> reviewed/accepted then we can move forward with including dedup.  Thanks,
>
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-06-04 10:26         ` Qu Wenruo
@ 2016-06-06 19:54           ` Mark Fasheh
  2016-06-07  0:42             ` Qu Wenruo
  0 siblings, 1 reply; 54+ messages in thread
From: Mark Fasheh @ 2016-06-06 19:54 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Josef Bacik, Qu Wenruo, linux-btrfs, Wang Xiaoguang, David Sterba

On Sat, Jun 04, 2016 at 06:26:39PM +0800, Qu Wenruo wrote:
> 
> 
> On 06/03/2016 10:27 PM, Josef Bacik wrote:
> >On 06/01/2016 09:12 PM, Qu Wenruo wrote:
> >>
> >>
> >>At 06/02/2016 06:08 AM, Mark Fasheh wrote:
> >>>On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote:
> >>>>Core implement for inband de-duplication.
> >>>>It reuse the async_cow_start() facility to do the calculate dedupe
> >>>>hash.
> >>>>And use dedupe hash to do inband de-duplication at extent level.
> >>>>
> >>>>The work flow is as below:
> >>>>1) Run delalloc range for an inode
> >>>>2) Calculate hash for the delalloc range at the unit of dedupe_bs
> >>>>3) For hash match(duplicated) case, just increase source extent ref
> >>>>   and insert file extent.
> >>>>   For hash mismatch case, go through the normal cow_file_range()
> >>>>   fallback, and add hash into dedupe_tree.
> >>>>   Compress for hash miss case is not supported yet.
> >>>>
> >>>>Current implement restore all dedupe hash in memory rb-tree, with LRU
> >>>>behavior to control the limit.
> >>>>
> >>>>Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> >>>>Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> >>>>---
> >>>> fs/btrfs/extent-tree.c |  18 ++++
> >>>> fs/btrfs/inode.c       | 235
> >>>>++++++++++++++++++++++++++++++++++++++++++-------
> >>>> fs/btrfs/relocation.c  |  16 ++++
> >>>> 3 files changed, 236 insertions(+), 33 deletions(-)
> >>>>
> >>>>diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >>>>index 53e1297..dabd721 100644
> >>>>--- a/fs/btrfs/extent-tree.c
> >>>>+++ b/fs/btrfs/extent-tree.c
> >>>>@@ -37,6 +37,7 @@
> >>>> #include "math.h"
> >>>> #include "sysfs.h"
> >>>> #include "qgroup.h"
> >>>>+#include "dedupe.h"
> >>>>
> >>>> #undef SCRAMBLE_DELAYED_REFS
> >>>>
> >>>>@@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct
> >>>>btrfs_trans_handle *trans,
> >>>>
> >>>>     if (btrfs_delayed_ref_is_head(node)) {
> >>>>         struct btrfs_delayed_ref_head *head;
> >>>>+        struct btrfs_fs_info *fs_info = root->fs_info;
> >>>>+
> >>>>         /*
> >>>>          * we've hit the end of the chain and we were supposed
> >>>>          * to insert this extent into the tree.  But, it got
> >>>>@@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct
> >>>>btrfs_trans_handle *trans,
> >>>>             btrfs_pin_extent(root, node->bytenr,
> >>>>                      node->num_bytes, 1);
> >>>>             if (head->is_data) {
> >>>>+                /*
> >>>>+                 * If insert_reserved is given, it means
> >>>>+                 * a new extent is revered, then deleted
> >>>>+                 * in one tran, and inc/dec get merged to 0.
> >>>>+                 *
> >>>>+                 * In this case, we need to remove its dedup
> >>>>+                 * hash.
> >>>>+                 */
> >>>>+                btrfs_dedupe_del(trans, fs_info, node->bytenr);
> >>>>                 ret = btrfs_del_csums(trans, root,
> >>>>                               node->bytenr,
> >>>>                               node->num_bytes);
> >>>>@@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct
> >>>>btrfs_trans_handle *trans,
> >>>>         btrfs_release_path(path);
> >>>>
> >>>>         if (is_data) {
> >>>>+            ret = btrfs_dedupe_del(trans, info, bytenr);
> >>>>+            if (ret < 0) {
> >>>>+                btrfs_abort_transaction(trans, extent_root,
> >>>>+                            ret);
> >>>
> >>>I don't see why an error here should lead to a readonly fs.
> >>>    --Mark
> >>>
> >>
> >>Because such deletion error can lead to corruption.
> >>
> >>For example, extent A is already in hash pool.
> >>And when freeing extent A, we need to delete its hash, of course.
> >>
> >>But if such deletion fails, which means the hash is still in the pool,
> >>even the extent A no longer exists in extent tree.
> >
> >Except if we're in in-memory mode only it doesn't matter, so don't abort
> >if we're in in-memory mode.  Thanks,
> >
> >Josef
> >
> 
> If we can't ensure a hash is delete along with the extent, we will
> screw up the whole fs, as new write can points to non-exist extent.
> 
> Although you're right with in-memory mode here, we won't abort
> trans, as inmem_del_hash() won't return error code. It will always
> return 0.

Until a third party comes along and changes it to return an error code and
neither you or I are there to remind them to fix this check (or have simply
forgotten).


> So still, no need to change anyway.

Personally I'd call this 'defensive coding' and do a check for in-memory
only before our abort_trans().  This would have no effect on our running
code but avoids the problem I stated above.  Alternatively, you could
clearly comment the exception. I don't like leaving it as-is for the reason 
I stated above.

Thanks,
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-06-06 19:54           ` Mark Fasheh
@ 2016-06-07  0:42             ` Qu Wenruo
  2016-06-07 16:55               ` Mark Fasheh
  0 siblings, 1 reply; 54+ messages in thread
From: Qu Wenruo @ 2016-06-07  0:42 UTC (permalink / raw)
  To: Mark Fasheh, Qu Wenruo
  Cc: Josef Bacik, linux-btrfs, Wang Xiaoguang, David Sterba



At 06/07/2016 03:54 AM, Mark Fasheh wrote:
> On Sat, Jun 04, 2016 at 06:26:39PM +0800, Qu Wenruo wrote:
>>
>>
>> On 06/03/2016 10:27 PM, Josef Bacik wrote:
>>> On 06/01/2016 09:12 PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> At 06/02/2016 06:08 AM, Mark Fasheh wrote:
>>>>> On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote:
>>>>>> Core implement for inband de-duplication.
>>>>>> It reuse the async_cow_start() facility to do the calculate dedupe
>>>>>> hash.
>>>>>> And use dedupe hash to do inband de-duplication at extent level.
>>>>>>
>>>>>> The work flow is as below:
>>>>>> 1) Run delalloc range for an inode
>>>>>> 2) Calculate hash for the delalloc range at the unit of dedupe_bs
>>>>>> 3) For hash match(duplicated) case, just increase source extent ref
>>>>>>   and insert file extent.
>>>>>>   For hash mismatch case, go through the normal cow_file_range()
>>>>>>   fallback, and add hash into dedupe_tree.
>>>>>>   Compress for hash miss case is not supported yet.
>>>>>>
>>>>>> Current implement restore all dedupe hash in memory rb-tree, with LRU
>>>>>> behavior to control the limit.
>>>>>>
>>>>>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>>>>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>>>> ---
>>>>>> fs/btrfs/extent-tree.c |  18 ++++
>>>>>> fs/btrfs/inode.c       | 235
>>>>>> ++++++++++++++++++++++++++++++++++++++++++-------
>>>>>> fs/btrfs/relocation.c  |  16 ++++
>>>>>> 3 files changed, 236 insertions(+), 33 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>>>> index 53e1297..dabd721 100644
>>>>>> --- a/fs/btrfs/extent-tree.c
>>>>>> +++ b/fs/btrfs/extent-tree.c
>>>>>> @@ -37,6 +37,7 @@
>>>>>> #include "math.h"
>>>>>> #include "sysfs.h"
>>>>>> #include "qgroup.h"
>>>>>> +#include "dedupe.h"
>>>>>>
>>>>>> #undef SCRAMBLE_DELAYED_REFS
>>>>>>
>>>>>> @@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct
>>>>>> btrfs_trans_handle *trans,
>>>>>>
>>>>>>     if (btrfs_delayed_ref_is_head(node)) {
>>>>>>         struct btrfs_delayed_ref_head *head;
>>>>>> +        struct btrfs_fs_info *fs_info = root->fs_info;
>>>>>> +
>>>>>>         /*
>>>>>>          * we've hit the end of the chain and we were supposed
>>>>>>          * to insert this extent into the tree.  But, it got
>>>>>> @@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct
>>>>>> btrfs_trans_handle *trans,
>>>>>>             btrfs_pin_extent(root, node->bytenr,
>>>>>>                      node->num_bytes, 1);
>>>>>>             if (head->is_data) {
>>>>>> +                /*
>>>>>> +                 * If insert_reserved is given, it means
>>>>>> +                 * a new extent is revered, then deleted
>>>>>> +                 * in one tran, and inc/dec get merged to 0.
>>>>>> +                 *
>>>>>> +                 * In this case, we need to remove its dedup
>>>>>> +                 * hash.
>>>>>> +                 */
>>>>>> +                btrfs_dedupe_del(trans, fs_info, node->bytenr);
>>>>>>                 ret = btrfs_del_csums(trans, root,
>>>>>>                               node->bytenr,
>>>>>>                               node->num_bytes);
>>>>>> @@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct
>>>>>> btrfs_trans_handle *trans,
>>>>>>         btrfs_release_path(path);
>>>>>>
>>>>>>         if (is_data) {
>>>>>> +            ret = btrfs_dedupe_del(trans, info, bytenr);
>>>>>> +            if (ret < 0) {
>>>>>> +                btrfs_abort_transaction(trans, extent_root,
>>>>>> +                            ret);
>>>>>
>>>>> I don't see why an error here should lead to a readonly fs.
>>>>>    --Mark
>>>>>
>>>>
>>>> Because such deletion error can lead to corruption.
>>>>
>>>> For example, extent A is already in hash pool.
>>>> And when freeing extent A, we need to delete its hash, of course.
>>>>
>>>> But if such deletion fails, which means the hash is still in the pool,
>>>> even the extent A no longer exists in extent tree.
>>>
>>> Except if we're in in-memory mode only it doesn't matter, so don't abort
>>> if we're in in-memory mode.  Thanks,
>>>
>>> Josef
>>>
>>
>> If we can't ensure a hash is delete along with the extent, we will
>> screw up the whole fs, as new write can points to non-exist extent.
>>
>> Although you're right with in-memory mode here, we won't abort
>> trans, as inmem_del_hash() won't return error code. It will always
>> return 0.
>
> Until a third party comes along and changes it to return an error code and
> neither you or I are there to remind them to fix this check (or have simply
> forgotten).
>
>
>> So still, no need to change anyway.
>
> Personally I'd call this 'defensive coding' and do a check for in-memory
> only before our abort_trans().  This would have no effect on our running
> code but avoids the problem I stated above.  Alternatively, you could
> clearly comment the exception. I don't like leaving it as-is for the reason
> I stated above.
>
> Thanks,
> 	--Mark
>

The whole 'defensive coding' is here just because the V10 patchset comes 
with full function, including 2 backends and other things later.

A lot of code like this is here because we know what will be added later.
So this is true that it looks ridiculous until one knows there is an 
on-disk backend to be added.

I'm OK to move the check to btrfs_dedupe_del(), but this makes me 
curious about the correct coding style for adding new function.


If we have a clear view of the future functions , should we leave such 
interfaces for them?
Or add them when adding the new functions?

And what level of integration should be done inside btrfs codes?
Should any caller of an exported btrfs function knows all possible 
return value and its condition?
Or caller only needs to check the definition without digging into the 
implementation?
(yes, isolation vs integrations things)

If we have a clear idea on this, we could avoid such embarrassing situation.

Thanks,
Qu

> --
> Mark Fasheh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-06-07  0:42             ` Qu Wenruo
@ 2016-06-07 16:55               ` Mark Fasheh
  0 siblings, 0 replies; 54+ messages in thread
From: Mark Fasheh @ 2016-06-07 16:55 UTC (permalink / raw)
  To: Qu Wenruo
  Cc: Qu Wenruo, Josef Bacik, linux-btrfs, Wang Xiaoguang, David Sterba

On Tue, Jun 07, 2016 at 08:42:46AM +0800, Qu Wenruo wrote:
> 
> 
> At 06/07/2016 03:54 AM, Mark Fasheh wrote:
> >On Sat, Jun 04, 2016 at 06:26:39PM +0800, Qu Wenruo wrote:
> >>
> >>
> >>On 06/03/2016 10:27 PM, Josef Bacik wrote:
> >>>On 06/01/2016 09:12 PM, Qu Wenruo wrote:
> >>>>
> >>>>
> >>>>At 06/02/2016 06:08 AM, Mark Fasheh wrote:
> >>>>>On Fri, Apr 01, 2016 at 02:35:00PM +0800, Qu Wenruo wrote:
> >>>>>>Core implement for inband de-duplication.
> >>>>>>It reuse the async_cow_start() facility to do the calculate dedupe
> >>>>>>hash.
> >>>>>>And use dedupe hash to do inband de-duplication at extent level.
> >>>>>>
> >>>>>>The work flow is as below:
> >>>>>>1) Run delalloc range for an inode
> >>>>>>2) Calculate hash for the delalloc range at the unit of dedupe_bs
> >>>>>>3) For hash match(duplicated) case, just increase source extent ref
> >>>>>>  and insert file extent.
> >>>>>>  For hash mismatch case, go through the normal cow_file_range()
> >>>>>>  fallback, and add hash into dedupe_tree.
> >>>>>>  Compress for hash miss case is not supported yet.
> >>>>>>
> >>>>>>Current implement restore all dedupe hash in memory rb-tree, with LRU
> >>>>>>behavior to control the limit.
> >>>>>>
> >>>>>>Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> >>>>>>Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> >>>>>>---
> >>>>>>fs/btrfs/extent-tree.c |  18 ++++
> >>>>>>fs/btrfs/inode.c       | 235
> >>>>>>++++++++++++++++++++++++++++++++++++++++++-------
> >>>>>>fs/btrfs/relocation.c  |  16 ++++
> >>>>>>3 files changed, 236 insertions(+), 33 deletions(-)
> >>>>>>
> >>>>>>diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >>>>>>index 53e1297..dabd721 100644
> >>>>>>--- a/fs/btrfs/extent-tree.c
> >>>>>>+++ b/fs/btrfs/extent-tree.c
> >>>>>>@@ -37,6 +37,7 @@
> >>>>>>#include "math.h"
> >>>>>>#include "sysfs.h"
> >>>>>>#include "qgroup.h"
> >>>>>>+#include "dedupe.h"
> >>>>>>
> >>>>>>#undef SCRAMBLE_DELAYED_REFS
> >>>>>>
> >>>>>>@@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct
> >>>>>>btrfs_trans_handle *trans,
> >>>>>>
> >>>>>>    if (btrfs_delayed_ref_is_head(node)) {
> >>>>>>        struct btrfs_delayed_ref_head *head;
> >>>>>>+        struct btrfs_fs_info *fs_info = root->fs_info;
> >>>>>>+
> >>>>>>        /*
> >>>>>>         * we've hit the end of the chain and we were supposed
> >>>>>>         * to insert this extent into the tree.  But, it got
> >>>>>>@@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct
> >>>>>>btrfs_trans_handle *trans,
> >>>>>>            btrfs_pin_extent(root, node->bytenr,
> >>>>>>                     node->num_bytes, 1);
> >>>>>>            if (head->is_data) {
> >>>>>>+                /*
> >>>>>>+                 * If insert_reserved is given, it means
> >>>>>>+                 * a new extent is revered, then deleted
> >>>>>>+                 * in one tran, and inc/dec get merged to 0.
> >>>>>>+                 *
> >>>>>>+                 * In this case, we need to remove its dedup
> >>>>>>+                 * hash.
> >>>>>>+                 */
> >>>>>>+                btrfs_dedupe_del(trans, fs_info, node->bytenr);
> >>>>>>                ret = btrfs_del_csums(trans, root,
> >>>>>>                              node->bytenr,
> >>>>>>                              node->num_bytes);
> >>>>>>@@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct
> >>>>>>btrfs_trans_handle *trans,
> >>>>>>        btrfs_release_path(path);
> >>>>>>
> >>>>>>        if (is_data) {
> >>>>>>+            ret = btrfs_dedupe_del(trans, info, bytenr);
> >>>>>>+            if (ret < 0) {
> >>>>>>+                btrfs_abort_transaction(trans, extent_root,
> >>>>>>+                            ret);
> >>>>>
> >>>>>I don't see why an error here should lead to a readonly fs.
> >>>>>   --Mark
> >>>>>
> >>>>
> >>>>Because such deletion error can lead to corruption.
> >>>>
> >>>>For example, extent A is already in hash pool.
> >>>>And when freeing extent A, we need to delete its hash, of course.
> >>>>
> >>>>But if such deletion fails, which means the hash is still in the pool,
> >>>>even the extent A no longer exists in extent tree.
> >>>
> >>>Except if we're in in-memory mode only it doesn't matter, so don't abort
> >>>if we're in in-memory mode.  Thanks,
> >>>
> >>>Josef
> >>>
> >>
> >>If we can't ensure a hash is delete along with the extent, we will
> >>screw up the whole fs, as new write can points to non-exist extent.
> >>
> >>Although you're right with in-memory mode here, we won't abort
> >>trans, as inmem_del_hash() won't return error code. It will always
> >>return 0.
> >
> >Until a third party comes along and changes it to return an error code and
> >neither you or I are there to remind them to fix this check (or have simply
> >forgotten).
> >
> >
> >>So still, no need to change anyway.
> >
> >Personally I'd call this 'defensive coding' and do a check for in-memory
> >only before our abort_trans().  This would have no effect on our running
> >code but avoids the problem I stated above.  Alternatively, you could
> >clearly comment the exception. I don't like leaving it as-is for the reason
> >I stated above.
> >
> >Thanks,
> >	--Mark
> >
> 
> The whole 'defensive coding' is here just because the V10 patchset
> comes with full function, including 2 backends and other things
> later.
> 
> A lot of code like this is here because we know what will be added later.
> So this is true that it looks ridiculous until one knows there is an
> on-disk backend to be added.

Yeah I get that - btw, my initial question was about either backend so
there's something to be said about the added confusion this brings.


> I'm OK to move the check to btrfs_dedupe_del(), but this makes me
> curious about the correct coding style for adding new function.
> 
> 
> If we have a clear view of the future functions , should we leave
> such interfaces for them?
> Or add them when adding the new functions?

The best answer is 'whatever makes the patch most readable'. But that's
vague and not useful.

Personally I find it easier when changes are self contained and build on
each other in sequence. So to take this as an example, I'd have the
'go-readonly' part implemented at the point where the disk backend actually 
needs that functionality.

This can happen inside of a function too - if there's some condition foo
which has to be handled but is not introduced in a later patch, I prefer   
that the code to handle condition 'foo' be in the same patch even if the 
function it goes into was initially created earlier.  The alternative has
the reviewer flipping between patches.

The other btrfs devs might have different ideas.


> And what level of integration should be done inside btrfs codes?
> Should any caller of an exported btrfs function knows all possible
> return value and its condition?

Yes, the caller of a function should understand and be able to handle all
errors it might recieve from it.


> Or caller only needs to check the definition without digging into
> the implementation?
> (yes, isolation vs integrations things)
> 
> If we have a clear idea on this, we could avoid such embarrassing situation.

Ideally we're doing both. I'll be honest, very few functions in btrfs are
commented in a way that makes this easy so often I feel like I have to dig
through the code to make sure I don't blow up at the wrong error.

My suggestion is that we start documenting important functions in a
standardized and obvious way. Then, when a patch comes along which changes
return codes, we can ask that they update the comment. Kernel-doc
(Documentation/kernel-doc-nano-HOWTO.txt) makes this trivial (you can cut
and paste the example comment and just fill it in).

That wouldn't actually solve the problem but it would be a good start and
allow us to build some confidence in what we're calling.

Thanks,
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2016-06-07 16:55 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-01  6:34 [PATCH v10 00/21] Btrfs dedupe framework Qu Wenruo
2016-04-01  6:34 ` [PATCH v10 01/21] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
2016-04-01  6:34 ` [PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
2016-04-01  9:59   ` kbuild test robot
2016-05-11  0:00     ` Mark Fasheh
2016-05-11  0:21       ` Qu Wenruo
2016-05-11  2:24         ` Qu Wenruo
2016-04-01  6:34 ` [PATCH v10 03/21] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
2016-06-01 19:37   ` Mark Fasheh
2016-06-02  0:49     ` Qu Wenruo
2016-04-01  6:34 ` [PATCH v10 04/21] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
2016-06-01 19:40   ` Mark Fasheh
2016-06-02  1:01     ` Qu Wenruo
2016-04-01  6:34 ` [PATCH v10 05/21] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
2016-04-01  6:34 ` [PATCH v10 06/21] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
2016-04-01  6:34 ` [PATCH v10 07/21] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
2016-05-17 13:15   ` David Sterba
2016-04-01  6:34 ` [PATCH v10 08/21] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
2016-06-01 22:06   ` Mark Fasheh
2016-06-02  1:08     ` Qu Wenruo
2016-04-01  6:35 ` [PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
2016-06-01 22:08   ` Mark Fasheh
2016-06-02  1:12     ` Qu Wenruo
2016-06-03 14:27       ` Josef Bacik
2016-06-04 10:26         ` Qu Wenruo
2016-06-06 19:54           ` Mark Fasheh
2016-06-07  0:42             ` Qu Wenruo
2016-06-07 16:55               ` Mark Fasheh
2016-06-03 14:43   ` Josef Bacik
2016-06-04 10:28     ` Qu Wenruo
2016-04-01  6:35 ` [PATCH v10 10/21] btrfs: try more times to alloc metadata reserve space Qu Wenruo
2016-05-17 13:20   ` David Sterba
2016-05-18  0:57     ` Qu Wenruo
2016-06-01 22:14     ` Mark Fasheh
2016-04-01  6:35 ` [PATCH v10 11/21] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
2016-04-27  1:29   ` Qu Wenruo
2016-05-17 13:14     ` David Sterba
2016-05-18  0:54       ` Qu Wenruo
2016-04-01  6:35 ` [PATCH v10 12/21] btrfs: dedupe: add an inode nodedupe flag Qu Wenruo
2016-04-01  6:35 ` [PATCH v10 13/21] btrfs: dedupe: add a property handler for online dedupe Qu Wenruo
2016-04-01  6:35 ` [PATCH v10 14/21] btrfs: dedupe: add per-file online dedupe control Qu Wenruo
2016-04-01  6:35 ` [PATCH v10 15/21] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
2016-04-01  6:35 ` [PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
2016-04-01  6:35 ` [PATCH v10 17/21] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info Qu Wenruo
2016-06-03 14:54   ` Josef Bacik
2016-04-01  6:35 ` [PATCH v10 18/21] btrfs: dedupe: Add support for on-disk hash search Qu Wenruo
2016-06-03 14:57   ` Josef Bacik
2016-04-01  6:35 ` [PATCH v10 19/21] btrfs: dedupe: Add support to delete hash for on-disk backend Qu Wenruo
2016-04-01  6:35 ` [PATCH v10 20/21] btrfs: dedupe: Add support for adding " Qu Wenruo
2016-06-03 15:03   ` Josef Bacik
2016-04-01  6:35 ` [PATCH v10 21/21] btrfs: dedupe: Preparation for compress-dedupe co-work Qu Wenruo
2016-04-01  8:53 ` [PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
2016-06-03 15:20 ` [PATCH v10 00/21] Btrfs dedupe framework Josef Bacik
2016-06-04 10:37   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.