All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
@ 2016-03-22  1:35 Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 01/27] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
                   ` (28 more replies)
  0 siblings, 29 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs

This patchset can be fetched from github:
https://github.com/adam900710/linux.git wang_dedupe_20160322 

This updated version of inband de-duplication has the following features:
1) ONE unified dedup framework.
   Most of its code is hidden quietly in dedup.c and export the minimal
   interfaces for its caller.
   Reviewer and further developer would benefit from the unified
   framework.

2) TWO different back-end with different trade-off
   One is the improved version of previous Fujitsu in-memory only dedup.
   The other one is enhanced dedup implementation from Liu Bo.
   Changed its tree structure to handle bytenr -> hash search for
   deleting hash, without the hideous data backref hack.

3) Support compression with dedupe
   Now dedupe can work with compression.
   Means that, a dedupe miss case can be compressed, and dedupe hit case
   can also reuse compressed file extents.

4) Ioctl interface with persist dedup status
   Advised by David, now we use ioctl to enable/disable dedup.

   And we now have dedup status, recorded in the first item of dedup
   tree.
   Just like quota, once enabled, no extra ioctl is needed for next
   mount.

5) Ability to disable dedup for given dirs/files
   It works just like the compression prop method, by adding a new
   xattr.

TODO:
1) Add extent-by-extent comparison for faster but more conflicting algorithm
   Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
   CPU may even be a bottleneck other than IO.
   But for faster hash, it will definitely cause conflicts, so we need
   extent comparison before we introduce new dedup algorithm.

2) Misc end-user related helpers
   Like handy and easy to implement dedup rate report.
   And method to query in-memory hash size for those "non-exist" users who
   want to use 'dedup enable -l' option but didn't ever know how much
   RAM they have.

Changelog:
v2:
  Totally reworked to handle multiple backends
v3:
  Fix a stupid but deadly on-disk backend bug
  Add handle for multiple hash on same bytenr corner case to fix abort
  trans error
  Increase dedup rate by enhancing delayed ref handler for both backend.
  Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
  Increase dedup block size up limit to 8M.
v4:
  Add dedup prop for disabling dedup for given files/dirs.
  Merge inmem_search() and ondisk_search() into generic_search() to save
  some code
  Fix another delayed_ref related bug.
  Use the same mutex for both inmem and ondisk backend.
  Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
  rate.
v5:
  Reuse compress routine for much simpler dedup function.
  Slightly improved performance due to above modification.
  Fix race between dedup enable/disable
  Fix for false ENOSPC report
v6:
  Further enable/disable race window fix.
  Minor format change according to checkpatch.
v7:
  Fix one concurrency bug with balance.
  Slightly modify return value from -EINVAL to -EOPNOTSUPP for
  btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
  and wrong parameter.
  Rebased to integration-4.6.
v8:
  Rename 'dedup' to 'dedupe'.
  Add support to allow dedupe and compression work at the same time.
  Fix several balance related bugs. Special thanks to Satoru Takeuchi,
  who exposed most of them.
  Small dedupe hit case performance improvement.

Qu Wenruo (12):
  btrfs: delayed-ref: Add support for increasing data ref under spinlock
  btrfs: dedupe: Inband in-memory only de-duplication implement
  btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
  btrfs: dedupe: Add support for on-disk hash search
  btrfs: dedupe: Add support to delete hash for on-disk backend
  btrfs: dedupe: Add support for adding hash for on-disk backend
  btrfs: Fix a memory leak in inband dedupe hash
  btrfs: dedupe: Fix metadata balance error when dedupe is enabled
  btrfs: dedupe: Preparation for compress-dedupe co-work
  btrfs: relocation: Enhance error handling to avoid BUG_ON
  btrfs: dedupe: Fix a space cache delalloc bytes underflow bug

Wang Xiaoguang (15):
  btrfs: dedupe: Introduce dedupe framework and its header
  btrfs: dedupe: Introduce function to initialize dedupe info
  btrfs: dedupe: Introduce function to add hash into in-memory tree
  btrfs: dedupe: Introduce function to remove hash from in-memory tree
  btrfs: dedupe: Introduce function to search for an existing hash
  btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  btrfs: ordered-extent: Add support for dedupe
  btrfs: dedupe: Add ioctl for inband dedupelication
  btrfs: dedupe: add an inode nodedupe flag
  btrfs: dedupe: add a property handler for online dedupe
  btrfs: dedupe: add per-file online dedupe control
  btrfs: try more times to alloc metadata reserve space
  btrfs: dedupe: Fix a bug when running inband dedupe with balance
  btrfs: dedupe: Avoid submit IO for hash hit extent
  btrfs: dedupe: Add support for compression and dedpue

 fs/btrfs/Makefile            |    2 +-
 fs/btrfs/ctree.h             |   78 ++-
 fs/btrfs/dedupe.c            | 1188 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dedupe.h            |  181 +++++++
 fs/btrfs/delayed-ref.c       |   30 +-
 fs/btrfs/delayed-ref.h       |    8 +
 fs/btrfs/disk-io.c           |   28 +-
 fs/btrfs/disk-io.h           |    1 +
 fs/btrfs/extent-tree.c       |   49 +-
 fs/btrfs/inode.c             |  338 ++++++++++--
 fs/btrfs/ioctl.c             |   70 ++-
 fs/btrfs/ordered-data.c      |   49 +-
 fs/btrfs/ordered-data.h      |   16 +-
 fs/btrfs/props.c             |   41 ++
 fs/btrfs/relocation.c        |   41 +-
 fs/btrfs/sysfs.c             |    2 +
 include/trace/events/btrfs.h |    3 +-
 include/uapi/linux/btrfs.h   |   25 +-
 18 files changed, 2073 insertions(+), 77 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c
 create mode 100644 fs/btrfs/dedupe.h

-- 
2.7.3




^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v8 01/27] btrfs: dedupe: Introduce dedupe framework and its header
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 02/27] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce the header for btrfs online(write time) de-duplication
framework and needed header.

The new de-duplication framework is going to support 2 different dedupe
methods and 1 dedupe hash.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h   |   5 ++
 fs/btrfs/dedupe.h  | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/disk-io.c |   1 +
 3 files changed, 138 insertions(+)
 create mode 100644 fs/btrfs/dedupe.h

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 84a6a5b..022ab61 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1860,6 +1860,11 @@ struct btrfs_fs_info {
 	struct list_head pinned_chunks;
 
 	int creating_free_space_tree;
+
+	/* Inband de-duplication related structures*/
+	unsigned int dedupe_enabled:1;
+	struct btrfs_dedupe_info *dedupe_info;
+	struct mutex dedupe_ioctl_lock;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
new file mode 100644
index 0000000..916039c
--- /dev/null
+++ b/fs/btrfs/dedupe.h
@@ -0,0 +1,132 @@
+/*
+ * Copyright (C) 2015 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_DEDUPE__
+#define __BTRFS_DEDUPE__
+
+#include <linux/btrfs.h>
+#include <linux/wait.h>
+#include <crypto/hash.h>
+
+/*
+ * Dedup storage backend
+ * On disk is persist storage but overhead is large
+ * In memory is fast but will lose all its hash on umount
+ */
+#define BTRFS_DEDUPE_BACKEND_INMEMORY		0
+#define BTRFS_DEDUPE_BACKEND_ONDISK		1
+#define BTRFS_DEDUPE_BACKEND_COUNT		2
+
+/* Dedup block size limit and default value */
+#define BTRFS_DEDUPE_BLOCKSIZE_MAX	(8 * 1024 * 1024)
+#define BTRFS_DEDUPE_BLOCKSIZE_MIN	(16 * 1024)
+#define BTRFS_DEDUPE_BLOCKSIZE_DEFAULT	(128 * 1024)
+
+/* Hash algorithm, only support SHA256 yet */
+#define BTRFS_DEDUPE_HASH_SHA256		0
+
+static int btrfs_dedupe_sizes[] = { 32 };
+
+/*
+ * For caller outside of dedup.c
+ *
+ * Different dedupe backends should have their own hash structure
+ */
+struct btrfs_dedupe_hash {
+	u64 bytenr;
+	u32 num_bytes;
+
+	/* last field is a variable length array of dedupe hash */
+	u8 hash[];
+};
+
+struct btrfs_dedupe_info {
+	/* dedupe blocksize */
+	u64 blocksize;
+	u16 backend;
+	u16 hash_type;
+
+	struct crypto_shash *dedupe_driver;
+	struct mutex lock;
+
+	/* following members are only used in in-memory dedupe mode */
+	struct rb_root hash_root;
+	struct rb_root bytenr_root;
+	struct list_head lru_list;
+	u64 limit_nr;
+	u64 current_nr;
+};
+
+struct btrfs_trans_handle;
+
+static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
+{
+	return (hash && hash->bytenr);
+}
+
+int btrfs_dedupe_hash_size(u16 type);
+struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type);
+
+/*
+ * Initial inband dedupe info
+ * Called at dedupe enable time.
+ */
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+			u64 blocksize, u64 limit_nr);
+
+/*
+ * Disable dedupe and invalidate all its dedupe data.
+ * Called at dedupe disable time.
+ */
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
+
+/*
+ * Calculate hash for dedup.
+ * Caller must ensure [start, start + dedupe_bs) has valid data.
+ */
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+			   struct inode *inode, u64 start,
+			   struct btrfs_dedupe_hash *hash);
+
+/*
+ * Search for duplicated extents by calculated hash
+ * Caller must call btrfs_dedupe_calc_hash() first to get the hash.
+ *
+ * @inode: the inode for we are writing
+ * @file_pos: offset inside the inode
+ * As we will increase extent ref immediately after a hash match,
+ * we need @file_pos and @inode in this case.
+ *
+ * Return > 0 for a hash match, and the extent ref will be
+ * *INCREASED*, and hash->bytenr/num_bytes will record the existing
+ * extent data.
+ * Return 0 for a hash miss. Nothing is done
+ */
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+			struct inode *inode, u64 file_pos,
+			struct btrfs_dedupe_hash *hash);
+
+/* Add a dedupe hash into dedupe info */
+int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info,
+		     struct btrfs_dedupe_hash *hash);
+
+/* Remove a dedupe hash from dedupe info */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info, u64 bytenr);
+#endif
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c95e3ce..3cf4c11 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2584,6 +2584,7 @@ int open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
 	mutex_init(&fs_info->cleaner_delayed_iput_mutex);
+	mutex_init(&fs_info->dedupe_ioctl_lock);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 02/27] btrfs: dedupe: Introduce function to initialize dedupe info
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 01/27] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 03/27] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Add generic function to initialize dedupe info.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/Makefile |  2 +-
 fs/btrfs/dedupe.c | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dedupe.h | 16 +++++++--
 3 files changed, 112 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/dedupe.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17..1b8c627 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
 	   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-	   uuid-tree.o props.o hash.o free-space-tree.o
+	   uuid-tree.o props.o hash.o free-space-tree.o dedupe.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
new file mode 100644
index 0000000..9a0e03b
--- /dev/null
+++ b/fs/btrfs/dedupe.c
@@ -0,0 +1,97 @@
+/*
+ * Copyright (C) 2016 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include "ctree.h"
+#include "dedupe.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "delayed-ref.h"
+
+int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
+			u64 blocksize, u64 limit_nr)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+	u64 limit = limit_nr;
+	int ret = 0;
+
+	/* Sanity check */
+	if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX ||
+	    blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN ||
+	    blocksize < fs_info->tree_root->sectorsize ||
+	    !is_power_of_2(blocksize))
+		return -EINVAL;
+	if (type >= ARRAY_SIZE(btrfs_dedupe_sizes))
+		return -EINVAL;
+	if (backend >= BTRFS_DEDUPE_BACKEND_COUNT)
+		return -EINVAL;
+
+	if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY && limit_nr == 0)
+		limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
+	if (backend == BTRFS_DEDUPE_BACKEND_ONDISK && limit_nr != 0)
+		limit = 0;
+
+	dedupe_info = fs_info->dedupe_info;
+	if (dedupe_info) {
+		/* Check if we are re-enable for different dedupe config */
+		if (dedupe_info->blocksize != blocksize ||
+		    dedupe_info->hash_type != type ||
+		    dedupe_info->backend != backend) {
+			btrfs_dedupe_disable(fs_info);
+			goto enable;
+		}
+
+		/* On-fly limit change is OK */
+		mutex_lock(&dedupe_info->lock);
+		fs_info->dedupe_info->limit_nr = limit;
+		mutex_unlock(&dedupe_info->lock);
+		return 0;
+	}
+
+enable:
+	dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
+	if (dedupe_info)
+		return -ENOMEM;
+
+	dedupe_info->hash_type = type;
+	dedupe_info->backend = backend;
+	dedupe_info->blocksize = blocksize;
+	dedupe_info->limit_nr = limit;
+
+	/* Only support SHA256 yet */
+	dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
+	if (IS_ERR(dedupe_info->dedupe_driver)) {
+		btrfs_err(fs_info, "failed to init sha256 driver");
+		ret = PTR_ERR(dedupe_info->dedupe_driver);
+		goto out;
+	}
+
+	dedupe_info->hash_root = RB_ROOT;
+	dedupe_info->bytenr_root = RB_ROOT;
+	dedupe_info->current_nr = 0;
+	INIT_LIST_HEAD(&dedupe_info->lru_list);
+	mutex_init(&dedupe_info->lock);
+
+	fs_info->dedupe_info = dedupe_info;
+	/* We must ensure dedupe_enabled is modified after dedupe_info */
+	smp_wmb();
+	fs_info->dedupe_enabled = 1;
+
+out:
+	if (ret < 0)
+		kfree(dedupe_info);
+	return ret;
+}
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 916039c..ab1aef7 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -37,6 +37,9 @@
 #define BTRFS_DEDUPE_BLOCKSIZE_MIN	(16 * 1024)
 #define BTRFS_DEDUPE_BLOCKSIZE_DEFAULT	(128 * 1024)
 
+/* Default dedupe limit on number of hash */
+#define BTRFS_DEDUPE_LIMIT_NR_DEFAULT	(32 * 1024)
+
 /* Hash algorithm, only support SHA256 yet */
 #define BTRFS_DEDUPE_HASH_SHA256		0
 
@@ -79,8 +82,17 @@ static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
 	return (hash && hash->bytenr);
 }
 
-int btrfs_dedupe_hash_size(u16 type);
-struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type);
+static inline int btrfs_dedupe_hash_size(u16 type)
+{
+	if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
+		return -EINVAL;
+	return sizeof(struct btrfs_dedupe_hash) + btrfs_dedupe_sizes[type];
+}
+
+static inline struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type)
+{
+	return kzalloc(btrfs_dedupe_hash_size(type), GFP_NOFS);
+}
 
 /*
  * Initial inband dedupe info
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 03/27] btrfs: dedupe: Introduce function to add hash into in-memory tree
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 01/27] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 02/27] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 04/27] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce static function inmem_add() to add hash into in-memory tree.
And now we can implement the btrfs_dedupe_add() interface.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 162 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 9a0e03b..013600a 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -21,6 +21,25 @@
 #include "transaction.h"
 #include "delayed-ref.h"
 
+struct inmem_hash {
+	struct rb_node hash_node;
+	struct rb_node bytenr_node;
+	struct list_head lru_list;
+
+	u64 bytenr;
+	u32 num_bytes;
+
+	u8 hash[];
+};
+
+static inline struct inmem_hash *inmem_alloc_hash(u16 type)
+{
+	if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
+		return NULL;
+	return kzalloc(sizeof(struct inmem_hash) + btrfs_dedupe_sizes[type],
+			GFP_NOFS);
+}
+
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
 			u64 blocksize, u64 limit_nr)
 {
@@ -95,3 +114,146 @@ out:
 		kfree(dedupe_info);
 	return ret;
 }
+
+static int inmem_insert_hash(struct rb_root *root,
+			     struct inmem_hash *hash, int hash_len)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, hash_node);
+		if (memcmp(hash->hash, entry->hash, hash_len) < 0)
+			p = &(*p)->rb_left;
+		else if (memcmp(hash->hash, entry->hash, hash_len) > 0)
+			p = &(*p)->rb_right;
+		else
+			return 1;
+	}
+	rb_link_node(&hash->hash_node, parent, p);
+	rb_insert_color(&hash->hash_node, root);
+	return 0;
+}
+
+static int inmem_insert_bytenr(struct rb_root *root,
+			       struct inmem_hash *hash)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+		if (hash->bytenr < entry->bytenr)
+			p = &(*p)->rb_left;
+		else if (hash->bytenr > entry->bytenr)
+			p = &(*p)->rb_right;
+		else
+			return 1;
+	}
+	rb_link_node(&hash->bytenr_node, parent, p);
+	rb_insert_color(&hash->bytenr_node, root);
+	return 0;
+}
+
+static void __inmem_del(struct btrfs_dedupe_info *dedupe_info,
+			struct inmem_hash *hash)
+{
+	list_del(&hash->lru_list);
+	rb_erase(&hash->hash_node, &dedupe_info->hash_root);
+	rb_erase(&hash->bytenr_node, &dedupe_info->bytenr_root);
+
+	if (!WARN_ON(dedupe_info->current_nr == 0))
+		dedupe_info->current_nr--;
+
+	kfree(hash);
+}
+
+/*
+ * Insert a hash into in-memory dedupe tree
+ * Will remove exceeding last recent use hash.
+ *
+ * If the hash mathced with existing one, we won't insert it, to
+ * save memory
+ */
+static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
+		     struct btrfs_dedupe_hash *hash)
+{
+	int ret = 0;
+	u16 type = dedupe_info->hash_type;
+	struct inmem_hash *ihash;
+
+	ihash = inmem_alloc_hash(type);
+
+	if (!ihash)
+		return -ENOMEM;
+
+	/* Copy the data out */
+	ihash->bytenr = hash->bytenr;
+	ihash->num_bytes = hash->num_bytes;
+	memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
+
+	mutex_lock(&dedupe_info->lock);
+
+	ret = inmem_insert_bytenr(&dedupe_info->bytenr_root, ihash);
+	if (ret > 0) {
+		kfree(ihash);
+		ret = 0;
+		goto out;
+	}
+
+	ret = inmem_insert_hash(&dedupe_info->hash_root, ihash,
+				btrfs_dedupe_sizes[type]);
+	if (ret > 0) {
+		/*
+		 * We only keep one hash in tree to save memory, so if
+		 * hash conflicts, free the one to insert.
+		 */
+		rb_erase(&ihash->bytenr_node, &dedupe_info->bytenr_root);
+		kfree(ihash);
+		ret = 0;
+		goto out;
+	}
+
+	list_add(&ihash->lru_list, &dedupe_info->lru_list);
+	dedupe_info->current_nr++;
+
+	/* Remove the last dedupe hash if we exceed limit */
+	while (dedupe_info->current_nr > dedupe_info->limit_nr) {
+		struct inmem_hash *last;
+
+		last = list_entry(dedupe_info->lru_list.prev,
+				  struct inmem_hash, lru_list);
+		__inmem_del(dedupe_info, last);
+	}
+out:
+	mutex_unlock(&dedupe_info->lock);
+	return 0;
+}
+
+int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info,
+		     struct btrfs_dedupe_hash *hash)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+	if (!fs_info->dedupe_enabled || !hash)
+		return 0;
+
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	if (WARN_ON(!btrfs_dedupe_hash_hit(hash)))
+		return -EINVAL;
+
+	/* ignore old hash */
+	if (dedupe_info->blocksize != hash->num_bytes)
+		return 0;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		return inmem_add(dedupe_info, hash);
+	return -EINVAL;
+}
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 04/27] btrfs: dedupe: Introduce function to remove hash from in-memory tree
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (2 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 03/27] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 05/27] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce static function inmem_del() to remove hash from in-memory
dedupe tree.
And implement btrfs_dedupe_del() and btrfs_dedup_destroy() interfaces.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 013600a..e44993b 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -257,3 +257,108 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
 		return inmem_add(dedupe_info, hash);
 	return -EINVAL;
 }
+
+static struct inmem_hash *
+inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+	struct rb_node **p = &dedupe_info->bytenr_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, bytenr_node);
+
+		if (bytenr < entry->bytenr)
+			p = &(*p)->rb_left;
+		else if (bytenr > entry->bytenr)
+			p = &(*p)->rb_right;
+		else
+			return entry;
+	}
+
+	return NULL;
+}
+
+/* Delete a hash from in-memory dedupe tree */
+static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+	struct inmem_hash *hash;
+
+	mutex_lock(&dedupe_info->lock);
+	hash = inmem_search_bytenr(dedupe_info, bytenr);
+	if (!hash) {
+		mutex_unlock(&dedupe_info->lock);
+		return 0;
+	}
+
+	__inmem_del(dedupe_info, hash);
+	mutex_unlock(&dedupe_info->lock);
+	return 0;
+}
+
+/* Remove a dedupe hash from dedupe tree */
+int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
+		     struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+	if (!fs_info->dedupe_enabled)
+		return 0;
+
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		return inmem_del(dedupe_info, bytenr);
+	return -EINVAL;
+}
+
+static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
+{
+	struct inmem_hash *entry, *tmp;
+
+	mutex_lock(&dedupe_info->lock);
+	list_for_each_entry_safe(entry, tmp, &dedupe_info->lru_list, lru_list)
+		__inmem_del(dedupe_info, entry);
+	mutex_unlock(&dedupe_info->lock);
+}
+
+int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+	int ret;
+
+	/* Here we don't want to increase refs of dedupe_info */
+	fs_info->dedupe_enabled = 0;
+
+	dedupe_info = fs_info->dedupe_info;
+
+	if (!dedupe_info)
+		return 0;
+
+	/* Don't allow disable status change in RO mount */
+	if (fs_info->sb->s_flags & MS_RDONLY)
+		return -EROFS;
+
+	/*
+	 * Wait for all unfinished write to complete dedupe routine
+	 * As disable operation is not a frequent operation, we are
+	 * OK to use heavy but safe sync_filesystem().
+	 */
+	down_read(&fs_info->sb->s_umount);
+	ret = sync_filesystem(fs_info->sb);
+	up_read(&fs_info->sb->s_umount);
+	if (ret < 0)
+		return ret;
+
+	fs_info->dedupe_info = NULL;
+
+	/* now we are OK to clean up everything */
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		inmem_destroy(dedupe_info);
+
+	crypto_free_shash(dedupe_info->dedupe_driver);
+	kfree(dedupe_info);
+	return 0;
+}
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 05/27] btrfs: delayed-ref: Add support for increasing data ref under spinlock
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (3 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 04/27] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 06/27] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs

For in-band dedupe, btrfs needs to increase data ref with delayed_ref
locked, so add a new function btrfs_add_delayed_data_ref_lock() to
increase extent ref with delayed_refs already locked.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/delayed-ref.c | 30 +++++++++++++++++++++++-------
 fs/btrfs/delayed-ref.h |  8 ++++++++
 2 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 430b368..07474e8 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -805,6 +805,26 @@ free_ref:
 }
 
 /*
+ * Do real delayed data ref insert.
+ * Caller must hold delayed_refs->lock and allocation memory
+ * for dref,head_ref and record.
+ */
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+			struct btrfs_trans_handle *trans,
+			struct btrfs_delayed_data_ref *dref,
+			struct btrfs_delayed_ref_head *head_ref,
+			struct btrfs_qgroup_extent_record *qrecord,
+			u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+			u64 owner, u64 offset, u64 reserved, int action)
+{
+	head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node,
+			qrecord, bytenr, num_bytes, ref_root, reserved,
+			action, 1);
+	add_delayed_data_ref(fs_info, trans, head_ref, &dref->node, bytenr,
+			num_bytes, parent, ref_root, owner, offset, action);
+}
+
+/*
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
@@ -849,13 +869,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
 	 * insert both the head node and the new ref without dropping
 	 * the spin lock
 	 */
-	head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node, record,
-					bytenr, num_bytes, ref_root, reserved,
-					action, 1);
-
-	add_delayed_data_ref(fs_info, trans, head_ref, &ref->node, bytenr,
-				   num_bytes, parent, ref_root, owner, offset,
-				   action);
+	btrfs_add_delayed_data_ref_locked(fs_info, trans, ref, head_ref, record,
+			bytenr, num_bytes, parent, ref_root, owner, offset,
+			reserved, action);
 	spin_unlock(&delayed_refs->lock);
 
 	return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index c24b653..2765858 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -239,11 +239,19 @@ static inline void btrfs_put_delayed_ref(struct btrfs_delayed_ref_node *ref)
 	}
 }
 
+struct btrfs_qgroup_extent_record;
 int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
 			       struct btrfs_trans_handle *trans,
 			       u64 bytenr, u64 num_bytes, u64 parent,
 			       u64 ref_root, int level, int action,
 			       struct btrfs_delayed_extent_op *extent_op);
+void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info,
+			struct btrfs_trans_handle *trans,
+			struct btrfs_delayed_data_ref *dref,
+			struct btrfs_delayed_ref_head *head_ref,
+			struct btrfs_qgroup_extent_record *qrecord,
+			u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root,
+			u64 owner, u64 offset, u64 reserved, int action);
 int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info,
 			       struct btrfs_trans_handle *trans,
 			       u64 bytenr, u64 num_bytes,
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 06/27] btrfs: dedupe: Introduce function to search for an existing hash
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (4 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 05/27] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 07/27] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce static function inmem_search() to handle the job for in-memory
hash tree.

The trick is, we must ensure the delayed ref head is not being run at
the time we search the for the hash.

With inmem_search(), we can implement the btrfs_dedupe_search()
interface.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 184 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 184 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index e44993b..0196188 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -20,6 +20,7 @@
 #include "btrfs_inode.h"
 #include "transaction.h"
 #include "delayed-ref.h"
+#include "qgroup.h"
 
 struct inmem_hash {
 	struct rb_node hash_node;
@@ -362,3 +363,186 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 	kfree(dedupe_info);
 	return 0;
 }
+
+/*
+ * Caller must ensure the corresponding ref head is not being run.
+ */
+static struct inmem_hash *
+inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
+{
+	struct rb_node **p = &dedupe_info->hash_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct inmem_hash *entry = NULL;
+	u16 hash_type = dedupe_info->hash_type;
+	int hash_len = btrfs_dedupe_sizes[hash_type];
+
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct inmem_hash, hash_node);
+
+		if (memcmp(hash, entry->hash, hash_len) < 0) {
+			p = &(*p)->rb_left;
+		} else if (memcmp(hash, entry->hash, hash_len) > 0) {
+			p = &(*p)->rb_right;
+		} else {
+			/* Found, need to re-add it to LRU list head */
+			list_del(&entry->lru_list);
+			list_add(&entry->lru_list, &dedupe_info->lru_list);
+			return entry;
+		}
+	}
+	return NULL;
+}
+
+static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
+			struct inode *inode, u64 file_pos,
+			struct btrfs_dedupe_hash *hash)
+{
+	int ret;
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_trans_handle *trans;
+	struct btrfs_delayed_ref_root *delayed_refs;
+	struct btrfs_delayed_ref_head *head;
+	struct btrfs_delayed_ref_head *insert_head;
+	struct btrfs_delayed_data_ref *insert_dref;
+	struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
+	struct inmem_hash *found_hash;
+	int free_insert = 1;
+	u64 bytenr;
+	u32 num_bytes;
+
+	insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
+	if (!insert_head)
+		return -ENOMEM;
+	insert_head->extent_op = NULL;
+	insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
+	if (!insert_dref) {
+		kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+		return -ENOMEM;
+	}
+	if (root->fs_info->quota_enabled &&
+	    is_fstree(root->root_key.objectid)) {
+		insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS);
+		if (!insert_qrecord) {
+			kmem_cache_free(btrfs_delayed_ref_head_cachep,
+					insert_head);
+			kmem_cache_free(btrfs_delayed_data_ref_cachep,
+					insert_dref);
+			return -ENOMEM;
+		}
+	}
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto free_mem;
+	}
+
+again:
+	mutex_lock(&dedupe_info->lock);
+	found_hash = inmem_search_hash(dedupe_info, hash->hash);
+	/* If we don't find a duplicated extent, just return. */
+	if (!found_hash) {
+		ret = 0;
+		goto out;
+	}
+	bytenr = found_hash->bytenr;
+	num_bytes = found_hash->num_bytes;
+
+	delayed_refs = &trans->transaction->delayed_refs;
+
+	spin_lock(&delayed_refs->lock);
+	head = btrfs_find_delayed_ref_head(trans, bytenr);
+	if (!head) {
+		/*
+		 * We can safely insert a new delayed_ref as long as we
+		 * hold delayed_refs->lock.
+		 * Only need to use atomic inc_extent_ref()
+		 */
+		btrfs_add_delayed_data_ref_locked(root->fs_info, trans,
+				insert_dref, insert_head, insert_qrecord,
+				bytenr, num_bytes, 0, root->root_key.objectid,
+				btrfs_ino(inode), file_pos, 0,
+				BTRFS_ADD_DELAYED_REF);
+		spin_unlock(&delayed_refs->lock);
+
+		/* add_delayed_data_ref_locked will free unused memory */
+		free_insert = 0;
+		hash->bytenr = bytenr;
+		hash->num_bytes = num_bytes;
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * We can't lock ref head with dedupe_info->lock hold or we will cause
+	 * ABBA dead lock.
+	 */
+	mutex_unlock(&dedupe_info->lock);
+	ret = btrfs_delayed_ref_lock(trans, head);
+	spin_unlock(&delayed_refs->lock);
+	if (ret == -EAGAIN)
+		goto again;
+
+	mutex_lock(&dedupe_info->lock);
+	/* Search again to ensure the hash is still here */
+	found_hash = inmem_search_hash(dedupe_info, hash->hash);
+	if (!found_hash) {
+		ret = 0;
+		mutex_unlock(&head->mutex);
+		goto out;
+	}
+	hash->bytenr = bytenr;
+	hash->num_bytes = num_bytes;
+
+	/*
+	 * Increase the extent ref right now, to avoid delayed ref run
+	 * Or we may increase ref on non-exist extent.
+	 */
+	btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0,
+			     root->root_key.objectid,
+			     btrfs_ino(inode), file_pos);
+	mutex_unlock(&head->mutex);
+out:
+	mutex_unlock(&dedupe_info->lock);
+	btrfs_end_transaction(trans, root);
+
+free_mem:
+	if (free_insert) {
+		kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head);
+		kmem_cache_free(btrfs_delayed_data_ref_cachep, insert_dref);
+		kfree(insert_qrecord);
+	}
+	return ret;
+}
+
+int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
+			struct inode *inode, u64 file_pos,
+			struct btrfs_dedupe_hash *hash)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+	int ret = -EINVAL;
+
+	if (!hash)
+		return 0;
+
+	/*
+	 * This function doesn't follow fs_info->dedupe_enabled as it will need
+	 * to ensure any hashed extent to go through dedupe routine
+	 */
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	if (WARN_ON(btrfs_dedupe_hash_hit(hash)))
+		return -EINVAL;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		ret = inmem_search(dedupe_info, inode, file_pos, hash);
+
+	/* It's possible hash->bytenr/num_bytenr already changed */
+	if (ret == 0) {
+		hash->num_bytes = 0;
+		hash->bytenr = 0;
+	}
+	return ret;
+}
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 07/27] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (5 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 06/27] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 08/27] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
                   ` (21 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Unlike in-memory or on-disk dedupe method, only SHA256 hash method is
supported yet, so implement btrfs_dedupe_calc_hash() interface using
SHA256.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 0196188..7ef2c37 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -546,3 +546,52 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
 	}
 	return ret;
 }
+
+int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
+			   struct inode *inode, u64 start,
+			   struct btrfs_dedupe_hash *hash)
+{
+	int i;
+	int ret;
+	struct page *p;
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+	struct crypto_shash *tfm = dedupe_info->dedupe_driver;
+	struct {
+		struct shash_desc desc;
+		char ctx[crypto_shash_descsize(tfm)];
+	} sdesc;
+	u64 dedupe_bs;
+	u64 sectorsize = BTRFS_I(inode)->root->sectorsize;
+
+	if (!fs_info->dedupe_enabled || !hash)
+		return 0;
+
+	if (WARN_ON(dedupe_info == NULL))
+		return -EINVAL;
+
+	WARN_ON(!IS_ALIGNED(start, sectorsize));
+
+	dedupe_bs = dedupe_info->blocksize;
+
+	sdesc.desc.tfm = tfm;
+	sdesc.desc.flags = 0;
+	ret = crypto_shash_init(&sdesc.desc);
+	if (ret)
+		return ret;
+	for (i = 0; sectorsize * i < dedupe_bs; i++) {
+		char *d;
+
+		p = find_get_page(inode->i_mapping,
+				  (start >> PAGE_CACHE_SHIFT) + i);
+		if (WARN_ON(!p))
+			return -ENOENT;
+		d = kmap(p);
+		ret = crypto_shash_update(&sdesc.desc, d, sectorsize);
+		kunmap(p);
+		page_cache_release(p);
+		if (ret)
+			return ret;
+	}
+	ret = crypto_shash_final(&sdesc.desc, hash->hash);
+	return ret;
+}
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 08/27] btrfs: ordered-extent: Add support for dedupe
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (6 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 07/27] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 09/27] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Add ordered-extent support for dedupe.

Note, current ordered-extent support only supports non-compressed source
extent.
Support for compressed source extent will be added later.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ordered-data.c | 44 ++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/ordered-data.h | 13 +++++++++++++
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 0de7da5..ef24ad1 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -26,6 +26,7 @@
 #include "extent_io.h"
 #include "disk-io.h"
 #include "compression.h"
+#include "dedupe.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -184,7 +185,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
  */
 static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 				      u64 start, u64 len, u64 disk_len,
-				      int type, int dio, int compress_type)
+				      int type, int dio, int compress_type,
+				      struct btrfs_dedupe_hash *hash)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_ordered_inode_tree *tree;
@@ -204,6 +206,31 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 	entry->inode = igrab(inode);
 	entry->compress_type = compress_type;
 	entry->truncated_len = (u64)-1;
+	entry->hash = NULL;
+	/*
+	 * Hash hit must go through dedupe routine at all cost, even dedupe
+	 * is disabled. As its delayed ref is already increased.
+	 */
+	if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) {
+		struct btrfs_dedupe_info *dedupe_info;
+
+		dedupe_info = root->fs_info->dedupe_info;
+		if (WARN_ON(dedupe_info == NULL)) {
+			kmem_cache_free(btrfs_ordered_extent_cache,
+					entry);
+			return -EINVAL;
+		}
+		entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_type);
+		if (!entry->hash) {
+			kmem_cache_free(btrfs_ordered_extent_cache, entry);
+			return -ENOMEM;
+		}
+		entry->hash->bytenr = hash->bytenr;
+		entry->hash->num_bytes = hash->num_bytes;
+		memcpy(entry->hash->hash, hash->hash,
+		       btrfs_dedupe_sizes[dedupe_info->hash_type]);
+	}
+
 	if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
 		set_bit(type, &entry->flags);
 
@@ -250,15 +277,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 {
 	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
 					  disk_len, type, 0,
-					  BTRFS_COMPRESS_NONE);
+					  BTRFS_COMPRESS_NONE, NULL);
 }
 
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+				   u64 start, u64 len, u64 disk_len, int type,
+				   struct btrfs_dedupe_hash *hash)
+{
+	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
+					  disk_len, type, 0,
+					  BTRFS_COMPRESS_NONE, hash);
+}
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 				 u64 start, u64 len, u64 disk_len, int type)
 {
 	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
 					  disk_len, type, 1,
-					  BTRFS_COMPRESS_NONE);
+					  BTRFS_COMPRESS_NONE, NULL);
 }
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
@@ -267,7 +302,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
 {
 	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
 					  disk_len, type, 0,
-					  compress_type);
+					  compress_type, NULL);
 }
 
 /*
@@ -577,6 +612,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry)
 			list_del(&sum->list);
 			kfree(sum);
 		}
+		kfree(entry->hash);
 		kmem_cache_free(btrfs_ordered_extent_cache, entry);
 	}
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 23c9605..8a54476 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -139,6 +139,16 @@ struct btrfs_ordered_extent {
 	struct completion completion;
 	struct btrfs_work flush_work;
 	struct list_head work_list;
+
+	/*
+	 * For inband deduplication
+	 * If hash is NULL, no deduplication.
+	 * If hash->bytenr is zero, means this is a dedupe miss, hash will
+	 * be added into dedupe tree.
+	 * If hash->bytenr is non-zero, this is a dedupe hit. Extent ref is
+	 * *ALREADY* increased.
+	 */
+	struct btrfs_dedupe_hash *hash;
 };
 
 /*
@@ -172,6 +182,9 @@ int btrfs_dec_test_first_ordered_pending(struct inode *inode,
 				   int uptodate);
 int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 			     u64 start, u64 len, u64 disk_len, int type);
+int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset,
+				   u64 start, u64 len, u64 disk_len, int type,
+				   struct btrfs_dedupe_hash *hash);
 int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 				 u64 start, u64 len, u64 disk_len, int type);
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 09/27] btrfs: dedupe: Inband in-memory only de-duplication implement
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (7 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 08/27] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Core implement for inband de-duplication.
It reuse the async_cow_start() facility to do the calculate dedupe hash.
And use dedupe hash to do inband de-duplication at extent level.

The work flow is as below:
1) Run delalloc range for an inode
2) Calculate hash for the delalloc range at the unit of dedupe_bs
3) For hash match(duplicated) case, just increase source extent ref
   and insert file extent.
   For hash mismatch case, go through the normal cow_file_range()
   fallback, and add hash into dedupe_tree.
   Compress for hash miss case is not supported yet.

Current implement restore all dedupe hash in memory rb-tree, with LRU
behavior to control the limit.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c |  18 ++++++
 fs/btrfs/inode.c       | 168 ++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 162 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 53e1297..dabd721 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -37,6 +37,7 @@
 #include "math.h"
 #include "sysfs.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 
 	if (btrfs_delayed_ref_is_head(node)) {
 		struct btrfs_delayed_ref_head *head;
+		struct btrfs_fs_info *fs_info = root->fs_info;
+
 		/*
 		 * we've hit the end of the chain and we were supposed
 		 * to insert this extent into the tree.  But, it got
@@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 			btrfs_pin_extent(root, node->bytenr,
 					 node->num_bytes, 1);
 			if (head->is_data) {
+				/*
+				 * If insert_reserved is given, it means
+				 * a new extent is revered, then deleted
+				 * in one tran, and inc/dec get merged to 0.
+				 *
+				 * In this case, we need to remove its dedup
+				 * hash.
+				 */
+				btrfs_dedupe_del(trans, fs_info, node->bytenr);
 				ret = btrfs_del_csums(trans, root,
 						      node->bytenr,
 						      node->num_bytes);
@@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 		btrfs_release_path(path);
 
 		if (is_data) {
+			ret = btrfs_dedupe_del(trans, info, bytenr);
+			if (ret < 0) {
+				btrfs_abort_transaction(trans, extent_root,
+							ret);
+				goto out;
+			}
 			ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
 			if (ret) {
 				btrfs_abort_transaction(trans, extent_root, ret);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 41a5688..13ae366 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -60,6 +60,7 @@
 #include "hash.h"
 #include "props.h"
 #include "qgroup.h"
+#include "dedupe.h"
 
 struct btrfs_iget_args {
 	struct btrfs_key *location;
@@ -106,7 +107,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent);
 static noinline int cow_file_range(struct inode *inode,
 				   struct page *locked_page,
 				   u64 start, u64 end, int *page_started,
-				   unsigned long *nr_written, int unlock);
+				   unsigned long *nr_written, int unlock,
+				   struct btrfs_dedupe_hash *hash);
 static struct extent_map *create_pinned_em(struct inode *inode, u64 start,
 					   u64 len, u64 orig_start,
 					   u64 block_start, u64 block_len,
@@ -335,6 +337,7 @@ struct async_extent {
 	struct page **pages;
 	unsigned long nr_pages;
 	int compress_type;
+	struct btrfs_dedupe_hash *hash;
 	struct list_head list;
 };
 
@@ -353,7 +356,8 @@ static noinline int add_async_extent(struct async_cow *cow,
 				     u64 compressed_size,
 				     struct page **pages,
 				     unsigned long nr_pages,
-				     int compress_type)
+				     int compress_type,
+				     struct btrfs_dedupe_hash *hash)
 {
 	struct async_extent *async_extent;
 
@@ -365,6 +369,7 @@ static noinline int add_async_extent(struct async_cow *cow,
 	async_extent->pages = pages;
 	async_extent->nr_pages = nr_pages;
 	async_extent->compress_type = compress_type;
+	async_extent->hash = hash;
 	list_add_tail(&async_extent->list, &cow->extents);
 	return 0;
 }
@@ -616,7 +621,7 @@ cont:
 		 */
 		add_async_extent(async_cow, start, num_bytes,
 				 total_compressed, pages, nr_pages_ret,
-				 compress_type);
+				 compress_type, NULL);
 
 		if (start + num_bytes < end) {
 			start += num_bytes;
@@ -641,7 +646,7 @@ cleanup_and_bail_uncompressed:
 		if (redirty)
 			extent_range_redirty_for_io(inode, start, end);
 		add_async_extent(async_cow, start, end - start + 1,
-				 0, NULL, 0, BTRFS_COMPRESS_NONE);
+				 0, NULL, 0, BTRFS_COMPRESS_NONE, NULL);
 		*num_added += 1;
 	}
 
@@ -687,6 +692,7 @@ static noinline void submit_compressed_extents(struct inode *inode,
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
 	struct extent_io_tree *io_tree;
+	struct btrfs_dedupe_hash *hash;
 	int ret = 0;
 
 again:
@@ -696,6 +702,7 @@ again:
 		list_del(&async_extent->list);
 
 		io_tree = &BTRFS_I(inode)->io_tree;
+		hash = async_extent->hash;
 
 retry:
 		/* did the compression code fall back to uncompressed IO? */
@@ -712,7 +719,8 @@ retry:
 					     async_extent->start,
 					     async_extent->start +
 					     async_extent->ram_size - 1,
-					     &page_started, &nr_written, 0);
+					     &page_started, &nr_written, 0,
+					     hash);
 
 			/* JDM XXX */
 
@@ -925,7 +933,7 @@ static noinline int cow_file_range(struct inode *inode,
 				   struct page *locked_page,
 				   u64 start, u64 end, int *page_started,
 				   unsigned long *nr_written,
-				   int unlock)
+				   int unlock, struct btrfs_dedupe_hash *hash)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 alloc_hint = 0;
@@ -984,11 +992,16 @@ static noinline int cow_file_range(struct inode *inode,
 		unsigned long op;
 
 		cur_alloc_size = disk_num_bytes;
-		ret = btrfs_reserve_extent(root, cur_alloc_size,
+		if (btrfs_dedupe_hash_hit(hash)) {
+			ins.objectid = hash->bytenr;
+			ins.offset = hash->num_bytes;
+		} else {
+			ret = btrfs_reserve_extent(root, cur_alloc_size,
 					   root->sectorsize, 0, alloc_hint,
 					   &ins, 1, 1);
-		if (ret < 0)
-			goto out_unlock;
+			if (ret < 0)
+				goto out_unlock;
+		}
 
 		em = alloc_extent_map();
 		if (!em) {
@@ -1025,8 +1038,9 @@ static noinline int cow_file_range(struct inode *inode,
 			goto out_reserve;
 
 		cur_alloc_size = ins.offset;
-		ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
-					       ram_size, cur_alloc_size, 0);
+		ret = btrfs_add_ordered_extent_dedupe(inode, start,
+				ins.objectid, cur_alloc_size, ins.offset,
+				0, hash);
 		if (ret)
 			goto out_drop_extent_cache;
 
@@ -1076,6 +1090,68 @@ out_unlock:
 	goto out;
 }
 
+static int hash_file_ranges(struct inode *inode, u64 start, u64 end,
+			    struct async_cow *async_cow, int *num_added)
+{
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+	struct page *locked_page = async_cow->locked_page;
+	u16 hash_algo;
+	u64 actual_end;
+	u64 isize = i_size_read(inode);
+	u64 dedupe_bs;
+	u64 cur_offset = start;
+	int ret = 0;
+
+	actual_end = min_t(u64, isize, end + 1);
+	/* If dedupe is not enabled, don't split extent into dedupe_bs */
+	if (fs_info->dedupe_enabled && dedupe_info) {
+		dedupe_bs = dedupe_info->blocksize;
+		hash_algo = dedupe_info->hash_type;
+	} else {
+		dedupe_bs = SZ_128M;
+		/* Just dummy, to avoid access NULL pointer */
+		hash_algo = BTRFS_DEDUPE_HASH_SHA256;
+	}
+
+	while (cur_offset < end) {
+		struct btrfs_dedupe_hash *hash = NULL;
+		u64 len;
+
+		len = min(end + 1 - cur_offset, dedupe_bs);
+		if (len < dedupe_bs)
+			goto next;
+
+		hash = btrfs_dedupe_alloc_hash(hash_algo);
+		if (!hash) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		ret = btrfs_dedupe_calc_hash(fs_info, inode, cur_offset, hash);
+		if (ret < 0)
+			goto out;
+
+		ret = btrfs_dedupe_search(fs_info, inode, cur_offset, hash);
+		if (ret < 0)
+			goto out;
+		ret = 0;
+
+next:
+		/* Redirty the locked page if it corresponds to our extent */
+		if (page_offset(locked_page) >= start &&
+		    page_offset(locked_page) <= end)
+			__set_page_dirty_nobuffers(locked_page);
+
+		add_async_extent(async_cow, cur_offset, len, 0, NULL, 0,
+				 BTRFS_COMPRESS_NONE, hash);
+		cur_offset += len;
+		(*num_added)++;
+	}
+out:
+	return ret;
+}
+
 /*
  * work queue call back to started compression on a file and pages
  */
@@ -1083,11 +1159,18 @@ static noinline void async_cow_start(struct btrfs_work *work)
 {
 	struct async_cow *async_cow;
 	int num_added = 0;
+	int ret = 0;
 	async_cow = container_of(work, struct async_cow, work);
 
-	compress_file_range(async_cow->inode, async_cow->locked_page,
-			    async_cow->start, async_cow->end, async_cow,
-			    &num_added);
+	if (inode_need_compress(async_cow->inode))
+		compress_file_range(async_cow->inode, async_cow->locked_page,
+				    async_cow->start, async_cow->end, async_cow,
+				    &num_added);
+	else
+		ret = hash_file_ranges(async_cow->inode, async_cow->start,
+				       async_cow->end, async_cow, &num_added);
+	WARN_ON(ret);
+
 	if (num_added == 0) {
 		btrfs_add_delayed_iput(async_cow->inode);
 		async_cow->inode = NULL;
@@ -1136,6 +1219,8 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 {
 	struct async_cow *async_cow;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
 	unsigned long nr_pages;
 	u64 cur_end;
 	int limit = 10 * SZ_1M;
@@ -1150,7 +1235,11 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
 		async_cow->locked_page = locked_page;
 		async_cow->start = start;
 
-		if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS &&
+		if (fs_info->dedupe_enabled && dedupe_info) {
+			u64 len = max_t(u64, SZ_512K, dedupe_info->blocksize);
+
+			cur_end = min(end, start + len - 1);
+		} else if (BTRFS_I(inode)->flags & BTRFS_INODE_NOCOMPRESS &&
 		    !btrfs_test_opt(root, FORCE_COMPRESS))
 			cur_end = end;
 		else
@@ -1407,7 +1496,7 @@ out_check:
 		if (cow_start != (u64)-1) {
 			ret = cow_file_range(inode, locked_page,
 					     cow_start, found_key.offset - 1,
-					     page_started, nr_written, 1);
+					     page_started, nr_written, 1, NULL);
 			if (ret) {
 				if (!nolock && nocow)
 					btrfs_end_write_no_snapshoting(root);
@@ -1486,7 +1575,7 @@ out_check:
 
 	if (cow_start != (u64)-1) {
 		ret = cow_file_range(inode, locked_page, cow_start, end,
-				     page_started, nr_written, 1);
+				     page_started, nr_written, 1, NULL);
 		if (ret)
 			goto error;
 	}
@@ -1537,6 +1626,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 {
 	int ret;
 	int force_cow = need_force_cow(inode, start, end);
+	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
 
 	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) {
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
@@ -1544,9 +1635,9 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 	} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_need_compress(inode)) {
+	} else if (!inode_need_compress(inode) && !fs_info->dedupe_enabled) {
 		ret = cow_file_range(inode, locked_page, start, end,
-				      page_started, nr_written, 1);
+				      page_started, nr_written, 1, NULL);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
 			&BTRFS_I(inode)->runtime_flags);
@@ -2076,7 +2167,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 				       u64 disk_bytenr, u64 disk_num_bytes,
 				       u64 num_bytes, u64 ram_bytes,
 				       u8 compression, u8 encryption,
-				       u16 other_encoding, int extent_type)
+				       u16 other_encoding, int extent_type,
+				       struct btrfs_dedupe_hash *hash)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_file_extent_item *fi;
@@ -2138,10 +2230,37 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	ins.objectid = disk_bytenr;
 	ins.offset = disk_num_bytes;
 	ins.type = BTRFS_EXTENT_ITEM_KEY;
-	ret = btrfs_alloc_reserved_file_extent(trans, root,
+
+	/*
+	 * Only for no-dedupe or hash miss case, we need to increase
+	 * extent reference
+	 * For hash hit case, reference is already increased
+	 */
+	if (!hash || hash->bytenr == 0)
+		ret = btrfs_alloc_reserved_file_extent(trans, root,
 					root->root_key.objectid,
 					btrfs_ino(inode), file_pos,
 					ram_bytes, &ins);
+	if (ret < 0)
+		goto out_qgroup;
+
+	/*
+	 * Hash hit won't create a new data extent, so its reserved quota
+	 * space won't be freed by new delayed_ref_head.
+	 * Need to free it here.
+	 */
+	if (btrfs_dedupe_hash_hit(hash))
+		btrfs_qgroup_free_data(inode, file_pos, ram_bytes);
+
+	/* Add missed hash into dedupe tree */
+	if (hash && hash->bytenr == 0) {
+		hash->bytenr = ins.objectid;
+		hash->num_bytes = ins.offset;
+		ret = btrfs_dedupe_add(trans, root->fs_info, hash);
+	}
+
+out_qgroup:
+
 	/*
 	 * Release the reserved range from inode dirty range map, as it is
 	 * already moved into delayed_ref_head
@@ -2925,7 +3044,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 						ordered_extent->disk_len,
 						logical_len, logical_len,
 						compress_type, 0, 0,
-						BTRFS_FILE_EXTENT_REG);
+						BTRFS_FILE_EXTENT_REG,
+						ordered_extent->hash);
 		if (!ret)
 			btrfs_release_delalloc_bytes(root,
 						     ordered_extent->start,
@@ -2985,7 +3105,6 @@ out:
 						   ordered_extent->disk_len, 1);
 	}
 
-
 	/*
 	 * This needs to be done to make sure anybody waiting knows we are done
 	 * updating everything for this ordered extent.
@@ -9948,7 +10067,8 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 						  cur_offset, ins.objectid,
 						  ins.offset, ins.offset,
 						  ins.offset, 0, 0, 0,
-						  BTRFS_FILE_EXTENT_PREALLOC);
+						  BTRFS_FILE_EXTENT_PREALLOC,
+						  NULL);
 		if (ret) {
 			btrfs_free_reserved_extent(root, ins.objectid,
 						   ins.offset, 0);
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (8 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 09/27] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-24 20:58   ` Chris Mason
  2016-03-22  1:35 ` [PATCH v8 11/27] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info Qu Wenruo
                   ` (18 subsequent siblings)
  28 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Liu Bo, Wang Xiaoguang

Introduce a new tree, dedupe tree to record on-disk dedupe hash.
As a persist hash storage instead of in-memeory only implement.

Unlike Liu Bo's implement, in this version we won't do hack for
bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such
search case, just like in-memory backend.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/ctree.h             | 63 +++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/dedupe.h            |  5 ++++
 fs/btrfs/disk-io.c           |  1 +
 include/trace/events/btrfs.h |  3 ++-
 4 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 022ab61..bed9273 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -100,6 +100,9 @@ struct btrfs_ordered_sum;
 /* tracks free space in block groups. */
 #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
 
+/* on-disk dedupe tree (EXPERIMENTAL) */
+#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL
+
 /* device stats in the device tree */
 #define BTRFS_DEV_STATS_OBJECTID 0ULL
 
@@ -508,6 +511,7 @@ struct btrfs_super_block {
  * ones specified below then we will fail to mount
  */
 #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE	(1ULL << 0)
+#define BTRFS_FEATURE_COMPAT_RO_DEDUPE		(1ULL << 1)
 
 #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
 #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
@@ -537,7 +541,8 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
 
 #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
-	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)
+	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
+	 BTRFS_FEATURE_COMPAT_RO_DEDUPE)
 
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
 #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
@@ -959,6 +964,42 @@ struct btrfs_csum_item {
 	u8 csum;
 } __attribute__ ((__packed__));
 
+/*
+ * Objectid: 0
+ * Type: BTRFS_DEDUPE_STATUS_ITEM_KEY
+ * Offset: 0
+ */
+struct btrfs_dedupe_status_item {
+	__le64 blocksize;
+	__le64 limit_nr;
+	__le16 hash_type;
+	__le16 backend;
+} __attribute__ ((__packed__));
+
+/*
+ * Objectid: Last 64 bit of the hash
+ * Type: BTRFS_DEDUPE_HASH_ITEM_KEY
+ * Offset: Bytenr of the hash
+ *
+ * Used for hash <-> bytenr search
+ */
+struct btrfs_dedupe_hash_item {
+	/* length of dedupe range */
+	__le32 len;
+
+	/* Hash follows */
+} __attribute__ ((__packed__));
+
+/*
+ * Objectid: bytenr
+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
+ * offset: Last 64 bit of the hash
+ *
+ * Used for bytenr <-> hash search (for free_extent)
+ * all its content is hash.
+ * So no special item struct is needed.
+ */
+
 struct btrfs_dev_stats_item {
 	/*
 	 * grow this item struct at the end for future enhancements and keep
@@ -2167,6 +2208,13 @@ struct btrfs_ioctl_defrag_range_args {
 #define BTRFS_CHUNK_ITEM_KEY	228
 
 /*
+ * Dedup item and status
+ */
+#define BTRFS_DEDUPE_STATUS_ITEM_KEY	230
+#define BTRFS_DEDUPE_HASH_ITEM_KEY	231
+#define BTRFS_DEDUPE_BYTENR_ITEM_KEY	232
+
+/*
  * Records the overall state of the qgroups.
  * There's only one instance of this key present,
  * (0, BTRFS_QGROUP_STATUS_KEY, 0)
@@ -3263,6 +3311,19 @@ static inline unsigned long btrfs_leaf_data(struct extent_buffer *l)
 	return offsetof(struct btrfs_leaf, items);
 }
 
+/* btrfs_dedupe_status */
+BTRFS_SETGET_FUNCS(dedupe_status_blocksize, struct btrfs_dedupe_status_item,
+		   blocksize, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_limit, struct btrfs_dedupe_status_item,
+		   limit_nr, 64);
+BTRFS_SETGET_FUNCS(dedupe_status_hash_type, struct btrfs_dedupe_status_item,
+		   hash_type, 16);
+BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item,
+		   backend, 16);
+
+/* btrfs_dedupe_hash_item */
+BTRFS_SETGET_FUNCS(dedupe_hash_len, struct btrfs_dedupe_hash_item, len, 32);
+
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
 BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr,
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index ab1aef7..537f0b8 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -58,6 +58,8 @@ struct btrfs_dedupe_hash {
 	u8 hash[];
 };
 
+struct btrfs_root;
+
 struct btrfs_dedupe_info {
 	/* dedupe blocksize */
 	u64 blocksize;
@@ -73,6 +75,9 @@ struct btrfs_dedupe_info {
 	struct list_head lru_list;
 	u64 limit_nr;
 	u64 current_nr;
+
+	/* for persist data like dedup-hash and dedupe status */
+	struct btrfs_root *dedupe_root;
 };
 
 struct btrfs_trans_handle;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3cf4c11..57ae928 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -183,6 +183,7 @@ static struct btrfs_lockdep_keyset {
 	{ .id = BTRFS_DATA_RELOC_TREE_OBJECTID,	.name_stem = "dreloc"	},
 	{ .id = BTRFS_UUID_TREE_OBJECTID,	.name_stem = "uuid"	},
 	{ .id = BTRFS_FREE_SPACE_TREE_OBJECTID,	.name_stem = "free-space" },
+	{ .id = BTRFS_DEDUPE_TREE_OBJECTID,	.name_stem = "dedupe"	},
 	{ .id = 0,				.name_stem = "tree"	},
 };
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index d866f21..2c3d48a 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -47,12 +47,13 @@ struct btrfs_qgroup_operation;
 		{ BTRFS_TREE_RELOC_OBJECTID,	"TREE_RELOC"	},	\
 		{ BTRFS_UUID_TREE_OBJECTID,	"UUID_TREE"	},	\
 		{ BTRFS_FREE_SPACE_TREE_OBJECTID, "FREE_SPACE_TREE" },	\
+		{ BTRFS_DEDUPE_TREE_OBJECTID,	"DEDUPE_TREE"	},	\
 		{ BTRFS_DATA_RELOC_TREE_OBJECTID, "DATA_RELOC_TREE" })
 
 #define show_root_type(obj)						\
 	obj, ((obj >= BTRFS_DATA_RELOC_TREE_OBJECTID) ||		\
 	      (obj >= BTRFS_ROOT_TREE_OBJECTID &&			\
-	       obj <= BTRFS_QUOTA_TREE_OBJECTID)) ? __show_root_type(obj) : "-"
+	       obj <= BTRFS_DEDUPE_TREE_OBJECTID)) ? __show_root_type(obj) : "-"
 
 #define BTRFS_GROUP_FLAGS	\
 	{ BTRFS_BLOCK_GROUP_DATA,	"DATA"},	\
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 11/27] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (9 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-29 17:31   ` Alex Lyakas
  2016-03-22  1:35 ` [PATCH v8 12/27] btrfs: dedupe: Add support for on-disk hash search Qu Wenruo
                   ` (17 subsequent siblings)
  28 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Since we will introduce a new on-disk based dedupe method, introduce new
interfaces to resume previous dedupe setup.

And since we introduce a new tree for status, also add disable handler
for it.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c  | 269 +++++++++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/dedupe.h  |  13 +++
 fs/btrfs/disk-io.c |  21 ++++-
 fs/btrfs/disk-io.h |   1 +
 4 files changed, 283 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 7ef2c37..1112fec 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -21,6 +21,8 @@
 #include "transaction.h"
 #include "delayed-ref.h"
 #include "qgroup.h"
+#include "disk-io.h"
+#include "locking.h"
 
 struct inmem_hash {
 	struct rb_node hash_node;
@@ -41,10 +43,103 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 type)
 			GFP_NOFS);
 }
 
+static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
+			    u16 backend, u64 blocksize, u64 limit)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+
+	dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
+	if (!dedupe_info)
+		return -ENOMEM;
+
+	dedupe_info->hash_type = type;
+	dedupe_info->backend = backend;
+	dedupe_info->blocksize = blocksize;
+	dedupe_info->limit_nr = limit;
+
+	/* only support SHA256 yet */
+	dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
+	if (IS_ERR(dedupe_info->dedupe_driver)) {
+		int ret;
+
+		ret = PTR_ERR(dedupe_info->dedupe_driver);
+		kfree(dedupe_info);
+		return ret;
+	}
+
+	dedupe_info->hash_root = RB_ROOT;
+	dedupe_info->bytenr_root = RB_ROOT;
+	dedupe_info->current_nr = 0;
+	INIT_LIST_HEAD(&dedupe_info->lru_list);
+	mutex_init(&dedupe_info->lock);
+
+	*ret_info = dedupe_info;
+	return 0;
+}
+
+static int init_dedupe_tree(struct btrfs_fs_info *fs_info,
+			    struct btrfs_dedupe_info *dedupe_info)
+{
+	struct btrfs_root *dedupe_root;
+	struct btrfs_key key;
+	struct btrfs_path *path;
+	struct btrfs_dedupe_status_item *status;
+	struct btrfs_trans_handle *trans;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	trans = btrfs_start_transaction(fs_info->tree_root, 2);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
+	dedupe_root = btrfs_create_tree(trans, fs_info,
+				       BTRFS_DEDUPE_TREE_OBJECTID);
+	if (IS_ERR(dedupe_root)) {
+		ret = PTR_ERR(dedupe_root);
+		btrfs_abort_transaction(trans, fs_info->tree_root, ret);
+		goto out;
+	}
+	dedupe_info->dedupe_root = dedupe_root;
+
+	key.objectid = 0;
+	key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
+	key.offset = 0;
+
+	ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key,
+				      sizeof(*status));
+	if (ret < 0) {
+		btrfs_abort_transaction(trans, fs_info->tree_root, ret);
+		goto out;
+	}
+
+	status = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				struct btrfs_dedupe_status_item);
+	btrfs_set_dedupe_status_blocksize(path->nodes[0], status,
+					 dedupe_info->blocksize);
+	btrfs_set_dedupe_status_limit(path->nodes[0], status,
+			dedupe_info->limit_nr);
+	btrfs_set_dedupe_status_hash_type(path->nodes[0], status,
+			dedupe_info->hash_type);
+	btrfs_set_dedupe_status_backend(path->nodes[0], status,
+			dedupe_info->backend);
+	btrfs_mark_buffer_dirty(path->nodes[0]);
+out:
+	btrfs_free_path(path);
+	if (ret == 0)
+		btrfs_commit_transaction(trans, fs_info->tree_root);
+	return ret;
+}
+
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
 			u64 blocksize, u64 limit_nr)
 {
 	struct btrfs_dedupe_info *dedupe_info;
+	int create_tree;
+	u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
 	u64 limit = limit_nr;
 	int ret = 0;
 
@@ -63,6 +158,14 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
 		limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
 	if (backend == BTRFS_DEDUPE_BACKEND_ONDISK && limit_nr != 0)
 		limit = 0;
+	/* Ondisk backend needs DEDUP RO compat feature */
+	if (!(compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE) &&
+	    backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+		return -EOPNOTSUPP;
+
+	/* Meaningless and unable to enable dedupe for RO fs */
+	if (fs_info->sb->s_flags & MS_RDONLY)
+		return -EROFS;
 
 	dedupe_info = fs_info->dedupe_info;
 	if (dedupe_info) {
@@ -81,29 +184,71 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
 		return 0;
 	}
 
+	dedupe_info = NULL;
 enable:
-	dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
-	if (dedupe_info)
+	create_tree = compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE;
+
+	ret = init_dedupe_info(&dedupe_info, type, backend, blocksize, limit);
+	if (ret < 0)
+		return ret;
+	if (create_tree) {
+		ret = init_dedupe_tree(fs_info, dedupe_info);
+		if (ret < 0)
+			goto out;
+	}
+
+	fs_info->dedupe_info = dedupe_info;
+	/* We must ensure dedupe_enabled is modified after dedupe_info */
+	smp_wmb();
+	fs_info->dedupe_enabled = 1;
+out:
+	if (ret < 0) {
+		crypto_free_shash(dedupe_info->dedupe_driver);
+		kfree(dedupe_info);
+	}
+	return ret;
+}
+
+int btrfs_dedupe_resume(struct btrfs_fs_info *fs_info,
+			struct btrfs_root *dedupe_root)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+	struct btrfs_dedupe_status_item *status;
+	struct btrfs_key key;
+	struct btrfs_path *path;
+	u64 blocksize;
+	u64 limit;
+	u16 type;
+	u16 backend;
+	int ret = 0;
+
+	path = btrfs_alloc_path();
+	if (!path)
 		return -ENOMEM;
 
-	dedupe_info->hash_type = type;
-	dedupe_info->backend = backend;
-	dedupe_info->blocksize = blocksize;
-	dedupe_info->limit_nr = limit;
+	key.objectid = 0;
+	key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
+	key.offset = 0;
 
-	/* Only support SHA256 yet */
-	dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
-	if (IS_ERR(dedupe_info->dedupe_driver)) {
-		btrfs_err(fs_info, "failed to init sha256 driver");
-		ret = PTR_ERR(dedupe_info->dedupe_driver);
+	ret = btrfs_search_slot(NULL, dedupe_root, &key, path, 0, 0);
+	if (ret > 0) {
+		ret = -ENOENT;
+		goto out;
+	} else if (ret < 0) {
 		goto out;
 	}
 
-	dedupe_info->hash_root = RB_ROOT;
-	dedupe_info->bytenr_root = RB_ROOT;
-	dedupe_info->current_nr = 0;
-	INIT_LIST_HEAD(&dedupe_info->lru_list);
-	mutex_init(&dedupe_info->lock);
+	status = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				struct btrfs_dedupe_status_item);
+	blocksize = btrfs_dedupe_status_blocksize(path->nodes[0], status);
+	limit = btrfs_dedupe_status_limit(path->nodes[0], status);
+	type = btrfs_dedupe_status_hash_type(path->nodes[0], status);
+	backend = btrfs_dedupe_status_backend(path->nodes[0], status);
+
+	ret = init_dedupe_info(&dedupe_info, type, backend, blocksize, limit);
+	if (ret < 0)
+		goto out;
+	dedupe_info->dedupe_root = dedupe_root;
 
 	fs_info->dedupe_info = dedupe_info;
 	/* We must ensure dedupe_enabled is modified after dedupe_info */
@@ -111,11 +256,36 @@ enable:
 	fs_info->dedupe_enabled = 1;
 
 out:
-	if (ret < 0)
-		kfree(dedupe_info);
+	btrfs_free_path(path);
 	return ret;
 }
 
+static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info);
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_dedupe_info *dedupe_info;
+
+	fs_info->dedupe_enabled = 0;
+
+	/* same as disable */
+	smp_wmb();
+	dedupe_info = fs_info->dedupe_info;
+	fs_info->dedupe_info = NULL;
+
+	if (!dedupe_info)
+		return 0;
+
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
+		inmem_destroy(dedupe_info);
+	if (dedupe_info->dedupe_root) {
+		free_root_extent_buffers(dedupe_info->dedupe_root);
+		kfree(dedupe_info->dedupe_root);
+	}
+	crypto_free_shash(dedupe_info->dedupe_driver);
+	kfree(dedupe_info);
+	return 0;
+}
+
 static int inmem_insert_hash(struct rb_root *root,
 			     struct inmem_hash *hash, int hash_len)
 {
@@ -325,6 +495,65 @@ static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
 	mutex_unlock(&dedupe_info->lock);
 }
 
+static int remove_dedupe_tree(struct btrfs_root *dedupe_root)
+{
+	struct btrfs_trans_handle *trans;
+	struct btrfs_fs_info *fs_info = dedupe_root->fs_info;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct extent_buffer *node;
+	int ret;
+	int nr;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+	trans = btrfs_start_transaction(fs_info->tree_root, 2);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
+
+	path->leave_spinning = 1;
+	key.objectid = 0;
+	key.offset = 0;
+	key.type = 0;
+
+	while (1) {
+		ret = btrfs_search_slot(trans, dedupe_root, &key, path, -1, 1);
+		if (ret < 0)
+			goto out;
+		node = path->nodes[0];
+		nr = btrfs_header_nritems(node);
+		if (nr == 0) {
+			btrfs_release_path(path);
+			break;
+		}
+		path->slots[0] = 0;
+		ret = btrfs_del_items(trans, dedupe_root, path, 0, nr);
+		if (ret)
+			goto out;
+		btrfs_release_path(path);
+	}
+
+	ret = btrfs_del_root(trans, fs_info->tree_root, &dedupe_root->root_key);
+	if (ret)
+		goto out;
+
+	list_del(&dedupe_root->dirty_list);
+	btrfs_tree_lock(dedupe_root->node);
+	clean_tree_block(trans, fs_info, dedupe_root->node);
+	btrfs_tree_unlock(dedupe_root->node);
+	btrfs_free_tree_block(trans, dedupe_root, dedupe_root->node, 0, 1);
+	free_extent_buffer(dedupe_root->node);
+	free_extent_buffer(dedupe_root->commit_root);
+	kfree(dedupe_root);
+	ret = btrfs_commit_transaction(trans, fs_info->tree_root);
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_dedupe_info *dedupe_info;
@@ -358,10 +587,12 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 	/* now we are OK to clean up everything */
 	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
 		inmem_destroy(dedupe_info);
+	if (dedupe_info->dedupe_root)
+		ret = remove_dedupe_tree(dedupe_info->dedupe_root);
 
 	crypto_free_shash(dedupe_info->dedupe_driver);
 	kfree(dedupe_info);
-	return 0;
+	return ret;
 }
 
 /*
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 537f0b8..120e630 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -112,6 +112,19 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
  */
 int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
 
+ /*
+ * Restore previous dedupe setup from disk
+ * Called at mount time
+ */
+int btrfs_dedupe_resume(struct btrfs_fs_info *fs_info,
+		       struct btrfs_root *dedupe_root);
+
+/*
+ * Cleanup current btrfs_dedupe_info
+ * Called in umount time
+ */
+int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
+
 /*
  * Calculate hash for dedup.
  * Caller must ensure [start, start + dedupe_bs) has valid data.
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 57ae928..44d098d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -51,6 +51,7 @@
 #include "sysfs.h"
 #include "qgroup.h"
 #include "compression.h"
+#include "dedupe.h"
 
 #ifdef CONFIG_X86
 #include <asm/cpufeature.h>
@@ -2156,7 +2157,7 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->extent_workers);
 }
 
-static void free_root_extent_buffers(struct btrfs_root *root)
+void free_root_extent_buffers(struct btrfs_root *root)
 {
 	if (root) {
 		free_extent_buffer(root->node);
@@ -2490,7 +2491,21 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info,
 		fs_info->free_space_root = root;
 	}
 
-	return 0;
+	location.objectid = BTRFS_DEDUPE_TREE_OBJECTID;
+	root = btrfs_read_tree_root(tree_root, &location);
+	if (IS_ERR(root)) {
+		ret = PTR_ERR(root);
+		if (ret != -ENOENT)
+			return ret;
+		return 0;
+	}
+	set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
+	ret = btrfs_dedupe_resume(fs_info, root);
+	if (ret < 0) {
+		free_root_extent_buffers(root);
+		kfree(root);
+	}
+	return ret;
 }
 
 int open_ctree(struct super_block *sb,
@@ -3885,6 +3900,8 @@ void close_ctree(struct btrfs_root *root)
 
 	btrfs_free_qgroup_config(fs_info);
 
+	btrfs_dedupe_cleanup(fs_info);
+
 	if (percpu_counter_sum(&fs_info->delalloc_bytes)) {
 		btrfs_info(fs_info, "at unmount delalloc count %lld",
 		       percpu_counter_sum(&fs_info->delalloc_bytes));
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 8e79d00..42c4ff2 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -70,6 +70,7 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_root *tree_root,
 int btrfs_init_fs_root(struct btrfs_root *root);
 int btrfs_insert_fs_root(struct btrfs_fs_info *fs_info,
 			 struct btrfs_root *root);
+void free_root_extent_buffers(struct btrfs_root *root);
 void btrfs_free_fs_roots(struct btrfs_fs_info *fs_info);
 
 struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 12/27] btrfs: dedupe: Add support for on-disk hash search
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (10 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 11/27] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 13/27] btrfs: dedupe: Add support to delete hash for on-disk backend Qu Wenruo
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Now on-disk backend should be able to search hash now.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 133 +++++++++++++++++++++++++++++++++++++++++++++++-------
 fs/btrfs/dedupe.h |   1 +
 2 files changed, 118 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 1112fec..f73a4c7 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -595,6 +595,79 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
 	return ret;
 }
 
+ /*
+ * Return 0 for not found
+ * Return >0 for found and set bytenr_ret
+ * Return <0 for error
+ */
+static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
+			      u64 *bytenr_ret, u32 *num_bytes_ret)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+	u8 *buf = NULL;
+	u64 hash_key;
+	int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	buf = kmalloc(hash_len, GFP_NOFS);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	memcpy(&hash_key, hash + hash_len - 8, 8);
+	key.objectid = hash_key;
+	key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_slot(NULL, dedupe_root, &key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	WARN_ON(ret == 0);
+	while (1) {
+		struct extent_buffer *node;
+		struct btrfs_dedupe_hash_item *hash_item;
+		int slot;
+
+		ret = btrfs_previous_item(dedupe_root, path, hash_key,
+					  BTRFS_DEDUPE_HASH_ITEM_KEY);
+		if (ret < 0)
+			goto out;
+		if (ret > 0) {
+			ret = 0;
+			goto out;
+		}
+
+		node = path->nodes[0];
+		slot = path->slots[0];
+		btrfs_item_key_to_cpu(node, &key, slot);
+
+		if (key.type != BTRFS_DEDUPE_HASH_ITEM_KEY ||
+		    memcmp(&key.objectid, hash + hash_len - 8, 8))
+			break;
+		hash_item = btrfs_item_ptr(node, slot,
+				struct btrfs_dedupe_hash_item);
+		read_extent_buffer(node, buf, (unsigned long)(hash_item + 1),
+				   hash_len);
+		if (!memcmp(buf, hash, hash_len)) {
+			ret = 1;
+			*bytenr_ret = key.offset;
+			*num_bytes_ret = btrfs_dedupe_hash_len(node, hash_item);
+			break;
+		}
+	}
+out:
+	kfree(buf);
+	btrfs_free_path(path);
+	return ret;
+}
+
 /*
  * Caller must ensure the corresponding ref head is not being run.
  */
@@ -625,9 +698,36 @@ inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
 	return NULL;
 }
 
-static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
-			struct inode *inode, u64 file_pos,
-			struct btrfs_dedupe_hash *hash)
+/* Wapper for different backends, caller needs to hold dedupe_info->lock */
+static inline int generic_search_hash(struct btrfs_dedupe_info *dedupe_info,
+				      u8 *hash, u64 *bytenr_ret,
+				      u32 *num_bytes_ret)
+{
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+		struct inmem_hash *found_hash;
+		int ret;
+
+		found_hash = inmem_search_hash(dedupe_info, hash);
+		if (found_hash) {
+			ret = 1;
+			*bytenr_ret = found_hash->bytenr;
+			*num_bytes_ret = found_hash->num_bytes;
+		} else {
+			ret = 0;
+			*bytenr_ret = 0;
+			*num_bytes_ret = 0;
+		}
+		return ret;
+	} else if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK) {
+		return ondisk_search_hash(dedupe_info, hash, bytenr_ret,
+					  num_bytes_ret);
+	}
+	return -EINVAL;
+}
+
+static int generic_search(struct btrfs_dedupe_info *dedupe_info,
+			  struct inode *inode, u64 file_pos,
+			  struct btrfs_dedupe_hash *hash)
 {
 	int ret;
 	struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -637,9 +737,9 @@ static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
 	struct btrfs_delayed_ref_head *insert_head;
 	struct btrfs_delayed_data_ref *insert_dref;
 	struct btrfs_qgroup_extent_record *insert_qrecord = NULL;
-	struct inmem_hash *found_hash;
 	int free_insert = 1;
 	u64 bytenr;
+	u64 tmp_bytenr;
 	u32 num_bytes;
 
 	insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
@@ -671,14 +771,9 @@ static int inmem_search(struct btrfs_dedupe_info *dedupe_info,
 
 again:
 	mutex_lock(&dedupe_info->lock);
-	found_hash = inmem_search_hash(dedupe_info, hash->hash);
-	/* If we don't find a duplicated extent, just return. */
-	if (!found_hash) {
-		ret = 0;
+	ret = generic_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes);
+	if (ret <= 0)
 		goto out;
-	}
-	bytenr = found_hash->bytenr;
-	num_bytes = found_hash->num_bytes;
 
 	delayed_refs = &trans->transaction->delayed_refs;
 
@@ -717,12 +812,17 @@ again:
 
 	mutex_lock(&dedupe_info->lock);
 	/* Search again to ensure the hash is still here */
-	found_hash = inmem_search_hash(dedupe_info, hash->hash);
-	if (!found_hash) {
-		ret = 0;
+	ret = generic_search_hash(dedupe_info, hash->hash, &tmp_bytenr,
+				  &num_bytes);
+	if (ret <= 0) {
 		mutex_unlock(&head->mutex);
 		goto out;
 	}
+	if (tmp_bytenr != bytenr) {
+		mutex_unlock(&head->mutex);
+		mutex_unlock(&dedupe_info->lock);
+		goto again;
+	}
 	hash->bytenr = bytenr;
 	hash->num_bytes = num_bytes;
 
@@ -767,8 +867,9 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
 	if (WARN_ON(btrfs_dedupe_hash_hit(hash)))
 		return -EINVAL;
 
-	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
-		ret = inmem_search(dedupe_info, inode, file_pos, hash);
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY ||
+	    dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+		ret = generic_search(dedupe_info, inode, file_pos, hash);
 
 	/* It's possible hash->bytenr/num_bytenr already changed */
 	if (ret == 0) {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 120e630..467ddd5 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -146,6 +146,7 @@ int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info,
  * *INCREASED*, and hash->bytenr/num_bytes will record the existing
  * extent data.
  * Return 0 for a hash miss. Nothing is done
+ * Return < 0 for error
  */
 int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
 			struct inode *inode, u64 file_pos,
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 13/27] btrfs: dedupe: Add support to delete hash for on-disk backend
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (11 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 12/27] btrfs: dedupe: Add support for on-disk hash search Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 14/27] btrfs: dedupe: Add support for adding " Qu Wenruo
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Now on-disk backend can delete hash now.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index f73a4c7..c38137e 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -468,6 +468,104 @@ static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
 	return 0;
 }
 
+/*
+ * If prepare_del is given, this will setup search_slot() for delete.
+ * Caller needs to do proper locking.
+ *
+ * Return > 0 for found.
+ * Return 0 for not found.
+ * Return < 0 for error.
+ */
+static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
+				struct btrfs_dedupe_info *dedupe_info,
+				struct btrfs_path *path, u64 bytenr,
+				int prepare_del)
+{
+	struct btrfs_key key;
+	struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+	int ret;
+	int ins_len = 0;
+	int cow = 0;
+
+	if (prepare_del) {
+		if (WARN_ON(trans == NULL))
+			return -EINVAL;
+		cow = 1;
+		ins_len = -1;
+	}
+
+	key.objectid = bytenr;
+	key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+	key.offset = (u64)-1;
+
+	ret = btrfs_search_slot(trans, dedupe_root, &key, path,
+				ins_len, cow);
+
+	if (ret < 0)
+		return ret;
+	/*
+	 * Although it's almost impossible, it's still possible that
+	 * the last 64bits are all 1.
+	 */
+	if (ret == 0)
+		return 1;
+
+	ret = btrfs_previous_item(dedupe_root, path, bytenr,
+				  BTRFS_DEDUPE_BYTENR_ITEM_KEY);
+	if (ret < 0)
+		return ret;
+	if (ret > 0)
+		return 0;
+	return 1;
+}
+
+static int ondisk_del(struct btrfs_trans_handle *trans,
+		      struct btrfs_dedupe_info *dedupe_info, u64 bytenr)
+{
+	struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	key.objectid = bytenr;
+	key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+	key.offset = 0;
+
+	mutex_lock(&dedupe_info->lock);
+
+	ret = ondisk_search_bytenr(trans, dedupe_info, path, bytenr, 1);
+	if (ret <= 0)
+		goto out;
+
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	ret = btrfs_del_item(trans, dedupe_root, path);
+	btrfs_release_path(path);
+	if (ret < 0)
+		goto out;
+	/* Search for hash item and delete it */
+	key.objectid = key.offset;
+	key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+	key.offset = bytenr;
+
+	ret = btrfs_search_slot(trans, dedupe_root, &key, path, -1, 1);
+	if (WARN_ON(ret > 0)) {
+		ret = -ENOENT;
+		goto out;
+	}
+	if (ret < 0)
+		goto out;
+	ret = btrfs_del_item(trans, dedupe_root, path);
+
+out:
+	btrfs_free_path(path);
+	mutex_unlock(&dedupe_info->lock);
+	return ret;
+}
+
 /* Remove a dedupe hash from dedupe tree */
 int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
 		     struct btrfs_fs_info *fs_info, u64 bytenr)
@@ -482,6 +580,8 @@ int btrfs_dedupe_del(struct btrfs_trans_handle *trans,
 
 	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
 		return inmem_del(dedupe_info, bytenr);
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+		return ondisk_del(trans, dedupe_info, bytenr);
 	return -EINVAL;
 }
 
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 14/27] btrfs: dedupe: Add support for adding hash for on-disk backend
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (12 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 13/27] btrfs: dedupe: Add support to delete hash for on-disk backend Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 15/27] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Now on-disk backend can add hash now.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 83 insertions(+)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index c38137e..6a80afc 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -405,6 +405,87 @@ out:
 	return 0;
 }
 
+static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
+				struct btrfs_dedupe_info *dedupe_info,
+				struct btrfs_path *path, u64 bytenr,
+				int prepare_del);
+static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
+			      u64 *bytenr_ret, u32 *num_bytes_ret);
+static int ondisk_add(struct btrfs_trans_handle *trans,
+		      struct btrfs_dedupe_info *dedupe_info,
+		      struct btrfs_dedupe_hash *hash)
+{
+	struct btrfs_path *path;
+	struct btrfs_root *dedupe_root = dedupe_info->dedupe_root;
+	struct btrfs_key key;
+	struct btrfs_dedupe_hash_item *hash_item;
+	u64 bytenr;
+	u32 num_bytes;
+	int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type];
+	int ret;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	mutex_lock(&dedupe_info->lock);
+
+	ret = ondisk_search_bytenr(NULL, dedupe_info, path, hash->bytenr, 0);
+	if (ret < 0)
+		goto out;
+	if (ret > 0) {
+		ret = 0;
+		goto out;
+	}
+	btrfs_release_path(path);
+
+	ret = ondisk_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes);
+	if (ret < 0)
+		goto out;
+	/* Same hash found, don't re-add to save dedupe tree space */
+	if (ret > 0) {
+		ret = 0;
+		goto out;
+	}
+
+	/* Insert hash->bytenr item */
+	memcpy(&key.objectid, hash->hash + hash_len - 8, 8);
+	key.type = BTRFS_DEDUPE_HASH_ITEM_KEY;
+	key.offset = hash->bytenr;
+
+	ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key,
+			sizeof(*hash_item) + hash_len);
+	WARN_ON(ret == -EEXIST);
+	if (ret < 0)
+		goto out;
+	hash_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+				   struct btrfs_dedupe_hash_item);
+	btrfs_set_dedupe_hash_len(path->nodes[0], hash_item, hash->num_bytes);
+	write_extent_buffer(path->nodes[0], hash->hash,
+			    (unsigned long)(hash_item + 1), hash_len);
+	btrfs_mark_buffer_dirty(path->nodes[0]);
+	btrfs_release_path(path);
+
+	/* Then bytenr->hash item */
+	key.objectid = hash->bytenr;
+	key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY;
+	memcpy(&key.offset, hash->hash + hash_len - 8, 8);
+
+	ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key, hash_len);
+	WARN_ON(ret == -EEXIST);
+	if (ret < 0)
+		goto out;
+	write_extent_buffer(path->nodes[0], hash->hash,
+			btrfs_item_ptr_offset(path->nodes[0], path->slots[0]),
+			hash_len);
+	btrfs_mark_buffer_dirty(path->nodes[0]);
+
+out:
+	mutex_unlock(&dedupe_info->lock);
+	btrfs_free_path(path);
+	return ret;
+}
+
 int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
 		     struct btrfs_fs_info *fs_info,
 		     struct btrfs_dedupe_hash *hash)
@@ -426,6 +507,8 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans,
 
 	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
 		return inmem_add(dedupe_info, hash);
+	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK)
+		return ondisk_add(trans, dedupe_info, hash);
 	return -EINVAL;
 }
 
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 15/27] btrfs: dedupe: Add ioctl for inband dedupelication
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (13 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 14/27] btrfs: dedupe: Add support for adding " Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  2:29   ` kbuild test robot
  2016-03-22  2:48   ` kbuild test robot
  2016-03-22  1:35 ` [PATCH v8 16/27] btrfs: dedupe: add an inode nodedupe flag Qu Wenruo
                   ` (13 subsequent siblings)
  28 siblings, 2 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Add ioctl interface for inband dedupelication, which includes:
1) enable
2) disable
3) status

We will later add ioctl to disable inband dedupe for given file/dir.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/dedupe.c          | 48 ++++++++++++++++++++++++++++++----
 fs/btrfs/dedupe.h          | 10 +++++++-
 fs/btrfs/ioctl.c           | 64 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/sysfs.c           |  2 ++
 include/uapi/linux/btrfs.h | 25 +++++++++++++++++-
 5 files changed, 142 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 6a80afc..294cbb5 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -135,12 +135,12 @@ out:
 }
 
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
-			u64 blocksize, u64 limit_nr)
+			u64 blocksize, u64 limit_nr, u64 limit_mem)
 {
 	struct btrfs_dedupe_info *dedupe_info;
 	int create_tree;
 	u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
-	u64 limit = limit_nr;
+	u64 limit;
 	int ret = 0;
 
 	/* Sanity check */
@@ -153,11 +153,22 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
 		return -EINVAL;
 	if (backend >= BTRFS_DEDUPE_BACKEND_COUNT)
 		return -EINVAL;
+	/* Only one limit is accept */
+	if (limit_nr && limit_mem)
+		return -EINVAL;
 
-	if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY && limit_nr == 0)
-		limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
-	if (backend == BTRFS_DEDUPE_BACKEND_ONDISK && limit_nr != 0)
+	if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
+		if (!limit_nr && !limit_mem)
+			limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
+		else if (limit_nr)
+			limit = limit_nr;
+		else
+			limit = limit_mem / (sizeof(struct inmem_hash) +
+					btrfs_dedupe_sizes[type]);
+	}
+	if (backend == BTRFS_DEDUPE_BACKEND_ONDISK)
 		limit = 0;
+
 	/* Ondisk backend needs DEDUP RO compat feature */
 	if (!(compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE) &&
 	    backend == BTRFS_DEDUPE_BACKEND_ONDISK)
@@ -209,6 +220,33 @@ out:
 	return ret;
 }
 
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+			 struct btrfs_ioctl_dedupe_args *dargs)
+{
+	struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info;
+
+	if (!fs_info->dedupe_enabled || !dedupe_info) {
+		dargs->status = 0;
+		dargs->blocksize = 0;
+		dargs->backend = 0;
+		dargs->hash_type = 0;
+		dargs->limit_nr = 0;
+		dargs->current_nr = 0;
+		return;
+	}
+	mutex_lock(&dedupe_info->lock);
+	dargs->status = 1;
+	dargs->blocksize = dedupe_info->blocksize;
+	dargs->backend = dedupe_info->backend;
+	dargs->hash_type = dedupe_info->hash_type;
+	dargs->limit_nr = dedupe_info->limit_nr;
+	dargs->limit_mem = dedupe_info->limit_nr *
+		(sizeof(struct inmem_hash) +
+		 btrfs_dedupe_sizes[dedupe_info->hash_type]);
+	dargs->current_nr = dedupe_info->current_nr;
+	mutex_unlock(&dedupe_info->lock);
+}
+
 int btrfs_dedupe_resume(struct btrfs_fs_info *fs_info,
 			struct btrfs_root *dedupe_root)
 {
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 467ddd5..60479b1 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -104,7 +104,15 @@ static inline struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type)
  * Called at dedupe enable time.
  */
 int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
-			u64 blocksize, u64 limit_nr);
+			u64 blocksize, u64 limit_nr, u64 limit_mem);
+
+/*
+ * Get inband dedupe info
+ * Since it needs to access different backends' hash size, which
+ * is not exported, we need such simple function.
+ */
+void btrfs_dedupe_status(struct btrfs_fs_info *fs_info,
+			 struct btrfs_ioctl_dedupe_args *dargs);
 
 /*
  * Disable dedupe and invalidate all its dedupe data.
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 053e677..49bca5f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -61,6 +61,7 @@
 #include "qgroup.h"
 #include "tree-log.h"
 #include "compression.h"
+#include "dedupe.h"
 
 #ifdef CONFIG_64BIT
 /* If we have a 32-bit userspace and 64-bit kernel, then the UAPI
@@ -3206,6 +3207,67 @@ ssize_t btrfs_dedupe_file_range(struct file *src_file, u64 loff, u64 olen,
 	return olen;
 }
 
+static long btrfs_ioctl_dedupe_ctl(struct btrfs_root *root, void __user *args)
+{
+	struct btrfs_ioctl_dedupe_args *dargs;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	dargs = memdup_user(args, sizeof(*dargs));
+	if (IS_ERR(dargs)) {
+		ret = PTR_ERR(dargs);
+		return ret;
+	}
+
+	if (dargs->cmd >= BTRFS_DEDUPE_CTL_LAST) {
+		ret = -EINVAL;
+		goto out;
+	}
+	switch (dargs->cmd) {
+	case BTRFS_DEDUPE_CTL_ENABLE:
+		mutex_lock(&fs_info->dedupe_ioctl_lock);
+		ret = btrfs_dedupe_enable(fs_info, dargs->hash_type,
+					 dargs->backend, dargs->blocksize,
+					 dargs->limit_nr, dargs->limit_mem);
+		mutex_unlock(&fs_info->dedupe_ioctl_lock);
+		if (ret < 0)
+			break;
+
+		/* Also copy the result to caller for further use */
+		btrfs_dedupe_status(fs_info, dargs);
+		if (copy_to_user(args, dargs, sizeof(*dargs)))
+			ret = -EFAULT;
+		else
+			ret = 0;
+		break;
+	case BTRFS_DEDUPE_CTL_DISABLE:
+		mutex_lock(&fs_info->dedupe_ioctl_lock);
+		ret = btrfs_dedupe_disable(fs_info);
+		mutex_unlock(&fs_info->dedupe_ioctl_lock);
+		break;
+	case BTRFS_DEDUPE_CTL_STATUS:
+		btrfs_dedupe_status(fs_info, dargs);
+		if (copy_to_user(args, dargs, sizeof(*dargs)))
+			ret = -EFAULT;
+		else
+			ret = 0;
+		break;
+	default:
+		/*
+		 * Use this return value to inform progs that kernel
+		 * doesn't support such new command.
+		 */
+		ret = -EOPNOTSUPP;
+		break;
+	}
+out:
+	kfree(dargs);
+	return ret;
+}
+
 static int clone_finish_inode_update(struct btrfs_trans_handle *trans,
 				     struct inode *inode,
 				     u64 endoff,
@@ -5542,6 +5604,8 @@ long btrfs_ioctl(struct file *file, unsigned int
 		return btrfs_ioctl_get_fslabel(file, argp);
 	case BTRFS_IOC_SET_FSLABEL:
 		return btrfs_ioctl_set_fslabel(file, argp);
+	case BTRFS_IOC_DEDUPE_CTL:
+		return btrfs_ioctl_dedupe_ctl(root, argp);
 	case BTRFS_IOC_GET_SUPPORTED_FEATURES:
 		return btrfs_ioctl_get_supported_features(argp);
 	case BTRFS_IOC_GET_FEATURES:
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 539e7b5..18686d1 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -203,6 +203,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56);
 BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA);
 BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
+BTRFS_FEAT_ATTR_COMPAT_RO(dedupe, DEDUPE);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -215,6 +216,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(skinny_metadata),
 	BTRFS_FEAT_ATTR_PTR(no_holes),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
+	BTRFS_FEAT_ATTR_PTR(dedupe),
 	NULL
 };
 
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dea8931..de08f53 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -359,7 +359,7 @@ struct btrfs_ioctl_same_extent_info {
 	__u64 bytes_deduped;	/* out - total # of bytes we were able
 				 * to dedupe from this file */
 	/* status of this dedupe operation:
-	 * 0 if dedup succeeds
+	 * 0 if dedupe succeeds
 	 * < 0 for error
 	 * == BTRFS_SAME_DATA_DIFFERS if data differs
 	 */
@@ -445,6 +445,27 @@ struct btrfs_ioctl_get_dev_stats {
 	__u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */
 };
 
+/*
+ * de-duplication control modes
+ * For re-config, re-enable will handle it
+ */
+#define BTRFS_DEDUPE_CTL_ENABLE	1
+#define BTRFS_DEDUPE_CTL_DISABLE 2
+#define BTRFS_DEDUPE_CTL_STATUS	3
+#define BTRFS_DEDUPE_CTL_LAST	4
+struct btrfs_ioctl_dedupe_args {
+	__u16 cmd;		/* In: command(see above macro) */
+	__u64 blocksize;	/* In/Out: For enable/status */
+	__u64 limit_nr;		/* In/Out: For enable/status */
+	__u64 limit_mem;	/* In/Out: For enable/status */
+	__u64 current_nr;	/* Out: For status output */
+	__u16 backend;		/* In/Out: For enable/status */
+	__u16 hash_type;	/* In/Out: For enable/status */
+	u8 status;		/* Out: For status output */
+	/* pad to 512 bytes */
+	u8 __unused[473];
+};
+
 #define BTRFS_QUOTA_CTL_ENABLE	1
 #define BTRFS_QUOTA_CTL_DISABLE	2
 #define BTRFS_QUOTA_CTL_RESCAN__NOTUSED	3
@@ -653,6 +674,8 @@ static inline char *btrfs_err_str(enum btrfs_err_code err_code)
 				    struct btrfs_ioctl_dev_replace_args)
 #define BTRFS_IOC_FILE_EXTENT_SAME _IOWR(BTRFS_IOCTL_MAGIC, 54, \
 					 struct btrfs_ioctl_same_args)
+#define BTRFS_IOC_DEDUPE_CTL	_IOWR(BTRFS_IOCTL_MAGIC, 55, \
+				      struct btrfs_ioctl_dedupe_args)
 #define BTRFS_IOC_GET_FEATURES _IOR(BTRFS_IOCTL_MAGIC, 57, \
 				   struct btrfs_ioctl_feature_flags)
 #define BTRFS_IOC_SET_FEATURES _IOW(BTRFS_IOCTL_MAGIC, 57, \
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 16/27] btrfs: dedupe: add an inode nodedupe flag
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (14 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 15/27] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 17/27] btrfs: dedupe: add a property handler for online dedupe Qu Wenruo
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce BTRFS_INODE_NODEDUP flag, then we can explicitly disable
online data dedupelication for specified files.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/ctree.h | 1 +
 fs/btrfs/ioctl.c | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index bed9273..b19c1f1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2428,6 +2428,7 @@ do {                                                                   \
 #define BTRFS_INODE_NOATIME		(1 << 9)
 #define BTRFS_INODE_DIRSYNC		(1 << 10)
 #define BTRFS_INODE_COMPRESS		(1 << 11)
+#define BTRFS_INODE_NODEDUPE		(1 << 12)
 
 #define BTRFS_INODE_ROOT_ITEM_INIT	(1 << 31)
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 49bca5f..3c226b0 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -161,7 +161,8 @@ void btrfs_update_iflags(struct inode *inode)
 /*
  * Inherit flags from the parent inode.
  *
- * Currently only the compression flags and the cow flags are inherited.
+ * Currently only the compression flags, dedupe flags and the cow flags
+ * are inherited.
  */
 void btrfs_inherit_iflags(struct inode *inode, struct inode *dir)
 {
@@ -186,6 +187,9 @@ void btrfs_inherit_iflags(struct inode *inode, struct inode *dir)
 			BTRFS_I(inode)->flags |= BTRFS_INODE_NODATASUM;
 	}
 
+	if (flags & BTRFS_INODE_NODEDUPE)
+		BTRFS_I(inode)->flags |= BTRFS_INODE_NODEDUPE;
+
 	btrfs_update_iflags(inode);
 }
 
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 17/27] btrfs: dedupe: add a property handler for online dedupe
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (15 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 16/27] btrfs: dedupe: add an inode nodedupe flag Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 18/27] btrfs: dedupe: add per-file online dedupe control Qu Wenruo
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

We use btrfs extended attribute "btrfs.dedupe" to record per-file online
dedupe status, so add a dedupe property handler.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/props.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c
index 3699212..a430886 100644
--- a/fs/btrfs/props.c
+++ b/fs/btrfs/props.c
@@ -42,6 +42,11 @@ static int prop_compression_apply(struct inode *inode,
 				  size_t len);
 static const char *prop_compression_extract(struct inode *inode);
 
+static int prop_dedupe_validate(const char *value, size_t len);
+static int prop_dedupe_apply(struct inode *inode, const char *value,
+			     size_t len);
+static const char *prop_dedupe_extract(struct inode *inode);
+
 static struct prop_handler prop_handlers[] = {
 	{
 		.xattr_name = XATTR_BTRFS_PREFIX "compression",
@@ -50,6 +55,13 @@ static struct prop_handler prop_handlers[] = {
 		.extract = prop_compression_extract,
 		.inheritable = 1
 	},
+	{
+		.xattr_name = XATTR_BTRFS_PREFIX "dedupe",
+		.validate = prop_dedupe_validate,
+		.apply = prop_dedupe_apply,
+		.extract = prop_dedupe_extract,
+		.inheritable = 1
+	},
 };
 
 void __init btrfs_props_init(void)
@@ -426,4 +438,33 @@ static const char *prop_compression_extract(struct inode *inode)
 	return NULL;
 }
 
+static int prop_dedupe_validate(const char *value, size_t len)
+{
+	if (!strncmp("disable", value, len))
+		return 0;
+
+	return -EINVAL;
+}
+
+static int prop_dedupe_apply(struct inode *inode, const char *value, size_t len)
+{
+	if (len == 0) {
+		BTRFS_I(inode)->flags &= ~BTRFS_INODE_NODEDUPE;
+		return 0;
+	}
+
+	if (!strncmp("disable", value, len)) {
+		BTRFS_I(inode)->flags |= BTRFS_INODE_NODEDUPE;
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static const char *prop_dedupe_extract(struct inode *inode)
+{
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODEDUPE)
+		return "disable";
 
+	return NULL;
+}
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 18/27] btrfs: dedupe: add per-file online dedupe control
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (16 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 17/27] btrfs: dedupe: add a property handler for online dedupe Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space Qu Wenruo
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Introduce inode_need_dedupe() to implement per-file online dedupe control.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/inode.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 13ae366..979811c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -676,6 +676,18 @@ static void free_async_extent_pages(struct async_extent *async_extent)
 	async_extent->pages = NULL;
 }
 
+static inline int inode_need_dedupe(struct btrfs_fs_info *fs_info,
+				    struct inode *inode)
+{
+	if (!fs_info->dedupe_enabled)
+		return 0;
+
+	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODEDUPE)
+		return 0;
+
+	return 1;
+}
+
 /*
  * phase two of compressed writeback.  This is the ordered portion
  * of the code, which only gets called in the order the work was
@@ -1635,7 +1647,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page,
 	} else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) {
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_need_compress(inode) && !fs_info->dedupe_enabled) {
+	} else if (!inode_need_compress(inode) &&
+		   !inode_need_dedupe(fs_info, inode)) {
 		ret = cow_file_range(inode, locked_page, start, end,
 				      page_started, nr_written, 1, NULL);
 	} else {
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (17 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 18/27] btrfs: dedupe: add per-file online dedupe control Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-04-22 18:06   ` Josef Bacik
  2016-03-22  1:35 ` [PATCH v8 20/27] btrfs: dedupe: Fix a bug when running inband dedupe with balance Qu Wenruo
                   ` (9 subsequent siblings)
  28 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
to reserve is calculated by the difference between outstanding_extents and
reserved_extents.

When reserve_metadata_bytes() fails to reserve desited metadata space,
it has already done some reclaim work, such as write ordered extents.

In that case, outstanding_extents and reserved_extents may already
changed, and we may reserve enough metadata space then.

So this patch will try to call reserve_metadata_bytes() at most 3 times
to ensure we really run out of space.

Such false ENOSPC is mainly caused by small file extents and time
consuming delalloc functions, which mainly affects in-band
de-duplication. (Compress should also be affected, but LZO/zlib is
faster than SHA256, so still harder to trigger than dedupe).

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dabd721..016d2ec 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
 				 * a new extent is revered, then deleted
 				 * in one tran, and inc/dec get merged to 0.
 				 *
-				 * In this case, we need to remove its dedup
+				 * In this case, we need to remove its dedupe
 				 * hash.
 				 */
 				btrfs_dedupe_del(trans, fs_info, node->bytenr);
@@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	bool delalloc_lock = true;
 	u64 to_free = 0;
 	unsigned dropped;
+	int loops = 0;
 
 	/* If we are a free space inode we need to not flush since we will be in
 	 * the middle of a transaction commit.  We also don't need the delalloc
@@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
 	    btrfs_transaction_in_commit(root->fs_info))
 		schedule_timeout(1);
 
+	num_bytes = ALIGN(num_bytes, root->sectorsize);
+
+again:
 	if (delalloc_lock)
 		mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
 
-	num_bytes = ALIGN(num_bytes, root->sectorsize);
-
 	spin_lock(&BTRFS_I(inode)->lock);
 	nr_extents = (unsigned)div64_u64(num_bytes +
 					 BTRFS_MAX_EXTENT_SIZE - 1,
@@ -5815,6 +5817,23 @@ out_fail:
 	}
 	if (delalloc_lock)
 		mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
+	/*
+	 * The number of metadata bytes is calculated by the difference
+	 * between outstanding_extents and reserved_extents. Sometimes though
+	 * reserve_metadata_bytes() fails to reserve the wanted metadata bytes,
+	 * indeed it has already done some work to reclaim metadata space, hence
+	 * both outstanding_extents and reserved_extents would have changed and
+	 * the bytes we try to reserve would also has changed(may be smaller).
+	 * So here we try to reserve again. This is much useful for online
+	 * dedupe, which will easily eat almost all meta space.
+	 *
+	 * XXX: Indeed here 3 is arbitrarily choosed, it's a good workaround for
+	 * online dedupe, later we should find a better method to avoid dedupe
+	 * enospc issue.
+	 */
+	if (unlikely(ret == -ENOSPC && loops++ < 3))
+		goto again;
+
 	return ret;
 }
 
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 20/27] btrfs: dedupe: Fix a bug when running inband dedupe with balance
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (18 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 21/27] btrfs: Fix a memory leak in inband dedupe hash Qu Wenruo
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

When running inband dedupe with balance, it's possible that inband dedupe
still increase ref on extents which are in RO chunk.

This may cause either find_data_references() gives warning, or make
run_delayed_refs() return -EIO and cause trans abort.

The cause is, normal dedupe_del() is only called at run_delayed_ref()
time, which is too late for balance case.

This patch fixes this bug by calling dedupe_del() at extent searching
time of balance.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/relocation.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 2bd0011..71a5cd0 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -31,6 +31,7 @@
 #include "async-thread.h"
 #include "free-space-cache.h"
 #include "inode-map.h"
+#include "dedupe.h"
 
 /*
  * backref_node, mapping_node and tree_block start with this
@@ -3909,6 +3910,7 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
 	struct btrfs_trans_handle *trans = NULL;
 	struct btrfs_path *path;
 	struct btrfs_extent_item *ei;
+	struct btrfs_fs_info *fs_info = rc->extent_root->fs_info;
 	u64 flags;
 	u32 item_size;
 	int ret;
@@ -4032,6 +4034,20 @@ restart:
 			}
 		}
 
+		/*
+		 * This data extent will be replaced, but normal
+		 * dedupe_del() will only happen at run_delayed_ref()
+		 * time, which is too late, so delete dedupe hash early
+		 * to prevent its ref get increased.
+		 */
+		if (rc->stage == MOVE_DATA_EXTENTS &&
+		    (flags & BTRFS_EXTENT_FLAG_DATA)) {
+			ret = btrfs_dedupe_del(trans, fs_info, key.objectid);
+			if (ret < 0) {
+				err = ret;
+				break;
+			}
+		}
 		btrfs_end_transaction_throttle(trans, rc->extent_root);
 		btrfs_btree_balance_dirty(rc->extent_root);
 		trans = NULL;
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 21/27] btrfs: Fix a memory leak in inband dedupe hash
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (19 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 20/27] btrfs: dedupe: Fix a bug when running inband dedupe with balance Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 22/27] btrfs: dedupe: Fix metadata balance error when dedupe is enabled Qu Wenruo
                   ` (7 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs

We allocate a dedupe hash into async_extent, but forget to free it.
Fix it by freeing the hash before freeing async_extent.

Reported-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/inode.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 979811c..81b19193 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -751,6 +751,7 @@ retry:
 						  WB_SYNC_ALL);
 			else if (ret)
 				unlock_page(async_cow->locked_page);
+			kfree(hash);
 			kfree(async_extent);
 			cond_resched();
 			continue;
@@ -876,6 +877,7 @@ retry:
 			free_async_extent_pages(async_extent);
 		}
 		alloc_hint = ins.objectid + ins.offset;
+		kfree(hash);
 		kfree(async_extent);
 		cond_resched();
 	}
@@ -892,6 +894,7 @@ out_free:
 				     PAGE_SET_WRITEBACK | PAGE_END_WRITEBACK |
 				     PAGE_SET_ERROR);
 	free_async_extent_pages(async_extent);
+	kfree(hash);
 	kfree(async_extent);
 	goto again;
 }
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 22/27] btrfs: dedupe: Fix metadata balance error when dedupe is enabled
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (20 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 21/27] btrfs: Fix a memory leak in inband dedupe hash Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 23/27] btrfs: dedupe: Avoid submit IO for hash hit extent Qu Wenruo
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs

A missing branch in btrfs_get_fs_root() is making dedupe_root read from
disk, and REF_COWS bit set.
This makes btrfs balance treating dedupe_root as fs root, and reusing the
old dedupe root bytenr to drop tree ref, causing the following kernel
warning after metadata balancing:

BTRFS error (device sdb6): unable to find ref byte nr 29736960 parent 0
root 11  owner 0 offset 0
------------[ cut here ]------------
WARNING: CPU: 1 PID: 19113 at fs/btrfs/extent-tree.c:6636
__btrfs_free_extent.isra.66+0xb6d/0xd20 [btrfs]()
BTRFS: Transaction aborted (error -2)
Modules linked in: btrfs(O) xor zlib_deflate raid6_pq xfs [last
unloaded: btrfs]
CPU: 1 PID: 19113 Comm: btrfs Tainted: G        W  O    4.5.0-rc5+ #2
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox
12/01/2006
 0000000000000000 ffff880035b0ba18 ffffffff813771ff ffff880035b0ba60
 ffffffffa06a810a ffff880035b0ba50 ffffffff810bcb81 ffff88003c45c528
 0000000001c5c000 00000000fffffffe ffff88003dc8c520 0000000000000000
Call Trace:
 [<ffffffff813771ff>] dump_stack+0x67/0x98
 [<ffffffff810bcb81>] warn_slowpath_common+0x81/0xc0
 [<ffffffff810bcc07>] warn_slowpath_fmt+0x47/0x50
 [<ffffffffa06028fd>] __btrfs_free_extent.isra.66+0xb6d/0xd20 [btrfs]
 [<ffffffffa0606d4d>] __btrfs_run_delayed_refs.constprop.71+0x96d/0x1560
[btrfs]
 [<ffffffff81202ad9>] ? cmpxchg_double_slab.isra.68+0x149/0x160
 [<ffffffff81106a1d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffa060a5ce>] btrfs_run_delayed_refs+0x8e/0x2d0 [btrfs]
 [<ffffffffa06209fe>] btrfs_commit_transaction+0x3e/0xb50 [btrfs]
 [<ffffffffa069f26e>] ? btrfs_dedupe_disable+0x28e/0x2c0 [btrfs]
 [<ffffffff812035c3>] ? kfree+0x223/0x270
 [<ffffffffa069f27a>] btrfs_dedupe_disable+0x29a/0x2c0 [btrfs]
 [<ffffffffa065e403>] btrfs_ioctl+0x2363/0x2a40 [btrfs]
 [<ffffffff8116b12a>] ? __audit_syscall_entry+0xaa/0xf0
 [<ffffffff81137ce6>] ? current_kernel_time64+0x56/0xa0
 [<ffffffff8122080e>] do_vfs_ioctl+0x8e/0x690
 [<ffffffff8116b12a>] ? __audit_syscall_entry+0xaa/0xf0
 [<ffffffff8122c181>] ? __fget_light+0x61/0x90
 [<ffffffff81220e84>] SyS_ioctl+0x74/0x80
 [<ffffffff8180ad57>] entry_SYSCALL_64_fastpath+0x12/0x6f
---[ end trace 618d5a5bc21d6a7c ]---

Fix it by adding corresponding branch for btrfs_get_fs_root().

Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/disk-io.c    | 5 +++++
 fs/btrfs/relocation.c | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 44d098d..ec9fff3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1679,6 +1679,11 @@ struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
 	if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
 		return fs_info->free_space_root ? fs_info->free_space_root :
 						  ERR_PTR(-ENOENT);
+	if (location->objectid == BTRFS_DEDUPE_TREE_OBJECTID) {
+		if (fs_info->dedupe_enabled && fs_info->dedupe_info)
+			return fs_info->dedupe_info->dedupe_root;
+		return ERR_PTR(-ENOENT);
+	}
 again:
 	root = btrfs_lookup_fs_root(fs_info, location->objectid);
 	if (root) {
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 71a5cd0..74fd5d9 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -577,7 +577,8 @@ static int is_cowonly_root(u64 root_objectid)
 	    root_objectid == BTRFS_CSUM_TREE_OBJECTID ||
 	    root_objectid == BTRFS_UUID_TREE_OBJECTID ||
 	    root_objectid == BTRFS_QUOTA_TREE_OBJECTID ||
-	    root_objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
+	    root_objectid == BTRFS_FREE_SPACE_TREE_OBJECTID ||
+	    root_objectid == BTRFS_DEDUPE_TREE_OBJECTID)
 		return 1;
 	return 0;
 }
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 23/27] btrfs: dedupe: Avoid submit IO for hash hit extent
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (21 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 22/27] btrfs: dedupe: Fix metadata balance error when dedupe is enabled Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 24/27] btrfs: dedupe: Preparation for compress-dedupe co-work Qu Wenruo
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

Before this patch, even for duplicated extent, it will still go through
page write, meaning we didn't skip IO for them.

Although such write will be skipped by block level, as block level will
only select the last submitted write request to the same bytenr.

This patch will manually skip such IO to reduce dedupe overhead.
After this patch, dedupe all miss performance is higher than low
compress ratio performance.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
---
 fs/btrfs/inode.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 50 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 81b19193..b22663c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -688,6 +688,38 @@ static inline int inode_need_dedupe(struct btrfs_fs_info *fs_info,
 	return 1;
 }
 
+static void end_dedupe_extent(struct inode *inode, u64 start,
+			      u32 len, unsigned long page_ops)
+{
+	int i;
+	unsigned nr_pages = len / PAGE_CACHE_SIZE;
+	struct page *page;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = find_get_page(inode->i_mapping,
+				     start >> PAGE_CACHE_SHIFT);
+		/* page should be already locked by caller */
+		if (WARN_ON(!page))
+			continue;
+
+		/* We need to do this by ourselves as we skipped IO */
+		if (page_ops & PAGE_CLEAR_DIRTY)
+			clear_page_dirty_for_io(page);
+		if (page_ops & PAGE_SET_WRITEBACK)
+			set_page_writeback(page);
+
+		end_extent_writepage(page, 0, start,
+				     start + PAGE_CACHE_SIZE - 1);
+		if (page_ops & PAGE_END_WRITEBACK)
+			end_page_writeback(page);
+		if (page_ops & PAGE_UNLOCK)
+			unlock_page(page);
+
+		start += PAGE_CACHE_SIZE;
+		page_cache_release(page);
+	}
+}
+
 /*
  * phase two of compressed writeback.  This is the ordered portion
  * of the code, which only gets called in the order the work was
@@ -742,14 +774,24 @@ retry:
 			 * and IO for us.  Otherwise, we need to submit
 			 * all those pages down to the drive.
 			 */
-			if (!page_started && !ret)
-				extent_write_locked_range(io_tree,
-						  inode, async_extent->start,
-						  async_extent->start +
-						  async_extent->ram_size - 1,
-						  btrfs_get_extent,
-						  WB_SYNC_ALL);
-			else if (ret)
+			if (!page_started && !ret) {
+				/* Skip IO for dedup async_extent */
+				if (btrfs_dedupe_hash_hit(hash))
+					end_dedupe_extent(inode,
+						async_extent->start,
+						async_extent->ram_size,
+						PAGE_CLEAR_DIRTY |
+						PAGE_SET_WRITEBACK |
+						PAGE_END_WRITEBACK |
+						PAGE_UNLOCK);
+				else
+					extent_write_locked_range(io_tree,
+						inode, async_extent->start,
+						async_extent->start +
+						async_extent->ram_size - 1,
+						btrfs_get_extent,
+						WB_SYNC_ALL);
+			} else if (ret)
 				unlock_page(async_cow->locked_page);
 			kfree(hash);
 			kfree(async_extent);
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 24/27] btrfs: dedupe: Preparation for compress-dedupe co-work
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (22 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 23/27] btrfs: dedupe: Avoid submit IO for hash hit extent Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue Qu Wenruo
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs

For dedupe to work with compression, new members recording compression
algorithm and on-disk extent length are needed.

Add them for later compress-dedupe co-work.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/ctree.h        | 11 ++++++++-
 fs/btrfs/dedupe.c       | 64 +++++++++++++++++++++++++++++++++++++++----------
 fs/btrfs/dedupe.h       |  2 ++
 fs/btrfs/inode.c        |  2 ++
 fs/btrfs/ordered-data.c |  2 ++
 5 files changed, 67 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b19c1f1..88702e1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -984,9 +984,14 @@ struct btrfs_dedupe_status_item {
  * Used for hash <-> bytenr search
  */
 struct btrfs_dedupe_hash_item {
-	/* length of dedupe range */
+	/* length of dedupe range in memory */
 	__le32 len;
 
+	/* length of dedupe range on disk */
+	__le32 disk_len;
+
+	u8 compression;
+
 	/* Hash follows */
 } __attribute__ ((__packed__));
 
@@ -3324,6 +3329,10 @@ BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item,
 
 /* btrfs_dedupe_hash_item */
 BTRFS_SETGET_FUNCS(dedupe_hash_len, struct btrfs_dedupe_hash_item, len, 32);
+BTRFS_SETGET_FUNCS(dedupe_hash_disk_len, struct btrfs_dedupe_hash_item,
+		   disk_len, 32);
+BTRFS_SETGET_FUNCS(dedupe_hash_compression, struct btrfs_dedupe_hash_item,
+		   compression, 8);
 
 /* struct btrfs_file_extent_item */
 BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
index 294cbb5..1c89c8f 100644
--- a/fs/btrfs/dedupe.c
+++ b/fs/btrfs/dedupe.c
@@ -31,6 +31,8 @@ struct inmem_hash {
 
 	u64 bytenr;
 	u32 num_bytes;
+	u32 disk_num_bytes;
+	u8 compression;
 
 	u8 hash[];
 };
@@ -403,6 +405,8 @@ static int inmem_add(struct btrfs_dedupe_info *dedupe_info,
 	/* Copy the data out */
 	ihash->bytenr = hash->bytenr;
 	ihash->num_bytes = hash->num_bytes;
+	ihash->disk_num_bytes = hash->disk_num_bytes;
+	ihash->compression = hash->compression;
 	memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]);
 
 	mutex_lock(&dedupe_info->lock);
@@ -448,7 +452,8 @@ static int ondisk_search_bytenr(struct btrfs_trans_handle *trans,
 				struct btrfs_path *path, u64 bytenr,
 				int prepare_del);
 static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
-			      u64 *bytenr_ret, u32 *num_bytes_ret);
+			      u64 *bytenr_ret, u32 *num_bytes_ret,
+			      u32 *disk_num_bytes_ret, u8 *compression);
 static int ondisk_add(struct btrfs_trans_handle *trans,
 		      struct btrfs_dedupe_info *dedupe_info,
 		      struct btrfs_dedupe_hash *hash)
@@ -477,7 +482,8 @@ static int ondisk_add(struct btrfs_trans_handle *trans,
 	}
 	btrfs_release_path(path);
 
-	ret = ondisk_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes);
+	ret = ondisk_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes,
+				 NULL, NULL);
 	if (ret < 0)
 		goto out;
 	/* Same hash found, don't re-add to save dedupe tree space */
@@ -499,6 +505,10 @@ static int ondisk_add(struct btrfs_trans_handle *trans,
 	hash_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
 				   struct btrfs_dedupe_hash_item);
 	btrfs_set_dedupe_hash_len(path->nodes[0], hash_item, hash->num_bytes);
+	btrfs_set_dedupe_hash_disk_len(path->nodes[0], hash_item,
+				       hash->disk_num_bytes);
+	btrfs_set_dedupe_hash_compression(path->nodes[0], hash_item,
+					  hash->compression);
 	write_extent_buffer(path->nodes[0], hash->hash,
 			    (unsigned long)(hash_item + 1), hash_len);
 	btrfs_mark_buffer_dirty(path->nodes[0]);
@@ -822,7 +832,8 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
  * Return <0 for error
  */
 static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
-			      u64 *bytenr_ret, u32 *num_bytes_ret)
+			      u64 *bytenr_ret, u32 *num_bytes_ret,
+			      u32 *disk_num_bytes_ret, u8 *compression_ret)
 {
 	struct btrfs_path *path;
 	struct btrfs_key key;
@@ -878,8 +889,19 @@ static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash,
 				   hash_len);
 		if (!memcmp(buf, hash, hash_len)) {
 			ret = 1;
-			*bytenr_ret = key.offset;
-			*num_bytes_ret = btrfs_dedupe_hash_len(node, hash_item);
+			if (bytenr_ret)
+				*bytenr_ret = key.offset;
+			if (num_bytes_ret)
+				*num_bytes_ret =
+					btrfs_dedupe_hash_len(node, hash_item);
+			if (disk_num_bytes_ret)
+				*disk_num_bytes_ret =
+					btrfs_dedupe_hash_disk_len(node,
+							hash_item);
+			if (compression_ret)
+				*compression_ret =
+					btrfs_dedupe_hash_compression(node,
+							hash_item);
 			break;
 		}
 	}
@@ -922,7 +944,9 @@ inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash)
 /* Wapper for different backends, caller needs to hold dedupe_info->lock */
 static inline int generic_search_hash(struct btrfs_dedupe_info *dedupe_info,
 				      u8 *hash, u64 *bytenr_ret,
-				      u32 *num_bytes_ret)
+				      u32 *num_bytes_ret,
+				      u32 *disk_num_bytes_ret,
+				      u8 *compression_ret)
 {
 	if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) {
 		struct inmem_hash *found_hash;
@@ -933,15 +957,20 @@ static inline int generic_search_hash(struct btrfs_dedupe_info *dedupe_info,
 			ret = 1;
 			*bytenr_ret = found_hash->bytenr;
 			*num_bytes_ret = found_hash->num_bytes;
+			*disk_num_bytes_ret = found_hash->disk_num_bytes;
+			*compression_ret = found_hash->compression;
 		} else {
 			ret = 0;
 			*bytenr_ret = 0;
 			*num_bytes_ret = 0;
+			*disk_num_bytes_ret = 0;
+			*compression_ret = 0;
 		}
 		return ret;
 	} else if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK) {
 		return ondisk_search_hash(dedupe_info, hash, bytenr_ret,
-					  num_bytes_ret);
+					  num_bytes_ret, disk_num_bytes_ret,
+					  compression_ret);
 	}
 	return -EINVAL;
 }
@@ -962,6 +991,8 @@ static int generic_search(struct btrfs_dedupe_info *dedupe_info,
 	u64 bytenr;
 	u64 tmp_bytenr;
 	u32 num_bytes;
+	u32 disk_num_bytes;
+	u8 compression;
 
 	insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS);
 	if (!insert_head)
@@ -992,7 +1023,8 @@ static int generic_search(struct btrfs_dedupe_info *dedupe_info,
 
 again:
 	mutex_lock(&dedupe_info->lock);
-	ret = generic_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes);
+	ret = generic_search_hash(dedupe_info, hash->hash, &bytenr, &num_bytes,
+				  &disk_num_bytes, &compression);
 	if (ret <= 0)
 		goto out;
 
@@ -1008,15 +1040,17 @@ again:
 		 */
 		btrfs_add_delayed_data_ref_locked(root->fs_info, trans,
 				insert_dref, insert_head, insert_qrecord,
-				bytenr, num_bytes, 0, root->root_key.objectid,
-				btrfs_ino(inode), file_pos, 0,
-				BTRFS_ADD_DELAYED_REF);
+				bytenr, disk_num_bytes, 0,
+				root->root_key.objectid, btrfs_ino(inode),
+				file_pos, 0, BTRFS_ADD_DELAYED_REF);
 		spin_unlock(&delayed_refs->lock);
 
 		/* add_delayed_data_ref_locked will free unused memory */
 		free_insert = 0;
 		hash->bytenr = bytenr;
 		hash->num_bytes = num_bytes;
+		hash->disk_num_bytes = disk_num_bytes;
+		hash->compression = compression;
 		ret = 1;
 		goto out;
 	}
@@ -1034,7 +1068,7 @@ again:
 	mutex_lock(&dedupe_info->lock);
 	/* Search again to ensure the hash is still here */
 	ret = generic_search_hash(dedupe_info, hash->hash, &tmp_bytenr,
-				  &num_bytes);
+				  &num_bytes, &disk_num_bytes, &compression);
 	if (ret <= 0) {
 		mutex_unlock(&head->mutex);
 		goto out;
@@ -1046,12 +1080,14 @@ again:
 	}
 	hash->bytenr = bytenr;
 	hash->num_bytes = num_bytes;
+	hash->disk_num_bytes = disk_num_bytes;
+	hash->compression = compression;
 
 	/*
 	 * Increase the extent ref right now, to avoid delayed ref run
 	 * Or we may increase ref on non-exist extent.
 	 */
-	btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, 0,
+	btrfs_inc_extent_ref(trans, root, bytenr, disk_num_bytes, 0,
 			     root->root_key.objectid,
 			     btrfs_ino(inode), file_pos);
 	mutex_unlock(&head->mutex);
@@ -1096,6 +1132,8 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info,
 	if (ret == 0) {
 		hash->num_bytes = 0;
 		hash->bytenr = 0;
+		hash->disk_num_bytes = 0;
+		hash->compression = 0;
 	}
 	return ret;
 }
diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index 60479b1..ac6eeb3 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -53,6 +53,8 @@ static int btrfs_dedupe_sizes[] = { 32 };
 struct btrfs_dedupe_hash {
 	u64 bytenr;
 	u32 num_bytes;
+	u32 disk_num_bytes;
+	u8 compression;
 
 	/* last field is a variable length array of dedupe hash */
 	u8 hash[];
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b22663c..35d1ec4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2314,6 +2314,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	if (hash && hash->bytenr == 0) {
 		hash->bytenr = ins.objectid;
 		hash->num_bytes = ins.offset;
+		hash->disk_num_bytes = hash->num_bytes;
+		hash->compression = BTRFS_COMPRESS_NONE;
 		ret = btrfs_dedupe_add(trans, root->fs_info, hash);
 	}
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index ef24ad1..695c0e2 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -227,6 +227,8 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 		}
 		entry->hash->bytenr = hash->bytenr;
 		entry->hash->num_bytes = hash->num_bytes;
+		entry->hash->disk_num_bytes = hash->disk_num_bytes;
+		entry->hash->compression = hash->compression;
 		memcpy(entry->hash->hash, hash->hash,
 		       btrfs_dedupe_sizes[dedupe_info->hash_type]);
 	}
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (23 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 24/27] btrfs: dedupe: Preparation for compress-dedupe co-work Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-24 20:35   ` Chris Mason
  2016-03-22  1:35 ` [PATCH v8 26/27] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
                   ` (3 subsequent siblings)
  28 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>

The basic idea is also calculate hash before compression, and add needed
members for dedupe to record compressed file extent.

Since dedupe support dedupe_bs larger than 128K, which is the up limit
of compression file extent, in that case we will skip dedupe and prefer
compression, as in that size dedupe rate is low and compression will be
more obvious.

Current implement is far from elegant. The most elegant one should split
every data processing method into its own and independent function, and
have a unified function to co-operate them.

Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/dedupe.h       |   8 ++++
 fs/btrfs/inode.c        | 107 +++++++++++++++++++++++++++++++++++++-----------
 fs/btrfs/ordered-data.c |   5 ++-
 fs/btrfs/ordered-data.h |   3 +-
 4 files changed, 95 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
index ac6eeb3..af488b1 100644
--- a/fs/btrfs/dedupe.h
+++ b/fs/btrfs/dedupe.h
@@ -22,6 +22,7 @@
 #include <linux/btrfs.h>
 #include <linux/wait.h>
 #include <crypto/hash.h>
+#include "compression.h"
 
 /*
  * Dedup storage backend
@@ -89,6 +90,13 @@ static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash)
 	return (hash && hash->bytenr);
 }
 
+static inline int
+btrfs_dedupe_hash_compressed_hit(struct btrfs_dedupe_hash *hash)
+{
+	/* This judgment implied hash->bytenr != 0 */
+	return (hash && hash->compression != BTRFS_COMPRESS_NONE);
+}
+
 static inline int btrfs_dedupe_hash_size(u16 type)
 {
 	if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes)))
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 35d1ec4..8a2a76a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -412,13 +412,15 @@ static noinline void compress_file_range(struct inode *inode,
 					struct page *locked_page,
 					u64 start, u64 end,
 					struct async_cow *async_cow,
-					int *num_added)
+					int *num_added,
+					struct btrfs_dedupe_hash *hash)
 {
 	struct btrfs_root *root = BTRFS_I(inode)->root;
 	u64 num_bytes;
 	u64 blocksize = root->sectorsize;
 	u64 actual_end;
 	u64 isize = i_size_read(inode);
+	u64 orig_start = start;
 	int ret = 0;
 	struct page **pages = NULL;
 	unsigned long nr_pages;
@@ -618,10 +620,18 @@ cont:
 		/* the async work queues will take care of doing actual
 		 * allocation on disk for these compressed pages,
 		 * and will submit them to the elevator.
+		 *
+		 * And if we generate two or more compressed extent, we
+		 * should not go through dedup routine
+		 * TODO: freeing hash should be done by caller
 		 */
+		if (start + num_bytes < end) {
+			kfree(hash);
+			hash = NULL;
+		}
 		add_async_extent(async_cow, start, num_bytes,
 				 total_compressed, pages, nr_pages_ret,
-				 compress_type, NULL);
+				 compress_type, hash);
 
 		if (start + num_bytes < end) {
 			start += num_bytes;
@@ -645,8 +655,12 @@ cleanup_and_bail_uncompressed:
 		}
 		if (redirty)
 			extent_range_redirty_for_io(inode, start, end);
+		if (orig_start != start) {
+			kfree(hash);
+			hash = NULL;
+		}
 		add_async_extent(async_cow, start, end - start + 1,
-				 0, NULL, 0, BTRFS_COMPRESS_NONE, NULL);
+				 0, NULL, 0, BTRFS_COMPRESS_NONE, hash);
 		*num_added += 1;
 	}
 
@@ -727,7 +741,7 @@ static void end_dedupe_extent(struct inode *inode, u64 start,
  * and send them down to the disk.
  */
 static noinline void submit_compressed_extents(struct inode *inode,
-					      struct async_cow *async_cow)
+					       struct async_cow *async_cow)
 {
 	struct async_extent *async_extent;
 	u64 alloc_hint = 0;
@@ -749,8 +763,13 @@ again:
 		hash = async_extent->hash;
 
 retry:
-		/* did the compression code fall back to uncompressed IO? */
-		if (!async_extent->pages) {
+		/*
+		 * The extent doesn't contain any compressed data (from
+		 * compression or inband dedupe)
+		 * TODO: Merge this branch to compressed branch
+		 */
+		if (!(async_extent->pages ||
+		      btrfs_dedupe_hash_compressed_hit(hash))) {
 			int page_started = 0;
 			unsigned long nr_written = 0;
 
@@ -802,10 +821,15 @@ retry:
 		lock_extent(io_tree, async_extent->start,
 			    async_extent->start + async_extent->ram_size - 1);
 
-		ret = btrfs_reserve_extent(root,
+		if (btrfs_dedupe_hash_hit(hash)) {
+			ins.objectid = hash->bytenr;
+			ins.offset = hash->disk_num_bytes;
+		} else {
+			ret = btrfs_reserve_extent(root,
 					   async_extent->compressed_size,
 					   async_extent->compressed_size,
 					   0, alloc_hint, &ins, 1, 1);
+		}
 		if (ret) {
 			free_async_extent_pages(async_extent);
 
@@ -880,7 +904,8 @@ retry:
 						async_extent->ram_size,
 						ins.offset,
 						BTRFS_ORDERED_COMPRESSED,
-						async_extent->compress_type);
+						async_extent->compress_type,
+						hash);
 		if (ret) {
 			btrfs_drop_extent_cache(inode, async_extent->start,
 						async_extent->start +
@@ -897,12 +922,17 @@ retry:
 				NULL, EXTENT_LOCKED | EXTENT_DELALLOC,
 				PAGE_UNLOCK | PAGE_CLEAR_DIRTY |
 				PAGE_SET_WRITEBACK);
-		ret = btrfs_submit_compressed_write(inode,
-				    async_extent->start,
-				    async_extent->ram_size,
-				    ins.objectid,
-				    ins.offset, async_extent->pages,
-				    async_extent->nr_pages);
+		if (btrfs_dedupe_hash_hit(hash))
+			end_dedupe_extent(inode, async_extent->start,
+					  async_extent->ram_size,
+					  PAGE_END_WRITEBACK);
+		else
+			ret = btrfs_submit_compressed_write(inode,
+					async_extent->start,
+					async_extent->ram_size,
+					ins.objectid,
+					ins.offset, async_extent->pages,
+					async_extent->nr_pages);
 		if (ret) {
 			struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 			struct page *p = async_extent->pages[0];
@@ -1160,6 +1190,7 @@ static int hash_file_ranges(struct inode *inode, u64 start, u64 end,
 	u64 dedupe_bs;
 	u64 cur_offset = start;
 	int ret = 0;
+	int need_compress = inode_need_compress(inode);
 
 	actual_end = min_t(u64, isize, end + 1);
 	/* If dedupe is not enabled, don't split extent into dedupe_bs */
@@ -1177,7 +1208,14 @@ static int hash_file_ranges(struct inode *inode, u64 start, u64 end,
 		u64 len;
 
 		len = min(end + 1 - cur_offset, dedupe_bs);
-		if (len < dedupe_bs)
+		/*
+		 * If dedupe_size is larger than compression extent up limit
+		 * (128K), we prefer compression other than dedupe.
+		 * As larger dedupe bs will lead to much lower dedupe rate
+		 * and normally, compression is more obvious than dedupe
+		 */
+		if (len < dedupe_bs || (need_compress == 1
+					&& dedupe_bs > SZ_128K))
 			goto next;
 
 		hash = btrfs_dedupe_alloc_hash(hash_algo);
@@ -1200,10 +1238,24 @@ next:
 		    page_offset(locked_page) <= end)
 			__set_page_dirty_nobuffers(locked_page);
 
-		add_async_extent(async_cow, cur_offset, len, 0, NULL, 0,
-				 BTRFS_COMPRESS_NONE, hash);
+		if (btrfs_dedupe_hash_hit(hash)) {
+			/* Dedupe found */
+			add_async_extent(async_cow, cur_offset, len,
+					 hash->disk_num_bytes, NULL, 0,
+					 hash->compression, hash);
+			(*num_added)++;
+		} else if (need_compress) {
+			/* Compression */
+			compress_file_range(inode, locked_page, cur_offset,
+					    cur_offset + len - 1, async_cow,
+					    num_added, hash);
+		} else {
+			/* No compression, dedupe not found */
+			add_async_extent(async_cow, cur_offset, len, 0, NULL, 0,
+					 BTRFS_COMPRESS_NONE, hash);
+			(*num_added)++;
+		}
 		cur_offset += len;
-		(*num_added)++;
 	}
 out:
 	return ret;
@@ -1215,21 +1267,26 @@ out:
 static noinline void async_cow_start(struct btrfs_work *work)
 {
 	struct async_cow *async_cow;
+	struct inode *inode;
+	struct btrfs_fs_info *fs_info;
 	int num_added = 0;
 	int ret = 0;
+
 	async_cow = container_of(work, struct async_cow, work);
+	inode = async_cow->inode;
+	fs_info = BTRFS_I(inode)->root->fs_info;
 
-	if (inode_need_compress(async_cow->inode))
+	if (!inode_need_dedupe(fs_info, inode))
 		compress_file_range(async_cow->inode, async_cow->locked_page,
 				    async_cow->start, async_cow->end, async_cow,
-				    &num_added);
+				    &num_added, NULL);
 	else
-		ret = hash_file_ranges(async_cow->inode, async_cow->start,
+		ret = hash_file_ranges(inode, async_cow->start,
 				       async_cow->end, async_cow, &num_added);
 	WARN_ON(ret);
 
 	if (num_added == 0) {
-		btrfs_add_delayed_iput(async_cow->inode);
+		btrfs_add_delayed_iput(inode);
 		async_cow->inode = NULL;
 	}
 }
@@ -2313,9 +2370,9 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 	/* Add missed hash into dedupe tree */
 	if (hash && hash->bytenr == 0) {
 		hash->bytenr = ins.objectid;
-		hash->num_bytes = ins.offset;
-		hash->disk_num_bytes = hash->num_bytes;
-		hash->compression = BTRFS_COMPRESS_NONE;
+		hash->num_bytes = ram_bytes;
+		hash->disk_num_bytes = ins.offset;
+		hash->compression = compression;
 		ret = btrfs_dedupe_add(trans, root->fs_info, hash);
 	}
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 695c0e2..052f810 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -300,11 +300,12 @@ int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
 				      u64 start, u64 len, u64 disk_len,
-				      int type, int compress_type)
+				      int type, int compress_type,
+				      struct btrfs_dedupe_hash *hash)
 {
 	return __btrfs_add_ordered_extent(inode, file_offset, start, len,
 					  disk_len, type, 0,
-					  compress_type, NULL);
+					  compress_type, hash);
 }
 
 /*
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 8a54476..8ff4dcb 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -189,7 +189,8 @@ int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset,
 				 u64 start, u64 len, u64 disk_len, int type);
 int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset,
 				      u64 start, u64 len, u64 disk_len,
-				      int type, int compress_type);
+				      int type, int compress_type,
+				      struct btrfs_dedupe_hash *hash);
 void btrfs_add_ordered_sum(struct inode *inode,
 			   struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 26/27] btrfs: relocation: Enhance error handling to avoid BUG_ON
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (24 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22  1:35 ` [PATCH v8 27/27] btrfs: dedupe: Fix a space cache delalloc bytes underflow bug Qu Wenruo
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs

Since the introduce of btrfs dedupe tree, it's possible that balance can
race with dedupe disabling.

When this happens, dedupe_enabled will make btrfs_get_fs_root() return
PTR_ERR(-ENOENT).
But due to a bug in error handling branch, when this happens
backref_cache->nr_nodes is increased but the node is neither added to
backref_cache or nr_nodes decreased.
Causing BUG_ON() in backref_cache_cleanup()

[ 2611.668810] ------------[ cut here ]------------
[ 2611.669946] kernel BUG at
/home/sat/ktest/linux/fs/btrfs/relocation.c:243!
[ 2611.670572] invalid opcode: 0000 [#1] SMP
[ 2611.686797] Call Trace:
[ 2611.687034]  [<ffffffffa01f71d3>]
btrfs_relocate_block_group+0x1b3/0x290 [btrfs]
[ 2611.687706]  [<ffffffffa01cc177>]
btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs]
[ 2611.688385]  [<ffffffffa01cdb12>] btrfs_balance+0xb22/0x11e0 [btrfs]
[ 2611.688966]  [<ffffffffa01d9611>] btrfs_ioctl_balance+0x391/0x3a0
[btrfs]
[ 2611.689587]  [<ffffffffa01ddaf0>] btrfs_ioctl+0x1650/0x2290 [btrfs]
[ 2611.690145]  [<ffffffff81171cda>] ? lru_cache_add+0x3a/0x80
[ 2611.690647]  [<ffffffff81171e4c>] ?
lru_cache_add_active_or_unevictable+0x4c/0xc0
[ 2611.691310]  [<ffffffff81193f04>] ? handle_mm_fault+0xcd4/0x17f0
[ 2611.691842]  [<ffffffff811da423>] ? cp_new_stat+0x153/0x180
[ 2611.692342]  [<ffffffff8119913d>] ? __vma_link_rb+0xfd/0x110
[ 2611.692842]  [<ffffffff81199209>] ? vma_link+0xb9/0xc0
[ 2611.693303]  [<ffffffff811e7e81>] do_vfs_ioctl+0xa1/0x5a0
[ 2611.693781]  [<ffffffff8104e024>] ? __do_page_fault+0x1b4/0x400
[ 2611.694310]  [<ffffffff811e83c1>] SyS_ioctl+0x41/0x70
[ 2611.694758]  [<ffffffff816dfc6e>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0
05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b
0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44
[ 2611.697870] RIP  [<ffffffffa01f6fc1>]
relocate_block_group+0x741/0x7a0 [btrfs]
[ 2611.698818]  RSP <ffff88002a81fb30>

This patch will call remove_backref_node() in error handling branch, and
cache the returned -ENOENT in relocate_tree_block() and continue
balancing.

Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/relocation.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 74fd5d9..b9822e9 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -888,6 +888,13 @@ again:
 		root = read_fs_root(rc->extent_root->fs_info, key.offset);
 		if (IS_ERR(root)) {
 			err = PTR_ERR(root);
+			/*
+			 * Don't forget to cleanup current node.
+			 * As it may not be added to backref_cache but nr_node
+			 * increased.
+			 * This will cause BUG_ON() in backref_cache_cleanup().
+			 */
+			remove_backref_node(&rc->backref_cache, cur);
 			goto out;
 		}
 
@@ -2991,14 +2998,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans,
 	}
 
 	rb_node = rb_first(blocks);
-	while (rb_node) {
+	for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) {
 		block = rb_entry(rb_node, struct tree_block, rb_node);
 
 		node = build_backref_tree(rc, &block->key,
 					  block->level, block->bytenr);
 		if (IS_ERR(node)) {
+			/*
+			 * The root(dedupe tree yet) of the tree block is
+			 * going to be freed and can't be reached.
+			 * Just skip it and continue balancing.
+			 */
+			if (PTR_ERR(node) == -ENOENT)
+				continue;
 			err = PTR_ERR(node);
-			goto out;
+			break;
 		}
 
 		ret = relocate_tree_block(trans, rc, node, &block->key,
@@ -3006,11 +3020,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans,
 		if (ret < 0) {
 			if (ret != -EAGAIN || rb_node == rb_first(blocks))
 				err = ret;
-			goto out;
+			break;
 		}
-		rb_node = rb_next(rb_node);
 	}
-out:
 	err = finish_pending_nodes(trans, rc, path, err);
 
 out_free_path:
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v8 27/27] btrfs: dedupe: Fix a space cache delalloc bytes underflow bug
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (25 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 26/27] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
@ 2016-03-22  1:35 ` Qu Wenruo
  2016-03-22 13:38 ` [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework David Sterba
  2016-03-29 17:22 ` Alex Lyakas
  28 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-22  1:35 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Wang Xiaoguang

Dedupe has a bug that underflow block_group_cache->delalloc_bytes, makes
it unable to return to 0.
This will cause free space cache for that block group never written to
disk.

And cause the following kernel message at umount:
BTRFS info (device vdc): The free space cache file (1485570048) is
invalid. skip it

Reported-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/extent-tree.c |  8 ++++++--
 fs/btrfs/inode.c       | 11 +++++++++--
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 016d2ec..f6dbef3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6256,8 +6256,12 @@ static int btrfs_update_reserved_bytes(struct btrfs_block_group_cache *cache,
 		cache->reserved -= num_bytes;
 		space_info->bytes_reserved -= num_bytes;
 
-		if (delalloc)
-			cache->delalloc_bytes -= num_bytes;
+		if (delalloc) {
+			if (WARN_ON(num_bytes > cache->delalloc_bytes))
+				cache->delalloc_bytes = 0;
+			else
+				cache->delalloc_bytes -= num_bytes;
+		}
 	}
 	spin_unlock(&cache->lock);
 	spin_unlock(&space_info->lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8a2a76a..5014ece 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3045,7 +3045,10 @@ static void btrfs_release_delalloc_bytes(struct btrfs_root *root,
 	ASSERT(cache);
 
 	spin_lock(&cache->lock);
-	cache->delalloc_bytes -= len;
+	if (WARN_ON(len > cache->delalloc_bytes))
+		cache->delalloc_bytes = 0;
+	else
+		cache->delalloc_bytes -= len;
 	spin_unlock(&cache->lock);
 
 	btrfs_put_block_group(cache);
@@ -3154,6 +3157,9 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 						ordered_extent->file_offset +
 						logical_len);
 	} else {
+		/* Must be checked before hash modified*/
+		int hash_hit = btrfs_dedupe_hash_hit(ordered_extent->hash);
+
 		BUG_ON(root == root->fs_info->tree_root);
 		ret = insert_reserved_file_extent(trans, inode,
 						ordered_extent->file_offset,
@@ -3163,7 +3169,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 						compress_type, 0, 0,
 						BTRFS_FILE_EXTENT_REG,
 						ordered_extent->hash);
-		if (!ret)
+		/* Hash hit case doesn't reserved delalloc bytes */
+		if (!ret && !hash_hit)
 			btrfs_release_delalloc_bytes(root,
 						     ordered_extent->start,
 						     ordered_extent->disk_len);
-- 
2.7.3




^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 15/27] btrfs: dedupe: Add ioctl for inband dedupelication
  2016-03-22  1:35 ` [PATCH v8 15/27] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
@ 2016-03-22  2:29   ` kbuild test robot
  2016-03-22  2:48   ` kbuild test robot
  1 sibling, 0 replies; 62+ messages in thread
From: kbuild test robot @ 2016-03-22  2:29 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: kbuild-all, linux-btrfs, Wang Xiaoguang

[-- Attachment #1: Type: text/plain, Size: 826 bytes --]

Hi Wang,

[auto build test ERROR on btrfs/next]
[also build test ERROR on next-20160321]
[cannot apply to v4.5]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-Add-inband-write-time-de-duplication-framework/20160322-094812
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next
config: i386-randconfig-s1-201612 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> ERROR: "__udivdi3" [fs/btrfs/btrfs.ko] undefined!

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 23643 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 15/27] btrfs: dedupe: Add ioctl for inband dedupelication
  2016-03-22  1:35 ` [PATCH v8 15/27] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
  2016-03-22  2:29   ` kbuild test robot
@ 2016-03-22  2:48   ` kbuild test robot
  1 sibling, 0 replies; 62+ messages in thread
From: kbuild test robot @ 2016-03-22  2:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: kbuild-all, linux-btrfs, Wang Xiaoguang

[-- Attachment #1: Type: text/plain, Size: 882 bytes --]

Hi Wang,

[auto build test ERROR on btrfs/next]
[also build test ERROR on next-20160321]
[cannot apply to v4.5]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-Add-inband-write-time-de-duplication-framework/20160322-094812
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next
config: i386-randconfig-s0-201612 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   fs/built-in.o: In function `btrfs_dedupe_enable':
>> (.text+0x45417a): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 24853 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (26 preceding siblings ...)
  2016-03-22  1:35 ` [PATCH v8 27/27] btrfs: dedupe: Fix a space cache delalloc bytes underflow bug Qu Wenruo
@ 2016-03-22 13:38 ` David Sterba
  2016-03-23  2:25   ` Qu Wenruo
  2016-03-29 17:22 ` Alex Lyakas
  28 siblings, 1 reply; 62+ messages in thread
From: David Sterba @ 2016-03-22 13:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Mar 22, 2016 at 09:35:25AM +0800, Qu Wenruo wrote:
> This updated version of inband de-duplication has the following features:
> 1) ONE unified dedup framework.
> 2) TWO different back-end with different trade-off

The on-disk format is defined in code, would be good to give some
overview here.

> 3) Support compression with dedupe
> 4) Ioctl interface with persist dedup status

I'd like to see the ioctl specified in more detail. So far there's
enable, disable and status. I'd expect some way to control the in-memory
limits, let it "forget" current hash cache, specify the dedupe chunk
size, maybe sync of the in-memory hash cache to disk.

> 5) Ability to disable dedup for given dirs/files

This would be good to extend to subvolumes.

> TODO:
> 1) Add extent-by-extent comparison for faster but more conflicting algorithm
>    Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>    CPU may even be a bottleneck other than IO.
>    But for faster hash, it will definitely cause conflicts, so we need
>    extent comparison before we introduce new dedup algorithm.

If sha256 is slow, we can use a less secure hash that's faster but will
do a full byte-to-byte comparison in case of hash collision, and
recompute sha256 when the blocks are going to disk. I haven't thought
this through, so there are possibly details that could make unfeasible.

The idea is to move expensive hashing to the slow IO operations and do
fast but not 100% safe hashing on the read/write side where performance
matters.

> 2) Misc end-user related helpers
>    Like handy and easy to implement dedup rate report.
>    And method to query in-memory hash size for those "non-exist" users who
>    want to use 'dedup enable -l' option but didn't ever know how much
>    RAM they have.

That's what we should try know and define in advance, that's part of the
ioctl interface.

I went through the patches, there are a lot of small things to fix, but
first I want to be sure about the interfaces, ie. on-disk and ioctl.

Then we can start to merge the patchset in smaller batches, the
in-memory deduplication does not have implications on the on-disk
format, so it's "just" the ioctl part.

The patches at the end of the series fix bugs introduced within the same
series, these should be folded to the patches that are buggy.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-03-22 13:38 ` [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework David Sterba
@ 2016-03-23  2:25   ` Qu Wenruo
  2016-03-24 13:42     ` David Sterba
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-03-23  2:25 UTC (permalink / raw)
  To: dsterba, linux-btrfs

Thank you for your interest in dedupe patchset first.

In fact I'm quite afraid if there is no one interest in the patchset, it 
may be delayed again to 4.8.

David Sterba wrote on 2016/03/22 14:38 +0100:
> On Tue, Mar 22, 2016 at 09:35:25AM +0800, Qu Wenruo wrote:
>> This updated version of inband de-duplication has the following features:
>> 1) ONE unified dedup framework.
>> 2) TWO different back-end with different trade-off
>
> The on-disk format is defined in code, would be good to give some
> overview here.

No problem at all.
(Although not sure if it's a good idea to explain it in mail. Maybe wiki 
is much better?)

There are 3 dedupe related on-disk items.

1) dedupe status
    Used by both dedupe backends. Mainly used to record the dedupe
    backend info, allowing btrfs to resume its dedupe setup after umount.

Key contents:
    Objectid             , Type                   , Offset
   (0                    , DEDUPE_STATUS_ITEM_KEY , 0      )

Structure contents:
   dedupe block size:     records dedupe block size
   limit_nr:              In-memory hash limit
   hash_type:             Only SHA256 is possible yet
   backend:               In-memory or on-disk

2) dedupe hash item
    The main item for on-disk dedupe backend.
    It's used for hash -> extent search.
    Duplicated hash won't be inserted into dedupe tree.

Key contents:
    Objectid            , Type                   , Offset
   (Last 64bit of hash  , DEDUPE_HASH_ITEM_KEY   , Bytenr of the extent)

Structure contents:
   len:                   The in-memory length of the extent
                          Should always match dedupe_bs.
   disk_len:              The on-disk length of extent, diffs with len
                          if the extent is compressed.
   compression:           Compression algorithm.
   hash:                  Complete hash(SHA256) of the extent, including
                          the last  64 bit

   The structure is a simplified file extent with hash, offset are
   removed.

3) dedupe bytenr item
    Helper structure, mainly used for extent -> hash lookup, used by
    extent freeing.
    1 on 1 mapping with dedupe hash item.

Key contents:
    Objectid       , Type                       , Offset
   (Extent bytenr  , DEDUPE_HASH_BYTENR_ITEM_KEY, Last 64 bit of hash)

Structure contents:
   Hash:                 Complete hash(SHA256) of the extent.

>
>> 3) Support compression with dedupe
>> 4) Ioctl interface with persist dedup status
>
> I'd like to see the ioctl specified in more detail. So far there's
> enable, disable and status. I'd expect some way to control the in-memory
> limits, let it "forget" current hash cache, specify the dedupe chunk
> size, maybe sync of the in-memory hash cache to disk.

So current and planned ioctl should be the following, with some details 
related to your in-memory limit control concerns.

1) Enable
    Enable dedupe if it's not enabled already. (disabled -> enabled)
    Or change current dedupe setting to another. (re-configure)

    For dedupe_bs/backend/hash algorithm(only SHA256 yet) change, it
    will disable dedupe(dropping all hash) and then enable with new
    setting.

    For in-memory backend, if only limit is different from previous
    setting, limit can be changed on the fly without dropping any hash.

2) Disable
    Disable will drop all hash and delete the dedupe tree if it exists.
    Imply a full sync_fs().

3) Status
    Output basic status of current dedupe.
    Including running status(disabled/enabled), dedupe block size, hash
    algorithm, and limit setting for in-memory backend.

4) (PLANNED) In-memory hash size querying
    Allowing userspace to query in-memory hash structure header size.
    Used for "btrfs dedupe enable" '-l' option to output warning if user
    specify memory size larger than 1/4 of the total memory.

5) (PLANNED) Dedeup rate statistics
    Should be handy for user to know the dedupe rate so they can further
    fine tuning their dedup setup.

So for your "in-memory limit control", just enable it with different limit.
For "dedupe block size change", just enable it with different dedupe_bs.
For "forget hash", just disable it.

And for "write in-memory hash onto disk", not planned and may never do 
it due to the complexity, sorry.

>
>> 5) Ability to disable dedup for given dirs/files
>
> This would be good to extend to subvolumes.

I'm sorry that I didn't quite understand the difference.
Doesn't dir includes subvolume?

Or xattr for subvolume is only restored in its parent subvolume, and 
won't be copied for its snapshot?

>
>> TODO:
>> 1) Add extent-by-extent comparison for faster but more conflicting algorithm
>>     Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>>     CPU may even be a bottleneck other than IO.
>>     But for faster hash, it will definitely cause conflicts, so we need
>>     extent comparison before we introduce new dedup algorithm.
>
> If sha256 is slow, we can use a less secure hash that's faster but will
> do a full byte-to-byte comparison in case of hash collision, and
> recompute sha256 when the blocks are going to disk. I haven't thought
> this through, so there are possibly details that could make unfeasible.

Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only 
for both in-memory and on-disk backend. No SHA256 again.

In that case, for MD5 hit case, we will do a full byte-to-byte 
comparison. It may be slow or fast, depending on the cache.

But at least for MD5 miss case, it should be faster than SHA256.

>
> The idea is to move expensive hashing to the slow IO operations and do
> fast but not 100% safe hashing on the read/write side where performance
> matters.

Yes, although on the read side, we don't perform hash, we only do hash 
at write side.
And in that case, if weak hash hit, we will need to do memory 
comparison, which may also be slow.
So the performance impact may still exist.

The biggest challenge is, we need to read (decompressed) extent 
contents, even without an inode.
(So, no address_space and all the working facilities)

Considering the complexity and uncertain performance improvement, the 
priority of introducing weak hash is quite low so far, not to mention a 
lot of detail design change for it.

A much easier and practical enhancement is, to use SHA512.
As it's faster than SHA256 on modern 64bit machine for larger size.
For example, for hashing 8K data, SHA512 is almost 40% faster than SHA256.

>
>> 2) Misc end-user related helpers
>>     Like handy and easy to implement dedup rate report.
>>     And method to query in-memory hash size for those "non-exist" users who
>>     want to use 'dedup enable -l' option but didn't ever know how much
>>     RAM they have.
>
> That's what we should try know and define in advance, that's part of the
> ioctl interface.
>
> I went through the patches, there are a lot of small things to fix, but
> first I want to be sure about the interfaces, ie. on-disk and ioctl.

I hope such small things can be pointed out, allowing me to fix them 
while rebasing.

>
> Then we can start to merge the patchset in smaller batches, the
> in-memory deduplication does not have implications on the on-disk
> format, so it's "just" the ioctl part.

Yes, that's my original plan, first merge simple in-memory backend into 
4.5/4.6 and then adding ondisk backend into 4.7.

But things turned out that, since we designed the two-backends API from 
the beginning, on-disk backend doesn't take much time to implement.

So this makes what you see now, a big patchset with both backend 
implemented.

>
> The patches at the end of the series fix bugs introduced within the same
> series, these should be folded to the patches that are buggy.

I'll fold them in next version.

Thanks,
Qu

> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-03-23  2:25   ` Qu Wenruo
@ 2016-03-24 13:42     ` David Sterba
  2016-03-25  1:38       ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: David Sterba @ 2016-03-24 13:42 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs, clm

On Wed, Mar 23, 2016 at 10:25:51AM +0800, Qu Wenruo wrote:
> Thank you for your interest in dedupe patchset first.
> 
> In fact I'm quite afraid if there is no one interest in the patchset, it 
> may be delayed again to 4.8.

It's not about lack of interest, the high-profile features need time and
input from several people that will supposedly cover all aspects.

> David Sterba wrote on 2016/03/22 14:38 +0100:
> There are 3 dedupe related on-disk items.
> 
> 1) dedupe status
>     Used by both dedupe backends. Mainly used to record the dedupe
>     backend info, allowing btrfs to resume its dedupe setup after umount.
> 
> Key contents:
>     Objectid             , Type                   , Offset
>    (0                    , DEDUPE_STATUS_ITEM_KEY , 0      )

Please use the newly added BTRFS_PERSISTENT_ITEM_KEY instead of a new
key type. As this is the second user of that item, there's no precendent
how to select the subtype. Right now 0 is for the dev stats item, but
I'd like to leave some space between them, so it should be 256 at best.
The space is 64bit so there's enough room but this also means defining
the on-disk format.

> >> 4) Ioctl interface with persist dedup status
> >
> > I'd like to see the ioctl specified in more detail. So far there's
> > enable, disable and status. I'd expect some way to control the in-memory
> > limits, let it "forget" current hash cache, specify the dedupe chunk
> > size, maybe sync of the in-memory hash cache to disk.
> 
> So current and planned ioctl should be the following, with some details 
> related to your in-memory limit control concerns.
> 
> 1) Enable
>     Enable dedupe if it's not enabled already. (disabled -> enabled)

Ok, so it should also take a parameter which bckend is about to be
enabled.

>     Or change current dedupe setting to another. (re-configure)

Doing that in 'enable' sounds confusing, any changes belong to a
separate command.

>     For dedupe_bs/backend/hash algorithm(only SHA256 yet) change, it
>     will disable dedupe(dropping all hash) and then enable with new
>     setting.
> 
>     For in-memory backend, if only limit is different from previous
>     setting, limit can be changed on the fly without dropping any hash.

This is obviously misplaced in 'enable'.

> 2) Disable
>     Disable will drop all hash and delete the dedupe tree if it exists.
>     Imply a full sync_fs().

That is again combining too many things into one. Say I want to disable
deduplication and want to enable it later. And not lose the whole state
between that. Not to say deleting the dedup tree.

IOW, deleting the tree belongs to a separate command, though in the
userspace tools it could be done in one command, but we're talking about
the kernel ioctls now.

I'm not sure if the sync is required, but it's acceptable for first
implementation.

> 
> 3) Status
>     Output basic status of current dedupe.
>     Including running status(disabled/enabled), dedupe block size, hash
>     algorithm, and limit setting for in-memory backend.

Agreed. So this is basically the settings and static info.

> 4) (PLANNED) In-memory hash size querying
>     Allowing userspace to query in-memory hash structure header size.
>     Used for "btrfs dedupe enable" '-l' option to output warning if user
>     specify memory size larger than 1/4 of the total memory.

And this reflects the run-time status. Ok.

> 5) (PLANNED) Dedeup rate statistics
>     Should be handy for user to know the dedupe rate so they can further
>     fine tuning their dedup setup.

Similar as above, but for a different type of data. Ok.

> So for your "in-memory limit control", just enable it with different limit.
> For "dedupe block size change", just enable it with different dedupe_bs.
> For "forget hash", just disable it.

I can comment once the semantics of 'enable' are split, but basically I
want an interface to control the deduplication cache.

> And for "write in-memory hash onto disk", not planned and may never do 
> it due to the complexity, sorry.

I'm not asking you to do it, definetelly not for the initial
implementation, but sync from memory to disk is IMO something that we
can expect users to ask for. The percieved complexity may shift
implementation to the future, but we should take it into account.

> >> 5) Ability to disable dedup for given dirs/files
> >
> > This would be good to extend to subvolumes.
> 
> I'm sorry that I didn't quite understand the difference.
> Doesn't dir includes subvolume?

If I enable deduplication on the entire subvolume, it will affect all
subdirectories. Not the other way around.

> Or xattr for subvolume is only restored in its parent subvolume, and 
> won't be copied for its snapshot?

The xattrs are copied to the snapshot.

> >> TODO:
> >> 1) Add extent-by-extent comparison for faster but more conflicting algorithm
> >>     Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
> >>     CPU may even be a bottleneck other than IO.
> >>     But for faster hash, it will definitely cause conflicts, so we need
> >>     extent comparison before we introduce new dedup algorithm.
> >
> > If sha256 is slow, we can use a less secure hash that's faster but will
> > do a full byte-to-byte comparison in case of hash collision, and
> > recompute sha256 when the blocks are going to disk. I haven't thought
> > this through, so there are possibly details that could make unfeasible.
> 
> Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only 
> for both in-memory and on-disk backend. No SHA256 again.

I'm proposing unsafe but fast, which MD5 is not. Look for xxhash or
murmur. As they're both order-of-magnitutes faster than sha1/md5, we can
actually hash both to reduce the collisions.

> In that case, for MD5 hit case, we will do a full byte-to-byte 
> comparison. It may be slow or fast, depending on the cache.

If the probability of hash collision is low, so the number of needed
byte-to-byte comparisions is also low.

> But at least for MD5 miss case, it should be faster than SHA256.
> 
> > The idea is to move expensive hashing to the slow IO operations and do
> > fast but not 100% safe hashing on the read/write side where performance
> > matters.
> 
> Yes, although on the read side, we don't perform hash, we only do hash 
> at write side.

Oh, so how exactly gets the in-memory deduplication cache filled? My
impression was that we can pre-fill it by reading bunch of files where we
expect the shared data to exist.

The usecase:

Say there's a golden image for a virtual machine, we'll clone it and use
for other VM's, with minor changes. If we first read the golden image
with deduplication enabled, pre-fill the cache, any subsequent writes to
the cloned images will be compared to the cached data. The estimated hit
ratio is medium-to-high.

And this can be extended to anything, not just VMs. Without the option
to fill the in-memory cache, the deduplication would seem pretty useless
to me. The clear benefit is lack of maintaining the persistent storage
of deduplication data.

> And in that case, if weak hash hit, we will need to do memory 
> comparison, which may also be slow.
> So the performance impact may still exist.

Yes the performance hit is there, with statistically low probability.

> The biggest challenge is, we need to read (decompressed) extent 
> contents, even without an inode.
> (So, no address_space and all the working facilities)
> 
> Considering the complexity and uncertain performance improvement, the 
> priority of introducing weak hash is quite low so far, not to mention a 
> lot of detail design change for it.

I disagree.

> A much easier and practical enhancement is, to use SHA512.
> As it's faster than SHA256 on modern 64bit machine for larger size.
> For example, for hashing 8K data, SHA512 is almost 40% faster than SHA256.
> 
> >> 2) Misc end-user related helpers
> >>     Like handy and easy to implement dedup rate report.
> >>     And method to query in-memory hash size for those "non-exist" users who
> >>     want to use 'dedup enable -l' option but didn't ever know how much
> >>     RAM they have.
> >
> > That's what we should try know and define in advance, that's part of the
> > ioctl interface.
> >
> > I went through the patches, there are a lot of small things to fix, but
> > first I want to be sure about the interfaces, ie. on-disk and ioctl.
> 
> I hope such small things can be pointed out, allowing me to fix them 
> while rebasing.

Sure, that's next after we agree on what the deduplication should
actually, the ioctls interefaces are settled and the on-disk format
changes are agreed on. The code is a good starting point, but pointing
out minor things at this point does not justify the time spent.

> > Then we can start to merge the patchset in smaller batches, the
> > in-memory deduplication does not have implications on the on-disk
> > format, so it's "just" the ioctl part.
> 
> Yes, that's my original plan, first merge simple in-memory backend into 
> 4.5/4.6 and then adding ondisk backend into 4.7.
> 
> But things turned out that, since we designed the two-backends API from 
> the beginning, on-disk backend doesn't take much time to implement.
> 
> So this makes what you see now, a big patchset with both backend 
> implemented.

For the discussions and review phase it's ok to see them both, but it's
unrealistic to expect merging in a particular version without going
through the review heat, especially for something like deduplication.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue
  2016-03-22  1:35 ` [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue Qu Wenruo
@ 2016-03-24 20:35   ` Chris Mason
  2016-03-25  1:44     ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Mason @ 2016-03-24 20:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Wang Xiaoguang

On Tue, Mar 22, 2016 at 09:35:50AM +0800, Qu Wenruo wrote:
> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> 
> The basic idea is also calculate hash before compression, and add needed
> members for dedupe to record compressed file extent.
> 
> Since dedupe support dedupe_bs larger than 128K, which is the up limit
> of compression file extent, in that case we will skip dedupe and prefer
> compression, as in that size dedupe rate is low and compression will be
> more obvious.
> 
> Current implement is far from elegant. The most elegant one should split
> every data processing method into its own and independent function, and
> have a unified function to co-operate them.

I'd leave this one out for now, it looks like we need to refine the
pipeline from dedup -> compression and this is just more to carry around
until the initial support is in.  Can you just decline to dedup
compressed extents for now?

-chris

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  2016-03-22  1:35 ` [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
@ 2016-03-24 20:58   ` Chris Mason
  2016-03-25  1:59     ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Mason @ 2016-03-24 20:58 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Liu Bo, Wang Xiaoguang

On Tue, Mar 22, 2016 at 09:35:35AM +0800, Qu Wenruo wrote:
> Introduce a new tree, dedupe tree to record on-disk dedupe hash.
> As a persist hash storage instead of in-memeory only implement.
> 
> Unlike Liu Bo's implement, in this version we won't do hack for
> bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such
> search case, just like in-memory backend.

Thanks for refreshing this again, I'm starting to go through the disk
format in more detail.

> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/ctree.h             | 63 +++++++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/dedupe.h            |  5 ++++
>  fs/btrfs/disk-io.c           |  1 +
>  include/trace/events/btrfs.h |  3 ++-
>  4 files changed, 70 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 022ab61..bed9273 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -100,6 +100,9 @@ struct btrfs_ordered_sum;
>  /* tracks free space in block groups. */
>  #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
>  
> +/* on-disk dedupe tree (EXPERIMENTAL) */
> +#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL
> +
>  /* device stats in the device tree */
>  #define BTRFS_DEV_STATS_OBJECTID 0ULL
>  
> @@ -508,6 +511,7 @@ struct btrfs_super_block {
>   * ones specified below then we will fail to mount
>   */
>  #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE	(1ULL << 0)
> +#define BTRFS_FEATURE_COMPAT_RO_DEDUPE		(1ULL << 1)
>  
>  #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
>  #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
> @@ -537,7 +541,8 @@ struct btrfs_super_block {
>  #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
>  
>  #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
> -	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)
> +	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
> +	 BTRFS_FEATURE_COMPAT_RO_DEDUPE)
>  
>  #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
>  #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
> @@ -959,6 +964,42 @@ struct btrfs_csum_item {
>  	u8 csum;
>  } __attribute__ ((__packed__));
>  
> +/*
> + * Objectid: 0
> + * Type: BTRFS_DEDUPE_STATUS_ITEM_KEY
> + * Offset: 0
> + */
> +struct btrfs_dedupe_status_item {
> +	__le64 blocksize;
> +	__le64 limit_nr;
> +	__le16 hash_type;
> +	__le16 backend;
> +} __attribute__ ((__packed__));
> +
> +/*
> + * Objectid: Last 64 bit of the hash
> + * Type: BTRFS_DEDUPE_HASH_ITEM_KEY
> + * Offset: Bytenr of the hash
> + *
> + * Used for hash <-> bytenr search
> + */
> +struct btrfs_dedupe_hash_item {
> +	/* length of dedupe range */
> +	__le32 len;
> +
> +	/* Hash follows */
> +} __attribute__ ((__packed__));

Are you storing the entire hash, or just the parts not represented in
the key?  I'd like to keep the on-disk part as compact as possible for
this part.

> +
> +/*
> + * Objectid: bytenr
> + * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
> + * offset: Last 64 bit of the hash
> + *
> + * Used for bytenr <-> hash search (for free_extent)
> + * all its content is hash.
> + * So no special item struct is needed.
> + */
> +

Can we do this instead with a backref from the extent?  It'll save us a
huge amount of IO as we delete things.

-chris

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-03-24 13:42     ` David Sterba
@ 2016-03-25  1:38       ` Qu Wenruo
  2016-04-04 16:55         ` David Sterba
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-03-25  1:38 UTC (permalink / raw)
  To: dsterba, linux-btrfs, clm



David Sterba wrote on 2016/03/24 14:42 +0100:
> On Wed, Mar 23, 2016 at 10:25:51AM +0800, Qu Wenruo wrote:
>> Thank you for your interest in dedupe patchset first.
>>
>> In fact I'm quite afraid if there is no one interest in the patchset, it
>> may be delayed again to 4.8.
>
> It's not about lack of interest, the high-profile features need time and
> input from several people that will supposedly cover all aspects.
>
>> David Sterba wrote on 2016/03/22 14:38 +0100:
>> There are 3 dedupe related on-disk items.
>>
>> 1) dedupe status
>>      Used by both dedupe backends. Mainly used to record the dedupe
>>      backend info, allowing btrfs to resume its dedupe setup after umount.
>>
>> Key contents:
>>      Objectid             , Type                   , Offset
>>     (0                    , DEDUPE_STATUS_ITEM_KEY , 0      )
>
> Please use the newly added BTRFS_PERSISTENT_ITEM_KEY instead of a new
> key type. As this is the second user of that item, there's no precendent
> how to select the subtype. Right now 0 is for the dev stats item, but
> I'd like to leave some space between them, so it should be 256 at best.
> The space is 64bit so there's enough room but this also means defining
> the on-disk format.

After checking BTRFS_PERSISENT_ITEM_KEY, it seems that its value is 
larger than current DEDUPE_BYTENR/HASH_ITEM_KEY, and since the objectid 
of DEDUPE_HASH_ITEM_KEY, it won't be the first item of the tree.

Although that's not a big problem, but for user using debug-tree, it 
would be quite annoying to find it located among tons of other hashes.

So personally, if using PERSISTENT_ITEM_KEY, at least I prefer to keep 
objectid to 0, and modify DEDUPE_BYTENR/HASH_ITEM_KEY to higher value, 
to ensure dedupe status to be the first item of dedupe tree.

>
>>>> 4) Ioctl interface with persist dedup status
>>>
>>> I'd like to see the ioctl specified in more detail. So far there's
>>> enable, disable and status. I'd expect some way to control the in-memory
>>> limits, let it "forget" current hash cache, specify the dedupe chunk
>>> size, maybe sync of the in-memory hash cache to disk.
>>
>> So current and planned ioctl should be the following, with some details
>> related to your in-memory limit control concerns.
>>
>> 1) Enable
>>      Enable dedupe if it's not enabled already. (disabled -> enabled)
>
> Ok, so it should also take a parameter which bckend is about to be
> enabled.

It already has.
It also has limit_nr and limit_mem parameter for in-memory backend.

>
>>      Or change current dedupe setting to another. (re-configure)
>
> Doing that in 'enable' sounds confusing, any changes belong to a
> separate command.

This depends the aspect of view.

For "Enable/config/disable" case, it will introduce a state machine for 
end-user.

Personally, I doesn't state machine for end user. Yes, I also hate 
merging play and pause button together on music player.

If using state machine, user must ensure the dedupe is enabled before 
doing any configuration.

For me, user only need to care the result of the operation. User can now 
configure dedupe to their need without need to know previous setting.
 From this aspect of view, "Enable/Disable" is much easier than 
"Enable/Config/Disable".

>
>>      For dedupe_bs/backend/hash algorithm(only SHA256 yet) change, it
>>      will disable dedupe(dropping all hash) and then enable with new
>>      setting.
>>
>>      For in-memory backend, if only limit is different from previous
>>      setting, limit can be changed on the fly without dropping any hash.
>
> This is obviously misplaced in 'enable'.

Then, changing the 'enable' to 'configure' or other proper naming would 
be better.

The point is, user only need to care what they want to do, not previous 
setup.

>
>> 2) Disable
>>      Disable will drop all hash and delete the dedupe tree if it exists.
>>      Imply a full sync_fs().
>
> That is again combining too many things into one. Say I want to disable
> deduplication and want to enable it later. And not lose the whole state
> between that. Not to say deleting the dedup tree.
>
> IOW, deleting the tree belongs to a separate command, though in the
> userspace tools it could be done in one command, but we're talking about
> the kernel ioctls now.
>
> I'm not sure if the sync is required, but it's acceptable for first
> implementation.

The design is just to to reduce complexity.
If want to keep hash but disable dedupe, it will make dedupe only handle 
extent remove, but ignore any new coming write.

It will introduce a new state for dedupe, other than current simple 
enabled/disabled.
So I just don't allow such mode.

>
>>
>> 3) Status
>>      Output basic status of current dedupe.
>>      Including running status(disabled/enabled), dedupe block size, hash
>>      algorithm, and limit setting for in-memory backend.
>
> Agreed. So this is basically the settings and static info.
>
>> 4) (PLANNED) In-memory hash size querying
>>      Allowing userspace to query in-memory hash structure header size.
>>      Used for "btrfs dedupe enable" '-l' option to output warning if user
>>      specify memory size larger than 1/4 of the total memory.
>
> And this reflects the run-time status. Ok.
>
>> 5) (PLANNED) Dedeup rate statistics
>>      Should be handy for user to know the dedupe rate so they can further
>>      fine tuning their dedup setup.
>
> Similar as above, but for a different type of data. Ok.
>
>> So for your "in-memory limit control", just enable it with different limit.
>> For "dedupe block size change", just enable it with different dedupe_bs.
>> For "forget hash", just disable it.
>
> I can comment once the semantics of 'enable' are split, but basically I
> want an interface to control the deduplication cache.

So better renaming 'enable'.

Current 'enable' provides the interface to control the limit or dedupe hash.

I'm not sure further control is needed.

>
>> And for "write in-memory hash onto disk", not planned and may never do
>> it due to the complexity, sorry.
>
> I'm not asking you to do it, definetelly not for the initial
> implementation, but sync from memory to disk is IMO something that we
> can expect users to ask for. The percieved complexity may shift
> implementation to the future, but we should take it into account.

OK, I'll keep it in mind.

>
>>>> 5) Ability to disable dedup for given dirs/files
>>>
>>> This would be good to extend to subvolumes.
>>
>> I'm sorry that I didn't quite understand the difference.
>> Doesn't dir includes subvolume?
>
> If I enable deduplication on the entire subvolume, it will affect all
> subdirectories. Not the other way around.

It can be done by setting 'dedupe disable' on all other subvolumes.
But it it's not practical yet.

So maybe introduce a new state for default dedupe behavior?
Current dedupe enabled default behavior is to dedup unless prohibited.
If dedupe default behavior can be don't dedupe unless allowed, then it 
will be much easier to do.

>
>> Or xattr for subvolume is only restored in its parent subvolume, and
>> won't be copied for its snapshot?
>
> The xattrs are copied to the snapshot.
>
>>>> TODO:
>>>> 1) Add extent-by-extent comparison for faster but more conflicting algorithm
>>>>      Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>>>>      CPU may even be a bottleneck other than IO.
>>>>      But for faster hash, it will definitely cause conflicts, so we need
>>>>      extent comparison before we introduce new dedup algorithm.
>>>
>>> If sha256 is slow, we can use a less secure hash that's faster but will
>>> do a full byte-to-byte comparison in case of hash collision, and
>>> recompute sha256 when the blocks are going to disk. I haven't thought
>>> this through, so there are possibly details that could make unfeasible.
>>
>> Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only
>> for both in-memory and on-disk backend. No SHA256 again.
>
> I'm proposing unsafe but fast, which MD5 is not. Look for xxhash or
> murmur. As they're both order-of-magnitutes faster than sha1/md5, we can
> actually hash both to reduce the collisions.

Don't quite like the idea to use 2 hash other than 1.
Yes, some program like rsync uses this method, but this also involves a 
lot of details, like the order to restore them on disk.

>
>> In that case, for MD5 hit case, we will do a full byte-to-byte
>> comparison. It may be slow or fast, depending on the cache.
>
> If the probability of hash collision is low, so the number of needed
> byte-to-byte comparisions is also low.

Considering the common use-case of dedupe, hash hit should be a common case.

In that case, each hash hit will lead to byte-to-byte comparison, which 
will significantly impact the dedupe performance.

On the other hand, if dedupe hit rate is low, then why use dedupe?

>
>> But at least for MD5 miss case, it should be faster than SHA256.
>>
>>> The idea is to move expensive hashing to the slow IO operations and do
>>> fast but not 100% safe hashing on the read/write side where performance
>>> matters.
>>
>> Yes, although on the read side, we don't perform hash, we only do hash
>> at write side.
>
> Oh, so how exactly gets the in-memory deduplication cache filled? My
> impression was that we can pre-fill it by reading bunch of files where we
> expect the shared data to exist.

Yes, we used to do that method aging back to the first version of 
in-memory implementation.

But that will cause a lot of CPU usage and most of them are just wasted.

Don't forget that, in common dedupe use-case, dedupe rate should be 
high, I'll use 50% as an exmaple.

This means, 50% of your read will be pointed to a shared extents. But 
100% of read will need to calculate hash, and 50% of them are already in 
hash pool.
So the CPU time are just wasted.

>
> The usecase:
>
> Say there's a golden image for a virtual machine,

Not to nitpick, but I though VM images are not good use-case for btrfs.
And normally user would set nodatacow for it, which will bypass dedupe.

> we'll clone it and use
> for other VM's, with minor changes. If we first read the golden image
> with deduplication enabled, pre-fill the cache, any subsequent writes to
> the cloned images will be compared to the cached data. The estimated hit
> ratio is medium-to-high.

And performance is so low that most user would feel, and CPU usage will 
be so high (up to 8 cores 100% used)that almost no spare CPU time can be 
allocated for VM use.

>
> And this can be extended to anything, not just VMs. Without the option
> to fill the in-memory cache, the deduplication would seem pretty useless
> to me. The clear benefit is lack of maintaining the persistent storage
> of deduplication data.

I originally planned a ioctl for it to fill hash manually.
But now I think re-write would be good enough.
Maybe I could a pseudo 'dedupe fill' command in btrfs-progs, which will 
just read out the data and re-write it.

>
>> And in that case, if weak hash hit, we will need to do memory
>> comparison, which may also be slow.
>> So the performance impact may still exist.
>
> Yes the performance hit is there, with statistically low probability.
>
>> The biggest challenge is, we need to read (decompressed) extent
>> contents, even without an inode.
>> (So, no address_space and all the working facilities)
>>
>> Considering the complexity and uncertain performance improvement, the
>> priority of introducing weak hash is quite low so far, not to mention a
>> lot of detail design change for it.
>
> I disagree.

Explained above, hash hit in dedupe use-case is common case, while we 
must do byte-to-byte comparison in common case routine, it's hard to 
ignore the overhead.

>
>> A much easier and practical enhancement is, to use SHA512.
>> As it's faster than SHA256 on modern 64bit machine for larger size.
>> For example, for hashing 8K data, SHA512 is almost 40% faster than SHA256.
>>
>>>> 2) Misc end-user related helpers
>>>>      Like handy and easy to implement dedup rate report.
>>>>      And method to query in-memory hash size for those "non-exist" users who
>>>>      want to use 'dedup enable -l' option but didn't ever know how much
>>>>      RAM they have.
>>>
>>> That's what we should try know and define in advance, that's part of the
>>> ioctl interface.
>>>
>>> I went through the patches, there are a lot of small things to fix, but
>>> first I want to be sure about the interfaces, ie. on-disk and ioctl.
>>
>> I hope such small things can be pointed out, allowing me to fix them
>> while rebasing.
>
> Sure, that's next after we agree on what the deduplication should
> actually, the ioctls interefaces are settled and the on-disk format
> changes are agreed on. The code is a good starting point, but pointing
> out minor things at this point does not justify the time spent.
>

That's OK.

>>> Then we can start to merge the patchset in smaller batches, the
>>> in-memory deduplication does not have implications on the on-disk
>>> format, so it's "just" the ioctl part.
>>
>> Yes, that's my original plan, first merge simple in-memory backend into
>> 4.5/4.6 and then adding ondisk backend into 4.7.
>>
>> But things turned out that, since we designed the two-backends API from
>> the beginning, on-disk backend doesn't take much time to implement.
>>
>> So this makes what you see now, a big patchset with both backend
>> implemented.
>
> For the discussions and review phase it's ok to see them both, but it's
> unrealistic to expect merging in a particular version without going
> through the review heat, especially for something like deduplication.
>
>
In fact, I didn't expect dedupe to receive such heat.

I originally expect such dedupe to be an interesting but not so 
practical feature, just like ZFS dedupe.
(I can be totally wrong, please point it out if there is some well-known 
use-case of ZFS dedupe)

I was expecting dedupe to be a good entrance to expose existing bugs, 
and raise attention for better delayed_ref and delalloc implementation.


Since it's considered as a high-profile feature, I'm OK to slow down the 
rush of merge and polish the interface/code further more.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue
  2016-03-24 20:35   ` Chris Mason
@ 2016-03-25  1:44     ` Qu Wenruo
  2016-03-25 15:12       ` Chris Mason
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-03-25  1:44 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs, Wang Xiaoguang



Chris Mason wrote on 2016/03/24 16:35 -0400:
> On Tue, Mar 22, 2016 at 09:35:50AM +0800, Qu Wenruo wrote:
>> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>
>> The basic idea is also calculate hash before compression, and add needed
>> members for dedupe to record compressed file extent.
>>
>> Since dedupe support dedupe_bs larger than 128K, which is the up limit
>> of compression file extent, in that case we will skip dedupe and prefer
>> compression, as in that size dedupe rate is low and compression will be
>> more obvious.
>>
>> Current implement is far from elegant. The most elegant one should split
>> every data processing method into its own and independent function, and
>> have a unified function to co-operate them.
>
> I'd leave this one out for now, it looks like we need to refine the
> pipeline from dedup -> compression and this is just more to carry around
> until the initial support is in.  Can you just decline to dedup
> compressed extents for now?

Yes, completely no problem.
Although this patch seems works well yet, but I also have planned to 
rework current run_delloc_range() to make it more flex and clear.

So the main object of the patch is more about raising attention for such 
further re-work.

And now it has achieved its goal.

Thanks,
Qu
>
> -chris
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  2016-03-24 20:58   ` Chris Mason
@ 2016-03-25  1:59     ` Qu Wenruo
  2016-03-25 15:11       ` Chris Mason
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-03-25  1:59 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs, Liu Bo, Wang Xiaoguang



Chris Mason wrote on 2016/03/24 16:58 -0400:
> On Tue, Mar 22, 2016 at 09:35:35AM +0800, Qu Wenruo wrote:
>> Introduce a new tree, dedupe tree to record on-disk dedupe hash.
>> As a persist hash storage instead of in-memeory only implement.
>>
>> Unlike Liu Bo's implement, in this version we won't do hack for
>> bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such
>> search case, just like in-memory backend.
>
> Thanks for refreshing this again, I'm starting to go through the disk
> format in more detail.
>
>>
>> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>   fs/btrfs/ctree.h             | 63 +++++++++++++++++++++++++++++++++++++++++++-
>>   fs/btrfs/dedupe.h            |  5 ++++
>>   fs/btrfs/disk-io.c           |  1 +
>>   include/trace/events/btrfs.h |  3 ++-
>>   4 files changed, 70 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 022ab61..bed9273 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -100,6 +100,9 @@ struct btrfs_ordered_sum;
>>   /* tracks free space in block groups. */
>>   #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL
>>
>> +/* on-disk dedupe tree (EXPERIMENTAL) */
>> +#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL
>> +
>>   /* device stats in the device tree */
>>   #define BTRFS_DEV_STATS_OBJECTID 0ULL
>>
>> @@ -508,6 +511,7 @@ struct btrfs_super_block {
>>    * ones specified below then we will fail to mount
>>    */
>>   #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE	(1ULL << 0)
>> +#define BTRFS_FEATURE_COMPAT_RO_DEDUPE		(1ULL << 1)
>>
>>   #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF	(1ULL << 0)
>>   #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL	(1ULL << 1)
>> @@ -537,7 +541,8 @@ struct btrfs_super_block {
>>   #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR		0ULL
>>
>>   #define BTRFS_FEATURE_COMPAT_RO_SUPP			\
>> -	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE)
>> +	(BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE |	\
>> +	 BTRFS_FEATURE_COMPAT_RO_DEDUPE)
>>
>>   #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET	0ULL
>>   #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR	0ULL
>> @@ -959,6 +964,42 @@ struct btrfs_csum_item {
>>   	u8 csum;
>>   } __attribute__ ((__packed__));
>>
>> +/*
>> + * Objectid: 0
>> + * Type: BTRFS_DEDUPE_STATUS_ITEM_KEY
>> + * Offset: 0
>> + */
>> +struct btrfs_dedupe_status_item {
>> +	__le64 blocksize;
>> +	__le64 limit_nr;
>> +	__le16 hash_type;
>> +	__le16 backend;
>> +} __attribute__ ((__packed__));
>> +
>> +/*
>> + * Objectid: Last 64 bit of the hash
>> + * Type: BTRFS_DEDUPE_HASH_ITEM_KEY
>> + * Offset: Bytenr of the hash
>> + *
>> + * Used for hash <-> bytenr search
>> + */
>> +struct btrfs_dedupe_hash_item {
>> +	/* length of dedupe range */
>> +	__le32 len;
>> +
>> +	/* Hash follows */
>> +} __attribute__ ((__packed__));
>
> Are you storing the entire hash, or just the parts not represented in
> the key?  I'd like to keep the on-disk part as compact as possible for
> this part.

Currently, it's entire hash.

More detailed can be checked in another mail.
http://article.gmane.org/gmane.comp.file-systems.btrfs/54432

Although it's OK to truncate the last duplicated 8 bytes(64bit) for me,
I still quite like current implementation, as one memcpy() is simpler.

>
>> +
>> +/*
>> + * Objectid: bytenr
>> + * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
>> + * offset: Last 64 bit of the hash
>> + *
>> + * Used for bytenr <-> hash search (for free_extent)
>> + * all its content is hash.
>> + * So no special item struct is needed.
>> + */
>> +
>
> Can we do this instead with a backref from the extent?  It'll save us a
> huge amount of IO as we delete things.

That's the original implementation from Liu Bo.

The problem is, it changes the data backref rules(originally, only 
EXTENT_DATA item can cause data backref), and will make dedupe INCOMPACT 
other than current RO_COMPACT.
So I really don't like to change the data backref rule.


If only want to reduce ondisk space, just trashing the hash and making 
DEDUPE_BYTENR_ITEM have no data would be good enough.

As (bytenr, DEDEUPE_BYTENR_ITEM) can locate the hash uniquely.

In fact no code really checked the hash for dedupe bytenr item, they all 
just swap objectid and offset, reset the type and do search for 
DEDUPE_HASH_ITEM.

So it's OK to emit the hash.

Thanks,
Qu

>
> -chris
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  2016-03-25  1:59     ` Qu Wenruo
@ 2016-03-25 15:11       ` Chris Mason
  2016-03-26 13:11         ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Mason @ 2016-03-25 15:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Liu Bo, Wang Xiaoguang

On Fri, Mar 25, 2016 at 09:59:39AM +0800, Qu Wenruo wrote:
> 
> 
> Chris Mason wrote on 2016/03/24 16:58 -0400:
> >Are you storing the entire hash, or just the parts not represented in
> >the key?  I'd like to keep the on-disk part as compact as possible for
> >this part.
> 
> Currently, it's entire hash.
> 
> More detailed can be checked in another mail.
> 
> Although it's OK to truncate the last duplicated 8 bytes(64bit) for me,
> I still quite like current implementation, as one memcpy() is simpler.

[ sorry FB makes urls look ugly, so I delete them from replys ;) ]

Right, I saw that but wanted to reply to the specific patch.  One of the
lessons learned from the extent allocation tree and file extent items is
they are just too big.  Lets save those bytes, it'll add up.

> 
> >
> >>+
> >>+/*
> >>+ * Objectid: bytenr
> >>+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
> >>+ * offset: Last 64 bit of the hash
> >>+ *
> >>+ * Used for bytenr <-> hash search (for free_extent)
> >>+ * all its content is hash.
> >>+ * So no special item struct is needed.
> >>+ */
> >>+
> >
> >Can we do this instead with a backref from the extent?  It'll save us a
> >huge amount of IO as we delete things.
> 
> That's the original implementation from Liu Bo.
> 
> The problem is, it changes the data backref rules(originally, only
> EXTENT_DATA item can cause data backref), and will make dedupe INCOMPACT
> other than current RO_COMPACT.
> So I really don't like to change the data backref rule.

Let me reread this part, the cost of maintaining the second index is
dramatically higher than adding a backref.  I do agree that's its nice
to be able to delete the dedup trees without impacting the rest, but
over the long term I think we'll regret the added balances.

> 
> If only want to reduce ondisk space, just trashing the hash and making
> DEDUPE_BYTENR_ITEM have no data would be good enough.
> 
> As (bytenr, DEDEUPE_BYTENR_ITEM) can locate the hash uniquely.

For the second index, the big problem is the cost of the btree
operations.  We're already pretty expensive in terms of the cost of
deleting an extent, with dedup its 2x higher, with dedup + extra index,
its 3x higher.

> 
> In fact no code really checked the hash for dedupe bytenr item, they all
> just swap objectid and offset, reset the type and do search for
> DEDUPE_HASH_ITEM.
> 
> So it's OK to emit the hash.

If we have to go with the second index, I do agree here.

-chris

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue
  2016-03-25  1:44     ` Qu Wenruo
@ 2016-03-25 15:12       ` Chris Mason
  0 siblings, 0 replies; 62+ messages in thread
From: Chris Mason @ 2016-03-25 15:12 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Wang Xiaoguang

On Fri, Mar 25, 2016 at 09:44:31AM +0800, Qu Wenruo wrote:
> 
> 
> Chris Mason wrote on 2016/03/24 16:35 -0400:
> >On Tue, Mar 22, 2016 at 09:35:50AM +0800, Qu Wenruo wrote:
> >>From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> >>
> >>The basic idea is also calculate hash before compression, and add needed
> >>members for dedupe to record compressed file extent.
> >>
> >>Since dedupe support dedupe_bs larger than 128K, which is the up limit
> >>of compression file extent, in that case we will skip dedupe and prefer
> >>compression, as in that size dedupe rate is low and compression will be
> >>more obvious.
> >>
> >>Current implement is far from elegant. The most elegant one should split
> >>every data processing method into its own and independent function, and
> >>have a unified function to co-operate them.
> >
> >I'd leave this one out for now, it looks like we need to refine the
> >pipeline from dedup -> compression and this is just more to carry around
> >until the initial support is in.  Can you just decline to dedup
> >compressed extents for now?
> 
> Yes, completely no problem.
> Although this patch seems works well yet, but I also have planned to rework
> current run_delloc_range() to make it more flex and clear.
> 
> So the main object of the patch is more about raising attention for such
> further re-work.
> 
> And now it has achieved its goal.

Thanks, I do really like how you had compression in mind all along.

-chris


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  2016-03-25 15:11       ` Chris Mason
@ 2016-03-26 13:11         ` Qu Wenruo
  2016-03-28 14:09           ` Chris Mason
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-03-26 13:11 UTC (permalink / raw)
  To: Chris Mason, Qu Wenruo, linux-btrfs, Liu Bo, Wang Xiaoguang



On 03/25/2016 11:11 PM, Chris Mason wrote:
> On Fri, Mar 25, 2016 at 09:59:39AM +0800, Qu Wenruo wrote:
>>
>>
>> Chris Mason wrote on 2016/03/24 16:58 -0400:
>>> Are you storing the entire hash, or just the parts not represented in
>>> the key?  I'd like to keep the on-disk part as compact as possible for
>>> this part.
>>
>> Currently, it's entire hash.
>>
>> More detailed can be checked in another mail.
>>
>> Although it's OK to truncate the last duplicated 8 bytes(64bit) for me,
>> I still quite like current implementation, as one memcpy() is simpler.
>
> [ sorry FB makes urls look ugly, so I delete them from replys ;) ]
>
> Right, I saw that but wanted to reply to the specific patch.  One of the
> lessons learned from the extent allocation tree and file extent items is
> they are just too big.  Lets save those bytes, it'll add up.

OK, I'll reduce the duplicated last 8 bytes.

And also, removing the "length" member, as it can be always fetched from 
dedupe_info->block_size.

The length itself is used to verify if we are at the transaction to a 
new dedupe size, but later we use full sync_fs(), such behavior is not 
needed any more.


>
>>
>>>
>>>> +
>>>> +/*
>>>> + * Objectid: bytenr
>>>> + * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
>>>> + * offset: Last 64 bit of the hash
>>>> + *
>>>> + * Used for bytenr <-> hash search (for free_extent)
>>>> + * all its content is hash.
>>>> + * So no special item struct is needed.
>>>> + */
>>>> +
>>>
>>> Can we do this instead with a backref from the extent?  It'll save us a
>>> huge amount of IO as we delete things.
>>
>> That's the original implementation from Liu Bo.
>>
>> The problem is, it changes the data backref rules(originally, only
>> EXTENT_DATA item can cause data backref), and will make dedupe INCOMPACT
>> other than current RO_COMPACT.
>> So I really don't like to change the data backref rule.
>
> Let me reread this part, the cost of maintaining the second index is
> dramatically higher than adding a backref.  I do agree that's its nice
> to be able to delete the dedup trees without impacting the rest, but
> over the long term I think we'll regret the added balances.

Thanks for pointing the problem. Yes, I didn't even consider this fact.

But, on the other hand. such remove only happens when we remove the 
*last* reference of the extent.
So, for medium to high dedupe rate case, such routine is not that 
frequent, which will reduce the impact.
(Which is quite different for non-dedupe case)

And for low dedupe rate case, why use dedupe anyway. In that case, 
compression would be much more appropriate if user just wants to reduce 
disk usage IMO.


Another reason I don't want to touch delayed-ref codes is, it already 
has made us quite pain.
We were fighting with delayed-ref from the beginning.
The delayed ref, especially the ability to run delayed refs 
asynchronously, is the biggest problem we met.

And that's why we added ability to increase data ref while holding 
delayed_refs->lock in patch 5, and then uses a long lock-and-try-inc 
method to search hash in patch 6.

Any modification to delayed ref can easily lead to new bugs (Yes, I have 
proved it several times by myself).
So I choose to use current method.

>
>>
>> If only want to reduce ondisk space, just trashing the hash and making
>> DEDUPE_BYTENR_ITEM have no data would be good enough.
>>
>> As (bytenr, DEDEUPE_BYTENR_ITEM) can locate the hash uniquely.
>
> For the second index, the big problem is the cost of the btree
> operations.  We're already pretty expensive in terms of the cost of
> deleting an extent, with dedup its 2x higher, with dedup + extra index,
> its 3x higher.

The good news is, we only delete hash bytenr and its ref at the last 
de-reference.
And in normal (medium to high dedupe rate) case, it's not a frequent 
operation IMHO.

Thanks,
Qu

>
>>
>> In fact no code really checked the hash for dedupe bytenr item, they all
>> just swap objectid and offset, reset the type and do search for
>> DEDUPE_HASH_ITEM.
>>
>> So it's OK to emit the hash.
>
> If we have to go with the second index, I do agree here.
>
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  2016-03-26 13:11         ` Qu Wenruo
@ 2016-03-28 14:09           ` Chris Mason
  2016-03-29  1:47             ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Mason @ 2016-03-28 14:09 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Liu Bo, Wang Xiaoguang

On Sat, Mar 26, 2016 at 09:11:53PM +0800, Qu Wenruo wrote:
> 
> 
> On 03/25/2016 11:11 PM, Chris Mason wrote:
> >On Fri, Mar 25, 2016 at 09:59:39AM +0800, Qu Wenruo wrote:
> >>
> >>
> >>Chris Mason wrote on 2016/03/24 16:58 -0400:
> >>>Are you storing the entire hash, or just the parts not represented in
> >>>the key?  I'd like to keep the on-disk part as compact as possible for
> >>>this part.
> >>
> >>Currently, it's entire hash.
> >>
> >>More detailed can be checked in another mail.
> >>
> >>Although it's OK to truncate the last duplicated 8 bytes(64bit) for me,
> >>I still quite like current implementation, as one memcpy() is simpler.
> >
> >[ sorry FB makes urls look ugly, so I delete them from replys ;) ]
> >
> >Right, I saw that but wanted to reply to the specific patch.  One of the
> >lessons learned from the extent allocation tree and file extent items is
> >they are just too big.  Lets save those bytes, it'll add up.
> 
> OK, I'll reduce the duplicated last 8 bytes.
> 
> And also, removing the "length" member, as it can be always fetched from
> dedupe_info->block_size.

This would mean dedup_info->block_size is a write once field.  I'm ok
with that (just like metadata blocksize) but we should make sure the
ioctls etc don't allow changing it.

> 
> The length itself is used to verify if we are at the transaction to a new
> dedupe size, but later we use full sync_fs(), such behavior is not needed
> any more.
> 
> 
> >
> >>
> >>>
> >>>>+
> >>>>+/*
> >>>>+ * Objectid: bytenr
> >>>>+ * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
> >>>>+ * offset: Last 64 bit of the hash
> >>>>+ *
> >>>>+ * Used for bytenr <-> hash search (for free_extent)
> >>>>+ * all its content is hash.
> >>>>+ * So no special item struct is needed.
> >>>>+ */
> >>>>+
> >>>
> >>>Can we do this instead with a backref from the extent?  It'll save us a
> >>>huge amount of IO as we delete things.
> >>
> >>That's the original implementation from Liu Bo.
> >>
> >>The problem is, it changes the data backref rules(originally, only
> >>EXTENT_DATA item can cause data backref), and will make dedupe INCOMPACT
> >>other than current RO_COMPACT.
> >>So I really don't like to change the data backref rule.
> >
> >Let me reread this part, the cost of maintaining the second index is
> >dramatically higher than adding a backref.  I do agree that's its nice
> >to be able to delete the dedup trees without impacting the rest, but
> >over the long term I think we'll regret the added balances.
> 
> Thanks for pointing the problem. Yes, I didn't even consider this fact.
> 
> But, on the other hand. such remove only happens when we remove the *last*
> reference of the extent.
> So, for medium to high dedupe rate case, such routine is not that frequent,
> which will reduce the impact.
> (Which is quite different for non-dedupe case)

It's both addition and removal, and the efficiency hit does depend on
what level of sharing you're able to achieve.  But what we don't want is
for metadata usage to explode as people make small non-duplicate changes
to their FS.   If that happens, we'll only end up using dedup in back up
farms and other highly limited use cases.

I do agree that delayed refs are error prone, but that's a good reason
not fix delayed refs, not to recreate the backrefs of the extent
allocation tree in a new dedicated tree.

-chris


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
  2016-03-28 14:09           ` Chris Mason
@ 2016-03-29  1:47             ` Qu Wenruo
  0 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-29  1:47 UTC (permalink / raw)
  To: Chris Mason, Qu Wenruo, linux-btrfs, Liu Bo, Wang Xiaoguang



Chris Mason wrote on 2016/03/28 10:09 -0400:
> On Sat, Mar 26, 2016 at 09:11:53PM +0800, Qu Wenruo wrote:
>>
>>
>> On 03/25/2016 11:11 PM, Chris Mason wrote:
>>> On Fri, Mar 25, 2016 at 09:59:39AM +0800, Qu Wenruo wrote:
>>>>
>>>>
>>>> Chris Mason wrote on 2016/03/24 16:58 -0400:
>>>>> Are you storing the entire hash, or just the parts not represented in
>>>>> the key?  I'd like to keep the on-disk part as compact as possible for
>>>>> this part.
>>>>
>>>> Currently, it's entire hash.
>>>>
>>>> More detailed can be checked in another mail.
>>>>
>>>> Although it's OK to truncate the last duplicated 8 bytes(64bit) for me,
>>>> I still quite like current implementation, as one memcpy() is simpler.
>>>
>>> [ sorry FB makes urls look ugly, so I delete them from replys ;) ]
>>>
>>> Right, I saw that but wanted to reply to the specific patch.  One of the
>>> lessons learned from the extent allocation tree and file extent items is
>>> they are just too big.  Lets save those bytes, it'll add up.
>>
>> OK, I'll reduce the duplicated last 8 bytes.
>>
>> And also, removing the "length" member, as it can be always fetched from
>> dedupe_info->block_size.
>
> This would mean dedup_info->block_size is a write once field.  I'm ok
> with that (just like metadata blocksize) but we should make sure the
> ioctls etc don't allow changing it.

Not a problem, current block_size change is done by completely disabling 
dedupe(imply a sync_fs), then re-enable with new block_size.

So it would be OK.

>
>>
>> The length itself is used to verify if we are at the transaction to a new
>> dedupe size, but later we use full sync_fs(), such behavior is not needed
>> any more.
>>
>>
>>>
>>>>
>>>>>
>>>>>> +
>>>>>> +/*
>>>>>> + * Objectid: bytenr
>>>>>> + * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY
>>>>>> + * offset: Last 64 bit of the hash
>>>>>> + *
>>>>>> + * Used for bytenr <-> hash search (for free_extent)
>>>>>> + * all its content is hash.
>>>>>> + * So no special item struct is needed.
>>>>>> + */
>>>>>> +
>>>>>
>>>>> Can we do this instead with a backref from the extent?  It'll save us a
>>>>> huge amount of IO as we delete things.
>>>>
>>>> That's the original implementation from Liu Bo.
>>>>
>>>> The problem is, it changes the data backref rules(originally, only
>>>> EXTENT_DATA item can cause data backref), and will make dedupe INCOMPACT
>>>> other than current RO_COMPACT.
>>>> So I really don't like to change the data backref rule.
>>>
>>> Let me reread this part, the cost of maintaining the second index is
>>> dramatically higher than adding a backref.  I do agree that's its nice
>>> to be able to delete the dedup trees without impacting the rest, but
>>> over the long term I think we'll regret the added balances.
>>
>> Thanks for pointing the problem. Yes, I didn't even consider this fact.
>>
>> But, on the other hand. such remove only happens when we remove the *last*
>> reference of the extent.
>> So, for medium to high dedupe rate case, such routine is not that frequent,
>> which will reduce the impact.
>> (Which is quite different for non-dedupe case)
>
> It's both addition and removal, and the efficiency hit does depend on
> what level of sharing you're able to achieve.  But what we don't want is
> for metadata usage to explode as people make small non-duplicate changes
> to their FS.
>   If that happens, we'll only end up using dedup in back up
> farms and other highly limited use cases.

Right, with current dedupe-specific backref, it'll bring unavoidable 
metadata overhead.

[[People are trading-off using non-default feature]]
Although IMHO, dedupe is not a generic feature, just like compression 
and possible encryption, people choose them with trade-off in their mind.

For example, compression can achieve quite high performance for easily 
compressible data, but can also get quite low performance for not so 
compressible data, like ISO file or videos.
(In my test with 2 cores VM, virtio blk on HDD, dd ISO into btrfs file 
will causing about 90MB/s for default mount option, while with 
compression, it's only about 40~50MB/s)

If we combine all overhead together (not only metadata overhead), almost 
all current transparent data processing method will only benefit 
specific use case while reducing generic performance.

So increased metadata overhead is acceptable for me, especially when the 
main overhead is CPU time spent on SHA256.

And we have workaround from setting dedupe disable prop to setting 
larger dedupe block_size to avoid small and non-dedupe writes to fill 
dedupe tree.


>
> I do agree that delayed refs are error prone, but that's a good reason
> not fix delayed refs, not to recreate the backrefs of the extent
> allocation tree in a new dedicated tree.

[[We need an idea generic for both backends]]
Also I want to mention is, dedupe now contains 2 different backends, so 
we'd better choose one idea that won't break different backends into 
different incompat/ro_compat flags.

If using backref method, ondisk backend will definitely make dedupe 
incompatible, affecting in-memory backend even it's completely 
backward-compatible.

Or, splitting dedupe flag into DEDUPE_ONDISK and DEDUPE_INMEMORY, and 
former one is INCOMPAT, while latter is at most RO_COMPAT(if using 
dedupe tree).


[[Cleaner layout is less bug-prone]]
The last point of using dedupe specific backref, is to reduce the 
possible bug affection, which for me is more important than performance.

Current implementation will limit dedupe backref bug to dedupe only.
While a new backref bug will definitely impact almost all btrfs function.

Thanks,
Qu

>
> -chris
>
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
                   ` (27 preceding siblings ...)
  2016-03-22 13:38 ` [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework David Sterba
@ 2016-03-29 17:22 ` Alex Lyakas
  2016-03-30  0:34   ` Qu Wenruo
  28 siblings, 1 reply; 62+ messages in thread
From: Alex Lyakas @ 2016-03-29 17:22 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Greetings Qu Wenruo,

I have reviewed the dedup patchset found in the github account you
mentioned. I have several questions. Please note that by all means I
am not criticizing your design or code. I just want to make sure that
my understanding of the code is proper.

1) You mentioned in several emails that at some point byte-to-byte
comparison is to be performed. However, I do not see this in the code.
It seems that generic_search() only looks for the hash value match. If
there is a match, it goes ahead and adds a delayed ref.

2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
mutex and proceed with the normal COW. What happens if there are
several IO streams to different files writing an identical block, but
we don't have such block in our dedup DB? Then all
btrfs_dedupe_search() calls will not find a match, so all streams will
allocate space for their block (which are all identical). At some
point, they will call insert_reserved_file_extent() and will call
btrfs_dedupe_add(). Since there is a global mutex, the first stream
will insert the dedup hash entries into the DB, and all other streams
will find that such hash entry already exists. So the end result is
that we have the hash entry in the DB, but still we have multiple
copies of the same block allocated, due to timing issues. Is this
correct?

3) generic_search() competes with __btrfs_free_extent(). Meaning that
generic_search() wants to add a delayed ref to an existing extent,
whereas __btrfs_free_extent() wants to delete an entry from the dedup
DB. The race is resolved as follows:
- generic_search attempts to lock the delayed ref head
- if it succeeds to lock, then __btrfs_free_extent() is not running
right now. So we can add a delayed ref. Later, when delayed ref head
will be run, it will figure out what needs to be done (free the extent
or not)
- if we fail to lock, then there is a delayed ref processing for this
bytenr. We drop all locks and redo the search from the top. If
__btrfs_free_extent() has deleted the dedup hash meanwhile, we will
not find it, and proceed with normal COW.
Is my understanding correct?

I have also few nitpicks on the code, will reply to relevant patches.

Thanks for doing this work,
Alex.



On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux.git wang_dedupe_20160322
>
> This updated version of inband de-duplication has the following features:
> 1) ONE unified dedup framework.
>    Most of its code is hidden quietly in dedup.c and export the minimal
>    interfaces for its caller.
>    Reviewer and further developer would benefit from the unified
>    framework.
>
> 2) TWO different back-end with different trade-off
>    One is the improved version of previous Fujitsu in-memory only dedup.
>    The other one is enhanced dedup implementation from Liu Bo.
>    Changed its tree structure to handle bytenr -> hash search for
>    deleting hash, without the hideous data backref hack.
>
> 3) Support compression with dedupe
>    Now dedupe can work with compression.
>    Means that, a dedupe miss case can be compressed, and dedupe hit case
>    can also reuse compressed file extents.
>
> 4) Ioctl interface with persist dedup status
>    Advised by David, now we use ioctl to enable/disable dedup.
>
>    And we now have dedup status, recorded in the first item of dedup
>    tree.
>    Just like quota, once enabled, no extra ioctl is needed for next
>    mount.
>
> 5) Ability to disable dedup for given dirs/files
>    It works just like the compression prop method, by adding a new
>    xattr.
>
> TODO:
> 1) Add extent-by-extent comparison for faster but more conflicting algorithm
>    Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>    CPU may even be a bottleneck other than IO.
>    But for faster hash, it will definitely cause conflicts, so we need
>    extent comparison before we introduce new dedup algorithm.
>
> 2) Misc end-user related helpers
>    Like handy and easy to implement dedup rate report.
>    And method to query in-memory hash size for those "non-exist" users who
>    want to use 'dedup enable -l' option but didn't ever know how much
>    RAM they have.
>
> Changelog:
> v2:
>   Totally reworked to handle multiple backends
> v3:
>   Fix a stupid but deadly on-disk backend bug
>   Add handle for multiple hash on same bytenr corner case to fix abort
>   trans error
>   Increase dedup rate by enhancing delayed ref handler for both backend.
>   Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
>   Increase dedup block size up limit to 8M.
> v4:
>   Add dedup prop for disabling dedup for given files/dirs.
>   Merge inmem_search() and ondisk_search() into generic_search() to save
>   some code
>   Fix another delayed_ref related bug.
>   Use the same mutex for both inmem and ondisk backend.
>   Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
>   rate.
> v5:
>   Reuse compress routine for much simpler dedup function.
>   Slightly improved performance due to above modification.
>   Fix race between dedup enable/disable
>   Fix for false ENOSPC report
> v6:
>   Further enable/disable race window fix.
>   Minor format change according to checkpatch.
> v7:
>   Fix one concurrency bug with balance.
>   Slightly modify return value from -EINVAL to -EOPNOTSUPP for
>   btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
>   and wrong parameter.
>   Rebased to integration-4.6.
> v8:
>   Rename 'dedup' to 'dedupe'.
>   Add support to allow dedupe and compression work at the same time.
>   Fix several balance related bugs. Special thanks to Satoru Takeuchi,
>   who exposed most of them.
>   Small dedupe hit case performance improvement.
>
> Qu Wenruo (12):
>   btrfs: delayed-ref: Add support for increasing data ref under spinlock
>   btrfs: dedupe: Inband in-memory only de-duplication implement
>   btrfs: dedupe: Add basic tree structure for on-disk dedupe method
>   btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
>   btrfs: dedupe: Add support for on-disk hash search
>   btrfs: dedupe: Add support to delete hash for on-disk backend
>   btrfs: dedupe: Add support for adding hash for on-disk backend
>   btrfs: Fix a memory leak in inband dedupe hash
>   btrfs: dedupe: Fix metadata balance error when dedupe is enabled
>   btrfs: dedupe: Preparation for compress-dedupe co-work
>   btrfs: relocation: Enhance error handling to avoid BUG_ON
>   btrfs: dedupe: Fix a space cache delalloc bytes underflow bug
>
> Wang Xiaoguang (15):
>   btrfs: dedupe: Introduce dedupe framework and its header
>   btrfs: dedupe: Introduce function to initialize dedupe info
>   btrfs: dedupe: Introduce function to add hash into in-memory tree
>   btrfs: dedupe: Introduce function to remove hash from in-memory tree
>   btrfs: dedupe: Introduce function to search for an existing hash
>   btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
>   btrfs: ordered-extent: Add support for dedupe
>   btrfs: dedupe: Add ioctl for inband dedupelication
>   btrfs: dedupe: add an inode nodedupe flag
>   btrfs: dedupe: add a property handler for online dedupe
>   btrfs: dedupe: add per-file online dedupe control
>   btrfs: try more times to alloc metadata reserve space
>   btrfs: dedupe: Fix a bug when running inband dedupe with balance
>   btrfs: dedupe: Avoid submit IO for hash hit extent
>   btrfs: dedupe: Add support for compression and dedpue
>
>  fs/btrfs/Makefile            |    2 +-
>  fs/btrfs/ctree.h             |   78 ++-
>  fs/btrfs/dedupe.c            | 1188 ++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/dedupe.h            |  181 +++++++
>  fs/btrfs/delayed-ref.c       |   30 +-
>  fs/btrfs/delayed-ref.h       |    8 +
>  fs/btrfs/disk-io.c           |   28 +-
>  fs/btrfs/disk-io.h           |    1 +
>  fs/btrfs/extent-tree.c       |   49 +-
>  fs/btrfs/inode.c             |  338 ++++++++++--
>  fs/btrfs/ioctl.c             |   70 ++-
>  fs/btrfs/ordered-data.c      |   49 +-
>  fs/btrfs/ordered-data.h      |   16 +-
>  fs/btrfs/props.c             |   41 ++
>  fs/btrfs/relocation.c        |   41 +-
>  fs/btrfs/sysfs.c             |    2 +
>  include/trace/events/btrfs.h |    3 +-
>  include/uapi/linux/btrfs.h   |   25 +-
>  18 files changed, 2073 insertions(+), 77 deletions(-)
>  create mode 100644 fs/btrfs/dedupe.c
>  create mode 100644 fs/btrfs/dedupe.h
>
> --
> 2.7.3
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 11/27] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
  2016-03-22  1:35 ` [PATCH v8 11/27] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info Qu Wenruo
@ 2016-03-29 17:31   ` Alex Lyakas
  2016-03-30  0:26     ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Alex Lyakas @ 2016-03-29 17:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Wang Xiaoguang

Hi Qu, Wang,

On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
> Since we will introduce a new on-disk based dedupe method, introduce new
> interfaces to resume previous dedupe setup.
>
> And since we introduce a new tree for status, also add disable handler
> for it.
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/dedupe.c  | 269 +++++++++++++++++++++++++++++++++++++++++++++++++----
>  fs/btrfs/dedupe.h  |  13 +++
>  fs/btrfs/disk-io.c |  21 ++++-
>  fs/btrfs/disk-io.h |   1 +
>  4 files changed, 283 insertions(+), 21 deletions(-)
>
> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
> index 7ef2c37..1112fec 100644
> --- a/fs/btrfs/dedupe.c
> +++ b/fs/btrfs/dedupe.c
> @@ -21,6 +21,8 @@
>  #include "transaction.h"
>  #include "delayed-ref.h"
>  #include "qgroup.h"
> +#include "disk-io.h"
> +#include "locking.h"
>
>  struct inmem_hash {
>         struct rb_node hash_node;
> @@ -41,10 +43,103 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 type)
>                         GFP_NOFS);
>  }
>
> +static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
> +                           u16 backend, u64 blocksize, u64 limit)
> +{
> +       struct btrfs_dedupe_info *dedupe_info;
> +
> +       dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
> +       if (!dedupe_info)
> +               return -ENOMEM;
> +
> +       dedupe_info->hash_type = type;
> +       dedupe_info->backend = backend;
> +       dedupe_info->blocksize = blocksize;
> +       dedupe_info->limit_nr = limit;
> +
> +       /* only support SHA256 yet */
> +       dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
> +       if (IS_ERR(dedupe_info->dedupe_driver)) {
> +               int ret;
> +
> +               ret = PTR_ERR(dedupe_info->dedupe_driver);
> +               kfree(dedupe_info);
> +               return ret;
> +       }
> +
> +       dedupe_info->hash_root = RB_ROOT;
> +       dedupe_info->bytenr_root = RB_ROOT;
> +       dedupe_info->current_nr = 0;
> +       INIT_LIST_HEAD(&dedupe_info->lru_list);
> +       mutex_init(&dedupe_info->lock);
> +
> +       *ret_info = dedupe_info;
> +       return 0;
> +}
> +
> +static int init_dedupe_tree(struct btrfs_fs_info *fs_info,
> +                           struct btrfs_dedupe_info *dedupe_info)
> +{
> +       struct btrfs_root *dedupe_root;
> +       struct btrfs_key key;
> +       struct btrfs_path *path;
> +       struct btrfs_dedupe_status_item *status;
> +       struct btrfs_trans_handle *trans;
> +       int ret;
> +
> +       path = btrfs_alloc_path();
> +       if (!path)
> +               return -ENOMEM;
> +
> +       trans = btrfs_start_transaction(fs_info->tree_root, 2);
> +       if (IS_ERR(trans)) {
> +               ret = PTR_ERR(trans);
> +               goto out;
> +       }
> +       dedupe_root = btrfs_create_tree(trans, fs_info,
> +                                      BTRFS_DEDUPE_TREE_OBJECTID);
> +       if (IS_ERR(dedupe_root)) {
> +               ret = PTR_ERR(dedupe_root);
> +               btrfs_abort_transaction(trans, fs_info->tree_root, ret);
> +               goto out;
> +       }
> +       dedupe_info->dedupe_root = dedupe_root;
> +
> +       key.objectid = 0;
> +       key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
> +       key.offset = 0;
> +
> +       ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key,
> +                                     sizeof(*status));
> +       if (ret < 0) {
> +               btrfs_abort_transaction(trans, fs_info->tree_root, ret);
> +               goto out;
> +       }
> +
> +       status = btrfs_item_ptr(path->nodes[0], path->slots[0],
> +                               struct btrfs_dedupe_status_item);
> +       btrfs_set_dedupe_status_blocksize(path->nodes[0], status,
> +                                        dedupe_info->blocksize);
> +       btrfs_set_dedupe_status_limit(path->nodes[0], status,
> +                       dedupe_info->limit_nr);
> +       btrfs_set_dedupe_status_hash_type(path->nodes[0], status,
> +                       dedupe_info->hash_type);
> +       btrfs_set_dedupe_status_backend(path->nodes[0], status,
> +                       dedupe_info->backend);
> +       btrfs_mark_buffer_dirty(path->nodes[0]);
> +out:
> +       btrfs_free_path(path);
> +       if (ret == 0)
> +               btrfs_commit_transaction(trans, fs_info->tree_root);
> +       return ret;
> +}
> +
>  int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
>                         u64 blocksize, u64 limit_nr)
>  {
>         struct btrfs_dedupe_info *dedupe_info;
> +       int create_tree;
> +       u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
>         u64 limit = limit_nr;
>         int ret = 0;
>
> @@ -63,6 +158,14 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
>                 limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
>         if (backend == BTRFS_DEDUPE_BACKEND_ONDISK && limit_nr != 0)
>                 limit = 0;
> +       /* Ondisk backend needs DEDUP RO compat feature */
> +       if (!(compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE) &&
> +           backend == BTRFS_DEDUPE_BACKEND_ONDISK)
> +               return -EOPNOTSUPP;
> +
> +       /* Meaningless and unable to enable dedupe for RO fs */
> +       if (fs_info->sb->s_flags & MS_RDONLY)
> +               return -EROFS;
>
>         dedupe_info = fs_info->dedupe_info;
>         if (dedupe_info) {
> @@ -81,29 +184,71 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
>                 return 0;
>         }
>
> +       dedupe_info = NULL;
>  enable:
> -       dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
> -       if (dedupe_info)
> +       create_tree = compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE;
> +
> +       ret = init_dedupe_info(&dedupe_info, type, backend, blocksize, limit);
> +       if (ret < 0)
> +               return ret;
> +       if (create_tree) {
> +               ret = init_dedupe_tree(fs_info, dedupe_info);
> +               if (ret < 0)
> +                       goto out;
> +       }
> +
> +       fs_info->dedupe_info = dedupe_info;
I think this leaks memory. If previously we had a valid
fs_info->dedupe_info, it will remain allocated.


> +       /* We must ensure dedupe_enabled is modified after dedupe_info */
> +       smp_wmb();
> +       fs_info->dedupe_enabled = 1;
> +out:
> +       if (ret < 0) {
> +               crypto_free_shash(dedupe_info->dedupe_driver);
> +               kfree(dedupe_info);
> +       }
> +       return ret;
> +}
> +
> +int btrfs_dedupe_resume(struct btrfs_fs_info *fs_info,
> +                       struct btrfs_root *dedupe_root)
> +{
> +       struct btrfs_dedupe_info *dedupe_info;
> +       struct btrfs_dedupe_status_item *status;
> +       struct btrfs_key key;
> +       struct btrfs_path *path;
> +       u64 blocksize;
> +       u64 limit;
> +       u16 type;
> +       u16 backend;
> +       int ret = 0;
> +
> +       path = btrfs_alloc_path();
> +       if (!path)
>                 return -ENOMEM;
>
> -       dedupe_info->hash_type = type;
> -       dedupe_info->backend = backend;
> -       dedupe_info->blocksize = blocksize;
> -       dedupe_info->limit_nr = limit;
> +       key.objectid = 0;
> +       key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
> +       key.offset = 0;
>
> -       /* Only support SHA256 yet */
> -       dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
> -       if (IS_ERR(dedupe_info->dedupe_driver)) {
> -               btrfs_err(fs_info, "failed to init sha256 driver");
> -               ret = PTR_ERR(dedupe_info->dedupe_driver);
> +       ret = btrfs_search_slot(NULL, dedupe_root, &key, path, 0, 0);
> +       if (ret > 0) {
> +               ret = -ENOENT;
> +               goto out;
> +       } else if (ret < 0) {
>                 goto out;
>         }
>
> -       dedupe_info->hash_root = RB_ROOT;
> -       dedupe_info->bytenr_root = RB_ROOT;
> -       dedupe_info->current_nr = 0;
> -       INIT_LIST_HEAD(&dedupe_info->lru_list);
> -       mutex_init(&dedupe_info->lock);
> +       status = btrfs_item_ptr(path->nodes[0], path->slots[0],
> +                               struct btrfs_dedupe_status_item);
> +       blocksize = btrfs_dedupe_status_blocksize(path->nodes[0], status);
> +       limit = btrfs_dedupe_status_limit(path->nodes[0], status);
> +       type = btrfs_dedupe_status_hash_type(path->nodes[0], status);
> +       backend = btrfs_dedupe_status_backend(path->nodes[0], status);
> +
> +       ret = init_dedupe_info(&dedupe_info, type, backend, blocksize, limit);
> +       if (ret < 0)
> +               goto out;
> +       dedupe_info->dedupe_root = dedupe_root;
>
>         fs_info->dedupe_info = dedupe_info;
>         /* We must ensure dedupe_enabled is modified after dedupe_info */
> @@ -111,11 +256,36 @@ enable:
>         fs_info->dedupe_enabled = 1;
>
>  out:
> -       if (ret < 0)
> -               kfree(dedupe_info);
> +       btrfs_free_path(path);
>         return ret;
>  }
>
> +static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info);
> +int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
> +{
> +       struct btrfs_dedupe_info *dedupe_info;
> +
> +       fs_info->dedupe_enabled = 0;
> +
> +       /* same as disable */
> +       smp_wmb();
> +       dedupe_info = fs_info->dedupe_info;
> +       fs_info->dedupe_info = NULL;
> +
> +       if (!dedupe_info)
> +               return 0;
> +
> +       if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
> +               inmem_destroy(dedupe_info);
> +       if (dedupe_info->dedupe_root) {
> +               free_root_extent_buffers(dedupe_info->dedupe_root);
> +               kfree(dedupe_info->dedupe_root);
> +       }
> +       crypto_free_shash(dedupe_info->dedupe_driver);
> +       kfree(dedupe_info);
> +       return 0;
> +}
> +
>  static int inmem_insert_hash(struct rb_root *root,
>                              struct inmem_hash *hash, int hash_len)
>  {
> @@ -325,6 +495,65 @@ static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
>         mutex_unlock(&dedupe_info->lock);
>  }
>
> +static int remove_dedupe_tree(struct btrfs_root *dedupe_root)
> +{
> +       struct btrfs_trans_handle *trans;
> +       struct btrfs_fs_info *fs_info = dedupe_root->fs_info;
> +       struct btrfs_path *path;
> +       struct btrfs_key key;
> +       struct extent_buffer *node;
> +       int ret;
> +       int nr;
> +
> +       path = btrfs_alloc_path();
> +       if (!path)
> +               return -ENOMEM;
> +       trans = btrfs_start_transaction(fs_info->tree_root, 2);
> +       if (IS_ERR(trans)) {
> +               ret = PTR_ERR(trans);
> +               goto out;
> +       }
> +
> +       path->leave_spinning = 1;
> +       key.objectid = 0;
> +       key.offset = 0;
> +       key.type = 0;
> +
> +       while (1) {
> +               ret = btrfs_search_slot(trans, dedupe_root, &key, path, -1, 1);
> +               if (ret < 0)
> +                       goto out;
> +               node = path->nodes[0];
> +               nr = btrfs_header_nritems(node);
> +               if (nr == 0) {
> +                       btrfs_release_path(path);
> +                       break;
> +               }
> +               path->slots[0] = 0;
> +               ret = btrfs_del_items(trans, dedupe_root, path, 0, nr);
> +               if (ret)
> +                       goto out;
> +               btrfs_release_path(path);
> +       }
> +
> +       ret = btrfs_del_root(trans, fs_info->tree_root, &dedupe_root->root_key);
> +       if (ret)
> +               goto out;
> +
> +       list_del(&dedupe_root->dirty_list);
> +       btrfs_tree_lock(dedupe_root->node);
> +       clean_tree_block(trans, fs_info, dedupe_root->node);
> +       btrfs_tree_unlock(dedupe_root->node);
> +       btrfs_free_tree_block(trans, dedupe_root, dedupe_root->node, 0, 1);
> +       free_extent_buffer(dedupe_root->node);
> +       free_extent_buffer(dedupe_root->commit_root);
> +       kfree(dedupe_root);
> +       ret = btrfs_commit_transaction(trans, fs_info->tree_root);
> +out:
> +       btrfs_free_path(path);
> +       return ret;
> +}
> +
>  int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
>  {
>         struct btrfs_dedupe_info *dedupe_info;
> @@ -358,10 +587,12 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
>         /* now we are OK to clean up everything */
>         if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
>                 inmem_destroy(dedupe_info);
> +       if (dedupe_info->dedupe_root)
> +               ret = remove_dedupe_tree(dedupe_info->dedupe_root);
>
>         crypto_free_shash(dedupe_info->dedupe_driver);
>         kfree(dedupe_info);
> -       return 0;
> +       return ret;
>  }
>
>  /*
> diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
> index 537f0b8..120e630 100644
> --- a/fs/btrfs/dedupe.h
> +++ b/fs/btrfs/dedupe.h
> @@ -112,6 +112,19 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
>   */
>  int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
>
> + /*
> + * Restore previous dedupe setup from disk
> + * Called at mount time
> + */
> +int btrfs_dedupe_resume(struct btrfs_fs_info *fs_info,
> +                      struct btrfs_root *dedupe_root);
> +
> +/*
> + * Cleanup current btrfs_dedupe_info
> + * Called in umount time
> + */
> +int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
> +
>  /*
>   * Calculate hash for dedup.
>   * Caller must ensure [start, start + dedupe_bs) has valid data.
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 57ae928..44d098d 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -51,6 +51,7 @@
>  #include "sysfs.h"
>  #include "qgroup.h"
>  #include "compression.h"
> +#include "dedupe.h"
>
>  #ifdef CONFIG_X86
>  #include <asm/cpufeature.h>
> @@ -2156,7 +2157,7 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
>         btrfs_destroy_workqueue(fs_info->extent_workers);
>  }
>
> -static void free_root_extent_buffers(struct btrfs_root *root)
> +void free_root_extent_buffers(struct btrfs_root *root)
>  {
>         if (root) {
>                 free_extent_buffer(root->node);
> @@ -2490,7 +2491,21 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info,
>                 fs_info->free_space_root = root;
>         }
>
> -       return 0;
> +       location.objectid = BTRFS_DEDUPE_TREE_OBJECTID;
> +       root = btrfs_read_tree_root(tree_root, &location);
> +       if (IS_ERR(root)) {
> +               ret = PTR_ERR(root);
> +               if (ret != -ENOENT)
> +                       return ret;
> +               return 0;
> +       }
> +       set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
> +       ret = btrfs_dedupe_resume(fs_info, root);
> +       if (ret < 0) {
> +               free_root_extent_buffers(root);
> +               kfree(root);
> +       }
> +       return ret;
>  }
>
>  int open_ctree(struct super_block *sb,
> @@ -3885,6 +3900,8 @@ void close_ctree(struct btrfs_root *root)
>
>         btrfs_free_qgroup_config(fs_info);
>
> +       btrfs_dedupe_cleanup(fs_info);
> +
>         if (percpu_counter_sum(&fs_info->delalloc_bytes)) {
>                 btrfs_info(fs_info, "at unmount delalloc count %lld",
>                        percpu_counter_sum(&fs_info->delalloc_bytes));
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 8e79d00..42c4ff2 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -70,6 +70,7 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_root *tree_root,
>  int btrfs_init_fs_root(struct btrfs_root *root);
>  int btrfs_insert_fs_root(struct btrfs_fs_info *fs_info,
>                          struct btrfs_root *root);
> +void free_root_extent_buffers(struct btrfs_root *root);
>  void btrfs_free_fs_roots(struct btrfs_fs_info *fs_info);
>
>  struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
> --
> 2.7.3
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 11/27] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
  2016-03-29 17:31   ` Alex Lyakas
@ 2016-03-30  0:26     ` Qu Wenruo
  0 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-30  0:26 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs, Wang Xiaoguang



Alex Lyakas wrote on 2016/03/29 19:31 +0200:
> Hi Qu, Wang,
>
> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>> Since we will introduce a new on-disk based dedupe method, introduce new
>> interfaces to resume previous dedupe setup.
>>
>> And since we introduce a new tree for status, also add disable handler
>> for it.
>>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>   fs/btrfs/dedupe.c  | 269 +++++++++++++++++++++++++++++++++++++++++++++++++----
>>   fs/btrfs/dedupe.h  |  13 +++
>>   fs/btrfs/disk-io.c |  21 ++++-
>>   fs/btrfs/disk-io.h |   1 +
>>   4 files changed, 283 insertions(+), 21 deletions(-)
>>
>> diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c
>> index 7ef2c37..1112fec 100644
>> --- a/fs/btrfs/dedupe.c
>> +++ b/fs/btrfs/dedupe.c
>> @@ -21,6 +21,8 @@
>>   #include "transaction.h"
>>   #include "delayed-ref.h"
>>   #include "qgroup.h"
>> +#include "disk-io.h"
>> +#include "locking.h"
>>
>>   struct inmem_hash {
>>          struct rb_node hash_node;
>> @@ -41,10 +43,103 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 type)
>>                          GFP_NOFS);
>>   }
>>
>> +static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type,
>> +                           u16 backend, u64 blocksize, u64 limit)
>> +{
>> +       struct btrfs_dedupe_info *dedupe_info;
>> +
>> +       dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
>> +       if (!dedupe_info)
>> +               return -ENOMEM;
>> +
>> +       dedupe_info->hash_type = type;
>> +       dedupe_info->backend = backend;
>> +       dedupe_info->blocksize = blocksize;
>> +       dedupe_info->limit_nr = limit;
>> +
>> +       /* only support SHA256 yet */
>> +       dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
>> +       if (IS_ERR(dedupe_info->dedupe_driver)) {
>> +               int ret;
>> +
>> +               ret = PTR_ERR(dedupe_info->dedupe_driver);
>> +               kfree(dedupe_info);
>> +               return ret;
>> +       }
>> +
>> +       dedupe_info->hash_root = RB_ROOT;
>> +       dedupe_info->bytenr_root = RB_ROOT;
>> +       dedupe_info->current_nr = 0;
>> +       INIT_LIST_HEAD(&dedupe_info->lru_list);
>> +       mutex_init(&dedupe_info->lock);
>> +
>> +       *ret_info = dedupe_info;
>> +       return 0;
>> +}
>> +
>> +static int init_dedupe_tree(struct btrfs_fs_info *fs_info,
>> +                           struct btrfs_dedupe_info *dedupe_info)
>> +{
>> +       struct btrfs_root *dedupe_root;
>> +       struct btrfs_key key;
>> +       struct btrfs_path *path;
>> +       struct btrfs_dedupe_status_item *status;
>> +       struct btrfs_trans_handle *trans;
>> +       int ret;
>> +
>> +       path = btrfs_alloc_path();
>> +       if (!path)
>> +               return -ENOMEM;
>> +
>> +       trans = btrfs_start_transaction(fs_info->tree_root, 2);
>> +       if (IS_ERR(trans)) {
>> +               ret = PTR_ERR(trans);
>> +               goto out;
>> +       }
>> +       dedupe_root = btrfs_create_tree(trans, fs_info,
>> +                                      BTRFS_DEDUPE_TREE_OBJECTID);
>> +       if (IS_ERR(dedupe_root)) {
>> +               ret = PTR_ERR(dedupe_root);
>> +               btrfs_abort_transaction(trans, fs_info->tree_root, ret);
>> +               goto out;
>> +       }
>> +       dedupe_info->dedupe_root = dedupe_root;
>> +
>> +       key.objectid = 0;
>> +       key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
>> +       key.offset = 0;
>> +
>> +       ret = btrfs_insert_empty_item(trans, dedupe_root, path, &key,
>> +                                     sizeof(*status));
>> +       if (ret < 0) {
>> +               btrfs_abort_transaction(trans, fs_info->tree_root, ret);
>> +               goto out;
>> +       }
>> +
>> +       status = btrfs_item_ptr(path->nodes[0], path->slots[0],
>> +                               struct btrfs_dedupe_status_item);
>> +       btrfs_set_dedupe_status_blocksize(path->nodes[0], status,
>> +                                        dedupe_info->blocksize);
>> +       btrfs_set_dedupe_status_limit(path->nodes[0], status,
>> +                       dedupe_info->limit_nr);
>> +       btrfs_set_dedupe_status_hash_type(path->nodes[0], status,
>> +                       dedupe_info->hash_type);
>> +       btrfs_set_dedupe_status_backend(path->nodes[0], status,
>> +                       dedupe_info->backend);
>> +       btrfs_mark_buffer_dirty(path->nodes[0]);
>> +out:
>> +       btrfs_free_path(path);
>> +       if (ret == 0)
>> +               btrfs_commit_transaction(trans, fs_info->tree_root);
>> +       return ret;
>> +}
>> +
>>   int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
>>                          u64 blocksize, u64 limit_nr)
>>   {
>>          struct btrfs_dedupe_info *dedupe_info;
>> +       int create_tree;
>> +       u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy);
>>          u64 limit = limit_nr;
>>          int ret = 0;
>>
>> @@ -63,6 +158,14 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
>>                  limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT;
>>          if (backend == BTRFS_DEDUPE_BACKEND_ONDISK && limit_nr != 0)
>>                  limit = 0;
>> +       /* Ondisk backend needs DEDUP RO compat feature */
>> +       if (!(compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE) &&
>> +           backend == BTRFS_DEDUPE_BACKEND_ONDISK)
>> +               return -EOPNOTSUPP;
>> +
>> +       /* Meaningless and unable to enable dedupe for RO fs */
>> +       if (fs_info->sb->s_flags & MS_RDONLY)
>> +               return -EROFS;
>>
>>          dedupe_info = fs_info->dedupe_info;
>>          if (dedupe_info) {
>> @@ -81,29 +184,71 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
>>                  return 0;
>>          }
>>
>> +       dedupe_info = NULL;
>>   enable:
>> -       dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS);
>> -       if (dedupe_info)
>> +       create_tree = compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE;
>> +
>> +       ret = init_dedupe_info(&dedupe_info, type, backend, blocksize, limit);
>> +       if (ret < 0)
>> +               return ret;
>> +       if (create_tree) {
>> +               ret = init_dedupe_tree(fs_info, dedupe_info);
>> +               if (ret < 0)
>> +                       goto out;
>> +       }
>> +
>> +       fs_info->dedupe_info = dedupe_info;
> I think this leaks memory. If previously we had a valid
> fs_info->dedupe_info, it will remain allocated.

Please check previous patch, or the final dedupe.c.

Just before enable tag:
------

         dedupe_info = fs_info->dedupe_info;
         if (dedupe_info) {
                 /* Check if we are re-enable for different dedupe config */
                 if (dedupe_info->blocksize != blocksize ||
                     dedupe_info->hash_type != type ||
                     dedupe_info->backend != backend) {
                         btrfs_dedupe_disable(fs_info);
                         goto enable;
                 }

                 /* On-fly limit change is OK */
                 mutex_lock(&dedupe_info->lock);
                 fs_info->dedupe_info->limit_nr = limit;
                 mutex_unlock(&dedupe_info->lock);
                 return 0;
         }

         dedupe_info = NULL;

------

For any existing dedupe_info, btrfs_dedupe_enable() will either disable 
it or, just modify limit on-fly.

So no leaking.

Thanks for the review.
Qu
>
>
>> +       /* We must ensure dedupe_enabled is modified after dedupe_info */
>> +       smp_wmb();
>> +       fs_info->dedupe_enabled = 1;
>> +out:
>> +       if (ret < 0) {
>> +               crypto_free_shash(dedupe_info->dedupe_driver);
>> +               kfree(dedupe_info);
>> +       }
>> +       return ret;
>> +}
>> +
>> +int btrfs_dedupe_resume(struct btrfs_fs_info *fs_info,
>> +                       struct btrfs_root *dedupe_root)
>> +{
>> +       struct btrfs_dedupe_info *dedupe_info;
>> +       struct btrfs_dedupe_status_item *status;
>> +       struct btrfs_key key;
>> +       struct btrfs_path *path;
>> +       u64 blocksize;
>> +       u64 limit;
>> +       u16 type;
>> +       u16 backend;
>> +       int ret = 0;
>> +
>> +       path = btrfs_alloc_path();
>> +       if (!path)
>>                  return -ENOMEM;
>>
>> -       dedupe_info->hash_type = type;
>> -       dedupe_info->backend = backend;
>> -       dedupe_info->blocksize = blocksize;
>> -       dedupe_info->limit_nr = limit;
>> +       key.objectid = 0;
>> +       key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY;
>> +       key.offset = 0;
>>
>> -       /* Only support SHA256 yet */
>> -       dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0);
>> -       if (IS_ERR(dedupe_info->dedupe_driver)) {
>> -               btrfs_err(fs_info, "failed to init sha256 driver");
>> -               ret = PTR_ERR(dedupe_info->dedupe_driver);
>> +       ret = btrfs_search_slot(NULL, dedupe_root, &key, path, 0, 0);
>> +       if (ret > 0) {
>> +               ret = -ENOENT;
>> +               goto out;
>> +       } else if (ret < 0) {
>>                  goto out;
>>          }
>>
>> -       dedupe_info->hash_root = RB_ROOT;
>> -       dedupe_info->bytenr_root = RB_ROOT;
>> -       dedupe_info->current_nr = 0;
>> -       INIT_LIST_HEAD(&dedupe_info->lru_list);
>> -       mutex_init(&dedupe_info->lock);
>> +       status = btrfs_item_ptr(path->nodes[0], path->slots[0],
>> +                               struct btrfs_dedupe_status_item);
>> +       blocksize = btrfs_dedupe_status_blocksize(path->nodes[0], status);
>> +       limit = btrfs_dedupe_status_limit(path->nodes[0], status);
>> +       type = btrfs_dedupe_status_hash_type(path->nodes[0], status);
>> +       backend = btrfs_dedupe_status_backend(path->nodes[0], status);
>> +
>> +       ret = init_dedupe_info(&dedupe_info, type, backend, blocksize, limit);
>> +       if (ret < 0)
>> +               goto out;
>> +       dedupe_info->dedupe_root = dedupe_root;
>>
>>          fs_info->dedupe_info = dedupe_info;
>>          /* We must ensure dedupe_enabled is modified after dedupe_info */
>> @@ -111,11 +256,36 @@ enable:
>>          fs_info->dedupe_enabled = 1;
>>
>>   out:
>> -       if (ret < 0)
>> -               kfree(dedupe_info);
>> +       btrfs_free_path(path);
>>          return ret;
>>   }
>>
>> +static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info);
>> +int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info)
>> +{
>> +       struct btrfs_dedupe_info *dedupe_info;
>> +
>> +       fs_info->dedupe_enabled = 0;
>> +
>> +       /* same as disable */
>> +       smp_wmb();
>> +       dedupe_info = fs_info->dedupe_info;
>> +       fs_info->dedupe_info = NULL;
>> +
>> +       if (!dedupe_info)
>> +               return 0;
>> +
>> +       if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
>> +               inmem_destroy(dedupe_info);
>> +       if (dedupe_info->dedupe_root) {
>> +               free_root_extent_buffers(dedupe_info->dedupe_root);
>> +               kfree(dedupe_info->dedupe_root);
>> +       }
>> +       crypto_free_shash(dedupe_info->dedupe_driver);
>> +       kfree(dedupe_info);
>> +       return 0;
>> +}
>> +
>>   static int inmem_insert_hash(struct rb_root *root,
>>                               struct inmem_hash *hash, int hash_len)
>>   {
>> @@ -325,6 +495,65 @@ static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info)
>>          mutex_unlock(&dedupe_info->lock);
>>   }
>>
>> +static int remove_dedupe_tree(struct btrfs_root *dedupe_root)
>> +{
>> +       struct btrfs_trans_handle *trans;
>> +       struct btrfs_fs_info *fs_info = dedupe_root->fs_info;
>> +       struct btrfs_path *path;
>> +       struct btrfs_key key;
>> +       struct extent_buffer *node;
>> +       int ret;
>> +       int nr;
>> +
>> +       path = btrfs_alloc_path();
>> +       if (!path)
>> +               return -ENOMEM;
>> +       trans = btrfs_start_transaction(fs_info->tree_root, 2);
>> +       if (IS_ERR(trans)) {
>> +               ret = PTR_ERR(trans);
>> +               goto out;
>> +       }
>> +
>> +       path->leave_spinning = 1;
>> +       key.objectid = 0;
>> +       key.offset = 0;
>> +       key.type = 0;
>> +
>> +       while (1) {
>> +               ret = btrfs_search_slot(trans, dedupe_root, &key, path, -1, 1);
>> +               if (ret < 0)
>> +                       goto out;
>> +               node = path->nodes[0];
>> +               nr = btrfs_header_nritems(node);
>> +               if (nr == 0) {
>> +                       btrfs_release_path(path);
>> +                       break;
>> +               }
>> +               path->slots[0] = 0;
>> +               ret = btrfs_del_items(trans, dedupe_root, path, 0, nr);
>> +               if (ret)
>> +                       goto out;
>> +               btrfs_release_path(path);
>> +       }
>> +
>> +       ret = btrfs_del_root(trans, fs_info->tree_root, &dedupe_root->root_key);
>> +       if (ret)
>> +               goto out;
>> +
>> +       list_del(&dedupe_root->dirty_list);
>> +       btrfs_tree_lock(dedupe_root->node);
>> +       clean_tree_block(trans, fs_info, dedupe_root->node);
>> +       btrfs_tree_unlock(dedupe_root->node);
>> +       btrfs_free_tree_block(trans, dedupe_root, dedupe_root->node, 0, 1);
>> +       free_extent_buffer(dedupe_root->node);
>> +       free_extent_buffer(dedupe_root->commit_root);
>> +       kfree(dedupe_root);
>> +       ret = btrfs_commit_transaction(trans, fs_info->tree_root);
>> +out:
>> +       btrfs_free_path(path);
>> +       return ret;
>> +}
>> +
>>   int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
>>   {
>>          struct btrfs_dedupe_info *dedupe_info;
>> @@ -358,10 +587,12 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info)
>>          /* now we are OK to clean up everything */
>>          if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY)
>>                  inmem_destroy(dedupe_info);
>> +       if (dedupe_info->dedupe_root)
>> +               ret = remove_dedupe_tree(dedupe_info->dedupe_root);
>>
>>          crypto_free_shash(dedupe_info->dedupe_driver);
>>          kfree(dedupe_info);
>> -       return 0;
>> +       return ret;
>>   }
>>
>>   /*
>> diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h
>> index 537f0b8..120e630 100644
>> --- a/fs/btrfs/dedupe.h
>> +++ b/fs/btrfs/dedupe.h
>> @@ -112,6 +112,19 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend,
>>    */
>>   int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info);
>>
>> + /*
>> + * Restore previous dedupe setup from disk
>> + * Called at mount time
>> + */
>> +int btrfs_dedupe_resume(struct btrfs_fs_info *fs_info,
>> +                      struct btrfs_root *dedupe_root);
>> +
>> +/*
>> + * Cleanup current btrfs_dedupe_info
>> + * Called in umount time
>> + */
>> +int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info);
>> +
>>   /*
>>    * Calculate hash for dedup.
>>    * Caller must ensure [start, start + dedupe_bs) has valid data.
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 57ae928..44d098d 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -51,6 +51,7 @@
>>   #include "sysfs.h"
>>   #include "qgroup.h"
>>   #include "compression.h"
>> +#include "dedupe.h"
>>
>>   #ifdef CONFIG_X86
>>   #include <asm/cpufeature.h>
>> @@ -2156,7 +2157,7 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
>>          btrfs_destroy_workqueue(fs_info->extent_workers);
>>   }
>>
>> -static void free_root_extent_buffers(struct btrfs_root *root)
>> +void free_root_extent_buffers(struct btrfs_root *root)
>>   {
>>          if (root) {
>>                  free_extent_buffer(root->node);
>> @@ -2490,7 +2491,21 @@ static int btrfs_read_roots(struct btrfs_fs_info *fs_info,
>>                  fs_info->free_space_root = root;
>>          }
>>
>> -       return 0;
>> +       location.objectid = BTRFS_DEDUPE_TREE_OBJECTID;
>> +       root = btrfs_read_tree_root(tree_root, &location);
>> +       if (IS_ERR(root)) {
>> +               ret = PTR_ERR(root);
>> +               if (ret != -ENOENT)
>> +                       return ret;
>> +               return 0;
>> +       }
>> +       set_bit(BTRFS_ROOT_TRACK_DIRTY, &root->state);
>> +       ret = btrfs_dedupe_resume(fs_info, root);
>> +       if (ret < 0) {
>> +               free_root_extent_buffers(root);
>> +               kfree(root);
>> +       }
>> +       return ret;
>>   }
>>
>>   int open_ctree(struct super_block *sb,
>> @@ -3885,6 +3900,8 @@ void close_ctree(struct btrfs_root *root)
>>
>>          btrfs_free_qgroup_config(fs_info);
>>
>> +       btrfs_dedupe_cleanup(fs_info);
>> +
>>          if (percpu_counter_sum(&fs_info->delalloc_bytes)) {
>>                  btrfs_info(fs_info, "at unmount delalloc count %lld",
>>                         percpu_counter_sum(&fs_info->delalloc_bytes));
>> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
>> index 8e79d00..42c4ff2 100644
>> --- a/fs/btrfs/disk-io.h
>> +++ b/fs/btrfs/disk-io.h
>> @@ -70,6 +70,7 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_root *tree_root,
>>   int btrfs_init_fs_root(struct btrfs_root *root);
>>   int btrfs_insert_fs_root(struct btrfs_fs_info *fs_info,
>>                           struct btrfs_root *root);
>> +void free_root_extent_buffers(struct btrfs_root *root);
>>   void btrfs_free_fs_roots(struct btrfs_fs_info *fs_info);
>>
>>   struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info,
>> --
>> 2.7.3
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-03-29 17:22 ` Alex Lyakas
@ 2016-03-30  0:34   ` Qu Wenruo
  2016-03-30 10:36     ` Alex Lyakas
  2016-04-03  8:22     ` Alex Lyakas
  0 siblings, 2 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-03-30  0:34 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs



Alex Lyakas wrote on 2016/03/29 19:22 +0200:
> Greetings Qu Wenruo,
>
> I have reviewed the dedup patchset found in the github account you
> mentioned. I have several questions. Please note that by all means I
> am not criticizing your design or code. I just want to make sure that
> my understanding of the code is proper.

It's OK to criticize the design or code, and that's how review works.

>
> 1) You mentioned in several emails that at some point byte-to-byte
> comparison is to be performed. However, I do not see this in the code.
> It seems that generic_search() only looks for the hash value match. If
> there is a match, it goes ahead and adds a delayed ref.

I mentioned byte-to-byte comparison as, "not to be implemented in any 
time soon".

Considering the lack of facility to read out extent contents without any 
inode structure, it's not going to be done in any time soon.

>
> 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
> mutex and proceed with the normal COW. What happens if there are
> several IO streams to different files writing an identical block, but
> we don't have such block in our dedup DB? Then all
> btrfs_dedupe_search() calls will not find a match, so all streams will
> allocate space for their block (which are all identical). At some
> point, they will call insert_reserved_file_extent() and will call
> btrfs_dedupe_add(). Since there is a global mutex, the first stream
> will insert the dedup hash entries into the DB, and all other streams
> will find that such hash entry already exists. So the end result is
> that we have the hash entry in the DB, but still we have multiple
> copies of the same block allocated, due to timing issues. Is this
> correct?

That's right, and that's also unavoidable for the hash initializing stage.

>
> 3) generic_search() competes with __btrfs_free_extent(). Meaning that
> generic_search() wants to add a delayed ref to an existing extent,
> whereas __btrfs_free_extent() wants to delete an entry from the dedup
> DB. The race is resolved as follows:
> - generic_search attempts to lock the delayed ref head
> - if it succeeds to lock, then __btrfs_free_extent() is not running
> right now. So we can add a delayed ref. Later, when delayed ref head
> will be run, it will figure out what needs to be done (free the extent
> or not)
> - if we fail to lock, then there is a delayed ref processing for this
> bytenr. We drop all locks and redo the search from the top. If
> __btrfs_free_extent() has deleted the dedup hash meanwhile, we will
> not find it, and proceed with normal COW.
> Is my understanding correct?

Yes that's correct.

>
> I have also few nitpicks on the code, will reply to relevant patches.

Feel free to comment.

Thanks,
Qu
>
> Thanks for doing this work,
> Alex.
>
>
>
> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>> This patchset can be fetched from github:
>> https://github.com/adam900710/linux.git wang_dedupe_20160322
>>
>> This updated version of inband de-duplication has the following features:
>> 1) ONE unified dedup framework.
>>     Most of its code is hidden quietly in dedup.c and export the minimal
>>     interfaces for its caller.
>>     Reviewer and further developer would benefit from the unified
>>     framework.
>>
>> 2) TWO different back-end with different trade-off
>>     One is the improved version of previous Fujitsu in-memory only dedup.
>>     The other one is enhanced dedup implementation from Liu Bo.
>>     Changed its tree structure to handle bytenr -> hash search for
>>     deleting hash, without the hideous data backref hack.
>>
>> 3) Support compression with dedupe
>>     Now dedupe can work with compression.
>>     Means that, a dedupe miss case can be compressed, and dedupe hit case
>>     can also reuse compressed file extents.
>>
>> 4) Ioctl interface with persist dedup status
>>     Advised by David, now we use ioctl to enable/disable dedup.
>>
>>     And we now have dedup status, recorded in the first item of dedup
>>     tree.
>>     Just like quota, once enabled, no extra ioctl is needed for next
>>     mount.
>>
>> 5) Ability to disable dedup for given dirs/files
>>     It works just like the compression prop method, by adding a new
>>     xattr.
>>
>> TODO:
>> 1) Add extent-by-extent comparison for faster but more conflicting algorithm
>>     Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>>     CPU may even be a bottleneck other than IO.
>>     But for faster hash, it will definitely cause conflicts, so we need
>>     extent comparison before we introduce new dedup algorithm.
>>
>> 2) Misc end-user related helpers
>>     Like handy and easy to implement dedup rate report.
>>     And method to query in-memory hash size for those "non-exist" users who
>>     want to use 'dedup enable -l' option but didn't ever know how much
>>     RAM they have.
>>
>> Changelog:
>> v2:
>>    Totally reworked to handle multiple backends
>> v3:
>>    Fix a stupid but deadly on-disk backend bug
>>    Add handle for multiple hash on same bytenr corner case to fix abort
>>    trans error
>>    Increase dedup rate by enhancing delayed ref handler for both backend.
>>    Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
>>    Increase dedup block size up limit to 8M.
>> v4:
>>    Add dedup prop for disabling dedup for given files/dirs.
>>    Merge inmem_search() and ondisk_search() into generic_search() to save
>>    some code
>>    Fix another delayed_ref related bug.
>>    Use the same mutex for both inmem and ondisk backend.
>>    Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
>>    rate.
>> v5:
>>    Reuse compress routine for much simpler dedup function.
>>    Slightly improved performance due to above modification.
>>    Fix race between dedup enable/disable
>>    Fix for false ENOSPC report
>> v6:
>>    Further enable/disable race window fix.
>>    Minor format change according to checkpatch.
>> v7:
>>    Fix one concurrency bug with balance.
>>    Slightly modify return value from -EINVAL to -EOPNOTSUPP for
>>    btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
>>    and wrong parameter.
>>    Rebased to integration-4.6.
>> v8:
>>    Rename 'dedup' to 'dedupe'.
>>    Add support to allow dedupe and compression work at the same time.
>>    Fix several balance related bugs. Special thanks to Satoru Takeuchi,
>>    who exposed most of them.
>>    Small dedupe hit case performance improvement.
>>
>> Qu Wenruo (12):
>>    btrfs: delayed-ref: Add support for increasing data ref under spinlock
>>    btrfs: dedupe: Inband in-memory only de-duplication implement
>>    btrfs: dedupe: Add basic tree structure for on-disk dedupe method
>>    btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
>>    btrfs: dedupe: Add support for on-disk hash search
>>    btrfs: dedupe: Add support to delete hash for on-disk backend
>>    btrfs: dedupe: Add support for adding hash for on-disk backend
>>    btrfs: Fix a memory leak in inband dedupe hash
>>    btrfs: dedupe: Fix metadata balance error when dedupe is enabled
>>    btrfs: dedupe: Preparation for compress-dedupe co-work
>>    btrfs: relocation: Enhance error handling to avoid BUG_ON
>>    btrfs: dedupe: Fix a space cache delalloc bytes underflow bug
>>
>> Wang Xiaoguang (15):
>>    btrfs: dedupe: Introduce dedupe framework and its header
>>    btrfs: dedupe: Introduce function to initialize dedupe info
>>    btrfs: dedupe: Introduce function to add hash into in-memory tree
>>    btrfs: dedupe: Introduce function to remove hash from in-memory tree
>>    btrfs: dedupe: Introduce function to search for an existing hash
>>    btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
>>    btrfs: ordered-extent: Add support for dedupe
>>    btrfs: dedupe: Add ioctl for inband dedupelication
>>    btrfs: dedupe: add an inode nodedupe flag
>>    btrfs: dedupe: add a property handler for online dedupe
>>    btrfs: dedupe: add per-file online dedupe control
>>    btrfs: try more times to alloc metadata reserve space
>>    btrfs: dedupe: Fix a bug when running inband dedupe with balance
>>    btrfs: dedupe: Avoid submit IO for hash hit extent
>>    btrfs: dedupe: Add support for compression and dedpue
>>
>>   fs/btrfs/Makefile            |    2 +-
>>   fs/btrfs/ctree.h             |   78 ++-
>>   fs/btrfs/dedupe.c            | 1188 ++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/dedupe.h            |  181 +++++++
>>   fs/btrfs/delayed-ref.c       |   30 +-
>>   fs/btrfs/delayed-ref.h       |    8 +
>>   fs/btrfs/disk-io.c           |   28 +-
>>   fs/btrfs/disk-io.h           |    1 +
>>   fs/btrfs/extent-tree.c       |   49 +-
>>   fs/btrfs/inode.c             |  338 ++++++++++--
>>   fs/btrfs/ioctl.c             |   70 ++-
>>   fs/btrfs/ordered-data.c      |   49 +-
>>   fs/btrfs/ordered-data.h      |   16 +-
>>   fs/btrfs/props.c             |   41 ++
>>   fs/btrfs/relocation.c        |   41 +-
>>   fs/btrfs/sysfs.c             |    2 +
>>   include/trace/events/btrfs.h |    3 +-
>>   include/uapi/linux/btrfs.h   |   25 +-
>>   18 files changed, 2073 insertions(+), 77 deletions(-)
>>   create mode 100644 fs/btrfs/dedupe.c
>>   create mode 100644 fs/btrfs/dedupe.h
>>
>> --
>> 2.7.3
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-03-30  0:34   ` Qu Wenruo
@ 2016-03-30 10:36     ` Alex Lyakas
  2016-04-03  8:22     ` Alex Lyakas
  1 sibling, 0 replies; 62+ messages in thread
From: Alex Lyakas @ 2016-03-30 10:36 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

Thanks for your comments, Qu.

Alex.


On Wed, Mar 30, 2016 at 2:34 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> Alex Lyakas wrote on 2016/03/29 19:22 +0200:
>>
>> Greetings Qu Wenruo,
>>
>> I have reviewed the dedup patchset found in the github account you
>> mentioned. I have several questions. Please note that by all means I
>> am not criticizing your design or code. I just want to make sure that
>> my understanding of the code is proper.
>
>
> It's OK to criticize the design or code, and that's how review works.
>
>>
>> 1) You mentioned in several emails that at some point byte-to-byte
>> comparison is to be performed. However, I do not see this in the code.
>> It seems that generic_search() only looks for the hash value match. If
>> there is a match, it goes ahead and adds a delayed ref.
>
>
> I mentioned byte-to-byte comparison as, "not to be implemented in any time
> soon".
>
> Considering the lack of facility to read out extent contents without any
> inode structure, it's not going to be done in any time soon.
>
>>
>> 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
>> mutex and proceed with the normal COW. What happens if there are
>> several IO streams to different files writing an identical block, but
>> we don't have such block in our dedup DB? Then all
>> btrfs_dedupe_search() calls will not find a match, so all streams will
>> allocate space for their block (which are all identical). At some
>> point, they will call insert_reserved_file_extent() and will call
>> btrfs_dedupe_add(). Since there is a global mutex, the first stream
>> will insert the dedup hash entries into the DB, and all other streams
>> will find that such hash entry already exists. So the end result is
>> that we have the hash entry in the DB, but still we have multiple
>> copies of the same block allocated, due to timing issues. Is this
>> correct?
>
>
> That's right, and that's also unavoidable for the hash initializing stage.
>
>>
>> 3) generic_search() competes with __btrfs_free_extent(). Meaning that
>> generic_search() wants to add a delayed ref to an existing extent,
>> whereas __btrfs_free_extent() wants to delete an entry from the dedup
>> DB. The race is resolved as follows:
>> - generic_search attempts to lock the delayed ref head
>> - if it succeeds to lock, then __btrfs_free_extent() is not running
>> right now. So we can add a delayed ref. Later, when delayed ref head
>> will be run, it will figure out what needs to be done (free the extent
>> or not)
>> - if we fail to lock, then there is a delayed ref processing for this
>> bytenr. We drop all locks and redo the search from the top. If
>> __btrfs_free_extent() has deleted the dedup hash meanwhile, we will
>> not find it, and proceed with normal COW.
>> Is my understanding correct?
>
>
> Yes that's correct.
>
>>
>> I have also few nitpicks on the code, will reply to relevant patches.
>
>
> Feel free to comment.
>
> Thanks,
> Qu
>
>>
>> Thanks for doing this work,
>> Alex.
>>
>>
>>
>> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>> wrote:
>>>
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux.git wang_dedupe_20160322
>>>
>>> This updated version of inband de-duplication has the following features:
>>> 1) ONE unified dedup framework.
>>>     Most of its code is hidden quietly in dedup.c and export the minimal
>>>     interfaces for its caller.
>>>     Reviewer and further developer would benefit from the unified
>>>     framework.
>>>
>>> 2) TWO different back-end with different trade-off
>>>     One is the improved version of previous Fujitsu in-memory only dedup.
>>>     The other one is enhanced dedup implementation from Liu Bo.
>>>     Changed its tree structure to handle bytenr -> hash search for
>>>     deleting hash, without the hideous data backref hack.
>>>
>>> 3) Support compression with dedupe
>>>     Now dedupe can work with compression.
>>>     Means that, a dedupe miss case can be compressed, and dedupe hit case
>>>     can also reuse compressed file extents.
>>>
>>> 4) Ioctl interface with persist dedup status
>>>     Advised by David, now we use ioctl to enable/disable dedup.
>>>
>>>     And we now have dedup status, recorded in the first item of dedup
>>>     tree.
>>>     Just like quota, once enabled, no extra ioctl is needed for next
>>>     mount.
>>>
>>> 5) Ability to disable dedup for given dirs/files
>>>     It works just like the compression prop method, by adding a new
>>>     xattr.
>>>
>>> TODO:
>>> 1) Add extent-by-extent comparison for faster but more conflicting
>>> algorithm
>>>     Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>>>     CPU may even be a bottleneck other than IO.
>>>     But for faster hash, it will definitely cause conflicts, so we need
>>>     extent comparison before we introduce new dedup algorithm.
>>>
>>> 2) Misc end-user related helpers
>>>     Like handy and easy to implement dedup rate report.
>>>     And method to query in-memory hash size for those "non-exist" users
>>> who
>>>     want to use 'dedup enable -l' option but didn't ever know how much
>>>     RAM they have.
>>>
>>> Changelog:
>>> v2:
>>>    Totally reworked to handle multiple backends
>>> v3:
>>>    Fix a stupid but deadly on-disk backend bug
>>>    Add handle for multiple hash on same bytenr corner case to fix abort
>>>    trans error
>>>    Increase dedup rate by enhancing delayed ref handler for both backend.
>>>    Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
>>>    Increase dedup block size up limit to 8M.
>>> v4:
>>>    Add dedup prop for disabling dedup for given files/dirs.
>>>    Merge inmem_search() and ondisk_search() into generic_search() to save
>>>    some code
>>>    Fix another delayed_ref related bug.
>>>    Use the same mutex for both inmem and ondisk backend.
>>>    Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
>>>    rate.
>>> v5:
>>>    Reuse compress routine for much simpler dedup function.
>>>    Slightly improved performance due to above modification.
>>>    Fix race between dedup enable/disable
>>>    Fix for false ENOSPC report
>>> v6:
>>>    Further enable/disable race window fix.
>>>    Minor format change according to checkpatch.
>>> v7:
>>>    Fix one concurrency bug with balance.
>>>    Slightly modify return value from -EINVAL to -EOPNOTSUPP for
>>>    btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
>>>    and wrong parameter.
>>>    Rebased to integration-4.6.
>>> v8:
>>>    Rename 'dedup' to 'dedupe'.
>>>    Add support to allow dedupe and compression work at the same time.
>>>    Fix several balance related bugs. Special thanks to Satoru Takeuchi,
>>>    who exposed most of them.
>>>    Small dedupe hit case performance improvement.
>>>
>>> Qu Wenruo (12):
>>>    btrfs: delayed-ref: Add support for increasing data ref under spinlock
>>>    btrfs: dedupe: Inband in-memory only de-duplication implement
>>>    btrfs: dedupe: Add basic tree structure for on-disk dedupe method
>>>    btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
>>>    btrfs: dedupe: Add support for on-disk hash search
>>>    btrfs: dedupe: Add support to delete hash for on-disk backend
>>>    btrfs: dedupe: Add support for adding hash for on-disk backend
>>>    btrfs: Fix a memory leak in inband dedupe hash
>>>    btrfs: dedupe: Fix metadata balance error when dedupe is enabled
>>>    btrfs: dedupe: Preparation for compress-dedupe co-work
>>>    btrfs: relocation: Enhance error handling to avoid BUG_ON
>>>    btrfs: dedupe: Fix a space cache delalloc bytes underflow bug
>>>
>>> Wang Xiaoguang (15):
>>>    btrfs: dedupe: Introduce dedupe framework and its header
>>>    btrfs: dedupe: Introduce function to initialize dedupe info
>>>    btrfs: dedupe: Introduce function to add hash into in-memory tree
>>>    btrfs: dedupe: Introduce function to remove hash from in-memory tree
>>>    btrfs: dedupe: Introduce function to search for an existing hash
>>>    btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
>>>    btrfs: ordered-extent: Add support for dedupe
>>>    btrfs: dedupe: Add ioctl for inband dedupelication
>>>    btrfs: dedupe: add an inode nodedupe flag
>>>    btrfs: dedupe: add a property handler for online dedupe
>>>    btrfs: dedupe: add per-file online dedupe control
>>>    btrfs: try more times to alloc metadata reserve space
>>>    btrfs: dedupe: Fix a bug when running inband dedupe with balance
>>>    btrfs: dedupe: Avoid submit IO for hash hit extent
>>>    btrfs: dedupe: Add support for compression and dedpue
>>>
>>>   fs/btrfs/Makefile            |    2 +-
>>>   fs/btrfs/ctree.h             |   78 ++-
>>>   fs/btrfs/dedupe.c            | 1188
>>> ++++++++++++++++++++++++++++++++++++++++++
>>>   fs/btrfs/dedupe.h            |  181 +++++++
>>>   fs/btrfs/delayed-ref.c       |   30 +-
>>>   fs/btrfs/delayed-ref.h       |    8 +
>>>   fs/btrfs/disk-io.c           |   28 +-
>>>   fs/btrfs/disk-io.h           |    1 +
>>>   fs/btrfs/extent-tree.c       |   49 +-
>>>   fs/btrfs/inode.c             |  338 ++++++++++--
>>>   fs/btrfs/ioctl.c             |   70 ++-
>>>   fs/btrfs/ordered-data.c      |   49 +-
>>>   fs/btrfs/ordered-data.h      |   16 +-
>>>   fs/btrfs/props.c             |   41 ++
>>>   fs/btrfs/relocation.c        |   41 +-
>>>   fs/btrfs/sysfs.c             |    2 +
>>>   include/trace/events/btrfs.h |    3 +-
>>>   include/uapi/linux/btrfs.h   |   25 +-
>>>   18 files changed, 2073 insertions(+), 77 deletions(-)
>>>   create mode 100644 fs/btrfs/dedupe.c
>>>   create mode 100644 fs/btrfs/dedupe.h
>>>
>>> --
>>> 2.7.3
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-03-30  0:34   ` Qu Wenruo
  2016-03-30 10:36     ` Alex Lyakas
@ 2016-04-03  8:22     ` Alex Lyakas
  2016-04-05  3:51       ` Qu Wenruo
  1 sibling, 1 reply; 62+ messages in thread
From: Alex Lyakas @ 2016-04-03  8:22 UTC (permalink / raw)
  To: Qu Wenruo, Xiaoguang Wang; +Cc: linux-btrfs

Hello Qu, Wang,

On Wed, Mar 30, 2016 at 2:34 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> Alex Lyakas wrote on 2016/03/29 19:22 +0200:
>>
>> Greetings Qu Wenruo,
>>
>> I have reviewed the dedup patchset found in the github account you
>> mentioned. I have several questions. Please note that by all means I
>> am not criticizing your design or code. I just want to make sure that
>> my understanding of the code is proper.
>
>
> It's OK to criticize the design or code, and that's how review works.
>
>>
>> 1) You mentioned in several emails that at some point byte-to-byte
>> comparison is to be performed. However, I do not see this in the code.
>> It seems that generic_search() only looks for the hash value match. If
>> there is a match, it goes ahead and adds a delayed ref.
>
>
> I mentioned byte-to-byte comparison as, "not to be implemented in any time
> soon".
>
> Considering the lack of facility to read out extent contents without any
> inode structure, it's not going to be done in any time soon.
>
>>
>> 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
>> mutex and proceed with the normal COW. What happens if there are
>> several IO streams to different files writing an identical block, but
>> we don't have such block in our dedup DB? Then all
>> btrfs_dedupe_search() calls will not find a match, so all streams will
>> allocate space for their block (which are all identical). At some
>> point, they will call insert_reserved_file_extent() and will call
>> btrfs_dedupe_add(). Since there is a global mutex, the first stream
>> will insert the dedup hash entries into the DB, and all other streams
>> will find that such hash entry already exists. So the end result is
>> that we have the hash entry in the DB, but still we have multiple
>> copies of the same block allocated, due to timing issues. Is this
>> correct?
>
>
> That's right, and that's also unavoidable for the hash initializing stage.
>
>>
>> 3) generic_search() competes with __btrfs_free_extent(). Meaning that
>> generic_search() wants to add a delayed ref to an existing extent,
>> whereas __btrfs_free_extent() wants to delete an entry from the dedup
>> DB. The race is resolved as follows:
>> - generic_search attempts to lock the delayed ref head
>> - if it succeeds to lock, then __btrfs_free_extent() is not running
>> right now. So we can add a delayed ref. Later, when delayed ref head
>> will be run, it will figure out what needs to be done (free the extent
>> or not)
>> - if we fail to lock, then there is a delayed ref processing for this
>> bytenr. We drop all locks and redo the search from the top. If
>> __btrfs_free_extent() has deleted the dedup hash meanwhile, we will
>> not find it, and proceed with normal COW.
>> Is my understanding correct?
>
>
> Yes that's correct.

Reviewing the code again, it seems that I still lack understanding.
What is special about the dedup code adding a delayed data ref versus
other places doing that? In other places, we do not insist on locking
the delayed ref head, but in dedup we do. For example,
__btrfs_drop_extents calls btrfs_inc_extent_ref, without locking the
ref head. I know that one of your purposes was to draw attention to
delayed ref processing, so you have succeeded.

Thanks,
Alex.




>
>>
>> I have also few nitpicks on the code, will reply to relevant patches.
>
>
> Feel free to comment.
>
> Thanks,
> Qu
>
>>
>> Thanks for doing this work,
>> Alex.
>>
>>
>>
>> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>> wrote:
>>>
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux.git wang_dedupe_20160322
>>>
>>> This updated version of inband de-duplication has the following features:
>>> 1) ONE unified dedup framework.
>>>     Most of its code is hidden quietly in dedup.c and export the minimal
>>>     interfaces for its caller.
>>>     Reviewer and further developer would benefit from the unified
>>>     framework.
>>>
>>> 2) TWO different back-end with different trade-off
>>>     One is the improved version of previous Fujitsu in-memory only dedup.
>>>     The other one is enhanced dedup implementation from Liu Bo.
>>>     Changed its tree structure to handle bytenr -> hash search for
>>>     deleting hash, without the hideous data backref hack.
>>>
>>> 3) Support compression with dedupe
>>>     Now dedupe can work with compression.
>>>     Means that, a dedupe miss case can be compressed, and dedupe hit case
>>>     can also reuse compressed file extents.
>>>
>>> 4) Ioctl interface with persist dedup status
>>>     Advised by David, now we use ioctl to enable/disable dedup.
>>>
>>>     And we now have dedup status, recorded in the first item of dedup
>>>     tree.
>>>     Just like quota, once enabled, no extra ioctl is needed for next
>>>     mount.
>>>
>>> 5) Ability to disable dedup for given dirs/files
>>>     It works just like the compression prop method, by adding a new
>>>     xattr.
>>>
>>> TODO:
>>> 1) Add extent-by-extent comparison for faster but more conflicting
>>> algorithm
>>>     Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>>>     CPU may even be a bottleneck other than IO.
>>>     But for faster hash, it will definitely cause conflicts, so we need
>>>     extent comparison before we introduce new dedup algorithm.
>>>
>>> 2) Misc end-user related helpers
>>>     Like handy and easy to implement dedup rate report.
>>>     And method to query in-memory hash size for those "non-exist" users
>>> who
>>>     want to use 'dedup enable -l' option but didn't ever know how much
>>>     RAM they have.
>>>
>>> Changelog:
>>> v2:
>>>    Totally reworked to handle multiple backends
>>> v3:
>>>    Fix a stupid but deadly on-disk backend bug
>>>    Add handle for multiple hash on same bytenr corner case to fix abort
>>>    trans error
>>>    Increase dedup rate by enhancing delayed ref handler for both backend.
>>>    Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
>>>    Increase dedup block size up limit to 8M.
>>> v4:
>>>    Add dedup prop for disabling dedup for given files/dirs.
>>>    Merge inmem_search() and ondisk_search() into generic_search() to save
>>>    some code
>>>    Fix another delayed_ref related bug.
>>>    Use the same mutex for both inmem and ondisk backend.
>>>    Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
>>>    rate.
>>> v5:
>>>    Reuse compress routine for much simpler dedup function.
>>>    Slightly improved performance due to above modification.
>>>    Fix race between dedup enable/disable
>>>    Fix for false ENOSPC report
>>> v6:
>>>    Further enable/disable race window fix.
>>>    Minor format change according to checkpatch.
>>> v7:
>>>    Fix one concurrency bug with balance.
>>>    Slightly modify return value from -EINVAL to -EOPNOTSUPP for
>>>    btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
>>>    and wrong parameter.
>>>    Rebased to integration-4.6.
>>> v8:
>>>    Rename 'dedup' to 'dedupe'.
>>>    Add support to allow dedupe and compression work at the same time.
>>>    Fix several balance related bugs. Special thanks to Satoru Takeuchi,
>>>    who exposed most of them.
>>>    Small dedupe hit case performance improvement.
>>>
>>> Qu Wenruo (12):
>>>    btrfs: delayed-ref: Add support for increasing data ref under spinlock
>>>    btrfs: dedupe: Inband in-memory only de-duplication implement
>>>    btrfs: dedupe: Add basic tree structure for on-disk dedupe method
>>>    btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
>>>    btrfs: dedupe: Add support for on-disk hash search
>>>    btrfs: dedupe: Add support to delete hash for on-disk backend
>>>    btrfs: dedupe: Add support for adding hash for on-disk backend
>>>    btrfs: Fix a memory leak in inband dedupe hash
>>>    btrfs: dedupe: Fix metadata balance error when dedupe is enabled
>>>    btrfs: dedupe: Preparation for compress-dedupe co-work
>>>    btrfs: relocation: Enhance error handling to avoid BUG_ON
>>>    btrfs: dedupe: Fix a space cache delalloc bytes underflow bug
>>>
>>> Wang Xiaoguang (15):
>>>    btrfs: dedupe: Introduce dedupe framework and its header
>>>    btrfs: dedupe: Introduce function to initialize dedupe info
>>>    btrfs: dedupe: Introduce function to add hash into in-memory tree
>>>    btrfs: dedupe: Introduce function to remove hash from in-memory tree
>>>    btrfs: dedupe: Introduce function to search for an existing hash
>>>    btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
>>>    btrfs: ordered-extent: Add support for dedupe
>>>    btrfs: dedupe: Add ioctl for inband dedupelication
>>>    btrfs: dedupe: add an inode nodedupe flag
>>>    btrfs: dedupe: add a property handler for online dedupe
>>>    btrfs: dedupe: add per-file online dedupe control
>>>    btrfs: try more times to alloc metadata reserve space
>>>    btrfs: dedupe: Fix a bug when running inband dedupe with balance
>>>    btrfs: dedupe: Avoid submit IO for hash hit extent
>>>    btrfs: dedupe: Add support for compression and dedpue
>>>
>>>   fs/btrfs/Makefile            |    2 +-
>>>   fs/btrfs/ctree.h             |   78 ++-
>>>   fs/btrfs/dedupe.c            | 1188
>>> ++++++++++++++++++++++++++++++++++++++++++
>>>   fs/btrfs/dedupe.h            |  181 +++++++
>>>   fs/btrfs/delayed-ref.c       |   30 +-
>>>   fs/btrfs/delayed-ref.h       |    8 +
>>>   fs/btrfs/disk-io.c           |   28 +-
>>>   fs/btrfs/disk-io.h           |    1 +
>>>   fs/btrfs/extent-tree.c       |   49 +-
>>>   fs/btrfs/inode.c             |  338 ++++++++++--
>>>   fs/btrfs/ioctl.c             |   70 ++-
>>>   fs/btrfs/ordered-data.c      |   49 +-
>>>   fs/btrfs/ordered-data.h      |   16 +-
>>>   fs/btrfs/props.c             |   41 ++
>>>   fs/btrfs/relocation.c        |   41 +-
>>>   fs/btrfs/sysfs.c             |    2 +
>>>   include/trace/events/btrfs.h |    3 +-
>>>   include/uapi/linux/btrfs.h   |   25 +-
>>>   18 files changed, 2073 insertions(+), 77 deletions(-)
>>>   create mode 100644 fs/btrfs/dedupe.c
>>>   create mode 100644 fs/btrfs/dedupe.h
>>>
>>> --
>>> 2.7.3
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-03-25  1:38       ` Qu Wenruo
@ 2016-04-04 16:55         ` David Sterba
  2016-04-05  3:08           ` Qu Wenruo
  2016-04-06  3:47           ` Nicholas D Steeves
  0 siblings, 2 replies; 62+ messages in thread
From: David Sterba @ 2016-04-04 16:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs, clm

On Fri, Mar 25, 2016 at 09:38:50AM +0800, Qu Wenruo wrote:
> > Please use the newly added BTRFS_PERSISTENT_ITEM_KEY instead of a new
> > key type. As this is the second user of that item, there's no precendent
> > how to select the subtype. Right now 0 is for the dev stats item, but
> > I'd like to leave some space between them, so it should be 256 at best.
> > The space is 64bit so there's enough room but this also means defining
> > the on-disk format.
> 
> After checking BTRFS_PERSISENT_ITEM_KEY, it seems that its value is 
> larger than current DEDUPE_BYTENR/HASH_ITEM_KEY, and since the objectid 
> of DEDUPE_HASH_ITEM_KEY, it won't be the first item of the tree.
> 
> Although that's not a big problem, but for user using debug-tree, it 
> would be quite annoying to find it located among tons of other hashes.

You can alternatively store it in the tree_root, but I don't know how
frquently it's supposed to be changed.

> So personally, if using PERSISTENT_ITEM_KEY, at least I prefer to keep 
> objectid to 0, and modify DEDUPE_BYTENR/HASH_ITEM_KEY to higher value, 
> to ensure dedupe status to be the first item of dedupe tree.

0 is unfortnuatelly taken by BTRFS_DEV_STATS_OBJECTID, but I don't see
problem with the ordering. DEDUPE_BYTENR/HASH_ITEM_KEY store a large
number in the objectid, either part of a hash, that's unlikely to be
almost-all zeros and bytenr which will be larger than 1MB.

> >>>> 4) Ioctl interface with persist dedup status
> >>>
> >>> I'd like to see the ioctl specified in more detail. So far there's
> >>> enable, disable and status. I'd expect some way to control the in-memory
> >>> limits, let it "forget" current hash cache, specify the dedupe chunk
> >>> size, maybe sync of the in-memory hash cache to disk.
> >>
> >> So current and planned ioctl should be the following, with some details
> >> related to your in-memory limit control concerns.
> >>
> >> 1) Enable
> >>      Enable dedupe if it's not enabled already. (disabled -> enabled)
> >
> > Ok, so it should also take a parameter which bckend is about to be
> > enabled.
> 
> It already has.
> It also has limit_nr and limit_mem parameter for in-memory backend.
> 
> >
> >>      Or change current dedupe setting to another. (re-configure)
> >
> > Doing that in 'enable' sounds confusing, any changes belong to a
> > separate command.
> 
> This depends the aspect of view.
> 
> For "Enable/config/disable" case, it will introduce a state machine for 
> end-user.

Yes, that's exacly my point.

> Personally, I doesn't state machine for end user. Yes, I also hate 
> merging play and pause button together on music player.

I don't see this reference relevant, we're not designing a music player.

> If using state machine, user must ensure the dedupe is enabled before 
> doing any configuration.

For user convenience we can copy the configuration options to the dedup
enable subcommand, but it will still do separate enable and configure
ioctl calls.

> For me, user only need to care the result of the operation. User can now 
> configure dedupe to their need without need to know previous setting.
>  From this aspect of view, "Enable/Disable" is much easier than 
> "Enable/Config/Disable".

Getting the usability is hard and that's why we're having this
dicussion. What suites you does not suite others, we have different
habits, expectations and there are existing usage patterns. We better
stick to something that's not too surprising yet still flexible enough
to cover broad needs. I'm leaving this open, but I strongly disagree
with the current interface proposal.

> >>      For dedupe_bs/backend/hash algorithm(only SHA256 yet) change, it
> >>      will disable dedupe(dropping all hash) and then enable with new
> >>      setting.
> >>
> >>      For in-memory backend, if only limit is different from previous
> >>      setting, limit can be changed on the fly without dropping any hash.
> >
> > This is obviously misplaced in 'enable'.
> 
> Then, changing the 'enable' to 'configure' or other proper naming would 
> be better.
> 
> The point is, user only need to care what they want to do, not previous 
> setup.
> 
> >
> >> 2) Disable
> >>      Disable will drop all hash and delete the dedupe tree if it exists.
> >>      Imply a full sync_fs().
> >
> > That is again combining too many things into one. Say I want to disable
> > deduplication and want to enable it later. And not lose the whole state
> > between that. Not to say deleting the dedup tree.
> >
> > IOW, deleting the tree belongs to a separate command, though in the
> > userspace tools it could be done in one command, but we're talking about
> > the kernel ioctls now.
> >
> > I'm not sure if the sync is required, but it's acceptable for first
> > implementation.
> 
> The design is just to to reduce complexity.
> If want to keep hash but disable dedupe, it will make dedupe only handle 
> extent remove, but ignore any new coming write.
> 
> It will introduce a new state for dedupe, other than current simple 
> enabled/disabled.
> So I just don't allow such mode.
> 
> >
> >>
> >> 3) Status
> >>      Output basic status of current dedupe.
> >>      Including running status(disabled/enabled), dedupe block size, hash
> >>      algorithm, and limit setting for in-memory backend.
> >
> > Agreed. So this is basically the settings and static info.
> >
> >> 4) (PLANNED) In-memory hash size querying
> >>      Allowing userspace to query in-memory hash structure header size.
> >>      Used for "btrfs dedupe enable" '-l' option to output warning if user
> >>      specify memory size larger than 1/4 of the total memory.
> >
> > And this reflects the run-time status. Ok.
> >
> >> 5) (PLANNED) Dedeup rate statistics
> >>      Should be handy for user to know the dedupe rate so they can further
> >>      fine tuning their dedup setup.
> >
> > Similar as above, but for a different type of data. Ok.
> >
> >> So for your "in-memory limit control", just enable it with different limit.
> >> For "dedupe block size change", just enable it with different dedupe_bs.
> >> For "forget hash", just disable it.
> >
> > I can comment once the semantics of 'enable' are split, but basically I
> > want an interface to control the deduplication cache.
> 
> So better renaming 'enable'.
> 
> Current 'enable' provides the interface to control the limit or dedupe hash.
> 
> I'm not sure further control is needed.
> 
> >
> >> And for "write in-memory hash onto disk", not planned and may never do
> >> it due to the complexity, sorry.
> >
> > I'm not asking you to do it, definetelly not for the initial
> > implementation, but sync from memory to disk is IMO something that we
> > can expect users to ask for. The percieved complexity may shift
> > implementation to the future, but we should take it into account.
> 
> OK, I'll keep it in mind.
> 
> >
> >>>> 5) Ability to disable dedup for given dirs/files
> >>>
> >>> This would be good to extend to subvolumes.
> >>
> >> I'm sorry that I didn't quite understand the difference.
> >> Doesn't dir includes subvolume?
> >
> > If I enable deduplication on the entire subvolume, it will affect all
> > subdirectories. Not the other way around.
> 
> It can be done by setting 'dedupe disable' on all other subvolumes.
> But it it's not practical yet.

Thtat's opt-in vs opt-out, we'd need a better description of the
usecase.

> So maybe introduce a new state for default dedupe behavior?
> Current dedupe enabled default behavior is to dedup unless prohibited.
> If dedupe default behavior can be don't dedupe unless allowed, then it 
> will be much easier to do.
> 
> >
> >> Or xattr for subvolume is only restored in its parent subvolume, and
> >> won't be copied for its snapshot?
> >
> > The xattrs are copied to the snapshot.
> >
> >>>> TODO:
> >>>> 1) Add extent-by-extent comparison for faster but more conflicting algorithm
> >>>>      Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
> >>>>      CPU may even be a bottleneck other than IO.
> >>>>      But for faster hash, it will definitely cause conflicts, so we need
> >>>>      extent comparison before we introduce new dedup algorithm.
> >>>
> >>> If sha256 is slow, we can use a less secure hash that's faster but will
> >>> do a full byte-to-byte comparison in case of hash collision, and
> >>> recompute sha256 when the blocks are going to disk. I haven't thought
> >>> this through, so there are possibly details that could make unfeasible.
> >>
> >> Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only
> >> for both in-memory and on-disk backend. No SHA256 again.
> >
> > I'm proposing unsafe but fast, which MD5 is not. Look for xxhash or
> > murmur. As they're both order-of-magnitutes faster than sha1/md5, we can
> > actually hash both to reduce the collisions.
> 
> Don't quite like the idea to use 2 hash other than 1.
> Yes, some program like rsync uses this method, but this also involves a 
> lot of details, like the order to restore them on disk.

I'm considering fast-but-unsafe hashes for the in-memory backend, where
the speed matters and we cannot hide the slow sha256 calculations behind
the IO (ie. no point to save microseconds if the IO is going to take
milliseconds).

> >> In that case, for MD5 hit case, we will do a full byte-to-byte
> >> comparison. It may be slow or fast, depending on the cache.
> >
> > If the probability of hash collision is low, so the number of needed
> > byte-to-byte comparisions is also low.
> 
> Considering the common use-case of dedupe, hash hit should be a common case.
> 
> In that case, each hash hit will lead to byte-to-byte comparison, which 
> will significantly impact the dedupe performance.
> 
> On the other hand, if dedupe hit rate is low, then why use dedupe?

Oh right, that would require at least 2 hashes then.

> >> But at least for MD5 miss case, it should be faster than SHA256.
> >>
> >>> The idea is to move expensive hashing to the slow IO operations and do
> >>> fast but not 100% safe hashing on the read/write side where performance
> >>> matters.
> >>
> >> Yes, although on the read side, we don't perform hash, we only do hash
> >> at write side.
> >
> > Oh, so how exactly gets the in-memory deduplication cache filled? My
> > impression was that we can pre-fill it by reading bunch of files where we
> > expect the shared data to exist.
> 
> Yes, we used to do that method aging back to the first version of 
> in-memory implementation.
> 
> But that will cause a lot of CPU usage and most of them are just wasted.

I think this depends on the data.

> Don't forget that, in common dedupe use-case, dedupe rate should be 
> high, I'll use 50% as an exmaple.
> This means, 50% of your read will be pointed to a shared extents. But 
> 100% of read will need to calculate hash, and 50% of them are already in 
> hash pool.
> So the CPU time are just wasted.

I understand the concerns, but I don't understand the example, sorry.

> > The usecase:
> >
> > Say there's a golden image for a virtual machine,
> 
> Not to nitpick, but I though VM images are not good use-case for btrfs.
> And normally user would set nodatacow for it, which will bypass dedupe.

VM on nodatacow. By bypass you mean that it cannot work together or that
it's just not going to be implemented?

> > we'll clone it and use
> > for other VM's, with minor changes. If we first read the golden image
> > with deduplication enabled, pre-fill the cache, any subsequent writes to
> > the cloned images will be compared to the cached data. The estimated hit
> > ratio is medium-to-high.
> 
> And performance is so low that most user would feel, and CPU usage will 
> be so high (up to 8 cores 100% used)that almost no spare CPU time can be 
> allocated for VM use.
> 
> >
> > And this can be extended to anything, not just VMs. Without the option
> > to fill the in-memory cache, the deduplication would seem pretty useless
> > to me. The clear benefit is lack of maintaining the persistent storage
> > of deduplication data.
> 
> I originally planned a ioctl for it to fill hash manually.
> But now I think re-write would be good enough.
> Maybe I could a pseudo 'dedupe fill' command in btrfs-progs, which will 
> just read out the data and re-write it.

Rewriting will take twice the IO and might even fail due to enospc
reasons, I don't see that as a viable option.

> >> And in that case, if weak hash hit, we will need to do memory
> >> comparison, which may also be slow.
> >> So the performance impact may still exist.
> >
> > Yes the performance hit is there, with statistically low probability.
> >
> >> The biggest challenge is, we need to read (decompressed) extent
> >> contents, even without an inode.
> >> (So, no address_space and all the working facilities)
> >>
> >> Considering the complexity and uncertain performance improvement, the
> >> priority of introducing weak hash is quite low so far, not to mention a
> >> lot of detail design change for it.
> >
> > I disagree.
> 
> Explained above, hash hit in dedupe use-case is common case, while we 
> must do byte-to-byte comparison in common case routine, it's hard to 
> ignore the overhead.

So this should be solved by the double hashing, pushing the probability
of byte-to-byte comparision low.

> >> A much easier and practical enhancement is, to use SHA512.
> >> As it's faster than SHA256 on modern 64bit machine for larger size.
> >> For example, for hashing 8K data, SHA512 is almost 40% faster than SHA256.
> >>
> >>>> 2) Misc end-user related helpers
> >>>>      Like handy and easy to implement dedup rate report.
> >>>>      And method to query in-memory hash size for those "non-exist" users who
> >>>>      want to use 'dedup enable -l' option but didn't ever know how much
> >>>>      RAM they have.
> >>>
> >>> That's what we should try know and define in advance, that's part of the
> >>> ioctl interface.
> >>>
> >>> I went through the patches, there are a lot of small things to fix, but
> >>> first I want to be sure about the interfaces, ie. on-disk and ioctl.
> >>
> >> I hope such small things can be pointed out, allowing me to fix them
> >> while rebasing.
> >
> > Sure, that's next after we agree on what the deduplication should
> > actually, the ioctls interefaces are settled and the on-disk format
> > changes are agreed on. The code is a good starting point, but pointing
> > out minor things at this point does not justify the time spent.
> 
> That's OK.
> 
> >>> Then we can start to merge the patchset in smaller batches, the
> >>> in-memory deduplication does not have implications on the on-disk
> >>> format, so it's "just" the ioctl part.
> >>
> >> Yes, that's my original plan, first merge simple in-memory backend into
> >> 4.5/4.6 and then adding ondisk backend into 4.7.
> >>
> >> But things turned out that, since we designed the two-backends API from
> >> the beginning, on-disk backend doesn't take much time to implement.
> >>
> >> So this makes what you see now, a big patchset with both backend
> >> implemented.
> >
> > For the discussions and review phase it's ok to see them both, but it's
> > unrealistic to expect merging in a particular version without going
> > through the review heat, especially for something like deduplication.
> >
> >
> In fact, I didn't expect dedupe to receive such heat.

Really? That surprises me :) It modifies on-disk format, adds ioctls,
can have impact on performacne (that we even haven't measured yet), and
from the users POV, it's been requested for a long time.

> I originally expect such dedupe to be an interesting but not so 
> practical feature, just like ZFS dedupe.
> (I can be totally wrong, please point it out if there is some well-known 
> use-case of ZFS dedupe)
> 
> I was expecting dedupe to be a good entrance to expose existing bugs, 
> and raise attention for better delayed_ref and delalloc implementation.
> 
> Since it's considered as a high-profile feature, I'm OK to slow down the 
> rush of merge and polish the interface/code further more.

Yeah, as already mentioned, for exposing the bugs we can add code but
hide the ioctls.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-04-04 16:55         ` David Sterba
@ 2016-04-05  3:08           ` Qu Wenruo
  2016-04-20  2:02             ` Qu Wenruo
  2016-04-06  3:47           ` Nicholas D Steeves
  1 sibling, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-04-05  3:08 UTC (permalink / raw)
  To: dsterba, linux-btrfs, clm



David Sterba wrote on 2016/04/04 18:55 +0200:
> On Fri, Mar 25, 2016 at 09:38:50AM +0800, Qu Wenruo wrote:
>>> Please use the newly added BTRFS_PERSISTENT_ITEM_KEY instead of a new
>>> key type. As this is the second user of that item, there's no precendent
>>> how to select the subtype. Right now 0 is for the dev stats item, but
>>> I'd like to leave some space between them, so it should be 256 at best.
>>> The space is 64bit so there's enough room but this also means defining
>>> the on-disk format.
>>
>> After checking BTRFS_PERSISENT_ITEM_KEY, it seems that its value is
>> larger than current DEDUPE_BYTENR/HASH_ITEM_KEY, and since the objectid
>> of DEDUPE_HASH_ITEM_KEY, it won't be the first item of the tree.
>>
>> Although that's not a big problem, but for user using debug-tree, it
>> would be quite annoying to find it located among tons of other hashes.
>
> You can alternatively store it in the tree_root, but I don't know how
> frquently it's supposed to be changed.

Storing in tree root sounds pretty good.
As such status doesn't change until we enable/disable (including 
configure), so tree root seems good.

But we still need to consider the later dedupe rate statistics key order.
In that case, I hope to restore them both into dedupe tree.

>
>> So personally, if using PERSISTENT_ITEM_KEY, at least I prefer to keep
>> objectid to 0, and modify DEDUPE_BYTENR/HASH_ITEM_KEY to higher value,
>> to ensure dedupe status to be the first item of dedupe tree.
>
> 0 is unfortnuatelly taken by BTRFS_DEV_STATS_OBJECTID, but I don't see
> problem with the ordering. DEDUPE_BYTENR/HASH_ITEM_KEY store a large
> number in the objectid, either part of a hash, that's unlikely to be
> almost-all zeros and bytenr which will be larger than 1MB.

OK, as long as we can search the status item with exactly match key, it 
shouldn't cause big problem.

>
>>>>>> 4) Ioctl interface with persist dedup status
>>>>>
>>>>> I'd like to see the ioctl specified in more detail. So far there's
>>>>> enable, disable and status. I'd expect some way to control the in-memory
>>>>> limits, let it "forget" current hash cache, specify the dedupe chunk
>>>>> size, maybe sync of the in-memory hash cache to disk.
>>>>
>>>> So current and planned ioctl should be the following, with some details
>>>> related to your in-memory limit control concerns.
>>>>
>>>> 1) Enable
>>>>       Enable dedupe if it's not enabled already. (disabled -> enabled)
>>>
>>> Ok, so it should also take a parameter which bckend is about to be
>>> enabled.
>>
>> It already has.
>> It also has limit_nr and limit_mem parameter for in-memory backend.
>>
>>>
>>>>       Or change current dedupe setting to another. (re-configure)
>>>
>>> Doing that in 'enable' sounds confusing, any changes belong to a
>>> separate command.
>>
>> This depends the aspect of view.
>>
>> For "Enable/config/disable" case, it will introduce a state machine for
>> end-user.
>
> Yes, that's exacly my point.
>
>> Personally, I doesn't state machine for end user. Yes, I also hate
>> merging play and pause button together on music player.
>
> I don't see this reference relevant, we're not designing a music player.
>
>> If using state machine, user must ensure the dedupe is enabled before
>> doing any configuration.
>
> For user convenience we can copy the configuration options to the dedup
> enable subcommand, but it will still do separate enable and configure
> ioctl calls.

So, that's to say, user can assume there is a state machine, and to do 
enable-configure method.
And other user can use the state-less enable-enable method.

If so, I'm OK to add a configure ioctl interface.
(As it's still enable-enable stateless one beneath the stateful ioctl)

But in that case, if user forget to enable dedupe and call configure 
directly, btrfs won't give any warning and just enable dedupe.

Will that design be OK for you? Or we need to share most part of enable 
and configure ioctl, but configure ioctl will do extra check?


>
>> For me, user only need to care the result of the operation. User can now
>> configure dedupe to their need without need to know previous setting.
>>   From this aspect of view, "Enable/Disable" is much easier than
>> "Enable/Config/Disable".
>
> Getting the usability is hard and that's why we're having this
> dicussion. What suites you does not suite others, we have different
> habits, expectations and there are existing usage patterns. We better
> stick to something that's not too surprising yet still flexible enough
> to cover broad needs. I'm leaving this open, but I strongly disagree
> with the current interface proposal.

I'm still open to new ioctl interface design, as long as we can re-use 
most of current code.

Anyway, just as you pointed, the stateless one is just my personal taste.

>
>>>>       For dedupe_bs/backend/hash algorithm(only SHA256 yet) change, it
>>>>       will disable dedupe(dropping all hash) and then enable with new
>>>>       setting.
>>>>
>>>>       For in-memory backend, if only limit is different from previous
>>>>       setting, limit can be changed on the fly without dropping any hash.
>>>
>>> This is obviously misplaced in 'enable'.
>>
>> Then, changing the 'enable' to 'configure' or other proper naming would
>> be better.
>>
>> The point is, user only need to care what they want to do, not previous
>> setup.
>>
>>>
>>>> 2) Disable
>>>>       Disable will drop all hash and delete the dedupe tree if it exists.
>>>>       Imply a full sync_fs().
>>>
>>> That is again combining too many things into one. Say I want to disable
>>> deduplication and want to enable it later. And not lose the whole state
>>> between that. Not to say deleting the dedup tree.
>>>
>>> IOW, deleting the tree belongs to a separate command, though in the
>>> userspace tools it could be done in one command, but we're talking about
>>> the kernel ioctls now.
>>>
>>> I'm not sure if the sync is required, but it's acceptable for first
>>> implementation.
>>
>> The design is just to to reduce complexity.
>> If want to keep hash but disable dedupe, it will make dedupe only handle
>> extent remove, but ignore any new coming write.
>>
>> It will introduce a new state for dedupe, other than current simple
>> enabled/disabled.
>> So I just don't allow such mode.
>>
>>>
>>>>
>>>> 3) Status
>>>>       Output basic status of current dedupe.
>>>>       Including running status(disabled/enabled), dedupe block size, hash
>>>>       algorithm, and limit setting for in-memory backend.
>>>
>>> Agreed. So this is basically the settings and static info.
>>>
>>>> 4) (PLANNED) In-memory hash size querying
>>>>       Allowing userspace to query in-memory hash structure header size.
>>>>       Used for "btrfs dedupe enable" '-l' option to output warning if user
>>>>       specify memory size larger than 1/4 of the total memory.
>>>
>>> And this reflects the run-time status. Ok.
>>>
>>>> 5) (PLANNED) Dedeup rate statistics
>>>>       Should be handy for user to know the dedupe rate so they can further
>>>>       fine tuning their dedup setup.
>>>
>>> Similar as above, but for a different type of data. Ok.
>>>
>>>> So for your "in-memory limit control", just enable it with different limit.
>>>> For "dedupe block size change", just enable it with different dedupe_bs.
>>>> For "forget hash", just disable it.
>>>
>>> I can comment once the semantics of 'enable' are split, but basically I
>>> want an interface to control the deduplication cache.
>>
>> So better renaming 'enable'.
>>
>> Current 'enable' provides the interface to control the limit or dedupe hash.
>>
>> I'm not sure further control is needed.
>>
>>>
>>>> And for "write in-memory hash onto disk", not planned and may never do
>>>> it due to the complexity, sorry.
>>>
>>> I'm not asking you to do it, definetelly not for the initial
>>> implementation, but sync from memory to disk is IMO something that we
>>> can expect users to ask for. The percieved complexity may shift
>>> implementation to the future, but we should take it into account.
>>
>> OK, I'll keep it in mind.
>>
>>>
>>>>>> 5) Ability to disable dedup for given dirs/files
>>>>>
>>>>> This would be good to extend to subvolumes.
>>>>
>>>> I'm sorry that I didn't quite understand the difference.
>>>> Doesn't dir includes subvolume?
>>>
>>> If I enable deduplication on the entire subvolume, it will affect all
>>> subdirectories. Not the other way around.
>>
>> It can be done by setting 'dedupe disable' on all other subvolumes.
>> But it it's not practical yet.
>
> Thtat's opt-in vs opt-out, we'd need a better description of the
> usecase.

Then still the default dedupe behavior idea.
Default to disable or enable.
And per-inode dedupe enable/disable flag.


>
>> So maybe introduce a new state for default dedupe behavior?
>> Current dedupe enabled default behavior is to dedup unless prohibited.
>> If dedupe default behavior can be don't dedupe unless allowed, then it
>> will be much easier to do.
>>
>>>
>>>> Or xattr for subvolume is only restored in its parent subvolume, and
>>>> won't be copied for its snapshot?
>>>
>>> The xattrs are copied to the snapshot.
>>>
>>>>>> TODO:
>>>>>> 1) Add extent-by-extent comparison for faster but more conflicting algorithm
>>>>>>       Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>>>>>>       CPU may even be a bottleneck other than IO.
>>>>>>       But for faster hash, it will definitely cause conflicts, so we need
>>>>>>       extent comparison before we introduce new dedup algorithm.
>>>>>
>>>>> If sha256 is slow, we can use a less secure hash that's faster but will
>>>>> do a full byte-to-byte comparison in case of hash collision, and
>>>>> recompute sha256 when the blocks are going to disk. I haven't thought
>>>>> this through, so there are possibly details that could make unfeasible.
>>>>
>>>> Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only
>>>> for both in-memory and on-disk backend. No SHA256 again.
>>>
>>> I'm proposing unsafe but fast, which MD5 is not. Look for xxhash or
>>> murmur. As they're both order-of-magnitutes faster than sha1/md5, we can
>>> actually hash both to reduce the collisions.
>>
>> Don't quite like the idea to use 2 hash other than 1.
>> Yes, some program like rsync uses this method, but this also involves a
>> lot of details, like the order to restore them on disk.
>
> I'm considering fast-but-unsafe hashes for the in-memory backend, where
> the speed matters and we cannot hide the slow sha256 calculations behind
> the IO (ie. no point to save microseconds if the IO is going to take
> milliseconds).

If only for in-memory backend, I'm OK.

In-memory backend is much like an experimental field for new ideas, as 
it won't affect on-disk format at all.

But the problem is still here.
For fast hash hit case, we still need calculate slow SHA256 to ensure 
that's a complete hit.
That's OK and expected.

But for fast hash miss case, nothing is really changed.
As long as we need to add hash for the extent into hash pool, we still 
need to calculate the slow SHA256.

We can't insert the fast hash only, or in next fast hash hit with this 
extent, we have no slow hash to ensure consistence.

>
>>>> In that case, for MD5 hit case, we will do a full byte-to-byte
>>>> comparison. It may be slow or fast, depending on the cache.
>>>
>>> If the probability of hash collision is low, so the number of needed
>>> byte-to-byte comparisions is also low.
>>
>> Considering the common use-case of dedupe, hash hit should be a common case.
>>
>> In that case, each hash hit will lead to byte-to-byte comparison, which
>> will significantly impact the dedupe performance.
>>
>> On the other hand, if dedupe hit rate is low, then why use dedupe?
>
> Oh right, that would require at least 2 hashes then.
>
>>>> But at least for MD5 miss case, it should be faster than SHA256.
>>>>
>>>>> The idea is to move expensive hashing to the slow IO operations and do
>>>>> fast but not 100% safe hashing on the read/write side where performance
>>>>> matters.
>>>>
>>>> Yes, although on the read side, we don't perform hash, we only do hash
>>>> at write side.
>>>
>>> Oh, so how exactly gets the in-memory deduplication cache filled? My
>>> impression was that we can pre-fill it by reading bunch of files where we
>>> expect the shared data to exist.
>>
>> Yes, we used to do that method aging back to the first version of
>> in-memory implementation.
>>
>> But that will cause a lot of CPU usage and most of them are just wasted.
>
> I think this depends on the data.
>
>> Don't forget that, in common dedupe use-case, dedupe rate should be
>> high, I'll use 50% as an exmaple.
>> This means, 50% of your read will be pointed to a shared extents. But
>> 100% of read will need to calculate hash, and 50% of them are already in
>> hash pool.
>> So the CPU time are just wasted.
>
> I understand the concerns, but I don't understand the example, sorry.

My poor English again.

Take 4 file extents as example.
File Ext A: points to extent X
File Ext B: points to extent X
File Ext C: points to extent Y
File Ext D: points to extent Y

They are all read at the same time, then we calculate hash for all 4 of 
them at read time.

But at dedupe hash insert time, only hash for extent X and Y is inserted.

We hashed 4 times, but only inserted 2 hashes into dedupe hash pool.

>
>>> The usecase:
>>>
>>> Say there's a golden image for a virtual machine,
>>
>> Not to nitpick, but I though VM images are not good use-case for btrfs.
>> And normally user would set nodatacow for it, which will bypass dedupe.
>
> VM on nodatacow. By bypass you mean that it cannot work together or that
> it's just not going to be implemented?

It can't work together, so for nodatacow file, it won't go through 
dedupe routine.
Dedupe needs datacow to ensure its hashed data won't change.

If a extent is re-written while dedupe hash not changed, next hash hit 
will cause corruption.

And for already deduped(shared) extent, data cow will always happen no 
matter the mount option or file flag.

>
>>> we'll clone it and use
>>> for other VM's, with minor changes. If we first read the golden image
>>> with deduplication enabled, pre-fill the cache, any subsequent writes to
>>> the cloned images will be compared to the cached data. The estimated hit
>>> ratio is medium-to-high.
>>
>> And performance is so low that most user would feel, and CPU usage will
>> be so high (up to 8 cores 100% used)that almost no spare CPU time can be
>> allocated for VM use.
>>
>>>
>>> And this can be extended to anything, not just VMs. Without the option
>>> to fill the in-memory cache, the deduplication would seem pretty useless
>>> to me. The clear benefit is lack of maintaining the persistent storage
>>> of deduplication data.
>>
>> I originally planned a ioctl for it to fill hash manually.
>> But now I think re-write would be good enough.
>> Maybe I could a pseudo 'dedupe fill' command in btrfs-progs, which will
>> just read out the data and re-write it.
>
> Rewriting will take twice the IO and might even fail due to enospc
> reasons, I don't see that as a viable option.

Then we still need another ioctl for it to re-fill hash.

Added to ioctl TO-DO list.

>
>>>> And in that case, if weak hash hit, we will need to do memory
>>>> comparison, which may also be slow.
>>>> So the performance impact may still exist.
>>>
>>> Yes the performance hit is there, with statistically low probability.
>>>
>>>> The biggest challenge is, we need to read (decompressed) extent
>>>> contents, even without an inode.
>>>> (So, no address_space and all the working facilities)
>>>>
>>>> Considering the complexity and uncertain performance improvement, the
>>>> priority of introducing weak hash is quite low so far, not to mention a
>>>> lot of detail design change for it.
>>>
>>> I disagree.
>>
>> Explained above, hash hit in dedupe use-case is common case, while we
>> must do byte-to-byte comparison in common case routine, it's hard to
>> ignore the overhead.
>
> So this should be solved by the double hashing, pushing the probability
> of byte-to-byte comparision low.

As long as we are going to add the hash into hash pool, we need all of 
the two hashes.(explained above)

So nothing changed.

That's the difference with rsync, which doesn't need to add hash into 
its pool. It only needs to make sure they are identical.

Such fast hash only case will only happen in priority based dedupe use case.

In a simple priority based case, only specified(high priority) files can 
populate the hash pool.
Other(low priority) files can only be deduped using high priority files' 
extent, but never populate hash pool.

In that case, fast hash will work, as low priority files will go through 
fast hash calculation but only when fast hash hit, they will go through 
slow hash, and save some time.
(But still, in high dedupe-rate case, the saved some is still small)

>
>>>> A much easier and practical enhancement is, to use SHA512.
>>>> As it's faster than SHA256 on modern 64bit machine for larger size.
>>>> For example, for hashing 8K data, SHA512 is almost 40% faster than SHA256.
>>>>
>>>>>> 2) Misc end-user related helpers
>>>>>>       Like handy and easy to implement dedup rate report.
>>>>>>       And method to query in-memory hash size for those "non-exist" users who
>>>>>>       want to use 'dedup enable -l' option but didn't ever know how much
>>>>>>       RAM they have.
>>>>>
>>>>> That's what we should try know and define in advance, that's part of the
>>>>> ioctl interface.
>>>>>
>>>>> I went through the patches, there are a lot of small things to fix, but
>>>>> first I want to be sure about the interfaces, ie. on-disk and ioctl.
>>>>
>>>> I hope such small things can be pointed out, allowing me to fix them
>>>> while rebasing.
>>>
>>> Sure, that's next after we agree on what the deduplication should
>>> actually, the ioctls interefaces are settled and the on-disk format
>>> changes are agreed on. The code is a good starting point, but pointing
>>> out minor things at this point does not justify the time spent.
>>
>> That's OK.
>>
>>>>> Then we can start to merge the patchset in smaller batches, the
>>>>> in-memory deduplication does not have implications on the on-disk
>>>>> format, so it's "just" the ioctl part.
>>>>
>>>> Yes, that's my original plan, first merge simple in-memory backend into
>>>> 4.5/4.6 and then adding ondisk backend into 4.7.
>>>>
>>>> But things turned out that, since we designed the two-backends API from
>>>> the beginning, on-disk backend doesn't take much time to implement.
>>>>
>>>> So this makes what you see now, a big patchset with both backend
>>>> implemented.
>>>
>>> For the discussions and review phase it's ok to see them both, but it's
>>> unrealistic to expect merging in a particular version without going
>>> through the review heat, especially for something like deduplication.
>>>
>>>
>> In fact, I didn't expect dedupe to receive such heat.
>
> Really? That surprises me :) It modifies on-disk format, adds ioctls,
> can have impact on performacne (that we even haven't measured yet), and
> from the users POV, it's been requested for a long time.

Don't forget I was pushing for in-memory only backend at first.

The main reason for in-memory only backend is that we can do whatever we 
want to try and won't affect the on-disk format.
The very first vision of dedupe is to be a cool but not that useful feature.

But that's the past.
Since now we have on-disk format change, new and expanding ioctls, slow 
performance (but still a little better for all miss case compared to 
compression), it will receive much heat.

Thanks,
Qu
>
>> I originally expect such dedupe to be an interesting but not so
>> practical feature, just like ZFS dedupe.
>> (I can be totally wrong, please point it out if there is some well-known
>> use-case of ZFS dedupe)
>>
>> I was expecting dedupe to be a good entrance to expose existing bugs,
>> and raise attention for better delayed_ref and delalloc implementation.
>>
>> Since it's considered as a high-profile feature, I'm OK to slow down the
>> rush of merge and polish the interface/code further more.
>
> Yeah, as already mentioned, for exposing the bugs we can add code but
> hide the ioctls.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-04-03  8:22     ` Alex Lyakas
@ 2016-04-05  3:51       ` Qu Wenruo
  0 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-04-05  3:51 UTC (permalink / raw)
  To: Alex Lyakas, Xiaoguang Wang; +Cc: linux-btrfs



Alex Lyakas wrote on 2016/04/03 10:22 +0200:
> Hello Qu, Wang,
>
> On Wed, Mar 30, 2016 at 2:34 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> Alex Lyakas wrote on 2016/03/29 19:22 +0200:
>>>
>>> Greetings Qu Wenruo,
>>>
>>> I have reviewed the dedup patchset found in the github account you
>>> mentioned. I have several questions. Please note that by all means I
>>> am not criticizing your design or code. I just want to make sure that
>>> my understanding of the code is proper.
>>
>>
>> It's OK to criticize the design or code, and that's how review works.
>>
>>>
>>> 1) You mentioned in several emails that at some point byte-to-byte
>>> comparison is to be performed. However, I do not see this in the code.
>>> It seems that generic_search() only looks for the hash value match. If
>>> there is a match, it goes ahead and adds a delayed ref.
>>
>>
>> I mentioned byte-to-byte comparison as, "not to be implemented in any time
>> soon".
>>
>> Considering the lack of facility to read out extent contents without any
>> inode structure, it's not going to be done in any time soon.
>>
>>>
>>> 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
>>> mutex and proceed with the normal COW. What happens if there are
>>> several IO streams to different files writing an identical block, but
>>> we don't have such block in our dedup DB? Then all
>>> btrfs_dedupe_search() calls will not find a match, so all streams will
>>> allocate space for their block (which are all identical). At some
>>> point, they will call insert_reserved_file_extent() and will call
>>> btrfs_dedupe_add(). Since there is a global mutex, the first stream
>>> will insert the dedup hash entries into the DB, and all other streams
>>> will find that such hash entry already exists. So the end result is
>>> that we have the hash entry in the DB, but still we have multiple
>>> copies of the same block allocated, due to timing issues. Is this
>>> correct?
>>
>>
>> That's right, and that's also unavoidable for the hash initializing stage.
>>
>>>
>>> 3) generic_search() competes with __btrfs_free_extent(). Meaning that
>>> generic_search() wants to add a delayed ref to an existing extent,
>>> whereas __btrfs_free_extent() wants to delete an entry from the dedup
>>> DB. The race is resolved as follows:
>>> - generic_search attempts to lock the delayed ref head
>>> - if it succeeds to lock, then __btrfs_free_extent() is not running
>>> right now. So we can add a delayed ref. Later, when delayed ref head
>>> will be run, it will figure out what needs to be done (free the extent
>>> or not)
>>> - if we fail to lock, then there is a delayed ref processing for this
>>> bytenr. We drop all locks and redo the search from the top. If
>>> __btrfs_free_extent() has deleted the dedup hash meanwhile, we will
>>> not find it, and proceed with normal COW.
>>> Is my understanding correct?
>>
>>
>> Yes that's correct.
>
> Reviewing the code again, it seems that I still lack understanding.
> What is special about the dedup code adding a delayed data ref versus
> other places doing that? In other places, we do not insist on locking
> the delayed ref head, but in dedup we do. For example,
> __btrfs_drop_extents calls btrfs_inc_extent_ref, without locking the
> ref head. I know that one of your purposes was to draw attention to
> delayed ref processing, so you have succeeded.

In the patchset, the delayed_ref related part is not only to draw 
attention, it's to resolve problems.

For example, there is a case where an extent has a ref in extent tree, 
while it's going to be freed, which means there is a DROP ref in 
delayed_refs:

For extent A:
Extent tree                | Delayed refs
1                          | -1 (Drop ref)

While we call dedupe_del() only at __btrfs_free_extents() time, which 
means unless delayed_refs are run, we still have the hash for Extent A.

If we don't lock delayed_ref_head, the following case may happen:

Dedupe routine             | run_delayed_refs()
dedupe_search()            |
   |- Found hash            |
   |                        | btrfs_delayed_ref_lock()
   |                        | |- btrfs_delayed_ref_lock()
   |                        | |- run_one_delayed_ref
   |                        | |  |- __btrfs_free_extent()
   |                        | |- btrfs_delayed_ref_unlock()
   |- btrfs_inc_extent_ref()|

In that case, we will increase extent ref to a non-exist extent.
It will cause next run_delayed_refs() return -ENOENT and cause abort 
transaction.
We have hit such problem several times in our test.

If we lock delayed ref head, we will ensure the delayed ref of that 
extent won't be run.

Either we increase extent ref before run_one_delayed_ref, or after it.

If we run before run delayed ref on that extent, we will 
increase_extent_ref() and won't go to __btrfs_free_extent(), that extent 
will still be there.

If we run after run delayed ref, we will not find the hash, and cause a 
hash miss and continue to write the data into disk.


In case we can't find delayed_ref_head, which means there is not delayed 
refs for that data extent yet.
We directly insert delayed_data_ref while holding delayed_refs->lock, to 
avoid any possible concurrency.


BTW, for other caller, they don't need to keep any data sync with extent 
tree or delayed refs.
Currently, only dedupe has async hash and extent add/remove, and need 
such ugly hack.

Thanks,
Qu


>
> Thanks,
> Alex.
>
>
>
>
>>
>>>
>>> I have also few nitpicks on the code, will reply to relevant patches.
>>
>>
>> Feel free to comment.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thanks for doing this work,
>>> Alex.
>>>
>>>
>>>
>>> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> wrote:
>>>>
>>>> This patchset can be fetched from github:
>>>> https://github.com/adam900710/linux.git wang_dedupe_20160322
>>>>
>>>> This updated version of inband de-duplication has the following features:
>>>> 1) ONE unified dedup framework.
>>>>      Most of its code is hidden quietly in dedup.c and export the minimal
>>>>      interfaces for its caller.
>>>>      Reviewer and further developer would benefit from the unified
>>>>      framework.
>>>>
>>>> 2) TWO different back-end with different trade-off
>>>>      One is the improved version of previous Fujitsu in-memory only dedup.
>>>>      The other one is enhanced dedup implementation from Liu Bo.
>>>>      Changed its tree structure to handle bytenr -> hash search for
>>>>      deleting hash, without the hideous data backref hack.
>>>>
>>>> 3) Support compression with dedupe
>>>>      Now dedupe can work with compression.
>>>>      Means that, a dedupe miss case can be compressed, and dedupe hit case
>>>>      can also reuse compressed file extents.
>>>>
>>>> 4) Ioctl interface with persist dedup status
>>>>      Advised by David, now we use ioctl to enable/disable dedup.
>>>>
>>>>      And we now have dedup status, recorded in the first item of dedup
>>>>      tree.
>>>>      Just like quota, once enabled, no extra ioctl is needed for next
>>>>      mount.
>>>>
>>>> 5) Ability to disable dedup for given dirs/files
>>>>      It works just like the compression prop method, by adding a new
>>>>      xattr.
>>>>
>>>> TODO:
>>>> 1) Add extent-by-extent comparison for faster but more conflicting
>>>> algorithm
>>>>      Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>>>>      CPU may even be a bottleneck other than IO.
>>>>      But for faster hash, it will definitely cause conflicts, so we need
>>>>      extent comparison before we introduce new dedup algorithm.
>>>>
>>>> 2) Misc end-user related helpers
>>>>      Like handy and easy to implement dedup rate report.
>>>>      And method to query in-memory hash size for those "non-exist" users
>>>> who
>>>>      want to use 'dedup enable -l' option but didn't ever know how much
>>>>      RAM they have.
>>>>
>>>> Changelog:
>>>> v2:
>>>>     Totally reworked to handle multiple backends
>>>> v3:
>>>>     Fix a stupid but deadly on-disk backend bug
>>>>     Add handle for multiple hash on same bytenr corner case to fix abort
>>>>     trans error
>>>>     Increase dedup rate by enhancing delayed ref handler for both backend.
>>>>     Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
>>>>     Increase dedup block size up limit to 8M.
>>>> v4:
>>>>     Add dedup prop for disabling dedup for given files/dirs.
>>>>     Merge inmem_search() and ondisk_search() into generic_search() to save
>>>>     some code
>>>>     Fix another delayed_ref related bug.
>>>>     Use the same mutex for both inmem and ondisk backend.
>>>>     Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup
>>>>     rate.
>>>> v5:
>>>>     Reuse compress routine for much simpler dedup function.
>>>>     Slightly improved performance due to above modification.
>>>>     Fix race between dedup enable/disable
>>>>     Fix for false ENOSPC report
>>>> v6:
>>>>     Further enable/disable race window fix.
>>>>     Minor format change according to checkpatch.
>>>> v7:
>>>>     Fix one concurrency bug with balance.
>>>>     Slightly modify return value from -EINVAL to -EOPNOTSUPP for
>>>>     btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands
>>>>     and wrong parameter.
>>>>     Rebased to integration-4.6.
>>>> v8:
>>>>     Rename 'dedup' to 'dedupe'.
>>>>     Add support to allow dedupe and compression work at the same time.
>>>>     Fix several balance related bugs. Special thanks to Satoru Takeuchi,
>>>>     who exposed most of them.
>>>>     Small dedupe hit case performance improvement.
>>>>
>>>> Qu Wenruo (12):
>>>>     btrfs: delayed-ref: Add support for increasing data ref under spinlock
>>>>     btrfs: dedupe: Inband in-memory only de-duplication implement
>>>>     btrfs: dedupe: Add basic tree structure for on-disk dedupe method
>>>>     btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
>>>>     btrfs: dedupe: Add support for on-disk hash search
>>>>     btrfs: dedupe: Add support to delete hash for on-disk backend
>>>>     btrfs: dedupe: Add support for adding hash for on-disk backend
>>>>     btrfs: Fix a memory leak in inband dedupe hash
>>>>     btrfs: dedupe: Fix metadata balance error when dedupe is enabled
>>>>     btrfs: dedupe: Preparation for compress-dedupe co-work
>>>>     btrfs: relocation: Enhance error handling to avoid BUG_ON
>>>>     btrfs: dedupe: Fix a space cache delalloc bytes underflow bug
>>>>
>>>> Wang Xiaoguang (15):
>>>>     btrfs: dedupe: Introduce dedupe framework and its header
>>>>     btrfs: dedupe: Introduce function to initialize dedupe info
>>>>     btrfs: dedupe: Introduce function to add hash into in-memory tree
>>>>     btrfs: dedupe: Introduce function to remove hash from in-memory tree
>>>>     btrfs: dedupe: Introduce function to search for an existing hash
>>>>     btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
>>>>     btrfs: ordered-extent: Add support for dedupe
>>>>     btrfs: dedupe: Add ioctl for inband dedupelication
>>>>     btrfs: dedupe: add an inode nodedupe flag
>>>>     btrfs: dedupe: add a property handler for online dedupe
>>>>     btrfs: dedupe: add per-file online dedupe control
>>>>     btrfs: try more times to alloc metadata reserve space
>>>>     btrfs: dedupe: Fix a bug when running inband dedupe with balance
>>>>     btrfs: dedupe: Avoid submit IO for hash hit extent
>>>>     btrfs: dedupe: Add support for compression and dedpue
>>>>
>>>>    fs/btrfs/Makefile            |    2 +-
>>>>    fs/btrfs/ctree.h             |   78 ++-
>>>>    fs/btrfs/dedupe.c            | 1188
>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>    fs/btrfs/dedupe.h            |  181 +++++++
>>>>    fs/btrfs/delayed-ref.c       |   30 +-
>>>>    fs/btrfs/delayed-ref.h       |    8 +
>>>>    fs/btrfs/disk-io.c           |   28 +-
>>>>    fs/btrfs/disk-io.h           |    1 +
>>>>    fs/btrfs/extent-tree.c       |   49 +-
>>>>    fs/btrfs/inode.c             |  338 ++++++++++--
>>>>    fs/btrfs/ioctl.c             |   70 ++-
>>>>    fs/btrfs/ordered-data.c      |   49 +-
>>>>    fs/btrfs/ordered-data.h      |   16 +-
>>>>    fs/btrfs/props.c             |   41 ++
>>>>    fs/btrfs/relocation.c        |   41 +-
>>>>    fs/btrfs/sysfs.c             |    2 +
>>>>    include/trace/events/btrfs.h |    3 +-
>>>>    include/uapi/linux/btrfs.h   |   25 +-
>>>>    18 files changed, 2073 insertions(+), 77 deletions(-)
>>>>    create mode 100644 fs/btrfs/dedupe.c
>>>>    create mode 100644 fs/btrfs/dedupe.h
>>>>
>>>> --
>>>> 2.7.3
>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-04-04 16:55         ` David Sterba
  2016-04-05  3:08           ` Qu Wenruo
@ 2016-04-06  3:47           ` Nicholas D Steeves
  2016-04-06  5:22             ` Qu Wenruo
  1 sibling, 1 reply; 62+ messages in thread
From: Nicholas D Steeves @ 2016-04-06  3:47 UTC (permalink / raw)
  To: Btrfs BTRFS

On 4 April 2016 at 12:55, David Sterba <dsterba@suse.cz> wrote:
>> >> Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only
>> >> for both in-memory and on-disk backend. No SHA256 again.
>> >
>> > I'm proposing unsafe but fast, which MD5 is not. Look for xxhash or
>> > murmur. As they're both order-of-magnitutes faster than sha1/md5, we can
>> > actually hash both to reduce the collisions.
>>
>> Don't quite like the idea to use 2 hash other than 1.
>> Yes, some program like rsync uses this method, but this also involves a
>> lot of details, like the order to restore them on disk.
>
> I'm considering fast-but-unsafe hashes for the in-memory backend, where
> the speed matters and we cannot hide the slow sha256 calculations behind
> the IO (ie. no point to save microseconds if the IO is going to take
> milliseconds).
>
>> >> In that case, for MD5 hit case, we will do a full byte-to-byte
>> >> comparison. It may be slow or fast, depending on the cache.
>> >
>> > If the probability of hash collision is low, so the number of needed
>> > byte-to-byte comparisions is also low.

It is unlikely that I will use dedupe, but I imagine your work will
apply tot he following wishlist:

1. Allow disabling of memory-backend hash via a kernel argument,
sysctl, or mount option for those of us have ECC RAM.
    * page_cache never gets pushed to swap, so this should be safe, no?
2. Implementing an intelligent cache so that it's possible to offset
the cost of hashing the most actively read data.  I'm guessing there's
already some sort of weighed cache eviction algorithm in place, but I
don't yet know how to look into it, let alone enough to leverage it...
    * on the topic of leaning on the cache, I've been thinking about
ways to optimize reads, while minimizing seeks on multi-spindle raid1
btrfs volumes.  I'm guessing that someone will commit a solution
before I manage to teach myself enough about filesystems to contribute
something useful.

That's it, in terms of features I want ;-)

It's probably a well-known fact, but sha512 is roughly 40 to 50%
faster than sha256, and 40 to 50% slower than sha1 on my 1200-series
Xeon v3 (Haswell), for 8192 size blocks.

Wish I could do more right now!
Nicholas

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-04-06  3:47           ` Nicholas D Steeves
@ 2016-04-06  5:22             ` Qu Wenruo
  2016-04-22 22:14               ` Nicholas D Steeves
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-04-06  5:22 UTC (permalink / raw)
  To: Nicholas D Steeves, Btrfs BTRFS



Nicholas D Steeves wrote on 2016/04/05 23:47 -0400:
> On 4 April 2016 at 12:55, David Sterba <dsterba@suse.cz> wrote:
>>>>> Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5 only
>>>>> for both in-memory and on-disk backend. No SHA256 again.
>>>>
>>>> I'm proposing unsafe but fast, which MD5 is not. Look for xxhash or
>>>> murmur. As they're both order-of-magnitutes faster than sha1/md5, we can
>>>> actually hash both to reduce the collisions.
>>>
>>> Don't quite like the idea to use 2 hash other than 1.
>>> Yes, some program like rsync uses this method, but this also involves a
>>> lot of details, like the order to restore them on disk.
>>
>> I'm considering fast-but-unsafe hashes for the in-memory backend, where
>> the speed matters and we cannot hide the slow sha256 calculations behind
>> the IO (ie. no point to save microseconds if the IO is going to take
>> milliseconds).
>>
>>>>> In that case, for MD5 hit case, we will do a full byte-to-byte
>>>>> comparison. It may be slow or fast, depending on the cache.
>>>>
>>>> If the probability of hash collision is low, so the number of needed
>>>> byte-to-byte comparisions is also low.
>
> It is unlikely that I will use dedupe, but I imagine your work will
> apply tot he following wishlist:
>
> 1. Allow disabling of memory-backend hash via a kernel argument,
> sysctl, or mount option for those of us have ECC RAM.
>      * page_cache never gets pushed to swap, so this should be safe, no?

Why not use current ioctl to disable dedupe?

And why it's related to ECC RAM? To avoid memory corruption which will 
finally lead to file corruption?
If so, it makes sense.

Also I didn't get the point when you mention page_cache.
For hash pool, we didn't use page cache. We just use kmalloc, which 
won't be swapped out.
For file page cache, it's not affected at all.


> 2. Implementing an intelligent cache so that it's possible to offset
> the cost of hashing the most actively read data.  I'm guessing there's
> already some sort of weighed cache eviction algorithm in place, but I
> don't yet know how to look into it, let alone enough to leverage it...

I not quite a fan of such intelligent but complicated cache design.
The main problem is we are putting police into kernel space.

Currently, either use last-recent-use in-memory backend, or use all-in 
ondisk backend.
For user want more precious control on which file/dir shouldn't go 
through dedupe, they have the btrfs prop to set per-file flag to avoid 
dedupe.

>      * on the topic of leaning on the cache, I've been thinking about
> ways to optimize reads, while minimizing seeks on multi-spindle raid1
> btrfs volumes.  I'm guessing that someone will commit a solution
> before I manage to teach myself enough about filesystems to contribute
> something useful.
>
> That's it, in terms of features I want ;-)
>
> It's probably a well-known fact, but sha512 is roughly 40 to 50%
> faster than sha256, and 40 to 50% slower than sha1 on my 1200-series
> Xeon v3 (Haswell), for 8192 size blocks.

Sadly I didn't know it until recent days. :(
Or I would have implemented SHA512 hash algorithm instead SHA256.

Anyway, it's not that hard to add a new hash algorithm.

Thanks for your comments.
Qu

>
> Wish I could do more right now!
> Nicholas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-04-05  3:08           ` Qu Wenruo
@ 2016-04-20  2:02             ` Qu Wenruo
  2016-04-20 19:14               ` Chris Mason
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-04-20  2:02 UTC (permalink / raw)
  To: dsterba, linux-btrfs, clm

Hi David,

Any new comment about the ondisk format and ioctl interface?

Thanks,
Qu

Qu Wenruo wrote on 2016/04/05 11:08 +0800:
>
>
> David Sterba wrote on 2016/04/04 18:55 +0200:
>> On Fri, Mar 25, 2016 at 09:38:50AM +0800, Qu Wenruo wrote:
>>>> Please use the newly added BTRFS_PERSISTENT_ITEM_KEY instead of a new
>>>> key type. As this is the second user of that item, there's no
>>>> precendent
>>>> how to select the subtype. Right now 0 is for the dev stats item, but
>>>> I'd like to leave some space between them, so it should be 256 at best.
>>>> The space is 64bit so there's enough room but this also means defining
>>>> the on-disk format.
>>>
>>> After checking BTRFS_PERSISENT_ITEM_KEY, it seems that its value is
>>> larger than current DEDUPE_BYTENR/HASH_ITEM_KEY, and since the objectid
>>> of DEDUPE_HASH_ITEM_KEY, it won't be the first item of the tree.
>>>
>>> Although that's not a big problem, but for user using debug-tree, it
>>> would be quite annoying to find it located among tons of other hashes.
>>
>> You can alternatively store it in the tree_root, but I don't know how
>> frquently it's supposed to be changed.
>
> Storing in tree root sounds pretty good.
> As such status doesn't change until we enable/disable (including
> configure), so tree root seems good.
>
> But we still need to consider the later dedupe rate statistics key order.
> In that case, I hope to restore them both into dedupe tree.
>
>>
>>> So personally, if using PERSISTENT_ITEM_KEY, at least I prefer to keep
>>> objectid to 0, and modify DEDUPE_BYTENR/HASH_ITEM_KEY to higher value,
>>> to ensure dedupe status to be the first item of dedupe tree.
>>
>> 0 is unfortnuatelly taken by BTRFS_DEV_STATS_OBJECTID, but I don't see
>> problem with the ordering. DEDUPE_BYTENR/HASH_ITEM_KEY store a large
>> number in the objectid, either part of a hash, that's unlikely to be
>> almost-all zeros and bytenr which will be larger than 1MB.
>
> OK, as long as we can search the status item with exactly match key, it
> shouldn't cause big problem.
>
>>
>>>>>>> 4) Ioctl interface with persist dedup status
>>>>>>
>>>>>> I'd like to see the ioctl specified in more detail. So far there's
>>>>>> enable, disable and status. I'd expect some way to control the
>>>>>> in-memory
>>>>>> limits, let it "forget" current hash cache, specify the dedupe chunk
>>>>>> size, maybe sync of the in-memory hash cache to disk.
>>>>>
>>>>> So current and planned ioctl should be the following, with some
>>>>> details
>>>>> related to your in-memory limit control concerns.
>>>>>
>>>>> 1) Enable
>>>>>       Enable dedupe if it's not enabled already. (disabled -> enabled)
>>>>
>>>> Ok, so it should also take a parameter which bckend is about to be
>>>> enabled.
>>>
>>> It already has.
>>> It also has limit_nr and limit_mem parameter for in-memory backend.
>>>
>>>>
>>>>>       Or change current dedupe setting to another. (re-configure)
>>>>
>>>> Doing that in 'enable' sounds confusing, any changes belong to a
>>>> separate command.
>>>
>>> This depends the aspect of view.
>>>
>>> For "Enable/config/disable" case, it will introduce a state machine for
>>> end-user.
>>
>> Yes, that's exacly my point.
>>
>>> Personally, I doesn't state machine for end user. Yes, I also hate
>>> merging play and pause button together on music player.
>>
>> I don't see this reference relevant, we're not designing a music player.
>>
>>> If using state machine, user must ensure the dedupe is enabled before
>>> doing any configuration.
>>
>> For user convenience we can copy the configuration options to the dedup
>> enable subcommand, but it will still do separate enable and configure
>> ioctl calls.
>
> So, that's to say, user can assume there is a state machine, and to do
> enable-configure method.
> And other user can use the state-less enable-enable method.
>
> If so, I'm OK to add a configure ioctl interface.
> (As it's still enable-enable stateless one beneath the stateful ioctl)
>
> But in that case, if user forget to enable dedupe and call configure
> directly, btrfs won't give any warning and just enable dedupe.
>
> Will that design be OK for you? Or we need to share most part of enable
> and configure ioctl, but configure ioctl will do extra check?
>
>
>>
>>> For me, user only need to care the result of the operation. User can now
>>> configure dedupe to their need without need to know previous setting.
>>>   From this aspect of view, "Enable/Disable" is much easier than
>>> "Enable/Config/Disable".
>>
>> Getting the usability is hard and that's why we're having this
>> dicussion. What suites you does not suite others, we have different
>> habits, expectations and there are existing usage patterns. We better
>> stick to something that's not too surprising yet still flexible enough
>> to cover broad needs. I'm leaving this open, but I strongly disagree
>> with the current interface proposal.
>
> I'm still open to new ioctl interface design, as long as we can re-use
> most of current code.
>
> Anyway, just as you pointed, the stateless one is just my personal taste.
>
>>
>>>>>       For dedupe_bs/backend/hash algorithm(only SHA256 yet) change, it
>>>>>       will disable dedupe(dropping all hash) and then enable with new
>>>>>       setting.
>>>>>
>>>>>       For in-memory backend, if only limit is different from previous
>>>>>       setting, limit can be changed on the fly without dropping any
>>>>> hash.
>>>>
>>>> This is obviously misplaced in 'enable'.
>>>
>>> Then, changing the 'enable' to 'configure' or other proper naming would
>>> be better.
>>>
>>> The point is, user only need to care what they want to do, not previous
>>> setup.
>>>
>>>>
>>>>> 2) Disable
>>>>>       Disable will drop all hash and delete the dedupe tree if it
>>>>> exists.
>>>>>       Imply a full sync_fs().
>>>>
>>>> That is again combining too many things into one. Say I want to disable
>>>> deduplication and want to enable it later. And not lose the whole state
>>>> between that. Not to say deleting the dedup tree.
>>>>
>>>> IOW, deleting the tree belongs to a separate command, though in the
>>>> userspace tools it could be done in one command, but we're talking
>>>> about
>>>> the kernel ioctls now.
>>>>
>>>> I'm not sure if the sync is required, but it's acceptable for first
>>>> implementation.
>>>
>>> The design is just to to reduce complexity.
>>> If want to keep hash but disable dedupe, it will make dedupe only handle
>>> extent remove, but ignore any new coming write.
>>>
>>> It will introduce a new state for dedupe, other than current simple
>>> enabled/disabled.
>>> So I just don't allow such mode.
>>>
>>>>
>>>>>
>>>>> 3) Status
>>>>>       Output basic status of current dedupe.
>>>>>       Including running status(disabled/enabled), dedupe block
>>>>> size, hash
>>>>>       algorithm, and limit setting for in-memory backend.
>>>>
>>>> Agreed. So this is basically the settings and static info.
>>>>
>>>>> 4) (PLANNED) In-memory hash size querying
>>>>>       Allowing userspace to query in-memory hash structure header
>>>>> size.
>>>>>       Used for "btrfs dedupe enable" '-l' option to output warning
>>>>> if user
>>>>>       specify memory size larger than 1/4 of the total memory.
>>>>
>>>> And this reflects the run-time status. Ok.
>>>>
>>>>> 5) (PLANNED) Dedeup rate statistics
>>>>>       Should be handy for user to know the dedupe rate so they can
>>>>> further
>>>>>       fine tuning their dedup setup.
>>>>
>>>> Similar as above, but for a different type of data. Ok.
>>>>
>>>>> So for your "in-memory limit control", just enable it with
>>>>> different limit.
>>>>> For "dedupe block size change", just enable it with different
>>>>> dedupe_bs.
>>>>> For "forget hash", just disable it.
>>>>
>>>> I can comment once the semantics of 'enable' are split, but basically I
>>>> want an interface to control the deduplication cache.
>>>
>>> So better renaming 'enable'.
>>>
>>> Current 'enable' provides the interface to control the limit or
>>> dedupe hash.
>>>
>>> I'm not sure further control is needed.
>>>
>>>>
>>>>> And for "write in-memory hash onto disk", not planned and may never do
>>>>> it due to the complexity, sorry.
>>>>
>>>> I'm not asking you to do it, definetelly not for the initial
>>>> implementation, but sync from memory to disk is IMO something that we
>>>> can expect users to ask for. The percieved complexity may shift
>>>> implementation to the future, but we should take it into account.
>>>
>>> OK, I'll keep it in mind.
>>>
>>>>
>>>>>>> 5) Ability to disable dedup for given dirs/files
>>>>>>
>>>>>> This would be good to extend to subvolumes.
>>>>>
>>>>> I'm sorry that I didn't quite understand the difference.
>>>>> Doesn't dir includes subvolume?
>>>>
>>>> If I enable deduplication on the entire subvolume, it will affect all
>>>> subdirectories. Not the other way around.
>>>
>>> It can be done by setting 'dedupe disable' on all other subvolumes.
>>> But it it's not practical yet.
>>
>> Thtat's opt-in vs opt-out, we'd need a better description of the
>> usecase.
>
> Then still the default dedupe behavior idea.
> Default to disable or enable.
> And per-inode dedupe enable/disable flag.
>
>
>>
>>> So maybe introduce a new state for default dedupe behavior?
>>> Current dedupe enabled default behavior is to dedup unless prohibited.
>>> If dedupe default behavior can be don't dedupe unless allowed, then it
>>> will be much easier to do.
>>>
>>>>
>>>>> Or xattr for subvolume is only restored in its parent subvolume, and
>>>>> won't be copied for its snapshot?
>>>>
>>>> The xattrs are copied to the snapshot.
>>>>
>>>>>>> TODO:
>>>>>>> 1) Add extent-by-extent comparison for faster but more
>>>>>>> conflicting algorithm
>>>>>>>       Current SHA256 hash is quite slow, and for some old(5 years
>>>>>>> ago) CPU,
>>>>>>>       CPU may even be a bottleneck other than IO.
>>>>>>>       But for faster hash, it will definitely cause conflicts, so
>>>>>>> we need
>>>>>>>       extent comparison before we introduce new dedup algorithm.
>>>>>>
>>>>>> If sha256 is slow, we can use a less secure hash that's faster but
>>>>>> will
>>>>>> do a full byte-to-byte comparison in case of hash collision, and
>>>>>> recompute sha256 when the blocks are going to disk. I haven't thought
>>>>>> this through, so there are possibly details that could make
>>>>>> unfeasible.
>>>>>
>>>>> Not exactly. If we are using unsafe hash, e.g MD5, we will use MD5
>>>>> only
>>>>> for both in-memory and on-disk backend. No SHA256 again.
>>>>
>>>> I'm proposing unsafe but fast, which MD5 is not. Look for xxhash or
>>>> murmur. As they're both order-of-magnitutes faster than sha1/md5, we
>>>> can
>>>> actually hash both to reduce the collisions.
>>>
>>> Don't quite like the idea to use 2 hash other than 1.
>>> Yes, some program like rsync uses this method, but this also involves a
>>> lot of details, like the order to restore them on disk.
>>
>> I'm considering fast-but-unsafe hashes for the in-memory backend, where
>> the speed matters and we cannot hide the slow sha256 calculations behind
>> the IO (ie. no point to save microseconds if the IO is going to take
>> milliseconds).
>
> If only for in-memory backend, I'm OK.
>
> In-memory backend is much like an experimental field for new ideas, as
> it won't affect on-disk format at all.
>
> But the problem is still here.
> For fast hash hit case, we still need calculate slow SHA256 to ensure
> that's a complete hit.
> That's OK and expected.
>
> But for fast hash miss case, nothing is really changed.
> As long as we need to add hash for the extent into hash pool, we still
> need to calculate the slow SHA256.
>
> We can't insert the fast hash only, or in next fast hash hit with this
> extent, we have no slow hash to ensure consistence.
>
>>
>>>>> In that case, for MD5 hit case, we will do a full byte-to-byte
>>>>> comparison. It may be slow or fast, depending on the cache.
>>>>
>>>> If the probability of hash collision is low, so the number of needed
>>>> byte-to-byte comparisions is also low.
>>>
>>> Considering the common use-case of dedupe, hash hit should be a
>>> common case.
>>>
>>> In that case, each hash hit will lead to byte-to-byte comparison, which
>>> will significantly impact the dedupe performance.
>>>
>>> On the other hand, if dedupe hit rate is low, then why use dedupe?
>>
>> Oh right, that would require at least 2 hashes then.
>>
>>>>> But at least for MD5 miss case, it should be faster than SHA256.
>>>>>
>>>>>> The idea is to move expensive hashing to the slow IO operations
>>>>>> and do
>>>>>> fast but not 100% safe hashing on the read/write side where
>>>>>> performance
>>>>>> matters.
>>>>>
>>>>> Yes, although on the read side, we don't perform hash, we only do hash
>>>>> at write side.
>>>>
>>>> Oh, so how exactly gets the in-memory deduplication cache filled? My
>>>> impression was that we can pre-fill it by reading bunch of files
>>>> where we
>>>> expect the shared data to exist.
>>>
>>> Yes, we used to do that method aging back to the first version of
>>> in-memory implementation.
>>>
>>> But that will cause a lot of CPU usage and most of them are just wasted.
>>
>> I think this depends on the data.
>>
>>> Don't forget that, in common dedupe use-case, dedupe rate should be
>>> high, I'll use 50% as an exmaple.
>>> This means, 50% of your read will be pointed to a shared extents. But
>>> 100% of read will need to calculate hash, and 50% of them are already in
>>> hash pool.
>>> So the CPU time are just wasted.
>>
>> I understand the concerns, but I don't understand the example, sorry.
>
> My poor English again.
>
> Take 4 file extents as example.
> File Ext A: points to extent X
> File Ext B: points to extent X
> File Ext C: points to extent Y
> File Ext D: points to extent Y
>
> They are all read at the same time, then we calculate hash for all 4 of
> them at read time.
>
> But at dedupe hash insert time, only hash for extent X and Y is inserted.
>
> We hashed 4 times, but only inserted 2 hashes into dedupe hash pool.
>
>>
>>>> The usecase:
>>>>
>>>> Say there's a golden image for a virtual machine,
>>>
>>> Not to nitpick, but I though VM images are not good use-case for btrfs.
>>> And normally user would set nodatacow for it, which will bypass dedupe.
>>
>> VM on nodatacow. By bypass you mean that it cannot work together or that
>> it's just not going to be implemented?
>
> It can't work together, so for nodatacow file, it won't go through
> dedupe routine.
> Dedupe needs datacow to ensure its hashed data won't change.
>
> If a extent is re-written while dedupe hash not changed, next hash hit
> will cause corruption.
>
> And for already deduped(shared) extent, data cow will always happen no
> matter the mount option or file flag.
>
>>
>>>> we'll clone it and use
>>>> for other VM's, with minor changes. If we first read the golden image
>>>> with deduplication enabled, pre-fill the cache, any subsequent
>>>> writes to
>>>> the cloned images will be compared to the cached data. The estimated
>>>> hit
>>>> ratio is medium-to-high.
>>>
>>> And performance is so low that most user would feel, and CPU usage will
>>> be so high (up to 8 cores 100% used)that almost no spare CPU time can be
>>> allocated for VM use.
>>>
>>>>
>>>> And this can be extended to anything, not just VMs. Without the option
>>>> to fill the in-memory cache, the deduplication would seem pretty
>>>> useless
>>>> to me. The clear benefit is lack of maintaining the persistent storage
>>>> of deduplication data.
>>>
>>> I originally planned a ioctl for it to fill hash manually.
>>> But now I think re-write would be good enough.
>>> Maybe I could a pseudo 'dedupe fill' command in btrfs-progs, which will
>>> just read out the data and re-write it.
>>
>> Rewriting will take twice the IO and might even fail due to enospc
>> reasons, I don't see that as a viable option.
>
> Then we still need another ioctl for it to re-fill hash.
>
> Added to ioctl TO-DO list.
>
>>
>>>>> And in that case, if weak hash hit, we will need to do memory
>>>>> comparison, which may also be slow.
>>>>> So the performance impact may still exist.
>>>>
>>>> Yes the performance hit is there, with statistically low probability.
>>>>
>>>>> The biggest challenge is, we need to read (decompressed) extent
>>>>> contents, even without an inode.
>>>>> (So, no address_space and all the working facilities)
>>>>>
>>>>> Considering the complexity and uncertain performance improvement, the
>>>>> priority of introducing weak hash is quite low so far, not to
>>>>> mention a
>>>>> lot of detail design change for it.
>>>>
>>>> I disagree.
>>>
>>> Explained above, hash hit in dedupe use-case is common case, while we
>>> must do byte-to-byte comparison in common case routine, it's hard to
>>> ignore the overhead.
>>
>> So this should be solved by the double hashing, pushing the probability
>> of byte-to-byte comparision low.
>
> As long as we are going to add the hash into hash pool, we need all of
> the two hashes.(explained above)
>
> So nothing changed.
>
> That's the difference with rsync, which doesn't need to add hash into
> its pool. It only needs to make sure they are identical.
>
> Such fast hash only case will only happen in priority based dedupe use
> case.
>
> In a simple priority based case, only specified(high priority) files can
> populate the hash pool.
> Other(low priority) files can only be deduped using high priority files'
> extent, but never populate hash pool.
>
> In that case, fast hash will work, as low priority files will go through
> fast hash calculation but only when fast hash hit, they will go through
> slow hash, and save some time.
> (But still, in high dedupe-rate case, the saved some is still small)
>
>>
>>>>> A much easier and practical enhancement is, to use SHA512.
>>>>> As it's faster than SHA256 on modern 64bit machine for larger size.
>>>>> For example, for hashing 8K data, SHA512 is almost 40% faster than
>>>>> SHA256.
>>>>>
>>>>>>> 2) Misc end-user related helpers
>>>>>>>       Like handy and easy to implement dedup rate report.
>>>>>>>       And method to query in-memory hash size for those
>>>>>>> "non-exist" users who
>>>>>>>       want to use 'dedup enable -l' option but didn't ever know
>>>>>>> how much
>>>>>>>       RAM they have.
>>>>>>
>>>>>> That's what we should try know and define in advance, that's part
>>>>>> of the
>>>>>> ioctl interface.
>>>>>>
>>>>>> I went through the patches, there are a lot of small things to
>>>>>> fix, but
>>>>>> first I want to be sure about the interfaces, ie. on-disk and ioctl.
>>>>>
>>>>> I hope such small things can be pointed out, allowing me to fix them
>>>>> while rebasing.
>>>>
>>>> Sure, that's next after we agree on what the deduplication should
>>>> actually, the ioctls interefaces are settled and the on-disk format
>>>> changes are agreed on. The code is a good starting point, but pointing
>>>> out minor things at this point does not justify the time spent.
>>>
>>> That's OK.
>>>
>>>>>> Then we can start to merge the patchset in smaller batches, the
>>>>>> in-memory deduplication does not have implications on the on-disk
>>>>>> format, so it's "just" the ioctl part.
>>>>>
>>>>> Yes, that's my original plan, first merge simple in-memory backend
>>>>> into
>>>>> 4.5/4.6 and then adding ondisk backend into 4.7.
>>>>>
>>>>> But things turned out that, since we designed the two-backends API
>>>>> from
>>>>> the beginning, on-disk backend doesn't take much time to implement.
>>>>>
>>>>> So this makes what you see now, a big patchset with both backend
>>>>> implemented.
>>>>
>>>> For the discussions and review phase it's ok to see them both, but it's
>>>> unrealistic to expect merging in a particular version without going
>>>> through the review heat, especially for something like deduplication.
>>>>
>>>>
>>> In fact, I didn't expect dedupe to receive such heat.
>>
>> Really? That surprises me :) It modifies on-disk format, adds ioctls,
>> can have impact on performacne (that we even haven't measured yet), and
>> from the users POV, it's been requested for a long time.
>
> Don't forget I was pushing for in-memory only backend at first.
>
> The main reason for in-memory only backend is that we can do whatever we
> want to try and won't affect the on-disk format.
> The very first vision of dedupe is to be a cool but not that useful
> feature.
>
> But that's the past.
> Since now we have on-disk format change, new and expanding ioctls, slow
> performance (but still a little better for all miss case compared to
> compression), it will receive much heat.
>
> Thanks,
> Qu
>>
>>> I originally expect such dedupe to be an interesting but not so
>>> practical feature, just like ZFS dedupe.
>>> (I can be totally wrong, please point it out if there is some well-known
>>> use-case of ZFS dedupe)
>>>
>>> I was expecting dedupe to be a good entrance to expose existing bugs,
>>> and raise attention for better delayed_ref and delalloc implementation.
>>>
>>> Since it's considered as a high-profile feature, I'm OK to slow down the
>>> rush of merge and polish the interface/code further more.
>>
>> Yeah, as already mentioned, for exposing the bugs we can add code but
>> hide the ioctls.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-04-20  2:02             ` Qu Wenruo
@ 2016-04-20 19:14               ` Chris Mason
  0 siblings, 0 replies; 62+ messages in thread
From: Chris Mason @ 2016-04-20 19:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs

On Wed, Apr 20, 2016 at 10:02:27AM +0800, Qu Wenruo wrote:
> Hi David,
> 
> Any new comment about the ondisk format and ioctl interface?

Hi Qu,

I'm at LSF this week but will dig through again on the way home.
Thanks!

-chris

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space
  2016-03-22  1:35 ` [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space Qu Wenruo
@ 2016-04-22 18:06   ` Josef Bacik
  2016-04-25  0:54     ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Josef Bacik @ 2016-04-22 18:06 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Wang Xiaoguang

On 03/21/2016 09:35 PM, Qu Wenruo wrote:
> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>
> In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
> to reserve is calculated by the difference between outstanding_extents and
> reserved_extents.
>
> When reserve_metadata_bytes() fails to reserve desited metadata space,
> it has already done some reclaim work, such as write ordered extents.
>
> In that case, outstanding_extents and reserved_extents may already
> changed, and we may reserve enough metadata space then.
>
> So this patch will try to call reserve_metadata_bytes() at most 3 times
> to ensure we really run out of space.
>
> Such false ENOSPC is mainly caused by small file extents and time
> consuming delalloc functions, which mainly affects in-band
> de-duplication. (Compress should also be affected, but LZO/zlib is
> faster than SHA256, so still harder to trigger than dedupe).
>
> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
> ---
>   fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
>   1 file changed, 22 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index dabd721..016d2ec 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans,
>   				 * a new extent is revered, then deleted
>   				 * in one tran, and inc/dec get merged to 0.
>   				 *
> -				 * In this case, we need to remove its dedup
> +				 * In this case, we need to remove its dedupe
>   				 * hash.
>   				 */
>   				btrfs_dedupe_del(trans, fs_info, node->bytenr);
> @@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
>   	bool delalloc_lock = true;
>   	u64 to_free = 0;
>   	unsigned dropped;
> +	int loops = 0;
>
>   	/* If we are a free space inode we need to not flush since we will be in
>   	 * the middle of a transaction commit.  We also don't need the delalloc
> @@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes)
>   	    btrfs_transaction_in_commit(root->fs_info))
>   		schedule_timeout(1);
>
> +	num_bytes = ALIGN(num_bytes, root->sectorsize);
> +
> +again:
>   	if (delalloc_lock)
>   		mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
>
> -	num_bytes = ALIGN(num_bytes, root->sectorsize);
> -
>   	spin_lock(&BTRFS_I(inode)->lock);
>   	nr_extents = (unsigned)div64_u64(num_bytes +
>   					 BTRFS_MAX_EXTENT_SIZE - 1,
> @@ -5815,6 +5817,23 @@ out_fail:
>   	}
>   	if (delalloc_lock)
>   		mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
> +	/*
> +	 * The number of metadata bytes is calculated by the difference
> +	 * between outstanding_extents and reserved_extents. Sometimes though
> +	 * reserve_metadata_bytes() fails to reserve the wanted metadata bytes,
> +	 * indeed it has already done some work to reclaim metadata space, hence
> +	 * both outstanding_extents and reserved_extents would have changed and
> +	 * the bytes we try to reserve would also has changed(may be smaller).
> +	 * So here we try to reserve again. This is much useful for online
> +	 * dedupe, which will easily eat almost all meta space.
> +	 *
> +	 * XXX: Indeed here 3 is arbitrarily choosed, it's a good workaround for
> +	 * online dedupe, later we should find a better method to avoid dedupe
> +	 * enospc issue.
> +	 */
> +	if (unlikely(ret == -ENOSPC && loops++ < 3))
> +		goto again;
> +
>   	return ret;
>   }
>
>

NAK, we aren't going to just arbitrarily retry to make our metadata 
reservation.  Dropping reserved metadata space by completing ordered 
extents should free enough to make our current reservation, and in fact 
this only accounts for the disparity, so should be an accurate count 
most of the time.  I can see a case for detecting that the disparity no 
longer exists and retrying in that case (we free enough ordered extents 
that we are no longer trying to reserve ours + overflow but now only 
ours) and retry in _that specific case_, but we need to limit it to this 
case only.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-04-06  5:22             ` Qu Wenruo
@ 2016-04-22 22:14               ` Nicholas D Steeves
  2016-04-25  1:25                 ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Nicholas D Steeves @ 2016-04-22 22:14 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi Qu,

On 6 April 2016 at 01:22, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>
>
> Nicholas D Steeves wrote on 2016/04/05 23:47 -0400:
>>
>> It is unlikely that I will use dedupe, but I imagine your work will
>> apply tot he following wishlist:
>>
>> 1. Allow disabling of memory-backend hash via a kernel argument,
>> sysctl, or mount option for those of us have ECC RAM.
>>      * page_cache never gets pushed to swap, so this should be safe, no?
>
> And why it's related to ECC RAM? To avoid memory corruption which will
> finally lead to file corruption?
> If so, it makes sense.

Yes, my assumption is that a system with ECC will either correct the
error, or that an uncorrectable event will trigger the same error
handling procedure as if the software checksum failed.

> Also I didn't get the point when you mention page_cache.
> For hash pool, we didn't use page cache. We just use kmalloc, which won't be
> swapped out.
> For file page cache, it's not affected at all.

My apologies, I'm still very new to this, and my "point" only
demonstrates my lack of understanding.  Thank you for directing me to
the kmalloc-related sections.

>> 2. Implementing an intelligent cache so that it's possible to offset
>> the cost of hashing the most actively read data.  I'm guessing there's
>> already some sort of weighed cache eviction algorithm in place, but I
>> don't yet know how to look into it, let alone enough to leverage it...
>
>
> I not quite a fan of such intelligent but complicated cache design.
> The main problem is we are putting police into kernel space.
>
> Currently, either use last-recent-use in-memory backend, or use all-in
> ondisk backend.
> For user want more precious control on which file/dir shouldn't go through
> dedupe, they have the btrfs prop to set per-file flag to avoid dedupe.

I'm looking into a project for some (hopefully) safe,
low-hanging-fruit read optimisations, and read that

Qu Wenruo wrote on 2016/04/05 11:08 +0800:
> In-memory backend is much like an experimental field for new ideas,
> as it won't affect on-disk format at all."

Do you think that last-recent-use in-memory backend could be used in
this way?  Specifically, I'm wondering the even|odd PID method of
choosing which disk to read from could be replaced with the following
method for rotational disks:

The last-recent-use in-memory backend stores the value of last
allocation group (and/or transaction ID, or something else), with an
attached value of which disk did the IO.  I imagine it's possible to
minimize seeks by choosing the disk by getting the absolute value
difference between requested_location and last-recent-use_location of
each disk with a simple a static_cast.

Would the addition of that value pair (recent-use_location, disk) keep
things simple and maybe prove to be useful, or is last-recent-use
in-memory the wrong place for it?

Thank you for taking the time to reply,
Nicholas

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space
  2016-04-22 18:06   ` Josef Bacik
@ 2016-04-25  0:54     ` Qu Wenruo
  2016-04-25 14:05       ` Josef Bacik
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2016-04-25  0:54 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs; +Cc: Wang Xiaoguang



Josef Bacik wrote on 2016/04/22 14:06 -0400:
> On 03/21/2016 09:35 PM, Qu Wenruo wrote:
>> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>
>> In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try
>> to reserve is calculated by the difference between outstanding_extents
>> and
>> reserved_extents.
>>
>> When reserve_metadata_bytes() fails to reserve desited metadata space,
>> it has already done some reclaim work, such as write ordered extents.
>>
>> In that case, outstanding_extents and reserved_extents may already
>> changed, and we may reserve enough metadata space then.
>>
>> So this patch will try to call reserve_metadata_bytes() at most 3 times
>> to ensure we really run out of space.
>>
>> Such false ENOSPC is mainly caused by small file extents and time
>> consuming delalloc functions, which mainly affects in-band
>> de-duplication. (Compress should also be affected, but LZO/zlib is
>> faster than SHA256, so still harder to trigger than dedupe).
>>
>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>> ---
>>   fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
>>   1 file changed, 22 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index dabd721..016d2ec 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct
>> btrfs_trans_handle *trans,
>>                    * a new extent is revered, then deleted
>>                    * in one tran, and inc/dec get merged to 0.
>>                    *
>> -                 * In this case, we need to remove its dedup
>> +                 * In this case, we need to remove its dedupe
>>                    * hash.
>>                    */
>>                   btrfs_dedupe_del(trans, fs_info, node->bytenr);
>> @@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode
>> *inode, u64 num_bytes)
>>       bool delalloc_lock = true;
>>       u64 to_free = 0;
>>       unsigned dropped;
>> +    int loops = 0;
>>
>>       /* If we are a free space inode we need to not flush since we
>> will be in
>>        * the middle of a transaction commit.  We also don't need the
>> delalloc
>> @@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct
>> inode *inode, u64 num_bytes)
>>           btrfs_transaction_in_commit(root->fs_info))
>>           schedule_timeout(1);
>>
>> +    num_bytes = ALIGN(num_bytes, root->sectorsize);
>> +
>> +again:
>>       if (delalloc_lock)
>>           mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
>>
>> -    num_bytes = ALIGN(num_bytes, root->sectorsize);
>> -
>>       spin_lock(&BTRFS_I(inode)->lock);
>>       nr_extents = (unsigned)div64_u64(num_bytes +
>>                        BTRFS_MAX_EXTENT_SIZE - 1,
>> @@ -5815,6 +5817,23 @@ out_fail:
>>       }
>>       if (delalloc_lock)
>>           mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
>> +    /*
>> +     * The number of metadata bytes is calculated by the difference
>> +     * between outstanding_extents and reserved_extents. Sometimes
>> though
>> +     * reserve_metadata_bytes() fails to reserve the wanted metadata
>> bytes,
>> +     * indeed it has already done some work to reclaim metadata
>> space, hence
>> +     * both outstanding_extents and reserved_extents would have
>> changed and
>> +     * the bytes we try to reserve would also has changed(may be
>> smaller).
>> +     * So here we try to reserve again. This is much useful for online
>> +     * dedupe, which will easily eat almost all meta space.
>> +     *
>> +     * XXX: Indeed here 3 is arbitrarily choosed, it's a good
>> workaround for
>> +     * online dedupe, later we should find a better method to avoid
>> dedupe
>> +     * enospc issue.
>> +     */
>> +    if (unlikely(ret == -ENOSPC && loops++ < 3))
>> +        goto again;
>> +
>>       return ret;
>>   }
>>
>>
>
> NAK, we aren't going to just arbitrarily retry to make our metadata
> reservation.  Dropping reserved metadata space by completing ordered
> extents should free enough to make our current reservation, and in fact
> this only accounts for the disparity, so should be an accurate count
> most of the time.  I can see a case for detecting that the disparity no
> longer exists and retrying in that case (we free enough ordered extents
> that we are no longer trying to reserve ours + overflow but now only
> ours) and retry in _that specific case_, but we need to limit it to this
> case only.  Thanks,

Would it be OK to retry only for dedupe enabled case?

Currently it's only a workaround and we are still digging the root 
cause, but for a workaround, I assume it is good enough though for 
dedupe enabled case.

Thanks,
Qu
>
> Josef
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
  2016-04-22 22:14               ` Nicholas D Steeves
@ 2016-04-25  1:25                 ` Qu Wenruo
  0 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-04-25  1:25 UTC (permalink / raw)
  To: Nicholas D Steeves, Btrfs BTRFS



Nicholas D Steeves wrote on 2016/04/22 18:14 -0400:
> Hi Qu,
>
> On 6 April 2016 at 01:22, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> Nicholas D Steeves wrote on 2016/04/05 23:47 -0400:
>>>
>>> It is unlikely that I will use dedupe, but I imagine your work will
>>> apply tot he following wishlist:
>>>
>>> 1. Allow disabling of memory-backend hash via a kernel argument,
>>> sysctl, or mount option for those of us have ECC RAM.
>>>      * page_cache never gets pushed to swap, so this should be safe, no?
>>
>> And why it's related to ECC RAM? To avoid memory corruption which will
>> finally lead to file corruption?
>> If so, it makes sense.
>
> Yes, my assumption is that a system with ECC will either correct the
> error, or that an uncorrectable event will trigger the same error
> handling procedure as if the software checksum failed.
>
>> Also I didn't get the point when you mention page_cache.
>> For hash pool, we didn't use page cache. We just use kmalloc, which won't be
>> swapped out.
>> For file page cache, it's not affected at all.
>
> My apologies, I'm still very new to this, and my "point" only
> demonstrates my lack of understanding.  Thank you for directing me to
> the kmalloc-related sections.
>
>>> 2. Implementing an intelligent cache so that it's possible to offset
>>> the cost of hashing the most actively read data.  I'm guessing there's
>>> already some sort of weighed cache eviction algorithm in place, but I
>>> don't yet know how to look into it, let alone enough to leverage it...
>>
>>
>> I not quite a fan of such intelligent but complicated cache design.
>> The main problem is we are putting police into kernel space.
>>
>> Currently, either use last-recent-use in-memory backend, or use all-in
>> ondisk backend.
>> For user want more precious control on which file/dir shouldn't go through
>> dedupe, they have the btrfs prop to set per-file flag to avoid dedupe.
>
> I'm looking into a project for some (hopefully) safe,
> low-hanging-fruit read optimisations, and read that
>
> Qu Wenruo wrote on 2016/04/05 11:08 +0800:
>> In-memory backend is much like an experimental field for new ideas,
>> as it won't affect on-disk format at all."
>
> Do you think that last-recent-use in-memory backend could be used in
> this way?  Specifically, I'm wondering the even|odd PID method of
> choosing which disk to read from could be replaced with the following
> method for rotational disks:
>
> The last-recent-use in-memory backend stores the value of last
> allocation group (and/or transaction ID, or something else), with an
> attached value of which disk did the IO.  I imagine it's possible to
> minimize seeks by choosing the disk by getting the absolute value
> difference between requested_location and last-recent-use_location of
> each disk with a simple a static_cast.

For allocation group, did you mean chunk or block group?

>
> Would the addition of that value pair (recent-use_location, disk) keep
> things simple and maybe prove to be useful, or is last-recent-use
> in-memory the wrong place for it?

Maybe I missed something, but it doesn't seem to have something to do 
with inband dedupe.
It looks more like a RAID read optimization.

And I'm not familiar with btrfs RAID, but it seems to be that btrfs 
doesn't have anything smart for balancing bio request.
So it may makes sense.

But you also mentioned "each disk", if you are going to do it at disk 
basis, then it may not make any sense, as we already have block level 
scheduler, which will do bio merge/re-order to improve performance.

It would be better if you can provide a clearer view on what you are 
going to do.
For example, at RAID level or at block device level.

Thanks,
Qu

>
> Thank you for taking the time to reply,
> Nicholas



> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space
  2016-04-25  0:54     ` Qu Wenruo
@ 2016-04-25 14:05       ` Josef Bacik
  2016-04-26  0:50         ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Josef Bacik @ 2016-04-25 14:05 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Wang Xiaoguang

On 04/24/2016 08:54 PM, Qu Wenruo wrote:
>
>
> Josef Bacik wrote on 2016/04/22 14:06 -0400:
>> On 03/21/2016 09:35 PM, Qu Wenruo wrote:
>>> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>>
>>> In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we
>>> try
>>> to reserve is calculated by the difference between outstanding_extents
>>> and
>>> reserved_extents.
>>>
>>> When reserve_metadata_bytes() fails to reserve desited metadata space,
>>> it has already done some reclaim work, such as write ordered extents.
>>>
>>> In that case, outstanding_extents and reserved_extents may already
>>> changed, and we may reserve enough metadata space then.
>>>
>>> So this patch will try to call reserve_metadata_bytes() at most 3 times
>>> to ensure we really run out of space.
>>>
>>> Such false ENOSPC is mainly caused by small file extents and time
>>> consuming delalloc functions, which mainly affects in-band
>>> de-duplication. (Compress should also be affected, but LZO/zlib is
>>> faster than SHA256, so still harder to trigger than dedupe).
>>>
>>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>> ---
>>>   fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
>>>   1 file changed, 22 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index dabd721..016d2ec 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct
>>> btrfs_trans_handle *trans,
>>>                    * a new extent is revered, then deleted
>>>                    * in one tran, and inc/dec get merged to 0.
>>>                    *
>>> -                 * In this case, we need to remove its dedup
>>> +                 * In this case, we need to remove its dedupe
>>>                    * hash.
>>>                    */
>>>                   btrfs_dedupe_del(trans, fs_info, node->bytenr);
>>> @@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode
>>> *inode, u64 num_bytes)
>>>       bool delalloc_lock = true;
>>>       u64 to_free = 0;
>>>       unsigned dropped;
>>> +    int loops = 0;
>>>
>>>       /* If we are a free space inode we need to not flush since we
>>> will be in
>>>        * the middle of a transaction commit.  We also don't need the
>>> delalloc
>>> @@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct
>>> inode *inode, u64 num_bytes)
>>>           btrfs_transaction_in_commit(root->fs_info))
>>>           schedule_timeout(1);
>>>
>>> +    num_bytes = ALIGN(num_bytes, root->sectorsize);
>>> +
>>> +again:
>>>       if (delalloc_lock)
>>>           mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
>>>
>>> -    num_bytes = ALIGN(num_bytes, root->sectorsize);
>>> -
>>>       spin_lock(&BTRFS_I(inode)->lock);
>>>       nr_extents = (unsigned)div64_u64(num_bytes +
>>>                        BTRFS_MAX_EXTENT_SIZE - 1,
>>> @@ -5815,6 +5817,23 @@ out_fail:
>>>       }
>>>       if (delalloc_lock)
>>>           mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
>>> +    /*
>>> +     * The number of metadata bytes is calculated by the difference
>>> +     * between outstanding_extents and reserved_extents. Sometimes
>>> though
>>> +     * reserve_metadata_bytes() fails to reserve the wanted metadata
>>> bytes,
>>> +     * indeed it has already done some work to reclaim metadata
>>> space, hence
>>> +     * both outstanding_extents and reserved_extents would have
>>> changed and
>>> +     * the bytes we try to reserve would also has changed(may be
>>> smaller).
>>> +     * So here we try to reserve again. This is much useful for online
>>> +     * dedupe, which will easily eat almost all meta space.
>>> +     *
>>> +     * XXX: Indeed here 3 is arbitrarily choosed, it's a good
>>> workaround for
>>> +     * online dedupe, later we should find a better method to avoid
>>> dedupe
>>> +     * enospc issue.
>>> +     */
>>> +    if (unlikely(ret == -ENOSPC && loops++ < 3))
>>> +        goto again;
>>> +
>>>       return ret;
>>>   }
>>>
>>>
>>
>> NAK, we aren't going to just arbitrarily retry to make our metadata
>> reservation.  Dropping reserved metadata space by completing ordered
>> extents should free enough to make our current reservation, and in fact
>> this only accounts for the disparity, so should be an accurate count
>> most of the time.  I can see a case for detecting that the disparity no
>> longer exists and retrying in that case (we free enough ordered extents
>> that we are no longer trying to reserve ours + overflow but now only
>> ours) and retry in _that specific case_, but we need to limit it to this
>> case only.  Thanks,
>
> Would it be OK to retry only for dedupe enabled case?
>
> Currently it's only a workaround and we are still digging the root
> cause, but for a workaround, I assume it is good enough though for
> dedupe enabled case.
>

No we're not going to leave things in a known broken state to come back 
to later, that just makes it so we forget stuff and it sits there 
forever.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space
  2016-04-25 14:05       ` Josef Bacik
@ 2016-04-26  0:50         ` Qu Wenruo
  0 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2016-04-26  0:50 UTC (permalink / raw)
  To: Josef Bacik, linux-btrfs; +Cc: Wang Xiaoguang



Josef Bacik wrote on 2016/04/25 10:05 -0400:
> On 04/24/2016 08:54 PM, Qu Wenruo wrote:
>>
>>
>> Josef Bacik wrote on 2016/04/22 14:06 -0400:
>>> On 03/21/2016 09:35 PM, Qu Wenruo wrote:
>>>> From: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>>>
>>>> In btrfs_delalloc_reserve_metadata(), the number of metadata bytes we
>>>> try
>>>> to reserve is calculated by the difference between outstanding_extents
>>>> and
>>>> reserved_extents.
>>>>
>>>> When reserve_metadata_bytes() fails to reserve desited metadata space,
>>>> it has already done some reclaim work, such as write ordered extents.
>>>>
>>>> In that case, outstanding_extents and reserved_extents may already
>>>> changed, and we may reserve enough metadata space then.
>>>>
>>>> So this patch will try to call reserve_metadata_bytes() at most 3 times
>>>> to ensure we really run out of space.
>>>>
>>>> Such false ENOSPC is mainly caused by small file extents and time
>>>> consuming delalloc functions, which mainly affects in-band
>>>> de-duplication. (Compress should also be affected, but LZO/zlib is
>>>> faster than SHA256, so still harder to trigger than dedupe).
>>>>
>>>> Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
>>>> ---
>>>>   fs/btrfs/extent-tree.c | 25 ++++++++++++++++++++++---
>>>>   1 file changed, 22 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index dabd721..016d2ec 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct
>>>> btrfs_trans_handle *trans,
>>>>                    * a new extent is revered, then deleted
>>>>                    * in one tran, and inc/dec get merged to 0.
>>>>                    *
>>>> -                 * In this case, we need to remove its dedup
>>>> +                 * In this case, we need to remove its dedupe
>>>>                    * hash.
>>>>                    */
>>>>                   btrfs_dedupe_del(trans, fs_info, node->bytenr);
>>>> @@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode
>>>> *inode, u64 num_bytes)
>>>>       bool delalloc_lock = true;
>>>>       u64 to_free = 0;
>>>>       unsigned dropped;
>>>> +    int loops = 0;
>>>>
>>>>       /* If we are a free space inode we need to not flush since we
>>>> will be in
>>>>        * the middle of a transaction commit.  We also don't need the
>>>> delalloc
>>>> @@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct
>>>> inode *inode, u64 num_bytes)
>>>>           btrfs_transaction_in_commit(root->fs_info))
>>>>           schedule_timeout(1);
>>>>
>>>> +    num_bytes = ALIGN(num_bytes, root->sectorsize);
>>>> +
>>>> +again:
>>>>       if (delalloc_lock)
>>>>           mutex_lock(&BTRFS_I(inode)->delalloc_mutex);
>>>>
>>>> -    num_bytes = ALIGN(num_bytes, root->sectorsize);
>>>> -
>>>>       spin_lock(&BTRFS_I(inode)->lock);
>>>>       nr_extents = (unsigned)div64_u64(num_bytes +
>>>>                        BTRFS_MAX_EXTENT_SIZE - 1,
>>>> @@ -5815,6 +5817,23 @@ out_fail:
>>>>       }
>>>>       if (delalloc_lock)
>>>>           mutex_unlock(&BTRFS_I(inode)->delalloc_mutex);
>>>> +    /*
>>>> +     * The number of metadata bytes is calculated by the difference
>>>> +     * between outstanding_extents and reserved_extents. Sometimes
>>>> though
>>>> +     * reserve_metadata_bytes() fails to reserve the wanted metadata
>>>> bytes,
>>>> +     * indeed it has already done some work to reclaim metadata
>>>> space, hence
>>>> +     * both outstanding_extents and reserved_extents would have
>>>> changed and
>>>> +     * the bytes we try to reserve would also has changed(may be
>>>> smaller).
>>>> +     * So here we try to reserve again. This is much useful for online
>>>> +     * dedupe, which will easily eat almost all meta space.
>>>> +     *
>>>> +     * XXX: Indeed here 3 is arbitrarily choosed, it's a good
>>>> workaround for
>>>> +     * online dedupe, later we should find a better method to avoid
>>>> dedupe
>>>> +     * enospc issue.
>>>> +     */
>>>> +    if (unlikely(ret == -ENOSPC && loops++ < 3))
>>>> +        goto again;
>>>> +
>>>>       return ret;
>>>>   }
>>>>
>>>>
>>>
>>> NAK, we aren't going to just arbitrarily retry to make our metadata
>>> reservation.  Dropping reserved metadata space by completing ordered
>>> extents should free enough to make our current reservation, and in fact
>>> this only accounts for the disparity, so should be an accurate count
>>> most of the time.  I can see a case for detecting that the disparity no
>>> longer exists and retrying in that case (we free enough ordered extents
>>> that we are no longer trying to reserve ours + overflow but now only
>>> ours) and retry in _that specific case_, but we need to limit it to this
>>> case only.  Thanks,
>>
>> Would it be OK to retry only for dedupe enabled case?
>>
>> Currently it's only a workaround and we are still digging the root
>> cause, but for a workaround, I assume it is good enough though for
>> dedupe enabled case.
>>
>
> No we're not going to leave things in a known broken state to come back
> to later, that just makes it so we forget stuff and it sits there
> forever.  Thanks,
>
> Josef

OK, We'll investigate it and find the best fix.

BTW, we also found extent-tree.c is using the same 3 loops code:
(and that's why we choose the same method)
------
         loops = 0;
         while (delalloc_bytes && loops < 3) {
                 max_reclaim = min(delalloc_bytes, to_reclaim);
                 nr_pages = max_reclaim >> PAGE_CACHE_SHIFT;
                 btrfs_writeback_inodes_sb_nr(root, nr_pages, items);
                 /*
                  * We need to wait for the async pages to actually 
start before
                  * we do anything.
                  */
                 max_reclaim = 
atomic_read(&root->fs_info->async_delalloc_pages);
                 if (!max_reclaim)
                         goto skip_async;
------

Any idea why it's still there?

Thanks,
Qu



^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2016-04-26  0:51 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-22  1:35 [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 01/27] btrfs: dedupe: Introduce dedupe framework and its header Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 02/27] btrfs: dedupe: Introduce function to initialize dedupe info Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 03/27] btrfs: dedupe: Introduce function to add hash into in-memory tree Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 04/27] btrfs: dedupe: Introduce function to remove hash from " Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 05/27] btrfs: delayed-ref: Add support for increasing data ref under spinlock Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 06/27] btrfs: dedupe: Introduce function to search for an existing hash Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 07/27] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 08/27] btrfs: ordered-extent: Add support for dedupe Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 09/27] btrfs: dedupe: Inband in-memory only de-duplication implement Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 10/27] btrfs: dedupe: Add basic tree structure for on-disk dedupe method Qu Wenruo
2016-03-24 20:58   ` Chris Mason
2016-03-25  1:59     ` Qu Wenruo
2016-03-25 15:11       ` Chris Mason
2016-03-26 13:11         ` Qu Wenruo
2016-03-28 14:09           ` Chris Mason
2016-03-29  1:47             ` Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 11/27] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info Qu Wenruo
2016-03-29 17:31   ` Alex Lyakas
2016-03-30  0:26     ` Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 12/27] btrfs: dedupe: Add support for on-disk hash search Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 13/27] btrfs: dedupe: Add support to delete hash for on-disk backend Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 14/27] btrfs: dedupe: Add support for adding " Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 15/27] btrfs: dedupe: Add ioctl for inband dedupelication Qu Wenruo
2016-03-22  2:29   ` kbuild test robot
2016-03-22  2:48   ` kbuild test robot
2016-03-22  1:35 ` [PATCH v8 16/27] btrfs: dedupe: add an inode nodedupe flag Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 17/27] btrfs: dedupe: add a property handler for online dedupe Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 18/27] btrfs: dedupe: add per-file online dedupe control Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 19/27] btrfs: try more times to alloc metadata reserve space Qu Wenruo
2016-04-22 18:06   ` Josef Bacik
2016-04-25  0:54     ` Qu Wenruo
2016-04-25 14:05       ` Josef Bacik
2016-04-26  0:50         ` Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 20/27] btrfs: dedupe: Fix a bug when running inband dedupe with balance Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 21/27] btrfs: Fix a memory leak in inband dedupe hash Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 22/27] btrfs: dedupe: Fix metadata balance error when dedupe is enabled Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 23/27] btrfs: dedupe: Avoid submit IO for hash hit extent Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 24/27] btrfs: dedupe: Preparation for compress-dedupe co-work Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 25/27] btrfs: dedupe: Add support for compression and dedpue Qu Wenruo
2016-03-24 20:35   ` Chris Mason
2016-03-25  1:44     ` Qu Wenruo
2016-03-25 15:12       ` Chris Mason
2016-03-22  1:35 ` [PATCH v8 26/27] btrfs: relocation: Enhance error handling to avoid BUG_ON Qu Wenruo
2016-03-22  1:35 ` [PATCH v8 27/27] btrfs: dedupe: Fix a space cache delalloc bytes underflow bug Qu Wenruo
2016-03-22 13:38 ` [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework David Sterba
2016-03-23  2:25   ` Qu Wenruo
2016-03-24 13:42     ` David Sterba
2016-03-25  1:38       ` Qu Wenruo
2016-04-04 16:55         ` David Sterba
2016-04-05  3:08           ` Qu Wenruo
2016-04-20  2:02             ` Qu Wenruo
2016-04-20 19:14               ` Chris Mason
2016-04-06  3:47           ` Nicholas D Steeves
2016-04-06  5:22             ` Qu Wenruo
2016-04-22 22:14               ` Nicholas D Steeves
2016-04-25  1:25                 ` Qu Wenruo
2016-03-29 17:22 ` Alex Lyakas
2016-03-30  0:34   ` Qu Wenruo
2016-03-30 10:36     ` Alex Lyakas
2016-04-03  8:22     ` Alex Lyakas
2016-04-05  3:51       ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.