All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] btrfs: qgroup fixes for btrfs_drop_snapshot V3
@ 2014-07-07 22:09 Mark Fasheh
  2014-07-07 22:09 ` btrfs: add trace for qgroup accounting Mark Fasheh
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Mark Fasheh @ 2014-07-07 22:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chris Mason, Josef Bacik, Mark Fasheh

Hi, the following patches try to fix a long outstanding issue with qgroups
and snapshot deletion. The core problem is that btrfs_drop_snapshot will
skip shared extents during it's tree walk. This results in an inconsistent
qgroup state once the drop is processed.

The first patch adds some tracing which I found very useful in debugging
qgroup operations. The second patch is an actual fix to the problem. A third
patch, from Josef is also added. We need this because it fixes at least one
set of inconsistencies qgroups can get to via drop_snapshot.

With this version of the patch series, I can no longer reproduce
qgroup inconsistencies via drop_snapshot on my test disks.

Changes from last patch set:

- search on bytenr and root, but not seq in btrfs_record_ref when
  we're looking for existing qgroup operations.

Changes before that (V1-V2):

- remove extra extent_buffer_uptodate call from account_shared_subtree()

- catch return values for the accounting calls now and do the right thing
  (log an error and tell the user to rescan)

- remove the loop on roots in qgroup_subtree_accounting and just use the
  nnodes member to make our first decision.

- Don't queue up the subtree root for a change (the code in drop_snapshot
  handkles qgroup updates for this block).

- only walk subtrees if we're actually in DROP_REFERENCE stage and we're
  going to call free_extent

- account leaf items for level zero blocks that we are dropping in
  walk_up_proc


General qgroups TODO:

- We need an xfstest for the drop_snapshot case, otherwise I'm
  concerned that we can easily regress from bugs introduced via
  seemingly unrelated patches. This stuff can be fragile.
   - I already have a script that creates and removes a level 1 tree to
     introduce an inconsistency. I think adapting that is probably a good
     first step. The script can be found at:

     http://zeniv.linux.org.uk/~mfasheh/create-btrfs-trees.sh

     Please don't make fun of my poor shell scripting skills  :)

- qgroup items are not deleted after drop_snapshot. They stay orphaned, on
  disk, often with nonzero values in their count fields. This is something
  for another patch. Josef and I have some ideas for how to deal with this:
   - Just zero them out at the end of drop_snapshot (maybe in the future we
     could actually then delete them from disk?)
   - update btrfs_subtree_accounting() to remove bytes from the
     being-deleted qgroups so they wind up as zero on disk (this is
     preferable but might not be practical)

- we need at least a rescan to be kicked off when adding parent qgroups.
  otherwise, the newly added groups start with the wrong information. Quite
  possible the rescan itself might need to be updated (I haven't tested this
  enough).

- qgroup heirarchies in general don't seem quite implemented yet. Once we
  fix the previous items the code to update their counts for them will
  probably need some love.

Please review, thanks. Diffstat follows,
	--Mark

 fs/btrfs/ctree.c             |   20 +--
 fs/btrfs/ctree.h             |    4 
 fs/btrfs/extent-tree.c       |  285 +++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/qgroup.c            |  168 +++++++++++++++++++++++++
 fs/btrfs/qgroup.h            |    1 
 fs/btrfs/super.c             |    1 
 include/trace/events/btrfs.h |   57 ++++++++
 7 files changed, 511 insertions(+), 25 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* btrfs: add trace for qgroup accounting
  2014-07-07 22:09 [PATCH 0/3] btrfs: qgroup fixes for btrfs_drop_snapshot V3 Mark Fasheh
@ 2014-07-07 22:09 ` Mark Fasheh
  2014-07-07 22:09 ` btrfs: qgroup: account shared subtrees during snapshot delete Mark Fasheh
  2014-07-07 22:10 ` [PATCH] Btrfs: __btrfs_mod_ref should always use no_quota Mark Fasheh
  2 siblings, 0 replies; 15+ messages in thread
From: Mark Fasheh @ 2014-07-07 22:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chris Mason, Josef Bacik, Mark Fasheh

We want this to debug qgroup changes on live systems.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/qgroup.c            |  3 +++
 fs/btrfs/super.c             |  1 +
 include/trace/events/btrfs.h | 56 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 60 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index cf5aead..a9f0f05 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1290,6 +1290,7 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans,
 	oper->seq = atomic_inc_return(&fs_info->qgroup_op_seq);
 	INIT_LIST_HEAD(&oper->elem.list);
 	oper->elem.seq = 0;
+	trace_btrfs_qgroup_record_ref(oper);
 	ret = insert_qgroup_oper(fs_info, oper);
 	if (ret) {
 		/* Shouldn't happen so have an assert for developers */
@@ -1909,6 +1910,8 @@ static int btrfs_qgroup_account(struct btrfs_trans_handle *trans,
 
 	ASSERT(is_fstree(oper->ref_root));
 
+	trace_btrfs_qgroup_account(oper);
+
 	switch (oper->type) {
 	case BTRFS_QGROUP_OPER_ADD_EXCL:
 	case BTRFS_QGROUP_OPER_SUB_EXCL:
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4662d92..ca7836c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -60,6 +60,7 @@
 #include "backref.h"
 #include "tests/btrfs-tests.h"
 
+#include "qgroup.h"
 #define CREATE_TRACE_POINTS
 #include <trace/events/btrfs.h>
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 4ee4e30..b8774b3 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -23,6 +23,7 @@ struct map_lookup;
 struct extent_buffer;
 struct btrfs_work;
 struct __btrfs_workqueue;
+struct btrfs_qgroup_operation;
 
 #define show_ref_type(type)						\
 	__print_symbolic(type,						\
@@ -1119,6 +1120,61 @@ DEFINE_EVENT(btrfs__workqueue_done, btrfs_workqueue_destroy,
 	TP_ARGS(wq)
 );
 
+#define show_oper_type(type)						\
+	__print_symbolic(type,						\
+		{ BTRFS_QGROUP_OPER_ADD_EXCL, 	"OPER_ADD_EXCL" },	\
+		{ BTRFS_QGROUP_OPER_ADD_SHARED, "OPER_ADD_SHARED" },	\
+		{ BTRFS_QGROUP_OPER_SUB_EXCL, 	"OPER_SUB_EXCL" },	\
+		{ BTRFS_QGROUP_OPER_SUB_SHARED,	"OPER_SUB_SHARED" })
+
+DECLARE_EVENT_CLASS(btrfs_qgroup_oper,
+
+	TP_PROTO(struct btrfs_qgroup_operation *oper),
+
+	TP_ARGS(oper),
+
+	TP_STRUCT__entry(
+		__field(	u64,  ref_root		)
+		__field(	u64,  bytenr		)
+		__field(	u64,  num_bytes		)
+		__field(	u64,  seq		)
+		__field(	int,  type		)
+		__field(	u64,  elem_seq		)
+	),
+
+	TP_fast_assign(
+		__entry->ref_root	= oper->ref_root;
+		__entry->bytenr		= oper->bytenr,
+		__entry->num_bytes	= oper->num_bytes;
+		__entry->seq 		= oper->seq;
+		__entry->type		= oper->type;
+		__entry->elem_seq	= oper->elem.seq;
+	),
+
+	TP_printk("ref_root = %llu, bytenr = %llu, num_bytes = %llu, "
+		  "seq = %llu, elem.seq = %llu, type = %s",
+		  (unsigned long long)__entry->ref_root,
+		  (unsigned long long)__entry->bytenr,
+		  (unsigned long long)__entry->num_bytes,
+		  (unsigned long long)__entry->seq,
+		  (unsigned long long)__entry->elem_seq,
+		  show_oper_type(__entry->type))
+);
+
+DEFINE_EVENT(btrfs_qgroup_oper, btrfs_qgroup_account,
+
+	TP_PROTO(struct btrfs_qgroup_operation *oper),
+
+	TP_ARGS(oper)
+);
+
+DEFINE_EVENT(btrfs_qgroup_oper, btrfs_qgroup_record_ref,
+
+	TP_PROTO(struct btrfs_qgroup_operation *oper),
+
+	TP_ARGS(oper)
+);
+
 #endif /* _TRACE_BTRFS_H */
 
 /* This part must be outside protection */
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* btrfs: qgroup: account shared subtrees during snapshot delete
  2014-07-07 22:09 [PATCH 0/3] btrfs: qgroup fixes for btrfs_drop_snapshot V3 Mark Fasheh
  2014-07-07 22:09 ` btrfs: add trace for qgroup accounting Mark Fasheh
@ 2014-07-07 22:09 ` Mark Fasheh
  2014-07-07 22:10 ` [PATCH] Btrfs: __btrfs_mod_ref should always use no_quota Mark Fasheh
  2 siblings, 0 replies; 15+ messages in thread
From: Mark Fasheh @ 2014-07-07 22:09 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chris Mason, Josef Bacik, Mark Fasheh

During its tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.

In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.

This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Josef Bacik <jbacik@fb.com>
---
 fs/btrfs/extent-tree.c       | 261 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.c            | 165 +++++++++++++++++++++++++++
 fs/btrfs/qgroup.h            |   1 +
 include/trace/events/btrfs.h |   3 +-
 4 files changed, 429 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 46f39bf..3f43e9a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7324,6 +7324,220 @@ reada:
 	wc->reada_slot = slot;
 }
 
+static int account_leaf_items(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root,
+			      struct extent_buffer *eb)
+{
+	int nr = btrfs_header_nritems(eb);
+	int i, extent_type, ret;
+	struct btrfs_key key;
+	struct btrfs_file_extent_item *fi;
+	u64 bytenr, num_bytes;
+
+	for (i = 0; i < nr; i++) {
+		btrfs_item_key_to_cpu(eb, &key, i);
+
+		if (key.type != BTRFS_EXTENT_DATA_KEY)
+			continue;
+
+		fi = btrfs_item_ptr(eb, i, struct btrfs_file_extent_item);
+		/* filter out non qgroup-accountable extents  */
+		extent_type = btrfs_file_extent_type(eb, fi);
+
+		if (extent_type == BTRFS_FILE_EXTENT_INLINE)
+			continue;
+
+		bytenr = btrfs_file_extent_disk_bytenr(eb, fi);
+		if (!bytenr)
+			continue;
+
+		num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
+
+		ret = btrfs_qgroup_record_ref(trans, root->fs_info,
+					      root->objectid,
+					      bytenr, num_bytes,
+					      BTRFS_QGROUP_OPER_SUB_SUBTREE, 0);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+/*
+ * Walk up the tree from the bottom, freeing leaves and any interior
+ * nodes which have had all slots visited. If a node (leaf or
+ * interior) is freed, the node above it will have it's slot
+ * incremented. The root node will never be freed.
+ *
+ * At the end of this function, we should have a path which has all
+ * slots incremented to the next position for a search. If we need to
+ * read a new node it will be NULL and the node above it will have the
+ * correct slot selected for a later read.
+ *
+ * If we increment the root nodes slot counter past the number of
+ * elements, 1 is returned to signal completion of the search.
+ */
+static int adjust_slots_upwards(struct btrfs_root *root,
+				struct btrfs_path *path, int root_level)
+{
+	int level = 0;
+	int nr, slot;
+	struct extent_buffer *eb;
+
+	if (root_level == 0)
+		return 1;
+
+	while (level <= root_level) {
+		eb = path->nodes[level];
+		nr = btrfs_header_nritems(eb);
+		path->slots[level]++;
+		slot = path->slots[level];
+		if (slot >= nr || level == 0) {
+			/*
+			 * Don't free the root -  we will detect this
+			 * condition after our loop and return a
+			 * positive value for caller to stop walking the tree.
+			 */
+			if (level != root_level) {
+				btrfs_tree_unlock_rw(eb, path->locks[level]);
+				path->locks[level] = 0;
+
+				free_extent_buffer(eb);
+				path->nodes[level] = NULL;
+				path->slots[level] = 0;
+			}
+		} else {
+			/*
+			 * We have a valid slot to walk back down
+			 * from. Stop here so caller can process these
+			 * new nodes.
+			 */
+			break;
+		}
+
+		level++;
+	}
+
+	eb = path->nodes[root_level];
+	if (path->slots[root_level] >= btrfs_header_nritems(eb))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * root_eb is the subtree root and is locked before this function is called.
+ */
+static int account_shared_subtree(struct btrfs_trans_handle *trans,
+				  struct btrfs_root *root,
+				  struct extent_buffer *root_eb,
+				  u64 root_gen,
+				  int root_level)
+{
+	int ret = 0;
+	int level;
+	struct extent_buffer *eb = root_eb;
+	struct btrfs_path *path = NULL;
+
+	BUG_ON(root_level < 0 || root_level > BTRFS_MAX_LEVEL);
+	BUG_ON(root_eb == NULL);
+
+	if (!root->fs_info->quota_enabled)
+		return 0;
+
+	if (!extent_buffer_uptodate(root_eb)) {
+		ret = btrfs_read_buffer(root_eb, root_gen);
+		if (ret)
+			goto out;
+	}
+
+	if (root_level == 0) {
+		ret = account_leaf_items(trans, root, root_eb);
+		goto out;
+	}
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	/*
+	 * Walk down the tree.  Missing extent blocks are filled in as
+	 * we go. Metadata is accounted every time we read a new
+	 * extent block.
+	 *
+	 * When we reach a leaf, we account for file extent items in it,
+	 * walk back up the tree (adjusting slot pointers as we go)
+	 * and restart the search process.
+	 */
+	extent_buffer_get(root_eb); /* For path */
+	path->nodes[root_level] = root_eb;
+	path->slots[root_level] = 0;
+	path->locks[root_level] = 0; /* so release_path doesn't try to unlock */
+walk_down:
+	level = root_level;
+	while (level >= 0) {
+		if (path->nodes[level] == NULL) {
+			int child_bsize = btrfs_level_size(root, level);
+			int parent_slot;
+			u64 child_gen;
+			u64 child_bytenr;
+
+			/* We need to get child blockptr/gen from
+			 * parent before we can read it. */
+			eb = path->nodes[level + 1];
+			parent_slot = path->slots[level + 1];
+			child_bytenr = btrfs_node_blockptr(eb, parent_slot);
+			child_gen = btrfs_node_ptr_generation(eb, parent_slot);
+
+			eb = read_tree_block(root, child_bytenr, child_bsize,
+					     child_gen);
+			if (!eb || !extent_buffer_uptodate(eb)) {
+				ret = -EIO;
+				goto out;
+			}
+
+			path->nodes[level] = eb;
+			path->slots[level] = 0;
+
+			btrfs_tree_read_lock(eb);
+			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+			path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
+
+			ret = btrfs_qgroup_record_ref(trans, root->fs_info,
+						root->objectid,
+						child_bytenr,
+						child_bsize,
+						BTRFS_QGROUP_OPER_SUB_SUBTREE,
+						0);
+			if (ret)
+				goto out;
+
+		}
+
+		if (level == 0) {
+			ret = account_leaf_items(trans, root, path->nodes[level]);
+			if (ret)
+				goto out;
+
+			/* Nonzero return here means we completed our search */
+			ret = adjust_slots_upwards(root, path, root_level);
+			if (ret)
+				break;
+
+			/* Restart search with new slots */
+			goto walk_down;
+		}
+
+		level--;
+	}
+
+	ret = 0;
+out:
+	btrfs_free_path(path);
+
+	return ret;
+}
+
 /*
  * helper to process tree block while walking down the tree.
  *
@@ -7427,6 +7641,7 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 	int level = wc->level;
 	int reada = 0;
 	int ret = 0;
+	bool need_account = false;
 
 	generation = btrfs_node_ptr_generation(path->nodes[level],
 					       path->slots[level]);
@@ -7472,6 +7687,7 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 
 	if (wc->stage == DROP_REFERENCE) {
 		if (wc->refs[level - 1] > 1) {
+			need_account = true;
 			if (level == 1 &&
 			    (wc->flags[0] & BTRFS_BLOCK_FLAG_FULL_BACKREF))
 				goto skip;
@@ -7535,6 +7751,16 @@ skip:
 			parent = 0;
 		}
 
+		if (need_account) {
+			ret = account_shared_subtree(trans, root, next,
+						     generation, level - 1);
+			if (ret) {
+				printk_ratelimited(KERN_ERR "BTRFS: %s Error "
+					"%d accounting shared subtree. Quota "
+					"is out of sync, rescan required.\n",
+					root->fs_info->sb->s_id, ret);
+			}
+		}
 		ret = btrfs_free_extent(trans, root, bytenr, blocksize, parent,
 				root->root_key.objectid, level - 1, 0, 0);
 		BUG_ON(ret); /* -ENOMEM */
@@ -7621,6 +7847,13 @@ static noinline int walk_up_proc(struct btrfs_trans_handle *trans,
 				ret = btrfs_dec_ref(trans, root, eb, 0,
 						    wc->for_reloc);
 			BUG_ON(ret); /* -ENOMEM */
+			ret = account_leaf_items(trans, root, eb);
+			if (ret) {
+				printk_ratelimited(KERN_ERR "BTRFS: %s Error "
+					"%d accounting leaf items. Quota "
+					"is out of sync, rescan required.\n",
+					root->fs_info->sb->s_id, ret);
+			}
 		}
 		/* make block locked assertion in clean_tree_block happy */
 		if (!path->locks[level] &&
@@ -7746,6 +7979,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 	int level;
 	bool root_dropped = false;
 
+	btrfs_debug(root->fs_info, "Drop subvolume %llu", root->objectid);
+
 	path = btrfs_alloc_path();
 	if (!path) {
 		err = -ENOMEM;
@@ -7871,6 +8106,24 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 				goto out_end_trans;
 			}
 
+			/*
+			 * Qgroup update accounting is run from
+			 * delayed ref handling. This usually works
+			 * out because delayed refs are normally the
+			 * only way qgroup updates are added. However,
+			 * we may have added updates during our tree
+			 * walk so run qgroups here to make sure we
+			 * don't lose any updates.
+			 */
+			ret = btrfs_delayed_qgroup_accounting(trans,
+							      root->fs_info);
+			if (ret)
+				printk_ratelimited(KERN_ERR "BTRFS: Failure %d "
+						   "running qgroup updates "
+						   "during snapshot delete. "
+						   "Quota is out of sync, "
+						   "rescan required.\n", ret);
+
 			btrfs_end_transaction_throttle(trans, tree_root);
 			if (!for_reloc && btrfs_need_cleaner_sleep(root)) {
 				pr_debug("BTRFS: drop snapshot early exit\n");
@@ -7924,6 +8177,14 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 	}
 	root_dropped = true;
 out_end_trans:
+	ret = btrfs_delayed_qgroup_accounting(trans, root->fs_info);
+	if (ret)
+		printk_ratelimited(KERN_ERR "BTRFS: Failure %d "
+				   "running qgroup updates "
+				   "during snapshot delete. "
+				   "Quota is out of sync, "
+				   "rescan required.\n", ret);
+
 	btrfs_end_transaction_throttle(trans, tree_root);
 out_free:
 	kfree(wc);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a9f0f05..9b0cb93 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1201,6 +1201,50 @@ out:
 	mutex_unlock(&fs_info->qgroup_ioctl_lock);
 	return ret;
 }
+
+static int comp_oper_exist(struct btrfs_qgroup_operation *oper1,
+			   struct btrfs_qgroup_operation *oper2)
+{
+	/*
+	 * Ignore seq and type here, we're looking for any operation
+	 * at all related to this extent on that root.
+	 */
+	if (oper1->bytenr < oper2->bytenr)
+		return -1;
+	if (oper1->bytenr > oper2->bytenr)
+		return 1;
+	if (oper1->ref_root < oper2->ref_root)
+		return -1;
+	if (oper1->ref_root > oper2->ref_root)
+		return 1;
+	return 0;
+}
+
+static int qgroup_oper_exists(struct btrfs_fs_info *fs_info,
+			      struct btrfs_qgroup_operation *oper)
+{
+	struct rb_node *n;
+	struct btrfs_qgroup_operation *cur;
+	int cmp;
+
+	spin_lock(&fs_info->qgroup_op_lock);
+	n = fs_info->qgroup_op_tree.rb_node;
+	while (n) {
+		cur = rb_entry(n, struct btrfs_qgroup_operation, n);
+		cmp = comp_oper_exist(cur, oper);
+		if (cmp < 0) {
+			n = n->rb_right;
+		} else if (cmp) {
+			n = n->rb_left;
+		} else {
+			spin_unlock(&fs_info->qgroup_op_lock);
+			return -EEXIST;
+		}
+	}
+	spin_unlock(&fs_info->qgroup_op_lock);
+	return 0;
+}
+
 static int comp_oper(struct btrfs_qgroup_operation *oper1,
 		     struct btrfs_qgroup_operation *oper2)
 {
@@ -1290,7 +1334,25 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans,
 	oper->seq = atomic_inc_return(&fs_info->qgroup_op_seq);
 	INIT_LIST_HEAD(&oper->elem.list);
 	oper->elem.seq = 0;
+
 	trace_btrfs_qgroup_record_ref(oper);
+
+	if (type == BTRFS_QGROUP_OPER_SUB_SUBTREE) {
+		/*
+		 * If any operation for this bytenr/ref_root combo
+		 * exists, then we know it's not exclusively owned and
+		 * shouldn't be queued up.
+		 *
+		 * This also catches the case where we have a cloned
+		 * extent that gets queued up multiple times during
+		 * drop snapshot.
+		 */
+		if (qgroup_oper_exists(fs_info, oper)) {
+			kfree(oper);
+			return 0;
+		}
+	}
+
 	ret = insert_qgroup_oper(fs_info, oper);
 	if (ret) {
 		/* Shouldn't happen so have an assert for developers */
@@ -1883,6 +1945,106 @@ out:
 }
 
 /*
+ * Process a reference to a shared subtree. This type of operation is
+ * queued during snapshot removal when we encounter extents which are
+ * shared between more than one root.
+ */
+static int qgroup_subtree_accounting(struct btrfs_trans_handle *trans,
+				     struct btrfs_fs_info *fs_info,
+				     struct btrfs_qgroup_operation *oper)
+{
+	struct ulist *roots = NULL;
+	struct ulist_node *unode;
+	struct ulist_iterator uiter;
+	struct btrfs_qgroup_list *glist;
+	struct ulist *parents;
+	int ret = 0;
+	struct btrfs_qgroup *qg;
+	u64 root_obj = 0;
+	struct seq_list elem = {};
+
+	parents = ulist_alloc(GFP_NOFS);
+	if (!parents)
+		return -ENOMEM;
+
+	btrfs_get_tree_mod_seq(fs_info, &elem);
+	ret = btrfs_find_all_roots(trans, fs_info, oper->bytenr,
+				   elem.seq, &roots);
+	btrfs_put_tree_mod_seq(fs_info, &elem);
+	if (ret < 0)
+		return ret;
+
+	if (roots->nnodes != 1)
+		goto out;
+
+	ULIST_ITER_INIT(&uiter);
+	unode = ulist_next(roots, &uiter); /* Only want 1 so no need to loop */
+	/*
+	 * If we find our ref root then that means all refs
+	 * this extent has to the root have not yet been
+	 * deleted. In that case, we do nothing and let the
+	 * last ref for this bytenr drive our update.
+	 *
+	 * This can happen for example if an extent is
+	 * referenced multiple times in a snapshot (clone,
+	 * etc). If we are in the middle of snapshot removal,
+	 * queued updates for such an extent will find the
+	 * root if we have not yet finished removing the
+	 * snapshot.
+	 */
+	if (unode->val == oper->ref_root)
+		goto out;
+
+	root_obj = unode->val;
+	BUG_ON(!root_obj);
+
+	spin_lock(&fs_info->qgroup_lock);
+	qg = find_qgroup_rb(fs_info, root_obj);
+	if (!qg)
+		goto out_unlock;
+
+	qg->excl += oper->num_bytes;
+	qg->excl_cmpr += oper->num_bytes;
+	qgroup_dirty(fs_info, qg);
+
+	/*
+	 * Adjust counts for parent groups. First we find all
+	 * parents, then in the 2nd loop we do the adjustment
+	 * while adding parents of the parents to our ulist.
+	 */
+	list_for_each_entry(glist, &qg->groups, next_group) {
+		ret = ulist_add(parents, glist->group->qgroupid,
+				ptr_to_u64(glist->group), GFP_ATOMIC);
+		if (ret < 0)
+			goto out_unlock;
+	}
+
+	ULIST_ITER_INIT(&uiter);
+	while ((unode = ulist_next(parents, &uiter))) {
+		qg = u64_to_ptr(unode->aux);
+		qg->excl += oper->num_bytes;
+		qg->excl_cmpr += oper->num_bytes;
+		qgroup_dirty(fs_info, qg);
+
+		/* Add any parents of the parents */
+		list_for_each_entry(glist, &qg->groups, next_group) {
+			ret = ulist_add(parents, glist->group->qgroupid,
+					ptr_to_u64(glist->group), GFP_ATOMIC);
+			if (ret < 0)
+				goto out_unlock;
+		}
+	}
+
+out_unlock:
+	spin_unlock(&fs_info->qgroup_lock);
+
+out:
+	ulist_free(roots);
+	ulist_free(parents);
+	return ret;
+}
+
+/*
  * btrfs_qgroup_account_ref is called for every ref that is added to or deleted
  * from the fs. First, all roots referencing the extent are searched, and
  * then the space is accounted accordingly to the different roots. The
@@ -1921,6 +2083,9 @@ static int btrfs_qgroup_account(struct btrfs_trans_handle *trans,
 	case BTRFS_QGROUP_OPER_SUB_SHARED:
 		ret = qgroup_shared_accounting(trans, fs_info, oper);
 		break;
+	case BTRFS_QGROUP_OPER_SUB_SUBTREE:
+		ret = qgroup_subtree_accounting(trans, fs_info, oper);
+		break;
 	default:
 		ASSERT(0);
 	}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 5952ff1..18cc68c 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -44,6 +44,7 @@ enum btrfs_qgroup_operation_type {
 	BTRFS_QGROUP_OPER_ADD_SHARED,
 	BTRFS_QGROUP_OPER_SUB_EXCL,
 	BTRFS_QGROUP_OPER_SUB_SHARED,
+	BTRFS_QGROUP_OPER_SUB_SUBTREE,
 };
 
 struct btrfs_qgroup_operation {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index b8774b3..96daac6 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1125,7 +1125,8 @@ DEFINE_EVENT(btrfs__workqueue_done, btrfs_workqueue_destroy,
 		{ BTRFS_QGROUP_OPER_ADD_EXCL, 	"OPER_ADD_EXCL" },	\
 		{ BTRFS_QGROUP_OPER_ADD_SHARED, "OPER_ADD_SHARED" },	\
 		{ BTRFS_QGROUP_OPER_SUB_EXCL, 	"OPER_SUB_EXCL" },	\
-		{ BTRFS_QGROUP_OPER_SUB_SHARED,	"OPER_SUB_SHARED" })
+		{ BTRFS_QGROUP_OPER_SUB_SHARED,	"OPER_SUB_SHARED" },	\
+		{ BTRFS_QGROUP_OPER_SUB_SUBTREE,"OPER_SUB_SUBTREE" })
 
 DECLARE_EVENT_CLASS(btrfs_qgroup_oper,
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH] Btrfs: __btrfs_mod_ref should always use no_quota
  2014-07-07 22:09 [PATCH 0/3] btrfs: qgroup fixes for btrfs_drop_snapshot V3 Mark Fasheh
  2014-07-07 22:09 ` btrfs: add trace for qgroup accounting Mark Fasheh
  2014-07-07 22:09 ` btrfs: qgroup: account shared subtrees during snapshot delete Mark Fasheh
@ 2014-07-07 22:10 ` Mark Fasheh
  2 siblings, 0 replies; 15+ messages in thread
From: Mark Fasheh @ 2014-07-07 22:10 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chris Mason, Josef Bacik, Mark Fasheh

From: Josef Bacik <jbacik@fb.com>

Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
understand how snapshot delete was using it and assumed that we needed the
quota operations there.  With Mark's work this has turned out to be not the
case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
argument and make __btrfs_mod_ref call it's process function with no_quota set
always.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/ctree.c       | 20 ++++++++++----------
 fs/btrfs/ctree.h       |  4 ++--
 fs/btrfs/extent-tree.c | 24 +++++++++++-------------
 3 files changed, 23 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d99d965..d9e0ce0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -280,9 +280,9 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
 
 	WARN_ON(btrfs_header_generation(buf) > trans->transid);
 	if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID)
-		ret = btrfs_inc_ref(trans, root, cow, 1, 1);
+		ret = btrfs_inc_ref(trans, root, cow, 1);
 	else
-		ret = btrfs_inc_ref(trans, root, cow, 0, 1);
+		ret = btrfs_inc_ref(trans, root, cow, 0);
 
 	if (ret)
 		return ret;
@@ -1035,14 +1035,14 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
 		if ((owner == root->root_key.objectid ||
 		     root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) &&
 		    !(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF)) {
-			ret = btrfs_inc_ref(trans, root, buf, 1, 1);
+			ret = btrfs_inc_ref(trans, root, buf, 1);
 			BUG_ON(ret); /* -ENOMEM */
 
 			if (root->root_key.objectid ==
 			    BTRFS_TREE_RELOC_OBJECTID) {
-				ret = btrfs_dec_ref(trans, root, buf, 0, 1);
+				ret = btrfs_dec_ref(trans, root, buf, 0);
 				BUG_ON(ret); /* -ENOMEM */
-				ret = btrfs_inc_ref(trans, root, cow, 1, 1);
+				ret = btrfs_inc_ref(trans, root, cow, 1);
 				BUG_ON(ret); /* -ENOMEM */
 			}
 			new_flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
@@ -1050,9 +1050,9 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
 
 			if (root->root_key.objectid ==
 			    BTRFS_TREE_RELOC_OBJECTID)
-				ret = btrfs_inc_ref(trans, root, cow, 1, 1);
+				ret = btrfs_inc_ref(trans, root, cow, 1);
 			else
-				ret = btrfs_inc_ref(trans, root, cow, 0, 1);
+				ret = btrfs_inc_ref(trans, root, cow, 0);
 			BUG_ON(ret); /* -ENOMEM */
 		}
 		if (new_flags != 0) {
@@ -1069,11 +1069,11 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
 		if (flags & BTRFS_BLOCK_FLAG_FULL_BACKREF) {
 			if (root->root_key.objectid ==
 			    BTRFS_TREE_RELOC_OBJECTID)
-				ret = btrfs_inc_ref(trans, root, cow, 1, 1);
+				ret = btrfs_inc_ref(trans, root, cow, 1);
 			else
-				ret = btrfs_inc_ref(trans, root, cow, 0, 1);
+				ret = btrfs_inc_ref(trans, root, cow, 0);
 			BUG_ON(ret); /* -ENOMEM */
-			ret = btrfs_dec_ref(trans, root, buf, 1, 1);
+			ret = btrfs_dec_ref(trans, root, buf, 1);
 			BUG_ON(ret); /* -ENOMEM */
 		}
 		clean_tree_block(trans, root, buf);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4896d7a..56f280f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3307,9 +3307,9 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
 			 u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 			 struct btrfs_key *ins, int is_data);
 int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
-		  struct extent_buffer *buf, int full_backref, int no_quota);
+		  struct extent_buffer *buf, int full_backref);
 int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
-		  struct extent_buffer *buf, int full_backref, int no_quota);
+		  struct extent_buffer *buf, int full_backref);
 int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans,
 				struct btrfs_root *root,
 				u64 bytenr, u64 num_bytes, u64 flags,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3f43e9a..e0e8c3f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2977,7 +2977,7 @@ out:
 static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
 			   struct btrfs_root *root,
 			   struct extent_buffer *buf,
-			   int full_backref, int inc, int no_quota)
+			   int full_backref, int inc)
 {
 	u64 bytenr;
 	u64 num_bytes;
@@ -3031,7 +3031,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
 			key.offset -= btrfs_file_extent_offset(buf, fi);
 			ret = process_func(trans, root, bytenr, num_bytes,
 					   parent, ref_root, key.objectid,
-					   key.offset, no_quota);
+					   key.offset, 1);
 			if (ret)
 				goto fail;
 		} else {
@@ -3039,7 +3039,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans,
 			num_bytes = btrfs_level_size(root, level - 1);
 			ret = process_func(trans, root, bytenr, num_bytes,
 					   parent, ref_root, level - 1, 0,
-					   no_quota);
+					   1);
 			if (ret)
 				goto fail;
 		}
@@ -3050,15 +3050,15 @@ fail:
 }
 
 int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
-		  struct extent_buffer *buf, int full_backref, int no_quota)
+		  struct extent_buffer *buf, int full_backref)
 {
-	return __btrfs_mod_ref(trans, root, buf, full_backref, 1, no_quota);
+	return __btrfs_mod_ref(trans, root, buf, full_backref, 1);
 }
 
 int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
-		  struct extent_buffer *buf, int full_backref, int no_quota)
+		  struct extent_buffer *buf, int full_backref)
 {
-	return __btrfs_mod_ref(trans, root, buf, full_backref, 0, no_quota);
+	return __btrfs_mod_ref(trans, root, buf, full_backref, 0);
 }
 
 static int write_one_cache_group(struct btrfs_trans_handle *trans,
@@ -7592,9 +7592,9 @@ static noinline int walk_down_proc(struct btrfs_trans_handle *trans,
 	/* wc->stage == UPDATE_BACKREF */
 	if (!(wc->flags[level] & flag)) {
 		BUG_ON(!path->locks[level]);
-		ret = btrfs_inc_ref(trans, root, eb, 1, wc->for_reloc);
+		ret = btrfs_inc_ref(trans, root, eb, 1);
 		BUG_ON(ret); /* -ENOMEM */
-		ret = btrfs_dec_ref(trans, root, eb, 0, wc->for_reloc);
+		ret = btrfs_dec_ref(trans, root, eb, 0);
 		BUG_ON(ret); /* -ENOMEM */
 		ret = btrfs_set_disk_extent_flags(trans, root, eb->start,
 						  eb->len, flag,
@@ -7841,11 +7841,9 @@ static noinline int walk_up_proc(struct btrfs_trans_handle *trans,
 	if (wc->refs[level] == 1) {
 		if (level == 0) {
 			if (wc->flags[level] & BTRFS_BLOCK_FLAG_FULL_BACKREF)
-				ret = btrfs_dec_ref(trans, root, eb, 1,
-						    wc->for_reloc);
+				ret = btrfs_dec_ref(trans, root, eb, 1);
 			else
-				ret = btrfs_dec_ref(trans, root, eb, 0,
-						    wc->for_reloc);
+				ret = btrfs_dec_ref(trans, root, eb, 0);
 			BUG_ON(ret); /* -ENOMEM */
 			ret = account_leaf_items(trans, root, eb);
 			if (ret) {
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: btrfs: qgroup: account shared subtrees during snapshot delete
  2015-02-01 20:51 btrfs: qgroup: account shared subtrees during snapshot delete Dan Carpenter
@ 2015-02-06 22:16 ` Mark Fasheh
  0 siblings, 0 replies; 15+ messages in thread
From: Mark Fasheh @ 2015-02-06 22:16 UTC (permalink / raw)
  To: Dan Carpenter; +Cc: linux-btrfs

On Sun, Feb 01, 2015 at 11:51:19PM +0300, Dan Carpenter wrote:
> Hello Mark Fasheh,
> 
> The patch 1152651a0817: "btrfs: qgroup: account shared subtrees
> during snapshot delete" from Jul 17, 2014, leads to the following
> static checker warning:

What checker are you using?


> 	fs/btrfs/extent-tree.c:7642 account_shared_subtree()
> 	error: off-by-one overflow 'path->nodes' size 8.  index range = '1-8'
> 
> fs/btrfs/extent-tree.c
>   7611          BUG_ON(root_level < 0 || root_level > BTRFS_MAX_LEVEL);
> 
> At first I thought that I could just change this > to >= to fix this
> warning.

That change would be appropriate as far as I can tell.



>   7612          BUG_ON(root_eb == NULL);
>   7613  
>   7614          if (!root->fs_info->quota_enabled)
>   7615                  return 0;
>   7616  
>   7617          if (!extent_buffer_uptodate(root_eb)) {
>   7618                  ret = btrfs_read_buffer(root_eb, root_gen);
>   7619                  if (ret)
>   7620                          goto out;
>   7621          }
>   7622  
>   7623          if (root_level == 0) {
>   7624                  ret = account_leaf_items(trans, root, root_eb);
>   7625                  goto out;
>   7626          }
>   7627  
>   7628          path = btrfs_alloc_path();
>   7629          if (!path)
>   7630                  return -ENOMEM;
>   7631  
>   7632          /*
>   7633           * Walk down the tree.  Missing extent blocks are filled in as
>   7634           * we go. Metadata is accounted every time we read a new
>   7635           * extent block.
>   7636           *
>   7637           * When we reach a leaf, we account for file extent items in it,
>   7638           * walk back up the tree (adjusting slot pointers as we go)
>   7639           * and restart the search process.
>   7640           */
>   7641          extent_buffer_get(root_eb); /* For path */
>   7642          path->nodes[root_level] = root_eb;
> 
> ->nodes[] has BTRFS_MAX_LEVEL elements.
> 
>   7643          path->slots[root_level] = 0;
>   7644          path->locks[root_level] = 0; /* so release_path doesn't try to unlock */
>   7645  walk_down:
>   7646          level = root_level;
>   7647          while (level >= 0) {
>   7648                  if (path->nodes[level] == NULL) {
>   7649                          int parent_slot;
>   7650                          u64 child_gen;
>   7651                          u64 child_bytenr;
>   7652  
>   7653                          /* We need to get child blockptr/gen from
>   7654                           * parent before we can read it. */
>   7655                          eb = path->nodes[level + 1];
>                                          ^^^^^^^^^^^^^^^^^^
> But when I changed that, then it introduced a warning here because we
> add one.  I'm not sure what to do.

This code here is correct, and we should not be able to over/underflow the
nodes[] array so long as we guard against root_level being a reasonable
value. If you see above, we we set path->nodes[root_level] to a non-NULL
value. Since it is always non-NULL, the code you are worried about here will
never be executed for root_level.

Regarding changing it, I really don't care so long as you don't break the
code. It was extemely difficult and time consuming to test this and get it
all correct (well as correct as it is).

Thanks,
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* re: btrfs: qgroup: account shared subtrees during snapshot delete
@ 2015-02-01 20:51 Dan Carpenter
  2015-02-06 22:16 ` Mark Fasheh
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Carpenter @ 2015-02-01 20:51 UTC (permalink / raw)
  To: mfasheh; +Cc: linux-btrfs

Hello Mark Fasheh,

The patch 1152651a0817: "btrfs: qgroup: account shared subtrees
during snapshot delete" from Jul 17, 2014, leads to the following
static checker warning:

	fs/btrfs/extent-tree.c:7642 account_shared_subtree()
	error: off-by-one overflow 'path->nodes' size 8.  index range = '1-8'

fs/btrfs/extent-tree.c
  7611          BUG_ON(root_level < 0 || root_level > BTRFS_MAX_LEVEL);

At first I thought that I could just change this > to >= to fix this
warning.

  7612          BUG_ON(root_eb == NULL);
  7613  
  7614          if (!root->fs_info->quota_enabled)
  7615                  return 0;
  7616  
  7617          if (!extent_buffer_uptodate(root_eb)) {
  7618                  ret = btrfs_read_buffer(root_eb, root_gen);
  7619                  if (ret)
  7620                          goto out;
  7621          }
  7622  
  7623          if (root_level == 0) {
  7624                  ret = account_leaf_items(trans, root, root_eb);
  7625                  goto out;
  7626          }
  7627  
  7628          path = btrfs_alloc_path();
  7629          if (!path)
  7630                  return -ENOMEM;
  7631  
  7632          /*
  7633           * Walk down the tree.  Missing extent blocks are filled in as
  7634           * we go. Metadata is accounted every time we read a new
  7635           * extent block.
  7636           *
  7637           * When we reach a leaf, we account for file extent items in it,
  7638           * walk back up the tree (adjusting slot pointers as we go)
  7639           * and restart the search process.
  7640           */
  7641          extent_buffer_get(root_eb); /* For path */
  7642          path->nodes[root_level] = root_eb;

->nodes[] has BTRFS_MAX_LEVEL elements.

  7643          path->slots[root_level] = 0;
  7644          path->locks[root_level] = 0; /* so release_path doesn't try to unlock */
  7645  walk_down:
  7646          level = root_level;
  7647          while (level >= 0) {
  7648                  if (path->nodes[level] == NULL) {
  7649                          int parent_slot;
  7650                          u64 child_gen;
  7651                          u64 child_bytenr;
  7652  
  7653                          /* We need to get child blockptr/gen from
  7654                           * parent before we can read it. */
  7655                          eb = path->nodes[level + 1];
                                         ^^^^^^^^^^^^^^^^^^
But when I changed that, then it introduced a warning here because we
add one.  I'm not sure what to do.

  7656                          parent_slot = path->slots[level + 1];
  7657                          child_bytenr = btrfs_node_blockptr(eb, parent_slot);
  7658                          child_gen = btrfs_node_ptr_generation(eb, parent_slot);
  7659  
  7660                          eb = read_tree_block(root, child_bytenr, child_gen);
  7661                          if (!eb || !extent_buffer_uptodate(eb)) {
  7662                                  ret = -EIO;
  7663                                  goto out;
  7664                          }
  7665  
  7666                          path->nodes[level] = eb;
  7667                          path->slots[level] = 0;
  7668  

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs: qgroup: account shared subtrees during snapshot delete
  2014-06-20 15:29           ` Mark Fasheh
  2014-06-20 15:44             ` Josef Bacik
@ 2014-06-23 14:49             ` David Sterba
  1 sibling, 0 replies; 15+ messages in thread
From: David Sterba @ 2014-06-23 14:49 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: dsterba, Josef Bacik, linux-btrfs, Chris Mason

On Fri, Jun 20, 2014 at 08:29:58AM -0700, Mark Fasheh wrote:
> +static int account_shared_subtree(struct btrfs_trans_handle *trans,
> +				  struct btrfs_root *root,
> +				  struct extent_buffer *root_eb,
> +				  u64 root_gen,
> +				  int root_level)
> +{
> +	int ret;
> +	int level;
> +	struct extent_buffer *eb = root_eb;
> +	struct btrfs_path *path = NULL;
> +
> +	BUG_ON(root_level < 0 || root_level > BTRFS_MAX_LEVEL);
> +	BUG_ON(root_eb == NULL);

The caller is ready for errors, can you please turn these to ane EIO?
(Separate patch is fine)

> +static int qgroup_subtree_accounting(struct btrfs_trans_handle *trans,
> +				     struct btrfs_fs_info *fs_info,
> +				     struct btrfs_qgroup_operation *oper)
> +{
> +	struct ulist *roots = NULL;
> +	struct ulist_node *unode;
> +	struct ulist_iterator uiter;
> +	struct btrfs_qgroup_list *glist;
> +	struct ulist *parents;
> +	int ret = 0;
> +	struct btrfs_qgroup *qg;
> +	u64 root_obj = 0;
> +	struct seq_list elem = {};
> +
> +	parents = ulist_alloc(GFP_NOFS);
> +	if (!parents)
> +		return -ENOMEM;
> +
> +	btrfs_get_tree_mod_seq(fs_info, &elem);
> +	ret = btrfs_find_all_roots(trans, fs_info, oper->bytenr,
> +				   elem.seq, &roots);
> +	btrfs_put_tree_mod_seq(fs_info, &elem);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (roots->nnodes != 1)
> +		goto out;
> +
> +	ULIST_ITER_INIT(&uiter);
> +	unode = ulist_next(roots, &uiter); /* Only want 1 so no need to loop */
> +	/*
> +	 * If we find our ref root then that means all refs
> +	 * this extent has to the root have not yet been
> +	 * deleted. In that case, we do nothing and let the
> +	 * last ref for this bytenr drive our update.
> +	 *
> +	 * This can happen for example if an extent is
> +	 * referenced multiple times in a snapshot (clone,
> +	 * etc). If we are in the middle of snapshot removal,
> +	 * queued updates for such an extent will find the
> +	 * root if we have not yet finished removing the
> +	 * snapshot.
> +	 */
> +	if (unode->val == oper->ref_root)
> +		goto out;
> +
> +	root_obj = unode->val;
> +	BUG_ON(!root_obj);

Same here, although when called through btrfs_qgroup_account, the error
code is not properly checked in btrfs_delayed_qgroup_accounting, line 1944:

1933 int btrfs_delayed_qgroup_accounting(struct btrfs_trans_handle *trans,
1934                                     struct btrfs_fs_info *fs_info)
1935 {
1936         struct btrfs_qgroup_operation *oper;
1937         int ret = 0;
1938
1939         while (!list_empty(&trans->qgroup_ref_list)) {
1940                 oper = list_first_entry(&trans->qgroup_ref_list,
1941                                         struct btrfs_qgroup_operation, list);
1942                 list_del_init(&oper->list);
1943                 if (!ret || !trans->aborted)
1944                         ret = btrfs_qgroup_account(trans, fs_info, oper);
1945                 spin_lock(&fs_info->qgroup_op_lock);
1946                 rb_erase(&oper->n, &fs_info->qgroup_op_tree);
1947                 spin_unlock(&fs_info->qgroup_op_lock);
1948                 btrfs_put_tree_mod_seq(fs_info, &oper->elem);
1949                 kfree(oper);
1950         }
1951         return ret;
1952 }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs: qgroup: account shared subtrees during snapshot delete
  2014-06-20 15:44             ` Josef Bacik
@ 2014-06-20 17:18               ` Mark Fasheh
  0 siblings, 0 replies; 15+ messages in thread
From: Mark Fasheh @ 2014-06-20 17:18 UTC (permalink / raw)
  To: Josef Bacik; +Cc: dsterba, linux-btrfs, Chris Mason

On Fri, Jun 20, 2014 at 08:44:51AM -0700, Josef Bacik wrote:
> On 06/20/2014 08:29 AM, Mark Fasheh wrote:
>> On Fri, Jun 20, 2014 at 01:25:34PM +0200, David Sterba wrote:
>>> On Thu, Jun 19, 2014 at 04:17:25PM -0700, Josef Bacik wrote:
>>>>>> We don't pay attention to the return value, we should probably abort the
>>>>>> transaction if there is a problem.
>>>>>
>>>>> Abort or log an error and continue? I ask because technically we could
>>>>> continue with the subvolume drop but obviously qgroup state will need to be
>>>>> fixed via a future rescan. I guess the question is which is more 'friendly'
>>>>> to the user.
>>>>
>>>> I'd be ok with log an error and tell the user to rescan.  Thanks,
>>>
>>> I agree.
>>
>> Great, I went ahead and did that. Below is patch #2 with all review comments
>> implemented.
>>
>
> Looks good, you can add
>
> Reviewed-by: Josef Bacik <jbacik@fb.com>
>
> However I'd like to see an xfstest or two for this case so we're sure to be
> testing everything properly.  Once we have that we can merge it.  Thanks,

Adding a test is no problem - I can just import a little script I
have to make a shared tree. FYI though - btrfsck will still report some
inconsistency until we fix the last bit of accounting problems so I guess it
won't really 'pass' at first.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs: qgroup: account shared subtrees during snapshot delete
  2014-06-20 15:29           ` Mark Fasheh
@ 2014-06-20 15:44             ` Josef Bacik
  2014-06-20 17:18               ` Mark Fasheh
  2014-06-23 14:49             ` David Sterba
  1 sibling, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2014-06-20 15:44 UTC (permalink / raw)
  To: Mark Fasheh, dsterba, linux-btrfs, Chris Mason

On 06/20/2014 08:29 AM, Mark Fasheh wrote:
> On Fri, Jun 20, 2014 at 01:25:34PM +0200, David Sterba wrote:
>> On Thu, Jun 19, 2014 at 04:17:25PM -0700, Josef Bacik wrote:
>>>>> We don't pay attention to the return value, we should probably abort the
>>>>> transaction if there is a problem.
>>>>
>>>> Abort or log an error and continue? I ask because technically we could
>>>> continue with the subvolume drop but obviously qgroup state will need to be
>>>> fixed via a future rescan. I guess the question is which is more 'friendly'
>>>> to the user.
>>>
>>> I'd be ok with log an error and tell the user to rescan.  Thanks,
>>
>> I agree.
>
> Great, I went ahead and did that. Below is patch #2 with all review comments
> implemented.
>

Looks good, you can add

Reviewed-by: Josef Bacik <jbacik@fb.com>

However I'd like to see an xfstest or two for this case so we're sure to be
testing everything properly.  Once we have that we can merge it.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs: qgroup: account shared subtrees during snapshot delete
  2014-06-20 11:25         ` David Sterba
@ 2014-06-20 15:29           ` Mark Fasheh
  2014-06-20 15:44             ` Josef Bacik
  2014-06-23 14:49             ` David Sterba
  0 siblings, 2 replies; 15+ messages in thread
From: Mark Fasheh @ 2014-06-20 15:29 UTC (permalink / raw)
  To: dsterba, Josef Bacik, linux-btrfs, Chris Mason

On Fri, Jun 20, 2014 at 01:25:34PM +0200, David Sterba wrote:
> On Thu, Jun 19, 2014 at 04:17:25PM -0700, Josef Bacik wrote:
> > >>We don't pay attention to the return value, we should probably abort the
> > >>transaction if there is a problem.
> > >
> > >Abort or log an error and continue? I ask because technically we could
> > >continue with the subvolume drop but obviously qgroup state will need to be
> > >fixed via a future rescan. I guess the question is which is more 'friendly'
> > >to the user.
> > 
> > I'd be ok with log an error and tell the user to rescan.  Thanks,
> 
> I agree.

Great, I went ahead and did that. Below is patch #2 with all review comments
implemented.

Thanks, Mark

--
Mark Fasheh

From: Mark Fasheh <mfasheh@suse.de>

btrfs: qgroup: account shared subtrees during snapshot delete

During it's tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.

In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.

This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 46f39bf..c867c43 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7324,6 +7324,229 @@ reada:
 	wc->reada_slot = slot;
 }
 
+static int account_leaf_items(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root,
+			      struct extent_buffer *eb)
+{
+	int nr = btrfs_header_nritems(eb);
+	int i, extent_type, ret;
+	struct btrfs_key key;
+	struct btrfs_file_extent_item *fi;
+	u64 bytenr, num_bytes;
+
+	for (i = 0; i < nr; i++) {
+		btrfs_item_key_to_cpu(eb, &key, i);
+
+		if (key.type != BTRFS_EXTENT_DATA_KEY)
+			continue;
+
+		fi = btrfs_item_ptr(eb, i, struct btrfs_file_extent_item);
+		/* filter out non qgroup-accountable extents  */
+		extent_type = btrfs_file_extent_type(eb, fi);
+
+		if (extent_type == BTRFS_FILE_EXTENT_INLINE)
+			continue;
+
+		bytenr = btrfs_file_extent_disk_bytenr(eb, fi);
+		if (!bytenr)
+			continue;
+
+		num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
+
+		ret = btrfs_qgroup_record_ref(trans, root->fs_info,
+					      root->objectid,
+					      bytenr, num_bytes,
+					      BTRFS_QGROUP_OPER_SUB_SUBTREE, 0);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+/*
+ * Walk up the tree from the bottom, freeing leaves and any interior
+ * nodes which have had all slots visited. If a node (leaf or
+ * interior) is freed, the node above it will have it's slot
+ * incremented. The root node will never be freed.
+ *
+ * At the end of this function, we should have a path which has all
+ * slots incremented to the next position for a search. If we need to
+ * read a new node it will be NULL and the node above it will have the
+ * correct slot selected for a later read.
+ *
+ * If we increment the root nodes slot counter past the number of
+ * elements, 1 is returned to signal completion of the search.
+ */
+static int adjust_slots_upwards(struct btrfs_root *root,
+				struct btrfs_path *path, int root_level)
+{
+	int level = 0;
+	int nr, slot;
+	struct extent_buffer *eb;
+
+	if (root_level == 0)
+		return 1;
+
+	while (level <= root_level) {
+		eb = path->nodes[level];
+		nr = btrfs_header_nritems(eb);
+		path->slots[level]++;
+		slot = path->slots[level];
+		if (slot >= nr || level == 0) {
+			/*
+			 * Don't free the root -  we will detect this
+			 * condition after our loop and return a
+			 * positive value for caller to stop walking the tree.
+			 */
+			if (level != root_level) {
+				btrfs_tree_unlock_rw(eb, path->locks[level]);
+				path->locks[level] = 0;
+
+				free_extent_buffer(eb);
+				path->nodes[level] = NULL;
+				path->slots[level] = 0;
+			}
+		} else {
+			/*
+			 * We have a valid slot to walk back down
+			 * from. Stop here so caller can process these
+			 * new nodes.
+			 */
+			break;
+		}
+
+		level++;
+	}
+
+	eb = path->nodes[root_level];
+	if (path->slots[root_level] >= btrfs_header_nritems(eb))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * root_eb is the subtree root and is locked before this function is called.
+ */
+static int account_shared_subtree(struct btrfs_trans_handle *trans,
+				  struct btrfs_root *root,
+				  struct extent_buffer *root_eb,
+				  u64 root_gen,
+				  int root_level)
+{
+	int ret;
+	int level;
+	struct extent_buffer *eb = root_eb;
+	struct btrfs_path *path = NULL;
+
+	BUG_ON(root_level < 0 || root_level > BTRFS_MAX_LEVEL);
+	BUG_ON(root_eb == NULL);
+
+	if (!root->fs_info->quota_enabled)
+		return 0;
+
+	if (!extent_buffer_uptodate(root_eb)) {
+		ret = btrfs_read_buffer(root_eb, root_gen);
+		if (ret)
+			goto out;
+	}
+
+	if (root_level == 0) {
+		ret = account_leaf_items(trans, root, root_eb);
+		if (ret)
+			goto out;
+		goto out_done;
+	}
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	/*
+	 * Walk down the tree.  Missing extent blocks are filled in as
+	 * we go. Metadata is accounted every time we read a new
+	 * extent block.
+	 *
+	 * When we reach a leaf, we account for file extent items in it,
+	 * walk back up the tree (adjusting slot pointers as we go)
+	 * and restart the search process.
+	 */
+	extent_buffer_get(root_eb); /* For path */
+	path->nodes[root_level] = root_eb;
+	path->slots[root_level] = 0;
+	path->locks[root_level] = 0; /* so release_path doesn't try to unlock */
+walk_down:
+	level = root_level;
+	while (level >= 0) {
+		if (path->nodes[level] == NULL) {
+			int child_bsize = btrfs_level_size(root, level);
+			int parent_slot;
+			u64 child_gen;
+			u64 child_bytenr;
+
+			/* We need to get child blockptr/gen from
+			 * parent before we can read it. */
+			eb = path->nodes[level + 1];
+			parent_slot = path->slots[level + 1];
+			child_bytenr = btrfs_node_blockptr(eb, parent_slot);
+			child_gen = btrfs_node_ptr_generation(eb, parent_slot);
+
+			eb = read_tree_block(root, child_bytenr, child_bsize,
+					     child_gen);
+			if (!eb || !extent_buffer_uptodate(eb)) {
+				ret = -EIO;
+				goto out;
+			}
+
+			path->nodes[level] = eb;
+			path->slots[level] = 0;
+
+			btrfs_tree_read_lock(eb);
+			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+			path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
+
+			ret = btrfs_qgroup_record_ref(trans, root->fs_info,
+						root->objectid,
+						child_bytenr,
+						child_bsize,
+						BTRFS_QGROUP_OPER_SUB_SUBTREE,
+						0);
+			if (ret)
+				goto out;
+
+		}
+
+		if (level == 0) {
+			ret = account_leaf_items(trans, root, path->nodes[level]);
+			if (ret)
+				goto out;
+
+			/* Nonzero return here means we completed our search */
+			ret = adjust_slots_upwards(root, path, root_level);
+			if (ret)
+				break;
+
+			/* Restart search with new slots */
+			goto walk_down;
+		}
+
+		level--;
+	}
+
+out_done:
+	ret = btrfs_qgroup_record_ref(trans, root->fs_info,
+				      root->objectid, root_eb->start,
+				      btrfs_level_size(root, root_level),
+				      BTRFS_QGROUP_OPER_SUB_SUBTREE, 0);
+	if (ret)
+		goto out;
+
+out:
+	btrfs_free_path(path);
+
+	return ret;
+}
+
 /*
  * helper to process tree block while walking down the tree.
  *
@@ -7472,6 +7695,15 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 
 	if (wc->stage == DROP_REFERENCE) {
 		if (wc->refs[level - 1] > 1) {
+			ret = account_shared_subtree(trans, root, next,
+						     generation, level - 1);
+			if (ret) {
+				printk_ratelimited(KERN_ERR "BTRFS: %s Error "
+					"%d accounting shared subtree. Quota "
+					"is out of sync, rescan required.\n",
+					root->fs_info->sb->s_id, ret);
+			}
+
 			if (level == 1 &&
 			    (wc->flags[0] & BTRFS_BLOCK_FLAG_FULL_BACKREF))
 				goto skip;
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a9f0f05..231eb78 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1202,7 +1202,7 @@ out:
 	return ret;
 }
 static int comp_oper(struct btrfs_qgroup_operation *oper1,
-		     struct btrfs_qgroup_operation *oper2)
+		     struct btrfs_qgroup_operation *oper2, int for_insert)
 {
 	if (oper1->bytenr < oper2->bytenr)
 		return -1;
@@ -1216,10 +1216,37 @@ static int comp_oper(struct btrfs_qgroup_operation *oper1,
 		return -1;
 	if (oper1->ref_root > oper2->ref_root)
 		return 1;
-	if (oper1->type < oper2->type)
-		return -1;
-	if (oper1->type > oper2->type)
-		return 1;
+	if (for_insert) {
+		if (oper1->type < oper2->type)
+			return -1;
+		if (oper1->type > oper2->type)
+			return 1;
+	}
+	return 0;
+}
+
+static int qgroup_oper_exists(struct btrfs_fs_info *fs_info,
+			      struct btrfs_qgroup_operation *oper)
+{
+	struct rb_node *n;
+	struct btrfs_qgroup_operation *cur;
+	int cmp;
+
+	spin_lock(&fs_info->qgroup_op_lock);
+	n = fs_info->qgroup_op_tree.rb_node;
+	while (n) {
+		cur = rb_entry(n, struct btrfs_qgroup_operation, n);
+		cmp = comp_oper(cur, oper, 0);
+		if (cmp < 0) {
+			n = n->rb_right;
+		} else if (cmp) {
+			n = n->rb_left;
+		} else {
+			spin_unlock(&fs_info->qgroup_op_lock);
+			return -EEXIST;
+		}
+	}
+	spin_unlock(&fs_info->qgroup_op_lock);
 	return 0;
 }
 
@@ -1236,7 +1263,7 @@ static int insert_qgroup_oper(struct btrfs_fs_info *fs_info,
 	while (*p) {
 		parent = *p;
 		cur = rb_entry(parent, struct btrfs_qgroup_operation, n);
-		cmp = comp_oper(cur, oper);
+		cmp = comp_oper(cur, oper, 1);
 		if (cmp < 0) {
 			p = &(*p)->rb_right;
 		} else if (cmp) {
@@ -1290,7 +1317,21 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans,
 	oper->seq = atomic_inc_return(&fs_info->qgroup_op_seq);
 	INIT_LIST_HEAD(&oper->elem.list);
 	oper->elem.seq = 0;
+
 	trace_btrfs_qgroup_record_ref(oper);
+
+	if (type == BTRFS_QGROUP_OPER_SUB_SUBTREE) {
+		/*
+		 * If any operation for this bytenr/ref_root combo
+		 * exists, then we know it's not exclusively owned and
+		 * shouldn't be queued up.
+		 */
+		if (qgroup_oper_exists(fs_info, oper)) {
+			kfree(oper);
+			return 0;
+		}
+	}
+
 	ret = insert_qgroup_oper(fs_info, oper);
 	if (ret) {
 		/* Shouldn't happen so have an assert for developers */
@@ -1883,6 +1924,106 @@ out:
 }
 
 /*
+ * Process a reference to a shared subtree. This type of operation is
+ * queued during snapshot removal when we encounter extents which are
+ * shared between more than one root.
+ */
+static int qgroup_subtree_accounting(struct btrfs_trans_handle *trans,
+				     struct btrfs_fs_info *fs_info,
+				     struct btrfs_qgroup_operation *oper)
+{
+	struct ulist *roots = NULL;
+	struct ulist_node *unode;
+	struct ulist_iterator uiter;
+	struct btrfs_qgroup_list *glist;
+	struct ulist *parents;
+	int ret = 0;
+	struct btrfs_qgroup *qg;
+	u64 root_obj = 0;
+	struct seq_list elem = {};
+
+	parents = ulist_alloc(GFP_NOFS);
+	if (!parents)
+		return -ENOMEM;
+
+	btrfs_get_tree_mod_seq(fs_info, &elem);
+	ret = btrfs_find_all_roots(trans, fs_info, oper->bytenr,
+				   elem.seq, &roots);
+	btrfs_put_tree_mod_seq(fs_info, &elem);
+	if (ret < 0)
+		return ret;
+
+	if (roots->nnodes != 1)
+		goto out;
+
+	ULIST_ITER_INIT(&uiter);
+	unode = ulist_next(roots, &uiter); /* Only want 1 so no need to loop */
+	/*
+	 * If we find our ref root then that means all refs
+	 * this extent has to the root have not yet been
+	 * deleted. In that case, we do nothing and let the
+	 * last ref for this bytenr drive our update.
+	 *
+	 * This can happen for example if an extent is
+	 * referenced multiple times in a snapshot (clone,
+	 * etc). If we are in the middle of snapshot removal,
+	 * queued updates for such an extent will find the
+	 * root if we have not yet finished removing the
+	 * snapshot.
+	 */
+	if (unode->val == oper->ref_root)
+		goto out;
+
+	root_obj = unode->val;
+	BUG_ON(!root_obj);
+
+	spin_lock(&fs_info->qgroup_lock);
+	qg = find_qgroup_rb(fs_info, root_obj);
+	if (!qg)
+		goto out_unlock;
+
+	qg->excl += oper->num_bytes;
+	qg->excl_cmpr += oper->num_bytes;
+	qgroup_dirty(fs_info, qg);
+
+	/*
+	 * Adjust counts for parent groups. First we find all
+	 * parents, then in the 2nd loop we do the adjustment
+	 * while adding parents of the parents to our ulist.
+	 */
+	list_for_each_entry(glist, &qg->groups, next_group) {
+		ret = ulist_add(parents, glist->group->qgroupid,
+				ptr_to_u64(glist->group), GFP_ATOMIC);
+		if (ret < 0)
+			goto out_unlock;
+	}
+
+	ULIST_ITER_INIT(&uiter);
+	while ((unode = ulist_next(parents, &uiter))) {
+		qg = u64_to_ptr(unode->aux);
+		qg->excl += oper->num_bytes;
+		qg->excl_cmpr += oper->num_bytes;
+		qgroup_dirty(fs_info, qg);
+
+		/* Add any parents of the parents */
+		list_for_each_entry(glist, &qg->groups, next_group) {
+			ret = ulist_add(parents, glist->group->qgroupid,
+					ptr_to_u64(glist->group), GFP_ATOMIC);
+			if (ret < 0)
+				goto out_unlock;
+		}
+	}
+
+out_unlock:
+	spin_unlock(&fs_info->qgroup_lock);
+
+out:
+	ulist_free(roots);
+	ulist_free(parents);
+	return ret;
+}
+
+/*
  * btrfs_qgroup_account_ref is called for every ref that is added to or deleted
  * from the fs. First, all roots referencing the extent are searched, and
  * then the space is accounted accordingly to the different roots. The
@@ -1921,6 +2062,9 @@ static int btrfs_qgroup_account(struct btrfs_trans_handle *trans,
 	case BTRFS_QGROUP_OPER_SUB_SHARED:
 		ret = qgroup_shared_accounting(trans, fs_info, oper);
 		break;
+	case BTRFS_QGROUP_OPER_SUB_SUBTREE:
+		ret = qgroup_subtree_accounting(trans, fs_info, oper);
+		break;
 	default:
 		ASSERT(0);
 	}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 5952ff1..18cc68c 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -44,6 +44,7 @@ enum btrfs_qgroup_operation_type {
 	BTRFS_QGROUP_OPER_ADD_SHARED,
 	BTRFS_QGROUP_OPER_SUB_EXCL,
 	BTRFS_QGROUP_OPER_SUB_SHARED,
+	BTRFS_QGROUP_OPER_SUB_SUBTREE,
 };
 
 struct btrfs_qgroup_operation {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index b8774b3..96daac6 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1125,7 +1125,8 @@ DEFINE_EVENT(btrfs__workqueue_done, btrfs_workqueue_destroy,
 		{ BTRFS_QGROUP_OPER_ADD_EXCL, 	"OPER_ADD_EXCL" },	\
 		{ BTRFS_QGROUP_OPER_ADD_SHARED, "OPER_ADD_SHARED" },	\
 		{ BTRFS_QGROUP_OPER_SUB_EXCL, 	"OPER_SUB_EXCL" },	\
-		{ BTRFS_QGROUP_OPER_SUB_SHARED,	"OPER_SUB_SHARED" })
+		{ BTRFS_QGROUP_OPER_SUB_SHARED,	"OPER_SUB_SHARED" },	\
+		{ BTRFS_QGROUP_OPER_SUB_SUBTREE,"OPER_SUB_SUBTREE" })
 
 DECLARE_EVENT_CLASS(btrfs_qgroup_oper,
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: btrfs: qgroup: account shared subtrees during snapshot delete
  2014-06-19 23:17       ` Josef Bacik
@ 2014-06-20 11:25         ` David Sterba
  2014-06-20 15:29           ` Mark Fasheh
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2014-06-20 11:25 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Mark Fasheh, linux-btrfs, Chris Mason

On Thu, Jun 19, 2014 at 04:17:25PM -0700, Josef Bacik wrote:
> >>We don't pay attention to the return value, we should probably abort the
> >>transaction if there is a problem.
> >
> >Abort or log an error and continue? I ask because technically we could
> >continue with the subvolume drop but obviously qgroup state will need to be
> >fixed via a future rescan. I guess the question is which is more 'friendly'
> >to the user.
> 
> I'd be ok with log an error and tell the user to rescan.  Thanks,

I agree.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs: qgroup: account shared subtrees during snapshot delete
  2014-06-19 23:16     ` Mark Fasheh
@ 2014-06-19 23:17       ` Josef Bacik
  2014-06-20 11:25         ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2014-06-19 23:17 UTC (permalink / raw)
  To: Mark Fasheh; +Cc: linux-btrfs, Chris Mason



On 06/19/2014 04:16 PM, Mark Fasheh wrote:
> Thanks for the review Josef, I will implement everything you mentioned. I
> have one question below though:
>
> On Thu, Jun 19, 2014 at 03:25:12PM -0700, Josef Bacik wrote:
>> On 06/19/2014 02:49 PM, Mark Fasheh wrote:
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 46f39bf..672d2a4 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -7472,6 +7703,9 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
>>>
>>>    	if (wc->stage == DROP_REFERENCE) {
>>>    		if (wc->refs[level - 1] > 1) {
>>> +			account_shared_subtree(trans, root, next, generation,
>>> +					       level - 1);
>>> +
>>
>> We don't pay attention to the return value, we should probably abort the
>> transaction if there is a problem.
>
> Abort or log an error and continue? I ask because technically we could
> continue with the subvolume drop but obviously qgroup state will need to be
> fixed via a future rescan. I guess the question is which is more 'friendly'
> to the user.

I'd be ok with log an error and tell the user to rescan.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs: qgroup: account shared subtrees during snapshot delete
  2014-06-19 22:25   ` Josef Bacik
@ 2014-06-19 23:16     ` Mark Fasheh
  2014-06-19 23:17       ` Josef Bacik
  0 siblings, 1 reply; 15+ messages in thread
From: Mark Fasheh @ 2014-06-19 23:16 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, Chris Mason

Thanks for the review Josef, I will implement everything you mentioned. I
have one question below though:

On Thu, Jun 19, 2014 at 03:25:12PM -0700, Josef Bacik wrote:
> On 06/19/2014 02:49 PM, Mark Fasheh wrote:
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 46f39bf..672d2a4 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -7472,6 +7703,9 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
>>
>>   	if (wc->stage == DROP_REFERENCE) {
>>   		if (wc->refs[level - 1] > 1) {
>> +			account_shared_subtree(trans, root, next, generation,
>> +					       level - 1);
>> +
>
> We don't pay attention to the return value, we should probably abort the
> transaction if there is a problem.

Abort or log an error and continue? I ask because technically we could
continue with the subvolume drop but obviously qgroup state will need to be
fixed via a future rescan. I guess the question is which is more 'friendly'
to the user.
	--Mark

--
Mark Fasheh

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: btrfs: qgroup: account shared subtrees during snapshot delete
  2014-06-19 21:49 ` btrfs: qgroup: account shared subtrees during snapshot delete Mark Fasheh
@ 2014-06-19 22:25   ` Josef Bacik
  2014-06-19 23:16     ` Mark Fasheh
  0 siblings, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2014-06-19 22:25 UTC (permalink / raw)
  To: Mark Fasheh, linux-btrfs; +Cc: Chris Mason

On 06/19/2014 02:49 PM, Mark Fasheh wrote:
> During it's tree walk, btrfs_drop_snapshot() will skip any shared
> subtrees it encounters. This is incorrect when we have qgroups
> turned on as those subtrees need to have their contents
> accounted. In particular, the case we're concerned with is when
> removing our snapshot root leaves the subtree with only one root
> reference.
>
> In those cases we need to find the last remaining root and add
> each extent in the subtree to the corresponding qgroup exclusive
> counts.
>
> This patch implements the shared subtree walk and a new qgroup
> operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
> this type is encountered during qgroup accounting, we search for
> any root references to that extent and in the case that we find
> only one reference left, we go ahead and do the math on it's
> exclusive counts.
>
> Signed-off-by: Mark Fasheh <mfasheh@suse.de>
> ---
>   fs/btrfs/extent-tree.c       | 234 +++++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/qgroup.c            | 166 ++++++++++++++++++++++++++++--
>   fs/btrfs/qgroup.h            |   1 +
>   include/trace/events/btrfs.h |   3 +-
>   4 files changed, 397 insertions(+), 7 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 46f39bf..672d2a4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7324,6 +7324,237 @@ reada:
>   	wc->reada_slot = slot;
>   }
>
> +static int account_leaf_items(struct btrfs_trans_handle *trans,
> +			      struct btrfs_root *root,
> +			      struct extent_buffer *eb)
> +{
> +	int nr = btrfs_header_nritems(eb);
> +	int i, extent_type, ret;
> +	struct btrfs_key key;
> +	struct btrfs_file_extent_item *fi;
> +	u64 bytenr, num_bytes;
> +
> +	for (i = 0; i < nr; i++) {
> +		btrfs_item_key_to_cpu(eb, &key, i);
> +
> +		if (key.type != BTRFS_EXTENT_DATA_KEY)
> +			continue;
> +
> +		fi = btrfs_item_ptr(eb, i, struct btrfs_file_extent_item);
> +		/* filter out non qgroup-accountable extents  */
> +		extent_type = btrfs_file_extent_type(eb, fi);
> +
> +		if (extent_type == BTRFS_FILE_EXTENT_INLINE)
> +			continue;
> +
> +		bytenr = btrfs_file_extent_disk_bytenr(eb, fi);
> +		if (!bytenr)
> +			continue;
> +
> +		num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
> +
> +		ret = btrfs_qgroup_record_ref(trans, root->fs_info,
> +					      root->objectid,
> +					      bytenr, num_bytes,
> +					      BTRFS_QGROUP_OPER_SUB_SUBTREE, 0);
> +		if (ret)
> +			return ret;
> +	}
> +	return 0;
> +}
> +
> +/*
> + * Walk up the tree from the bottom, freeing leaves and any interior
> + * nodes which have had all slots visited. If a node (leaf or
> + * interior) is freed, the node above it will have it's slot
> + * incremented. The root node will never be freed.
> + *
> + * At the end of this function, we should have a path which has all
> + * slots incremented to the next position for a search. If we need to
> + * read a new node it will be NULL and the node above it will have the
> + * correct slot selected for a later read.
> + *
> + * If we increment the root nodes slot counter past the number of
> + * elements, 1 is returned to signal completion of the search.
> + */
> +static int adjust_slots_upwards(struct btrfs_root *root,
> +				struct btrfs_path *path, int root_level)
> +{
> +	int level = 0;
> +	int nr, slot;
> +	struct extent_buffer *eb;
> +
> +	if (root_level == 0)
> +		return 1;
> +
> +	while (level <= root_level) {
> +		eb = path->nodes[level];
> +		nr = btrfs_header_nritems(eb);
> +		path->slots[level]++;
> +		slot = path->slots[level];
> +		if (slot >= nr || level == 0) {
> +			/*
> +			 * Don't free the root -  we will detect this
> +			 * condition after our loop and return a
> +			 * positive value for caller to stop walking the tree.
> +			 */
> +			if (level != root_level) {
> +				btrfs_tree_unlock_rw(eb, path->locks[level]);
> +				path->locks[level] = 0;
> +
> +				free_extent_buffer(eb);
> +				path->nodes[level] = NULL;
> +				path->slots[level] = 0;
> +			}
> +		} else {
> +			/*
> +			 * We have a valid slot to walk back down
> +			 * from. Stop here so caller can process these
> +			 * new nodes.
> +			 */
> +			break;
> +		}
> +
> +		level++;
> +	}
> +
> +	eb = path->nodes[root_level];
> +	if (path->slots[root_level] >= btrfs_header_nritems(eb))
> +		return 1;
> +
> +	return 0;
> +}
> +
> +/*
> + * root_eb is the subtree root and is locked before this function is called.
> + */
> +static int account_shared_subtree(struct btrfs_trans_handle *trans,
> +				  struct btrfs_root *root,
> +				  struct extent_buffer *root_eb,
> +				  u64 root_gen,
> +				  int root_level)
> +{
> +	int ret;
> +	int level;
> +	struct extent_buffer *eb = root_eb;
> +	struct btrfs_path *path = NULL;
> +
> +	BUG_ON(root_level < 0 || root_level > BTRFS_MAX_LEVEL);
> +	BUG_ON(root_eb == NULL);
> +
> +	if (!root->fs_info->quota_enabled)
> +		return 0;
> +
> +	if (!extent_buffer_uptodate(root_eb)) {
> +		ret = btrfs_read_buffer(root_eb, root_gen);
> +		if (ret)
> +			goto out;
> +	}
> +	/* XXX: Is this check necessary? */
> +	if (!extent_buffer_uptodate(root_eb)) {
> +		ret = -EIO;
> +		goto out;
> +	}

Nope, don't need it.

> +
> +	if (root_level == 0) {
> +		ret = account_leaf_items(trans, root, root_eb);
> +		if (ret)
> +			goto out;
> +		goto out_done;
> +	}
> +
> +	path = btrfs_alloc_path();
> +	if (!path)
> +		return -ENOMEM;
> +
> +	/*
> +	 * Walk down the tree.  Missing extent blocks are filled in as
> +	 * we go. Metadata is accounted every time we read a new
> +	 * extent block.
> +	 *
> +	 * When we reach a leaf, we account for file extent items in it,
> +	 * walk back up the tree (adjusting slot pointers as we go)
> +	 * and restart the search process.
> +	 */
> +	extent_buffer_get(root_eb); /* For path */
> +	path->nodes[root_level] = root_eb;
> +	path->slots[root_level] = 0;
> +	path->locks[root_level] = 0; /* so release_path doesn't try to unlock */
> +walk_down:
> +	level = root_level;
> +	while (level >= 0) {
> +		if (path->nodes[level] == NULL) {
> +			int child_bsize = btrfs_level_size(root, level);
> +			int parent_slot;
> +			u64 child_gen;
> +			u64 child_bytenr;
> +
> +			/* We need to get child blockptr/gen from
> +			 * parent before we can read it. */
> +			eb = path->nodes[level + 1];
> +			parent_slot = path->slots[level + 1];
> +			child_bytenr = btrfs_node_blockptr(eb, parent_slot);
> +			child_gen = btrfs_node_ptr_generation(eb, parent_slot);
> +
> +			eb = read_tree_block(root, child_bytenr, child_bsize,
> +					     child_gen);
> +			if (!eb || !extent_buffer_uptodate(eb)) {
> +				ret = -EIO;
> +				goto out;
> +			}
> +
> +			path->nodes[level] = eb;
> +			path->slots[level] = 0;
> +
> +			btrfs_tree_read_lock(eb);
> +			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
> +			path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
> +
> +			ret = btrfs_qgroup_record_ref(trans, root->fs_info,
> +						root->objectid,
> +						child_bytenr,
> +						child_bsize,
> +						BTRFS_QGROUP_OPER_SUB_SUBTREE,
> +						0);
> +			if (ret)
> +				goto out;
> +
> +		}
> +
> +		if (level == 0) {
> +			ret = account_leaf_items(trans, root, path->nodes[level]);
> +			if (ret)
> +				goto out;
> +
> +			/* Nonzero return here means we completed our search */
> +			ret = adjust_slots_upwards(root, path, root_level);
> +			if (ret)
> +				break;
> +
> +			/* Restart search with new slots */
> +			goto walk_down;
> +		}
> +
> +		level--;
> +	}
> +
> +out_done:
> +	ret = btrfs_qgroup_record_ref(trans, root->fs_info,
> +				      root->objectid, root_eb->start,
> +				      btrfs_level_size(root, root_level),
> +				      BTRFS_QGROUP_OPER_SUB_SUBTREE, 0);
> +	if (ret)
> +		goto out;
> +
> +	ret = 0;

Don't need this bit either.

> +out:
> +
> +	if (path)
> +		btrfs_free_path(path);

btrfs_free_path() already has a null check in there.

> +
> +	return ret;
> +}
> +
>   /*
>    * helper to process tree block while walking down the tree.
>    *
> @@ -7472,6 +7703,9 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
>
>   	if (wc->stage == DROP_REFERENCE) {
>   		if (wc->refs[level - 1] > 1) {
> +			account_shared_subtree(trans, root, next, generation,
> +					       level - 1);
> +

We don't pay attention to the return value, we should probably abort the
transaction if there is a problem.

>   			if (level == 1 &&
>   			    (wc->flags[0] & BTRFS_BLOCK_FLAG_FULL_BACKREF))
>   				goto skip;
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index a9f0f05..05c36e5 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1202,7 +1202,7 @@ out:
>   	return ret;
>   }
>   static int comp_oper(struct btrfs_qgroup_operation *oper1,
> -		     struct btrfs_qgroup_operation *oper2)
> +		     struct btrfs_qgroup_operation *oper2, int for_insert)
>   {
>   	if (oper1->bytenr < oper2->bytenr)
>   		return -1;
> @@ -1216,10 +1216,37 @@ static int comp_oper(struct btrfs_qgroup_operation *oper1,
>   		return -1;
>   	if (oper1->ref_root > oper2->ref_root)
>   		return 1;
> -	if (oper1->type < oper2->type)
> -		return -1;
> -	if (oper1->type > oper2->type)
> -		return 1;
> +	if (for_insert) {
> +		if (oper1->type < oper2->type)
> +			return -1;
> +		if (oper1->type > oper2->type)
> +			return 1;
> +	}
> +	return 0;
> +}
> +
> +static int qgroup_oper_exists(struct btrfs_fs_info *fs_info,
> +			      struct btrfs_qgroup_operation *oper)
> +{
> +	struct rb_node *n;
> +	struct btrfs_qgroup_operation *cur;
> +	int cmp;
> +
> +	spin_lock(&fs_info->qgroup_op_lock);
> +	n = fs_info->qgroup_op_tree.rb_node;
> +	while (n) {
> +		cur = rb_entry(n, struct btrfs_qgroup_operation, n);
> +		cmp = comp_oper(cur, oper, 0);
> +		if (cmp < 0) {
> +			n = n->rb_right;
> +		} else if (cmp) {
> +			n = n->rb_left;
> +		} else {
> +			spin_unlock(&fs_info->qgroup_op_lock);
> +			return -EEXIST;
> +		}
> +	}
> +	spin_unlock(&fs_info->qgroup_op_lock);
>   	return 0;
>   }
>
> @@ -1236,7 +1263,7 @@ static int insert_qgroup_oper(struct btrfs_fs_info *fs_info,
>   	while (*p) {
>   		parent = *p;
>   		cur = rb_entry(parent, struct btrfs_qgroup_operation, n);
> -		cmp = comp_oper(cur, oper);
> +		cmp = comp_oper(cur, oper, 1);
>   		if (cmp < 0) {
>   			p = &(*p)->rb_right;
>   		} else if (cmp) {
> @@ -1290,7 +1317,25 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans,
>   	oper->seq = atomic_inc_return(&fs_info->qgroup_op_seq);
>   	INIT_LIST_HEAD(&oper->elem.list);
>   	oper->elem.seq = 0;
> +
>   	trace_btrfs_qgroup_record_ref(oper);
> +
> +	if (type == BTRFS_QGROUP_OPER_SUB_SUBTREE) {
> +		/*
> +		 * If any operation for this bytenr/ref_root combo
> +		 * exists, then we know it's not exclusively owned and
> +		 * shouldn't be queued up.
> +		 *
> +		 * XXX: Do other operations need to search for
> +		 * SUB_SHARED opers queued up on this bytenr? What do
> +		 * they do in that case?
> +		 */

No because other operations will be for other roots, so it's ok.

> +		if (qgroup_oper_exists(fs_info, oper)) {
> +			kfree(oper);
> +			return 0;
> +		}
> +	}
> +
>   	ret = insert_qgroup_oper(fs_info, oper);
>   	if (ret) {
>   		/* Shouldn't happen so have an assert for developers */
> @@ -1883,6 +1928,112 @@ out:
>   }
>
>   /*
> + * Process a reference to a shared subtree. This type of operation is
> + * queued during snapshot removal when we encounter extents which are
> + * shared between more than one root.
> + */
> +static int qgroup_subtree_accounting(struct btrfs_trans_handle *trans,
> +				     struct btrfs_fs_info *fs_info,
> +				     struct btrfs_qgroup_operation *oper)
> +{
> +	struct ulist *roots = NULL;
> +	struct ulist_node *unode;
> +	struct ulist_iterator uiter;
> +	struct btrfs_qgroup_list *glist;
> +	struct ulist *parents;
> +	int ret = 0;
> +	int found;
> +	struct btrfs_qgroup *qg;
> +	u64 root_obj = 0;
> +	struct seq_list elem = {};
> +
> +	parents = ulist_alloc(GFP_NOFS);
> +	if (!parents)
> +		return -ENOMEM;
> +
> +	btrfs_get_tree_mod_seq(fs_info, &elem);
> +	ret = btrfs_find_all_roots(trans, fs_info, oper->bytenr,
> +				   elem.seq, &roots);
> +	btrfs_put_tree_mod_seq(fs_info, &elem);
> +	if (ret < 0)
> +		return ret;
> +
> +	found = 0;
> +	ULIST_ITER_INIT(&uiter);
> +	while ((unode = ulist_next(roots, &uiter))) {
> +		/*
> +		 * If we find our ref root then that means all refs
> +		 * this extent has to the root have not yet been
> +		 * deleted. In that case, we do nothing and let the
> +		 * last ref for this bytenr drive our update.
> +		 *
> +		 * This can happen for example if an extent is
> +		 * referenced multiple times in a snapshot (clone,
> +		 * etc). If we are in the middle of snapshot removal,
> +		 * queued updates for such an extent will find the
> +		 * root if we have not yet finished removing the
> +		 * snapshot.
> +		 */
> +		if (unode->val == oper->ref_root)
> +			goto out;
> +
> +		if (!found)
> +			root_obj = unode->val;
> +		found++;
> +	}

One thing you could do here instead is just

if (roots->nnodes > 1)
	goto out;

and then do the while ((unode = ulist_next(roots, &uiter))) bit down here, that
way you don't needlessly iterate through all the roots if you don't have to.
Imagine having 5k snapshots and you delete one and then you have to loop through
all of the other 4999 snapshots for every block you look at.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 15+ messages in thread

* btrfs: qgroup: account shared subtrees during snapshot delete
  2014-06-19 21:49 [PATCH 0/2] btrfs: qgroup fixes for btrfs_drop_snapshot Mark Fasheh
@ 2014-06-19 21:49 ` Mark Fasheh
  2014-06-19 22:25   ` Josef Bacik
  0 siblings, 1 reply; 15+ messages in thread
From: Mark Fasheh @ 2014-06-19 21:49 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Chris Mason, Josef Bacik, Mark Fasheh

During it's tree walk, btrfs_drop_snapshot() will skip any shared
subtrees it encounters. This is incorrect when we have qgroups
turned on as those subtrees need to have their contents
accounted. In particular, the case we're concerned with is when
removing our snapshot root leaves the subtree with only one root
reference.

In those cases we need to find the last remaining root and add
each extent in the subtree to the corresponding qgroup exclusive
counts.

This patch implements the shared subtree walk and a new qgroup
operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
this type is encountered during qgroup accounting, we search for
any root references to that extent and in the case that we find
only one reference left, we go ahead and do the math on it's
exclusive counts.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
---
 fs/btrfs/extent-tree.c       | 234 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.c            | 166 ++++++++++++++++++++++++++++--
 fs/btrfs/qgroup.h            |   1 +
 include/trace/events/btrfs.h |   3 +-
 4 files changed, 397 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 46f39bf..672d2a4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7324,6 +7324,237 @@ reada:
 	wc->reada_slot = slot;
 }
 
+static int account_leaf_items(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root,
+			      struct extent_buffer *eb)
+{
+	int nr = btrfs_header_nritems(eb);
+	int i, extent_type, ret;
+	struct btrfs_key key;
+	struct btrfs_file_extent_item *fi;
+	u64 bytenr, num_bytes;
+
+	for (i = 0; i < nr; i++) {
+		btrfs_item_key_to_cpu(eb, &key, i);
+
+		if (key.type != BTRFS_EXTENT_DATA_KEY)
+			continue;
+
+		fi = btrfs_item_ptr(eb, i, struct btrfs_file_extent_item);
+		/* filter out non qgroup-accountable extents  */
+		extent_type = btrfs_file_extent_type(eb, fi);
+
+		if (extent_type == BTRFS_FILE_EXTENT_INLINE)
+			continue;
+
+		bytenr = btrfs_file_extent_disk_bytenr(eb, fi);
+		if (!bytenr)
+			continue;
+
+		num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
+
+		ret = btrfs_qgroup_record_ref(trans, root->fs_info,
+					      root->objectid,
+					      bytenr, num_bytes,
+					      BTRFS_QGROUP_OPER_SUB_SUBTREE, 0);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+/*
+ * Walk up the tree from the bottom, freeing leaves and any interior
+ * nodes which have had all slots visited. If a node (leaf or
+ * interior) is freed, the node above it will have it's slot
+ * incremented. The root node will never be freed.
+ *
+ * At the end of this function, we should have a path which has all
+ * slots incremented to the next position for a search. If we need to
+ * read a new node it will be NULL and the node above it will have the
+ * correct slot selected for a later read.
+ *
+ * If we increment the root nodes slot counter past the number of
+ * elements, 1 is returned to signal completion of the search.
+ */
+static int adjust_slots_upwards(struct btrfs_root *root,
+				struct btrfs_path *path, int root_level)
+{
+	int level = 0;
+	int nr, slot;
+	struct extent_buffer *eb;
+
+	if (root_level == 0)
+		return 1;
+
+	while (level <= root_level) {
+		eb = path->nodes[level];
+		nr = btrfs_header_nritems(eb);
+		path->slots[level]++;
+		slot = path->slots[level];
+		if (slot >= nr || level == 0) {
+			/*
+			 * Don't free the root -  we will detect this
+			 * condition after our loop and return a
+			 * positive value for caller to stop walking the tree.
+			 */
+			if (level != root_level) {
+				btrfs_tree_unlock_rw(eb, path->locks[level]);
+				path->locks[level] = 0;
+
+				free_extent_buffer(eb);
+				path->nodes[level] = NULL;
+				path->slots[level] = 0;
+			}
+		} else {
+			/*
+			 * We have a valid slot to walk back down
+			 * from. Stop here so caller can process these
+			 * new nodes.
+			 */
+			break;
+		}
+
+		level++;
+	}
+
+	eb = path->nodes[root_level];
+	if (path->slots[root_level] >= btrfs_header_nritems(eb))
+		return 1;
+
+	return 0;
+}
+
+/*
+ * root_eb is the subtree root and is locked before this function is called.
+ */
+static int account_shared_subtree(struct btrfs_trans_handle *trans,
+				  struct btrfs_root *root,
+				  struct extent_buffer *root_eb,
+				  u64 root_gen,
+				  int root_level)
+{
+	int ret;
+	int level;
+	struct extent_buffer *eb = root_eb;
+	struct btrfs_path *path = NULL;
+
+	BUG_ON(root_level < 0 || root_level > BTRFS_MAX_LEVEL);
+	BUG_ON(root_eb == NULL);
+
+	if (!root->fs_info->quota_enabled)
+		return 0;
+
+	if (!extent_buffer_uptodate(root_eb)) {
+		ret = btrfs_read_buffer(root_eb, root_gen);
+		if (ret)
+			goto out;
+	}
+	/* XXX: Is this check necessary? */
+	if (!extent_buffer_uptodate(root_eb)) {
+		ret = -EIO;
+		goto out;
+	}
+
+	if (root_level == 0) {
+		ret = account_leaf_items(trans, root, root_eb);
+		if (ret)
+			goto out;
+		goto out_done;
+	}
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	/*
+	 * Walk down the tree.  Missing extent blocks are filled in as
+	 * we go. Metadata is accounted every time we read a new
+	 * extent block.
+	 *
+	 * When we reach a leaf, we account for file extent items in it,
+	 * walk back up the tree (adjusting slot pointers as we go)
+	 * and restart the search process.
+	 */
+	extent_buffer_get(root_eb); /* For path */
+	path->nodes[root_level] = root_eb;
+	path->slots[root_level] = 0;
+	path->locks[root_level] = 0; /* so release_path doesn't try to unlock */
+walk_down:
+	level = root_level;
+	while (level >= 0) {
+		if (path->nodes[level] == NULL) {
+			int child_bsize = btrfs_level_size(root, level);
+			int parent_slot;
+			u64 child_gen;
+			u64 child_bytenr;
+
+			/* We need to get child blockptr/gen from
+			 * parent before we can read it. */
+			eb = path->nodes[level + 1];
+			parent_slot = path->slots[level + 1];
+			child_bytenr = btrfs_node_blockptr(eb, parent_slot);
+			child_gen = btrfs_node_ptr_generation(eb, parent_slot);
+
+			eb = read_tree_block(root, child_bytenr, child_bsize,
+					     child_gen);
+			if (!eb || !extent_buffer_uptodate(eb)) {
+				ret = -EIO;
+				goto out;
+			}
+
+			path->nodes[level] = eb;
+			path->slots[level] = 0;
+
+			btrfs_tree_read_lock(eb);
+			btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
+			path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
+
+			ret = btrfs_qgroup_record_ref(trans, root->fs_info,
+						root->objectid,
+						child_bytenr,
+						child_bsize,
+						BTRFS_QGROUP_OPER_SUB_SUBTREE,
+						0);
+			if (ret)
+				goto out;
+
+		}
+
+		if (level == 0) {
+			ret = account_leaf_items(trans, root, path->nodes[level]);
+			if (ret)
+				goto out;
+
+			/* Nonzero return here means we completed our search */
+			ret = adjust_slots_upwards(root, path, root_level);
+			if (ret)
+				break;
+
+			/* Restart search with new slots */
+			goto walk_down;
+		}
+
+		level--;
+	}
+
+out_done:
+	ret = btrfs_qgroup_record_ref(trans, root->fs_info,
+				      root->objectid, root_eb->start,
+				      btrfs_level_size(root, root_level),
+				      BTRFS_QGROUP_OPER_SUB_SUBTREE, 0);
+	if (ret)
+		goto out;
+
+	ret = 0;
+out:
+
+	if (path)
+		btrfs_free_path(path);
+
+	return ret;
+}
+
 /*
  * helper to process tree block while walking down the tree.
  *
@@ -7472,6 +7703,9 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 
 	if (wc->stage == DROP_REFERENCE) {
 		if (wc->refs[level - 1] > 1) {
+			account_shared_subtree(trans, root, next, generation,
+					       level - 1);
+
 			if (level == 1 &&
 			    (wc->flags[0] & BTRFS_BLOCK_FLAG_FULL_BACKREF))
 				goto skip;
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a9f0f05..05c36e5 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1202,7 +1202,7 @@ out:
 	return ret;
 }
 static int comp_oper(struct btrfs_qgroup_operation *oper1,
-		     struct btrfs_qgroup_operation *oper2)
+		     struct btrfs_qgroup_operation *oper2, int for_insert)
 {
 	if (oper1->bytenr < oper2->bytenr)
 		return -1;
@@ -1216,10 +1216,37 @@ static int comp_oper(struct btrfs_qgroup_operation *oper1,
 		return -1;
 	if (oper1->ref_root > oper2->ref_root)
 		return 1;
-	if (oper1->type < oper2->type)
-		return -1;
-	if (oper1->type > oper2->type)
-		return 1;
+	if (for_insert) {
+		if (oper1->type < oper2->type)
+			return -1;
+		if (oper1->type > oper2->type)
+			return 1;
+	}
+	return 0;
+}
+
+static int qgroup_oper_exists(struct btrfs_fs_info *fs_info,
+			      struct btrfs_qgroup_operation *oper)
+{
+	struct rb_node *n;
+	struct btrfs_qgroup_operation *cur;
+	int cmp;
+
+	spin_lock(&fs_info->qgroup_op_lock);
+	n = fs_info->qgroup_op_tree.rb_node;
+	while (n) {
+		cur = rb_entry(n, struct btrfs_qgroup_operation, n);
+		cmp = comp_oper(cur, oper, 0);
+		if (cmp < 0) {
+			n = n->rb_right;
+		} else if (cmp) {
+			n = n->rb_left;
+		} else {
+			spin_unlock(&fs_info->qgroup_op_lock);
+			return -EEXIST;
+		}
+	}
+	spin_unlock(&fs_info->qgroup_op_lock);
 	return 0;
 }
 
@@ -1236,7 +1263,7 @@ static int insert_qgroup_oper(struct btrfs_fs_info *fs_info,
 	while (*p) {
 		parent = *p;
 		cur = rb_entry(parent, struct btrfs_qgroup_operation, n);
-		cmp = comp_oper(cur, oper);
+		cmp = comp_oper(cur, oper, 1);
 		if (cmp < 0) {
 			p = &(*p)->rb_right;
 		} else if (cmp) {
@@ -1290,7 +1317,25 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans,
 	oper->seq = atomic_inc_return(&fs_info->qgroup_op_seq);
 	INIT_LIST_HEAD(&oper->elem.list);
 	oper->elem.seq = 0;
+
 	trace_btrfs_qgroup_record_ref(oper);
+
+	if (type == BTRFS_QGROUP_OPER_SUB_SUBTREE) {
+		/*
+		 * If any operation for this bytenr/ref_root combo
+		 * exists, then we know it's not exclusively owned and
+		 * shouldn't be queued up.
+		 *
+		 * XXX: Do other operations need to search for
+		 * SUB_SHARED opers queued up on this bytenr? What do
+		 * they do in that case?
+		 */
+		if (qgroup_oper_exists(fs_info, oper)) {
+			kfree(oper);
+			return 0;
+		}
+	}
+
 	ret = insert_qgroup_oper(fs_info, oper);
 	if (ret) {
 		/* Shouldn't happen so have an assert for developers */
@@ -1883,6 +1928,112 @@ out:
 }
 
 /*
+ * Process a reference to a shared subtree. This type of operation is
+ * queued during snapshot removal when we encounter extents which are
+ * shared between more than one root.
+ */
+static int qgroup_subtree_accounting(struct btrfs_trans_handle *trans,
+				     struct btrfs_fs_info *fs_info,
+				     struct btrfs_qgroup_operation *oper)
+{
+	struct ulist *roots = NULL;
+	struct ulist_node *unode;
+	struct ulist_iterator uiter;
+	struct btrfs_qgroup_list *glist;
+	struct ulist *parents;
+	int ret = 0;
+	int found;
+	struct btrfs_qgroup *qg;
+	u64 root_obj = 0;
+	struct seq_list elem = {};
+
+	parents = ulist_alloc(GFP_NOFS);
+	if (!parents)
+		return -ENOMEM;
+
+	btrfs_get_tree_mod_seq(fs_info, &elem);
+	ret = btrfs_find_all_roots(trans, fs_info, oper->bytenr,
+				   elem.seq, &roots);
+	btrfs_put_tree_mod_seq(fs_info, &elem);
+	if (ret < 0)
+		return ret;
+
+	found = 0;
+	ULIST_ITER_INIT(&uiter);
+	while ((unode = ulist_next(roots, &uiter))) {
+		/*
+		 * If we find our ref root then that means all refs
+		 * this extent has to the root have not yet been
+		 * deleted. In that case, we do nothing and let the
+		 * last ref for this bytenr drive our update.
+		 *
+		 * This can happen for example if an extent is
+		 * referenced multiple times in a snapshot (clone,
+		 * etc). If we are in the middle of snapshot removal,
+		 * queued updates for such an extent will find the
+		 * root if we have not yet finished removing the
+		 * snapshot.
+		 */
+		if (unode->val == oper->ref_root)
+			goto out;
+
+		if (!found)
+			root_obj = unode->val;
+		found++;
+	}
+
+	if (found == 1) {
+		BUG_ON(!root_obj);
+
+		spin_lock(&fs_info->qgroup_lock);
+		qg = find_qgroup_rb(fs_info, root_obj);
+		if (!qg)
+			goto out_unlock;
+
+		qg->excl += oper->num_bytes;
+		qg->excl_cmpr += oper->num_bytes;
+		qgroup_dirty(fs_info, qg);
+
+		/*
+		 * Adjust counts for parent groups. First we find all
+		 * parents, then in the 2nd loop we do the adjustment
+		 * while adding parents of the parents to our ulist.
+		 */
+		list_for_each_entry(glist, &qg->groups, next_group) {
+			ret = ulist_add(parents, glist->group->qgroupid,
+					ptr_to_u64(glist->group), GFP_ATOMIC);
+			if (ret < 0)
+				goto out_unlock;
+		}
+
+		ULIST_ITER_INIT(&uiter);
+		while ((unode = ulist_next(parents, &uiter))) {
+			qg = u64_to_ptr(unode->aux);
+			qg->excl += oper->num_bytes;
+			qg->excl_cmpr += oper->num_bytes;
+			qgroup_dirty(fs_info, qg);
+
+			/* Add any parents of the parents */
+			list_for_each_entry(glist, &qg->groups, next_group) {
+				ret = ulist_add(parents, glist->group->qgroupid,
+						ptr_to_u64(glist->group),
+						GFP_ATOMIC);
+				if (ret < 0)
+					goto out_unlock;
+			}
+		}
+
+out_unlock:
+		spin_unlock(&fs_info->qgroup_lock);
+	}
+
+out:
+	ulist_free(roots);
+	ulist_free(parents);
+	return ret;
+}
+
+/*
  * btrfs_qgroup_account_ref is called for every ref that is added to or deleted
  * from the fs. First, all roots referencing the extent are searched, and
  * then the space is accounted accordingly to the different roots. The
@@ -1921,6 +2072,9 @@ static int btrfs_qgroup_account(struct btrfs_trans_handle *trans,
 	case BTRFS_QGROUP_OPER_SUB_SHARED:
 		ret = qgroup_shared_accounting(trans, fs_info, oper);
 		break;
+	case BTRFS_QGROUP_OPER_SUB_SUBTREE:
+		ret = qgroup_subtree_accounting(trans, fs_info, oper);
+		break;
 	default:
 		ASSERT(0);
 	}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 5952ff1..18cc68c 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -44,6 +44,7 @@ enum btrfs_qgroup_operation_type {
 	BTRFS_QGROUP_OPER_ADD_SHARED,
 	BTRFS_QGROUP_OPER_SUB_EXCL,
 	BTRFS_QGROUP_OPER_SUB_SHARED,
+	BTRFS_QGROUP_OPER_SUB_SUBTREE,
 };
 
 struct btrfs_qgroup_operation {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index b8774b3..96daac6 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1125,7 +1125,8 @@ DEFINE_EVENT(btrfs__workqueue_done, btrfs_workqueue_destroy,
 		{ BTRFS_QGROUP_OPER_ADD_EXCL, 	"OPER_ADD_EXCL" },	\
 		{ BTRFS_QGROUP_OPER_ADD_SHARED, "OPER_ADD_SHARED" },	\
 		{ BTRFS_QGROUP_OPER_SUB_EXCL, 	"OPER_SUB_EXCL" },	\
-		{ BTRFS_QGROUP_OPER_SUB_SHARED,	"OPER_SUB_SHARED" })
+		{ BTRFS_QGROUP_OPER_SUB_SHARED,	"OPER_SUB_SHARED" },	\
+		{ BTRFS_QGROUP_OPER_SUB_SUBTREE,"OPER_SUB_SUBTREE" })
 
 DECLARE_EVENT_CLASS(btrfs_qgroup_oper,
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-02-06 22:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-07 22:09 [PATCH 0/3] btrfs: qgroup fixes for btrfs_drop_snapshot V3 Mark Fasheh
2014-07-07 22:09 ` btrfs: add trace for qgroup accounting Mark Fasheh
2014-07-07 22:09 ` btrfs: qgroup: account shared subtrees during snapshot delete Mark Fasheh
2014-07-07 22:10 ` [PATCH] Btrfs: __btrfs_mod_ref should always use no_quota Mark Fasheh
  -- strict thread matches above, loose matches on Subject: below --
2015-02-01 20:51 btrfs: qgroup: account shared subtrees during snapshot delete Dan Carpenter
2015-02-06 22:16 ` Mark Fasheh
2014-06-19 21:49 [PATCH 0/2] btrfs: qgroup fixes for btrfs_drop_snapshot Mark Fasheh
2014-06-19 21:49 ` btrfs: qgroup: account shared subtrees during snapshot delete Mark Fasheh
2014-06-19 22:25   ` Josef Bacik
2014-06-19 23:16     ` Mark Fasheh
2014-06-19 23:17       ` Josef Bacik
2014-06-20 11:25         ` David Sterba
2014-06-20 15:29           ` Mark Fasheh
2014-06-20 15:44             ` Josef Bacik
2014-06-20 17:18               ` Mark Fasheh
2014-06-23 14:49             ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.