All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/17] improve the block time during the transaction commit
@ 2013-05-15  7:48 Miao Xie
  2013-05-15  7:48 ` [PATCH 01/17] Btrfs: fix accessing a freed tree root Miao Xie
                   ` (16 more replies)
  0 siblings, 17 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

This patchset improve the problem that the transaction may be blocked for a
long time when it is being committed if there is heavy I/O.

In this patchset,
- 0001-0005, 0007, 0011-0012 are random fix or code cleanup patch.
- 0006, 0008-0010 introduce per-subvolume delalloc inode list and ordered
  extent list, which can reduce the flush time when we create snapshots.
- 0013-0016 improve the block time during the transaction commit by removing
  the while loop at the beginning of the transaction commit.
- 0017 improves the readability of the code.

Miao Xie (17):
  Btrfs: fix accessing a freed tree root
  Btrfs: fix unprotected root node of the subvolume's inode rb-tree
  Btrfs: pause the space balance when remounting to R/O
  Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
  Btrfs: cleanup the similar code of the fs root read
  Btrfs: introduce grab/put functions for the root of the fs/file tree
  Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
  Btrfs: introduce per-subvolume delalloc inode list
  Btrfs: introduce per-subvolume ordered extent list
  Btrfs: just flush the delalloc inodes in the source tree before snapshot creation
  Btrfs: cleanup unnecessary assignment when cleaning up all the residual transaction
  Btrfs: remove the code for the impossible case in cleanup_transaction()
  Btrfs: don't wait for all the writers circularly during the transaction commit
  Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set
  Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure
  Btrfs: remove the time check in btrfs_commit_transaction()
  Btrfs: make the state of the transaction more readable

 fs/btrfs/ctree.h        |  55 +++++--
 fs/btrfs/dev-replace.c  |   6 +-
 fs/btrfs/disk-io.c      | 425 +++++++++++++++++++++++++++---------------------
 fs/btrfs/disk-io.h      |  32 +++-
 fs/btrfs/extent-tree.c  |  20 +--
 fs/btrfs/inode.c        | 180 ++++++++++++++------
 fs/btrfs/ioctl.c        |   6 +
 fs/btrfs/ordered-data.c | 109 +++++++++----
 fs/btrfs/ordered-data.h |   2 +
 fs/btrfs/relocation.c   |   9 +-
 fs/btrfs/root-tree.c    | 170 ++++++-------------
 fs/btrfs/super.c        |   3 +-
 fs/btrfs/transaction.c  | 271 ++++++++++++++++--------------
 fs/btrfs/transaction.h  |  49 ++++--
 fs/btrfs/tree-log.c     |   3 +-
 fs/btrfs/volumes.c      |  13 +-
 fs/btrfs/volumes.h      |   1 +
 17 files changed, 791 insertions(+), 563 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 01/17] Btrfs: fix accessing a freed tree root
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 02/17] Btrfs: fix unprotected root node of the subvolume's inode rb-tree Miao Xie
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

inode_tree_del() will move the tree root into the dead root list, and
then the tree will be destroyed by the cleaner. So if we remove the
delayed node which is cached in the inode after inode_tree_del(),
we may access a freed tree root. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1669c3b..7f6e78a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4723,6 +4723,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_end_transaction(trans, root);
 	btrfs_btree_balance_dirty(root);
 no_delete:
+	btrfs_remove_delayed_node(inode);
 	clear_inode(inode);
 	return;
 }
@@ -7978,7 +7979,6 @@ void btrfs_destroy_inode(struct inode *inode)
 	inode_tree_del(inode);
 	btrfs_drop_extent_cache(inode, 0, (u64)-1, 0);
 free:
-	btrfs_remove_delayed_node(inode);
 	call_rcu(&inode->i_rcu, btrfs_i_callback);
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 02/17] Btrfs: fix unprotected root node of the subvolume's inode rb-tree
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
  2013-05-15  7:48 ` [PATCH 01/17] Btrfs: fix accessing a freed tree root Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 03/17] Btrfs: pause the space balance when remounting to R/O Miao Xie
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

The root node of the rb-tree may be changed, so we should get it under
the lock. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/inode.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7f6e78a..bf5c399 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4839,14 +4839,13 @@ static void inode_tree_add(struct inode *inode)
 	struct rb_node **p;
 	struct rb_node *parent;
 	u64 ino = btrfs_ino(inode);
-again:
-	p = &root->inode_tree.rb_node;
-	parent = NULL;
 
 	if (inode_unhashed(inode))
 		return;
-
+again:
+	parent = NULL;
 	spin_lock(&root->inode_lock);
+	p = &root->inode_tree.rb_node;
 	while (*p) {
 		parent = *p;
 		entry = rb_entry(parent, struct btrfs_inode, rb_node);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 03/17] Btrfs: pause the space balance when remounting to R/O
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
  2013-05-15  7:48 ` [PATCH 01/17] Btrfs: fix accessing a freed tree root Miao Xie
  2013-05-15  7:48 ` [PATCH 02/17] Btrfs: fix unprotected root node of the subvolume's inode rb-tree Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 04/17] Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix() Miao Xie
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a4807ce..f0857e0 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1263,6 +1263,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 
 		btrfs_dev_replace_suspend_for_unmount(fs_info);
 		btrfs_scrub_cancel(fs_info);
+		btrfs_pause_balance(fs_info);
 
 		ret = btrfs_commit_super(root);
 		if (ret)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 04/17] Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (2 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 03/17] Btrfs: pause the space balance when remounting to R/O Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 05/17] Btrfs: cleanup the similar code of the fs root read Miao Xie
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

We have checked if ->node is NULL or not, so it is unnecessary to
use BUG_ON() to check again. Remove it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/disk-io.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2a9ae38..8c1e4fb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1513,7 +1513,6 @@ struct btrfs_root *btrfs_read_fs_root_no_radix(struct btrfs_root *tree_root,
 	}
 
 	root->commit_root = btrfs_root_node(root);
-	BUG_ON(!root->node); /* -ENOMEM */
 out:
 	if (location->objectid != BTRFS_TREE_LOG_OBJECTID) {
 		root->ref_cows = 1;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 05/17] Btrfs: cleanup the similar code of the fs root read
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (3 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 04/17] Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix() Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree Miao Xie
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

There are several functions whose code is similar, such as
  btrfs_find_last_root()
  btrfs_read_fs_root_no_radix()

Besides that, some functions are invoked twice, it is unnecessary,
for example, we are sure that all roots which is found in
  btrfs_find_orphan_roots()
have their orphan items, so it is unnecessary to check the orphan
item again.

So cleanup it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/ctree.h       |   6 +-
 fs/btrfs/disk-io.c     | 282 +++++++++++++++++++++++++------------------------
 fs/btrfs/disk-io.h     |  11 +-
 fs/btrfs/extent-tree.c |   6 +-
 fs/btrfs/relocation.c  |   5 +-
 fs/btrfs/root-tree.c   | 170 ++++++++++-------------------
 fs/btrfs/tree-log.c    |   3 +-
 fs/btrfs/volumes.c     |  13 ++-
 fs/btrfs/volumes.h     |   1 +
 9 files changed, 228 insertions(+), 269 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f9a48fd..845b77f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3369,9 +3369,9 @@ int __must_check btrfs_update_root(struct btrfs_trans_handle *trans,
 				   struct btrfs_root_item *item);
 void btrfs_read_root_item(struct extent_buffer *eb, int slot,
 			  struct btrfs_root_item *item);
-int btrfs_find_last_root(struct btrfs_root *root, u64 objectid, struct
-			 btrfs_root_item *item, struct btrfs_key *key);
-int btrfs_find_dead_roots(struct btrfs_root *root, u64 objectid);
+int btrfs_find_root(struct btrfs_root *root, struct btrfs_key *search_key,
+		    struct btrfs_path *path, struct btrfs_root_item *root_item,
+		    struct btrfs_key *root_key);
 int btrfs_find_orphan_roots(struct btrfs_root *tree_root);
 void btrfs_set_root_node(struct btrfs_root_item *item,
 			 struct extent_buffer *node);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8c1e4fb..42d6ba2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1234,39 +1234,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	spin_lock_init(&root->root_item_lock);
 }
 
-static int __must_check find_and_setup_root(struct btrfs_root *tree_root,
-					    struct btrfs_fs_info *fs_info,
-					    u64 objectid,
-					    struct btrfs_root *root)
-{
-	int ret;
-	u32 blocksize;
-	u64 generation;
-
-	__setup_root(tree_root->nodesize, tree_root->leafsize,
-		     tree_root->sectorsize, tree_root->stripesize,
-		     root, fs_info, objectid);
-	ret = btrfs_find_last_root(tree_root, objectid,
-				   &root->root_item, &root->root_key);
-	if (ret > 0)
-		return -ENOENT;
-	else if (ret < 0)
-		return ret;
-
-	generation = btrfs_root_generation(&root->root_item);
-	blocksize = btrfs_level_size(root, btrfs_root_level(&root->root_item));
-	root->commit_root = NULL;
-	root->node = read_tree_block(root, btrfs_root_bytenr(&root->root_item),
-				     blocksize, generation);
-	if (!root->node || !btrfs_buffer_uptodate(root->node, generation, 0)) {
-		free_extent_buffer(root->node);
-		root->node = NULL;
-		return -EIO;
-	}
-	root->commit_root = btrfs_root_node(root);
-	return 0;
-}
-
 static struct btrfs_root *btrfs_alloc_root(struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *root = kzalloc(sizeof(*root), GFP_NOFS);
@@ -1451,70 +1418,73 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
 	return 0;
 }
 
-struct btrfs_root *btrfs_read_fs_root_no_radix(struct btrfs_root *tree_root,
-					       struct btrfs_key *location)
+struct btrfs_root *btrfs_read_tree_root(struct btrfs_root *tree_root,
+					struct btrfs_key *key)
 {
 	struct btrfs_root *root;
 	struct btrfs_fs_info *fs_info = tree_root->fs_info;
 	struct btrfs_path *path;
-	struct extent_buffer *l;
 	u64 generation;
 	u32 blocksize;
-	int ret = 0;
-	int slot;
+	int ret;
 
-	root = btrfs_alloc_root(fs_info);
-	if (!root)
+	path = btrfs_alloc_path();
+	if (!path)
 		return ERR_PTR(-ENOMEM);
-	if (location->offset == (u64)-1) {
-		ret = find_and_setup_root(tree_root, fs_info,
-					  location->objectid, root);
-		if (ret) {
-			kfree(root);
-			return ERR_PTR(ret);
-		}
-		goto out;
+
+	root = btrfs_alloc_root(fs_info);
+	if (!root) {
+		ret = -ENOMEM;
+		goto alloc_fail;
 	}
 
 	__setup_root(tree_root->nodesize, tree_root->leafsize,
 		     tree_root->sectorsize, tree_root->stripesize,
-		     root, fs_info, location->objectid);
+		     root, fs_info, key->objectid);
 
-	path = btrfs_alloc_path();
-	if (!path) {
-		kfree(root);
-		return ERR_PTR(-ENOMEM);
-	}
-	ret = btrfs_search_slot(NULL, tree_root, location, path, 0, 0);
-	if (ret == 0) {
-		l = path->nodes[0];
-		slot = path->slots[0];
-		btrfs_read_root_item(l, slot, &root->root_item);
-		memcpy(&root->root_key, location, sizeof(*location));
-	}
-	btrfs_free_path(path);
+	ret = btrfs_find_root(tree_root, key, path,
+			      &root->root_item, &root->root_key);
 	if (ret) {
-		kfree(root);
 		if (ret > 0)
 			ret = -ENOENT;
-		return ERR_PTR(ret);
+		goto find_fail;
 	}
 
 	generation = btrfs_root_generation(&root->root_item);
 	blocksize = btrfs_level_size(root, btrfs_root_level(&root->root_item));
 	root->node = read_tree_block(root, btrfs_root_bytenr(&root->root_item),
 				     blocksize, generation);
-	if (!root->node || !extent_buffer_uptodate(root->node)) {
-		ret = (!root->node) ? -ENOMEM : -EIO;
-
-		free_extent_buffer(root->node);
-		kfree(root);
-		return ERR_PTR(ret);
+	if (!root->node) {
+		ret = -ENOMEM;
+		goto find_fail;
+	} else if (!btrfs_buffer_uptodate(root->node, generation, 0)) {
+		ret = -EIO;
+		goto read_fail;
 	}
-
 	root->commit_root = btrfs_root_node(root);
 out:
-	if (location->objectid != BTRFS_TREE_LOG_OBJECTID) {
+	btrfs_free_path(path);
+	return root;
+
+read_fail:
+	free_extent_buffer(root->node);
+find_fail:
+	kfree(root);
+alloc_fail:
+	root = ERR_PTR(ret);
+	goto out;
+}
+
+struct btrfs_root *btrfs_read_fs_root(struct btrfs_root *tree_root,
+				      struct btrfs_key *location)
+{
+	struct btrfs_root *root;
+
+	root = btrfs_read_tree_root(tree_root, location);
+	if (IS_ERR(root))
+		return root;
+
+	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
 		root->ref_cows = 1;
 		btrfs_check_and_init_root_item(&root->root_item);
 	}
@@ -1522,6 +1492,66 @@ out:
 	return root;
 }
 
+int btrfs_init_fs_root(struct btrfs_root *root)
+{
+	int ret;
+
+	root->free_ino_ctl = kzalloc(sizeof(*root->free_ino_ctl), GFP_NOFS);
+	root->free_ino_pinned = kzalloc(sizeof(*root->free_ino_pinned),
+					GFP_NOFS);
+	if (!root->free_ino_pinned || !root->free_ino_ctl) {
+		ret = -ENOMEM;
+		goto fail;
+	}
+
+	btrfs_init_free_ino_ctl(root);
+	mutex_init(&root->fs_commit_mutex);
+	spin_lock_init(&root->cache_lock);
+	init_waitqueue_head(&root->cache_wait);
+
+	ret = get_anon_bdev(&root->anon_dev);
+	if (ret)
+		goto fail;
+	return 0;
+fail:
+	kfree(root->free_ino_ctl);
+	kfree(root->free_ino_pinned);
+	return ret;
+}
+
+struct btrfs_root *btrfs_lookup_fs_root(struct btrfs_fs_info *fs_info,
+					u64 root_id)
+{
+	struct btrfs_root *root;
+
+	spin_lock(&fs_info->fs_roots_radix_lock);
+	root = radix_tree_lookup(&fs_info->fs_roots_radix,
+				 (unsigned long)root_id);
+	spin_unlock(&fs_info->fs_roots_radix_lock);
+	return root;
+}
+
+int btrfs_insert_fs_root(struct btrfs_fs_info *fs_info,
+			 struct btrfs_root *root)
+{
+	int ret;
+
+	ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
+	if (ret)
+		return ret;
+
+	spin_lock(&fs_info->fs_roots_radix_lock);
+	ret = radix_tree_insert(&fs_info->fs_roots_radix,
+				(unsigned long)root->root_key.objectid,
+				root);
+	if (ret == 0)
+		root->in_radix = 1;
+	spin_unlock(&fs_info->fs_roots_radix_lock);
+	radix_tree_preload_end();
+
+	return ret;
+}
+
 struct btrfs_root *btrfs_read_fs_root_no_name(struct btrfs_fs_info *fs_info,
 					      struct btrfs_key *location)
 {
@@ -1542,58 +1572,30 @@ struct btrfs_root *btrfs_read_fs_root_no_name(struct btrfs_fs_info *fs_info,
 		return fs_info->quota_root ? fs_info->quota_root :
 					     ERR_PTR(-ENOENT);
 again:
-	spin_lock(&fs_info->fs_roots_radix_lock);
-	root = radix_tree_lookup(&fs_info->fs_roots_radix,
-				 (unsigned long)location->objectid);
-	spin_unlock(&fs_info->fs_roots_radix_lock);
+	root = btrfs_lookup_fs_root(fs_info, location->objectid);
 	if (root)
 		return root;
 
-	root = btrfs_read_fs_root_no_radix(fs_info->tree_root, location);
+	root = btrfs_read_fs_root(fs_info->tree_root, location);
 	if (IS_ERR(root))
 		return root;
 
-	root->free_ino_ctl = kzalloc(sizeof(*root->free_ino_ctl), GFP_NOFS);
-	root->free_ino_pinned = kzalloc(sizeof(*root->free_ino_pinned),
-					GFP_NOFS);
-	if (!root->free_ino_pinned || !root->free_ino_ctl) {
-		ret = -ENOMEM;
+	if (btrfs_root_refs(&root->root_item) == 0) {
+		ret = -ENOENT;
 		goto fail;
 	}
 
-	btrfs_init_free_ino_ctl(root);
-	mutex_init(&root->fs_commit_mutex);
-	spin_lock_init(&root->cache_lock);
-	init_waitqueue_head(&root->cache_wait);
-
-	ret = get_anon_bdev(&root->anon_dev);
+	ret = btrfs_init_fs_root(root);
 	if (ret)
 		goto fail;
 
-	if (btrfs_root_refs(&root->root_item) == 0) {
-		ret = -ENOENT;
-		goto fail;
-	}
-
 	ret = btrfs_find_orphan_item(fs_info->tree_root, location->objectid);
 	if (ret < 0)
 		goto fail;
 	if (ret == 0)
 		root->orphan_item_inserted = 1;
 
-	ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
-	if (ret)
-		goto fail;
-
-	spin_lock(&fs_info->fs_roots_radix_lock);
-	ret = radix_tree_insert(&fs_info->fs_roots_radix,
-				(unsigned long)root->root_key.objectid,
-				root);
-	if (ret == 0)
-		root->in_radix = 1;
-
-	spin_unlock(&fs_info->fs_roots_radix_lock);
-	radix_tree_preload_end();
+	ret = btrfs_insert_fs_root(fs_info, root);
 	if (ret) {
 		if (ret == -EEXIST) {
 			free_fs_root(root);
@@ -1601,10 +1603,6 @@ again:
 		}
 		goto fail;
 	}
-
-	ret = btrfs_find_dead_roots(fs_info->tree_root,
-				    root->root_key.objectid);
-	WARN_ON(ret);
 	return root;
 fail:
 	free_fs_root(root);
@@ -2047,7 +2045,7 @@ static void del_fs_roots(struct btrfs_fs_info *fs_info)
 		list_del(&gang[0]->root_list);
 
 		if (gang[0]->in_radix) {
-			btrfs_free_fs_root(fs_info, gang[0]);
+			btrfs_drop_and_free_fs_root(fs_info, gang[0]);
 		} else {
 			free_extent_buffer(gang[0]->node);
 			free_extent_buffer(gang[0]->commit_root);
@@ -2062,7 +2060,7 @@ static void del_fs_roots(struct btrfs_fs_info *fs_info)
 		if (!ret)
 			break;
 		for (i = 0; i < ret; i++)
-			btrfs_free_fs_root(fs_info, gang[i]);
+			btrfs_drop_and_free_fs_root(fs_info, gang[i]);
 	}
 }
 
@@ -2094,14 +2092,8 @@ int open_ctree(struct super_block *sb,
 	int backup_index = 0;
 
 	tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info);
-	extent_root = fs_info->extent_root = btrfs_alloc_root(fs_info);
-	csum_root = fs_info->csum_root = btrfs_alloc_root(fs_info);
 	chunk_root = fs_info->chunk_root = btrfs_alloc_root(fs_info);
-	dev_root = fs_info->dev_root = btrfs_alloc_root(fs_info);
-	quota_root = fs_info->quota_root = btrfs_alloc_root(fs_info);
-
-	if (!tree_root || !extent_root || !csum_root ||
-	    !chunk_root || !dev_root || !quota_root) {
+	if (!tree_root || !chunk_root) {
 		err = -ENOMEM;
 		goto fail;
 	}
@@ -2651,33 +2643,44 @@ retry_root_backup:
 	btrfs_set_root_node(&tree_root->root_item, tree_root->node);
 	tree_root->commit_root = btrfs_root_node(tree_root);
 
-	ret = find_and_setup_root(tree_root, fs_info,
-				  BTRFS_EXTENT_TREE_OBJECTID, extent_root);
-	if (ret)
+	location.objectid = BTRFS_EXTENT_TREE_OBJECTID;
+	location.type = BTRFS_ROOT_ITEM_KEY;
+	location.offset = 0;
+
+	extent_root = btrfs_read_tree_root(tree_root, &location);
+	if (IS_ERR(extent_root)) {
+		ret = PTR_ERR(extent_root);
 		goto recovery_tree_root;
+	}
 	extent_root->track_dirty = 1;
+	fs_info->extent_root = extent_root;
 
-	ret = find_and_setup_root(tree_root, fs_info,
-				  BTRFS_DEV_TREE_OBJECTID, dev_root);
-	if (ret)
+	location.objectid = BTRFS_DEV_TREE_OBJECTID;
+	dev_root = btrfs_read_tree_root(tree_root, &location);
+	if (IS_ERR(dev_root)) {
+		ret = PTR_ERR(dev_root);
 		goto recovery_tree_root;
+	}
 	dev_root->track_dirty = 1;
+	fs_info->dev_root = dev_root;
+	btrfs_init_devices_late(fs_info);
 
-	ret = find_and_setup_root(tree_root, fs_info,
-				  BTRFS_CSUM_TREE_OBJECTID, csum_root);
-	if (ret)
+	location.objectid = BTRFS_CSUM_TREE_OBJECTID;
+	csum_root = btrfs_read_tree_root(tree_root, &location);
+	if (IS_ERR(csum_root)) {
+		ret = PTR_ERR(csum_root);
 		goto recovery_tree_root;
+	}
 	csum_root->track_dirty = 1;
+	fs_info->csum_root = csum_root;
 
-	ret = find_and_setup_root(tree_root, fs_info,
-				  BTRFS_QUOTA_TREE_OBJECTID, quota_root);
-	if (ret) {
-		kfree(quota_root);
-		quota_root = fs_info->quota_root = NULL;
-	} else {
+	location.objectid = BTRFS_QUOTA_TREE_OBJECTID;
+	quota_root = btrfs_read_tree_root(tree_root, &location);
+	if (!IS_ERR(quota_root)) {
 		quota_root->track_dirty = 1;
 		fs_info->quota_enabled = 1;
 		fs_info->pending_quota_state = 1;
+		fs_info->quota_root = quota_root;
 	}
 
 	fs_info->generation = generation;
@@ -2830,7 +2833,7 @@ retry_root_backup:
 
 	location.objectid = BTRFS_FS_TREE_OBJECTID;
 	location.type = BTRFS_ROOT_ITEM_KEY;
-	location.offset = (u64)-1;
+	location.offset = 0;
 
 	fs_info->fs_root = btrfs_read_fs_root_no_name(fs_info, &location);
 	if (!fs_info->fs_root)
@@ -3379,7 +3382,9 @@ int write_ctree_super(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
-void btrfs_free_fs_root(struct btrfs_fs_info *fs_info, struct btrfs_root *root)
+/* Drop a fs root from the radix tree and free it. */
+void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
+				  struct btrfs_root *root)
 {
 	spin_lock(&fs_info->fs_roots_radix_lock);
 	radix_tree_delete(&fs_info->fs_roots_radix,
@@ -3413,6 +3418,11 @@ static void free_fs_root(struct btrfs_root *root)
 	kfree(root);
 }
 
+void btrfs_free_fs_root(struct btrfs_root *root)
+{
+	free_fs_root(root);
+}
+
 int btrfs_cleanup_fs_roots(struct btrfs_fs_info *fs_info)
 {
 	u64 root_objectid = 0;
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index be69ce1..534d583 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -63,14 +63,19 @@ struct buffer_head *btrfs_read_dev_super(struct block_device *bdev);
 int btrfs_commit_super(struct btrfs_root *root);
 struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root,
 					    u64 bytenr, u32 blocksize);
-struct btrfs_root *btrfs_read_fs_root_no_radix(struct btrfs_root *tree_root,
-					       struct btrfs_key *location);
+struct btrfs_root *btrfs_read_fs_root(struct btrfs_root *tree_root,
+				      struct btrfs_key *location);
+int btrfs_init_fs_root(struct btrfs_root *root);
+int btrfs_insert_fs_root(struct btrfs_fs_info *fs_info,
+			 struct btrfs_root *root);
 struct btrfs_root *btrfs_read_fs_root_no_name(struct btrfs_fs_info *fs_info,
 					      struct btrfs_key *location);
 int btrfs_cleanup_fs_roots(struct btrfs_fs_info *fs_info);
 void btrfs_btree_balance_dirty(struct btrfs_root *root);
 void btrfs_btree_balance_dirty_nodelay(struct btrfs_root *root);
-void btrfs_free_fs_root(struct btrfs_fs_info *fs_info, struct btrfs_root *root);
+void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
+				 struct btrfs_root *root);
+void btrfs_free_fs_root(struct btrfs_root *root);
 void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
 			  int atomic);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7f7c7a6..08e42c8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7441,8 +7441,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 	}
 
 	if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
-		ret = btrfs_find_last_root(tree_root, root->root_key.objectid,
-					   NULL, NULL);
+		ret = btrfs_find_root(tree_root, &root->root_key, path,
+				      NULL, NULL);
 		if (ret < 0) {
 			btrfs_abort_transaction(trans, tree_root, ret);
 			err = ret;
@@ -7459,7 +7459,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 	}
 
 	if (root->in_radix) {
-		btrfs_free_fs_root(tree_root->fs_info, root);
+		btrfs_drop_and_free_fs_root(tree_root->fs_info, root);
 	} else {
 		free_extent_buffer(root->node);
 		free_extent_buffer(root->commit_root);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 704a1b8..baaa66b 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1355,8 +1355,7 @@ static struct btrfs_root *create_reloc_root(struct btrfs_trans_handle *trans,
 	BUG_ON(ret);
 	kfree(root_item);
 
-	reloc_root = btrfs_read_fs_root_no_radix(root->fs_info->tree_root,
-						 &root_key);
+	reloc_root = btrfs_read_fs_root(root->fs_info->tree_root, &root_key);
 	BUG_ON(IS_ERR(reloc_root));
 	reloc_root->last_trans = trans->transid;
 	return reloc_root;
@@ -4271,7 +4270,7 @@ int btrfs_recover_relocation(struct btrfs_root *root)
 		    key.type != BTRFS_ROOT_ITEM_KEY)
 			break;
 
-		reloc_root = btrfs_read_fs_root_no_radix(root, &key);
+		reloc_root = btrfs_read_fs_root(root, &key);
 		if (IS_ERR(reloc_root)) {
 			err = PTR_ERR(reloc_root);
 			goto out;
diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index 5bf1ed5..79e6832 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -64,52 +64,59 @@ void btrfs_read_root_item(struct extent_buffer *eb, int slot,
 }
 
 /*
- * lookup the root with the highest offset for a given objectid.  The key we do
- * find is copied into 'key'.  If we find something return 0, otherwise 1, < 0
- * on error.
+ * btrfs_find_root - lookup the root by the key.
+ * root: the root of the root tree
+ * search_key: the key to search
+ * path: the path we search
+ * root_item: the root item of the tree we look for
+ * root_key: the reak key of the tree we look for
+ *
+ * If ->offset of 'seach_key' is -1ULL, it means we are not sure the offset
+ * of the search key, just lookup the root with the highest offset for a
+ * given objectid.
+ *
+ * If we find something return 0, otherwise > 0, < 0 on error.
  */
-int btrfs_find_last_root(struct btrfs_root *root, u64 objectid,
-			struct btrfs_root_item *item, struct btrfs_key *key)
+int btrfs_find_root(struct btrfs_root *root, struct btrfs_key *search_key,
+		    struct btrfs_path *path, struct btrfs_root_item *root_item,
+		    struct btrfs_key *root_key)
 {
-	struct btrfs_path *path;
-	struct btrfs_key search_key;
 	struct btrfs_key found_key;
 	struct extent_buffer *l;
 	int ret;
 	int slot;
 
-	search_key.objectid = objectid;
-	search_key.type = BTRFS_ROOT_ITEM_KEY;
-	search_key.offset = (u64)-1;
-
-	path = btrfs_alloc_path();
-	if (!path)
-		return -ENOMEM;
-	ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0);
+	ret = btrfs_search_slot(NULL, root, search_key, path, 0, 0);
 	if (ret < 0)
-		goto out;
+		return ret;
 
-	BUG_ON(ret == 0);
-	if (path->slots[0] == 0) {
-		ret = 1;
-		goto out;
+	if (search_key->offset != -1ULL) {	/* the search key is exact */
+		if (ret > 0)
+			goto out;
+	} else {
+		BUG_ON(ret == 0);		/* Logical error */
+		if (path->slots[0] == 0)
+			goto out;
+		path->slots[0]--;
+		ret = 0;
 	}
+
 	l = path->nodes[0];
-	slot = path->slots[0] - 1;
+	slot = path->slots[0];
+
 	btrfs_item_key_to_cpu(l, &found_key, slot);
-	if (found_key.objectid != objectid ||
+	if (found_key.objectid != search_key->objectid ||
 	    found_key.type != BTRFS_ROOT_ITEM_KEY) {
 		ret = 1;
 		goto out;
 	}
-	if (item)
-		btrfs_read_root_item(l, slot, item);
-	if (key)
-		memcpy(key, &found_key, sizeof(found_key));
 
-	ret = 0;
+	if (root_item)
+		btrfs_read_root_item(l, slot, root_item);
+	if (root_key)
+		memcpy(root_key, &found_key, sizeof(found_key));
 out:
-	btrfs_free_path(path);
+	btrfs_release_path(path);
 	return ret;
 }
 
@@ -212,86 +219,6 @@ int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	return btrfs_insert_item(trans, root, key, item, sizeof(*item));
 }
 
-/*
- * at mount time we want to find all the old transaction snapshots that were in
- * the process of being deleted if we crashed.  This is any root item with an
- * offset lower than the latest root.  They need to be queued for deletion to
- * finish what was happening when we crashed.
- */
-int btrfs_find_dead_roots(struct btrfs_root *root, u64 objectid)
-{
-	struct btrfs_root *dead_root;
-	struct btrfs_root_item *ri;
-	struct btrfs_key key;
-	struct btrfs_key found_key;
-	struct btrfs_path *path;
-	int ret;
-	u32 nritems;
-	struct extent_buffer *leaf;
-	int slot;
-
-	key.objectid = objectid;
-	btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY);
-	key.offset = 0;
-	path = btrfs_alloc_path();
-	if (!path)
-		return -ENOMEM;
-
-again:
-	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
-	if (ret < 0)
-		goto err;
-	while (1) {
-		leaf = path->nodes[0];
-		nritems = btrfs_header_nritems(leaf);
-		slot = path->slots[0];
-		if (slot >= nritems) {
-			ret = btrfs_next_leaf(root, path);
-			if (ret)
-				break;
-			leaf = path->nodes[0];
-			nritems = btrfs_header_nritems(leaf);
-			slot = path->slots[0];
-		}
-		btrfs_item_key_to_cpu(leaf, &key, slot);
-		if (btrfs_key_type(&key) != BTRFS_ROOT_ITEM_KEY)
-			goto next;
-
-		if (key.objectid < objectid)
-			goto next;
-
-		if (key.objectid > objectid)
-			break;
-
-		ri = btrfs_item_ptr(leaf, slot, struct btrfs_root_item);
-		if (btrfs_disk_root_refs(leaf, ri) != 0)
-			goto next;
-
-		memcpy(&found_key, &key, sizeof(key));
-		key.offset++;
-		btrfs_release_path(path);
-		dead_root =
-			btrfs_read_fs_root_no_radix(root->fs_info->tree_root,
-						    &found_key);
-		if (IS_ERR(dead_root)) {
-			ret = PTR_ERR(dead_root);
-			goto err;
-		}
-
-		ret = btrfs_add_dead_root(dead_root);
-		if (ret)
-			goto err;
-		goto again;
-next:
-		slot++;
-		path->slots[0]++;
-	}
-	ret = 0;
-err:
-	btrfs_free_path(path);
-	return ret;
-}
-
 int btrfs_find_orphan_roots(struct btrfs_root *tree_root)
 {
 	struct extent_buffer *leaf;
@@ -340,20 +267,29 @@ int btrfs_find_orphan_roots(struct btrfs_root *tree_root)
 		root_key.objectid = key.offset;
 		key.offset++;
 
-		root = btrfs_read_fs_root_no_name(tree_root->fs_info,
-						  &root_key);
-		if (!IS_ERR(root))
+		root = btrfs_read_fs_root(tree_root, &root_key);
+		if (IS_ERR(root)) {
+			err = PTR_ERR(root);
+			break;
+		}
+
+		if (btrfs_root_refs(&root->root_item) == 0) {
+			btrfs_add_dead_root(root);
 			continue;
+		}
 
-		ret = PTR_ERR(root);
-		if (ret != -ENOENT) {
-			err = ret;
+		err = btrfs_init_fs_root(root);
+		if (err) {
+			btrfs_free_fs_root(root);
 			break;
 		}
 
-		ret = btrfs_find_dead_roots(tree_root, root_key.objectid);
-		if (ret) {
-			err = ret;
+		root->orphan_item_inserted = 1;
+
+		err = btrfs_insert_fs_root(root->fs_info, root);
+		if (err) {
+			BUG_ON(err == -EEXIST);
+			btrfs_free_fs_root(root);
 			break;
 		}
 	}
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c276ac9..a59724e 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4016,8 +4016,7 @@ again:
 		if (found_key.objectid != BTRFS_TREE_LOG_OBJECTID)
 			break;
 
-		log = btrfs_read_fs_root_no_radix(log_root_tree,
-						  &found_key);
+		log = btrfs_read_fs_root(log_root_tree, &found_key);
 		if (IS_ERR(log)) {
 			ret = PTR_ERR(log);
 			btrfs_error(fs_info, ret,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a191bac..746d092 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5397,7 +5397,6 @@ static struct btrfs_device *add_missing_dev(struct btrfs_root *root,
 		return NULL;
 	list_add(&device->dev_list,
 		 &fs_devices->devices);
-	device->dev_root = root->fs_info->dev_root;
 	device->devid = devid;
 	device->work.func = pending_bios_fn;
 	device->fs_devices = fs_devices;
@@ -5623,7 +5622,6 @@ static int read_one_dev(struct btrfs_root *root,
 	}
 
 	fill_device_from_item(leaf, dev_item, device);
-	device->dev_root = root->fs_info->dev_root;
 	device->in_fs_metadata = 1;
 	if (device->writeable && !device->is_tgtdev_for_dev_replace) {
 		device->fs_devices->total_rw_bytes += device->total_bytes;
@@ -5781,6 +5779,17 @@ error:
 	return ret;
 }
 
+void btrfs_init_devices_late(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+
+	mutex_lock(&fs_devices->device_list_mutex);
+	list_for_each_entry(device, &fs_devices->devices, dev_list)
+		device->dev_root = fs_info->dev_root;
+	mutex_unlock(&fs_devices->device_list_mutex);
+}
+
 static void __btrfs_reset_dev_stats(struct btrfs_device *dev)
 {
 	int i;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 845ccbb..7772c2e 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -301,6 +301,7 @@ int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
 void btrfs_dev_stat_inc_and_print(struct btrfs_device *dev, int index);
 int btrfs_get_dev_stats(struct btrfs_root *root,
 			struct btrfs_ioctl_get_dev_stats *stats);
+void btrfs_init_devices_late(struct btrfs_fs_info *fs_info);
 int btrfs_init_dev_stats(struct btrfs_fs_info *fs_info);
 int btrfs_run_dev_stats(struct btrfs_trans_handle *trans,
 			struct btrfs_fs_info *fs_info);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (4 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 05/17] Btrfs: cleanup the similar code of the fs root read Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-16  3:36   ` Liu Bo
  2013-05-15  7:48 ` [PATCH 07/17] Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context Miao Xie
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

The grab/put funtions will be used in the next patch, which need grab
the root object and ensure it is not freed. We use reference counter
instead of the srcu lock is to aovid blocking the memory reclaim task,
which invokes synchronize_srcu().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/disk-io.c     |  5 +++--
 fs/btrfs/disk-io.h     | 21 +++++++++++++++++++++
 fs/btrfs/extent-tree.c |  2 +-
 4 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 845b77f..958ce6c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1739,6 +1739,7 @@ struct btrfs_root {
 	int force_cow;
 
 	spinlock_t root_item_lock;
+	atomic_t refs;
 };
 
 struct btrfs_ioctl_defrag_range_args {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 42d6ba2..642c861 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1216,6 +1216,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	atomic_set(&root->log_writers, 0);
 	atomic_set(&root->log_batch, 0);
 	atomic_set(&root->orphan_inodes, 0);
+	atomic_set(&root->refs, 1);
 	root->log_transid = 0;
 	root->last_log_commit = 0;
 	extent_io_tree_init(&root->dirty_log_pages,
@@ -2049,7 +2050,7 @@ static void del_fs_roots(struct btrfs_fs_info *fs_info)
 		} else {
 			free_extent_buffer(gang[0]->node);
 			free_extent_buffer(gang[0]->commit_root);
-			kfree(gang[0]);
+			btrfs_put_fs_root(gang[0]);
 		}
 	}
 
@@ -3415,7 +3416,7 @@ static void free_fs_root(struct btrfs_root *root)
 	kfree(root->free_ino_ctl);
 	kfree(root->free_ino_pinned);
 	kfree(root->name);
-	kfree(root);
+	btrfs_put_fs_root(root);
 }
 
 void btrfs_free_fs_root(struct btrfs_root *root)
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 534d583..b71acd6e 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -76,6 +76,27 @@ void btrfs_btree_balance_dirty_nodelay(struct btrfs_root *root);
 void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
 				 struct btrfs_root *root);
 void btrfs_free_fs_root(struct btrfs_root *root);
+
+/*
+ * This function is used to grab the root, and avoid it is freed when we
+ * access it. But it doesn't ensure that the tree is not dropped.
+ *
+ * If you want to ensure the whole tree is safe, you should use
+ * 	fs_info->subvol_srcu
+ */
+static inline struct btrfs_root *btrfs_grab_fs_root(struct btrfs_root *root)
+{
+	if (atomic_inc_not_zero(&root->refs))
+		return root;
+	return NULL;
+}
+
+static inline void btrfs_put_fs_root(struct btrfs_root *root)
+{
+	if (atomic_dec_and_test(&root->refs))
+		kfree(root);
+}
+
 void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
 int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
 			  int atomic);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 08e42c8..08f9862 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7463,7 +7463,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 	} else {
 		free_extent_buffer(root->node);
 		free_extent_buffer(root->commit_root);
-		kfree(root);
+		btrfs_put_fs_root(root);
 	}
 out_end_trans:
 	btrfs_end_transaction_throttle(trans, tree_root);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 07/17] Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (5 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 08/17] Btrfs: introduce per-subvolume delalloc inode list Miao Xie
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

btrfs_invalidate_inodes() may sleep, so we should not invoke it in the
spin lock context. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/disk-io.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 642c861..724a0da 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3685,8 +3685,11 @@ static void btrfs_destroy_ordered_operations(struct btrfs_transaction *t,
 					 ordered_operations);
 
 		list_del_init(&btrfs_inode->ordered_operations);
+		spin_unlock(&root->fs_info->ordered_extent_lock);
 
 		btrfs_invalidate_inodes(btrfs_inode->root);
+
+		spin_lock(&root->fs_info->ordered_extent_lock);
 	}
 
 	spin_unlock(&root->fs_info->ordered_extent_lock);
@@ -3808,8 +3811,11 @@ static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root)
 		list_del_init(&btrfs_inode->delalloc_inodes);
 		clear_bit(BTRFS_INODE_IN_DELALLOC_LIST,
 			  &btrfs_inode->runtime_flags);
+		spin_unlock(&root->fs_info->delalloc_lock);
 
 		btrfs_invalidate_inodes(btrfs_inode->root);
+
+		spin_lock(&root->fs_info->delalloc_lock);
 	}
 
 	spin_unlock(&root->fs_info->delalloc_lock);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 08/17] Btrfs: introduce per-subvolume delalloc inode list
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (6 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 07/17] Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 09/17] Btrfs: introduce per-subvolume ordered extent list Miao Xie
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

When we create a snapshot, we need flush all delalloc inodes in the
fs, just flushing the inodes in the source tree is OK. So we introduce
per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/ctree.h       |  22 ++++---
 fs/btrfs/dev-replace.c |   2 +-
 fs/btrfs/disk-io.c     |  49 ++++++++++++---
 fs/btrfs/extent-tree.c |   6 +-
 fs/btrfs/inode.c       | 167 ++++++++++++++++++++++++++++++++++++-------------
 fs/btrfs/relocation.c  |   2 +-
 fs/btrfs/transaction.c |   2 +-
 7 files changed, 183 insertions(+), 67 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 958ce6c..067233f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1449,13 +1449,9 @@ struct btrfs_fs_info {
 	 */
 	struct list_head ordered_extents;
 
-	spinlock_t delalloc_lock;
-	/*
-	 * all of the inodes that have delalloc bytes.  It is possible for
-	 * this list to be empty even when there is still dirty data=ordered
-	 * extents waiting to finish IO.
-	 */
-	struct list_head delalloc_inodes;
+	spinlock_t delalloc_root_lock;
+	/* all fs/file tree roots that have delalloc inodes. */
+	struct list_head delalloc_roots;
 
 	/*
 	 * there is a pool of worker threads for checksumming during writes
@@ -1740,6 +1736,16 @@ struct btrfs_root {
 
 	spinlock_t root_item_lock;
 	atomic_t refs;
+
+	spinlock_t delalloc_lock;
+	/*
+	 * all of the inodes that have delalloc bytes.  It is possible for
+	 * this list to be empty even when there is still dirty data=ordered
+	 * extents waiting to finish IO.
+	 */
+	struct list_head delalloc_inodes;
+	struct list_head delalloc_root;
+	u64 nr_delalloc_inodes;
 };
 
 struct btrfs_ioctl_defrag_range_args {
@@ -3543,6 +3549,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 			       u32 min_type);
 
 int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput);
+int btrfs_start_all_delalloc_inodes(struct btrfs_fs_info *fs_info,
+				    int delay_iput);
 int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end,
 			      struct extent_state **cached_state);
 int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 7ba7b39..aca77ad 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -465,7 +465,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 	 * flush all outstanding I/O and inode extent mappings before the
 	 * copy operation is declared as being finished
 	 */
-	ret = btrfs_start_delalloc_inodes(root, 0);
+	ret = btrfs_start_all_delalloc_inodes(root->fs_info, 0);
 	if (ret) {
 		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 		return ret;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 724a0da..13dddba 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1191,6 +1191,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->objectid = objectid;
 	root->last_trans = 0;
 	root->highest_objectid = 0;
+	root->nr_delalloc_inodes = 0;
 	root->name = NULL;
 	root->inode_tree = RB_ROOT;
 	INIT_RADIX_TREE(&root->delayed_nodes_tree, GFP_ATOMIC);
@@ -1199,10 +1200,13 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 
 	INIT_LIST_HEAD(&root->dirty_list);
 	INIT_LIST_HEAD(&root->root_list);
+	INIT_LIST_HEAD(&root->delalloc_inodes);
+	INIT_LIST_HEAD(&root->delalloc_root);
 	INIT_LIST_HEAD(&root->logged_list[0]);
 	INIT_LIST_HEAD(&root->logged_list[1]);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
+	spin_lock_init(&root->delalloc_lock);
 	spin_lock_init(&root->accounting_lock);
 	spin_lock_init(&root->log_extents_lock[0]);
 	spin_lock_init(&root->log_extents_lock[1]);
@@ -2137,9 +2141,9 @@ int open_ctree(struct super_block *sb,
 	INIT_LIST_HEAD(&fs_info->trans_list);
 	INIT_LIST_HEAD(&fs_info->dead_roots);
 	INIT_LIST_HEAD(&fs_info->delayed_iputs);
-	INIT_LIST_HEAD(&fs_info->delalloc_inodes);
+	INIT_LIST_HEAD(&fs_info->delalloc_roots);
 	INIT_LIST_HEAD(&fs_info->caching_block_groups);
-	spin_lock_init(&fs_info->delalloc_lock);
+	spin_lock_init(&fs_info->delalloc_root_lock);
 	spin_lock_init(&fs_info->trans_lock);
 	spin_lock_init(&fs_info->fs_roots_radix_lock);
 	spin_lock_init(&fs_info->delayed_iput_lock);
@@ -3801,24 +3805,49 @@ static void btrfs_destroy_delalloc_inodes(struct btrfs_root *root)
 
 	INIT_LIST_HEAD(&splice);
 
-	spin_lock(&root->fs_info->delalloc_lock);
-	list_splice_init(&root->fs_info->delalloc_inodes, &splice);
+	spin_lock(&root->delalloc_lock);
+	list_splice_init(&root->delalloc_inodes, &splice);
 
 	while (!list_empty(&splice)) {
-		btrfs_inode = list_entry(splice.next, struct btrfs_inode,
-				    delalloc_inodes);
+		btrfs_inode = list_first_entry(&splice, struct btrfs_inode,
+					       delalloc_inodes);
 
 		list_del_init(&btrfs_inode->delalloc_inodes);
 		clear_bit(BTRFS_INODE_IN_DELALLOC_LIST,
 			  &btrfs_inode->runtime_flags);
-		spin_unlock(&root->fs_info->delalloc_lock);
+		spin_unlock(&root->delalloc_lock);
 
 		btrfs_invalidate_inodes(btrfs_inode->root);
 
-		spin_lock(&root->fs_info->delalloc_lock);
+		spin_lock(&root->delalloc_lock);
 	}
 
-	spin_unlock(&root->fs_info->delalloc_lock);
+	spin_unlock(&root->delalloc_lock);
+}
+
+static void btrfs_destroy_all_delalloc_inodes(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_root *root;
+	struct list_head splice;
+
+	INIT_LIST_HEAD(&splice);
+
+	spin_lock(&fs_info->delalloc_root_lock);
+	list_splice_init(&fs_info->delalloc_roots, &splice);
+	while (!list_empty(&splice)) {
+		root = list_first_entry(&splice, struct btrfs_root,
+					 delalloc_root);
+		list_del_init(&root->delalloc_root);
+		root = btrfs_grab_fs_root(root);
+		BUG_ON(!root);
+		spin_unlock(&fs_info->delalloc_root_lock);
+
+		btrfs_destroy_delalloc_inodes(root);
+		btrfs_put_fs_root(root);
+
+		spin_lock(&fs_info->delalloc_root_lock);
+	}
+	spin_unlock(&fs_info->delalloc_root_lock);
 }
 
 static int btrfs_destroy_marked_extents(struct btrfs_root *root,
@@ -3972,7 +4001,7 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root)
 		btrfs_destroy_delayed_inodes(root);
 		btrfs_assert_delayed_root_empty(root);
 
-		btrfs_destroy_delalloc_inodes(root);
+		btrfs_destroy_all_delalloc_inodes(root->fs_info);
 
 		spin_lock(&root->fs_info->trans_lock);
 		root->fs_info->running_transaction = NULL;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 08f9862..455117a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3897,7 +3897,7 @@ static void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root,
 		 * the filesystem is readonly(all dirty pages are written to
 		 * the disk).
 		 */
-		btrfs_start_delalloc_inodes(root, 0);
+		btrfs_start_all_delalloc_inodes(root->fs_info, 0);
 		if (!current->journal_info)
 			btrfs_wait_ordered_extents(root, 0);
 	}
@@ -5026,14 +5026,14 @@ static int update_block_group(struct btrfs_root *root,
 	int factor;
 
 	/* block accounting for super block */
-	spin_lock(&info->delalloc_lock);
+	spin_lock(&info->delalloc_root_lock);
 	old_val = btrfs_super_bytes_used(info->super_copy);
 	if (alloc)
 		old_val += num_bytes;
 	else
 		old_val -= num_bytes;
 	btrfs_set_super_bytes_used(info->super_copy, old_val);
-	spin_unlock(&info->delalloc_lock);
+	spin_unlock(&info->delalloc_root_lock);
 
 	while (total) {
 		cache = btrfs_lookup_block_group(info, bytenr);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bf5c399..a270083 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1524,6 +1524,46 @@ static void btrfs_merge_extent_hook(struct inode *inode,
 	spin_unlock(&BTRFS_I(inode)->lock);
 }
 
+static void btrfs_add_delalloc_inodes(struct btrfs_root *root,
+				      struct inode *inode)
+{
+	spin_lock(&root->delalloc_lock);
+	if (list_empty(&BTRFS_I(inode)->delalloc_inodes)) {
+		list_add_tail(&BTRFS_I(inode)->delalloc_inodes,
+			      &root->delalloc_inodes);
+		set_bit(BTRFS_INODE_IN_DELALLOC_LIST,
+			&BTRFS_I(inode)->runtime_flags);
+		root->nr_delalloc_inodes++;
+		if (root->nr_delalloc_inodes == 1) {
+			spin_lock(&root->fs_info->delalloc_root_lock);
+			BUG_ON(!list_empty(&root->delalloc_root));
+			list_add_tail(&root->delalloc_root,
+				      &root->fs_info->delalloc_roots);
+			spin_unlock(&root->fs_info->delalloc_root_lock);
+		}
+	}
+	spin_unlock(&root->delalloc_lock);
+}
+
+static void btrfs_del_delalloc_inode(struct btrfs_root *root,
+				     struct inode *inode)
+{
+	spin_lock(&root->delalloc_lock);
+	if (!list_empty(&BTRFS_I(inode)->delalloc_inodes)) {
+		list_del_init(&BTRFS_I(inode)->delalloc_inodes);
+		clear_bit(BTRFS_INODE_IN_DELALLOC_LIST,
+			  &BTRFS_I(inode)->runtime_flags);
+		root->nr_delalloc_inodes--;
+		if (!root->nr_delalloc_inodes) {
+			spin_lock(&root->fs_info->delalloc_root_lock);
+			BUG_ON(list_empty(&root->delalloc_root));
+			list_del_init(&root->delalloc_root);
+			spin_unlock(&root->fs_info->delalloc_root_lock);
+		}
+	}
+	spin_unlock(&root->delalloc_lock);
+}
+
 /*
  * extent_io.c set_bit_hook, used to track delayed allocation
  * bytes in this file, and to maintain the list of inodes that
@@ -1556,16 +1596,8 @@ static void btrfs_set_bit_hook(struct inode *inode,
 		spin_lock(&BTRFS_I(inode)->lock);
 		BTRFS_I(inode)->delalloc_bytes += len;
 		if (do_list && !test_bit(BTRFS_INODE_IN_DELALLOC_LIST,
-					 &BTRFS_I(inode)->runtime_flags)) {
-			spin_lock(&root->fs_info->delalloc_lock);
-			if (list_empty(&BTRFS_I(inode)->delalloc_inodes)) {
-				list_add_tail(&BTRFS_I(inode)->delalloc_inodes,
-					      &root->fs_info->delalloc_inodes);
-				set_bit(BTRFS_INODE_IN_DELALLOC_LIST,
-					&BTRFS_I(inode)->runtime_flags);
-			}
-			spin_unlock(&root->fs_info->delalloc_lock);
-		}
+					 &BTRFS_I(inode)->runtime_flags))
+			btrfs_add_delalloc_inodes(root, inode);
 		spin_unlock(&BTRFS_I(inode)->lock);
 	}
 }
@@ -1608,15 +1640,8 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 		BTRFS_I(inode)->delalloc_bytes -= len;
 		if (do_list && BTRFS_I(inode)->delalloc_bytes == 0 &&
 		    test_bit(BTRFS_INODE_IN_DELALLOC_LIST,
-			     &BTRFS_I(inode)->runtime_flags)) {
-			spin_lock(&root->fs_info->delalloc_lock);
-			if (!list_empty(&BTRFS_I(inode)->delalloc_inodes)) {
-				list_del_init(&BTRFS_I(inode)->delalloc_inodes);
-				clear_bit(BTRFS_INODE_IN_DELALLOC_LIST,
-					  &BTRFS_I(inode)->runtime_flags);
-			}
-			spin_unlock(&root->fs_info->delalloc_lock);
-		}
+			     &BTRFS_I(inode)->runtime_flags))
+			btrfs_del_delalloc_inode(root, inode);
 		spin_unlock(&BTRFS_I(inode)->lock);
 	}
 }
@@ -8319,7 +8344,7 @@ void btrfs_wait_and_free_delalloc_work(struct btrfs_delalloc_work *work)
  * some fairly slow code that needs optimization. This walks the list
  * of all the inodes with pending delalloc and forces them to disk.
  */
-int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput)
+static int __start_delalloc_inodes(struct btrfs_root *root, int delay_iput)
 {
 	struct btrfs_inode *binode;
 	struct inode *inode;
@@ -8328,30 +8353,23 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput)
 	struct list_head splice;
 	int ret = 0;
 
-	if (root->fs_info->sb->s_flags & MS_RDONLY)
-		return -EROFS;
-
 	INIT_LIST_HEAD(&works);
 	INIT_LIST_HEAD(&splice);
 
-	spin_lock(&root->fs_info->delalloc_lock);
-	list_splice_init(&root->fs_info->delalloc_inodes, &splice);
+	spin_lock(&root->delalloc_lock);
+	list_splice_init(&root->delalloc_inodes, &splice);
 	while (!list_empty(&splice)) {
 		binode = list_entry(splice.next, struct btrfs_inode,
 				    delalloc_inodes);
 
-		list_del_init(&binode->delalloc_inodes);
-
+		list_move_tail(&binode->delalloc_inodes,
+			       &root->delalloc_inodes);
 		inode = igrab(&binode->vfs_inode);
 		if (!inode) {
-			clear_bit(BTRFS_INODE_IN_DELALLOC_LIST,
-				  &binode->runtime_flags);
+			cond_resched_lock(&root->delalloc_lock);
 			continue;
 		}
-
-		list_add_tail(&binode->delalloc_inodes,
-			      &root->fs_info->delalloc_inodes);
-		spin_unlock(&root->fs_info->delalloc_lock);
+		spin_unlock(&root->delalloc_lock);
 
 		work = btrfs_alloc_delalloc_work(inode, 0, delay_iput);
 		if (unlikely(!work)) {
@@ -8363,16 +8381,39 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput)
 				   &work->work);
 
 		cond_resched();
-		spin_lock(&root->fs_info->delalloc_lock);
+		spin_lock(&root->delalloc_lock);
 	}
-	spin_unlock(&root->fs_info->delalloc_lock);
+	spin_unlock(&root->delalloc_lock);
 
 	list_for_each_entry_safe(work, next, &works, list) {
 		list_del_init(&work->list);
 		btrfs_wait_and_free_delalloc_work(work);
 	}
+	return 0;
+out:
+	list_for_each_entry_safe(work, next, &works, list) {
+		list_del_init(&work->list);
+		btrfs_wait_and_free_delalloc_work(work);
+	}
+
+	if (!list_empty_careful(&splice)) {
+		spin_lock(&root->delalloc_lock);
+		list_splice_tail(&splice, &root->delalloc_inodes);
+		spin_unlock(&root->delalloc_lock);
+	}
+	return ret;
+}
 
-	/* the filemap_flush will queue IO into the worker threads, but
+int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput)
+{
+	int ret;
+
+	if (root->fs_info->sb->s_flags & MS_RDONLY)
+		return -EROFS;
+
+	ret = __start_delalloc_inodes(root, delay_iput);
+	/*
+	 * the filemap_flush will queue IO into the worker threads, but
 	 * we have to make sure the IO is actually started and that
 	 * ordered extents get created before we return
 	 */
@@ -8384,17 +8425,55 @@ int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput)
 		    atomic_read(&root->fs_info->async_delalloc_pages) == 0));
 	}
 	atomic_dec(&root->fs_info->async_submit_draining);
-	return 0;
-out:
-	list_for_each_entry_safe(work, next, &works, list) {
-		list_del_init(&work->list);
-		btrfs_wait_and_free_delalloc_work(work);
+	return ret;
+}
+
+int btrfs_start_all_delalloc_inodes(struct btrfs_fs_info *fs_info,
+				    int delay_iput)
+{
+	struct btrfs_root *root;
+	struct list_head splice;
+	int ret;
+
+	if (fs_info->sb->s_flags & MS_RDONLY)
+		return -EROFS;
+
+	INIT_LIST_HEAD(&splice);
+
+	spin_lock(&fs_info->delalloc_root_lock);
+	list_splice_init(&fs_info->delalloc_roots, &splice);
+	while (!list_empty(&splice)) {
+		root = list_first_entry(&splice, struct btrfs_root,
+					delalloc_root);
+		root = btrfs_grab_fs_root(root);
+		BUG_ON(!root);
+		list_move_tail(&root->delalloc_root,
+			       &fs_info->delalloc_roots);
+		spin_unlock(&fs_info->delalloc_root_lock);
+
+		ret = __start_delalloc_inodes(root, delay_iput);
+		btrfs_put_fs_root(root);
+		if (ret)
+			goto out;
+
+		spin_lock(&fs_info->delalloc_root_lock);
 	}
+	spin_unlock(&fs_info->delalloc_root_lock);
 
+	atomic_inc(&fs_info->async_submit_draining);
+	while (atomic_read(&fs_info->nr_async_submits) ||
+	      atomic_read(&fs_info->async_delalloc_pages)) {
+		wait_event(fs_info->async_submit_wait,
+		   (atomic_read(&fs_info->nr_async_submits) == 0 &&
+		    atomic_read(&fs_info->async_delalloc_pages) == 0));
+	}
+	atomic_dec(&fs_info->async_submit_draining);
+	return 0;
+out:
 	if (!list_empty_careful(&splice)) {
-		spin_lock(&root->fs_info->delalloc_lock);
-		list_splice_tail(&splice, &root->fs_info->delalloc_inodes);
-		spin_unlock(&root->fs_info->delalloc_lock);
+		spin_lock(&fs_info->delalloc_root_lock);
+		list_splice_tail(&splice, &fs_info->delalloc_roots);
+		spin_unlock(&fs_info->delalloc_root_lock);
 	}
 	return ret;
 }
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index baaa66b..247fc71 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4153,7 +4153,7 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start)
 	       (unsigned long long)rc->block_group->key.objectid,
 	       (unsigned long long)rc->block_group->flags);
 
-	ret = btrfs_start_delalloc_inodes(fs_info->tree_root, 0);
+	ret = btrfs_start_all_delalloc_inodes(fs_info, 0);
 	if (ret < 0) {
 		err = ret;
 		goto out;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index f157752..4b63111 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1502,7 +1502,7 @@ static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
 	}
 
 	if (flush_on_commit || snap_pending) {
-		ret = btrfs_start_delalloc_inodes(root, 1);
+		ret = btrfs_start_all_delalloc_inodes(root->fs_info, 1);
 		if (ret)
 			return ret;
 		btrfs_wait_ordered_extents(root, 1);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 09/17] Btrfs: introduce per-subvolume ordered extent list
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (7 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 08/17] Btrfs: introduce per-subvolume delalloc inode list Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 10/17] Btrfs: just flush the delalloc inodes in the source tree before snapshot creation Miao Xie
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

The reason we introduce per-subvolume ordered extent list is the same
as the per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/ctree.h        |  25 ++++++++---
 fs/btrfs/dev-replace.c  |   4 +-
 fs/btrfs/disk-io.c      |  45 +++++++++++++++-----
 fs/btrfs/extent-tree.c  |   6 +--
 fs/btrfs/inode.c        |   4 +-
 fs/btrfs/ordered-data.c | 109 +++++++++++++++++++++++++++++++++---------------
 fs/btrfs/ordered-data.h |   2 +
 fs/btrfs/relocation.c   |   2 +-
 fs/btrfs/super.c        |   2 +-
 fs/btrfs/transaction.c  |   2 +-
 10 files changed, 143 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 067233f..a7e71ff 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1437,17 +1437,18 @@ struct btrfs_fs_info {
 	atomic_t open_ioctl_trans;
 
 	/*
-	 * this is used by the balancing code to wait for all the pending
-	 * ordered extents
+	 * this is used to protect the following list -- ordered_roots.
 	 */
-	spinlock_t ordered_extent_lock;
+	spinlock_t ordered_root_lock;
 
 	/*
-	 * all of the data=ordered extents pending writeback
+	 * all fs/file tree roots in which there are data=ordered extents
+	 * pending writeback are added into this list.
+	 *
 	 * these can span multiple transactions and basically include
 	 * every dirty data page that isn't from nodatacow
 	 */
-	struct list_head ordered_extents;
+	struct list_head ordered_roots;
 
 	spinlock_t delalloc_root_lock;
 	/* all fs/file tree roots that have delalloc inodes. */
@@ -1746,6 +1747,20 @@ struct btrfs_root {
 	struct list_head delalloc_inodes;
 	struct list_head delalloc_root;
 	u64 nr_delalloc_inodes;
+	/*
+	 * this is used by the balancing code to wait for all the pending
+	 * ordered extents
+	 */
+	spinlock_t ordered_extent_lock;
+
+	/*
+	 * all of the data=ordered extents pending writeback
+	 * these can span multiple transactions and basically include
+	 * every dirty data page that isn't from nodatacow
+	 */
+	struct list_head ordered_extents;
+	struct list_head ordered_root;
+	u64 nr_ordered_extents;
 };
 
 struct btrfs_ioctl_defrag_range_args {
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index aca77ad..4254da8 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -395,7 +395,7 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
 	args->result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_ERROR;
 	btrfs_dev_replace_unlock(dev_replace);
 
-	btrfs_wait_ordered_extents(root, 0);
+	btrfs_wait_all_ordered_extents(root->fs_info, 0);
 
 	/* force writing the updated state information to disk */
 	trans = btrfs_start_transaction(root, 0);
@@ -470,7 +470,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 		return ret;
 	}
-	btrfs_wait_ordered_extents(root, 0);
+	btrfs_wait_all_ordered_extents(root->fs_info, 0);
 
 	trans = btrfs_start_transaction(root, 0);
 	if (IS_ERR(trans)) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 13dddba..44d5a86 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1192,6 +1192,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	root->last_trans = 0;
 	root->highest_objectid = 0;
 	root->nr_delalloc_inodes = 0;
+	root->nr_ordered_extents = 0;
 	root->name = NULL;
 	root->inode_tree = RB_ROOT;
 	INIT_RADIX_TREE(&root->delayed_nodes_tree, GFP_ATOMIC);
@@ -1202,11 +1203,14 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
 	INIT_LIST_HEAD(&root->root_list);
 	INIT_LIST_HEAD(&root->delalloc_inodes);
 	INIT_LIST_HEAD(&root->delalloc_root);
+	INIT_LIST_HEAD(&root->ordered_extents);
+	INIT_LIST_HEAD(&root->ordered_root);
 	INIT_LIST_HEAD(&root->logged_list[0]);
 	INIT_LIST_HEAD(&root->logged_list[1]);
 	spin_lock_init(&root->orphan_lock);
 	spin_lock_init(&root->inode_lock);
 	spin_lock_init(&root->delalloc_lock);
+	spin_lock_init(&root->ordered_extent_lock);
 	spin_lock_init(&root->accounting_lock);
 	spin_lock_init(&root->log_extents_lock[0]);
 	spin_lock_init(&root->log_extents_lock[1]);
@@ -2190,8 +2194,8 @@ int open_ctree(struct super_block *sb,
 	fs_info->thread_pool_size = min_t(unsigned long,
 					  num_online_cpus() + 2, 8);
 
-	INIT_LIST_HEAD(&fs_info->ordered_extents);
-	spin_lock_init(&fs_info->ordered_extent_lock);
+	INIT_LIST_HEAD(&fs_info->ordered_roots);
+	spin_lock_init(&fs_info->ordered_root_lock);
 	fs_info->delayed_root = kmalloc(sizeof(struct btrfs_delayed_root),
 					GFP_NOFS);
 	if (!fs_info->delayed_root) {
@@ -3681,7 +3685,7 @@ static void btrfs_destroy_ordered_operations(struct btrfs_transaction *t,
 	INIT_LIST_HEAD(&splice);
 
 	mutex_lock(&root->fs_info->ordered_operations_mutex);
-	spin_lock(&root->fs_info->ordered_extent_lock);
+	spin_lock(&root->fs_info->ordered_root_lock);
 
 	list_splice_init(&t->ordered_operations, &splice);
 	while (!list_empty(&splice)) {
@@ -3689,14 +3693,14 @@ static void btrfs_destroy_ordered_operations(struct btrfs_transaction *t,
 					 ordered_operations);
 
 		list_del_init(&btrfs_inode->ordered_operations);
-		spin_unlock(&root->fs_info->ordered_extent_lock);
+		spin_unlock(&root->fs_info->ordered_root_lock);
 
 		btrfs_invalidate_inodes(btrfs_inode->root);
 
-		spin_lock(&root->fs_info->ordered_extent_lock);
+		spin_lock(&root->fs_info->ordered_root_lock);
 	}
 
-	spin_unlock(&root->fs_info->ordered_extent_lock);
+	spin_unlock(&root->fs_info->ordered_root_lock);
 	mutex_unlock(&root->fs_info->ordered_operations_mutex);
 }
 
@@ -3704,15 +3708,36 @@ static void btrfs_destroy_ordered_extents(struct btrfs_root *root)
 {
 	struct btrfs_ordered_extent *ordered;
 
-	spin_lock(&root->fs_info->ordered_extent_lock);
+	spin_lock(&root->ordered_extent_lock);
 	/*
 	 * This will just short circuit the ordered completion stuff which will
 	 * make sure the ordered extent gets properly cleaned up.
 	 */
-	list_for_each_entry(ordered, &root->fs_info->ordered_extents,
+	list_for_each_entry(ordered, &root->ordered_extents,
 			    root_extent_list)
 		set_bit(BTRFS_ORDERED_IOERR, &ordered->flags);
-	spin_unlock(&root->fs_info->ordered_extent_lock);
+	spin_unlock(&root->ordered_extent_lock);
+}
+
+static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_root *root;
+	struct list_head splice;
+
+	INIT_LIST_HEAD(&splice);
+
+	spin_lock(&fs_info->ordered_root_lock);
+	list_splice_init(&fs_info->ordered_roots, &splice);
+	while (!list_empty(&splice)) {
+		root = list_first_entry(&splice, struct btrfs_root,
+					ordered_root);
+		list_del_init(&root->ordered_root);
+
+		btrfs_destroy_ordered_extents(root);
+
+		cond_resched_lock(&fs_info->ordered_root_lock);
+	}
+	spin_unlock(&fs_info->ordered_root_lock);
 }
 
 int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
@@ -3975,7 +4000,7 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root)
 
 		btrfs_destroy_ordered_operations(t, root);
 
-		btrfs_destroy_ordered_extents(root);
+		btrfs_destroy_all_ordered_extents(root->fs_info);
 
 		btrfs_destroy_delayed_refs(t, root);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 455117a..3149616 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3899,7 +3899,7 @@ static void btrfs_writeback_inodes_sb_nr(struct btrfs_root *root,
 		 */
 		btrfs_start_all_delalloc_inodes(root->fs_info, 0);
 		if (!current->journal_info)
-			btrfs_wait_ordered_extents(root, 0);
+			btrfs_wait_all_ordered_extents(root->fs_info, 0);
 	}
 }
 
@@ -3929,7 +3929,7 @@ static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 orig,
 	if (delalloc_bytes == 0) {
 		if (trans)
 			return;
-		btrfs_wait_ordered_extents(root, 0);
+		btrfs_wait_all_ordered_extents(root->fs_info, 0);
 		return;
 	}
 
@@ -3957,7 +3957,7 @@ static void shrink_delalloc(struct btrfs_root *root, u64 to_reclaim, u64 orig,
 
 		loops++;
 		if (wait_ordered && !trans) {
-			btrfs_wait_ordered_extents(root, 0);
+			btrfs_wait_all_ordered_extents(root->fs_info, 0);
 		} else {
 			time_left = schedule_timeout_killable(1);
 			if (time_left)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a270083..69ca278 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7975,9 +7975,9 @@ void btrfs_destroy_inode(struct inode *inode)
 	 */
 	smp_mb();
 	if (!list_empty(&BTRFS_I(inode)->ordered_operations)) {
-		spin_lock(&root->fs_info->ordered_extent_lock);
+		spin_lock(&root->fs_info->ordered_root_lock);
 		list_del_init(&BTRFS_I(inode)->ordered_operations);
-		spin_unlock(&root->fs_info->ordered_extent_lock);
+		spin_unlock(&root->fs_info->ordered_root_lock);
 	}
 
 	if (test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 1ddd728..665c640 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -24,6 +24,7 @@
 #include "transaction.h"
 #include "btrfs_inode.h"
 #include "extent_io.h"
+#include "disk-io.h"
 
 static struct kmem_cache *btrfs_ordered_extent_cache;
 
@@ -184,6 +185,7 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 				      u64 start, u64 len, u64 disk_len,
 				      int type, int dio, int compress_type)
 {
+	struct btrfs_root *root = BTRFS_I(inode)->root;
 	struct btrfs_ordered_inode_tree *tree;
 	struct rb_node *node;
 	struct btrfs_ordered_extent *entry;
@@ -227,10 +229,18 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
 		ordered_data_tree_panic(inode, -EEXIST, file_offset);
 	spin_unlock_irq(&tree->lock);
 
-	spin_lock(&BTRFS_I(inode)->root->fs_info->ordered_extent_lock);
+	spin_lock(&root->ordered_extent_lock);
 	list_add_tail(&entry->root_extent_list,
-		      &BTRFS_I(inode)->root->fs_info->ordered_extents);
-	spin_unlock(&BTRFS_I(inode)->root->fs_info->ordered_extent_lock);
+		      &root->ordered_extents);
+	root->nr_ordered_extents++;
+	if (root->nr_ordered_extents == 1) {
+		spin_lock(&root->fs_info->ordered_root_lock);
+		BUG_ON(!list_empty(&root->ordered_root));
+		list_add_tail(&root->ordered_root,
+			      &root->fs_info->ordered_roots);
+		spin_unlock(&root->fs_info->ordered_root_lock);
+	}
+	spin_unlock(&root->ordered_extent_lock);
 
 	return 0;
 }
@@ -516,8 +526,9 @@ void btrfs_remove_ordered_extent(struct inode *inode,
 	set_bit(BTRFS_ORDERED_COMPLETE, &entry->flags);
 	spin_unlock_irq(&tree->lock);
 
-	spin_lock(&root->fs_info->ordered_extent_lock);
+	spin_lock(&root->ordered_extent_lock);
 	list_del_init(&entry->root_extent_list);
+	root->nr_ordered_extents--;
 
 	trace_btrfs_ordered_extent_remove(inode, entry);
 
@@ -530,7 +541,14 @@ void btrfs_remove_ordered_extent(struct inode *inode,
 	    !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY)) {
 		list_del_init(&BTRFS_I(inode)->ordered_operations);
 	}
-	spin_unlock(&root->fs_info->ordered_extent_lock);
+
+	if (!root->nr_ordered_extents) {
+		spin_lock(&root->fs_info->ordered_root_lock);
+		BUG_ON(list_empty(&root->ordered_root));
+		list_del_init(&root->ordered_root);
+		spin_unlock(&root->fs_info->ordered_root_lock);
+	}
+	spin_unlock(&root->ordered_extent_lock);
 	wake_up(&entry->wait);
 }
 
@@ -550,7 +568,6 @@ static void btrfs_run_ordered_extent_work(struct btrfs_work *work)
 void btrfs_wait_ordered_extents(struct btrfs_root *root, int delay_iput)
 {
 	struct list_head splice, works;
-	struct list_head *cur;
 	struct btrfs_ordered_extent *ordered, *next;
 	struct inode *inode;
 
@@ -558,35 +575,34 @@ void btrfs_wait_ordered_extents(struct btrfs_root *root, int delay_iput)
 	INIT_LIST_HEAD(&works);
 
 	mutex_lock(&root->fs_info->ordered_operations_mutex);
-	spin_lock(&root->fs_info->ordered_extent_lock);
-	list_splice_init(&root->fs_info->ordered_extents, &splice);
+	spin_lock(&root->ordered_extent_lock);
+	list_splice_init(&root->ordered_extents, &splice);
 	while (!list_empty(&splice)) {
-		cur = splice.next;
-		ordered = list_entry(cur, struct btrfs_ordered_extent,
-				     root_extent_list);
-		list_del_init(&ordered->root_extent_list);
-		atomic_inc(&ordered->refs);
-
+		ordered = list_first_entry(&splice, struct btrfs_ordered_extent,
+					   root_extent_list);
+		list_move_tail(&ordered->root_extent_list,
+			       &root->ordered_extents);
 		/*
 		 * the inode may be getting freed (in sys_unlink path).
 		 */
 		inode = igrab(ordered->inode);
+		if (!inode) {
+			cond_resched_lock(&root->ordered_extent_lock);
+			continue;
+		}
 
-		spin_unlock(&root->fs_info->ordered_extent_lock);
+		atomic_inc(&ordered->refs);
+		spin_unlock(&root->ordered_extent_lock);
 
-		if (inode) {
-			ordered->flush_work.func = btrfs_run_ordered_extent_work;
-			list_add_tail(&ordered->work_list, &works);
-			btrfs_queue_worker(&root->fs_info->flush_workers,
-					   &ordered->flush_work);
-		} else {
-			btrfs_put_ordered_extent(ordered);
-		}
+		ordered->flush_work.func = btrfs_run_ordered_extent_work;
+		list_add_tail(&ordered->work_list, &works);
+		btrfs_queue_worker(&root->fs_info->flush_workers,
+				   &ordered->flush_work);
 
 		cond_resched();
-		spin_lock(&root->fs_info->ordered_extent_lock);
+		spin_lock(&root->ordered_extent_lock);
 	}
-	spin_unlock(&root->fs_info->ordered_extent_lock);
+	spin_unlock(&root->ordered_extent_lock);
 
 	list_for_each_entry_safe(ordered, next, &works, work_list) {
 		list_del_init(&ordered->work_list);
@@ -604,6 +620,33 @@ void btrfs_wait_ordered_extents(struct btrfs_root *root, int delay_iput)
 	mutex_unlock(&root->fs_info->ordered_operations_mutex);
 }
 
+void btrfs_wait_all_ordered_extents(struct btrfs_fs_info *fs_info,
+				    int delay_iput)
+{
+	struct btrfs_root *root;
+	struct list_head splice;
+
+	INIT_LIST_HEAD(&splice);
+
+	spin_lock(&fs_info->ordered_root_lock);
+	list_splice_init(&fs_info->ordered_roots, &splice);
+	while (!list_empty(&splice)) {
+		root = list_first_entry(&splice, struct btrfs_root,
+					ordered_root);
+		root = btrfs_grab_fs_root(root);
+		BUG_ON(!root);
+		list_move_tail(&root->ordered_root,
+			       &fs_info->ordered_roots);
+		spin_unlock(&fs_info->ordered_root_lock);
+
+		btrfs_wait_ordered_extents(root, delay_iput);
+		btrfs_put_fs_root(root);
+
+		spin_lock(&fs_info->ordered_root_lock);
+	}
+	spin_unlock(&fs_info->ordered_root_lock);
+}
+
 /*
  * this is used during transaction commit to write all the inodes
  * added to the ordered operation list.  These files must be fully on
@@ -629,7 +672,7 @@ int btrfs_run_ordered_operations(struct btrfs_trans_handle *trans,
 	INIT_LIST_HEAD(&works);
 
 	mutex_lock(&root->fs_info->ordered_operations_mutex);
-	spin_lock(&root->fs_info->ordered_extent_lock);
+	spin_lock(&root->fs_info->ordered_root_lock);
 	list_splice_init(&cur_trans->ordered_operations, &splice);
 	while (!list_empty(&splice)) {
 		btrfs_inode = list_entry(splice.next, struct btrfs_inode,
@@ -648,17 +691,17 @@ int btrfs_run_ordered_operations(struct btrfs_trans_handle *trans,
 		if (!wait)
 			list_add_tail(&BTRFS_I(inode)->ordered_operations,
 				      &cur_trans->ordered_operations);
-		spin_unlock(&root->fs_info->ordered_extent_lock);
+		spin_unlock(&root->fs_info->ordered_root_lock);
 
 		work = btrfs_alloc_delalloc_work(inode, wait, 1);
 		if (!work) {
-			spin_lock(&root->fs_info->ordered_extent_lock);
+			spin_lock(&root->fs_info->ordered_root_lock);
 			if (list_empty(&BTRFS_I(inode)->ordered_operations))
 				list_add_tail(&btrfs_inode->ordered_operations,
 					      &splice);
 			list_splice_tail(&splice,
 					 &cur_trans->ordered_operations);
-			spin_unlock(&root->fs_info->ordered_extent_lock);
+			spin_unlock(&root->fs_info->ordered_root_lock);
 			ret = -ENOMEM;
 			goto out;
 		}
@@ -667,9 +710,9 @@ int btrfs_run_ordered_operations(struct btrfs_trans_handle *trans,
 				   &work->work);
 
 		cond_resched();
-		spin_lock(&root->fs_info->ordered_extent_lock);
+		spin_lock(&root->fs_info->ordered_root_lock);
 	}
-	spin_unlock(&root->fs_info->ordered_extent_lock);
+	spin_unlock(&root->fs_info->ordered_root_lock);
 out:
 	list_for_each_entry_safe(work, next, &works, list) {
 		list_del_init(&work->list);
@@ -1055,12 +1098,12 @@ void btrfs_add_ordered_operation(struct btrfs_trans_handle *trans,
 	if (last_mod < root->fs_info->last_trans_committed)
 		return;
 
-	spin_lock(&root->fs_info->ordered_extent_lock);
+	spin_lock(&root->fs_info->ordered_root_lock);
 	if (list_empty(&BTRFS_I(inode)->ordered_operations)) {
 		list_add_tail(&BTRFS_I(inode)->ordered_operations,
 			      &cur_trans->ordered_operations);
 	}
-	spin_unlock(&root->fs_info->ordered_extent_lock);
+	spin_unlock(&root->fs_info->ordered_root_lock);
 }
 
 int __init ordered_data_init(void)
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 58b0e3b..d082d43 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -204,6 +204,8 @@ void btrfs_add_ordered_operation(struct btrfs_trans_handle *trans,
 				 struct btrfs_root *root,
 				 struct inode *inode);
 void btrfs_wait_ordered_extents(struct btrfs_root *root, int delay_iput);
+void btrfs_wait_all_ordered_extents(struct btrfs_fs_info *fs_info,
+				    int delay_iput);
 void btrfs_get_logged_extents(struct btrfs_root *log, struct inode *inode);
 void btrfs_wait_logged_extents(struct btrfs_root *log, u64 transid);
 void btrfs_free_logged_extents(struct btrfs_root *log, u64 transid);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 247fc71..f04e703 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4158,7 +4158,7 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start)
 		err = ret;
 		goto out;
 	}
-	btrfs_wait_ordered_extents(fs_info->tree_root, 0);
+	btrfs_wait_all_ordered_extents(fs_info, 0);
 
 	while (1) {
 		mutex_lock(&fs_info->cleaner_mutex);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f0857e0..d36d6e6 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -866,7 +866,7 @@ int btrfs_sync_fs(struct super_block *sb, int wait)
 		return 0;
 	}
 
-	btrfs_wait_ordered_extents(root, 1);
+	btrfs_wait_all_ordered_extents(fs_info, 0);
 
 	trans = btrfs_attach_transaction_barrier(root);
 	if (IS_ERR(trans)) {
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 4b63111..2b17213 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1505,7 +1505,7 @@ static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
 		ret = btrfs_start_all_delalloc_inodes(root->fs_info, 1);
 		if (ret)
 			return ret;
-		btrfs_wait_ordered_extents(root, 1);
+		btrfs_wait_all_ordered_extents(root->fs_info, 1);
 	}
 
 	ret = btrfs_run_delayed_items(trans, root);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 10/17] Btrfs: just flush the delalloc inodes in the source tree before snapshot creation
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (8 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 09/17] Btrfs: introduce per-subvolume ordered extent list Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-16  3:20   ` Liu Bo
  2013-05-15  7:48 ` [PATCH 11/17] Btrfs: cleanup unnecessary assignment when cleaning up all the residual transaction Miao Xie
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

Before applying this patch, we need flush all the delalloc inodes in
the fs when we want to create a snapshot, it wastes time, and make
the transaction commit be blocked for a long time. It means some other
user operation would also be blocked for a long time.

This patch improves this problem, we just flush the delalloc inodes that
in the source trees before snapshot creation, so the transaction commit
will complete quickly.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/ioctl.c       |  6 ++++++
 fs/btrfs/transaction.c | 10 +---------
 2 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0de4a2f..2677dcc 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -555,6 +555,12 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
 	if (!root->ref_cows)
 		return -EINVAL;
 
+	ret = btrfs_start_delalloc_inodes(root, 0);
+	if (ret)
+		return ret;
+
+	btrfs_wait_ordered_extents(root, 0);
+
 	pending_snapshot = kzalloc(sizeof(*pending_snapshot), GFP_NOFS);
 	if (!pending_snapshot)
 		return -ENOMEM;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 2b17213..bc22be9 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1491,17 +1491,9 @@ static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
 					  struct btrfs_root *root)
 {
 	int flush_on_commit = btrfs_test_opt(root, FLUSHONCOMMIT);
-	int snap_pending = 0;
 	int ret;
 
-	if (!flush_on_commit) {
-		spin_lock(&root->fs_info->trans_lock);
-		if (!list_empty(&trans->transaction->pending_snapshots))
-			snap_pending = 1;
-		spin_unlock(&root->fs_info->trans_lock);
-	}
-
-	if (flush_on_commit || snap_pending) {
+	if (flush_on_commit) {
 		ret = btrfs_start_all_delalloc_inodes(root->fs_info, 1);
 		if (ret)
 			return ret;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 11/17] Btrfs: cleanup unnecessary assignment when cleaning up all the residual transaction
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (9 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 10/17] Btrfs: just flush the delalloc inodes in the source tree before snapshot creation Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 12/17] Btrfs: remove the code for the impossible case in cleanup_transaction() Miao Xie
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

When we umount a fs with serious errors, we will invoke btrfs_cleanup_transactions()
to clean up the residual transaction. At this time, It is impossible to start a new
transaction, so we needn't assign trans_no_join to 1, and also needn't clear running
transaction every time we destroy a residual transaction.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/disk-io.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 44d5a86..6bb3f3d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3992,7 +3992,7 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root)
 
 	spin_lock(&root->fs_info->trans_lock);
 	list_splice_init(&root->fs_info->trans_list, &list);
-	root->fs_info->trans_no_join = 1;
+	root->fs_info->running_transaction = NULL;
 	spin_unlock(&root->fs_info->trans_lock);
 
 	while (!list_empty(&list)) {
@@ -4028,10 +4028,6 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root)
 
 		btrfs_destroy_all_delalloc_inodes(root->fs_info);
 
-		spin_lock(&root->fs_info->trans_lock);
-		root->fs_info->running_transaction = NULL;
-		spin_unlock(&root->fs_info->trans_lock);
-
 		btrfs_destroy_marked_extents(root, &t->dirty_pages,
 					     EXTENT_DIRTY);
 
@@ -4044,9 +4040,6 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root)
 		kmem_cache_free(btrfs_transaction_cachep, t);
 	}
 
-	spin_lock(&root->fs_info->trans_lock);
-	root->fs_info->trans_no_join = 0;
-	spin_unlock(&root->fs_info->trans_lock);
 	mutex_unlock(&root->fs_info->transaction_kthread_mutex);
 
 	return 0;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 12/17] Btrfs: remove the code for the impossible case in cleanup_transaction()
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (10 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 11/17] Btrfs: cleanup unnecessary assignment when cleaning up all the residual transaction Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 13/17] Btrfs: don't wait for all the writers circularly during the transaction commit Miao Xie
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

If the transaction is removed from the transaction list, it means the
transaction has been committed successfully. So it is impossible to
call cleanup_transaction(), otherwise there is something wrong with
the code logic. Thus, we use BUG_ON() instead of the original handle.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/transaction.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index bc22be9..cf8706c 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1450,11 +1450,12 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans,
 
 	spin_lock(&root->fs_info->trans_lock);
 
-	if (list_empty(&cur_trans->list)) {
-		spin_unlock(&root->fs_info->trans_lock);
-		btrfs_end_transaction(trans, root);
-		return;
-	}
+	/*
+	 * If the transaction is removed from the list, it means this
+	 * transaction has been committed successfully, so it is impossible
+	 * to call the cleanup function.
+	 */
+	BUG_ON(list_empty(&cur_trans->list));
 
 	list_del_init(&cur_trans->list);
 	if (cur_trans == root->fs_info->running_transaction) {
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 13/17] Btrfs: don't wait for all the writers circularly during the transaction commit
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (11 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 12/17] Btrfs: remove the code for the impossible case in cleanup_transaction() Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 14/17] Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set Miao Xie
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

btrfs_commit_transaction has the following loop before we commit the
transaction.

do {
    // attempt to do some useful stuff and/or sleep
} while (atomic_read(&cur_trans->num_writers) > 1 ||
	 (should_grow && cur_trans->num_joined != joined));

This is used to prevent from the TRANS_START to get in the way of a
committing transaction. But it does not prevent from TRANS_JOIN, that
is we would do this loop for a long time if some writers JOIN the
current transaction endlessly.

Because we need join the current transaction to do some useful stuff,
we can not block TRANS_JOIN here. So we introduce a external writer
counter, which is used to count the TRANS_USERSPACE/TRANS_START writers.
If the external writer counter is zero, we can break the above loop.

In order to make the code more clear, we don't use enum variant
to define the type of the transaction handle, use bitmask instead.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/transaction.c | 55 ++++++++++++++++++++++++++++++++++++++------------
 fs/btrfs/transaction.h | 31 ++++++++++++++++++++--------
 2 files changed, 65 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index cf8706c..fd319b2 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -51,17 +51,41 @@ static noinline void switch_commit_root(struct btrfs_root *root)
 }
 
 static inline int can_join_transaction(struct btrfs_transaction *trans,
-				       int type)
+				       unsigned int type)
 {
 	return !(trans->in_commit &&
-		 type != TRANS_JOIN &&
-		 type != TRANS_JOIN_NOLOCK);
+		 (type & TRANS_EXTWRITERS));
+}
+
+static inline void extwriter_counter_inc(struct btrfs_transaction *trans,
+					 unsigned int type)
+{
+	if (type & TRANS_EXTWRITERS)
+		atomic_inc(&trans->num_extwriters);
+}
+
+static inline void extwriter_counter_dec(struct btrfs_transaction *trans,
+					 unsigned int type)
+{
+	if (type & TRANS_EXTWRITERS)
+		atomic_dec(&trans->num_extwriters);
+}
+
+static inline void extwriter_counter_init(struct btrfs_transaction *trans,
+					  unsigned int type)
+{
+	atomic_set(&trans->num_extwriters, ((type & TRANS_EXTWRITERS) ? 1 : 0));
+}
+
+static inline int extwriter_counter_read(struct btrfs_transaction *trans)
+{
+	return atomic_read(&trans->num_extwriters);
 }
 
 /*
  * either allocate a new transaction or hop into the existing one
  */
-static noinline int join_transaction(struct btrfs_root *root, int type)
+static noinline int join_transaction(struct btrfs_root *root, unsigned int type)
 {
 	struct btrfs_transaction *cur_trans;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -99,6 +123,7 @@ loop:
 		}
 		atomic_inc(&cur_trans->use_count);
 		atomic_inc(&cur_trans->num_writers);
+		extwriter_counter_inc(cur_trans, type);
 		cur_trans->num_joined++;
 		spin_unlock(&fs_info->trans_lock);
 		return 0;
@@ -131,6 +156,7 @@ loop:
 	}
 
 	atomic_set(&cur_trans->num_writers, 1);
+	extwriter_counter_init(cur_trans, type);
 	cur_trans->num_joined = 0;
 	init_waitqueue_head(&cur_trans->writer_wait);
 	init_waitqueue_head(&cur_trans->commit_wait);
@@ -307,7 +333,7 @@ static int may_wait_transaction(struct btrfs_root *root, int type)
 }
 
 static struct btrfs_trans_handle *
-start_transaction(struct btrfs_root *root, u64 num_items, int type,
+start_transaction(struct btrfs_root *root, u64 num_items, unsigned int type,
 		  enum btrfs_reserve_flush_enum flush)
 {
 	struct btrfs_trans_handle *h;
@@ -320,7 +346,7 @@ start_transaction(struct btrfs_root *root, u64 num_items, int type,
 		return ERR_PTR(-EROFS);
 
 	if (current->journal_info) {
-		WARN_ON(type != TRANS_JOIN && type != TRANS_JOIN_NOLOCK);
+		WARN_ON(type & TRANS_EXTWRITERS);
 		h = current->journal_info;
 		h->use_count++;
 		WARN_ON(h->use_count > 2);
@@ -366,7 +392,7 @@ again:
 	 * If we are ATTACH, it means we just want to catch the current
 	 * transaction and commit it, so we needn't do sb_start_intwrite(). 
 	 */
-	if (type < TRANS_JOIN_NOLOCK)
+	if (type & __TRANS_FREEZABLE)
 		sb_start_intwrite(root->fs_info->sb);
 
 	if (may_wait_transaction(root, type))
@@ -429,7 +455,7 @@ got_it:
 	return h;
 
 join_fail:
-	if (type < TRANS_JOIN_NOLOCK)
+	if (type & __TRANS_FREEZABLE)
 		sb_end_intwrite(root->fs_info->sb);
 	kmem_cache_free(btrfs_trans_handle_cachep, h);
 alloc_fail:
@@ -677,12 +703,13 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 		}
 	}
 
-	if (trans->type < TRANS_JOIN_NOLOCK)
+	if (trans->type & __TRANS_FREEZABLE)
 		sb_end_intwrite(root->fs_info->sb);
 
 	WARN_ON(cur_trans != info->running_transaction);
 	WARN_ON(atomic_read(&cur_trans->num_writers) < 1);
 	atomic_dec(&cur_trans->num_writers);
+	extwriter_counter_dec(cur_trans, trans->type);
 
 	smp_mb();
 	if (waitqueue_active(&cur_trans->writer_wait))
@@ -1625,6 +1652,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 		spin_unlock(&root->fs_info->trans_lock);
 	}
 
+	extwriter_counter_dec(cur_trans, trans->type);
+
 	if (!btrfs_test_opt(root, SSD) &&
 	    (now < cur_trans->start_time || now - cur_trans->start_time < 1))
 		should_grow = 1;
@@ -1641,13 +1670,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 		prepare_to_wait(&cur_trans->writer_wait, &wait,
 				TASK_UNINTERRUPTIBLE);
 
-		if (atomic_read(&cur_trans->num_writers) > 1)
-			schedule_timeout(MAX_SCHEDULE_TIMEOUT);
+		if (extwriter_counter_read(cur_trans) > 0)
+			schedule();
 		else if (should_grow)
 			schedule_timeout(1);
 
 		finish_wait(&cur_trans->writer_wait, &wait);
-	} while (atomic_read(&cur_trans->num_writers) > 1 ||
+	} while (extwriter_counter_read(cur_trans) > 0 ||
 		 (should_grow && cur_trans->num_joined != joined));
 
 	ret = btrfs_flush_all_pending_stuffs(trans, root);
@@ -1831,7 +1860,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	put_transaction(cur_trans);
 	put_transaction(cur_trans);
 
-	if (trans->type < TRANS_JOIN_NOLOCK)
+	if (trans->type & __TRANS_FREEZABLE)
 		sb_end_intwrite(root->fs_info->sb);
 
 	trace_btrfs_transaction_commit(root);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 24c9733..5cc77b0 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -25,6 +25,12 @@
 struct btrfs_transaction {
 	u64 transid;
 	/*
+	 * total external writers(USERSPACE/START/ATTACH) in this
+	 * transaction, it must be zero before the transaction is
+	 * being committed
+	 */
+	atomic_t num_extwriters;
+	/*
 	 * total writers in this transaction, it must be zero before the
 	 * transaction can end
 	 */
@@ -48,13 +54,22 @@ struct btrfs_transaction {
 	int aborted;
 };
 
-enum btrfs_trans_type {
-	TRANS_START,
-	TRANS_JOIN,
-	TRANS_USERSPACE,
-	TRANS_JOIN_NOLOCK,
-	TRANS_ATTACH,
-};
+#define __TRANS_FREEZABLE	(1U << 0)
+
+#define __TRANS_USERSPACE	(1U << 8)
+#define __TRANS_START		(1U << 9)
+#define __TRANS_ATTACH		(1U << 10)
+#define __TRANS_JOIN		(1U << 11)
+#define __TRANS_JOIN_NOLOCK	(1U << 12)
+
+#define TRANS_USERSPACE		(__TRANS_USERSPACE | __TRANS_FREEZABLE)
+#define TRANS_START		(__TRANS_START | __TRANS_FREEZABLE)
+#define TRANS_ATTACH		(__TRANS_ATTACH)
+#define TRANS_JOIN		(__TRANS_JOIN | __TRANS_FREEZABLE)
+#define TRANS_JOIN_NOLOCK	(__TRANS_JOIN_NOLOCK)
+
+#define TRANS_EXTWRITERS	(__TRANS_USERSPACE | __TRANS_START |	\
+				 __TRANS_ATTACH)
 
 struct btrfs_trans_handle {
 	u64 transid;
@@ -70,7 +85,7 @@ struct btrfs_trans_handle {
 	short aborted;
 	short adding_csums;
 	bool allocating_chunk;
-	enum btrfs_trans_type type;
+	unsigned int type;
 	/*
 	 * this root is only needed to validate that the root passed to
 	 * start_transaction is the same as the one passed to end_transaction.
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 14/17] Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (12 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 13/17] Btrfs: don't wait for all the writers circularly during the transaction commit Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 15/17] Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure Miao Xie
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

It is unnecessary to flush the delalloc inodes again and again because
we don't care the dirty pages which are introduced after the flush, and
they will be flush in the transaction commit.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/transaction.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index fd319b2..265db57 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1518,16 +1518,8 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans,
 static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
 					  struct btrfs_root *root)
 {
-	int flush_on_commit = btrfs_test_opt(root, FLUSHONCOMMIT);
 	int ret;
 
-	if (flush_on_commit) {
-		ret = btrfs_start_all_delalloc_inodes(root->fs_info, 1);
-		if (ret)
-			return ret;
-		btrfs_wait_all_ordered_extents(root->fs_info, 1);
-	}
-
 	ret = btrfs_run_delayed_items(trans, root);
 	if (ret)
 		return ret;
@@ -1551,6 +1543,19 @@ static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info)
+{
+	if (btrfs_test_opt(fs_info->tree_root, FLUSHONCOMMIT))
+		return btrfs_start_all_delalloc_inodes(fs_info, 1);
+	return 0;
+}
+
+static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
+{
+	if (btrfs_test_opt(fs_info->tree_root, FLUSHONCOMMIT))
+		btrfs_wait_all_ordered_extents(fs_info, 1);
+}
+
 /*
  * btrfs_transaction state sequence:
  *    in_commit = 0, blocked = 0  (initial)
@@ -1654,6 +1659,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 
 	extwriter_counter_dec(cur_trans, trans->type);
 
+	ret = btrfs_start_delalloc_flush(root->fs_info);
+	if (ret)
+		goto cleanup_transaction;
+
 	if (!btrfs_test_opt(root, SSD) &&
 	    (now < cur_trans->start_time || now - cur_trans->start_time < 1))
 		should_grow = 1;
@@ -1683,6 +1692,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	if (ret)
 		goto cleanup_transaction;
 
+	btrfs_wait_delalloc_flush(root->fs_info);
 	/*
 	 * Ok now we need to make sure to block out any other joins while we
 	 * commit the transaction.  We could have started a join before setting
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 15/17] Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (13 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 14/17] Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 16/17] Btrfs: remove the time check in btrfs_commit_transaction() Miao Xie
  2013-05-15  7:48 ` [PATCH 17/17] Btrfs: make the state of the transaction more readable Miao Xie
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

We used ->num_joined track if there were some writers which join the current
transaction when the committer was sleeping. If some writers joined the current
transaction, we has to continue the while loop to do some necessary stuff, such
as flush the ordered operations. But it is unnecessary because we will do it
after the while loop.

Besides that, tracking ->num_joined would make the committer drop into the while
loop when there are lots of internal writers(TRANS_JOIN).

So we remove ->num_joined and don't track if there are some writers which join
the current transaction when the committer is sleeping.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/transaction.c | 8 +-------
 fs/btrfs/transaction.h | 2 --
 2 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 265db57..75e7b15 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -124,7 +124,6 @@ loop:
 		atomic_inc(&cur_trans->use_count);
 		atomic_inc(&cur_trans->num_writers);
 		extwriter_counter_inc(cur_trans, type);
-		cur_trans->num_joined++;
 		spin_unlock(&fs_info->trans_lock);
 		return 0;
 	}
@@ -157,7 +156,6 @@ loop:
 
 	atomic_set(&cur_trans->num_writers, 1);
 	extwriter_counter_init(cur_trans, type);
-	cur_trans->num_joined = 0;
 	init_waitqueue_head(&cur_trans->writer_wait);
 	init_waitqueue_head(&cur_trans->commit_wait);
 	cur_trans->in_commit = 0;
@@ -1566,7 +1564,6 @@ static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root)
 {
-	unsigned long joined = 0;
 	struct btrfs_transaction *cur_trans = trans->transaction;
 	struct btrfs_transaction *prev_trans = NULL;
 	DEFINE_WAIT(wait);
@@ -1668,8 +1665,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 		should_grow = 1;
 
 	do {
-		joined = cur_trans->num_joined;
-
 		WARN_ON(cur_trans != trans->transaction);
 
 		ret = btrfs_flush_all_pending_stuffs(trans, root);
@@ -1685,8 +1680,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 			schedule_timeout(1);
 
 		finish_wait(&cur_trans->writer_wait, &wait);
-	} while (extwriter_counter_read(cur_trans) > 0 ||
-		 (should_grow && cur_trans->num_joined != joined));
+	} while (extwriter_counter_read(cur_trans) > 0);
 
 	ret = btrfs_flush_all_pending_stuffs(trans, root);
 	if (ret)
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 5cc77b0..0fc45e2 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -37,8 +37,6 @@ struct btrfs_transaction {
 	atomic_t num_writers;
 	atomic_t use_count;
 
-	unsigned long num_joined;
-
 	spinlock_t commit_lock;
 	int in_commit;
 	int commit_done;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 16/17] Btrfs: remove the time check in btrfs_commit_transaction()
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (14 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 15/17] Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-15  7:48 ` [PATCH 17/17] Btrfs: make the state of the transaction more readable Miao Xie
  16 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

We checked the commit time to avoid committing the transaction
frequently, but it is unnecessary because:
- It made the transaction commit spend more time, and delayed the
  operation of the external writers(TRANS_START/TRANS_USERSPACE).
- Except the space that we have to commit transaction, such as
  snapshot creation, btrfs doesn't commit the transaction on its
  own initiative.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/transaction.c | 29 ++++++-----------------------
 1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 75e7b15..5e75ff4 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1566,10 +1566,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_transaction *cur_trans = trans->transaction;
 	struct btrfs_transaction *prev_trans = NULL;
-	DEFINE_WAIT(wait);
 	int ret;
-	int should_grow = 0;
-	unsigned long now = get_seconds();
 
 	ret = btrfs_run_ordered_operations(trans, root, 0);
 	if (ret) {
@@ -1660,28 +1657,14 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	if (ret)
 		goto cleanup_transaction;
 
-	if (!btrfs_test_opt(root, SSD) &&
-	    (now < cur_trans->start_time || now - cur_trans->start_time < 1))
-		should_grow = 1;
-
-	do {
-		WARN_ON(cur_trans != trans->transaction);
-
-		ret = btrfs_flush_all_pending_stuffs(trans, root);
-		if (ret)
-			goto cleanup_transaction;
-
-		prepare_to_wait(&cur_trans->writer_wait, &wait,
-				TASK_UNINTERRUPTIBLE);
-
-		if (extwriter_counter_read(cur_trans) > 0)
-			schedule();
-		else if (should_grow)
-			schedule_timeout(1);
+	ret = btrfs_flush_all_pending_stuffs(trans, root);
+	if (ret)
+		goto cleanup_transaction;
 
-		finish_wait(&cur_trans->writer_wait, &wait);
-	} while (extwriter_counter_read(cur_trans) > 0);
+	wait_event(cur_trans->writer_wait,
+		   extwriter_counter_read(cur_trans) == 0);
 
+	/* some pending stuffs might be added after the previous flush. */
 	ret = btrfs_flush_all_pending_stuffs(trans, root);
 	if (ret)
 		goto cleanup_transaction;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 17/17] Btrfs: make the state of the transaction more readable
  2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
                   ` (15 preceding siblings ...)
  2013-05-15  7:48 ` [PATCH 16/17] Btrfs: remove the time check in btrfs_commit_transaction() Miao Xie
@ 2013-05-15  7:48 ` Miao Xie
  2013-05-17  3:53   ` [PATCH V2 " Miao Xie
  16 siblings, 1 reply; 33+ messages in thread
From: Miao Xie @ 2013-05-15  7:48 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.

This patch improved the above problem. In this patch, we define 6 states
for the transaction,
  enum btrfs_trans_state {
	TRANS_STATE_RUNNING		= 0,
	TRANS_STATE_BLOCKED		= 1,
	TRANS_STATE_COMMIT_START	= 2,
	TRANS_STATE_COMMIT_DOING	= 3,
	TRANS_STATE_UNBLOCKED		= 4,
	TRANS_STATE_COMPLETED		= 5,
	TRANS_STATE_MAX			= 6,
  }
and just use 1 variant to track those state.

In order to make the blocked handle types for each state more clear,
we introduce a array:
  unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
	[TRANS_STATE_RUNNING]		= 0U,
	[TRANS_STATE_BLOCKED]		= (__TRANS_USERSPACE |
					   __TRANS_START),
	[TRANS_STATE_COMMIT_START]	= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH),
	[TRANS_STATE_COMMIT_DOING]	= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN),
	[TRANS_STATE_UNBLOCKED]		= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN |
					   __TRANS_JOIN_NOLOCK),
	[TRANS_STATE_COMPLETED]		= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN |
					   __TRANS_JOIN_NOLOCK),
  }
it is very intuitionistic.

Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
 fs/btrfs/ctree.h       |   1 -
 fs/btrfs/disk-io.c     |  36 ++++++------
 fs/btrfs/transaction.c | 156 ++++++++++++++++++++++++++-----------------------
 fs/btrfs/transaction.h |  16 +++--
 4 files changed, 114 insertions(+), 95 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a7e71ff..bf92302 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1496,7 +1496,6 @@ struct btrfs_fs_info {
 	int closing;
 	int log_root_recovering;
 	int enospc_unlink;
-	int trans_no_join;
 
 	u64 total_pinned;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6bb3f3d..530e3c0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1747,7 +1747,7 @@ static int transaction_kthread(void *arg)
 		}
 
 		now = get_seconds();
-		if (!cur->blocked &&
+		if (cur->state < TRANS_STATE_BLOCKED &&
 		    (now < cur->start_time || now - cur->start_time < 30)) {
 			spin_unlock(&root->fs_info->trans_lock);
 			delay = HZ * 5;
@@ -2183,7 +2183,6 @@ int open_ctree(struct super_block *sb,
 	fs_info->max_inline = 8192 * 1024;
 	fs_info->metadata_ratio = 0;
 	fs_info->defrag_inodes = RB_ROOT;
-	fs_info->trans_no_join = 0;
 	fs_info->free_chunk_space = 0;
 	fs_info->tree_mod_log = RB_ROOT;
 
@@ -3956,19 +3955,14 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 	btrfs_block_rsv_release(root, &root->fs_info->trans_block_rsv,
 				cur_trans->dirty_pages.dirty_bytes);
 
-	/* FIXME: cleanup wait for commit */
-	cur_trans->in_commit = 1;
-	cur_trans->blocked = 1;
+	cur_trans->state = TRANS_STATE_COMMIT_START;
 	wake_up(&root->fs_info->transaction_blocked_wait);
 
 	btrfs_evict_pending_snapshots(cur_trans);
 
-	cur_trans->blocked = 0;
+	cur_trans->state = TRANS_STATE_UNBLOCKED;
 	wake_up(&root->fs_info->transaction_wait);
 
-	cur_trans->commit_done = 1;
-	wake_up(&cur_trans->commit_wait);
-
 	btrfs_destroy_delayed_inodes(root);
 	btrfs_assert_delayed_root_empty(root);
 
@@ -3977,6 +3971,9 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 	btrfs_destroy_pinned_extent(root,
 				    root->fs_info->pinned_extents);
 
+	cur_trans->state =TRANS_STATE_COMPLETED;
+	wake_up(&cur_trans->commit_wait);
+
 	/*
 	memset(cur_trans, 0, sizeof(*cur_trans));
 	kmem_cache_free(btrfs_transaction_cachep, cur_trans);
@@ -4004,25 +4001,23 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root)
 
 		btrfs_destroy_delayed_refs(t, root);
 
-		/* FIXME: cleanup wait for commit */
-		t->in_commit = 1;
-		t->blocked = 1;
+		/*
+		 *  FIXME: cleanup wait for commit
+		 *  We needn't acquire the lock here, because we are during
+		 *  the umount, there is no other task which will change it.
+		 */
+		t->state = TRANS_STATE_COMMIT_START;
 		smp_mb();
 		if (waitqueue_active(&root->fs_info->transaction_blocked_wait))
 			wake_up(&root->fs_info->transaction_blocked_wait);
 
 		btrfs_evict_pending_snapshots(t);
 
-		t->blocked = 0;
+		t->state = TRANS_STATE_UNBLOCKED;
 		smp_mb();
 		if (waitqueue_active(&root->fs_info->transaction_wait))
 			wake_up(&root->fs_info->transaction_wait);
 
-		t->commit_done = 1;
-		smp_mb();
-		if (waitqueue_active(&t->commit_wait))
-			wake_up(&t->commit_wait);
-
 		btrfs_destroy_delayed_inodes(root);
 		btrfs_assert_delayed_root_empty(root);
 
@@ -4034,6 +4029,11 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root)
 		btrfs_destroy_pinned_extent(root,
 					    root->fs_info->pinned_extents);
 
+		t->state = TRANS_STATE_COMPLETED;
+		smp_mb();
+		if (waitqueue_active(&t->commit_wait))
+			wake_up(&t->commit_wait);
+
 		atomic_set(&t->use_count, 0);
 		list_del_init(&t->list);
 		memset(t, 0, sizeof(*t));
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 5e75ff4..e25d972 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -34,6 +34,29 @@
 
 #define BTRFS_ROOT_TRANS_TAG 0
 
+static unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
+	[TRANS_STATE_RUNNING]		= 0U,
+	[TRANS_STATE_BLOCKED]		= (__TRANS_USERSPACE |
+					   __TRANS_START),
+	[TRANS_STATE_COMMIT_START]	= (__TRANS_USERSPACE |
+					   __TRANS_START |
+					   __TRANS_ATTACH),
+	[TRANS_STATE_COMMIT_DOING]	= (__TRANS_USERSPACE |
+					   __TRANS_START |
+					   __TRANS_ATTACH |
+					   __TRANS_JOIN),
+	[TRANS_STATE_UNBLOCKED]		= (__TRANS_USERSPACE |
+					   __TRANS_START |
+					   __TRANS_ATTACH |
+					   __TRANS_JOIN |
+					   __TRANS_JOIN_NOLOCK),
+	[TRANS_STATE_COMPLETED]		= (__TRANS_USERSPACE |
+					   __TRANS_START |
+					   __TRANS_ATTACH |
+					   __TRANS_JOIN |
+					   __TRANS_JOIN_NOLOCK),
+};
+
 static void put_transaction(struct btrfs_transaction *transaction)
 {
 	WARN_ON(atomic_read(&transaction->use_count) == 0);
@@ -50,13 +73,6 @@ static noinline void switch_commit_root(struct btrfs_root *root)
 	root->commit_root = btrfs_root_node(root);
 }
 
-static inline int can_join_transaction(struct btrfs_transaction *trans,
-				       unsigned int type)
-{
-	return !(trans->in_commit &&
-		 (type & TRANS_EXTWRITERS));
-}
-
 static inline void extwriter_counter_inc(struct btrfs_transaction *trans,
 					 unsigned int type)
 {
@@ -98,26 +114,13 @@ loop:
 		return -EROFS;
 	}
 
-	if (fs_info->trans_no_join) {
-		/* 
-		 * If we are JOIN_NOLOCK we're already committing a current
-		 * transaction, we just need a handle to deal with something
-		 * when committing the transaction, such as inode cache and
-		 * space cache. It is a special case.
-		 */
-		if (type != TRANS_JOIN_NOLOCK) {
-			spin_unlock(&fs_info->trans_lock);
-			return -EBUSY;
-		}
-	}
-
 	cur_trans = fs_info->running_transaction;
 	if (cur_trans) {
 		if (cur_trans->aborted) {
 			spin_unlock(&fs_info->trans_lock);
 			return cur_trans->aborted;
 		}
-		if (!can_join_transaction(cur_trans, type)) {
+		if (btrfs_blocked_trans_types[cur_trans->state] & type) {
 			spin_unlock(&fs_info->trans_lock);
 			return -EBUSY;
 		}
@@ -136,6 +139,12 @@ loop:
 	if (type == TRANS_ATTACH)
 		return -ENOENT;
 
+	/*
+	 * JOIN_NOLOCK only happens during the transaction commit, so
+	 * it is impossible that ->running_transaction is NULL
+	 */
+	BUG_ON(type == TRANS_JOIN_NOLOCK);
+
 	cur_trans = kmem_cache_alloc(btrfs_transaction_cachep, GFP_NOFS);
 	if (!cur_trans)
 		return -ENOMEM;
@@ -144,7 +153,7 @@ loop:
 	if (fs_info->running_transaction) {
 		/*
 		 * someone started a transaction after we unlocked.  Make sure
-		 * to redo the trans_no_join checks above
+		 * to redo the checks above
 		 */
 		kmem_cache_free(btrfs_transaction_cachep, cur_trans);
 		goto loop;
@@ -158,14 +167,12 @@ loop:
 	extwriter_counter_init(cur_trans, type);
 	init_waitqueue_head(&cur_trans->writer_wait);
 	init_waitqueue_head(&cur_trans->commit_wait);
-	cur_trans->in_commit = 0;
-	cur_trans->blocked = 0;
+	cur_trans->state = TRANS_STATE_RUNNING;
 	/*
 	 * One for this trans handle, one so it will live on until we
 	 * commit the transaction.
 	 */
 	atomic_set(&cur_trans->use_count, 2);
-	cur_trans->commit_done = 0;
 	cur_trans->start_time = get_seconds();
 
 	cur_trans->delayed_refs.root = RB_ROOT;
@@ -188,7 +195,6 @@ loop:
 			"creating a fresh transaction\n");
 	atomic64_set(&fs_info->tree_mod_seq, 0);
 
-	spin_lock_init(&cur_trans->commit_lock);
 	spin_lock_init(&cur_trans->delayed_refs.lock);
 	atomic_set(&cur_trans->delayed_refs.procs_running_refs, 0);
 	atomic_set(&cur_trans->delayed_refs.ref_seq, 0);
@@ -293,6 +299,12 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 	return 0;
 }
 
+static inline int is_transaction_blocked(struct btrfs_transaction *trans)
+{
+	return (trans->state >= TRANS_STATE_BLOCKED &&
+		trans->state < TRANS_STATE_UNBLOCKED);
+}
+
 /* wait for commit against the current transaction to become unblocked
  * when this is done, it is safe to start a new transaction, but the current
  * transaction might not be fully on disk.
@@ -303,12 +315,12 @@ static void wait_current_trans(struct btrfs_root *root)
 
 	spin_lock(&root->fs_info->trans_lock);
 	cur_trans = root->fs_info->running_transaction;
-	if (cur_trans && cur_trans->blocked) {
+	if (cur_trans && is_transaction_blocked(cur_trans)) {
 		atomic_inc(&cur_trans->use_count);
 		spin_unlock(&root->fs_info->trans_lock);
 
 		wait_event(root->fs_info->transaction_wait,
-			   !cur_trans->blocked);
+			   cur_trans->state >= TRANS_STATE_UNBLOCKED);
 		put_transaction(cur_trans);
 	} else {
 		spin_unlock(&root->fs_info->trans_lock);
@@ -432,7 +444,8 @@ again:
 	INIT_LIST_HEAD(&h->new_bgs);
 
 	smp_mb();
-	if (cur_trans->blocked && may_wait_transaction(root, type)) {
+	if (cur_trans->state >= TRANS_STATE_BLOCKED &&
+	    may_wait_transaction(root, type)) {
 		btrfs_commit_transaction(h, root);
 		goto again;
 	}
@@ -536,7 +549,7 @@ btrfs_attach_transaction_barrier(struct btrfs_root *root)
 static noinline void wait_for_commit(struct btrfs_root *root,
 				    struct btrfs_transaction *commit)
 {
-	wait_event(commit->commit_wait, commit->commit_done);
+	wait_event(commit->commit_wait, commit->state == TRANS_STATE_COMPLETED);
 }
 
 int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid)
@@ -572,8 +585,8 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid)
 		spin_lock(&root->fs_info->trans_lock);
 		list_for_each_entry_reverse(t, &root->fs_info->trans_list,
 					    list) {
-			if (t->in_commit) {
-				if (t->commit_done)
+			if (t->state >= TRANS_STATE_COMMIT_START) {
+				if (t->state == TRANS_STATE_COMPLETED)
 					break;
 				cur_trans = t;
 				atomic_inc(&cur_trans->use_count);
@@ -614,7 +627,8 @@ int btrfs_should_end_transaction(struct btrfs_trans_handle *trans,
 	int err;
 
 	smp_mb();
-	if (cur_trans->blocked || cur_trans->delayed_refs.flushing)
+	if (cur_trans->state >= TRANS_STATE_BLOCKED ||
+	    cur_trans->delayed_refs.flushing)
 		return 1;
 
 	updates = trans->delayed_ref_updates;
@@ -682,12 +696,12 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 		btrfs_create_pending_block_groups(trans, root);
 
 	if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
-	    should_end_transaction(trans, root)) {
-		trans->transaction->blocked = 1;
-		smp_wmb();
-	}
+	    should_end_transaction(trans, root) &&
+	    ACCESS_ONCE(cur_trans->state) == TRANS_STATE_RUNNING)
+		cmpxchg(&cur_trans->state, TRANS_STATE_RUNNING,
+			TRANS_STATE_BLOCKED);
 
-	if (lock && cur_trans->blocked && !cur_trans->in_commit) {
+	if (lock && ACCESS_ONCE(cur_trans->state) == TRANS_STATE_BLOCKED) {
 		if (throttle) {
 			/*
 			 * We may race with somebody else here so end up having
@@ -1343,20 +1357,26 @@ static void update_super_roots(struct btrfs_root *root)
 
 int btrfs_transaction_in_commit(struct btrfs_fs_info *info)
 {
+	struct btrfs_transaction *trans;
 	int ret = 0;
+
 	spin_lock(&info->trans_lock);
-	if (info->running_transaction)
-		ret = info->running_transaction->in_commit;
+	trans = info->running_transaction;
+	if (trans)
+		ret = (trans->state >= TRANS_STATE_COMMIT_START);
 	spin_unlock(&info->trans_lock);
 	return ret;
 }
 
 int btrfs_transaction_blocked(struct btrfs_fs_info *info)
 {
+	struct btrfs_transaction *trans;
 	int ret = 0;
+
 	spin_lock(&info->trans_lock);
-	if (info->running_transaction)
-		ret = info->running_transaction->blocked;
+	trans = info->running_transaction;
+	if (trans)
+		ret = is_transaction_blocked(trans);
 	spin_unlock(&info->trans_lock);
 	return ret;
 }
@@ -1368,7 +1388,8 @@ int btrfs_transaction_blocked(struct btrfs_fs_info *info)
 static void wait_current_trans_commit_start(struct btrfs_root *root,
 					    struct btrfs_transaction *trans)
 {
-	wait_event(root->fs_info->transaction_blocked_wait, trans->in_commit);
+	wait_event(root->fs_info->transaction_blocked_wait,
+		   trans->state >= TRANS_STATE_COMMIT_START);
 }
 
 /*
@@ -1379,7 +1400,7 @@ static void wait_current_trans_commit_start_and_unblock(struct btrfs_root *root,
 					 struct btrfs_transaction *trans)
 {
 	wait_event(root->fs_info->transaction_wait,
-		   trans->commit_done || (trans->in_commit && !trans->blocked));
+		   trans->state >= TRANS_STATE_UNBLOCKED);
 }
 
 /*
@@ -1484,18 +1505,22 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans,
 
 	list_del_init(&cur_trans->list);
 	if (cur_trans == root->fs_info->running_transaction) {
-		root->fs_info->trans_no_join = 1;
+		cur_trans->state = TRANS_STATE_COMMIT_DOING;
 		spin_unlock(&root->fs_info->trans_lock);
 		wait_event(cur_trans->writer_wait,
 			   atomic_read(&cur_trans->num_writers) == 1);
 
 		spin_lock(&root->fs_info->trans_lock);
-		root->fs_info->running_transaction = NULL;
 	}
 	spin_unlock(&root->fs_info->trans_lock);
 
 	btrfs_cleanup_one_transaction(trans->transaction, root);
 
+	spin_lock(&root->fs_info->trans_lock);
+	if (cur_trans == root->fs_info->running_transaction)
+		root->fs_info->running_transaction = NULL;
+	spin_unlock(&root->fs_info->trans_lock);
+
 	put_transaction(cur_trans);
 	put_transaction(cur_trans);
 
@@ -1507,10 +1532,6 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans,
 		current->journal_info = NULL;
 
 	kmem_cache_free(btrfs_trans_handle_cachep, trans);
-
-	spin_lock(&root->fs_info->trans_lock);
-	root->fs_info->trans_no_join = 0;
-	spin_unlock(&root->fs_info->trans_lock);
 }
 
 static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
@@ -1554,13 +1575,6 @@ static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
 		btrfs_wait_all_ordered_extents(fs_info, 1);
 }
 
-/*
- * btrfs_transaction state sequence:
- *    in_commit = 0, blocked = 0  (initial)
- *    in_commit = 1, blocked = 1
- *    blocked = 0
- *    commit_done = 1
- */
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root)
 {
@@ -1615,9 +1629,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 		return ret;
 	}
 
-	spin_lock(&cur_trans->commit_lock);
-	if (cur_trans->in_commit) {
-		spin_unlock(&cur_trans->commit_lock);
+	spin_lock(&root->fs_info->trans_lock);
+	if (cur_trans->state >= TRANS_STATE_COMMIT_START) {
+		spin_unlock(&root->fs_info->trans_lock);
 		atomic_inc(&cur_trans->use_count);
 		ret = btrfs_end_transaction(trans, root);
 
@@ -1628,16 +1642,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 		return ret;
 	}
 
-	trans->transaction->in_commit = 1;
-	trans->transaction->blocked = 1;
-	spin_unlock(&cur_trans->commit_lock);
+	cur_trans->state = TRANS_STATE_COMMIT_START;
 	wake_up(&root->fs_info->transaction_blocked_wait);
 
-	spin_lock(&root->fs_info->trans_lock);
 	if (cur_trans->list.prev != &root->fs_info->trans_list) {
 		prev_trans = list_entry(cur_trans->list.prev,
 					struct btrfs_transaction, list);
-		if (!prev_trans->commit_done) {
+		if (prev_trans->state != TRANS_STATE_COMPLETED) {
 			atomic_inc(&prev_trans->use_count);
 			spin_unlock(&root->fs_info->trans_lock);
 
@@ -1673,10 +1684,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	/*
 	 * Ok now we need to make sure to block out any other joins while we
 	 * commit the transaction.  We could have started a join before setting
-	 * no_join so make sure to wait for num_writers to == 1 again.
+	 * COMMIT_DOING so make sure to wait for num_writers to == 1 again.
 	 */
 	spin_lock(&root->fs_info->trans_lock);
-	root->fs_info->trans_no_join = 1;
+	cur_trans->state = TRANS_STATE_COMMIT_DOING;
 	spin_unlock(&root->fs_info->trans_lock);
 	wait_event(cur_trans->writer_wait,
 		   atomic_read(&cur_trans->num_writers) == 1);
@@ -1803,10 +1814,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	memcpy(root->fs_info->super_for_commit, root->fs_info->super_copy,
 	       sizeof(*root->fs_info->super_copy));
 
-	trans->transaction->blocked = 0;
 	spin_lock(&root->fs_info->trans_lock);
+	cur_trans->state = TRANS_STATE_UNBLOCKED;
 	root->fs_info->running_transaction = NULL;
-	root->fs_info->trans_no_join = 0;
 	spin_unlock(&root->fs_info->trans_lock);
 	mutex_unlock(&root->fs_info->reloc_mutex);
 
@@ -1834,10 +1844,12 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 
 	btrfs_finish_extent_commit(trans, root);
 
-	cur_trans->commit_done = 1;
-
 	root->fs_info->last_trans_committed = cur_trans->transid;
-
+	/*
+	 * We needn't acquire the lock here because there is no other task
+	 * which can change it.
+	 */
+	cur_trans->state = TRANS_STATE_COMPLETED;
 	wake_up(&cur_trans->commit_wait);
 
 	spin_lock(&root->fs_info->trans_lock);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 0fc45e2..66d2a6c 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -22,6 +22,16 @@
 #include "delayed-ref.h"
 #include "ctree.h"
 
+enum btrfs_trans_state {
+	TRANS_STATE_RUNNING		= 0,
+	TRANS_STATE_BLOCKED		= 1,
+	TRANS_STATE_COMMIT_START	= 2,
+	TRANS_STATE_COMMIT_DOING	= 3,
+	TRANS_STATE_UNBLOCKED		= 4,
+	TRANS_STATE_COMPLETED		= 5,
+	TRANS_STATE_MAX			= 6,
+};
+
 struct btrfs_transaction {
 	u64 transid;
 	/*
@@ -37,10 +47,8 @@ struct btrfs_transaction {
 	atomic_t num_writers;
 	atomic_t use_count;
 
-	spinlock_t commit_lock;
-	int in_commit;
-	int commit_done;
-	int blocked;
+	/* Be protected by fs_info->trans_lock when we want to change it. */
+	enum btrfs_trans_state state;
 	struct list_head list;
 	struct extent_io_tree dirty_pages;
 	unsigned long start_time;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 10/17] Btrfs: just flush the delalloc inodes in the source tree before snapshot creation
  2013-05-15  7:48 ` [PATCH 10/17] Btrfs: just flush the delalloc inodes in the source tree before snapshot creation Miao Xie
@ 2013-05-16  3:20   ` Liu Bo
  2013-05-16  4:00     ` Miao Xie
  0 siblings, 1 reply; 33+ messages in thread
From: Liu Bo @ 2013-05-16  3:20 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs, alex.btrfs

On Wed, May 15, 2013 at 03:48:24PM +0800, Miao Xie wrote:
> Before applying this patch, we need flush all the delalloc inodes in
> the fs when we want to create a snapshot, it wastes time, and make
> the transaction commit be blocked for a long time. It means some other
> user operation would also be blocked for a long time.
> 
> This patch improves this problem, we just flush the delalloc inodes that
> in the source trees before snapshot creation, so the transaction commit
> will complete quickly.
> 
> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
> ---
>  fs/btrfs/ioctl.c       |  6 ++++++
>  fs/btrfs/transaction.c | 10 +---------
>  2 files changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 0de4a2f..2677dcc 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -555,6 +555,12 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>  	if (!root->ref_cows)
>  		return -EINVAL;
>  
> +	ret = btrfs_start_delalloc_inodes(root, 0);
> +	if (ret)
> +		return ret;
> +
> +	btrfs_wait_ordered_extents(root, 0);
> +

Does this look too radical?  Does this snapshot creation ioctl block all
writes on its src root?
No, we can only be sure that there is no ordered extents being added until
setting trans_no_join, and then it's safe to create pending snapshots.

thanks,
liubo

>  	pending_snapshot = kzalloc(sizeof(*pending_snapshot), GFP_NOFS);
>  	if (!pending_snapshot)
>  		return -ENOMEM;
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 2b17213..bc22be9 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1491,17 +1491,9 @@ static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
>  					  struct btrfs_root *root)
>  {
>  	int flush_on_commit = btrfs_test_opt(root, FLUSHONCOMMIT);
> -	int snap_pending = 0;
>  	int ret;
>  
> -	if (!flush_on_commit) {
> -		spin_lock(&root->fs_info->trans_lock);
> -		if (!list_empty(&trans->transaction->pending_snapshots))
> -			snap_pending = 1;
> -		spin_unlock(&root->fs_info->trans_lock);
> -	}
> -
> -	if (flush_on_commit || snap_pending) {
> +	if (flush_on_commit) {
>  		ret = btrfs_start_all_delalloc_inodes(root->fs_info, 1);
>  		if (ret)
>  			return ret;
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-15  7:48 ` [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree Miao Xie
@ 2013-05-16  3:36   ` Liu Bo
  2013-05-16  4:31     ` Miao Xie
  0 siblings, 1 reply; 33+ messages in thread
From: Liu Bo @ 2013-05-16  3:36 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs, alex.btrfs

On Wed, May 15, 2013 at 03:48:20PM +0800, Miao Xie wrote:
> The grab/put funtions will be used in the next patch, which need grab
> the root object and ensure it is not freed. We use reference counter
> instead of the srcu lock is to aovid blocking the memory reclaim task,
> which invokes synchronize_srcu().
> 

I don't think this is necessary, we put 'kfree(root)' because we really
need to free them at the very end time, when there should be no inodes
linking on the root(we should have cleaned all inodes out from it).

So when we flush delalloc inodes and wait for ordered extents to finish,
the root should be valid, otherwise someone is doing wrong things.

And even with this grab_fs_root to avoid freeing root, it's just the
root that remains in memory, all its attributes, like root->node,
commit_root, root->inode_tree, are already NULL or empty.

thanks,
liubo

> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
> ---
>  fs/btrfs/ctree.h       |  1 +
>  fs/btrfs/disk-io.c     |  5 +++--
>  fs/btrfs/disk-io.h     | 21 +++++++++++++++++++++
>  fs/btrfs/extent-tree.c |  2 +-
>  4 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 845b77f..958ce6c 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1739,6 +1739,7 @@ struct btrfs_root {
>  	int force_cow;
>  
>  	spinlock_t root_item_lock;
> +	atomic_t refs;
>  };
>  
>  struct btrfs_ioctl_defrag_range_args {
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 42d6ba2..642c861 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1216,6 +1216,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
>  	atomic_set(&root->log_writers, 0);
>  	atomic_set(&root->log_batch, 0);
>  	atomic_set(&root->orphan_inodes, 0);
> +	atomic_set(&root->refs, 1);
>  	root->log_transid = 0;
>  	root->last_log_commit = 0;
>  	extent_io_tree_init(&root->dirty_log_pages,
> @@ -2049,7 +2050,7 @@ static void del_fs_roots(struct btrfs_fs_info *fs_info)
>  		} else {
>  			free_extent_buffer(gang[0]->node);
>  			free_extent_buffer(gang[0]->commit_root);
> -			kfree(gang[0]);
> +			btrfs_put_fs_root(gang[0]);
>  		}
>  	}
>  
> @@ -3415,7 +3416,7 @@ static void free_fs_root(struct btrfs_root *root)
>  	kfree(root->free_ino_ctl);
>  	kfree(root->free_ino_pinned);
>  	kfree(root->name);
> -	kfree(root);
> +	btrfs_put_fs_root(root);
>  }
>  
>  void btrfs_free_fs_root(struct btrfs_root *root)
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 534d583..b71acd6e 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -76,6 +76,27 @@ void btrfs_btree_balance_dirty_nodelay(struct btrfs_root *root);
>  void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
>  				 struct btrfs_root *root);
>  void btrfs_free_fs_root(struct btrfs_root *root);
> +
> +/*
> + * This function is used to grab the root, and avoid it is freed when we
> + * access it. But it doesn't ensure that the tree is not dropped.
> + *
> + * If you want to ensure the whole tree is safe, you should use
> + * 	fs_info->subvol_srcu
> + */
> +static inline struct btrfs_root *btrfs_grab_fs_root(struct btrfs_root *root)
> +{
> +	if (atomic_inc_not_zero(&root->refs))
> +		return root;
> +	return NULL;
> +}
> +
> +static inline void btrfs_put_fs_root(struct btrfs_root *root)
> +{
> +	if (atomic_dec_and_test(&root->refs))
> +		kfree(root);
> +}
> +
>  void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
>  int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
>  			  int atomic);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 08e42c8..08f9862 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7463,7 +7463,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>  	} else {
>  		free_extent_buffer(root->node);
>  		free_extent_buffer(root->commit_root);
> -		kfree(root);
> +		btrfs_put_fs_root(root);
>  	}
>  out_end_trans:
>  	btrfs_end_transaction_throttle(trans, tree_root);
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 10/17] Btrfs: just flush the delalloc inodes in the source tree before snapshot creation
  2013-05-16  3:20   ` Liu Bo
@ 2013-05-16  4:00     ` Miao Xie
  0 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-16  4:00 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs, alex.btrfs

On Thu, 16 May 2013 11:20:39 +0800, Liu Bo wrote:
> On Wed, May 15, 2013 at 03:48:24PM +0800, Miao Xie wrote:
>> Before applying this patch, we need flush all the delalloc inodes in
>> the fs when we want to create a snapshot, it wastes time, and make
>> the transaction commit be blocked for a long time. It means some other
>> user operation would also be blocked for a long time.
>>
>> This patch improves this problem, we just flush the delalloc inodes that
>> in the source trees before snapshot creation, so the transaction commit
>> will complete quickly.
>>
>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>> ---
>>  fs/btrfs/ioctl.c       |  6 ++++++
>>  fs/btrfs/transaction.c | 10 +---------
>>  2 files changed, 7 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 0de4a2f..2677dcc 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -555,6 +555,12 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir,
>>  	if (!root->ref_cows)
>>  		return -EINVAL;
>>  
>> +	ret = btrfs_start_delalloc_inodes(root, 0);
>> +	if (ret)
>> +		return ret;
>> +
>> +	btrfs_wait_ordered_extents(root, 0);
>> +
> 
> Does this look too radical?  Does this snapshot creation ioctl block all
> writes on its src root?

I don't think it is radical, and I think flushing delalloc inodes during the transaction commit
is stupid, especially flushing all the inodes including the roots which are not going to be snapshoted.
Because it will block the operations of the users (such as mkdir, rmdir, create and so on) for
a long time if there are lots of dirty pages.

And The snapshot creation now doesn't block the writes of the source root at all, there is no
appreciable difference between this way and the background flusher.

> No, we can only be sure that there is no ordered extents being added until
> setting trans_no_join, and then it's safe to create pending snapshots.

Actually, we can not avoid that the new ordered extents are added before trans_no_join is set.
But for the users, the 1st case below must be handled correctly, but the 2nd one can be ignored
because we can see the write of the 2nd case as the one that happens after the snapshot creation.

1st case:
	Task
	write data into a file
	make a snapshot

2nd case:
	Task0				Task1
	make a snapshot
	  flush delalloc inodes
					write data into a file
	  commit transaction
	    create_pending_snapshot

Thanks
Miao

> 
> thanks,
> liubo
> 
>>  	pending_snapshot = kzalloc(sizeof(*pending_snapshot), GFP_NOFS);
>>  	if (!pending_snapshot)
>>  		return -ENOMEM;
>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
>> index 2b17213..bc22be9 100644
>> --- a/fs/btrfs/transaction.c
>> +++ b/fs/btrfs/transaction.c
>> @@ -1491,17 +1491,9 @@ static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
>>  					  struct btrfs_root *root)
>>  {
>>  	int flush_on_commit = btrfs_test_opt(root, FLUSHONCOMMIT);
>> -	int snap_pending = 0;
>>  	int ret;
>>  
>> -	if (!flush_on_commit) {
>> -		spin_lock(&root->fs_info->trans_lock);
>> -		if (!list_empty(&trans->transaction->pending_snapshots))
>> -			snap_pending = 1;
>> -		spin_unlock(&root->fs_info->trans_lock);
>> -	}
>> -
>> -	if (flush_on_commit || snap_pending) {
>> +	if (flush_on_commit) {
>>  		ret = btrfs_start_all_delalloc_inodes(root->fs_info, 1);
>>  		if (ret)
>>  			return ret;
>> -- 
>> 1.8.1.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-16  3:36   ` Liu Bo
@ 2013-05-16  4:31     ` Miao Xie
  2013-05-16  5:15       ` Liu Bo
  0 siblings, 1 reply; 33+ messages in thread
From: Miao Xie @ 2013-05-16  4:31 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs, alex.btrfs

On thu, 16 May 2013 11:36:46 +0800, Liu Bo wrote:
> On Wed, May 15, 2013 at 03:48:20PM +0800, Miao Xie wrote:
>> The grab/put funtions will be used in the next patch, which need grab
>> the root object and ensure it is not freed. We use reference counter
>> instead of the srcu lock is to aovid blocking the memory reclaim task,
>> which invokes synchronize_srcu().
>>
> 
> I don't think this is necessary, we put 'kfree(root)' because we really
> need to free them at the very end time, when there should be no inodes
> linking on the root(we should have cleaned all inodes out from it).
> 
> So when we flush delalloc inodes and wait for ordered extents to finish,
> the root should be valid, otherwise someone is doing wrong things.
> 
> And even with this grab_fs_root to avoid freeing root, it's just the
> root that remains in memory, all its attributes, like root->node,
> commit_root, root->inode_tree, are already NULL or empty.

Please consider the case:
	Task1			Task2					Cleaner
	get the root
				flush all delalloc inodes
				drop subvolume
				iput the last inode
				  move the root into the dead list
									drop subvolume
									kfree(root)
If Task1 accesses the root now, oops will happen.

I introduced there two functions just to protect the access of the root object, not its
attributes, so don't worry about the attributes. (Please see the first sentence of the
changelog.)

Thanks
Miao

> 
> thanks,
> liubo
> 
>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>> ---
>>  fs/btrfs/ctree.h       |  1 +
>>  fs/btrfs/disk-io.c     |  5 +++--
>>  fs/btrfs/disk-io.h     | 21 +++++++++++++++++++++
>>  fs/btrfs/extent-tree.c |  2 +-
>>  4 files changed, 26 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 845b77f..958ce6c 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1739,6 +1739,7 @@ struct btrfs_root {
>>  	int force_cow;
>>  
>>  	spinlock_t root_item_lock;
>> +	atomic_t refs;
>>  };
>>  
>>  struct btrfs_ioctl_defrag_range_args {
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 42d6ba2..642c861 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1216,6 +1216,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
>>  	atomic_set(&root->log_writers, 0);
>>  	atomic_set(&root->log_batch, 0);
>>  	atomic_set(&root->orphan_inodes, 0);
>> +	atomic_set(&root->refs, 1);
>>  	root->log_transid = 0;
>>  	root->last_log_commit = 0;
>>  	extent_io_tree_init(&root->dirty_log_pages,
>> @@ -2049,7 +2050,7 @@ static void del_fs_roots(struct btrfs_fs_info *fs_info)
>>  		} else {
>>  			free_extent_buffer(gang[0]->node);
>>  			free_extent_buffer(gang[0]->commit_root);
>> -			kfree(gang[0]);
>> +			btrfs_put_fs_root(gang[0]);
>>  		}
>>  	}
>>  
>> @@ -3415,7 +3416,7 @@ static void free_fs_root(struct btrfs_root *root)
>>  	kfree(root->free_ino_ctl);
>>  	kfree(root->free_ino_pinned);
>>  	kfree(root->name);
>> -	kfree(root);
>> +	btrfs_put_fs_root(root);
>>  }
>>  
>>  void btrfs_free_fs_root(struct btrfs_root *root)
>> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
>> index 534d583..b71acd6e 100644
>> --- a/fs/btrfs/disk-io.h
>> +++ b/fs/btrfs/disk-io.h
>> @@ -76,6 +76,27 @@ void btrfs_btree_balance_dirty_nodelay(struct btrfs_root *root);
>>  void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
>>  				 struct btrfs_root *root);
>>  void btrfs_free_fs_root(struct btrfs_root *root);
>> +
>> +/*
>> + * This function is used to grab the root, and avoid it is freed when we
>> + * access it. But it doesn't ensure that the tree is not dropped.
>> + *
>> + * If you want to ensure the whole tree is safe, you should use
>> + * 	fs_info->subvol_srcu
>> + */
>> +static inline struct btrfs_root *btrfs_grab_fs_root(struct btrfs_root *root)
>> +{
>> +	if (atomic_inc_not_zero(&root->refs))
>> +		return root;
>> +	return NULL;
>> +}
>> +
>> +static inline void btrfs_put_fs_root(struct btrfs_root *root)
>> +{
>> +	if (atomic_dec_and_test(&root->refs))
>> +		kfree(root);
>> +}
>> +
>>  void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
>>  int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
>>  			  int atomic);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 08e42c8..08f9862 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -7463,7 +7463,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>>  	} else {
>>  		free_extent_buffer(root->node);
>>  		free_extent_buffer(root->commit_root);
>> -		kfree(root);
>> +		btrfs_put_fs_root(root);
>>  	}
>>  out_end_trans:
>>  	btrfs_end_transaction_throttle(trans, tree_root);
>> -- 
>> 1.8.1.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-16  4:31     ` Miao Xie
@ 2013-05-16  5:15       ` Liu Bo
  2013-05-16  5:34         ` Miao Xie
  0 siblings, 1 reply; 33+ messages in thread
From: Liu Bo @ 2013-05-16  5:15 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs, alex.btrfs

On Thu, May 16, 2013 at 12:31:11PM +0800, Miao Xie wrote:
> On thu, 16 May 2013 11:36:46 +0800, Liu Bo wrote:
> > On Wed, May 15, 2013 at 03:48:20PM +0800, Miao Xie wrote:
> >> The grab/put funtions will be used in the next patch, which need grab
> >> the root object and ensure it is not freed. We use reference counter
> >> instead of the srcu lock is to aovid blocking the memory reclaim task,
> >> which invokes synchronize_srcu().
> >>
> > 
> > I don't think this is necessary, we put 'kfree(root)' because we really
> > need to free them at the very end time, when there should be no inodes
> > linking on the root(we should have cleaned all inodes out from it).
> > 
> > So when we flush delalloc inodes and wait for ordered extents to finish,
> > the root should be valid, otherwise someone is doing wrong things.
> > 
> > And even with this grab_fs_root to avoid freeing root, it's just the
> > root that remains in memory, all its attributes, like root->node,
> > commit_root, root->inode_tree, are already NULL or empty.
> 
> Please consider the case:
> 	Task1			Task2					Cleaner
> 	get the root
> 				flush all delalloc inodes
> 				drop subvolume
> 				iput the last inode
> 				  move the root into the dead list
> 									drop subvolume
> 									kfree(root)
> If Task1 accesses the root now, oops will happen.

Then it's task1's fault, why it is not protected by subvol_srcu section when
it's possible that someone like task2 sets root's refs to 0?

synchronize_srcu(subvol_srcu) before adding root into dead root list is
just for this race case, why do we need another?

thanks,
liubo

> 
> I introduced there two functions just to protect the access of the root object, not its
> attributes, so don't worry about the attributes. (Please see the first sentence of the
> changelog.)
> 
> Thanks
> Miao
> 
> > 
> > thanks,
> > liubo
> > 
> >> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
> >> ---
> >>  fs/btrfs/ctree.h       |  1 +
> >>  fs/btrfs/disk-io.c     |  5 +++--
> >>  fs/btrfs/disk-io.h     | 21 +++++++++++++++++++++
> >>  fs/btrfs/extent-tree.c |  2 +-
> >>  4 files changed, 26 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >> index 845b77f..958ce6c 100644
> >> --- a/fs/btrfs/ctree.h
> >> +++ b/fs/btrfs/ctree.h
> >> @@ -1739,6 +1739,7 @@ struct btrfs_root {
> >>  	int force_cow;
> >>  
> >>  	spinlock_t root_item_lock;
> >> +	atomic_t refs;
> >>  };
> >>  
> >>  struct btrfs_ioctl_defrag_range_args {
> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >> index 42d6ba2..642c861 100644
> >> --- a/fs/btrfs/disk-io.c
> >> +++ b/fs/btrfs/disk-io.c
> >> @@ -1216,6 +1216,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
> >>  	atomic_set(&root->log_writers, 0);
> >>  	atomic_set(&root->log_batch, 0);
> >>  	atomic_set(&root->orphan_inodes, 0);
> >> +	atomic_set(&root->refs, 1);
> >>  	root->log_transid = 0;
> >>  	root->last_log_commit = 0;
> >>  	extent_io_tree_init(&root->dirty_log_pages,
> >> @@ -2049,7 +2050,7 @@ static void del_fs_roots(struct btrfs_fs_info *fs_info)
> >>  		} else {
> >>  			free_extent_buffer(gang[0]->node);
> >>  			free_extent_buffer(gang[0]->commit_root);
> >> -			kfree(gang[0]);
> >> +			btrfs_put_fs_root(gang[0]);
> >>  		}
> >>  	}
> >>  
> >> @@ -3415,7 +3416,7 @@ static void free_fs_root(struct btrfs_root *root)
> >>  	kfree(root->free_ino_ctl);
> >>  	kfree(root->free_ino_pinned);
> >>  	kfree(root->name);
> >> -	kfree(root);
> >> +	btrfs_put_fs_root(root);
> >>  }
> >>  
> >>  void btrfs_free_fs_root(struct btrfs_root *root)
> >> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> >> index 534d583..b71acd6e 100644
> >> --- a/fs/btrfs/disk-io.h
> >> +++ b/fs/btrfs/disk-io.h
> >> @@ -76,6 +76,27 @@ void btrfs_btree_balance_dirty_nodelay(struct btrfs_root *root);
> >>  void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
> >>  				 struct btrfs_root *root);
> >>  void btrfs_free_fs_root(struct btrfs_root *root);
> >> +
> >> +/*
> >> + * This function is used to grab the root, and avoid it is freed when we
> >> + * access it. But it doesn't ensure that the tree is not dropped.
> >> + *
> >> + * If you want to ensure the whole tree is safe, you should use
> >> + * 	fs_info->subvol_srcu
> >> + */
> >> +static inline struct btrfs_root *btrfs_grab_fs_root(struct btrfs_root *root)
> >> +{
> >> +	if (atomic_inc_not_zero(&root->refs))
> >> +		return root;
> >> +	return NULL;
> >> +}
> >> +
> >> +static inline void btrfs_put_fs_root(struct btrfs_root *root)
> >> +{
> >> +	if (atomic_dec_and_test(&root->refs))
> >> +		kfree(root);
> >> +}
> >> +
> >>  void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
> >>  int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
> >>  			  int atomic);
> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >> index 08e42c8..08f9862 100644
> >> --- a/fs/btrfs/extent-tree.c
> >> +++ b/fs/btrfs/extent-tree.c
> >> @@ -7463,7 +7463,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> >>  	} else {
> >>  		free_extent_buffer(root->node);
> >>  		free_extent_buffer(root->commit_root);
> >> -		kfree(root);
> >> +		btrfs_put_fs_root(root);
> >>  	}
> >>  out_end_trans:
> >>  	btrfs_end_transaction_throttle(trans, tree_root);
> >> -- 
> >> 1.8.1.4
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-16  5:15       ` Liu Bo
@ 2013-05-16  5:34         ` Miao Xie
  2013-05-16  6:15           ` Liu Bo
  0 siblings, 1 reply; 33+ messages in thread
From: Miao Xie @ 2013-05-16  5:34 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs, alex.btrfs

On Thu, 16 May 2013 13:15:57 +0800, Liu Bo wrote:
> On Thu, May 16, 2013 at 12:31:11PM +0800, Miao Xie wrote:
>> On thu, 16 May 2013 11:36:46 +0800, Liu Bo wrote:
>>> On Wed, May 15, 2013 at 03:48:20PM +0800, Miao Xie wrote:
>>>> The grab/put funtions will be used in the next patch, which need grab
>>>> the root object and ensure it is not freed. We use reference counter
>>>> instead of the srcu lock is to aovid blocking the memory reclaim task,
>>>> which invokes synchronize_srcu().
>>>>
>>>
>>> I don't think this is necessary, we put 'kfree(root)' because we really
>>> need to free them at the very end time, when there should be no inodes
>>> linking on the root(we should have cleaned all inodes out from it).
>>>
>>> So when we flush delalloc inodes and wait for ordered extents to finish,
>>> the root should be valid, otherwise someone is doing wrong things.
>>>
>>> And even with this grab_fs_root to avoid freeing root, it's just the
>>> root that remains in memory, all its attributes, like root->node,
>>> commit_root, root->inode_tree, are already NULL or empty.
>>
>> Please consider the case:
>> 	Task1			Task2					Cleaner
>> 	get the root
>> 				flush all delalloc inodes
>> 				drop subvolume
>> 				iput the last inode
>> 				  move the root into the dead list
>> 									drop subvolume
>> 									kfree(root)
>> If Task1 accesses the root now, oops will happen.
> 
> Then it's task1's fault, why it is not protected by subvol_srcu section when
> it's possible that someone like task2 sets root's refs to 0?
> 
> synchronize_srcu(subvol_srcu) before adding root into dead root list is
> just for this race case, why do we need another?

Please read my changelog.

Miao

> 
> thanks,
> liubo
> 
>>
>> I introduced there two functions just to protect the access of the root object, not its
>> attributes, so don't worry about the attributes. (Please see the first sentence of the
>> changelog.)
>>
>> Thanks
>> Miao
>>
>>>
>>> thanks,
>>> liubo
>>>
>>>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
>>>> ---
>>>>  fs/btrfs/ctree.h       |  1 +
>>>>  fs/btrfs/disk-io.c     |  5 +++--
>>>>  fs/btrfs/disk-io.h     | 21 +++++++++++++++++++++
>>>>  fs/btrfs/extent-tree.c |  2 +-
>>>>  4 files changed, 26 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>>> index 845b77f..958ce6c 100644
>>>> --- a/fs/btrfs/ctree.h
>>>> +++ b/fs/btrfs/ctree.h
>>>> @@ -1739,6 +1739,7 @@ struct btrfs_root {
>>>>  	int force_cow;
>>>>  
>>>>  	spinlock_t root_item_lock;
>>>> +	atomic_t refs;
>>>>  };
>>>>  
>>>>  struct btrfs_ioctl_defrag_range_args {
>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>> index 42d6ba2..642c861 100644
>>>> --- a/fs/btrfs/disk-io.c
>>>> +++ b/fs/btrfs/disk-io.c
>>>> @@ -1216,6 +1216,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
>>>>  	atomic_set(&root->log_writers, 0);
>>>>  	atomic_set(&root->log_batch, 0);
>>>>  	atomic_set(&root->orphan_inodes, 0);
>>>> +	atomic_set(&root->refs, 1);
>>>>  	root->log_transid = 0;
>>>>  	root->last_log_commit = 0;
>>>>  	extent_io_tree_init(&root->dirty_log_pages,
>>>> @@ -2049,7 +2050,7 @@ static void del_fs_roots(struct btrfs_fs_info *fs_info)
>>>>  		} else {
>>>>  			free_extent_buffer(gang[0]->node);
>>>>  			free_extent_buffer(gang[0]->commit_root);
>>>> -			kfree(gang[0]);
>>>> +			btrfs_put_fs_root(gang[0]);
>>>>  		}
>>>>  	}
>>>>  
>>>> @@ -3415,7 +3416,7 @@ static void free_fs_root(struct btrfs_root *root)
>>>>  	kfree(root->free_ino_ctl);
>>>>  	kfree(root->free_ino_pinned);
>>>>  	kfree(root->name);
>>>> -	kfree(root);
>>>> +	btrfs_put_fs_root(root);
>>>>  }
>>>>  
>>>>  void btrfs_free_fs_root(struct btrfs_root *root)
>>>> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
>>>> index 534d583..b71acd6e 100644
>>>> --- a/fs/btrfs/disk-io.h
>>>> +++ b/fs/btrfs/disk-io.h
>>>> @@ -76,6 +76,27 @@ void btrfs_btree_balance_dirty_nodelay(struct btrfs_root *root);
>>>>  void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
>>>>  				 struct btrfs_root *root);
>>>>  void btrfs_free_fs_root(struct btrfs_root *root);
>>>> +
>>>> +/*
>>>> + * This function is used to grab the root, and avoid it is freed when we
>>>> + * access it. But it doesn't ensure that the tree is not dropped.
>>>> + *
>>>> + * If you want to ensure the whole tree is safe, you should use
>>>> + * 	fs_info->subvol_srcu
>>>> + */
>>>> +static inline struct btrfs_root *btrfs_grab_fs_root(struct btrfs_root *root)
>>>> +{
>>>> +	if (atomic_inc_not_zero(&root->refs))
>>>> +		return root;
>>>> +	return NULL;
>>>> +}
>>>> +
>>>> +static inline void btrfs_put_fs_root(struct btrfs_root *root)
>>>> +{
>>>> +	if (atomic_dec_and_test(&root->refs))
>>>> +		kfree(root);
>>>> +}
>>>> +
>>>>  void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
>>>>  int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
>>>>  			  int atomic);
>>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>>> index 08e42c8..08f9862 100644
>>>> --- a/fs/btrfs/extent-tree.c
>>>> +++ b/fs/btrfs/extent-tree.c
>>>> @@ -7463,7 +7463,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
>>>>  	} else {
>>>>  		free_extent_buffer(root->node);
>>>>  		free_extent_buffer(root->commit_root);
>>>> -		kfree(root);
>>>> +		btrfs_put_fs_root(root);
>>>>  	}
>>>>  out_end_trans:
>>>>  	btrfs_end_transaction_throttle(trans, tree_root);
>>>> -- 
>>>> 1.8.1.4
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-16  5:34         ` Miao Xie
@ 2013-05-16  6:15           ` Liu Bo
  2013-05-16  7:22             ` Miao Xie
  0 siblings, 1 reply; 33+ messages in thread
From: Liu Bo @ 2013-05-16  6:15 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs, alex.btrfs

On Thu, May 16, 2013 at 01:34:11PM +0800, Miao Xie wrote:
> On Thu, 16 May 2013 13:15:57 +0800, Liu Bo wrote:
> > On Thu, May 16, 2013 at 12:31:11PM +0800, Miao Xie wrote:
> >> On thu, 16 May 2013 11:36:46 +0800, Liu Bo wrote:
> >>> On Wed, May 15, 2013 at 03:48:20PM +0800, Miao Xie wrote:
> >>>> The grab/put funtions will be used in the next patch, which need grab
> >>>> the root object and ensure it is not freed. We use reference counter
> >>>> instead of the srcu lock is to aovid blocking the memory reclaim task,
> >>>> which invokes synchronize_srcu().
> >>>>
> >>>
> >>> I don't think this is necessary, we put 'kfree(root)' because we really
> >>> need to free them at the very end time, when there should be no inodes
> >>> linking on the root(we should have cleaned all inodes out from it).
> >>>
> >>> So when we flush delalloc inodes and wait for ordered extents to finish,
> >>> the root should be valid, otherwise someone is doing wrong things.
> >>>
> >>> And even with this grab_fs_root to avoid freeing root, it's just the
> >>> root that remains in memory, all its attributes, like root->node,
> >>> commit_root, root->inode_tree, are already NULL or empty.
> >>
> >> Please consider the case:
> >> 	Task1			Task2					Cleaner
> >> 	get the root
> >> 				flush all delalloc inodes
> >> 				drop subvolume
> >> 				iput the last inode
> >> 				  move the root into the dead list
> >> 									drop subvolume
> >> 									kfree(root)
> >> If Task1 accesses the root now, oops will happen.
> > 
> > Then it's task1's fault, why it is not protected by subvol_srcu section when
> > it's possible that someone like task2 sets root's refs to 0?
> > 
> > synchronize_srcu(subvol_srcu) before adding root into dead root list is
> > just for this race case, why do we need another?
> 
> Please read my changelog.

'The memory reclaim task' in the changelog refers to
	iput
	  -> inode_tree_del
, right?

I don't like special cases, this get/put is different from our usual way:
if (atomic_dec_and_test(refs)) {
	kfree(A->a);
	kfree(A->b);
	kfree(A);
}

According to the pictured case, task1 may also access root->something.

I must say that the patch itself looks harmless, the reason is not good enough.

thanks,
liubo

> 
> Miao
> 
> > 
> > thanks,
> > liubo
> > 
> >>
> >> I introduced there two functions just to protect the access of the root object, not its
> >> attributes, so don't worry about the attributes. (Please see the first sentence of the
> >> changelog.)
> >>
> >> Thanks
> >> Miao
> >>
> >>>
> >>> thanks,
> >>> liubo
> >>>
> >>>> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
> >>>> ---
> >>>>  fs/btrfs/ctree.h       |  1 +
> >>>>  fs/btrfs/disk-io.c     |  5 +++--
> >>>>  fs/btrfs/disk-io.h     | 21 +++++++++++++++++++++
> >>>>  fs/btrfs/extent-tree.c |  2 +-
> >>>>  4 files changed, 26 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >>>> index 845b77f..958ce6c 100644
> >>>> --- a/fs/btrfs/ctree.h
> >>>> +++ b/fs/btrfs/ctree.h
> >>>> @@ -1739,6 +1739,7 @@ struct btrfs_root {
> >>>>  	int force_cow;
> >>>>  
> >>>>  	spinlock_t root_item_lock;
> >>>> +	atomic_t refs;
> >>>>  };
> >>>>  
> >>>>  struct btrfs_ioctl_defrag_range_args {
> >>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >>>> index 42d6ba2..642c861 100644
> >>>> --- a/fs/btrfs/disk-io.c
> >>>> +++ b/fs/btrfs/disk-io.c
> >>>> @@ -1216,6 +1216,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
> >>>>  	atomic_set(&root->log_writers, 0);
> >>>>  	atomic_set(&root->log_batch, 0);
> >>>>  	atomic_set(&root->orphan_inodes, 0);
> >>>> +	atomic_set(&root->refs, 1);
> >>>>  	root->log_transid = 0;
> >>>>  	root->last_log_commit = 0;
> >>>>  	extent_io_tree_init(&root->dirty_log_pages,
> >>>> @@ -2049,7 +2050,7 @@ static void del_fs_roots(struct btrfs_fs_info *fs_info)
> >>>>  		} else {
> >>>>  			free_extent_buffer(gang[0]->node);
> >>>>  			free_extent_buffer(gang[0]->commit_root);
> >>>> -			kfree(gang[0]);
> >>>> +			btrfs_put_fs_root(gang[0]);
> >>>>  		}
> >>>>  	}
> >>>>  
> >>>> @@ -3415,7 +3416,7 @@ static void free_fs_root(struct btrfs_root *root)
> >>>>  	kfree(root->free_ino_ctl);
> >>>>  	kfree(root->free_ino_pinned);
> >>>>  	kfree(root->name);
> >>>> -	kfree(root);
> >>>> +	btrfs_put_fs_root(root);
> >>>>  }
> >>>>  
> >>>>  void btrfs_free_fs_root(struct btrfs_root *root)
> >>>> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> >>>> index 534d583..b71acd6e 100644
> >>>> --- a/fs/btrfs/disk-io.h
> >>>> +++ b/fs/btrfs/disk-io.h
> >>>> @@ -76,6 +76,27 @@ void btrfs_btree_balance_dirty_nodelay(struct btrfs_root *root);
> >>>>  void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
> >>>>  				 struct btrfs_root *root);
> >>>>  void btrfs_free_fs_root(struct btrfs_root *root);
> >>>> +
> >>>> +/*
> >>>> + * This function is used to grab the root, and avoid it is freed when we
> >>>> + * access it. But it doesn't ensure that the tree is not dropped.
> >>>> + *
> >>>> + * If you want to ensure the whole tree is safe, you should use
> >>>> + * 	fs_info->subvol_srcu
> >>>> + */
> >>>> +static inline struct btrfs_root *btrfs_grab_fs_root(struct btrfs_root *root)
> >>>> +{
> >>>> +	if (atomic_inc_not_zero(&root->refs))
> >>>> +		return root;
> >>>> +	return NULL;
> >>>> +}
> >>>> +
> >>>> +static inline void btrfs_put_fs_root(struct btrfs_root *root)
> >>>> +{
> >>>> +	if (atomic_dec_and_test(&root->refs))
> >>>> +		kfree(root);
> >>>> +}
> >>>> +
> >>>>  void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
> >>>>  int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid,
> >>>>  			  int atomic);
> >>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >>>> index 08e42c8..08f9862 100644
> >>>> --- a/fs/btrfs/extent-tree.c
> >>>> +++ b/fs/btrfs/extent-tree.c
> >>>> @@ -7463,7 +7463,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> >>>>  	} else {
> >>>>  		free_extent_buffer(root->node);
> >>>>  		free_extent_buffer(root->commit_root);
> >>>> -		kfree(root);
> >>>> +		btrfs_put_fs_root(root);
> >>>>  	}
> >>>>  out_end_trans:
> >>>>  	btrfs_end_transaction_throttle(trans, tree_root);
> >>>> -- 
> >>>> 1.8.1.4
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>>> the body of a message to majordomo@vger.kernel.org
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> > 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-16  6:15           ` Liu Bo
@ 2013-05-16  7:22             ` Miao Xie
  2013-05-16 11:54               ` Chris Mason
  2013-05-16 12:05               ` Liu Bo
  0 siblings, 2 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-16  7:22 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs, alex.btrfs

On thu, 16 May 2013 14:15:52 +0800, Liu Bo wrote:
> On Thu, May 16, 2013 at 01:34:11PM +0800, Miao Xie wrote:
>> On Thu, 16 May 2013 13:15:57 +0800, Liu Bo wrote:
>>> On Thu, May 16, 2013 at 12:31:11PM +0800, Miao Xie wrote:
>>>> On thu, 16 May 2013 11:36:46 +0800, Liu Bo wrote:
>>>>> On Wed, May 15, 2013 at 03:48:20PM +0800, Miao Xie wrote:
>>>>>> The grab/put funtions will be used in the next patch, which need grab
>>>>>> the root object and ensure it is not freed. We use reference counter
>>>>>> instead of the srcu lock is to aovid blocking the memory reclaim task,
>>>>>> which invokes synchronize_srcu().
>>>>>>
>>>>>
>>>>> I don't think this is necessary, we put 'kfree(root)' because we really
>>>>> need to free them at the very end time, when there should be no inodes
>>>>> linking on the root(we should have cleaned all inodes out from it).
>>>>>
>>>>> So when we flush delalloc inodes and wait for ordered extents to finish,
>>>>> the root should be valid, otherwise someone is doing wrong things.
>>>>>
>>>>> And even with this grab_fs_root to avoid freeing root, it's just the
>>>>> root that remains in memory, all its attributes, like root->node,
>>>>> commit_root, root->inode_tree, are already NULL or empty.
>>>>
>>>> Please consider the case:
>>>> 	Task1			Task2					Cleaner
>>>> 	get the root
>>>> 				flush all delalloc inodes
>>>> 				drop subvolume
>>>> 				iput the last inode
>>>> 				  move the root into the dead list
>>>> 									drop subvolume
>>>> 									kfree(root)
>>>> If Task1 accesses the root now, oops will happen.
>>>
>>> Then it's task1's fault, why it is not protected by subvol_srcu section when
>>> it's possible that someone like task2 sets root's refs to 0?
>>>
>>> synchronize_srcu(subvol_srcu) before adding root into dead root list is
>>> just for this race case, why do we need another?
>>
>> Please read my changelog.
> 
> 'The memory reclaim task' in the changelog refers to
> 	iput
> 	  -> inode_tree_del
> , right?
> 
> I don't like special cases, this get/put is different from our usual way:
> if (atomic_dec_and_test(refs)) {
> 	kfree(A->a);
> 	kfree(A->b);
> 	kfree(A);
> }
> 
> According to the pictured case, task1 may also access root->something.

"->something" are the member variants of root object, so the problem you worry about
can not happen unless somebody misuse this mechanism in the future. But you can not
ascribe the misuse to this patch.

> I must say that the patch itself looks harmless, the reason is not good enough.

I don't agree with you.
It is perishing that The memory reclaim task is blocked for a long time. We should avoid
this problem.

Thanks
Miao

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-16  7:22             ` Miao Xie
@ 2013-05-16 11:54               ` Chris Mason
  2013-05-16 14:31                 ` Liu Bo
  2013-05-16 12:05               ` Liu Bo
  1 sibling, 1 reply; 33+ messages in thread
From: Chris Mason @ 2013-05-16 11:54 UTC (permalink / raw)
  To: miaox, Miao Xie, bo.li.liu; +Cc: linux-btrfs, alex.btrfs

Quoting Miao Xie (2013-05-16 03:22:37)
> > I must say that the patch itself looks harmless, the reason is not good enough.
> 
> I don't agree with you.
> It is perishing that The memory reclaim task is blocked for a long time. We should avoid
> this problem.

synchronize_rcu and friends can take a very very long time.  I like this
patch as a way to avoid them, it's just keeps the whole kernel moving.

-chris

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-16  7:22             ` Miao Xie
  2013-05-16 11:54               ` Chris Mason
@ 2013-05-16 12:05               ` Liu Bo
  1 sibling, 0 replies; 33+ messages in thread
From: Liu Bo @ 2013-05-16 12:05 UTC (permalink / raw)
  To: Miao Xie; +Cc: linux-btrfs, alex.btrfs

On Thu, May 16, 2013 at 03:22:37PM +0800, Miao Xie wrote:
> On thu, 16 May 2013 14:15:52 +0800, Liu Bo wrote:
> > On Thu, May 16, 2013 at 01:34:11PM +0800, Miao Xie wrote:
> >> On Thu, 16 May 2013 13:15:57 +0800, Liu Bo wrote:
> >>> On Thu, May 16, 2013 at 12:31:11PM +0800, Miao Xie wrote:
> >>>> On thu, 16 May 2013 11:36:46 +0800, Liu Bo wrote:
> >>>>> On Wed, May 15, 2013 at 03:48:20PM +0800, Miao Xie wrote:
> >>>>>> The grab/put funtions will be used in the next patch, which need grab
> >>>>>> the root object and ensure it is not freed. We use reference counter
> >>>>>> instead of the srcu lock is to aovid blocking the memory reclaim task,
> >>>>>> which invokes synchronize_srcu().
> >>>>>>
> >>>>>
> >>>>> I don't think this is necessary, we put 'kfree(root)' because we really
> >>>>> need to free them at the very end time, when there should be no inodes
> >>>>> linking on the root(we should have cleaned all inodes out from it).
> >>>>>
> >>>>> So when we flush delalloc inodes and wait for ordered extents to finish,
> >>>>> the root should be valid, otherwise someone is doing wrong things.
> >>>>>
> >>>>> And even with this grab_fs_root to avoid freeing root, it's just the
> >>>>> root that remains in memory, all its attributes, like root->node,
> >>>>> commit_root, root->inode_tree, are already NULL or empty.
> >>>>
> >>>> Please consider the case:
> >>>> 	Task1			Task2					Cleaner
> >>>> 	get the root
> >>>> 				flush all delalloc inodes
> >>>> 				drop subvolume
> >>>> 				iput the last inode
> >>>> 				  move the root into the dead list
> >>>> 									drop subvolume
> >>>> 									kfree(root)
> >>>> If Task1 accesses the root now, oops will happen.
> >>>
> >>> Then it's task1's fault, why it is not protected by subvol_srcu section when
> >>> it's possible that someone like task2 sets root's refs to 0?
> >>>
> >>> synchronize_srcu(subvol_srcu) before adding root into dead root list is
> >>> just for this race case, why do we need another?
> >>
> >> Please read my changelog.
> > 
> > 'The memory reclaim task' in the changelog refers to
> > 	iput
> > 	  -> inode_tree_del
> > , right?
> > 
> > I don't like special cases, this get/put is different from our usual way:
> > if (atomic_dec_and_test(refs)) {
> > 	kfree(A->a);
> > 	kfree(A->b);
> > 	kfree(A);
> > }
> > 
> > According to the pictured case, task1 may also access root->something.
> 
> "->something" are the member variants of root object, so the problem you worry about
> can not happen unless somebody misuse this mechanism in the future. But you can not
> ascribe the misuse to this patch.
> 
> > I must say that the patch itself looks harmless, the reason is not good enough.
> 
> I don't agree with you.
> It is perishing that The memory reclaim task is blocked for a long time. We should avoid
> this problem.

So the real question is why do you think it will be blocked for a long time
by subvol_srcu?  Have you already noticed btrfs acting like this somehow?

I played with subvol_srcu for defrag bug some time ago,
on the read side we use subvol_srcu as
	index = srcu_read_lock(&fs_info->subvol_srcu);
	get root and check btrfs_root_refs();
	inode = btrfs_iget(inode);
	srcu_read_unlock(&fs_info->subvol_srcu, index);

and yes if we've deleted the subvol/snap, this section will nicely bail out
after checking btrfs_root_refs().

Even if we now have thousands of this kind of srcu section, would that be a long
time?

thanks,
liubo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-16 11:54               ` Chris Mason
@ 2013-05-16 14:31                 ` Liu Bo
  2013-05-16 14:34                   ` Chris Mason
  0 siblings, 1 reply; 33+ messages in thread
From: Liu Bo @ 2013-05-16 14:31 UTC (permalink / raw)
  To: Chris Mason; +Cc: miaox, linux-btrfs, alex.btrfs

On Thu, May 16, 2013 at 07:54:17AM -0400, Chris Mason wrote:
> Quoting Miao Xie (2013-05-16 03:22:37)
> > > I must say that the patch itself looks harmless, the reason is not good enough.
> > 
> > I don't agree with you.
> > It is perishing that The memory reclaim task is blocked for a long time. We should avoid
> > this problem.
> 
> synchronize_rcu and friends can take a very very long time.  I like this
> patch as a way to avoid them, it's just keeps the whole kernel moving.
> 
> -chris

Okay, that teaches me another lesson, thanks Miao :)

thanks,
liubo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-16 14:31                 ` Liu Bo
@ 2013-05-16 14:34                   ` Chris Mason
  2013-05-16 14:57                     ` Liu Bo
  0 siblings, 1 reply; 33+ messages in thread
From: Chris Mason @ 2013-05-16 14:34 UTC (permalink / raw)
  To: bo.li.liu, Liu Bo; +Cc: miaox, linux-btrfs, alex.btrfs

Quoting Liu Bo (2013-05-16 10:31:39)
> On Thu, May 16, 2013 at 07:54:17AM -0400, Chris Mason wrote:
> > Quoting Miao Xie (2013-05-16 03:22:37)
> > > > I must say that the patch itself looks harmless, the reason is not good enough.
> > > 
> > > I don't agree with you.
> > > It is perishing that The memory reclaim task is blocked for a long time. We should avoid
> > > this problem.
> > 
> > synchronize_rcu and friends can take a very very long time.  I like this
> > patch as a way to avoid them, it's just keeps the whole kernel moving.
> > 
> > -chris
> 
> Okay, that teaches me another lesson, thanks Miao :)

Actually using the rcu api isn't a huge impact.  It's just the
synchronize_rcu variants that should be avoided ;)

-chris


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-16 14:34                   ` Chris Mason
@ 2013-05-16 14:57                     ` Liu Bo
  2013-05-17  1:50                       ` Miao Xie
  0 siblings, 1 reply; 33+ messages in thread
From: Liu Bo @ 2013-05-16 14:57 UTC (permalink / raw)
  To: Chris Mason; +Cc: miaox, linux-btrfs, alex.btrfs

On Thu, May 16, 2013 at 10:34:55AM -0400, Chris Mason wrote:
> Quoting Liu Bo (2013-05-16 10:31:39)
> > On Thu, May 16, 2013 at 07:54:17AM -0400, Chris Mason wrote:
> > > Quoting Miao Xie (2013-05-16 03:22:37)
> > > > > I must say that the patch itself looks harmless, the reason is not good enough.
> > > > 
> > > > I don't agree with you.
> > > > It is perishing that The memory reclaim task is blocked for a long time. We should avoid
> > > > this problem.
> > > 
> > > synchronize_rcu and friends can take a very very long time.  I like this
> > > patch as a way to avoid them, it's just keeps the whole kernel moving.
> > > 
> > > -chris
> > 
> > Okay, that teaches me another lesson, thanks Miao :)
> 
> Actually using the rcu api isn't a huge impact.  It's just the
> synchronize_rcu variants that should be avoided ;)
> 
> -chris
> 

Now that it's so slow, I wonder why not use call_srcu() instead?

thanks,
liubo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree
  2013-05-16 14:57                     ` Liu Bo
@ 2013-05-17  1:50                       ` Miao Xie
  0 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-17  1:50 UTC (permalink / raw)
  To: bo.li.liu; +Cc: Chris Mason, linux-btrfs, alex.btrfs

On 	thu, 16 May 2013 22:57:07 +0800, Liu Bo wrote:
> On Thu, May 16, 2013 at 10:34:55AM -0400, Chris Mason wrote:
>> Quoting Liu Bo (2013-05-16 10:31:39)
>>> On Thu, May 16, 2013 at 07:54:17AM -0400, Chris Mason wrote:
>>>> Quoting Miao Xie (2013-05-16 03:22:37)
>>>>>> I must say that the patch itself looks harmless, the reason is not good enough.
>>>>>
>>>>> I don't agree with you.
>>>>> It is perishing that The memory reclaim task is blocked for a long time. We should avoid
>>>>> this problem.
>>>>
>>>> synchronize_rcu and friends can take a very very long time.  I like this
>>>> patch as a way to avoid them, it's just keeps the whole kernel moving.
>>>>
>>>> -chris
>>>
>>> Okay, that teaches me another lesson, thanks Miao :)
>>
>> Actually using the rcu api isn't a huge impact.  It's just the
>> synchronize_rcu variants that should be avoided ;)
>>
>> -chris
>>
> 
> Now that it's so slow, I wonder why not use call_srcu() instead?

I have considered this approach, but there is a problem that someone may insert
a new inode after we call call_srcu(). So I had to make this patch as a interim
solution. My next work is to remove synchronize_srcu() from tree_del_inode().

Thanks
Miao

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH V2 17/17] Btrfs: make the state of the transaction more readable
  2013-05-15  7:48 ` [PATCH 17/17] Btrfs: make the state of the transaction more readable Miao Xie
@ 2013-05-17  3:53   ` Miao Xie
  0 siblings, 0 replies; 33+ messages in thread
From: Miao Xie @ 2013-05-17  3:53 UTC (permalink / raw)
  To: linux-btrfs; +Cc: alex.btrfs

We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.

This patch improved the above problem. In this patch, we define 6 states
for the transaction,
  enum btrfs_trans_state {
	TRANS_STATE_RUNNING		= 0,
	TRANS_STATE_BLOCKED		= 1,
	TRANS_STATE_COMMIT_START	= 2,
	TRANS_STATE_COMMIT_DOING	= 3,
	TRANS_STATE_UNBLOCKED		= 4,
	TRANS_STATE_COMPLETED		= 5,
	TRANS_STATE_MAX			= 6,
  }
and just use 1 variant to track those state.

In order to make the blocked handle types for each state more clear,
we introduce a array:
  unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
	[TRANS_STATE_RUNNING]		= 0U,
	[TRANS_STATE_BLOCKED]		= (__TRANS_USERSPACE |
					   __TRANS_START),
	[TRANS_STATE_COMMIT_START]	= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH),
	[TRANS_STATE_COMMIT_DOING]	= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN),
	[TRANS_STATE_UNBLOCKED]		= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN |
					   __TRANS_JOIN_NOLOCK),
	[TRANS_STATE_COMPLETED]		= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN |
					   __TRANS_JOIN_NOLOCK),
  }
it is very intuitionistic.

Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
---
Changelog v1 -> v2:
- remove the misuse of cmpxchg in __btrfs_end_transaction()
---
 fs/btrfs/ctree.h       |   1 -
 fs/btrfs/disk-io.c     |  36 ++++++------
 fs/btrfs/transaction.c | 157 +++++++++++++++++++++++++++----------------------
 fs/btrfs/transaction.h |  16 +++--
 4 files changed, 116 insertions(+), 94 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a7e71ff..bf92302 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1496,7 +1496,6 @@ struct btrfs_fs_info {
 	int closing;
 	int log_root_recovering;
 	int enospc_unlink;
-	int trans_no_join;
 
 	u64 total_pinned;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6bb3f3d..530e3c0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1747,7 +1747,7 @@ static int transaction_kthread(void *arg)
 		}
 
 		now = get_seconds();
-		if (!cur->blocked &&
+		if (cur->state < TRANS_STATE_BLOCKED &&
 		    (now < cur->start_time || now - cur->start_time < 30)) {
 			spin_unlock(&root->fs_info->trans_lock);
 			delay = HZ * 5;
@@ -2183,7 +2183,6 @@ int open_ctree(struct super_block *sb,
 	fs_info->max_inline = 8192 * 1024;
 	fs_info->metadata_ratio = 0;
 	fs_info->defrag_inodes = RB_ROOT;
-	fs_info->trans_no_join = 0;
 	fs_info->free_chunk_space = 0;
 	fs_info->tree_mod_log = RB_ROOT;
 
@@ -3956,19 +3955,14 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 	btrfs_block_rsv_release(root, &root->fs_info->trans_block_rsv,
 				cur_trans->dirty_pages.dirty_bytes);
 
-	/* FIXME: cleanup wait for commit */
-	cur_trans->in_commit = 1;
-	cur_trans->blocked = 1;
+	cur_trans->state = TRANS_STATE_COMMIT_START;
 	wake_up(&root->fs_info->transaction_blocked_wait);
 
 	btrfs_evict_pending_snapshots(cur_trans);
 
-	cur_trans->blocked = 0;
+	cur_trans->state = TRANS_STATE_UNBLOCKED;
 	wake_up(&root->fs_info->transaction_wait);
 
-	cur_trans->commit_done = 1;
-	wake_up(&cur_trans->commit_wait);
-
 	btrfs_destroy_delayed_inodes(root);
 	btrfs_assert_delayed_root_empty(root);
 
@@ -3977,6 +3971,9 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 	btrfs_destroy_pinned_extent(root,
 				    root->fs_info->pinned_extents);
 
+	cur_trans->state =TRANS_STATE_COMPLETED;
+	wake_up(&cur_trans->commit_wait);
+
 	/*
 	memset(cur_trans, 0, sizeof(*cur_trans));
 	kmem_cache_free(btrfs_transaction_cachep, cur_trans);
@@ -4004,25 +4001,23 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root)
 
 		btrfs_destroy_delayed_refs(t, root);
 
-		/* FIXME: cleanup wait for commit */
-		t->in_commit = 1;
-		t->blocked = 1;
+		/*
+		 *  FIXME: cleanup wait for commit
+		 *  We needn't acquire the lock here, because we are during
+		 *  the umount, there is no other task which will change it.
+		 */
+		t->state = TRANS_STATE_COMMIT_START;
 		smp_mb();
 		if (waitqueue_active(&root->fs_info->transaction_blocked_wait))
 			wake_up(&root->fs_info->transaction_blocked_wait);
 
 		btrfs_evict_pending_snapshots(t);
 
-		t->blocked = 0;
+		t->state = TRANS_STATE_UNBLOCKED;
 		smp_mb();
 		if (waitqueue_active(&root->fs_info->transaction_wait))
 			wake_up(&root->fs_info->transaction_wait);
 
-		t->commit_done = 1;
-		smp_mb();
-		if (waitqueue_active(&t->commit_wait))
-			wake_up(&t->commit_wait);
-
 		btrfs_destroy_delayed_inodes(root);
 		btrfs_assert_delayed_root_empty(root);
 
@@ -4034,6 +4029,11 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root)
 		btrfs_destroy_pinned_extent(root,
 					    root->fs_info->pinned_extents);
 
+		t->state = TRANS_STATE_COMPLETED;
+		smp_mb();
+		if (waitqueue_active(&t->commit_wait))
+			wake_up(&t->commit_wait);
+
 		atomic_set(&t->use_count, 0);
 		list_del_init(&t->list);
 		memset(t, 0, sizeof(*t));
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 5e75ff4..eec8686 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -34,6 +34,29 @@
 
 #define BTRFS_ROOT_TRANS_TAG 0
 
+static unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
+	[TRANS_STATE_RUNNING]		= 0U,
+	[TRANS_STATE_BLOCKED]		= (__TRANS_USERSPACE |
+					   __TRANS_START),
+	[TRANS_STATE_COMMIT_START]	= (__TRANS_USERSPACE |
+					   __TRANS_START |
+					   __TRANS_ATTACH),
+	[TRANS_STATE_COMMIT_DOING]	= (__TRANS_USERSPACE |
+					   __TRANS_START |
+					   __TRANS_ATTACH |
+					   __TRANS_JOIN),
+	[TRANS_STATE_UNBLOCKED]		= (__TRANS_USERSPACE |
+					   __TRANS_START |
+					   __TRANS_ATTACH |
+					   __TRANS_JOIN |
+					   __TRANS_JOIN_NOLOCK),
+	[TRANS_STATE_COMPLETED]		= (__TRANS_USERSPACE |
+					   __TRANS_START |
+					   __TRANS_ATTACH |
+					   __TRANS_JOIN |
+					   __TRANS_JOIN_NOLOCK),
+};
+
 static void put_transaction(struct btrfs_transaction *transaction)
 {
 	WARN_ON(atomic_read(&transaction->use_count) == 0);
@@ -50,13 +73,6 @@ static noinline void switch_commit_root(struct btrfs_root *root)
 	root->commit_root = btrfs_root_node(root);
 }
 
-static inline int can_join_transaction(struct btrfs_transaction *trans,
-				       unsigned int type)
-{
-	return !(trans->in_commit &&
-		 (type & TRANS_EXTWRITERS));
-}
-
 static inline void extwriter_counter_inc(struct btrfs_transaction *trans,
 					 unsigned int type)
 {
@@ -98,26 +114,13 @@ loop:
 		return -EROFS;
 	}
 
-	if (fs_info->trans_no_join) {
-		/* 
-		 * If we are JOIN_NOLOCK we're already committing a current
-		 * transaction, we just need a handle to deal with something
-		 * when committing the transaction, such as inode cache and
-		 * space cache. It is a special case.
-		 */
-		if (type != TRANS_JOIN_NOLOCK) {
-			spin_unlock(&fs_info->trans_lock);
-			return -EBUSY;
-		}
-	}
-
 	cur_trans = fs_info->running_transaction;
 	if (cur_trans) {
 		if (cur_trans->aborted) {
 			spin_unlock(&fs_info->trans_lock);
 			return cur_trans->aborted;
 		}
-		if (!can_join_transaction(cur_trans, type)) {
+		if (btrfs_blocked_trans_types[cur_trans->state] & type) {
 			spin_unlock(&fs_info->trans_lock);
 			return -EBUSY;
 		}
@@ -136,6 +139,12 @@ loop:
 	if (type == TRANS_ATTACH)
 		return -ENOENT;
 
+	/*
+	 * JOIN_NOLOCK only happens during the transaction commit, so
+	 * it is impossible that ->running_transaction is NULL
+	 */
+	BUG_ON(type == TRANS_JOIN_NOLOCK);
+
 	cur_trans = kmem_cache_alloc(btrfs_transaction_cachep, GFP_NOFS);
 	if (!cur_trans)
 		return -ENOMEM;
@@ -144,7 +153,7 @@ loop:
 	if (fs_info->running_transaction) {
 		/*
 		 * someone started a transaction after we unlocked.  Make sure
-		 * to redo the trans_no_join checks above
+		 * to redo the checks above
 		 */
 		kmem_cache_free(btrfs_transaction_cachep, cur_trans);
 		goto loop;
@@ -158,14 +167,12 @@ loop:
 	extwriter_counter_init(cur_trans, type);
 	init_waitqueue_head(&cur_trans->writer_wait);
 	init_waitqueue_head(&cur_trans->commit_wait);
-	cur_trans->in_commit = 0;
-	cur_trans->blocked = 0;
+	cur_trans->state = TRANS_STATE_RUNNING;
 	/*
 	 * One for this trans handle, one so it will live on until we
 	 * commit the transaction.
 	 */
 	atomic_set(&cur_trans->use_count, 2);
-	cur_trans->commit_done = 0;
 	cur_trans->start_time = get_seconds();
 
 	cur_trans->delayed_refs.root = RB_ROOT;
@@ -188,7 +195,6 @@ loop:
 			"creating a fresh transaction\n");
 	atomic64_set(&fs_info->tree_mod_seq, 0);
 
-	spin_lock_init(&cur_trans->commit_lock);
 	spin_lock_init(&cur_trans->delayed_refs.lock);
 	atomic_set(&cur_trans->delayed_refs.procs_running_refs, 0);
 	atomic_set(&cur_trans->delayed_refs.ref_seq, 0);
@@ -293,6 +299,12 @@ int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
 	return 0;
 }
 
+static inline int is_transaction_blocked(struct btrfs_transaction *trans)
+{
+	return (trans->state >= TRANS_STATE_BLOCKED &&
+		trans->state < TRANS_STATE_UNBLOCKED);
+}
+
 /* wait for commit against the current transaction to become unblocked
  * when this is done, it is safe to start a new transaction, but the current
  * transaction might not be fully on disk.
@@ -303,12 +315,12 @@ static void wait_current_trans(struct btrfs_root *root)
 
 	spin_lock(&root->fs_info->trans_lock);
 	cur_trans = root->fs_info->running_transaction;
-	if (cur_trans && cur_trans->blocked) {
+	if (cur_trans && is_transaction_blocked(cur_trans)) {
 		atomic_inc(&cur_trans->use_count);
 		spin_unlock(&root->fs_info->trans_lock);
 
 		wait_event(root->fs_info->transaction_wait,
-			   !cur_trans->blocked);
+			   cur_trans->state >= TRANS_STATE_UNBLOCKED);
 		put_transaction(cur_trans);
 	} else {
 		spin_unlock(&root->fs_info->trans_lock);
@@ -432,7 +444,8 @@ again:
 	INIT_LIST_HEAD(&h->new_bgs);
 
 	smp_mb();
-	if (cur_trans->blocked && may_wait_transaction(root, type)) {
+	if (cur_trans->state >= TRANS_STATE_BLOCKED &&
+	    may_wait_transaction(root, type)) {
 		btrfs_commit_transaction(h, root);
 		goto again;
 	}
@@ -536,7 +549,7 @@ btrfs_attach_transaction_barrier(struct btrfs_root *root)
 static noinline void wait_for_commit(struct btrfs_root *root,
 				    struct btrfs_transaction *commit)
 {
-	wait_event(commit->commit_wait, commit->commit_done);
+	wait_event(commit->commit_wait, commit->state == TRANS_STATE_COMPLETED);
 }
 
 int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid)
@@ -572,8 +585,8 @@ int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid)
 		spin_lock(&root->fs_info->trans_lock);
 		list_for_each_entry_reverse(t, &root->fs_info->trans_list,
 					    list) {
-			if (t->in_commit) {
-				if (t->commit_done)
+			if (t->state >= TRANS_STATE_COMMIT_START) {
+				if (t->state == TRANS_STATE_COMPLETED)
 					break;
 				cur_trans = t;
 				atomic_inc(&cur_trans->use_count);
@@ -614,7 +627,8 @@ int btrfs_should_end_transaction(struct btrfs_trans_handle *trans,
 	int err;
 
 	smp_mb();
-	if (cur_trans->blocked || cur_trans->delayed_refs.flushing)
+	if (cur_trans->state >= TRANS_STATE_BLOCKED ||
+	    cur_trans->delayed_refs.flushing)
 		return 1;
 
 	updates = trans->delayed_ref_updates;
@@ -682,12 +696,15 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
 		btrfs_create_pending_block_groups(trans, root);
 
 	if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
-	    should_end_transaction(trans, root)) {
-		trans->transaction->blocked = 1;
-		smp_wmb();
+	    should_end_transaction(trans, root) &&
+	    ACCESS_ONCE(cur_trans->state) == TRANS_STATE_RUNNING) {
+		spin_lock(&info->trans_lock);
+		if (cur_trans->state == TRANS_STATE_RUNNING)
+			cur_trans->state = TRANS_STATE_BLOCKED;
+		spin_unlock(&info->trans_lock);
 	}
 
-	if (lock && cur_trans->blocked && !cur_trans->in_commit) {
+	if (lock && ACCESS_ONCE(cur_trans->state) == TRANS_STATE_BLOCKED) {
 		if (throttle) {
 			/*
 			 * We may race with somebody else here so end up having
@@ -1343,20 +1360,26 @@ static void update_super_roots(struct btrfs_root *root)
 
 int btrfs_transaction_in_commit(struct btrfs_fs_info *info)
 {
+	struct btrfs_transaction *trans;
 	int ret = 0;
+
 	spin_lock(&info->trans_lock);
-	if (info->running_transaction)
-		ret = info->running_transaction->in_commit;
+	trans = info->running_transaction;
+	if (trans)
+		ret = (trans->state >= TRANS_STATE_COMMIT_START);
 	spin_unlock(&info->trans_lock);
 	return ret;
 }
 
 int btrfs_transaction_blocked(struct btrfs_fs_info *info)
 {
+	struct btrfs_transaction *trans;
 	int ret = 0;
+
 	spin_lock(&info->trans_lock);
-	if (info->running_transaction)
-		ret = info->running_transaction->blocked;
+	trans = info->running_transaction;
+	if (trans)
+		ret = is_transaction_blocked(trans);
 	spin_unlock(&info->trans_lock);
 	return ret;
 }
@@ -1368,7 +1391,8 @@ int btrfs_transaction_blocked(struct btrfs_fs_info *info)
 static void wait_current_trans_commit_start(struct btrfs_root *root,
 					    struct btrfs_transaction *trans)
 {
-	wait_event(root->fs_info->transaction_blocked_wait, trans->in_commit);
+	wait_event(root->fs_info->transaction_blocked_wait,
+		   trans->state >= TRANS_STATE_COMMIT_START);
 }
 
 /*
@@ -1379,7 +1403,7 @@ static void wait_current_trans_commit_start_and_unblock(struct btrfs_root *root,
 					 struct btrfs_transaction *trans)
 {
 	wait_event(root->fs_info->transaction_wait,
-		   trans->commit_done || (trans->in_commit && !trans->blocked));
+		   trans->state >= TRANS_STATE_UNBLOCKED);
 }
 
 /*
@@ -1484,18 +1508,22 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans,
 
 	list_del_init(&cur_trans->list);
 	if (cur_trans == root->fs_info->running_transaction) {
-		root->fs_info->trans_no_join = 1;
+		cur_trans->state = TRANS_STATE_COMMIT_DOING;
 		spin_unlock(&root->fs_info->trans_lock);
 		wait_event(cur_trans->writer_wait,
 			   atomic_read(&cur_trans->num_writers) == 1);
 
 		spin_lock(&root->fs_info->trans_lock);
-		root->fs_info->running_transaction = NULL;
 	}
 	spin_unlock(&root->fs_info->trans_lock);
 
 	btrfs_cleanup_one_transaction(trans->transaction, root);
 
+	spin_lock(&root->fs_info->trans_lock);
+	if (cur_trans == root->fs_info->running_transaction)
+		root->fs_info->running_transaction = NULL;
+	spin_unlock(&root->fs_info->trans_lock);
+
 	put_transaction(cur_trans);
 	put_transaction(cur_trans);
 
@@ -1507,10 +1535,6 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans,
 		current->journal_info = NULL;
 
 	kmem_cache_free(btrfs_trans_handle_cachep, trans);
-
-	spin_lock(&root->fs_info->trans_lock);
-	root->fs_info->trans_no_join = 0;
-	spin_unlock(&root->fs_info->trans_lock);
 }
 
 static int btrfs_flush_all_pending_stuffs(struct btrfs_trans_handle *trans,
@@ -1554,13 +1578,6 @@ static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info)
 		btrfs_wait_all_ordered_extents(fs_info, 1);
 }
 
-/*
- * btrfs_transaction state sequence:
- *    in_commit = 0, blocked = 0  (initial)
- *    in_commit = 1, blocked = 1
- *    blocked = 0
- *    commit_done = 1
- */
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 			     struct btrfs_root *root)
 {
@@ -1615,9 +1632,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 		return ret;
 	}
 
-	spin_lock(&cur_trans->commit_lock);
-	if (cur_trans->in_commit) {
-		spin_unlock(&cur_trans->commit_lock);
+	spin_lock(&root->fs_info->trans_lock);
+	if (cur_trans->state >= TRANS_STATE_COMMIT_START) {
+		spin_unlock(&root->fs_info->trans_lock);
 		atomic_inc(&cur_trans->use_count);
 		ret = btrfs_end_transaction(trans, root);
 
@@ -1628,16 +1645,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 		return ret;
 	}
 
-	trans->transaction->in_commit = 1;
-	trans->transaction->blocked = 1;
-	spin_unlock(&cur_trans->commit_lock);
+	cur_trans->state = TRANS_STATE_COMMIT_START;
 	wake_up(&root->fs_info->transaction_blocked_wait);
 
-	spin_lock(&root->fs_info->trans_lock);
 	if (cur_trans->list.prev != &root->fs_info->trans_list) {
 		prev_trans = list_entry(cur_trans->list.prev,
 					struct btrfs_transaction, list);
-		if (!prev_trans->commit_done) {
+		if (prev_trans->state != TRANS_STATE_COMPLETED) {
 			atomic_inc(&prev_trans->use_count);
 			spin_unlock(&root->fs_info->trans_lock);
 
@@ -1673,10 +1687,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	/*
 	 * Ok now we need to make sure to block out any other joins while we
 	 * commit the transaction.  We could have started a join before setting
-	 * no_join so make sure to wait for num_writers to == 1 again.
+	 * COMMIT_DOING so make sure to wait for num_writers to == 1 again.
 	 */
 	spin_lock(&root->fs_info->trans_lock);
-	root->fs_info->trans_no_join = 1;
+	cur_trans->state = TRANS_STATE_COMMIT_DOING;
 	spin_unlock(&root->fs_info->trans_lock);
 	wait_event(cur_trans->writer_wait,
 		   atomic_read(&cur_trans->num_writers) == 1);
@@ -1803,10 +1817,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 	memcpy(root->fs_info->super_for_commit, root->fs_info->super_copy,
 	       sizeof(*root->fs_info->super_copy));
 
-	trans->transaction->blocked = 0;
 	spin_lock(&root->fs_info->trans_lock);
+	cur_trans->state = TRANS_STATE_UNBLOCKED;
 	root->fs_info->running_transaction = NULL;
-	root->fs_info->trans_no_join = 0;
 	spin_unlock(&root->fs_info->trans_lock);
 	mutex_unlock(&root->fs_info->reloc_mutex);
 
@@ -1834,10 +1847,12 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
 
 	btrfs_finish_extent_commit(trans, root);
 
-	cur_trans->commit_done = 1;
-
 	root->fs_info->last_trans_committed = cur_trans->transid;
-
+	/*
+	 * We needn't acquire the lock here because there is no other task
+	 * which can change it.
+	 */
+	cur_trans->state = TRANS_STATE_COMPLETED;
 	wake_up(&cur_trans->commit_wait);
 
 	spin_lock(&root->fs_info->trans_lock);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 0fc45e2..66d2a6c 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -22,6 +22,16 @@
 #include "delayed-ref.h"
 #include "ctree.h"
 
+enum btrfs_trans_state {
+	TRANS_STATE_RUNNING		= 0,
+	TRANS_STATE_BLOCKED		= 1,
+	TRANS_STATE_COMMIT_START	= 2,
+	TRANS_STATE_COMMIT_DOING	= 3,
+	TRANS_STATE_UNBLOCKED		= 4,
+	TRANS_STATE_COMPLETED		= 5,
+	TRANS_STATE_MAX			= 6,
+};
+
 struct btrfs_transaction {
 	u64 transid;
 	/*
@@ -37,10 +47,8 @@ struct btrfs_transaction {
 	atomic_t num_writers;
 	atomic_t use_count;
 
-	spinlock_t commit_lock;
-	int in_commit;
-	int commit_done;
-	int blocked;
+	/* Be protected by fs_info->trans_lock when we want to change it. */
+	enum btrfs_trans_state state;
 	struct list_head list;
 	struct extent_io_tree dirty_pages;
 	unsigned long start_time;
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2013-05-17  4:12 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-15  7:48 [PATCH 00/17] improve the block time during the transaction commit Miao Xie
2013-05-15  7:48 ` [PATCH 01/17] Btrfs: fix accessing a freed tree root Miao Xie
2013-05-15  7:48 ` [PATCH 02/17] Btrfs: fix unprotected root node of the subvolume's inode rb-tree Miao Xie
2013-05-15  7:48 ` [PATCH 03/17] Btrfs: pause the space balance when remounting to R/O Miao Xie
2013-05-15  7:48 ` [PATCH 04/17] Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix() Miao Xie
2013-05-15  7:48 ` [PATCH 05/17] Btrfs: cleanup the similar code of the fs root read Miao Xie
2013-05-15  7:48 ` [PATCH 06/17] Btrfs: introduce grab/put functions for the root of the fs/file tree Miao Xie
2013-05-16  3:36   ` Liu Bo
2013-05-16  4:31     ` Miao Xie
2013-05-16  5:15       ` Liu Bo
2013-05-16  5:34         ` Miao Xie
2013-05-16  6:15           ` Liu Bo
2013-05-16  7:22             ` Miao Xie
2013-05-16 11:54               ` Chris Mason
2013-05-16 14:31                 ` Liu Bo
2013-05-16 14:34                   ` Chris Mason
2013-05-16 14:57                     ` Liu Bo
2013-05-17  1:50                       ` Miao Xie
2013-05-16 12:05               ` Liu Bo
2013-05-15  7:48 ` [PATCH 07/17] Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context Miao Xie
2013-05-15  7:48 ` [PATCH 08/17] Btrfs: introduce per-subvolume delalloc inode list Miao Xie
2013-05-15  7:48 ` [PATCH 09/17] Btrfs: introduce per-subvolume ordered extent list Miao Xie
2013-05-15  7:48 ` [PATCH 10/17] Btrfs: just flush the delalloc inodes in the source tree before snapshot creation Miao Xie
2013-05-16  3:20   ` Liu Bo
2013-05-16  4:00     ` Miao Xie
2013-05-15  7:48 ` [PATCH 11/17] Btrfs: cleanup unnecessary assignment when cleaning up all the residual transaction Miao Xie
2013-05-15  7:48 ` [PATCH 12/17] Btrfs: remove the code for the impossible case in cleanup_transaction() Miao Xie
2013-05-15  7:48 ` [PATCH 13/17] Btrfs: don't wait for all the writers circularly during the transaction commit Miao Xie
2013-05-15  7:48 ` [PATCH 14/17] Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set Miao Xie
2013-05-15  7:48 ` [PATCH 15/17] Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure Miao Xie
2013-05-15  7:48 ` [PATCH 16/17] Btrfs: remove the time check in btrfs_commit_transaction() Miao Xie
2013-05-15  7:48 ` [PATCH 17/17] Btrfs: make the state of the transaction more readable Miao Xie
2013-05-17  3:53   ` [PATCH V2 " Miao Xie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.