All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] address lock contention of btree root
@ 2018-08-16 21:07 Liu Bo
  2018-08-16 21:07 ` [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking Liu Bo
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Liu Bo @ 2018-08-16 21:07 UTC (permalink / raw)
  To: linux-btrfs

The lock contention on btree nodes (esp. root node) is apparently a
bottleneck when there're multiple readers and writers concurrently
trying to access them.  Unfortunately this is by design and it's not
easy to fix it unless with some complex changes, however, there is
still some room.

With a stable workload based on fsmark which has 16 threads creating
1,600K files, we could see that a good amount of overhead comes from
switching path between spinning mode and blocking mode in
btrfs_search_slot().

Patch 1 provides more details about the overhead and test results from
fsmark and dbench.
Patch 2 kills leave_spinning due to the behaviour change from patch 1.

Liu Bo (2):
  Btrfs: kill btrfs_clear_path_blocking
  Btrfs: kill leave_spinning

 fs/btrfs/backref.c            |  3 --
 fs/btrfs/ctree.c              | 73 +++++--------------------------------------
 fs/btrfs/ctree.h              |  3 --
 fs/btrfs/delayed-inode.c      |  7 -----
 fs/btrfs/dir-item.c           |  1 -
 fs/btrfs/export.c             |  1 -
 fs/btrfs/extent-tree.c        |  7 -----
 fs/btrfs/extent_io.c          |  1 -
 fs/btrfs/file-item.c          |  4 ---
 fs/btrfs/free-space-tree.c    |  2 --
 fs/btrfs/inode-item.c         |  6 ----
 fs/btrfs/inode.c              |  8 -----
 fs/btrfs/ioctl.c              |  3 --
 fs/btrfs/qgroup.c             |  2 --
 fs/btrfs/super.c              |  2 --
 fs/btrfs/tests/qgroup-tests.c |  4 ---
 16 files changed, 7 insertions(+), 120 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking
  2018-08-16 21:07 [PATCH 0/2] address lock contention of btree root Liu Bo
@ 2018-08-16 21:07 ` Liu Bo
  2018-08-17  7:24   ` Nikolay Borisov
  2018-08-16 21:07 ` [PATCH 2/2] Btrfs: kill leave_spinning Liu Bo
  2018-08-21 17:54 ` [PATCH 0/2] address lock contention of btree root Chris Mason
  2 siblings, 1 reply; 16+ messages in thread
From: Liu Bo @ 2018-08-16 21:07 UTC (permalink / raw)
  To: linux-btrfs

Btrfs's btree locking has two modes, spinning mode and blocking mode,
while searching btree, locking is always acquired in spinning mode and
then converted to blocking mode if necessary, and in some hot paths we may
switch the locking back to spinning mode by btrfs_clear_path_blocking().

When acquiring locks, both of reader and writer need to wait for blocking
readers and writers to complete before doing read_lock()/write_lock().

The problem is that btrfs_clear_path_blocking() needs to switch nodes
in the path to blocking mode at first (by btrfs_set_path_blocking) to
make lockdep happy before doing its actual clearing blocking job.

When switching to blocking mode from spinning mode, it consists of

step 1) bumping up blocking readers counter and
step 2) read_unlock()/write_unlock(),

this has caused serious ping-pong effect if there're a great amount of
concurrent readers/writers, as waiters will be woken up and go to
sleep immediately.

1) Killing this kind of ping-pong results in a big improvement in my 1600k
files creation script,

MNT=/mnt/btrfs
mkfs.btrfs -f /dev/sdf
mount /dev/def $MNT
time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
        -d  $MNT/0  -d  $MNT/1 \
        -d  $MNT/2  -d  $MNT/3 \
        -d  $MNT/4  -d  $MNT/5 \
        -d  $MNT/6  -d  $MNT/7 \
        -d  $MNT/8  -d  $MNT/9 \
        -d  $MNT/10  -d  $MNT/11 \
        -d  $MNT/12  -d  $MNT/13 \
        -d  $MNT/14  -d  $MNT/15

w/o patch:
real    2m27.307s
user    0m12.839s
sys     13m42.831s

w/ patch:
real    1m2.273s
user    0m15.802s
sys     8m16.495s

2) dbench also proves the improvement,
dbench -t 120 -D /mnt/btrfs 16

w/o patch:
Throughput 158.363 MB/sec

w/ patch:
Throughput 449.52 MB/sec

3) xfstests didn't show any additional failures.

One thing to note is that callers may set leave_spinning to have all
nodes in the path stay in spinning mode, which means callers are ready
to not sleep before releasing the path, but it won't cause problems if
they don't want to sleep in blocking mode, IOW, we can just get rid of
leave_spinning.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/btrfs/ctree.c         | 57 ++++--------------------------------------------
 fs/btrfs/ctree.h         |  2 --
 fs/btrfs/delayed-inode.c |  3 ---
 3 files changed, 4 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d436fb4c002e..8b31caa60b0a 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
 	}
 }
 
-/*
- * reset all the locked nodes in the patch to spinning locks.
- *
- * held is used to keep lockdep happy, when lockdep is enabled
- * we set held to a blocking lock before we go around and
- * retake all the spinlocks in the path.  You can safely use NULL
- * for held
- */
-noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
-					struct extent_buffer *held, int held_rw)
-{
-	int i;
-
-	if (held) {
-		btrfs_set_lock_blocking_rw(held, held_rw);
-		if (held_rw == BTRFS_WRITE_LOCK)
-			held_rw = BTRFS_WRITE_LOCK_BLOCKING;
-		else if (held_rw == BTRFS_READ_LOCK)
-			held_rw = BTRFS_READ_LOCK_BLOCKING;
-	}
-	btrfs_set_path_blocking(p);
-
-	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
-		if (p->nodes[i] && p->locks[i]) {
-			btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
-			if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
-				p->locks[i] = BTRFS_WRITE_LOCK;
-			else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
-				p->locks[i] = BTRFS_READ_LOCK;
-		}
-	}
-
-	if (held)
-		btrfs_clear_lock_blocking_rw(held, held_rw);
-}
-
 /* this also releases the path */
 void btrfs_free_path(struct btrfs_path *p)
 {
@@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
 		}
 	}
 
-	btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
 	btrfs_tree_read_unlock_blocking(eb);
 	free_extent_buffer(eb);
 
@@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
 		btrfs_set_path_blocking(p);
 		reada_for_balance(fs_info, p, level);
 		sret = split_node(trans, root, p, level);
-		btrfs_clear_path_blocking(p, NULL, 0);
 
 		BUG_ON(sret > 0);
 		if (sret) {
@@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
 		btrfs_set_path_blocking(p);
 		reada_for_balance(fs_info, p, level);
 		sret = balance_level(trans, root, p, level);
-		btrfs_clear_path_blocking(p, NULL, 0);
 
 		if (sret) {
 			ret = sret;
@@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		}
 cow_done:
 		p->nodes[level] = b;
-		btrfs_clear_path_blocking(p, NULL, 0);
+		/*
+		 * Leave path with blocking locks to avoid massive
+		 * lock context switch, this is made on purpose.
+		 */
 
 		/*
 		 * we have a lock on b and as long as we aren't changing
@@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 					if (!err) {
 						btrfs_set_path_blocking(p);
 						btrfs_tree_lock(b);
-						btrfs_clear_path_blocking(p, b,
-								  BTRFS_WRITE_LOCK);
 					}
 					p->locks[level] = BTRFS_WRITE_LOCK;
 				} else {
@@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 					if (!err) {
 						btrfs_set_path_blocking(p);
 						btrfs_tree_read_lock(b);
-						btrfs_clear_path_blocking(p, b,
-								  BTRFS_READ_LOCK);
 					}
 					p->locks[level] = BTRFS_READ_LOCK;
 				}
@@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 				btrfs_set_path_blocking(p);
 				err = split_leaf(trans, root, key,
 						 p, ins_len, ret == 0);
-				btrfs_clear_path_blocking(p, NULL, 0);
 
 				BUG_ON(err > 0);
 				if (err) {
@@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
 	while (b) {
 		level = btrfs_header_level(b);
 		p->nodes[level] = b;
-		btrfs_clear_path_blocking(p, NULL, 0);
 
 		/*
 		 * we have a lock on b and as long as we aren't changing
@@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
 			if (!err) {
 				btrfs_set_path_blocking(p);
 				btrfs_tree_read_lock(b);
-				btrfs_clear_path_blocking(p, b,
-							  BTRFS_READ_LOCK);
 			}
 			b = tree_mod_log_rewind(fs_info, p, b, time_seq);
 			if (!b) {
@@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
 		path->locks[level - 1] = BTRFS_READ_LOCK;
 		path->nodes[level - 1] = cur;
 		unlock_up(path, level, 1, 0, NULL);
-		btrfs_clear_path_blocking(path, NULL, 0);
 	}
 out:
 	path->keep_locks = keep_locks;
@@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 			if (!ret) {
 				btrfs_set_path_blocking(path);
 				btrfs_tree_read_lock(next);
-				btrfs_clear_path_blocking(path, next,
-							  BTRFS_READ_LOCK);
 			}
 			next_rw_lock = BTRFS_READ_LOCK;
 		}
@@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 			if (!ret) {
 				btrfs_set_path_blocking(path);
 				btrfs_tree_read_lock(next);
-				btrfs_clear_path_blocking(path, next,
-							  BTRFS_READ_LOCK);
 			}
 			next_rw_lock = BTRFS_READ_LOCK;
 		}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 318be7864072..1aeed3c0e949 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
 struct btrfs_path *btrfs_alloc_path(void);
 void btrfs_free_path(struct btrfs_path *p);
 void btrfs_set_path_blocking(struct btrfs_path *p);
-void btrfs_clear_path_blocking(struct btrfs_path *p,
-			       struct extent_buffer *held, int held_rw);
 void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
 
 int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index f51b509f2d9b..db9f45082c61 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
 		i++;
 	}
 
-	/* reset all the locked nodes in the patch to spinning locks. */
-	btrfs_clear_path_blocking(path, NULL, 0);
-
 	/* insert the keys of the items */
 	setup_items_for_insert(root, path, keys, data_size,
 			       total_data_size, total_size, nitems);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] Btrfs: kill leave_spinning
  2018-08-16 21:07 [PATCH 0/2] address lock contention of btree root Liu Bo
  2018-08-16 21:07 ` [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking Liu Bo
@ 2018-08-16 21:07 ` Liu Bo
  2018-08-21 22:19   ` Liu Bo
  2018-08-21 17:54 ` [PATCH 0/2] address lock contention of btree root Chris Mason
  2 siblings, 1 reply; 16+ messages in thread
From: Liu Bo @ 2018-08-16 21:07 UTC (permalink / raw)
  To: linux-btrfs

As btrfs_clear_path_blocking() turns out to be a major source of lock
contention, we've kill it and without it btrfs_search_slot() and
btrfs_search_old_slot() are not able to return a path in spinning
mode, lets kill leave_spinning, too.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/btrfs/backref.c            |  3 ---
 fs/btrfs/ctree.c              | 16 +++-------------
 fs/btrfs/ctree.h              |  1 -
 fs/btrfs/delayed-inode.c      |  4 ----
 fs/btrfs/dir-item.c           |  1 -
 fs/btrfs/export.c             |  1 -
 fs/btrfs/extent-tree.c        |  7 -------
 fs/btrfs/extent_io.c          |  1 -
 fs/btrfs/file-item.c          |  4 ----
 fs/btrfs/free-space-tree.c    |  2 --
 fs/btrfs/inode-item.c         |  6 ------
 fs/btrfs/inode.c              |  8 --------
 fs/btrfs/ioctl.c              |  3 ---
 fs/btrfs/qgroup.c             |  2 --
 fs/btrfs/super.c              |  2 --
 fs/btrfs/tests/qgroup-tests.c |  4 ----
 16 files changed, 3 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index ae750b1574a2..70c629b90710 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1613,13 +1613,11 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
 	s64 bytes_left = ((s64)size) - 1;
 	struct extent_buffer *eb = eb_in;
 	struct btrfs_key found_key;
-	int leave_spinning = path->leave_spinning;
 	struct btrfs_inode_ref *iref;
 
 	if (bytes_left >= 0)
 		dest[bytes_left] = '\0';
 
-	path->leave_spinning = 1;
 	while (1) {
 		bytes_left -= name_len;
 		if (bytes_left >= 0)
@@ -1665,7 +1663,6 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
 	}
 
 	btrfs_release_path(path);
-	path->leave_spinning = leave_spinning;
 
 	if (ret)
 		return ERR_PTR(ret);
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 8b31caa60b0a..d2df7cfbec06 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2875,14 +2875,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	}
 	ret = 1;
 done:
-	/*
-	 * we don't really know what they plan on doing with the path
-	 * from here on, so for now just mark it as blocking
-	 */
-	if (!p->leave_spinning)
-		btrfs_set_path_blocking(p);
 	if (ret < 0 && !p->skip_release_on_error)
 		btrfs_release_path(p);
+
+	/* path is supposed to be in blocking mode from now on. */
 	return ret;
 }
 
@@ -2987,11 +2983,10 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
 	}
 	ret = 1;
 done:
-	if (!p->leave_spinning)
-		btrfs_set_path_blocking(p);
 	if (ret < 0)
 		btrfs_release_path(p);
 
+	/* path is supposed to be in blocking mode from now on.*/
 	return ret;
 }
 
@@ -5628,7 +5623,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 	struct btrfs_key key;
 	u32 nritems;
 	int ret;
-	int old_spinning = path->leave_spinning;
 	int next_rw_lock = 0;
 
 	nritems = btrfs_header_nritems(path->nodes[0]);
@@ -5643,7 +5637,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 	btrfs_release_path(path);
 
 	path->keep_locks = 1;
-	path->leave_spinning = 1;
 
 	if (time_seq)
 		ret = btrfs_search_old_slot(root, &key, path, time_seq);
@@ -5780,9 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 	ret = 0;
 done:
 	unlock_up(path, 0, 1, 0, NULL);
-	path->leave_spinning = old_spinning;
-	if (!old_spinning)
-		btrfs_set_path_blocking(path);
 
 	return ret;
 }
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1aeed3c0e949..e8bddf21b7f7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -339,7 +339,6 @@ struct btrfs_path {
 	unsigned int search_for_split:1;
 	unsigned int keep_locks:1;
 	unsigned int skip_locking:1;
-	unsigned int leave_spinning:1;
 	unsigned int search_commit_root:1;
 	unsigned int need_commit_sem:1;
 	unsigned int skip_release_on_error:1;
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index db9f45082c61..e6fbcdbc313e 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1138,7 +1138,6 @@ static int __btrfs_run_delayed_items(struct btrfs_trans_handle *trans, int nr)
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-	path->leave_spinning = 1;
 
 	block_rsv = trans->block_rsv;
 	trans->block_rsv = &fs_info->delayed_block_rsv;
@@ -1203,7 +1202,6 @@ int btrfs_commit_inode_delayed_items(struct btrfs_trans_handle *trans,
 		btrfs_release_delayed_node(delayed_node);
 		return -ENOMEM;
 	}
-	path->leave_spinning = 1;
 
 	block_rsv = trans->block_rsv;
 	trans->block_rsv = &delayed_node->root->fs_info->delayed_block_rsv;
@@ -1248,7 +1246,6 @@ int btrfs_commit_inode_delayed_inode(struct btrfs_inode *inode)
 		ret = -ENOMEM;
 		goto trans_out;
 	}
-	path->leave_spinning = 1;
 
 	block_rsv = trans->block_rsv;
 	trans->block_rsv = &fs_info->delayed_block_rsv;
@@ -1317,7 +1314,6 @@ static void btrfs_async_run_delayed_root(struct btrfs_work *work)
 		if (!delayed_node)
 			break;
 
-		path->leave_spinning = 1;
 		root = delayed_node->root;
 
 		trans = btrfs_join_transaction(root);
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index a678b07fcf01..4f27d6ba2f1e 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -127,7 +127,6 @@ int btrfs_insert_dir_item(struct btrfs_trans_handle *trans, struct btrfs_root
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-	path->leave_spinning = 1;
 
 	btrfs_cpu_key_to_disk(&disk_key, location);
 
diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 1f3755b3a37a..7f410edd92e4 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -245,7 +245,6 @@ static int btrfs_get_name(struct dentry *parent, char *name,
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-	path->leave_spinning = 1;
 
 	if (ino == BTRFS_FIRST_FREE_OBJECTID) {
 		key.objectid = BTRFS_I(inode)->root->root_key.objectid;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index de6f75f5547b..8d9ca923db93 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2116,7 +2116,6 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 		return -ENOMEM;
 
 	path->reada = READA_FORWARD;
-	path->leave_spinning = 1;
 	/* this will setup the path even if it fails to insert the back ref */
 	ret = insert_inline_extent_backref(trans, path, bytenr, num_bytes,
 					   parent, root_objectid, owner,
@@ -2141,7 +2140,6 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 	btrfs_release_path(path);
 
 	path->reada = READA_FORWARD;
-	path->leave_spinning = 1;
 	/* now insert the actual backref */
 	ret = insert_extent_backref(trans, path, bytenr, parent, root_objectid,
 				    owner, offset, refs_to_add);
@@ -2251,7 +2249,6 @@ static int run_delayed_extent_op(struct btrfs_trans_handle *trans,
 
 again:
 	path->reada = READA_FORWARD;
-	path->leave_spinning = 1;
 	ret = btrfs_search_slot(trans, fs_info->extent_root, &key, path, 0, 1);
 	if (ret < 0) {
 		err = ret;
@@ -6700,7 +6697,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 		return -ENOMEM;
 
 	path->reada = READA_FORWARD;
-	path->leave_spinning = 1;
 
 	is_data = owner_objectid >= BTRFS_FIRST_FREE_OBJECTID;
 	BUG_ON(!is_data && refs_to_drop != 1);
@@ -6743,7 +6739,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 				goto out;
 			}
 			btrfs_release_path(path);
-			path->leave_spinning = 1;
 
 			key.objectid = bytenr;
 			key.type = BTRFS_EXTENT_ITEM_KEY;
@@ -7896,7 +7891,6 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
 				      ins, size);
 	if (ret) {
@@ -7985,7 +7979,6 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
 		return -ENOMEM;
 	}
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
 				      &extent_key, size);
 	if (ret) {
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 628f1aef34b0..481a4bbc49fd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4444,7 +4444,6 @@ int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-	path->leave_spinning = 1;
 
 	start = round_down(start, btrfs_inode_sectorsize(inode));
 	len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index ba74827beb32..8e25953ed901 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -45,7 +45,6 @@ int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 	file_key.offset = pos;
 	file_key.type = BTRFS_EXTENT_DATA_KEY;
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(trans, root, path, &file_key,
 				      sizeof(*item));
 	if (ret < 0)
@@ -598,7 +597,6 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 		key.offset = end_byte - 1;
 		key.type = BTRFS_EXTENT_CSUM_KEY;
 
-		path->leave_spinning = 1;
 		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
 		if (ret > 0) {
 			if (path->slots[0] == 0)
@@ -874,10 +872,8 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 	} else {
 		ins_size = csum_size;
 	}
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(trans, root, path, &file_key,
 				      ins_size);
-	path->leave_spinning = 0;
 	if (ret < 0)
 		goto fail_unlock;
 	if (WARN_ON(ret != 0))
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index d6736595ec57..bccc33bb453c 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -1188,8 +1188,6 @@ static int clear_free_space_tree(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
-
 	key.objectid = 0;
 	key.type = 0;
 	key.offset = 0;
diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
index a8956a3c9e05..97a7b4de1033 100644
--- a/fs/btrfs/inode-item.c
+++ b/fs/btrfs/inode-item.c
@@ -129,8 +129,6 @@ static int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
-
 	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
 	if (ret > 0)
 		ret = -ENOENT;
@@ -203,8 +201,6 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
-
 	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
 	if (ret > 0) {
 		ret = -ENOENT;
@@ -278,7 +274,6 @@ static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(trans, root, path, &key,
 				      ins_len);
 	if (ret == -EEXIST) {
@@ -335,7 +330,6 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
 	path->skip_release_on_error = 1;
 	ret = btrfs_insert_empty_item(trans, root, path, &key,
 				      ins_len);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9357a19d2bff..8b135a46835f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -191,7 +191,6 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 		key.type = BTRFS_EXTENT_DATA_KEY;
 
 		datasize = btrfs_file_extent_calc_inline_size(cur_size);
-		path->leave_spinning = 1;
 		ret = btrfs_insert_empty_item(trans, root, path, &key,
 					      datasize);
 		if (ret)
@@ -2236,7 +2235,6 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 		ins.offset = file_pos;
 		ins.type = BTRFS_EXTENT_DATA_KEY;
 
-		path->leave_spinning = 1;
 		ret = btrfs_insert_empty_item(trans, root, path, &ins,
 					      sizeof(*fi));
 		if (ret)
@@ -2668,7 +2666,6 @@ static noinline int relink_extent_backref(struct btrfs_path *path,
 	key.type = BTRFS_EXTENT_DATA_KEY;
 	key.offset = start;
 
-	path->leave_spinning = 1;
 	if (merge) {
 		struct btrfs_file_extent_item *fi;
 		u64 extent_len;
@@ -2739,7 +2736,6 @@ static noinline int relink_extent_backref(struct btrfs_path *path,
 	ret = 1;
 out_free_path:
 	btrfs_release_path(path);
-	path->leave_spinning = 0;
 	btrfs_end_transaction(trans);
 out_unlock:
 	unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, lock_end,
@@ -3833,7 +3829,6 @@ static noinline int btrfs_update_inode_item(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
 	ret = btrfs_lookup_inode(trans, root, path, &BTRFS_I(inode)->location,
 				 1);
 	if (ret) {
@@ -3924,7 +3919,6 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
 		goto out;
 	}
 
-	path->leave_spinning = 1;
 	di = btrfs_lookup_dir_item(trans, root, path, dir_ino,
 				    name, name_len, -1);
 	if (IS_ERR(di)) {
@@ -4584,7 +4578,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
 		goto out;
 	}
 
-	path->leave_spinning = 1;
 	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
 	if (ret < 0)
 		goto out;
@@ -6300,7 +6293,6 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_items(trans, root, path, key, sizes, nitems);
 	if (ret != 0)
 		goto fail_unlock;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d3a5d2a41e5f..1de051f529f9 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3899,7 +3899,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 		 * note the key will change type as we walk through the
 		 * tree.
 		 */
-		path->leave_spinning = 1;
 		ret = btrfs_search_slot(NULL, BTRFS_I(src)->root, &key, path,
 				0, 0);
 		if (ret < 0)
@@ -3981,7 +3980,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 					   size);
 
 			btrfs_release_path(path);
-			path->leave_spinning = 0;
 
 			memcpy(&new_key, &key, sizeof(new_key));
 			new_key.objectid = btrfs_ino(BTRFS_I(inode));
@@ -4371,7 +4369,6 @@ static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp)
 		ret = -ENOMEM;
 		goto out;
 	}
-	path->leave_spinning = 1;
 
 	trans = btrfs_start_transaction(root, 1);
 	if (IS_ERR(trans)) {
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 4353bb69bb86..58e859593bbb 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -844,8 +844,6 @@ static int btrfs_clean_quota_tree(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
-
 	key.objectid = 0;
 	key.offset = 0;
 	key.type = 0;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 6601c9aa5e35..4aa0b4f61ffd 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1015,7 +1015,6 @@ static char *get_subvol_name_from_objectid(struct btrfs_fs_info *fs_info,
 		ret = -ENOMEM;
 		goto err;
 	}
-	path->leave_spinning = 1;
 
 	name = kmalloc(PATH_MAX, GFP_KERNEL);
 	if (!name) {
@@ -1143,7 +1142,6 @@ static int get_default_subvol_objectid(struct btrfs_fs_info *fs_info, u64 *objec
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-	path->leave_spinning = 1;
 
 	/*
 	 * Find the "default" dir item which points to the root item that we
diff --git a/fs/btrfs/tests/qgroup-tests.c b/fs/btrfs/tests/qgroup-tests.c
index 412b910b04cc..91233a0d4d62 100644
--- a/fs/btrfs/tests/qgroup-tests.c
+++ b/fs/btrfs/tests/qgroup-tests.c
@@ -36,7 +36,6 @@ static int insert_normal_tree_ref(struct btrfs_root *root, u64 bytenr,
 		return -ENOMEM;
 	}
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(&trans, root, path, &ins, size);
 	if (ret) {
 		test_err("couldn't insert ref %d", ret);
@@ -86,7 +85,6 @@ static int add_tree_ref(struct btrfs_root *root, u64 bytenr, u64 num_bytes,
 		return -ENOMEM;
 	}
 
-	path->leave_spinning = 1;
 	ret = btrfs_search_slot(&trans, root, &key, path, 0, 1);
 	if (ret) {
 		test_err("couldn't find extent ref");
@@ -135,7 +133,6 @@ static int remove_extent_item(struct btrfs_root *root, u64 bytenr,
 		test_err("couldn't allocate path");
 		return -ENOMEM;
 	}
-	path->leave_spinning = 1;
 
 	ret = btrfs_search_slot(&trans, root, &key, path, -1, 1);
 	if (ret) {
@@ -170,7 +167,6 @@ static int remove_extent_ref(struct btrfs_root *root, u64 bytenr,
 		return -ENOMEM;
 	}
 
-	path->leave_spinning = 1;
 	ret = btrfs_search_slot(&trans, root, &key, path, 0, 1);
 	if (ret) {
 		test_err("couldn't find extent ref");
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking
  2018-08-16 21:07 ` [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking Liu Bo
@ 2018-08-17  7:24   ` Nikolay Borisov
  2018-08-20  6:07     ` Liu Bo
  0 siblings, 1 reply; 16+ messages in thread
From: Nikolay Borisov @ 2018-08-17  7:24 UTC (permalink / raw)
  To: Liu Bo, linux-btrfs



On 17.08.2018 00:07, Liu Bo wrote:
> Btrfs's btree locking has two modes, spinning mode and blocking mode,
> while searching btree, locking is always acquired in spinning mode and
> then converted to blocking mode if necessary, and in some hot paths we may
> switch the locking back to spinning mode by btrfs_clear_path_blocking().
> 
> When acquiring locks, both of reader and writer need to wait for blocking
> readers and writers to complete before doing read_lock()/write_lock().
> 
> The problem is that btrfs_clear_path_blocking() needs to switch nodes
> in the path to blocking mode at first (by btrfs_set_path_blocking) to
> make lockdep happy before doing its actual clearing blocking job.
> 
> When switching to blocking mode from spinning mode, it consists of
> 
> step 1) bumping up blocking readers counter and
> step 2) read_unlock()/write_unlock(),
> 
> this has caused serious ping-pong effect if there're a great amount of
> concurrent readers/writers, as waiters will be woken up and go to
> sleep immediately.

I think this ping-pong needs to be explained in a bit more detail.

Looking at the code it seems the issue happens when we have a path
locked in spinning mode (via btrfs_tree_lock) - in this case we have
blocking_readers/writes == 0 and write_lock acquired. Then we could
potentially have multiple requests to lock the tree (in this case the
root) and since the current holder is spinning then btrfs_tree_lock
callers will block on the write_lock. Subsequently when the holder
switches to blocking he calls
btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
increments the blocking reader/writes and calls the unlock at which
point a "thundering herd" problem occurs on the lock. Am I correct?

Furthermore I think the problem occurs not in btrfs_clear_path_blocking
but rather in the initial call to btrfs_set_path_blocking.

The way I see it the following sequence of operations occur:

1. Call btrfs_set_path_blocking: increments
blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
the lock types for the nodes in the path are
BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING

2. Some time later we call btrfs_clear_path_blocking
which also calls btrfs_set_path_blocking, however this time this
function doesn't do anything because the lock types in path struct are
already blocking and none of the 2 conditions inside
btrfs_set_lock_blocking_rw will match hence this code won't be executed.

Did I miss something ?

> 
> 1) Killing this kind of ping-pong results in a big improvement in my 1600k
> files creation script,
> 
> MNT=/mnt/btrfs
> mkfs.btrfs -f /dev/sdf
> mount /dev/def $MNT
> time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
>         -d  $MNT/0  -d  $MNT/1 \
>         -d  $MNT/2  -d  $MNT/3 \
>         -d  $MNT/4  -d  $MNT/5 \
>         -d  $MNT/6  -d  $MNT/7 \
>         -d  $MNT/8  -d  $MNT/9 \
>         -d  $MNT/10  -d  $MNT/11 \
>         -d  $MNT/12  -d  $MNT/13 \
>         -d  $MNT/14  -d  $MNT/15
> 
> w/o patch:
> real    2m27.307s
> user    0m12.839s
> sys     13m42.831s
> 
> w/ patch:
> real    1m2.273s
> user    0m15.802s
> sys     8m16.495s
> 
> 2) dbench also proves the improvement,
> dbench -t 120 -D /mnt/btrfs 16
> 
> w/o patch:
> Throughput 158.363 MB/sec
> 
> w/ patch:
> Throughput 449.52 MB/sec
> 
> 3) xfstests didn't show any additional failures.
> 
> One thing to note is that callers may set leave_spinning to have all
> nodes in the path stay in spinning mode, which means callers are ready
> to not sleep before releasing the path, but it won't cause problems if
> they don't want to sleep in blocking mode, IOW, we can just get rid of
> leave_spinning.
> 
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/btrfs/ctree.c         | 57 ++++--------------------------------------------
>  fs/btrfs/ctree.h         |  2 --
>  fs/btrfs/delayed-inode.c |  3 ---
>  3 files changed, 4 insertions(+), 58 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index d436fb4c002e..8b31caa60b0a 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
>  	}
>  }
>  
> -/*
> - * reset all the locked nodes in the patch to spinning locks.
> - *
> - * held is used to keep lockdep happy, when lockdep is enabled
> - * we set held to a blocking lock before we go around and
> - * retake all the spinlocks in the path.  You can safely use NULL
> - * for held
> - */
> -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
> -					struct extent_buffer *held, int held_rw)
> -{
> -	int i;
> -
> -	if (held) {
> -		btrfs_set_lock_blocking_rw(held, held_rw);
> -		if (held_rw == BTRFS_WRITE_LOCK)
> -			held_rw = BTRFS_WRITE_LOCK_BLOCKING;
> -		else if (held_rw == BTRFS_READ_LOCK)
> -			held_rw = BTRFS_READ_LOCK_BLOCKING;
> -	}
> -	btrfs_set_path_blocking(p);
> -
> -	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
> -		if (p->nodes[i] && p->locks[i]) {
> -			btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
> -			if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
> -				p->locks[i] = BTRFS_WRITE_LOCK;
> -			else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
> -				p->locks[i] = BTRFS_READ_LOCK;
> -		}
> -	}
> -
> -	if (held)
> -		btrfs_clear_lock_blocking_rw(held, held_rw);
> -}
> -
>  /* this also releases the path */
>  void btrfs_free_path(struct btrfs_path *p)
>  {
> @@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
>  		}
>  	}
>  
> -	btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
>  	btrfs_tree_read_unlock_blocking(eb);
>  	free_extent_buffer(eb);
>  
> @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>  		btrfs_set_path_blocking(p);
>  		reada_for_balance(fs_info, p, level);
>  		sret = split_node(trans, root, p, level);
> -		btrfs_clear_path_blocking(p, NULL, 0);
>  
>  		BUG_ON(sret > 0);
>  		if (sret) {
> @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>  		btrfs_set_path_blocking(p);
>  		reada_for_balance(fs_info, p, level);
>  		sret = balance_level(trans, root, p, level);
> -		btrfs_clear_path_blocking(p, NULL, 0);
>  
>  		if (sret) {
>  			ret = sret;
> @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>  		}
>  cow_done:
>  		p->nodes[level] = b;
> -		btrfs_clear_path_blocking(p, NULL, 0);
> +		/*
> +		 * Leave path with blocking locks to avoid massive
> +		 * lock context switch, this is made on purpose.
> +		 */
>  
>  		/*
>  		 * we have a lock on b and as long as we aren't changing
> @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>  					if (!err) {
>  						btrfs_set_path_blocking(p);
>  						btrfs_tree_lock(b);
> -						btrfs_clear_path_blocking(p, b,
> -								  BTRFS_WRITE_LOCK);
>  					}
>  					p->locks[level] = BTRFS_WRITE_LOCK;
>  				} else {
> @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>  					if (!err) {
>  						btrfs_set_path_blocking(p);
>  						btrfs_tree_read_lock(b);
> -						btrfs_clear_path_blocking(p, b,
> -								  BTRFS_READ_LOCK);
>  					}
>  					p->locks[level] = BTRFS_READ_LOCK;
>  				}
> @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>  				btrfs_set_path_blocking(p);
>  				err = split_leaf(trans, root, key,
>  						 p, ins_len, ret == 0);
> -				btrfs_clear_path_blocking(p, NULL, 0);
>  
>  				BUG_ON(err > 0);
>  				if (err) {
> @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
>  	while (b) {
>  		level = btrfs_header_level(b);
>  		p->nodes[level] = b;
> -		btrfs_clear_path_blocking(p, NULL, 0);
>  
>  		/*
>  		 * we have a lock on b and as long as we aren't changing
> @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
>  			if (!err) {
>  				btrfs_set_path_blocking(p);
>  				btrfs_tree_read_lock(b);
> -				btrfs_clear_path_blocking(p, b,
> -							  BTRFS_READ_LOCK);
>  			}
>  			b = tree_mod_log_rewind(fs_info, p, b, time_seq);
>  			if (!b) {
> @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
>  		path->locks[level - 1] = BTRFS_READ_LOCK;
>  		path->nodes[level - 1] = cur;
>  		unlock_up(path, level, 1, 0, NULL);
> -		btrfs_clear_path_blocking(path, NULL, 0);
>  	}
>  out:
>  	path->keep_locks = keep_locks;
> @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
>  			if (!ret) {
>  				btrfs_set_path_blocking(path);
>  				btrfs_tree_read_lock(next);
> -				btrfs_clear_path_blocking(path, next,
> -							  BTRFS_READ_LOCK);
>  			}
>  			next_rw_lock = BTRFS_READ_LOCK;
>  		}
> @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
>  			if (!ret) {
>  				btrfs_set_path_blocking(path);
>  				btrfs_tree_read_lock(next);
> -				btrfs_clear_path_blocking(path, next,
> -							  BTRFS_READ_LOCK);
>  			}
>  			next_rw_lock = BTRFS_READ_LOCK;
>  		}
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 318be7864072..1aeed3c0e949 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
>  struct btrfs_path *btrfs_alloc_path(void);
>  void btrfs_free_path(struct btrfs_path *p);
>  void btrfs_set_path_blocking(struct btrfs_path *p);
> -void btrfs_clear_path_blocking(struct btrfs_path *p,
> -			       struct extent_buffer *held, int held_rw);
>  void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
>  
>  int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index f51b509f2d9b..db9f45082c61 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
>  		i++;
>  	}
>  
> -	/* reset all the locked nodes in the patch to spinning locks. */
> -	btrfs_clear_path_blocking(path, NULL, 0);
> -
>  	/* insert the keys of the items */
>  	setup_items_for_insert(root, path, keys, data_size,
>  			       total_data_size, total_size, nitems);
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking
  2018-08-17  7:24   ` Nikolay Borisov
@ 2018-08-20  6:07     ` Liu Bo
  2018-08-20  7:31       ` Nikolay Borisov
  0 siblings, 1 reply; 16+ messages in thread
From: Liu Bo @ 2018-08-20  6:07 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Fri, Aug 17, 2018 at 10:24:58AM +0300, Nikolay Borisov wrote:
> 
> 
> On 17.08.2018 00:07, Liu Bo wrote:
> > Btrfs's btree locking has two modes, spinning mode and blocking mode,
> > while searching btree, locking is always acquired in spinning mode and
> > then converted to blocking mode if necessary, and in some hot paths we may
> > switch the locking back to spinning mode by btrfs_clear_path_blocking().
> > 
> > When acquiring locks, both of reader and writer need to wait for blocking
> > readers and writers to complete before doing read_lock()/write_lock().
> > 
> > The problem is that btrfs_clear_path_blocking() needs to switch nodes
> > in the path to blocking mode at first (by btrfs_set_path_blocking) to
> > make lockdep happy before doing its actual clearing blocking job.
> > 
> > When switching to blocking mode from spinning mode, it consists of
> > 
> > step 1) bumping up blocking readers counter and
> > step 2) read_unlock()/write_unlock(),
> > 
> > this has caused serious ping-pong effect if there're a great amount of
> > concurrent readers/writers, as waiters will be woken up and go to
> > sleep immediately.
> 
> I think this ping-pong needs to be explained in a bit more detail.
> 
> Looking at the code it seems the issue happens when we have a path
> locked in spinning mode (via btrfs_tree_lock) - in this case we have
> blocking_readers/writes == 0 and write_lock acquired. Then we could
> potentially have multiple requests to lock the tree (in this case the
> root) and since the current holder is spinning then btrfs_tree_lock
> callers will block on the write_lock. Subsequently when the holder
> switches to blocking he calls
> btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
> increments the blocking reader/writes and calls the unlock at which
> point a "thundering herd" problem occurs on the lock. Am I correct?
>

That's correct, but the "thundering herd' problem is not really the
problem that this patch is trying to address, see more in below.

> Furthermore I think the problem occurs not in btrfs_clear_path_blocking
> but rather in the initial call to btrfs_set_path_blocking.
> 
> The way I see it the following sequence of operations occur:
> 
> 1. Call btrfs_set_path_blocking: increments
> blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
> the lock types for the nodes in the path are
> BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING
> 
> 2. Some time later we call btrfs_clear_path_blocking
> which also calls btrfs_set_path_blocking, however this time this
> function doesn't do anything because the lock types in path struct are
> already blocking and none of the 2 conditions inside
> btrfs_set_lock_blocking_rw will match hence this code won't be executed.
> 
> Did I miss something ?
> 

Thanks for sharing your thoughts.

We have to set path blocking as the subsequent code needs to sleep,
it's the cost we have to pay by design, I'm OK with it.

The problem is that btrfs_clear_path_blocking always tries to set path
to blocking before clearing path blocking as it needs to ensure
spin_locks on each level are acquired in right order.  And if all the
nodes in the path are already in spinning mode (say that
should_cow_block() decides to not COW), setting path blocking doesn't
really make sense and causes others waiting on the lock to do a
wake-up and a go-to-sleep.

My trace showed that there could be up to 50 switches for a single
lock waiter and the difference on the finished time also proved how
big the impact is.

thanks,
-liubo
> > 
> > 1) Killing this kind of ping-pong results in a big improvement in my 1600k
> > files creation script,
> > 
> > MNT=/mnt/btrfs
> > mkfs.btrfs -f /dev/sdf
> > mount /dev/def $MNT
> > time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
> >         -d  $MNT/0  -d  $MNT/1 \
> >         -d  $MNT/2  -d  $MNT/3 \
> >         -d  $MNT/4  -d  $MNT/5 \
> >         -d  $MNT/6  -d  $MNT/7 \
> >         -d  $MNT/8  -d  $MNT/9 \
> >         -d  $MNT/10  -d  $MNT/11 \
> >         -d  $MNT/12  -d  $MNT/13 \
> >         -d  $MNT/14  -d  $MNT/15
> > 
> > w/o patch:
> > real    2m27.307s
> > user    0m12.839s
> > sys     13m42.831s
> > 
> > w/ patch:
> > real    1m2.273s
> > user    0m15.802s
> > sys     8m16.495s
> > 
> > 2) dbench also proves the improvement,
> > dbench -t 120 -D /mnt/btrfs 16
> > 
> > w/o patch:
> > Throughput 158.363 MB/sec
> > 
> > w/ patch:
> > Throughput 449.52 MB/sec
> > 
> > 3) xfstests didn't show any additional failures.
> > 
> > One thing to note is that callers may set leave_spinning to have all
> > nodes in the path stay in spinning mode, which means callers are ready
> > to not sleep before releasing the path, but it won't cause problems if
> > they don't want to sleep in blocking mode, IOW, we can just get rid of
> > leave_spinning.
> > 
> > Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> > ---
> >  fs/btrfs/ctree.c         | 57 ++++--------------------------------------------
> >  fs/btrfs/ctree.h         |  2 --
> >  fs/btrfs/delayed-inode.c |  3 ---
> >  3 files changed, 4 insertions(+), 58 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > index d436fb4c002e..8b31caa60b0a 100644
> > --- a/fs/btrfs/ctree.c
> > +++ b/fs/btrfs/ctree.c
> > @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
> >  	}
> >  }
> >  
> > -/*
> > - * reset all the locked nodes in the patch to spinning locks.
> > - *
> > - * held is used to keep lockdep happy, when lockdep is enabled
> > - * we set held to a blocking lock before we go around and
> > - * retake all the spinlocks in the path.  You can safely use NULL
> > - * for held
> > - */
> > -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
> > -					struct extent_buffer *held, int held_rw)
> > -{
> > -	int i;
> > -
> > -	if (held) {
> > -		btrfs_set_lock_blocking_rw(held, held_rw);
> > -		if (held_rw == BTRFS_WRITE_LOCK)
> > -			held_rw = BTRFS_WRITE_LOCK_BLOCKING;
> > -		else if (held_rw == BTRFS_READ_LOCK)
> > -			held_rw = BTRFS_READ_LOCK_BLOCKING;
> > -	}
> > -	btrfs_set_path_blocking(p);
> > -
> > -	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
> > -		if (p->nodes[i] && p->locks[i]) {
> > -			btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
> > -			if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
> > -				p->locks[i] = BTRFS_WRITE_LOCK;
> > -			else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
> > -				p->locks[i] = BTRFS_READ_LOCK;
> > -		}
> > -	}
> > -
> > -	if (held)
> > -		btrfs_clear_lock_blocking_rw(held, held_rw);
> > -}
> > -
> >  /* this also releases the path */
> >  void btrfs_free_path(struct btrfs_path *p)
> >  {
> > @@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
> >  		}
> >  	}
> >  
> > -	btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
> >  	btrfs_tree_read_unlock_blocking(eb);
> >  	free_extent_buffer(eb);
> >  
> > @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> >  		btrfs_set_path_blocking(p);
> >  		reada_for_balance(fs_info, p, level);
> >  		sret = split_node(trans, root, p, level);
> > -		btrfs_clear_path_blocking(p, NULL, 0);
> >  
> >  		BUG_ON(sret > 0);
> >  		if (sret) {
> > @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> >  		btrfs_set_path_blocking(p);
> >  		reada_for_balance(fs_info, p, level);
> >  		sret = balance_level(trans, root, p, level);
> > -		btrfs_clear_path_blocking(p, NULL, 0);
> >  
> >  		if (sret) {
> >  			ret = sret;
> > @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >  		}
> >  cow_done:
> >  		p->nodes[level] = b;
> > -		btrfs_clear_path_blocking(p, NULL, 0);
> > +		/*
> > +		 * Leave path with blocking locks to avoid massive
> > +		 * lock context switch, this is made on purpose.
> > +		 */
> >  
> >  		/*
> >  		 * we have a lock on b and as long as we aren't changing
> > @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >  					if (!err) {
> >  						btrfs_set_path_blocking(p);
> >  						btrfs_tree_lock(b);
> > -						btrfs_clear_path_blocking(p, b,
> > -								  BTRFS_WRITE_LOCK);
> >  					}
> >  					p->locks[level] = BTRFS_WRITE_LOCK;
> >  				} else {
> > @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >  					if (!err) {
> >  						btrfs_set_path_blocking(p);
> >  						btrfs_tree_read_lock(b);
> > -						btrfs_clear_path_blocking(p, b,
> > -								  BTRFS_READ_LOCK);
> >  					}
> >  					p->locks[level] = BTRFS_READ_LOCK;
> >  				}
> > @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >  				btrfs_set_path_blocking(p);
> >  				err = split_leaf(trans, root, key,
> >  						 p, ins_len, ret == 0);
> > -				btrfs_clear_path_blocking(p, NULL, 0);
> >  
> >  				BUG_ON(err > 0);
> >  				if (err) {
> > @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> >  	while (b) {
> >  		level = btrfs_header_level(b);
> >  		p->nodes[level] = b;
> > -		btrfs_clear_path_blocking(p, NULL, 0);
> >  
> >  		/*
> >  		 * we have a lock on b and as long as we aren't changing
> > @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> >  			if (!err) {
> >  				btrfs_set_path_blocking(p);
> >  				btrfs_tree_read_lock(b);
> > -				btrfs_clear_path_blocking(p, b,
> > -							  BTRFS_READ_LOCK);
> >  			}
> >  			b = tree_mod_log_rewind(fs_info, p, b, time_seq);
> >  			if (!b) {
> > @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
> >  		path->locks[level - 1] = BTRFS_READ_LOCK;
> >  		path->nodes[level - 1] = cur;
> >  		unlock_up(path, level, 1, 0, NULL);
> > -		btrfs_clear_path_blocking(path, NULL, 0);
> >  	}
> >  out:
> >  	path->keep_locks = keep_locks;
> > @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> >  			if (!ret) {
> >  				btrfs_set_path_blocking(path);
> >  				btrfs_tree_read_lock(next);
> > -				btrfs_clear_path_blocking(path, next,
> > -							  BTRFS_READ_LOCK);
> >  			}
> >  			next_rw_lock = BTRFS_READ_LOCK;
> >  		}
> > @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> >  			if (!ret) {
> >  				btrfs_set_path_blocking(path);
> >  				btrfs_tree_read_lock(next);
> > -				btrfs_clear_path_blocking(path, next,
> > -							  BTRFS_READ_LOCK);
> >  			}
> >  			next_rw_lock = BTRFS_READ_LOCK;
> >  		}
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 318be7864072..1aeed3c0e949 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
> >  struct btrfs_path *btrfs_alloc_path(void);
> >  void btrfs_free_path(struct btrfs_path *p);
> >  void btrfs_set_path_blocking(struct btrfs_path *p);
> > -void btrfs_clear_path_blocking(struct btrfs_path *p,
> > -			       struct extent_buffer *held, int held_rw);
> >  void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
> >  
> >  int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> > diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> > index f51b509f2d9b..db9f45082c61 100644
> > --- a/fs/btrfs/delayed-inode.c
> > +++ b/fs/btrfs/delayed-inode.c
> > @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
> >  		i++;
> >  	}
> >  
> > -	/* reset all the locked nodes in the patch to spinning locks. */
> > -	btrfs_clear_path_blocking(path, NULL, 0);
> > -
> >  	/* insert the keys of the items */
> >  	setup_items_for_insert(root, path, keys, data_size,
> >  			       total_data_size, total_size, nitems);
> > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking
  2018-08-20  6:07     ` Liu Bo
@ 2018-08-20  7:31       ` Nikolay Borisov
  2018-08-21 17:24         ` Liu Bo
  0 siblings, 1 reply; 16+ messages in thread
From: Nikolay Borisov @ 2018-08-20  7:31 UTC (permalink / raw)
  To: bo.liu; +Cc: linux-btrfs



On 20.08.2018 09:07, Liu Bo wrote:
> On Fri, Aug 17, 2018 at 10:24:58AM +0300, Nikolay Borisov wrote:
>>
>>
>> On 17.08.2018 00:07, Liu Bo wrote:
>>> Btrfs's btree locking has two modes, spinning mode and blocking mode,
>>> while searching btree, locking is always acquired in spinning mode and
>>> then converted to blocking mode if necessary, and in some hot paths we may
>>> switch the locking back to spinning mode by btrfs_clear_path_blocking().
>>>
>>> When acquiring locks, both of reader and writer need to wait for blocking
>>> readers and writers to complete before doing read_lock()/write_lock().
>>>
>>> The problem is that btrfs_clear_path_blocking() needs to switch nodes
>>> in the path to blocking mode at first (by btrfs_set_path_blocking) to
>>> make lockdep happy before doing its actual clearing blocking job.
>>>
>>> When switching to blocking mode from spinning mode, it consists of
>>>
>>> step 1) bumping up blocking readers counter and
>>> step 2) read_unlock()/write_unlock(),
>>>
>>> this has caused serious ping-pong effect if there're a great amount of
>>> concurrent readers/writers, as waiters will be woken up and go to
>>> sleep immediately.
>>
>> I think this ping-pong needs to be explained in a bit more detail.
>>
>> Looking at the code it seems the issue happens when we have a path
>> locked in spinning mode (via btrfs_tree_lock) - in this case we have
>> blocking_readers/writes == 0 and write_lock acquired. Then we could
>> potentially have multiple requests to lock the tree (in this case the
>> root) and since the current holder is spinning then btrfs_tree_lock
>> callers will block on the write_lock. Subsequently when the holder
>> switches to blocking he calls
>> btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
>> increments the blocking reader/writes and calls the unlock at which
>> point a "thundering herd" problem occurs on the lock. Am I correct?
>>
> 
> That's correct, but the "thundering herd' problem is not really the
> problem that this patch is trying to address, see more in below.
> 
>> Furthermore I think the problem occurs not in btrfs_clear_path_blocking
>> but rather in the initial call to btrfs_set_path_blocking.
>>
>> The way I see it the following sequence of operations occur:
>>
>> 1. Call btrfs_set_path_blocking: increments
>> blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
>> the lock types for the nodes in the path are
>> BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING
>>
>> 2. Some time later we call btrfs_clear_path_blocking
>> which also calls btrfs_set_path_blocking, however this time this
>> function doesn't do anything because the lock types in path struct are
>> already blocking and none of the 2 conditions inside
>> btrfs_set_lock_blocking_rw will match hence this code won't be executed.
>>
>> Did I miss something ?
>>
> 
> Thanks for sharing your thoughts.
> 
> We have to set path blocking as the subsequent code needs to sleep,
> it's the cost we have to pay by design, I'm OK with it.
> 
> The problem is that btrfs_clear_path_blocking always tries to set path
> to blocking before clearing path blocking as it needs to ensure
> spin_locks on each level are acquired in right order.  And if all the
> nodes in the path are already in spinning mode (say that
> should_cow_block() decides to not COW), setting path blocking doesn't
> really make sense and causes others waiting on the lock to do a
> wake-up and a go-to-sleep.

Ok for the case of btrfs_clear_path_blocking in btrfs_search_path I
agree this could happen but it can't in any of the other hunks you
touched since they all call btrfs_clear_path_blocking AFTER
btrfs_set_path_blocking was called. So the problem can't stem from them
since they take the perf hit at the time btrfs_set_path_blocking is
called. IMO this is important information for context and needs to be
included in the changelog. In short the only problematic code is the one
in btrfs_search_slot.

Another less intrusive idea is to touch only btrfs_search_slot by
introducing a boolean flag to indicate we have set the path to
blocking/COWed blocks.

Then the fix will be a simple:

if (cowed/path_blocking/whatever)
    btrfs_clear_path_blocking(p, NULL, 0);

Yours is also a valid approach albeit seems more heavy weight. Having
said that, I'm all in favor of removing code so long as it does not
introduce any regressions :)


> 
> My trace showed that there could be up to 50 switches for a single
> lock waiter and the difference on the finished time also proved how
> big the impact is>
> thanks,
> -liubo
>>>
>>> 1) Killing this kind of ping-pong results in a big improvement in my 1600k
>>> files creation script,
>>>
>>> MNT=/mnt/btrfs
>>> mkfs.btrfs -f /dev/sdf
>>> mount /dev/def $MNT
>>> time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
>>>         -d  $MNT/0  -d  $MNT/1 \
>>>         -d  $MNT/2  -d  $MNT/3 \
>>>         -d  $MNT/4  -d  $MNT/5 \
>>>         -d  $MNT/6  -d  $MNT/7 \
>>>         -d  $MNT/8  -d  $MNT/9 \
>>>         -d  $MNT/10  -d  $MNT/11 \
>>>         -d  $MNT/12  -d  $MNT/13 \
>>>         -d  $MNT/14  -d  $MNT/15
>>>
>>> w/o patch:
>>> real    2m27.307s
>>> user    0m12.839s
>>> sys     13m42.831s
>>>
>>> w/ patch:
>>> real    1m2.273s
>>> user    0m15.802s
>>> sys     8m16.495s
>>>
>>> 2) dbench also proves the improvement,
>>> dbench -t 120 -D /mnt/btrfs 16
>>>
>>> w/o patch:
>>> Throughput 158.363 MB/sec
>>>
>>> w/ patch:
>>> Throughput 449.52 MB/sec
>>>
>>> 3) xfstests didn't show any additional failures.
>>>
>>> One thing to note is that callers may set leave_spinning to have all
>>> nodes in the path stay in spinning mode, which means callers are ready
>>> to not sleep before releasing the path, but it won't cause problems if
>>> they don't want to sleep in blocking mode, IOW, we can just get rid of
>>> leave_spinning.
>>>
>>> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
>>> ---
>>>  fs/btrfs/ctree.c         | 57 ++++--------------------------------------------
>>>  fs/btrfs/ctree.h         |  2 --
>>>  fs/btrfs/delayed-inode.c |  3 ---
>>>  3 files changed, 4 insertions(+), 58 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
>>> index d436fb4c002e..8b31caa60b0a 100644
>>> --- a/fs/btrfs/ctree.c
>>> +++ b/fs/btrfs/ctree.c
>>> @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
>>>  	}
>>>  }
>>>  
>>> -/*
>>> - * reset all the locked nodes in the patch to spinning locks.
>>> - *
>>> - * held is used to keep lockdep happy, when lockdep is enabled
>>> - * we set held to a blocking lock before we go around and
>>> - * retake all the spinlocks in the path.  You can safely use NULL
>>> - * for held
>>> - */
>>> -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
>>> -					struct extent_buffer *held, int held_rw)
>>> -{
>>> -	int i;
>>> -
>>> -	if (held) {
>>> -		btrfs_set_lock_blocking_rw(held, held_rw);
>>> -		if (held_rw == BTRFS_WRITE_LOCK)
>>> -			held_rw = BTRFS_WRITE_LOCK_BLOCKING;
>>> -		else if (held_rw == BTRFS_READ_LOCK)
>>> -			held_rw = BTRFS_READ_LOCK_BLOCKING;
>>> -	}
>>> -	btrfs_set_path_blocking(p);
>>> -
>>> -	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
>>> -		if (p->nodes[i] && p->locks[i]) {
>>> -			btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
>>> -			if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
>>> -				p->locks[i] = BTRFS_WRITE_LOCK;
>>> -			else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
>>> -				p->locks[i] = BTRFS_READ_LOCK;
>>> -		}
>>> -	}
>>> -
>>> -	if (held)
>>> -		btrfs_clear_lock_blocking_rw(held, held_rw);
>>> -}
>>> -
>>>  /* this also releases the path */
>>>  void btrfs_free_path(struct btrfs_path *p)
>>>  {
>>> @@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
>>>  		}
>>>  	}
>>>  
>>> -	btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
>>>  	btrfs_tree_read_unlock_blocking(eb);
>>>  	free_extent_buffer(eb);
>>>  
>>> @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>>>  		btrfs_set_path_blocking(p);
>>>  		reada_for_balance(fs_info, p, level);
>>>  		sret = split_node(trans, root, p, level);
>>> -		btrfs_clear_path_blocking(p, NULL, 0);
>>>  
>>>  		BUG_ON(sret > 0);
>>>  		if (sret) {
>>> @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
>>>  		btrfs_set_path_blocking(p);
>>>  		reada_for_balance(fs_info, p, level);
>>>  		sret = balance_level(trans, root, p, level);
>>> -		btrfs_clear_path_blocking(p, NULL, 0);
>>>  
>>>  		if (sret) {
>>>  			ret = sret;
>>> @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>>>  		}
>>>  cow_done:
>>>  		p->nodes[level] = b;
>>> -		btrfs_clear_path_blocking(p, NULL, 0);
>>> +		/*
>>> +		 * Leave path with blocking locks to avoid massive
>>> +		 * lock context switch, this is made on purpose.
>>> +		 */
>>>  
>>>  		/*
>>>  		 * we have a lock on b and as long as we aren't changing
>>> @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>>>  					if (!err) {
>>>  						btrfs_set_path_blocking(p);
>>>  						btrfs_tree_lock(b);
>>> -						btrfs_clear_path_blocking(p, b,
>>> -								  BTRFS_WRITE_LOCK);
>>>  					}
>>>  					p->locks[level] = BTRFS_WRITE_LOCK;
>>>  				} else {
>>> @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>>>  					if (!err) {
>>>  						btrfs_set_path_blocking(p);
>>>  						btrfs_tree_read_lock(b);
>>> -						btrfs_clear_path_blocking(p, b,
>>> -								  BTRFS_READ_LOCK);
>>>  					}
>>>  					p->locks[level] = BTRFS_READ_LOCK;
>>>  				}
>>> @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>>>  				btrfs_set_path_blocking(p);
>>>  				err = split_leaf(trans, root, key,
>>>  						 p, ins_len, ret == 0);
>>> -				btrfs_clear_path_blocking(p, NULL, 0);
>>>  
>>>  				BUG_ON(err > 0);
>>>  				if (err) {
>>> @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
>>>  	while (b) {
>>>  		level = btrfs_header_level(b);
>>>  		p->nodes[level] = b;
>>> -		btrfs_clear_path_blocking(p, NULL, 0);
>>>  
>>>  		/*
>>>  		 * we have a lock on b and as long as we aren't changing
>>> @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
>>>  			if (!err) {
>>>  				btrfs_set_path_blocking(p);
>>>  				btrfs_tree_read_lock(b);
>>> -				btrfs_clear_path_blocking(p, b,
>>> -							  BTRFS_READ_LOCK);
>>>  			}
>>>  			b = tree_mod_log_rewind(fs_info, p, b, time_seq);
>>>  			if (!b) {
>>> @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
>>>  		path->locks[level - 1] = BTRFS_READ_LOCK;
>>>  		path->nodes[level - 1] = cur;
>>>  		unlock_up(path, level, 1, 0, NULL);
>>> -		btrfs_clear_path_blocking(path, NULL, 0);
>>>  	}
>>>  out:
>>>  	path->keep_locks = keep_locks;
>>> @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
>>>  			if (!ret) {
>>>  				btrfs_set_path_blocking(path);
>>>  				btrfs_tree_read_lock(next);
>>> -				btrfs_clear_path_blocking(path, next,
>>> -							  BTRFS_READ_LOCK);
>>>  			}
>>>  			next_rw_lock = BTRFS_READ_LOCK;
>>>  		}
>>> @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
>>>  			if (!ret) {
>>>  				btrfs_set_path_blocking(path);
>>>  				btrfs_tree_read_lock(next);
>>> -				btrfs_clear_path_blocking(path, next,
>>> -							  BTRFS_READ_LOCK);
>>>  			}
>>>  			next_rw_lock = BTRFS_READ_LOCK;
>>>  		}
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 318be7864072..1aeed3c0e949 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
>>>  struct btrfs_path *btrfs_alloc_path(void);
>>>  void btrfs_free_path(struct btrfs_path *p);
>>>  void btrfs_set_path_blocking(struct btrfs_path *p);
>>> -void btrfs_clear_path_blocking(struct btrfs_path *p,
>>> -			       struct extent_buffer *held, int held_rw);
>>>  void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
>>>  
>>>  int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>>> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
>>> index f51b509f2d9b..db9f45082c61 100644
>>> --- a/fs/btrfs/delayed-inode.c
>>> +++ b/fs/btrfs/delayed-inode.c
>>> @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
>>>  		i++;
>>>  	}
>>>  
>>> -	/* reset all the locked nodes in the patch to spinning locks. */
>>> -	btrfs_clear_path_blocking(path, NULL, 0);
>>> -
>>>  	/* insert the keys of the items */
>>>  	setup_items_for_insert(root, path, keys, data_size,
>>>  			       total_data_size, total_size, nitems);
>>>
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking
  2018-08-20  7:31       ` Nikolay Borisov
@ 2018-08-21 17:24         ` Liu Bo
  0 siblings, 0 replies; 16+ messages in thread
From: Liu Bo @ 2018-08-21 17:24 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Mon, Aug 20, 2018 at 10:31:49AM +0300, Nikolay Borisov wrote:
> 
> 
> On 20.08.2018 09:07, Liu Bo wrote:
> > On Fri, Aug 17, 2018 at 10:24:58AM +0300, Nikolay Borisov wrote:
> >>
> >>
> >> On 17.08.2018 00:07, Liu Bo wrote:
> >>> Btrfs's btree locking has two modes, spinning mode and blocking mode,
> >>> while searching btree, locking is always acquired in spinning mode and
> >>> then converted to blocking mode if necessary, and in some hot paths we may
> >>> switch the locking back to spinning mode by btrfs_clear_path_blocking().
> >>>
> >>> When acquiring locks, both of reader and writer need to wait for blocking
> >>> readers and writers to complete before doing read_lock()/write_lock().
> >>>
> >>> The problem is that btrfs_clear_path_blocking() needs to switch nodes
> >>> in the path to blocking mode at first (by btrfs_set_path_blocking) to
> >>> make lockdep happy before doing its actual clearing blocking job.
> >>>
> >>> When switching to blocking mode from spinning mode, it consists of
> >>>
> >>> step 1) bumping up blocking readers counter and
> >>> step 2) read_unlock()/write_unlock(),
> >>>
> >>> this has caused serious ping-pong effect if there're a great amount of
> >>> concurrent readers/writers, as waiters will be woken up and go to
> >>> sleep immediately.
> >>
> >> I think this ping-pong needs to be explained in a bit more detail.
> >>
> >> Looking at the code it seems the issue happens when we have a path
> >> locked in spinning mode (via btrfs_tree_lock) - in this case we have
> >> blocking_readers/writes == 0 and write_lock acquired. Then we could
> >> potentially have multiple requests to lock the tree (in this case the
> >> root) and since the current holder is spinning then btrfs_tree_lock
> >> callers will block on the write_lock. Subsequently when the holder
> >> switches to blocking he calls
> >> btrfs_set_path_blocking->btrfs_set_lock_blocking_rw which of course
> >> increments the blocking reader/writes and calls the unlock at which
> >> point a "thundering herd" problem occurs on the lock. Am I correct?
> >>
> > 
> > That's correct, but the "thundering herd' problem is not really the
> > problem that this patch is trying to address, see more in below.
> > 
> >> Furthermore I think the problem occurs not in btrfs_clear_path_blocking
> >> but rather in the initial call to btrfs_set_path_blocking.
> >>
> >> The way I see it the following sequence of operations occur:
> >>
> >> 1. Call btrfs_set_path_blocking: increments
> >> blocking_writers/blocking_readers, calls unlock: ping-pong happens. Now
> >> the lock types for the nodes in the path are
> >> BTRFS_READ_LOCK_BLOCKING/BTRFS_WRITE_LOCK_BLOCKING
> >>
> >> 2. Some time later we call btrfs_clear_path_blocking
> >> which also calls btrfs_set_path_blocking, however this time this
> >> function doesn't do anything because the lock types in path struct are
> >> already blocking and none of the 2 conditions inside
> >> btrfs_set_lock_blocking_rw will match hence this code won't be executed.
> >>
> >> Did I miss something ?
> >>
> > 
> > Thanks for sharing your thoughts.
> > 
> > We have to set path blocking as the subsequent code needs to sleep,
> > it's the cost we have to pay by design, I'm OK with it.
> > 
> > The problem is that btrfs_clear_path_blocking always tries to set path
> > to blocking before clearing path blocking as it needs to ensure
> > spin_locks on each level are acquired in right order.  And if all the
> > nodes in the path are already in spinning mode (say that
> > should_cow_block() decides to not COW), setting path blocking doesn't
> > really make sense and causes others waiting on the lock to do a
> > wake-up and a go-to-sleep.
> 
> Ok for the case of btrfs_clear_path_blocking in btrfs_search_path I
> agree this could happen but it can't in any of the other hunks you
> touched since they all call btrfs_clear_path_blocking AFTER
> btrfs_set_path_blocking was called. So the problem can't stem from them
> since they take the perf hit at the time btrfs_set_path_blocking is
> called. IMO this is important information for context and needs to be
> included in the changelog. In short the only problematic code is the one
> in btrfs_search_slot.
> 
> Another less intrusive idea is to touch only btrfs_search_slot by
> introducing a boolean flag to indicate we have set the path to
> blocking/COWed blocks.
> 
> Then the fix will be a simple:
> 
> if (cowed/path_blocking/whatever)
>     btrfs_clear_path_blocking(p, NULL, 0);

if (cow)
    btrfs_clear_path_blocking(); 

I did exactly this at first and it ended up with 1m20s while killing
all the btrfs_clear_path_blocking() gave a stable 1m2s, so I confirmed
that even switching path back to spinning from blocking also hurts a
little bit.

> 
> Yours is also a valid approach albeit seems more heavy weight. Having
> said that, I'm all in favor of removing code so long as it does not
> introduce any regressions :)

As all the callers that were running in spinning lock context should
be ready to cooperate with blocking lock context, it'd be fine IMO.

Thanks for the comments again.

thanks,
-liubo

> 
> 
> > 
> > My trace showed that there could be up to 50 switches for a single
> > lock waiter and the difference on the finished time also proved how
> > big the impact is>
> > thanks,
> > -liubo
> >>>
> >>> 1) Killing this kind of ping-pong results in a big improvement in my 1600k
> >>> files creation script,
> >>>
> >>> MNT=/mnt/btrfs
> >>> mkfs.btrfs -f /dev/sdf
> >>> mount /dev/def $MNT
> >>> time fsmark  -D  10000  -S0  -n  100000  -s  0  -L  1 -l /tmp/fs_log.txt \
> >>>         -d  $MNT/0  -d  $MNT/1 \
> >>>         -d  $MNT/2  -d  $MNT/3 \
> >>>         -d  $MNT/4  -d  $MNT/5 \
> >>>         -d  $MNT/6  -d  $MNT/7 \
> >>>         -d  $MNT/8  -d  $MNT/9 \
> >>>         -d  $MNT/10  -d  $MNT/11 \
> >>>         -d  $MNT/12  -d  $MNT/13 \
> >>>         -d  $MNT/14  -d  $MNT/15
> >>>
> >>> w/o patch:
> >>> real    2m27.307s
> >>> user    0m12.839s
> >>> sys     13m42.831s
> >>>
> >>> w/ patch:
> >>> real    1m2.273s
> >>> user    0m15.802s
> >>> sys     8m16.495s
> >>>
> >>> 2) dbench also proves the improvement,
> >>> dbench -t 120 -D /mnt/btrfs 16
> >>>
> >>> w/o patch:
> >>> Throughput 158.363 MB/sec
> >>>
> >>> w/ patch:
> >>> Throughput 449.52 MB/sec
> >>>
> >>> 3) xfstests didn't show any additional failures.
> >>>
> >>> One thing to note is that callers may set leave_spinning to have all
> >>> nodes in the path stay in spinning mode, which means callers are ready
> >>> to not sleep before releasing the path, but it won't cause problems if
> >>> they don't want to sleep in blocking mode, IOW, we can just get rid of
> >>> leave_spinning.
> >>>
> >>> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> >>> ---
> >>>  fs/btrfs/ctree.c         | 57 ++++--------------------------------------------
> >>>  fs/btrfs/ctree.h         |  2 --
> >>>  fs/btrfs/delayed-inode.c |  3 ---
> >>>  3 files changed, 4 insertions(+), 58 deletions(-)
> >>>
> >>> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> >>> index d436fb4c002e..8b31caa60b0a 100644
> >>> --- a/fs/btrfs/ctree.c
> >>> +++ b/fs/btrfs/ctree.c
> >>> @@ -52,42 +52,6 @@ noinline void btrfs_set_path_blocking(struct btrfs_path *p)
> >>>  	}
> >>>  }
> >>>  
> >>> -/*
> >>> - * reset all the locked nodes in the patch to spinning locks.
> >>> - *
> >>> - * held is used to keep lockdep happy, when lockdep is enabled
> >>> - * we set held to a blocking lock before we go around and
> >>> - * retake all the spinlocks in the path.  You can safely use NULL
> >>> - * for held
> >>> - */
> >>> -noinline void btrfs_clear_path_blocking(struct btrfs_path *p,
> >>> -					struct extent_buffer *held, int held_rw)
> >>> -{
> >>> -	int i;
> >>> -
> >>> -	if (held) {
> >>> -		btrfs_set_lock_blocking_rw(held, held_rw);
> >>> -		if (held_rw == BTRFS_WRITE_LOCK)
> >>> -			held_rw = BTRFS_WRITE_LOCK_BLOCKING;
> >>> -		else if (held_rw == BTRFS_READ_LOCK)
> >>> -			held_rw = BTRFS_READ_LOCK_BLOCKING;
> >>> -	}
> >>> -	btrfs_set_path_blocking(p);
> >>> -
> >>> -	for (i = BTRFS_MAX_LEVEL - 1; i >= 0; i--) {
> >>> -		if (p->nodes[i] && p->locks[i]) {
> >>> -			btrfs_clear_lock_blocking_rw(p->nodes[i], p->locks[i]);
> >>> -			if (p->locks[i] == BTRFS_WRITE_LOCK_BLOCKING)
> >>> -				p->locks[i] = BTRFS_WRITE_LOCK;
> >>> -			else if (p->locks[i] == BTRFS_READ_LOCK_BLOCKING)
> >>> -				p->locks[i] = BTRFS_READ_LOCK;
> >>> -		}
> >>> -	}
> >>> -
> >>> -	if (held)
> >>> -		btrfs_clear_lock_blocking_rw(held, held_rw);
> >>> -}
> >>> -
> >>>  /* this also releases the path */
> >>>  void btrfs_free_path(struct btrfs_path *p)
> >>>  {
> >>> @@ -1306,7 +1270,6 @@ static struct tree_mod_elem *__tree_mod_log_oldest_root(
> >>>  		}
> >>>  	}
> >>>  
> >>> -	btrfs_clear_path_blocking(path, NULL, BTRFS_READ_LOCK);
> >>>  	btrfs_tree_read_unlock_blocking(eb);
> >>>  	free_extent_buffer(eb);
> >>>  
> >>> @@ -2483,7 +2446,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> >>>  		btrfs_set_path_blocking(p);
> >>>  		reada_for_balance(fs_info, p, level);
> >>>  		sret = split_node(trans, root, p, level);
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  		BUG_ON(sret > 0);
> >>>  		if (sret) {
> >>> @@ -2504,7 +2466,6 @@ noinline void btrfs_unlock_up_safe(struct btrfs_path *path, int level)
> >>>  		btrfs_set_path_blocking(p);
> >>>  		reada_for_balance(fs_info, p, level);
> >>>  		sret = balance_level(trans, root, p, level);
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  		if (sret) {
> >>>  			ret = sret;
> >>> @@ -2789,7 +2750,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  		}
> >>>  cow_done:
> >>>  		p->nodes[level] = b;
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>> +		/*
> >>> +		 * Leave path with blocking locks to avoid massive
> >>> +		 * lock context switch, this is made on purpose.
> >>> +		 */
> >>>  
> >>>  		/*
> >>>  		 * we have a lock on b and as long as we aren't changing
> >>> @@ -2871,8 +2835,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  					if (!err) {
> >>>  						btrfs_set_path_blocking(p);
> >>>  						btrfs_tree_lock(b);
> >>> -						btrfs_clear_path_blocking(p, b,
> >>> -								  BTRFS_WRITE_LOCK);
> >>>  					}
> >>>  					p->locks[level] = BTRFS_WRITE_LOCK;
> >>>  				} else {
> >>> @@ -2880,8 +2842,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  					if (!err) {
> >>>  						btrfs_set_path_blocking(p);
> >>>  						btrfs_tree_read_lock(b);
> >>> -						btrfs_clear_path_blocking(p, b,
> >>> -								  BTRFS_READ_LOCK);
> >>>  					}
> >>>  					p->locks[level] = BTRFS_READ_LOCK;
> >>>  				}
> >>> @@ -2900,7 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>>  				btrfs_set_path_blocking(p);
> >>>  				err = split_leaf(trans, root, key,
> >>>  						 p, ins_len, ret == 0);
> >>> -				btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  				BUG_ON(err > 0);
> >>>  				if (err) {
> >>> @@ -2967,7 +2926,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> >>>  	while (b) {
> >>>  		level = btrfs_header_level(b);
> >>>  		p->nodes[level] = b;
> >>> -		btrfs_clear_path_blocking(p, NULL, 0);
> >>>  
> >>>  		/*
> >>>  		 * we have a lock on b and as long as we aren't changing
> >>> @@ -3013,8 +2971,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
> >>>  			if (!err) {
> >>>  				btrfs_set_path_blocking(p);
> >>>  				btrfs_tree_read_lock(b);
> >>> -				btrfs_clear_path_blocking(p, b,
> >>> -							  BTRFS_READ_LOCK);
> >>>  			}
> >>>  			b = tree_mod_log_rewind(fs_info, p, b, time_seq);
> >>>  			if (!b) {
> >>> @@ -5198,7 +5154,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
> >>>  		path->locks[level - 1] = BTRFS_READ_LOCK;
> >>>  		path->nodes[level - 1] = cur;
> >>>  		unlock_up(path, level, 1, 0, NULL);
> >>> -		btrfs_clear_path_blocking(path, NULL, 0);
> >>>  	}
> >>>  out:
> >>>  	path->keep_locks = keep_locks;
> >>> @@ -5783,8 +5738,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> >>>  			if (!ret) {
> >>>  				btrfs_set_path_blocking(path);
> >>>  				btrfs_tree_read_lock(next);
> >>> -				btrfs_clear_path_blocking(path, next,
> >>> -							  BTRFS_READ_LOCK);
> >>>  			}
> >>>  			next_rw_lock = BTRFS_READ_LOCK;
> >>>  		}
> >>> @@ -5820,8 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
> >>>  			if (!ret) {
> >>>  				btrfs_set_path_blocking(path);
> >>>  				btrfs_tree_read_lock(next);
> >>> -				btrfs_clear_path_blocking(path, next,
> >>> -							  BTRFS_READ_LOCK);
> >>>  			}
> >>>  			next_rw_lock = BTRFS_READ_LOCK;
> >>>  		}
> >>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> >>> index 318be7864072..1aeed3c0e949 100644
> >>> --- a/fs/btrfs/ctree.h
> >>> +++ b/fs/btrfs/ctree.h
> >>> @@ -2876,8 +2876,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
> >>>  struct btrfs_path *btrfs_alloc_path(void);
> >>>  void btrfs_free_path(struct btrfs_path *p);
> >>>  void btrfs_set_path_blocking(struct btrfs_path *p);
> >>> -void btrfs_clear_path_blocking(struct btrfs_path *p,
> >>> -			       struct extent_buffer *held, int held_rw);
> >>>  void btrfs_unlock_up_safe(struct btrfs_path *p, int level);
> >>>  
> >>>  int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
> >>> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> >>> index f51b509f2d9b..db9f45082c61 100644
> >>> --- a/fs/btrfs/delayed-inode.c
> >>> +++ b/fs/btrfs/delayed-inode.c
> >>> @@ -762,9 +762,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
> >>>  		i++;
> >>>  	}
> >>>  
> >>> -	/* reset all the locked nodes in the patch to spinning locks. */
> >>> -	btrfs_clear_path_blocking(path, NULL, 0);
> >>> -
> >>>  	/* insert the keys of the items */
> >>>  	setup_items_for_insert(root, path, keys, data_size,
> >>>  			       total_data_size, total_size, nitems);
> >>>
> > 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] address lock contention of btree root
  2018-08-16 21:07 [PATCH 0/2] address lock contention of btree root Liu Bo
  2018-08-16 21:07 ` [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking Liu Bo
  2018-08-16 21:07 ` [PATCH 2/2] Btrfs: kill leave_spinning Liu Bo
@ 2018-08-21 17:54 ` Chris Mason
  2018-08-21 18:15   ` Liu Bo
  2 siblings, 1 reply; 16+ messages in thread
From: Chris Mason @ 2018-08-21 17:54 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

On 16 Aug 2018, at 17:07, Liu Bo wrote:

> The lock contention on btree nodes (esp. root node) is apparently a
> bottleneck when there're multiple readers and writers concurrently
> trying to access them.  Unfortunately this is by design and it's not
> easy to fix it unless with some complex changes, however, there is
> still some room.
>
> With a stable workload based on fsmark which has 16 threads creating
> 1,600K files, we could see that a good amount of overhead comes from
> switching path between spinning mode and blocking mode in
> btrfs_search_slot().
>
> Patch 1 provides more details about the overhead and test results from
> fsmark and dbench.
> Patch 2 kills leave_spinning due to the behaviour change from patch 1.

This is really interesting, do you have numbers about how often we are 
able to stay spinning?

IOW, do we end up blocking every time?

-chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] address lock contention of btree root
  2018-08-21 17:54 ` [PATCH 0/2] address lock contention of btree root Chris Mason
@ 2018-08-21 18:15   ` Liu Bo
  2018-08-21 18:35     ` Holger Hoffstätte
  2018-08-21 18:48     ` Chris Mason
  0 siblings, 2 replies; 16+ messages in thread
From: Liu Bo @ 2018-08-21 18:15 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Tue, Aug 21, 2018 at 01:54:11PM -0400, Chris Mason wrote:
> On 16 Aug 2018, at 17:07, Liu Bo wrote:
> 
> > The lock contention on btree nodes (esp. root node) is apparently a
> > bottleneck when there're multiple readers and writers concurrently
> > trying to access them.  Unfortunately this is by design and it's not
> > easy to fix it unless with some complex changes, however, there is
> > still some room.
> > 
> > With a stable workload based on fsmark which has 16 threads creating
> > 1,600K files, we could see that a good amount of overhead comes from
> > switching path between spinning mode and blocking mode in
> > btrfs_search_slot().
> > 
> > Patch 1 provides more details about the overhead and test results from
> > fsmark and dbench.
> > Patch 2 kills leave_spinning due to the behaviour change from patch 1.
> 
> This is really interesting, do you have numbers about how often we are able
> to stay spinning?
>

I didn't gather how long we stay spinning, but I was tracking

(1) how long a read lock or write lock can wait until it gets the
lock, with vanilla kernel, for read lock, it could be up to 10ms while
for write lock, it could be up to 200ms.

(2) how many switches occurs when reader/writers wait for blocking locks,
for both readers and writers, there are up to 50 times.

but I guess it highly depends on the test environment, I'm using a 24
core VM and the test dispatches 16 threads to do the job.

> IOW, do we end up blocking every time?

With setting leave_spinning, we don't, this part works fine.

I just realize that patch 2 can result in softlockup as
btrfs_search_slot() may return a path with all nodes being in spinning
lock, and if the callers want to sleep, we're in trouble.  I've
removed patch 2 and am re-running the test (xfstests, fsmark and
dbench).

thanks,
-liubo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] address lock contention of btree root
  2018-08-21 18:15   ` Liu Bo
@ 2018-08-21 18:35     ` Holger Hoffstätte
  2018-08-21 21:59       ` Liu Bo
  2018-08-21 18:48     ` Chris Mason
  1 sibling, 1 reply; 16+ messages in thread
From: Holger Hoffstätte @ 2018-08-21 18:35 UTC (permalink / raw)
  To: bo.liu; +Cc: linux-btrfs

On 08/21/18 20:15, Liu Bo wrote:
> I just realize that patch 2 can result in softlockup as
> btrfs_search_slot() may return a path with all nodes being in spinning
> lock, and if the callers want to sleep, we're in trouble.  I've
> removed patch 2 and am re-running the test (xfstests, fsmark and
> dbench).

You mean like this, when trying to balance? :)
Got it only once so far, subsequent attempts worked. Otherwise everything
seems fine.

-h

kernel: BTRFS info (device sdc1): relocating block group 4128424067072 flags data
kernel: BTRFS info (device sdc1): found 1706 extents
kernel: INFO: rcu_sched self-detected stall on CPU
kernel: ^I3-....: (17999 ticks this GP) idle=f5e/1/4611686018427387906 softirq=269430/269430 fqs=5999
kernel: ^I (t=18000 jiffies g=232869 c=232868 q=4365)
kernel: NMI backtrace for cpu 3
kernel: CPU: 3 PID: 4287 Comm: kworker/u8:0 Not tainted 4.18.3 #1
kernel: Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013
kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
kernel: Call Trace:
kernel:  <IRQ>
kernel:  dump_stack+0x46/0x60
kernel:  nmi_cpu_backtrace.cold.0+0x13/0x57
kernel:  ? lapic_can_unplug_cpu.cold.5+0x34/0x34
kernel:  nmi_trigger_cpumask_backtrace+0x8f/0x91
kernel:  rcu_dump_cpu_stacks+0x87/0xb2
kernel:  rcu_check_callbacks.cold.59+0x2ac/0x430
kernel:  ? tick_sched_handle.isra.6+0x40/0x40
kernel:  update_process_times+0x28/0x60
kernel:  tick_sched_handle.isra.6+0x35/0x40
kernel:  tick_sched_timer+0x3b/0x80
kernel:  __hrtimer_run_queues+0xfe/0x270
kernel:  hrtimer_interrupt+0xf4/0x210
kernel:  smp_apic_timer_interrupt+0x56/0x110
kernel:  apic_timer_interrupt+0xf/0x20
kernel:  </IRQ>
kernel: RIP: 0010:queued_write_lock_slowpath+0x4a/0x80
kernel: Code: ff 00 00 00 f0 0f b1 13 85 c0 74 32 f0 81 03 00 01 00 00 ba ff 00 00 00 b9 00 01 00 00 8b 03 3d 00 01 00 00 74 0b f3 90 8b 03 <3d> 00 01 00 00 75 f5 89 c8 f0 0f b1 13 3d 00 01 00 00 75 df c6 43
kernel: RSP: 0018:ffffc9000040fc40 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
kernel: RAX: 0000000000000300 RBX: ffff88071abf86f0 RCX: 0000000000000100
kernel: RDX: 00000000000000ff RSI: ffff880000000000 RDI: ffff88071abf86f0
kernel: RBP: 0000000000000000 R08: ffff88071abf8690 R09: ffff880724ebeb58
kernel: R10: ffff880724ebeb80 R11: 0000000000000000 R12: 0000000000000001
kernel: R13: ffff8803d0db5f54 R14: 0000160000000000 R15: 0000000000000006
kernel:  btrfs_try_tree_write_lock+0x23/0x60 [btrfs]
kernel:  btrfs_search_slot+0x2df/0x970 [btrfs]
kernel:  btrfs_mark_extent_written+0xb0/0xac0 [btrfs]
kernel:  ? kmem_cache_alloc+0x1a5/0x1b0
kernel:  btrfs_finish_ordered_io+0x2e2/0x7a0 [btrfs]
kernel:  normal_work_helper+0xad/0x2c0 [btrfs]
kernel:  process_one_work+0x1e3/0x390
kernel:  worker_thread+0x2d/0x3c0
kernel:  ? process_one_work+0x390/0x390
kernel:  kthread+0x111/0x130
kernel:  ? kthread_flush_work_fn+0x10/0x10
kernel:  ret_from_fork+0x1f/0x30

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] address lock contention of btree root
  2018-08-21 18:15   ` Liu Bo
  2018-08-21 18:35     ` Holger Hoffstätte
@ 2018-08-21 18:48     ` Chris Mason
  2018-08-21 22:23       ` Liu Bo
  2018-08-22 21:38       ` Liu Bo
  1 sibling, 2 replies; 16+ messages in thread
From: Chris Mason @ 2018-08-21 18:48 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

On 21 Aug 2018, at 14:15, Liu Bo wrote:

> On Tue, Aug 21, 2018 at 01:54:11PM -0400, Chris Mason wrote:
>> On 16 Aug 2018, at 17:07, Liu Bo wrote:
>>
>>> The lock contention on btree nodes (esp. root node) is apparently a
>>> bottleneck when there're multiple readers and writers concurrently
>>> trying to access them.  Unfortunately this is by design and it's not
>>> easy to fix it unless with some complex changes, however, there is
>>> still some room.
>>>
>>> With a stable workload based on fsmark which has 16 threads creating
>>> 1,600K files, we could see that a good amount of overhead comes from
>>> switching path between spinning mode and blocking mode in
>>> btrfs_search_slot().
>>>
>>> Patch 1 provides more details about the overhead and test results 
>>> from
>>> fsmark and dbench.
>>> Patch 2 kills leave_spinning due to the behaviour change from patch 
>>> 1.
>>
>> This is really interesting, do you have numbers about how often we 
>> are able
>> to stay spinning?
>>
>
> I didn't gather how long we stay spinning,

I'm less worried about length of time spinning than I am how often we're 
able to call btrfs_search_slot() without ever blocking.  If one caller 
ends up going into blocking mode, it can cascade into all of the other 
callers, which can slow things down in low-to-medium contention 
workloads.

The flip side is that maybe the adaptive spinning in the mutexes is good 
enough now and we can just deleting the spinning completely.

> but I was tracking
>
> (1) how long a read lock or write lock can wait until it gets the
> lock, with vanilla kernel, for read lock, it could be up to 10ms while
> for write lock, it could be up to 200ms.

Nice, please add the stats to the patch descriptions, both before and 
after.  I'd love to see a histogram like you can get out of bcc's 
argdist.py.

-chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] address lock contention of btree root
  2018-08-21 18:35     ` Holger Hoffstätte
@ 2018-08-21 21:59       ` Liu Bo
  0 siblings, 0 replies; 16+ messages in thread
From: Liu Bo @ 2018-08-21 21:59 UTC (permalink / raw)
  To: Holger Hoffstätte; +Cc: linux-btrfs

On Tue, Aug 21, 2018 at 08:35:30PM +0200, Holger Hoffstätte wrote:
> On 08/21/18 20:15, Liu Bo wrote:
> > I just realize that patch 2 can result in softlockup as
> > btrfs_search_slot() may return a path with all nodes being in spinning
> > lock, and if the callers want to sleep, we're in trouble.  I've
> > removed patch 2 and am re-running the test (xfstests, fsmark and
> > dbench).
> 
> You mean like this, when trying to balance? :)
> Got it only once so far, subsequent attempts worked. Otherwise everything
> seems fine.

Exactly, it didn't occur every time though, but thanks a lot for
testing it through.

thanks,
-liubo


> 
> -h
> 
> kernel: BTRFS info (device sdc1): relocating block group 4128424067072 flags data
> kernel: BTRFS info (device sdc1): found 1706 extents
> kernel: INFO: rcu_sched self-detected stall on CPU
> kernel: ^I3-....: (17999 ticks this GP) idle=f5e/1/4611686018427387906 softirq=269430/269430 fqs=5999
> kernel: ^I (t=18000 jiffies g=232869 c=232868 q=4365)
> kernel: NMI backtrace for cpu 3
> kernel: CPU: 3 PID: 4287 Comm: kworker/u8:0 Not tainted 4.18.3 #1
> kernel: Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013
> kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
> kernel: Call Trace:
> kernel:  <IRQ>
> kernel:  dump_stack+0x46/0x60
> kernel:  nmi_cpu_backtrace.cold.0+0x13/0x57
> kernel:  ? lapic_can_unplug_cpu.cold.5+0x34/0x34
> kernel:  nmi_trigger_cpumask_backtrace+0x8f/0x91
> kernel:  rcu_dump_cpu_stacks+0x87/0xb2
> kernel:  rcu_check_callbacks.cold.59+0x2ac/0x430
> kernel:  ? tick_sched_handle.isra.6+0x40/0x40
> kernel:  update_process_times+0x28/0x60
> kernel:  tick_sched_handle.isra.6+0x35/0x40
> kernel:  tick_sched_timer+0x3b/0x80
> kernel:  __hrtimer_run_queues+0xfe/0x270
> kernel:  hrtimer_interrupt+0xf4/0x210
> kernel:  smp_apic_timer_interrupt+0x56/0x110
> kernel:  apic_timer_interrupt+0xf/0x20
> kernel:  </IRQ>
> kernel: RIP: 0010:queued_write_lock_slowpath+0x4a/0x80
> kernel: Code: ff 00 00 00 f0 0f b1 13 85 c0 74 32 f0 81 03 00 01 00 00 ba ff 00 00 00 b9 00 01 00 00 8b 03 3d 00 01 00 00 74 0b f3 90 8b 03 <3d> 00 01 00 00 75 f5 89 c8 f0 0f b1 13 3d 00 01 00 00 75 df c6 43
> kernel: RSP: 0018:ffffc9000040fc40 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
> kernel: RAX: 0000000000000300 RBX: ffff88071abf86f0 RCX: 0000000000000100
> kernel: RDX: 00000000000000ff RSI: ffff880000000000 RDI: ffff88071abf86f0
> kernel: RBP: 0000000000000000 R08: ffff88071abf8690 R09: ffff880724ebeb58
> kernel: R10: ffff880724ebeb80 R11: 0000000000000000 R12: 0000000000000001
> kernel: R13: ffff8803d0db5f54 R14: 0000160000000000 R15: 0000000000000006
> kernel:  btrfs_try_tree_write_lock+0x23/0x60 [btrfs]
> kernel:  btrfs_search_slot+0x2df/0x970 [btrfs]
> kernel:  btrfs_mark_extent_written+0xb0/0xac0 [btrfs]
> kernel:  ? kmem_cache_alloc+0x1a5/0x1b0
> kernel:  btrfs_finish_ordered_io+0x2e2/0x7a0 [btrfs]
> kernel:  normal_work_helper+0xad/0x2c0 [btrfs]
> kernel:  process_one_work+0x1e3/0x390
> kernel:  worker_thread+0x2d/0x3c0
> kernel:  ? process_one_work+0x390/0x390
> kernel:  kthread+0x111/0x130
> kernel:  ? kthread_flush_work_fn+0x10/0x10
> kernel:  ret_from_fork+0x1f/0x30

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] Btrfs: kill leave_spinning
  2018-08-16 21:07 ` [PATCH 2/2] Btrfs: kill leave_spinning Liu Bo
@ 2018-08-21 22:19   ` Liu Bo
  0 siblings, 0 replies; 16+ messages in thread
From: Liu Bo @ 2018-08-21 22:19 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

Please kindly ignore this patch as it did introduce soft lockup when
btrfs_search_slot() returns a spinning path and its callers goes to
sleep before releasing the path.

thanks,
liubo


On Thu, Aug 16, 2018 at 2:07 PM, Liu Bo <bo.liu@linux.alibaba.com> wrote:
> As btrfs_clear_path_blocking() turns out to be a major source of lock
> contention, we've kill it and without it btrfs_search_slot() and
> btrfs_search_old_slot() are not able to return a path in spinning
> mode, lets kill leave_spinning, too.
>
> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
> ---
>  fs/btrfs/backref.c            |  3 ---
>  fs/btrfs/ctree.c              | 16 +++-------------
>  fs/btrfs/ctree.h              |  1 -
>  fs/btrfs/delayed-inode.c      |  4 ----
>  fs/btrfs/dir-item.c           |  1 -
>  fs/btrfs/export.c             |  1 -
>  fs/btrfs/extent-tree.c        |  7 -------
>  fs/btrfs/extent_io.c          |  1 -
>  fs/btrfs/file-item.c          |  4 ----
>  fs/btrfs/free-space-tree.c    |  2 --
>  fs/btrfs/inode-item.c         |  6 ------
>  fs/btrfs/inode.c              |  8 --------
>  fs/btrfs/ioctl.c              |  3 ---
>  fs/btrfs/qgroup.c             |  2 --
>  fs/btrfs/super.c              |  2 --
>  fs/btrfs/tests/qgroup-tests.c |  4 ----
>  16 files changed, 3 insertions(+), 62 deletions(-)
>
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index ae750b1574a2..70c629b90710 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -1613,13 +1613,11 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
>         s64 bytes_left = ((s64)size) - 1;
>         struct extent_buffer *eb = eb_in;
>         struct btrfs_key found_key;
> -       int leave_spinning = path->leave_spinning;
>         struct btrfs_inode_ref *iref;
>
>         if (bytes_left >= 0)
>                 dest[bytes_left] = '\0';
>
> -       path->leave_spinning = 1;
>         while (1) {
>                 bytes_left -= name_len;
>                 if (bytes_left >= 0)
> @@ -1665,7 +1663,6 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
>         }
>
>         btrfs_release_path(path);
> -       path->leave_spinning = leave_spinning;
>
>         if (ret)
>                 return ERR_PTR(ret);
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 8b31caa60b0a..d2df7cfbec06 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -2875,14 +2875,10 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>         }
>         ret = 1;
>  done:
> -       /*
> -        * we don't really know what they plan on doing with the path
> -        * from here on, so for now just mark it as blocking
> -        */
> -       if (!p->leave_spinning)
> -               btrfs_set_path_blocking(p);
>         if (ret < 0 && !p->skip_release_on_error)
>                 btrfs_release_path(p);
> +
> +       /* path is supposed to be in blocking mode from now on. */
>         return ret;
>  }
>
> @@ -2987,11 +2983,10 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
>         }
>         ret = 1;
>  done:
> -       if (!p->leave_spinning)
> -               btrfs_set_path_blocking(p);
>         if (ret < 0)
>                 btrfs_release_path(p);
>
> +       /* path is supposed to be in blocking mode from now on.*/
>         return ret;
>  }
>
> @@ -5628,7 +5623,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
>         struct btrfs_key key;
>         u32 nritems;
>         int ret;
> -       int old_spinning = path->leave_spinning;
>         int next_rw_lock = 0;
>
>         nritems = btrfs_header_nritems(path->nodes[0]);
> @@ -5643,7 +5637,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
>         btrfs_release_path(path);
>
>         path->keep_locks = 1;
> -       path->leave_spinning = 1;
>
>         if (time_seq)
>                 ret = btrfs_search_old_slot(root, &key, path, time_seq);
> @@ -5780,9 +5773,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
>         ret = 0;
>  done:
>         unlock_up(path, 0, 1, 0, NULL);
> -       path->leave_spinning = old_spinning;
> -       if (!old_spinning)
> -               btrfs_set_path_blocking(path);
>
>         return ret;
>  }
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1aeed3c0e949..e8bddf21b7f7 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -339,7 +339,6 @@ struct btrfs_path {
>         unsigned int search_for_split:1;
>         unsigned int keep_locks:1;
>         unsigned int skip_locking:1;
> -       unsigned int leave_spinning:1;
>         unsigned int search_commit_root:1;
>         unsigned int need_commit_sem:1;
>         unsigned int skip_release_on_error:1;
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index db9f45082c61..e6fbcdbc313e 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -1138,7 +1138,6 @@ static int __btrfs_run_delayed_items(struct btrfs_trans_handle *trans, int nr)
>         path = btrfs_alloc_path();
>         if (!path)
>                 return -ENOMEM;
> -       path->leave_spinning = 1;
>
>         block_rsv = trans->block_rsv;
>         trans->block_rsv = &fs_info->delayed_block_rsv;
> @@ -1203,7 +1202,6 @@ int btrfs_commit_inode_delayed_items(struct btrfs_trans_handle *trans,
>                 btrfs_release_delayed_node(delayed_node);
>                 return -ENOMEM;
>         }
> -       path->leave_spinning = 1;
>
>         block_rsv = trans->block_rsv;
>         trans->block_rsv = &delayed_node->root->fs_info->delayed_block_rsv;
> @@ -1248,7 +1246,6 @@ int btrfs_commit_inode_delayed_inode(struct btrfs_inode *inode)
>                 ret = -ENOMEM;
>                 goto trans_out;
>         }
> -       path->leave_spinning = 1;
>
>         block_rsv = trans->block_rsv;
>         trans->block_rsv = &fs_info->delayed_block_rsv;
> @@ -1317,7 +1314,6 @@ static void btrfs_async_run_delayed_root(struct btrfs_work *work)
>                 if (!delayed_node)
>                         break;
>
> -               path->leave_spinning = 1;
>                 root = delayed_node->root;
>
>                 trans = btrfs_join_transaction(root);
> diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
> index a678b07fcf01..4f27d6ba2f1e 100644
> --- a/fs/btrfs/dir-item.c
> +++ b/fs/btrfs/dir-item.c
> @@ -127,7 +127,6 @@ int btrfs_insert_dir_item(struct btrfs_trans_handle *trans, struct btrfs_root
>         path = btrfs_alloc_path();
>         if (!path)
>                 return -ENOMEM;
> -       path->leave_spinning = 1;
>
>         btrfs_cpu_key_to_disk(&disk_key, location);
>
> diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
> index 1f3755b3a37a..7f410edd92e4 100644
> --- a/fs/btrfs/export.c
> +++ b/fs/btrfs/export.c
> @@ -245,7 +245,6 @@ static int btrfs_get_name(struct dentry *parent, char *name,
>         path = btrfs_alloc_path();
>         if (!path)
>                 return -ENOMEM;
> -       path->leave_spinning = 1;
>
>         if (ino == BTRFS_FIRST_FREE_OBJECTID) {
>                 key.objectid = BTRFS_I(inode)->root->root_key.objectid;
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index de6f75f5547b..8d9ca923db93 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2116,7 +2116,6 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
>                 return -ENOMEM;
>
>         path->reada = READA_FORWARD;
> -       path->leave_spinning = 1;
>         /* this will setup the path even if it fails to insert the back ref */
>         ret = insert_inline_extent_backref(trans, path, bytenr, num_bytes,
>                                            parent, root_objectid, owner,
> @@ -2141,7 +2140,6 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
>         btrfs_release_path(path);
>
>         path->reada = READA_FORWARD;
> -       path->leave_spinning = 1;
>         /* now insert the actual backref */
>         ret = insert_extent_backref(trans, path, bytenr, parent, root_objectid,
>                                     owner, offset, refs_to_add);
> @@ -2251,7 +2249,6 @@ static int run_delayed_extent_op(struct btrfs_trans_handle *trans,
>
>  again:
>         path->reada = READA_FORWARD;
> -       path->leave_spinning = 1;
>         ret = btrfs_search_slot(trans, fs_info->extent_root, &key, path, 0, 1);
>         if (ret < 0) {
>                 err = ret;
> @@ -6700,7 +6697,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>                 return -ENOMEM;
>
>         path->reada = READA_FORWARD;
> -       path->leave_spinning = 1;
>
>         is_data = owner_objectid >= BTRFS_FIRST_FREE_OBJECTID;
>         BUG_ON(!is_data && refs_to_drop != 1);
> @@ -6743,7 +6739,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
>                                 goto out;
>                         }
>                         btrfs_release_path(path);
> -                       path->leave_spinning = 1;
>
>                         key.objectid = bytenr;
>                         key.type = BTRFS_EXTENT_ITEM_KEY;
> @@ -7896,7 +7891,6 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
>         if (!path)
>                 return -ENOMEM;
>
> -       path->leave_spinning = 1;
>         ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
>                                       ins, size);
>         if (ret) {
> @@ -7985,7 +7979,6 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
>                 return -ENOMEM;
>         }
>
> -       path->leave_spinning = 1;
>         ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
>                                       &extent_key, size);
>         if (ret) {
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 628f1aef34b0..481a4bbc49fd 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4444,7 +4444,6 @@ int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>         path = btrfs_alloc_path();
>         if (!path)
>                 return -ENOMEM;
> -       path->leave_spinning = 1;
>
>         start = round_down(start, btrfs_inode_sectorsize(inode));
>         len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index ba74827beb32..8e25953ed901 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -45,7 +45,6 @@ int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
>         file_key.offset = pos;
>         file_key.type = BTRFS_EXTENT_DATA_KEY;
>
> -       path->leave_spinning = 1;
>         ret = btrfs_insert_empty_item(trans, root, path, &file_key,
>                                       sizeof(*item));
>         if (ret < 0)
> @@ -598,7 +597,6 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
>                 key.offset = end_byte - 1;
>                 key.type = BTRFS_EXTENT_CSUM_KEY;
>
> -               path->leave_spinning = 1;
>                 ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
>                 if (ret > 0) {
>                         if (path->slots[0] == 0)
> @@ -874,10 +872,8 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
>         } else {
>                 ins_size = csum_size;
>         }
> -       path->leave_spinning = 1;
>         ret = btrfs_insert_empty_item(trans, root, path, &file_key,
>                                       ins_size);
> -       path->leave_spinning = 0;
>         if (ret < 0)
>                 goto fail_unlock;
>         if (WARN_ON(ret != 0))
> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
> index d6736595ec57..bccc33bb453c 100644
> --- a/fs/btrfs/free-space-tree.c
> +++ b/fs/btrfs/free-space-tree.c
> @@ -1188,8 +1188,6 @@ static int clear_free_space_tree(struct btrfs_trans_handle *trans,
>         if (!path)
>                 return -ENOMEM;
>
> -       path->leave_spinning = 1;
> -
>         key.objectid = 0;
>         key.type = 0;
>         key.offset = 0;
> diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
> index a8956a3c9e05..97a7b4de1033 100644
> --- a/fs/btrfs/inode-item.c
> +++ b/fs/btrfs/inode-item.c
> @@ -129,8 +129,6 @@ static int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
>         if (!path)
>                 return -ENOMEM;
>
> -       path->leave_spinning = 1;
> -
>         ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
>         if (ret > 0)
>                 ret = -ENOENT;
> @@ -203,8 +201,6 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
>         if (!path)
>                 return -ENOMEM;
>
> -       path->leave_spinning = 1;
> -
>         ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
>         if (ret > 0) {
>                 ret = -ENOENT;
> @@ -278,7 +274,6 @@ static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
>         if (!path)
>                 return -ENOMEM;
>
> -       path->leave_spinning = 1;
>         ret = btrfs_insert_empty_item(trans, root, path, &key,
>                                       ins_len);
>         if (ret == -EEXIST) {
> @@ -335,7 +330,6 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
>         if (!path)
>                 return -ENOMEM;
>
> -       path->leave_spinning = 1;
>         path->skip_release_on_error = 1;
>         ret = btrfs_insert_empty_item(trans, root, path, &key,
>                                       ins_len);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9357a19d2bff..8b135a46835f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -191,7 +191,6 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
>                 key.type = BTRFS_EXTENT_DATA_KEY;
>
>                 datasize = btrfs_file_extent_calc_inline_size(cur_size);
> -               path->leave_spinning = 1;
>                 ret = btrfs_insert_empty_item(trans, root, path, &key,
>                                               datasize);
>                 if (ret)
> @@ -2236,7 +2235,6 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
>                 ins.offset = file_pos;
>                 ins.type = BTRFS_EXTENT_DATA_KEY;
>
> -               path->leave_spinning = 1;
>                 ret = btrfs_insert_empty_item(trans, root, path, &ins,
>                                               sizeof(*fi));
>                 if (ret)
> @@ -2668,7 +2666,6 @@ static noinline int relink_extent_backref(struct btrfs_path *path,
>         key.type = BTRFS_EXTENT_DATA_KEY;
>         key.offset = start;
>
> -       path->leave_spinning = 1;
>         if (merge) {
>                 struct btrfs_file_extent_item *fi;
>                 u64 extent_len;
> @@ -2739,7 +2736,6 @@ static noinline int relink_extent_backref(struct btrfs_path *path,
>         ret = 1;
>  out_free_path:
>         btrfs_release_path(path);
> -       path->leave_spinning = 0;
>         btrfs_end_transaction(trans);
>  out_unlock:
>         unlock_extent_cached(&BTRFS_I(inode)->io_tree, lock_start, lock_end,
> @@ -3833,7 +3829,6 @@ static noinline int btrfs_update_inode_item(struct btrfs_trans_handle *trans,
>         if (!path)
>                 return -ENOMEM;
>
> -       path->leave_spinning = 1;
>         ret = btrfs_lookup_inode(trans, root, path, &BTRFS_I(inode)->location,
>                                  1);
>         if (ret) {
> @@ -3924,7 +3919,6 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
>                 goto out;
>         }
>
> -       path->leave_spinning = 1;
>         di = btrfs_lookup_dir_item(trans, root, path, dir_ino,
>                                     name, name_len, -1);
>         if (IS_ERR(di)) {
> @@ -4584,7 +4578,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
>                 goto out;
>         }
>
> -       path->leave_spinning = 1;
>         ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
>         if (ret < 0)
>                 goto out;
> @@ -6300,7 +6293,6 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
>                 goto fail;
>         }
>
> -       path->leave_spinning = 1;
>         ret = btrfs_insert_empty_items(trans, root, path, key, sizes, nitems);
>         if (ret != 0)
>                 goto fail_unlock;
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index d3a5d2a41e5f..1de051f529f9 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3899,7 +3899,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>                  * note the key will change type as we walk through the
>                  * tree.
>                  */
> -               path->leave_spinning = 1;
>                 ret = btrfs_search_slot(NULL, BTRFS_I(src)->root, &key, path,
>                                 0, 0);
>                 if (ret < 0)
> @@ -3981,7 +3980,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
>                                            size);
>
>                         btrfs_release_path(path);
> -                       path->leave_spinning = 0;
>
>                         memcpy(&new_key, &key, sizeof(new_key));
>                         new_key.objectid = btrfs_ino(BTRFS_I(inode));
> @@ -4371,7 +4369,6 @@ static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp)
>                 ret = -ENOMEM;
>                 goto out;
>         }
> -       path->leave_spinning = 1;
>
>         trans = btrfs_start_transaction(root, 1);
>         if (IS_ERR(trans)) {
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 4353bb69bb86..58e859593bbb 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -844,8 +844,6 @@ static int btrfs_clean_quota_tree(struct btrfs_trans_handle *trans,
>         if (!path)
>                 return -ENOMEM;
>
> -       path->leave_spinning = 1;
> -
>         key.objectid = 0;
>         key.offset = 0;
>         key.type = 0;
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 6601c9aa5e35..4aa0b4f61ffd 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1015,7 +1015,6 @@ static char *get_subvol_name_from_objectid(struct btrfs_fs_info *fs_info,
>                 ret = -ENOMEM;
>                 goto err;
>         }
> -       path->leave_spinning = 1;
>
>         name = kmalloc(PATH_MAX, GFP_KERNEL);
>         if (!name) {
> @@ -1143,7 +1142,6 @@ static int get_default_subvol_objectid(struct btrfs_fs_info *fs_info, u64 *objec
>         path = btrfs_alloc_path();
>         if (!path)
>                 return -ENOMEM;
> -       path->leave_spinning = 1;
>
>         /*
>          * Find the "default" dir item which points to the root item that we
> diff --git a/fs/btrfs/tests/qgroup-tests.c b/fs/btrfs/tests/qgroup-tests.c
> index 412b910b04cc..91233a0d4d62 100644
> --- a/fs/btrfs/tests/qgroup-tests.c
> +++ b/fs/btrfs/tests/qgroup-tests.c
> @@ -36,7 +36,6 @@ static int insert_normal_tree_ref(struct btrfs_root *root, u64 bytenr,
>                 return -ENOMEM;
>         }
>
> -       path->leave_spinning = 1;
>         ret = btrfs_insert_empty_item(&trans, root, path, &ins, size);
>         if (ret) {
>                 test_err("couldn't insert ref %d", ret);
> @@ -86,7 +85,6 @@ static int add_tree_ref(struct btrfs_root *root, u64 bytenr, u64 num_bytes,
>                 return -ENOMEM;
>         }
>
> -       path->leave_spinning = 1;
>         ret = btrfs_search_slot(&trans, root, &key, path, 0, 1);
>         if (ret) {
>                 test_err("couldn't find extent ref");
> @@ -135,7 +133,6 @@ static int remove_extent_item(struct btrfs_root *root, u64 bytenr,
>                 test_err("couldn't allocate path");
>                 return -ENOMEM;
>         }
> -       path->leave_spinning = 1;
>
>         ret = btrfs_search_slot(&trans, root, &key, path, -1, 1);
>         if (ret) {
> @@ -170,7 +167,6 @@ static int remove_extent_ref(struct btrfs_root *root, u64 bytenr,
>                 return -ENOMEM;
>         }
>
> -       path->leave_spinning = 1;
>         ret = btrfs_search_slot(&trans, root, &key, path, 0, 1);
>         if (ret) {
>                 test_err("couldn't find extent ref");
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] address lock contention of btree root
  2018-08-21 18:48     ` Chris Mason
@ 2018-08-21 22:23       ` Liu Bo
  2018-08-22 21:38       ` Liu Bo
  1 sibling, 0 replies; 16+ messages in thread
From: Liu Bo @ 2018-08-21 22:23 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Tue, Aug 21, 2018 at 02:48:47PM -0400, Chris Mason wrote:
> On 21 Aug 2018, at 14:15, Liu Bo wrote:
> 
> > On Tue, Aug 21, 2018 at 01:54:11PM -0400, Chris Mason wrote:
> > > On 16 Aug 2018, at 17:07, Liu Bo wrote:
> > > 
> > > > The lock contention on btree nodes (esp. root node) is apparently a
> > > > bottleneck when there're multiple readers and writers concurrently
> > > > trying to access them.  Unfortunately this is by design and it's not
> > > > easy to fix it unless with some complex changes, however, there is
> > > > still some room.
> > > > 
> > > > With a stable workload based on fsmark which has 16 threads creating
> > > > 1,600K files, we could see that a good amount of overhead comes from
> > > > switching path between spinning mode and blocking mode in
> > > > btrfs_search_slot().
> > > > 
> > > > Patch 1 provides more details about the overhead and test
> > > > results from
> > > > fsmark and dbench.
> > > > Patch 2 kills leave_spinning due to the behaviour change from
> > > > patch 1.
> > > 
> > > This is really interesting, do you have numbers about how often we
> > > are able
> > > to stay spinning?
> > > 
> > 
> > I didn't gather how long we stay spinning,
> 
> I'm less worried about length of time spinning than I am how often we're
> able to call btrfs_search_slot() without ever blocking.  If one caller ends
> up going into blocking mode, it can cascade into all of the other callers,
> which can slow things down in low-to-medium contention workloads.

Right, these blocking cases could be COW, split_node, split_leaf, etc.

> 
> The flip side is that maybe the adaptive spinning in the mutexes is good
> enough now and we can just deleting the spinning completely.
>

Looks like so.

> > but I was tracking
> > 
> > (1) how long a read lock or write lock can wait until it gets the
> > lock, with vanilla kernel, for read lock, it could be up to 10ms while
> > for write lock, it could be up to 200ms.
> 
> Nice, please add the stats to the patch descriptions, both before and after.
> I'd love to see a histogram like you can get out of bcc's argdist.py.

Just gave another run of testing with bcc tool funclatency, for write
lock, the worst latency could be up to >500ms, but it could be that
kprobe itself adds some cost to the latency.

thanks,
-liubo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] address lock contention of btree root
  2018-08-21 18:48     ` Chris Mason
  2018-08-21 22:23       ` Liu Bo
@ 2018-08-22 21:38       ` Liu Bo
  2018-08-23  5:46         ` Nikolay Borisov
  1 sibling, 1 reply; 16+ messages in thread
From: Liu Bo @ 2018-08-22 21:38 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On Tue, Aug 21, 2018 at 02:48:47PM -0400, Chris Mason wrote:
> On 21 Aug 2018, at 14:15, Liu Bo wrote:
> 
> > On Tue, Aug 21, 2018 at 01:54:11PM -0400, Chris Mason wrote:
> > > On 16 Aug 2018, at 17:07, Liu Bo wrote:
> > > 
> > > > The lock contention on btree nodes (esp. root node) is apparently a
> > > > bottleneck when there're multiple readers and writers concurrently
> > > > trying to access them.  Unfortunately this is by design and it's not
> > > > easy to fix it unless with some complex changes, however, there is
> > > > still some room.
> > > > 
> > > > With a stable workload based on fsmark which has 16 threads creating
> > > > 1,600K files, we could see that a good amount of overhead comes from
> > > > switching path between spinning mode and blocking mode in
> > > > btrfs_search_slot().
> > > > 
> > > > Patch 1 provides more details about the overhead and test
> > > > results from
> > > > fsmark and dbench.
> > > > Patch 2 kills leave_spinning due to the behaviour change from
> > > > patch 1.
> > > 
> > > This is really interesting, do you have numbers about how often we
> > > are able
> > > to stay spinning?
> > > 
> > 
> > I didn't gather how long we stay spinning,
> 
> I'm less worried about length of time spinning than I am how often we're
> able to call btrfs_search_slot() without ever blocking.  If one caller ends
> up going into blocking mode, it can cascade into all of the other callers,
> which can slow things down in low-to-medium contention workloads.
> 
> The flip side is that maybe the adaptive spinning in the mutexes is good
> enough now and we can just deleting the spinning completely.
>

hmm, looks like the current mutex with adaptive spinning doesn't offer
read/write version, meaning we're not able to simple drop rwlock.

thanks,
-liubo

> > but I was tracking
> > 
> > (1) how long a read lock or write lock can wait until it gets the
> > lock, with vanilla kernel, for read lock, it could be up to 10ms while
> > for write lock, it could be up to 200ms.
> 
> Nice, please add the stats to the patch descriptions, both before and after.
> I'd love to see a histogram like you can get out of bcc's argdist.py.
> 
> -chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 0/2] address lock contention of btree root
  2018-08-22 21:38       ` Liu Bo
@ 2018-08-23  5:46         ` Nikolay Borisov
  0 siblings, 0 replies; 16+ messages in thread
From: Nikolay Borisov @ 2018-08-23  5:46 UTC (permalink / raw)
  To: bo.liu, Chris Mason; +Cc: linux-btrfs



On 23.08.2018 00:38, Liu Bo wrote:
> hmm, looks like the current mutex with adaptive spinning doesn't offer
> read/write version, meaning we're not able to simple drop rwlock.


What about rw_semaphores?

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-08-23  9:14 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-16 21:07 [PATCH 0/2] address lock contention of btree root Liu Bo
2018-08-16 21:07 ` [PATCH 1/2] Btrfs: kill btrfs_clear_path_blocking Liu Bo
2018-08-17  7:24   ` Nikolay Borisov
2018-08-20  6:07     ` Liu Bo
2018-08-20  7:31       ` Nikolay Borisov
2018-08-21 17:24         ` Liu Bo
2018-08-16 21:07 ` [PATCH 2/2] Btrfs: kill leave_spinning Liu Bo
2018-08-21 22:19   ` Liu Bo
2018-08-21 17:54 ` [PATCH 0/2] address lock contention of btree root Chris Mason
2018-08-21 18:15   ` Liu Bo
2018-08-21 18:35     ` Holger Hoffstätte
2018-08-21 21:59       ` Liu Bo
2018-08-21 18:48     ` Chris Mason
2018-08-21 22:23       ` Liu Bo
2018-08-22 21:38       ` Liu Bo
2018-08-23  5:46         ` Nikolay Borisov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.