linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/17] Convert to an rwsem for our tree locking
@ 2020-08-10 15:42 Josef Bacik
  2020-08-10 15:42 ` [PATCH 01/17] btrfs: drop path before adding new uuid tree entry Josef Bacik
                   ` (17 more replies)
  0 siblings, 18 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

tl;dr:
- Cleaner code.
- dbench 200 3x as fast.
- fio with < nr_cpus is 10% slower.
- fio with 2 * nr_cpus is 10% faster.
- fs_mark is the same ish (results have a 10% variance between runs).

When Chris originally wrote the locking code for Btrfs the performance of rwsem
was atroctios compared to his locking scheme.  In btrfs we can usually spin on
most accesses, because generally we won't need to block.  If we do need to block
we can simply flip over to a blocking lock.  This flexibility has meant that our
locking scheme generally outperforms any of the standard locking.  We also have
a unique usecase where in a specific case (the space cache) we need to be able
to recurse a tree lock.  It is safe to do this because we're just reading, and
we would currently be holding a write lock, but this is not something that the
generic locking infrastructure provides.

There are a few downsides

1) The code is complicated.  This generally isn't an issue because we don't
really have to touch the code that often.

2) We lose lockdep help.  As you can see the first 4 patches in this series are
fixing deadlocks that exist that were caught by having lockdep properly working
with our tree locking.

3) We actually perform really poorly in the highly contended case.  Because we
don't really do anything proper locks do like dealing with writer/read
starvation, we have a series of waitqueues that mass get woken up when we unlock
blocking.  This means our context switch counts can be stupid high, as we'll
wake everybody up, and whoever wins the race gets the lock.  Implementing proper
exclusive waitqueues is a possiblity, but would further complicate the code, and
likely introduce problems we'd spend the next 6 months figuring out.

It's #3 that actually brought me to this patch series, as changing to an rwsem
significantly improved a usecase I was investigating.  However the rwsem isn't
without drawbacks.  Namely that in less highly contended cases we can be slower
than baseline, around 9-10% on my test box.  However in the very highly
contended cases (> nr_cpus) we are the same, if not better.

Filipe and I have discussed this a bit over the last few days, we're both on the
fence about it currently.  On one hand we could likely make up the lost
performance in other areas, and there's certainly things that could be
investigated to see why exactly we're slower with the rwsem and maybe figure out
a way to fix those.  We also gain a lot with this new code.  But 9-10% isn't
nothing, so it needs to be taken into consideration.

The following is the set of benchmarks that I've run on my 80 CPU, 256 GIB of
ram, 2 TIB NVME drive machine.  If there's anything else people would like to
see I can easily do A/B testing to see.

fio was N threads with 1gib files, 64kib blocksize with a fsync after every
write.  The fs_mark was just create empty files with 16 threads.

			PATCHED		UNPATCHED	% DIFF
dbench 200		699.011 Mb/s	213.568 Mb/s	+227%
fs_mark			223 seconds	197 seconds	-11.65%
fio 64 threads		562 Mb/s	624 Mb/s	-9.9%
fio 100 threads		566 Mb/s	566 Mb/s	0.0%
fio 160 threads		593 Mb/s	576 Mb/s	+2.95%

Another thing not shown by the raw numbers is the number of context switches.
Because of the nature of our locking we wake up everything constantly, so our
context switch counts for the fio jobs are consistently 2-3x with the old scheme
vs the rwsem.  This is why my particular workload performed so poorly, the
context switching was actually quite painful for that workload.

The diffstat is as follows

 fs/btrfs/backref.c            |  13 +-
 fs/btrfs/ctree.c              | 168 ++++++---------
 fs/btrfs/ctree.h              |  10 +-
 fs/btrfs/delayed-inode.c      |  11 -
 fs/btrfs/dir-item.c           |   1 -
 fs/btrfs/disk-io.c            |  13 +-
 fs/btrfs/export.c             |   1 -
 fs/btrfs/extent-tree.c        |  41 ++--
 fs/btrfs/extent_io.c          |  24 +--
 fs/btrfs/extent_io.h          |  27 +--
 fs/btrfs/file-item.c          |   4 -
 fs/btrfs/file.c               |   3 +-
 fs/btrfs/free-space-tree.c    |   2 -
 fs/btrfs/inode-item.c         |   6 -
 fs/btrfs/inode.c              |  13 +-
 fs/btrfs/ioctl.c              |  31 ++-
 fs/btrfs/locking.c            | 469 +++++++-----------------------------------
 fs/btrfs/locking.h            |  80 ++++++-
 fs/btrfs/print-tree.c         |  11 +-
 fs/btrfs/qgroup.c             |  11 +-
 fs/btrfs/ref-verify.c         |   6 +-
 fs/btrfs/reflink.c            |   3 -
 fs/btrfs/relocation.c         |  15 +-
 fs/btrfs/scrub.c              | 122 ++++++-----
 fs/btrfs/super.c              |   2 -
 fs/btrfs/tests/qgroup-tests.c |   4 -
 fs/btrfs/transaction.c        |   7 +-
 fs/btrfs/tree-defrag.c        |   1 -
 fs/btrfs/tree-log.c           |   3 -
 fs/btrfs/volumes.c            |   8 +-
 30 files changed, 357 insertions(+), 753 deletions(-)

Thanks,

Josef



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/17] btrfs: drop path before adding new uuid tree entry
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-10 16:28   ` Filipe Manana
  2020-08-10 15:42 ` [PATCH 02/17] btrfs: fix potential deadlock in the search ioctl Josef Bacik
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

With the conversion to the rwsemaphore I got the following lockdep splat

======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
------------------------------------------------------
btrfs-uuid/7955 is trying to acquire lock:
ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

but task is already holding lock:
ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (btrfs-uuid-00){++++}-{3:3}:
       down_read_nested+0x3e/0x140
       __btrfs_tree_read_lock+0x39/0x180
       __btrfs_read_lock_root_node+0x3a/0x50
       btrfs_search_slot+0x4bd/0x990
       btrfs_uuid_tree_add+0x89/0x2d0
       btrfs_uuid_scan_kthread+0x330/0x390
       kthread+0x133/0x150
       ret_from_fork+0x1f/0x30

-> #0 (btrfs-root-00){++++}-{3:3}:
       __lock_acquire+0x1272/0x2310
       lock_acquire+0x9e/0x360
       down_read_nested+0x3e/0x140
       __btrfs_tree_read_lock+0x39/0x180
       __btrfs_read_lock_root_node+0x3a/0x50
       btrfs_search_slot+0x4bd/0x990
       btrfs_find_root+0x45/0x1b0
       btrfs_read_tree_root+0x61/0x100
       btrfs_get_root_ref.part.50+0x143/0x630
       btrfs_uuid_tree_iterate+0x207/0x314
       btrfs_uuid_rescan_kthread+0x12/0x50
       kthread+0x133/0x150
       ret_from_fork+0x1f/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(btrfs-uuid-00);
                               lock(btrfs-root-00);
                               lock(btrfs-uuid-00);
  lock(btrfs-root-00);

 *** DEADLOCK ***

1 lock held by btrfs-uuid/7955:
 #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

stack backtrace:
CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
 dump_stack+0x78/0xa0
 check_noncircular+0x165/0x180
 __lock_acquire+0x1272/0x2310
 lock_acquire+0x9e/0x360
 ? __btrfs_tree_read_lock+0x39/0x180
 ? btrfs_root_node+0x1c/0x1d0
 down_read_nested+0x3e/0x140
 ? __btrfs_tree_read_lock+0x39/0x180
 __btrfs_tree_read_lock+0x39/0x180
 __btrfs_read_lock_root_node+0x3a/0x50
 btrfs_search_slot+0x4bd/0x990
 btrfs_find_root+0x45/0x1b0
 btrfs_read_tree_root+0x61/0x100
 btrfs_get_root_ref.part.50+0x143/0x630
 btrfs_uuid_tree_iterate+0x207/0x314
 ? btree_readpage+0x20/0x20
 btrfs_uuid_rescan_kthread+0x12/0x50
 kthread+0x133/0x150
 ? kthread_create_on_node+0x60/0x60
 ret_from_fork+0x1f/0x30

This problem exists because we have two different rescan threads,
btrfs_uuid_scan_kthread which creates the uuid tree, and
btrfs_uuid_tree_iterate that goes through and updates or deletes any out
of date roots.  The problem is they both do things in different order.
btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
into the uuid_root.  btrfs_uuid_tree_iterate() scans the uuid_root, but
then does a btrfs_get_fs_root() which can read from the tree_root.

It's actually easy enough to not be holding the path in
btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
it further down and re-start the search when we loop.  So simply move
the path release before we add our entry to the uuid tree.

This also fixes a problem where we're holding a path open after we do
btrfs_end_transaction(), which has it's own problems.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/volumes.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d7670e2a9f39..3ac44dad58bb 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4462,6 +4462,7 @@ int btrfs_uuid_scan_kthread(void *data)
 			goto skip;
 		}
 update_tree:
+		btrfs_release_path(path);
 		if (!btrfs_is_empty_uuid(root_item.uuid)) {
 			ret = btrfs_uuid_tree_add(trans, root_item.uuid,
 						  BTRFS_UUID_KEY_SUBVOL,
@@ -4486,6 +4487,7 @@ int btrfs_uuid_scan_kthread(void *data)
 		}
 
 skip:
+		btrfs_release_path(path);
 		if (trans) {
 			ret = btrfs_end_transaction(trans);
 			trans = NULL;
@@ -4493,7 +4495,6 @@ int btrfs_uuid_scan_kthread(void *data)
 				break;
 		}
 
-		btrfs_release_path(path);
 		if (key.offset < (u64)-1) {
 			key.offset++;
 		} else if (key.type < BTRFS_ROOT_ITEM_KEY) {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/17] btrfs: fix potential deadlock in the search ioctl
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
  2020-08-10 15:42 ` [PATCH 01/17] btrfs: drop path before adding new uuid tree entry Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-10 16:45   ` Filipe Manana
  2020-08-10 15:42 ` [PATCH 03/17] btrfs: do not hold device_list_mutex when closing devices Josef Bacik
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

When I converted the locking over to using a rwsem I got the following
lockdep splat

======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
------------------------------------------------------
compsize/11122 is trying to acquire lock:
ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90

but task is already holding lock:
ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (btrfs-fs-00){++++}-{3:3}:
       down_write_nested+0x3b/0x70
       __btrfs_tree_lock+0x24/0x120
       btrfs_search_slot+0x756/0x990
       btrfs_lookup_inode+0x3a/0xb4
       __btrfs_update_delayed_inode+0x93/0x270
       btrfs_async_run_delayed_root+0x168/0x230
       btrfs_work_helper+0xd4/0x570
       process_one_work+0x2ad/0x5f0
       worker_thread+0x3a/0x3d0
       kthread+0x133/0x150
       ret_from_fork+0x1f/0x30

-> #1 (&delayed_node->mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_delayed_update_inode+0x50/0x440
       btrfs_update_inode+0x8a/0xf0
       btrfs_dirty_inode+0x5b/0xd0
       touch_atime+0xa1/0xd0
       btrfs_file_mmap+0x3f/0x60
       mmap_region+0x3a4/0x640
       do_mmap+0x376/0x580
       vm_mmap_pgoff+0xd5/0x120
       ksys_mmap_pgoff+0x193/0x230
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (&mm->mmap_lock#2){++++}-{3:3}:
       __lock_acquire+0x1272/0x2310
       lock_acquire+0x9e/0x360
       __might_fault+0x68/0x90
       _copy_to_user+0x1e/0x80
       copy_to_sk.isra.32+0x121/0x300
       search_ioctl+0x106/0x200
       btrfs_ioctl_tree_search_v2+0x7b/0xf0
       btrfs_ioctl+0x106f/0x30a0
       ksys_ioctl+0x83/0xc0
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

other info that might help us debug this:

Chain exists of:
  &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(btrfs-fs-00);
                               lock(&delayed_node->mutex);
                               lock(btrfs-fs-00);
  lock(&mm->mmap_lock#2);

 *** DEADLOCK ***

1 lock held by compsize/11122:
 #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180

stack backtrace:
CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
 dump_stack+0x78/0xa0
 check_noncircular+0x165/0x180
 __lock_acquire+0x1272/0x2310
 lock_acquire+0x9e/0x360
 ? __might_fault+0x3e/0x90
 ? find_held_lock+0x72/0x90
 __might_fault+0x68/0x90
 ? __might_fault+0x3e/0x90
 _copy_to_user+0x1e/0x80
 copy_to_sk.isra.32+0x121/0x300
 ? btrfs_search_forward+0x2a6/0x360
 search_ioctl+0x106/0x200
 btrfs_ioctl_tree_search_v2+0x7b/0xf0
 btrfs_ioctl+0x106f/0x30a0
 ? __do_sys_newfstat+0x5a/0x70
 ? ksys_ioctl+0x83/0xc0
 ksys_ioctl+0x83/0xc0
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x50/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The problem is we're doing a copy_to_user() while holding tree locks,
which can deadlock if we have to do a page fault for the copy_to_user().
This exists even without my locking changes, so it needs to be fixed.
Rework the search ioctl to do the pre-fault and then
copy_to_user_nofault for the copying.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c |  8 ++++----
 fs/btrfs/extent_io.h |  6 +++---
 fs/btrfs/ioctl.c     | 27 ++++++++++++++++++++-------
 3 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 617ea38e6fd7..c15ab6c1897f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5653,9 +5653,9 @@ void read_extent_buffer(const struct extent_buffer *eb, void *dstv,
 	}
 }
 
-int read_extent_buffer_to_user(const struct extent_buffer *eb,
-			       void __user *dstv,
-			       unsigned long start, unsigned long len)
+int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
+				       void __user *dstv,
+				       unsigned long start, unsigned long len)
 {
 	size_t cur;
 	size_t offset;
@@ -5675,7 +5675,7 @@ int read_extent_buffer_to_user(const struct extent_buffer *eb,
 
 		cur = min(len, (PAGE_SIZE - offset));
 		kaddr = page_address(page);
-		if (copy_to_user(dst, kaddr + offset, cur)) {
+		if (copy_to_user_nofault(dst, kaddr + offset, cur)) {
 			ret = -EFAULT;
 			break;
 		}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 00a88f2eb5ab..30794ae58498 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -241,9 +241,9 @@ int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
 void read_extent_buffer(const struct extent_buffer *eb, void *dst,
 			unsigned long start,
 			unsigned long len);
-int read_extent_buffer_to_user(const struct extent_buffer *eb,
-			       void __user *dst, unsigned long start,
-			       unsigned long len);
+int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
+				       void __user *dst, unsigned long start,
+				       unsigned long len);
 void write_extent_buffer_fsid(const struct extent_buffer *eb, const void *src);
 void write_extent_buffer_chunk_tree_uuid(const struct extent_buffer *eb,
 		const void *src);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index bd3511c5ca81..39448556f635 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2086,9 +2086,14 @@ static noinline int copy_to_sk(struct btrfs_path *path,
 		sh.len = item_len;
 		sh.transid = found_transid;
 
-		/* copy search result header */
-		if (copy_to_user(ubuf + *sk_offset, &sh, sizeof(sh))) {
-			ret = -EFAULT;
+		/*
+		 * copy search result header, if we fault loop again so we can
+		 * fault in the pages and -EFAULT there if there's a problem,
+		 * otherwise we'll fault and then copy the buffer in properly
+		 * this next time through
+		 */
+		if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
+			ret = 0;
 			goto out;
 		}
 
@@ -2096,10 +2101,14 @@ static noinline int copy_to_sk(struct btrfs_path *path,
 
 		if (item_len) {
 			char __user *up = ubuf + *sk_offset;
-			/* copy the item */
-			if (read_extent_buffer_to_user(leaf, up,
-						       item_off, item_len)) {
-				ret = -EFAULT;
+			/* copy the item, same behavior as above, but reset the
+			 * sk_offset so we copy the full thing again.
+			*/
+			if (read_extent_buffer_to_user_nofault(leaf, up,
+							       item_off,
+							       item_len)) {
+				ret = 0;
+				*sk_offset -= sizeof(sh);
 				goto out;
 			}
 
@@ -2184,6 +2193,10 @@ static noinline int search_ioctl(struct inode *inode,
 	key.offset = sk->min_offset;
 
 	while (1) {
+		ret = fault_in_pages_writeable(ubuf, *buf_size - sk_offset);
+		if (ret)
+			break;
+
 		ret = btrfs_search_forward(root, &key, path, sk->min_transid);
 		if (ret != 0) {
 			if (ret > 0)
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/17] btrfs: do not hold device_list_mutex when closing devices
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
  2020-08-10 15:42 ` [PATCH 01/17] btrfs: drop path before adding new uuid tree entry Josef Bacik
  2020-08-10 15:42 ` [PATCH 02/17] btrfs: fix potential deadlock in the search ioctl Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-11 10:53   ` Filipe Manana
  2020-08-14 14:11   ` David Sterba
  2020-08-10 15:42 ` [PATCH 04/17] btrfs: allocate scrub workqueues outside of locks Josef Bacik
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The following lockdep splat

======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
------------------------------------------------------
fsstress/8739 is trying to acquire lock:
ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70

but task is already holding lock:
ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #10 (sb_pagefaults){.+.+}-{0:0}:
       __sb_start_write+0x129/0x210
       btrfs_page_mkwrite+0x6a/0x4a0
       do_page_mkwrite+0x4d/0xc0
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       asm_exc_page_fault+0x1e/0x30

-> #9 (&mm->mmap_lock#2){++++}-{3:3}:
       __might_fault+0x68/0x90
       _copy_to_user+0x1e/0x80
       perf_read+0x141/0x2c0
       vfs_read+0xad/0x1b0
       ksys_read+0x5f/0xe0
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #8 (&cpuctx_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       perf_event_init_cpu+0x88/0x150
       perf_event_init+0x1db/0x20b
       start_kernel+0x3ae/0x53c
       secondary_startup_64+0xa4/0xb0

-> #7 (pmus_lock){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       perf_event_init_cpu+0x4f/0x150
       cpuhp_invoke_callback+0xb1/0x900
       _cpu_up.constprop.26+0x9f/0x130
       cpu_up+0x7b/0xc0
       bringup_nonboot_cpus+0x4f/0x60
       smp_init+0x26/0x71
       kernel_init_freeable+0x110/0x258
       kernel_init+0xa/0x103
       ret_from_fork+0x1f/0x30

-> #6 (cpu_hotplug_lock){++++}-{0:0}:
       cpus_read_lock+0x39/0xb0
       kmem_cache_create_usercopy+0x28/0x230
       kmem_cache_create+0x12/0x20
       bioset_init+0x15e/0x2b0
       init_bio+0xa3/0xaa
       do_one_initcall+0x5a/0x2e0
       kernel_init_freeable+0x1f4/0x258
       kernel_init+0xa/0x103
       ret_from_fork+0x1f/0x30

-> #5 (bio_slab_lock){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       bioset_init+0xbc/0x2b0
       __blk_alloc_queue+0x6f/0x2d0
       blk_mq_init_queue_data+0x1b/0x70
       loop_add+0x110/0x290 [loop]
       fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
       do_one_initcall+0x5a/0x2e0
       do_init_module+0x5a/0x220
       load_module+0x2459/0x26e0
       __do_sys_finit_module+0xba/0xe0
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #4 (loop_ctl_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       lo_open+0x18/0x50 [loop]
       __blkdev_get+0xec/0x570
       blkdev_get+0xe8/0x150
       do_dentry_open+0x167/0x410
       path_openat+0x7c9/0xa80
       do_filp_open+0x93/0x100
       do_sys_openat2+0x22a/0x2e0
       do_sys_open+0x4b/0x80
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       blkdev_put+0x1d/0x120
       close_fs_devices.part.31+0x84/0x130
       btrfs_close_devices+0x44/0xb0
       close_ctree+0x296/0x2b2
       generic_shutdown_super+0x69/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_run_dev_stats+0x49/0x480
       commit_cowonly_roots+0xb5/0x2a0
       btrfs_commit_transaction+0x516/0xa60
       sync_filesystem+0x6b/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_commit_transaction+0x4bb/0xa60
       sync_filesystem+0x6b/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
       __lock_acquire+0x1272/0x2310
       lock_acquire+0x9e/0x360
       __mutex_lock+0x9f/0x930
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       file_update_time+0xc8/0x110
       btrfs_page_mkwrite+0x10c/0x4a0
       do_page_mkwrite+0x4d/0xc0
       handle_mm_fault+0x103c/0x1730
       exc_page_fault+0x340/0x660
       asm_exc_page_fault+0x1e/0x30

other info that might help us debug this:

Chain exists of:
  &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(sb_pagefaults);
                               lock(&mm->mmap_lock#2);
                               lock(sb_pagefaults);
  lock(&fs_info->reloc_mutex);

 *** DEADLOCK ***

3 locks held by fsstress/8739:
 #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
 #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
 #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0

stack backtrace:
CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
 dump_stack+0x78/0xa0
 check_noncircular+0x165/0x180
 __lock_acquire+0x1272/0x2310
 ? btrfs_get_alloc_profile+0x150/0x210
 lock_acquire+0x9e/0x360
 ? btrfs_record_root_in_trans+0x43/0x70
 __mutex_lock+0x9f/0x930
 ? btrfs_record_root_in_trans+0x43/0x70
 ? lock_acquire+0x9e/0x360
 ? join_transaction+0x5d/0x450
 ? find_held_lock+0x2d/0x90
 ? btrfs_record_root_in_trans+0x43/0x70
 ? join_transaction+0x3d5/0x450
 ? btrfs_record_root_in_trans+0x43/0x70
 btrfs_record_root_in_trans+0x43/0x70
 start_transaction+0xd1/0x5d0
 btrfs_dirty_inode+0x42/0xd0
 file_update_time+0xc8/0x110
 btrfs_page_mkwrite+0x10c/0x4a0
 ? handle_mm_fault+0x5e/0x1730
 do_page_mkwrite+0x4d/0xc0
 ? __do_fault+0x32/0x150
 handle_mm_fault+0x103c/0x1730
 exc_page_fault+0x340/0x660
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7faa6c9969c4

Was seen in testing.  The fix is similar to that of

  btrfs: open device without device_list_mutex

where we're holding the device_list_mutex and then grab the bd_mutex,
which pulls in a bunch of dependencies under the bd_mutex.  We only ever
call btrfs_close_devices() on mount failure or unmount, so we're save to
not have the device_list_mutex here.  We're already holding the
uuid_mutex which keeps us safe from any external modification of the
fs_devices.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/volumes.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3ac44dad58bb..97ec9c0a91aa 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1155,11 +1155,8 @@ static int close_fs_devices(struct btrfs_fs_devices *fs_devices)
 	if (--fs_devices->opened > 0)
 		return 0;
 
-	mutex_lock(&fs_devices->device_list_mutex);
-	list_for_each_entry_safe(device, tmp, &fs_devices->devices, dev_list) {
+	list_for_each_entry_safe(device, tmp, &fs_devices->devices, dev_list)
 		btrfs_close_one_device(device);
-	}
-	mutex_unlock(&fs_devices->device_list_mutex);
 
 	WARN_ON(fs_devices->open_devices);
 	WARN_ON(fs_devices->rw_devices);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/17] btrfs: allocate scrub workqueues outside of locks
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (2 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 03/17] btrfs: do not hold device_list_mutex when closing devices Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-11 11:22   ` Filipe Manana
  2020-08-10 15:42 ` [PATCH 05/17] btrfs: set the correct lockdep class for new nodes Josef Bacik
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

I got the following lockdep splat while testing

======================================================
WARNING: possible circular locking dependency detected
5.8.0-rc7-00172-g021118712e59 #932 Not tainted
------------------------------------------------------
btrfs/229626 is trying to acquire lock:
ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450

but task is already holding lock:
ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_scrub_dev+0x11c/0x630
       btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
       btrfs_ioctl+0x2799/0x30a0
       ksys_ioctl+0x83/0xc0
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_run_dev_stats+0x49/0x480
       commit_cowonly_roots+0xb5/0x2a0
       btrfs_commit_transaction+0x516/0xa60
       sync_filesystem+0x6b/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_commit_transaction+0x4bb/0xa60
       sync_filesystem+0x6b/0x90
       generic_shutdown_super+0x22/0x100
       kill_anon_super+0xe/0x30
       btrfs_kill_super+0x12/0x20
       deactivate_locked_super+0x29/0x60
       cleanup_mnt+0xb8/0x140
       task_work_run+0x6d/0xb0
       __prepare_exit_to_usermode+0x1cc/0x1e0
       do_syscall_64+0x5c/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       btrfs_record_root_in_trans+0x43/0x70
       start_transaction+0xd1/0x5d0
       btrfs_dirty_inode+0x42/0xd0
       touch_atime+0xa1/0xd0
       btrfs_file_mmap+0x3f/0x60
       mmap_region+0x3a4/0x640
       do_mmap+0x376/0x580
       vm_mmap_pgoff+0xd5/0x120
       ksys_mmap_pgoff+0x193/0x230
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #3 (&mm->mmap_lock#2){++++}-{3:3}:
       __might_fault+0x68/0x90
       _copy_to_user+0x1e/0x80
       perf_read+0x141/0x2c0
       vfs_read+0xad/0x1b0
       ksys_read+0x5f/0xe0
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

-> #2 (&cpuctx_mutex){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       perf_event_init_cpu+0x88/0x150
       perf_event_init+0x1db/0x20b
       start_kernel+0x3ae/0x53c
       secondary_startup_64+0xa4/0xb0

-> #1 (pmus_lock){+.+.}-{3:3}:
       __mutex_lock+0x9f/0x930
       perf_event_init_cpu+0x4f/0x150
       cpuhp_invoke_callback+0xb1/0x900
       _cpu_up.constprop.26+0x9f/0x130
       cpu_up+0x7b/0xc0
       bringup_nonboot_cpus+0x4f/0x60
       smp_init+0x26/0x71
       kernel_init_freeable+0x110/0x258
       kernel_init+0xa/0x103
       ret_from_fork+0x1f/0x30

-> #0 (cpu_hotplug_lock){++++}-{0:0}:
       __lock_acquire+0x1272/0x2310
       lock_acquire+0x9e/0x360
       cpus_read_lock+0x39/0xb0
       alloc_workqueue+0x378/0x450
       __btrfs_alloc_workqueue+0x15d/0x200
       btrfs_alloc_workqueue+0x51/0x160
       scrub_workers_get+0x5a/0x170
       btrfs_scrub_dev+0x18c/0x630
       btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
       btrfs_ioctl+0x2799/0x30a0
       ksys_ioctl+0x83/0xc0
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x50/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

other info that might help us debug this:

Chain exists of:
  cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&fs_info->scrub_lock);
                               lock(&fs_devs->device_list_mutex);
                               lock(&fs_info->scrub_lock);
  lock(cpu_hotplug_lock);

 *** DEADLOCK ***

2 locks held by btrfs/229626:
 #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
 #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630

stack backtrace:
CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
Call Trace:
 dump_stack+0x78/0xa0
 check_noncircular+0x165/0x180
 __lock_acquire+0x1272/0x2310
 lock_acquire+0x9e/0x360
 ? alloc_workqueue+0x378/0x450
 cpus_read_lock+0x39/0xb0
 ? alloc_workqueue+0x378/0x450
 alloc_workqueue+0x378/0x450
 ? rcu_read_lock_sched_held+0x52/0x80
 __btrfs_alloc_workqueue+0x15d/0x200
 btrfs_alloc_workqueue+0x51/0x160
 scrub_workers_get+0x5a/0x170
 btrfs_scrub_dev+0x18c/0x630
 ? start_transaction+0xd1/0x5d0
 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
 btrfs_ioctl+0x2799/0x30a0
 ? do_sigaction+0x102/0x250
 ? lockdep_hardirqs_on_prepare+0xca/0x160
 ? _raw_spin_unlock_irq+0x24/0x30
 ? trace_hardirqs_on+0x1c/0xe0
 ? _raw_spin_unlock_irq+0x24/0x30
 ? do_sigaction+0x102/0x250
 ? ksys_ioctl+0x83/0xc0
 ksys_ioctl+0x83/0xc0
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x50/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens because we're allocating the scrub workqueues under the
scrub and device list mutex, which brings in a whole host of other
dependencies.  Fix this by moving the actual allocation outside of the
scrub lock, and then only take the lock once we're ready to actually
assign them to the fs_info.  We'll now have to cleanup the workqueues in
a few more places, so I've added a helper to do the refcount dance to
safely free the workqueues.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/scrub.c | 122 +++++++++++++++++++++++++++--------------------
 1 file changed, 70 insertions(+), 52 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 5a6cb9db512e..7f7098e3110e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3716,50 +3716,84 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
 	return 0;
 }
 
+static void scrub_workers_put(struct btrfs_fs_info *fs_info)
+{
+	if (refcount_dec_and_mutex_lock(&fs_info->scrub_workers_refcnt,
+					&fs_info->scrub_lock)) {
+		struct btrfs_workqueue *scrub_workers = NULL;
+		struct btrfs_workqueue *scrub_wr_comp = NULL;
+		struct btrfs_workqueue *scrub_parity = NULL;
+
+		scrub_workers = fs_info->scrub_workers;
+		scrub_wr_comp = fs_info->scrub_wr_completion_workers;
+		scrub_parity = fs_info->scrub_parity_workers;
+
+		fs_info->scrub_workers = NULL;
+		fs_info->scrub_wr_completion_workers = NULL;
+		fs_info->scrub_parity_workers = NULL;
+		mutex_unlock(&fs_info->scrub_lock);
+
+		btrfs_destroy_workqueue(scrub_workers);
+		btrfs_destroy_workqueue(scrub_wr_comp);
+		btrfs_destroy_workqueue(scrub_parity);
+	}
+}
+
 /*
  * get a reference count on fs_info->scrub_workers. start worker if necessary
  */
 static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info,
 						int is_dev_replace)
 {
+	struct btrfs_workqueue *scrub_workers = NULL;
+	struct btrfs_workqueue *scrub_wr_comp = NULL;
+	struct btrfs_workqueue *scrub_parity = NULL;
 	unsigned int flags = WQ_FREEZABLE | WQ_UNBOUND;
 	int max_active = fs_info->thread_pool_size;
+	int ret = -ENOMEM;
 
-	lockdep_assert_held(&fs_info->scrub_lock);
+	if (refcount_inc_not_zero(&fs_info->scrub_workers_refcnt))
+		return 0;
 
-	if (refcount_read(&fs_info->scrub_workers_refcnt) == 0) {
-		ASSERT(fs_info->scrub_workers == NULL);
-		fs_info->scrub_workers = btrfs_alloc_workqueue(fs_info, "scrub",
-				flags, is_dev_replace ? 1 : max_active, 4);
-		if (!fs_info->scrub_workers)
-			goto fail_scrub_workers;
-
-		ASSERT(fs_info->scrub_wr_completion_workers == NULL);
-		fs_info->scrub_wr_completion_workers =
-			btrfs_alloc_workqueue(fs_info, "scrubwrc", flags,
-					      max_active, 2);
-		if (!fs_info->scrub_wr_completion_workers)
-			goto fail_scrub_wr_completion_workers;
+	scrub_workers = btrfs_alloc_workqueue(fs_info, "scrub", flags,
+					      is_dev_replace ? 1 : max_active,
+					      4);
+	if (!scrub_workers)
+		goto fail_scrub_workers;
 
-		ASSERT(fs_info->scrub_parity_workers == NULL);
-		fs_info->scrub_parity_workers =
-			btrfs_alloc_workqueue(fs_info, "scrubparity", flags,
+	scrub_wr_comp = btrfs_alloc_workqueue(fs_info, "scrubwrc", flags,
 					      max_active, 2);
-		if (!fs_info->scrub_parity_workers)
-			goto fail_scrub_parity_workers;
+	if (!scrub_wr_comp)
+		goto fail_scrub_wr_completion_workers;
 
+	scrub_parity = btrfs_alloc_workqueue(fs_info, "scrubparity", flags,
+					     max_active, 2);
+	if (!scrub_parity)
+		goto fail_scrub_parity_workers;
+
+	mutex_lock(&fs_info->scrub_lock);
+	if (refcount_read(&fs_info->scrub_workers_refcnt) == 0) {
+		ASSERT(fs_info->scrub_workers == NULL &&
+		       fs_info->scrub_wr_completion_workers == NULL &&
+		       fs_info->scrub_parity_workers == NULL);
+		fs_info->scrub_workers = scrub_workers;
+		fs_info->scrub_wr_completion_workers = scrub_wr_comp;
+		fs_info->scrub_parity_workers = scrub_parity;
 		refcount_set(&fs_info->scrub_workers_refcnt, 1);
-	} else {
-		refcount_inc(&fs_info->scrub_workers_refcnt);
+		mutex_unlock(&fs_info->scrub_lock);
+		return 0;
 	}
-	return 0;
+	refcount_inc(&fs_info->scrub_workers_refcnt);
+	mutex_unlock(&fs_info->scrub_lock);
 
+	ret = 0;
+	btrfs_destroy_workqueue(scrub_parity);
 fail_scrub_parity_workers:
-	btrfs_destroy_workqueue(fs_info->scrub_wr_completion_workers);
+	btrfs_destroy_workqueue(scrub_wr_comp);
 fail_scrub_wr_completion_workers:
-	btrfs_destroy_workqueue(fs_info->scrub_workers);
+	btrfs_destroy_workqueue(scrub_workers);
 fail_scrub_workers:
-	return -ENOMEM;
+	return ret;
 }
 
 int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
@@ -3770,9 +3804,6 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	int ret;
 	struct btrfs_device *dev;
 	unsigned int nofs_flag;
-	struct btrfs_workqueue *scrub_workers = NULL;
-	struct btrfs_workqueue *scrub_wr_comp = NULL;
-	struct btrfs_workqueue *scrub_parity = NULL;
 
 	if (btrfs_fs_closing(fs_info))
 		return -EAGAIN;
@@ -3819,13 +3850,17 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 	if (IS_ERR(sctx))
 		return PTR_ERR(sctx);
 
+	ret = scrub_workers_get(fs_info, is_dev_replace);
+	if (ret)
+		goto out_free_ctx;
+
 	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	dev = btrfs_find_device(fs_info->fs_devices, devid, NULL, NULL, true);
 	if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state) &&
 		     !is_dev_replace)) {
 		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 		ret = -ENODEV;
-		goto out_free_ctx;
+		goto out;
 	}
 
 	if (!is_dev_replace && !readonly &&
@@ -3834,7 +3869,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 		btrfs_err_in_rcu(fs_info, "scrub: device %s is not writable",
 				rcu_str_deref(dev->name));
 		ret = -EROFS;
-		goto out_free_ctx;
+		goto out;
 	}
 
 	mutex_lock(&fs_info->scrub_lock);
@@ -3843,7 +3878,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 		mutex_unlock(&fs_info->scrub_lock);
 		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 		ret = -EIO;
-		goto out_free_ctx;
+		goto out;
 	}
 
 	down_read(&fs_info->dev_replace.rwsem);
@@ -3854,17 +3889,10 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 		mutex_unlock(&fs_info->scrub_lock);
 		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 		ret = -EINPROGRESS;
-		goto out_free_ctx;
+		goto out;
 	}
 	up_read(&fs_info->dev_replace.rwsem);
 
-	ret = scrub_workers_get(fs_info, is_dev_replace);
-	if (ret) {
-		mutex_unlock(&fs_info->scrub_lock);
-		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
-		goto out_free_ctx;
-	}
-
 	sctx->readonly = readonly;
 	dev->scrub_ctx = sctx;
 	mutex_unlock(&fs_info->fs_devices->device_list_mutex);
@@ -3917,24 +3945,14 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
 
 	mutex_lock(&fs_info->scrub_lock);
 	dev->scrub_ctx = NULL;
-	if (refcount_dec_and_test(&fs_info->scrub_workers_refcnt)) {
-		scrub_workers = fs_info->scrub_workers;
-		scrub_wr_comp = fs_info->scrub_wr_completion_workers;
-		scrub_parity = fs_info->scrub_parity_workers;
-
-		fs_info->scrub_workers = NULL;
-		fs_info->scrub_wr_completion_workers = NULL;
-		fs_info->scrub_parity_workers = NULL;
-	}
 	mutex_unlock(&fs_info->scrub_lock);
 
-	btrfs_destroy_workqueue(scrub_workers);
-	btrfs_destroy_workqueue(scrub_wr_comp);
-	btrfs_destroy_workqueue(scrub_parity);
+	scrub_workers_put(fs_info);
 	scrub_put_ctx(sctx);
 
 	return ret;
-
+out:
+	scrub_workers_put(fs_info);
 out_free_ctx:
 	scrub_free_ctx(sctx);
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/17] btrfs: set the correct lockdep class for new nodes
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (3 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 04/17] btrfs: allocate scrub workqueues outside of locks Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-11 11:25   ` Filipe Manana
  2020-08-10 15:42 ` [PATCH 06/17] btrfs: set the lockdep class for log tree extent buffers Josef Bacik
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

When flipping over to the rw_semaphore I noticed I'd get a lockdep splat
in replace_path(), which is weird because we're swapping the reloc root
with the actual target root.  Turns out this is because we're using the
root->root_key.objectid as the root id for the newly allocated tree
block when setting the lockdep class, however we need to be using the
actual owner of this new block, which is saved in owner.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fa7d83051587..84029851f094 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4522,7 +4522,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		return ERR_PTR(-EUCLEAN);
 	}
 
-	btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
+	btrfs_set_buffer_lockdep_class(owner, buf, level);
 	btrfs_tree_lock(buf);
 	btrfs_clean_tree_block(buf);
 	clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/17] btrfs: set the lockdep class for log tree extent buffers
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (4 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 05/17] btrfs: set the correct lockdep class for new nodes Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-11 11:28   ` Filipe Manana
  2020-08-10 15:42 ` [PATCH 07/17] btrfs: rename eb->lock_nested to eb->lock_recursed Josef Bacik
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

These are special extent buffers that get rewound in order to lookup
the state of the tree at a specific point in time.  As such they do not
go through the normal initialization paths that set their lockdep class,
so handle them appropriately when they are created and before they are
locked.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 70e49d8d4f6c..cbacf700b68b 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1297,6 +1297,8 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct btrfs_path *path,
 	btrfs_tree_read_unlock_blocking(eb);
 	free_extent_buffer(eb);
 
+	btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb_rewin),
+				       eb_rewin, btrfs_header_level(eb_rewin));
 	btrfs_tree_read_lock(eb_rewin);
 	__tree_mod_log_rewind(fs_info, eb_rewin, time_seq, tm);
 	WARN_ON(btrfs_header_nritems(eb_rewin) >
@@ -1370,7 +1372,6 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
 
 	if (!eb)
 		return NULL;
-	btrfs_tree_read_lock(eb);
 	if (old_root) {
 		btrfs_set_header_bytenr(eb, eb->start);
 		btrfs_set_header_backref_rev(eb, BTRFS_MIXED_BACKREF_REV);
@@ -1378,6 +1379,9 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
 		btrfs_set_header_level(eb, old_root->level);
 		btrfs_set_header_generation(eb, old_generation);
 	}
+	btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb), eb,
+				       btrfs_header_level(eb));
+	btrfs_tree_read_lock(eb);
 	if (tm)
 		__tree_mod_log_rewind(fs_info, eb, time_seq, tm);
 	else
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/17] btrfs: rename eb->lock_nested to eb->lock_recursed
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (5 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 06/17] btrfs: set the lockdep class for log tree extent buffers Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-10 15:42 ` [PATCH 08/17] btrfs: introduce path->recurse Josef Bacik
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Nested locking with lockdep and everything else refers to lock hierarchy
within the same lock map.  This is how we indicate the same locks for
different objects are ok to take in a specific order, for our use case
that would be to take the lock on a leaf and then take a lock on an
adjacent leaf.

What ->lock_nested _actually_ refers to is if we happen to already be
holding the write lock on the extent buffer and we're allowing a read
lock to be taken on that extent buffer, which is recursion.  Rename this
so we don't get confused when we switch to a rwsem and have to start
using the _nested helpers.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c |  2 +-
 fs/btrfs/extent_io.h |  2 +-
 fs/btrfs/locking.c   | 22 +++++++++++-----------
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c15ab6c1897f..16b4e7655a96 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4988,7 +4988,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 	rwlock_init(&eb->lock);
 	atomic_set(&eb->blocking_readers, 0);
 	eb->blocking_writers = 0;
-	eb->lock_nested = false;
+	eb->lock_recursed = false;
 	init_waitqueue_head(&eb->write_lock_wq);
 	init_waitqueue_head(&eb->read_lock_wq);
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 30794ae58498..9e1e22f1586a 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -102,7 +102,7 @@ struct extent_buffer {
 
 	int blocking_writers;
 	atomic_t blocking_readers;
-	bool lock_nested;
+	bool lock_recursed;
 	/* >= 0 if eb belongs to a log tree, -1 otherwise */
 	short log_index;
 
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index f75612e18a82..8e3d107a6192 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -57,7 +57,7 @@
  * performance reasons.
  *
  *
- * Lock nesting
+ * Lock recursion
  * ------------
  *
  * A write operation on a tree might indirectly start a look up on the same
@@ -201,7 +201,7 @@ void btrfs_set_lock_blocking_read(struct extent_buffer *eb)
 	 * lock, but it won't change to or away from us.  If we have the write
 	 * lock, we are the owner and it'll never change.
 	 */
-	if (eb->lock_nested && current->pid == eb->lock_owner)
+	if (eb->lock_recursed && current->pid == eb->lock_owner)
 		return;
 	btrfs_assert_tree_read_locked(eb);
 	atomic_inc(&eb->blocking_readers);
@@ -225,7 +225,7 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
 	 * lock, but it won't change to or away from us.  If we have the write
 	 * lock, we are the owner and it'll never change.
 	 */
-	if (eb->lock_nested && current->pid == eb->lock_owner)
+	if (eb->lock_recursed && current->pid == eb->lock_owner)
 		return;
 	if (eb->blocking_writers == 0) {
 		btrfs_assert_spinning_writers_put(eb);
@@ -263,8 +263,8 @@ void btrfs_tree_read_lock(struct extent_buffer *eb)
 			 * depends on this as it may be called on a partly
 			 * (write-)locked tree.
 			 */
-			BUG_ON(eb->lock_nested);
-			eb->lock_nested = true;
+			BUG_ON(eb->lock_recursed);
+			eb->lock_recursed = true;
 			read_unlock(&eb->lock);
 			trace_btrfs_tree_read_lock(eb, start_ns);
 			return;
@@ -362,11 +362,11 @@ void btrfs_tree_read_unlock(struct extent_buffer *eb)
 	/*
 	 * if we're nested, we have the write lock.  No new locking
 	 * is needed as long as we are the lock owner.
-	 * The write unlock will do a barrier for us, and the lock_nested
+	 * The write unlock will do a barrier for us, and the lock_recursed
 	 * field only matters to the lock owner.
 	 */
-	if (eb->lock_nested && current->pid == eb->lock_owner) {
-		eb->lock_nested = false;
+	if (eb->lock_recursed && current->pid == eb->lock_owner) {
+		eb->lock_recursed = false;
 		return;
 	}
 	btrfs_assert_tree_read_locked(eb);
@@ -388,11 +388,11 @@ void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
 	/*
 	 * if we're nested, we have the write lock.  No new locking
 	 * is needed as long as we are the lock owner.
-	 * The write unlock will do a barrier for us, and the lock_nested
+	 * The write unlock will do a barrier for us, and the lock_recursed
 	 * field only matters to the lock owner.
 	 */
-	if (eb->lock_nested && current->pid == eb->lock_owner) {
-		eb->lock_nested = false;
+	if (eb->lock_recursed && current->pid == eb->lock_owner) {
+		eb->lock_recursed = false;
 		return;
 	}
 	btrfs_assert_tree_read_locked(eb);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/17] btrfs: introduce path->recurse
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (6 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 07/17] btrfs: rename eb->lock_nested to eb->lock_recursed Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-10 15:42 ` [PATCH 09/17] btrfs: add nesting tags to the locking helpers Josef Bacik
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Our current tree locking stuff allows us to recurse with read locks if
we're already holding the write lock.  This is necessary for the space
cache inode, as we could be holding a lock on the root_tree root when we
need to cache a block group, and thus need to be able to read down the
root_tree to read in the inode cache.

We can get away with this in our current locking, but we won't be able
to with a rwsem.  Handle this by purposefully annotating the places
where we require recursion, so that in the future we can maybe come up
with a way to avoid the recursion.  In the case of the free space inode,
this will be superseded by the free space tree.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.c   |  8 ++++----
 fs/btrfs/ctree.h   |  3 +--
 fs/btrfs/inode.c   |  2 ++
 fs/btrfs/locking.c | 13 ++++++++++---
 fs/btrfs/locking.h | 10 ++++++++++
 5 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index cbacf700b68b..34d40b19bb71 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2601,7 +2601,7 @@ static struct extent_buffer *btrfs_search_slot_get_root(struct btrfs_root *root,
 		 * We don't know the level of the root node until we actually
 		 * have it read locked
 		 */
-		b = btrfs_read_lock_root_node(root);
+		b = __btrfs_read_lock_root_node(root, p->recurse);
 		level = btrfs_header_level(b);
 		if (level > write_lock_level)
 			goto out;
@@ -2875,7 +2875,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 			} else {
 				if (!btrfs_tree_read_lock_atomic(b)) {
 					btrfs_set_path_blocking(p);
-					btrfs_tree_read_lock(b);
+					__btrfs_tree_read_lock(b, p->recurse);
 				}
 				p->locks[level] = BTRFS_READ_LOCK;
 			}
@@ -5379,7 +5379,7 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 			}
 			if (!ret) {
 				btrfs_set_path_blocking(path);
-				btrfs_tree_read_lock(next);
+				__btrfs_tree_read_lock(next, path->recurse);
 			}
 			next_rw_lock = BTRFS_READ_LOCK;
 		}
@@ -5414,7 +5414,7 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 			ret = btrfs_try_tree_read_lock(next);
 			if (!ret) {
 				btrfs_set_path_blocking(path);
-				btrfs_tree_read_lock(next);
+				__btrfs_tree_read_lock(next, path->recurse);
 			}
 			next_rw_lock = BTRFS_READ_LOCK;
 		}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9c7e466f27a9..e7a28033eaef 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -374,6 +374,7 @@ struct btrfs_path {
 	unsigned int search_commit_root:1;
 	unsigned int need_commit_sem:1;
 	unsigned int skip_release_on_error:1;
+	unsigned int recurse:1;
 };
 #define BTRFS_MAX_EXTENT_ITEM_SIZE(r) ((BTRFS_LEAF_DATA_SIZE(r->fs_info) >> 4) - \
 					sizeof(struct btrfs_item))
@@ -2651,8 +2652,6 @@ void btrfs_set_item_key_safe(struct btrfs_fs_info *fs_info,
 			     struct btrfs_path *path,
 			     const struct btrfs_key *new_key);
 struct extent_buffer *btrfs_root_node(struct btrfs_root *root);
-struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root);
-struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root);
 int btrfs_find_next_key(struct btrfs_root *root, struct btrfs_path *path,
 			struct btrfs_key *key, int lowest_level,
 			u64 min_trans);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 96064eb41d55..6c547aa905d7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6591,6 +6591,8 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 	 */
 	path->leave_spinning = 1;
 
+	path->recurse = btrfs_is_free_space_inode(inode);
+
 	ret = btrfs_lookup_file_extent(NULL, root, path, objectid, start, 0);
 	if (ret < 0) {
 		err = ret;
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index 8e3d107a6192..d07818733833 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -244,7 +244,7 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
  *
  * The rwlock is held upon exit.
  */
-void btrfs_tree_read_lock(struct extent_buffer *eb)
+void __btrfs_tree_read_lock(struct extent_buffer *eb, bool recurse)
 {
 	u64 start_ns = 0;
 
@@ -263,6 +263,7 @@ void btrfs_tree_read_lock(struct extent_buffer *eb)
 			 * depends on this as it may be called on a partly
 			 * (write-)locked tree.
 			 */
+			WARN_ON(!recurse);
 			BUG_ON(eb->lock_recursed);
 			eb->lock_recursed = true;
 			read_unlock(&eb->lock);
@@ -279,6 +280,11 @@ void btrfs_tree_read_lock(struct extent_buffer *eb)
 	trace_btrfs_tree_read_lock(eb, start_ns);
 }
 
+void btrfs_tree_read_lock(struct extent_buffer *eb)
+{
+	__btrfs_tree_read_lock(eb, false);
+}
+
 /*
  * Lock extent buffer for read, optimistically expecting that there are no
  * contending blocking writers. If there are, don't wait.
@@ -552,13 +558,14 @@ struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root)
  *
  * Return: root extent buffer with read lock held
  */
-struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root)
+struct extent_buffer *__btrfs_read_lock_root_node(struct btrfs_root *root,
+						  bool recurse)
 {
 	struct extent_buffer *eb;
 
 	while (1) {
 		eb = btrfs_root_node(root);
-		btrfs_tree_read_lock(eb);
+		__btrfs_tree_read_lock(eb, recurse);
 		if (eb == root->node)
 			break;
 		btrfs_tree_read_unlock(eb);
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index d715846c10b8..587f144adafe 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -21,6 +21,7 @@ struct btrfs_path;
 void btrfs_tree_lock(struct extent_buffer *eb);
 void btrfs_tree_unlock(struct extent_buffer *eb);
 
+void __btrfs_tree_read_lock(struct extent_buffer *eb, bool recurse);
 void btrfs_tree_read_lock(struct extent_buffer *eb);
 void btrfs_tree_read_unlock(struct extent_buffer *eb);
 void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb);
@@ -29,6 +30,15 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb);
 int btrfs_try_tree_read_lock(struct extent_buffer *eb);
 int btrfs_try_tree_write_lock(struct extent_buffer *eb);
 int btrfs_tree_read_lock_atomic(struct extent_buffer *eb);
+struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root);
+struct extent_buffer *__btrfs_read_lock_root_node(struct btrfs_root *root,
+						  bool recurse);
+
+static inline struct extent_buffer *
+btrfs_read_lock_root_node(struct btrfs_root *root)
+{
+	return __btrfs_read_lock_root_node(root, false);
+}
 
 #ifdef CONFIG_BTRFS_DEBUG
 static inline void btrfs_assert_tree_locked(struct extent_buffer *eb) {
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/17] btrfs: add nesting tags to the locking helpers
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (7 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 08/17] btrfs: introduce path->recurse Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-14 14:41   ` David Sterba
  2020-08-10 15:42 ` [PATCH 10/17] btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks Josef Bacik
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We will need these when we switch to an rwsem, so plumb in the
infrastructure here to use later on.  I violate the 80 character limit
some here because it'll be cleaned up later.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.c   | 11 ++++++++---
 fs/btrfs/locking.c | 14 ++++++++++----
 fs/btrfs/locking.h |  8 +++++++-
 3 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 34d40b19bb71..fefc61402118 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2875,7 +2875,8 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 			} else {
 				if (!btrfs_tree_read_lock_atomic(b)) {
 					btrfs_set_path_blocking(p);
-					__btrfs_tree_read_lock(b, p->recurse);
+					__btrfs_tree_read_lock(b, BTRFS_NESTING_NORMAL,
+							       p->recurse);
 				}
 				p->locks[level] = BTRFS_READ_LOCK;
 			}
@@ -5379,7 +5380,9 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 			}
 			if (!ret) {
 				btrfs_set_path_blocking(path);
-				__btrfs_tree_read_lock(next, path->recurse);
+				__btrfs_tree_read_lock(next,
+						       BTRFS_NESTING_NORMAL,
+						       path->recurse);
 			}
 			next_rw_lock = BTRFS_READ_LOCK;
 		}
@@ -5414,7 +5417,9 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 			ret = btrfs_try_tree_read_lock(next);
 			if (!ret) {
 				btrfs_set_path_blocking(path);
-				__btrfs_tree_read_lock(next, path->recurse);
+				__btrfs_tree_read_lock(next,
+						       BTRFS_NESTING_NORMAL,
+						       path->recurse);
 			}
 			next_rw_lock = BTRFS_READ_LOCK;
 		}
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index d07818733833..133a6dd5f29b 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -244,7 +244,8 @@ void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
  *
  * The rwlock is held upon exit.
  */
-void __btrfs_tree_read_lock(struct extent_buffer *eb, bool recurse)
+void __btrfs_tree_read_lock(struct extent_buffer *eb, enum btrfs_lock_nesting nest,
+			    bool recurse)
 {
 	u64 start_ns = 0;
 
@@ -282,7 +283,7 @@ void __btrfs_tree_read_lock(struct extent_buffer *eb, bool recurse)
 
 void btrfs_tree_read_lock(struct extent_buffer *eb)
 {
-	__btrfs_tree_read_lock(eb, false);
+	__btrfs_tree_read_lock(eb, BTRFS_NESTING_NORMAL, false);
 }
 
 /*
@@ -415,7 +416,7 @@ void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
  *
  * The rwlock is held for write upon exit.
  */
-void btrfs_tree_lock(struct extent_buffer *eb)
+void __btrfs_tree_lock(struct extent_buffer *eb, enum btrfs_lock_nesting nest)
 	__acquires(&eb->lock)
 {
 	u64 start_ns = 0;
@@ -440,6 +441,11 @@ void btrfs_tree_lock(struct extent_buffer *eb)
 	trace_btrfs_tree_lock(eb, start_ns);
 }
 
+void btrfs_tree_lock(struct extent_buffer *eb)
+{
+	__btrfs_tree_lock(eb, BTRFS_NESTING_NORMAL);
+}
+
 /*
  * Release the write lock, either blocking or spinning (ie. there's no need
  * for an explicit blocking unlock, like btrfs_tree_read_unlock_blocking).
@@ -565,7 +571,7 @@ struct extent_buffer *__btrfs_read_lock_root_node(struct btrfs_root *root,
 
 	while (1) {
 		eb = btrfs_root_node(root);
-		__btrfs_tree_read_lock(eb, recurse);
+		__btrfs_tree_read_lock(eb, BTRFS_NESTING_NORMAL, recurse);
 		if (eb == root->node)
 			break;
 		btrfs_tree_read_unlock(eb);
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index 587f144adafe..476b5552e8b2 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -16,12 +16,18 @@
 #define BTRFS_WRITE_LOCK_BLOCKING 3
 #define BTRFS_READ_LOCK_BLOCKING 4
 
+enum btrfs_lock_nesting {
+	BTRFS_NESTING_NORMAL = 0,
+};
+
 struct btrfs_path;
 
+void __btrfs_tree_lock(struct extent_buffer *eb, enum btrfs_lock_nesting nest);
 void btrfs_tree_lock(struct extent_buffer *eb);
 void btrfs_tree_unlock(struct extent_buffer *eb);
 
-void __btrfs_tree_read_lock(struct extent_buffer *eb, bool recurse);
+void __btrfs_tree_read_lock(struct extent_buffer *eb, enum btrfs_lock_nesting nest,
+			    bool recurse);
 void btrfs_tree_read_lock(struct extent_buffer *eb);
 void btrfs_tree_read_unlock(struct extent_buffer *eb);
 void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/17] btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (8 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 09/17] btrfs: add nesting tags to the locking helpers Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-10 15:42 ` [PATCH 11/17] btrfs: introduce BTRFS_NESTING_LEFT/BTRFS_NESTING_RIGHT Josef Bacik
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

When we COW a block we are holding a lock on the original block, and
then we lock the new COW block.  Because our lockdep maps are based on
root + level, this will make lockdep complain.  We need a way to
indicate a subclass for locking the COW'ed block, so plumb through our
btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.

The reason I've added all this extra infrastructure is because there
will be need of different nesting classes in follow up patches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.c       | 52 ++++++++++++++++++++++++++----------------
 fs/btrfs/ctree.h       |  6 +++--
 fs/btrfs/disk-io.c     |  5 ++--
 fs/btrfs/extent-tree.c | 12 ++++++----
 fs/btrfs/ioctl.c       |  3 ++-
 fs/btrfs/locking.h     |  8 +++++++
 fs/btrfs/relocation.c  | 11 +++++----
 fs/btrfs/transaction.c |  5 ++--
 8 files changed, 66 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index fefc61402118..a9699232d5bc 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -198,7 +198,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
 		btrfs_node_key(buf, &disk_key, 0);
 
 	cow = btrfs_alloc_tree_block(trans, root, 0, new_root_objectid,
-			&disk_key, level, buf->start, 0);
+			&disk_key, level, buf->start, 0, BTRFS_NESTING_NORMAL);
 	if (IS_ERR(cow))
 		return PTR_ERR(cow);
 
@@ -957,7 +957,8 @@ static struct extent_buffer *alloc_tree_block_no_bg_flush(
 					  const struct btrfs_disk_key *disk_key,
 					  int level,
 					  u64 hint,
-					  u64 empty_size)
+					  u64 empty_size,
+					  enum btrfs_lock_nesting nest)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct extent_buffer *ret;
@@ -986,7 +987,7 @@ static struct extent_buffer *alloc_tree_block_no_bg_flush(
 
 	ret = btrfs_alloc_tree_block(trans, root, parent_start,
 				     root->root_key.objectid, disk_key, level,
-				     hint, empty_size);
+				     hint, empty_size, nest);
 	trans->can_flush_pending_bgs = true;
 
 	return ret;
@@ -1009,7 +1010,8 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 			     struct extent_buffer *buf,
 			     struct extent_buffer *parent, int parent_slot,
 			     struct extent_buffer **cow_ret,
-			     u64 search_start, u64 empty_size)
+			     u64 search_start, u64 empty_size,
+			     enum btrfs_lock_nesting nest)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_disk_key disk_key;
@@ -1040,7 +1042,7 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 		parent_start = parent->start;
 
 	cow = alloc_tree_block_no_bg_flush(trans, root, parent_start, &disk_key,
-					   level, search_start, empty_size);
+					   level, search_start, empty_size, nest);
 	if (IS_ERR(cow))
 		return PTR_ERR(cow);
 
@@ -1446,7 +1448,8 @@ static inline int should_cow_block(struct btrfs_trans_handle *trans,
 noinline int btrfs_cow_block(struct btrfs_trans_handle *trans,
 		    struct btrfs_root *root, struct extent_buffer *buf,
 		    struct extent_buffer *parent, int parent_slot,
-		    struct extent_buffer **cow_ret)
+		    struct extent_buffer **cow_ret,
+		    enum btrfs_lock_nesting nest)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 search_start;
@@ -1485,7 +1488,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans,
 	 */
 	btrfs_qgroup_trace_subtree_after_cow(trans, root, buf);
 	ret = __btrfs_cow_block(trans, root, buf, parent,
-				 parent_slot, cow_ret, search_start, 0);
+				 parent_slot, cow_ret, search_start, 0, nest);
 
 	trace_btrfs_cow_block(root, buf, *cow_ret);
 
@@ -1657,7 +1660,8 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
 		err = __btrfs_cow_block(trans, root, cur, parent, i,
 					&cur, search_start,
 					min(16 * blocksize,
-					    (end_slot - i) * blocksize));
+					    (end_slot - i) * blocksize),
+					BTRFS_NESTING_COW);
 		if (err) {
 			btrfs_tree_unlock(cur);
 			free_extent_buffer(cur);
@@ -1855,7 +1859,8 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 
 		btrfs_tree_lock(child);
 		btrfs_set_lock_blocking_write(child);
-		ret = btrfs_cow_block(trans, root, child, mid, 0, &child);
+		ret = btrfs_cow_block(trans, root, child, mid, 0, &child,
+				      BTRFS_NESTING_COW);
 		if (ret) {
 			btrfs_tree_unlock(child);
 			free_extent_buffer(child);
@@ -1894,7 +1899,8 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 		btrfs_tree_lock(left);
 		btrfs_set_lock_blocking_write(left);
 		wret = btrfs_cow_block(trans, root, left,
-				       parent, pslot - 1, &left);
+				       parent, pslot - 1, &left,
+				       BTRFS_NESTING_COW);
 		if (wret) {
 			ret = wret;
 			goto enospc;
@@ -1909,7 +1915,8 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 		btrfs_tree_lock(right);
 		btrfs_set_lock_blocking_write(right);
 		wret = btrfs_cow_block(trans, root, right,
-				       parent, pslot + 1, &right);
+				       parent, pslot + 1, &right,
+				       BTRFS_NESTING_COW);
 		if (wret) {
 			ret = wret;
 			goto enospc;
@@ -2077,7 +2084,8 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
 			wret = 1;
 		} else {
 			ret = btrfs_cow_block(trans, root, left, parent,
-					      pslot - 1, &left);
+					      pslot - 1, &left,
+					      BTRFS_NESTING_COW);
 			if (ret)
 				wret = 1;
 			else {
@@ -2132,7 +2140,7 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
 		} else {
 			ret = btrfs_cow_block(trans, root, right,
 					      parent, pslot + 1,
-					      &right);
+					      &right, BTRFS_NESTING_COW);
 			if (ret)
 				wret = 1;
 			else {
@@ -2740,11 +2748,13 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 			btrfs_set_path_blocking(p);
 			if (last_level)
 				err = btrfs_cow_block(trans, root, b, NULL, 0,
-						      &b);
+						      &b,
+						      BTRFS_NESTING_COW);
 			else
 				err = btrfs_cow_block(trans, root, b,
 						      p->nodes[level + 1],
-						      p->slots[level + 1], &b);
+						      p->slots[level + 1], &b,
+						      BTRFS_NESTING_COW);
 			if (err) {
 				ret = err;
 				goto done;
@@ -3332,7 +3342,8 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans,
 		btrfs_node_key(lower, &lower_key, 0);
 
 	c = alloc_tree_block_no_bg_flush(trans, root, 0, &lower_key, level,
-					 root->node->start, 0);
+					 root->node->start, 0,
+					 BTRFS_NESTING_NORMAL);
 	if (IS_ERR(c))
 		return PTR_ERR(c);
 
@@ -3462,7 +3473,7 @@ static noinline int split_node(struct btrfs_trans_handle *trans,
 	btrfs_node_key(c, &disk_key, mid);
 
 	split = alloc_tree_block_no_bg_flush(trans, root, 0, &disk_key, level,
-					     c->start, 0);
+					     c->start, 0, BTRFS_NESTING_NORMAL);
 	if (IS_ERR(split))
 		return PTR_ERR(split);
 
@@ -3740,7 +3751,7 @@ static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
 
 	/* cow and double check */
 	ret = btrfs_cow_block(trans, root, right, upper,
-			      slot + 1, &right);
+			      slot + 1, &right, BTRFS_NESTING_COW);
 	if (ret)
 		goto out_unlock;
 
@@ -3975,7 +3986,8 @@ static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
 
 	/* cow and double check */
 	ret = btrfs_cow_block(trans, root, left,
-			      path->nodes[1], slot - 1, &left);
+			      path->nodes[1], slot - 1, &left,
+			      BTRFS_NESTING_COW);
 	if (ret) {
 		/* we hit -ENOSPC, but it isn't fatal here */
 		if (ret == -ENOSPC)
@@ -4238,7 +4250,7 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
 		btrfs_item_key(l, &disk_key, mid);
 
 	right = alloc_tree_block_no_bg_flush(trans, root, 0, &disk_key, 0,
-					     l->start, 0);
+					     l->start, 0, BTRFS_NESTING_NORMAL);
 	if (IS_ERR(right))
 		return PTR_ERR(right);
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e7a28033eaef..ceecccce6133 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2525,7 +2525,8 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 					     u64 parent, u64 root_objectid,
 					     const struct btrfs_disk_key *key,
 					     int level, u64 hint,
-					     u64 empty_size);
+					     u64 empty_size,
+					     enum btrfs_lock_nesting nest);
 void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 			   struct btrfs_root *root,
 			   struct extent_buffer *buf,
@@ -2664,7 +2665,8 @@ struct extent_buffer *btrfs_read_node_slot(struct extent_buffer *parent,
 int btrfs_cow_block(struct btrfs_trans_handle *trans,
 		    struct btrfs_root *root, struct extent_buffer *buf,
 		    struct extent_buffer *parent, int parent_slot,
-		    struct extent_buffer **cow_ret);
+		    struct extent_buffer **cow_ret,
+		    enum btrfs_lock_nesting nest);
 int btrfs_copy_root(struct btrfs_trans_handle *trans,
 		      struct btrfs_root *root,
 		      struct extent_buffer *buf,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c850d7f44fbe..e618127c3d89 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1209,7 +1209,8 @@ struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans,
 	root->root_key.type = BTRFS_ROOT_ITEM_KEY;
 	root->root_key.offset = 0;
 
-	leaf = btrfs_alloc_tree_block(trans, root, 0, objectid, NULL, 0, 0, 0);
+	leaf = btrfs_alloc_tree_block(trans, root, 0, objectid, NULL, 0, 0, 0,
+				      BTRFS_NESTING_NORMAL);
 	if (IS_ERR(leaf)) {
 		ret = PTR_ERR(leaf);
 		leaf = NULL;
@@ -1281,7 +1282,7 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 	 */
 
 	leaf = btrfs_alloc_tree_block(trans, root, 0, BTRFS_TREE_LOG_OBJECTID,
-			NULL, 0, 0, 0);
+			NULL, 0, 0, 0, BTRFS_NESTING_NORMAL);
 	if (IS_ERR(leaf)) {
 		btrfs_put_root(root);
 		return ERR_CAST(leaf);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 84029851f094..e5baf90f1174 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4500,7 +4500,8 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans,
 
 static struct extent_buffer *
 btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
-		      u64 bytenr, int level, u64 owner)
+		      u64 bytenr, int level, u64 owner,
+		      enum btrfs_lock_nesting nest)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct extent_buffer *buf;
@@ -4523,7 +4524,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	}
 
 	btrfs_set_buffer_lockdep_class(owner, buf, level);
-	btrfs_tree_lock(buf);
+	__btrfs_tree_lock(buf, nest);
 	btrfs_clean_tree_block(buf);
 	clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
 
@@ -4569,7 +4570,8 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 					     u64 parent, u64 root_objectid,
 					     const struct btrfs_disk_key *key,
 					     int level, u64 hint,
-					     u64 empty_size)
+					     u64 empty_size,
+					     enum btrfs_lock_nesting nest)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_key ins;
@@ -4585,7 +4587,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 	if (btrfs_is_testing(fs_info)) {
 		buf = btrfs_init_new_buffer(trans, root, root->alloc_bytenr,
-					    level, root_objectid);
+					    level, root_objectid, nest);
 		if (!IS_ERR(buf))
 			root->alloc_bytenr += blocksize;
 		return buf;
@@ -4602,7 +4604,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct btrfs_trans_handle *trans,
 		goto out_unuse;
 
 	buf = btrfs_init_new_buffer(trans, root, ins.objectid, level,
-				    root_objectid);
+				    root_objectid, nest);
 	if (IS_ERR(buf)) {
 		ret = PTR_ERR(buf);
 		goto out_free_reserved;
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 39448556f635..c0d4ab44361d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -628,7 +628,8 @@ static noinline int create_subvol(struct inode *dir,
 	if (ret)
 		goto fail;
 
-	leaf = btrfs_alloc_tree_block(trans, root, 0, objectid, NULL, 0, 0, 0);
+	leaf = btrfs_alloc_tree_block(trans, root, 0, objectid, NULL, 0, 0, 0,
+				      BTRFS_NESTING_NORMAL);
 	if (IS_ERR(leaf)) {
 		ret = PTR_ERR(leaf);
 		goto fail;
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index 476b5552e8b2..9e0975aa82b3 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -18,6 +18,14 @@
 
 enum btrfs_lock_nesting {
 	BTRFS_NESTING_NORMAL = 0,
+
+	/*
+	 * When we COW a block we are holding the lock on the original block,
+	 * and since our lockdep maps are rootid+level, this confuses lockdep
+	 * when we lock the newly allocated COW'd block.  Handle this by having
+	 * a subclass for COW'ed blocks so that lockdep doesn't complain.
+	 */
+	BTRFS_NESTING_COW,
 };
 
 struct btrfs_path;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 4ba1ab9cc76d..3602806d71bd 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1206,7 +1206,8 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 	}
 
 	if (cow) {
-		ret = btrfs_cow_block(trans, dest, eb, NULL, 0, &eb);
+		ret = btrfs_cow_block(trans, dest, eb, NULL, 0, &eb,
+				      BTRFS_NESTING_COW);
 		BUG_ON(ret);
 	}
 	btrfs_set_lock_blocking_write(eb);
@@ -1274,7 +1275,8 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 			btrfs_tree_lock(eb);
 			if (cow) {
 				ret = btrfs_cow_block(trans, dest, eb, parent,
-						      slot, &eb);
+						      slot, &eb,
+						      BTRFS_NESTING_COW);
 				BUG_ON(ret);
 			}
 			btrfs_set_lock_blocking_write(eb);
@@ -1781,7 +1783,8 @@ static noinline_for_stack int merge_reloc_root(struct reloc_control *rc,
 	 * relocated and the block is tree root.
 	 */
 	leaf = btrfs_lock_root_node(root);
-	ret = btrfs_cow_block(trans, root, leaf, NULL, 0, &leaf);
+	ret = btrfs_cow_block(trans, root, leaf, NULL, 0, &leaf,
+			      BTRFS_NESTING_COW);
 	btrfs_tree_unlock(leaf);
 	free_extent_buffer(leaf);
 	if (ret < 0)
@@ -2308,7 +2311,7 @@ static int do_relocation(struct btrfs_trans_handle *trans,
 
 		if (!node->eb) {
 			ret = btrfs_cow_block(trans, root, eb, upper->eb,
-					      slot, &eb);
+					      slot, &eb, BTRFS_NESTING_COW);
 			btrfs_tree_unlock(eb);
 			free_extent_buffer(eb);
 			if (ret < 0) {
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 20c6ac1a5de7..45bf7da3ce1c 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1182,7 +1182,7 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans)
 
 	eb = btrfs_lock_root_node(fs_info->tree_root);
 	ret = btrfs_cow_block(trans, fs_info->tree_root, eb, NULL,
-			      0, &eb);
+			      0, &eb, BTRFS_NESTING_COW);
 	btrfs_tree_unlock(eb);
 	free_extent_buffer(eb);
 
@@ -1587,7 +1587,8 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 	btrfs_set_root_otransid(new_root_item, trans->transid);
 
 	old = btrfs_lock_root_node(root);
-	ret = btrfs_cow_block(trans, root, old, NULL, 0, &old);
+	ret = btrfs_cow_block(trans, root, old, NULL, 0, &old,
+			      BTRFS_NESTING_COW);
 	if (ret) {
 		btrfs_tree_unlock(old);
 		free_extent_buffer(old);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 11/17] btrfs: introduce BTRFS_NESTING_LEFT/BTRFS_NESTING_RIGHT
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (9 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 10/17] btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-10 15:42 ` [PATCH 12/17] btrfs: introduce BTRFS_NESTING_LEFT/RIGHT_COW Josef Bacik
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Our lockdep maps are based on rootid+level, however in some cases we
will lock adjacent blocks on the same level, namely in searching forward
or in split/balance.  Because of this lockdep will complain, so we need
a separate subclass to indicate to lockdep that these are different
locks.

lock leaf -> BTRFS_NESTING_NORMAL
  cow leaf -> BTRFS_NESTING_COW
    split leaf
       lock left -> BTRFS_NESTING_LEFT
       lock right -> BTRFS_NESTING_RIGHT

The above graph illustrates the need for this new nesting subclass.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.c   | 16 ++++++++--------
 fs/btrfs/locking.h | 12 ++++++++++++
 2 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index a9699232d5bc..ea5412cd0f62 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1896,7 +1896,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 		left = NULL;
 
 	if (left) {
-		btrfs_tree_lock(left);
+		__btrfs_tree_lock(left, BTRFS_NESTING_LEFT);
 		btrfs_set_lock_blocking_write(left);
 		wret = btrfs_cow_block(trans, root, left,
 				       parent, pslot - 1, &left,
@@ -1912,7 +1912,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 		right = NULL;
 
 	if (right) {
-		btrfs_tree_lock(right);
+		__btrfs_tree_lock(right, BTRFS_NESTING_RIGHT);
 		btrfs_set_lock_blocking_write(right);
 		wret = btrfs_cow_block(trans, root, right,
 				       parent, pslot + 1, &right,
@@ -2076,7 +2076,7 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
 	if (left) {
 		u32 left_nr;
 
-		btrfs_tree_lock(left);
+		__btrfs_tree_lock(left, BTRFS_NESTING_LEFT);
 		btrfs_set_lock_blocking_write(left);
 
 		left_nr = btrfs_header_nritems(left);
@@ -2131,7 +2131,7 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
 	if (right) {
 		u32 right_nr;
 
-		btrfs_tree_lock(right);
+		__btrfs_tree_lock(right, BTRFS_NESTING_RIGHT);
 		btrfs_set_lock_blocking_write(right);
 
 		right_nr = btrfs_header_nritems(right);
@@ -3742,7 +3742,7 @@ static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
 	if (IS_ERR(right))
 		return 1;
 
-	btrfs_tree_lock(right);
+	__btrfs_tree_lock(right, BTRFS_NESTING_RIGHT);
 	btrfs_set_lock_blocking_write(right);
 
 	free_space = btrfs_leaf_free_space(right);
@@ -3975,7 +3975,7 @@ static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
 	if (IS_ERR(left))
 		return 1;
 
-	btrfs_tree_lock(left);
+	__btrfs_tree_lock(left, BTRFS_NESTING_LEFT);
 	btrfs_set_lock_blocking_write(left);
 
 	free_space = btrfs_leaf_free_space(left);
@@ -5393,7 +5393,7 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 			if (!ret) {
 				btrfs_set_path_blocking(path);
 				__btrfs_tree_read_lock(next,
-						       BTRFS_NESTING_NORMAL,
+						       BTRFS_NESTING_RIGHT,
 						       path->recurse);
 			}
 			next_rw_lock = BTRFS_READ_LOCK;
@@ -5430,7 +5430,7 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 			if (!ret) {
 				btrfs_set_path_blocking(path);
 				__btrfs_tree_read_lock(next,
-						       BTRFS_NESTING_NORMAL,
+						       BTRFS_NESTING_RIGHT,
 						       path->recurse);
 			}
 			next_rw_lock = BTRFS_READ_LOCK;
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index 9e0975aa82b3..f93e3e10ddbd 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -26,6 +26,18 @@ enum btrfs_lock_nesting {
 	 * a subclass for COW'ed blocks so that lockdep doesn't complain.
 	 */
 	BTRFS_NESTING_COW,
+
+	/*
+	 * Oftentimes we need to lock adjacent nodes on the same level while
+	 * still holding the lock on the original node we searched to, such as
+	 * for searching forward or for split/balance.
+	 *
+	 * Because of this we need to indicate to lockdep that this is
+	 * acceptable by having a different subclass for each of these
+	 * operations.
+	 */
+	BTRFS_NESTING_LEFT,
+	BTRFS_NESTING_RIGHT,
 };
 
 struct btrfs_path;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 12/17] btrfs: introduce BTRFS_NESTING_LEFT/RIGHT_COW
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (10 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 11/17] btrfs: introduce BTRFS_NESTING_LEFT/BTRFS_NESTING_RIGHT Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-10 15:42 ` [PATCH 13/17] btrfs: introduce BTRFS_NESTING_SPLIT for split blocks Josef Bacik
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

For similar reasons as BTRFS_NESTING_COW, we need
BTRFS_NESTING_LEFT/RIGHT_COW.  The pattern is this

lock leaf -> BTRFS_NESTING_NORMAL
  cow leaf -> BTRFS_NESTING_COW
    split leaf
      lock left -> BTRFS_NESTING_LEFT
        cow left -> BTRFS_NESTING_LEFT_COW

We need this in order to indicate to lockdep that these locks are
discrete and are being taken in a safe order.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.c   | 12 ++++++------
 fs/btrfs/locking.h |  8 ++++++++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index ea5412cd0f62..6b63b3bcacd4 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1900,7 +1900,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 		btrfs_set_lock_blocking_write(left);
 		wret = btrfs_cow_block(trans, root, left,
 				       parent, pslot - 1, &left,
-				       BTRFS_NESTING_COW);
+				       BTRFS_NESTING_LEFT_COW);
 		if (wret) {
 			ret = wret;
 			goto enospc;
@@ -1916,7 +1916,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 		btrfs_set_lock_blocking_write(right);
 		wret = btrfs_cow_block(trans, root, right,
 				       parent, pslot + 1, &right,
-				       BTRFS_NESTING_COW);
+				       BTRFS_NESTING_RIGHT_COW);
 		if (wret) {
 			ret = wret;
 			goto enospc;
@@ -2085,7 +2085,7 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
 		} else {
 			ret = btrfs_cow_block(trans, root, left, parent,
 					      pslot - 1, &left,
-					      BTRFS_NESTING_COW);
+					      BTRFS_NESTING_LEFT_COW);
 			if (ret)
 				wret = 1;
 			else {
@@ -2140,7 +2140,7 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
 		} else {
 			ret = btrfs_cow_block(trans, root, right,
 					      parent, pslot + 1,
-					      &right, BTRFS_NESTING_COW);
+					      &right, BTRFS_NESTING_RIGHT_COW);
 			if (ret)
 				wret = 1;
 			else {
@@ -3751,7 +3751,7 @@ static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
 
 	/* cow and double check */
 	ret = btrfs_cow_block(trans, root, right, upper,
-			      slot + 1, &right, BTRFS_NESTING_COW);
+			      slot + 1, &right, BTRFS_NESTING_RIGHT_COW);
 	if (ret)
 		goto out_unlock;
 
@@ -3987,7 +3987,7 @@ static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
 	/* cow and double check */
 	ret = btrfs_cow_block(trans, root, left,
 			      path->nodes[1], slot - 1, &left,
-			      BTRFS_NESTING_COW);
+			      BTRFS_NESTING_LEFT_COW);
 	if (ret) {
 		/* we hit -ENOSPC, but it isn't fatal here */
 		if (ret == -ENOSPC)
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index f93e3e10ddbd..31a87477b889 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -38,6 +38,14 @@ enum btrfs_lock_nesting {
 	 */
 	BTRFS_NESTING_LEFT,
 	BTRFS_NESTING_RIGHT,
+
+	/*
+	 * When splitting we will be holding a lock on the left/right node when
+	 * we need to cow that node, thus we need a new set of subclasses for
+	 * these two operations.
+	 */
+	BTRFS_NESTING_LEFT_COW,
+	BTRFS_NESTING_RIGHT_COW,
 };
 
 struct btrfs_path;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 13/17] btrfs: introduce BTRFS_NESTING_SPLIT for split blocks
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (11 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 12/17] btrfs: introduce BTRFS_NESTING_LEFT/RIGHT_COW Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-10 15:42 ` [PATCH 14/17] btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots Josef Bacik
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

If we are splitting a leaf/node, we could do something like the
following

lock(leaf)  BTRFS_NESTING_NORMAL
  lock(left) BTRFS_NESTING_LEFT + BTRFS_NESTING_COW
    push from leaf -> left
      reset path to point to left
        split left
          allocate new block, lock block BTRFS_NESTING_SPLIT

at the new block point we need to have a different nesting level,
because we have already used either BTRFS_NESTING_LEFT or
BTRFS_NESTING_RIGHT when pushing items from the original leaf into the
adjacent leaves.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.c   | 4 ++--
 fs/btrfs/locking.h | 9 +++++++++
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 6b63b3bcacd4..82dac6510a86 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -3473,7 +3473,7 @@ static noinline int split_node(struct btrfs_trans_handle *trans,
 	btrfs_node_key(c, &disk_key, mid);
 
 	split = alloc_tree_block_no_bg_flush(trans, root, 0, &disk_key, level,
-					     c->start, 0, BTRFS_NESTING_NORMAL);
+					     c->start, 0, BTRFS_NESTING_SPLIT);
 	if (IS_ERR(split))
 		return PTR_ERR(split);
 
@@ -4250,7 +4250,7 @@ static noinline int split_leaf(struct btrfs_trans_handle *trans,
 		btrfs_item_key(l, &disk_key, mid);
 
 	right = alloc_tree_block_no_bg_flush(trans, root, 0, &disk_key, 0,
-					     l->start, 0, BTRFS_NESTING_NORMAL);
+					     l->start, 0, BTRFS_NESTING_SPLIT);
 	if (IS_ERR(right))
 		return PTR_ERR(right);
 
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index 31a87477b889..4f5586fed25a 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -46,6 +46,15 @@ enum btrfs_lock_nesting {
 	 */
 	BTRFS_NESTING_LEFT_COW,
 	BTRFS_NESTING_RIGHT_COW,
+
+	/*
+	 * When splitting we may push nodes to the left or right, but still use
+	 * the subsequent nodes in our path, keeping our locks on those adjacent
+	 * blocks.  Thus when we go to allocate a new split block we've already
+	 * used up all of our available subclasses, so this subclass exists to
+	 * handle this case where we need to allocate a new split block.
+	 */
+	BTRFS_NESTING_SPLIT,
 };
 
 struct btrfs_path;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 14/17] btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (12 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 13/17] btrfs: introduce BTRFS_NESTING_SPLIT for split blocks Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-14 14:45   ` David Sterba
  2020-08-10 15:42 ` [PATCH 15/17] btrfs: change our extent buffer lock to a rw_semaphore Josef Bacik
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

The way we add new roots is confusing from a locking perspective for
lockdep.  We generally have the rule that we lock things in order from
highest level to lowest, but in the case of adding a new level to the
tree we actually allocate a new block for the root, which makes the
locking go in reverse.  A similar issue exists for snapshotting, we cow
the original root for the root of a new tree, however they're at the
same level.  Address this by using BTRFS_NESTING_NEW_ROOT for these
operations.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.c   |  5 +++--
 fs/btrfs/locking.h | 14 ++++++++++++++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 82dac6510a86..b0d6d86f449d 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -198,7 +198,8 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
 		btrfs_node_key(buf, &disk_key, 0);
 
 	cow = btrfs_alloc_tree_block(trans, root, 0, new_root_objectid,
-			&disk_key, level, buf->start, 0, BTRFS_NESTING_NORMAL);
+				     &disk_key, level, buf->start, 0,
+				     BTRFS_NESTING_NEW_ROOT);
 	if (IS_ERR(cow))
 		return PTR_ERR(cow);
 
@@ -3343,7 +3344,7 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans,
 
 	c = alloc_tree_block_no_bg_flush(trans, root, 0, &lower_key, level,
 					 root->node->start, 0,
-					 BTRFS_NESTING_NORMAL);
+					 BTRFS_NESTING_NEW_ROOT);
 	if (IS_ERR(c))
 		return PTR_ERR(c);
 
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index 4f5586fed25a..c845a0c07a1c 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -16,6 +16,11 @@
 #define BTRFS_WRITE_LOCK_BLOCKING 3
 #define BTRFS_READ_LOCK_BLOCKING 4
 
+/*
+ * We are limited in number of subclasses by MAX_LOCKDEP_SUBCLASSES, which at
+ * the time of this patch is 8, which is how many we're using here.  Keep this
+ * in mind if you decide you want to add another subclass.
+ */
 enum btrfs_lock_nesting {
 	BTRFS_NESTING_NORMAL = 0,
 
@@ -55,6 +60,15 @@ enum btrfs_lock_nesting {
 	 * handle this case where we need to allocate a new split block.
 	 */
 	BTRFS_NESTING_SPLIT,
+
+	/*
+	 * When promoting a new block to a root we need to have a special
+	 * subclass so we don't confuse lockdep, as it will appear that we are
+	 * locking a higher level node before a lower level one.  Copying also
+	 * has this problem as it appears we're locking the same block again
+	 * when we make a snapshot of an existing root.
+	 */
+	BTRFS_NESTING_NEW_ROOT,
 };
 
 struct btrfs_path;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 15/17] btrfs: change our extent buffer lock to a rw_semaphore
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (13 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 14/17] btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-10 15:42 ` [PATCH 16/17] btrfs: remove all of the blocking helpers Josef Bacik
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Historically we've implemented our own locking because we wanted to be
able to selectively spin or sleep based on what we were doing in the
tree.  For instance, if all of our nodes were in cache then there's
rarely a reason to need to sleep waiting for node locks, as they'll
likely become available soon.  At the time this code was written the
rw_semaphore didn't do adaptive spinning, and thus was orders of
magnitude slower than our home grown locking.

However now the opposite is the case.  There are a few problems with how
we implement blocking locks, namely that we use a normal waitqueue and
simply wake everybody up in reverse sleep order.  This leads to some
suboptimal performance behavior, and a lot of context switches in highly
contended cases.  rw_semaphores actually do this properly, and also have
adaptive spinning that works relatively well.

The locking code is also a bit of a bear to understand, and we lose the
benefit of lockdep for the most part because the blocking states of the
lock are simply ad-hoc and not mapped into lockdep.

So rework the locking code to drop all of this custom locking stuff, and
simply use a rw_semaphore for everything.  This makes the locking much
simpler for everything, as we can now drop a lot of cruft and blocking
transitions.  The performance numbers vary depending on the workload,
because generally speaking there doesn't tend to be a lot of contention
on the btree.  However on my test system which is an 80 core single
socket system with 256gib of RAM and a 2tib NVME drive I get the
following results (with all debug options off)

dbench 200 baseline
Throughput 216.056 MB/sec  200 clients  200 procs  max_latency=1471.197 ms

dbench 200 with patch
Throughput 737.188 MB/sec  200 clients  200 procs  max_latency=714.346 ms

Previously we also used fs_mark to test this sort of contention, and
those results are far less impressive, mostly because there's not enough
tasks to really stress the locking

fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16

baseline
  verage Files/sec:     160166.7
  p50 Files/sec: 165832
  p90 Files/sec: 123886
  p99 Files/sec: 123495

  real    3m26.527s
  user    2m19.223s
  sys     48m21.856s

patched
  Average Files/sec:     164135.7
  p50 Files/sec: 171095
  p90 Files/sec: 122889
  p99 Files/sec: 113819

  real    3m29.660s
  user    2m19.990s
  sys     44m12.259s

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c  |  13 +-
 fs/btrfs/extent_io.h  |  21 +--
 fs/btrfs/locking.c    | 378 +++++++-----------------------------------
 fs/btrfs/locking.h    |   2 +-
 fs/btrfs/print-tree.c |  11 +-
 5 files changed, 67 insertions(+), 358 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 16b4e7655a96..ab44e696c249 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4985,12 +4985,8 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 	eb->len = len;
 	eb->fs_info = fs_info;
 	eb->bflags = 0;
-	rwlock_init(&eb->lock);
-	atomic_set(&eb->blocking_readers, 0);
-	eb->blocking_writers = 0;
+	init_rwsem(&eb->lock);
 	eb->lock_recursed = false;
-	init_waitqueue_head(&eb->write_lock_wq);
-	init_waitqueue_head(&eb->read_lock_wq);
 
 	btrfs_leak_debug_add(&fs_info->eb_leak_lock, &eb->leak_list,
 			     &fs_info->allocated_ebs);
@@ -5006,13 +5002,6 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 		> MAX_INLINE_EXTENT_BUFFER_SIZE);
 	BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE);
 
-#ifdef CONFIG_BTRFS_DEBUG
-	eb->spinning_writers = 0;
-	atomic_set(&eb->spinning_readers, 0);
-	atomic_set(&eb->read_locks, 0);
-	eb->write_locks = 0;
-#endif
-
 	return eb;
 }
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 9e1e22f1586a..3f5dcc982fb8 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -99,31 +99,14 @@ struct extent_buffer {
 	int read_mirror;
 	struct rcu_head rcu_head;
 	pid_t lock_owner;
-
-	int blocking_writers;
-	atomic_t blocking_readers;
 	bool lock_recursed;
+	struct rw_semaphore lock;
+
 	/* >= 0 if eb belongs to a log tree, -1 otherwise */
 	short log_index;
 
-	/* protects write locks */
-	rwlock_t lock;
-
-	/* readers use lock_wq while they wait for the write
-	 * lock holders to unlock
-	 */
-	wait_queue_head_t write_lock_wq;
-
-	/* writers use read_lock_wq while they wait for readers
-	 * to unlock
-	 */
-	wait_queue_head_t read_lock_wq;
 	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
 #ifdef CONFIG_BTRFS_DEBUG
-	int spinning_writers;
-	atomic_t spinning_readers;
-	atomic_t read_locks;
-	int write_locks;
 	struct list_head leak_list;
 #endif
 };
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index 133a6dd5f29b..fa2f50241693 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -17,45 +17,8 @@
  * Extent buffer locking
  * =====================
  *
- * The locks use a custom scheme that allows to do more operations than are
- * available fromt current locking primitives. The building blocks are still
- * rwlock and wait queues.
- *
- * Required semantics:
- *
- * - reader/writer exclusion
- * - writer/writer exclusion
- * - reader/reader sharing
- * - spinning lock semantics
- * - blocking lock semantics
- * - try-lock semantics for readers and writers
- * - one level nesting, allowing read lock to be taken by the same thread that
- *   already has write lock
- *
- * The extent buffer locks (also called tree locks) manage access to eb data
- * related to the storage in the b-tree (keys, items, but not the individual
- * members of eb).
- * We want concurrency of many readers and safe updates. The underlying locking
- * is done by read-write spinlock and the blocking part is implemented using
- * counters and wait queues.
- *
- * spinning semantics - the low-level rwlock is held so all other threads that
- *                      want to take it are spinning on it.
- *
- * blocking semantics - the low-level rwlock is not held but the counter
- *                      denotes how many times the blocking lock was held;
- *                      sleeping is possible
- *
- * Write lock always allows only one thread to access the data.
- *
- *
- * Debugging
- * ---------
- *
- * There are additional state counters that are asserted in various contexts,
- * removed from non-debug build to reduce extent_buffer size and for
- * performance reasons.
- *
+ * We currently use a rw_semaphore for the tree locking, and the semantics are
+ * exactly the same.  The only exception is recursion, which is described below.
  *
  * Lock recursion
  * ------------
@@ -75,115 +38,8 @@
  *           btrfs_lookup_file_extent
  *             btrfs_search_slot
  *
- *
- * Locking pattern - spinning
- * --------------------------
- *
- * The simple locking scenario, the +--+ denotes the spinning section.
- *
- * +- btrfs_tree_lock
- * | - extent_buffer::rwlock is held
- * | - no heavy operations should happen, eg. IO, memory allocations, large
- * |   structure traversals
- * +- btrfs_tree_unock
-*
-*
- * Locking pattern - blocking
- * --------------------------
- *
- * The blocking write uses the following scheme.  The +--+ denotes the spinning
- * section.
- *
- * +- btrfs_tree_lock
- * |
- * +- btrfs_set_lock_blocking_write
- *
- *   - allowed: IO, memory allocations, etc.
- *
- * -- btrfs_tree_unlock - note, no explicit unblocking necessary
- *
- *
- * Blocking read is similar.
- *
- * +- btrfs_tree_read_lock
- * |
- * +- btrfs_set_lock_blocking_read
- *
- *  - heavy operations allowed
- *
- * +- btrfs_tree_read_unlock_blocking
- * |
- * +- btrfs_tree_read_unlock
- *
  */
 
-#ifdef CONFIG_BTRFS_DEBUG
-static inline void btrfs_assert_spinning_writers_get(struct extent_buffer *eb)
-{
-	WARN_ON(eb->spinning_writers);
-	eb->spinning_writers++;
-}
-
-static inline void btrfs_assert_spinning_writers_put(struct extent_buffer *eb)
-{
-	WARN_ON(eb->spinning_writers != 1);
-	eb->spinning_writers--;
-}
-
-static inline void btrfs_assert_no_spinning_writers(struct extent_buffer *eb)
-{
-	WARN_ON(eb->spinning_writers);
-}
-
-static inline void btrfs_assert_spinning_readers_get(struct extent_buffer *eb)
-{
-	atomic_inc(&eb->spinning_readers);
-}
-
-static inline void btrfs_assert_spinning_readers_put(struct extent_buffer *eb)
-{
-	WARN_ON(atomic_read(&eb->spinning_readers) == 0);
-	atomic_dec(&eb->spinning_readers);
-}
-
-static inline void btrfs_assert_tree_read_locks_get(struct extent_buffer *eb)
-{
-	atomic_inc(&eb->read_locks);
-}
-
-static inline void btrfs_assert_tree_read_locks_put(struct extent_buffer *eb)
-{
-	atomic_dec(&eb->read_locks);
-}
-
-static inline void btrfs_assert_tree_read_locked(struct extent_buffer *eb)
-{
-	BUG_ON(!atomic_read(&eb->read_locks));
-}
-
-static inline void btrfs_assert_tree_write_locks_get(struct extent_buffer *eb)
-{
-	eb->write_locks++;
-}
-
-static inline void btrfs_assert_tree_write_locks_put(struct extent_buffer *eb)
-{
-	eb->write_locks--;
-}
-
-#else
-static void btrfs_assert_spinning_writers_get(struct extent_buffer *eb) { }
-static void btrfs_assert_spinning_writers_put(struct extent_buffer *eb) { }
-static void btrfs_assert_no_spinning_writers(struct extent_buffer *eb) { }
-static void btrfs_assert_spinning_readers_put(struct extent_buffer *eb) { }
-static void btrfs_assert_spinning_readers_get(struct extent_buffer *eb) { }
-static void btrfs_assert_tree_read_locked(struct extent_buffer *eb) { }
-static void btrfs_assert_tree_read_locks_get(struct extent_buffer *eb) { }
-static void btrfs_assert_tree_read_locks_put(struct extent_buffer *eb) { }
-static void btrfs_assert_tree_write_locks_get(struct extent_buffer *eb) { }
-static void btrfs_assert_tree_write_locks_put(struct extent_buffer *eb) { }
-#endif
-
 /*
  * Mark already held read lock as blocking. Can be nested in write lock by the
  * same thread.
@@ -195,18 +51,6 @@ static void btrfs_assert_tree_write_locks_put(struct extent_buffer *eb) { }
  */
 void btrfs_set_lock_blocking_read(struct extent_buffer *eb)
 {
-	trace_btrfs_set_lock_blocking_read(eb);
-	/*
-	 * No lock is required.  The lock owner may change if we have a read
-	 * lock, but it won't change to or away from us.  If we have the write
-	 * lock, we are the owner and it'll never change.
-	 */
-	if (eb->lock_recursed && current->pid == eb->lock_owner)
-		return;
-	btrfs_assert_tree_read_locked(eb);
-	atomic_inc(&eb->blocking_readers);
-	btrfs_assert_spinning_readers_put(eb);
-	read_unlock(&eb->lock);
 }
 
 /*
@@ -219,30 +63,20 @@ void btrfs_set_lock_blocking_read(struct extent_buffer *eb)
  */
 void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
 {
-	trace_btrfs_set_lock_blocking_write(eb);
-	/*
-	 * No lock is required.  The lock owner may change if we have a read
-	 * lock, but it won't change to or away from us.  If we have the write
-	 * lock, we are the owner and it'll never change.
-	 */
-	if (eb->lock_recursed && current->pid == eb->lock_owner)
-		return;
-	if (eb->blocking_writers == 0) {
-		btrfs_assert_spinning_writers_put(eb);
-		btrfs_assert_tree_locked(eb);
-		WRITE_ONCE(eb->blocking_writers, 1);
-		write_unlock(&eb->lock);
-	}
 }
 
 /*
- * Lock the extent buffer for read. Wait for any writers (spinning or blocking).
- * Can be nested in write lock by the same thread.
+ * __btrfs_tree_read_lock: Lock the extent buffer for read.
+ * @eb - the eb to be locked.
+ * @nest - the nesting level to be used for lockdep.
+ * @recurse - if this lock is able to be recursed.
  *
- * Use when the locked section does only lightweight actions and busy waiting
- * would be cheaper than making other threads do the wait/wake loop.
+ * This takes the read lock on the extent buffer, using the specified nesting
+ * level for lockdep purposes.
  *
- * The rwlock is held upon exit.
+ * If you specify recurse = true, then we will allow this to be taken if we
+ * currently own the lock already.  This should only be used in specific
+ * usecases, and the subsequent unlock will not change the state of the lock.
  */
 void __btrfs_tree_read_lock(struct extent_buffer *eb, enum btrfs_lock_nesting nest,
 			    bool recurse)
@@ -251,33 +85,33 @@ void __btrfs_tree_read_lock(struct extent_buffer *eb, enum btrfs_lock_nesting ne
 
 	if (trace_btrfs_tree_read_lock_enabled())
 		start_ns = ktime_get_ns();
-again:
-	read_lock(&eb->lock);
-	BUG_ON(eb->blocking_writers == 0 &&
-	       current->pid == eb->lock_owner);
-	if (eb->blocking_writers) {
-		if (current->pid == eb->lock_owner) {
-			/*
-			 * This extent is already write-locked by our thread.
-			 * We allow an additional read lock to be added because
-			 * it's for the same thread. btrfs_find_all_roots()
-			 * depends on this as it may be called on a partly
-			 * (write-)locked tree.
-			 */
-			WARN_ON(!recurse);
-			BUG_ON(eb->lock_recursed);
-			eb->lock_recursed = true;
-			read_unlock(&eb->lock);
-			trace_btrfs_tree_read_lock(eb, start_ns);
-			return;
+
+	if (unlikely(recurse)) {
+		/* First see if we can grab the lock outright. */
+		if (down_read_trylock(&eb->lock))
+			goto out;
+
+		/*
+		 * Ok still doesn't necessarily mean we are already holding the
+		 * lock, check the owner.
+		 */
+		if (eb->lock_owner != current->pid) {
+			down_read_nested(&eb->lock, nest);
+			goto out;
 		}
-		read_unlock(&eb->lock);
-		wait_event(eb->write_lock_wq,
-			   READ_ONCE(eb->blocking_writers) == 0);
-		goto again;
+
+		/*
+		 * Ok we have actually recursed, but we should only be recursing
+		 * once, so blow up if we're already recursed, otherwise set
+		 * ->lock_recursed and carry on.
+		 */
+		BUG_ON(eb->lock_recursed);
+		eb->lock_recursed = true;
+		goto out;
 	}
-	btrfs_assert_tree_read_locks_get(eb);
-	btrfs_assert_spinning_readers_get(eb);
+	down_read_nested(&eb->lock, nest);
+out:
+	eb->lock_owner = current->pid;
 	trace_btrfs_tree_read_lock(eb, start_ns);
 }
 
@@ -294,74 +128,42 @@ void btrfs_tree_read_lock(struct extent_buffer *eb)
  */
 int btrfs_tree_read_lock_atomic(struct extent_buffer *eb)
 {
-	if (READ_ONCE(eb->blocking_writers))
-		return 0;
-
-	read_lock(&eb->lock);
-	/* Refetch value after lock */
-	if (READ_ONCE(eb->blocking_writers)) {
-		read_unlock(&eb->lock);
-		return 0;
-	}
-	btrfs_assert_tree_read_locks_get(eb);
-	btrfs_assert_spinning_readers_get(eb);
-	trace_btrfs_tree_read_lock_atomic(eb);
-	return 1;
+	return btrfs_try_tree_read_lock(eb);
 }
 
 /*
- * Try-lock for read. Don't block or wait for contending writers.
+ * Try-lock for read.
  *
  * Retrun 1 if the rwlock has been taken, 0 otherwise
  */
 int btrfs_try_tree_read_lock(struct extent_buffer *eb)
 {
-	if (READ_ONCE(eb->blocking_writers))
-		return 0;
-
-	if (!read_trylock(&eb->lock))
-		return 0;
-
-	/* Refetch value after lock */
-	if (READ_ONCE(eb->blocking_writers)) {
-		read_unlock(&eb->lock);
-		return 0;
+	if (down_read_trylock(&eb->lock)) {
+		eb->lock_owner = current->pid;
+		trace_btrfs_try_tree_read_lock(eb);
+		return 1;
 	}
-	btrfs_assert_tree_read_locks_get(eb);
-	btrfs_assert_spinning_readers_get(eb);
-	trace_btrfs_try_tree_read_lock(eb);
-	return 1;
+	return 0;
 }
 
 /*
- * Try-lock for write. May block until the lock is uncontended, but does not
- * wait until it is free.
+ * Try-lock for write.
  *
  * Retrun 1 if the rwlock has been taken, 0 otherwise
  */
 int btrfs_try_tree_write_lock(struct extent_buffer *eb)
 {
-	if (READ_ONCE(eb->blocking_writers) || atomic_read(&eb->blocking_readers))
-		return 0;
-
-	write_lock(&eb->lock);
-	/* Refetch value after lock */
-	if (READ_ONCE(eb->blocking_writers) || atomic_read(&eb->blocking_readers)) {
-		write_unlock(&eb->lock);
-		return 0;
+	if (down_write_trylock(&eb->lock)) {
+		eb->lock_owner = current->pid;
+		trace_btrfs_try_tree_write_lock(eb);
+		return 1;
 	}
-	btrfs_assert_tree_write_locks_get(eb);
-	btrfs_assert_spinning_writers_get(eb);
-	eb->lock_owner = current->pid;
-	trace_btrfs_try_tree_write_lock(eb);
-	return 1;
+	return 0;
 }
 
 /*
- * Release read lock. Must be used only if the lock is in spinning mode.  If
- * the read lock is nested, must pair with read lock before the write unlock.
- *
- * The rwlock is not held upon exit.
+ * Release read lock.  If the read lock was recursed then the lock stays in the
+ * original state that it was before it was recursively locked.
  */
 void btrfs_tree_read_unlock(struct extent_buffer *eb)
 {
@@ -376,10 +178,8 @@ void btrfs_tree_read_unlock(struct extent_buffer *eb)
 		eb->lock_recursed = false;
 		return;
 	}
-	btrfs_assert_tree_read_locked(eb);
-	btrfs_assert_spinning_readers_put(eb);
-	btrfs_assert_tree_read_locks_put(eb);
-	read_unlock(&eb->lock);
+	eb->lock_owner = 0;
+	up_read(&eb->lock);
 }
 
 /*
@@ -391,30 +191,15 @@ void btrfs_tree_read_unlock(struct extent_buffer *eb)
  */
 void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
 {
-	trace_btrfs_tree_read_unlock_blocking(eb);
-	/*
-	 * if we're nested, we have the write lock.  No new locking
-	 * is needed as long as we are the lock owner.
-	 * The write unlock will do a barrier for us, and the lock_recursed
-	 * field only matters to the lock owner.
-	 */
-	if (eb->lock_recursed && current->pid == eb->lock_owner) {
-		eb->lock_recursed = false;
-		return;
-	}
-	btrfs_assert_tree_read_locked(eb);
-	WARN_ON(atomic_read(&eb->blocking_readers) == 0);
-	/* atomic_dec_and_test implies a barrier */
-	if (atomic_dec_and_test(&eb->blocking_readers))
-		cond_wake_up_nomb(&eb->read_lock_wq);
-	btrfs_assert_tree_read_locks_put(eb);
+	btrfs_tree_read_unlock(eb);
 }
 
 /*
- * Lock for write. Wait for all blocking and spinning readers and writers. This
- * starts context where reader lock could be nested by the same thread.
+ * __btrfs_tree_lock: Lock for write.
+ * @eb - the eb to lock.
+ * @nest - the nesting to use for the lock.
  *
- * The rwlock is held for write upon exit.
+ * Returns with the eb->lock write locked.
  */
 void __btrfs_tree_lock(struct extent_buffer *eb, enum btrfs_lock_nesting nest)
 	__acquires(&eb->lock)
@@ -424,19 +209,7 @@ void __btrfs_tree_lock(struct extent_buffer *eb, enum btrfs_lock_nesting nest)
 	if (trace_btrfs_tree_lock_enabled())
 		start_ns = ktime_get_ns();
 
-	WARN_ON(eb->lock_owner == current->pid);
-again:
-	wait_event(eb->read_lock_wq, atomic_read(&eb->blocking_readers) == 0);
-	wait_event(eb->write_lock_wq, READ_ONCE(eb->blocking_writers) == 0);
-	write_lock(&eb->lock);
-	/* Refetch value after lock */
-	if (atomic_read(&eb->blocking_readers) ||
-	    READ_ONCE(eb->blocking_writers)) {
-		write_unlock(&eb->lock);
-		goto again;
-	}
-	btrfs_assert_spinning_writers_get(eb);
-	btrfs_assert_tree_write_locks_get(eb);
+	down_write_nested(&eb->lock, nest);
 	eb->lock_owner = current->pid;
 	trace_btrfs_tree_lock(eb, start_ns);
 }
@@ -447,42 +220,13 @@ void btrfs_tree_lock(struct extent_buffer *eb)
 }
 
 /*
- * Release the write lock, either blocking or spinning (ie. there's no need
- * for an explicit blocking unlock, like btrfs_tree_read_unlock_blocking).
- * This also ends the context for nesting, the read lock must have been
- * released already.
- *
- * Tasks blocked and waiting are woken, rwlock is not held upon exit.
+ * Release the write lock.
  */
 void btrfs_tree_unlock(struct extent_buffer *eb)
 {
-	/*
-	 * This is read both locked and unlocked but always by the same thread
-	 * that already owns the lock so we don't need to use READ_ONCE
-	 */
-	int blockers = eb->blocking_writers;
-
-	BUG_ON(blockers > 1);
-
-	btrfs_assert_tree_locked(eb);
 	trace_btrfs_tree_unlock(eb);
 	eb->lock_owner = 0;
-	btrfs_assert_tree_write_locks_put(eb);
-
-	if (blockers) {
-		btrfs_assert_no_spinning_writers(eb);
-		/* Unlocked write */
-		WRITE_ONCE(eb->blocking_writers, 0);
-		/*
-		 * We need to order modifying blocking_writers above with
-		 * actually waking up the sleepers to ensure they see the
-		 * updated value of blocking_writers
-		 */
-		cond_wake_up(&eb->write_lock_wq);
-	} else {
-		btrfs_assert_spinning_writers_put(eb);
-		write_unlock(&eb->lock);
-	}
+	up_write(&eb->lock);
 }
 
 /*
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index c845a0c07a1c..babb8e23e16e 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -99,7 +99,7 @@ btrfs_read_lock_root_node(struct btrfs_root *root)
 
 #ifdef CONFIG_BTRFS_DEBUG
 static inline void btrfs_assert_tree_locked(struct extent_buffer *eb) {
-	BUG_ON(!eb->write_locks);
+	lockdep_assert_held(&eb->lock);
 }
 #else
 static inline void btrfs_assert_tree_locked(struct extent_buffer *eb) { }
diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c
index 61f44e78e3c9..6668dbe2b9d2 100644
--- a/fs/btrfs/print-tree.c
+++ b/fs/btrfs/print-tree.c
@@ -151,15 +151,8 @@ static void print_uuid_item(struct extent_buffer *l, unsigned long offset,
 static void print_eb_refs_lock(struct extent_buffer *eb)
 {
 #ifdef CONFIG_BTRFS_DEBUG
-	btrfs_info(eb->fs_info,
-"refs %u lock (w:%d r:%d bw:%d br:%d sw:%d sr:%d) lock_owner %u current %u",
-		   atomic_read(&eb->refs), eb->write_locks,
-		   atomic_read(&eb->read_locks),
-		   eb->blocking_writers,
-		   atomic_read(&eb->blocking_readers),
-		   eb->spinning_writers,
-		   atomic_read(&eb->spinning_readers),
-		   eb->lock_owner, current->pid);
+	btrfs_info(eb->fs_info, "refs %u lock_owner %u current %u",
+		   atomic_read(&eb->refs), eb->lock_owner, current->pid);
 #endif
 }
 
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 16/17] btrfs: remove all of the blocking helpers
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (14 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 15/17] btrfs: change our extent buffer lock to a rw_semaphore Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-10 15:42 ` [PATCH 17/17] btrfs: rip out path->leave_spinning Josef Bacik
  2020-08-13 14:26 ` [PATCH 00/17] Convert to an rwsem for our tree locking David Sterba
  17 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Now that we're using a rw_semaphore we no longer need to indicate if a
lock is blocking or not, nor do we need to flip the entire path from
blocking to spinning.  Remove these helpers and all the places they are
called.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/backref.c       | 10 ++---
 fs/btrfs/ctree.c         | 91 ++++++----------------------------------
 fs/btrfs/delayed-inode.c |  7 ----
 fs/btrfs/disk-io.c       |  8 +---
 fs/btrfs/extent-tree.c   | 19 +++------
 fs/btrfs/file.c          |  3 +-
 fs/btrfs/inode.c         |  1 -
 fs/btrfs/locking.c       | 74 --------------------------------
 fs/btrfs/locking.h       | 11 +----
 fs/btrfs/qgroup.c        |  9 ++--
 fs/btrfs/ref-verify.c    |  6 +--
 fs/btrfs/relocation.c    |  4 --
 fs/btrfs/transaction.c   |  2 -
 fs/btrfs/tree-defrag.c   |  1 -
 fs/btrfs/tree-log.c      |  3 --
 15 files changed, 30 insertions(+), 219 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index ea10f7bc99ab..5b994d4377d6 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1330,14 +1330,12 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans,
 					goto out;
 				}
 
-				if (!path->skip_locking) {
+				if (!path->skip_locking)
 					btrfs_tree_read_lock(eb);
-					btrfs_set_lock_blocking_read(eb);
-				}
 				ret = find_extent_in_eb(eb, bytenr,
 							*extent_item_pos, &eie, ignore_offset);
 				if (!path->skip_locking)
-					btrfs_tree_read_unlock_blocking(eb);
+					btrfs_tree_read_unlock(eb);
 				free_extent_buffer(eb);
 				if (ret < 0)
 					goto out;
@@ -1674,7 +1672,7 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
 					   name_off, name_len);
 		if (eb != eb_in) {
 			if (!path->skip_locking)
-				btrfs_tree_read_unlock_blocking(eb);
+				btrfs_tree_read_unlock(eb);
 			free_extent_buffer(eb);
 		}
 		ret = btrfs_find_item(fs_root, path, parent, 0,
@@ -1694,8 +1692,6 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
 		eb = path->nodes[0];
 		/* make sure we can use eb after releasing the path */
 		if (eb != eb_in) {
-			if (!path->skip_locking)
-				btrfs_set_lock_blocking_read(eb);
 			path->nodes[0] = NULL;
 			path->locks[0] = 0;
 		}
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index b0d6d86f449d..45215eaee492 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1272,14 +1272,11 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct btrfs_path *path,
 	if (!tm)
 		return eb;
 
-	btrfs_set_path_blocking(path);
-	btrfs_set_lock_blocking_read(eb);
-
 	if (tm->op == MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
 		BUG_ON(tm->slot != 0);
 		eb_rewin = alloc_dummy_extent_buffer(fs_info, eb->start);
 		if (!eb_rewin) {
-			btrfs_tree_read_unlock_blocking(eb);
+			btrfs_tree_read_unlock(eb);
 			free_extent_buffer(eb);
 			return NULL;
 		}
@@ -1291,13 +1288,13 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct btrfs_path *path,
 	} else {
 		eb_rewin = btrfs_clone_extent_buffer(eb);
 		if (!eb_rewin) {
-			btrfs_tree_read_unlock_blocking(eb);
+			btrfs_tree_read_unlock(eb);
 			free_extent_buffer(eb);
 			return NULL;
 		}
 	}
 
-	btrfs_tree_read_unlock_blocking(eb);
+	btrfs_tree_read_unlock(eb);
 	free_extent_buffer(eb);
 
 	btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb_rewin),
@@ -1367,9 +1364,8 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
 		free_extent_buffer(eb_root);
 		eb = alloc_dummy_extent_buffer(fs_info, logical);
 	} else {
-		btrfs_set_lock_blocking_read(eb_root);
 		eb = btrfs_clone_extent_buffer(eb_root);
-		btrfs_tree_read_unlock_blocking(eb_root);
+		btrfs_tree_read_unlock(eb_root);
 		free_extent_buffer(eb_root);
 	}
 
@@ -1477,10 +1473,6 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans,
 
 	search_start = buf->start & ~((u64)SZ_1G - 1);
 
-	if (parent)
-		btrfs_set_lock_blocking_write(parent);
-	btrfs_set_lock_blocking_write(buf);
-
 	/*
 	 * Before CoWing this block for later modification, check if it's
 	 * the subtree root and do the delayed subtree trace if needed.
@@ -1598,8 +1590,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
 	if (parent_nritems <= 1)
 		return 0;
 
-	btrfs_set_lock_blocking_write(parent);
-
 	for (i = start_slot; i <= end_slot; i++) {
 		struct btrfs_key first_key;
 		int close = 1;
@@ -1657,7 +1647,6 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans,
 			search_start = last_block;
 
 		btrfs_tree_lock(cur);
-		btrfs_set_lock_blocking_write(cur);
 		err = __btrfs_cow_block(trans, root, cur, parent, i,
 					&cur, search_start,
 					min(16 * blocksize,
@@ -1829,8 +1818,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 
 	mid = path->nodes[level];
 
-	WARN_ON(path->locks[level] != BTRFS_WRITE_LOCK &&
-		path->locks[level] != BTRFS_WRITE_LOCK_BLOCKING);
+	WARN_ON(path->locks[level] != BTRFS_WRITE_LOCK);
 	WARN_ON(btrfs_header_generation(mid) != trans->transid);
 
 	orig_ptr = btrfs_node_blockptr(mid, orig_slot);
@@ -1859,7 +1847,6 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 		}
 
 		btrfs_tree_lock(child);
-		btrfs_set_lock_blocking_write(child);
 		ret = btrfs_cow_block(trans, root, child, mid, 0, &child,
 				      BTRFS_NESTING_COW);
 		if (ret) {
@@ -1898,7 +1885,6 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 
 	if (left) {
 		__btrfs_tree_lock(left, BTRFS_NESTING_LEFT);
-		btrfs_set_lock_blocking_write(left);
 		wret = btrfs_cow_block(trans, root, left,
 				       parent, pslot - 1, &left,
 				       BTRFS_NESTING_LEFT_COW);
@@ -1914,7 +1900,6 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 
 	if (right) {
 		__btrfs_tree_lock(right, BTRFS_NESTING_RIGHT);
-		btrfs_set_lock_blocking_write(right);
 		wret = btrfs_cow_block(trans, root, right,
 				       parent, pslot + 1, &right,
 				       BTRFS_NESTING_RIGHT_COW);
@@ -2078,7 +2063,6 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
 		u32 left_nr;
 
 		__btrfs_tree_lock(left, BTRFS_NESTING_LEFT);
-		btrfs_set_lock_blocking_write(left);
 
 		left_nr = btrfs_header_nritems(left);
 		if (left_nr >= BTRFS_NODEPTRS_PER_BLOCK(fs_info) - 1) {
@@ -2133,7 +2117,6 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
 		u32 right_nr;
 
 		__btrfs_tree_lock(right, BTRFS_NESTING_RIGHT);
-		btrfs_set_lock_blocking_write(right);
 
 		right_nr = btrfs_header_nritems(right);
 		if (right_nr >= BTRFS_NODEPTRS_PER_BLOCK(fs_info) - 1) {
@@ -2393,14 +2376,6 @@ read_block_for_search(struct btrfs_root *root, struct btrfs_path *p,
 			return 0;
 		}
 
-		/* the pages were up to date, but we failed
-		 * the generation number check.  Do a full
-		 * read for the generation number that is correct.
-		 * We must do this without dropping locks so
-		 * we can trust our generation number
-		 */
-		btrfs_set_path_blocking(p);
-
 		/* now we're allowed to do a blocking uptodate check */
 		ret = btrfs_read_buffer(tmp, gen, parent_level - 1, &first_key);
 		if (!ret) {
@@ -2420,7 +2395,6 @@ read_block_for_search(struct btrfs_root *root, struct btrfs_path *p,
 	 * out which blocks to read.
 	 */
 	btrfs_unlock_up_safe(p, level + 1);
-	btrfs_set_path_blocking(p);
 
 	if (p->reada != READA_NONE)
 		reada_for_search(fs_info, p, level, slot, key->objectid);
@@ -2474,7 +2448,6 @@ setup_nodes_for_search(struct btrfs_trans_handle *trans,
 			goto again;
 		}
 
-		btrfs_set_path_blocking(p);
 		reada_for_balance(fs_info, p, level);
 		sret = split_node(trans, root, p, level);
 
@@ -2494,7 +2467,6 @@ setup_nodes_for_search(struct btrfs_trans_handle *trans,
 			goto again;
 		}
 
-		btrfs_set_path_blocking(p);
 		reada_for_balance(fs_info, p, level);
 		sret = balance_level(trans, root, p, level);
 
@@ -2746,7 +2718,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 				goto again;
 			}
 
-			btrfs_set_path_blocking(p);
 			if (last_level)
 				err = btrfs_cow_block(trans, root, b, NULL, 0,
 						      &b,
@@ -2816,7 +2787,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 					goto again;
 				}
 
-				btrfs_set_path_blocking(p);
 				err = split_leaf(trans, root, key,
 						 p, ins_len, ret == 0);
 
@@ -2878,17 +2848,11 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		if (!p->skip_locking) {
 			level = btrfs_header_level(b);
 			if (level <= write_lock_level) {
-				if (!btrfs_try_tree_write_lock(b)) {
-					btrfs_set_path_blocking(p);
-					btrfs_tree_lock(b);
-				}
+				btrfs_tree_lock(b);
 				p->locks[level] = BTRFS_WRITE_LOCK;
 			} else {
-				if (!btrfs_tree_read_lock_atomic(b)) {
-					btrfs_set_path_blocking(p);
-					__btrfs_tree_read_lock(b, BTRFS_NESTING_NORMAL,
-							       p->recurse);
-				}
+				__btrfs_tree_read_lock(b, BTRFS_NESTING_NORMAL,
+						       p->recurse);
 				p->locks[level] = BTRFS_READ_LOCK;
 			}
 			p->nodes[level] = b;
@@ -2896,12 +2860,6 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	}
 	ret = 1;
 done:
-	/*
-	 * we don't really know what they plan on doing with the path
-	 * from here on, so for now just mark it as blocking
-	 */
-	if (!p->leave_spinning)
-		btrfs_set_path_blocking(p);
 	if (ret < 0 && !p->skip_release_on_error)
 		btrfs_release_path(p);
 	return ret;
@@ -2993,10 +2951,7 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
 		}
 
 		level = btrfs_header_level(b);
-		if (!btrfs_tree_read_lock_atomic(b)) {
-			btrfs_set_path_blocking(p);
-			btrfs_tree_read_lock(b);
-		}
+		btrfs_tree_read_lock(b);
 		b = tree_mod_log_rewind(fs_info, p, b, time_seq);
 		if (!b) {
 			ret = -ENOMEM;
@@ -3007,8 +2962,6 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
 	}
 	ret = 1;
 done:
-	if (!p->leave_spinning)
-		btrfs_set_path_blocking(p);
 	if (ret < 0)
 		btrfs_release_path(p);
 
@@ -3371,7 +3324,7 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans,
 	add_root_to_dirty_list(root);
 	atomic_inc(&c->refs);
 	path->nodes[level] = c;
-	path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
+	path->locks[level] = BTRFS_WRITE_LOCK;
 	path->slots[level] = 0;
 	return 0;
 }
@@ -3744,7 +3697,6 @@ static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
 		return 1;
 
 	__btrfs_tree_lock(right, BTRFS_NESTING_RIGHT);
-	btrfs_set_lock_blocking_write(right);
 
 	free_space = btrfs_leaf_free_space(right);
 	if (free_space < data_size)
@@ -3977,7 +3929,6 @@ static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
 		return 1;
 
 	__btrfs_tree_lock(left, BTRFS_NESTING_LEFT);
-	btrfs_set_lock_blocking_write(left);
 
 	free_space = btrfs_leaf_free_space(left);
 	if (free_space < data_size) {
@@ -4358,7 +4309,6 @@ static noinline int setup_leaf_for_split(struct btrfs_trans_handle *trans,
 			goto err;
 	}
 
-	btrfs_set_path_blocking(path);
 	ret = split_leaf(trans, root, &key, path, ins_len, 1);
 	if (ret)
 		goto err;
@@ -4388,8 +4338,6 @@ static noinline int split_item(struct btrfs_path *path,
 	leaf = path->nodes[0];
 	BUG_ON(btrfs_leaf_free_space(leaf) < sizeof(struct btrfs_item));
 
-	btrfs_set_path_blocking(path);
-
 	item = btrfs_item_nr(path->slots[0]);
 	orig_offset = btrfs_item_offset(leaf, item);
 	item_size = btrfs_item_size(leaf, item);
@@ -4955,7 +4903,6 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 		if (leaf == root->node) {
 			btrfs_set_header_level(leaf, 0);
 		} else {
-			btrfs_set_path_blocking(path);
 			btrfs_clean_tree_block(leaf);
 			btrfs_del_leaf(trans, root, path, leaf);
 		}
@@ -4977,7 +4924,6 @@ int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 			slot = path->slots[1];
 			atomic_inc(&leaf->refs);
 
-			btrfs_set_path_blocking(path);
 			wret = push_leaf_left(trans, root, path, 1, 1,
 					      1, (u32)-1);
 			if (wret < 0 && wret != -ENOSPC)
@@ -5148,7 +5094,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
 		 */
 		if (slot >= nritems) {
 			path->slots[level] = slot;
-			btrfs_set_path_blocking(path);
 			sret = btrfs_find_next_key(root, path, min_key, level,
 						  min_trans);
 			if (sret == 0) {
@@ -5165,7 +5110,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
 			ret = 0;
 			goto out;
 		}
-		btrfs_set_path_blocking(path);
 		cur = btrfs_read_node_slot(cur, slot);
 		if (IS_ERR(cur)) {
 			ret = PTR_ERR(cur);
@@ -5182,7 +5126,6 @@ int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
 	path->keep_locks = keep_locks;
 	if (ret == 0) {
 		btrfs_unlock_up_safe(path, path->lowest_level + 1);
-		btrfs_set_path_blocking(path);
 		memcpy(min_key, &found_key, sizeof(found_key));
 	}
 	return ret;
@@ -5392,7 +5335,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 				goto again;
 			}
 			if (!ret) {
-				btrfs_set_path_blocking(path);
 				__btrfs_tree_read_lock(next,
 						       BTRFS_NESTING_RIGHT,
 						       path->recurse);
@@ -5427,13 +5369,8 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 		}
 
 		if (!path->skip_locking) {
-			ret = btrfs_try_tree_read_lock(next);
-			if (!ret) {
-				btrfs_set_path_blocking(path);
-				__btrfs_tree_read_lock(next,
-						       BTRFS_NESTING_RIGHT,
-						       path->recurse);
-			}
+			__btrfs_tree_read_lock(next, BTRFS_NESTING_RIGHT,
+					       path->recurse);
 			next_rw_lock = BTRFS_READ_LOCK;
 		}
 	}
@@ -5441,8 +5378,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 done:
 	unlock_up(path, 0, 1, 0, NULL);
 	path->leave_spinning = old_spinning;
-	if (!old_spinning)
-		btrfs_set_path_blocking(path);
 
 	return ret;
 }
@@ -5464,7 +5399,6 @@ int btrfs_previous_item(struct btrfs_root *root,
 
 	while (1) {
 		if (path->slots[0] == 0) {
-			btrfs_set_path_blocking(path);
 			ret = btrfs_prev_leaf(root, path);
 			if (ret != 0)
 				return ret;
@@ -5506,7 +5440,6 @@ int btrfs_previous_extent_item(struct btrfs_root *root,
 
 	while (1) {
 		if (path->slots[0] == 0) {
-			btrfs_set_path_blocking(path);
 			ret = btrfs_prev_leaf(root, path);
 			if (ret != 0)
 				return ret;
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index bf1595a42a98..b00b7ed47fc0 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -741,13 +741,6 @@ static int btrfs_batch_insert_items(struct btrfs_root *root,
 		goto out;
 	}
 
-	/*
-	 * we need allocate some memory space, but it might cause the task
-	 * to sleep, so we set all locked nodes in the path to blocking locks
-	 * first.
-	 */
-	btrfs_set_path_blocking(path);
-
 	keys = kmalloc_array(nitems, sizeof(struct btrfs_key), GFP_NOFS);
 	if (!keys) {
 		ret = -ENOMEM;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e618127c3d89..c53a0f5d8c27 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -296,10 +296,8 @@ static int verify_parent_transid(struct extent_io_tree *io_tree,
 	if (atomic)
 		return -EAGAIN;
 
-	if (need_lock) {
+	if (need_lock)
 		btrfs_tree_read_lock(eb);
-		btrfs_set_lock_blocking_read(eb);
-	}
 
 	lock_extent_bits(io_tree, eb->start, eb->start + eb->len - 1,
 			 &cached_state);
@@ -328,7 +326,7 @@ static int verify_parent_transid(struct extent_io_tree *io_tree,
 	unlock_extent_cached(io_tree, eb->start, eb->start + eb->len - 1,
 			     &cached_state);
 	if (need_lock)
-		btrfs_tree_read_unlock_blocking(eb);
+		btrfs_tree_read_unlock(eb);
 	return ret;
 }
 
@@ -1071,8 +1069,6 @@ void btrfs_clean_tree_block(struct extent_buffer *buf)
 			percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
 						 -buf->len,
 						 fs_info->dirty_metadata_batch);
-			/* ugh, clear_extent_buffer_dirty needs to lock the page */
-			btrfs_set_lock_blocking_write(buf);
 			clear_extent_buffer_dirty(buf);
 		}
 	}
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e5baf90f1174..548ee1742987 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4528,7 +4528,6 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	btrfs_clean_tree_block(buf);
 	clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
 
-	btrfs_set_lock_blocking_write(buf);
 	set_extent_buffer_uptodate(buf);
 
 	memzero_extent_buffer(buf, 0, sizeof(struct btrfs_header));
@@ -4917,7 +4916,6 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 		reada = 1;
 	}
 	btrfs_tree_lock(next);
-	btrfs_set_lock_blocking_write(next);
 
 	ret = btrfs_lookup_extent_info(trans, fs_info, bytenr, level - 1, 1,
 				       &wc->refs[level - 1],
@@ -4977,7 +4975,6 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 			return -EIO;
 		}
 		btrfs_tree_lock(next);
-		btrfs_set_lock_blocking_write(next);
 	}
 
 	level--;
@@ -4989,7 +4986,7 @@ static noinline int do_walk_down(struct btrfs_trans_handle *trans,
 	}
 	path->nodes[level] = next;
 	path->slots[level] = 0;
-	path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
+	path->locks[level] = BTRFS_WRITE_LOCK;
 	wc->level = level;
 	if (wc->level == 1)
 		wc->reada_slot = 0;
@@ -5117,8 +5114,7 @@ static noinline int walk_up_proc(struct btrfs_trans_handle *trans,
 		if (!path->locks[level]) {
 			BUG_ON(level == 0);
 			btrfs_tree_lock(eb);
-			btrfs_set_lock_blocking_write(eb);
-			path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
+			path->locks[level] = BTRFS_WRITE_LOCK;
 
 			ret = btrfs_lookup_extent_info(trans, fs_info,
 						       eb->start, level, 1,
@@ -5161,8 +5157,7 @@ static noinline int walk_up_proc(struct btrfs_trans_handle *trans,
 		if (!path->locks[level] &&
 		    btrfs_header_generation(eb) == trans->transid) {
 			btrfs_tree_lock(eb);
-			btrfs_set_lock_blocking_write(eb);
-			path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
+			path->locks[level] = BTRFS_WRITE_LOCK;
 		}
 		btrfs_clean_tree_block(eb);
 	}
@@ -5330,9 +5325,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 	if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) {
 		level = btrfs_header_level(root->node);
 		path->nodes[level] = btrfs_lock_root_node(root);
-		btrfs_set_lock_blocking_write(path->nodes[level]);
 		path->slots[level] = 0;
-		path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
+		path->locks[level] = BTRFS_WRITE_LOCK;
 		memset(&wc->update_progress, 0,
 		       sizeof(wc->update_progress));
 	} else {
@@ -5360,8 +5354,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 		level = btrfs_header_level(root->node);
 		while (1) {
 			btrfs_tree_lock(path->nodes[level]);
-			btrfs_set_lock_blocking_write(path->nodes[level]);
-			path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
+			path->locks[level] = BTRFS_WRITE_LOCK;
 
 			ret = btrfs_lookup_extent_info(trans, fs_info,
 						path->nodes[level]->start,
@@ -5548,7 +5541,7 @@ int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
 	level = btrfs_header_level(node);
 	path->nodes[level] = node;
 	path->slots[level] = 0;
-	path->locks[level] = BTRFS_WRITE_LOCK_BLOCKING;
+	path->locks[level] = BTRFS_WRITE_LOCK;
 
 	wc->refs[parent_level] = 1;
 	wc->flags[parent_level] = BTRFS_BLOCK_FLAG_FULL_BACKREF;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 841c516079a9..67aacc548857 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1042,8 +1042,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
 	 * write lock.
 	 */
 	if (!ret && replace_extent && leafs_visited == 1 &&
-	    (path->locks[0] == BTRFS_WRITE_LOCK_BLOCKING ||
-	     path->locks[0] == BTRFS_WRITE_LOCK) &&
+	    path->locks[0] == BTRFS_WRITE_LOCK &&
 	    btrfs_leaf_free_space(leaf) >=
 	    sizeof(struct btrfs_item) + extent_item_size) {
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6c547aa905d7..ca242de7cb69 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6694,7 +6694,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 		em->orig_start = em->start;
 		ptr = btrfs_file_extent_inline_start(item) + extent_offset;
 
-		btrfs_set_path_blocking(path);
 		if (!PageUptodate(page)) {
 			if (btrfs_file_extent_compression(leaf, item) !=
 			    BTRFS_COMPRESS_NONE) {
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c
index fa2f50241693..6e4fd4335d04 100644
--- a/fs/btrfs/locking.c
+++ b/fs/btrfs/locking.c
@@ -40,31 +40,6 @@
  *
  */
 
-/*
- * Mark already held read lock as blocking. Can be nested in write lock by the
- * same thread.
- *
- * Use when there are potentially long operations ahead so other thread waiting
- * on the lock will not actively spin but sleep instead.
- *
- * The rwlock is released and blocking reader counter is increased.
- */
-void btrfs_set_lock_blocking_read(struct extent_buffer *eb)
-{
-}
-
-/*
- * Mark already held write lock as blocking.
- *
- * Use when there are potentially long operations ahead so other threads
- * waiting on the lock will not actively spin but sleep instead.
- *
- * The rwlock is released and blocking writers is set.
- */
-void btrfs_set_lock_blocking_write(struct extent_buffer *eb)
-{
-}
-
 /*
  * __btrfs_tree_read_lock: Lock the extent buffer for read.
  * @eb - the eb to be locked.
@@ -120,17 +95,6 @@ void btrfs_tree_read_lock(struct extent_buffer *eb)
 	__btrfs_tree_read_lock(eb, BTRFS_NESTING_NORMAL, false);
 }
 
-/*
- * Lock extent buffer for read, optimistically expecting that there are no
- * contending blocking writers. If there are, don't wait.
- *
- * Return 1 if the rwlock has been taken, 0 otherwise
- */
-int btrfs_tree_read_lock_atomic(struct extent_buffer *eb)
-{
-	return btrfs_try_tree_read_lock(eb);
-}
-
 /*
  * Try-lock for read.
  *
@@ -182,18 +146,6 @@ void btrfs_tree_read_unlock(struct extent_buffer *eb)
 	up_read(&eb->lock);
 }
 
-/*
- * Release read lock, previously set to blocking by a pairing call to
- * btrfs_set_lock_blocking_read(). Can be nested in write lock by the same
- * thread.
- *
- * State of rwlock is unchanged, last reader wakes waiting threads.
- */
-void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb)
-{
-	btrfs_tree_read_unlock(eb);
-}
-
 /*
  * __btrfs_tree_lock: Lock for write.
  * @eb - the eb to lock.
@@ -229,32 +181,6 @@ void btrfs_tree_unlock(struct extent_buffer *eb)
 	up_write(&eb->lock);
 }
 
-/*
- * Set all locked nodes in the path to blocking locks.  This should be done
- * before scheduling
- */
-void btrfs_set_path_blocking(struct btrfs_path *p)
-{
-	int i;
-
-	for (i = 0; i < BTRFS_MAX_LEVEL; i++) {
-		if (!p->nodes[i] || !p->locks[i])
-			continue;
-		/*
-		 * If we currently have a spinning reader or writer lock this
-		 * will bump the count of blocking holders and drop the
-		 * spinlock.
-		 */
-		if (p->locks[i] == BTRFS_READ_LOCK) {
-			btrfs_set_lock_blocking_read(p->nodes[i]);
-			p->locks[i] = BTRFS_READ_LOCK_BLOCKING;
-		} else if (p->locks[i] == BTRFS_WRITE_LOCK) {
-			btrfs_set_lock_blocking_write(p->nodes[i]);
-			p->locks[i] = BTRFS_WRITE_LOCK_BLOCKING;
-		}
-	}
-}
-
 /*
  * This releases any locks held in the path starting at level and going all the
  * way up to the root.
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h
index babb8e23e16e..6acd2d41cb4d 100644
--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -13,8 +13,6 @@
 
 #define BTRFS_WRITE_LOCK 1
 #define BTRFS_READ_LOCK 2
-#define BTRFS_WRITE_LOCK_BLOCKING 3
-#define BTRFS_READ_LOCK_BLOCKING 4
 
 /*
  * We are limited in number of subclasses by MAX_LOCKDEP_SUBCLASSES, which at
@@ -81,12 +79,8 @@ void __btrfs_tree_read_lock(struct extent_buffer *eb, enum btrfs_lock_nesting ne
 			    bool recurse);
 void btrfs_tree_read_lock(struct extent_buffer *eb);
 void btrfs_tree_read_unlock(struct extent_buffer *eb);
-void btrfs_tree_read_unlock_blocking(struct extent_buffer *eb);
-void btrfs_set_lock_blocking_read(struct extent_buffer *eb);
-void btrfs_set_lock_blocking_write(struct extent_buffer *eb);
 int btrfs_try_tree_read_lock(struct extent_buffer *eb);
 int btrfs_try_tree_write_lock(struct extent_buffer *eb);
-int btrfs_tree_read_lock_atomic(struct extent_buffer *eb);
 struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root);
 struct extent_buffer *__btrfs_read_lock_root_node(struct btrfs_root *root,
 						  bool recurse);
@@ -105,15 +99,12 @@ static inline void btrfs_assert_tree_locked(struct extent_buffer *eb) {
 static inline void btrfs_assert_tree_locked(struct extent_buffer *eb) { }
 #endif
 
-void btrfs_set_path_blocking(struct btrfs_path *p);
 void btrfs_unlock_up_safe(struct btrfs_path *path, int level);
 
 static inline void btrfs_tree_unlock_rw(struct extent_buffer *eb, int rw)
 {
-	if (rw == BTRFS_WRITE_LOCK || rw == BTRFS_WRITE_LOCK_BLOCKING)
+	if (rw == BTRFS_WRITE_LOCK)
 		btrfs_tree_unlock(eb);
-	else if (rw == BTRFS_READ_LOCK_BLOCKING)
-		btrfs_tree_read_unlock_blocking(eb);
 	else if (rw == BTRFS_READ_LOCK)
 		btrfs_tree_read_unlock(eb);
 	else
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index c0f350c3a0cf..0c19d12fb20d 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1902,8 +1902,7 @@ static int qgroup_trace_extent_swap(struct btrfs_trans_handle* trans,
 			src_path->nodes[cur_level] = eb;
 
 			btrfs_tree_read_lock(eb);
-			btrfs_set_lock_blocking_read(eb);
-			src_path->locks[cur_level] = BTRFS_READ_LOCK_BLOCKING;
+			src_path->locks[cur_level] = BTRFS_READ_LOCK;
 		}
 
 		src_path->slots[cur_level] = dst_path->slots[cur_level];
@@ -2043,8 +2042,7 @@ static int qgroup_trace_new_subtree_blocks(struct btrfs_trans_handle* trans,
 		dst_path->slots[cur_level] = 0;
 
 		btrfs_tree_read_lock(eb);
-		btrfs_set_lock_blocking_read(eb);
-		dst_path->locks[cur_level] = BTRFS_READ_LOCK_BLOCKING;
+		dst_path->locks[cur_level] = BTRFS_READ_LOCK;
 		need_cleanup = true;
 	}
 
@@ -2218,8 +2216,7 @@ int btrfs_qgroup_trace_subtree(struct btrfs_trans_handle *trans,
 			path->slots[level] = 0;
 
 			btrfs_tree_read_lock(eb);
-			btrfs_set_lock_blocking_read(eb);
-			path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
+			path->locks[level] = BTRFS_READ_LOCK;
 
 			ret = btrfs_qgroup_trace_extent(trans, child_bytenr,
 							fs_info->nodesize,
diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
index 7f03dbe5b609..d79c82928ba8 100644
--- a/fs/btrfs/ref-verify.c
+++ b/fs/btrfs/ref-verify.c
@@ -575,10 +575,9 @@ static int walk_down_tree(struct btrfs_root *root, struct btrfs_path *path,
 				return -EIO;
 			}
 			btrfs_tree_read_lock(eb);
-			btrfs_set_lock_blocking_read(eb);
 			path->nodes[level-1] = eb;
 			path->slots[level-1] = 0;
-			path->locks[level-1] = BTRFS_READ_LOCK_BLOCKING;
+			path->locks[level-1] = BTRFS_READ_LOCK;
 		} else {
 			ret = process_leaf(root, path, bytenr, num_bytes);
 			if (ret)
@@ -1000,11 +999,10 @@ int btrfs_build_ref_tree(struct btrfs_fs_info *fs_info)
 		return -ENOMEM;
 
 	eb = btrfs_read_lock_root_node(fs_info->extent_root);
-	btrfs_set_lock_blocking_read(eb);
 	level = btrfs_header_level(eb);
 	path->nodes[level] = eb;
 	path->slots[level] = 0;
-	path->locks[level] = BTRFS_READ_LOCK_BLOCKING;
+	path->locks[level] = BTRFS_READ_LOCK;
 
 	while (1) {
 		/*
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3602806d71bd..3dd3eda97e38 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1196,7 +1196,6 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 	btrfs_node_key_to_cpu(path->nodes[lowest_level], &key, slot);
 
 	eb = btrfs_lock_root_node(dest);
-	btrfs_set_lock_blocking_write(eb);
 	level = btrfs_header_level(eb);
 
 	if (level < lowest_level) {
@@ -1210,7 +1209,6 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 				      BTRFS_NESTING_COW);
 		BUG_ON(ret);
 	}
-	btrfs_set_lock_blocking_write(eb);
 
 	if (next_key) {
 		next_key->objectid = (u64)-1;
@@ -1279,7 +1277,6 @@ int replace_path(struct btrfs_trans_handle *trans, struct reloc_control *rc,
 						      BTRFS_NESTING_COW);
 				BUG_ON(ret);
 			}
-			btrfs_set_lock_blocking_write(eb);
 
 			btrfs_tree_unlock(parent);
 			free_extent_buffer(parent);
@@ -2307,7 +2304,6 @@ static int do_relocation(struct btrfs_trans_handle *trans,
 			goto next;
 		}
 		btrfs_tree_lock(eb);
-		btrfs_set_lock_blocking_write(eb);
 
 		if (!node->eb) {
 			ret = btrfs_cow_block(trans, root, eb, upper->eb,
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 45bf7da3ce1c..51747f3622a3 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1596,8 +1596,6 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
-	btrfs_set_lock_blocking_write(old);
-
 	ret = btrfs_copy_root(trans, root, old, &tmp, objectid);
 	/* clean up in any case */
 	btrfs_tree_unlock(old);
diff --git a/fs/btrfs/tree-defrag.c b/fs/btrfs/tree-defrag.c
index d3f28b8f4ff9..7c45d960b53c 100644
--- a/fs/btrfs/tree-defrag.c
+++ b/fs/btrfs/tree-defrag.c
@@ -52,7 +52,6 @@ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
 		u32 nritems;
 
 		root_node = btrfs_lock_root_node(root);
-		btrfs_set_lock_blocking_write(root_node);
 		nritems = btrfs_header_nritems(root_node);
 		root->defrag_max.objectid = 0;
 		/* from above we know this is not a leaf */
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 696dd861cc3c..76e1e8c23ea8 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2736,7 +2736,6 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans,
 
 				if (trans) {
 					btrfs_tree_lock(next);
-					btrfs_set_lock_blocking_write(next);
 					btrfs_clean_tree_block(next);
 					btrfs_wait_tree_block_writeback(next);
 					btrfs_tree_unlock(next);
@@ -2805,7 +2804,6 @@ static noinline int walk_up_log_tree(struct btrfs_trans_handle *trans,
 
 				if (trans) {
 					btrfs_tree_lock(next);
-					btrfs_set_lock_blocking_write(next);
 					btrfs_clean_tree_block(next);
 					btrfs_wait_tree_block_writeback(next);
 					btrfs_tree_unlock(next);
@@ -2887,7 +2885,6 @@ static int walk_log_tree(struct btrfs_trans_handle *trans,
 
 			if (trans) {
 				btrfs_tree_lock(next);
-				btrfs_set_lock_blocking_write(next);
 				btrfs_clean_tree_block(next);
 				btrfs_wait_tree_block_writeback(next);
 				btrfs_tree_unlock(next);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 17/17] btrfs: rip out path->leave_spinning
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (15 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 16/17] btrfs: remove all of the blocking helpers Josef Bacik
@ 2020-08-10 15:42 ` Josef Bacik
  2020-08-13 14:26 ` [PATCH 00/17] Convert to an rwsem for our tree locking David Sterba
  17 siblings, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-10 15:42 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We no longer distinguish between blocking and spinning, so rip out all
of this code.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/backref.c            |  3 ---
 fs/btrfs/ctree.c              |  3 ---
 fs/btrfs/ctree.h              |  1 -
 fs/btrfs/delayed-inode.c      |  4 ----
 fs/btrfs/dir-item.c           |  1 -
 fs/btrfs/export.c             |  1 -
 fs/btrfs/extent-tree.c        |  8 --------
 fs/btrfs/extent_io.c          |  1 -
 fs/btrfs/file-item.c          |  4 ----
 fs/btrfs/free-space-tree.c    |  2 --
 fs/btrfs/inode-item.c         |  6 ------
 fs/btrfs/inode.c              | 12 ------------
 fs/btrfs/ioctl.c              |  1 -
 fs/btrfs/qgroup.c             |  2 --
 fs/btrfs/reflink.c            |  3 ---
 fs/btrfs/super.c              |  2 --
 fs/btrfs/tests/qgroup-tests.c |  4 ----
 17 files changed, 58 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 5b994d4377d6..9635c9368b39 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1658,13 +1658,11 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
 	s64 bytes_left = ((s64)size) - 1;
 	struct extent_buffer *eb = eb_in;
 	struct btrfs_key found_key;
-	int leave_spinning = path->leave_spinning;
 	struct btrfs_inode_ref *iref;
 
 	if (bytes_left >= 0)
 		dest[bytes_left] = '\0';
 
-	path->leave_spinning = 1;
 	while (1) {
 		bytes_left -= name_len;
 		if (bytes_left >= 0)
@@ -1708,7 +1706,6 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path,
 	}
 
 	btrfs_release_path(path);
-	path->leave_spinning = leave_spinning;
 
 	if (ret)
 		return ERR_PTR(ret);
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 45215eaee492..3481de544d29 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -5227,7 +5227,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 	struct btrfs_key key;
 	u32 nritems;
 	int ret;
-	int old_spinning = path->leave_spinning;
 	int next_rw_lock = 0;
 
 	nritems = btrfs_header_nritems(path->nodes[0]);
@@ -5242,7 +5241,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 	btrfs_release_path(path);
 
 	path->keep_locks = 1;
-	path->leave_spinning = 1;
 
 	if (time_seq)
 		ret = btrfs_search_old_slot(root, &key, path, time_seq);
@@ -5377,7 +5375,6 @@ int btrfs_next_old_leaf(struct btrfs_root *root, struct btrfs_path *path,
 	ret = 0;
 done:
 	unlock_up(path, 0, 1, 0, NULL);
-	path->leave_spinning = old_spinning;
 
 	return ret;
 }
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ceecccce6133..0e797ac2e2d6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -370,7 +370,6 @@ struct btrfs_path {
 	unsigned int search_for_split:1;
 	unsigned int keep_locks:1;
 	unsigned int skip_locking:1;
-	unsigned int leave_spinning:1;
 	unsigned int search_commit_root:1;
 	unsigned int need_commit_sem:1;
 	unsigned int skip_release_on_error:1;
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index b00b7ed47fc0..1f00daa8671b 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1149,7 +1149,6 @@ static int __btrfs_run_delayed_items(struct btrfs_trans_handle *trans, int nr)
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-	path->leave_spinning = 1;
 
 	block_rsv = trans->block_rsv;
 	trans->block_rsv = &fs_info->delayed_block_rsv;
@@ -1214,7 +1213,6 @@ int btrfs_commit_inode_delayed_items(struct btrfs_trans_handle *trans,
 		btrfs_release_delayed_node(delayed_node);
 		return -ENOMEM;
 	}
-	path->leave_spinning = 1;
 
 	block_rsv = trans->block_rsv;
 	trans->block_rsv = &delayed_node->root->fs_info->delayed_block_rsv;
@@ -1259,7 +1257,6 @@ int btrfs_commit_inode_delayed_inode(struct btrfs_inode *inode)
 		ret = -ENOMEM;
 		goto trans_out;
 	}
-	path->leave_spinning = 1;
 
 	block_rsv = trans->block_rsv;
 	trans->block_rsv = &fs_info->delayed_block_rsv;
@@ -1328,7 +1325,6 @@ static void btrfs_async_run_delayed_root(struct btrfs_work *work)
 		if (!delayed_node)
 			break;
 
-		path->leave_spinning = 1;
 		root = delayed_node->root;
 
 		trans = btrfs_join_transaction(root);
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 863367c2c620..98b63ebed539 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -127,7 +127,6 @@ int btrfs_insert_dir_item(struct btrfs_trans_handle *trans, const char *name,
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-	path->leave_spinning = 1;
 
 	btrfs_cpu_key_to_disk(&disk_key, location);
 
diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 1a8d419d9e1f..1d4c2397d0d6 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -222,7 +222,6 @@ static int btrfs_get_name(struct dentry *parent, char *name,
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-	path->leave_spinning = 1;
 
 	if (ino == BTRFS_FIRST_FREE_OBJECTID) {
 		key.objectid = BTRFS_I(inode)->root->root_key.objectid;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 548ee1742987..c95799af95f1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1447,7 +1447,6 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
 	/* this will setup the path even if it fails to insert the back ref */
 	ret = insert_inline_extent_backref(trans, path, bytenr, num_bytes,
 					   parent, root_objectid, owner,
@@ -1471,7 +1470,6 @@ static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 	btrfs_mark_buffer_dirty(leaf);
 	btrfs_release_path(path);
 
-	path->leave_spinning = 1;
 	/* now insert the actual backref */
 	if (owner < BTRFS_FIRST_FREE_OBJECTID) {
 		BUG_ON(refs_to_add != 1);
@@ -1587,7 +1585,6 @@ static int run_delayed_extent_op(struct btrfs_trans_handle *trans,
 	}
 
 again:
-	path->leave_spinning = 1;
 	ret = btrfs_search_slot(trans, fs_info->extent_root, &key, path, 0, 1);
 	if (ret < 0) {
 		err = ret;
@@ -2959,8 +2956,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
-
 	is_data = owner_objectid >= BTRFS_FIRST_FREE_OBJECTID;
 	BUG_ON(!is_data && refs_to_drop != 1);
 
@@ -3002,7 +2997,6 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
 				goto out;
 			}
 			btrfs_release_path(path);
-			path->leave_spinning = 1;
 
 			key.objectid = bytenr;
 			key.type = BTRFS_EXTENT_ITEM_KEY;
@@ -4292,7 +4286,6 @@ static int alloc_reserved_file_extent(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
 				      ins, size);
 	if (ret) {
@@ -4377,7 +4370,6 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
 				      &extent_key, size);
 	if (ret) {
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ab44e696c249..9a2dbae195cd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4725,7 +4725,6 @@ int extent_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-	path->leave_spinning = 1;
 
 	roots = ulist_alloc(GFP_KERNEL);
 	tmp_ulist = ulist_alloc(GFP_KERNEL);
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 7d5ec71615b8..e2eb1037220e 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -142,7 +142,6 @@ int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 	file_key.offset = pos;
 	file_key.type = BTRFS_EXTENT_DATA_KEY;
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(trans, root, path, &file_key,
 				      sizeof(*item));
 	if (ret < 0)
@@ -706,7 +705,6 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 		key.offset = end_byte - 1;
 		key.type = BTRFS_EXTENT_CSUM_KEY;
 
-		path->leave_spinning = 1;
 		ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
 		if (ret > 0) {
 			if (path->slots[0] == 0)
@@ -991,10 +989,8 @@ int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
 	} else {
 		ins_size = csum_size;
 	}
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(trans, root, path, &file_key,
 				      ins_size);
-	path->leave_spinning = 0;
 	if (ret < 0)
 		goto out;
 	if (WARN_ON(ret != 0))
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index 8b1f5c8897b7..752891e0d1bb 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -1191,8 +1191,6 @@ static int clear_free_space_tree(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
-
 	key.objectid = 0;
 	key.type = 0;
 	key.offset = 0;
diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
index 668701832845..37f36ffdaf6b 100644
--- a/fs/btrfs/inode-item.c
+++ b/fs/btrfs/inode-item.c
@@ -119,8 +119,6 @@ static int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
-
 	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
 	if (ret > 0)
 		ret = -ENOENT;
@@ -193,8 +191,6 @@ int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
-
 	ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
 	if (ret > 0) {
 		ret = -ENOENT;
@@ -270,7 +266,6 @@ static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(trans, root, path, &key,
 				      ins_len);
 	if (ret == -EEXIST) {
@@ -327,7 +322,6 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
 	path->skip_release_on_error = 1;
 	ret = btrfs_insert_empty_item(trans, root, path, &key,
 				      ins_len);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ca242de7cb69..c51c3faaf962 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -196,7 +196,6 @@ static int insert_inline_extent(struct btrfs_trans_handle *trans,
 		key.type = BTRFS_EXTENT_DATA_KEY;
 
 		datasize = btrfs_file_extent_calc_inline_size(cur_size);
-		path->leave_spinning = 1;
 		ret = btrfs_insert_empty_item(trans, root, path, &key,
 					      datasize);
 		if (ret)
@@ -2498,7 +2497,6 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
 		ins.offset = file_pos;
 		ins.type = BTRFS_EXTENT_DATA_KEY;
 
-		path->leave_spinning = 1;
 		ret = btrfs_insert_empty_item(trans, root, path, &ins,
 					      sizeof(*stack_fi));
 		if (ret)
@@ -3480,7 +3478,6 @@ static noinline int btrfs_update_inode_item(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
 	ret = btrfs_lookup_inode(trans, root, path, &BTRFS_I(inode)->location,
 				 1);
 	if (ret) {
@@ -3569,7 +3566,6 @@ static int __btrfs_unlink_inode(struct btrfs_trans_handle *trans,
 		goto out;
 	}
 
-	path->leave_spinning = 1;
 	di = btrfs_lookup_dir_item(trans, root, path, dir_ino,
 				    name, name_len, -1);
 	if (IS_ERR_OR_NULL(di)) {
@@ -6025,7 +6021,6 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
 		goto fail;
 	}
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_items(trans, root, path, key, sizes, nitems);
 	if (ret != 0)
 		goto fail_unlock;
@@ -6584,13 +6579,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 
 	/* Chances are we'll be called again, so go ahead and do readahead */
 	path->reada = READA_FORWARD;
-
-	/*
-	 * Unless we're going to uncompress the inline extent, no sleep would
-	 * happen.
-	 */
-	path->leave_spinning = 1;
-
 	path->recurse = btrfs_is_free_space_inode(inode);
 
 	ret = btrfs_lookup_file_extent(NULL, root, path, objectid, start, 0);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c0d4ab44361d..d2a8326fb096 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3382,7 +3382,6 @@ static long btrfs_ioctl_default_subvol(struct file *file, void __user *argp)
 		ret = -ENOMEM;
 		goto out_free;
 	}
-	path->leave_spinning = 1;
 
 	trans = btrfs_start_transaction(root, 1);
 	if (IS_ERR(trans)) {
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 0c19d12fb20d..0e24990b6140 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -893,8 +893,6 @@ static int btrfs_clean_quota_tree(struct btrfs_trans_handle *trans,
 	if (!path)
 		return -ENOMEM;
 
-	path->leave_spinning = 1;
-
 	key.objectid = 0;
 	key.offset = 0;
 	key.type = 0;
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 5cd02514cf4d..fa640bb107f9 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -347,7 +347,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 		u64 drop_start;
 
 		/* Note the key will change type as we walk through the tree */
-		path->leave_spinning = 1;
 		ret = btrfs_search_slot(NULL, BTRFS_I(src)->root, &key, path,
 				0, 0);
 		if (ret < 0)
@@ -417,7 +416,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 				   size);
 
 		btrfs_release_path(path);
-		path->leave_spinning = 0;
 
 		memcpy(&new_key, &key, sizeof(new_key));
 		new_key.objectid = btrfs_ino(BTRFS_I(inode));
@@ -531,7 +529,6 @@ static int btrfs_clone(struct inode *src, struct inode *inode,
 		 * mixing buffered and direct IO writes against this file.
 		 */
 		btrfs_release_path(path);
-		path->leave_spinning = 0;
 
 		ret = btrfs_punch_hole_range(inode, path, last_dest_end,
 				destoff + len - 1, NULL, &trans);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 53639de3a064..99fbc25f667b 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1126,7 +1126,6 @@ char *btrfs_get_subvol_name_from_objectid(struct btrfs_fs_info *fs_info,
 		ret = -ENOMEM;
 		goto err;
 	}
-	path->leave_spinning = 1;
 
 	name = kmalloc(PATH_MAX, GFP_KERNEL);
 	if (!name) {
@@ -1255,7 +1254,6 @@ static int get_default_subvol_objectid(struct btrfs_fs_info *fs_info, u64 *objec
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
-	path->leave_spinning = 1;
 
 	/*
 	 * Find the "default" dir item which points to the root item that we
diff --git a/fs/btrfs/tests/qgroup-tests.c b/fs/btrfs/tests/qgroup-tests.c
index ce1ca8e73c2d..f3137285a9e2 100644
--- a/fs/btrfs/tests/qgroup-tests.c
+++ b/fs/btrfs/tests/qgroup-tests.c
@@ -36,7 +36,6 @@ static int insert_normal_tree_ref(struct btrfs_root *root, u64 bytenr,
 		return -ENOMEM;
 	}
 
-	path->leave_spinning = 1;
 	ret = btrfs_insert_empty_item(&trans, root, path, &ins, size);
 	if (ret) {
 		test_err("couldn't insert ref %d", ret);
@@ -86,7 +85,6 @@ static int add_tree_ref(struct btrfs_root *root, u64 bytenr, u64 num_bytes,
 		return -ENOMEM;
 	}
 
-	path->leave_spinning = 1;
 	ret = btrfs_search_slot(&trans, root, &key, path, 0, 1);
 	if (ret) {
 		test_err("couldn't find extent ref");
@@ -135,7 +133,6 @@ static int remove_extent_item(struct btrfs_root *root, u64 bytenr,
 		test_std_err(TEST_ALLOC_ROOT);
 		return -ENOMEM;
 	}
-	path->leave_spinning = 1;
 
 	ret = btrfs_search_slot(&trans, root, &key, path, -1, 1);
 	if (ret) {
@@ -170,7 +167,6 @@ static int remove_extent_ref(struct btrfs_root *root, u64 bytenr,
 		return -ENOMEM;
 	}
 
-	path->leave_spinning = 1;
 	ret = btrfs_search_slot(&trans, root, &key, path, 0, 1);
 	if (ret) {
 		test_err("couldn't find extent ref");
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/17] btrfs: drop path before adding new uuid tree entry
  2020-08-10 15:42 ` [PATCH 01/17] btrfs: drop path before adding new uuid tree entry Josef Bacik
@ 2020-08-10 16:28   ` Filipe Manana
  2020-08-10 16:30     ` Filipe Manana
  2020-08-11 14:35     ` Josef Bacik
  0 siblings, 2 replies; 30+ messages in thread
From: Filipe Manana @ 2020-08-10 16:28 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 10, 2020 at 4:44 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> With the conversion to the rwsemaphore I got the following lockdep splat
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
> ------------------------------------------------------
> btrfs-uuid/7955 is trying to acquire lock:
> ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
>
> but task is already holding lock:
> ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (btrfs-uuid-00){++++}-{3:3}:
>        down_read_nested+0x3e/0x140
>        __btrfs_tree_read_lock+0x39/0x180
>        __btrfs_read_lock_root_node+0x3a/0x50
>        btrfs_search_slot+0x4bd/0x990
>        btrfs_uuid_tree_add+0x89/0x2d0
>        btrfs_uuid_scan_kthread+0x330/0x390
>        kthread+0x133/0x150
>        ret_from_fork+0x1f/0x30
>
> -> #0 (btrfs-root-00){++++}-{3:3}:
>        __lock_acquire+0x1272/0x2310
>        lock_acquire+0x9e/0x360
>        down_read_nested+0x3e/0x140
>        __btrfs_tree_read_lock+0x39/0x180
>        __btrfs_read_lock_root_node+0x3a/0x50
>        btrfs_search_slot+0x4bd/0x990
>        btrfs_find_root+0x45/0x1b0
>        btrfs_read_tree_root+0x61/0x100
>        btrfs_get_root_ref.part.50+0x143/0x630
>        btrfs_uuid_tree_iterate+0x207/0x314
>        btrfs_uuid_rescan_kthread+0x12/0x50
>        kthread+0x133/0x150
>        ret_from_fork+0x1f/0x30
>
> other info that might help us debug this:
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(btrfs-uuid-00);
>                                lock(btrfs-root-00);
>                                lock(btrfs-uuid-00);
>   lock(btrfs-root-00);
>
>  *** DEADLOCK ***
>
> 1 lock held by btrfs-uuid/7955:
>  #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
>
> stack backtrace:
> CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
> Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
> Call Trace:
>  dump_stack+0x78/0xa0
>  check_noncircular+0x165/0x180
>  __lock_acquire+0x1272/0x2310
>  lock_acquire+0x9e/0x360
>  ? __btrfs_tree_read_lock+0x39/0x180
>  ? btrfs_root_node+0x1c/0x1d0
>  down_read_nested+0x3e/0x140
>  ? __btrfs_tree_read_lock+0x39/0x180
>  __btrfs_tree_read_lock+0x39/0x180
>  __btrfs_read_lock_root_node+0x3a/0x50
>  btrfs_search_slot+0x4bd/0x990
>  btrfs_find_root+0x45/0x1b0
>  btrfs_read_tree_root+0x61/0x100
>  btrfs_get_root_ref.part.50+0x143/0x630
>  btrfs_uuid_tree_iterate+0x207/0x314
>  ? btree_readpage+0x20/0x20
>  btrfs_uuid_rescan_kthread+0x12/0x50
>  kthread+0x133/0x150
>  ? kthread_create_on_node+0x60/0x60
>  ret_from_fork+0x1f/0x30
>
> This problem exists because we have two different rescan threads,
> btrfs_uuid_scan_kthread which creates the uuid tree, and
> btrfs_uuid_tree_iterate that goes through and updates or deletes any out
> of date roots.  The problem is they both do things in different order.
> btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
> into the uuid_root.  btrfs_uuid_tree_iterate() scans the uuid_root, but
> then does a btrfs_get_fs_root() which can read from the tree_root.

Ok, the get_fs_root() is through  btrfs_check_uuid_tree_entry().

>
> It's actually easy enough to not be holding the path in
> btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
> it further down and re-start the search when we loop.  So simply move
> the path release before we add our entry to the uuid tree.
>
> This also fixes a problem where we're holding a path open after we do
> btrfs_end_transaction(), which has it's own problems.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/volumes.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index d7670e2a9f39..3ac44dad58bb 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4462,6 +4462,7 @@ int btrfs_uuid_scan_kthread(void *data)
>                         goto skip;
>                 }
>  update_tree:
> +               btrfs_release_path(path);

We can actually do the release right after reading from the eb, so we
avoid 3 path releases in different places.
The end result would be:

https://pastebin.com/2gH54HBw

Either way, it looks good to me.

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>                 if (!btrfs_is_empty_uuid(root_item.uuid)) {
>                         ret = btrfs_uuid_tree_add(trans, root_item.uuid,
>                                                   BTRFS_UUID_KEY_SUBVOL,
> @@ -4486,6 +4487,7 @@ int btrfs_uuid_scan_kthread(void *data)
>                 }
>
>  skip:
> +               btrfs_release_path(path);
>                 if (trans) {
>                         ret = btrfs_end_transaction(trans);
>                         trans = NULL;
> @@ -4493,7 +4495,6 @@ int btrfs_uuid_scan_kthread(void *data)
>                                 break;
>                 }
>
> -               btrfs_release_path(path);
>                 if (key.offset < (u64)-1) {
>                         key.offset++;
>                 } else if (key.type < BTRFS_ROOT_ITEM_KEY) {
> --
> 2.24.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/17] btrfs: drop path before adding new uuid tree entry
  2020-08-10 16:28   ` Filipe Manana
@ 2020-08-10 16:30     ` Filipe Manana
  2020-08-11 14:35     ` Josef Bacik
  1 sibling, 0 replies; 30+ messages in thread
From: Filipe Manana @ 2020-08-10 16:30 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 10, 2020 at 5:28 PM Filipe Manana <fdmanana@gmail.com> wrote:
>
> On Mon, Aug 10, 2020 at 4:44 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > With the conversion to the rwsemaphore I got the following lockdep splat
> >
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
> > ------------------------------------------------------
> > btrfs-uuid/7955 is trying to acquire lock:
> > ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
> >
> > but task is already holding lock:
> > ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
> >
> > which lock already depends on the new lock.
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #1 (btrfs-uuid-00){++++}-{3:3}:
> >        down_read_nested+0x3e/0x140
> >        __btrfs_tree_read_lock+0x39/0x180
> >        __btrfs_read_lock_root_node+0x3a/0x50
> >        btrfs_search_slot+0x4bd/0x990
> >        btrfs_uuid_tree_add+0x89/0x2d0
> >        btrfs_uuid_scan_kthread+0x330/0x390
> >        kthread+0x133/0x150
> >        ret_from_fork+0x1f/0x30
> >
> > -> #0 (btrfs-root-00){++++}-{3:3}:
> >        __lock_acquire+0x1272/0x2310
> >        lock_acquire+0x9e/0x360
> >        down_read_nested+0x3e/0x140
> >        __btrfs_tree_read_lock+0x39/0x180
> >        __btrfs_read_lock_root_node+0x3a/0x50
> >        btrfs_search_slot+0x4bd/0x990
> >        btrfs_find_root+0x45/0x1b0
> >        btrfs_read_tree_root+0x61/0x100
> >        btrfs_get_root_ref.part.50+0x143/0x630
> >        btrfs_uuid_tree_iterate+0x207/0x314
> >        btrfs_uuid_rescan_kthread+0x12/0x50
> >        kthread+0x133/0x150
> >        ret_from_fork+0x1f/0x30
> >
> > other info that might help us debug this:
> >
> >  Possible unsafe locking scenario:
> >
> >        CPU0                    CPU1
> >        ----                    ----
> >   lock(btrfs-uuid-00);
> >                                lock(btrfs-root-00);
> >                                lock(btrfs-uuid-00);
> >   lock(btrfs-root-00);
> >
> >  *** DEADLOCK ***
> >
> > 1 lock held by btrfs-uuid/7955:
> >  #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
> >
> > stack backtrace:
> > CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
> > Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
> > Call Trace:
> >  dump_stack+0x78/0xa0
> >  check_noncircular+0x165/0x180
> >  __lock_acquire+0x1272/0x2310
> >  lock_acquire+0x9e/0x360
> >  ? __btrfs_tree_read_lock+0x39/0x180
> >  ? btrfs_root_node+0x1c/0x1d0
> >  down_read_nested+0x3e/0x140
> >  ? __btrfs_tree_read_lock+0x39/0x180
> >  __btrfs_tree_read_lock+0x39/0x180
> >  __btrfs_read_lock_root_node+0x3a/0x50
> >  btrfs_search_slot+0x4bd/0x990
> >  btrfs_find_root+0x45/0x1b0
> >  btrfs_read_tree_root+0x61/0x100
> >  btrfs_get_root_ref.part.50+0x143/0x630
> >  btrfs_uuid_tree_iterate+0x207/0x314
> >  ? btree_readpage+0x20/0x20
> >  btrfs_uuid_rescan_kthread+0x12/0x50
> >  kthread+0x133/0x150
> >  ? kthread_create_on_node+0x60/0x60
> >  ret_from_fork+0x1f/0x30
> >
> > This problem exists because we have two different rescan threads,
> > btrfs_uuid_scan_kthread which creates the uuid tree, and
> > btrfs_uuid_tree_iterate that goes through and updates or deletes any out
> > of date roots.  The problem is they both do things in different order.
> > btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
> > into the uuid_root.  btrfs_uuid_tree_iterate() scans the uuid_root, but
> > then does a btrfs_get_fs_root() which can read from the tree_root.
>
> Ok, the get_fs_root() is through  btrfs_check_uuid_tree_entry().
>
> >
> > It's actually easy enough to not be holding the path in
> > btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
> > it further down and re-start the search when we loop.  So simply move
> > the path release before we add our entry to the uuid tree.
> >
> > This also fixes a problem where we're holding a path open after we do
> > btrfs_end_transaction(), which has it's own problems.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/volumes.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index d7670e2a9f39..3ac44dad58bb 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -4462,6 +4462,7 @@ int btrfs_uuid_scan_kthread(void *data)
> >                         goto skip;
> >                 }
> >  update_tree:
> > +               btrfs_release_path(path);
>
> We can actually do the release right after reading from the eb, so we
> avoid 3 path releases in different places.
> The end result would be:
>
> https://pastebin.com/2gH54HBw

Well, it should be https://pastebin.com/t5nDBhVR

>
> Either way, it looks good to me.
>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>
> Thanks.
>
> >                 if (!btrfs_is_empty_uuid(root_item.uuid)) {
> >                         ret = btrfs_uuid_tree_add(trans, root_item.uuid,
> >                                                   BTRFS_UUID_KEY_SUBVOL,
> > @@ -4486,6 +4487,7 @@ int btrfs_uuid_scan_kthread(void *data)
> >                 }
> >
> >  skip:
> > +               btrfs_release_path(path);
> >                 if (trans) {
> >                         ret = btrfs_end_transaction(trans);
> >                         trans = NULL;
> > @@ -4493,7 +4495,6 @@ int btrfs_uuid_scan_kthread(void *data)
> >                                 break;
> >                 }
> >
> > -               btrfs_release_path(path);
> >                 if (key.offset < (u64)-1) {
> >                         key.offset++;
> >                 } else if (key.type < BTRFS_ROOT_ITEM_KEY) {
> > --
> > 2.24.1
> >
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/17] btrfs: fix potential deadlock in the search ioctl
  2020-08-10 15:42 ` [PATCH 02/17] btrfs: fix potential deadlock in the search ioctl Josef Bacik
@ 2020-08-10 16:45   ` Filipe Manana
  0 siblings, 0 replies; 30+ messages in thread
From: Filipe Manana @ 2020-08-10 16:45 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 10, 2020 at 4:44 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> When I converted the locking over to using a rwsem I got the following
> lockdep splat
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted
> ------------------------------------------------------
> compsize/11122 is trying to acquire lock:
> ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90
>
> but task is already holding lock:
> ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (btrfs-fs-00){++++}-{3:3}:
>        down_write_nested+0x3b/0x70
>        __btrfs_tree_lock+0x24/0x120
>        btrfs_search_slot+0x756/0x990
>        btrfs_lookup_inode+0x3a/0xb4
>        __btrfs_update_delayed_inode+0x93/0x270
>        btrfs_async_run_delayed_root+0x168/0x230
>        btrfs_work_helper+0xd4/0x570
>        process_one_work+0x2ad/0x5f0
>        worker_thread+0x3a/0x3d0
>        kthread+0x133/0x150
>        ret_from_fork+0x1f/0x30
>
> -> #1 (&delayed_node->mutex){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        btrfs_delayed_update_inode+0x50/0x440
>        btrfs_update_inode+0x8a/0xf0
>        btrfs_dirty_inode+0x5b/0xd0
>        touch_atime+0xa1/0xd0
>        btrfs_file_mmap+0x3f/0x60
>        mmap_region+0x3a4/0x640
>        do_mmap+0x376/0x580
>        vm_mmap_pgoff+0xd5/0x120
>        ksys_mmap_pgoff+0x193/0x230
>        do_syscall_64+0x50/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #0 (&mm->mmap_lock#2){++++}-{3:3}:
>        __lock_acquire+0x1272/0x2310
>        lock_acquire+0x9e/0x360
>        __might_fault+0x68/0x90
>        _copy_to_user+0x1e/0x80
>        copy_to_sk.isra.32+0x121/0x300
>        search_ioctl+0x106/0x200
>        btrfs_ioctl_tree_search_v2+0x7b/0xf0
>        btrfs_ioctl+0x106f/0x30a0
>        ksys_ioctl+0x83/0xc0
>        __x64_sys_ioctl+0x16/0x20
>        do_syscall_64+0x50/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> other info that might help us debug this:
>
> Chain exists of:
>   &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(btrfs-fs-00);
>                                lock(&delayed_node->mutex);
>                                lock(btrfs-fs-00);
>   lock(&mm->mmap_lock#2);
>
>  *** DEADLOCK ***
>
> 1 lock held by compsize/11122:
>  #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
>
> stack backtrace:
> CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922
> Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
> Call Trace:
>  dump_stack+0x78/0xa0
>  check_noncircular+0x165/0x180
>  __lock_acquire+0x1272/0x2310
>  lock_acquire+0x9e/0x360
>  ? __might_fault+0x3e/0x90
>  ? find_held_lock+0x72/0x90
>  __might_fault+0x68/0x90
>  ? __might_fault+0x3e/0x90
>  _copy_to_user+0x1e/0x80
>  copy_to_sk.isra.32+0x121/0x300
>  ? btrfs_search_forward+0x2a6/0x360
>  search_ioctl+0x106/0x200
>  btrfs_ioctl_tree_search_v2+0x7b/0xf0
>  btrfs_ioctl+0x106f/0x30a0
>  ? __do_sys_newfstat+0x5a/0x70
>  ? ksys_ioctl+0x83/0xc0
>  ksys_ioctl+0x83/0xc0
>  __x64_sys_ioctl+0x16/0x20
>  do_syscall_64+0x50/0x90
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> The problem is we're doing a copy_to_user() while holding tree locks,
> which can deadlock if we have to do a page fault for the copy_to_user().
> This exists even without my locking changes, so it needs to be fixed.
> Rework the search ioctl to do the pre-fault and then
> copy_to_user_nofault for the copying.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/extent_io.c |  8 ++++----
>  fs/btrfs/extent_io.h |  6 +++---
>  fs/btrfs/ioctl.c     | 27 ++++++++++++++++++++-------
>  3 files changed, 27 insertions(+), 14 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 617ea38e6fd7..c15ab6c1897f 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5653,9 +5653,9 @@ void read_extent_buffer(const struct extent_buffer *eb, void *dstv,
>         }
>  }
>
> -int read_extent_buffer_to_user(const struct extent_buffer *eb,
> -                              void __user *dstv,
> -                              unsigned long start, unsigned long len)
> +int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
> +                                      void __user *dstv,
> +                                      unsigned long start, unsigned long len)
>  {
>         size_t cur;
>         size_t offset;
> @@ -5675,7 +5675,7 @@ int read_extent_buffer_to_user(const struct extent_buffer *eb,
>
>                 cur = min(len, (PAGE_SIZE - offset));
>                 kaddr = page_address(page);
> -               if (copy_to_user(dst, kaddr + offset, cur)) {
> +               if (copy_to_user_nofault(dst, kaddr + offset, cur)) {
>                         ret = -EFAULT;
>                         break;
>                 }
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index 00a88f2eb5ab..30794ae58498 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -241,9 +241,9 @@ int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
>  void read_extent_buffer(const struct extent_buffer *eb, void *dst,
>                         unsigned long start,
>                         unsigned long len);
> -int read_extent_buffer_to_user(const struct extent_buffer *eb,
> -                              void __user *dst, unsigned long start,
> -                              unsigned long len);
> +int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
> +                                      void __user *dst, unsigned long start,
> +                                      unsigned long len);
>  void write_extent_buffer_fsid(const struct extent_buffer *eb, const void *src);
>  void write_extent_buffer_chunk_tree_uuid(const struct extent_buffer *eb,
>                 const void *src);
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index bd3511c5ca81..39448556f635 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2086,9 +2086,14 @@ static noinline int copy_to_sk(struct btrfs_path *path,
>                 sh.len = item_len;
>                 sh.transid = found_transid;
>
> -               /* copy search result header */
> -               if (copy_to_user(ubuf + *sk_offset, &sh, sizeof(sh))) {
> -                       ret = -EFAULT;
> +               /*
> +                * copy search result header, if we fault loop again so we can
> +                * fault in the pages and -EFAULT there if there's a problem,
> +                * otherwise we'll fault and then copy the buffer in properly
> +                * this next time through
> +                */
> +               if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
> +                       ret = 0;
>                         goto out;
>                 }
>
> @@ -2096,10 +2101,14 @@ static noinline int copy_to_sk(struct btrfs_path *path,
>
>                 if (item_len) {
>                         char __user *up = ubuf + *sk_offset;
> -                       /* copy the item */
> -                       if (read_extent_buffer_to_user(leaf, up,
> -                                                      item_off, item_len)) {
> -                               ret = -EFAULT;
> +                       /* copy the item, same behavior as above, but reset the
> +                        * sk_offset so we copy the full thing again.
> +                       */
> +                       if (read_extent_buffer_to_user_nofault(leaf, up,
> +                                                              item_off,
> +                                                              item_len)) {
> +                               ret = 0;
> +                               *sk_offset -= sizeof(sh);
>                                 goto out;
>                         }
>
> @@ -2184,6 +2193,10 @@ static noinline int search_ioctl(struct inode *inode,
>         key.offset = sk->min_offset;
>
>         while (1) {
> +               ret = fault_in_pages_writeable(ubuf, *buf_size - sk_offset);
> +               if (ret)
> +                       break;
> +
>                 ret = btrfs_search_forward(root, &key, path, sk->min_transid);
>                 if (ret != 0) {
>                         if (ret > 0)
> --
> 2.24.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/17] btrfs: do not hold device_list_mutex when closing devices
  2020-08-10 15:42 ` [PATCH 03/17] btrfs: do not hold device_list_mutex when closing devices Josef Bacik
@ 2020-08-11 10:53   ` Filipe Manana
  2020-08-14 14:11   ` David Sterba
  1 sibling, 0 replies; 30+ messages in thread
From: Filipe Manana @ 2020-08-11 10:53 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 10, 2020 at 4:44 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> The following lockdep splat
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted
> ------------------------------------------------------
> fsstress/8739 is trying to acquire lock:
> ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70
>
> but task is already holding lock:
> ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #10 (sb_pagefaults){.+.+}-{0:0}:
>        __sb_start_write+0x129/0x210
>        btrfs_page_mkwrite+0x6a/0x4a0
>        do_page_mkwrite+0x4d/0xc0
>        handle_mm_fault+0x103c/0x1730
>        exc_page_fault+0x340/0x660
>        asm_exc_page_fault+0x1e/0x30
>
> -> #9 (&mm->mmap_lock#2){++++}-{3:3}:
>        __might_fault+0x68/0x90
>        _copy_to_user+0x1e/0x80
>        perf_read+0x141/0x2c0
>        vfs_read+0xad/0x1b0
>        ksys_read+0x5f/0xe0
>        do_syscall_64+0x50/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #8 (&cpuctx_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        perf_event_init_cpu+0x88/0x150
>        perf_event_init+0x1db/0x20b
>        start_kernel+0x3ae/0x53c
>        secondary_startup_64+0xa4/0xb0
>
> -> #7 (pmus_lock){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        perf_event_init_cpu+0x4f/0x150
>        cpuhp_invoke_callback+0xb1/0x900
>        _cpu_up.constprop.26+0x9f/0x130
>        cpu_up+0x7b/0xc0
>        bringup_nonboot_cpus+0x4f/0x60
>        smp_init+0x26/0x71
>        kernel_init_freeable+0x110/0x258
>        kernel_init+0xa/0x103
>        ret_from_fork+0x1f/0x30
>
> -> #6 (cpu_hotplug_lock){++++}-{0:0}:
>        cpus_read_lock+0x39/0xb0
>        kmem_cache_create_usercopy+0x28/0x230
>        kmem_cache_create+0x12/0x20
>        bioset_init+0x15e/0x2b0
>        init_bio+0xa3/0xaa
>        do_one_initcall+0x5a/0x2e0
>        kernel_init_freeable+0x1f4/0x258
>        kernel_init+0xa/0x103
>        ret_from_fork+0x1f/0x30
>
> -> #5 (bio_slab_lock){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        bioset_init+0xbc/0x2b0
>        __blk_alloc_queue+0x6f/0x2d0
>        blk_mq_init_queue_data+0x1b/0x70
>        loop_add+0x110/0x290 [loop]
>        fq_codel_tcf_block+0x12/0x20 [sch_fq_codel]
>        do_one_initcall+0x5a/0x2e0
>        do_init_module+0x5a/0x220
>        load_module+0x2459/0x26e0
>        __do_sys_finit_module+0xba/0xe0
>        do_syscall_64+0x50/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #4 (loop_ctl_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        lo_open+0x18/0x50 [loop]
>        __blkdev_get+0xec/0x570
>        blkdev_get+0xe8/0x150
>        do_dentry_open+0x167/0x410
>        path_openat+0x7c9/0xa80
>        do_filp_open+0x93/0x100
>        do_sys_openat2+0x22a/0x2e0
>        do_sys_open+0x4b/0x80
>        do_syscall_64+0x50/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        blkdev_put+0x1d/0x120
>        close_fs_devices.part.31+0x84/0x130
>        btrfs_close_devices+0x44/0xb0
>        close_ctree+0x296/0x2b2
>        generic_shutdown_super+0x69/0x100
>        kill_anon_super+0xe/0x30
>        btrfs_kill_super+0x12/0x20
>        deactivate_locked_super+0x29/0x60
>        cleanup_mnt+0xb8/0x140
>        task_work_run+0x6d/0xb0
>        __prepare_exit_to_usermode+0x1cc/0x1e0
>        do_syscall_64+0x5c/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        btrfs_run_dev_stats+0x49/0x480
>        commit_cowonly_roots+0xb5/0x2a0
>        btrfs_commit_transaction+0x516/0xa60
>        sync_filesystem+0x6b/0x90
>        generic_shutdown_super+0x22/0x100
>        kill_anon_super+0xe/0x30
>        btrfs_kill_super+0x12/0x20
>        deactivate_locked_super+0x29/0x60
>        cleanup_mnt+0xb8/0x140
>        task_work_run+0x6d/0xb0
>        __prepare_exit_to_usermode+0x1cc/0x1e0
>        do_syscall_64+0x5c/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        btrfs_commit_transaction+0x4bb/0xa60
>        sync_filesystem+0x6b/0x90
>        generic_shutdown_super+0x22/0x100
>        kill_anon_super+0xe/0x30
>        btrfs_kill_super+0x12/0x20
>        deactivate_locked_super+0x29/0x60
>        cleanup_mnt+0xb8/0x140
>        task_work_run+0x6d/0xb0
>        __prepare_exit_to_usermode+0x1cc/0x1e0
>        do_syscall_64+0x5c/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
>        __lock_acquire+0x1272/0x2310
>        lock_acquire+0x9e/0x360
>        __mutex_lock+0x9f/0x930
>        btrfs_record_root_in_trans+0x43/0x70
>        start_transaction+0xd1/0x5d0
>        btrfs_dirty_inode+0x42/0xd0
>        file_update_time+0xc8/0x110
>        btrfs_page_mkwrite+0x10c/0x4a0
>        do_page_mkwrite+0x4d/0xc0
>        handle_mm_fault+0x103c/0x1730
>        exc_page_fault+0x340/0x660
>        asm_exc_page_fault+0x1e/0x30
>
> other info that might help us debug this:
>
> Chain exists of:
>   &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(sb_pagefaults);
>                                lock(&mm->mmap_lock#2);
>                                lock(sb_pagefaults);
>   lock(&fs_info->reloc_mutex);
>
>  *** DEADLOCK ***
>
> 3 locks held by fsstress/8739:
>  #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660
>  #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0
>  #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0
>
> stack backtrace:
> CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929
> Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
> Call Trace:
>  dump_stack+0x78/0xa0
>  check_noncircular+0x165/0x180
>  __lock_acquire+0x1272/0x2310
>  ? btrfs_get_alloc_profile+0x150/0x210
>  lock_acquire+0x9e/0x360
>  ? btrfs_record_root_in_trans+0x43/0x70
>  __mutex_lock+0x9f/0x930
>  ? btrfs_record_root_in_trans+0x43/0x70
>  ? lock_acquire+0x9e/0x360
>  ? join_transaction+0x5d/0x450
>  ? find_held_lock+0x2d/0x90
>  ? btrfs_record_root_in_trans+0x43/0x70
>  ? join_transaction+0x3d5/0x450
>  ? btrfs_record_root_in_trans+0x43/0x70
>  btrfs_record_root_in_trans+0x43/0x70
>  start_transaction+0xd1/0x5d0
>  btrfs_dirty_inode+0x42/0xd0
>  file_update_time+0xc8/0x110
>  btrfs_page_mkwrite+0x10c/0x4a0
>  ? handle_mm_fault+0x5e/0x1730
>  do_page_mkwrite+0x4d/0xc0
>  ? __do_fault+0x32/0x150
>  handle_mm_fault+0x103c/0x1730
>  exc_page_fault+0x340/0x660
>  ? asm_exc_page_fault+0x8/0x30
>  asm_exc_page_fault+0x1e/0x30
> RIP: 0033:0x7faa6c9969c4
>
> Was seen in testing.  The fix is similar to that of
>
>   btrfs: open device without device_list_mutex
>
> where we're holding the device_list_mutex and then grab the bd_mutex,
> which pulls in a bunch of dependencies under the bd_mutex.  We only ever
> call btrfs_close_devices() on mount failure or unmount, so we're save to
> not have the device_list_mutex here.  We're already holding the
> uuid_mutex which keeps us safe from any external modification of the
> fs_devices.

We aren't holding the uuid_mutex at btrfs_rm_dev_replace_free_srcdev().
Will it not be a problem?

Thanks.

>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/volumes.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 3ac44dad58bb..97ec9c0a91aa 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1155,11 +1155,8 @@ static int close_fs_devices(struct btrfs_fs_devices *fs_devices)
>         if (--fs_devices->opened > 0)
>                 return 0;
>
> -       mutex_lock(&fs_devices->device_list_mutex);
> -       list_for_each_entry_safe(device, tmp, &fs_devices->devices, dev_list) {
> +       list_for_each_entry_safe(device, tmp, &fs_devices->devices, dev_list)
>                 btrfs_close_one_device(device);
> -       }
> -       mutex_unlock(&fs_devices->device_list_mutex);
>
>         WARN_ON(fs_devices->open_devices);
>         WARN_ON(fs_devices->rw_devices);
> --
> 2.24.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 04/17] btrfs: allocate scrub workqueues outside of locks
  2020-08-10 15:42 ` [PATCH 04/17] btrfs: allocate scrub workqueues outside of locks Josef Bacik
@ 2020-08-11 11:22   ` Filipe Manana
  0 siblings, 0 replies; 30+ messages in thread
From: Filipe Manana @ 2020-08-11 11:22 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 10, 2020 at 4:44 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> I got the following lockdep splat while testing
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.8.0-rc7-00172-g021118712e59 #932 Not tainted
> ------------------------------------------------------
> btrfs/229626 is trying to acquire lock:
> ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450
>
> but task is already holding lock:
> ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #7 (&fs_info->scrub_lock){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        btrfs_scrub_dev+0x11c/0x630
>        btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
>        btrfs_ioctl+0x2799/0x30a0
>        ksys_ioctl+0x83/0xc0
>        __x64_sys_ioctl+0x16/0x20
>        do_syscall_64+0x50/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        btrfs_run_dev_stats+0x49/0x480
>        commit_cowonly_roots+0xb5/0x2a0
>        btrfs_commit_transaction+0x516/0xa60
>        sync_filesystem+0x6b/0x90
>        generic_shutdown_super+0x22/0x100
>        kill_anon_super+0xe/0x30
>        btrfs_kill_super+0x12/0x20
>        deactivate_locked_super+0x29/0x60
>        cleanup_mnt+0xb8/0x140
>        task_work_run+0x6d/0xb0
>        __prepare_exit_to_usermode+0x1cc/0x1e0
>        do_syscall_64+0x5c/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        btrfs_commit_transaction+0x4bb/0xa60
>        sync_filesystem+0x6b/0x90
>        generic_shutdown_super+0x22/0x100
>        kill_anon_super+0xe/0x30
>        btrfs_kill_super+0x12/0x20
>        deactivate_locked_super+0x29/0x60
>        cleanup_mnt+0xb8/0x140
>        task_work_run+0x6d/0xb0
>        __prepare_exit_to_usermode+0x1cc/0x1e0
>        do_syscall_64+0x5c/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        btrfs_record_root_in_trans+0x43/0x70
>        start_transaction+0xd1/0x5d0
>        btrfs_dirty_inode+0x42/0xd0
>        touch_atime+0xa1/0xd0
>        btrfs_file_mmap+0x3f/0x60
>        mmap_region+0x3a4/0x640
>        do_mmap+0x376/0x580
>        vm_mmap_pgoff+0xd5/0x120
>        ksys_mmap_pgoff+0x193/0x230
>        do_syscall_64+0x50/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
>        __might_fault+0x68/0x90
>        _copy_to_user+0x1e/0x80
>        perf_read+0x141/0x2c0
>        vfs_read+0xad/0x1b0
>        ksys_read+0x5f/0xe0
>        do_syscall_64+0x50/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> -> #2 (&cpuctx_mutex){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        perf_event_init_cpu+0x88/0x150
>        perf_event_init+0x1db/0x20b
>        start_kernel+0x3ae/0x53c
>        secondary_startup_64+0xa4/0xb0
>
> -> #1 (pmus_lock){+.+.}-{3:3}:
>        __mutex_lock+0x9f/0x930
>        perf_event_init_cpu+0x4f/0x150
>        cpuhp_invoke_callback+0xb1/0x900
>        _cpu_up.constprop.26+0x9f/0x130
>        cpu_up+0x7b/0xc0
>        bringup_nonboot_cpus+0x4f/0x60
>        smp_init+0x26/0x71
>        kernel_init_freeable+0x110/0x258
>        kernel_init+0xa/0x103
>        ret_from_fork+0x1f/0x30
>
> -> #0 (cpu_hotplug_lock){++++}-{0:0}:
>        __lock_acquire+0x1272/0x2310
>        lock_acquire+0x9e/0x360
>        cpus_read_lock+0x39/0xb0
>        alloc_workqueue+0x378/0x450
>        __btrfs_alloc_workqueue+0x15d/0x200
>        btrfs_alloc_workqueue+0x51/0x160
>        scrub_workers_get+0x5a/0x170
>        btrfs_scrub_dev+0x18c/0x630
>        btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
>        btrfs_ioctl+0x2799/0x30a0
>        ksys_ioctl+0x83/0xc0
>        __x64_sys_ioctl+0x16/0x20
>        do_syscall_64+0x50/0x90
>        entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> other info that might help us debug this:
>
> Chain exists of:
>   cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock
>
>  Possible unsafe locking scenario:
>
>        CPU0                    CPU1
>        ----                    ----
>   lock(&fs_info->scrub_lock);
>                                lock(&fs_devs->device_list_mutex);
>                                lock(&fs_info->scrub_lock);
>   lock(cpu_hotplug_lock);
>
>  *** DEADLOCK ***
>
> 2 locks held by btrfs/229626:
>  #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630
>  #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630
>
> stack backtrace:
> CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932
> Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
> Call Trace:
>  dump_stack+0x78/0xa0
>  check_noncircular+0x165/0x180
>  __lock_acquire+0x1272/0x2310
>  lock_acquire+0x9e/0x360
>  ? alloc_workqueue+0x378/0x450
>  cpus_read_lock+0x39/0xb0
>  ? alloc_workqueue+0x378/0x450
>  alloc_workqueue+0x378/0x450
>  ? rcu_read_lock_sched_held+0x52/0x80
>  __btrfs_alloc_workqueue+0x15d/0x200
>  btrfs_alloc_workqueue+0x51/0x160
>  scrub_workers_get+0x5a/0x170
>  btrfs_scrub_dev+0x18c/0x630
>  ? start_transaction+0xd1/0x5d0
>  btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4
>  btrfs_ioctl+0x2799/0x30a0
>  ? do_sigaction+0x102/0x250
>  ? lockdep_hardirqs_on_prepare+0xca/0x160
>  ? _raw_spin_unlock_irq+0x24/0x30
>  ? trace_hardirqs_on+0x1c/0xe0
>  ? _raw_spin_unlock_irq+0x24/0x30
>  ? do_sigaction+0x102/0x250
>  ? ksys_ioctl+0x83/0xc0
>  ksys_ioctl+0x83/0xc0
>  __x64_sys_ioctl+0x16/0x20
>  do_syscall_64+0x50/0x90
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> This happens because we're allocating the scrub workqueues under the
> scrub and device list mutex, which brings in a whole host of other
> dependencies.

Because the work queue allocation is done with GFP_KERNEL, it can
trigger reclaim, which can lead to a transaction commit, which in
turns needs the device_list_mutex,
it can lead to a deadlock. A different problem for which this fix is a
solution. Maybe worth mentioning as well.


> Fix this by moving the actual allocation outside of the
> scrub lock, and then only take the lock once we're ready to actually
> assign them to the fs_info.  We'll now have to cleanup the workqueues in
> a few more places, so I've added a helper to do the refcount dance to
> safely free the workqueues.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/scrub.c | 122 +++++++++++++++++++++++++++--------------------
>  1 file changed, 70 insertions(+), 52 deletions(-)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 5a6cb9db512e..7f7098e3110e 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3716,50 +3716,84 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
>         return 0;
>  }
>
> +static void scrub_workers_put(struct btrfs_fs_info *fs_info)
> +{
> +       if (refcount_dec_and_mutex_lock(&fs_info->scrub_workers_refcnt,
> +                                       &fs_info->scrub_lock)) {
> +               struct btrfs_workqueue *scrub_workers = NULL;
> +               struct btrfs_workqueue *scrub_wr_comp = NULL;
> +               struct btrfs_workqueue *scrub_parity = NULL;
> +
> +               scrub_workers = fs_info->scrub_workers;
> +               scrub_wr_comp = fs_info->scrub_wr_completion_workers;
> +               scrub_parity = fs_info->scrub_parity_workers;
> +
> +               fs_info->scrub_workers = NULL;
> +               fs_info->scrub_wr_completion_workers = NULL;
> +               fs_info->scrub_parity_workers = NULL;
> +               mutex_unlock(&fs_info->scrub_lock);
> +
> +               btrfs_destroy_workqueue(scrub_workers);
> +               btrfs_destroy_workqueue(scrub_wr_comp);
> +               btrfs_destroy_workqueue(scrub_parity);
> +       }
> +}
> +
>  /*
>   * get a reference count on fs_info->scrub_workers. start worker if necessary
>   */
>  static noinline_for_stack int scrub_workers_get(struct btrfs_fs_info *fs_info,
>                                                 int is_dev_replace)
>  {
> +       struct btrfs_workqueue *scrub_workers = NULL;
> +       struct btrfs_workqueue *scrub_wr_comp = NULL;
> +       struct btrfs_workqueue *scrub_parity = NULL;
>         unsigned int flags = WQ_FREEZABLE | WQ_UNBOUND;
>         int max_active = fs_info->thread_pool_size;
> +       int ret = -ENOMEM;
>
> -       lockdep_assert_held(&fs_info->scrub_lock);
> +       if (refcount_inc_not_zero(&fs_info->scrub_workers_refcnt))
> +               return 0;
>
> -       if (refcount_read(&fs_info->scrub_workers_refcnt) == 0) {
> -               ASSERT(fs_info->scrub_workers == NULL);
> -               fs_info->scrub_workers = btrfs_alloc_workqueue(fs_info, "scrub",
> -                               flags, is_dev_replace ? 1 : max_active, 4);
> -               if (!fs_info->scrub_workers)
> -                       goto fail_scrub_workers;
> -
> -               ASSERT(fs_info->scrub_wr_completion_workers == NULL);
> -               fs_info->scrub_wr_completion_workers =
> -                       btrfs_alloc_workqueue(fs_info, "scrubwrc", flags,
> -                                             max_active, 2);
> -               if (!fs_info->scrub_wr_completion_workers)
> -                       goto fail_scrub_wr_completion_workers;
> +       scrub_workers = btrfs_alloc_workqueue(fs_info, "scrub", flags,
> +                                             is_dev_replace ? 1 : max_active,
> +                                             4);
> +       if (!scrub_workers)
> +               goto fail_scrub_workers;
>
> -               ASSERT(fs_info->scrub_parity_workers == NULL);
> -               fs_info->scrub_parity_workers =
> -                       btrfs_alloc_workqueue(fs_info, "scrubparity", flags,
> +       scrub_wr_comp = btrfs_alloc_workqueue(fs_info, "scrubwrc", flags,
>                                               max_active, 2);
> -               if (!fs_info->scrub_parity_workers)
> -                       goto fail_scrub_parity_workers;
> +       if (!scrub_wr_comp)
> +               goto fail_scrub_wr_completion_workers;
>
> +       scrub_parity = btrfs_alloc_workqueue(fs_info, "scrubparity", flags,
> +                                            max_active, 2);
> +       if (!scrub_parity)
> +               goto fail_scrub_parity_workers;
> +
> +       mutex_lock(&fs_info->scrub_lock);
> +       if (refcount_read(&fs_info->scrub_workers_refcnt) == 0) {
> +               ASSERT(fs_info->scrub_workers == NULL &&
> +                      fs_info->scrub_wr_completion_workers == NULL &&
> +                      fs_info->scrub_parity_workers == NULL);
> +               fs_info->scrub_workers = scrub_workers;
> +               fs_info->scrub_wr_completion_workers = scrub_wr_comp;
> +               fs_info->scrub_parity_workers = scrub_parity;
>                 refcount_set(&fs_info->scrub_workers_refcnt, 1);
> -       } else {
> -               refcount_inc(&fs_info->scrub_workers_refcnt);
> +               mutex_unlock(&fs_info->scrub_lock);
> +               return 0;
>         }
> -       return 0;
> +       refcount_inc(&fs_info->scrub_workers_refcnt);
> +       mutex_unlock(&fs_info->scrub_lock);
>
> +       ret = 0;
> +       btrfs_destroy_workqueue(scrub_parity);
>  fail_scrub_parity_workers:
> -       btrfs_destroy_workqueue(fs_info->scrub_wr_completion_workers);
> +       btrfs_destroy_workqueue(scrub_wr_comp);
>  fail_scrub_wr_completion_workers:
> -       btrfs_destroy_workqueue(fs_info->scrub_workers);
> +       btrfs_destroy_workqueue(scrub_workers);
>  fail_scrub_workers:
> -       return -ENOMEM;
> +       return ret;
>  }
>
>  int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
> @@ -3770,9 +3804,6 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
>         int ret;
>         struct btrfs_device *dev;
>         unsigned int nofs_flag;
> -       struct btrfs_workqueue *scrub_workers = NULL;
> -       struct btrfs_workqueue *scrub_wr_comp = NULL;
> -       struct btrfs_workqueue *scrub_parity = NULL;
>
>         if (btrfs_fs_closing(fs_info))
>                 return -EAGAIN;
> @@ -3819,13 +3850,17 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
>         if (IS_ERR(sctx))
>                 return PTR_ERR(sctx);
>
> +       ret = scrub_workers_get(fs_info, is_dev_replace);
> +       if (ret)
> +               goto out_free_ctx;
> +
>         mutex_lock(&fs_info->fs_devices->device_list_mutex);
>         dev = btrfs_find_device(fs_info->fs_devices, devid, NULL, NULL, true);
>         if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, &dev->dev_state) &&
>                      !is_dev_replace)) {
>                 mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>                 ret = -ENODEV;
> -               goto out_free_ctx;
> +               goto out;
>         }
>
>         if (!is_dev_replace && !readonly &&
> @@ -3834,7 +3869,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
>                 btrfs_err_in_rcu(fs_info, "scrub: device %s is not writable",
>                                 rcu_str_deref(dev->name));
>                 ret = -EROFS;
> -               goto out_free_ctx;
> +               goto out;
>         }
>
>         mutex_lock(&fs_info->scrub_lock);
> @@ -3843,7 +3878,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
>                 mutex_unlock(&fs_info->scrub_lock);
>                 mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>                 ret = -EIO;
> -               goto out_free_ctx;
> +               goto out;
>         }
>
>         down_read(&fs_info->dev_replace.rwsem);
> @@ -3854,17 +3889,10 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
>                 mutex_unlock(&fs_info->scrub_lock);
>                 mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>                 ret = -EINPROGRESS;
> -               goto out_free_ctx;
> +               goto out;
>         }
>         up_read(&fs_info->dev_replace.rwsem);
>
> -       ret = scrub_workers_get(fs_info, is_dev_replace);
> -       if (ret) {
> -               mutex_unlock(&fs_info->scrub_lock);
> -               mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> -               goto out_free_ctx;
> -       }
> -
>         sctx->readonly = readonly;
>         dev->scrub_ctx = sctx;
>         mutex_unlock(&fs_info->fs_devices->device_list_mutex);
> @@ -3917,24 +3945,14 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start,
>
>         mutex_lock(&fs_info->scrub_lock);
>         dev->scrub_ctx = NULL;
> -       if (refcount_dec_and_test(&fs_info->scrub_workers_refcnt)) {
> -               scrub_workers = fs_info->scrub_workers;
> -               scrub_wr_comp = fs_info->scrub_wr_completion_workers;
> -               scrub_parity = fs_info->scrub_parity_workers;
> -
> -               fs_info->scrub_workers = NULL;
> -               fs_info->scrub_wr_completion_workers = NULL;
> -               fs_info->scrub_parity_workers = NULL;
> -       }
>         mutex_unlock(&fs_info->scrub_lock);
>
> -       btrfs_destroy_workqueue(scrub_workers);
> -       btrfs_destroy_workqueue(scrub_wr_comp);
> -       btrfs_destroy_workqueue(scrub_parity);
> +       scrub_workers_put(fs_info);
>         scrub_put_ctx(sctx);
>
>         return ret;
> -
> +out:
> +       scrub_workers_put(fs_info);
>  out_free_ctx:
>         scrub_free_ctx(sctx);
>
> --
> 2.24.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 05/17] btrfs: set the correct lockdep class for new nodes
  2020-08-10 15:42 ` [PATCH 05/17] btrfs: set the correct lockdep class for new nodes Josef Bacik
@ 2020-08-11 11:25   ` Filipe Manana
  0 siblings, 0 replies; 30+ messages in thread
From: Filipe Manana @ 2020-08-11 11:25 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 10, 2020 at 4:44 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> When flipping over to the rw_semaphore I noticed I'd get a lockdep splat
> in replace_path(), which is weird because we're swapping the reloc root
> with the actual target root.  Turns out this is because we're using the
> root->root_key.objectid as the root id for the newly allocated tree
> block when setting the lockdep class, however we need to be using the
> actual owner of this new block, which is saved in owner.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index fa7d83051587..84029851f094 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4522,7 +4522,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
>                 return ERR_PTR(-EUCLEAN);
>         }
>
> -       btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
> +       btrfs_set_buffer_lockdep_class(owner, buf, level);
>         btrfs_tree_lock(buf);
>         btrfs_clean_tree_block(buf);
>         clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
> --
> 2.24.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 06/17] btrfs: set the lockdep class for log tree extent buffers
  2020-08-10 15:42 ` [PATCH 06/17] btrfs: set the lockdep class for log tree extent buffers Josef Bacik
@ 2020-08-11 11:28   ` Filipe Manana
  0 siblings, 0 replies; 30+ messages in thread
From: Filipe Manana @ 2020-08-11 11:28 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 10, 2020 at 4:44 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> These are special extent buffers that get rewound in order to lookup
> the state of the tree at a specific point in time.  As such they do not
> go through the normal initialization paths that set their lockdep class,
> so handle them appropriately when they are created and before they are
> locked.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Looks good, thanks.

> ---
>  fs/btrfs/ctree.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 70e49d8d4f6c..cbacf700b68b 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1297,6 +1297,8 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct btrfs_path *path,
>         btrfs_tree_read_unlock_blocking(eb);
>         free_extent_buffer(eb);
>
> +       btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb_rewin),
> +                                      eb_rewin, btrfs_header_level(eb_rewin));
>         btrfs_tree_read_lock(eb_rewin);
>         __tree_mod_log_rewind(fs_info, eb_rewin, time_seq, tm);
>         WARN_ON(btrfs_header_nritems(eb_rewin) >
> @@ -1370,7 +1372,6 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
>
>         if (!eb)
>                 return NULL;
> -       btrfs_tree_read_lock(eb);
>         if (old_root) {
>                 btrfs_set_header_bytenr(eb, eb->start);
>                 btrfs_set_header_backref_rev(eb, BTRFS_MIXED_BACKREF_REV);
> @@ -1378,6 +1379,9 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
>                 btrfs_set_header_level(eb, old_root->level);
>                 btrfs_set_header_generation(eb, old_generation);
>         }
> +       btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb), eb,
> +                                      btrfs_header_level(eb));
> +       btrfs_tree_read_lock(eb);
>         if (tm)
>                 __tree_mod_log_rewind(fs_info, eb, time_seq, tm);
>         else
> --
> 2.24.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/17] btrfs: drop path before adding new uuid tree entry
  2020-08-10 16:28   ` Filipe Manana
  2020-08-10 16:30     ` Filipe Manana
@ 2020-08-11 14:35     ` Josef Bacik
  1 sibling, 0 replies; 30+ messages in thread
From: Josef Bacik @ 2020-08-11 14:35 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, kernel-team

On 8/10/20 12:28 PM, Filipe Manana wrote:
> On Mon, Aug 10, 2020 at 4:44 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> With the conversion to the rwsemaphore I got the following lockdep splat
>>
>> ======================================================
>> WARNING: possible circular locking dependency detected
>> 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted
>> ------------------------------------------------------
>> btrfs-uuid/7955 is trying to acquire lock:
>> ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
>>
>> but task is already holding lock:
>> ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (btrfs-uuid-00){++++}-{3:3}:
>>         down_read_nested+0x3e/0x140
>>         __btrfs_tree_read_lock+0x39/0x180
>>         __btrfs_read_lock_root_node+0x3a/0x50
>>         btrfs_search_slot+0x4bd/0x990
>>         btrfs_uuid_tree_add+0x89/0x2d0
>>         btrfs_uuid_scan_kthread+0x330/0x390
>>         kthread+0x133/0x150
>>         ret_from_fork+0x1f/0x30
>>
>> -> #0 (btrfs-root-00){++++}-{3:3}:
>>         __lock_acquire+0x1272/0x2310
>>         lock_acquire+0x9e/0x360
>>         down_read_nested+0x3e/0x140
>>         __btrfs_tree_read_lock+0x39/0x180
>>         __btrfs_read_lock_root_node+0x3a/0x50
>>         btrfs_search_slot+0x4bd/0x990
>>         btrfs_find_root+0x45/0x1b0
>>         btrfs_read_tree_root+0x61/0x100
>>         btrfs_get_root_ref.part.50+0x143/0x630
>>         btrfs_uuid_tree_iterate+0x207/0x314
>>         btrfs_uuid_rescan_kthread+0x12/0x50
>>         kthread+0x133/0x150
>>         ret_from_fork+0x1f/0x30
>>
>> other info that might help us debug this:
>>
>>   Possible unsafe locking scenario:
>>
>>         CPU0                    CPU1
>>         ----                    ----
>>    lock(btrfs-uuid-00);
>>                                 lock(btrfs-root-00);
>>                                 lock(btrfs-uuid-00);
>>    lock(btrfs-root-00);
>>
>>   *** DEADLOCK ***
>>
>> 1 lock held by btrfs-uuid/7955:
>>   #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
>>
>> stack backtrace:
>> CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925
>> Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
>> Call Trace:
>>   dump_stack+0x78/0xa0
>>   check_noncircular+0x165/0x180
>>   __lock_acquire+0x1272/0x2310
>>   lock_acquire+0x9e/0x360
>>   ? __btrfs_tree_read_lock+0x39/0x180
>>   ? btrfs_root_node+0x1c/0x1d0
>>   down_read_nested+0x3e/0x140
>>   ? __btrfs_tree_read_lock+0x39/0x180
>>   __btrfs_tree_read_lock+0x39/0x180
>>   __btrfs_read_lock_root_node+0x3a/0x50
>>   btrfs_search_slot+0x4bd/0x990
>>   btrfs_find_root+0x45/0x1b0
>>   btrfs_read_tree_root+0x61/0x100
>>   btrfs_get_root_ref.part.50+0x143/0x630
>>   btrfs_uuid_tree_iterate+0x207/0x314
>>   ? btree_readpage+0x20/0x20
>>   btrfs_uuid_rescan_kthread+0x12/0x50
>>   kthread+0x133/0x150
>>   ? kthread_create_on_node+0x60/0x60
>>   ret_from_fork+0x1f/0x30
>>
>> This problem exists because we have two different rescan threads,
>> btrfs_uuid_scan_kthread which creates the uuid tree, and
>> btrfs_uuid_tree_iterate that goes through and updates or deletes any out
>> of date roots.  The problem is they both do things in different order.
>> btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries
>> into the uuid_root.  btrfs_uuid_tree_iterate() scans the uuid_root, but
>> then does a btrfs_get_fs_root() which can read from the tree_root.
> 
> Ok, the get_fs_root() is through  btrfs_check_uuid_tree_entry().
> 
>>
>> It's actually easy enough to not be holding the path in
>> btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop
>> it further down and re-start the search when we loop.  So simply move
>> the path release before we add our entry to the uuid tree.
>>
>> This also fixes a problem where we're holding a path open after we do
>> btrfs_end_transaction(), which has it's own problems.
>>
>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>> ---
>>   fs/btrfs/volumes.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index d7670e2a9f39..3ac44dad58bb 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -4462,6 +4462,7 @@ int btrfs_uuid_scan_kthread(void *data)
>>                          goto skip;
>>                  }
>>   update_tree:
>> +               btrfs_release_path(path);
> 
> We can actually do the release right after reading from the eb, so we
> avoid 3 path releases in different places.
> The end result would be:
> 
> https://pastebin.com/2gH54HBw
> 
> Either way, it looks good to me.

We need it at the skip: spot because we'll go straight to skip in some cases and 
not release the path, and then deadlock.  Fun things I discovered in testing ;).

Josef

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 00/17] Convert to an rwsem for our tree locking
  2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
                   ` (16 preceding siblings ...)
  2020-08-10 15:42 ` [PATCH 17/17] btrfs: rip out path->leave_spinning Josef Bacik
@ 2020-08-13 14:26 ` David Sterba
  17 siblings, 0 replies; 30+ messages in thread
From: David Sterba @ 2020-08-13 14:26 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 10, 2020 at 11:42:25AM -0400, Josef Bacik wrote:

The lockdep fixes are independent so I'd like to take them first, that's
patches 1-6. I haven't looked at the rest yet.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/17] btrfs: do not hold device_list_mutex when closing devices
  2020-08-10 15:42 ` [PATCH 03/17] btrfs: do not hold device_list_mutex when closing devices Josef Bacik
  2020-08-11 10:53   ` Filipe Manana
@ 2020-08-14 14:11   ` David Sterba
  1 sibling, 0 replies; 30+ messages in thread
From: David Sterba @ 2020-08-14 14:11 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 10, 2020 at 11:42:28AM -0400, Josef Bacik wrote:
> not have the device_list_mutex here.  We're already holding the
> uuid_mutex which keeps us safe from any external modification of the
> fs_devices.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/volumes.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 3ac44dad58bb..97ec9c0a91aa 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1155,11 +1155,8 @@ static int close_fs_devices(struct btrfs_fs_devices *fs_devices)
>  	if (--fs_devices->opened > 0)
>  		return 0;
>  
> -	mutex_lock(&fs_devices->device_list_mutex);
> -	list_for_each_entry_safe(device, tmp, &fs_devices->devices, dev_list) {
> +	list_for_each_entry_safe(device, tmp, &fs_devices->devices, dev_list)
>  		btrfs_close_one_device(device);
> -	}
> -	mutex_unlock(&fs_devices->device_list_mutex);

As Filipe pointed out, uuid mutex is not held in the dev replace
function, I suggest to add

	lockdep_assert_held(&uuid_mutex);

when locks are removed and we rely on callers to make sure the right
locks are taken.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 09/17] btrfs: add nesting tags to the locking helpers
  2020-08-10 15:42 ` [PATCH 09/17] btrfs: add nesting tags to the locking helpers Josef Bacik
@ 2020-08-14 14:41   ` David Sterba
  0 siblings, 0 replies; 30+ messages in thread
From: David Sterba @ 2020-08-14 14:41 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 10, 2020 at 11:42:34AM -0400, Josef Bacik wrote:
> --- a/fs/btrfs/locking.h
> +++ b/fs/btrfs/locking.h
> @@ -16,12 +16,18 @@
>  #define BTRFS_WRITE_LOCK_BLOCKING 3
>  #define BTRFS_READ_LOCK_BLOCKING 4
>  
> +enum btrfs_lock_nesting {
> +	BTRFS_NESTING_NORMAL = 0,

It should be safe to use the enum auto numbering, that starts from 0.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 14/17] btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots
  2020-08-10 15:42 ` [PATCH 14/17] btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots Josef Bacik
@ 2020-08-14 14:45   ` David Sterba
  0 siblings, 0 replies; 30+ messages in thread
From: David Sterba @ 2020-08-14 14:45 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Mon, Aug 10, 2020 at 11:42:39AM -0400, Josef Bacik wrote:
> --- a/fs/btrfs/locking.h
> +++ b/fs/btrfs/locking.h
> @@ -16,6 +16,11 @@
>  #define BTRFS_WRITE_LOCK_BLOCKING 3
>  #define BTRFS_READ_LOCK_BLOCKING 4
>  
> +/*
> + * We are limited in number of subclasses by MAX_LOCKDEP_SUBCLASSES, which at
> + * the time of this patch is 8, which is how many we're using here.  Keep this
> + * in mind if you decide you want to add another subclass.
> + */
>  enum btrfs_lock_nesting {
>  	BTRFS_NESTING_NORMAL = 0,
>  
> @@ -55,6 +60,15 @@ enum btrfs_lock_nesting {
>  	 * handle this case where we need to allocate a new split block.
>  	 */
>  	BTRFS_NESTING_SPLIT,
> +
> +	/*
> +	 * When promoting a new block to a root we need to have a special
> +	 * subclass so we don't confuse lockdep, as it will appear that we are
> +	 * locking a higher level node before a lower level one.  Copying also
> +	 * has this problem as it appears we're locking the same block again
> +	 * when we make a snapshot of an existing root.
> +	 */
> +	BTRFS_NESTING_NEW_ROOT,
>  };

To prevent accidental addition (and not reading the comment about the
maximum count), I suggest to use the static_assert with one extra
definition like

--- a/fs/btrfs/locking.h
+++ b/fs/btrfs/locking.h
@@ -14,11 +14,6 @@
 #define BTRFS_WRITE_LOCK 1
 #define BTRFS_READ_LOCK 2
 
-/*
- * We are limited in number of subclasses by MAX_LOCKDEP_SUBCLASSES, which at
- * the time of this patch is 8, which is how many we're using here.  Keep this
- * in mind if you decide you want to add another subclass.
- */
 enum btrfs_lock_nesting {
        BTRFS_NESTING_NORMAL = 0,
 
@@ -67,8 +62,19 @@ enum btrfs_lock_nesting {
         * when we make a snapshot of an existing root.
         */
        BTRFS_NESTING_NEW_ROOT,
+
+       /*
+        * We are limited in number of subclasses by MAX_LOCKDEP_SUBCLASSES,
+        * which at the time of this patch is 8, which is how many we're using
+        * here.  Keep this in mind if you decide you want to add another
+        * subclass.
+        */
+       BTRFS_NESTING_MAX
 };
 
+static_assert(BTRFS_NESTING_MAX <= MAX_LOCKDEP_SUBCLASSES,
+             "too many lock subclasses defined");
+
 struct btrfs_path;
 
 void __btrfs_tree_lock(struct extent_buffer *eb, enum btrfs_lock_nesting nest);
---

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-08-14 14:46 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-10 15:42 [PATCH 00/17] Convert to an rwsem for our tree locking Josef Bacik
2020-08-10 15:42 ` [PATCH 01/17] btrfs: drop path before adding new uuid tree entry Josef Bacik
2020-08-10 16:28   ` Filipe Manana
2020-08-10 16:30     ` Filipe Manana
2020-08-11 14:35     ` Josef Bacik
2020-08-10 15:42 ` [PATCH 02/17] btrfs: fix potential deadlock in the search ioctl Josef Bacik
2020-08-10 16:45   ` Filipe Manana
2020-08-10 15:42 ` [PATCH 03/17] btrfs: do not hold device_list_mutex when closing devices Josef Bacik
2020-08-11 10:53   ` Filipe Manana
2020-08-14 14:11   ` David Sterba
2020-08-10 15:42 ` [PATCH 04/17] btrfs: allocate scrub workqueues outside of locks Josef Bacik
2020-08-11 11:22   ` Filipe Manana
2020-08-10 15:42 ` [PATCH 05/17] btrfs: set the correct lockdep class for new nodes Josef Bacik
2020-08-11 11:25   ` Filipe Manana
2020-08-10 15:42 ` [PATCH 06/17] btrfs: set the lockdep class for log tree extent buffers Josef Bacik
2020-08-11 11:28   ` Filipe Manana
2020-08-10 15:42 ` [PATCH 07/17] btrfs: rename eb->lock_nested to eb->lock_recursed Josef Bacik
2020-08-10 15:42 ` [PATCH 08/17] btrfs: introduce path->recurse Josef Bacik
2020-08-10 15:42 ` [PATCH 09/17] btrfs: add nesting tags to the locking helpers Josef Bacik
2020-08-14 14:41   ` David Sterba
2020-08-10 15:42 ` [PATCH 10/17] btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks Josef Bacik
2020-08-10 15:42 ` [PATCH 11/17] btrfs: introduce BTRFS_NESTING_LEFT/BTRFS_NESTING_RIGHT Josef Bacik
2020-08-10 15:42 ` [PATCH 12/17] btrfs: introduce BTRFS_NESTING_LEFT/RIGHT_COW Josef Bacik
2020-08-10 15:42 ` [PATCH 13/17] btrfs: introduce BTRFS_NESTING_SPLIT for split blocks Josef Bacik
2020-08-10 15:42 ` [PATCH 14/17] btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots Josef Bacik
2020-08-14 14:45   ` David Sterba
2020-08-10 15:42 ` [PATCH 15/17] btrfs: change our extent buffer lock to a rw_semaphore Josef Bacik
2020-08-10 15:42 ` [PATCH 16/17] btrfs: remove all of the blocking helpers Josef Bacik
2020-08-10 15:42 ` [PATCH 17/17] btrfs: rip out path->leave_spinning Josef Bacik
2020-08-13 14:26 ` [PATCH 00/17] Convert to an rwsem for our tree locking David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).