All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Deadlock fix and cleanups
@ 2022-09-30 20:45 Josef Bacik
  2022-09-30 20:45 ` [PATCH 1/6] btrfs: unlock locked extent area if we have contention Josef Bacik
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Hello,

There's a problem with how we do our extent locking, namely that if we encounter
any contention in the range we're trying to lock we'll leave any area we were
able to lock locked.  This creates dependency issues when the contended area is
blocked on some other thing that depends on an operation that would need the
range lock for the area that was successfully locked.  The exact scenario we
encountered is described in patch 1.  We need to change this behavior to simply
wait on the contended area and then attempt to re-lock our entire range,
clearing anything that was locked before we encountered contention.

The followup patches are optimizations and cleanups around caching the extent
state in more places.  Additionally I've added caching of the extent state of
the contended area to hopefully reduce the pain of needing to unlock and then
wait on the contended area.

The first patch needs to be backported to stable because of the extent-io-tree.c
move.  Once it lands in Linus I'll manually backport the patch and send it back
to the stable branches.

I've run this through a sanity test on xfstests, and I ran it through 2 tests on
fsperf that I felt would hammer on the extent locking code the most and be most
likely to run into lock contention.  I used the new function profiling stuff to
grab latency numbers for lock_extent(), but of course there's a lot of variance
here.  The only thing that fell outside of the standard deviation were some of
the maximum latency numbers, but generally everything was within the standard
deviation.  Again these are mostly just for information, deadlock fixing comes
before performance.  Thanks,

Josef

bufferedrandwrite16g results
      metric           baseline       current        stdev            diff
================================================================================
avg_commit_ms            10090.03       8453.23       3505.07   -16.22%
bg_count                    20.80         20.80          0.45     0.00%
commits                      6.20             6          1.10    -3.23%
elapsed                    140.60        139.60         18.06    -0.71%
end_state_mount_ns    42066858.80   45104291.80    6317588.83     7.22%
end_state_umount_ns      6.26e+09      6.23e+09      1.76e+08    -0.39%
lock_extent_calls     10795318.60   10713477.60     233933.51    -0.76%
lock_extent_ns_max     3945976.80       7027641    1676910.27    78.10%
lock_extent_ns_mean       2258.36       2187.89        212.76    -3.12%
lock_extent_ns_min         503.60        500.80          7.44    -0.56%
lock_extent_ns_p50        1964.80       1953.60        190.31    -0.57%
lock_extent_ns_p95        4257.40       4153.20        409.06    -2.45%
lock_extent_ns_p99        6967.20       6429.20       1166.93    -7.72%
max_commit_ms            24686.20         25927       5930.26     5.03%
sys_cpu                     46.68         45.52          6.83    -2.48%
write_bw_bytes           1.25e+08      1.24e+08   15352552.51    -0.61%
write_clat_ns_mean       23568.91      21840.81       4061.83    -7.33%
write_clat_ns_p50        13734.40      13683.20       1268.43    -0.37%
write_clat_ns_p99           33152      30873.60       4236.59    -6.87%
write_io_kbytes          16777216      16777216             0     0.00%
write_iops               30413.83      30228.55       3748.18    -0.61%
write_lat_ns_max         1.72e+10      1.77e+10      9.25e+09     2.64%
write_lat_ns_mean        23645.69      21934.65       4057.93    -7.24%
write_lat_ns_min          6049.40       5877.60         80.29    -2.84%

randwrite2xram results
      metric           baseline       current        stdev            diff
================================================================================
avg_commit_ms            21196.15      32607.37       2286.20    53.84%
bg_count                    43.80         39.80          6.46    -9.13%
commits                     11.20          9.80          1.10   -12.50%
elapsed                    329.20           350          4.97     6.32%
end_state_mount_ns    61846504.40   57392390.60    7914609.45    -7.20%
end_state_umount_ns      1.74e+10      1.80e+10      2.35e+09     3.44%
lock_extent_calls     26193512.60      24046630    4169768.34    -8.20%
lock_extent_ns_max    23699711.60   17524496.80   13253697.58   -26.06%
lock_extent_ns_mean       1871.04       1866.95         26.60    -0.22%
lock_extent_ns_min         495.60           501          5.41     1.09%
lock_extent_ns_p50        1681.80       1675.40         22.13    -0.38%
lock_extent_ns_p95        3487.60          3492         45.35     0.13%
lock_extent_ns_p99        4416.60       4431.80         44.77     0.34%
max_commit_ms               53482     116910.40       8707.34   118.60%
sys_cpu                      9.88          8.32          1.75   -15.85%
write_bw_bytes           1.05e+08   89841897.40   20713472.34   -14.84%
write_clat_ns_mean      158731.16     234418.93      30030.49    47.68%
write_clat_ns_p50        14732.80      14873.60        448.91     0.96%
write_clat_ns_p99        75622.40      82892.80      12865.88     9.61%
write_io_kbytes       31930975.20   28774377.60    5985362.29    -9.89%
write_iops               25756.55      21934.06       5057.00   -14.84%
write_lat_ns_max         3.49e+10      5.87e+10      6.68e+09    68.41%
write_lat_ns_mean       158828.62     234533.46      30028.84    47.66%
write_lat_ns_min             7809       7923.40        371.87     1.46%

Josef Bacik (6):
  btrfs: unlock locked extent area if we have contention
  btrfs: add a cached_state to try_lock_extent
  btrfs: use cached_state for btrfs_check_nocow_lock
  btrfs: use a cached_state everywhere in relocation
  btrfs: cache the failed state when locking extents
  btrfs: add cached_state to read_extent_buffer_subpage

 fs/btrfs/extent-io-tree.c | 68 +++++++++++++++++++++++++++------------
 fs/btrfs/extent-io-tree.h |  6 ++--
 fs/btrfs/extent_io.c      | 17 +++++++---
 fs/btrfs/file.c           | 12 ++++---
 fs/btrfs/inode.c          |  3 +-
 fs/btrfs/ordered-data.c   |  7 ++--
 fs/btrfs/ordered-data.h   |  3 +-
 fs/btrfs/relocation.c     | 40 +++++++++++++++--------
 8 files changed, 106 insertions(+), 50 deletions(-)

-- 
2.26.3


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/6] btrfs: unlock locked extent area if we have contention
  2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
  2022-09-30 20:45 ` [PATCH 2/6] btrfs: add a cached_state to try_lock_extent Josef Bacik
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

In production we hit the following deadlock

task 1			task 2			task 3
------			------			------
fiemap(file)		falloc(file)		fsync(file)
						  write(0, 1mib)
						  btrfs_commit_transaction()
						    wait_on(!pending_ordered)
			  lock(512mib, 1gib)
			  start_transaction
			    wait_on_transaction

  lock(0, 1gib)
    wait_extent_bit(512mib)

task 4
------
finish_ordered_extent(0, 1mib)
  lock(0, 1mib)
  **DEADLOCK**

This occurs because when task 1 does it's lock, it locks everything from
0-512mib, and then waits for the 512mib chunk to unlock.  task 2 will
never unlock because it's waiting on the transaction commit to happen,
the transaction commit is waiting for the outstanding ordered extents,
and then the ordered extent thread is blocked waiting on the 0-1mib
range to unlock.

To fix this we have to clear anything we've locked so far, wait for the
extent_state that we contended on, and then try to re-lock the entire
range again.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-io-tree.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 618275af19c4..83cb0378096f 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -1641,16 +1641,17 @@ int lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
 	int err;
 	u64 failed_start;
 
-	while (1) {
+	err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
+			       cached_state, NULL, GFP_NOFS);
+	while (err == -EEXIST) {
+		if (failed_start != start)
+			clear_extent_bit(tree, start, failed_start - 1,
+					 EXTENT_LOCKED, cached_state);
+
+		wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED);
 		err = __set_extent_bit(tree, start, end, EXTENT_LOCKED,
 				       &failed_start, cached_state, NULL,
 				       GFP_NOFS);
-		if (err == -EEXIST) {
-			wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED);
-			start = failed_start;
-		} else
-			break;
-		WARN_ON(start > end);
 	}
 	return err;
 }
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/6] btrfs: add a cached_state to try_lock_extent
  2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
  2022-09-30 20:45 ` [PATCH 1/6] btrfs: unlock locked extent area if we have contention Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
  2022-09-30 20:45 ` [PATCH 3/6] btrfs: use cached_state for btrfs_check_nocow_lock Josef Bacik
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

With nowait becoming more pervasive throughout our codebase go ahead and
add a cached_state to try_lock_extent().  This allows us to be faster
about clearing the locked area if we have contention, and then gives us
the same optimization for unlock if we are able to lock the range.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-io-tree.c | 7 ++++---
 fs/btrfs/extent-io-tree.h | 3 ++-
 fs/btrfs/extent_io.c      | 3 ++-
 fs/btrfs/file.c           | 3 ++-
 fs/btrfs/inode.c          | 3 ++-
 fs/btrfs/ordered-data.c   | 2 +-
 fs/btrfs/relocation.c     | 2 +-
 7 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 83cb0378096f..1b0a45b51f4c 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -1615,17 +1615,18 @@ int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 				  changeset);
 }
 
-int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end)
+int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
+		    struct extent_state **cached)
 {
 	int err;
 	u64 failed_start;
 
 	err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
-			       NULL, NULL, GFP_NOFS);
+			       cached, NULL, GFP_NOFS);
 	if (err == -EEXIST) {
 		if (failed_start > start)
 			clear_extent_bit(tree, start, failed_start - 1,
-					 EXTENT_LOCKED, NULL);
+					 EXTENT_LOCKED, cached);
 		return 0;
 	}
 	return 1;
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index a855f40dd61d..786be8f38f0b 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -106,7 +106,8 @@ void extent_io_tree_release(struct extent_io_tree *tree);
 int lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
 		struct extent_state **cached);
 
-int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end);
+int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
+		    struct extent_state **cached);
 
 int __init extent_state_init_cachep(void);
 void __cold extent_state_free_cachep(void);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1eae68fbae21..e29e8aafc3b7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4962,7 +4962,8 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 	io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
 
 	if (wait == WAIT_NONE) {
-		if (!try_lock_extent(io_tree, eb->start, eb->start + eb->len - 1))
+		if (!try_lock_extent(io_tree, eb->start, eb->start + eb->len - 1,
+				     NULL))
 			return -EAGAIN;
 	} else {
 		ret = lock_extent(io_tree, eb->start, eb->start + eb->len - 1, NULL);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 176b432035ae..64bf29848723 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1302,7 +1302,8 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
 		struct btrfs_ordered_extent *ordered;
 
 		if (nowait) {
-			if (!try_lock_extent(&inode->io_tree, start_pos, last_pos)) {
+			if (!try_lock_extent(&inode->io_tree, start_pos, last_pos,
+					     cached_state)) {
 				for (i = 0; i < num_pages; i++) {
 					unlock_page(pages[i]);
 					put_page(pages[i]);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 45ebef8d3ea8..c5630a3b8011 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7255,7 +7255,8 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 
 	while (1) {
 		if (nowait) {
-			if (!try_lock_extent(io_tree, lockstart, lockend))
+			if (!try_lock_extent(io_tree, lockstart, lockend,
+					     cached_state))
 				return -EAGAIN;
 		} else {
 			lock_extent(io_tree, lockstart, lockend, cached_state);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index e54f8280031f..b648c9d4ea0f 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1073,7 +1073,7 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end)
 {
 	struct btrfs_ordered_extent *ordered;
 
-	if (!try_lock_extent(&inode->io_tree, start, end))
+	if (!try_lock_extent(&inode->io_tree, start, end, NULL))
 		return false;
 
 	ordered = btrfs_lookup_ordered_range(inode, start, end - start + 1);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 666a37a0ee89..e81a21082e58 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1120,7 +1120,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 				WARN_ON(!IS_ALIGNED(end, fs_info->sectorsize));
 				end--;
 				ret = try_lock_extent(&BTRFS_I(inode)->io_tree,
-						      key.offset, end);
+						      key.offset, end, NULL);
 				if (!ret)
 					continue;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/6] btrfs: use cached_state for btrfs_check_nocow_lock
  2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
  2022-09-30 20:45 ` [PATCH 1/6] btrfs: unlock locked extent area if we have contention Josef Bacik
  2022-09-30 20:45 ` [PATCH 2/6] btrfs: add a cached_state to try_lock_extent Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
  2022-09-30 20:45 ` [PATCH 4/6] btrfs: use a cached_state everywhere in relocation Josef Bacik
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Now that try_lock_extent() takes a cached_state, plumb the cached_state
through btrfs_try_lock_ordered_range() and then use a cached_state in
btrfs_check_nocow_lock everywhere to avoid extra tree searches on the
extent_io_tree.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/file.c         | 9 ++++++---
 fs/btrfs/ordered-data.c | 7 ++++---
 fs/btrfs/ordered-data.h | 3 ++-
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 64bf29848723..c390d5805917 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1373,6 +1373,7 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_root *root = inode->root;
+	struct extent_state *cached_state = NULL;
 	u64 lockstart, lockend;
 	u64 num_bytes;
 	int ret;
@@ -1389,12 +1390,14 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 	num_bytes = lockend - lockstart + 1;
 
 	if (nowait) {
-		if (!btrfs_try_lock_ordered_range(inode, lockstart, lockend)) {
+		if (!btrfs_try_lock_ordered_range(inode, lockstart, lockend,
+						  &cached_state)) {
 			btrfs_drew_write_unlock(&root->snapshot_lock);
 			return -EAGAIN;
 		}
 	} else {
-		btrfs_lock_and_flush_ordered_range(inode, lockstart, lockend, NULL);
+		btrfs_lock_and_flush_ordered_range(inode, lockstart, lockend,
+						   &cached_state);
 	}
 	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
 			NULL, NULL, NULL, nowait, false);
@@ -1403,7 +1406,7 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 	else
 		*write_bytes = min_t(size_t, *write_bytes ,
 				     num_bytes - pos + lockstart);
-	unlock_extent(&inode->io_tree, lockstart, lockend, NULL);
+	unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 
 	return ret;
 }
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index b648c9d4ea0f..de2b716d3e7b 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1069,11 +1069,12 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
  * Return true if btrfs_lock_ordered_range does not return any extents,
  * otherwise false.
  */
-bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end)
+bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
+				  struct extent_state **cached_state)
 {
 	struct btrfs_ordered_extent *ordered;
 
-	if (!try_lock_extent(&inode->io_tree, start, end, NULL))
+	if (!try_lock_extent(&inode->io_tree, start, end, cached_state))
 		return false;
 
 	ordered = btrfs_lookup_ordered_range(inode, start, end - start + 1);
@@ -1081,7 +1082,7 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end)
 		return true;
 
 	btrfs_put_ordered_extent(ordered);
-	unlock_extent(&inode->io_tree, start, end, NULL);
+	unlock_extent(&inode->io_tree, start, end, cached_state);
 
 	return false;
 }
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index f59f2dbdb25e..89f82b78f590 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -206,7 +206,8 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
 void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 					u64 end,
 					struct extent_state **cached_state);
-bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end);
+bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
+				  struct extent_state **cached_state);
 int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
 			       u64 post);
 int __init ordered_data_init(void);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/6] btrfs: use a cached_state everywhere in relocation
  2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
                   ` (2 preceding siblings ...)
  2022-09-30 20:45 ` [PATCH 3/6] btrfs: use cached_state for btrfs_check_nocow_lock Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
  2022-09-30 20:45 ` [PATCH 5/6] btrfs: cache the failed state when locking extents Josef Bacik
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

All of the relocation code avoids using the cached state, despite
everywhere using the normal

lock_extent()
// do something
unlock_extent()

pattern.  Fix this by plumbing a cached state throughout all of these
functions in order to allow for less tree searches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/relocation.c | 40 ++++++++++++++++++++++++++--------------
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index e81a21082e58..9dcf9b39c798 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1113,6 +1113,8 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 				inode = find_next_inode(root, key.objectid);
 			}
 			if (inode && btrfs_ino(BTRFS_I(inode)) == key.objectid) {
+				struct extent_state *cached_state = NULL;
+
 				end = key.offset +
 				      btrfs_file_extent_num_bytes(leaf, fi);
 				WARN_ON(!IS_ALIGNED(key.offset,
@@ -1120,14 +1122,15 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
 				WARN_ON(!IS_ALIGNED(end, fs_info->sectorsize));
 				end--;
 				ret = try_lock_extent(&BTRFS_I(inode)->io_tree,
-						      key.offset, end, NULL);
+						      key.offset, end,
+						      &cached_state);
 				if (!ret)
 					continue;
 
 				btrfs_drop_extent_map_range(BTRFS_I(inode),
 							    key.offset, end, true);
 				unlock_extent(&BTRFS_I(inode)->io_tree,
-					      key.offset, end, NULL);
+					      key.offset, end, &cached_state);
 			}
 		}
 
@@ -1516,6 +1519,8 @@ static int invalidate_extent_cache(struct btrfs_root *root,
 
 	objectid = min_key->objectid;
 	while (1) {
+		struct extent_state *cached_state = NULL;
+
 		cond_resched();
 		iput(inode);
 
@@ -1566,9 +1571,9 @@ static int invalidate_extent_cache(struct btrfs_root *root,
 		}
 
 		/* the lock_extent waits for read_folio to complete */
-		lock_extent(&BTRFS_I(inode)->io_tree, start, end, NULL);
+		lock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
 		btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, true);
-		unlock_extent(&BTRFS_I(inode)->io_tree, start, end, NULL);
+		unlock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
 	}
 	return 0;
 }
@@ -2863,19 +2868,21 @@ static noinline_for_stack int prealloc_file_extent_cluster(
 
 	btrfs_inode_lock(&inode->vfs_inode, 0);
 	for (nr = 0; nr < cluster->nr; nr++) {
+		struct extent_state *cached_state = NULL;
+
 		start = cluster->boundary[nr] - offset;
 		if (nr + 1 < cluster->nr)
 			end = cluster->boundary[nr + 1] - 1 - offset;
 		else
 			end = cluster->end - offset;
 
-		lock_extent(&inode->io_tree, start, end, NULL);
+		lock_extent(&inode->io_tree, start, end, &cached_state);
 		num_bytes = end + 1 - start;
 		ret = btrfs_prealloc_file_range(&inode->vfs_inode, 0, start,
 						num_bytes, num_bytes,
 						end + 1, &alloc_hint);
 		cur_offset = end + 1;
-		unlock_extent(&inode->io_tree, start, end, NULL);
+		unlock_extent(&inode->io_tree, start, end, &cached_state);
 		if (ret)
 			break;
 	}
@@ -2891,6 +2898,7 @@ static noinline_for_stack int setup_relocation_extent_mapping(struct inode *inod
 				u64 start, u64 end, u64 block_start)
 {
 	struct extent_map *em;
+	struct extent_state *cached_state = NULL;
 	int ret = 0;
 
 	em = alloc_extent_map();
@@ -2903,9 +2911,9 @@ static noinline_for_stack int setup_relocation_extent_mapping(struct inode *inod
 	em->block_start = block_start;
 	set_bit(EXTENT_FLAG_PINNED, &em->flags);
 
-	lock_extent(&BTRFS_I(inode)->io_tree, start, end, NULL);
+	lock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
 	ret = btrfs_replace_extent_map_range(BTRFS_I(inode), em, false);
-	unlock_extent(&BTRFS_I(inode)->io_tree, start, end, NULL);
+	unlock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
 	free_extent_map(em);
 
 	return ret;
@@ -2983,6 +2991,7 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 	 */
 	cur = max(page_start, cluster->boundary[*cluster_nr] - offset);
 	while (cur <= page_end) {
+		struct extent_state *cached_state = NULL;
 		u64 extent_start = cluster->boundary[*cluster_nr] - offset;
 		u64 extent_end = get_cluster_boundary_end(cluster,
 						*cluster_nr) - offset;
@@ -2998,13 +3007,15 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 			goto release_page;
 
 		/* Mark the range delalloc and dirty for later writeback */
-		lock_extent(&BTRFS_I(inode)->io_tree, clamped_start, clamped_end, NULL);
+		lock_extent(&BTRFS_I(inode)->io_tree, clamped_start, clamped_end,
+			    &cached_state);
 		ret = btrfs_set_extent_delalloc(BTRFS_I(inode), clamped_start,
-						clamped_end, 0, NULL);
+						clamped_end, 0, &cached_state);
 		if (ret) {
-			clear_extent_bits(&BTRFS_I(inode)->io_tree,
-					clamped_start, clamped_end,
-					EXTENT_LOCKED | EXTENT_BOUNDARY);
+			clear_extent_bit(&BTRFS_I(inode)->io_tree,
+					 clamped_start, clamped_end,
+					 EXTENT_LOCKED | EXTENT_BOUNDARY,
+					 &cached_state);
 			btrfs_delalloc_release_metadata(BTRFS_I(inode),
 							clamped_len, true);
 			btrfs_delalloc_release_extents(BTRFS_I(inode),
@@ -3031,7 +3042,8 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
 					boundary_start, boundary_end,
 					EXTENT_BOUNDARY);
 		}
-		unlock_extent(&BTRFS_I(inode)->io_tree, clamped_start, clamped_end, NULL);
+		unlock_extent(&BTRFS_I(inode)->io_tree, clamped_start, clamped_end,
+			      &cached_state);
 		btrfs_delalloc_release_extents(BTRFS_I(inode), clamped_len);
 		cur += clamped_len;
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/6] btrfs: cache the failed state when locking extents
  2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
                   ` (3 preceding siblings ...)
  2022-09-30 20:45 ` [PATCH 4/6] btrfs: use a cached_state everywhere in relocation Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
  2022-09-30 20:45 ` [PATCH 6/6] btrfs: add cached_state to read_extent_buffer_subpage Josef Bacik
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

Currently if we fail to lock a range we'll return the start of the range
that we failed to lock.  We'll then search down to this range and wait
on any extent states in this range.

However we can avoid this search altogether if we simply cache the
extent_state that had the contention.  We can pass this into
wait_extent_bit() and start from that extent_state without doing the
search.  In the most optimistic case we can avoid all searches, more
likely we'll avoid the initial search and have to perform the search
after we wait on the failed state, or worst case we must search both
times which is what currently happens.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent-io-tree.c | 52 +++++++++++++++++++++++++++++----------
 fs/btrfs/extent-io-tree.h |  3 ++-
 fs/btrfs/extent_io.c      |  3 ++-
 3 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 1b0a45b51f4c..e455439f9e86 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -714,7 +714,8 @@ static void wait_on_state(struct extent_io_tree *tree,
  * The range [start, end] is inclusive.
  * The tree lock is taken by this function
  */
-void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits)
+void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits,
+		     struct extent_state **cached_state)
 {
 	struct extent_state *state;
 
@@ -722,6 +723,16 @@ void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits)
 
 	spin_lock(&tree->lock);
 again:
+	/*
+	 * Maintain cached_state, as we may not remove it from the tree if there
+	 * are more bits than the bits we're waiting on set on this state.
+	 */
+	if (cached_state && *cached_state) {
+		state = *cached_state;
+		if (extent_state_in_tree(state) &&
+		    state->start <= start && state->end > start)
+			goto process_node;
+	}
 	while (1) {
 		/*
 		 * This search will find all the extents that end after our
@@ -752,6 +763,12 @@ void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits)
 		}
 	}
 out:
+	/* This state is no longer useful, clear it and free it up. */
+	if (cached_state && *cached_state) {
+		state = *cached_state;
+		*cached_state = NULL;
+		free_extent_state(state);
+	}
 	spin_unlock(&tree->lock);
 }
 
@@ -939,13 +956,17 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
  * sleeping, so the gfp mask is used to indicate what is allowed.
  *
  * If any of the exclusive bits are set, this will fail with -EEXIST if some
- * part of the range already has the desired bits set.  The start of the
- * existing range is returned in failed_start in this case.
+ * part of the range already has the desired bits set.  The extent_state of the
+ * existing range is returned in failed_state in this case, and the start of the
+ * existing range is returned in failed_start.  failed_state is used as an
+ * optimization for wait_extent_bit, failed_start must be used as the source of
+ * truth as failed_state may have changed since we returned.
  *
  * [start, end] is inclusive This takes the tree lock.
  */
 static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 			    u32 bits, u64 *failed_start,
+			    struct extent_state **failed_state,
 			    struct extent_state **cached_state,
 			    struct extent_changeset *changeset, gfp_t mask)
 {
@@ -964,7 +985,7 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (exclusive_bits)
 		ASSERT(failed_start);
 	else
-		ASSERT(failed_start == NULL);
+		ASSERT(failed_start == NULL && failed_state == NULL);
 again:
 	if (!prealloc && gfpflags_allow_blocking(mask)) {
 		/*
@@ -1012,6 +1033,7 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (state->start == start && state->end <= end) {
 		if (state->state & exclusive_bits) {
 			*failed_start = state->start;
+			cache_state(state, failed_state);
 			err = -EEXIST;
 			goto out;
 		}
@@ -1047,6 +1069,7 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (state->start < start) {
 		if (state->state & exclusive_bits) {
 			*failed_start = start;
+			cache_state(state, failed_state);
 			err = -EEXIST;
 			goto out;
 		}
@@ -1125,6 +1148,7 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (state->start <= end && state->end > end) {
 		if (state->state & exclusive_bits) {
 			*failed_start = start;
+			cache_state(state, failed_state);
 			err = -EEXIST;
 			goto out;
 		}
@@ -1162,8 +1186,8 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 		   u32 bits, struct extent_state **cached_state, gfp_t mask)
 {
-	return __set_extent_bit(tree, start, end, bits, NULL, cached_state,
-				NULL, mask);
+	return __set_extent_bit(tree, start, end, bits, NULL, NULL,
+				cached_state, NULL, mask);
 }
 
 /*
@@ -1598,8 +1622,8 @@ int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
 	 */
 	ASSERT(!(bits & EXTENT_LOCKED));
 
-	return __set_extent_bit(tree, start, end, bits, NULL, NULL, changeset,
-				GFP_NOFS);
+	return __set_extent_bit(tree, start, end, bits, NULL, NULL, NULL,
+				changeset, GFP_NOFS);
 }
 
 int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
@@ -1622,7 +1646,7 @@ int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
 	u64 failed_start;
 
 	err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
-			       cached, NULL, GFP_NOFS);
+			       NULL, cached, NULL, GFP_NOFS);
 	if (err == -EEXIST) {
 		if (failed_start > start)
 			clear_extent_bit(tree, start, failed_start - 1,
@@ -1639,20 +1663,22 @@ int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
 int lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
 		struct extent_state **cached_state)
 {
+	struct extent_state *failed_state = NULL;
 	int err;
 	u64 failed_start;
 
 	err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
-			       cached_state, NULL, GFP_NOFS);
+			       &failed_state, cached_state, NULL, GFP_NOFS);
 	while (err == -EEXIST) {
 		if (failed_start != start)
 			clear_extent_bit(tree, start, failed_start - 1,
 					 EXTENT_LOCKED, cached_state);
 
-		wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED);
+		wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED,
+				&failed_state);
 		err = __set_extent_bit(tree, start, end, EXTENT_LOCKED,
-				       &failed_start, cached_state, NULL,
-				       GFP_NOFS);
+				       &failed_start, &failed_state,
+				       cached_state, NULL, GFP_NOFS);
 	}
 	return err;
 }
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index 786be8f38f0b..c71aa29f719d 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -235,6 +235,7 @@ int find_contiguous_extent_bit(struct extent_io_tree *tree, u64 start,
 bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
 			       u64 *end, u64 max_bytes,
 			       struct extent_state **cached_state);
-void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits);
+void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits,
+		     struct extent_state **cached_state);
 
 #endif /* BTRFS_EXTENT_IO_TREE_H */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e29e8aafc3b7..21b437f4e36e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5004,7 +5004,8 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 	if (ret || wait != WAIT_COMPLETE)
 		return ret;
 
-	wait_extent_bit(io_tree, eb->start, eb->start + eb->len - 1, EXTENT_LOCKED);
+	wait_extent_bit(io_tree, eb->start, eb->start + eb->len - 1,
+			EXTENT_LOCKED, NULL);
 	if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
 		ret = -EIO;
 	return ret;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 6/6] btrfs: add cached_state to read_extent_buffer_subpage
  2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
                   ` (4 preceding siblings ...)
  2022-09-30 20:45 ` [PATCH 5/6] btrfs: cache the failed state when locking extents Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
  2022-10-03 11:03 ` [PATCH 0/6] Deadlock fix and cleanups Filipe Manana
  2022-10-07 16:21 ` David Sterba
  7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
  To: linux-btrfs, kernel-team

We don't use a cached state here at all, which generally makes sense as
async reads are going to unlock at endio time.  However for blocking
reads we will call wait_extent_bit() for our range.  Since the
lock_extent() stuff will return the cached_state for the start of the
range this is a helpful optimization to have for this case, we'll have
the exact state we want to wait on.  Add a cached state here and simply
throw it away if we're a non-blocking read, otherwise we'll get a small
improvement by eliminating some tree searches.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 21b437f4e36e..c1021e9ca079 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4952,6 +4952,7 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 	struct btrfs_fs_info *fs_info = eb->fs_info;
 	struct extent_io_tree *io_tree;
 	struct page *page = eb->pages[0];
+	struct extent_state *cached_state = NULL;
 	struct btrfs_bio_ctrl bio_ctrl = {
 		.mirror_num = mirror_num,
 	};
@@ -4963,10 +4964,11 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 
 	if (wait == WAIT_NONE) {
 		if (!try_lock_extent(io_tree, eb->start, eb->start + eb->len - 1,
-				     NULL))
+				     &cached_state))
 			return -EAGAIN;
 	} else {
-		ret = lock_extent(io_tree, eb->start, eb->start + eb->len - 1, NULL);
+		ret = lock_extent(io_tree, eb->start, eb->start + eb->len - 1,
+				  &cached_state);
 		if (ret < 0)
 			return ret;
 	}
@@ -4976,7 +4978,8 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 	    PageUptodate(page) ||
 	    btrfs_subpage_test_uptodate(fs_info, page, eb->start, eb->len)) {
 		set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
-		unlock_extent(io_tree, eb->start, eb->start + eb->len - 1, NULL);
+		unlock_extent(io_tree, eb->start, eb->start + eb->len - 1,
+			      &cached_state);
 		return ret;
 	}
 
@@ -5001,11 +5004,13 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
 		atomic_dec(&eb->io_pages);
 	}
 	submit_one_bio(&bio_ctrl);
-	if (ret || wait != WAIT_COMPLETE)
+	if (ret || wait != WAIT_COMPLETE) {
+		free_extent_state(cached_state);
 		return ret;
+	}
 
 	wait_extent_bit(io_tree, eb->start, eb->start + eb->len - 1,
-			EXTENT_LOCKED, NULL);
+			EXTENT_LOCKED, &cached_state);
 	if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
 		ret = -EIO;
 	return ret;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] Deadlock fix and cleanups
  2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
                   ` (5 preceding siblings ...)
  2022-09-30 20:45 ` [PATCH 6/6] btrfs: add cached_state to read_extent_buffer_subpage Josef Bacik
@ 2022-10-03 11:03 ` Filipe Manana
  2022-10-07 16:21 ` David Sterba
  7 siblings, 0 replies; 9+ messages in thread
From: Filipe Manana @ 2022-10-03 11:03 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Sep 30, 2022 at 10:08 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Hello,
>
> There's a problem with how we do our extent locking, namely that if we encounter
> any contention in the range we're trying to lock we'll leave any area we were
> able to lock locked.  This creates dependency issues when the contended area is
> blocked on some other thing that depends on an operation that would need the
> range lock for the area that was successfully locked.  The exact scenario we
> encountered is described in patch 1.  We need to change this behavior to simply
> wait on the contended area and then attempt to re-lock our entire range,
> clearing anything that was locked before we encountered contention.
>
> The followup patches are optimizations and cleanups around caching the extent
> state in more places.  Additionally I've added caching of the extent state of
> the contended area to hopefully reduce the pain of needing to unlock and then
> wait on the contended area.
>
> The first patch needs to be backported to stable because of the extent-io-tree.c
> move.  Once it lands in Linus I'll manually backport the patch and send it back
> to the stable branches.
>
> I've run this through a sanity test on xfstests, and I ran it through 2 tests on
> fsperf that I felt would hammer on the extent locking code the most and be most
> likely to run into lock contention.  I used the new function profiling stuff to
> grab latency numbers for lock_extent(), but of course there's a lot of variance
> here.  The only thing that fell outside of the standard deviation were some of
> the maximum latency numbers, but generally everything was within the standard
> deviation.  Again these are mostly just for information, deadlock fixing comes
> before performance.  Thanks,
>
> Josef
>
> bufferedrandwrite16g results
>       metric           baseline       current        stdev            diff
> ================================================================================
> avg_commit_ms            10090.03       8453.23       3505.07   -16.22%
> bg_count                    20.80         20.80          0.45     0.00%
> commits                      6.20             6          1.10    -3.23%
> elapsed                    140.60        139.60         18.06    -0.71%
> end_state_mount_ns    42066858.80   45104291.80    6317588.83     7.22%
> end_state_umount_ns      6.26e+09      6.23e+09      1.76e+08    -0.39%
> lock_extent_calls     10795318.60   10713477.60     233933.51    -0.76%
> lock_extent_ns_max     3945976.80       7027641    1676910.27    78.10%
> lock_extent_ns_mean       2258.36       2187.89        212.76    -3.12%
> lock_extent_ns_min         503.60        500.80          7.44    -0.56%
> lock_extent_ns_p50        1964.80       1953.60        190.31    -0.57%
> lock_extent_ns_p95        4257.40       4153.20        409.06    -2.45%
> lock_extent_ns_p99        6967.20       6429.20       1166.93    -7.72%
> max_commit_ms            24686.20         25927       5930.26     5.03%
> sys_cpu                     46.68         45.52          6.83    -2.48%
> write_bw_bytes           1.25e+08      1.24e+08   15352552.51    -0.61%
> write_clat_ns_mean       23568.91      21840.81       4061.83    -7.33%
> write_clat_ns_p50        13734.40      13683.20       1268.43    -0.37%
> write_clat_ns_p99           33152      30873.60       4236.59    -6.87%
> write_io_kbytes          16777216      16777216             0     0.00%
> write_iops               30413.83      30228.55       3748.18    -0.61%
> write_lat_ns_max         1.72e+10      1.77e+10      9.25e+09     2.64%
> write_lat_ns_mean        23645.69      21934.65       4057.93    -7.24%
> write_lat_ns_min          6049.40       5877.60         80.29    -2.84%
>
> randwrite2xram results
>       metric           baseline       current        stdev            diff
> ================================================================================
> avg_commit_ms            21196.15      32607.37       2286.20    53.84%
> bg_count                    43.80         39.80          6.46    -9.13%
> commits                     11.20          9.80          1.10   -12.50%
> elapsed                    329.20           350          4.97     6.32%
> end_state_mount_ns    61846504.40   57392390.60    7914609.45    -7.20%
> end_state_umount_ns      1.74e+10      1.80e+10      2.35e+09     3.44%
> lock_extent_calls     26193512.60      24046630    4169768.34    -8.20%
> lock_extent_ns_max    23699711.60   17524496.80   13253697.58   -26.06%
> lock_extent_ns_mean       1871.04       1866.95         26.60    -0.22%
> lock_extent_ns_min         495.60           501          5.41     1.09%
> lock_extent_ns_p50        1681.80       1675.40         22.13    -0.38%
> lock_extent_ns_p95        3487.60          3492         45.35     0.13%
> lock_extent_ns_p99        4416.60       4431.80         44.77     0.34%
> max_commit_ms               53482     116910.40       8707.34   118.60%
> sys_cpu                      9.88          8.32          1.75   -15.85%
> write_bw_bytes           1.05e+08   89841897.40   20713472.34   -14.84%
> write_clat_ns_mean      158731.16     234418.93      30030.49    47.68%
> write_clat_ns_p50        14732.80      14873.60        448.91     0.96%
> write_clat_ns_p99        75622.40      82892.80      12865.88     9.61%
> write_io_kbytes       31930975.20   28774377.60    5985362.29    -9.89%
> write_iops               25756.55      21934.06       5057.00   -14.84%
> write_lat_ns_max         3.49e+10      5.87e+10      6.68e+09    68.41%
> write_lat_ns_mean       158828.62     234533.46      30028.84    47.66%
> write_lat_ns_min             7809       7923.40        371.87     1.46%
>
> Josef Bacik (6):
>   btrfs: unlock locked extent area if we have contention
>   btrfs: add a cached_state to try_lock_extent
>   btrfs: use cached_state for btrfs_check_nocow_lock
>   btrfs: use a cached_state everywhere in relocation
>   btrfs: cache the failed state when locking extents
>   btrfs: add cached_state to read_extent_buffer_subpage

The whole patchset looks good to me, so

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>
>  fs/btrfs/extent-io-tree.c | 68 +++++++++++++++++++++++++++------------
>  fs/btrfs/extent-io-tree.h |  6 ++--
>  fs/btrfs/extent_io.c      | 17 +++++++---
>  fs/btrfs/file.c           | 12 ++++---
>  fs/btrfs/inode.c          |  3 +-
>  fs/btrfs/ordered-data.c   |  7 ++--
>  fs/btrfs/ordered-data.h   |  3 +-
>  fs/btrfs/relocation.c     | 40 +++++++++++++++--------
>  8 files changed, 106 insertions(+), 50 deletions(-)
>
> --
> 2.26.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/6] Deadlock fix and cleanups
  2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
                   ` (6 preceding siblings ...)
  2022-10-03 11:03 ` [PATCH 0/6] Deadlock fix and cleanups Filipe Manana
@ 2022-10-07 16:21 ` David Sterba
  7 siblings, 0 replies; 9+ messages in thread
From: David Sterba @ 2022-10-07 16:21 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs, kernel-team

On Fri, Sep 30, 2022 at 04:45:07PM -0400, Josef Bacik wrote:
> Hello,
> 
> There's a problem with how we do our extent locking, namely that if we encounter
> any contention in the range we're trying to lock we'll leave any area we were
> able to lock locked.  This creates dependency issues when the contended area is
> blocked on some other thing that depends on an operation that would need the
> range lock for the area that was successfully locked.  The exact scenario we
> encountered is described in patch 1.  We need to change this behavior to simply
> wait on the contended area and then attempt to re-lock our entire range,
> clearing anything that was locked before we encountered contention.
> 
> The followup patches are optimizations and cleanups around caching the extent
> state in more places.  Additionally I've added caching of the extent state of
> the contended area to hopefully reduce the pain of needing to unlock and then
> wait on the contended area.
> 
> The first patch needs to be backported to stable because of the extent-io-tree.c
> move.  Once it lands in Linus I'll manually backport the patch and send it back
> to the stable branches.

I want to add the CC: stable tag to the patch so I did the backport to
see how far it applies.

https://github.com/kdave/btrfs-devel/commit/3405ac932abaa81ce2e1cf0f8cc9b311f7a8287d

This is for 5.15+ then the code looks different and maybe the fix is
still aplicable but needs closer look. I'll mark it for 5.15+ and any
ohter backports can be sent separately.

> I've run this through a sanity test on xfstests, and I ran it through 2 tests on
> fsperf that I felt would hammer on the extent locking code the most and be most
> likely to run into lock contention.  I used the new function profiling stuff to
> grab latency numbers for lock_extent(), but of course there's a lot of variance
> here.  The only thing that fell outside of the standard deviation were some of
> the maximum latency numbers, but generally everything was within the standard
> deviation.  Again these are mostly just for information, deadlock fixing comes
> before performance.  Thanks,

Yeah the deadlock fix will to to 6.1, the other patches are simple but
not really fixes so I'll probably leave them for 6.2 unless there's
another reason for 6.1.

> Josef
> 
> bufferedrandwrite16g results
>       metric           baseline       current        stdev            diff
> ================================================================================
> avg_commit_ms            10090.03       8453.23       3505.07   -16.22%
> bg_count                    20.80         20.80          0.45     0.00%
> commits                      6.20             6          1.10    -3.23%
> elapsed                    140.60        139.60         18.06    -0.71%
> end_state_mount_ns    42066858.80   45104291.80    6317588.83     7.22%
> end_state_umount_ns      6.26e+09      6.23e+09      1.76e+08    -0.39%
> lock_extent_calls     10795318.60   10713477.60     233933.51    -0.76%
> lock_extent_ns_max     3945976.80       7027641    1676910.27    78.10%
> lock_extent_ns_mean       2258.36       2187.89        212.76    -3.12%
> lock_extent_ns_min         503.60        500.80          7.44    -0.56%
> lock_extent_ns_p50        1964.80       1953.60        190.31    -0.57%
> lock_extent_ns_p95        4257.40       4153.20        409.06    -2.45%
> lock_extent_ns_p99        6967.20       6429.20       1166.93    -7.72%
> max_commit_ms            24686.20         25927       5930.26     5.03%
> sys_cpu                     46.68         45.52          6.83    -2.48%
> write_bw_bytes           1.25e+08      1.24e+08   15352552.51    -0.61%
> write_clat_ns_mean       23568.91      21840.81       4061.83    -7.33%
> write_clat_ns_p50        13734.40      13683.20       1268.43    -0.37%
> write_clat_ns_p99           33152      30873.60       4236.59    -6.87%
> write_io_kbytes          16777216      16777216             0     0.00%
> write_iops               30413.83      30228.55       3748.18    -0.61%
> write_lat_ns_max         1.72e+10      1.77e+10      9.25e+09     2.64%
> write_lat_ns_mean        23645.69      21934.65       4057.93    -7.24%
> write_lat_ns_min          6049.40       5877.60         80.29    -2.84%
> 
> randwrite2xram results
>       metric           baseline       current        stdev            diff
> ================================================================================
> avg_commit_ms            21196.15      32607.37       2286.20    53.84%
> bg_count                    43.80         39.80          6.46    -9.13%
> commits                     11.20          9.80          1.10   -12.50%
> elapsed                    329.20           350          4.97     6.32%
> end_state_mount_ns    61846504.40   57392390.60    7914609.45    -7.20%
> end_state_umount_ns      1.74e+10      1.80e+10      2.35e+09     3.44%
> lock_extent_calls     26193512.60      24046630    4169768.34    -8.20%
> lock_extent_ns_max    23699711.60   17524496.80   13253697.58   -26.06%
> lock_extent_ns_mean       1871.04       1866.95         26.60    -0.22%
> lock_extent_ns_min         495.60           501          5.41     1.09%
> lock_extent_ns_p50        1681.80       1675.40         22.13    -0.38%
> lock_extent_ns_p95        3487.60          3492         45.35     0.13%
> lock_extent_ns_p99        4416.60       4431.80         44.77     0.34%
> max_commit_ms               53482     116910.40       8707.34   118.60%
> sys_cpu                      9.88          8.32          1.75   -15.85%
> write_bw_bytes           1.05e+08   89841897.40   20713472.34   -14.84%
> write_clat_ns_mean      158731.16     234418.93      30030.49    47.68%
> write_clat_ns_p50        14732.80      14873.60        448.91     0.96%
> write_clat_ns_p99        75622.40      82892.80      12865.88     9.61%
> write_io_kbytes       31930975.20   28774377.60    5985362.29    -9.89%
> write_iops               25756.55      21934.06       5057.00   -14.84%
> write_lat_ns_max         3.49e+10      5.87e+10      6.68e+09    68.41%
> write_lat_ns_mean       158828.62     234533.46      30028.84    47.66%
> write_lat_ns_min             7809       7923.40        371.87     1.46%
> 
> Josef Bacik (6):
>   btrfs: unlock locked extent area if we have contention
>   btrfs: add a cached_state to try_lock_extent
>   btrfs: use cached_state for btrfs_check_nocow_lock
>   btrfs: use a cached_state everywhere in relocation
>   btrfs: cache the failed state when locking extents
>   btrfs: add cached_state to read_extent_buffer_subpage

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-10-07 16:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
2022-09-30 20:45 ` [PATCH 1/6] btrfs: unlock locked extent area if we have contention Josef Bacik
2022-09-30 20:45 ` [PATCH 2/6] btrfs: add a cached_state to try_lock_extent Josef Bacik
2022-09-30 20:45 ` [PATCH 3/6] btrfs: use cached_state for btrfs_check_nocow_lock Josef Bacik
2022-09-30 20:45 ` [PATCH 4/6] btrfs: use a cached_state everywhere in relocation Josef Bacik
2022-09-30 20:45 ` [PATCH 5/6] btrfs: cache the failed state when locking extents Josef Bacik
2022-09-30 20:45 ` [PATCH 6/6] btrfs: add cached_state to read_extent_buffer_subpage Josef Bacik
2022-10-03 11:03 ` [PATCH 0/6] Deadlock fix and cleanups Filipe Manana
2022-10-07 16:21 ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.