* [PATCH 0/6] Deadlock fix and cleanups
@ 2022-09-30 20:45 Josef Bacik
2022-09-30 20:45 ` [PATCH 1/6] btrfs: unlock locked extent area if we have contention Josef Bacik
` (7 more replies)
0 siblings, 8 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
To: linux-btrfs, kernel-team
Hello,
There's a problem with how we do our extent locking, namely that if we encounter
any contention in the range we're trying to lock we'll leave any area we were
able to lock locked. This creates dependency issues when the contended area is
blocked on some other thing that depends on an operation that would need the
range lock for the area that was successfully locked. The exact scenario we
encountered is described in patch 1. We need to change this behavior to simply
wait on the contended area and then attempt to re-lock our entire range,
clearing anything that was locked before we encountered contention.
The followup patches are optimizations and cleanups around caching the extent
state in more places. Additionally I've added caching of the extent state of
the contended area to hopefully reduce the pain of needing to unlock and then
wait on the contended area.
The first patch needs to be backported to stable because of the extent-io-tree.c
move. Once it lands in Linus I'll manually backport the patch and send it back
to the stable branches.
I've run this through a sanity test on xfstests, and I ran it through 2 tests on
fsperf that I felt would hammer on the extent locking code the most and be most
likely to run into lock contention. I used the new function profiling stuff to
grab latency numbers for lock_extent(), but of course there's a lot of variance
here. The only thing that fell outside of the standard deviation were some of
the maximum latency numbers, but generally everything was within the standard
deviation. Again these are mostly just for information, deadlock fixing comes
before performance. Thanks,
Josef
bufferedrandwrite16g results
metric baseline current stdev diff
================================================================================
avg_commit_ms 10090.03 8453.23 3505.07 -16.22%
bg_count 20.80 20.80 0.45 0.00%
commits 6.20 6 1.10 -3.23%
elapsed 140.60 139.60 18.06 -0.71%
end_state_mount_ns 42066858.80 45104291.80 6317588.83 7.22%
end_state_umount_ns 6.26e+09 6.23e+09 1.76e+08 -0.39%
lock_extent_calls 10795318.60 10713477.60 233933.51 -0.76%
lock_extent_ns_max 3945976.80 7027641 1676910.27 78.10%
lock_extent_ns_mean 2258.36 2187.89 212.76 -3.12%
lock_extent_ns_min 503.60 500.80 7.44 -0.56%
lock_extent_ns_p50 1964.80 1953.60 190.31 -0.57%
lock_extent_ns_p95 4257.40 4153.20 409.06 -2.45%
lock_extent_ns_p99 6967.20 6429.20 1166.93 -7.72%
max_commit_ms 24686.20 25927 5930.26 5.03%
sys_cpu 46.68 45.52 6.83 -2.48%
write_bw_bytes 1.25e+08 1.24e+08 15352552.51 -0.61%
write_clat_ns_mean 23568.91 21840.81 4061.83 -7.33%
write_clat_ns_p50 13734.40 13683.20 1268.43 -0.37%
write_clat_ns_p99 33152 30873.60 4236.59 -6.87%
write_io_kbytes 16777216 16777216 0 0.00%
write_iops 30413.83 30228.55 3748.18 -0.61%
write_lat_ns_max 1.72e+10 1.77e+10 9.25e+09 2.64%
write_lat_ns_mean 23645.69 21934.65 4057.93 -7.24%
write_lat_ns_min 6049.40 5877.60 80.29 -2.84%
randwrite2xram results
metric baseline current stdev diff
================================================================================
avg_commit_ms 21196.15 32607.37 2286.20 53.84%
bg_count 43.80 39.80 6.46 -9.13%
commits 11.20 9.80 1.10 -12.50%
elapsed 329.20 350 4.97 6.32%
end_state_mount_ns 61846504.40 57392390.60 7914609.45 -7.20%
end_state_umount_ns 1.74e+10 1.80e+10 2.35e+09 3.44%
lock_extent_calls 26193512.60 24046630 4169768.34 -8.20%
lock_extent_ns_max 23699711.60 17524496.80 13253697.58 -26.06%
lock_extent_ns_mean 1871.04 1866.95 26.60 -0.22%
lock_extent_ns_min 495.60 501 5.41 1.09%
lock_extent_ns_p50 1681.80 1675.40 22.13 -0.38%
lock_extent_ns_p95 3487.60 3492 45.35 0.13%
lock_extent_ns_p99 4416.60 4431.80 44.77 0.34%
max_commit_ms 53482 116910.40 8707.34 118.60%
sys_cpu 9.88 8.32 1.75 -15.85%
write_bw_bytes 1.05e+08 89841897.40 20713472.34 -14.84%
write_clat_ns_mean 158731.16 234418.93 30030.49 47.68%
write_clat_ns_p50 14732.80 14873.60 448.91 0.96%
write_clat_ns_p99 75622.40 82892.80 12865.88 9.61%
write_io_kbytes 31930975.20 28774377.60 5985362.29 -9.89%
write_iops 25756.55 21934.06 5057.00 -14.84%
write_lat_ns_max 3.49e+10 5.87e+10 6.68e+09 68.41%
write_lat_ns_mean 158828.62 234533.46 30028.84 47.66%
write_lat_ns_min 7809 7923.40 371.87 1.46%
Josef Bacik (6):
btrfs: unlock locked extent area if we have contention
btrfs: add a cached_state to try_lock_extent
btrfs: use cached_state for btrfs_check_nocow_lock
btrfs: use a cached_state everywhere in relocation
btrfs: cache the failed state when locking extents
btrfs: add cached_state to read_extent_buffer_subpage
fs/btrfs/extent-io-tree.c | 68 +++++++++++++++++++++++++++------------
fs/btrfs/extent-io-tree.h | 6 ++--
fs/btrfs/extent_io.c | 17 +++++++---
fs/btrfs/file.c | 12 ++++---
fs/btrfs/inode.c | 3 +-
fs/btrfs/ordered-data.c | 7 ++--
fs/btrfs/ordered-data.h | 3 +-
fs/btrfs/relocation.c | 40 +++++++++++++++--------
8 files changed, 106 insertions(+), 50 deletions(-)
--
2.26.3
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/6] btrfs: unlock locked extent area if we have contention
2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
2022-09-30 20:45 ` [PATCH 2/6] btrfs: add a cached_state to try_lock_extent Josef Bacik
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
To: linux-btrfs, kernel-team
In production we hit the following deadlock
task 1 task 2 task 3
------ ------ ------
fiemap(file) falloc(file) fsync(file)
write(0, 1mib)
btrfs_commit_transaction()
wait_on(!pending_ordered)
lock(512mib, 1gib)
start_transaction
wait_on_transaction
lock(0, 1gib)
wait_extent_bit(512mib)
task 4
------
finish_ordered_extent(0, 1mib)
lock(0, 1mib)
**DEADLOCK**
This occurs because when task 1 does it's lock, it locks everything from
0-512mib, and then waits for the 512mib chunk to unlock. task 2 will
never unlock because it's waiting on the transaction commit to happen,
the transaction commit is waiting for the outstanding ordered extents,
and then the ordered extent thread is blocked waiting on the 0-1mib
range to unlock.
To fix this we have to clear anything we've locked so far, wait for the
extent_state that we contended on, and then try to re-lock the entire
range again.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/extent-io-tree.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 618275af19c4..83cb0378096f 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -1641,16 +1641,17 @@ int lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
int err;
u64 failed_start;
- while (1) {
+ err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
+ cached_state, NULL, GFP_NOFS);
+ while (err == -EEXIST) {
+ if (failed_start != start)
+ clear_extent_bit(tree, start, failed_start - 1,
+ EXTENT_LOCKED, cached_state);
+
+ wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED);
err = __set_extent_bit(tree, start, end, EXTENT_LOCKED,
&failed_start, cached_state, NULL,
GFP_NOFS);
- if (err == -EEXIST) {
- wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED);
- start = failed_start;
- } else
- break;
- WARN_ON(start > end);
}
return err;
}
--
2.26.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/6] btrfs: add a cached_state to try_lock_extent
2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
2022-09-30 20:45 ` [PATCH 1/6] btrfs: unlock locked extent area if we have contention Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
2022-09-30 20:45 ` [PATCH 3/6] btrfs: use cached_state for btrfs_check_nocow_lock Josef Bacik
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
To: linux-btrfs, kernel-team
With nowait becoming more pervasive throughout our codebase go ahead and
add a cached_state to try_lock_extent(). This allows us to be faster
about clearing the locked area if we have contention, and then gives us
the same optimization for unlock if we are able to lock the range.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/extent-io-tree.c | 7 ++++---
fs/btrfs/extent-io-tree.h | 3 ++-
fs/btrfs/extent_io.c | 3 ++-
fs/btrfs/file.c | 3 ++-
fs/btrfs/inode.c | 3 ++-
fs/btrfs/ordered-data.c | 2 +-
fs/btrfs/relocation.c | 2 +-
7 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 83cb0378096f..1b0a45b51f4c 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -1615,17 +1615,18 @@ int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
changeset);
}
-int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end)
+int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
+ struct extent_state **cached)
{
int err;
u64 failed_start;
err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
- NULL, NULL, GFP_NOFS);
+ cached, NULL, GFP_NOFS);
if (err == -EEXIST) {
if (failed_start > start)
clear_extent_bit(tree, start, failed_start - 1,
- EXTENT_LOCKED, NULL);
+ EXTENT_LOCKED, cached);
return 0;
}
return 1;
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index a855f40dd61d..786be8f38f0b 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -106,7 +106,8 @@ void extent_io_tree_release(struct extent_io_tree *tree);
int lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
struct extent_state **cached);
-int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end);
+int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
+ struct extent_state **cached);
int __init extent_state_init_cachep(void);
void __cold extent_state_free_cachep(void);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 1eae68fbae21..e29e8aafc3b7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4962,7 +4962,8 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
io_tree = &BTRFS_I(fs_info->btree_inode)->io_tree;
if (wait == WAIT_NONE) {
- if (!try_lock_extent(io_tree, eb->start, eb->start + eb->len - 1))
+ if (!try_lock_extent(io_tree, eb->start, eb->start + eb->len - 1,
+ NULL))
return -EAGAIN;
} else {
ret = lock_extent(io_tree, eb->start, eb->start + eb->len - 1, NULL);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 176b432035ae..64bf29848723 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1302,7 +1302,8 @@ lock_and_cleanup_extent_if_need(struct btrfs_inode *inode, struct page **pages,
struct btrfs_ordered_extent *ordered;
if (nowait) {
- if (!try_lock_extent(&inode->io_tree, start_pos, last_pos)) {
+ if (!try_lock_extent(&inode->io_tree, start_pos, last_pos,
+ cached_state)) {
for (i = 0; i < num_pages; i++) {
unlock_page(pages[i]);
put_page(pages[i]);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 45ebef8d3ea8..c5630a3b8011 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7255,7 +7255,8 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
while (1) {
if (nowait) {
- if (!try_lock_extent(io_tree, lockstart, lockend))
+ if (!try_lock_extent(io_tree, lockstart, lockend,
+ cached_state))
return -EAGAIN;
} else {
lock_extent(io_tree, lockstart, lockend, cached_state);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index e54f8280031f..b648c9d4ea0f 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1073,7 +1073,7 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end)
{
struct btrfs_ordered_extent *ordered;
- if (!try_lock_extent(&inode->io_tree, start, end))
+ if (!try_lock_extent(&inode->io_tree, start, end, NULL))
return false;
ordered = btrfs_lookup_ordered_range(inode, start, end - start + 1);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 666a37a0ee89..e81a21082e58 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1120,7 +1120,7 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
WARN_ON(!IS_ALIGNED(end, fs_info->sectorsize));
end--;
ret = try_lock_extent(&BTRFS_I(inode)->io_tree,
- key.offset, end);
+ key.offset, end, NULL);
if (!ret)
continue;
--
2.26.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/6] btrfs: use cached_state for btrfs_check_nocow_lock
2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
2022-09-30 20:45 ` [PATCH 1/6] btrfs: unlock locked extent area if we have contention Josef Bacik
2022-09-30 20:45 ` [PATCH 2/6] btrfs: add a cached_state to try_lock_extent Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
2022-09-30 20:45 ` [PATCH 4/6] btrfs: use a cached_state everywhere in relocation Josef Bacik
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
To: linux-btrfs, kernel-team
Now that try_lock_extent() takes a cached_state, plumb the cached_state
through btrfs_try_lock_ordered_range() and then use a cached_state in
btrfs_check_nocow_lock everywhere to avoid extra tree searches on the
extent_io_tree.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/file.c | 9 ++++++---
fs/btrfs/ordered-data.c | 7 ++++---
fs/btrfs/ordered-data.h | 3 ++-
3 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 64bf29848723..c390d5805917 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1373,6 +1373,7 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct btrfs_root *root = inode->root;
+ struct extent_state *cached_state = NULL;
u64 lockstart, lockend;
u64 num_bytes;
int ret;
@@ -1389,12 +1390,14 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
num_bytes = lockend - lockstart + 1;
if (nowait) {
- if (!btrfs_try_lock_ordered_range(inode, lockstart, lockend)) {
+ if (!btrfs_try_lock_ordered_range(inode, lockstart, lockend,
+ &cached_state)) {
btrfs_drew_write_unlock(&root->snapshot_lock);
return -EAGAIN;
}
} else {
- btrfs_lock_and_flush_ordered_range(inode, lockstart, lockend, NULL);
+ btrfs_lock_and_flush_ordered_range(inode, lockstart, lockend,
+ &cached_state);
}
ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
NULL, NULL, NULL, nowait, false);
@@ -1403,7 +1406,7 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
else
*write_bytes = min_t(size_t, *write_bytes ,
num_bytes - pos + lockstart);
- unlock_extent(&inode->io_tree, lockstart, lockend, NULL);
+ unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
return ret;
}
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index b648c9d4ea0f..de2b716d3e7b 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -1069,11 +1069,12 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
* Return true if btrfs_lock_ordered_range does not return any extents,
* otherwise false.
*/
-bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end)
+bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
+ struct extent_state **cached_state)
{
struct btrfs_ordered_extent *ordered;
- if (!try_lock_extent(&inode->io_tree, start, end, NULL))
+ if (!try_lock_extent(&inode->io_tree, start, end, cached_state))
return false;
ordered = btrfs_lookup_ordered_range(inode, start, end - start + 1);
@@ -1081,7 +1082,7 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end)
return true;
btrfs_put_ordered_extent(ordered);
- unlock_extent(&inode->io_tree, start, end, NULL);
+ unlock_extent(&inode->io_tree, start, end, cached_state);
return false;
}
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index f59f2dbdb25e..89f82b78f590 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -206,7 +206,8 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
u64 end,
struct extent_state **cached_state);
-bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end);
+bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end,
+ struct extent_state **cached_state);
int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
u64 post);
int __init ordered_data_init(void);
--
2.26.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 4/6] btrfs: use a cached_state everywhere in relocation
2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
` (2 preceding siblings ...)
2022-09-30 20:45 ` [PATCH 3/6] btrfs: use cached_state for btrfs_check_nocow_lock Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
2022-09-30 20:45 ` [PATCH 5/6] btrfs: cache the failed state when locking extents Josef Bacik
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
To: linux-btrfs, kernel-team
All of the relocation code avoids using the cached state, despite
everywhere using the normal
lock_extent()
// do something
unlock_extent()
pattern. Fix this by plumbing a cached state throughout all of these
functions in order to allow for less tree searches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/relocation.c | 40 ++++++++++++++++++++++++++--------------
1 file changed, 26 insertions(+), 14 deletions(-)
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index e81a21082e58..9dcf9b39c798 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1113,6 +1113,8 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
inode = find_next_inode(root, key.objectid);
}
if (inode && btrfs_ino(BTRFS_I(inode)) == key.objectid) {
+ struct extent_state *cached_state = NULL;
+
end = key.offset +
btrfs_file_extent_num_bytes(leaf, fi);
WARN_ON(!IS_ALIGNED(key.offset,
@@ -1120,14 +1122,15 @@ int replace_file_extents(struct btrfs_trans_handle *trans,
WARN_ON(!IS_ALIGNED(end, fs_info->sectorsize));
end--;
ret = try_lock_extent(&BTRFS_I(inode)->io_tree,
- key.offset, end, NULL);
+ key.offset, end,
+ &cached_state);
if (!ret)
continue;
btrfs_drop_extent_map_range(BTRFS_I(inode),
key.offset, end, true);
unlock_extent(&BTRFS_I(inode)->io_tree,
- key.offset, end, NULL);
+ key.offset, end, &cached_state);
}
}
@@ -1516,6 +1519,8 @@ static int invalidate_extent_cache(struct btrfs_root *root,
objectid = min_key->objectid;
while (1) {
+ struct extent_state *cached_state = NULL;
+
cond_resched();
iput(inode);
@@ -1566,9 +1571,9 @@ static int invalidate_extent_cache(struct btrfs_root *root,
}
/* the lock_extent waits for read_folio to complete */
- lock_extent(&BTRFS_I(inode)->io_tree, start, end, NULL);
+ lock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
btrfs_drop_extent_map_range(BTRFS_I(inode), start, end, true);
- unlock_extent(&BTRFS_I(inode)->io_tree, start, end, NULL);
+ unlock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
}
return 0;
}
@@ -2863,19 +2868,21 @@ static noinline_for_stack int prealloc_file_extent_cluster(
btrfs_inode_lock(&inode->vfs_inode, 0);
for (nr = 0; nr < cluster->nr; nr++) {
+ struct extent_state *cached_state = NULL;
+
start = cluster->boundary[nr] - offset;
if (nr + 1 < cluster->nr)
end = cluster->boundary[nr + 1] - 1 - offset;
else
end = cluster->end - offset;
- lock_extent(&inode->io_tree, start, end, NULL);
+ lock_extent(&inode->io_tree, start, end, &cached_state);
num_bytes = end + 1 - start;
ret = btrfs_prealloc_file_range(&inode->vfs_inode, 0, start,
num_bytes, num_bytes,
end + 1, &alloc_hint);
cur_offset = end + 1;
- unlock_extent(&inode->io_tree, start, end, NULL);
+ unlock_extent(&inode->io_tree, start, end, &cached_state);
if (ret)
break;
}
@@ -2891,6 +2898,7 @@ static noinline_for_stack int setup_relocation_extent_mapping(struct inode *inod
u64 start, u64 end, u64 block_start)
{
struct extent_map *em;
+ struct extent_state *cached_state = NULL;
int ret = 0;
em = alloc_extent_map();
@@ -2903,9 +2911,9 @@ static noinline_for_stack int setup_relocation_extent_mapping(struct inode *inod
em->block_start = block_start;
set_bit(EXTENT_FLAG_PINNED, &em->flags);
- lock_extent(&BTRFS_I(inode)->io_tree, start, end, NULL);
+ lock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
ret = btrfs_replace_extent_map_range(BTRFS_I(inode), em, false);
- unlock_extent(&BTRFS_I(inode)->io_tree, start, end, NULL);
+ unlock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
free_extent_map(em);
return ret;
@@ -2983,6 +2991,7 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
*/
cur = max(page_start, cluster->boundary[*cluster_nr] - offset);
while (cur <= page_end) {
+ struct extent_state *cached_state = NULL;
u64 extent_start = cluster->boundary[*cluster_nr] - offset;
u64 extent_end = get_cluster_boundary_end(cluster,
*cluster_nr) - offset;
@@ -2998,13 +3007,15 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
goto release_page;
/* Mark the range delalloc and dirty for later writeback */
- lock_extent(&BTRFS_I(inode)->io_tree, clamped_start, clamped_end, NULL);
+ lock_extent(&BTRFS_I(inode)->io_tree, clamped_start, clamped_end,
+ &cached_state);
ret = btrfs_set_extent_delalloc(BTRFS_I(inode), clamped_start,
- clamped_end, 0, NULL);
+ clamped_end, 0, &cached_state);
if (ret) {
- clear_extent_bits(&BTRFS_I(inode)->io_tree,
- clamped_start, clamped_end,
- EXTENT_LOCKED | EXTENT_BOUNDARY);
+ clear_extent_bit(&BTRFS_I(inode)->io_tree,
+ clamped_start, clamped_end,
+ EXTENT_LOCKED | EXTENT_BOUNDARY,
+ &cached_state);
btrfs_delalloc_release_metadata(BTRFS_I(inode),
clamped_len, true);
btrfs_delalloc_release_extents(BTRFS_I(inode),
@@ -3031,7 +3042,8 @@ static int relocate_one_page(struct inode *inode, struct file_ra_state *ra,
boundary_start, boundary_end,
EXTENT_BOUNDARY);
}
- unlock_extent(&BTRFS_I(inode)->io_tree, clamped_start, clamped_end, NULL);
+ unlock_extent(&BTRFS_I(inode)->io_tree, clamped_start, clamped_end,
+ &cached_state);
btrfs_delalloc_release_extents(BTRFS_I(inode), clamped_len);
cur += clamped_len;
--
2.26.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 5/6] btrfs: cache the failed state when locking extents
2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
` (3 preceding siblings ...)
2022-09-30 20:45 ` [PATCH 4/6] btrfs: use a cached_state everywhere in relocation Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
2022-09-30 20:45 ` [PATCH 6/6] btrfs: add cached_state to read_extent_buffer_subpage Josef Bacik
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
To: linux-btrfs, kernel-team
Currently if we fail to lock a range we'll return the start of the range
that we failed to lock. We'll then search down to this range and wait
on any extent states in this range.
However we can avoid this search altogether if we simply cache the
extent_state that had the contention. We can pass this into
wait_extent_bit() and start from that extent_state without doing the
search. In the most optimistic case we can avoid all searches, more
likely we'll avoid the initial search and have to perform the search
after we wait on the failed state, or worst case we must search both
times which is what currently happens.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/extent-io-tree.c | 52 +++++++++++++++++++++++++++++----------
fs/btrfs/extent-io-tree.h | 3 ++-
fs/btrfs/extent_io.c | 3 ++-
3 files changed, 43 insertions(+), 15 deletions(-)
diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 1b0a45b51f4c..e455439f9e86 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -714,7 +714,8 @@ static void wait_on_state(struct extent_io_tree *tree,
* The range [start, end] is inclusive.
* The tree lock is taken by this function
*/
-void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits)
+void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits,
+ struct extent_state **cached_state)
{
struct extent_state *state;
@@ -722,6 +723,16 @@ void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits)
spin_lock(&tree->lock);
again:
+ /*
+ * Maintain cached_state, as we may not remove it from the tree if there
+ * are more bits than the bits we're waiting on set on this state.
+ */
+ if (cached_state && *cached_state) {
+ state = *cached_state;
+ if (extent_state_in_tree(state) &&
+ state->start <= start && state->end > start)
+ goto process_node;
+ }
while (1) {
/*
* This search will find all the extents that end after our
@@ -752,6 +763,12 @@ void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits)
}
}
out:
+ /* This state is no longer useful, clear it and free it up. */
+ if (cached_state && *cached_state) {
+ state = *cached_state;
+ *cached_state = NULL;
+ free_extent_state(state);
+ }
spin_unlock(&tree->lock);
}
@@ -939,13 +956,17 @@ bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
* sleeping, so the gfp mask is used to indicate what is allowed.
*
* If any of the exclusive bits are set, this will fail with -EEXIST if some
- * part of the range already has the desired bits set. The start of the
- * existing range is returned in failed_start in this case.
+ * part of the range already has the desired bits set. The extent_state of the
+ * existing range is returned in failed_state in this case, and the start of the
+ * existing range is returned in failed_start. failed_state is used as an
+ * optimization for wait_extent_bit, failed_start must be used as the source of
+ * truth as failed_state may have changed since we returned.
*
* [start, end] is inclusive This takes the tree lock.
*/
static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
u32 bits, u64 *failed_start,
+ struct extent_state **failed_state,
struct extent_state **cached_state,
struct extent_changeset *changeset, gfp_t mask)
{
@@ -964,7 +985,7 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
if (exclusive_bits)
ASSERT(failed_start);
else
- ASSERT(failed_start == NULL);
+ ASSERT(failed_start == NULL && failed_state == NULL);
again:
if (!prealloc && gfpflags_allow_blocking(mask)) {
/*
@@ -1012,6 +1033,7 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
if (state->start == start && state->end <= end) {
if (state->state & exclusive_bits) {
*failed_start = state->start;
+ cache_state(state, failed_state);
err = -EEXIST;
goto out;
}
@@ -1047,6 +1069,7 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
if (state->start < start) {
if (state->state & exclusive_bits) {
*failed_start = start;
+ cache_state(state, failed_state);
err = -EEXIST;
goto out;
}
@@ -1125,6 +1148,7 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
if (state->start <= end && state->end > end) {
if (state->state & exclusive_bits) {
*failed_start = start;
+ cache_state(state, failed_state);
err = -EEXIST;
goto out;
}
@@ -1162,8 +1186,8 @@ static int __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
u32 bits, struct extent_state **cached_state, gfp_t mask)
{
- return __set_extent_bit(tree, start, end, bits, NULL, cached_state,
- NULL, mask);
+ return __set_extent_bit(tree, start, end, bits, NULL, NULL,
+ cached_state, NULL, mask);
}
/*
@@ -1598,8 +1622,8 @@ int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
*/
ASSERT(!(bits & EXTENT_LOCKED));
- return __set_extent_bit(tree, start, end, bits, NULL, NULL, changeset,
- GFP_NOFS);
+ return __set_extent_bit(tree, start, end, bits, NULL, NULL, NULL,
+ changeset, GFP_NOFS);
}
int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
@@ -1622,7 +1646,7 @@ int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
u64 failed_start;
err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
- cached, NULL, GFP_NOFS);
+ NULL, cached, NULL, GFP_NOFS);
if (err == -EEXIST) {
if (failed_start > start)
clear_extent_bit(tree, start, failed_start - 1,
@@ -1639,20 +1663,22 @@ int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
int lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
struct extent_state **cached_state)
{
+ struct extent_state *failed_state = NULL;
int err;
u64 failed_start;
err = __set_extent_bit(tree, start, end, EXTENT_LOCKED, &failed_start,
- cached_state, NULL, GFP_NOFS);
+ &failed_state, cached_state, NULL, GFP_NOFS);
while (err == -EEXIST) {
if (failed_start != start)
clear_extent_bit(tree, start, failed_start - 1,
EXTENT_LOCKED, cached_state);
- wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED);
+ wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED,
+ &failed_state);
err = __set_extent_bit(tree, start, end, EXTENT_LOCKED,
- &failed_start, cached_state, NULL,
- GFP_NOFS);
+ &failed_start, &failed_state,
+ cached_state, NULL, GFP_NOFS);
}
return err;
}
diff --git a/fs/btrfs/extent-io-tree.h b/fs/btrfs/extent-io-tree.h
index 786be8f38f0b..c71aa29f719d 100644
--- a/fs/btrfs/extent-io-tree.h
+++ b/fs/btrfs/extent-io-tree.h
@@ -235,6 +235,7 @@ int find_contiguous_extent_bit(struct extent_io_tree *tree, u64 start,
bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
u64 *end, u64 max_bytes,
struct extent_state **cached_state);
-void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits);
+void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits,
+ struct extent_state **cached_state);
#endif /* BTRFS_EXTENT_IO_TREE_H */
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e29e8aafc3b7..21b437f4e36e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5004,7 +5004,8 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
if (ret || wait != WAIT_COMPLETE)
return ret;
- wait_extent_bit(io_tree, eb->start, eb->start + eb->len - 1, EXTENT_LOCKED);
+ wait_extent_bit(io_tree, eb->start, eb->start + eb->len - 1,
+ EXTENT_LOCKED, NULL);
if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
ret = -EIO;
return ret;
--
2.26.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 6/6] btrfs: add cached_state to read_extent_buffer_subpage
2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
` (4 preceding siblings ...)
2022-09-30 20:45 ` [PATCH 5/6] btrfs: cache the failed state when locking extents Josef Bacik
@ 2022-09-30 20:45 ` Josef Bacik
2022-10-03 11:03 ` [PATCH 0/6] Deadlock fix and cleanups Filipe Manana
2022-10-07 16:21 ` David Sterba
7 siblings, 0 replies; 9+ messages in thread
From: Josef Bacik @ 2022-09-30 20:45 UTC (permalink / raw)
To: linux-btrfs, kernel-team
We don't use a cached state here at all, which generally makes sense as
async reads are going to unlock at endio time. However for blocking
reads we will call wait_extent_bit() for our range. Since the
lock_extent() stuff will return the cached_state for the start of the
range this is a helpful optimization to have for this case, we'll have
the exact state we want to wait on. Add a cached state here and simply
throw it away if we're a non-blocking read, otherwise we'll get a small
improvement by eliminating some tree searches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/extent_io.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 21b437f4e36e..c1021e9ca079 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4952,6 +4952,7 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
struct btrfs_fs_info *fs_info = eb->fs_info;
struct extent_io_tree *io_tree;
struct page *page = eb->pages[0];
+ struct extent_state *cached_state = NULL;
struct btrfs_bio_ctrl bio_ctrl = {
.mirror_num = mirror_num,
};
@@ -4963,10 +4964,11 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
if (wait == WAIT_NONE) {
if (!try_lock_extent(io_tree, eb->start, eb->start + eb->len - 1,
- NULL))
+ &cached_state))
return -EAGAIN;
} else {
- ret = lock_extent(io_tree, eb->start, eb->start + eb->len - 1, NULL);
+ ret = lock_extent(io_tree, eb->start, eb->start + eb->len - 1,
+ &cached_state);
if (ret < 0)
return ret;
}
@@ -4976,7 +4978,8 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
PageUptodate(page) ||
btrfs_subpage_test_uptodate(fs_info, page, eb->start, eb->len)) {
set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
- unlock_extent(io_tree, eb->start, eb->start + eb->len - 1, NULL);
+ unlock_extent(io_tree, eb->start, eb->start + eb->len - 1,
+ &cached_state);
return ret;
}
@@ -5001,11 +5004,13 @@ static int read_extent_buffer_subpage(struct extent_buffer *eb, int wait,
atomic_dec(&eb->io_pages);
}
submit_one_bio(&bio_ctrl);
- if (ret || wait != WAIT_COMPLETE)
+ if (ret || wait != WAIT_COMPLETE) {
+ free_extent_state(cached_state);
return ret;
+ }
wait_extent_bit(io_tree, eb->start, eb->start + eb->len - 1,
- EXTENT_LOCKED, NULL);
+ EXTENT_LOCKED, &cached_state);
if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
ret = -EIO;
return ret;
--
2.26.3
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 0/6] Deadlock fix and cleanups
2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
` (5 preceding siblings ...)
2022-09-30 20:45 ` [PATCH 6/6] btrfs: add cached_state to read_extent_buffer_subpage Josef Bacik
@ 2022-10-03 11:03 ` Filipe Manana
2022-10-07 16:21 ` David Sterba
7 siblings, 0 replies; 9+ messages in thread
From: Filipe Manana @ 2022-10-03 11:03 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-btrfs, kernel-team
On Fri, Sep 30, 2022 at 10:08 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> Hello,
>
> There's a problem with how we do our extent locking, namely that if we encounter
> any contention in the range we're trying to lock we'll leave any area we were
> able to lock locked. This creates dependency issues when the contended area is
> blocked on some other thing that depends on an operation that would need the
> range lock for the area that was successfully locked. The exact scenario we
> encountered is described in patch 1. We need to change this behavior to simply
> wait on the contended area and then attempt to re-lock our entire range,
> clearing anything that was locked before we encountered contention.
>
> The followup patches are optimizations and cleanups around caching the extent
> state in more places. Additionally I've added caching of the extent state of
> the contended area to hopefully reduce the pain of needing to unlock and then
> wait on the contended area.
>
> The first patch needs to be backported to stable because of the extent-io-tree.c
> move. Once it lands in Linus I'll manually backport the patch and send it back
> to the stable branches.
>
> I've run this through a sanity test on xfstests, and I ran it through 2 tests on
> fsperf that I felt would hammer on the extent locking code the most and be most
> likely to run into lock contention. I used the new function profiling stuff to
> grab latency numbers for lock_extent(), but of course there's a lot of variance
> here. The only thing that fell outside of the standard deviation were some of
> the maximum latency numbers, but generally everything was within the standard
> deviation. Again these are mostly just for information, deadlock fixing comes
> before performance. Thanks,
>
> Josef
>
> bufferedrandwrite16g results
> metric baseline current stdev diff
> ================================================================================
> avg_commit_ms 10090.03 8453.23 3505.07 -16.22%
> bg_count 20.80 20.80 0.45 0.00%
> commits 6.20 6 1.10 -3.23%
> elapsed 140.60 139.60 18.06 -0.71%
> end_state_mount_ns 42066858.80 45104291.80 6317588.83 7.22%
> end_state_umount_ns 6.26e+09 6.23e+09 1.76e+08 -0.39%
> lock_extent_calls 10795318.60 10713477.60 233933.51 -0.76%
> lock_extent_ns_max 3945976.80 7027641 1676910.27 78.10%
> lock_extent_ns_mean 2258.36 2187.89 212.76 -3.12%
> lock_extent_ns_min 503.60 500.80 7.44 -0.56%
> lock_extent_ns_p50 1964.80 1953.60 190.31 -0.57%
> lock_extent_ns_p95 4257.40 4153.20 409.06 -2.45%
> lock_extent_ns_p99 6967.20 6429.20 1166.93 -7.72%
> max_commit_ms 24686.20 25927 5930.26 5.03%
> sys_cpu 46.68 45.52 6.83 -2.48%
> write_bw_bytes 1.25e+08 1.24e+08 15352552.51 -0.61%
> write_clat_ns_mean 23568.91 21840.81 4061.83 -7.33%
> write_clat_ns_p50 13734.40 13683.20 1268.43 -0.37%
> write_clat_ns_p99 33152 30873.60 4236.59 -6.87%
> write_io_kbytes 16777216 16777216 0 0.00%
> write_iops 30413.83 30228.55 3748.18 -0.61%
> write_lat_ns_max 1.72e+10 1.77e+10 9.25e+09 2.64%
> write_lat_ns_mean 23645.69 21934.65 4057.93 -7.24%
> write_lat_ns_min 6049.40 5877.60 80.29 -2.84%
>
> randwrite2xram results
> metric baseline current stdev diff
> ================================================================================
> avg_commit_ms 21196.15 32607.37 2286.20 53.84%
> bg_count 43.80 39.80 6.46 -9.13%
> commits 11.20 9.80 1.10 -12.50%
> elapsed 329.20 350 4.97 6.32%
> end_state_mount_ns 61846504.40 57392390.60 7914609.45 -7.20%
> end_state_umount_ns 1.74e+10 1.80e+10 2.35e+09 3.44%
> lock_extent_calls 26193512.60 24046630 4169768.34 -8.20%
> lock_extent_ns_max 23699711.60 17524496.80 13253697.58 -26.06%
> lock_extent_ns_mean 1871.04 1866.95 26.60 -0.22%
> lock_extent_ns_min 495.60 501 5.41 1.09%
> lock_extent_ns_p50 1681.80 1675.40 22.13 -0.38%
> lock_extent_ns_p95 3487.60 3492 45.35 0.13%
> lock_extent_ns_p99 4416.60 4431.80 44.77 0.34%
> max_commit_ms 53482 116910.40 8707.34 118.60%
> sys_cpu 9.88 8.32 1.75 -15.85%
> write_bw_bytes 1.05e+08 89841897.40 20713472.34 -14.84%
> write_clat_ns_mean 158731.16 234418.93 30030.49 47.68%
> write_clat_ns_p50 14732.80 14873.60 448.91 0.96%
> write_clat_ns_p99 75622.40 82892.80 12865.88 9.61%
> write_io_kbytes 31930975.20 28774377.60 5985362.29 -9.89%
> write_iops 25756.55 21934.06 5057.00 -14.84%
> write_lat_ns_max 3.49e+10 5.87e+10 6.68e+09 68.41%
> write_lat_ns_mean 158828.62 234533.46 30028.84 47.66%
> write_lat_ns_min 7809 7923.40 371.87 1.46%
>
> Josef Bacik (6):
> btrfs: unlock locked extent area if we have contention
> btrfs: add a cached_state to try_lock_extent
> btrfs: use cached_state for btrfs_check_nocow_lock
> btrfs: use a cached_state everywhere in relocation
> btrfs: cache the failed state when locking extents
> btrfs: add cached_state to read_extent_buffer_subpage
The whole patchset looks good to me, so
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Thanks.
>
> fs/btrfs/extent-io-tree.c | 68 +++++++++++++++++++++++++++------------
> fs/btrfs/extent-io-tree.h | 6 ++--
> fs/btrfs/extent_io.c | 17 +++++++---
> fs/btrfs/file.c | 12 ++++---
> fs/btrfs/inode.c | 3 +-
> fs/btrfs/ordered-data.c | 7 ++--
> fs/btrfs/ordered-data.h | 3 +-
> fs/btrfs/relocation.c | 40 +++++++++++++++--------
> 8 files changed, 106 insertions(+), 50 deletions(-)
>
> --
> 2.26.3
>
--
Filipe David Manana,
“Whether you think you can, or you think you can't — you're right.”
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/6] Deadlock fix and cleanups
2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
` (6 preceding siblings ...)
2022-10-03 11:03 ` [PATCH 0/6] Deadlock fix and cleanups Filipe Manana
@ 2022-10-07 16:21 ` David Sterba
7 siblings, 0 replies; 9+ messages in thread
From: David Sterba @ 2022-10-07 16:21 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-btrfs, kernel-team
On Fri, Sep 30, 2022 at 04:45:07PM -0400, Josef Bacik wrote:
> Hello,
>
> There's a problem with how we do our extent locking, namely that if we encounter
> any contention in the range we're trying to lock we'll leave any area we were
> able to lock locked. This creates dependency issues when the contended area is
> blocked on some other thing that depends on an operation that would need the
> range lock for the area that was successfully locked. The exact scenario we
> encountered is described in patch 1. We need to change this behavior to simply
> wait on the contended area and then attempt to re-lock our entire range,
> clearing anything that was locked before we encountered contention.
>
> The followup patches are optimizations and cleanups around caching the extent
> state in more places. Additionally I've added caching of the extent state of
> the contended area to hopefully reduce the pain of needing to unlock and then
> wait on the contended area.
>
> The first patch needs to be backported to stable because of the extent-io-tree.c
> move. Once it lands in Linus I'll manually backport the patch and send it back
> to the stable branches.
I want to add the CC: stable tag to the patch so I did the backport to
see how far it applies.
https://github.com/kdave/btrfs-devel/commit/3405ac932abaa81ce2e1cf0f8cc9b311f7a8287d
This is for 5.15+ then the code looks different and maybe the fix is
still aplicable but needs closer look. I'll mark it for 5.15+ and any
ohter backports can be sent separately.
> I've run this through a sanity test on xfstests, and I ran it through 2 tests on
> fsperf that I felt would hammer on the extent locking code the most and be most
> likely to run into lock contention. I used the new function profiling stuff to
> grab latency numbers for lock_extent(), but of course there's a lot of variance
> here. The only thing that fell outside of the standard deviation were some of
> the maximum latency numbers, but generally everything was within the standard
> deviation. Again these are mostly just for information, deadlock fixing comes
> before performance. Thanks,
Yeah the deadlock fix will to to 6.1, the other patches are simple but
not really fixes so I'll probably leave them for 6.2 unless there's
another reason for 6.1.
> Josef
>
> bufferedrandwrite16g results
> metric baseline current stdev diff
> ================================================================================
> avg_commit_ms 10090.03 8453.23 3505.07 -16.22%
> bg_count 20.80 20.80 0.45 0.00%
> commits 6.20 6 1.10 -3.23%
> elapsed 140.60 139.60 18.06 -0.71%
> end_state_mount_ns 42066858.80 45104291.80 6317588.83 7.22%
> end_state_umount_ns 6.26e+09 6.23e+09 1.76e+08 -0.39%
> lock_extent_calls 10795318.60 10713477.60 233933.51 -0.76%
> lock_extent_ns_max 3945976.80 7027641 1676910.27 78.10%
> lock_extent_ns_mean 2258.36 2187.89 212.76 -3.12%
> lock_extent_ns_min 503.60 500.80 7.44 -0.56%
> lock_extent_ns_p50 1964.80 1953.60 190.31 -0.57%
> lock_extent_ns_p95 4257.40 4153.20 409.06 -2.45%
> lock_extent_ns_p99 6967.20 6429.20 1166.93 -7.72%
> max_commit_ms 24686.20 25927 5930.26 5.03%
> sys_cpu 46.68 45.52 6.83 -2.48%
> write_bw_bytes 1.25e+08 1.24e+08 15352552.51 -0.61%
> write_clat_ns_mean 23568.91 21840.81 4061.83 -7.33%
> write_clat_ns_p50 13734.40 13683.20 1268.43 -0.37%
> write_clat_ns_p99 33152 30873.60 4236.59 -6.87%
> write_io_kbytes 16777216 16777216 0 0.00%
> write_iops 30413.83 30228.55 3748.18 -0.61%
> write_lat_ns_max 1.72e+10 1.77e+10 9.25e+09 2.64%
> write_lat_ns_mean 23645.69 21934.65 4057.93 -7.24%
> write_lat_ns_min 6049.40 5877.60 80.29 -2.84%
>
> randwrite2xram results
> metric baseline current stdev diff
> ================================================================================
> avg_commit_ms 21196.15 32607.37 2286.20 53.84%
> bg_count 43.80 39.80 6.46 -9.13%
> commits 11.20 9.80 1.10 -12.50%
> elapsed 329.20 350 4.97 6.32%
> end_state_mount_ns 61846504.40 57392390.60 7914609.45 -7.20%
> end_state_umount_ns 1.74e+10 1.80e+10 2.35e+09 3.44%
> lock_extent_calls 26193512.60 24046630 4169768.34 -8.20%
> lock_extent_ns_max 23699711.60 17524496.80 13253697.58 -26.06%
> lock_extent_ns_mean 1871.04 1866.95 26.60 -0.22%
> lock_extent_ns_min 495.60 501 5.41 1.09%
> lock_extent_ns_p50 1681.80 1675.40 22.13 -0.38%
> lock_extent_ns_p95 3487.60 3492 45.35 0.13%
> lock_extent_ns_p99 4416.60 4431.80 44.77 0.34%
> max_commit_ms 53482 116910.40 8707.34 118.60%
> sys_cpu 9.88 8.32 1.75 -15.85%
> write_bw_bytes 1.05e+08 89841897.40 20713472.34 -14.84%
> write_clat_ns_mean 158731.16 234418.93 30030.49 47.68%
> write_clat_ns_p50 14732.80 14873.60 448.91 0.96%
> write_clat_ns_p99 75622.40 82892.80 12865.88 9.61%
> write_io_kbytes 31930975.20 28774377.60 5985362.29 -9.89%
> write_iops 25756.55 21934.06 5057.00 -14.84%
> write_lat_ns_max 3.49e+10 5.87e+10 6.68e+09 68.41%
> write_lat_ns_mean 158828.62 234533.46 30028.84 47.66%
> write_lat_ns_min 7809 7923.40 371.87 1.46%
>
> Josef Bacik (6):
> btrfs: unlock locked extent area if we have contention
> btrfs: add a cached_state to try_lock_extent
> btrfs: use cached_state for btrfs_check_nocow_lock
> btrfs: use a cached_state everywhere in relocation
> btrfs: cache the failed state when locking extents
> btrfs: add cached_state to read_extent_buffer_subpage
Added to misc-next, thanks.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-10-07 16:21 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-30 20:45 [PATCH 0/6] Deadlock fix and cleanups Josef Bacik
2022-09-30 20:45 ` [PATCH 1/6] btrfs: unlock locked extent area if we have contention Josef Bacik
2022-09-30 20:45 ` [PATCH 2/6] btrfs: add a cached_state to try_lock_extent Josef Bacik
2022-09-30 20:45 ` [PATCH 3/6] btrfs: use cached_state for btrfs_check_nocow_lock Josef Bacik
2022-09-30 20:45 ` [PATCH 4/6] btrfs: use a cached_state everywhere in relocation Josef Bacik
2022-09-30 20:45 ` [PATCH 5/6] btrfs: cache the failed state when locking extents Josef Bacik
2022-09-30 20:45 ` [PATCH 6/6] btrfs: add cached_state to read_extent_buffer_subpage Josef Bacik
2022-10-03 11:03 ` [PATCH 0/6] Deadlock fix and cleanups Filipe Manana
2022-10-07 16:21 ` David Sterba
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.