* [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks @ 2022-03-15 10:47 fdmanana 2022-03-15 10:47 ` [PATCH 1/6] btrfs: only reserve the needed data space amount during fallocate fdmanana ` (7 more replies) 0 siblings, 8 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 10:47 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> The first patch fixes a bug in fallocate (more details on its changelog), while the remaining just remove some code and logic that is no longer needed. The removals/clean ups are independent of the bug fix. Filipe Manana (6): btrfs: only reserve the needed data space amount during fallocate btrfs: remove useless dio wait call when doing fallocate zero range btrfs: remove inode_dio_wait() calls when starting reflink operations btrfs: remove ordered extent check and wait during fallocate btrfs: remove ordered extent check and wait during hole punching and zero range btrfs: add and use helper to assert an inode range is clean fs/btrfs/ctree.h | 1 + fs/btrfs/file.c | 169 ++++++++++++++++++--------------------------- fs/btrfs/inode.c | 37 ++++++++++ fs/btrfs/reflink.c | 23 +++--- 4 files changed, 115 insertions(+), 115 deletions(-) -- 2.33.0 ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH 1/6] btrfs: only reserve the needed data space amount during fallocate 2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana @ 2022-03-15 10:47 ` fdmanana 2022-03-15 10:47 ` [PATCH 2/6] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana ` (6 subsequent siblings) 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 10:47 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> During a plain fallocate, we always start by reserving an mount of data space that matches the length of the range passed to fallocate. When we already have extents allocated in that range, we may end up trying to reserve a lot more data space then we need, which can result in two undesired behaviours: 1) We fail with -ENOSPC. For example the passed range has a length of 1G, but there's only one hole with a size of 1M in that range; 2) We temporarily reserve excessive data space that could be used by other operations happening concurrently; 3) By reserving much more data space then we need, we can end up doing expensive things like triggering dellaloc for other inodes, waiting for the ordered extents to complete, trigger transaction commits, allocate new block groups, etc. Example: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f -b 1g $DEV mount $DEV $MNT # Create a file with a size of 600M and two holes, one at [200M, 201M[ # and another at [401M, 402M[ xfs_io -f -c "pwrite -S 0xab 0 200M" \ -c "pwrite -S 0xcd 201M 200M" \ -c "pwrite -S 0xef 402M 198M" \ $MNT/foobar # Now call fallocate against the whole file range, see if it fails # with -ENOSPC or not - it shouldn't since we only need to allocate # 2M of data space. xfs_io -c "falloc 0 600M" $MNT/foobar umount $MNT $ ./test.sh (...) wrote 209715200/209715200 bytes at offset 0 200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec) wrote 209715200/209715200 bytes at offset 210763776 200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec) wrote 207618048/207618048 bytes at offset 421527552 198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec) fallocate: No space left on device $ So fix this by not allocating an amount of data space that matches the length of the range passed to fallocate. Instead allocate an amount of data space that corresponds to the sum of the sizes of each hole found in the range. This reservation now happens after we have locked the file range, which is safe since we know at this point there's no delalloc in the range because we've taken the inode's VFS lock in exclusive mode, we have taken the inode's i_mmap_lock in exclusive mode, we have flushed delalloc and waited for all ordered extents in the range to complete. This type of failure actually seems to happen in pratice with systemd, and we had at least one report about this in a very long thread which is referenced by the Link tag below. Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 69 ++++++++++++++++++++++++++----------------------- 1 file changed, 37 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 380054c94e4b..b7c0db1000cd 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3417,6 +3417,9 @@ static long btrfs_fallocate(struct file *file, int mode, u64 alloc_hint = 0; u64 locked_end; u64 actual_end = 0; + u64 data_space_needed = 0; + u64 data_space_reserved = 0; + u64 qgroup_reserved = 0; struct extent_map *em; int blocksize = btrfs_inode_sectorsize(BTRFS_I(inode)); int ret; @@ -3437,18 +3440,6 @@ static long btrfs_fallocate(struct file *file, int mode, if (mode & FALLOC_FL_PUNCH_HOLE) return btrfs_punch_hole(file, offset, len); - /* - * Only trigger disk allocation, don't trigger qgroup reserve - * - * For qgroup space, it will be checked later. - */ - if (!(mode & FALLOC_FL_ZERO_RANGE)) { - ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), - alloc_end - alloc_start); - if (ret < 0) - return ret; - } - btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) { @@ -3548,48 +3539,66 @@ static long btrfs_fallocate(struct file *file, int mode, if (em->block_start == EXTENT_MAP_HOLE || (cur_offset >= inode->i_size && !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) { + const u64 range_len = last_byte - cur_offset; + ret = add_falloc_range(&reserve_list, cur_offset, - last_byte - cur_offset); + range_len); if (ret < 0) { free_extent_map(em); break; } ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, cur_offset, - last_byte - cur_offset); + range_len); if (ret < 0) { - cur_offset = last_byte; free_extent_map(em); break; } - } else { - /* - * Do not need to reserve unwritten extent for this - * range, free reserved data space first, otherwise - * it'll result in false ENOSPC error. - */ - btrfs_free_reserved_data_space(BTRFS_I(inode), - data_reserved, cur_offset, - last_byte - cur_offset); + qgroup_reserved += range_len; + data_space_needed += range_len; } free_extent_map(em); cur_offset = last_byte; } + if (!ret && data_space_needed > 0) { + /* + * We are safe to reserve space here as we can't have delalloc + * in the range, see above. + */ + ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), + data_space_needed); + if (!ret) + data_space_reserved = data_space_needed; + } + /* * If ret is still 0, means we're OK to fallocate. * Or just cleanup the list and exit. */ list_for_each_entry_safe(range, tmp, &reserve_list, list) { - if (!ret) + if (!ret) { ret = btrfs_prealloc_file_range(inode, mode, range->start, range->len, i_blocksize(inode), offset + len, &alloc_hint); - else + /* + * btrfs_prealloc_file_range() releases space even + * if it returns an error. + */ + data_space_reserved -= range->len; + qgroup_reserved -= range->len; + } else if (data_space_reserved > 0) { btrfs_free_reserved_data_space(BTRFS_I(inode), - data_reserved, range->start, - range->len); + data_reserved, range->start, + range->len); + data_space_reserved -= range->len; + qgroup_reserved -= range->len; + } else if (qgroup_reserved > 0) { + btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved, + range->start, range->len); + qgroup_reserved -= range->len; + } list_del(&range->list); kfree(range); } @@ -3606,10 +3615,6 @@ static long btrfs_fallocate(struct file *file, int mode, &cached_state); out: btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); - /* Let go of our reservation. */ - if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE)) - btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved, - cur_offset, alloc_end - cur_offset); extent_changeset_free(data_reserved); return ret; } -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 2/6] btrfs: remove useless dio wait call when doing fallocate zero range 2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 2022-03-15 10:47 ` [PATCH 1/6] btrfs: only reserve the needed data space amount during fallocate fdmanana @ 2022-03-15 10:47 ` fdmanana 2022-03-15 10:47 ` [PATCH 3/6] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana ` (5 subsequent siblings) 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 10:47 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> When starting a fallocate zero range operation, before getting the first extent map for the range, we make a call to inode_dio_wait(). This logic was needed in the past because direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (by commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (done in commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")). The lockless direct IO writes could result in a race with the zero range operation, resulting in the later getting a stale extent map for the range. So remove this no longer needed call to inode_dio_wait(), as fallocate takes the inode's VFS lock in exclusive mode and direct IO writes within i_size take that same lock in shared mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index b7c0db1000cd..2f57f7d9d9cb 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3237,8 +3237,6 @@ static int btrfs_zero_range(struct inode *inode, u64 bytes_to_reserve = 0; bool space_reserved = false; - inode_dio_wait(inode); - em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, alloc_start, alloc_end - alloc_start); if (IS_ERR(em)) { -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 3/6] btrfs: remove inode_dio_wait() calls when starting reflink operations 2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 2022-03-15 10:47 ` [PATCH 1/6] btrfs: only reserve the needed data space amount during fallocate fdmanana 2022-03-15 10:47 ` [PATCH 2/6] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana @ 2022-03-15 10:47 ` fdmanana 2022-03-15 10:47 ` [PATCH 4/6] btrfs: remove ordered extent check and wait during fallocate fdmanana ` (4 subsequent siblings) 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 10:47 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> When starting a reflink operation we have these calls to inode_dio_wait() which used to be needed because direct IO writes that don't cross the i_size boundary did not take the inode's VFS lock, so we could race with them and end up with ordered extents in target range after calling btrfs_wait_ordered_range(). However that is not the case anymore, because the inode's VFS lock was changed from a mutex to a rw semaphore, by commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem"), and several years later we started to lock the inode's VFS lock in shared mode for direct IO writes that don't cross the i_size boundary (commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")). So remove those inode_dio_wait() calls. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/reflink.c | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index 04a88bfe4fcf..bbd5da25c475 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -771,7 +771,6 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in, struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); u64 bs = BTRFS_I(inode_out)->root->fs_info->sb->s_blocksize; - bool same_inode = inode_out == inode_in; u64 wb_len; int ret; @@ -809,15 +808,6 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in, else wb_len = ALIGN(*len, bs); - /* - * Since we don't lock ranges, wait for ongoing lockless dio writes (as - * any in progress could create its ordered extents after we wait for - * existing ordered extents below). - */ - inode_dio_wait(inode_in); - if (!same_inode) - inode_dio_wait(inode_out); - /* * Workaround to make sure NOCOW buffered write reach disk as NOCOW. * -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 4/6] btrfs: remove ordered extent check and wait during fallocate 2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (2 preceding siblings ...) 2022-03-15 10:47 ` [PATCH 3/6] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana @ 2022-03-15 10:47 ` fdmanana 2022-03-15 10:47 ` [PATCH 5/6] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana ` (3 subsequent siblings) 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 10:47 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> For fallocate() we have this loop that checks if we have ordered extents after locking the file range, and if so unlock the range, wait for ordered extents, and retry until we don't find more ordered extents. This logic was needed in the past because: 1) Direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")); 2) We could race with memory mapped writes, because memory mapped writes don't acquire the inode's VFS lock. We don't have that race anymore, as we have a rw semaphore to synchronize memory mapped writes with fallocate (and reflinking too). That change happened with commit 8d9b4a162a37ce ("btrfs: exclude mmap from happening during all fallocate operations"). So stop looking for ordered extents after locking the file range when doing a plain fallocate. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 42 ++++++++---------------------------------- 1 file changed, 8 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 2f57f7d9d9cb..a7fd05c1d52f 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3474,8 +3474,12 @@ static long btrfs_fallocate(struct file *file, int mode, } /* - * wait for ordered IO before we have any locks. We'll loop again - * below with the locks held. + * We have locked the inode at the VFS level (in exclusive mode) and we + * have locked the i_mmap_lock lock (in exclusive mode). Now before + * locking the file range, flush all dealloc in the range and wait for + * all ordered extents in the range to complete. After this we can lock + * the file range and, due to the previous locking we did, we know there + * can't be more delalloc or ordered extents in the range. */ ret = btrfs_wait_ordered_range(inode, alloc_start, alloc_end - alloc_start); @@ -3489,38 +3493,8 @@ static long btrfs_fallocate(struct file *file, int mode, } locked_end = alloc_end - 1; - while (1) { - struct btrfs_ordered_extent *ordered; - - /* the extent lock is ordered inside the running - * transaction - */ - lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, - locked_end, &cached_state); - ordered = btrfs_lookup_first_ordered_extent(BTRFS_I(inode), - locked_end); - - if (ordered && - ordered->file_offset + ordered->num_bytes > alloc_start && - ordered->file_offset < alloc_end) { - btrfs_put_ordered_extent(ordered); - unlock_extent_cached(&BTRFS_I(inode)->io_tree, - alloc_start, locked_end, - &cached_state); - /* - * we can't wait on the range with the transaction - * running or with the extent lock held - */ - ret = btrfs_wait_ordered_range(inode, alloc_start, - alloc_end - alloc_start); - if (ret) - goto out; - } else { - if (ordered) - btrfs_put_ordered_extent(ordered); - break; - } - } + lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, + &cached_state); /* First, check if we exceed the qgroup limit */ INIT_LIST_HEAD(&reserve_list); -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 5/6] btrfs: remove ordered extent check and wait during hole punching and zero range 2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (3 preceding siblings ...) 2022-03-15 10:47 ` [PATCH 4/6] btrfs: remove ordered extent check and wait during fallocate fdmanana @ 2022-03-15 10:47 ` fdmanana 2022-03-15 10:47 ` [PATCH 6/6] btrfs: add and use helper to assert an inode range is clean fdmanana ` (2 subsequent siblings) 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 10:47 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> For hole punching and zero range we have this loop that checks if we have ordered extents after locking the file range, and if so unlock the range, wait for ordered extents, and retry until we don't find more ordered extents. This logic was needed in the past because: 1) Direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")); 2) We could race with memory mapped writes, because memory mapped writes don't acquire the inode's VFS lock. We don't have that race anymore, as we have a rw semaphore to synchronize memory mapped writes with fallocate (and reflinking too). That change happened with commit 8d9b4a162a37ce ("btrfs: exclude mmap from happening during all fallocate operations"). So stop looking for ordered extents after locking the file range when doing hole punching and zero range operations. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 54 +++++++++++++++++-------------------------------- 1 file changed, 18 insertions(+), 36 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index a7fd05c1d52f..9f8db8a12a2f 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2570,10 +2570,10 @@ static int find_first_non_hole(struct btrfs_inode *inode, u64 *start, u64 *len) return ret; } -static int btrfs_punch_hole_lock_range(struct inode *inode, - const u64 lockstart, - const u64 lockend, - struct extent_state **cached_state) +static void btrfs_punch_hole_lock_range(struct inode *inode, + const u64 lockstart, + const u64 lockend, + struct extent_state **cached_state) { /* * For subpage case, if the range is not at page boundary, we could @@ -2587,40 +2587,27 @@ static int btrfs_punch_hole_lock_range(struct inode *inode, const u64 page_lockend = round_down(lockend + 1, PAGE_SIZE) - 1; while (1) { - struct btrfs_ordered_extent *ordered; - int ret; - truncate_pagecache_range(inode, lockstart, lockend); lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, cached_state); - ordered = btrfs_lookup_first_ordered_extent(BTRFS_I(inode), - lockend); - /* - * We need to make sure we have no ordered extents in this range - * and nobody raced in and read a page in this range, if we did - * we need to try again. + * We can't have ordered extents in the range, nor dirty/writeback + * pages, because we have locked the inode's VFS lock in exclusive + * mode, we have locked the inode's i_mmap_lock in exclusive mode, + * we have flushed all delalloc in the range and we have waited + * for any ordered extents in the range to complete. + * We can race with anyone reading pages from this range, so after + * locking the range check if we have pages in the range, and if + * we do, unlock the range and retry. */ - if ((!ordered || - (ordered->file_offset + ordered->num_bytes <= lockstart || - ordered->file_offset > lockend)) && - !filemap_range_has_page(inode->i_mapping, - page_lockstart, page_lockend)) { - if (ordered) - btrfs_put_ordered_extent(ordered); + if (!filemap_range_has_page(inode->i_mapping, page_lockstart, + page_lockend)) break; - } - if (ordered) - btrfs_put_ordered_extent(ordered); + unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, cached_state); - ret = btrfs_wait_ordered_range(inode, lockstart, - lockend - lockstart + 1); - if (ret) - return ret; } - return 0; } static int btrfs_insert_replace_extent(struct btrfs_trans_handle *trans, @@ -3072,10 +3059,7 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len) goto out_only_mutex; } - ret = btrfs_punch_hole_lock_range(inode, lockstart, lockend, - &cached_state); - if (ret) - goto out_only_mutex; + btrfs_punch_hole_lock_range(inode, lockstart, lockend, &cached_state); path = btrfs_alloc_path(); if (!path) { @@ -3366,10 +3350,8 @@ static int btrfs_zero_range(struct inode *inode, if (ret < 0) goto out; space_reserved = true; - ret = btrfs_punch_hole_lock_range(inode, lockstart, lockend, - &cached_state); - if (ret) - goto out; + btrfs_punch_hole_lock_range(inode, lockstart, lockend, + &cached_state); ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, alloc_start, bytes_to_reserve); if (ret) { -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH 6/6] btrfs: add and use helper to assert an inode range is clean 2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (4 preceding siblings ...) 2022-03-15 10:47 ` [PATCH 5/6] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana @ 2022-03-15 10:47 ` fdmanana 2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 10:47 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> We have four different scenarios where we don't expect to find ordered extents after locking a file range: 1) During plain fallocate; 2) During hole punching; 3) During zero range; 4) During reflinks (both cloning and deduplication). This is because in all these cases we follow the pattern: 1) Lock the inode's VFS lock in exclusive mode; 2) Lock the inode's i_mmap_lock in exclusive node, to serialize with mmap writes; 3) Flush delalloc in a file range and wait for all ordered extents to complete - both done through btrfs_wait_ordered_range(); 4) Lock the file range in the inode's io_tree. So add a helper that asserts that we don't have ordered extents for a given range. Make the four scenarios listed above use this helper after locking the respective file range. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/ctree.h | 1 + fs/btrfs/file.c | 4 ++++ fs/btrfs/inode.c | 37 +++++++++++++++++++++++++++++++++++++ fs/btrfs/reflink.c | 13 +++++++++++-- 4 files changed, 53 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 4db17bd05a21..c58baad426f8 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3327,6 +3327,7 @@ void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags); void btrfs_update_inode_bytes(struct btrfs_inode *inode, const u64 add_bytes, const u64 del_bytes); +void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 end); /* ioctl.c */ long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 9f8db8a12a2f..e98fac5322e9 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2608,6 +2608,8 @@ static void btrfs_punch_hole_lock_range(struct inode *inode, unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, cached_state); } + + btrfs_assert_inode_range_clean(BTRFS_I(inode), lockstart, lockend); } static int btrfs_insert_replace_extent(struct btrfs_trans_handle *trans, @@ -3478,6 +3480,8 @@ static long btrfs_fallocate(struct file *file, int mode, lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, &cached_state); + btrfs_assert_inode_range_clean(BTRFS_I(inode), alloc_start, locked_end); + /* First, check if we exceed the qgroup limit */ INIT_LIST_HEAD(&reserve_list); while (cur_offset < alloc_end) { diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2e7143ff5523..50c7e23877a9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -11306,6 +11306,43 @@ void btrfs_update_inode_bytes(struct btrfs_inode *inode, spin_unlock(&inode->lock); } +/* + * Verify that there are no ordered extents for a given file range. + * + * @inode: The target inode. + * @start: Start offset of the file range, should be sector size + * aligned. + * @end: End offset (inclusive) of the file range, its value +1 + * should be sector size aligned. + * + * This should typically be used for cases where we locked an inode's VFS lock in + * exclusive mode, we have also locked the inode's i_mmap_lock in exclusive mode, + * we have flushed all delalloc in the range, we have waited for all ordered + * extents in the range to complete and finally we have locked the file range in + * the inode's io_tree. + */ +void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 end) +{ + struct btrfs_root *root = inode->root; + struct btrfs_ordered_extent *ordered; + + if (!IS_ENABLED(CONFIG_BTRFS_ASSERT)) + return; + + ordered = btrfs_lookup_first_ordered_range(inode, start, + end + 1 - start); + if (ordered) { + btrfs_err(root->fs_info, +"found unexpected ordered extent in file range [%llu, %llu] for inode %llu root %llu (ordered range [%llu, %llu])", + start, end, btrfs_ino(inode), root->root_key.objectid, + ordered->file_offset, + ordered->file_offset + ordered->num_bytes - 1); + btrfs_put_ordered_extent(ordered); + } + + ASSERT(ordered == NULL); +} + static const struct inode_operations btrfs_dir_inode_operations = { .getattr = btrfs_getattr, .lookup = btrfs_lookup, diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index bbd5da25c475..85d8f0d5d3e0 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -614,14 +614,23 @@ static void btrfs_double_extent_unlock(struct inode *inode1, u64 loff1, static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1, struct inode *inode2, u64 loff2, u64 len) { + u64 range1_end = loff1 + len - 1; + u64 range2_end = loff2 + len - 1; + if (inode1 < inode2) { swap(inode1, inode2); swap(loff1, loff2); + swap(range1_end, range2_end); } else if (inode1 == inode2 && loff2 < loff1) { swap(loff1, loff2); + swap(range1_end, range2_end); } - lock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1); - lock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1); + + lock_extent(&BTRFS_I(inode1)->io_tree, loff1, range1_end); + lock_extent(&BTRFS_I(inode2)->io_tree, loff2, range2_end); + + btrfs_assert_inode_range_clean(BTRFS_I(inode1), loff1, range1_end); + btrfs_assert_inode_range_clean(BTRFS_I(inode2), loff2, range2_end); } static void btrfs_double_mmap_lock(struct inode *inode1, struct inode *inode2) -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks 2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (5 preceding siblings ...) 2022-03-15 10:47 ` [PATCH 6/6] btrfs: add and use helper to assert an inode range is clean fdmanana @ 2022-03-15 12:18 ` fdmanana 2022-03-15 12:18 ` [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana ` (6 more replies) 2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 7 siblings, 7 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 12:18 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> The first patch fixes a bug in fallocate (more details on its changelog), while the remaining just remove some code and logic that is no longer needed. The removals/clean ups are independent of the bug fix. V2: Added one extra patch, 5/7, which was missing in v1. Fixed a type in the changelog of the first patch as well. Filipe Manana (7): btrfs: only reserve the needed data space amount during fallocate btrfs: remove useless dio wait call when doing fallocate zero range btrfs: remove inode_dio_wait() calls when starting reflink operations btrfs: remove ordered extent check and wait during fallocate btrfs: lock the inode first before flushing range when punching hole btrfs: remove ordered extent check and wait during hole punching and zero range btrfs: add and use helper to assert an inode range is clean fs/btrfs/ctree.h | 1 + fs/btrfs/file.c | 172 ++++++++++++++++++--------------------------- fs/btrfs/inode.c | 37 ++++++++++ fs/btrfs/reflink.c | 23 +++--- 4 files changed, 117 insertions(+), 116 deletions(-) -- 2.33.0 ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate 2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana @ 2022-03-15 12:18 ` fdmanana 2022-03-15 13:16 ` Wang Yugui 2022-03-15 12:18 ` [PATCH v2 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana ` (5 subsequent siblings) 6 siblings, 1 reply; 26+ messages in thread From: fdmanana @ 2022-03-15 12:18 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> During a plain fallocate, we always start by reserving an amount of data space that matches the length of the range passed to fallocate. When we already have extents allocated in that range, we may end up trying to reserve a lot more data space then we need, which can result in two undesired behaviours: 1) We fail with -ENOSPC. For example the passed range has a length of 1G, but there's only one hole with a size of 1M in that range; 2) We temporarily reserve excessive data space that could be used by other operations happening concurrently; 3) By reserving much more data space then we need, we can end up doing expensive things like triggering dellaloc for other inodes, waiting for the ordered extents to complete, trigger transaction commits, allocate new block groups, etc. Example: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f -b 1g $DEV mount $DEV $MNT # Create a file with a size of 600M and two holes, one at [200M, 201M[ # and another at [401M, 402M[ xfs_io -f -c "pwrite -S 0xab 0 200M" \ -c "pwrite -S 0xcd 201M 200M" \ -c "pwrite -S 0xef 402M 198M" \ $MNT/foobar # Now call fallocate against the whole file range, see if it fails # with -ENOSPC or not - it shouldn't since we only need to allocate # 2M of data space. xfs_io -c "falloc 0 600M" $MNT/foobar umount $MNT $ ./test.sh (...) wrote 209715200/209715200 bytes at offset 0 200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec) wrote 209715200/209715200 bytes at offset 210763776 200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec) wrote 207618048/207618048 bytes at offset 421527552 198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec) fallocate: No space left on device $ So fix this by not allocating an amount of data space that matches the length of the range passed to fallocate. Instead allocate an amount of data space that corresponds to the sum of the sizes of each hole found in the range. This reservation now happens after we have locked the file range, which is safe since we know at this point there's no delalloc in the range because we've taken the inode's VFS lock in exclusive mode, we have taken the inode's i_mmap_lock in exclusive mode, we have flushed delalloc and waited for all ordered extents in the range to complete. This type of failure actually seems to happen in pratice with systemd, and we had at least one report about this in a very long thread which is referenced by the Link tag below. Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 69 ++++++++++++++++++++++++++----------------------- 1 file changed, 37 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 380054c94e4b..b7c0db1000cd 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3417,6 +3417,9 @@ static long btrfs_fallocate(struct file *file, int mode, u64 alloc_hint = 0; u64 locked_end; u64 actual_end = 0; + u64 data_space_needed = 0; + u64 data_space_reserved = 0; + u64 qgroup_reserved = 0; struct extent_map *em; int blocksize = btrfs_inode_sectorsize(BTRFS_I(inode)); int ret; @@ -3437,18 +3440,6 @@ static long btrfs_fallocate(struct file *file, int mode, if (mode & FALLOC_FL_PUNCH_HOLE) return btrfs_punch_hole(file, offset, len); - /* - * Only trigger disk allocation, don't trigger qgroup reserve - * - * For qgroup space, it will be checked later. - */ - if (!(mode & FALLOC_FL_ZERO_RANGE)) { - ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), - alloc_end - alloc_start); - if (ret < 0) - return ret; - } - btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) { @@ -3548,48 +3539,66 @@ static long btrfs_fallocate(struct file *file, int mode, if (em->block_start == EXTENT_MAP_HOLE || (cur_offset >= inode->i_size && !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) { + const u64 range_len = last_byte - cur_offset; + ret = add_falloc_range(&reserve_list, cur_offset, - last_byte - cur_offset); + range_len); if (ret < 0) { free_extent_map(em); break; } ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, cur_offset, - last_byte - cur_offset); + range_len); if (ret < 0) { - cur_offset = last_byte; free_extent_map(em); break; } - } else { - /* - * Do not need to reserve unwritten extent for this - * range, free reserved data space first, otherwise - * it'll result in false ENOSPC error. - */ - btrfs_free_reserved_data_space(BTRFS_I(inode), - data_reserved, cur_offset, - last_byte - cur_offset); + qgroup_reserved += range_len; + data_space_needed += range_len; } free_extent_map(em); cur_offset = last_byte; } + if (!ret && data_space_needed > 0) { + /* + * We are safe to reserve space here as we can't have delalloc + * in the range, see above. + */ + ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), + data_space_needed); + if (!ret) + data_space_reserved = data_space_needed; + } + /* * If ret is still 0, means we're OK to fallocate. * Or just cleanup the list and exit. */ list_for_each_entry_safe(range, tmp, &reserve_list, list) { - if (!ret) + if (!ret) { ret = btrfs_prealloc_file_range(inode, mode, range->start, range->len, i_blocksize(inode), offset + len, &alloc_hint); - else + /* + * btrfs_prealloc_file_range() releases space even + * if it returns an error. + */ + data_space_reserved -= range->len; + qgroup_reserved -= range->len; + } else if (data_space_reserved > 0) { btrfs_free_reserved_data_space(BTRFS_I(inode), - data_reserved, range->start, - range->len); + data_reserved, range->start, + range->len); + data_space_reserved -= range->len; + qgroup_reserved -= range->len; + } else if (qgroup_reserved > 0) { + btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved, + range->start, range->len); + qgroup_reserved -= range->len; + } list_del(&range->list); kfree(range); } @@ -3606,10 +3615,6 @@ static long btrfs_fallocate(struct file *file, int mode, &cached_state); out: btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); - /* Let go of our reservation. */ - if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE)) - btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved, - cur_offset, alloc_end - cur_offset); extent_changeset_free(data_reserved); return ret; } -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate 2022-03-15 12:18 ` [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana @ 2022-03-15 13:16 ` Wang Yugui 2022-03-15 13:45 ` Filipe Manana 0 siblings, 1 reply; 26+ messages in thread From: Wang Yugui @ 2022-03-15 13:16 UTC (permalink / raw) To: fdmanana; +Cc: linux-btrfs Hi, > TODO: test case when the file with/without snapshot > > From: Filipe Manana <fdmanana@suse.com> > > During a plain fallocate, we always start by reserving an amount of data > space that matches the length of the range passed to fallocate. When we > already have extents allocated in that range, we may end up trying to > reserve a lot more data space then we need, which can result in two > undesired behaviours: Need we check whether the allocated extents are exclusive or shared? Best Regards Wang Yugui (wangyugui@e16-tech.com) 2022/03/15 > > 1) We fail with -ENOSPC. For example the passed range has a length > of 1G, but there's only one hole with a size of 1M in that range; > > 2) We temporarily reserve excessive data space that could be used by > other operations happening concurrently; > > 3) By reserving much more data space then we need, we can end up > doing expensive things like triggering dellaloc for other inodes, > waiting for the ordered extents to complete, trigger transaction > commits, allocate new block groups, etc. > > Example: > > $ cat test.sh > #!/bin/bash > > DEV=/dev/sdj > MNT=/mnt/sdj > > mkfs.btrfs -f -b 1g $DEV > mount $DEV $MNT > > # Create a file with a size of 600M and two holes, one at [200M, 201M[ > # and another at [401M, 402M[ > xfs_io -f -c "pwrite -S 0xab 0 200M" \ > -c "pwrite -S 0xcd 201M 200M" \ > -c "pwrite -S 0xef 402M 198M" \ > $MNT/foobar > > # Now call fallocate against the whole file range, see if it fails > # with -ENOSPC or not - it shouldn't since we only need to allocate > # 2M of data space. > xfs_io -c "falloc 0 600M" $MNT/foobar > > umount $MNT > > $ ./test.sh > (...) > wrote 209715200/209715200 bytes at offset 0 > 200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec) > wrote 209715200/209715200 bytes at offset 210763776 > 200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec) > wrote 207618048/207618048 bytes at offset 421527552 > 198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec) > fallocate: No space left on device > $ > > So fix this by not allocating an amount of data space that matches the > length of the range passed to fallocate. Instead allocate an amount of > data space that corresponds to the sum of the sizes of each hole found > in the range. This reservation now happens after we have locked the file > range, which is safe since we know at this point there's no delalloc > in the range because we've taken the inode's VFS lock in exclusive mode, > we have taken the inode's i_mmap_lock in exclusive mode, we have flushed > delalloc and waited for all ordered extents in the range to complete. > > This type of failure actually seems to happen in pratice with systemd, > and we had at least one report about this in a very long thread which > is referenced by the Link tag below. > > Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/ > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > fs/btrfs/file.c | 69 ++++++++++++++++++++++++++----------------------- > 1 file changed, 37 insertions(+), 32 deletions(-) > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index 380054c94e4b..b7c0db1000cd 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -3417,6 +3417,9 @@ static long btrfs_fallocate(struct file *file, int mode, > u64 alloc_hint = 0; > u64 locked_end; > u64 actual_end = 0; > + u64 data_space_needed = 0; > + u64 data_space_reserved = 0; > + u64 qgroup_reserved = 0; > struct extent_map *em; > int blocksize = btrfs_inode_sectorsize(BTRFS_I(inode)); > int ret; > @@ -3437,18 +3440,6 @@ static long btrfs_fallocate(struct file *file, int mode, > if (mode & FALLOC_FL_PUNCH_HOLE) > return btrfs_punch_hole(file, offset, len); > > - /* > - * Only trigger disk allocation, don't trigger qgroup reserve > - * > - * For qgroup space, it will be checked later. > - */ > - if (!(mode & FALLOC_FL_ZERO_RANGE)) { > - ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), > - alloc_end - alloc_start); > - if (ret < 0) > - return ret; > - } > - > btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); > > if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) { > @@ -3548,48 +3539,66 @@ static long btrfs_fallocate(struct file *file, int mode, > if (em->block_start == EXTENT_MAP_HOLE || > (cur_offset >= inode->i_size && > !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) { > + const u64 range_len = last_byte - cur_offset; > + > ret = add_falloc_range(&reserve_list, cur_offset, > - last_byte - cur_offset); > + range_len); > if (ret < 0) { > free_extent_map(em); > break; > } > ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), > &data_reserved, cur_offset, > - last_byte - cur_offset); > + range_len); > if (ret < 0) { > - cur_offset = last_byte; > free_extent_map(em); > break; > } > - } else { > - /* > - * Do not need to reserve unwritten extent for this > - * range, free reserved data space first, otherwise > - * it'll result in false ENOSPC error. > - */ > - btrfs_free_reserved_data_space(BTRFS_I(inode), > - data_reserved, cur_offset, > - last_byte - cur_offset); > + qgroup_reserved += range_len; > + data_space_needed += range_len; > } > free_extent_map(em); > cur_offset = last_byte; > } > > + if (!ret && data_space_needed > 0) { > + /* > + * We are safe to reserve space here as we can't have delalloc > + * in the range, see above. > + */ > + ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), > + data_space_needed); > + if (!ret) > + data_space_reserved = data_space_needed; > + } > + > /* > * If ret is still 0, means we're OK to fallocate. > * Or just cleanup the list and exit. > */ > list_for_each_entry_safe(range, tmp, &reserve_list, list) { > - if (!ret) > + if (!ret) { > ret = btrfs_prealloc_file_range(inode, mode, > range->start, > range->len, i_blocksize(inode), > offset + len, &alloc_hint); > - else > + /* > + * btrfs_prealloc_file_range() releases space even > + * if it returns an error. > + */ > + data_space_reserved -= range->len; > + qgroup_reserved -= range->len; > + } else if (data_space_reserved > 0) { > btrfs_free_reserved_data_space(BTRFS_I(inode), > - data_reserved, range->start, > - range->len); > + data_reserved, range->start, > + range->len); > + data_space_reserved -= range->len; > + qgroup_reserved -= range->len; > + } else if (qgroup_reserved > 0) { > + btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved, > + range->start, range->len); > + qgroup_reserved -= range->len; > + } > list_del(&range->list); > kfree(range); > } > @@ -3606,10 +3615,6 @@ static long btrfs_fallocate(struct file *file, int mode, > &cached_state); > out: > btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); > - /* Let go of our reservation. */ > - if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE)) > - btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved, > - cur_offset, alloc_end - cur_offset); > extent_changeset_free(data_reserved); > return ret; > } > -- > 2.33.0 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate 2022-03-15 13:16 ` Wang Yugui @ 2022-03-15 13:45 ` Filipe Manana 0 siblings, 0 replies; 26+ messages in thread From: Filipe Manana @ 2022-03-15 13:45 UTC (permalink / raw) To: Wang Yugui; +Cc: linux-btrfs On Tue, Mar 15, 2022 at 1:32 PM Wang Yugui <wangyugui@e16-tech.com> wrote: > > Hi, > > > TODO: test case when the file with/without snapshot > > > > From: Filipe Manana <fdmanana@suse.com> > > > > During a plain fallocate, we always start by reserving an amount of data > > space that matches the length of the range passed to fallocate. When we > > already have extents allocated in that range, we may end up trying to > > reserve a lot more data space then we need, which can result in two > > undesired behaviours: > > Need we check whether the allocated extents are exclusive or shared? I don't get it. Why? A plain fallocate only allocates extents for holes, regions without extents. If a region has an extent, shared or not, it doesn't touch that region. > > Best Regards > Wang Yugui (wangyugui@e16-tech.com) > 2022/03/15 > > > > > 1) We fail with -ENOSPC. For example the passed range has a length > > of 1G, but there's only one hole with a size of 1M in that range; > > > > 2) We temporarily reserve excessive data space that could be used by > > other operations happening concurrently; > > > > 3) By reserving much more data space then we need, we can end up > > doing expensive things like triggering dellaloc for other inodes, > > waiting for the ordered extents to complete, trigger transaction > > commits, allocate new block groups, etc. > > > > Example: > > > > $ cat test.sh > > #!/bin/bash > > > > DEV=/dev/sdj > > MNT=/mnt/sdj > > > > mkfs.btrfs -f -b 1g $DEV > > mount $DEV $MNT > > > > # Create a file with a size of 600M and two holes, one at [200M, 201M[ > > # and another at [401M, 402M[ > > xfs_io -f -c "pwrite -S 0xab 0 200M" \ > > -c "pwrite -S 0xcd 201M 200M" \ > > -c "pwrite -S 0xef 402M 198M" \ > > $MNT/foobar > > > > # Now call fallocate against the whole file range, see if it fails > > # with -ENOSPC or not - it shouldn't since we only need to allocate > > # 2M of data space. > > xfs_io -c "falloc 0 600M" $MNT/foobar > > > > umount $MNT > > > > $ ./test.sh > > (...) > > wrote 209715200/209715200 bytes at offset 0 > > 200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec) > > wrote 209715200/209715200 bytes at offset 210763776 > > 200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec) > > wrote 207618048/207618048 bytes at offset 421527552 > > 198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec) > > fallocate: No space left on device > > $ > > > > So fix this by not allocating an amount of data space that matches the > > length of the range passed to fallocate. Instead allocate an amount of > > data space that corresponds to the sum of the sizes of each hole found > > in the range. This reservation now happens after we have locked the file > > range, which is safe since we know at this point there's no delalloc > > in the range because we've taken the inode's VFS lock in exclusive mode, > > we have taken the inode's i_mmap_lock in exclusive mode, we have flushed > > delalloc and waited for all ordered extents in the range to complete. > > > > This type of failure actually seems to happen in pratice with systemd, > > and we had at least one report about this in a very long thread which > > is referenced by the Link tag below. > > > > Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/ > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > --- > > fs/btrfs/file.c | 69 ++++++++++++++++++++++++++----------------------- > > 1 file changed, 37 insertions(+), 32 deletions(-) > > > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > > index 380054c94e4b..b7c0db1000cd 100644 > > --- a/fs/btrfs/file.c > > +++ b/fs/btrfs/file.c > > @@ -3417,6 +3417,9 @@ static long btrfs_fallocate(struct file *file, int mode, > > u64 alloc_hint = 0; > > u64 locked_end; > > u64 actual_end = 0; > > + u64 data_space_needed = 0; > > + u64 data_space_reserved = 0; > > + u64 qgroup_reserved = 0; > > struct extent_map *em; > > int blocksize = btrfs_inode_sectorsize(BTRFS_I(inode)); > > int ret; > > @@ -3437,18 +3440,6 @@ static long btrfs_fallocate(struct file *file, int mode, > > if (mode & FALLOC_FL_PUNCH_HOLE) > > return btrfs_punch_hole(file, offset, len); > > > > - /* > > - * Only trigger disk allocation, don't trigger qgroup reserve > > - * > > - * For qgroup space, it will be checked later. > > - */ > > - if (!(mode & FALLOC_FL_ZERO_RANGE)) { > > - ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), > > - alloc_end - alloc_start); > > - if (ret < 0) > > - return ret; > > - } > > - > > btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); > > > > if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) { > > @@ -3548,48 +3539,66 @@ static long btrfs_fallocate(struct file *file, int mode, > > if (em->block_start == EXTENT_MAP_HOLE || > > (cur_offset >= inode->i_size && > > !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) { > > + const u64 range_len = last_byte - cur_offset; > > + > > ret = add_falloc_range(&reserve_list, cur_offset, > > - last_byte - cur_offset); > > + range_len); > > if (ret < 0) { > > free_extent_map(em); > > break; > > } > > ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), > > &data_reserved, cur_offset, > > - last_byte - cur_offset); > > + range_len); > > if (ret < 0) { > > - cur_offset = last_byte; > > free_extent_map(em); > > break; > > } > > - } else { > > - /* > > - * Do not need to reserve unwritten extent for this > > - * range, free reserved data space first, otherwise > > - * it'll result in false ENOSPC error. > > - */ > > - btrfs_free_reserved_data_space(BTRFS_I(inode), > > - data_reserved, cur_offset, > > - last_byte - cur_offset); > > + qgroup_reserved += range_len; > > + data_space_needed += range_len; > > } > > free_extent_map(em); > > cur_offset = last_byte; > > } > > > > + if (!ret && data_space_needed > 0) { > > + /* > > + * We are safe to reserve space here as we can't have delalloc > > + * in the range, see above. > > + */ > > + ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), > > + data_space_needed); > > + if (!ret) > > + data_space_reserved = data_space_needed; > > + } > > + > > /* > > * If ret is still 0, means we're OK to fallocate. > > * Or just cleanup the list and exit. > > */ > > list_for_each_entry_safe(range, tmp, &reserve_list, list) { > > - if (!ret) > > + if (!ret) { > > ret = btrfs_prealloc_file_range(inode, mode, > > range->start, > > range->len, i_blocksize(inode), > > offset + len, &alloc_hint); > > - else > > + /* > > + * btrfs_prealloc_file_range() releases space even > > + * if it returns an error. > > + */ > > + data_space_reserved -= range->len; > > + qgroup_reserved -= range->len; > > + } else if (data_space_reserved > 0) { > > btrfs_free_reserved_data_space(BTRFS_I(inode), > > - data_reserved, range->start, > > - range->len); > > + data_reserved, range->start, > > + range->len); > > + data_space_reserved -= range->len; > > + qgroup_reserved -= range->len; > > + } else if (qgroup_reserved > 0) { > > + btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved, > > + range->start, range->len); > > + qgroup_reserved -= range->len; > > + } > > list_del(&range->list); > > kfree(range); > > } > > @@ -3606,10 +3615,6 @@ static long btrfs_fallocate(struct file *file, int mode, > > &cached_state); > > out: > > btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); > > - /* Let go of our reservation. */ > > - if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE)) > > - btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved, > > - cur_offset, alloc_end - cur_offset); > > extent_changeset_free(data_reserved); > > return ret; > > } > > -- > > 2.33.0 > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 2/7] btrfs: remove useless dio wait call when doing fallocate zero range 2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 2022-03-15 12:18 ` [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana @ 2022-03-15 12:18 ` fdmanana 2022-03-15 12:18 ` [PATCH v2 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana ` (4 subsequent siblings) 6 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 12:18 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> When starting a fallocate zero range operation, before getting the first extent map for the range, we make a call to inode_dio_wait(). This logic was needed in the past because direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (by commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (done in commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")). The lockless direct IO writes could result in a race with the zero range operation, resulting in the later getting a stale extent map for the range. So remove this no longer needed call to inode_dio_wait(), as fallocate takes the inode's VFS lock in exclusive mode and direct IO writes within i_size take that same lock in shared mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index b7c0db1000cd..2f57f7d9d9cb 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3237,8 +3237,6 @@ static int btrfs_zero_range(struct inode *inode, u64 bytes_to_reserve = 0; bool space_reserved = false; - inode_dio_wait(inode); - em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, alloc_start, alloc_end - alloc_start); if (IS_ERR(em)) { -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations 2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 2022-03-15 12:18 ` [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana 2022-03-15 12:18 ` [PATCH v2 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana @ 2022-03-15 12:18 ` fdmanana 2022-03-15 12:18 ` [PATCH v2 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana ` (3 subsequent siblings) 6 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 12:18 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> When starting a reflink operation we have these calls to inode_dio_wait() which used to be needed because direct IO writes that don't cross the i_size boundary did not take the inode's VFS lock, so we could race with them and end up with ordered extents in target range after calling btrfs_wait_ordered_range(). However that is not the case anymore, because the inode's VFS lock was changed from a mutex to a rw semaphore, by commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem"), and several years later we started to lock the inode's VFS lock in shared mode for direct IO writes that don't cross the i_size boundary (commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")). So remove those inode_dio_wait() calls. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/reflink.c | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index 04a88bfe4fcf..bbd5da25c475 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -771,7 +771,6 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in, struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); u64 bs = BTRFS_I(inode_out)->root->fs_info->sb->s_blocksize; - bool same_inode = inode_out == inode_in; u64 wb_len; int ret; @@ -809,15 +808,6 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in, else wb_len = ALIGN(*len, bs); - /* - * Since we don't lock ranges, wait for ongoing lockless dio writes (as - * any in progress could create its ordered extents after we wait for - * existing ordered extents below). - */ - inode_dio_wait(inode_in); - if (!same_inode) - inode_dio_wait(inode_out); - /* * Workaround to make sure NOCOW buffered write reach disk as NOCOW. * -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 4/7] btrfs: remove ordered extent check and wait during fallocate 2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (2 preceding siblings ...) 2022-03-15 12:18 ` [PATCH v2 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana @ 2022-03-15 12:18 ` fdmanana 2022-03-15 12:18 ` [PATCH v2 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana ` (2 subsequent siblings) 6 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 12:18 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> For fallocate() we have this loop that checks if we have ordered extents after locking the file range, and if so unlock the range, wait for ordered extents, and retry until we don't find more ordered extents. This logic was needed in the past because: 1) Direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")); 2) We could race with memory mapped writes, because memory mapped writes don't acquire the inode's VFS lock. We don't have that race anymore, as we have a rw semaphore to synchronize memory mapped writes with fallocate (and reflinking too). That change happened with commit 8d9b4a162a37ce ("btrfs: exclude mmap from happening during all fallocate operations"). So stop looking for ordered extents after locking the file range when doing a plain fallocate. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 42 ++++++++---------------------------------- 1 file changed, 8 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 2f57f7d9d9cb..a7fd05c1d52f 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3474,8 +3474,12 @@ static long btrfs_fallocate(struct file *file, int mode, } /* - * wait for ordered IO before we have any locks. We'll loop again - * below with the locks held. + * We have locked the inode at the VFS level (in exclusive mode) and we + * have locked the i_mmap_lock lock (in exclusive mode). Now before + * locking the file range, flush all dealloc in the range and wait for + * all ordered extents in the range to complete. After this we can lock + * the file range and, due to the previous locking we did, we know there + * can't be more delalloc or ordered extents in the range. */ ret = btrfs_wait_ordered_range(inode, alloc_start, alloc_end - alloc_start); @@ -3489,38 +3493,8 @@ static long btrfs_fallocate(struct file *file, int mode, } locked_end = alloc_end - 1; - while (1) { - struct btrfs_ordered_extent *ordered; - - /* the extent lock is ordered inside the running - * transaction - */ - lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, - locked_end, &cached_state); - ordered = btrfs_lookup_first_ordered_extent(BTRFS_I(inode), - locked_end); - - if (ordered && - ordered->file_offset + ordered->num_bytes > alloc_start && - ordered->file_offset < alloc_end) { - btrfs_put_ordered_extent(ordered); - unlock_extent_cached(&BTRFS_I(inode)->io_tree, - alloc_start, locked_end, - &cached_state); - /* - * we can't wait on the range with the transaction - * running or with the extent lock held - */ - ret = btrfs_wait_ordered_range(inode, alloc_start, - alloc_end - alloc_start); - if (ret) - goto out; - } else { - if (ordered) - btrfs_put_ordered_extent(ordered); - break; - } - } + lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, + &cached_state); /* First, check if we exceed the qgroup limit */ INIT_LIST_HEAD(&reserve_list); -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 5/7] btrfs: lock the inode first before flushing range when punching hole 2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (3 preceding siblings ...) 2022-03-15 12:18 ` [PATCH v2 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana @ 2022-03-15 12:18 ` fdmanana 2022-03-15 12:18 ` [PATCH v2 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana 2022-03-15 12:18 ` [PATCH v2 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana 6 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 12:18 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> When doing hole punching we are flushing delalloc and waiting for ordered extents to complete before locking the inode (VFS lock and the btrfs specific i_mmap_lock). This is fine because even if a write happens after we call btrfs_wait_ordered_range() and before we lock the inode (call btrfs_inode_lock()), we will notice the write at btrfs_punch_hole_lock_range() and flush delalloc and wait for its ordered extent. We can however make this simpler by locking first the inode an then call btrfs_wait_ordered_range(), which will allow us to remove the ordered extent lookup logic from btrfs_punch_hole_lock_range() in the next patch. It will makes the behaviour the same as plain fallocate, reflinks, and other places. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index a7fd05c1d52f..a78a90cfc082 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2976,11 +2976,12 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len) bool truncated_block = false; bool updated_inode = false; + btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); + ret = btrfs_wait_ordered_range(inode, offset, len); if (ret) return ret; - btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); ino_size = round_up(inode->i_size, fs_info->sectorsize); ret = find_first_non_hole(BTRFS_I(inode), &offset, &len); if (ret < 0) -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range 2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (4 preceding siblings ...) 2022-03-15 12:18 ` [PATCH v2 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana @ 2022-03-15 12:18 ` fdmanana 2022-03-15 12:18 ` [PATCH v2 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana 6 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 12:18 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> For hole punching and zero range we have this loop that checks if we have ordered extents after locking the file range, and if so unlock the range, wait for ordered extents, and retry until we don't find more ordered extents. This logic was needed in the past because: 1) Direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")); 2) We could race with memory mapped writes, because memory mapped writes don't acquire the inode's VFS lock. We don't have that race anymore, as we have a rw semaphore to synchronize memory mapped writes with fallocate (and reflinking too). That change happened with commit 8d9b4a162a37ce ("btrfs: exclude mmap from happening during all fallocate operations"). So stop looking for ordered extents after locking the file range when doing hole punching and zero range operations. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 54 +++++++++++++++++-------------------------------- 1 file changed, 18 insertions(+), 36 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index a78a90cfc082..b9e43734d1ed 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2570,10 +2570,10 @@ static int find_first_non_hole(struct btrfs_inode *inode, u64 *start, u64 *len) return ret; } -static int btrfs_punch_hole_lock_range(struct inode *inode, - const u64 lockstart, - const u64 lockend, - struct extent_state **cached_state) +static void btrfs_punch_hole_lock_range(struct inode *inode, + const u64 lockstart, + const u64 lockend, + struct extent_state **cached_state) { /* * For subpage case, if the range is not at page boundary, we could @@ -2587,40 +2587,27 @@ static int btrfs_punch_hole_lock_range(struct inode *inode, const u64 page_lockend = round_down(lockend + 1, PAGE_SIZE) - 1; while (1) { - struct btrfs_ordered_extent *ordered; - int ret; - truncate_pagecache_range(inode, lockstart, lockend); lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, cached_state); - ordered = btrfs_lookup_first_ordered_extent(BTRFS_I(inode), - lockend); - /* - * We need to make sure we have no ordered extents in this range - * and nobody raced in and read a page in this range, if we did - * we need to try again. + * We can't have ordered extents in the range, nor dirty/writeback + * pages, because we have locked the inode's VFS lock in exclusive + * mode, we have locked the inode's i_mmap_lock in exclusive mode, + * we have flushed all delalloc in the range and we have waited + * for any ordered extents in the range to complete. + * We can race with anyone reading pages from this range, so after + * locking the range check if we have pages in the range, and if + * we do, unlock the range and retry. */ - if ((!ordered || - (ordered->file_offset + ordered->num_bytes <= lockstart || - ordered->file_offset > lockend)) && - !filemap_range_has_page(inode->i_mapping, - page_lockstart, page_lockend)) { - if (ordered) - btrfs_put_ordered_extent(ordered); + if (!filemap_range_has_page(inode->i_mapping, page_lockstart, + page_lockend)) break; - } - if (ordered) - btrfs_put_ordered_extent(ordered); + unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, cached_state); - ret = btrfs_wait_ordered_range(inode, lockstart, - lockend - lockstart + 1); - if (ret) - return ret; } - return 0; } static int btrfs_insert_replace_extent(struct btrfs_trans_handle *trans, @@ -3073,10 +3060,7 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len) goto out_only_mutex; } - ret = btrfs_punch_hole_lock_range(inode, lockstart, lockend, - &cached_state); - if (ret) - goto out_only_mutex; + btrfs_punch_hole_lock_range(inode, lockstart, lockend, &cached_state); path = btrfs_alloc_path(); if (!path) { @@ -3367,10 +3351,8 @@ static int btrfs_zero_range(struct inode *inode, if (ret < 0) goto out; space_reserved = true; - ret = btrfs_punch_hole_lock_range(inode, lockstart, lockend, - &cached_state); - if (ret) - goto out; + btrfs_punch_hole_lock_range(inode, lockstart, lockend, + &cached_state); ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, alloc_start, bytes_to_reserve); if (ret) { -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 7/7] btrfs: add and use helper to assert an inode range is clean 2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (5 preceding siblings ...) 2022-03-15 12:18 ` [PATCH v2 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana @ 2022-03-15 12:18 ` fdmanana 6 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 12:18 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> We have four different scenarios where we don't expect to find ordered extents after locking a file range: 1) During plain fallocate; 2) During hole punching; 3) During zero range; 4) During reflinks (both cloning and deduplication). This is because in all these cases we follow the pattern: 1) Lock the inode's VFS lock in exclusive mode; 2) Lock the inode's i_mmap_lock in exclusive node, to serialize with mmap writes; 3) Flush delalloc in a file range and wait for all ordered extents to complete - both done through btrfs_wait_ordered_range(); 4) Lock the file range in the inode's io_tree. So add a helper that asserts that we don't have ordered extents for a given range. Make the four scenarios listed above use this helper after locking the respective file range. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/ctree.h | 1 + fs/btrfs/file.c | 4 ++++ fs/btrfs/inode.c | 37 +++++++++++++++++++++++++++++++++++++ fs/btrfs/reflink.c | 13 +++++++++++-- 4 files changed, 53 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 4db17bd05a21..c58baad426f8 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3327,6 +3327,7 @@ void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags); void btrfs_update_inode_bytes(struct btrfs_inode *inode, const u64 add_bytes, const u64 del_bytes); +void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 end); /* ioctl.c */ long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index b9e43734d1ed..f3bb2aa0514a 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2608,6 +2608,8 @@ static void btrfs_punch_hole_lock_range(struct inode *inode, unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, cached_state); } + + btrfs_assert_inode_range_clean(BTRFS_I(inode), lockstart, lockend); } static int btrfs_insert_replace_extent(struct btrfs_trans_handle *trans, @@ -3479,6 +3481,8 @@ static long btrfs_fallocate(struct file *file, int mode, lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, &cached_state); + btrfs_assert_inode_range_clean(BTRFS_I(inode), alloc_start, locked_end); + /* First, check if we exceed the qgroup limit */ INIT_LIST_HEAD(&reserve_list); while (cur_offset < alloc_end) { diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2e7143ff5523..50c7e23877a9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -11306,6 +11306,43 @@ void btrfs_update_inode_bytes(struct btrfs_inode *inode, spin_unlock(&inode->lock); } +/* + * Verify that there are no ordered extents for a given file range. + * + * @inode: The target inode. + * @start: Start offset of the file range, should be sector size + * aligned. + * @end: End offset (inclusive) of the file range, its value +1 + * should be sector size aligned. + * + * This should typically be used for cases where we locked an inode's VFS lock in + * exclusive mode, we have also locked the inode's i_mmap_lock in exclusive mode, + * we have flushed all delalloc in the range, we have waited for all ordered + * extents in the range to complete and finally we have locked the file range in + * the inode's io_tree. + */ +void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 end) +{ + struct btrfs_root *root = inode->root; + struct btrfs_ordered_extent *ordered; + + if (!IS_ENABLED(CONFIG_BTRFS_ASSERT)) + return; + + ordered = btrfs_lookup_first_ordered_range(inode, start, + end + 1 - start); + if (ordered) { + btrfs_err(root->fs_info, +"found unexpected ordered extent in file range [%llu, %llu] for inode %llu root %llu (ordered range [%llu, %llu])", + start, end, btrfs_ino(inode), root->root_key.objectid, + ordered->file_offset, + ordered->file_offset + ordered->num_bytes - 1); + btrfs_put_ordered_extent(ordered); + } + + ASSERT(ordered == NULL); +} + static const struct inode_operations btrfs_dir_inode_operations = { .getattr = btrfs_getattr, .lookup = btrfs_lookup, diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index bbd5da25c475..85d8f0d5d3e0 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -614,14 +614,23 @@ static void btrfs_double_extent_unlock(struct inode *inode1, u64 loff1, static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1, struct inode *inode2, u64 loff2, u64 len) { + u64 range1_end = loff1 + len - 1; + u64 range2_end = loff2 + len - 1; + if (inode1 < inode2) { swap(inode1, inode2); swap(loff1, loff2); + swap(range1_end, range2_end); } else if (inode1 == inode2 && loff2 < loff1) { swap(loff1, loff2); + swap(range1_end, range2_end); } - lock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1); - lock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1); + + lock_extent(&BTRFS_I(inode1)->io_tree, loff1, range1_end); + lock_extent(&BTRFS_I(inode2)->io_tree, loff2, range2_end); + + btrfs_assert_inode_range_clean(BTRFS_I(inode1), loff1, range1_end); + btrfs_assert_inode_range_clean(BTRFS_I(inode2), loff2, range2_end); } static void btrfs_double_mmap_lock(struct inode *inode1, struct inode *inode2) -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks 2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (6 preceding siblings ...) 2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana @ 2022-03-15 15:22 ` fdmanana 2022-03-15 15:22 ` [PATCH v3 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana ` (7 more replies) 7 siblings, 8 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 15:22 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> The first patch fixes a bug in fallocate (more details on its changelog), while the remaining just remove some code and logic that is no longer needed. The removals/clean ups are independent of the bug fix. V3: Added missing inode unlock in error path (patch 5/7). V2: Added one extra patch, 5/7, which was missing in v1. Fixed a typo in the changelog of the first patch as well. Filipe Manana (7): btrfs: only reserve the needed data space amount during fallocate btrfs: remove useless dio wait call when doing fallocate zero range btrfs: remove inode_dio_wait() calls when starting reflink operations btrfs: remove ordered extent check and wait during fallocate btrfs: lock the inode first before flushing range when punching hole btrfs: remove ordered extent check and wait during hole punching and zero range btrfs: add and use helper to assert an inode range is clean fs/btrfs/ctree.h | 1 + fs/btrfs/file.c | 174 ++++++++++++++++++--------------------------- fs/btrfs/inode.c | 37 ++++++++++ fs/btrfs/reflink.c | 23 +++--- 4 files changed, 118 insertions(+), 117 deletions(-) -- 2.33.0 ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v3 1/7] btrfs: only reserve the needed data space amount during fallocate 2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana @ 2022-03-15 15:22 ` fdmanana 2022-03-15 15:22 ` [PATCH v3 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana ` (6 subsequent siblings) 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 15:22 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> During a plain fallocate, we always start by reserving an amount of data space that matches the length of the range passed to fallocate. When we already have extents allocated in that range, we may end up trying to reserve a lot more data space then we need, which can result in several undesired behaviours: 1) We fail with -ENOSPC. For example the passed range has a length of 1G, but there's only one hole with a size of 1M in that range; 2) We temporarily reserve excessive data space that could be used by other operations happening concurrently; 3) By reserving much more data space then we need, we can end up doing expensive things like triggering dellaloc for other inodes, waiting for the ordered extents to complete, trigger transaction commits, allocate new block groups, etc. Example: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj mkfs.btrfs -f -b 1g $DEV mount $DEV $MNT # Create a file with a size of 600M and two holes, one at [200M, 201M[ # and another at [401M, 402M[ xfs_io -f -c "pwrite -S 0xab 0 200M" \ -c "pwrite -S 0xcd 201M 200M" \ -c "pwrite -S 0xef 402M 198M" \ $MNT/foobar # Now call fallocate against the whole file range, see if it fails # with -ENOSPC or not - it shouldn't since we only need to allocate # 2M of data space. xfs_io -c "falloc 0 600M" $MNT/foobar umount $MNT $ ./test.sh (...) wrote 209715200/209715200 bytes at offset 0 200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec) wrote 209715200/209715200 bytes at offset 210763776 200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec) wrote 207618048/207618048 bytes at offset 421527552 198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec) fallocate: No space left on device $ So fix this by not allocating an amount of data space that matches the length of the range passed to fallocate. Instead allocate an amount of data space that corresponds to the sum of the sizes of each hole found in the range. This reservation now happens after we have locked the file range, which is safe since we know at this point there's no delalloc in the range because we've taken the inode's VFS lock in exclusive mode, we have taken the inode's i_mmap_lock in exclusive mode, we have flushed delalloc and waited for all ordered extents in the range to complete. This type of failure actually seems to happen in pratice with systemd, and we had at least one report about this in a very long thread which is referenced by the Link tag below. Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 69 ++++++++++++++++++++++++++----------------------- 1 file changed, 37 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 380054c94e4b..b7c0db1000cd 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3417,6 +3417,9 @@ static long btrfs_fallocate(struct file *file, int mode, u64 alloc_hint = 0; u64 locked_end; u64 actual_end = 0; + u64 data_space_needed = 0; + u64 data_space_reserved = 0; + u64 qgroup_reserved = 0; struct extent_map *em; int blocksize = btrfs_inode_sectorsize(BTRFS_I(inode)); int ret; @@ -3437,18 +3440,6 @@ static long btrfs_fallocate(struct file *file, int mode, if (mode & FALLOC_FL_PUNCH_HOLE) return btrfs_punch_hole(file, offset, len); - /* - * Only trigger disk allocation, don't trigger qgroup reserve - * - * For qgroup space, it will be checked later. - */ - if (!(mode & FALLOC_FL_ZERO_RANGE)) { - ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), - alloc_end - alloc_start); - if (ret < 0) - return ret; - } - btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) { @@ -3548,48 +3539,66 @@ static long btrfs_fallocate(struct file *file, int mode, if (em->block_start == EXTENT_MAP_HOLE || (cur_offset >= inode->i_size && !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) { + const u64 range_len = last_byte - cur_offset; + ret = add_falloc_range(&reserve_list, cur_offset, - last_byte - cur_offset); + range_len); if (ret < 0) { free_extent_map(em); break; } ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, cur_offset, - last_byte - cur_offset); + range_len); if (ret < 0) { - cur_offset = last_byte; free_extent_map(em); break; } - } else { - /* - * Do not need to reserve unwritten extent for this - * range, free reserved data space first, otherwise - * it'll result in false ENOSPC error. - */ - btrfs_free_reserved_data_space(BTRFS_I(inode), - data_reserved, cur_offset, - last_byte - cur_offset); + qgroup_reserved += range_len; + data_space_needed += range_len; } free_extent_map(em); cur_offset = last_byte; } + if (!ret && data_space_needed > 0) { + /* + * We are safe to reserve space here as we can't have delalloc + * in the range, see above. + */ + ret = btrfs_alloc_data_chunk_ondemand(BTRFS_I(inode), + data_space_needed); + if (!ret) + data_space_reserved = data_space_needed; + } + /* * If ret is still 0, means we're OK to fallocate. * Or just cleanup the list and exit. */ list_for_each_entry_safe(range, tmp, &reserve_list, list) { - if (!ret) + if (!ret) { ret = btrfs_prealloc_file_range(inode, mode, range->start, range->len, i_blocksize(inode), offset + len, &alloc_hint); - else + /* + * btrfs_prealloc_file_range() releases space even + * if it returns an error. + */ + data_space_reserved -= range->len; + qgroup_reserved -= range->len; + } else if (data_space_reserved > 0) { btrfs_free_reserved_data_space(BTRFS_I(inode), - data_reserved, range->start, - range->len); + data_reserved, range->start, + range->len); + data_space_reserved -= range->len; + qgroup_reserved -= range->len; + } else if (qgroup_reserved > 0) { + btrfs_qgroup_free_data(BTRFS_I(inode), data_reserved, + range->start, range->len); + qgroup_reserved -= range->len; + } list_del(&range->list); kfree(range); } @@ -3606,10 +3615,6 @@ static long btrfs_fallocate(struct file *file, int mode, &cached_state); out: btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP); - /* Let go of our reservation. */ - if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE)) - btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved, - cur_offset, alloc_end - cur_offset); extent_changeset_free(data_reserved); return ret; } -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 2/7] btrfs: remove useless dio wait call when doing fallocate zero range 2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 2022-03-15 15:22 ` [PATCH v3 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana @ 2022-03-15 15:22 ` fdmanana 2022-03-15 15:22 ` [PATCH v3 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana ` (5 subsequent siblings) 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 15:22 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> When starting a fallocate zero range operation, before getting the first extent map for the range, we make a call to inode_dio_wait(). This logic was needed in the past because direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (by commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (done in commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")). The lockless direct IO writes could result in a race with the zero range operation, resulting in the later getting a stale extent map for the range. So remove this no longer needed call to inode_dio_wait(), as fallocate takes the inode's VFS lock in exclusive mode and direct IO writes within i_size take that same lock in shared mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index b7c0db1000cd..2f57f7d9d9cb 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3237,8 +3237,6 @@ static int btrfs_zero_range(struct inode *inode, u64 bytes_to_reserve = 0; bool space_reserved = false; - inode_dio_wait(inode); - em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, alloc_start, alloc_end - alloc_start); if (IS_ERR(em)) { -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations 2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 2022-03-15 15:22 ` [PATCH v3 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana 2022-03-15 15:22 ` [PATCH v3 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana @ 2022-03-15 15:22 ` fdmanana 2022-03-15 15:22 ` [PATCH v3 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana ` (4 subsequent siblings) 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 15:22 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> When starting a reflink operation we have these calls to inode_dio_wait() which used to be needed because direct IO writes that don't cross the i_size boundary did not take the inode's VFS lock, so we could race with them and end up with ordered extents in target range after calling btrfs_wait_ordered_range(). However that is not the case anymore, because the inode's VFS lock was changed from a mutex to a rw semaphore, by commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem"), and several years later we started to lock the inode's VFS lock in shared mode for direct IO writes that don't cross the i_size boundary (commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")). So remove those inode_dio_wait() calls. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/reflink.c | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index 04a88bfe4fcf..bbd5da25c475 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -771,7 +771,6 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in, struct inode *inode_in = file_inode(file_in); struct inode *inode_out = file_inode(file_out); u64 bs = BTRFS_I(inode_out)->root->fs_info->sb->s_blocksize; - bool same_inode = inode_out == inode_in; u64 wb_len; int ret; @@ -809,15 +808,6 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in, else wb_len = ALIGN(*len, bs); - /* - * Since we don't lock ranges, wait for ongoing lockless dio writes (as - * any in progress could create its ordered extents after we wait for - * existing ordered extents below). - */ - inode_dio_wait(inode_in); - if (!same_inode) - inode_dio_wait(inode_out); - /* * Workaround to make sure NOCOW buffered write reach disk as NOCOW. * -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 4/7] btrfs: remove ordered extent check and wait during fallocate 2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (2 preceding siblings ...) 2022-03-15 15:22 ` [PATCH v3 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana @ 2022-03-15 15:22 ` fdmanana 2022-03-15 15:22 ` [PATCH v3 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana ` (3 subsequent siblings) 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 15:22 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> For fallocate() we have this loop that checks if we have ordered extents after locking the file range, and if so unlock the range, wait for ordered extents, and retry until we don't find more ordered extents. This logic was needed in the past because: 1) Direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")); 2) We could race with memory mapped writes, because memory mapped writes don't acquire the inode's VFS lock. We don't have that race anymore, as we have a rw semaphore to synchronize memory mapped writes with fallocate (and reflinking too). That change happened with commit 8d9b4a162a37ce ("btrfs: exclude mmap from happening during all fallocate operations"). So stop looking for ordered extents after locking the file range when doing a plain fallocate. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 42 ++++++++---------------------------------- 1 file changed, 8 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 2f57f7d9d9cb..a7fd05c1d52f 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3474,8 +3474,12 @@ static long btrfs_fallocate(struct file *file, int mode, } /* - * wait for ordered IO before we have any locks. We'll loop again - * below with the locks held. + * We have locked the inode at the VFS level (in exclusive mode) and we + * have locked the i_mmap_lock lock (in exclusive mode). Now before + * locking the file range, flush all dealloc in the range and wait for + * all ordered extents in the range to complete. After this we can lock + * the file range and, due to the previous locking we did, we know there + * can't be more delalloc or ordered extents in the range. */ ret = btrfs_wait_ordered_range(inode, alloc_start, alloc_end - alloc_start); @@ -3489,38 +3493,8 @@ static long btrfs_fallocate(struct file *file, int mode, } locked_end = alloc_end - 1; - while (1) { - struct btrfs_ordered_extent *ordered; - - /* the extent lock is ordered inside the running - * transaction - */ - lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, - locked_end, &cached_state); - ordered = btrfs_lookup_first_ordered_extent(BTRFS_I(inode), - locked_end); - - if (ordered && - ordered->file_offset + ordered->num_bytes > alloc_start && - ordered->file_offset < alloc_end) { - btrfs_put_ordered_extent(ordered); - unlock_extent_cached(&BTRFS_I(inode)->io_tree, - alloc_start, locked_end, - &cached_state); - /* - * we can't wait on the range with the transaction - * running or with the extent lock held - */ - ret = btrfs_wait_ordered_range(inode, alloc_start, - alloc_end - alloc_start); - if (ret) - goto out; - } else { - if (ordered) - btrfs_put_ordered_extent(ordered); - break; - } - } + lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, + &cached_state); /* First, check if we exceed the qgroup limit */ INIT_LIST_HEAD(&reserve_list); -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 5/7] btrfs: lock the inode first before flushing range when punching hole 2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (3 preceding siblings ...) 2022-03-15 15:22 ` [PATCH v3 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana @ 2022-03-15 15:22 ` fdmanana 2022-03-15 15:22 ` [PATCH v3 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana ` (2 subsequent siblings) 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 15:22 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> When doing hole punching we are flushing delalloc and waiting for ordered extents to complete before locking the inode (VFS lock and the btrfs specific i_mmap_lock). This is fine because even if a write happens after we call btrfs_wait_ordered_range() and before we lock the inode (call btrfs_inode_lock()), we will notice the write at btrfs_punch_hole_lock_range() and flush delalloc and wait for its ordered extent. We can however make this simpler by locking first the inode an then call btrfs_wait_ordered_range(), which will allow us to remove the ordered extent lookup logic from btrfs_punch_hole_lock_range() in the next patch. It also makes the behaviour the same as plain fallocate, hole punching and reflinks. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index a7fd05c1d52f..56c7961d165b 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2976,11 +2976,12 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len) bool truncated_block = false; bool updated_inode = false; + btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); + ret = btrfs_wait_ordered_range(inode, offset, len); if (ret) - return ret; + goto out_only_mutex; - btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP); ino_size = round_up(inode->i_size, fs_info->sectorsize); ret = find_first_non_hole(BTRFS_I(inode), &offset, &len); if (ret < 0) -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range 2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (4 preceding siblings ...) 2022-03-15 15:22 ` [PATCH v3 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana @ 2022-03-15 15:22 ` fdmanana 2022-03-15 15:22 ` [PATCH v3 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana 2022-03-16 16:29 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks David Sterba 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 15:22 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> For hole punching and zero range we have this loop that checks if we have ordered extents after locking the file range, and if so unlock the range, wait for ordered extents, and retry until we don't find more ordered extents. This logic was needed in the past because: 1) Direct IO writes within the i_size boundary did not take the inode's VFS lock. This was because that lock used to be a mutex, then some years ago it was switched to a rw semaphore (commit 9902af79c01a8e ("parallel lookups: actual switch to rwsem")), and then btrfs was changed to take the VFS inode's lock in shared mode for writes that don't cross the i_size boundary (commit e9adabb9712ef9 ("btrfs: use shared lock for direct writes within EOF")); 2) We could race with memory mapped writes, because memory mapped writes don't acquire the inode's VFS lock. We don't have that race anymore, as we have a rw semaphore to synchronize memory mapped writes with fallocate (and reflinking too). That change happened with commit 8d9b4a162a37ce ("btrfs: exclude mmap from happening during all fallocate operations"). So stop looking for ordered extents after locking the file range when doing hole punching and zero range operations. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/file.c | 54 +++++++++++++++++-------------------------------- 1 file changed, 18 insertions(+), 36 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 56c7961d165b..3019a5182766 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2570,10 +2570,10 @@ static int find_first_non_hole(struct btrfs_inode *inode, u64 *start, u64 *len) return ret; } -static int btrfs_punch_hole_lock_range(struct inode *inode, - const u64 lockstart, - const u64 lockend, - struct extent_state **cached_state) +static void btrfs_punch_hole_lock_range(struct inode *inode, + const u64 lockstart, + const u64 lockend, + struct extent_state **cached_state) { /* * For subpage case, if the range is not at page boundary, we could @@ -2587,40 +2587,27 @@ static int btrfs_punch_hole_lock_range(struct inode *inode, const u64 page_lockend = round_down(lockend + 1, PAGE_SIZE) - 1; while (1) { - struct btrfs_ordered_extent *ordered; - int ret; - truncate_pagecache_range(inode, lockstart, lockend); lock_extent_bits(&BTRFS_I(inode)->io_tree, lockstart, lockend, cached_state); - ordered = btrfs_lookup_first_ordered_extent(BTRFS_I(inode), - lockend); - /* - * We need to make sure we have no ordered extents in this range - * and nobody raced in and read a page in this range, if we did - * we need to try again. + * We can't have ordered extents in the range, nor dirty/writeback + * pages, because we have locked the inode's VFS lock in exclusive + * mode, we have locked the inode's i_mmap_lock in exclusive mode, + * we have flushed all delalloc in the range and we have waited + * for any ordered extents in the range to complete. + * We can race with anyone reading pages from this range, so after + * locking the range check if we have pages in the range, and if + * we do, unlock the range and retry. */ - if ((!ordered || - (ordered->file_offset + ordered->num_bytes <= lockstart || - ordered->file_offset > lockend)) && - !filemap_range_has_page(inode->i_mapping, - page_lockstart, page_lockend)) { - if (ordered) - btrfs_put_ordered_extent(ordered); + if (!filemap_range_has_page(inode->i_mapping, page_lockstart, + page_lockend)) break; - } - if (ordered) - btrfs_put_ordered_extent(ordered); + unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, cached_state); - ret = btrfs_wait_ordered_range(inode, lockstart, - lockend - lockstart + 1); - if (ret) - return ret; } - return 0; } static int btrfs_insert_replace_extent(struct btrfs_trans_handle *trans, @@ -3073,10 +3060,7 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len) goto out_only_mutex; } - ret = btrfs_punch_hole_lock_range(inode, lockstart, lockend, - &cached_state); - if (ret) - goto out_only_mutex; + btrfs_punch_hole_lock_range(inode, lockstart, lockend, &cached_state); path = btrfs_alloc_path(); if (!path) { @@ -3367,10 +3351,8 @@ static int btrfs_zero_range(struct inode *inode, if (ret < 0) goto out; space_reserved = true; - ret = btrfs_punch_hole_lock_range(inode, lockstart, lockend, - &cached_state); - if (ret) - goto out; + btrfs_punch_hole_lock_range(inode, lockstart, lockend, + &cached_state); ret = btrfs_qgroup_reserve_data(BTRFS_I(inode), &data_reserved, alloc_start, bytes_to_reserve); if (ret) { -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v3 7/7] btrfs: add and use helper to assert an inode range is clean 2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (5 preceding siblings ...) 2022-03-15 15:22 ` [PATCH v3 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana @ 2022-03-15 15:22 ` fdmanana 2022-03-16 16:29 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks David Sterba 7 siblings, 0 replies; 26+ messages in thread From: fdmanana @ 2022-03-15 15:22 UTC (permalink / raw) To: linux-btrfs From: Filipe Manana <fdmanana@suse.com> We have four different scenarios where we don't expect to find ordered extents after locking a file range: 1) During plain fallocate; 2) During hole punching; 3) During zero range; 4) During reflinks (both cloning and deduplication). This is because in all these cases we follow the pattern: 1) Lock the inode's VFS lock in exclusive mode; 2) Lock the inode's i_mmap_lock in exclusive node, to serialize with mmap writes; 3) Flush delalloc in a file range and wait for all ordered extents to complete - both done through btrfs_wait_ordered_range(); 4) Lock the file range in the inode's io_tree. So add a helper that asserts that we don't have ordered extents for a given range. Make the four scenarios listed above use this helper after locking the respective file range. Signed-off-by: Filipe Manana <fdmanana@suse.com> --- fs/btrfs/ctree.h | 1 + fs/btrfs/file.c | 4 ++++ fs/btrfs/inode.c | 37 +++++++++++++++++++++++++++++++++++++ fs/btrfs/reflink.c | 13 +++++++++++-- 4 files changed, 53 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 4db17bd05a21..c58baad426f8 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3327,6 +3327,7 @@ void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags); void btrfs_update_inode_bytes(struct btrfs_inode *inode, const u64 add_bytes, const u64 del_bytes); +void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 end); /* ioctl.c */ long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 3019a5182766..0197ede06750 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2608,6 +2608,8 @@ static void btrfs_punch_hole_lock_range(struct inode *inode, unlock_extent_cached(&BTRFS_I(inode)->io_tree, lockstart, lockend, cached_state); } + + btrfs_assert_inode_range_clean(BTRFS_I(inode), lockstart, lockend); } static int btrfs_insert_replace_extent(struct btrfs_trans_handle *trans, @@ -3479,6 +3481,8 @@ static long btrfs_fallocate(struct file *file, int mode, lock_extent_bits(&BTRFS_I(inode)->io_tree, alloc_start, locked_end, &cached_state); + btrfs_assert_inode_range_clean(BTRFS_I(inode), alloc_start, locked_end); + /* First, check if we exceed the qgroup limit */ INIT_LIST_HEAD(&reserve_list); while (cur_offset < alloc_end) { diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2e7143ff5523..50c7e23877a9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -11306,6 +11306,43 @@ void btrfs_update_inode_bytes(struct btrfs_inode *inode, spin_unlock(&inode->lock); } +/* + * Verify that there are no ordered extents for a given file range. + * + * @inode: The target inode. + * @start: Start offset of the file range, should be sector size + * aligned. + * @end: End offset (inclusive) of the file range, its value +1 + * should be sector size aligned. + * + * This should typically be used for cases where we locked an inode's VFS lock in + * exclusive mode, we have also locked the inode's i_mmap_lock in exclusive mode, + * we have flushed all delalloc in the range, we have waited for all ordered + * extents in the range to complete and finally we have locked the file range in + * the inode's io_tree. + */ +void btrfs_assert_inode_range_clean(struct btrfs_inode *inode, u64 start, u64 end) +{ + struct btrfs_root *root = inode->root; + struct btrfs_ordered_extent *ordered; + + if (!IS_ENABLED(CONFIG_BTRFS_ASSERT)) + return; + + ordered = btrfs_lookup_first_ordered_range(inode, start, + end + 1 - start); + if (ordered) { + btrfs_err(root->fs_info, +"found unexpected ordered extent in file range [%llu, %llu] for inode %llu root %llu (ordered range [%llu, %llu])", + start, end, btrfs_ino(inode), root->root_key.objectid, + ordered->file_offset, + ordered->file_offset + ordered->num_bytes - 1); + btrfs_put_ordered_extent(ordered); + } + + ASSERT(ordered == NULL); +} + static const struct inode_operations btrfs_dir_inode_operations = { .getattr = btrfs_getattr, .lookup = btrfs_lookup, diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index bbd5da25c475..85d8f0d5d3e0 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -614,14 +614,23 @@ static void btrfs_double_extent_unlock(struct inode *inode1, u64 loff1, static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1, struct inode *inode2, u64 loff2, u64 len) { + u64 range1_end = loff1 + len - 1; + u64 range2_end = loff2 + len - 1; + if (inode1 < inode2) { swap(inode1, inode2); swap(loff1, loff2); + swap(range1_end, range2_end); } else if (inode1 == inode2 && loff2 < loff1) { swap(loff1, loff2); + swap(range1_end, range2_end); } - lock_extent(&BTRFS_I(inode1)->io_tree, loff1, loff1 + len - 1); - lock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1); + + lock_extent(&BTRFS_I(inode1)->io_tree, loff1, range1_end); + lock_extent(&BTRFS_I(inode2)->io_tree, loff2, range2_end); + + btrfs_assert_inode_range_clean(BTRFS_I(inode1), loff1, range1_end); + btrfs_assert_inode_range_clean(BTRFS_I(inode2), loff2, range2_end); } static void btrfs_double_mmap_lock(struct inode *inode1, struct inode *inode2) -- 2.33.0 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks 2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana ` (6 preceding siblings ...) 2022-03-15 15:22 ` [PATCH v3 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana @ 2022-03-16 16:29 ` David Sterba 7 siblings, 0 replies; 26+ messages in thread From: David Sterba @ 2022-03-16 16:29 UTC (permalink / raw) To: fdmanana; +Cc: linux-btrfs On Tue, Mar 15, 2022 at 03:22:34PM +0000, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > The first patch fixes a bug in fallocate (more details on its changelog), > while the remaining just remove some code and logic that is no longer > needed. The removals/clean ups are independent of the bug fix. > > V3: Added missing inode unlock in error path (patch 5/7). > V2: Added one extra patch, 5/7, which was missing in v1. Fixed a typo > in the changelog of the first patch as well. > > Filipe Manana (7): > btrfs: only reserve the needed data space amount during fallocate > btrfs: remove useless dio wait call when doing fallocate zero range > btrfs: remove inode_dio_wait() calls when starting reflink operations > btrfs: remove ordered extent check and wait during fallocate > btrfs: lock the inode first before flushing range when punching hole > btrfs: remove ordered extent check and wait during hole punching and > zero range > btrfs: add and use helper to assert an inode range is clean Added to misc-next, thanks. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2022-03-16 16:33 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-03-15 10:47 [PATCH 0/6] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 2022-03-15 10:47 ` [PATCH 1/6] btrfs: only reserve the needed data space amount during fallocate fdmanana 2022-03-15 10:47 ` [PATCH 2/6] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana 2022-03-15 10:47 ` [PATCH 3/6] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana 2022-03-15 10:47 ` [PATCH 4/6] btrfs: remove ordered extent check and wait during fallocate fdmanana 2022-03-15 10:47 ` [PATCH 5/6] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana 2022-03-15 10:47 ` [PATCH 6/6] btrfs: add and use helper to assert an inode range is clean fdmanana 2022-03-15 12:18 ` [PATCH v2 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 2022-03-15 12:18 ` [PATCH v2 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana 2022-03-15 13:16 ` Wang Yugui 2022-03-15 13:45 ` Filipe Manana 2022-03-15 12:18 ` [PATCH v2 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana 2022-03-15 12:18 ` [PATCH v2 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana 2022-03-15 12:18 ` [PATCH v2 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana 2022-03-15 12:18 ` [PATCH v2 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana 2022-03-15 12:18 ` [PATCH v2 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana 2022-03-15 12:18 ` [PATCH v2 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana 2022-03-15 15:22 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks fdmanana 2022-03-15 15:22 ` [PATCH v3 1/7] btrfs: only reserve the needed data space amount during fallocate fdmanana 2022-03-15 15:22 ` [PATCH v3 2/7] btrfs: remove useless dio wait call when doing fallocate zero range fdmanana 2022-03-15 15:22 ` [PATCH v3 3/7] btrfs: remove inode_dio_wait() calls when starting reflink operations fdmanana 2022-03-15 15:22 ` [PATCH v3 4/7] btrfs: remove ordered extent check and wait during fallocate fdmanana 2022-03-15 15:22 ` [PATCH v3 5/7] btrfs: lock the inode first before flushing range when punching hole fdmanana 2022-03-15 15:22 ` [PATCH v3 6/7] btrfs: remove ordered extent check and wait during hole punching and zero range fdmanana 2022-03-15 15:22 ` [PATCH v3 7/7] btrfs: add and use helper to assert an inode range is clean fdmanana 2022-03-16 16:29 ` [PATCH v3 0/7] btrfs: one fallocate fix and removing outdated code for fallocate and reflinks David Sterba
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.