linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] btrfs: fix error handling of cow_file_range(unlock = 0)
@ 2022-06-16 15:45 Naohiro Aota
  2022-06-16 15:45 ` [PATCH 1/4] btrfs: ensure pages are unlocked on cow_file_range() failure Naohiro Aota
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Naohiro Aota @ 2022-06-16 15:45 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

This series is a revisit of patch "btrfs: ensure pages are unlocked on
cow_file_range() failure" [1].

[1] https://lore.kernel.org/linux-btrfs/20211213034338.949507-1-naohiro.aota@wdc.com/

We have a hang report like below which is caused by a locked page.

[  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
[  726.329839]       Not tainted 5.16.0-rc1+ #1
[  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
[  726.331608] Call Trace:
[  726.331611]  <TASK>
[  726.331614]  __schedule+0x2e5/0x9d0
[  726.331622]  schedule+0x58/0xd0
[  726.331626]  io_schedule+0x3f/0x70
[  726.331629]  __folio_lock+0x125/0x200
[  726.331634]  ? find_get_entries+0x1bc/0x240
[  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
[  726.331642]  truncate_inode_pages_range+0x5b2/0x770
[  726.331649]  truncate_inode_pages_final+0x44/0x50
[  726.331653]  btrfs_evict_inode+0x67/0x480
[  726.331658]  evict+0xd0/0x180
[  726.331661]  iput+0x13f/0x200
[  726.331664]  do_unlinkat+0x1c0/0x2b0
[  726.331668]  __x64_sys_unlink+0x23/0x30
[  726.331670]  do_syscall_64+0x3b/0xc0
[  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae

When cow_file_range(unlock = 0), we never unlock a page after an ordered
extent is allocated for the page. When the allocation loop fails after some
ordered extents are allocated, we have three things to do: (1) unlock the
pages in the successfully allocated ordered extents, (2) clean-up the
allocated ordered extents, and (3) propagate the error to the userland.

However, current code fails to do (1) in the all cow_file_range(unlock = 0)
case and cause the above hang. Also, it fails to do (2) and (3) in
submit_uncompressed_range() case.

This series addresses these three issues on error handling of
cow_file_range(unlock = 0) case.

To test the series, I applied the following two patches to stress
submit_uncompressed_range() and to inject an error.

The following first diff forces the compress_type to be
BTRFS_COMPRESS_NONE, so that all "compress" writes go through
submit_uncompressed_range path.

    diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
    index 7a54f964ff37..eb12c47d02b8 100644
    --- a/fs/btrfs/inode.c
    +++ b/fs/btrfs/inode.c
    @@ -706,6 +706,7 @@ static noinline int compress_file_range(struct async_chunk *async_chunk)
     			compress_type = BTRFS_I(inode)->defrag_compress;
     		else if (BTRFS_I(inode)->prop_compress)
     			compress_type = BTRFS_I(inode)->prop_compress;
    +		compress_type = BTRFS_COMPRESS_NONE;
     
     		/*
     		 * we need to call clear_page_dirty_for_io on each

The following second diff limits the allocation size at most 256KB to run
the loop many times. Also, it fails the allocation at the specific offset
(fail_offset).

    diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
    index eb12c47d02b8..1247690e7021 100644
    --- a/fs/btrfs/inode.c
    +++ b/fs/btrfs/inode.c
    @@ -1155,6 +1155,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
     	bool extent_reserved = false;
     	int ret = 0;
     
    +	u64 fail_offset = SZ_1M + (u64)SZ_256K * 0;
    +
     	if (btrfs_is_free_space_inode(inode)) {
     		ret = -EINVAL;
     		goto out_unlock;
    @@ -1239,9 +1241,13 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
     
     	while (num_bytes > 0) {
     		cur_alloc_size = num_bytes;
    -		ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size,
    -					   min_alloc_size, 0, alloc_hint,
    -					   &ins, 1, 1);
    +		cur_alloc_size = min_t(u64, SZ_256K, num_bytes);
    +		if (start != fail_offset)
    +			ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size,
    +						   min_alloc_size, 0, alloc_hint,
    +						   &ins, 1, 1);
    +		else
    +			ret = -ENOSPC;
     		if (ret < 0)
     			goto out_unlock;
     		cur_alloc_size = ins.offset;

I ran the following script with these patches + the series applied changing
the fail_offset from "1MB + 256KB * 0" to "1MB + 256KB * 15" step by 256KB,
and confirmed the above three error handlings are properly done.

    run() {
    	local mkfs_opts=$1
    	local mount_opts=$2
    
    	for x in $(seq 100); do
    		echo $x
    		mkfs.btrfs -f -d single -m single ${mkfs_opts} /dev/nullb0
    		mount ${mount_opts} /dev/nullb0 /mnt/test
    		dd if=/dev/zero of=/mnt/test/file bs=1M count=4 seek=1 oflag=sync 2>&1 | tee /tmp/err
    		# check error propagation
    		grep -q 'No space left' /tmp/err || exit 1
    		sync
    		umount /mnt/test
    		dmesg | grep -q WARN && exit 1
    	done
    }
    
    run ""         ""
    run ""         "-o compress-force"
    run "-O zoned" ""
    run "-O zoned" "-o compress-force"

Patch 1 addresses the (1) by unlocking the pages in the error case. Also,
it adds a figure to clarify what to do for each range in the error case.

Patches 2 and 3 do (2) and (3) by fixing the error case of
submit_uncompressed_range().

Patch 4 is a refactoring patch to replace unnecessary "goto out"s with
direct return.

Naohiro Aota (4):
  btrfs: ensure pages are unlocked on cow_file_range() failure
  btrfs: extend btrfs_cleanup_ordered_extens for NULL locked_page
  btrfs: fix error handling of fallbacked uncompress write
  btrfs: replace unnecessary goto with direct return

 fs/btrfs/inode.c | 132 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 103 insertions(+), 29 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] btrfs: ensure pages are unlocked on cow_file_range() failure
  2022-06-16 15:45 [PATCH 0/4] btrfs: fix error handling of cow_file_range(unlock = 0) Naohiro Aota
@ 2022-06-16 15:45 ` Naohiro Aota
  2022-06-17 10:11   ` Filipe Manana
  2022-06-16 15:45 ` [PATCH 2/4] btrfs: extend btrfs_cleanup_ordered_extens for NULL locked_page Naohiro Aota
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Naohiro Aota @ 2022-06-16 15:45 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

There is a hung_task report on zoned btrfs like below.

https://github.com/naota/linux/issues/59

[  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
[  726.329839]       Not tainted 5.16.0-rc1+ #1
[  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
[  726.331608] Call Trace:
[  726.331611]  <TASK>
[  726.331614]  __schedule+0x2e5/0x9d0
[  726.331622]  schedule+0x58/0xd0
[  726.331626]  io_schedule+0x3f/0x70
[  726.331629]  __folio_lock+0x125/0x200
[  726.331634]  ? find_get_entries+0x1bc/0x240
[  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
[  726.331642]  truncate_inode_pages_range+0x5b2/0x770
[  726.331649]  truncate_inode_pages_final+0x44/0x50
[  726.331653]  btrfs_evict_inode+0x67/0x480
[  726.331658]  evict+0xd0/0x180
[  726.331661]  iput+0x13f/0x200
[  726.331664]  do_unlinkat+0x1c0/0x2b0
[  726.331668]  __x64_sys_unlink+0x23/0x30
[  726.331670]  do_syscall_64+0x3b/0xc0
[  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  726.331677] RIP: 0033:0x7fb9490a171b
[  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
[  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
[  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
[  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
[  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
[  726.331693]  </TASK>

While we debug the issue, we found running fstests generic/551 on 5GB
non-zoned null_blk device in the emulated zoned mode also had a
similar hung issue.

Also, we can reproduce the same symptom with an error injected
cow_file_range() setup.

The hang occurs when cow_file_range() fails in the middle of
allocation. cow_file_range() called from do_allocation_zoned() can
split the give region ([start, end]) for allocation depending on
current block group usages. When btrfs can allocate bytes for one part
of the split regions but fails for the other region (e.g. because of
-ENOSPC), we return the error leaving the pages in the succeeded regions
locked. Technically, this occurs only when @unlock == 0. Otherwise, we
unlock the pages in an allocated region after creating an ordered
extent.

Considering the callers of cow_file_range(unlock=0) won't write out
the pages, we can unlock the pages on error exit from
cow_file_range(). So, we can ensure all the pages except @locked_page
are unlocked on error case.

In summary, cow_file_range now behaves like this:

- page_started == 1 (return value)
  - All the pages are unlocked. IO is started.
- unlock == 1
  - All the pages except @locked_page are unlocked in any case
- unlock == 0
  - On success, all the pages are locked for writing out them
  - On failure, all the pages except @locked_page are unlocked

Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 64 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1247690e7021..0c3d9998470f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1134,6 +1134,28 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
  * *page_started is set to one if we unlock locked_page and do everything
  * required to start IO on it.  It may be clean and already done with
  * IO when we return.
+ *
+ * When unlock == 1, we unlock the pages in successfully allocated regions.
+ * When unlock == 0, we leave them locked for writing them out.
+ *
+ * However, we unlock all the pages except @locked_page in case of failure.
+ *
+ * In summary, page locking state will be as follow:
+ *
+ * - page_started == 1 (return value)
+ *     - All the pages are unlocked. IO is started.
+ *     - Note that this can happen only on success
+ * - unlock == 1
+ *     - All the pages except @locked_page are unlocked in any case
+ * - unlock == 0
+ *     - On success, all the pages are locked for writing out them
+ *     - On failure, all the pages except @locked_page are unlocked
+ *
+ * When a failure happens in the second or later iteration of the
+ * while-loop, the ordered extents created in previous iterations are kept
+ * intact. So, the caller must clean them up by calling
+ * btrfs_cleanup_ordered_extents(). See btrfs_run_delalloc_range() for
+ * example.
  */
 static noinline int cow_file_range(struct btrfs_inode *inode,
 				   struct page *locked_page,
@@ -1143,6 +1165,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	u64 alloc_hint = 0;
+	u64 orig_start = start;
 	u64 num_bytes;
 	unsigned long ram_size;
 	u64 cur_alloc_size = 0;
@@ -1336,18 +1359,44 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
 out_unlock:
+	/*
+	 * Now, we have three regions to clean up, as shown below.
+	 *
+	 * |-------(1)----|---(2)---|-------------(3)----------|
+	 * `- orig_start  `- start  `- start + cur_alloc_size  `- end
+	 *
+	 * We process each region below.
+	 */
+
 	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
 		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
 	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
+
 	/*
-	 * If we reserved an extent for our delalloc range (or a subrange) and
-	 * failed to create the respective ordered extent, then it means that
-	 * when we reserved the extent we decremented the extent's size from
-	 * the data space_info's bytes_may_use counter and incremented the
-	 * space_info's bytes_reserved counter by the same amount. We must make
-	 * sure extent_clear_unlock_delalloc() does not try to decrement again
-	 * the data space_info's bytes_may_use counter, therefore we do not pass
-	 * it the flag EXTENT_CLEAR_DATA_RESV.
+	 * For the range (1). We have already instantiated the ordered extents
+	 * for this region. They are cleaned up by
+	 * btrfs_cleanup_ordered_extents() in e.g,
+	 * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
+	 * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
+	 * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
+	 * function.
+	 *
+	 * However, in case of unlock == 0, we still need to unlock the pages
+	 * (except @locked_page) to ensure all the pages are unlocked.
+	 */
+	if (!unlock && orig_start < start)
+		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
+					     locked_page, 0, page_ops);
+
+	/*
+	 * For the range (2). If we reserved an extent for our delalloc range
+	 * (or a subrange) and failed to create the respective ordered extent,
+	 * then it means that when we reserved the extent we decremented the
+	 * extent's size from the data space_info's bytes_may_use counter and
+	 * incremented the space_info's bytes_reserved counter by the same
+	 * amount. We must make sure extent_clear_unlock_delalloc() does not try
+	 * to decrement again the data space_info's bytes_may_use counter,
+	 * therefore we do not pass it the flag EXTENT_CLEAR_DATA_RESV.
 	 */
 	if (extent_reserved) {
 		extent_clear_unlock_delalloc(inode, start,
@@ -1359,6 +1408,13 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		if (start >= end)
 			goto out;
 	}
+
+	/*
+	 * For the range (3). We never touched the region. In addition to the
+	 * clear_bits above, we add EXTENT_CLEAR_DATA_RESV to release the data
+	 * space_info's bytes_may_use counter, reserved in e.g,
+	 * btrfs_check_data_free_space().
+	 */
 	extent_clear_unlock_delalloc(inode, start, end, locked_page,
 				     clear_bits | EXTENT_CLEAR_DATA_RESV,
 				     page_ops);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] btrfs: extend btrfs_cleanup_ordered_extens for NULL locked_page
  2022-06-16 15:45 [PATCH 0/4] btrfs: fix error handling of cow_file_range(unlock = 0) Naohiro Aota
  2022-06-16 15:45 ` [PATCH 1/4] btrfs: ensure pages are unlocked on cow_file_range() failure Naohiro Aota
@ 2022-06-16 15:45 ` Naohiro Aota
  2022-06-17 10:15   ` Filipe Manana
  2022-06-16 15:45 ` [PATCH 3/4] btrfs: fix error handling of fallbacked uncompress write Naohiro Aota
  2022-06-16 15:45 ` [PATCH 4/4] btrfs: replace unnecessary goto with direct return Naohiro Aota
  3 siblings, 1 reply; 13+ messages in thread
From: Naohiro Aota @ 2022-06-16 15:45 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it
is not usable for submit_uncompressed_range() which can habe NULL
locked_page.

This commit supports locked_page == NULL case. Also, it rewrites redundant
"page_offset(locked_page)".

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 36 +++++++++++++++++++++---------------
 1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0c3d9998470f..4e1100f84a88 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -195,11 +195,14 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
 {
 	unsigned long index = offset >> PAGE_SHIFT;
 	unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
-	u64 page_start = page_offset(locked_page);
-	u64 page_end = page_start + PAGE_SIZE - 1;
-
+	u64 page_start, page_end;
 	struct page *page;
 
+	if (locked_page) {
+		page_start = page_offset(locked_page);
+		page_end = page_start + PAGE_SIZE - 1;
+	}
+
 	while (index <= end_index) {
 		/*
 		 * For locked page, we will call end_extent_writepage() on it
@@ -212,7 +215,7 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
 		 * btrfs_mark_ordered_io_finished() would skip the accounting
 		 * for the page range, and the ordered extent will never finish.
 		 */
-		if (index == (page_offset(locked_page) >> PAGE_SHIFT)) {
+		if (locked_page && index == (page_start >> PAGE_SHIFT)) {
 			index++;
 			continue;
 		}
@@ -231,17 +234,20 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
 		put_page(page);
 	}
 
-	/* The locked page covers the full range, nothing needs to be done */
-	if (bytes + offset <= page_offset(locked_page) + PAGE_SIZE)
-		return;
-	/*
-	 * In case this page belongs to the delalloc range being instantiated
-	 * then skip it, since the first page of a range is going to be
-	 * properly cleaned up by the caller of run_delalloc_range
-	 */
-	if (page_start >= offset && page_end <= (offset + bytes - 1)) {
-		bytes = offset + bytes - page_offset(locked_page) - PAGE_SIZE;
-		offset = page_offset(locked_page) + PAGE_SIZE;
+	if (locked_page) {
+		/* The locked page covers the full range, nothing needs to be done */
+		if (bytes + offset <= page_start + PAGE_SIZE)
+			return;
+		/*
+		 * In case this page belongs to the delalloc range being
+		 * instantiated then skip it, since the first page of a range is
+		 * going to be properly cleaned up by the caller of
+		 * run_delalloc_range
+		 */
+		if (page_start >= offset && page_end <= (offset + bytes - 1)) {
+			bytes = offset + bytes - page_offset(locked_page) - PAGE_SIZE;
+			offset = page_offset(locked_page) + PAGE_SIZE;
+		}
 	}
 
 	return __endio_write_update_ordered(inode, offset, bytes, false);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] btrfs: fix error handling of fallbacked uncompress write
  2022-06-16 15:45 [PATCH 0/4] btrfs: fix error handling of cow_file_range(unlock = 0) Naohiro Aota
  2022-06-16 15:45 ` [PATCH 1/4] btrfs: ensure pages are unlocked on cow_file_range() failure Naohiro Aota
  2022-06-16 15:45 ` [PATCH 2/4] btrfs: extend btrfs_cleanup_ordered_extens for NULL locked_page Naohiro Aota
@ 2022-06-16 15:45 ` Naohiro Aota
  2022-06-17 10:21   ` Filipe Manana
  2022-06-16 15:45 ` [PATCH 4/4] btrfs: replace unnecessary goto with direct return Naohiro Aota
  3 siblings, 1 reply; 13+ messages in thread
From: Naohiro Aota @ 2022-06-16 15:45 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

When cow_file_range() fails in the middle of the allocation loop, it
unlocks the pages but remains the ordered extents intact. Thus, we need to
call btrfs_cleanup_ordered_extents() to finish the created ordered extents.

Also, we need to call end_extent_writepage() if locked_page is available
because btrfs_cleanup_ordered_extents() never process the region on the
locked_page.

Furthermore, we need to set the mapping as error if locked_page is
unavailable before unlocking the pages, so that the errno is properly
propagated to the userland.

CC: stable@vger.kernel.org
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4e1100f84a88..cae15924fc99 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -934,8 +934,18 @@ static int submit_uncompressed_range(struct btrfs_inode *inode,
 		goto out;
 	}
 	if (ret < 0) {
-		if (locked_page)
+		btrfs_cleanup_ordered_extents(inode, locked_page, start, end - start + 1);
+		if (locked_page) {
+			u64 page_start = page_offset(locked_page);
+			u64 page_end = page_start + PAGE_SIZE - 1;
+
+			btrfs_page_set_error(inode->root->fs_info, locked_page,
+					     page_start, PAGE_SIZE);
+			set_page_writeback(locked_page);
+			end_page_writeback(locked_page);
+			end_extent_writepage(locked_page, ret, page_start, page_end);
 			unlock_page(locked_page);
+		}
 		goto out;
 	}
 
@@ -1390,9 +1400,12 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	 * However, in case of unlock == 0, we still need to unlock the pages
 	 * (except @locked_page) to ensure all the pages are unlocked.
 	 */
-	if (!unlock && orig_start < start)
+	if (!unlock && orig_start < start) {
+		if (!locked_page)
+			mapping_set_error(inode->vfs_inode.i_mapping, ret);
 		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
 					     locked_page, 0, page_ops);
+	}
 
 	/*
 	 * For the range (2). If we reserved an extent for our delalloc range
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] btrfs: replace unnecessary goto with direct return
  2022-06-16 15:45 [PATCH 0/4] btrfs: fix error handling of cow_file_range(unlock = 0) Naohiro Aota
                   ` (2 preceding siblings ...)
  2022-06-16 15:45 ` [PATCH 3/4] btrfs: fix error handling of fallbacked uncompress write Naohiro Aota
@ 2022-06-16 15:45 ` Naohiro Aota
  2022-06-17 10:13   ` Filipe Manana
  3 siblings, 1 reply; 13+ messages in thread
From: Naohiro Aota @ 2022-06-16 15:45 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

The "goto out;"s in cow_file_range() just results in a simple "return
ret;" which are not really useful. Replace them with proper direct
"return"s. It also makes the success path vs failure path stands out.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cae15924fc99..055c573e2eb3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1253,7 +1253,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			 * inline extent or a compressed extent.
 			 */
 			unlock_page(locked_page);
-			goto out;
+			return 0;
 		} else if (ret < 0) {
 			goto out_unlock;
 		}
@@ -1366,8 +1366,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		if (ret)
 			goto out_unlock;
 	}
-out:
-	return ret;
+	return 0;
 
 out_drop_extent_cache:
 	btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
@@ -1425,7 +1424,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 					     page_ops);
 		start += cur_alloc_size;
 		if (start >= end)
-			goto out;
+			return ret;
 	}
 
 	/*
@@ -1437,7 +1436,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	extent_clear_unlock_delalloc(inode, start, end, locked_page,
 				     clear_bits | EXTENT_CLEAR_DATA_RESV,
 				     page_ops);
-	goto out;
+	return ret;
 }
 
 /*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] btrfs: ensure pages are unlocked on cow_file_range() failure
  2022-06-16 15:45 ` [PATCH 1/4] btrfs: ensure pages are unlocked on cow_file_range() failure Naohiro Aota
@ 2022-06-17 10:11   ` Filipe Manana
  2022-06-20  2:22     ` Naohiro Aota
  0 siblings, 1 reply; 13+ messages in thread
From: Filipe Manana @ 2022-06-17 10:11 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

On Fri, Jun 17, 2022 at 12:45:39AM +0900, Naohiro Aota wrote:
> There is a hung_task report on zoned btrfs like below.
> 
> https://github.com/naota/linux/issues/59
> 
> [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> [  726.329839]       Not tainted 5.16.0-rc1+ #1
> [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> [  726.331608] Call Trace:
> [  726.331611]  <TASK>
> [  726.331614]  __schedule+0x2e5/0x9d0
> [  726.331622]  schedule+0x58/0xd0
> [  726.331626]  io_schedule+0x3f/0x70
> [  726.331629]  __folio_lock+0x125/0x200
> [  726.331634]  ? find_get_entries+0x1bc/0x240
> [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> [  726.331649]  truncate_inode_pages_final+0x44/0x50
> [  726.331653]  btrfs_evict_inode+0x67/0x480
> [  726.331658]  evict+0xd0/0x180
> [  726.331661]  iput+0x13f/0x200
> [  726.331664]  do_unlinkat+0x1c0/0x2b0
> [  726.331668]  __x64_sys_unlink+0x23/0x30
> [  726.331670]  do_syscall_64+0x3b/0xc0
> [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  726.331677] RIP: 0033:0x7fb9490a171b
> [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> [  726.331693]  </TASK>
> 
> While we debug the issue, we found running fstests generic/551 on 5GB
> non-zoned null_blk device in the emulated zoned mode also had a
> similar hung issue.
> 
> Also, we can reproduce the same symptom with an error injected
> cow_file_range() setup.
> 
> The hang occurs when cow_file_range() fails in the middle of
> allocation. cow_file_range() called from do_allocation_zoned() can
> split the give region ([start, end]) for allocation depending on
> current block group usages. When btrfs can allocate bytes for one part
> of the split regions but fails for the other region (e.g. because of
> -ENOSPC), we return the error leaving the pages in the succeeded regions
> locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> unlock the pages in an allocated region after creating an ordered
> extent.
> 
> Considering the callers of cow_file_range(unlock=0) won't write out
> the pages, we can unlock the pages on error exit from
> cow_file_range(). So, we can ensure all the pages except @locked_page
> are unlocked on error case.
> 
> In summary, cow_file_range now behaves like this:
> 
> - page_started == 1 (return value)
>   - All the pages are unlocked. IO is started.
> - unlock == 1
>   - All the pages except @locked_page are unlocked in any case
> - unlock == 0
>   - On success, all the pages are locked for writing out them
>   - On failure, all the pages except @locked_page are unlocked
> 
> Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
> CC: stable@vger.kernel.org # 5.12+
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 64 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 1247690e7021..0c3d9998470f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1134,6 +1134,28 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
>   * *page_started is set to one if we unlock locked_page and do everything
>   * required to start IO on it.  It may be clean and already done with
>   * IO when we return.
> + *
> + * When unlock == 1, we unlock the pages in successfully allocated regions.
> + * When unlock == 0, we leave them locked for writing them out.
> + *
> + * However, we unlock all the pages except @locked_page in case of failure.
> + *
> + * In summary, page locking state will be as follow:
> + *
> + * - page_started == 1 (return value)
> + *     - All the pages are unlocked. IO is started.
> + *     - Note that this can happen only on success
> + * - unlock == 1
> + *     - All the pages except @locked_page are unlocked in any case
> + * - unlock == 0
> + *     - On success, all the pages are locked for writing out them
> + *     - On failure, all the pages except @locked_page are unlocked
> + *
> + * When a failure happens in the second or later iteration of the
> + * while-loop, the ordered extents created in previous iterations are kept
> + * intact. So, the caller must clean them up by calling
> + * btrfs_cleanup_ordered_extents(). See btrfs_run_delalloc_range() for
> + * example.
>   */
>  static noinline int cow_file_range(struct btrfs_inode *inode,
>  				   struct page *locked_page,
> @@ -1143,6 +1165,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  	struct btrfs_root *root = inode->root;
>  	struct btrfs_fs_info *fs_info = root->fs_info;
>  	u64 alloc_hint = 0;
> +	u64 orig_start = start;
>  	u64 num_bytes;
>  	unsigned long ram_size;
>  	u64 cur_alloc_size = 0;
> @@ -1336,18 +1359,44 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>  	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
>  out_unlock:
> +	/*
> +	 * Now, we have three regions to clean up, as shown below.
> +	 *
> +	 * |-------(1)----|---(2)---|-------------(3)----------|
> +	 * `- orig_start  `- start  `- start + cur_alloc_size  `- end
> +	 *
> +	 * We process each region below.
> +	 */
> +
>  	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
>  		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
>  	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
> +
>  	/*
> -	 * If we reserved an extent for our delalloc range (or a subrange) and
> -	 * failed to create the respective ordered extent, then it means that
> -	 * when we reserved the extent we decremented the extent's size from
> -	 * the data space_info's bytes_may_use counter and incremented the
> -	 * space_info's bytes_reserved counter by the same amount. We must make
> -	 * sure extent_clear_unlock_delalloc() does not try to decrement again
> -	 * the data space_info's bytes_may_use counter, therefore we do not pass
> -	 * it the flag EXTENT_CLEAR_DATA_RESV.
> +	 * For the range (1). We have already instantiated the ordered extents
> +	 * for this region. They are cleaned up by
> +	 * btrfs_cleanup_ordered_extents() in e.g,
> +	 * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
> +	 * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
> +	 * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
> +	 * function.
> +	 *
> +	 * However, in case of unlock == 0, we still need to unlock the pages
> +	 * (except @locked_page) to ensure all the pages are unlocked.
> +	 */
> +	if (!unlock && orig_start < start)
> +		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
> +					     locked_page, 0, page_ops);
> +
> +	/*
> +	 * For the range (2). If we reserved an extent for our delalloc range
> +	 * (or a subrange) and failed to create the respective ordered extent,
> +	 * then it means that when we reserved the extent we decremented the
> +	 * extent's size from the data space_info's bytes_may_use counter and
> +	 * incremented the space_info's bytes_reserved counter by the same
> +	 * amount. We must make sure extent_clear_unlock_delalloc() does not try
> +	 * to decrement again the data space_info's bytes_may_use counter,
> +	 * therefore we do not pass it the flag EXTENT_CLEAR_DATA_RESV.
>  	 */
>  	if (extent_reserved) {
>  		extent_clear_unlock_delalloc(inode, start,
> @@ -1359,6 +1408,13 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  		if (start >= end)
>  			goto out;
>  	}
> +
> +	/*
> +	 * For the range (3). We never touched the region. In addition to the
> +	 * clear_bits above, we add EXTENT_CLEAR_DATA_RESV to release the data
> +	 * space_info's bytes_may_use counter, reserved in e.g,
> +	 * btrfs_check_data_free_space().

This says "in .e.g.", but in fact there's only one place where we increment the
bytes_may_use counter for data, which is at btrfs_check_data_free_space(). So
just remove the "in e.g." part because it gives the idea that the increment/reservation
can happen in more than one place.

Other than that, it looks good, thanks.

> +	 */
>  	extent_clear_unlock_delalloc(inode, start, end, locked_page,
>  				     clear_bits | EXTENT_CLEAR_DATA_RESV,
>  				     page_ops);
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] btrfs: replace unnecessary goto with direct return
  2022-06-16 15:45 ` [PATCH 4/4] btrfs: replace unnecessary goto with direct return Naohiro Aota
@ 2022-06-17 10:13   ` Filipe Manana
  2022-06-20  2:22     ` Naohiro Aota
  0 siblings, 1 reply; 13+ messages in thread
From: Filipe Manana @ 2022-06-17 10:13 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

On Fri, Jun 17, 2022 at 12:45:42AM +0900, Naohiro Aota wrote:
> The "goto out;"s in cow_file_range() just results in a simple "return
> ret;" which are not really useful. Replace them with proper direct
> "return"s. It also makes the success path vs failure path stands out.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>

Reviewed-by: Filipe Manana <fdmanana@suse.com>

The subject is too generic, just adding "... at cow_file_range()" would make
it much more clear without being too long.

Thanks.

> ---
>  fs/btrfs/inode.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index cae15924fc99..055c573e2eb3 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1253,7 +1253,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  			 * inline extent or a compressed extent.
>  			 */
>  			unlock_page(locked_page);
> -			goto out;
> +			return 0;
>  		} else if (ret < 0) {
>  			goto out_unlock;
>  		}
> @@ -1366,8 +1366,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  		if (ret)
>  			goto out_unlock;
>  	}
> -out:
> -	return ret;
> +	return 0;
>  
>  out_drop_extent_cache:
>  	btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
> @@ -1425,7 +1424,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  					     page_ops);
>  		start += cur_alloc_size;
>  		if (start >= end)
> -			goto out;
> +			return ret;
>  	}
>  
>  	/*
> @@ -1437,7 +1436,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  	extent_clear_unlock_delalloc(inode, start, end, locked_page,
>  				     clear_bits | EXTENT_CLEAR_DATA_RESV,
>  				     page_ops);
> -	goto out;
> +	return ret;
>  }
>  
>  /*
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] btrfs: extend btrfs_cleanup_ordered_extens for NULL locked_page
  2022-06-16 15:45 ` [PATCH 2/4] btrfs: extend btrfs_cleanup_ordered_extens for NULL locked_page Naohiro Aota
@ 2022-06-17 10:15   ` Filipe Manana
  2022-06-20  2:28     ` Naohiro Aota
  0 siblings, 1 reply; 13+ messages in thread
From: Filipe Manana @ 2022-06-17 10:15 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

On Fri, Jun 17, 2022 at 12:45:40AM +0900, Naohiro Aota wrote:
> btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it
> is not usable for submit_uncompressed_range() which can habe NULL
> locked_page.
> 
> This commit supports locked_page == NULL case. Also, it rewrites redundant
> "page_offset(locked_page)".

The patch looks fine, but I don't see any patch in the patchset that actually
makes submit_uncompressed_range() use btrfs_cleanup_ordered_extents().
Did you forgot that, or am I missing something?

Thanks.

> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/inode.c | 36 +++++++++++++++++++++---------------
>  1 file changed, 21 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0c3d9998470f..4e1100f84a88 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -195,11 +195,14 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
>  {
>  	unsigned long index = offset >> PAGE_SHIFT;
>  	unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
> -	u64 page_start = page_offset(locked_page);
> -	u64 page_end = page_start + PAGE_SIZE - 1;
> -
> +	u64 page_start, page_end;
>  	struct page *page;
>  
> +	if (locked_page) {
> +		page_start = page_offset(locked_page);
> +		page_end = page_start + PAGE_SIZE - 1;
> +	}
> +
>  	while (index <= end_index) {
>  		/*
>  		 * For locked page, we will call end_extent_writepage() on it
> @@ -212,7 +215,7 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
>  		 * btrfs_mark_ordered_io_finished() would skip the accounting
>  		 * for the page range, and the ordered extent will never finish.
>  		 */
> -		if (index == (page_offset(locked_page) >> PAGE_SHIFT)) {
> +		if (locked_page && index == (page_start >> PAGE_SHIFT)) {
>  			index++;
>  			continue;
>  		}
> @@ -231,17 +234,20 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
>  		put_page(page);
>  	}
>  
> -	/* The locked page covers the full range, nothing needs to be done */
> -	if (bytes + offset <= page_offset(locked_page) + PAGE_SIZE)
> -		return;
> -	/*
> -	 * In case this page belongs to the delalloc range being instantiated
> -	 * then skip it, since the first page of a range is going to be
> -	 * properly cleaned up by the caller of run_delalloc_range
> -	 */
> -	if (page_start >= offset && page_end <= (offset + bytes - 1)) {
> -		bytes = offset + bytes - page_offset(locked_page) - PAGE_SIZE;
> -		offset = page_offset(locked_page) + PAGE_SIZE;
> +	if (locked_page) {
> +		/* The locked page covers the full range, nothing needs to be done */
> +		if (bytes + offset <= page_start + PAGE_SIZE)
> +			return;
> +		/*
> +		 * In case this page belongs to the delalloc range being
> +		 * instantiated then skip it, since the first page of a range is
> +		 * going to be properly cleaned up by the caller of
> +		 * run_delalloc_range
> +		 */
> +		if (page_start >= offset && page_end <= (offset + bytes - 1)) {
> +			bytes = offset + bytes - page_offset(locked_page) - PAGE_SIZE;
> +			offset = page_offset(locked_page) + PAGE_SIZE;
> +		}
>  	}
>  
>  	return __endio_write_update_ordered(inode, offset, bytes, false);
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] btrfs: fix error handling of fallbacked uncompress write
  2022-06-16 15:45 ` [PATCH 3/4] btrfs: fix error handling of fallbacked uncompress write Naohiro Aota
@ 2022-06-17 10:21   ` Filipe Manana
  2022-06-20  2:26     ` Naohiro Aota
  0 siblings, 1 reply; 13+ messages in thread
From: Filipe Manana @ 2022-06-17 10:21 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

On Fri, Jun 17, 2022 at 12:45:41AM +0900, Naohiro Aota wrote:
> When cow_file_range() fails in the middle of the allocation loop, it
> unlocks the pages but remains the ordered extents intact. Thus, we need to

s/remains/leaves/

> call btrfs_cleanup_ordered_extents() to finish the created ordered extents.
> 
> Also, we need to call end_extent_writepage() if locked_page is available
> because btrfs_cleanup_ordered_extents() never process the region on the
> locked_page.
> 
> Furthermore, we need to set the mapping as error if locked_page is
> unavailable before unlocking the pages, so that the errno is properly
> propagated to the userland.
> 
> CC: stable@vger.kernel.org

It would be better to specify a version here.

The delalloc paths for compression were (a bit heavily) refactored last year
in preparation for the subpage sector size support, so blindly adding this
to any stable releases might introduce regressions (assuming the patch does
not fail to apply).

> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/inode.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 4e1100f84a88..cae15924fc99 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -934,8 +934,18 @@ static int submit_uncompressed_range(struct btrfs_inode *inode,
>  		goto out;
>  	}
>  	if (ret < 0) {
> -		if (locked_page)
> +		btrfs_cleanup_ordered_extents(inode, locked_page, start, end - start + 1);
> +		if (locked_page) {
> +			u64 page_start = page_offset(locked_page);
> +			u64 page_end = page_start + PAGE_SIZE - 1;
> +
> +			btrfs_page_set_error(inode->root->fs_info, locked_page,
> +					     page_start, PAGE_SIZE);
> +			set_page_writeback(locked_page);
> +			end_page_writeback(locked_page);
> +			end_extent_writepage(locked_page, ret, page_start, page_end);
>  			unlock_page(locked_page);
> +		}
>  		goto out;
>  	}

So as I commented in the previous patch, something is missing here: the call to
btrfs_cleanup_ordered_extents() at submit_uncompressed_range() in case cow_file_range()
returns an error.

Otherwise it looks fine.
Thanks.

>  
> @@ -1390,9 +1400,12 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>  	 * However, in case of unlock == 0, we still need to unlock the pages
>  	 * (except @locked_page) to ensure all the pages are unlocked.
>  	 */
> -	if (!unlock && orig_start < start)
> +	if (!unlock && orig_start < start) {
> +		if (!locked_page)
> +			mapping_set_error(inode->vfs_inode.i_mapping, ret);
>  		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
>  					     locked_page, 0, page_ops);
> +	}
>  
>  	/*
>  	 * For the range (2). If we reserved an extent for our delalloc range
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/4] btrfs: ensure pages are unlocked on cow_file_range() failure
  2022-06-17 10:11   ` Filipe Manana
@ 2022-06-20  2:22     ` Naohiro Aota
  0 siblings, 0 replies; 13+ messages in thread
From: Naohiro Aota @ 2022-06-20  2:22 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Fri, Jun 17, 2022 at 11:11:49AM +0100, Filipe Manana wrote:
> On Fri, Jun 17, 2022 at 12:45:39AM +0900, Naohiro Aota wrote:
> > There is a hung_task report on zoned btrfs like below.
> > 
> > https://github.com/naota/linux/issues/59
> > 
> > [  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
> > [  726.329839]       Not tainted 5.16.0-rc1+ #1
> > [  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
> > [  726.331608] Call Trace:
> > [  726.331611]  <TASK>
> > [  726.331614]  __schedule+0x2e5/0x9d0
> > [  726.331622]  schedule+0x58/0xd0
> > [  726.331626]  io_schedule+0x3f/0x70
> > [  726.331629]  __folio_lock+0x125/0x200
> > [  726.331634]  ? find_get_entries+0x1bc/0x240
> > [  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
> > [  726.331642]  truncate_inode_pages_range+0x5b2/0x770
> > [  726.331649]  truncate_inode_pages_final+0x44/0x50
> > [  726.331653]  btrfs_evict_inode+0x67/0x480
> > [  726.331658]  evict+0xd0/0x180
> > [  726.331661]  iput+0x13f/0x200
> > [  726.331664]  do_unlinkat+0x1c0/0x2b0
> > [  726.331668]  __x64_sys_unlink+0x23/0x30
> > [  726.331670]  do_syscall_64+0x3b/0xc0
> > [  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  726.331677] RIP: 0033:0x7fb9490a171b
> > [  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
> > [  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
> > [  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
> > [  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
> > [  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
> > [  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
> > [  726.331693]  </TASK>
> > 
> > While we debug the issue, we found running fstests generic/551 on 5GB
> > non-zoned null_blk device in the emulated zoned mode also had a
> > similar hung issue.
> > 
> > Also, we can reproduce the same symptom with an error injected
> > cow_file_range() setup.
> > 
> > The hang occurs when cow_file_range() fails in the middle of
> > allocation. cow_file_range() called from do_allocation_zoned() can
> > split the give region ([start, end]) for allocation depending on
> > current block group usages. When btrfs can allocate bytes for one part
> > of the split regions but fails for the other region (e.g. because of
> > -ENOSPC), we return the error leaving the pages in the succeeded regions
> > locked. Technically, this occurs only when @unlock == 0. Otherwise, we
> > unlock the pages in an allocated region after creating an ordered
> > extent.
> > 
> > Considering the callers of cow_file_range(unlock=0) won't write out
> > the pages, we can unlock the pages on error exit from
> > cow_file_range(). So, we can ensure all the pages except @locked_page
> > are unlocked on error case.
> > 
> > In summary, cow_file_range now behaves like this:
> > 
> > - page_started == 1 (return value)
> >   - All the pages are unlocked. IO is started.
> > - unlock == 1
> >   - All the pages except @locked_page are unlocked in any case
> > - unlock == 0
> >   - On success, all the pages are locked for writing out them
> >   - On failure, all the pages except @locked_page are unlocked
> > 
> > Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems")
> > CC: stable@vger.kernel.org # 5.12+
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/inode.c | 72 ++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 64 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 1247690e7021..0c3d9998470f 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -1134,6 +1134,28 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
> >   * *page_started is set to one if we unlock locked_page and do everything
> >   * required to start IO on it.  It may be clean and already done with
> >   * IO when we return.
> > + *
> > + * When unlock == 1, we unlock the pages in successfully allocated regions.
> > + * When unlock == 0, we leave them locked for writing them out.
> > + *
> > + * However, we unlock all the pages except @locked_page in case of failure.
> > + *
> > + * In summary, page locking state will be as follow:
> > + *
> > + * - page_started == 1 (return value)
> > + *     - All the pages are unlocked. IO is started.
> > + *     - Note that this can happen only on success
> > + * - unlock == 1
> > + *     - All the pages except @locked_page are unlocked in any case
> > + * - unlock == 0
> > + *     - On success, all the pages are locked for writing out them
> > + *     - On failure, all the pages except @locked_page are unlocked
> > + *
> > + * When a failure happens in the second or later iteration of the
> > + * while-loop, the ordered extents created in previous iterations are kept
> > + * intact. So, the caller must clean them up by calling
> > + * btrfs_cleanup_ordered_extents(). See btrfs_run_delalloc_range() for
> > + * example.
> >   */
> >  static noinline int cow_file_range(struct btrfs_inode *inode,
> >  				   struct page *locked_page,
> > @@ -1143,6 +1165,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  	struct btrfs_root *root = inode->root;
> >  	struct btrfs_fs_info *fs_info = root->fs_info;
> >  	u64 alloc_hint = 0;
> > +	u64 orig_start = start;
> >  	u64 num_bytes;
> >  	unsigned long ram_size;
> >  	u64 cur_alloc_size = 0;
> > @@ -1336,18 +1359,44 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> >  	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
> >  out_unlock:
> > +	/*
> > +	 * Now, we have three regions to clean up, as shown below.
> > +	 *
> > +	 * |-------(1)----|---(2)---|-------------(3)----------|
> > +	 * `- orig_start  `- start  `- start + cur_alloc_size  `- end
> > +	 *
> > +	 * We process each region below.
> > +	 */
> > +
> >  	clear_bits = EXTENT_LOCKED | EXTENT_DELALLOC | EXTENT_DELALLOC_NEW |
> >  		EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV;
> >  	page_ops = PAGE_UNLOCK | PAGE_START_WRITEBACK | PAGE_END_WRITEBACK;
> > +
> >  	/*
> > -	 * If we reserved an extent for our delalloc range (or a subrange) and
> > -	 * failed to create the respective ordered extent, then it means that
> > -	 * when we reserved the extent we decremented the extent's size from
> > -	 * the data space_info's bytes_may_use counter and incremented the
> > -	 * space_info's bytes_reserved counter by the same amount. We must make
> > -	 * sure extent_clear_unlock_delalloc() does not try to decrement again
> > -	 * the data space_info's bytes_may_use counter, therefore we do not pass
> > -	 * it the flag EXTENT_CLEAR_DATA_RESV.
> > +	 * For the range (1). We have already instantiated the ordered extents
> > +	 * for this region. They are cleaned up by
> > +	 * btrfs_cleanup_ordered_extents() in e.g,
> > +	 * btrfs_run_delalloc_range(). EXTENT_LOCKED | EXTENT_DELALLOC are
> > +	 * already cleared in the above loop. And, EXTENT_DELALLOC_NEW |
> > +	 * EXTENT_DEFRAG | EXTENT_CLEAR_META_RESV are handled by the cleanup
> > +	 * function.
> > +	 *
> > +	 * However, in case of unlock == 0, we still need to unlock the pages
> > +	 * (except @locked_page) to ensure all the pages are unlocked.
> > +	 */
> > +	if (!unlock && orig_start < start)
> > +		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
> > +					     locked_page, 0, page_ops);
> > +
> > +	/*
> > +	 * For the range (2). If we reserved an extent for our delalloc range
> > +	 * (or a subrange) and failed to create the respective ordered extent,
> > +	 * then it means that when we reserved the extent we decremented the
> > +	 * extent's size from the data space_info's bytes_may_use counter and
> > +	 * incremented the space_info's bytes_reserved counter by the same
> > +	 * amount. We must make sure extent_clear_unlock_delalloc() does not try
> > +	 * to decrement again the data space_info's bytes_may_use counter,
> > +	 * therefore we do not pass it the flag EXTENT_CLEAR_DATA_RESV.
> >  	 */
> >  	if (extent_reserved) {
> >  		extent_clear_unlock_delalloc(inode, start,
> > @@ -1359,6 +1408,13 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  		if (start >= end)
> >  			goto out;
> >  	}
> > +
> > +	/*
> > +	 * For the range (3). We never touched the region. In addition to the
> > +	 * clear_bits above, we add EXTENT_CLEAR_DATA_RESV to release the data
> > +	 * space_info's bytes_may_use counter, reserved in e.g,
> > +	 * btrfs_check_data_free_space().
> 
> This says "in .e.g.", but in fact there's only one place where we increment the
> bytes_may_use counter for data, which is at btrfs_check_data_free_space(). So
> just remove the "in e.g." part because it gives the idea that the increment/reservation
> can happen in more than one place.

We also have btrfs_alloc_data_chunk_ondemand() in btrfs_fallocate() or
btrfs_do_encoded_write(). But, yeah, for the normal write path calling
cow_file_range(), btrfs_check_data_free_space() is the only place. I'll
update the comment.

Thanks.

> Other than that, it looks good, thanks.
> 
> > +	 */
> >  	extent_clear_unlock_delalloc(inode, start, end, locked_page,
> >  				     clear_bits | EXTENT_CLEAR_DATA_RESV,
> >  				     page_ops);
> > -- 
> > 2.35.1
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/4] btrfs: replace unnecessary goto with direct return
  2022-06-17 10:13   ` Filipe Manana
@ 2022-06-20  2:22     ` Naohiro Aota
  0 siblings, 0 replies; 13+ messages in thread
From: Naohiro Aota @ 2022-06-20  2:22 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Fri, Jun 17, 2022 at 11:13:18AM +0100, Filipe Manana wrote:
> On Fri, Jun 17, 2022 at 12:45:42AM +0900, Naohiro Aota wrote:
> > The "goto out;"s in cow_file_range() just results in a simple "return
> > ret;" which are not really useful. Replace them with proper direct
> > "return"s. It also makes the success path vs failure path stands out.
> > 
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> 
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> 
> The subject is too generic, just adding "... at cow_file_range()" would make
> it much more clear without being too long.

Yeah, that sounds better. I'll rephrase the subject.

> Thanks.
> 
> > ---
> >  fs/btrfs/inode.c | 9 ++++-----
> >  1 file changed, 4 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index cae15924fc99..055c573e2eb3 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -1253,7 +1253,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  			 * inline extent or a compressed extent.
> >  			 */
> >  			unlock_page(locked_page);
> > -			goto out;
> > +			return 0;
> >  		} else if (ret < 0) {
> >  			goto out_unlock;
> >  		}
> > @@ -1366,8 +1366,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  		if (ret)
> >  			goto out_unlock;
> >  	}
> > -out:
> > -	return ret;
> > +	return 0;
> >  
> >  out_drop_extent_cache:
> >  	btrfs_drop_extent_cache(inode, start, start + ram_size - 1, 0);
> > @@ -1425,7 +1424,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  					     page_ops);
> >  		start += cur_alloc_size;
> >  		if (start >= end)
> > -			goto out;
> > +			return ret;
> >  	}
> >  
> >  	/*
> > @@ -1437,7 +1436,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  	extent_clear_unlock_delalloc(inode, start, end, locked_page,
> >  				     clear_bits | EXTENT_CLEAR_DATA_RESV,
> >  				     page_ops);
> > -	goto out;
> > +	return ret;
> >  }
> >  
> >  /*
> > -- 
> > 2.35.1
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] btrfs: fix error handling of fallbacked uncompress write
  2022-06-17 10:21   ` Filipe Manana
@ 2022-06-20  2:26     ` Naohiro Aota
  0 siblings, 0 replies; 13+ messages in thread
From: Naohiro Aota @ 2022-06-20  2:26 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Fri, Jun 17, 2022 at 11:21:44AM +0100, Filipe Manana wrote:
> On Fri, Jun 17, 2022 at 12:45:41AM +0900, Naohiro Aota wrote:
> > When cow_file_range() fails in the middle of the allocation loop, it
> > unlocks the pages but remains the ordered extents intact. Thus, we need to
> 
> s/remains/leaves/

Will fix.

> 
> > call btrfs_cleanup_ordered_extents() to finish the created ordered extents.
> > 
> > Also, we need to call end_extent_writepage() if locked_page is available
> > because btrfs_cleanup_ordered_extents() never process the region on the
> > locked_page.
> > 
> > Furthermore, we need to set the mapping as error if locked_page is
> > unavailable before unlocking the pages, so that the errno is properly
> > propagated to the userland.
> > 
> > CC: stable@vger.kernel.org
> 
> It would be better to specify a version here.
>
> The delalloc paths for compression were (a bit heavily) refactored last year
> in preparation for the subpage sector size support, so blindly adding this
> to any stable releases might introduce regressions (assuming the patch does
> not fail to apply).

Yeah ... I could not find the exact commit. It looks like there is no
ordered extent clean up from the beginning.

I'll check if the modified cow_file_range() cause a hang on the stable
versions.

> 
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/inode.c | 17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 4e1100f84a88..cae15924fc99 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -934,8 +934,18 @@ static int submit_uncompressed_range(struct btrfs_inode *inode,
> >  		goto out;
> >  	}
> >  	if (ret < 0) {
> > -		if (locked_page)
> > +		btrfs_cleanup_ordered_extents(inode, locked_page, start, end - start + 1);
> > +		if (locked_page) {
> > +			u64 page_start = page_offset(locked_page);
> > +			u64 page_end = page_start + PAGE_SIZE - 1;
> > +
> > +			btrfs_page_set_error(inode->root->fs_info, locked_page,
> > +					     page_start, PAGE_SIZE);
> > +			set_page_writeback(locked_page);
> > +			end_page_writeback(locked_page);
> > +			end_extent_writepage(locked_page, ret, page_start, page_end);
> >  			unlock_page(locked_page);
> > +		}
> >  		goto out;
> >  	}
> 
> So as I commented in the previous patch, something is missing here: the call to
> btrfs_cleanup_ordered_extents() at submit_uncompressed_range() in case cow_file_range()
> returns an error.
> 
> Otherwise it looks fine.
> Thanks.

btrfs_cleanup_ordered_extents() is called at the first line of the added
lines. We need to call it regardless of locked_page != NULL or not, because
submit_uncompressed_range() can be called with locked_page = NULL.

> >  
> > @@ -1390,9 +1400,12 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
> >  	 * However, in case of unlock == 0, we still need to unlock the pages
> >  	 * (except @locked_page) to ensure all the pages are unlocked.
> >  	 */
> > -	if (!unlock && orig_start < start)
> > +	if (!unlock && orig_start < start) {
> > +		if (!locked_page)
> > +			mapping_set_error(inode->vfs_inode.i_mapping, ret);
> >  		extent_clear_unlock_delalloc(inode, orig_start, start - 1,
> >  					     locked_page, 0, page_ops);
> > +	}
> >  
> >  	/*
> >  	 * For the range (2). If we reserved an extent for our delalloc range
> > -- 
> > 2.35.1
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/4] btrfs: extend btrfs_cleanup_ordered_extens for NULL locked_page
  2022-06-17 10:15   ` Filipe Manana
@ 2022-06-20  2:28     ` Naohiro Aota
  0 siblings, 0 replies; 13+ messages in thread
From: Naohiro Aota @ 2022-06-20  2:28 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Fri, Jun 17, 2022 at 11:15:15AM +0100, Filipe Manana wrote:
> On Fri, Jun 17, 2022 at 12:45:40AM +0900, Naohiro Aota wrote:
> > btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it
> > is not usable for submit_uncompressed_range() which can habe NULL
> > locked_page.
> > 
> > This commit supports locked_page == NULL case. Also, it rewrites redundant
> > "page_offset(locked_page)".
> 
> The patch looks fine, but I don't see any patch in the patchset that actually
> makes submit_uncompressed_range() use btrfs_cleanup_ordered_extents().
> Did you forgot that, or am I missing something?

As commented in the reply, you looks like to miss the first line of added
code. :-)

> Thanks.
> 
> > 
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/inode.c | 36 +++++++++++++++++++++---------------
> >  1 file changed, 21 insertions(+), 15 deletions(-)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 0c3d9998470f..4e1100f84a88 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -195,11 +195,14 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
> >  {
> >  	unsigned long index = offset >> PAGE_SHIFT;
> >  	unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
> > -	u64 page_start = page_offset(locked_page);
> > -	u64 page_end = page_start + PAGE_SIZE - 1;
> > -
> > +	u64 page_start, page_end;
> >  	struct page *page;
> >  
> > +	if (locked_page) {
> > +		page_start = page_offset(locked_page);
> > +		page_end = page_start + PAGE_SIZE - 1;
> > +	}
> > +
> >  	while (index <= end_index) {
> >  		/*
> >  		 * For locked page, we will call end_extent_writepage() on it
> > @@ -212,7 +215,7 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
> >  		 * btrfs_mark_ordered_io_finished() would skip the accounting
> >  		 * for the page range, and the ordered extent will never finish.
> >  		 */
> > -		if (index == (page_offset(locked_page) >> PAGE_SHIFT)) {
> > +		if (locked_page && index == (page_start >> PAGE_SHIFT)) {
> >  			index++;
> >  			continue;
> >  		}
> > @@ -231,17 +234,20 @@ static inline void btrfs_cleanup_ordered_extents(struct btrfs_inode *inode,
> >  		put_page(page);
> >  	}
> >  
> > -	/* The locked page covers the full range, nothing needs to be done */
> > -	if (bytes + offset <= page_offset(locked_page) + PAGE_SIZE)
> > -		return;
> > -	/*
> > -	 * In case this page belongs to the delalloc range being instantiated
> > -	 * then skip it, since the first page of a range is going to be
> > -	 * properly cleaned up by the caller of run_delalloc_range
> > -	 */
> > -	if (page_start >= offset && page_end <= (offset + bytes - 1)) {
> > -		bytes = offset + bytes - page_offset(locked_page) - PAGE_SIZE;
> > -		offset = page_offset(locked_page) + PAGE_SIZE;
> > +	if (locked_page) {
> > +		/* The locked page covers the full range, nothing needs to be done */
> > +		if (bytes + offset <= page_start + PAGE_SIZE)
> > +			return;
> > +		/*
> > +		 * In case this page belongs to the delalloc range being
> > +		 * instantiated then skip it, since the first page of a range is
> > +		 * going to be properly cleaned up by the caller of
> > +		 * run_delalloc_range
> > +		 */
> > +		if (page_start >= offset && page_end <= (offset + bytes - 1)) {
> > +			bytes = offset + bytes - page_offset(locked_page) - PAGE_SIZE;
> > +			offset = page_offset(locked_page) + PAGE_SIZE;
> > +		}
> >  	}
> >  
> >  	return __endio_write_update_ordered(inode, offset, bytes, false);
> > -- 
> > 2.35.1
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-06-20  2:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-16 15:45 [PATCH 0/4] btrfs: fix error handling of cow_file_range(unlock = 0) Naohiro Aota
2022-06-16 15:45 ` [PATCH 1/4] btrfs: ensure pages are unlocked on cow_file_range() failure Naohiro Aota
2022-06-17 10:11   ` Filipe Manana
2022-06-20  2:22     ` Naohiro Aota
2022-06-16 15:45 ` [PATCH 2/4] btrfs: extend btrfs_cleanup_ordered_extens for NULL locked_page Naohiro Aota
2022-06-17 10:15   ` Filipe Manana
2022-06-20  2:28     ` Naohiro Aota
2022-06-16 15:45 ` [PATCH 3/4] btrfs: fix error handling of fallbacked uncompress write Naohiro Aota
2022-06-17 10:21   ` Filipe Manana
2022-06-20  2:26     ` Naohiro Aota
2022-06-16 15:45 ` [PATCH 4/4] btrfs: replace unnecessary goto with direct return Naohiro Aota
2022-06-17 10:13   ` Filipe Manana
2022-06-20  2:22     ` Naohiro Aota

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).