linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* Re: [PATCH] btrfs: zlib: do not do unnecessary page copying for compression
  2024-05-27 16:25  1% ` Zaslonko Mikhail
@ 2024-05-27 22:09  1%   ` Qu Wenruo
  0 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-27 22:09 UTC (permalink / raw)
  To: Zaslonko Mikhail, Qu Wenruo, linux-btrfs; +Cc: linux-s390, Vasily Gorbik



在 2024/5/28 01:55, Zaslonko Mikhail 写道:
> Hello Qu,
>
> I remember implementing btrfs zlib changes related to s390 dfltcc compression support a while ago:
> https://lwn.net/Articles/808809/
>
> The workspace buffer size was indeed enlarged for performance reasons.
>
> Please see my comments below.
>
> On 27.05.2024 11:24, Qu Wenruo wrote:
>> [BUG]
>> In function zlib_compress_folios(), we handle the input by:
>>
>> - If there are multiple pages left
>>    We copy the page content into workspace->buf, and use workspace->buf
>>    as input for compression.
>>
>>    But on x86_64 (which doesn't support dfltcc), that buffer size is just
>>    one page, so we're wasting our CPU time copying the page for no
>>    benefit.
>>
>> - If there is only one page left
>>    We use the mapped page address as input for compression.
>>
>> The problem is, this means we will copy the whole input range except the
>> last page (can be as large as 124K), without much obvious benefit.
>>
>> Meanwhile the cost is pretty obvious.
>
> Actually, the behavior for kernels w/o dfltcc support (currently available on s390
> only) should not be affected.
> We copy input pages to the workspace->buf only if the buffer size is larger than 1 page.
> At least it worked this way after my original btrfs zlib patch:
> https://lwn.net/ml/linux-kernel/20200108105103.29028-1-zaslonko@linux.ibm.com/
>
> Has this behavior somehow changed after your page->folio conversion performed for btrfs?
> https://lore.kernel.org/all/cover.1706521511.git.wqu@suse.com/

My bad, I forgot that the buf_size for non-S390 systems is fixed to one
page thus the page copy is not utilized for x86_64.

But I'm still wondering if we do not go 4 pages as buffer, how much
performance penalty would there be?

One of the objective is to prepare for the incoming sector perfect
subpage compression support, thus I'm re-checking the existing
compression code, preparing to change them to be subpage compatible.

If we can simplify the behavior without too large performance penalty,
can we consider just using one single page as buffer?

Thanks,
Qu

^ permalink raw reply	[relevance 1%]

* Re: [syzbot] [overlayfs] possible deadlock in ovl_copy_up_flags
  @ 2024-05-27 21:36  1% ` syzbot
  0 siblings, 0 replies; 200+ results
From: syzbot @ 2024-05-27 21:36 UTC (permalink / raw)
  To: amir73il, brauner, clm, dsterba, jack, josef, linux-btrfs,
	linux-fsdevel, linux-kernel, linux-unionfs, miklos, mszeredi,
	syzkaller-bugs, viro

Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-and-tested-by: syzbot+85e58cdf5b3136471d4b@syzkaller.appspotmail.com

Tested on:

commit:         f74ee925 ovl: tmpfile copy-up fix
git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs.git
console output: https://syzkaller.appspot.com/x/log.txt?x=142c4e2c980000
kernel config:  https://syzkaller.appspot.com/x/.config?x=b9016f104992d69c
dashboard link: https://syzkaller.appspot.com/bug?extid=85e58cdf5b3136471d4b
compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40

Note: no patches were applied.
Note: testing is done by a robot and is best-effort only.

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] fs: btrfs: add MODULE_DESCRIPTION()
  2024-05-27 17:56  1% [PATCH] fs: btrfs: add MODULE_DESCRIPTION() Jeff Johnson
@ 2024-05-27 20:05  1% ` David Sterba
  0 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-27 20:05 UTC (permalink / raw)
  To: Jeff Johnson
  Cc: Chris Mason, Josef Bacik, David Sterba, linux-btrfs,
	linux-kernel, kernel-janitors

On Mon, May 27, 2024 at 10:56:59AM -0700, Jeff Johnson wrote:
> Fix the 'make W=1' warning:
> WARNING: modpost: missing MODULE_DESCRIPTION() in fs/btrfs/btrfs.o
> 
> Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>

Added to for-next, thanks.

^ permalink raw reply	[relevance 1%]

* [PATCH] fs: btrfs: add MODULE_DESCRIPTION()
@ 2024-05-27 17:56  1% Jeff Johnson
  2024-05-27 20:05  1% ` David Sterba
  0 siblings, 1 reply; 200+ results
From: Jeff Johnson @ 2024-05-27 17:56 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: linux-btrfs, linux-kernel, kernel-janitors, Jeff Johnson

Fix the 'make W=1' warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in fs/btrfs/btrfs.o

Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
---
 fs/btrfs/super.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f05cce7c8b8d..649913e13bc0 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2590,6 +2590,7 @@ static int __init init_btrfs_fs(void)
 late_initcall(init_btrfs_fs);
 module_exit(exit_btrfs_fs)
 
+MODULE_DESCRIPTION("B-Tree File System (BTRFS)");
 MODULE_LICENSE("GPL");
 MODULE_SOFTDEP("pre: crc32c");
 MODULE_SOFTDEP("pre: xxhash64");

---
base-commit: 2bfcfd584ff5ccc8bb7acde19b42570414bf880b
change-id: 20240527-md-fs-btrfs-9333db7e7fea


^ permalink raw reply related	[relevance 1%]

* Re: [PATCH] btrfs: zlib: do not do unnecessary page copying for compression
  2024-05-27  9:24  1% [PATCH] btrfs: zlib: do not do unnecessary page copying for compression Qu Wenruo
@ 2024-05-27 16:25  1% ` Zaslonko Mikhail
  2024-05-27 22:09  1%   ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Zaslonko Mikhail @ 2024-05-27 16:25 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: linux-s390, Vasily Gorbik

Hello Qu,

I remember implementing btrfs zlib changes related to s390 dfltcc compression support a while ago:
https://lwn.net/Articles/808809/

The workspace buffer size was indeed enlarged for performance reasons.

Please see my comments below.

On 27.05.2024 11:24, Qu Wenruo wrote:
> [BUG]
> In function zlib_compress_folios(), we handle the input by:
> 
> - If there are multiple pages left
>   We copy the page content into workspace->buf, and use workspace->buf
>   as input for compression.
> 
>   But on x86_64 (which doesn't support dfltcc), that buffer size is just
>   one page, so we're wasting our CPU time copying the page for no
>   benefit.
> 
> - If there is only one page left
>   We use the mapped page address as input for compression.
> 
> The problem is, this means we will copy the whole input range except the
> last page (can be as large as 124K), without much obvious benefit.
> 
> Meanwhile the cost is pretty obvious.

Actually, the behavior for kernels w/o dfltcc support (currently available on s390
only) should not be affected. 
We copy input pages to the workspace->buf only if the buffer size is larger than 1 page.
At least it worked this way after my original btrfs zlib patch:
https://lwn.net/ml/linux-kernel/20200108105103.29028-1-zaslonko@linux.ibm.com/

Has this behavior somehow changed after your page->folio conversion performed for btrfs? 
https://lore.kernel.org/all/cover.1706521511.git.wqu@suse.com/

Am I missing something?

> 
> [POSSIBLE REASON]
> The possible reason may be related to the support of S390 hardware zlib
> decompression acceleration.
> 
> As we allocate 4 pages (4 * 4K) as workspace input buffer just for s390.
> 
> [FIX]
> I checked the dfltcc code, there seems to be no requirement on the
> input buffer size.
> The function dfltcc_can_deflate() only checks:
> 
> - If the compression settings are supported
>   Only level/w_bits/strategy/level_mask is checked.
> 
> - If the hardware supports
> 
> No mention at all on the input buffer size, thus I believe there is no
> need to waste time doing the page copying.
> 
> Maybe the hardware acceleration is so good for s390 that they can offset
> the page copying cost, but it's definitely a penalty for non-s390
> systems.
> 
> So fix the problem by:
> 
> - Use the same buffer size
>   No matter if dfltcc support is enabled or not
> 
> - Always use page address as input
> 
> Cc: linux-s390@vger.kernel.org
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/zlib.c | 67 +++++++++++--------------------------------------
>  1 file changed, 14 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
> index d9e5c88a0f85..9c88a841a060 100644
> --- a/fs/btrfs/zlib.c
> +++ b/fs/btrfs/zlib.c
> @@ -65,21 +65,8 @@ struct list_head *zlib_alloc_workspace(unsigned int level)
>  			zlib_inflate_workspacesize());
>  	workspace->strm.workspace = kvzalloc(workspacesize, GFP_KERNEL | __GFP_NOWARN);
>  	workspace->level = level;
> -	workspace->buf = NULL;
> -	/*
> -	 * In case of s390 zlib hardware support, allocate lager workspace
> -	 * buffer. If allocator fails, fall back to a single page buffer.
> -	 */
> -	if (zlib_deflate_dfltcc_enabled()) {
> -		workspace->buf = kmalloc(ZLIB_DFLTCC_BUF_SIZE,
> -					 __GFP_NOMEMALLOC | __GFP_NORETRY |
> -					 __GFP_NOWARN | GFP_NOIO);
> -		workspace->buf_size = ZLIB_DFLTCC_BUF_SIZE;
> -	}
> -	if (!workspace->buf) {
> -		workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> -		workspace->buf_size = PAGE_SIZE;
> -	}
> +	workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> +	workspace->buf_size = PAGE_SIZE;
>  	if (!workspace->strm.workspace || !workspace->buf)
>  		goto fail;
>  
> @@ -103,7 +90,6 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
>  	struct folio *in_folio = NULL;
>  	struct folio *out_folio = NULL;
>  	unsigned long bytes_left;
> -	unsigned int in_buf_folios;
>  	unsigned long len = *total_out;
>  	unsigned long nr_dest_folios = *out_folios;
>  	const unsigned long max_out = nr_dest_folios * PAGE_SIZE;
> @@ -130,7 +116,6 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
>  	folios[0] = out_folio;
>  	nr_folios = 1;
>  
> -	workspace->strm.next_in = workspace->buf;
>  	workspace->strm.avail_in = 0;
>  	workspace->strm.next_out = cfolio_out;
>  	workspace->strm.avail_out = PAGE_SIZE;
> @@ -142,43 +127,19 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
>  		 */
>  		if (workspace->strm.avail_in == 0) {
>  			bytes_left = len - workspace->strm.total_in;
> -			in_buf_folios = min(DIV_ROUND_UP(bytes_left, PAGE_SIZE),
> -					    workspace->buf_size / PAGE_SIZE);

	doesn't this always set *in_buf_folios* to 1 in case no dfltcc support (single page workspace buffer)?

> -			if (in_buf_folios > 1) {
> -				int i;
> -
> -				for (i = 0; i < in_buf_folios; i++) {
> -					if (data_in) {
> -						kunmap_local(data_in);
> -						folio_put(in_folio);
> -						data_in = NULL;
> -					}
> -					ret = btrfs_compress_filemap_get_folio(mapping,
> -							start, &in_folio);
> -					if (ret < 0)
> -						goto out;
> -					data_in = kmap_local_folio(in_folio, 0);
> -					copy_page(workspace->buf + i * PAGE_SIZE,
> -						  data_in);
> -					start += PAGE_SIZE;
> -				}
> -				workspace->strm.next_in = workspace->buf;
> -			} else {
> -				if (data_in) {
> -					kunmap_local(data_in);
> -					folio_put(in_folio);
> -					data_in = NULL;
> -				}
> -				ret = btrfs_compress_filemap_get_folio(mapping,
> -						start, &in_folio);
> -				if (ret < 0)
> -					goto out;
> -				data_in = kmap_local_folio(in_folio, 0);
> -				start += PAGE_SIZE;
> -				workspace->strm.next_in = data_in;
> +			if (data_in) {
> +				kunmap_local(data_in);
> +				folio_put(in_folio);
> +				data_in = NULL;
>  			}
> -			workspace->strm.avail_in = min(bytes_left,
> -						       (unsigned long) workspace->buf_size);
> +			ret = btrfs_compress_filemap_get_folio(mapping,
> +					start, &in_folio);
> +			if (ret < 0)
> +				goto out;
> +			data_in = kmap_local_folio(in_folio, 0);
> +			start += PAGE_SIZE;
> +			workspace->strm.next_in = data_in;
> +			workspace->strm.avail_in = min(bytes_left, PAGE_SIZE);
>  		}
>  
>  		ret = zlib_deflate(&workspace->strm, Z_SYNC_FLUSH);

Thanks,
Mikhail

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: qgroup: use kmem cache to alloc struct btrfs_qgroup_extent_record
  2024-05-27 10:13  1% [PATCH] btrfs: qgroup: use kmem cache to alloc struct btrfs_qgroup_extent_record Junchao Sun
@ 2024-05-27 15:27  1% ` David Sterba
  0 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-27 15:27 UTC (permalink / raw)
  To: Junchao Sun; +Cc: linux-btrfs, clm, josef, dsterba

On Mon, May 27, 2024 at 06:13:34PM +0800, Junchao Sun wrote:
> Fixes a todo in qgroup code by utilizing kmem cache to accelerate
> the allocation of struct btrfs_qgroup_extent_record.

The TODO is almost 9 years old so it should be evaluated if it's
applicable.

> This patch has passed the check -g qgroup tests using xfstests.

Changing kmalloc to kmem_cache should be justified and explained why
it's done. I'm not sure we need it given that it's been working fine so
far. Also the quotas can be enabled or disabled during a single mount
it's not necessary to create the dedicated kmem cache and leave it
unused if quotas are disabled. Here using the generic slab is
convenient.

If you think there is a reason to use kmem cache please let us know.
Otherwise it would be better to delete the TODO line.

^ permalink raw reply	[relevance 1%]

* [PATCH] btrfs: qgroup: use kmem cache to alloc struct btrfs_qgroup_extent_record
@ 2024-05-27 10:13  1% Junchao Sun
  2024-05-27 15:27  1% ` David Sterba
  0 siblings, 1 reply; 200+ results
From: Junchao Sun @ 2024-05-27 10:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: clm, josef, dsterba, Junchao Sun

Fixes a todo in qgroup code by utilizing kmem cache to accelerate
the allocation of struct btrfs_qgroup_extent_record.

This patch has passed the check -g qgroup tests using xfstests.

Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com>
---
 fs/btrfs/delayed-ref.c |  6 +++---
 fs/btrfs/qgroup.c      | 21 ++++++++++++++++++---
 fs/btrfs/qgroup.h      |  6 +++++-
 fs/btrfs/super.c       |  3 +++
 4 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index e44e62cf76bc..d2d6bda6ccf7 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -916,7 +916,7 @@ add_delayed_ref_head(struct btrfs_trans_handle *trans,
 	if (qrecord) {
 		if (btrfs_qgroup_trace_extent_nolock(trans->fs_info,
 					delayed_refs, qrecord))
-			kfree(qrecord);
+			kmem_cache_free(btrfs_qgroup_extent_record_cachep, qrecord);
 		else
 			qrecord_inserted = true;
 	}
@@ -1088,7 +1088,7 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 	}
 
 	if (btrfs_qgroup_full_accounting(fs_info) && !generic_ref->skip_qgroup) {
-		record = kzalloc(sizeof(*record), GFP_NOFS);
+		record = kmem_cache_zalloc(btrfs_qgroup_extent_record_cachep, GFP_NOFS);
 		if (!record) {
 			kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
 			kmem_cache_free(btrfs_delayed_ref_head_cachep, head_ref);
@@ -1191,7 +1191,7 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 	}
 
 	if (btrfs_qgroup_full_accounting(fs_info) && !generic_ref->skip_qgroup) {
-		record = kzalloc(sizeof(*record), GFP_NOFS);
+		record = kmem_cache_zalloc(btrfs_qgroup_extent_record_cachep, GFP_NOFS);
 		if (!record) {
 			kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
 			kmem_cache_free(btrfs_delayed_ref_head_cachep,
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 40e5f7f2fcb7..5f72909bfcf2 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -30,6 +30,7 @@
 #include "root-tree.h"
 #include "tree-checker.h"
 
+struct kmem_cache *btrfs_qgroup_extent_record_cachep;
 enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info)
 {
 	if (!test_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags))
@@ -2024,7 +2025,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 
 	if (!btrfs_qgroup_full_accounting(fs_info) || bytenr == 0 || num_bytes == 0)
 		return 0;
-	record = kzalloc(sizeof(*record), GFP_NOFS);
+	record = kmem_cache_zalloc(btrfs_qgroup_extent_record_cachep, GFP_NOFS);
 	if (!record)
 		return -ENOMEM;
 
@@ -2985,7 +2986,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 		ulist_free(new_roots);
 		new_roots = NULL;
 		rb_erase(node, &delayed_refs->dirty_extent_root);
-		kfree(record);
+		kmem_cache_free(btrfs_qgroup_extent_record_cachep, record);
 
 	}
 	trace_qgroup_num_dirty_extents(fs_info, trans->transid,
@@ -4783,7 +4784,7 @@ void btrfs_qgroup_destroy_extent_records(struct btrfs_transaction *trans)
 	root = &trans->delayed_refs.dirty_extent_root;
 	rbtree_postorder_for_each_entry_safe(entry, next, root, node) {
 		ulist_free(entry->old_roots);
-		kfree(entry);
+		kmem_cache_free(btrfs_qgroup_extent_record_cachep, entry);
 	}
 	*root = RB_ROOT;
 }
@@ -4845,3 +4846,17 @@ int btrfs_record_squota_delta(struct btrfs_fs_info *fs_info,
 	spin_unlock(&fs_info->qgroup_lock);
 	return ret;
 }
+
+void __cold btrfs_qgroup_exit(void)
+{
+	kmem_cache_destroy(btrfs_qgroup_extent_record_cachep);
+}
+
+int __init btrfs_qgroup_init(void)
+{
+	btrfs_qgroup_extent_record_cachep = KMEM_CACHE(btrfs_qgroup_extent_record, 0);
+	if (!btrfs_qgroup_extent_record_cachep)
+		return -ENOMEM;
+
+	return 0;
+}
\ No newline at end of file
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 706640be0ec2..3975c32ac23e 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -123,7 +123,6 @@ struct btrfs_inode;
 
 /*
  * Record a dirty extent, and info qgroup to update quota on it
- * TODO: Use kmem cache to alloc it.
  */
 struct btrfs_qgroup_extent_record {
 	struct rb_node node;
@@ -312,6 +311,11 @@ enum btrfs_qgroup_mode {
 	BTRFS_QGROUP_MODE_SIMPLE
 };
 
+extern struct kmem_cache *btrfs_qgroup_extent_record_cachep;
+
+void __cold btrfs_qgroup_exit(void);
+int __init btrfs_qgroup_init(void);
+
 enum btrfs_qgroup_mode btrfs_qgroup_mode(struct btrfs_fs_info *fs_info);
 bool btrfs_qgroup_enabled(struct btrfs_fs_info *fs_info);
 bool btrfs_qgroup_full_accounting(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 7e44ccaf348f..0fe383ef816b 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2506,6 +2506,9 @@ static const struct init_sequence mod_init_seq[] = {
 	}, {
 		.init_func = btrfs_delayed_ref_init,
 		.exit_func = btrfs_delayed_ref_exit,
+	}, {
+		.init_func = btrfs_qgroup_init,
+		.exit_func = btrfs_qgroup_exit,
 	}, {
 		.init_func = btrfs_prelim_ref_init,
 		.exit_func = btrfs_prelim_ref_exit,
-- 
2.39.2


^ permalink raw reply related	[relevance 1%]

* Re: RIP: + BUG: with 6.8.11 and BTRFS
  @ 2024-05-27  9:32  1%   ` Qu Wenruo
  0 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-27  9:32 UTC (permalink / raw)
  To: Toralf Förster, Linux Kernel, linux-btrfs



在 2024/5/27 00:16, Toralf Förster 写道:
> On 5/26/24 11:08, Toralf Förster wrote:
>>
>> I upgraded yesterday from kernel 6.8.10 to 6.8.11.
>>
>> The system does not recover from reboot in moment.
>
> It recovered eventually, I switched to 6.9.2, which runs fine so far.
> But these are new log messages:

That looks exactly the one Linus recently reported
(https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/)

Unfortunately he is reproducing it with latest master, so I'm not sure
if v6.9 is any better.

Meanwhile if you can reproduce the problem reliably, I can craft several
debug patches for you to test, but I'm afraid it's not that reproducible...

Thanks,
Qu
>
> May 26 13:44:06 mr-fox kernel: WARNING: stack recursion on stack type 4
> May 26 13:44:06 mr-fox kernel: WARNING: can't access registers at
> syscall_return_via_sysret+0x64/0xc2
> May 26 13:44:06 mr-fox sSMTP[29464]: Creating SSL connection to host
> May 26 13:44:06 mr-fox sSMTP[29464]: SSL connection using
> TLS_AES_256_GCM_SHA384
> May 26 13:44:07 mr-fox kernel: perf: interrupt took too long (2635 >
> 2500), lowering kernel.perf_event_max_sample_rate to 75750
> May 26 13:44:07 mr-fox kernel: perf: interrupt took too long (3323 >
> 3293), lowering kernel.perf_event_max_sample_rate to 60000
> May 26 13:44:07 mr-fox kernel: perf: interrupt took too long (4168 >
> 4153), lowering kernel.perf_event_max_sample_rate to 47750
> May 26 13:44:07 mr-fox kernel: perf: interrupt took too long (5273 >
> 5210), lowering kernel.perf_event_max_sample_rate to 37750
> May 26 13:44:07 mr-fox kernel: perf: interrupt took too long (6600 >
> 6591), lowering kernel.perf_event_max_sample_rate to 30250
> May 26 13:44:07 mr-fox kernel: perf: interrupt took too long (8318 >
> 8250), lowering kernel.perf_event_max_sample_rate to 24000
> May 26 13:44:07 mr-fox kernel: perf: interrupt took too long (10415 >
> 10397), lowering kernel.perf_event_max_sample_rate to 19000
> May 26 13:44:09 mr-fox kernel: perf: interrupt took too long (13048 >
> 13018), lowering kernel.perf_event_max_sample_rate to 15250
>
> --
> Toralf
>
>

^ permalink raw reply	[relevance 1%]

* [PATCH] btrfs: zlib: do not do unnecessary page copying for compression
@ 2024-05-27  9:24  1% Qu Wenruo
  2024-05-27 16:25  1% ` Zaslonko Mikhail
  0 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-27  9:24 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-s390

[BUG]
In function zlib_compress_folios(), we handle the input by:

- If there are multiple pages left
  We copy the page content into workspace->buf, and use workspace->buf
  as input for compression.

  But on x86_64 (which doesn't support dfltcc), that buffer size is just
  one page, so we're wasting our CPU time copying the page for no
  benefit.

- If there is only one page left
  We use the mapped page address as input for compression.

The problem is, this means we will copy the whole input range except the
last page (can be as large as 124K), without much obvious benefit.

Meanwhile the cost is pretty obvious.

[POSSIBLE REASON]
The possible reason may be related to the support of S390 hardware zlib
decompression acceleration.

As we allocate 4 pages (4 * 4K) as workspace input buffer just for s390.

[FIX]
I checked the dfltcc code, there seems to be no requirement on the
input buffer size.
The function dfltcc_can_deflate() only checks:

- If the compression settings are supported
  Only level/w_bits/strategy/level_mask is checked.

- If the hardware supports

No mention at all on the input buffer size, thus I believe there is no
need to waste time doing the page copying.

Maybe the hardware acceleration is so good for s390 that they can offset
the page copying cost, but it's definitely a penalty for non-s390
systems.

So fix the problem by:

- Use the same buffer size
  No matter if dfltcc support is enabled or not

- Always use page address as input

Cc: linux-s390@vger.kernel.org
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/zlib.c | 67 +++++++++++--------------------------------------
 1 file changed, 14 insertions(+), 53 deletions(-)

diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index d9e5c88a0f85..9c88a841a060 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -65,21 +65,8 @@ struct list_head *zlib_alloc_workspace(unsigned int level)
 			zlib_inflate_workspacesize());
 	workspace->strm.workspace = kvzalloc(workspacesize, GFP_KERNEL | __GFP_NOWARN);
 	workspace->level = level;
-	workspace->buf = NULL;
-	/*
-	 * In case of s390 zlib hardware support, allocate lager workspace
-	 * buffer. If allocator fails, fall back to a single page buffer.
-	 */
-	if (zlib_deflate_dfltcc_enabled()) {
-		workspace->buf = kmalloc(ZLIB_DFLTCC_BUF_SIZE,
-					 __GFP_NOMEMALLOC | __GFP_NORETRY |
-					 __GFP_NOWARN | GFP_NOIO);
-		workspace->buf_size = ZLIB_DFLTCC_BUF_SIZE;
-	}
-	if (!workspace->buf) {
-		workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
-		workspace->buf_size = PAGE_SIZE;
-	}
+	workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	workspace->buf_size = PAGE_SIZE;
 	if (!workspace->strm.workspace || !workspace->buf)
 		goto fail;
 
@@ -103,7 +90,6 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
 	struct folio *in_folio = NULL;
 	struct folio *out_folio = NULL;
 	unsigned long bytes_left;
-	unsigned int in_buf_folios;
 	unsigned long len = *total_out;
 	unsigned long nr_dest_folios = *out_folios;
 	const unsigned long max_out = nr_dest_folios * PAGE_SIZE;
@@ -130,7 +116,6 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
 	folios[0] = out_folio;
 	nr_folios = 1;
 
-	workspace->strm.next_in = workspace->buf;
 	workspace->strm.avail_in = 0;
 	workspace->strm.next_out = cfolio_out;
 	workspace->strm.avail_out = PAGE_SIZE;
@@ -142,43 +127,19 @@ int zlib_compress_folios(struct list_head *ws, struct address_space *mapping,
 		 */
 		if (workspace->strm.avail_in == 0) {
 			bytes_left = len - workspace->strm.total_in;
-			in_buf_folios = min(DIV_ROUND_UP(bytes_left, PAGE_SIZE),
-					    workspace->buf_size / PAGE_SIZE);
-			if (in_buf_folios > 1) {
-				int i;
-
-				for (i = 0; i < in_buf_folios; i++) {
-					if (data_in) {
-						kunmap_local(data_in);
-						folio_put(in_folio);
-						data_in = NULL;
-					}
-					ret = btrfs_compress_filemap_get_folio(mapping,
-							start, &in_folio);
-					if (ret < 0)
-						goto out;
-					data_in = kmap_local_folio(in_folio, 0);
-					copy_page(workspace->buf + i * PAGE_SIZE,
-						  data_in);
-					start += PAGE_SIZE;
-				}
-				workspace->strm.next_in = workspace->buf;
-			} else {
-				if (data_in) {
-					kunmap_local(data_in);
-					folio_put(in_folio);
-					data_in = NULL;
-				}
-				ret = btrfs_compress_filemap_get_folio(mapping,
-						start, &in_folio);
-				if (ret < 0)
-					goto out;
-				data_in = kmap_local_folio(in_folio, 0);
-				start += PAGE_SIZE;
-				workspace->strm.next_in = data_in;
+			if (data_in) {
+				kunmap_local(data_in);
+				folio_put(in_folio);
+				data_in = NULL;
 			}
-			workspace->strm.avail_in = min(bytes_left,
-						       (unsigned long) workspace->buf_size);
+			ret = btrfs_compress_filemap_get_folio(mapping,
+					start, &in_folio);
+			if (ret < 0)
+				goto out;
+			data_in = kmap_local_folio(in_folio, 0);
+			start += PAGE_SIZE;
+			workspace->strm.next_in = data_in;
+			workspace->strm.avail_in = min(bytes_left, PAGE_SIZE);
 		}
 
 		ret = zlib_deflate(&workspace->strm, Z_SYNC_FLUSH);
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [RESEND PATCH 4/4] crash: Remove duplicate included header
  @ 2024-05-26 21:23  1% ` Thorsten Blum
  0 siblings, 0 replies; 200+ results
From: Thorsten Blum @ 2024-05-26 21:23 UTC (permalink / raw)
  To: bhe
  Cc: amir73il, clm, dhowells, dsterba, dyoung, jlayton, josef, kexec,
	linux-btrfs, linux-fsdevel, linux-kernel, linux-unionfs, miklos,
	netfs, thorsten.blum, vgoyal

Remove duplicate included header file linux/kexec.h

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Acked-by: Baoquan He <bhe@redhat.com>
---
 kernel/crash_reserve.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/crash_reserve.c b/kernel/crash_reserve.c
index 5b2722a93a48..d3b4cd12bdd1 100644
--- a/kernel/crash_reserve.c
+++ b/kernel/crash_reserve.c
@@ -13,7 +13,6 @@
 #include <linux/memory.h>
 #include <linux/cpuhotplug.h>
 #include <linux/memblock.h>
-#include <linux/kexec.h>
 #include <linux/kmemleak.h>
 
 #include <asm/page.h>
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [RESEND PATCH 2/4] fscache: Remove duplicate included header
  @ 2024-05-26 21:21  1% ` Thorsten Blum
  0 siblings, 0 replies; 200+ results
From: Thorsten Blum @ 2024-05-26 21:21 UTC (permalink / raw)
  To: thorsten.blum
  Cc: amir73il, bhe, clm, dhowells, dsterba, dyoung, jlayton, josef,
	kexec, linux-btrfs, linux-fsdevel, linux-kernel, linux-unionfs,
	miklos, netfs, vgoyal

Remove duplicate included header file linux/uio.h

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
---
 fs/netfs/fscache_io.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/netfs/fscache_io.c b/fs/netfs/fscache_io.c
index 38637e5c9b57..b1722a82c03d 100644
--- a/fs/netfs/fscache_io.c
+++ b/fs/netfs/fscache_io.c
@@ -9,7 +9,6 @@
 #include <linux/uio.h>
 #include <linux/bvec.h>
 #include <linux/slab.h>
-#include <linux/uio.h>
 #include "internal.h"
 
 /**
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* Re: [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash
  2024-05-24 20:58  1% ` [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash Omar Sandoval
@ 2024-05-26 11:47  1%   ` Filipe Manana
  0 siblings, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-26 11:47 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: fstests, linux-btrfs, kernel-team

On Fri, May 24, 2024 at 9:58 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> This is a regression test for a Btrfs bug, but there's nothing
> Btrfs-specific about it. Since it's a race, we just try to make the race
> happen in a loop and pass if it doesn't crash after all of our attempts.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> Changes from v1 [1]:
>
> - Added missing groups and requires.
> - Simplified $XFS_IO_PROG calls.
> - Removed -i flag from $XFS_IO_PROG to make race reproduce more
>   reliably.
> - Removed all of the file creation and dump-tree parsing since the only
>   file on a fresh filesystem is guaranteed to be at the end of a leaf
>   anyways.
> - Rewrote to be a generic test.
>
> 1: https://lore.kernel.org/linux-btrfs/297da2b53ce9b697d82d89afd322b2cc0d0f392d.1716492850.git.osandov@osandov.com/
>
>  tests/generic/745     | 44 +++++++++++++++++++++++++++++++++++++++++++
>  tests/generic/745.out |  2 ++
>  2 files changed, 46 insertions(+)
>  create mode 100755 tests/generic/745
>  create mode 100644 tests/generic/745.out
>
> diff --git a/tests/generic/745 b/tests/generic/745
> new file mode 100755
> index 00000000..925adba9
> --- /dev/null
> +++ b/tests/generic/745

Btw, generic/745 already exists in the for-next branch (development is
based against that branch nowadays).

> @@ -0,0 +1,44 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) Meta Platforms, Inc. and affiliates.
> +#
> +# FS QA Test 745
> +#
> +# Repeatedly prealloc beyond i_size, set an xattr, direct write into the
> +# prealloc while extending i_size, then fdatasync. This is a regression test
> +# for a Btrfs crash.
> +#
> +. ./common/preamble
> +. ./common/attr
> +_begin_fstest auto quick log preallocrw dangerous
> +
> +_supported_fs generic
> +_require_scratch
> +_require_attrs
> +_require_xfs_io_command falloc -k

Since this is now a generic test and we're using direct IO, also:

_require_odirect

> +_fixed_by_kernel_commit XXXXXXXXXXXX \
> +       "btrfs: fix crash on racing fsync and size-extending write into prealloc"

Because it's now a generic test, it should be:

[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit ....

Otherwise it looks good to me, so with that:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

> +
> +# -i slows down xfs_io startup and makes the race much less reliable.
> +export XFS_IO_PROG="$(echo "$XFS_IO_PROG" | sed 's/ -i\b//')"
> +
> +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> +_scratch_mount
> +
> +blksz=$(_get_block_size "$SCRATCH_MNT")
> +
> +# On Btrfs, since this is the only file on the filesystem, its metadata is at
> +# the end of a B-tree leaf. We want an ordered extent completion to add an
> +# extent item at the end of the leaf while we're logging prealloc extents
> +# beyond i_size after an xattr was set.
> +for ((i = 0; i < 5000; i++)); do
> +       $XFS_IO_PROG -ftd -c "falloc -k 0 $((blksz * 3))" -c "pwrite -q -w 0 $blksz" "$SCRATCH_MNT/file"
> +       $SETFATTR_PROG -n user.a -v a "$SCRATCH_MNT/file"
> +       $XFS_IO_PROG -d -c "pwrite -q -w $blksz $blksz" "$SCRATCH_MNT/file"
> +done
> +
> +# If it didn't crash, we're good.
> +
> +echo "Silence is golden"
> +status=0
> +exit
> diff --git a/tests/generic/745.out b/tests/generic/745.out
> new file mode 100644
> index 00000000..fce6b7f5
> --- /dev/null
> +++ b/tests/generic/745.out
> @@ -0,0 +1,2 @@
> +QA output created by 745
> +Silence is golden
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc
  2024-05-24 20:58  1% [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Omar Sandoval
  2024-05-24 20:58  1% ` [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash Omar Sandoval
@ 2024-05-26 11:41  1% ` Filipe Manana
  1 sibling, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-26 11:41 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Fri, May 24, 2024 at 9:58 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> We have been seeing crashes on duplicate keys in
> btrfs_set_item_key_safe():
>
>   BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
>   ------------[ cut here ]------------
>   kernel BUG at fs/btrfs/ctree.c:2620!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
>   RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]
>
> With the following stack trace:
>
>   #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
>   #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
>   #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
>   #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
>   #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
>   #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
>   #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
>   #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
>   #8  vfs_fsync_range (fs/sync.c:188:9)
>   #9  vfs_fsync (fs/sync.c:202:9)
>   #10 do_fsync (fs/sync.c:212:9)
>   #11 __do_sys_fdatasync (fs/sync.c:225:9)
>   #12 __se_sys_fdatasync (fs/sync.c:223:1)
>   #13 __x64_sys_fdatasync (fs/sync.c:223:1)
>   #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
>   #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
>   #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)
>
> So we're logging a changed extent from fsync, which is splitting an
> extent in the log tree. But this split part already exists in the tree,
> triggering the BUG().
>
> This is the state of the log tree at the time of the crash, dumped with
> drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
> to get more details than btrfs_print_leaf() gives us:
>
>   >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
>   leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
>   leaf 33439744 flags 0x100000000000000
>   fs uuid e5bd3946-400c-4223-8923-190ef1f18677
>   chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
>           item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
>                   generation 7 transid 9 size 8192 nbytes 8473563889606862198
>                   block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>                   sequence 204 flags 0x10(PREALLOC)
>                   atime 1716417703.220000000 (2024-05-22 15:41:43)
>                   ctime 1716417704.983333333 (2024-05-22 15:41:44)
>                   mtime 1716417704.983333333 (2024-05-22 15:41:44)
>                   otime 17592186044416.000000000 (559444-03-08 01:40:16)
>           item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
>                   index 195 namelen 3 name: 193
>           item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
>                   location key (0 UNKNOWN.0 0) type XATTR
>                   transid 7 data_len 1 name_len 6
>                   name: user.a
>                   data a
>           item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
>                   generation 9 type 1 (regular)
>                   extent data disk byte 303144960 nr 12288
>                   extent data offset 0 nr 4096 ram 12288
>                   extent compression 0 (none)
>           item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
>                   generation 9 type 2 (prealloc)
>                   prealloc data disk byte 303144960 nr 12288
>                   prealloc data offset 4096 nr 8192
>           item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
>                   generation 9 type 2 (prealloc)
>                   prealloc data disk byte 303144960 nr 12288
>                   prealloc data offset 8192 nr 4096
>   ...
>
> So the real problem happened earlier: notice that items 4 (4k-12k) and 5
> (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
> item 5 starts at i_size.
>
> Here is the state of the filesystem tree at the time of the crash:
>
>   >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
>   >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
>   >>> print_extent_buffer(nodes[0])
>   leaf 30425088 level 0 items 184 generation 9 owner 5
>   leaf 30425088 flags 0x100000000000000
>   fs uuid e5bd3946-400c-4223-8923-190ef1f18677
>   chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
>         ...
>           item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
>                   generation 7 transid 7 size 4096 nbytes 12288
>                   block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>                   sequence 6 flags 0x10(PREALLOC)
>                   atime 1716417703.220000000 (2024-05-22 15:41:43)
>                   ctime 1716417703.220000000 (2024-05-22 15:41:43)
>                   mtime 1716417703.220000000 (2024-05-22 15:41:43)
>                   otime 1716417703.220000000 (2024-05-22 15:41:43)
>           item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
>                   index 195 namelen 3 name: 193
>           item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
>                   location key (0 UNKNOWN.0 0) type XATTR
>                   transid 7 data_len 1 name_len 6
>                   name: user.a
>                   data a
>           item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
>                   generation 9 type 1 (regular)
>                   extent data disk byte 303144960 nr 12288
>                   extent data offset 0 nr 8192 ram 12288
>                   extent compression 0 (none)
>           item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
>                   generation 9 type 2 (prealloc)
>                   prealloc data disk byte 303144960 nr 12288
>                   prealloc data offset 8192 nr 4096
>
> Item 5 in the log tree corresponds to item 183 in the filesystem tree,
> but nothing matches item 4. Furthermore, item 183 is the last item in
> the leaf.
>
> btrfs_log_prealloc_extents() is responsible for logging prealloc extents
> beyond i_size. It first truncates any previously logged prealloc extents
> that start beyond i_size. Then, it walks the filesystem tree and copies
> the prealloc extent items to the log tree.
>
> If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
> unlocks the tree and does another search. However, while the filesystem
> tree is unlocked, an ordered extent completion may modify the tree. In
> particular, it may insert an extent item that overlaps with an extent
> item that was already copied to the log tree.
>
> This may manifest in several ways depending on the exact scenario,
> including an EEXIST error that is silently translated to a full sync,
> overlapping items in the log tree, or this crash. This particular crash
> is triggered by the following sequence of events:
>
> - Initially, the file has i_size=4k, a regular extent from 0-4k, and a
>   prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
>   the last item in its B-tree leaf.
> - The file is fsync'd, which copies its inode item and both extent items
>   to the log tree.
> - An xattr is set on the file, which sets the
>   BTRFS_INODE_COPY_EVERYTHING flag.
> - The range 4k-8k in the file is written using direct I/O. i_size is
>   extended to 8k, but the ordered extent is still in flight.
> - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
>   calls copy_inode_items_to_log(), which calls
>   btrfs_log_prealloc_extents().
> - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
>   filesystem tree. Since it starts before i_size, it skips it. Since it
>   is the last item in its B-tree leaf, it calls btrfs_next_leaf().
> - btrfs_next_leaf() unlocks the path.
> - The ordered extent completion runs, which converts the 4k-8k part of
>   the prealloc extent to written and inserts the remaining prealloc part
>   from 8k-12k.
> - btrfs_next_leaf() does a search and finds the new prealloc extent
>   8k-12k.
> - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
>   the log tree. Note that it overlaps with the 4k-12k prealloc extent
>   that was copied to the log tree by the first fsync.
> - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
>   extent that was written.
> - This tries to drop the range 4k-8k in the log tree, which requires
>   adjusting the start of the 4k-12k prealloc extent in the log tree to
>   8k.
> - btrfs_set_item_key_safe() sees that there is already an extent
>   starting at 8k in the log tree and calls BUG().
>
> Fix this by detecting when we're about to insert an overlapping file
> extent item in the log tree and truncating the part that would overlap.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>

Perfect, thanks!

Reviewed-by: Filipe Manana <fdmanana@suse.com>

> ---
> Changes from v1 [1]:
>
> - Change commit subject to not mention direct I/O since this can also
>   happen with buffered I/O.
> - Reformat min() call to be on one line.
> - Use btrfs_file_extent_end().
> - Rebase on for-next.
>
> 1: https://lore.kernel.org/linux-btrfs/101430650a35b55b7a32d895fd292226d13346eb.1716486455.git.osandov@fb.com/
>
>  fs/btrfs/tree-log.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 5146387b416b..26a2e5aa08e9 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4860,18 +4860,23 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
>                         path->slots[0]++;
>                         continue;
>                 }
> -               if (!dropped_extents) {
> -                       /*
> -                        * Avoid logging extent items logged in past fsync calls
> -                        * and leading to duplicate keys in the log tree.
> -                        */
> +               /*
> +                * Avoid overlapping items in the log tree. The first time we
> +                * get here, get rid of everything from a past fsync. After
> +                * that, if the current extent starts before the end of the last
> +                * extent we copied, truncate the last one. This can happen if
> +                * an ordered extent completion modifies the subvolume tree
> +                * while btrfs_next_leaf() has the tree unlocked.
> +                */
> +               if (!dropped_extents || key.offset < truncate_offset) {
>                         ret = truncate_inode_items(trans, root->log_root, inode,
> -                                                  truncate_offset,
> +                                                  min(key.offset, truncate_offset),
>                                                    BTRFS_EXTENT_DATA_KEY);
>                         if (ret)
>                                 goto out;
>                         dropped_extents = true;
>                 }
> +               truncate_offset = btrfs_file_extent_end(path);
>                 if (ins_nr == 0)
>                         start_slot = slot;
>                 ins_nr++;
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 1%]

* [PATCH] btrfs-progs: convert: Add 64 bit block numbers support
@ 2024-05-25 10:31  1% Srivathsa Dara
  0 siblings, 0 replies; 200+ results
From: Srivathsa Dara @ 2024-05-25 10:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: rajesh.sivaramasubramaniom, junxiao.bi, clm, josef, dsterba

In ext4, number of blocks can be greater than 2^32. Therefore, if
btrfs-convert is used on filesystems greater than 16TiB (Staring from
16TiB, number of blocks overflow 32 bits), it fails to convert.

Fix it by considering 64 bit block numbers.

Signed-off-by: Srivathsa Dara <srivathsa.d.dara@oracle.com>
---
 convert/source-ext2.c | 6 +++---
 convert/source-ext2.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/convert/source-ext2.c b/convert/source-ext2.c
index 2186b252..afa48606 100644
--- a/convert/source-ext2.c
+++ b/convert/source-ext2.c
@@ -288,8 +288,8 @@ error:
 	return -1;
 }
 
-static int ext2_block_iterate_proc(ext2_filsys fs, blk_t *blocknr,
-			        e2_blkcnt_t blockcnt, blk_t ref_block,
+static int ext2_block_iterate_proc(ext2_filsys fs, blk64_t *blocknr,
+			        e2_blkcnt_t blockcnt, blk64_t ref_block,
 			        int ref_offset, void *priv_data)
 {
 	int ret;
@@ -323,7 +323,7 @@ static int ext2_create_file_extents(struct btrfs_trans_handle *trans,
 	init_blk_iterate_data(&data, trans, root, btrfs_inode, objectid,
 			convert_flags & CONVERT_FLAG_DATACSUM);
 
-	err = ext2fs_block_iterate2(ext2_fs, ext2_ino, BLOCK_FLAG_DATA_ONLY,
+	err = ext2fs_block_iterate3(ext2_fs, ext2_ino, BLOCK_FLAG_DATA_ONLY,
 				    NULL, ext2_block_iterate_proc, &data);
 	if (err)
 		goto error;
diff --git a/convert/source-ext2.h b/convert/source-ext2.h
index d204aac5..73c39e23 100644
--- a/convert/source-ext2.h
+++ b/convert/source-ext2.h
@@ -46,7 +46,7 @@ struct btrfs_trans_handle;
 #define ext2fs_get_block_bitmap_range2 ext2fs_get_block_bitmap_range
 #define ext2fs_inode_data_blocks2 ext2fs_inode_data_blocks
 #define ext2fs_read_ext_attr2 ext2fs_read_ext_attr
-#define ext2fs_blocks_count(s)		((s)->s_blocks_count)
+#define ext2fs_blocks_count(s)		((s)->s_blocks_count_hi << 32) | (s)->s_blocks_count
 #define EXT2FS_CLUSTER_RATIO(fs)	(1)
 #define EXT2_CLUSTERS_PER_GROUP(s)	(EXT2_BLOCKS_PER_GROUP(s))
 #define EXT2FS_B2C(fs, blk)		(blk)
-- 
2.39.3


^ permalink raw reply related	[relevance 1%]

* [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash
  2024-05-24 20:58  1% [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Omar Sandoval
@ 2024-05-24 20:58  1% ` Omar Sandoval
  2024-05-26 11:47  1%   ` Filipe Manana
  2024-05-26 11:41  1% ` [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Filipe Manana
  1 sibling, 1 reply; 200+ results
From: Omar Sandoval @ 2024-05-24 20:58 UTC (permalink / raw)
  To: fstests, linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

This is a regression test for a Btrfs bug, but there's nothing
Btrfs-specific about it. Since it's a race, we just try to make the race
happen in a loop and pass if it doesn't crash after all of our attempts.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Changes from v1 [1]:

- Added missing groups and requires.
- Simplified $XFS_IO_PROG calls.
- Removed -i flag from $XFS_IO_PROG to make race reproduce more
  reliably.
- Removed all of the file creation and dump-tree parsing since the only
  file on a fresh filesystem is guaranteed to be at the end of a leaf
  anyways.
- Rewrote to be a generic test.

1: https://lore.kernel.org/linux-btrfs/297da2b53ce9b697d82d89afd322b2cc0d0f392d.1716492850.git.osandov@osandov.com/

 tests/generic/745     | 44 +++++++++++++++++++++++++++++++++++++++++++
 tests/generic/745.out |  2 ++
 2 files changed, 46 insertions(+)
 create mode 100755 tests/generic/745
 create mode 100644 tests/generic/745.out

diff --git a/tests/generic/745 b/tests/generic/745
new file mode 100755
index 00000000..925adba9
--- /dev/null
+++ b/tests/generic/745
@@ -0,0 +1,44 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# FS QA Test 745
+#
+# Repeatedly prealloc beyond i_size, set an xattr, direct write into the
+# prealloc while extending i_size, then fdatasync. This is a regression test
+# for a Btrfs crash.
+#
+. ./common/preamble
+. ./common/attr
+_begin_fstest auto quick log preallocrw dangerous
+
+_supported_fs generic
+_require_scratch
+_require_attrs
+_require_xfs_io_command falloc -k
+_fixed_by_kernel_commit XXXXXXXXXXXX \
+	"btrfs: fix crash on racing fsync and size-extending write into prealloc"
+
+# -i slows down xfs_io startup and makes the race much less reliable.
+export XFS_IO_PROG="$(echo "$XFS_IO_PROG" | sed 's/ -i\b//')"
+
+_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
+_scratch_mount
+
+blksz=$(_get_block_size "$SCRATCH_MNT")
+
+# On Btrfs, since this is the only file on the filesystem, its metadata is at
+# the end of a B-tree leaf. We want an ordered extent completion to add an
+# extent item at the end of the leaf while we're logging prealloc extents
+# beyond i_size after an xattr was set.
+for ((i = 0; i < 5000; i++)); do
+	$XFS_IO_PROG -ftd -c "falloc -k 0 $((blksz * 3))" -c "pwrite -q -w 0 $blksz" "$SCRATCH_MNT/file"
+	$SETFATTR_PROG -n user.a -v a "$SCRATCH_MNT/file"
+	$XFS_IO_PROG -d -c "pwrite -q -w $blksz $blksz" "$SCRATCH_MNT/file"
+done
+
+# If it didn't crash, we're good.
+
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/generic/745.out b/tests/generic/745.out
new file mode 100644
index 00000000..fce6b7f5
--- /dev/null
+++ b/tests/generic/745.out
@@ -0,0 +1,2 @@
+QA output created by 745
+Silence is golden
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc
@ 2024-05-24 20:58  1% Omar Sandoval
  2024-05-24 20:58  1% ` [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash Omar Sandoval
  2024-05-26 11:41  1% ` [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Filipe Manana
  0 siblings, 2 replies; 200+ results
From: Omar Sandoval @ 2024-05-24 20:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():

  BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:2620!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]

With the following stack trace:

  #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
  #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
  #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
  #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
  #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
  #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
  #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
  #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
  #8  vfs_fsync_range (fs/sync.c:188:9)
  #9  vfs_fsync (fs/sync.c:202:9)
  #10 do_fsync (fs/sync.c:212:9)
  #11 __do_sys_fdatasync (fs/sync.c:225:9)
  #12 __se_sys_fdatasync (fs/sync.c:223:1)
  #13 __x64_sys_fdatasync (fs/sync.c:223:1)
  #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
  #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
  #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)

So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().

This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:

  >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
  leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
  leaf 33439744 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
          item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                  generation 7 transid 9 size 8192 nbytes 8473563889606862198
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 204 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417704.983333333 (2024-05-22 15:41:44)
                  mtime 1716417704.983333333 (2024-05-22 15:41:44)
                  otime 17592186044416.000000000 (559444-03-08 01:40:16)
          item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                  index 195 namelen 3 name: 193
          item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 4096 ram 12288
                  extent compression 0 (none)
          item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 4096 nr 8192
          item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096
  ...

So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.

Here is the state of the filesystem tree at the time of the crash:

  >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
  >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
  >>> print_extent_buffer(nodes[0])
  leaf 30425088 level 0 items 184 generation 9 owner 5
  leaf 30425088 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
  	...
          item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                  generation 7 transid 7 size 4096 nbytes 12288
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 6 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417703.220000000 (2024-05-22 15:41:43)
                  mtime 1716417703.220000000 (2024-05-22 15:41:43)
                  otime 1716417703.220000000 (2024-05-22 15:41:43)
          item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                  index 195 namelen 3 name: 193
          item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 8192 ram 12288
                  extent compression 0 (none)
          item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096

Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.

btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.

If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.

This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:

- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
  prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
  the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
  to the log tree.
- An xattr is set on the file, which sets the
  BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
  extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
  calls copy_inode_items_to_log(), which calls
  btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
  filesystem tree. Since it starts before i_size, it skips it. Since it
  is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
  the prealloc extent to written and inserts the remaining prealloc part
  from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
  8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
  the log tree. Note that it overlaps with the 4k-12k prealloc extent
  that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
  extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
  adjusting the start of the 4k-12k prealloc extent in the log tree to
  8k.
- btrfs_set_item_key_safe() sees that there is already an extent
  starting at 8k in the log tree and calls BUG().

Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Changes from v1 [1]:

- Change commit subject to not mention direct I/O since this can also
  happen with buffered I/O.
- Reformat min() call to be on one line.
- Use btrfs_file_extent_end().
- Rebase on for-next.

1: https://lore.kernel.org/linux-btrfs/101430650a35b55b7a32d895fd292226d13346eb.1716486455.git.osandov@fb.com/

 fs/btrfs/tree-log.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5146387b416b..26a2e5aa08e9 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4860,18 +4860,23 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
 			path->slots[0]++;
 			continue;
 		}
-		if (!dropped_extents) {
-			/*
-			 * Avoid logging extent items logged in past fsync calls
-			 * and leading to duplicate keys in the log tree.
-			 */
+		/*
+		 * Avoid overlapping items in the log tree. The first time we
+		 * get here, get rid of everything from a past fsync. After
+		 * that, if the current extent starts before the end of the last
+		 * extent we copied, truncate the last one. This can happen if
+		 * an ordered extent completion modifies the subvolume tree
+		 * while btrfs_next_leaf() has the tree unlocked.
+		 */
+		if (!dropped_extents || key.offset < truncate_offset) {
 			ret = truncate_inode_items(trans, root->log_root, inode,
-						   truncate_offset,
+						   min(key.offset, truncate_offset),
 						   BTRFS_EXTENT_DATA_KEY);
 			if (ret)
 				goto out;
 			dropped_extents = true;
 		}
+		truncate_offset = btrfs_file_extent_end(path);
 		if (ins_nr == 0)
 			start_slot = slot;
 		ins_nr++;
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v5 3/3] btrfs: reserve new relocation block-group after successful relocation
  2024-05-24 16:29  1% [PATCH v5 0/3] btrfs: zoned: always set aside a zone for relocation Johannes Thumshirn
  2024-05-24 16:29  1% ` [PATCH v5 1/3] btrfs: don't try to relocate the data relocation block-group Johannes Thumshirn
  2024-05-24 16:29  1% ` [PATCH v5 2/3] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
@ 2024-05-24 16:29  1% ` Johannes Thumshirn
  2 siblings, 0 replies; 200+ results
From: Johannes Thumshirn @ 2024-05-24 16:29 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Hans Holmberg, linux-btrfs, linux-kernel, Naohiro Aota,
	Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

After we've committed a relocation transaction, we know we have just freed
up space. Set it as hint for the next relocation.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/relocation.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 39e2db9af64f..29d235003ff1 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3811,6 +3811,13 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
 	ret = btrfs_commit_transaction(trans);
 	if (ret && !err)
 		err = ret;
+
+	/*
+	 * We know we have just freed space, set it as hint for the
+	 * next relocation.
+	 */
+	if (!err)
+		btrfs_reserve_relocation_bg(fs_info);
 out_free:
 	ret = clean_dirty_subvols(rc);
 	if (ret < 0 && !err)

-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v5 2/3] btrfs: zoned: reserve relocation block-group on mount
  2024-05-24 16:29  1% [PATCH v5 0/3] btrfs: zoned: always set aside a zone for relocation Johannes Thumshirn
  2024-05-24 16:29  1% ` [PATCH v5 1/3] btrfs: don't try to relocate the data relocation block-group Johannes Thumshirn
@ 2024-05-24 16:29  1% ` Johannes Thumshirn
  2024-05-24 16:29  1% ` [PATCH v5 3/3] btrfs: reserve new relocation block-group after successful relocation Johannes Thumshirn
  2 siblings, 0 replies; 200+ results
From: Johannes Thumshirn @ 2024-05-24 16:29 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Hans Holmberg, linux-btrfs, linux-kernel, Naohiro Aota,
	Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reserve one zone as a data relocation target on each mount. If we already
find one empty block group, there's no need to force a chunk allocation,
but we can use this empty data block group as our relocation target.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  9 +++++++
 fs/btrfs/disk-io.c     |  2 ++
 fs/btrfs/zoned.c       | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  3 +++
 4 files changed, 82 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 9a01bbad45f6..167ded78af89 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1500,6 +1500,15 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 			btrfs_put_block_group(block_group);
 			continue;
 		}
+
+		spin_lock(&fs_info->relocation_bg_lock);
+		if (block_group->start == fs_info->data_reloc_bg) {
+			btrfs_put_block_group(block_group);
+			spin_unlock(&fs_info->relocation_bg_lock);
+			continue;
+		}
+		spin_unlock(&fs_info->relocation_bg_lock);
+
 		spin_unlock(&fs_info->unused_bgs_lock);
 
 		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 78d3966232ae..16bb52bcb69e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3547,6 +3547,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	}
 	btrfs_discard_resume(fs_info);
 
+	btrfs_reserve_relocation_bg(fs_info);
+
 	if (fs_info->uuid_root &&
 	    (btrfs_test_opt(fs_info, RESCAN_UUID_TREE) ||
 	     fs_info->generation != btrfs_super_uuid_tree_generation(disk_super))) {
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index c52a0063f7db..f4962935efef 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -17,6 +17,7 @@
 #include "fs.h"
 #include "accessors.h"
 #include "bio.h"
+#include "transaction.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -2637,3 +2638,70 @@ void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info)
 	}
 	spin_unlock(&fs_info->zone_active_bgs_lock);
 }
+
+static u64 find_empty_block_group(struct btrfs_space_info *sinfo, u64 flags)
+{
+	struct btrfs_block_group *bg;
+
+	for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		list_for_each_entry(bg, &sinfo->block_groups[i], list) {
+			if (bg->flags != flags)
+				continue;
+			if (bg->used == 0)
+				return bg->start;
+		}
+	}
+
+	return 0;
+}
+
+void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_root *tree_root = fs_info->tree_root;
+	struct btrfs_space_info *sinfo = fs_info->data_sinfo;
+	struct btrfs_trans_handle *trans;
+	struct btrfs_block_group *bg;
+	u64 flags = btrfs_get_alloc_profile(fs_info, sinfo->flags);
+	u64 bytenr = 0;
+
+	lockdep_assert_not_held(&fs_info->relocation_bg_lock);
+
+	if (!btrfs_is_zoned(fs_info))
+		return;
+
+	if (fs_info->data_reloc_bg)
+		return;
+
+	bytenr = find_empty_block_group(sinfo, flags);
+	if (!bytenr) {
+		int ret;
+
+		trans = btrfs_join_transaction(tree_root);
+		if (IS_ERR(trans))
+			return;
+
+		ret = btrfs_chunk_alloc(trans, flags, CHUNK_ALLOC_FORCE);
+		btrfs_end_transaction(trans);
+		if (ret)
+			return;
+
+		bytenr = find_empty_block_group(sinfo, flags);
+		if (!bytenr)
+			return;
+
+	}
+
+	bg = btrfs_lookup_block_group(fs_info, bytenr);
+	if (!bg)
+		return;
+
+	if (!btrfs_zone_activate(bg))
+		bytenr = 0;
+
+	btrfs_put_block_group(bg);
+
+	spin_lock(&fs_info->relocation_bg_lock);
+	if (!fs_info->data_reloc_bg)
+		fs_info->data_reloc_bg = bytenr;
+	spin_unlock(&fs_info->relocation_bg_lock);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index ff605beb84ef..56c1c19d52bc 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -95,6 +95,7 @@ int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info);
 int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
 				struct btrfs_space_info *space_info, bool do_finish);
 void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info);
+void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info);
 #else /* CONFIG_BLK_DEV_ZONED */
 
 static inline int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info)
@@ -264,6 +265,8 @@ static inline int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
 
 static inline void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info) { }
 
+static inline void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info) { }
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)

-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v5 1/3] btrfs: don't try to relocate the data relocation block-group
  2024-05-24 16:29  1% [PATCH v5 0/3] btrfs: zoned: always set aside a zone for relocation Johannes Thumshirn
@ 2024-05-24 16:29  1% ` Johannes Thumshirn
  2024-05-24 16:29  1% ` [PATCH v5 2/3] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
  2024-05-24 16:29  1% ` [PATCH v5 3/3] btrfs: reserve new relocation block-group after successful relocation Johannes Thumshirn
  2 siblings, 0 replies; 200+ results
From: Johannes Thumshirn @ 2024-05-24 16:29 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Hans Holmberg, linux-btrfs, linux-kernel, Naohiro Aota,
	Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

When relocating block-groups, either via auto reclaim or manual
balance, skip the data relocation block-group.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-group.c | 2 ++
 fs/btrfs/relocation.c  | 7 +++++++
 fs/btrfs/volumes.c     | 2 ++
 3 files changed, 11 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 9910bae89966..9a01bbad45f6 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1921,6 +1921,8 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 				div64_u64(zone_unusable * 100, bg->length));
 		trace_btrfs_reclaim_block_group(bg);
 		ret = btrfs_relocate_chunk(fs_info, bg->start);
+		if (ret == -EBUSY)
+			ret = 0;
 		if (ret) {
 			btrfs_dec_block_group_ro(bg);
 			btrfs_err(fs_info, "error relocating chunk %llu",
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 5f1a909a1d91..39e2db9af64f 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4037,6 +4037,13 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 	int rw = 0;
 	int err = 0;
 
+	spin_lock(&fs_info->relocation_bg_lock);
+	if (group_start == fs_info->data_reloc_bg) {
+		spin_unlock(&fs_info->relocation_bg_lock);
+		return -EBUSY;
+	}
+	spin_unlock(&fs_info->relocation_bg_lock);
+
 	/*
 	 * This only gets set if we had a half-deleted snapshot on mount.  We
 	 * cannot allow relocation to start while we're still trying to clean up
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3f70f727dacf..75da3a32885b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3367,6 +3367,8 @@ int btrfs_relocate_chunk(struct btrfs_fs_info *fs_info, u64 chunk_offset)
 	btrfs_scrub_pause(fs_info);
 	ret = btrfs_relocate_block_group(fs_info, chunk_offset);
 	btrfs_scrub_continue(fs_info);
+	if (ret == -EBUSY)
+		return 0;
 	if (ret) {
 		/*
 		 * If we had a transaction abort, stop all running scrubs.

-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v5 0/3] btrfs: zoned: always set aside a zone for relocation
@ 2024-05-24 16:29  1% Johannes Thumshirn
  2024-05-24 16:29  1% ` [PATCH v5 1/3] btrfs: don't try to relocate the data relocation block-group Johannes Thumshirn
                   ` (2 more replies)
  0 siblings, 3 replies; 200+ results
From: Johannes Thumshirn @ 2024-05-24 16:29 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Hans Holmberg, linux-btrfs, linux-kernel, Naohiro Aota,
	Filipe Manana, Johannes Thumshirn

For zoned filesytsems we heavily rely on relocation for garbage collecting
as we cannot do any in-place updates of disk blocks.

But there can be situations where we're running out of space for doing the
relocation.

To solve this, always have a zone reserved for relocation.

This is a subset of another approach to this problem I've submitted in
https://lore.kernel.org/r/20240328-hans-v1-0-4cd558959407@kernel.org

---
Changes in v5:
- Split out one patch to skip relocation of the data relocation bg
- Link to v4: https://lore.kernel.org/r/20240523-zoned-gc-v4-0-23ed9f61afa0@kernel.org

Changes in v4:
- Skip data_reloc_bg in delete_unused_bgs() and reclaim_bgs_work()
- Link to v3: https://lore.kernel.org/r/20240521-zoned-gc-v3-0-7db9742454c7@kernel.org

Changes in v3:
- Rename btrfs_reserve_relocation_zone -> btrfs_reserve_relocation_bg
- Bail out if we already have a relocation bg set
- Link to v2: https://lore.kernel.org/r/20240515-zoned-gc-v2-0-20c7cb9763cd@kernel.org

Changes in v2:
- Incorporate Naohiro's review
- Link to v1: https://lore.kernel.org/r/20240514-zoned-gc-v1-0-109f1a6c7447@kernel.org

---
Johannes Thumshirn (3):
      btrfs: don't try to relocate the data relocation block-group
      btrfs: zoned: reserve relocation block-group on mount
      btrfs: reserve new relocation block-group after successful relocation

 fs/btrfs/block-group.c | 11 ++++++++
 fs/btrfs/disk-io.c     |  2 ++
 fs/btrfs/relocation.c  | 14 +++++++++++
 fs/btrfs/volumes.c     |  2 ++
 fs/btrfs/zoned.c       | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  3 +++
 6 files changed, 100 insertions(+)
---
base-commit: 2aabf192868a0f6d9ee3e35f9b0a8d97c77a46da
change-id: 20240514-zoned-gc-2ce793459eb7

Best regards,
-- 
Johannes Thumshirn <jth@kernel.org>


^ permalink raw reply	[relevance 1%]

* Re: [PATCH v4 1/2] btrfs: zoned: reserve relocation block-group on mount
  2024-05-23 15:21  1% ` [PATCH v4 1/2] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
  2024-05-24  8:31  1%   ` Naohiro Aota
@ 2024-05-24 14:07  1%   ` Filipe Manana
  1 sibling, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-24 14:07 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Chris Mason, Josef Bacik, David Sterba, Hans Holmberg,
	linux-btrfs, linux-kernel, Naohiro Aota, Johannes Thumshirn

On Thu, May 23, 2024 at 4:32 PM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Reserve one zone as a data relocation target on each mount. If we already
> find one empty block group, there's no need to force a chunk allocation,
> but we can use this empty data block group as our relocation target.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/block-group.c | 17 +++++++++++++
>  fs/btrfs/disk-io.c     |  2 ++
>  fs/btrfs/zoned.c       | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/zoned.h       |  3 +++
>  4 files changed, 89 insertions(+)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 9910bae89966..1195f6721c90 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1500,6 +1500,15 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>                         btrfs_put_block_group(block_group);
>                         continue;
>                 }
> +
> +               spin_lock(&fs_info->relocation_bg_lock);
> +               if (block_group->start == fs_info->data_reloc_bg) {
> +                       btrfs_put_block_group(block_group);
> +                       spin_unlock(&fs_info->relocation_bg_lock);
> +                       continue;
> +               }
> +               spin_unlock(&fs_info->relocation_bg_lock);
> +
>                 spin_unlock(&fs_info->unused_bgs_lock);
>
>                 btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
> @@ -1835,6 +1844,14 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>                                       bg_list);
>                 list_del_init(&bg->bg_list);
>
> +               spin_lock(&fs_info->relocation_bg_lock);
> +               if (bg->start == fs_info->data_reloc_bg) {
> +                       btrfs_put_block_group(bg);
> +                       spin_unlock(&fs_info->relocation_bg_lock);
> +                       continue;
> +               }
> +               spin_unlock(&fs_info->relocation_bg_lock);

Ok, so the reclaim task and cleaner kthread will not remove the
reserved block group.

But there's nothing preventing someone running balance manually, which
will delete the block group.

E.g. block group X is empty and reserved as the data relocation bg.
The balance ioctl is invoked, it goes through all block groups for relocation.
It happens that it first finds bg X. Deletes bg X.

Now there's no more reserved bg for data relocation, and other tasks
can come in and use the freed space and fill all of it or most of it.

Shouldn't we prevent the data reloc bg from being a target of a manual
relocation too?
E.g. have btrfs_relocate_chunk() do nothing if the bg is the data reloc bg.

Thanks.

> +
>                 space_info = bg->space_info;
>                 spin_unlock(&fs_info->unused_bgs_lock);
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 78d3966232ae..16bb52bcb69e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3547,6 +3547,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>         }
>         btrfs_discard_resume(fs_info);
>
> +       btrfs_reserve_relocation_bg(fs_info);
> +
>         if (fs_info->uuid_root &&
>             (btrfs_test_opt(fs_info, RESCAN_UUID_TREE) ||
>              fs_info->generation != btrfs_super_uuid_tree_generation(disk_super))) {
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index c52a0063f7db..d291cf4f565e 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -17,6 +17,7 @@
>  #include "fs.h"
>  #include "accessors.h"
>  #include "bio.h"
> +#include "transaction.h"
>
>  /* Maximum number of zones to report per blkdev_report_zones() call */
>  #define BTRFS_REPORT_NR_ZONES   4096
> @@ -2637,3 +2638,69 @@ void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info)
>         }
>         spin_unlock(&fs_info->zone_active_bgs_lock);
>  }
> +
> +static u64 find_empty_block_group(struct btrfs_space_info *sinfo, u64 flags)
> +{
> +       struct btrfs_block_group *bg;
> +
> +       for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
> +               list_for_each_entry(bg, &sinfo->block_groups[i], list) {
> +                       if (bg->flags != flags)
> +                               continue;
> +                       if (bg->used == 0)
> +                               return bg->start;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info)
> +{
> +       struct btrfs_root *tree_root = fs_info->tree_root;
> +       struct btrfs_space_info *sinfo = fs_info->data_sinfo;
> +       struct btrfs_trans_handle *trans;
> +       struct btrfs_block_group *bg;
> +       u64 flags = btrfs_get_alloc_profile(fs_info, sinfo->flags);
> +       u64 bytenr = 0;
> +
> +       lockdep_assert_not_held(&fs_info->relocation_bg_lock);
> +
> +       if (!btrfs_is_zoned(fs_info))
> +               return;
> +
> +       if (fs_info->data_reloc_bg)
> +               return;
> +
> +       bytenr = find_empty_block_group(sinfo, flags);
> +       if (!bytenr) {
> +               int ret;
> +
> +               trans = btrfs_join_transaction(tree_root);
> +               if (IS_ERR(trans))
> +                       return;
> +
> +               ret = btrfs_chunk_alloc(trans, flags, CHUNK_ALLOC_FORCE);
> +               btrfs_end_transaction(trans);
> +               if (ret)
> +                       return;
> +
> +               bytenr = find_empty_block_group(sinfo, flags);
> +               if (!bytenr)
> +                       return;
> +
> +       }
> +
> +       bg = btrfs_lookup_block_group(fs_info, bytenr);
> +       if (!bg)
> +               return;
> +
> +       if (!btrfs_zone_activate(bg))
> +               bytenr = 0;
> +
> +       btrfs_put_block_group(bg);
> +
> +       spin_lock(&fs_info->relocation_bg_lock);
> +       fs_info->data_reloc_bg = bytenr;
> +       spin_unlock(&fs_info->relocation_bg_lock);
> +}
> diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
> index ff605beb84ef..56c1c19d52bc 100644
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -95,6 +95,7 @@ int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info);
>  int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
>                                 struct btrfs_space_info *space_info, bool do_finish);
>  void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info);
> +void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info);
>  #else /* CONFIG_BLK_DEV_ZONED */
>
>  static inline int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info)
> @@ -264,6 +265,8 @@ static inline int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
>
>  static inline void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info) { }
>
> +static inline void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info) { }
> +
>  #endif
>
>  static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
>
> --
> 2.43.0
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH fstests] btrfs: add regression test for fsync vs. size-extending direct I/O into prealloc crash
  2024-05-23 19:34  1% ` [PATCH fstests] btrfs: add regression test for fsync vs. size-extending direct I/O into prealloc crash Omar Sandoval
@ 2024-05-24 13:24  1%   ` Filipe Manana
  0 siblings, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-24 13:24 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Thu, May 23, 2024 at 9:43 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> Since this is a race, we just try to make the race happen in a loop and
> pass if it doesn't crash after all of our attempts.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
>  tests/btrfs/312     | 66 +++++++++++++++++++++++++++++++++++++++++++++
>  tests/btrfs/312.out |  2 ++
>  2 files changed, 68 insertions(+)
>  create mode 100755 tests/btrfs/312
>  create mode 100644 tests/btrfs/312.out
>
> diff --git a/tests/btrfs/312 b/tests/btrfs/312
> new file mode 100755
> index 00000000..aaca0e3e
> --- /dev/null
> +++ b/tests/btrfs/312
> @@ -0,0 +1,66 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) Meta Platforms, Inc. and affiliates.
> +#
> +# FS QA Test 312
> +#
> +# Repeatedly fsync after size-extending direct I/O into a preallocated extent.
> +#
> +. ./common/preamble
> +_begin_fstest dangerous log prealloc

Can also add in the "quick group".
A since this is writing into a prealloc extent, the correct group is
"preallocrw".

> +
> +_supported_fs btrfs
> +_require_scratch
> +_require_btrfs_command inspect-internal dump-tree
> +_require_btrfs_command inspect-internal inode-resolve

Missing a _require_attrs because $SETFATTR_PROG is needed/used.

Also missing a:

_require_xfs_io_command falloc -k

> +_fixed_by_kernel_commit XXXXXXXXXXXX \
> +       "btrfs: fix crash on racing fsync and size-extending direct I/O into prealloc"
> +
> +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> +_scratch_mount
> +
> +sectorsize=$(_scratch_btrfs_sectorsize)
> +
> +# Create a bunch of files so that we hopefully get one whose items are at the
> +# end of a leaf.
> +for ((i = 0; i < 1000; i++)); do
> +       $XFS_IO_PROG -c "open -f -d $SCRATCH_MNT/$i" -c "falloc -k 0 $((sectorsize * 3))" -c "pwrite -q 0 $sectorsize"

You can pass -d to $XFS_IO_PROG and make this a bit shorter:

$XFS_IO_PROG -f -d -c "falloc -k 0 $((sectorsize * 3))" -c "pwrite -q
0 $sectorsize"  "$SCRATCH_MNT/$i"

> +       $SETFATTR_PROG -n user.a -v a "$SCRATCH_MNT/$i"
> +done
> +touch "$SCRATCH_MNT/$i"

Why is this touch needed here?
I can trigger the bug without it - it's confusing, if it's really
needed then add a comment explaining why please.

So now this works with the default leaf size of 16K on x86.
But what about in case we run with MKFS_OPTIONS="-n 64K", or on a PPC
machine where default leaf/node size is 64K?

The 1000 iterations aren't enough, so I would suggest making the test
always create a fs with a 64K node size and adjusting the iterations
to be enough to trigger the bug.

> +
> +_scratch_unmount
> +
> +ino=$($BTRFS_UTIL_PROG inspect-internal dump-tree "$SCRATCH_DEV" -t 5 |
> +      $AWK_PROG -v sectorsize="$sectorsize" '
> +match($0, /^leaf [0-9]+ items ([0-9]+)/, arr) {
> +       nritems = arr[1]
> +}
> +match($0, /item ([0-9]+) key \(([0-9]+) EXTENT_DATA ([0-9]+)\)/, arr) {
> +       if (arr[1] == nritems - 1 && arr[3] == sectorsize) {
> +               print arr[2]
> +               exit
> +       }
> +}
> +')
> +
> +if [ -z "$ino" ]; then
> +       _fail "Extent at end of leaf not found"
> +fi
> +
> +_scratch_mount
> +path=$($BTRFS_UTIL_PROG inspect-internal inode-resolve "$ino" "$SCRATCH_MNT")
> +
> +# Try repeatedly to reproduce the race of an ordered extent finishing while
> +# we're logging prealloc extents beyond i_size.
> +for ((i = 0; i < 1000; i++)); do
> +       $XFS_IO_PROG -c "open -t -d $path" -c "falloc -k 0 $((sectorsize * 3))" -c "pwrite -q -w 0 $sectorsize"
> +       $SETFATTR_PROG -n user.a -v a "$path"
> +       $XFS_IO_PROG -c "open -d $path" -c "pwrite -q -w $sectorsize $sectorsize" || exit 1

the || exit 1 is odd here.
Normally we do || _fail "some reason" or just don't do anything at
all, as a golden output mismatch due to an error will make the test
fail.

Also don't forget to cc the fstests mailing list.

Thanks.

> +done
> +
> +# If it didn't crash, we're good.
> +
> +echo "Silence is golden"
> +status=0
> +exit
> diff --git a/tests/btrfs/312.out b/tests/btrfs/312.out
> new file mode 100644
> index 00000000..6e72aa94
> --- /dev/null
> +++ b/tests/btrfs/312.out
> @@ -0,0 +1,2 @@
> +QA output created by 312
> +Silence is golden
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs/741: add commit ID in _fixed_by_kernel_commit
  2024-05-24  4:26  1% [PATCH] btrfs/741: add commit ID in _fixed_by_kernel_commit Anand Jain
@ 2024-05-24 13:17  1% ` David Sterba
  0 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-24 13:17 UTC (permalink / raw)
  To: Anand Jain; +Cc: fstests, linux-btrfs

On Fri, May 24, 2024 at 12:26:59PM +0800, Anand Jain wrote:
> Now that the kernel patch is merged in v6.9, replace the placeholder with
> the actual commit ID.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: fix crash on racing fsync and size-extending direct I/O into prealloc
  2024-05-23 19:34  1% [PATCH] btrfs: fix crash on racing fsync and size-extending direct I/O into prealloc Omar Sandoval
  2024-05-23 19:34  1% ` [PATCH fstests] btrfs: add regression test for fsync vs. size-extending direct I/O into prealloc crash Omar Sandoval
@ 2024-05-24 13:05  1% ` Filipe Manana
  1 sibling, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-24 13:05 UTC (permalink / raw)
  To: Omar Sandoval; +Cc: linux-btrfs, kernel-team

On Thu, May 23, 2024 at 8:34 PM Omar Sandoval <osandov@osandov.com> wrote:
>
> From: Omar Sandoval <osandov@fb.com>
>
> We have been seeing crashes on duplicate keys in
> btrfs_set_item_key_safe():
>
>   BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
>   ------------[ cut here ]------------
>   kernel BUG at fs/btrfs/ctree.c:2620!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
>   RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]
>
> With the following stack trace:
>
>   #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
>   #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
>   #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
>   #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
>   #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
>   #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
>   #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
>   #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
>   #8  vfs_fsync_range (fs/sync.c:188:9)
>   #9  vfs_fsync (fs/sync.c:202:9)
>   #10 do_fsync (fs/sync.c:212:9)
>   #11 __do_sys_fdatasync (fs/sync.c:225:9)
>   #12 __se_sys_fdatasync (fs/sync.c:223:1)
>   #13 __x64_sys_fdatasync (fs/sync.c:223:1)
>   #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
>   #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
>   #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)
>
> So we're logging a changed extent from fsync, which is splitting an
> extent in the log tree. But this split part already exists in the tree,
> triggering the BUG().
>
> This is the state of the log tree at the time of the crash, dumped with
> drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
> to get more details than btrfs_print_leaf() gives us:
>
>   >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
>   leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
>   leaf 33439744 flags 0x100000000000000
>   fs uuid e5bd3946-400c-4223-8923-190ef1f18677
>   chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
>           item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
>                   generation 7 transid 9 size 8192 nbytes 8473563889606862198
>                   block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>                   sequence 204 flags 0x10(PREALLOC)
>                   atime 1716417703.220000000 (2024-05-22 15:41:43)
>                   ctime 1716417704.983333333 (2024-05-22 15:41:44)
>                   mtime 1716417704.983333333 (2024-05-22 15:41:44)
>                   otime 17592186044416.000000000 (559444-03-08 01:40:16)
>           item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
>                   index 195 namelen 3 name: 193
>           item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
>                   location key (0 UNKNOWN.0 0) type XATTR
>                   transid 7 data_len 1 name_len 6
>                   name: user.a
>                   data a
>           item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
>                   generation 9 type 1 (regular)
>                   extent data disk byte 303144960 nr 12288
>                   extent data offset 0 nr 4096 ram 12288
>                   extent compression 0 (none)
>           item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
>                   generation 9 type 2 (prealloc)
>                   prealloc data disk byte 303144960 nr 12288
>                   prealloc data offset 4096 nr 8192
>           item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
>                   generation 9 type 2 (prealloc)
>                   prealloc data disk byte 303144960 nr 12288
>                   prealloc data offset 8192 nr 4096
>   ...
>
> So the real problem happened earlier: notice that items 4 (4k-12k) and 5
> (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
> item 5 starts at i_size.
>
> Here is the state of the filesystem tree at the time of the crash:
>
>   >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
>   >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
>   >>> print_extent_buffer(nodes[0])
>   leaf 30425088 level 0 items 184 generation 9 owner 5
>   leaf 30425088 flags 0x100000000000000
>   fs uuid e5bd3946-400c-4223-8923-190ef1f18677
>   chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
>         ...
>           item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
>                   generation 7 transid 7 size 4096 nbytes 12288
>                   block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>                   sequence 6 flags 0x10(PREALLOC)
>                   atime 1716417703.220000000 (2024-05-22 15:41:43)
>                   ctime 1716417703.220000000 (2024-05-22 15:41:43)
>                   mtime 1716417703.220000000 (2024-05-22 15:41:43)
>                   otime 1716417703.220000000 (2024-05-22 15:41:43)
>           item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
>                   index 195 namelen 3 name: 193
>           item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
>                   location key (0 UNKNOWN.0 0) type XATTR
>                   transid 7 data_len 1 name_len 6
>                   name: user.a
>                   data a
>           item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
>                   generation 9 type 1 (regular)
>                   extent data disk byte 303144960 nr 12288
>                   extent data offset 0 nr 8192 ram 12288
>                   extent compression 0 (none)
>           item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
>                   generation 9 type 2 (prealloc)
>                   prealloc data disk byte 303144960 nr 12288
>                   prealloc data offset 8192 nr 4096
>
> Item 5 in the log tree corresponds to item 183 in the filesystem tree,
> but nothing matches item 4. Furthermore, item 183 is the last item in
> the leaf.
>
> btrfs_log_prealloc_extents() is responsible for logging prealloc extents
> beyond i_size. It first truncates any previously logged prealloc extents
> that start beyond i_size. Then, it walks the filesystem tree and copies
> the prealloc extent items to the log tree.
>
> If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
> unlocks the tree and does another search. However, while the filesystem
> tree is unlocked, an ordered extent completion may modify the tree. In
> particular, it may insert an extent item that overlaps with an extent
> item that was already copied to the log tree.
>
> This may manifest in several ways depending on the exact scenario,
> including an EEXIST error that is silently translated to a full sync,
> overlapping items in the log tree, or this crash. This particular crash
> is triggered by the following sequence of events:
>
> - Initially, the file has i_size=4k, a regular extent from 0-4k, and a
>   prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
>   the last item in its B-tree leaf.
> - The file is fsync'd, which copies its inode item and both extent items
>   to the log tree.
> - An xattr is set on the file, which sets the
>   BTRFS_INODE_COPY_EVERYTHING flag.
> - The range 4k-8k in the file is written using direct I/O. i_size is
>   extended to 8k, but the ordered extent is still in flight.
> - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
>   calls copy_inode_items_to_log(), which calls
>   btrfs_log_prealloc_extents().
> - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
>   filesystem tree. Since it starts before i_size, it skips it. Since it
>   is the last item in its B-tree leaf, it calls btrfs_next_leaf().
> - btrfs_next_leaf() unlocks the path.
> - The ordered extent completion runs, which converts the 4k-8k part of
>   the prealloc extent to written and inserts the remaining prealloc part
>   from 8k-12k.
> - btrfs_next_leaf() does a search and finds the new prealloc extent
>   8k-12k.
> - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
>   the log tree. Note that it overlaps with the 4k-12k prealloc extent
>   that was copied to the log tree by the first fsync.
> - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
>   extent that was written.
> - This tries to drop the range 4k-8k in the log tree, which requires
>   adjusting the start of the 4k-12k prealloc extent in the log tree to
>   8k.
> - btrfs_set_item_key_safe() sees that there is already an extent
>   starting at 8k in the log tree and calls BUG().

This is all correct. Thanks for the detailed explanation, this is one
more instance of tricky cases involving the last item and
btrfs_next_leaf().

So you mention direct IO but that's only because the issue happened to
be triggered with direct IO, but there's really nothing specific to
direct IO here, can happen with buffered IO too.
So I suggest dropping the "direct IO" part in the subject and just say
something like "... size extending write into prealloc".

Then don't forget to update the subject in the test case for the
_fixed_by_kernel_commit call.

>
> Fix this by detecting when we're about to insert an overlapping file
> extent item in the log tree and truncating the part that would overlap.
>
> Signed-off-by: Omar Sandoval <osandov@fb.com>
> ---
> Hi,
>
> I'm not sure if this is the best way to fix the problem, but hopefully
> the commit message has enough detail to brainstorm a better solution if
> not. I've also included an fstest that reproduces the issue.

That seems reasonable, and I would fix it myself in a very similar way.

>
> Based on misc-next.

Btw, nowadays we use the "for-next" branch at
https://github.com/btrfs/linux/commits/for-next/
However this applies just fine in that for-next branch.

A few comments belows.

>
> Thanks,
> Omar
>
>
>  fs/btrfs/tree-log.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 51a167559ae8..a7efd23acf50 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4783,6 +4783,7 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
>         bool dropped_extents = false;
>         u64 truncate_offset = i_size;
>         struct extent_buffer *leaf;
> +       struct btrfs_file_extent_item *ei;
>         int slot;
>         int ins_nr = 0;
>         int start_slot = 0;
> @@ -4811,8 +4812,6 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
>                 goto out;
>
>         if (ret == 0) {
> -               struct btrfs_file_extent_item *ei;
> -
>                 leaf = path->nodes[0];
>                 slot = path->slots[0];
>                 ei = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
> @@ -4863,18 +4862,26 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
>                         path->slots[0]++;
>                         continue;
>                 }
> -               if (!dropped_extents) {
> -                       /*
> -                        * Avoid logging extent items logged in past fsync calls
> -                        * and leading to duplicate keys in the log tree.
> -                        */
> +               /*
> +                * Avoid overlapping items in the log tree. The first time we
> +                * get here, get rid of everything from a past fsync. After
> +                * that, if the current extent starts before the end of the last
> +                * extent we copied, truncate the last one. This can happen if
> +                * an ordered extent completion modifies the subvolume tree
> +                * while btrfs_next_leaf() has the tree unlocked.
> +                */
> +               if (!dropped_extents || key.offset < truncate_offset) {
>                         ret = truncate_inode_items(trans, root->log_root, inode,
> -                                                  truncate_offset,
> +                                                  min(key.offset,
> +                                                      truncate_offset),

For readability you can keep the min() expression in a single line.
That would result in a 84 characters wide line, which is tolerated nowadays.

>                                                    BTRFS_EXTENT_DATA_KEY);
>                         if (ret)
>                                 goto out;
>                         dropped_extents = true;
>                 }
> +               ei = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
> +               truncate_offset = (key.offset +
> +                                  btrfs_file_extent_num_bytes(leaf, ei));

The parentheses here are a bit odd and unnecessary.
You can also use btrfs_file_extent_end(path) instead of the expression.

Thanks.

>                 if (ins_nr == 0)
>                         start_slot = slot;
>                 ins_nr++;
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 03/11] btrfs: introduce new members for extent_map
  2024-05-23 23:19  2%     ` Qu Wenruo
@ 2024-05-24 10:59  1%       ` David Sterba
  0 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-24 10:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Filipe Manana, Qu Wenruo, linux-btrfs, David Sterba

On Fri, May 24, 2024 at 08:49:25AM +0930, Qu Wenruo wrote:
> 
> 
> 在 2024/5/24 02:23, Filipe Manana 写道:
> [...]
> >> @@ -832,10 +897,11 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
> >>                                          split->orig_start = em->orig_start;
> >>                                  }
> >>                          } else {
> >> +                               split->disk_num_bytes = 0;
> >> +                               split->offset = 0;
> >>                                  split->ram_bytes = split->len;
> >>                                  split->orig_start = split->start;
> >>                                  split->block_len = 0;
> >> -                               split->disk_num_bytes = 0;
> >
> > Why move the assignment of ->disk_num_bytes ?
> > This is sort of distracting, doing unnecessary changes.
> 
> It's to group the newer members together, and to follow the new trend to
> put them in disk_bytenr disk_num_bytes offset ram_bytes order.
> 
> I know with structures, there is really no need to keep any order
> between the member assignment, but with fixed ordering, it would be
> better in the long run.

I agree this pays of in the long run. The most prominent example is
ordering of the btrfs_key initialization, if it's always
objectid/type/offset it's does not slow down reading, it's enough to
read the values. Admittedly for the extent_map it's not the same because
there are more members. The important thing is to keep the same order
everywhere.

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v4 2/2] btrfs: reserve new relocation block-group after successful relocation
  2024-05-23 15:21  1% ` [PATCH v4 2/2] btrfs: reserve new relocation block-group after successful relocation Johannes Thumshirn
@ 2024-05-24  8:33  1%   ` Naohiro Aota
  0 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-24  8:33 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Chris Mason, Josef Bacik, David Sterba, Hans Holmberg,
	linux-btrfs, linux-kernel, Filipe Manana, Johannes Thumshirn

On Thu, May 23, 2024 at 05:21:59PM GMT, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> After we've committed a relocation transaction, we know we have just freed
> up space. Set it as hint for the next relocation.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

Regards,

> ---
>  fs/btrfs/relocation.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 5f1a909a1d91..02a9ebf96a95 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3811,6 +3811,13 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
>  	ret = btrfs_commit_transaction(trans);
>  	if (ret && !err)
>  		err = ret;
> +
> +	/*
> +	 * We know we have just freed space, set it as hint for the
> +	 * next relocation.
> +	 */
> +	if (!err)
> +		btrfs_reserve_relocation_bg(fs_info);
>  out_free:
>  	ret = clean_dirty_subvols(rc);
>  	if (ret < 0 && !err)
> 
> -- 
> 2.43.0
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v4 1/2] btrfs: zoned: reserve relocation block-group on mount
  2024-05-23 15:21  1% ` [PATCH v4 1/2] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
@ 2024-05-24  8:31  1%   ` Naohiro Aota
  2024-05-24 14:07  1%   ` Filipe Manana
  1 sibling, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-24  8:31 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Chris Mason, Josef Bacik, David Sterba, Hans Holmberg,
	linux-btrfs, linux-kernel, Filipe Manana, Johannes Thumshirn

On Thu, May 23, 2024 at 05:21:58PM GMT, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> Reserve one zone as a data relocation target on each mount. If we already
> find one empty block group, there's no need to force a chunk allocation,
> but we can use this empty data block group as our relocation target.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/block-group.c | 17 +++++++++++++
>  fs/btrfs/disk-io.c     |  2 ++
>  fs/btrfs/zoned.c       | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/zoned.h       |  3 +++
>  4 files changed, 89 insertions(+)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 9910bae89966..1195f6721c90 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1500,6 +1500,15 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
>  			btrfs_put_block_group(block_group);
>  			continue;
>  		}
> +
> +		spin_lock(&fs_info->relocation_bg_lock);
> +		if (block_group->start == fs_info->data_reloc_bg) {
> +			btrfs_put_block_group(block_group);
> +			spin_unlock(&fs_info->relocation_bg_lock);
> +			continue;
> +		}
> +		spin_unlock(&fs_info->relocation_bg_lock);
> +
>  		spin_unlock(&fs_info->unused_bgs_lock);
>  
>  		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
> @@ -1835,6 +1844,14 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
>  				      bg_list);
>  		list_del_init(&bg->bg_list);
>  
> +		spin_lock(&fs_info->relocation_bg_lock);
> +		if (bg->start == fs_info->data_reloc_bg) {
> +			btrfs_put_block_group(bg);
> +			spin_unlock(&fs_info->relocation_bg_lock);
> +			continue;
> +		}
> +		spin_unlock(&fs_info->relocation_bg_lock);
> +
>  		space_info = bg->space_info;
>  		spin_unlock(&fs_info->unused_bgs_lock);
>  
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 78d3966232ae..16bb52bcb69e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3547,6 +3547,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>  	}
>  	btrfs_discard_resume(fs_info);
>  
> +	btrfs_reserve_relocation_bg(fs_info);
> +
>  	if (fs_info->uuid_root &&
>  	    (btrfs_test_opt(fs_info, RESCAN_UUID_TREE) ||
>  	     fs_info->generation != btrfs_super_uuid_tree_generation(disk_super))) {
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index c52a0063f7db..d291cf4f565e 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -17,6 +17,7 @@
>  #include "fs.h"
>  #include "accessors.h"
>  #include "bio.h"
> +#include "transaction.h"
>  
>  /* Maximum number of zones to report per blkdev_report_zones() call */
>  #define BTRFS_REPORT_NR_ZONES   4096
> @@ -2637,3 +2638,69 @@ void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info)
>  	}
>  	spin_unlock(&fs_info->zone_active_bgs_lock);
>  }
> +
> +static u64 find_empty_block_group(struct btrfs_space_info *sinfo, u64 flags)
> +{
> +	struct btrfs_block_group *bg;
> +
> +	for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
> +		list_for_each_entry(bg, &sinfo->block_groups[i], list) {
> +			if (bg->flags != flags)
> +				continue;
> +			if (bg->used == 0)
> +				return bg->start;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info)
> +{
> +	struct btrfs_root *tree_root = fs_info->tree_root;
> +	struct btrfs_space_info *sinfo = fs_info->data_sinfo;
> +	struct btrfs_trans_handle *trans;
> +	struct btrfs_block_group *bg;
> +	u64 flags = btrfs_get_alloc_profile(fs_info, sinfo->flags);
> +	u64 bytenr = 0;
> +
> +	lockdep_assert_not_held(&fs_info->relocation_bg_lock);
> +
> +	if (!btrfs_is_zoned(fs_info))
> +		return;
> +
> +	if (fs_info->data_reloc_bg)
> +		return;
> +
> +	bytenr = find_empty_block_group(sinfo, flags);
> +	if (!bytenr) {
> +		int ret;
> +
> +		trans = btrfs_join_transaction(tree_root);
> +		if (IS_ERR(trans))
> +			return;
> +
> +		ret = btrfs_chunk_alloc(trans, flags, CHUNK_ALLOC_FORCE);
> +		btrfs_end_transaction(trans);
> +		if (ret)
> +			return;
> +
> +		bytenr = find_empty_block_group(sinfo, flags);
> +		if (!bytenr)
> +			return;
> +
> +	}
> +
> +	bg = btrfs_lookup_block_group(fs_info, bytenr);
> +	if (!bg)
> +		return;
> +
> +	if (!btrfs_zone_activate(bg))
> +		bytenr = 0;
> +
> +	btrfs_put_block_group(bg);
> +
> +	spin_lock(&fs_info->relocation_bg_lock);
> +	fs_info->data_reloc_bg = bytenr;

Since the above check and the chunk allocation are outside of the lock, some
other thread can already set new data_reloc_bg in the meantime. So, setting
this under "if (!fs_info->data_reloc_bg)" would be safe.

With that:

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

Regards,

^ permalink raw reply	[relevance 1%]

* [PATCH] btrfs/741: add commit ID in _fixed_by_kernel_commit
@ 2024-05-24  4:26  1% Anand Jain
  2024-05-24 13:17  1% ` David Sterba
  0 siblings, 1 reply; 200+ results
From: Anand Jain @ 2024-05-24  4:26 UTC (permalink / raw)
  To: fstests; +Cc: linux-btrfs

Now that the kernel patch is merged in v6.9, replace the placeholder with
the actual commit ID.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 tests/generic/741 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/generic/741 b/tests/generic/741
index f8f9a7be7619..ad1592a10553 100755
--- a/tests/generic/741
+++ b/tests/generic/741
@@ -31,7 +31,7 @@ _require_test
 _require_scratch
 _require_dm_target flakey
 
-[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit XXXXXXXXXXXX \
+[ "$FSTYP" = "btrfs" ] && _fixed_by_kernel_commit 2f1aeab9fca1 \
 			"btrfs: return accurate error code on open failure"
 
 _scratch_mkfs >> $seqres.full
-- 
2.39.3


^ permalink raw reply related	[relevance 1%]

* Re: [PATCH] fstests: mkfs the scratch device if we have missing profiles
  @ 2024-05-24  3:51  1% ` Anand Jain
  0 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-24  3:51 UTC (permalink / raw)
  To: Josef Bacik, fstests, linux-btrfs, kernel-team

On 5/8/24 04:08, Josef Bacik wrote:
> I have a btrfs config where I specifically exclude raid56 testing, and
> this resulted in btrfs/011 failing with an inconsistent file system.
> This happens because the last test we run does a btrfs device replace of
> the $SCRATCH_DEV, leaving it with no valid file system.  We then skip
> the remaining profiles and exit, but then we go to check the device on
> $SCRATCH_DEV and it fails because there is no file system.
> 
> Fix this to re-make the scratch device if we skip any of the raid
> profiles.  This only happens in the case of some idiot user configuring
> their testing in a special way, in normal runs of this test we'll never
> re-make the fs.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Reviewed-by: Anand Jain <anand.jain@oracle.com>

Applied.

Thanks, Anand

> ---
>   tests/btrfs/011 | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/tests/btrfs/011 b/tests/btrfs/011
> index d8b5a978..b8c14f3b 100755
> --- a/tests/btrfs/011
> +++ b/tests/btrfs/011
> @@ -257,6 +257,12 @@ for t in "-m single -d single:1 no 64" \
>   	workout_option=${t#*:}
>   	if [[ "${_btrfs_profile_configs[@]}" =~ "${mkfs_option/ -M}"( |$) ]]; then
>   		workout "$mkfs_option" $workout_option
> +	else
> +		# If we have limited the profile configs we could leave
> +		# $SCRATCH_DEV in an inconsistent state (because it was
> +		# replaced), so mkfs the scratch device to make sure we don't
> +		# trip the fs check at the end.
> +		_scratch_mkfs > /dev/null 2>&1
>   	fi
>   done
>   


^ permalink raw reply	[relevance 1%]

* Re: [PATCH v4 0/6] part3 trivial adjustments for return variable coding style
  2024-05-23 17:18  1%   ` David Sterba
@ 2024-05-24  3:09  1%     ` Anand Jain
  0 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-24  3:09 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs



On 5/24/24 01:18, David Sterba wrote:
> On Tue, May 21, 2024 at 08:10:03PM +0200, David Sterba wrote:
>> On Wed, May 22, 2024 at 01:11:06AM +0800, Anand Jain wrote:
>>> This is v4 of part 3 of the series, containing renaming with optimization of the
>>> return variable.
>>>
>>> v3 part3:
>>>    https://lore.kernel.org/linux-btrfs/cover.1715783315.git.anand.jain@oracle.com/
>>> v2 part2:
>>>    https://lore.kernel.org/linux-btrfs/cover.1713370756.git.anand.jain@oracle.com/
>>> v1:
>>>    https://lore.kernel.org/linux-btrfs/cover.1710857863.git.anand.jain@oracle.com/
>>>
>>> Anand Jain (6):
>>>    btrfs: rename err to ret in btrfs_cleanup_fs_roots()
>>>    btrfs: rename ret to err in btrfs_recover_relocation()
>>>    btrfs: rename ret to ret2 in btrfs_recover_relocation()
>>>    btrfs: rename err to ret in btrfs_recover_relocation()
>>>    btrfs: rename err to ret in btrfs_drop_snapshot()
>>>    btrfs: rename err to ret in btrfs_find_orphan_roots()
>>
>> 1-5 look ok to me, for patch 6 there's the ret = 0 reset question sent
>> to v3.
> 
> You can add 1-5 to for-next with
> 
> Reviewed-by: David Sterba <dsterba@suse.com>
> 

Pushed 1-5.

> and only resend 6.

IMO, in the patch 6, the

  if (ret > 1)
     ret = 0;

section is already simple and typical for the ret > 1 cases.
Could you pls check my response in v3.

Thanks, Anand



^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 07/11] btrfs: remove extent_map::block_start member
  2024-05-23 17:56  1%   ` Filipe Manana
@ 2024-05-23 23:23  1%     ` Qu Wenruo
  0 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23 23:23 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs, David Sterba



在 2024/5/24 03:26, Filipe Manana 写道:
>> @@ -2703,7 +2700,7 @@ static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
>>                  if (IS_ERR(em))
>>                          return PTR_ERR(em);
>>
>> -               if (em->block_start != EXTENT_MAP_HOLE)
>> +               if (extent_map_block_start(em) != EXTENT_MAP_HOLE)
> This should be:   if (em->disk_bytenr != EXTENT_MAP_HOLE)

That's fine, the extent_map_block_start() would handle it correctly, as
for any disk_bytenr >= EXTENT_MAP_LAST_BYTE, it would return the
disk_bytenr directly.

But yes, we can save one if() check and would update it.

Thanks,
Qu

>
> Everything else looks fine. Thanks.

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 03/11] btrfs: introduce new members for extent_map
  2024-05-23 16:53  1%   ` Filipe Manana
@ 2024-05-23 23:19  2%     ` Qu Wenruo
  2024-05-24 10:59  1%       ` David Sterba
  0 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-23 23:19 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs, David Sterba



在 2024/5/24 02:23, Filipe Manana 写道:
[...]
>> @@ -832,10 +897,11 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
>>                                          split->orig_start = em->orig_start;
>>                                  }
>>                          } else {
>> +                               split->disk_num_bytes = 0;
>> +                               split->offset = 0;
>>                                  split->ram_bytes = split->len;
>>                                  split->orig_start = split->start;
>>                                  split->block_len = 0;
>> -                               split->disk_num_bytes = 0;
>
> Why move the assignment of ->disk_num_bytes ?
> This is sort of distracting, doing unnecessary changes.

It's to group the newer members together, and to follow the new trend to
put them in disk_bytenr disk_num_bytes offset ram_bytes order.

I know with structures, there is really no need to keep any order
between the member assignment, but with fixed ordering, it would be
better in the long run.

And unfortunately the cost is that, the first patch doing the
re-ordering the members would be harder to review.

>
>>                          }
>>
>>                          if (extent_map_in_tree(em)) {
>> @@ -989,10 +1055,12 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
>>          /* First, replace the em with a new extent_map starting from * em->start */
>>          split_pre->start = em->start;
>>          split_pre->len = pre;
>> +       split_pre->disk_bytenr = new_logical;
>
> We are already setting disk_bytenr to the same value a few lines below.'

Sorry, I didn't see any location touching disk_bytenr, either inside the
patch, nor else where, especially the disk_bytenr is a new member.

>
>> +       split_pre->disk_num_bytes = split_pre->len;
>> +       split_pre->offset = 0;
>>          split_pre->orig_start = split_pre->start;
>>          split_pre->block_start = new_logical;
>>          split_pre->block_len = split_pre->len;
>> -       split_pre->disk_num_bytes = split_pre->block_len;
>
> Here, where slit_pre->block_len has the same value as split->pre_len.
> This sort of apparently accidental change makes it harder to review.

Again, to keep a consistent order of members.

>
>>          split_pre->ram_bytes = split_pre->len;
>>          split_pre->flags = flags;
>>          split_pre->generation = em->generation;
>> @@ -1007,10 +1075,12 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
>>          /* Insert the middle extent_map. */
>>          split_mid->start = em->start + pre;
>>          split_mid->len = em->len - pre;
>> +       split_mid->disk_bytenr = em->block_start + pre;
>
> Same here.
>
>> +       split_mid->disk_num_bytes = split_mid->len;
>> +       split_mid->offset = 0;
>>          split_mid->orig_start = split_mid->start;
>>          split_mid->block_start = em->block_start + pre;
>>          split_mid->block_len = split_mid->len;
>> -       split_mid->disk_num_bytes = split_mid->block_len;
>
> Which relates to this.
>
> Otherwise it looks fine, and could be fixed up when cherry picked to for-next.

So although it's indeed harder to review, we would have a very
consistent order when assigning those members.

Thankfully this is only a one time pain, there should be no more member
order related problems.

Thanks,
Qu

>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>
> Thanks.


^ permalink raw reply	[relevance 2%]

* [PATCH fstests] btrfs: add regression test for fsync vs. size-extending direct I/O into prealloc crash
  2024-05-23 19:34  1% [PATCH] btrfs: fix crash on racing fsync and size-extending direct I/O into prealloc Omar Sandoval
@ 2024-05-23 19:34  1% ` Omar Sandoval
  2024-05-24 13:24  1%   ` Filipe Manana
  2024-05-24 13:05  1% ` [PATCH] btrfs: fix crash on racing fsync and size-extending direct I/O into prealloc Filipe Manana
  1 sibling, 1 reply; 200+ results
From: Omar Sandoval @ 2024-05-23 19:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

Since this is a race, we just try to make the race happen in a loop and
pass if it doesn't crash after all of our attempts.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
 tests/btrfs/312     | 66 +++++++++++++++++++++++++++++++++++++++++++++
 tests/btrfs/312.out |  2 ++
 2 files changed, 68 insertions(+)
 create mode 100755 tests/btrfs/312
 create mode 100644 tests/btrfs/312.out

diff --git a/tests/btrfs/312 b/tests/btrfs/312
new file mode 100755
index 00000000..aaca0e3e
--- /dev/null
+++ b/tests/btrfs/312
@@ -0,0 +1,66 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+#
+# FS QA Test 312
+#
+# Repeatedly fsync after size-extending direct I/O into a preallocated extent.
+#
+. ./common/preamble
+_begin_fstest dangerous log prealloc
+
+_supported_fs btrfs
+_require_scratch
+_require_btrfs_command inspect-internal dump-tree
+_require_btrfs_command inspect-internal inode-resolve
+_fixed_by_kernel_commit XXXXXXXXXXXX \
+	"btrfs: fix crash on racing fsync and size-extending direct I/O into prealloc"
+
+_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
+_scratch_mount
+
+sectorsize=$(_scratch_btrfs_sectorsize)
+
+# Create a bunch of files so that we hopefully get one whose items are at the
+# end of a leaf.
+for ((i = 0; i < 1000; i++)); do
+	$XFS_IO_PROG -c "open -f -d $SCRATCH_MNT/$i" -c "falloc -k 0 $((sectorsize * 3))" -c "pwrite -q 0 $sectorsize"
+	$SETFATTR_PROG -n user.a -v a "$SCRATCH_MNT/$i"
+done
+touch "$SCRATCH_MNT/$i"
+
+_scratch_unmount
+
+ino=$($BTRFS_UTIL_PROG inspect-internal dump-tree "$SCRATCH_DEV" -t 5 |
+      $AWK_PROG -v sectorsize="$sectorsize" '
+match($0, /^leaf [0-9]+ items ([0-9]+)/, arr) {
+	nritems = arr[1]
+}
+match($0, /item ([0-9]+) key \(([0-9]+) EXTENT_DATA ([0-9]+)\)/, arr) {
+	if (arr[1] == nritems - 1 && arr[3] == sectorsize) {
+		print arr[2]
+		exit
+	}
+}
+')
+
+if [ -z "$ino" ]; then
+	_fail "Extent at end of leaf not found"
+fi
+
+_scratch_mount
+path=$($BTRFS_UTIL_PROG inspect-internal inode-resolve "$ino" "$SCRATCH_MNT")
+
+# Try repeatedly to reproduce the race of an ordered extent finishing while
+# we're logging prealloc extents beyond i_size.
+for ((i = 0; i < 1000; i++)); do
+	$XFS_IO_PROG -c "open -t -d $path" -c "falloc -k 0 $((sectorsize * 3))" -c "pwrite -q -w 0 $sectorsize"
+	$SETFATTR_PROG -n user.a -v a "$path"
+	$XFS_IO_PROG -c "open -d $path" -c "pwrite -q -w $sectorsize $sectorsize" || exit 1
+done
+
+# If it didn't crash, we're good.
+
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/btrfs/312.out b/tests/btrfs/312.out
new file mode 100644
index 00000000..6e72aa94
--- /dev/null
+++ b/tests/btrfs/312.out
@@ -0,0 +1,2 @@
+QA output created by 312
+Silence is golden
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH] btrfs: fix crash on racing fsync and size-extending direct I/O into prealloc
@ 2024-05-23 19:34  1% Omar Sandoval
  2024-05-23 19:34  1% ` [PATCH fstests] btrfs: add regression test for fsync vs. size-extending direct I/O into prealloc crash Omar Sandoval
  2024-05-24 13:05  1% ` [PATCH] btrfs: fix crash on racing fsync and size-extending direct I/O into prealloc Filipe Manana
  0 siblings, 2 replies; 200+ results
From: Omar Sandoval @ 2024-05-23 19:34 UTC (permalink / raw)
  To: linux-btrfs; +Cc: kernel-team

From: Omar Sandoval <osandov@fb.com>

We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():

  BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:2620!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
  RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]

With the following stack trace:

  #0  btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
  #1  btrfs_drop_extents (fs/btrfs/file.c:411:4)
  #2  log_one_extent (fs/btrfs/tree-log.c:4732:9)
  #3  btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
  #4  btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
  #5  btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
  #6  btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
  #7  btrfs_sync_file (fs/btrfs/file.c:1933:8)
  #8  vfs_fsync_range (fs/sync.c:188:9)
  #9  vfs_fsync (fs/sync.c:202:9)
  #10 do_fsync (fs/sync.c:212:9)
  #11 __do_sys_fdatasync (fs/sync.c:225:9)
  #12 __se_sys_fdatasync (fs/sync.c:223:1)
  #13 __x64_sys_fdatasync (fs/sync.c:223:1)
  #14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
  #15 do_syscall_64 (arch/x86/entry/common.c:83:7)
  #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)

So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().

This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:

  >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
  leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
  leaf 33439744 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
          item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
                  generation 7 transid 9 size 8192 nbytes 8473563889606862198
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 204 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417704.983333333 (2024-05-22 15:41:44)
                  mtime 1716417704.983333333 (2024-05-22 15:41:44)
                  otime 17592186044416.000000000 (559444-03-08 01:40:16)
          item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
                  index 195 namelen 3 name: 193
          item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 4096 ram 12288
                  extent compression 0 (none)
          item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 4096 nr 8192
          item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096
  ...

So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.

Here is the state of the filesystem tree at the time of the crash:

  >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
  >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
  >>> print_extent_buffer(nodes[0])
  leaf 30425088 level 0 items 184 generation 9 owner 5
  leaf 30425088 flags 0x100000000000000
  fs uuid e5bd3946-400c-4223-8923-190ef1f18677
  chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
  	...
          item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
                  generation 7 transid 7 size 4096 nbytes 12288
                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
                  sequence 6 flags 0x10(PREALLOC)
                  atime 1716417703.220000000 (2024-05-22 15:41:43)
                  ctime 1716417703.220000000 (2024-05-22 15:41:43)
                  mtime 1716417703.220000000 (2024-05-22 15:41:43)
                  otime 1716417703.220000000 (2024-05-22 15:41:43)
          item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
                  index 195 namelen 3 name: 193
          item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
                  location key (0 UNKNOWN.0 0) type XATTR
                  transid 7 data_len 1 name_len 6
                  name: user.a
                  data a
          item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
                  generation 9 type 1 (regular)
                  extent data disk byte 303144960 nr 12288
                  extent data offset 0 nr 8192 ram 12288
                  extent compression 0 (none)
          item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
                  generation 9 type 2 (prealloc)
                  prealloc data disk byte 303144960 nr 12288
                  prealloc data offset 8192 nr 4096

Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.

btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.

If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.

This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:

- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
  prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
  the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
  to the log tree.
- An xattr is set on the file, which sets the
  BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
  extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
  calls copy_inode_items_to_log(), which calls
  btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
  filesystem tree. Since it starts before i_size, it skips it. Since it
  is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
  the prealloc extent to written and inserts the remaining prealloc part
  from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
  8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
  the log tree. Note that it overlaps with the 4k-12k prealloc extent
  that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
  extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
  adjusting the start of the 4k-12k prealloc extent in the log tree to
  8k.
- btrfs_set_item_key_safe() sees that there is already an extent
  starting at 8k in the log tree and calls BUG().

Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.

Signed-off-by: Omar Sandoval <osandov@fb.com>
---
Hi,

I'm not sure if this is the best way to fix the problem, but hopefully
the commit message has enough detail to brainstorm a better solution if
not. I've also included an fstest that reproduces the issue.

Based on misc-next.

Thanks,
Omar


 fs/btrfs/tree-log.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 51a167559ae8..a7efd23acf50 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4783,6 +4783,7 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
 	bool dropped_extents = false;
 	u64 truncate_offset = i_size;
 	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *ei;
 	int slot;
 	int ins_nr = 0;
 	int start_slot = 0;
@@ -4811,8 +4812,6 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
 		goto out;
 
 	if (ret == 0) {
-		struct btrfs_file_extent_item *ei;
-
 		leaf = path->nodes[0];
 		slot = path->slots[0];
 		ei = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
@@ -4863,18 +4862,26 @@ static int btrfs_log_prealloc_extents(struct btrfs_trans_handle *trans,
 			path->slots[0]++;
 			continue;
 		}
-		if (!dropped_extents) {
-			/*
-			 * Avoid logging extent items logged in past fsync calls
-			 * and leading to duplicate keys in the log tree.
-			 */
+		/*
+		 * Avoid overlapping items in the log tree. The first time we
+		 * get here, get rid of everything from a past fsync. After
+		 * that, if the current extent starts before the end of the last
+		 * extent we copied, truncate the last one. This can happen if
+		 * an ordered extent completion modifies the subvolume tree
+		 * while btrfs_next_leaf() has the tree unlocked.
+		 */
+		if (!dropped_extents || key.offset < truncate_offset) {
 			ret = truncate_inode_items(trans, root->log_root, inode,
-						   truncate_offset,
+						   min(key.offset,
+						       truncate_offset),
 						   BTRFS_EXTENT_DATA_KEY);
 			if (ret)
 				goto out;
 			dropped_extents = true;
 		}
+		ei = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
+		truncate_offset = (key.offset +
+				   btrfs_file_extent_num_bytes(leaf, ei));
 		if (ins_nr == 0)
 			start_slot = slot;
 		ins_nr++;
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* Re: [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
                   ` (11 preceding siblings ...)
  2024-05-23 10:23  1% ` [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Johannes Thumshirn
@ 2024-05-23 18:26  2% ` Filipe Manana
  12 siblings, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-23 18:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 23, 2024 at 6:04 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [CHANGELOG]
> v3:
> - Rebased to the latest for-next
>   There is a small conflicts with the extent map tree members changes,
>   no big deal.
>
> - Fix an error where original code is checking
>   btrfs_file_extent_disk_bytenr()
>   The newer code is checking disk_num_bytes, which is wrong.
>
> - Various commit messages/comments update
>   Mostly some grammar fixes and removal of rants on the btrfs_file_extent
>   member mismatches for btrfs_alloc_ordered_extent().
>   However a comment is still left inside btrfs_alloc_ordered_extent()
>   for NOCOW/PREALLOC as a reminder for further cleanup.

I went through each new patch version, and it looks good.

I replied to some individual patches with minor things that can be
fixed at commit time when adding to for-next in case you don't send a
new version.
For patch 7/11 there's one issue.

Otherwise looks great, so after the 7/11 fix you can add:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks!

>
> v2:
> - Rebased to the latest for-next
>   There is a conflicts with extent locking, and maybe some other
>   hidden conflicts for NOCOW/PREALLOC?
>   As previously the patchset passes fstests auto group, but after
>   the merging with other patches, it always crashes as btrfs/060.
>
> - Fix an error in the final cleanup patch
>   It's the NOCOW/PREALLOC shenanigans again, in the buffered NOCOW path,
>   that we have to use the old inaccurate numbers for NOCOW/PREALLOC OEs.
>
> - Split the final cleanup into 4 patches
>   Most cleanups are very straightforward, but the cleanup for
>   btrfs_alloc_ordered_extent() needs extra special handling for
>   NOCOW/PREALLOC.
>
> v1:
> - Rebased to the latest for-next
>   To resolve the conflicts with the recently introduced extent map
>   shrinker
>
> - A new cleanup patch to remove the recursive header inclusion
>
> - Use a new structure to pass the file extent item related members
>   around
>
> - Add a new comment on why we're intentionally passing incorrect
>   numbers for NOCOW/PREALLOC ordered extents inside
>   btrfs_create_dio_extent()
>
> [REPO]
> https://github.com/adam900710/linux/tree/em_cleanup
>
> This series introduce two new members (disk_bytenr/offset) to
> extent_map, and removes three old members
> (block_start/block_len/offset), finally rename one member
> (orig_block_len -> disk_num_bytes).
>
> This should save us one u64 for extent_map, although with the recent
> extent map shrinker, the saving is not that useful.
>
> But to make things safe to migrate, I introduce extra sanity checks for
> extent_map, and do cross check for both old and new members.
>
> The extra sanity checks already exposed one bug (thankfully harmless)
> causing em::block_start to be incorrect.
>
> But so far, the patchset is fine for default fstests run.
>
> Furthermore, since we're already having too long parameter lists for
> extent_map/ordered_extent/can_nocow_extent, here is a new structure,
> btrfs_file_extent, a memory-access-friendly structure to represent a
> btrfs_file_extent_item.
>
> With the help of that structure, we can use that to represent a file
> extent item without a super long parameter list.
>
> The patchset would rename orig_block_len to disk_num_bytes first.
> Then introduce the new member, the extra sanity checks, and introduce the
> new btrfs_file_extent structure and use that to remove the older 3 members
> from extent_map.
>
> After all above works done, use btrfs_file_extent to further cleanup
> can_nocow_file_extent_args()/btrfs_alloc_ordered_extent()/create_io_em()/
> btrfs_create_dio_extent().
>
> The cleanup is in fact pretty tricky, the current code base never
> expects correct numbers for NOCOW/PREALLOC OEs, thus we have to keep the
> old but incorrect numbers just for NOCOW/PREALLOC.
>
> I will address the NOCOW/PREALLOC shenanigans the future, but
> after the huge cleanup across multiple core structures.
>
> Qu Wenruo (11):
>   btrfs: rename extent_map::orig_block_len to disk_num_bytes
>   btrfs: export the expected file extent through can_nocow_extent()
>   btrfs: introduce new members for extent_map
>   btrfs: introduce extra sanity checks for extent maps
>   btrfs: remove extent_map::orig_start member
>   btrfs: remove extent_map::block_len member
>   btrfs: remove extent_map::block_start member
>   btrfs: cleanup duplicated parameters related to
>     can_nocow_file_extent_args
>   btrfs: cleanup duplicated parameters related to
>     btrfs_alloc_ordered_extent
>   btrfs: cleanup duplicated parameters related to create_io_em()
>   btrfs: cleanup duplicated parameters related to
>     btrfs_create_dio_extent()
>
>  fs/btrfs/btrfs_inode.h            |   4 +-
>  fs/btrfs/compression.c            |   7 +-
>  fs/btrfs/defrag.c                 |  14 +-
>  fs/btrfs/extent_io.c              |  10 +-
>  fs/btrfs/extent_map.c             | 192 +++++++++++++------
>  fs/btrfs/extent_map.h             |  51 +++--
>  fs/btrfs/file-item.c              |  23 +--
>  fs/btrfs/file.c                   |  18 +-
>  fs/btrfs/inode.c                  | 308 +++++++++++++-----------------
>  fs/btrfs/ordered-data.c           |  34 +++-
>  fs/btrfs/ordered-data.h           |  19 +-
>  fs/btrfs/relocation.c             |   5 +-
>  fs/btrfs/tests/extent-map-tests.c | 114 ++++++-----
>  fs/btrfs/tests/inode-tests.c      | 177 ++++++++---------
>  fs/btrfs/tree-log.c               |  23 ++-
>  fs/btrfs/zoned.c                  |   4 +-
>  include/trace/events/btrfs.h      |  18 +-
>  17 files changed, 541 insertions(+), 480 deletions(-)
>
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 2%]

* Re: [PATCH v3 03/11] btrfs: introduce new members for extent_map
  2024-05-23  5:03  1% ` [PATCH v3 03/11] btrfs: introduce new members for extent_map Qu Wenruo
  2024-05-23 16:53  1%   ` Filipe Manana
@ 2024-05-23 18:21  1%   ` Filipe Manana
  1 sibling, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-23 18:21 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, David Sterba

On Thu, May 23, 2024 at 6:04 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Introduce two new members for extent_map:
>
> - disk_bytenr
> - offset
>
> Both are matching the members with the same name inside
> btrfs_file_extent_items.
>
> For now this patch only touches those members when:
>
> - Reading btrfs_file_extent_items from disk
> - Inserting new holes
> - Merging two extent maps
>   With the new disk_bytenr and disk_num_bytes, doing merging would be a
>   little more complex, as we have 3 different cases:
>
>   * Both extent maps are referring to the same data extents
>     |<----- data extent A ----->|
>        |<- em 1 ->|<- em 2 ->|
>
>   * Both extent maps are referring to different data extents
>     |<-- data extent A -->|<-- data extent B -->|
>                |<- em 1 ->|<- em 2 ->|
>
>   * One of the extent maps is referring to a merged and larger data
>     extent that covers both extent maps
>
>     This is not really valid case other than some selftests.
>     So this test case would be removed.
>
>   A new helper merge_ondisk_extents() would be introduced to handle
>   above valid cases.
>
> To properly assign values for those new members, a new btrfs_file_extent
> parameter is introduced to all the involved call sites.
>
> - For NOCOW writes the btrfs_file_extent would be exposed from
>   can_nocow_file_extent().
>
> - For other writes, the members can be easily calculated
>   As most of them have 0 offset and utilizing the whole on-disk data
>   extent.
>   The exception is encoded write, but thankfully that interface provided
>   offset directly and all other needed info.
>
> For now, both the old members (block_start/block_len/orig_start) are
> co-existing with the new members (disk_bytenr/offset), meanwhile all the
> critical code is still using the old members only.
>
> The cleanup would happen later after all the older and newer members are
> properly validated.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/defrag.c     |  4 +++
>  fs/btrfs/extent_map.c | 78 ++++++++++++++++++++++++++++++++++++++++---
>  fs/btrfs/extent_map.h | 17 ++++++++++
>  fs/btrfs/file-item.c  |  9 ++++-
>  fs/btrfs/file.c       |  1 +
>  fs/btrfs/inode.c      | 57 +++++++++++++++++++++++++++----
>  6 files changed, 155 insertions(+), 11 deletions(-)
>
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index 407ccec3e57e..242c5469f4ba 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -709,6 +709,10 @@ static struct extent_map *defrag_get_extent(struct btrfs_inode *inode,
>                         em->start = start;
>                         em->orig_start = start;
>                         em->block_start = EXTENT_MAP_HOLE;
> +                       em->disk_bytenr = EXTENT_MAP_HOLE;
> +                       em->disk_num_bytes = 0;
> +                       em->ram_bytes = 0;
> +                       em->offset = 0;
>                         em->len = key.offset - start;
>                         break;
>                 }
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index a9d60d1eade9..c7d2393692e6 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -229,6 +229,60 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma
>         return next->block_start == prev->block_start;
>  }
>
> +/*
> + * Handle the ondisk data extents merge for @prev and @next.
> + *
> + * Only touches disk_bytenr/disk_num_bytes/offset/ram_bytes.
> + * For now only uncompressed regular extent can be merged.
> + *
> + * @prev and @next will be both updated to point to the new merged range.
> + * Thus one of them should be removed by the caller.
> + */
> +static void merge_ondisk_extents(struct extent_map *prev, struct extent_map *next)
> +{
> +       u64 new_disk_bytenr;
> +       u64 new_disk_num_bytes;
> +       u64 new_offset;
> +
> +       /* @prev and @next should not be compressed. */
> +       ASSERT(!extent_map_is_compressed(prev));
> +       ASSERT(!extent_map_is_compressed(next));
> +
> +       /*
> +        * There are two different cases where @prev and @next can be merged.
> +        *
> +        * 1) They are referring to the same data extent
> +        * |<----- data extent A ----->|
> +        *    |<- prev ->|<- next ->|
> +        *
> +        * 2) They are referring to different data extents but still adjacent
> +        *
> +        * |<-- data extent A -->|<-- data extent B -->|
> +        *            |<- prev ->|<- next ->|
> +        *
> +        * The calculation here always merge the data extents first, then update
> +        * @offset using the new data extents.
> +        *
> +        * For case 1), the merged data extent would be the same.
> +        * For case 2), we just merge the two data extents into one.
> +        */
> +       new_disk_bytenr = min(prev->disk_bytenr, next->disk_bytenr);
> +       new_disk_num_bytes = max(prev->disk_bytenr + prev->disk_num_bytes,
> +                                next->disk_bytenr + next->disk_num_bytes) -
> +                            new_disk_bytenr;
> +       new_offset = prev->disk_bytenr + prev->offset - new_disk_bytenr;
> +
> +       prev->disk_bytenr = new_disk_bytenr;
> +       prev->disk_num_bytes = new_disk_num_bytes;
> +       prev->ram_bytes = new_disk_num_bytes;
> +       prev->offset = new_offset;
> +
> +       next->disk_bytenr = new_disk_bytenr;
> +       next->disk_num_bytes = new_disk_num_bytes;
> +       next->ram_bytes = new_disk_num_bytes;
> +       next->offset = new_offset;
> +}
> +
>  static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>  {
>         struct extent_map_tree *tree = &inode->extent_tree;
> @@ -260,6 +314,9 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>                         em->block_len += merge->block_len;
>                         em->block_start = merge->block_start;
>                         em->generation = max(em->generation, merge->generation);
> +
> +                       if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
> +                               merge_ondisk_extents(merge, em);
>                         em->flags |= EXTENT_FLAG_MERGED;
>
>                         rb_erase(&merge->rb_node, &tree->root);
> @@ -275,6 +332,8 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>         if (rb && can_merge_extent_map(merge) && mergeable_maps(em, merge)) {
>                 em->len += merge->len;
>                 em->block_len += merge->block_len;
> +               if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
> +                       merge_ondisk_extents(em, merge);
>                 rb_erase(&merge->rb_node, &tree->root);
>                 RB_CLEAR_NODE(&merge->rb_node);
>                 em->generation = max(em->generation, merge->generation);
> @@ -562,6 +621,7 @@ static noinline int merge_extent_mapping(struct btrfs_inode *inode,
>             !extent_map_is_compressed(em)) {
>                 em->block_start += start_diff;
>                 em->block_len = em->len;
> +               em->offset += start_diff;
>         }
>         return add_extent_mapping(inode, em, 0);
>  }
> @@ -785,14 +845,18 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
>                                         split->block_len = em->block_len;
>                                 else
>                                         split->block_len = split->len;
> +                               split->disk_bytenr = em->disk_bytenr;
>                                 split->disk_num_bytes = max(split->block_len,
>                                                             em->disk_num_bytes);
> +                               split->offset = em->offset;
>                                 split->ram_bytes = em->ram_bytes;
>                         } else {
>                                 split->orig_start = split->start;
>                                 split->block_len = 0;
>                                 split->block_start = em->block_start;
> +                               split->disk_bytenr = em->disk_bytenr;
>                                 split->disk_num_bytes = 0;
> +                               split->offset = 0;
>                                 split->ram_bytes = split->len;
>                         }
>
> @@ -813,13 +877,14 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
>                         split->start = end;
>                         split->len = em_end - end;
>                         split->block_start = em->block_start;
> +                       split->disk_bytenr = em->disk_bytenr;
>                         split->flags = flags;
>                         split->generation = gen;
>
>                         if (em->block_start < EXTENT_MAP_LAST_BYTE) {
>                                 split->disk_num_bytes = max(em->block_len,
>                                                             em->disk_num_bytes);
> -
> +                               split->offset = em->offset + end - em->start;
>                                 split->ram_bytes = em->ram_bytes;
>                                 if (compressed) {
>                                         split->block_len = em->block_len;
> @@ -832,10 +897,11 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
>                                         split->orig_start = em->orig_start;
>                                 }
>                         } else {
> +                               split->disk_num_bytes = 0;
> +                               split->offset = 0;
>                                 split->ram_bytes = split->len;
>                                 split->orig_start = split->start;
>                                 split->block_len = 0;
> -                               split->disk_num_bytes = 0;
>                         }
>
>                         if (extent_map_in_tree(em)) {
> @@ -989,10 +1055,12 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
>         /* First, replace the em with a new extent_map starting from * em->start */
>         split_pre->start = em->start;
>         split_pre->len = pre;
> +       split_pre->disk_bytenr = new_logical;
> +       split_pre->disk_num_bytes = split_pre->len;
> +       split_pre->offset = 0;
>         split_pre->orig_start = split_pre->start;
>         split_pre->block_start = new_logical;
>         split_pre->block_len = split_pre->len;
> -       split_pre->disk_num_bytes = split_pre->block_len;
>         split_pre->ram_bytes = split_pre->len;
>         split_pre->flags = flags;
>         split_pre->generation = em->generation;
> @@ -1007,10 +1075,12 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
>         /* Insert the middle extent_map. */
>         split_mid->start = em->start + pre;
>         split_mid->len = em->len - pre;
> +       split_mid->disk_bytenr = em->block_start + pre;
> +       split_mid->disk_num_bytes = split_mid->len;
> +       split_mid->offset = 0;
>         split_mid->orig_start = split_mid->start;
>         split_mid->block_start = em->block_start + pre;
>         split_mid->block_len = split_mid->len;
> -       split_mid->disk_num_bytes = split_mid->block_len;
>         split_mid->ram_bytes = split_mid->len;
>         split_mid->flags = flags;
>         split_mid->generation = em->generation;
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index 2b7bbffd594b..0b1a8e409377 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -70,12 +70,29 @@ struct extent_map {
>          */
>         u64 orig_start;
>
> +       /*
> +        * The bytenr of the full on-disk extent.
> +        *
> +        * For regular extents it's btrfs_file_extent_item::disk_bytenr.
> +        * For holes it's EXTENT_MAP_HOLE and for inline extents it's
> +        * EXTENT_MAP_INLINE.
> +        */
> +       u64 disk_bytenr;
> +
>         /*
>          * The full on-disk extent length, matching
>          * btrfs_file_extent_item::disk_num_bytes.
>          */
>         u64 disk_num_bytes;
>
> +       /*
> +        * Offset inside the decompressed extent.
> +        *
> +        * For regular extents it's btrfs_file_extent_item::offset.
> +        * For holes and inline extents it's 0.
> +        */
> +       u64 offset;
> +
>         /*
>          * The decompressed size of the whole on-disk extent, matching
>          * btrfs_file_extent_item::ram_bytes.
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 430dce44ebd2..1298afea9503 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -1295,12 +1295,17 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
>                 em->len = btrfs_file_extent_end(path) - extent_start;
>                 em->orig_start = extent_start -
>                         btrfs_file_extent_offset(leaf, fi);
> -               em->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
>                 bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
>                 if (bytenr == 0) {
>                         em->block_start = EXTENT_MAP_HOLE;
> +                       em->disk_bytenr = EXTENT_MAP_HOLE;
> +                       em->disk_num_bytes = 0;
> +                       em->offset = 0;
>                         return;
>                 }
> +               em->disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
> +               em->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
> +               em->offset = btrfs_file_extent_offset(leaf, fi);
>                 if (compress_type != BTRFS_COMPRESS_NONE) {
>                         extent_map_set_compression(em, compress_type);
>                         em->block_start = bytenr;
> @@ -1317,8 +1322,10 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
>                 ASSERT(extent_start == 0);
>
>                 em->block_start = EXTENT_MAP_INLINE;
> +               em->disk_bytenr = EXTENT_MAP_INLINE;
>                 em->start = 0;
>                 em->len = fs_info->sectorsize;
> +               em->offset = 0;
>                 /*
>                  * Initialize orig_start and block_len with the same values
>                  * as in inode.c:btrfs_get_extent().
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 7c42565da70c..5133c6705d74 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2350,6 +2350,7 @@ static int fill_holes(struct btrfs_trans_handle *trans,
>                 hole_em->orig_start = offset;
>
>                 hole_em->block_start = EXTENT_MAP_HOLE;
> +               hole_em->disk_bytenr = EXTENT_MAP_HOLE;
>                 hole_em->block_len = 0;
>                 hole_em->disk_num_bytes = 0;
>                 hole_em->generation = trans->transid;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8ac489fb5e39..7afcdea27782 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -141,6 +141,7 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
>                                        u64 len, u64 orig_start, u64 block_start,
>                                        u64 block_len, u64 disk_num_bytes,
>                                        u64 ram_bytes, int compress_type,
> +                                      struct btrfs_file_extent *file_extent,

Can be made const.

>                                        int type);
>
>  static int data_reloc_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
> @@ -1152,6 +1153,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
>         struct btrfs_root *root = inode->root;
>         struct btrfs_fs_info *fs_info = root->fs_info;
>         struct btrfs_ordered_extent *ordered;
> +       struct btrfs_file_extent file_extent;
>         struct btrfs_key ins;
>         struct page *locked_page = NULL;
>         struct extent_state *cached = NULL;
> @@ -1198,6 +1200,13 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
>         lock_extent(io_tree, start, end, &cached);
>
>         /* Here we're doing allocation and writeback of the compressed pages */
> +       file_extent.disk_bytenr = ins.objectid;
> +       file_extent.disk_num_bytes = ins.offset;
> +       file_extent.ram_bytes = async_extent->ram_size;
> +       file_extent.num_bytes = async_extent->ram_size;
> +       file_extent.offset = 0;
> +       file_extent.compression = async_extent->compress_type;
> +
>         em = create_io_em(inode, start,
>                           async_extent->ram_size,       /* len */
>                           start,                        /* orig_start */
> @@ -1206,6 +1215,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
>                           ins.offset,                   /* orig_block_len */
>                           async_extent->ram_size,       /* ram_bytes */
>                           async_extent->compress_type,
> +                         &file_extent,
>                           BTRFS_ORDERED_COMPRESSED);
>         if (IS_ERR(em)) {
>                 ret = PTR_ERR(em);
> @@ -1395,6 +1405,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>
>         while (num_bytes > 0) {
>                 struct btrfs_ordered_extent *ordered;
> +               struct btrfs_file_extent file_extent;
>
>                 cur_alloc_size = num_bytes;
>                 ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size,
> @@ -1431,6 +1442,12 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>                 extent_reserved = true;
>
>                 ram_size = ins.offset;
> +               file_extent.disk_bytenr = ins.objectid;
> +               file_extent.disk_num_bytes = ins.offset;
> +               file_extent.num_bytes = ins.offset;
> +               file_extent.ram_bytes = ins.offset;
> +               file_extent.offset = 0;
> +               file_extent.compression = BTRFS_COMPRESS_NONE;
>
>                 lock_extent(&inode->io_tree, start, start + ram_size - 1,
>                             &cached);
> @@ -1442,6 +1459,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>                                   ins.offset, /* orig_block_len */
>                                   ram_size, /* ram_bytes */
>                                   BTRFS_COMPRESS_NONE, /* compress_type */
> +                                 &file_extent,
>                                   BTRFS_ORDERED_REGULAR /* type */);
>                 if (IS_ERR(em)) {
>                         unlock_extent(&inode->io_tree, start,
> @@ -2180,6 +2198,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>                                           nocow_args.num_bytes, /* block_len */
>                                           nocow_args.disk_num_bytes, /* orig_block_len */
>                                           ram_bytes, BTRFS_COMPRESS_NONE,
> +                                         &nocow_args.file_extent,
>                                           BTRFS_ORDERED_PREALLOC);
>                         if (IS_ERR(em)) {
>                                 unlock_extent(&inode->io_tree, cur_offset,
> @@ -5012,6 +5031,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
>                         hole_em->orig_start = cur_offset;
>
>                         hole_em->block_start = EXTENT_MAP_HOLE;
> +                       hole_em->disk_bytenr = EXTENT_MAP_HOLE;
>                         hole_em->block_len = 0;
>                         hole_em->disk_num_bytes = 0;
>                         hole_em->ram_bytes = hole_size;
> @@ -6880,6 +6900,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>         }
>         em->start = EXTENT_MAP_HOLE;
>         em->orig_start = EXTENT_MAP_HOLE;
> +       em->disk_bytenr = EXTENT_MAP_HOLE;
>         em->len = (u64)-1;
>         em->block_len = (u64)-1;
>
> @@ -7045,7 +7066,8 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>                                                   const u64 block_len,
>                                                   const u64 orig_block_len,
>                                                   const u64 ram_bytes,
> -                                                 const int type)
> +                                                 const int type,
> +                                                 struct btrfs_file_extent *file_extent)

Can be made const too.

>  {
>         struct extent_map *em = NULL;
>         struct btrfs_ordered_extent *ordered;
> @@ -7054,7 +7076,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>                 em = create_io_em(inode, start, len, orig_start, block_start,
>                                   block_len, orig_block_len, ram_bytes,
>                                   BTRFS_COMPRESS_NONE, /* compress_type */
> -                                 type);
> +                                 file_extent, type);
>                 if (IS_ERR(em))
>                         goto out;
>         }
> @@ -7085,6 +7107,7 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
>  {
>         struct btrfs_root *root = inode->root;
>         struct btrfs_fs_info *fs_info = root->fs_info;
> +       struct btrfs_file_extent file_extent;
>         struct extent_map *em;
>         struct btrfs_key ins;
>         u64 alloc_hint;
> @@ -7103,9 +7126,16 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
>         if (ret)
>                 return ERR_PTR(ret);
>
> +       file_extent.disk_bytenr = ins.objectid;
> +       file_extent.disk_num_bytes = ins.offset;
> +       file_extent.num_bytes = ins.offset;
> +       file_extent.ram_bytes = ins.offset;
> +       file_extent.offset = 0;
> +       file_extent.compression = BTRFS_COMPRESS_NONE;
>         em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset, start,
>                                      ins.objectid, ins.offset, ins.offset,
> -                                    ins.offset, BTRFS_ORDERED_REGULAR);
> +                                    ins.offset, BTRFS_ORDERED_REGULAR,
> +                                    &file_extent);
>         btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>         if (IS_ERR(em))
>                 btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset,
> @@ -7348,6 +7378,7 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
>                                        u64 len, u64 orig_start, u64 block_start,
>                                        u64 block_len, u64 disk_num_bytes,
>                                        u64 ram_bytes, int compress_type,
> +                                      struct btrfs_file_extent *file_extent,
>                                        int type)
>  {
>         struct extent_map *em;
> @@ -7405,9 +7436,11 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
>         em->len = len;
>         em->block_len = block_len;
>         em->block_start = block_start;
> +       em->disk_bytenr = file_extent->disk_bytenr;
>         em->disk_num_bytes = disk_num_bytes;
>         em->ram_bytes = ram_bytes;
>         em->generation = -1;
> +       em->offset = file_extent->offset;
>         em->flags |= EXTENT_FLAG_PINNED;
>         if (type == BTRFS_ORDERED_COMPRESSED)
>                 extent_map_set_compression(em, compress_type);
> @@ -7431,6 +7464,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>  {
>         const bool nowait = (iomap_flags & IOMAP_NOWAIT);
>         struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> +       struct btrfs_file_extent file_extent;
>         struct extent_map *em = *map;
>         int type;
>         u64 block_start, orig_start, orig_block_len, ram_bytes;
> @@ -7461,7 +7495,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>                 block_start = em->block_start + (start - em->start);
>
>                 if (can_nocow_extent(inode, start, &len, &orig_start,
> -                                    &orig_block_len, &ram_bytes, NULL, false, false) == 1) {
> +                                    &orig_block_len, &ram_bytes,
> +                                    &file_extent, false, false) == 1) {
>                         bg = btrfs_inc_nocow_writers(fs_info, block_start);
>                         if (bg)
>                                 can_nocow = true;
> @@ -7489,7 +7524,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>                 em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
>                                               orig_start, block_start,
>                                               len, orig_block_len,
> -                                             ram_bytes, type);
> +                                             ram_bytes, type,
> +                                             &file_extent);
>                 btrfs_dec_nocow_writers(bg);
>                 if (type == BTRFS_ORDERED_PREALLOC) {
>                         free_extent_map(em);
> @@ -9629,6 +9665,8 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
>                 em->orig_start = cur_offset;
>                 em->len = ins.offset;
>                 em->block_start = ins.objectid;
> +               em->disk_bytenr = ins.objectid;
> +               em->offset = 0;
>                 em->block_len = ins.offset;
>                 em->disk_num_bytes = ins.offset;
>                 em->ram_bytes = ins.offset;
> @@ -10195,6 +10233,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
>         struct extent_changeset *data_reserved = NULL;
>         struct extent_state *cached_state = NULL;
>         struct btrfs_ordered_extent *ordered;
> +       struct btrfs_file_extent file_extent;
>         int compression;
>         size_t orig_count;
>         u64 start, end;
> @@ -10370,10 +10409,16 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
>                 goto out_delalloc_release;
>         extent_reserved = true;
>
> +       file_extent.disk_bytenr = ins.objectid;
> +       file_extent.disk_num_bytes = ins.offset;
> +       file_extent.num_bytes = num_bytes;
> +       file_extent.ram_bytes = ram_bytes;
> +       file_extent.offset = encoded->unencoded_offset;
> +       file_extent.compression = compression;
>         em = create_io_em(inode, start, num_bytes,
>                           start - encoded->unencoded_offset, ins.objectid,
>                           ins.offset, ins.offset, ram_bytes, compression,
> -                         BTRFS_ORDERED_COMPRESSED);
> +                         &file_extent, BTRFS_ORDERED_COMPRESSED);
>         if (IS_ERR(em)) {
>                 ret = PTR_ERR(em);
>                 goto out_free_reserved;
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 09/11] btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent
  2024-05-23  5:03  1% ` [PATCH v3 09/11] btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent Qu Wenruo
@ 2024-05-23 18:17  1%   ` Filipe Manana
  0 siblings, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-23 18:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, May 23, 2024 at 6:04 AM Qu Wenruo <wqu@suse.com> wrote:
>
> All parameters after @filepos of btrfs_alloc_ordered_extent() can be
> replaced with btrfs_file_extent structure.
>
> This patch does the cleanup, meanwhile some points to note:
>
> - Move btrfs_file_extent structure to ordered-data.h
>   The structure is needed by both btrfs_alloc_ordered_extent() and
>   can_nocow_extent(), but since btrfs_inode.h includes
>   ordered-data.h, so we need to move the structure to ordered-data.h.
>
> - Move the special handling of NOCOW/PREALLOC into
>   btrfs_alloc_ordered_extent()
>   This is to allow btrfs_split_ordered_extent() to properly split them
>   for DIO.
>   For now just move the handling into btrfs_alloc_ordered_extent() to
>   simplify the callers.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/btrfs_inode.h  | 14 -----------
>  fs/btrfs/inode.c        | 56 ++++++++---------------------------------
>  fs/btrfs/ordered-data.c | 34 ++++++++++++++++++++-----
>  fs/btrfs/ordered-data.h | 19 +++++++++++---
>  4 files changed, 54 insertions(+), 69 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index dbc85efdf68a..97ce56a60672 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -514,20 +514,6 @@ int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
>                             u32 pgoff, u8 *csum, const u8 * const csum_expected);
>  bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
>                         u32 bio_offset, struct bio_vec *bv);
> -
> -/*
> - * This represents details about the target file extent item of a write
> - * operation.
> - */
> -struct btrfs_file_extent {
> -       u64 disk_bytenr;
> -       u64 disk_num_bytes;
> -       u64 num_bytes;
> -       u64 ram_bytes;
> -       u64 offset;
> -       u8 compression;
> -};
> -
>  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>                               struct btrfs_file_extent *file_extent,
>                               bool nowait, bool strict);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 445c19d96d10..35f03149b777 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1220,14 +1220,8 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
>         }
>         free_extent_map(em);
>
> -       ordered = btrfs_alloc_ordered_extent(inode, start,      /* file_offset */
> -                                      async_extent->ram_size,  /* num_bytes */
> -                                      async_extent->ram_size,  /* ram_bytes */
> -                                      ins.objectid,            /* disk_bytenr */
> -                                      ins.offset,              /* disk_num_bytes */
> -                                      0,                       /* offset */
> -                                      1 << BTRFS_ORDERED_COMPRESSED,
> -                                      async_extent->compress_type);
> +       ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent,
> +                                      1 << BTRFS_ORDERED_COMPRESSED);
>         if (IS_ERR(ordered)) {
>                 btrfs_drop_extent_map_range(inode, start, end, false);
>                 ret = PTR_ERR(ordered);
> @@ -1463,10 +1457,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>                 }
>                 free_extent_map(em);
>
> -               ordered = btrfs_alloc_ordered_extent(inode, start, ram_size,
> -                                       ram_size, ins.objectid, cur_alloc_size,
> -                                       0, 1 << BTRFS_ORDERED_REGULAR,
> -                                       BTRFS_COMPRESS_NONE);
> +               ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent,
> +                                                    1 << BTRFS_ORDERED_REGULAR);
>                 if (IS_ERR(ordered)) {
>                         unlock_extent(&inode->io_tree, start,
>                                       start + ram_size - 1, &cached);
> @@ -2191,15 +2183,10 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>                 }
>
>                 ordered = btrfs_alloc_ordered_extent(inode, cur_offset,
> -                               nocow_args.file_extent.num_bytes,
> -                               nocow_args.file_extent.num_bytes,
> -                               nocow_args.file_extent.disk_bytenr +
> -                               nocow_args.file_extent.offset,
> -                               nocow_args.file_extent.num_bytes, 0,
> +                               &nocow_args.file_extent,
>                                 is_prealloc
>                                 ? (1 << BTRFS_ORDERED_PREALLOC)
> -                               : (1 << BTRFS_ORDERED_NOCOW),
> -                               BTRFS_COMPRESS_NONE);
> +                               : (1 << BTRFS_ORDERED_NOCOW));
>                 btrfs_dec_nocow_writers(nocow_bg);
>                 if (IS_ERR(ordered)) {
>                         if (is_prealloc) {
> @@ -7054,29 +7041,9 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>                         goto out;
>         }
>
> -       /*
> -        * For regular writes, file_extent->offset is always 0,
> -        * thus we really only need file_extent->disk_bytenr, every other length
> -        * (disk_num_bytes/ram_bytes) should match @len and
> -        * file_extent->num_bytes.
> -        *
> -        * For NOCOW, we don't really care about the numbers except
> -        * @start and @len, as we won't insert a file extent
> -        * item at all.
> -        *
> -        * For PREALLOC, we do not use ordered extent members, but
> -        * btrfs_mark_extent_written() handles everything.
> -        *
> -        * So here we always passing 0 as offset for the ordered extent,
> -        * or btrfs_split_ordered_extent() can not handle it correctly.
> -        */
> -       ordered = btrfs_alloc_ordered_extent(inode, start, len, len,
> -                                            file_extent->disk_bytenr +
> -                                            file_extent->offset,
> -                                            len, 0,
> +       ordered = btrfs_alloc_ordered_extent(inode, start, file_extent,
>                                              (1 << type) |
> -                                            (1 << BTRFS_ORDERED_DIRECT),
> -                                            BTRFS_COMPRESS_NONE);
> +                                            (1 << BTRFS_ORDERED_DIRECT));
>         if (IS_ERR(ordered)) {
>                 if (em) {
>                         free_extent_map(em);
> @@ -10396,12 +10363,9 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
>         }
>         free_extent_map(em);
>
> -       ordered = btrfs_alloc_ordered_extent(inode, start, num_bytes, ram_bytes,
> -                                      ins.objectid, ins.offset,
> -                                      encoded->unencoded_offset,
> +       ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent,
>                                        (1 << BTRFS_ORDERED_ENCODED) |
> -                                      (1 << BTRFS_ORDERED_COMPRESSED),
> -                                      compression);
> +                                      (1 << BTRFS_ORDERED_COMPRESSED));
>         if (IS_ERR(ordered)) {
>                 btrfs_drop_extent_map_range(inode, start, end, false);
>                 ret = PTR_ERR(ordered);
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index d446d89c2c34..5c2fb0a7c5c8 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -264,17 +264,39 @@ static void insert_ordered_extent(struct btrfs_ordered_extent *entry)
>   */
>  struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
>                         struct btrfs_inode *inode, u64 file_offset,
> -                       u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
> -                       u64 disk_num_bytes, u64 offset, unsigned long flags,
> -                       int compress_type)
> +                       struct btrfs_file_extent *file_extent,

Btw, this can be made const.

> +                       unsigned long flags)
>  {
>         struct btrfs_ordered_extent *entry;
>
>         ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0);
>
> -       entry = alloc_ordered_extent(inode, file_offset, num_bytes, ram_bytes,
> -                                    disk_bytenr, disk_num_bytes, offset, flags,
> -                                    compress_type);
> +       /*
> +        * For regular writes, we just use the members in @file_extent.
> +        *
> +        * For NOCOW, we don't really care about the numbers except
> +        * @start and file_extent->num_bytes, as we won't insert a file extent
> +        * item at all.
> +        *
> +        * For PREALLOC, we do not use ordered extent members, but
> +        * btrfs_mark_extent_written() handles everything.
> +        *
> +        * So here we always passing 0 as offset for NOCOW/PREALLOC ordered
> +        * extents, or btrfs_split_ordered_extent() can not handle it correctly.
> +        */
> +       if (flags & ((1 << BTRFS_ORDERED_NOCOW) | (1 << BTRFS_ORDERED_PREALLOC)))
> +               entry = alloc_ordered_extent(inode, file_offset,
> +                               file_extent->num_bytes, file_extent->num_bytes,
> +                               file_extent->disk_bytenr + file_extent->offset,
> +                               file_extent->num_bytes, 0, flags,
> +                               file_extent->compression);
> +       else
> +               entry = alloc_ordered_extent(inode, file_offset,
> +                               file_extent->num_bytes, file_extent->ram_bytes,
> +                               file_extent->disk_bytenr,
> +                               file_extent->disk_num_bytes,
> +                               file_extent->offset, flags,
> +                               file_extent->compression);
>         if (!IS_ERR(entry))
>                 insert_ordered_extent(entry);
>         return entry;
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index 2ec329e2f0f3..31e65f2f4990 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -171,11 +171,24 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>  bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
>                                     struct btrfs_ordered_extent **cached,
>                                     u64 file_offset, u64 io_size);
> +
> +/*
> + * This represents details about the target file extent item of a write
> + * operation.
> + */
> +struct btrfs_file_extent {
> +       u64 disk_bytenr;
> +       u64 disk_num_bytes;
> +       u64 num_bytes;
> +       u64 ram_bytes;
> +       u64 offset;
> +       u8 compression;
> +};
> +
>  struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
>                         struct btrfs_inode *inode, u64 file_offset,
> -                       u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
> -                       u64 disk_num_bytes, u64 offset, unsigned long flags,
> -                       int compress_type);
> +                       struct btrfs_file_extent *file_extent,

Same here, const.

Otherwise it looks good, thanks.

Reviewed-by: Filipe Manana <fdmanana@suse.com>

> +                       unsigned long flags);
>  void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
>                            struct btrfs_ordered_sum *sum);
>  struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 07/11] btrfs: remove extent_map::block_start member
  2024-05-23  5:03  1% ` [PATCH v3 07/11] btrfs: remove extent_map::block_start member Qu Wenruo
@ 2024-05-23 17:56  1%   ` Filipe Manana
  2024-05-23 23:23  1%     ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Filipe Manana @ 2024-05-23 17:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, David Sterba

On Thu, May 23, 2024 at 6:04 AM Qu Wenruo <wqu@suse.com> wrote:
>
> The member extent_map::block_start can be calculated from
> extent_map::disk_bytenr + extent_map::offset for regular extents.
> And otherwise just extent_map::disk_bytenr.
>
> And this is already validated by the validate_extent_map().
> Now we can remove the member.
>
> However there is a special case in btrfs_create_dio_extent() where we
> for NOCOW/PREALLOC ordered extents can not directly use the resulted
> btrfs_file_extent, as btrfs_split_ordered_extent() can not handle them
> yet.
> So for that call site, we pass file_extent->disk_bytenr +
> file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
> offset.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/compression.c            |  3 +-
>  fs/btrfs/defrag.c                 |  9 ++-
>  fs/btrfs/extent_io.c              | 10 ++--
>  fs/btrfs/extent_map.c             | 55 +++++------------
>  fs/btrfs/extent_map.h             | 22 ++++---
>  fs/btrfs/file-item.c              |  4 --
>  fs/btrfs/file.c                   | 11 ++--
>  fs/btrfs/inode.c                  | 80 ++++++++++++++-----------
>  fs/btrfs/relocation.c             |  1 -
>  fs/btrfs/tests/extent-map-tests.c | 48 ++++++---------
>  fs/btrfs/tests/inode-tests.c      | 99 ++++++++++++++++---------------
>  fs/btrfs/tree-log.c               | 17 +++---
>  fs/btrfs/zoned.c                  |  4 +-
>  include/trace/events/btrfs.h      | 11 +---
>  14 files changed, 168 insertions(+), 206 deletions(-)
>
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index cd88432e7072..07b31d1c0926 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -507,7 +507,8 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>                  */
>                 if (!em || cur < em->start ||
>                     (cur + fs_info->sectorsize > extent_map_end(em)) ||
> -                   (em->block_start >> SECTOR_SHIFT) != orig_bio->bi_iter.bi_sector) {
> +                   (extent_map_block_start(em) >> SECTOR_SHIFT) !=
> +                   orig_bio->bi_iter.bi_sector) {
>                         free_extent_map(em);
>                         unlock_extent(tree, cur, page_end, NULL);
>                         unlock_page(page);
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index 025e7f853a68..6fb94e897fc5 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -707,7 +707,6 @@ static struct extent_map *defrag_get_extent(struct btrfs_inode *inode,
>                  */
>                 if (key.offset > start) {
>                         em->start = start;
> -                       em->block_start = EXTENT_MAP_HOLE;
>                         em->disk_bytenr = EXTENT_MAP_HOLE;
>                         em->disk_num_bytes = 0;
>                         em->ram_bytes = 0;
> @@ -828,7 +827,7 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em,
>          */
>         next = defrag_lookup_extent(inode, em->start + em->len, newer_than, locked);
>         /* No more em or hole */
> -       if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE)
> +       if (!next || next->disk_bytenr >= EXTENT_MAP_LAST_BYTE)
>                 goto out;
>         if (next->flags & EXTENT_FLAG_PREALLOC)
>                 goto out;
> @@ -995,12 +994,12 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
>                  * This is for users who want to convert inline extents to
>                  * regular ones through max_inline= mount option.
>                  */
> -               if (em->block_start == EXTENT_MAP_INLINE &&
> +               if (em->disk_bytenr == EXTENT_MAP_INLINE &&
>                     em->len <= inode->root->fs_info->max_inline)
>                         goto next;
>
>                 /* Skip holes and preallocated extents. */
> -               if (em->block_start == EXTENT_MAP_HOLE ||
> +               if (em->disk_bytenr == EXTENT_MAP_HOLE ||
>                     (em->flags & EXTENT_FLAG_PREALLOC))
>                         goto next;
>
> @@ -1065,7 +1064,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
>                  * So if an inline extent passed all above checks, just add it
>                  * for defrag, and be converted to regular extents.
>                  */
> -               if (em->block_start == EXTENT_MAP_INLINE)
> +               if (em->disk_bytenr == EXTENT_MAP_INLINE)
>                         goto add;
>
>                 next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index bf50301ee528..063d7954c9ed 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1083,10 +1083,10 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
>                 iosize = min(extent_map_end(em) - cur, end - cur + 1);
>                 iosize = ALIGN(iosize, blocksize);
>                 if (compress_type != BTRFS_COMPRESS_NONE)
> -                       disk_bytenr = em->block_start;
> +                       disk_bytenr = em->disk_bytenr;
>                 else
> -                       disk_bytenr = em->block_start + extent_offset;
> -               block_start = em->block_start;
> +                       disk_bytenr = extent_map_block_start(em) + extent_offset;
> +               block_start = extent_map_block_start(em);
>                 if (em->flags & EXTENT_FLAG_PREALLOC)
>                         block_start = EXTENT_MAP_HOLE;
>
> @@ -1405,8 +1405,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>                 ASSERT(IS_ALIGNED(em->start, fs_info->sectorsize));
>                 ASSERT(IS_ALIGNED(em->len, fs_info->sectorsize));
>
> -               block_start = em->block_start;
> -               disk_bytenr = em->block_start + extent_offset;
> +               block_start = extent_map_block_start(em);
> +               disk_bytenr = extent_map_block_start(em) + extent_offset;
>
>                 ASSERT(!extent_map_is_compressed(em));
>                 ASSERT(block_start != EXTENT_MAP_HOLE);
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index 0c100fe47c43..38a1f07581b0 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -192,9 +192,10 @@ static inline u64 extent_map_block_len(const struct extent_map *em)
>
>  static inline u64 extent_map_block_end(const struct extent_map *em)
>  {
> -       if (em->block_start + extent_map_block_len(em) < em->block_start)
> +       if (extent_map_block_start(em) + extent_map_block_len(em) <
> +           extent_map_block_start(em))
>                 return (u64)-1;
> -       return em->block_start + extent_map_block_len(em);
> +       return extent_map_block_start(em) + extent_map_block_len(em);
>  }
>
>  static bool can_merge_extent_map(const struct extent_map *em)
> @@ -229,11 +230,11 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma
>         if (prev->flags != next->flags)
>                 return false;
>
> -       if (next->block_start < EXTENT_MAP_LAST_BYTE - 1)
> -               return next->block_start == extent_map_block_end(prev);
> +       if (next->disk_bytenr < EXTENT_MAP_LAST_BYTE - 1)
> +               return extent_map_block_start(next) == extent_map_block_end(prev);
>
>         /* HOLES and INLINE extents. */
> -       return next->block_start == prev->block_start;
> +       return next->disk_bytenr == prev->disk_bytenr;
>  }
>
>  /*
> @@ -295,10 +296,9 @@ static void dump_extent_map(struct btrfs_fs_info *fs_info,
>  {
>         if (!IS_ENABLED(CONFIG_BTRFS_DEBUG))
>                 return;
> -       btrfs_crit(fs_info, "%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu block_start=%llu flags=0x%x\n",
> +       btrfs_crit(fs_info, "%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu flags=0x%x\n",
>                 prefix, em->start, em->len, em->disk_bytenr, em->disk_num_bytes,
> -               em->ram_bytes, em->offset, em->block_start,
> -               em->flags);
> +               em->ram_bytes, em->offset, em->flags);
>         ASSERT(0);
>  }
>
> @@ -316,15 +316,6 @@ static void validate_extent_map(struct btrfs_fs_info *fs_info,
>                 if (em->offset + em->len > em->disk_num_bytes &&
>                     !extent_map_is_compressed(em))
>                         dump_extent_map(fs_info, "disk_num_bytes too small", em);
> -
> -               if (extent_map_is_compressed(em)) {
> -                       if (em->block_start != em->disk_bytenr)
> -                               dump_extent_map(fs_info,
> -                               "mismatch block_start/disk_bytenr/offset", em);
> -               } else if (em->block_start != em->disk_bytenr + em->offset) {
> -                       dump_extent_map(fs_info,
> -                               "mismatch block_start/disk_bytenr/offset", em);
> -               }
>         } else if (em->offset) {
>                 dump_extent_map(fs_info,
>                                 "non-zero offset for hole/inline", em);
> @@ -359,7 +350,6 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>                 if (rb && can_merge_extent_map(merge) && mergeable_maps(merge, em)) {
>                         em->start = merge->start;
>                         em->len += merge->len;
> -                       em->block_start = merge->block_start;
>                         em->generation = max(em->generation, merge->generation);
>
>                         if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
> @@ -669,11 +659,9 @@ static noinline int merge_extent_mapping(struct btrfs_inode *inode,
>         start_diff = start - em->start;
>         em->start = start;
>         em->len = end - start;
> -       if (em->block_start < EXTENT_MAP_LAST_BYTE &&
> -           !extent_map_is_compressed(em)) {
> -               em->block_start += start_diff;
> +       if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE &&
> +           !extent_map_is_compressed(em))
>                 em->offset += start_diff;
> -       }
>         return add_extent_mapping(inode, em, 0);
>  }
>
> @@ -708,7 +696,7 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
>          * Tree-checker should have rejected any inline extent with non-zero
>          * file offset. Here just do a sanity check.
>          */
> -       if (em->block_start == EXTENT_MAP_INLINE)
> +       if (em->disk_bytenr == EXTENT_MAP_INLINE)
>                 ASSERT(em->start == 0);
>
>         ret = add_extent_mapping(inode, em, 0);
> @@ -842,7 +830,6 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
>                 u64 gen;
>                 unsigned long flags;
>                 bool modified;
> -               bool compressed;
>
>                 if (em_end < end) {
>                         next_em = next_extent_map(em);
> @@ -876,7 +863,6 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
>                         goto remove_em;
>
>                 gen = em->generation;
> -               compressed = extent_map_is_compressed(em);
>
>                 if (em->start < start) {
>                         if (!split) {
> @@ -888,15 +874,12 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
>                         split->start = em->start;
>                         split->len = start - em->start;
>
> -                       if (em->block_start < EXTENT_MAP_LAST_BYTE) {
> -                               split->block_start = em->block_start;
> -
> +                       if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
>                                 split->disk_bytenr = em->disk_bytenr;
>                                 split->disk_num_bytes = em->disk_num_bytes;
>                                 split->offset = em->offset;
>                                 split->ram_bytes = em->ram_bytes;
>                         } else {
> -                               split->block_start = em->block_start;
>                                 split->disk_bytenr = em->disk_bytenr;
>                                 split->disk_num_bytes = 0;
>                                 split->offset = 0;
> @@ -919,20 +902,14 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
>                         }
>                         split->start = end;
>                         split->len = em_end - end;
> -                       split->block_start = em->block_start;
>                         split->disk_bytenr = em->disk_bytenr;
>                         split->flags = flags;
>                         split->generation = gen;
>
> -                       if (em->block_start < EXTENT_MAP_LAST_BYTE) {
> +                       if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
>                                 split->disk_num_bytes = em->disk_num_bytes;
>                                 split->offset = em->offset + end - em->start;
>                                 split->ram_bytes = em->ram_bytes;
> -                               if (!compressed) {
> -                                       const u64 diff = end - em->start;
> -
> -                                       split->block_start += diff;
> -                               }
>                         } else {
>                                 split->disk_num_bytes = 0;
>                                 split->offset = 0;
> @@ -1079,7 +1056,7 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
>
>         ASSERT(em->len == len);
>         ASSERT(!extent_map_is_compressed(em));
> -       ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
> +       ASSERT(em->disk_bytenr < EXTENT_MAP_LAST_BYTE);
>         ASSERT(em->flags & EXTENT_FLAG_PINNED);
>         ASSERT(!(em->flags & EXTENT_FLAG_LOGGING));
>         ASSERT(!list_empty(&em->list));
> @@ -1093,7 +1070,6 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
>         split_pre->disk_bytenr = new_logical;
>         split_pre->disk_num_bytes = split_pre->len;
>         split_pre->offset = 0;
> -       split_pre->block_start = new_logical;
>         split_pre->ram_bytes = split_pre->len;
>         split_pre->flags = flags;
>         split_pre->generation = em->generation;
> @@ -1108,10 +1084,9 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
>         /* Insert the middle extent_map. */
>         split_mid->start = em->start + pre;
>         split_mid->len = em->len - pre;
> -       split_mid->disk_bytenr = em->block_start + pre;
> +       split_mid->disk_bytenr = extent_map_block_start(em) + pre;
>         split_mid->disk_num_bytes = split_mid->len;
>         split_mid->offset = 0;
> -       split_mid->block_start = em->block_start + pre;
>         split_mid->ram_bytes = split_mid->len;
>         split_mid->flags = flags;
>         split_mid->generation = em->generation;
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index 5312bb542af0..2bcf7149b44c 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -90,18 +90,6 @@ struct extent_map {
>          */
>         u64 ram_bytes;
>
> -       /*
> -        * The on-disk logical bytenr for the file extent.
> -        *
> -        * For compressed extents it matches btrfs_file_extent_item::disk_bytenr.
> -        * For uncompressed extents it matches
> -        * btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset
> -        *
> -        * For holes it is EXTENT_MAP_HOLE and for inline extents it is
> -        * EXTENT_MAP_INLINE.
> -        */
> -       u64 block_start;
> -
>         /*
>          * Generation of the extent map, for merged em it's the highest
>          * generation of all merged ems.
> @@ -162,6 +150,16 @@ static inline int extent_map_in_tree(const struct extent_map *em)
>         return !RB_EMPTY_NODE(&em->rb_node);
>  }
>
> +static inline u64 extent_map_block_start(const struct extent_map *em)
> +{
> +       if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
> +               if (extent_map_is_compressed(em))
> +                       return em->disk_bytenr;
> +               return em->disk_bytenr + em->offset;
> +       }
> +       return em->disk_bytenr;
> +}
> +
>  static inline u64 extent_map_end(const struct extent_map *em)
>  {
>         if (em->start + em->len < em->start)
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 397df6588ce2..55703c833f3d 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -1295,7 +1295,6 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
>                 em->len = btrfs_file_extent_end(path) - extent_start;
>                 bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
>                 if (bytenr == 0) {
> -                       em->block_start = EXTENT_MAP_HOLE;
>                         em->disk_bytenr = EXTENT_MAP_HOLE;
>                         em->disk_num_bytes = 0;
>                         em->offset = 0;
> @@ -1306,10 +1305,8 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
>                 em->offset = btrfs_file_extent_offset(leaf, fi);
>                 if (compress_type != BTRFS_COMPRESS_NONE) {
>                         extent_map_set_compression(em, compress_type);
> -                       em->block_start = bytenr;
>                 } else {
>                         bytenr += btrfs_file_extent_offset(leaf, fi);
> -                       em->block_start = bytenr;
>                         if (type == BTRFS_FILE_EXTENT_PREALLOC)
>                                 em->flags |= EXTENT_FLAG_PREALLOC;
>                 }
> @@ -1317,7 +1314,6 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
>                 /* Tree-checker has ensured this. */
>                 ASSERT(extent_start == 0);
>
> -               em->block_start = EXTENT_MAP_INLINE;
>                 em->disk_bytenr = EXTENT_MAP_INLINE;
>                 em->start = 0;
>                 em->len = fs_info->sectorsize;
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 7033ea619073..f0cb7b29cab2 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2348,7 +2348,6 @@ static int fill_holes(struct btrfs_trans_handle *trans,
>                 hole_em->len = end - offset;
>                 hole_em->ram_bytes = hole_em->len;
>
> -               hole_em->block_start = EXTENT_MAP_HOLE;
>                 hole_em->disk_bytenr = EXTENT_MAP_HOLE;
>                 hole_em->disk_num_bytes = 0;
>                 hole_em->generation = trans->transid;
> @@ -2381,7 +2380,7 @@ static int find_first_non_hole(struct btrfs_inode *inode, u64 *start, u64 *len)
>                 return PTR_ERR(em);
>
>         /* Hole or vacuum extent(only exists in no-hole mode) */
> -       if (em->block_start == EXTENT_MAP_HOLE) {
> +       if (em->disk_bytenr == EXTENT_MAP_HOLE) {
>                 ret = 1;
>                 *len = em->start + em->len > *start + *len ?
>                        0 : *start + *len - em->start - em->len;
> @@ -3038,7 +3037,7 @@ static int btrfs_zero_range_check_range_boundary(struct btrfs_inode *inode,
>         if (IS_ERR(em))
>                 return PTR_ERR(em);
>
> -       if (em->block_start == EXTENT_MAP_HOLE)
> +       if (em->disk_bytenr == EXTENT_MAP_HOLE)
>                 ret = RANGE_BOUNDARY_HOLE;
>         else if (em->flags & EXTENT_FLAG_PREALLOC)
>                 ret = RANGE_BOUNDARY_PREALLOC_EXTENT;
> @@ -3102,7 +3101,7 @@ static int btrfs_zero_range(struct inode *inode,
>                 ASSERT(IS_ALIGNED(alloc_start, sectorsize));
>                 len = offset + len - alloc_start;
>                 offset = alloc_start;
> -               alloc_hint = em->block_start + em->len;
> +               alloc_hint = extent_map_block_start(em) + em->len;
>         }
>         free_extent_map(em);
>
> @@ -3120,7 +3119,7 @@ static int btrfs_zero_range(struct inode *inode,
>                                                            mode);
>                         goto out;
>                 }
> -               if (len < sectorsize && em->block_start != EXTENT_MAP_HOLE) {
> +               if (len < sectorsize && em->disk_bytenr != EXTENT_MAP_HOLE) {
>                         free_extent_map(em);
>                         ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,
>                                                    0);
> @@ -3333,7 +3332,7 @@ static long btrfs_fallocate(struct file *file, int mode,
>                 last_byte = min(extent_map_end(em), alloc_end);
>                 actual_end = min_t(u64, extent_map_end(em), offset + len);
>                 last_byte = ALIGN(last_byte, blocksize);
> -               if (em->block_start == EXTENT_MAP_HOLE ||
> +               if (em->disk_bytenr == EXTENT_MAP_HOLE ||
>                     (cur_offset >= inode->i_size &&
>                      !(em->flags & EXTENT_FLAG_PREALLOC))) {
>                         const u64 range_len = last_byte - cur_offset;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 00bb64fdf938..1b78769d1e41 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -138,7 +138,7 @@ static noinline int run_delalloc_cow(struct btrfs_inode *inode,
>                                      u64 end, struct writeback_control *wbc,
>                                      bool pages_dirty);
>  static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
> -                                      u64 len, u64 block_start,
> +                                      u64 len,
>                                        u64 disk_num_bytes,
>                                        u64 ram_bytes, int compress_type,
>                                        struct btrfs_file_extent *file_extent,
> @@ -1209,7 +1209,6 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
>
>         em = create_io_em(inode, start,
>                           async_extent->ram_size,       /* len */
> -                         ins.objectid,                 /* block_start */
>                           ins.offset,                   /* orig_block_len */
>                           async_extent->ram_size,       /* ram_bytes */
>                           async_extent->compress_type,
> @@ -1287,15 +1286,15 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
>                  * first block in this inode and use that as a hint.  If that
>                  * block is also bogus then just don't worry about it.
>                  */
> -               if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> +               if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
>                         free_extent_map(em);
>                         em = search_extent_mapping(em_tree, 0, 0);
> -                       if (em && em->block_start < EXTENT_MAP_LAST_BYTE)
> -                               alloc_hint = em->block_start;
> +                       if (em && em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
> +                               alloc_hint = extent_map_block_start(em);
>                         if (em)
>                                 free_extent_map(em);
>                 } else {
> -                       alloc_hint = em->block_start;
> +                       alloc_hint = extent_map_block_start(em);
>                         free_extent_map(em);
>                 }
>         }
> @@ -1451,7 +1450,6 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>                             &cached);
>
>                 em = create_io_em(inode, start, ins.offset, /* len */
> -                                 ins.objectid, /* block_start */
>                                   ins.offset, /* orig_block_len */
>                                   ram_size, /* ram_bytes */
>                                   BTRFS_COMPRESS_NONE, /* compress_type */
> @@ -2188,7 +2186,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>                         struct extent_map *em;
>
>                         em = create_io_em(inode, cur_offset, nocow_args.num_bytes,
> -                                         nocow_args.disk_bytenr, /* block_start */
>                                           nocow_args.disk_num_bytes, /* orig_block_len */
>                                           ram_bytes, BTRFS_COMPRESS_NONE,
>                                           &nocow_args.file_extent,
> @@ -2703,7 +2700,7 @@ static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
>                 if (IS_ERR(em))
>                         return PTR_ERR(em);
>
> -               if (em->block_start != EXTENT_MAP_HOLE)
> +               if (extent_map_block_start(em) != EXTENT_MAP_HOLE)

This should be:   if (em->disk_bytenr != EXTENT_MAP_HOLE)

Everything else looks fine. Thanks.

>                         goto next;
>
>                 em_len = em->len;
> @@ -5022,7 +5019,6 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
>                         hole_em->start = cur_offset;
>                         hole_em->len = hole_size;
>
> -                       hole_em->block_start = EXTENT_MAP_HOLE;
>                         hole_em->disk_bytenr = EXTENT_MAP_HOLE;
>                         hole_em->disk_num_bytes = 0;
>                         hole_em->ram_bytes = hole_size;
> @@ -6879,7 +6875,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>         if (em) {
>                 if (em->start > start || em->start + em->len <= start)
>                         free_extent_map(em);
> -               else if (em->block_start == EXTENT_MAP_INLINE && page)
> +               else if (em->disk_bytenr == EXTENT_MAP_INLINE && page)
>                         free_extent_map(em);
>                 else
>                         goto out;
> @@ -6982,7 +6978,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>                 /* New extent overlaps with existing one */
>                 em->start = start;
>                 em->len = found_key.offset - start;
> -               em->block_start = EXTENT_MAP_HOLE;
> +               em->disk_bytenr = EXTENT_MAP_HOLE;
>                 goto insert;
>         }
>
> @@ -7006,7 +7002,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>                  *
>                  * Other members are not utilized for inline extents.
>                  */
> -               ASSERT(em->block_start == EXTENT_MAP_INLINE);
> +               ASSERT(em->disk_bytenr == EXTENT_MAP_INLINE);
>                 ASSERT(em->len == fs_info->sectorsize);
>
>                 ret = read_inline_extent(inode, path, page);
> @@ -7017,7 +7013,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>  not_found:
>         em->start = start;
>         em->len = len;
> -       em->block_start = EXTENT_MAP_HOLE;
> +       em->disk_bytenr = EXTENT_MAP_HOLE;
>  insert:
>         ret = 0;
>         btrfs_release_path(path);
> @@ -7048,7 +7044,6 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>                                                   struct btrfs_dio_data *dio_data,
>                                                   const u64 start,
>                                                   const u64 len,
> -                                                 const u64 block_start,
>                                                   const u64 orig_block_len,
>                                                   const u64 ram_bytes,
>                                                   const int type,
> @@ -7058,15 +7053,34 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>         struct btrfs_ordered_extent *ordered;
>
>         if (type != BTRFS_ORDERED_NOCOW) {
> -               em = create_io_em(inode, start, len, block_start,
> +               em = create_io_em(inode, start, len,
>                                   orig_block_len, ram_bytes,
>                                   BTRFS_COMPRESS_NONE, /* compress_type */
>                                   file_extent, type);
>                 if (IS_ERR(em))
>                         goto out;
>         }
> +
> +       /*
> +        * For regular writes, file_extent->offset is always 0,
> +        * thus we really only need file_extent->disk_bytenr, every other length
> +        * (disk_num_bytes/ram_bytes) should match @len and
> +        * file_extent->num_bytes.
> +        *
> +        * For NOCOW, we don't really care about the numbers except
> +        * @start and @len, as we won't insert a file extent
> +        * item at all.
> +        *
> +        * For PREALLOC, we do not use ordered extent members, but
> +        * btrfs_mark_extent_written() handles everything.
> +        *
> +        * So here we always passing 0 as offset for the ordered extent,
> +        * or btrfs_split_ordered_extent() can not handle it correctly.
> +        */
>         ordered = btrfs_alloc_ordered_extent(inode, start, len, len,
> -                                            block_start, len, 0,
> +                                            file_extent->disk_bytenr +
> +                                            file_extent->offset,
> +                                            len, 0,
>                                              (1 << type) |
>                                              (1 << BTRFS_ORDERED_DIRECT),
>                                              BTRFS_COMPRESS_NONE);
> @@ -7118,7 +7132,7 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
>         file_extent.offset = 0;
>         file_extent.compression = BTRFS_COMPRESS_NONE;
>         em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset,
> -                                    ins.objectid, ins.offset,
> +                                    ins.offset,
>                                      ins.offset, BTRFS_ORDERED_REGULAR,
>                                      &file_extent);
>         btrfs_dec_block_group_reservations(fs_info, ins.objectid);
> @@ -7358,7 +7372,7 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
>
>  /* The callers of this must take lock_extent() */
>  static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
> -                                      u64 len, u64 block_start,
> +                                      u64 len,
>                                        u64 disk_num_bytes,
>                                        u64 ram_bytes, int compress_type,
>                                        struct btrfs_file_extent *file_extent,
> @@ -7410,7 +7424,6 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
>
>         em->start = start;
>         em->len = len;
> -       em->block_start = block_start;
>         em->disk_bytenr = file_extent->disk_bytenr;
>         em->disk_num_bytes = disk_num_bytes;
>         em->ram_bytes = ram_bytes;
> @@ -7461,13 +7474,13 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>          */
>         if ((em->flags & EXTENT_FLAG_PREALLOC) ||
>             ((BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) &&
> -            em->block_start != EXTENT_MAP_HOLE)) {
> +            em->disk_bytenr != EXTENT_MAP_HOLE)) {
>                 if (em->flags & EXTENT_FLAG_PREALLOC)
>                         type = BTRFS_ORDERED_PREALLOC;
>                 else
>                         type = BTRFS_ORDERED_NOCOW;
>                 len = min(len, em->len - (start - em->start));
> -               block_start = em->block_start + (start - em->start);
> +               block_start = extent_map_block_start(em) + (start - em->start);
>
>                 if (can_nocow_extent(inode, start, &len,
>                                      &orig_block_len, &ram_bytes,
> @@ -7497,7 +7510,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>                 space_reserved = true;
>
>                 em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
> -                                             block_start,
>                                               orig_block_len,
>                                               ram_bytes, type,
>                                               &file_extent);
> @@ -7700,7 +7712,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>          * the generic code.
>          */
>         if (extent_map_is_compressed(em) ||
> -           em->block_start == EXTENT_MAP_INLINE) {
> +           em->disk_bytenr == EXTENT_MAP_INLINE) {
>                 free_extent_map(em);
>                 /*
>                  * If we are in a NOWAIT context, return -EAGAIN in order to
> @@ -7794,12 +7806,12 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
>          * We trim the extents (and move the addr) even though iomap code does
>          * that, since we have locked only the parts we are performing I/O in.
>          */
> -       if ((em->block_start == EXTENT_MAP_HOLE) ||
> +       if ((em->disk_bytenr == EXTENT_MAP_HOLE) ||
>             ((em->flags & EXTENT_FLAG_PREALLOC) && !write)) {
>                 iomap->addr = IOMAP_NULL_ADDR;
>                 iomap->type = IOMAP_HOLE;
>         } else {
> -               iomap->addr = em->block_start + (start - em->start);
> +               iomap->addr = extent_map_block_start(em) + (start - em->start);
>                 iomap->type = IOMAP_MAPPED;
>         }
>         iomap->offset = start;
> @@ -9638,7 +9650,6 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
>
>                 em->start = cur_offset;
>                 em->len = ins.offset;
> -               em->block_start = ins.objectid;
>                 em->disk_bytenr = ins.objectid;
>                 em->offset = 0;
>                 em->disk_num_bytes = ins.offset;
> @@ -10104,7 +10115,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
>                 goto out_unlock_extent;
>         }
>
> -       if (em->block_start == EXTENT_MAP_INLINE) {
> +       if (em->disk_bytenr == EXTENT_MAP_INLINE) {
>                 u64 extent_start = em->start;
>
>                 /*
> @@ -10125,14 +10136,14 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
>          */
>         encoded->len = min_t(u64, extent_map_end(em),
>                              inode->vfs_inode.i_size) - iocb->ki_pos;
> -       if (em->block_start == EXTENT_MAP_HOLE ||
> +       if (em->disk_bytenr == EXTENT_MAP_HOLE ||
>             (em->flags & EXTENT_FLAG_PREALLOC)) {
>                 disk_bytenr = EXTENT_MAP_HOLE;
>                 count = min_t(u64, count, encoded->len);
>                 encoded->len = count;
>                 encoded->unencoded_len = count;
>         } else if (extent_map_is_compressed(em)) {
> -               disk_bytenr = em->block_start;
> +               disk_bytenr = em->disk_bytenr;
>                 /*
>                  * Bail if the buffer isn't large enough to return the whole
>                  * compressed extent.
> @@ -10151,7 +10162,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
>                         goto out_em;
>                 encoded->compression = ret;
>         } else {
> -               disk_bytenr = em->block_start + (start - em->start);
> +               disk_bytenr = extent_map_block_start(em) + (start - em->start);
>                 if (encoded->len > count)
>                         encoded->len = count;
>                 /*
> @@ -10389,7 +10400,6 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
>         file_extent.offset = encoded->unencoded_offset;
>         file_extent.compression = compression;
>         em = create_io_em(inode, start, num_bytes,
> -                         ins.objectid,
>                           ins.offset, ram_bytes, compression,
>                           &file_extent, BTRFS_ORDERED_COMPRESSED);
>         if (IS_ERR(em)) {
> @@ -10693,12 +10703,12 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>                         goto out;
>                 }
>
> -               if (em->block_start == EXTENT_MAP_HOLE) {
> +               if (em->disk_bytenr == EXTENT_MAP_HOLE) {
>                         btrfs_warn(fs_info, "swapfile must not have holes");
>                         ret = -EINVAL;
>                         goto out;
>                 }
> -               if (em->block_start == EXTENT_MAP_INLINE) {
> +               if (em->disk_bytenr == EXTENT_MAP_INLINE) {
>                         /*
>                          * It's unlikely we'll ever actually find ourselves
>                          * here, as a file small enough to fit inline won't be
> @@ -10716,7 +10726,7 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>                         goto out;
>                 }
>
> -               logical_block_start = em->block_start + (start - em->start);
> +               logical_block_start = extent_map_block_start(em) + (start - em->start);
>                 len = min(len, em->len - (start - em->start));
>                 free_extent_map(em);
>                 em = NULL;
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 68fe52ab445d..bcb665613e78 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -2912,7 +2912,6 @@ static noinline_for_stack int setup_relocation_extent_mapping(struct inode *inod
>
>         em->start = start;
>         em->len = end + 1 - start;
> -       em->block_start = block_start;
>         em->disk_bytenr = block_start;
>         em->disk_num_bytes = em->len;
>         em->ram_bytes = em->len;
> diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
> index 0dd270d6c506..ebec4ab361b8 100644
> --- a/fs/btrfs/tests/extent-map-tests.c
> +++ b/fs/btrfs/tests/extent-map-tests.c
> @@ -28,8 +28,8 @@ static int free_extent_map_tree(struct btrfs_inode *inode)
>                 if (refcount_read(&em->refs) != 1) {
>                         ret = -EINVAL;
>                         test_err(
> -"em leak: em (start %llu len %llu block_start %llu disk_num_bytes %llu offset %llu) refs %d",
> -                                em->start, em->len, em->block_start,
> +"em leak: em (start %llu len %llu disk_bytenr %llu disk_num_bytes %llu offset %llu) refs %d",
> +                                em->start, em->len, em->disk_bytenr,
>                                  em->disk_num_bytes, em->offset,
>                                  refcount_read(&em->refs));
>
> @@ -77,7 +77,6 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         /* Add [0, 16K) */
>         em->start = 0;
>         em->len = SZ_16K;
> -       em->block_start = 0;
>         em->disk_bytenr = 0;
>         em->disk_num_bytes = SZ_16K;
>         em->ram_bytes = SZ_16K;
> @@ -100,7 +99,6 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>
>         em->start = SZ_16K;
>         em->len = SZ_4K;
> -       em->block_start = SZ_32K; /* avoid merging */
>         em->disk_bytenr = SZ_32K; /* avoid merging */
>         em->disk_num_bytes = SZ_4K;
>         em->ram_bytes = SZ_4K;
> @@ -123,7 +121,6 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         /* Add [0, 8K), should return [0, 16K) instead. */
>         em->start = start;
>         em->len = len;
> -       em->block_start = start;
>         em->disk_bytenr = start;
>         em->disk_num_bytes = len;
>         em->ram_bytes = len;
> @@ -141,11 +138,11 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>                 goto out;
>         }
>         if (em->start != 0 || extent_map_end(em) != SZ_16K ||
> -           em->block_start != 0 || em->disk_num_bytes != SZ_16K) {
> +           em->disk_bytenr != 0 || em->disk_num_bytes != SZ_16K) {
>                 test_err(
> -"case1 [%llu %llu]: ret %d return a wrong em (start %llu len %llu block_start %llu disk_num_bytes %llu",
> +"case1 [%llu %llu]: ret %d return a wrong em (start %llu len %llu disk_bytenr %llu disk_num_bytes %llu",
>                          start, start + len, ret, em->start, em->len,
> -                        em->block_start, em->disk_num_bytes);
> +                        em->disk_bytenr, em->disk_num_bytes);
>                 ret = -EINVAL;
>         }
>         free_extent_map(em);
> @@ -179,7 +176,6 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         /* Add [0, 1K) */
>         em->start = 0;
>         em->len = SZ_1K;
> -       em->block_start = EXTENT_MAP_INLINE;
>         em->disk_bytenr = EXTENT_MAP_INLINE;
>         em->disk_num_bytes = 0;
>         em->ram_bytes = SZ_1K;
> @@ -202,7 +198,6 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>
>         em->start = SZ_4K;
>         em->len = SZ_4K;
> -       em->block_start = SZ_4K;
>         em->disk_bytenr = SZ_4K;
>         em->disk_num_bytes = SZ_4K;
>         em->ram_bytes = SZ_4K;
> @@ -225,7 +220,6 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         /* Add [0, 1K) */
>         em->start = 0;
>         em->len = SZ_1K;
> -       em->block_start = EXTENT_MAP_INLINE;
>         em->disk_bytenr = EXTENT_MAP_INLINE;
>         em->disk_num_bytes = 0;
>         em->ram_bytes = SZ_1K;
> @@ -242,10 +236,10 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>                 goto out;
>         }
>         if (em->start != 0 || extent_map_end(em) != SZ_1K ||
> -           em->block_start != EXTENT_MAP_INLINE) {
> +           em->disk_bytenr != EXTENT_MAP_INLINE) {
>                 test_err(
> -"case2 [0 1K]: ret %d return a wrong em (start %llu len %llu block_start %llu",
> -                        ret, em->start, em->len, em->block_start);
> +"case2 [0 1K]: ret %d return a wrong em (start %llu len %llu disk_bytenr %llu",
> +                        ret, em->start, em->len, em->disk_bytenr);
>                 ret = -EINVAL;
>         }
>         free_extent_map(em);
> @@ -275,7 +269,6 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
>         /* Add [4K, 8K) */
>         em->start = SZ_4K;
>         em->len = SZ_4K;
> -       em->block_start = SZ_4K;
>         em->disk_bytenr = SZ_4K;
>         em->disk_num_bytes = SZ_4K;
>         em->ram_bytes = SZ_4K;
> @@ -298,7 +291,6 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
>         /* Add [0, 16K) */
>         em->start = 0;
>         em->len = SZ_16K;
> -       em->block_start = 0;
>         em->disk_bytenr = 0;
>         em->disk_num_bytes = SZ_16K;
>         em->ram_bytes = SZ_16K;
> @@ -321,11 +313,11 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
>          * em->start.
>          */
>         if (start < em->start || start + len > extent_map_end(em) ||
> -           em->start != em->block_start) {
> +           em->start != extent_map_block_start(em)) {
>                 test_err(
> -"case3 [%llu %llu): ret %d em (start %llu len %llu block_start %llu block_len %llu)",
> +"case3 [%llu %llu): ret %d em (start %llu len %llu disk_bytenr %llu block_len %llu)",
>                          start, start + len, ret, em->start, em->len,
> -                        em->block_start, em->disk_num_bytes);
> +                        em->disk_bytenr, em->disk_num_bytes);
>                 ret = -EINVAL;
>         }
>         free_extent_map(em);
> @@ -386,7 +378,6 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
>         /* Add [0K, 8K) */
>         em->start = 0;
>         em->len = SZ_8K;
> -       em->block_start = 0;
>         em->disk_bytenr = 0;
>         em->disk_num_bytes = SZ_8K;
>         em->ram_bytes = SZ_8K;
> @@ -409,7 +400,6 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
>         /* Add [8K, 32K) */
>         em->start = SZ_8K;
>         em->len = 24 * SZ_1K;
> -       em->block_start = SZ_16K; /* avoid merging */
>         em->disk_bytenr = SZ_16K; /* avoid merging */
>         em->disk_num_bytes = 24 * SZ_1K;
>         em->ram_bytes = 24 * SZ_1K;
> @@ -431,7 +421,6 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
>         /* Add [0K, 32K) */
>         em->start = 0;
>         em->len = SZ_32K;
> -       em->block_start = 0;
>         em->disk_bytenr = 0;
>         em->disk_num_bytes = SZ_32K;
>         em->ram_bytes = SZ_32K;
> @@ -451,9 +440,9 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
>         }
>         if (start < em->start || start + len > extent_map_end(em)) {
>                 test_err(
> -"case4 [%llu %llu): ret %d, added wrong em (start %llu len %llu block_start %llu disk_num_bytes %llu)",
> -                        start, start + len, ret, em->start, em->len, em->block_start,
> -                        em->disk_num_bytes);
> +"case4 [%llu %llu): ret %d, added wrong em (start %llu len %llu disk_bytenr %llu disk_num_bytes %llu)",
> +                        start, start + len, ret, em->start, em->len,
> +                        em->disk_bytenr, em->disk_num_bytes);
>                 ret = -EINVAL;
>         }
>         free_extent_map(em);
> @@ -517,7 +506,6 @@ static int add_compressed_extent(struct btrfs_inode *inode,
>
>         em->start = start;
>         em->len = len;
> -       em->block_start = block_start;
>         em->disk_bytenr = block_start;
>         em->disk_num_bytes = SZ_4K;
>         em->ram_bytes = len;
> @@ -740,7 +728,6 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>
>         em->start = SZ_4K;
>         em->len = SZ_4K;
> -       em->block_start = SZ_16K;
>         em->disk_bytenr = SZ_16K;
>         em->disk_num_bytes = SZ_16K;
>         em->ram_bytes = SZ_16K;
> @@ -795,7 +782,6 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         /* [0, 16K), pinned */
>         em->start = 0;
>         em->len = SZ_16K;
> -       em->block_start = 0;
>         em->disk_bytenr = 0;
>         em->disk_num_bytes = SZ_4K;
>         em->ram_bytes = SZ_16K;
> @@ -819,7 +805,6 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         /* [32K, 48K), not pinned */
>         em->start = SZ_32K;
>         em->len = SZ_16K;
> -       em->block_start = SZ_32K;
>         em->disk_bytenr = SZ_32K;
>         em->disk_num_bytes = SZ_16K;
>         em->ram_bytes = SZ_16K;
> @@ -885,8 +870,9 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>                 goto out;
>         }
>
> -       if (em->block_start != SZ_32K + SZ_4K) {
> -               test_err("em->block_start is %llu, expected 36K", em->block_start);
> +       if (extent_map_block_start(em) != SZ_32K + SZ_4K) {
> +               test_err("em->block_start is %llu, expected 36K",
> +                               extent_map_block_start(em));
>                 goto out;
>         }
>
> diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
> index fc390c18ac95..b8f0d67f4cf6 100644
> --- a/fs/btrfs/tests/inode-tests.c
> +++ b/fs/btrfs/tests/inode-tests.c
> @@ -264,8 +264,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start != EXTENT_MAP_HOLE) {
> -               test_err("expected a hole, got %llu", em->block_start);
> +       if (em->disk_bytenr != EXTENT_MAP_HOLE) {
> +               test_err("expected a hole, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         free_extent_map(em);
> @@ -283,8 +283,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start != EXTENT_MAP_INLINE) {
> -               test_err("expected an inline, got %llu", em->block_start);
> +       if (em->disk_bytenr != EXTENT_MAP_INLINE) {
> +               test_err("expected an inline, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>
> @@ -321,8 +321,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start != EXTENT_MAP_HOLE) {
> -               test_err("expected a hole, got %llu", em->block_start);
> +       if (em->disk_bytenr != EXTENT_MAP_HOLE) {
> +               test_err("expected a hole, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != 4) {
> @@ -344,8 +344,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != sectorsize - 1) {
> @@ -371,8 +371,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != sectorsize) {
> @@ -389,7 +389,7 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("wrong offset, want 0, have %llu", em->offset);
>                 goto out;
>         }
> -       disk_bytenr = em->block_start;
> +       disk_bytenr = extent_map_block_start(em);
>         orig_start = em->start;
>         offset = em->start + em->len;
>         free_extent_map(em);
> @@ -399,8 +399,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start != EXTENT_MAP_HOLE) {
> -               test_err("expected a hole, got %llu", em->block_start);
> +       if (em->disk_bytenr != EXTENT_MAP_HOLE) {
> +               test_err("expected a hole, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != sectorsize) {
> @@ -421,8 +421,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != 2 * sectorsize) {
> @@ -441,9 +441,9 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 goto out;
>         }
>         disk_bytenr += (em->start - orig_start);
> -       if (em->block_start != disk_bytenr) {
> +       if (extent_map_block_start(em) != disk_bytenr) {
>                 test_err("wrong block start, want %llu, have %llu",
> -                        disk_bytenr, em->block_start);
> +                        disk_bytenr, extent_map_block_start(em));
>                 goto out;
>         }
>         offset = em->start + em->len;
> @@ -455,8 +455,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != sectorsize) {
> @@ -483,8 +483,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != sectorsize) {
> @@ -502,7 +502,7 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("wrong offset, want 0, have %llu", em->offset);
>                 goto out;
>         }
> -       disk_bytenr = em->block_start;
> +       disk_bytenr = extent_map_block_start(em);
>         orig_start = em->start;
>         offset = em->start + em->len;
>         free_extent_map(em);
> @@ -512,8 +512,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_HOLE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_HOLE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != sectorsize) {
> @@ -531,9 +531,9 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                          em->start - orig_start, em->offset);
>                 goto out;
>         }
> -       if (em->block_start != disk_bytenr + em->offset) {
> +       if (extent_map_block_start(em) != disk_bytenr + em->offset) {
>                 test_err("unexpected block start, wanted %llu, have %llu",
> -                        disk_bytenr + em->offset, em->block_start);
> +                        disk_bytenr + em->offset, extent_map_block_start(em));
>                 goto out;
>         }
>         offset = em->start + em->len;
> @@ -544,8 +544,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != 2 * sectorsize) {
> @@ -564,9 +564,9 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                          em->start, em->offset, orig_start);
>                 goto out;
>         }
> -       if (em->block_start != disk_bytenr + em->offset) {
> +       if (extent_map_block_start(em) != disk_bytenr + em->offset) {
>                 test_err("unexpected block start, wanted %llu, have %llu",
> -                        disk_bytenr + em->offset, em->block_start);
> +                        disk_bytenr + em->offset, extent_map_block_start(em));
>                 goto out;
>         }
>         offset = em->start + em->len;
> @@ -578,8 +578,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != 2 * sectorsize) {
> @@ -611,8 +611,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != sectorsize) {
> @@ -635,7 +635,7 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                          BTRFS_COMPRESS_ZLIB, extent_map_compression(em));
>                 goto out;
>         }
> -       disk_bytenr = em->block_start;
> +       disk_bytenr = extent_map_block_start(em);
>         orig_start = em->start;
>         offset = em->start + em->len;
>         free_extent_map(em);
> @@ -645,8 +645,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != sectorsize) {
> @@ -671,9 +671,9 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start != disk_bytenr) {
> +       if (extent_map_block_start(em) != disk_bytenr) {
>                 test_err("block start does not match, want %llu got %llu",
> -                        disk_bytenr, em->block_start);
> +                        disk_bytenr, extent_map_block_start(em));
>                 goto out;
>         }
>         if (em->start != offset || em->len != 2 * sectorsize) {
> @@ -706,8 +706,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != sectorsize) {
> @@ -732,8 +732,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start != EXTENT_MAP_HOLE) {
> -               test_err("expected a hole extent, got %llu", em->block_start);
> +       if (em->disk_bytenr != EXTENT_MAP_HOLE) {
> +               test_err("expected a hole extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         /*
> @@ -764,8 +764,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
> +               test_err("expected a real extent, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != offset || em->len != sectorsize) {
> @@ -843,8 +843,8 @@ static int test_hole_first(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start != EXTENT_MAP_HOLE) {
> -               test_err("expected a hole, got %llu", em->block_start);
> +       if (em->disk_bytenr != EXTENT_MAP_HOLE) {
> +               test_err("expected a hole, got %llu", em->disk_bytenr);
>                 goto out;
>         }
>         if (em->start != 0 || em->len != sectorsize) {
> @@ -865,8 +865,9 @@ static int test_hole_first(u32 sectorsize, u32 nodesize)
>                 test_err("got an error when we shouldn't have");
>                 goto out;
>         }
> -       if (em->block_start != sectorsize) {
> -               test_err("expected a real extent, got %llu", em->block_start);
> +       if (extent_map_block_start(em) != sectorsize) {
> +               test_err("expected a real extent, got %llu",
> +                        extent_map_block_start(em));
>                 goto out;
>         }
>         if (em->start != sectorsize || em->len != sectorsize) {
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index 1d04f0cb6f53..f1e006a5fc4c 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -4578,6 +4578,7 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
>  {
>         struct btrfs_ordered_extent *ordered;
>         struct btrfs_root *csum_root;
> +       u64 block_start;
>         u64 csum_offset;
>         u64 csum_len;
>         u64 mod_start = em->start;
> @@ -4587,7 +4588,7 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
>
>         if (inode->flags & BTRFS_INODE_NODATASUM ||
>             (em->flags & EXTENT_FLAG_PREALLOC) ||
> -           em->block_start == EXTENT_MAP_HOLE)
> +           em->disk_bytenr == EXTENT_MAP_HOLE)
>                 return 0;
>
>         list_for_each_entry(ordered, &ctx->ordered_extents, log_list) {
> @@ -4658,9 +4659,10 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
>         }
>
>         /* block start is already adjusted for the file extent offset. */
> -       csum_root = btrfs_csum_root(trans->fs_info, em->block_start);
> -       ret = btrfs_lookup_csums_list(csum_root, em->block_start + csum_offset,
> -                                     em->block_start + csum_offset +
> +       block_start = extent_map_block_start(em);
> +       csum_root = btrfs_csum_root(trans->fs_info, block_start);
> +       ret = btrfs_lookup_csums_list(csum_root, block_start + csum_offset,
> +                                     block_start + csum_offset +
>                                       csum_len - 1, &ordered_sums, false);
>         if (ret < 0)
>                 return ret;
> @@ -4692,6 +4694,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
>         struct btrfs_key key;
>         enum btrfs_compression_type compress_type;
>         u64 extent_offset = em->offset;
> +       u64 block_start = extent_map_block_start(em);
>         u64 block_len;
>         int ret;
>
> @@ -4704,10 +4707,10 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
>         block_len = em->disk_num_bytes;
>         compress_type = extent_map_compression(em);
>         if (compress_type != BTRFS_COMPRESS_NONE) {
> -               btrfs_set_stack_file_extent_disk_bytenr(&fi, em->block_start);
> +               btrfs_set_stack_file_extent_disk_bytenr(&fi, block_start);
>                 btrfs_set_stack_file_extent_disk_num_bytes(&fi, block_len);
> -       } else if (em->block_start < EXTENT_MAP_LAST_BYTE) {
> -               btrfs_set_stack_file_extent_disk_bytenr(&fi, em->block_start -
> +       } else if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
> +               btrfs_set_stack_file_extent_disk_bytenr(&fi, block_start -
>                                                         extent_offset);
>                 btrfs_set_stack_file_extent_disk_num_bytes(&fi, block_len);
>         }
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index c52a0063f7db..da9de81f340e 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1773,7 +1773,9 @@ static void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered,
>         write_lock(&em_tree->lock);
>         em = search_extent_mapping(em_tree, ordered->file_offset,
>                                    ordered->num_bytes);
> -       em->block_start = logical;
> +       /* The em should be a new COW extent, thus it should not have an offset. */
> +       ASSERT(em->offset == 0);
> +       em->disk_bytenr = logical;
>         free_extent_map(em);
>         write_unlock(&em_tree->lock);
>  }
> diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
> index a1804239812c..ca0f99689a2d 100644
> --- a/include/trace/events/btrfs.h
> +++ b/include/trace/events/btrfs.h
> @@ -291,7 +291,6 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
>                 __field(        u64,  ino               )
>                 __field(        u64,  start             )
>                 __field(        u64,  len               )
> -               __field(        u64,  block_start       )
>                 __field(        u32,  flags             )
>                 __field(        int,  refs              )
>         ),
> @@ -301,18 +300,16 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
>                 __entry->ino            = btrfs_ino(inode);
>                 __entry->start          = map->start;
>                 __entry->len            = map->len;
> -               __entry->block_start    = map->block_start;
>                 __entry->flags          = map->flags;
>                 __entry->refs           = refcount_read(&map->refs);
>         ),
>
>         TP_printk_btrfs("root=%llu(%s) ino=%llu start=%llu len=%llu "
> -                 "block_start=%llu(%s) flags=%s refs=%u",
> +                 "flags=%s refs=%u",
>                   show_root_type(__entry->root_objectid),
>                   __entry->ino,
>                   __entry->start,
>                   __entry->len,
> -                 show_map_type(__entry->block_start),
>                   show_map_flags(__entry->flags),
>                   __entry->refs)
>  );
> @@ -2608,7 +2605,6 @@ TRACE_EVENT(btrfs_extent_map_shrinker_remove_em,
>                 __field(        u64,    root_id         )
>                 __field(        u64,    start           )
>                 __field(        u64,    len             )
> -               __field(        u64,    block_start     )
>                 __field(        u32,    flags           )
>         ),
>
> @@ -2617,15 +2613,12 @@ TRACE_EVENT(btrfs_extent_map_shrinker_remove_em,
>                 __entry->root_id        = inode->root->root_key.objectid;
>                 __entry->start          = em->start;
>                 __entry->len            = em->len;
> -               __entry->block_start    = em->block_start;
>                 __entry->flags          = em->flags;
>         ),
>
> -       TP_printk_btrfs(
> -"ino=%llu root=%llu(%s) start=%llu len=%llu block_start=%llu(%s) flags=%s",
> +       TP_printk_btrfs("ino=%llu root=%llu(%s) start=%llu len=%llu flags=%s",
>                         __entry->ino, show_root_type(__entry->root_id),
>                         __entry->start, __entry->len,
> -                       show_map_type(__entry->block_start),
>                         show_map_flags(__entry->flags))
>  );
>
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v4 0/6] part3 trivial adjustments for return variable coding style
  2024-05-21 18:10  1% ` [PATCH v4 0/6] part3 trivial adjustments for return variable coding style David Sterba
@ 2024-05-23 17:18  1%   ` David Sterba
  2024-05-24  3:09  1%     ` Anand Jain
  0 siblings, 1 reply; 200+ results
From: David Sterba @ 2024-05-23 17:18 UTC (permalink / raw)
  To: David Sterba; +Cc: Anand Jain, linux-btrfs

On Tue, May 21, 2024 at 08:10:03PM +0200, David Sterba wrote:
> On Wed, May 22, 2024 at 01:11:06AM +0800, Anand Jain wrote:
> > This is v4 of part 3 of the series, containing renaming with optimization of the
> > return variable.
> > 
> > v3 part3:
> >   https://lore.kernel.org/linux-btrfs/cover.1715783315.git.anand.jain@oracle.com/
> > v2 part2:
> >   https://lore.kernel.org/linux-btrfs/cover.1713370756.git.anand.jain@oracle.com/
> > v1:
> >   https://lore.kernel.org/linux-btrfs/cover.1710857863.git.anand.jain@oracle.com/
> > 
> > Anand Jain (6):
> >   btrfs: rename err to ret in btrfs_cleanup_fs_roots()
> >   btrfs: rename ret to err in btrfs_recover_relocation()
> >   btrfs: rename ret to ret2 in btrfs_recover_relocation()
> >   btrfs: rename err to ret in btrfs_recover_relocation()
> >   btrfs: rename err to ret in btrfs_drop_snapshot()
> >   btrfs: rename err to ret in btrfs_find_orphan_roots()
> 
> 1-5 look ok to me, for patch 6 there's the ret = 0 reset question sent
> to v3.

You can add 1-5 to for-next with

Reviewed-by: David Sterba <dsterba@suse.com>

and only resend 6.

^ permalink raw reply	[relevance 1%]

* Re: [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions
  2024-05-22 14:36  2% [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions fdmanana
                   ` (8 preceding siblings ...)
  2024-05-22 22:21  1% ` Qu Wenruo
@ 2024-05-23 17:03  1% ` David Sterba
  9 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-23 17:03 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, May 22, 2024 at 03:36:28PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> A few places can unnecessarily create an empty transaction and then commit
> it, when the goal is just to catch the current transaction and wait for
> its commit to complete. This results in wasting IO, time and rotation of
> the precious backup roots in the super block. Details in the change logs.
> The patches are all independent, except patch 4 that applies on top of
> patch 3 (but could have been done in any order really, they are independent).
> 
> Filipe Manana (7):
>   btrfs: qgroup: avoid start/commit empty transaction when flushing reservations
>   btrfs: avoid create and commit empty transaction when committing super
>   btrfs: send: make ensure_commit_roots_uptodate() simpler and more efficient
>   btrfs: send: avoid create/commit empty transaction at ensure_commit_roots_uptodate()
>   btrfs: scrub: avoid create/commit empty transaction at finish_extent_writes_for_zoned()
>   btrfs: add and use helper to commit the current transaction
>   btrfs: send: get rid of the label and gotos at ensure_commit_roots_uptodate()

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 04/11] btrfs: introduce extra sanity checks for extent maps
  2024-05-23  5:03  1% ` [PATCH v3 04/11] btrfs: introduce extra sanity checks for extent maps Qu Wenruo
@ 2024-05-23 16:57  2%   ` Filipe Manana
  0 siblings, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-23 16:57 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, David Sterba

On Thu, May 23, 2024 at 6:04 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Since extent_map structure has the all the needed members to represent a
> file extent directly, we can apply all the file extent sanity checks to an extent
> map.
>
> The new sanity checks would cross check both the old members
> (block_start/block_len/orig_start) and the new members
> (disk_bytenr/disk_num_bytes/offset).
>
> There is a special case for offset/orig_start/start cross check, we only
> do such sanity check for compressed extent, as only compressed
> read/encoded write really utilize orig_start.
> This can be proved by the cleanup patch of orig_start.
>
> The checks happens at the following timing:
>
> - add_extent_mapping()
>   This is for newly added extent map
>
> - replace_extent_mapping()
>   This is for btrfs_drop_extent_map_range() and split_extent_map()
>
> - try_merge_map()
>
> For a lot of call sites we have to properly populate all the members to
> pass the sanity check, meanwhile the following code needs extra
> modification:
>
> - setup_file_extents() from inode-tests
>   The file extents layout of setup_file_extents() is already too invalid
>   that tree-checker would reject most of them in real world.
>
>   However there is just a special unaligned regular extent which has
>   mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1).
>   So instead of dropping the whole test case, here we just unify
>   disk_num_bytes and ram_bytes to 4096 - 1.
>
> - test_case_7() from extent-map-tests
>   An extent is inserted with 16K length, but on-disk extent size is
>   only 4K.
>   This means it must be a compressed extent, so set the compressed flag
>   for it.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/extent_map.c             | 60 +++++++++++++++++++++++++++++++
>  fs/btrfs/relocation.c             |  4 +++
>  fs/btrfs/tests/extent-map-tests.c | 56 ++++++++++++++++++++++++++++-
>  fs/btrfs/tests/inode-tests.c      |  2 +-
>  4 files changed, 120 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index c7d2393692e6..b157f30ac241 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -283,8 +283,62 @@ static void merge_ondisk_extents(struct extent_map *prev, struct extent_map *nex
>         next->offset = new_offset;
>  }
>
> +static void dump_extent_map(struct btrfs_fs_info *fs_info,
> +                           const char *prefix, struct extent_map *em)
> +{
> +       if (!IS_ENABLED(CONFIG_BTRFS_DEBUG))
> +               return;
> +       btrfs_crit(fs_info, "%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu orig_start=%llu block_start=%llu block_len=%llu flags=0x%x\n",
> +               prefix, em->start, em->len, em->disk_bytenr, em->disk_num_bytes,
> +               em->ram_bytes, em->offset, em->orig_start, em->block_start,
> +               em->block_len, em->flags);
> +       ASSERT(0);
> +}
> +
> +/* Internal sanity checks for btrfs debug builds. */
> +static void validate_extent_map(struct btrfs_fs_info *fs_info,
> +                               struct extent_map *em)
> +{
> +       if (!IS_ENABLED(CONFIG_BTRFS_DEBUG))
> +               return;
> +       if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
> +               if (em->disk_num_bytes == 0)
> +                       dump_extent_map(fs_info, "zero disk_num_bytes", em);
> +               if (em->offset + em->len > em->ram_bytes)
> +                       dump_extent_map(fs_info, "ram_bytes too small", em);
> +               if (em->offset + em->len > em->disk_num_bytes &&
> +                   !extent_map_is_compressed(em))
> +                       dump_extent_map(fs_info, "disk_num_bytes too small", em);
> +
> +               if (extent_map_is_compressed(em)) {
> +                       if (em->block_start != em->disk_bytenr)
> +                               dump_extent_map(fs_info,
> +                               "mismatch block_start/disk_bytenr/offset", em);
> +                       if (em->disk_num_bytes != em->block_len)
> +                               dump_extent_map(fs_info,
> +                               "mismatch disk_num_bytes/block_len", em);
> +                       /*
> +                        * Here we only check the start/orig_start/offset for
> +                        * compressed extents as that's the only case where
> +                        * orig_start is utilized.
> +                        */
> +                       if (em->orig_start != em->start - em->offset)
> +                               dump_extent_map(fs_info,
> +                               "mismatch orig_start/offset/start", em);
> +
> +               } else if (em->block_start != em->disk_bytenr + em->offset) {
> +                       dump_extent_map(fs_info,
> +                               "mismatch block_start/disk_bytenr/offset", em);
> +               }
> +       } else if (em->offset) {
> +               dump_extent_map(fs_info,
> +                               "non-zero offset for hole/inline", em);
> +       }

I think I mentioned this before, but since these are all unexpected to
happen and we're in a critical section that can have a lot of
concurrency, adding unlikely() here would be good to have.
You can do that afterwards in a separate patch. I know some of these
checks get removed in later patches.

Reviewed-by: Filipe Manana <fdmanana@suse.com>

> +}
> +
>  static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>  {
> +       struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         struct extent_map_tree *tree = &inode->extent_tree;
>         struct extent_map *merge = NULL;
>         struct rb_node *rb;
> @@ -319,6 +373,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>                                 merge_ondisk_extents(merge, em);
>                         em->flags |= EXTENT_FLAG_MERGED;
>
> +                       validate_extent_map(fs_info, em);
>                         rb_erase(&merge->rb_node, &tree->root);
>                         RB_CLEAR_NODE(&merge->rb_node);
>                         free_extent_map(merge);
> @@ -334,6 +389,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>                 em->block_len += merge->block_len;
>                 if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
>                         merge_ondisk_extents(em, merge);
> +               validate_extent_map(fs_info, em);
>                 rb_erase(&merge->rb_node, &tree->root);
>                 RB_CLEAR_NODE(&merge->rb_node);
>                 em->generation = max(em->generation, merge->generation);
> @@ -445,6 +501,7 @@ static int add_extent_mapping(struct btrfs_inode *inode,
>
>         lockdep_assert_held_write(&tree->lock);
>
> +       validate_extent_map(fs_info, em);
>         ret = tree_insert(&tree->root, em);
>         if (ret)
>                 return ret;
> @@ -548,10 +605,13 @@ static void replace_extent_mapping(struct btrfs_inode *inode,
>                                    struct extent_map *new,
>                                    int modified)
>  {
> +       struct btrfs_fs_info *fs_info = inode->root->fs_info;
>         struct extent_map_tree *tree = &inode->extent_tree;
>
>         lockdep_assert_held_write(&tree->lock);
>
> +       validate_extent_map(fs_info, new);
> +
>         WARN_ON(cur->flags & EXTENT_FLAG_PINNED);
>         ASSERT(extent_map_in_tree(cur));
>         if (!(cur->flags & EXTENT_FLAG_LOGGING))
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 5f1a909a1d91..151ed1ebd291 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -2911,9 +2911,13 @@ static noinline_for_stack int setup_relocation_extent_mapping(struct inode *inod
>                 return -ENOMEM;
>
>         em->start = start;
> +       em->orig_start = start;
>         em->len = end + 1 - start;
>         em->block_len = em->len;
>         em->block_start = block_start;
> +       em->disk_bytenr = block_start;
> +       em->disk_num_bytes = em->len;
> +       em->ram_bytes = em->len;
>         em->flags |= EXTENT_FLAG_PINNED;
>
>         lock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
> diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
> index c511a1297956..e73ac7a0869c 100644
> --- a/fs/btrfs/tests/extent-map-tests.c
> +++ b/fs/btrfs/tests/extent-map-tests.c
> @@ -78,6 +78,9 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         em->len = SZ_16K;
>         em->block_start = 0;
>         em->block_len = SZ_16K;
> +       em->disk_bytenr = 0;
> +       em->disk_num_bytes = SZ_16K;
> +       em->ram_bytes = SZ_16K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>         write_unlock(&em_tree->lock);
> @@ -96,9 +99,13 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         }
>
>         em->start = SZ_16K;
> +       em->orig_start = SZ_16K;
>         em->len = SZ_4K;
>         em->block_start = SZ_32K; /* avoid merging */
>         em->block_len = SZ_4K;
> +       em->disk_bytenr = SZ_32K; /* avoid merging */
> +       em->disk_num_bytes = SZ_4K;
> +       em->ram_bytes = SZ_4K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>         write_unlock(&em_tree->lock);
> @@ -117,9 +124,13 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>
>         /* Add [0, 8K), should return [0, 16K) instead. */
>         em->start = start;
> +       em->orig_start = start;
>         em->len = len;
>         em->block_start = start;
>         em->block_len = len;
> +       em->disk_bytenr = start;
> +       em->disk_num_bytes = len;
> +       em->ram_bytes = len;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>         write_unlock(&em_tree->lock);
> @@ -174,6 +185,9 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         em->len = SZ_1K;
>         em->block_start = EXTENT_MAP_INLINE;
>         em->block_len = (u64)-1;
> +       em->disk_bytenr = EXTENT_MAP_INLINE;
> +       em->disk_num_bytes = 0;
> +       em->ram_bytes = SZ_1K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>         write_unlock(&em_tree->lock);
> @@ -192,9 +206,13 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         }
>
>         em->start = SZ_4K;
> +       em->orig_start = SZ_4K;
>         em->len = SZ_4K;
>         em->block_start = SZ_4K;
>         em->block_len = SZ_4K;
> +       em->disk_bytenr = SZ_4K;
> +       em->disk_num_bytes = SZ_4K;
> +       em->ram_bytes = SZ_4K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>         write_unlock(&em_tree->lock);
> @@ -216,6 +234,9 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         em->len = SZ_1K;
>         em->block_start = EXTENT_MAP_INLINE;
>         em->block_len = (u64)-1;
> +       em->disk_bytenr = EXTENT_MAP_INLINE;
> +       em->disk_num_bytes = 0;
> +       em->ram_bytes = SZ_1K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>         write_unlock(&em_tree->lock);
> @@ -262,9 +283,13 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
>
>         /* Add [4K, 8K) */
>         em->start = SZ_4K;
> +       em->orig_start = SZ_4K;
>         em->len = SZ_4K;
>         em->block_start = SZ_4K;
>         em->block_len = SZ_4K;
> +       em->disk_bytenr = SZ_4K;
> +       em->disk_num_bytes = SZ_4K;
> +       em->ram_bytes = SZ_4K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>         write_unlock(&em_tree->lock);
> @@ -286,6 +311,9 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
>         em->len = SZ_16K;
>         em->block_start = 0;
>         em->block_len = SZ_16K;
> +       em->disk_bytenr = 0;
> +       em->disk_num_bytes = SZ_16K;
> +       em->ram_bytes = SZ_16K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, start, len);
>         write_unlock(&em_tree->lock);
> @@ -372,6 +400,9 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
>         em->len = SZ_8K;
>         em->block_start = 0;
>         em->block_len = SZ_8K;
> +       em->disk_bytenr = 0;
> +       em->disk_num_bytes = SZ_8K;
> +       em->ram_bytes = SZ_8K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>         write_unlock(&em_tree->lock);
> @@ -390,9 +421,13 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
>
>         /* Add [8K, 32K) */
>         em->start = SZ_8K;
> +       em->orig_start = SZ_8K;
>         em->len = 24 * SZ_1K;
>         em->block_start = SZ_16K; /* avoid merging */
>         em->block_len = 24 * SZ_1K;
> +       em->disk_bytenr = SZ_16K; /* avoid merging */
> +       em->disk_num_bytes = 24 * SZ_1K;
> +       em->ram_bytes = 24 * SZ_1K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>         write_unlock(&em_tree->lock);
> @@ -410,9 +445,13 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
>         }
>         /* Add [0K, 32K) */
>         em->start = 0;
> +       em->orig_start = 0;
>         em->len = SZ_32K;
>         em->block_start = 0;
>         em->block_len = SZ_32K;
> +       em->disk_bytenr = 0;
> +       em->disk_num_bytes = SZ_32K;
> +       em->ram_bytes = SZ_32K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, start, len);
>         write_unlock(&em_tree->lock);
> @@ -494,9 +533,13 @@ static int add_compressed_extent(struct btrfs_inode *inode,
>         }
>
>         em->start = start;
> +       em->orig_start = start;
>         em->len = len;
>         em->block_start = block_start;
>         em->block_len = SZ_4K;
> +       em->disk_bytenr = block_start;
> +       em->disk_num_bytes = SZ_4K;
> +       em->ram_bytes = len;
>         em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
> @@ -715,9 +758,13 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         }
>
>         em->start = SZ_4K;
> +       em->orig_start = SZ_4K;
>         em->len = SZ_4K;
>         em->block_start = SZ_16K;
>         em->block_len = SZ_16K;
> +       em->disk_bytenr = SZ_16K;
> +       em->disk_num_bytes = SZ_16K;
> +       em->ram_bytes = SZ_16K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, 0, SZ_8K);
>         write_unlock(&em_tree->lock);
> @@ -771,7 +818,10 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>         em->len = SZ_16K;
>         em->block_start = 0;
>         em->block_len = SZ_4K;
> -       em->flags |= EXTENT_FLAG_PINNED;
> +       em->disk_bytenr = 0;
> +       em->disk_num_bytes = SZ_4K;
> +       em->ram_bytes = SZ_16K;
> +       em->flags |= (EXTENT_FLAG_PINNED | EXTENT_FLAG_COMPRESS_ZLIB);
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>         write_unlock(&em_tree->lock);
> @@ -790,9 +840,13 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>
>         /* [32K, 48K), not pinned */
>         em->start = SZ_32K;
> +       em->orig_start = SZ_32K;
>         em->len = SZ_16K;
>         em->block_start = SZ_32K;
>         em->block_len = SZ_16K;
> +       em->disk_bytenr = SZ_32K;
> +       em->disk_num_bytes = SZ_16K;
> +       em->ram_bytes = SZ_16K;
>         write_lock(&em_tree->lock);
>         ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>         write_unlock(&em_tree->lock);
> diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
> index 99da9d34b77a..0895c6e06812 100644
> --- a/fs/btrfs/tests/inode-tests.c
> +++ b/fs/btrfs/tests/inode-tests.c
> @@ -117,7 +117,7 @@ static void setup_file_extents(struct btrfs_root *root, u32 sectorsize)
>
>         /* Now for a regular extent */
>         insert_extent(root, offset, sectorsize - 1, sectorsize - 1, 0,
> -                     disk_bytenr, sectorsize, BTRFS_FILE_EXTENT_REG, 0, slot);
> +                     disk_bytenr, sectorsize - 1, BTRFS_FILE_EXTENT_REG, 0, slot);
>         slot++;
>         disk_bytenr += sectorsize;
>         offset += sectorsize - 1;
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 2%]

* Re: [PATCH v3 03/11] btrfs: introduce new members for extent_map
  2024-05-23  5:03  1% ` [PATCH v3 03/11] btrfs: introduce new members for extent_map Qu Wenruo
@ 2024-05-23 16:53  1%   ` Filipe Manana
  2024-05-23 23:19  2%     ` Qu Wenruo
  2024-05-23 18:21  1%   ` Filipe Manana
  1 sibling, 1 reply; 200+ results
From: Filipe Manana @ 2024-05-23 16:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, David Sterba

On Thu, May 23, 2024 at 6:04 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Introduce two new members for extent_map:
>
> - disk_bytenr
> - offset
>
> Both are matching the members with the same name inside
> btrfs_file_extent_items.
>
> For now this patch only touches those members when:
>
> - Reading btrfs_file_extent_items from disk
> - Inserting new holes
> - Merging two extent maps
>   With the new disk_bytenr and disk_num_bytes, doing merging would be a
>   little more complex, as we have 3 different cases:
>
>   * Both extent maps are referring to the same data extents
>     |<----- data extent A ----->|
>        |<- em 1 ->|<- em 2 ->|
>
>   * Both extent maps are referring to different data extents
>     |<-- data extent A -->|<-- data extent B -->|
>                |<- em 1 ->|<- em 2 ->|
>
>   * One of the extent maps is referring to a merged and larger data
>     extent that covers both extent maps
>
>     This is not really valid case other than some selftests.
>     So this test case would be removed.
>
>   A new helper merge_ondisk_extents() would be introduced to handle
>   above valid cases.
>
> To properly assign values for those new members, a new btrfs_file_extent
> parameter is introduced to all the involved call sites.
>
> - For NOCOW writes the btrfs_file_extent would be exposed from
>   can_nocow_file_extent().
>
> - For other writes, the members can be easily calculated
>   As most of them have 0 offset and utilizing the whole on-disk data
>   extent.
>   The exception is encoded write, but thankfully that interface provided
>   offset directly and all other needed info.
>
> For now, both the old members (block_start/block_len/orig_start) are
> co-existing with the new members (disk_bytenr/offset), meanwhile all the
> critical code is still using the old members only.
>
> The cleanup would happen later after all the older and newer members are
> properly validated.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/defrag.c     |  4 +++
>  fs/btrfs/extent_map.c | 78 ++++++++++++++++++++++++++++++++++++++++---
>  fs/btrfs/extent_map.h | 17 ++++++++++
>  fs/btrfs/file-item.c  |  9 ++++-
>  fs/btrfs/file.c       |  1 +
>  fs/btrfs/inode.c      | 57 +++++++++++++++++++++++++++----
>  6 files changed, 155 insertions(+), 11 deletions(-)
>
> diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
> index 407ccec3e57e..242c5469f4ba 100644
> --- a/fs/btrfs/defrag.c
> +++ b/fs/btrfs/defrag.c
> @@ -709,6 +709,10 @@ static struct extent_map *defrag_get_extent(struct btrfs_inode *inode,
>                         em->start = start;
>                         em->orig_start = start;
>                         em->block_start = EXTENT_MAP_HOLE;
> +                       em->disk_bytenr = EXTENT_MAP_HOLE;
> +                       em->disk_num_bytes = 0;
> +                       em->ram_bytes = 0;
> +                       em->offset = 0;
>                         em->len = key.offset - start;
>                         break;
>                 }
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index a9d60d1eade9..c7d2393692e6 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -229,6 +229,60 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma
>         return next->block_start == prev->block_start;
>  }
>
> +/*
> + * Handle the ondisk data extents merge for @prev and @next.
> + *
> + * Only touches disk_bytenr/disk_num_bytes/offset/ram_bytes.
> + * For now only uncompressed regular extent can be merged.
> + *
> + * @prev and @next will be both updated to point to the new merged range.
> + * Thus one of them should be removed by the caller.
> + */
> +static void merge_ondisk_extents(struct extent_map *prev, struct extent_map *next)
> +{
> +       u64 new_disk_bytenr;
> +       u64 new_disk_num_bytes;
> +       u64 new_offset;
> +
> +       /* @prev and @next should not be compressed. */
> +       ASSERT(!extent_map_is_compressed(prev));
> +       ASSERT(!extent_map_is_compressed(next));
> +
> +       /*
> +        * There are two different cases where @prev and @next can be merged.
> +        *
> +        * 1) They are referring to the same data extent
> +        * |<----- data extent A ----->|
> +        *    |<- prev ->|<- next ->|
> +        *
> +        * 2) They are referring to different data extents but still adjacent
> +        *
> +        * |<-- data extent A -->|<-- data extent B -->|
> +        *            |<- prev ->|<- next ->|
> +        *
> +        * The calculation here always merge the data extents first, then update
> +        * @offset using the new data extents.
> +        *
> +        * For case 1), the merged data extent would be the same.
> +        * For case 2), we just merge the two data extents into one.
> +        */
> +       new_disk_bytenr = min(prev->disk_bytenr, next->disk_bytenr);
> +       new_disk_num_bytes = max(prev->disk_bytenr + prev->disk_num_bytes,
> +                                next->disk_bytenr + next->disk_num_bytes) -
> +                            new_disk_bytenr;
> +       new_offset = prev->disk_bytenr + prev->offset - new_disk_bytenr;
> +
> +       prev->disk_bytenr = new_disk_bytenr;
> +       prev->disk_num_bytes = new_disk_num_bytes;
> +       prev->ram_bytes = new_disk_num_bytes;
> +       prev->offset = new_offset;
> +
> +       next->disk_bytenr = new_disk_bytenr;
> +       next->disk_num_bytes = new_disk_num_bytes;
> +       next->ram_bytes = new_disk_num_bytes;
> +       next->offset = new_offset;
> +}
> +
>  static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>  {
>         struct extent_map_tree *tree = &inode->extent_tree;
> @@ -260,6 +314,9 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>                         em->block_len += merge->block_len;
>                         em->block_start = merge->block_start;
>                         em->generation = max(em->generation, merge->generation);
> +
> +                       if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
> +                               merge_ondisk_extents(merge, em);
>                         em->flags |= EXTENT_FLAG_MERGED;
>
>                         rb_erase(&merge->rb_node, &tree->root);
> @@ -275,6 +332,8 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
>         if (rb && can_merge_extent_map(merge) && mergeable_maps(em, merge)) {
>                 em->len += merge->len;
>                 em->block_len += merge->block_len;
> +               if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
> +                       merge_ondisk_extents(em, merge);
>                 rb_erase(&merge->rb_node, &tree->root);
>                 RB_CLEAR_NODE(&merge->rb_node);
>                 em->generation = max(em->generation, merge->generation);
> @@ -562,6 +621,7 @@ static noinline int merge_extent_mapping(struct btrfs_inode *inode,
>             !extent_map_is_compressed(em)) {
>                 em->block_start += start_diff;
>                 em->block_len = em->len;
> +               em->offset += start_diff;
>         }
>         return add_extent_mapping(inode, em, 0);
>  }
> @@ -785,14 +845,18 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
>                                         split->block_len = em->block_len;
>                                 else
>                                         split->block_len = split->len;
> +                               split->disk_bytenr = em->disk_bytenr;
>                                 split->disk_num_bytes = max(split->block_len,
>                                                             em->disk_num_bytes);
> +                               split->offset = em->offset;
>                                 split->ram_bytes = em->ram_bytes;
>                         } else {
>                                 split->orig_start = split->start;
>                                 split->block_len = 0;
>                                 split->block_start = em->block_start;
> +                               split->disk_bytenr = em->disk_bytenr;
>                                 split->disk_num_bytes = 0;
> +                               split->offset = 0;
>                                 split->ram_bytes = split->len;
>                         }
>
> @@ -813,13 +877,14 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
>                         split->start = end;
>                         split->len = em_end - end;
>                         split->block_start = em->block_start;
> +                       split->disk_bytenr = em->disk_bytenr;
>                         split->flags = flags;
>                         split->generation = gen;
>
>                         if (em->block_start < EXTENT_MAP_LAST_BYTE) {
>                                 split->disk_num_bytes = max(em->block_len,
>                                                             em->disk_num_bytes);
> -
> +                               split->offset = em->offset + end - em->start;
>                                 split->ram_bytes = em->ram_bytes;
>                                 if (compressed) {
>                                         split->block_len = em->block_len;
> @@ -832,10 +897,11 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
>                                         split->orig_start = em->orig_start;
>                                 }
>                         } else {
> +                               split->disk_num_bytes = 0;
> +                               split->offset = 0;
>                                 split->ram_bytes = split->len;
>                                 split->orig_start = split->start;
>                                 split->block_len = 0;
> -                               split->disk_num_bytes = 0;

Why move the assignment of ->disk_num_bytes ?
This is sort of distracting, doing unnecessary changes.

>                         }
>
>                         if (extent_map_in_tree(em)) {
> @@ -989,10 +1055,12 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
>         /* First, replace the em with a new extent_map starting from * em->start */
>         split_pre->start = em->start;
>         split_pre->len = pre;
> +       split_pre->disk_bytenr = new_logical;

We are already setting disk_bytenr to the same value a few lines below.

> +       split_pre->disk_num_bytes = split_pre->len;
> +       split_pre->offset = 0;
>         split_pre->orig_start = split_pre->start;
>         split_pre->block_start = new_logical;
>         split_pre->block_len = split_pre->len;
> -       split_pre->disk_num_bytes = split_pre->block_len;

Here, where slit_pre->block_len has the same value as split->pre_len.
This sort of apparently accidental change makes it harder to review.

>         split_pre->ram_bytes = split_pre->len;
>         split_pre->flags = flags;
>         split_pre->generation = em->generation;
> @@ -1007,10 +1075,12 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
>         /* Insert the middle extent_map. */
>         split_mid->start = em->start + pre;
>         split_mid->len = em->len - pre;
> +       split_mid->disk_bytenr = em->block_start + pre;

Same here.

> +       split_mid->disk_num_bytes = split_mid->len;
> +       split_mid->offset = 0;
>         split_mid->orig_start = split_mid->start;
>         split_mid->block_start = em->block_start + pre;
>         split_mid->block_len = split_mid->len;
> -       split_mid->disk_num_bytes = split_mid->block_len;

Which relates to this.

Otherwise it looks fine, and could be fixed up when cherry picked to for-next.

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Thanks.

>         split_mid->ram_bytes = split_mid->len;
>         split_mid->flags = flags;
>         split_mid->generation = em->generation;
> diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
> index 2b7bbffd594b..0b1a8e409377 100644
> --- a/fs/btrfs/extent_map.h
> +++ b/fs/btrfs/extent_map.h
> @@ -70,12 +70,29 @@ struct extent_map {
>          */
>         u64 orig_start;
>
> +       /*
> +        * The bytenr of the full on-disk extent.
> +        *
> +        * For regular extents it's btrfs_file_extent_item::disk_bytenr.
> +        * For holes it's EXTENT_MAP_HOLE and for inline extents it's
> +        * EXTENT_MAP_INLINE.
> +        */
> +       u64 disk_bytenr;
> +
>         /*
>          * The full on-disk extent length, matching
>          * btrfs_file_extent_item::disk_num_bytes.
>          */
>         u64 disk_num_bytes;
>
> +       /*
> +        * Offset inside the decompressed extent.
> +        *
> +        * For regular extents it's btrfs_file_extent_item::offset.
> +        * For holes and inline extents it's 0.
> +        */
> +       u64 offset;
> +
>         /*
>          * The decompressed size of the whole on-disk extent, matching
>          * btrfs_file_extent_item::ram_bytes.
> diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
> index 430dce44ebd2..1298afea9503 100644
> --- a/fs/btrfs/file-item.c
> +++ b/fs/btrfs/file-item.c
> @@ -1295,12 +1295,17 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
>                 em->len = btrfs_file_extent_end(path) - extent_start;
>                 em->orig_start = extent_start -
>                         btrfs_file_extent_offset(leaf, fi);
> -               em->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
>                 bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
>                 if (bytenr == 0) {
>                         em->block_start = EXTENT_MAP_HOLE;
> +                       em->disk_bytenr = EXTENT_MAP_HOLE;
> +                       em->disk_num_bytes = 0;
> +                       em->offset = 0;
>                         return;
>                 }
> +               em->disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
> +               em->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
> +               em->offset = btrfs_file_extent_offset(leaf, fi);
>                 if (compress_type != BTRFS_COMPRESS_NONE) {
>                         extent_map_set_compression(em, compress_type);
>                         em->block_start = bytenr;
> @@ -1317,8 +1322,10 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
>                 ASSERT(extent_start == 0);
>
>                 em->block_start = EXTENT_MAP_INLINE;
> +               em->disk_bytenr = EXTENT_MAP_INLINE;
>                 em->start = 0;
>                 em->len = fs_info->sectorsize;
> +               em->offset = 0;
>                 /*
>                  * Initialize orig_start and block_len with the same values
>                  * as in inode.c:btrfs_get_extent().
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 7c42565da70c..5133c6705d74 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2350,6 +2350,7 @@ static int fill_holes(struct btrfs_trans_handle *trans,
>                 hole_em->orig_start = offset;
>
>                 hole_em->block_start = EXTENT_MAP_HOLE;
> +               hole_em->disk_bytenr = EXTENT_MAP_HOLE;
>                 hole_em->block_len = 0;
>                 hole_em->disk_num_bytes = 0;
>                 hole_em->generation = trans->transid;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8ac489fb5e39..7afcdea27782 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -141,6 +141,7 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
>                                        u64 len, u64 orig_start, u64 block_start,
>                                        u64 block_len, u64 disk_num_bytes,
>                                        u64 ram_bytes, int compress_type,
> +                                      struct btrfs_file_extent *file_extent,
>                                        int type);
>
>  static int data_reloc_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
> @@ -1152,6 +1153,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
>         struct btrfs_root *root = inode->root;
>         struct btrfs_fs_info *fs_info = root->fs_info;
>         struct btrfs_ordered_extent *ordered;
> +       struct btrfs_file_extent file_extent;
>         struct btrfs_key ins;
>         struct page *locked_page = NULL;
>         struct extent_state *cached = NULL;
> @@ -1198,6 +1200,13 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
>         lock_extent(io_tree, start, end, &cached);
>
>         /* Here we're doing allocation and writeback of the compressed pages */
> +       file_extent.disk_bytenr = ins.objectid;
> +       file_extent.disk_num_bytes = ins.offset;
> +       file_extent.ram_bytes = async_extent->ram_size;
> +       file_extent.num_bytes = async_extent->ram_size;
> +       file_extent.offset = 0;
> +       file_extent.compression = async_extent->compress_type;
> +
>         em = create_io_em(inode, start,
>                           async_extent->ram_size,       /* len */
>                           start,                        /* orig_start */
> @@ -1206,6 +1215,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
>                           ins.offset,                   /* orig_block_len */
>                           async_extent->ram_size,       /* ram_bytes */
>                           async_extent->compress_type,
> +                         &file_extent,
>                           BTRFS_ORDERED_COMPRESSED);
>         if (IS_ERR(em)) {
>                 ret = PTR_ERR(em);
> @@ -1395,6 +1405,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>
>         while (num_bytes > 0) {
>                 struct btrfs_ordered_extent *ordered;
> +               struct btrfs_file_extent file_extent;
>
>                 cur_alloc_size = num_bytes;
>                 ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size,
> @@ -1431,6 +1442,12 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>                 extent_reserved = true;
>
>                 ram_size = ins.offset;
> +               file_extent.disk_bytenr = ins.objectid;
> +               file_extent.disk_num_bytes = ins.offset;
> +               file_extent.num_bytes = ins.offset;
> +               file_extent.ram_bytes = ins.offset;
> +               file_extent.offset = 0;
> +               file_extent.compression = BTRFS_COMPRESS_NONE;
>
>                 lock_extent(&inode->io_tree, start, start + ram_size - 1,
>                             &cached);
> @@ -1442,6 +1459,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>                                   ins.offset, /* orig_block_len */
>                                   ram_size, /* ram_bytes */
>                                   BTRFS_COMPRESS_NONE, /* compress_type */
> +                                 &file_extent,
>                                   BTRFS_ORDERED_REGULAR /* type */);
>                 if (IS_ERR(em)) {
>                         unlock_extent(&inode->io_tree, start,
> @@ -2180,6 +2198,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>                                           nocow_args.num_bytes, /* block_len */
>                                           nocow_args.disk_num_bytes, /* orig_block_len */
>                                           ram_bytes, BTRFS_COMPRESS_NONE,
> +                                         &nocow_args.file_extent,
>                                           BTRFS_ORDERED_PREALLOC);
>                         if (IS_ERR(em)) {
>                                 unlock_extent(&inode->io_tree, cur_offset,
> @@ -5012,6 +5031,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
>                         hole_em->orig_start = cur_offset;
>
>                         hole_em->block_start = EXTENT_MAP_HOLE;
> +                       hole_em->disk_bytenr = EXTENT_MAP_HOLE;
>                         hole_em->block_len = 0;
>                         hole_em->disk_num_bytes = 0;
>                         hole_em->ram_bytes = hole_size;
> @@ -6880,6 +6900,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>         }
>         em->start = EXTENT_MAP_HOLE;
>         em->orig_start = EXTENT_MAP_HOLE;
> +       em->disk_bytenr = EXTENT_MAP_HOLE;
>         em->len = (u64)-1;
>         em->block_len = (u64)-1;
>
> @@ -7045,7 +7066,8 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>                                                   const u64 block_len,
>                                                   const u64 orig_block_len,
>                                                   const u64 ram_bytes,
> -                                                 const int type)
> +                                                 const int type,
> +                                                 struct btrfs_file_extent *file_extent)
>  {
>         struct extent_map *em = NULL;
>         struct btrfs_ordered_extent *ordered;
> @@ -7054,7 +7076,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>                 em = create_io_em(inode, start, len, orig_start, block_start,
>                                   block_len, orig_block_len, ram_bytes,
>                                   BTRFS_COMPRESS_NONE, /* compress_type */
> -                                 type);
> +                                 file_extent, type);
>                 if (IS_ERR(em))
>                         goto out;
>         }
> @@ -7085,6 +7107,7 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
>  {
>         struct btrfs_root *root = inode->root;
>         struct btrfs_fs_info *fs_info = root->fs_info;
> +       struct btrfs_file_extent file_extent;
>         struct extent_map *em;
>         struct btrfs_key ins;
>         u64 alloc_hint;
> @@ -7103,9 +7126,16 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
>         if (ret)
>                 return ERR_PTR(ret);
>
> +       file_extent.disk_bytenr = ins.objectid;
> +       file_extent.disk_num_bytes = ins.offset;
> +       file_extent.num_bytes = ins.offset;
> +       file_extent.ram_bytes = ins.offset;
> +       file_extent.offset = 0;
> +       file_extent.compression = BTRFS_COMPRESS_NONE;
>         em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset, start,
>                                      ins.objectid, ins.offset, ins.offset,
> -                                    ins.offset, BTRFS_ORDERED_REGULAR);
> +                                    ins.offset, BTRFS_ORDERED_REGULAR,
> +                                    &file_extent);
>         btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>         if (IS_ERR(em))
>                 btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset,
> @@ -7348,6 +7378,7 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
>                                        u64 len, u64 orig_start, u64 block_start,
>                                        u64 block_len, u64 disk_num_bytes,
>                                        u64 ram_bytes, int compress_type,
> +                                      struct btrfs_file_extent *file_extent,
>                                        int type)
>  {
>         struct extent_map *em;
> @@ -7405,9 +7436,11 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
>         em->len = len;
>         em->block_len = block_len;
>         em->block_start = block_start;
> +       em->disk_bytenr = file_extent->disk_bytenr;
>         em->disk_num_bytes = disk_num_bytes;
>         em->ram_bytes = ram_bytes;
>         em->generation = -1;
> +       em->offset = file_extent->offset;
>         em->flags |= EXTENT_FLAG_PINNED;
>         if (type == BTRFS_ORDERED_COMPRESSED)
>                 extent_map_set_compression(em, compress_type);
> @@ -7431,6 +7464,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>  {
>         const bool nowait = (iomap_flags & IOMAP_NOWAIT);
>         struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> +       struct btrfs_file_extent file_extent;
>         struct extent_map *em = *map;
>         int type;
>         u64 block_start, orig_start, orig_block_len, ram_bytes;
> @@ -7461,7 +7495,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>                 block_start = em->block_start + (start - em->start);
>
>                 if (can_nocow_extent(inode, start, &len, &orig_start,
> -                                    &orig_block_len, &ram_bytes, NULL, false, false) == 1) {
> +                                    &orig_block_len, &ram_bytes,
> +                                    &file_extent, false, false) == 1) {
>                         bg = btrfs_inc_nocow_writers(fs_info, block_start);
>                         if (bg)
>                                 can_nocow = true;
> @@ -7489,7 +7524,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>                 em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
>                                               orig_start, block_start,
>                                               len, orig_block_len,
> -                                             ram_bytes, type);
> +                                             ram_bytes, type,
> +                                             &file_extent);
>                 btrfs_dec_nocow_writers(bg);
>                 if (type == BTRFS_ORDERED_PREALLOC) {
>                         free_extent_map(em);
> @@ -9629,6 +9665,8 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
>                 em->orig_start = cur_offset;
>                 em->len = ins.offset;
>                 em->block_start = ins.objectid;
> +               em->disk_bytenr = ins.objectid;
> +               em->offset = 0;
>                 em->block_len = ins.offset;
>                 em->disk_num_bytes = ins.offset;
>                 em->ram_bytes = ins.offset;
> @@ -10195,6 +10233,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
>         struct extent_changeset *data_reserved = NULL;
>         struct extent_state *cached_state = NULL;
>         struct btrfs_ordered_extent *ordered;
> +       struct btrfs_file_extent file_extent;
>         int compression;
>         size_t orig_count;
>         u64 start, end;
> @@ -10370,10 +10409,16 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
>                 goto out_delalloc_release;
>         extent_reserved = true;
>
> +       file_extent.disk_bytenr = ins.objectid;
> +       file_extent.disk_num_bytes = ins.offset;
> +       file_extent.num_bytes = num_bytes;
> +       file_extent.ram_bytes = ram_bytes;
> +       file_extent.offset = encoded->unencoded_offset;
> +       file_extent.compression = compression;
>         em = create_io_em(inode, start, num_bytes,
>                           start - encoded->unencoded_offset, ins.objectid,
>                           ins.offset, ins.offset, ram_bytes, compression,
> -                         BTRFS_ORDERED_COMPRESSED);
> +                         &file_extent, BTRFS_ORDERED_COMPRESSED);
>         if (IS_ERR(em)) {
>                 ret = PTR_ERR(em);
>                 goto out_free_reserved;
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: move fiemap code into its own file
  2024-05-22 20:15  1% [PATCH] btrfs: move fiemap code into its own file fdmanana
  2024-05-23 10:25  1% ` Johannes Thumshirn
@ 2024-05-23 16:33  1% ` David Sterba
  1 sibling, 0 replies; 200+ results
From: David Sterba @ 2024-05-23 16:33 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, May 22, 2024 at 09:15:59PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Currently the core of the fiemap code lives in extent_io.c, which does
> not make any sense because it's not related to extent IO at all (and it
> was not as well before the big rewrite of fiemap I did some time ago).
> The entry point for fiemap, btrfs_fiemap(), lives in inode.c since it's
> an inode operation.
> 
> Since there's a significant amount of fiemap code, move all of it into a
> dedicated file, including its entry point inode.c:btrfs_fiemap().
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] fstests: btrfs/301: handle auto-removed qgroups
    2024-05-21  1:19  1% ` Boris Burkov
@ 2024-05-23 15:43  1% ` Anand Jain
  1 sibling, 0 replies; 200+ results
From: Anand Jain @ 2024-05-23 15:43 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, fstests

On 07/05/2024 15:06, Qu Wenruo wrote:
> There are always attempts to auto-remove empty qgroups after dropping a
> subvolume.
> 
> For squota mode, not all qgroups can or should be dropped, as there are
> common cases where the dropped subvolume are still referred by other
> snapshots.
> In that case, the numbers can only be freed when the last referencer
> got dropped.
> 
> The latest kernel attempt would only try to drop empty qgroups for
> squota mode.
> But even with such safe change, the test case still needs to handle
> auto-removed qgroups, by explicitly echoing "0", or later calculation
> would break bash grammar.
> 
> This patch would add extra handling for such removed qgroups, to be
> future proof for qgroup auto-removal behavior change.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Looks good.

Reviewed-by: Anand Jain <anand.jain@oracle.com>

Applied.

Thanks, Anand

> ---
>   tests/btrfs/301 | 12 ++++++++++--
>   1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/btrfs/301 b/tests/btrfs/301
> index db469724..bb18ab04 100755
> --- a/tests/btrfs/301
> +++ b/tests/btrfs/301
> @@ -51,9 +51,17 @@ _require_fio $fio_config
>   get_qgroup_usage()
>   {
>   	local qgroupid=$1
> +	local output
>   
> -	$BTRFS_UTIL_PROG qgroup show --sync --raw $SCRATCH_MNT | \
> -				grep "$qgroupid" | $AWK_PROG '{print $3}'
> +	output=$($BTRFS_UTIL_PROG qgroup show --sync --raw $SCRATCH_MNT | \
> +		 grep "$qgroupid" | $AWK_PROG '{print $3}')
> +	# The qgroup is auto-removed, this can only happen if its numbers are
> +	# already all zeros, so here we only need to explicitly echo "0".
> +	if [ -z "$output" ]; then
> +		echo "0"
> +	else
> +		echo "$output"
> +	fi
>   }
>   
>   get_subvol_usage()


^ permalink raw reply	[relevance 1%]

* [PATCH v4 2/2] btrfs: reserve new relocation block-group after successful relocation
  2024-05-23 15:21  1% [PATCH v4 0/2] btrfs: zoned: always set aside a zone for relocation Johannes Thumshirn
  2024-05-23 15:21  1% ` [PATCH v4 1/2] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
@ 2024-05-23 15:21  1% ` Johannes Thumshirn
  2024-05-24  8:33  1%   ` Naohiro Aota
  1 sibling, 1 reply; 200+ results
From: Johannes Thumshirn @ 2024-05-23 15:21 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Hans Holmberg, linux-btrfs, linux-kernel, Naohiro Aota,
	Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

After we've committed a relocation transaction, we know we have just freed
up space. Set it as hint for the next relocation.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/relocation.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 5f1a909a1d91..02a9ebf96a95 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3811,6 +3811,13 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
 	ret = btrfs_commit_transaction(trans);
 	if (ret && !err)
 		err = ret;
+
+	/*
+	 * We know we have just freed space, set it as hint for the
+	 * next relocation.
+	 */
+	if (!err)
+		btrfs_reserve_relocation_bg(fs_info);
 out_free:
 	ret = clean_dirty_subvols(rc);
 	if (ret < 0 && !err)

-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 1/2] btrfs: zoned: reserve relocation block-group on mount
  2024-05-23 15:21  1% [PATCH v4 0/2] btrfs: zoned: always set aside a zone for relocation Johannes Thumshirn
@ 2024-05-23 15:21  1% ` Johannes Thumshirn
  2024-05-24  8:31  1%   ` Naohiro Aota
  2024-05-24 14:07  1%   ` Filipe Manana
  2024-05-23 15:21  1% ` [PATCH v4 2/2] btrfs: reserve new relocation block-group after successful relocation Johannes Thumshirn
  1 sibling, 2 replies; 200+ results
From: Johannes Thumshirn @ 2024-05-23 15:21 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Hans Holmberg, linux-btrfs, linux-kernel, Naohiro Aota,
	Filipe Manana, Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reserve one zone as a data relocation target on each mount. If we already
find one empty block group, there's no need to force a chunk allocation,
but we can use this empty data block group as our relocation target.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/block-group.c | 17 +++++++++++++
 fs/btrfs/disk-io.c     |  2 ++
 fs/btrfs/zoned.c       | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  3 +++
 4 files changed, 89 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 9910bae89966..1195f6721c90 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1500,6 +1500,15 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 			btrfs_put_block_group(block_group);
 			continue;
 		}
+
+		spin_lock(&fs_info->relocation_bg_lock);
+		if (block_group->start == fs_info->data_reloc_bg) {
+			btrfs_put_block_group(block_group);
+			spin_unlock(&fs_info->relocation_bg_lock);
+			continue;
+		}
+		spin_unlock(&fs_info->relocation_bg_lock);
+
 		spin_unlock(&fs_info->unused_bgs_lock);
 
 		btrfs_discard_cancel_work(&fs_info->discard_ctl, block_group);
@@ -1835,6 +1844,14 @@ void btrfs_reclaim_bgs_work(struct work_struct *work)
 				      bg_list);
 		list_del_init(&bg->bg_list);
 
+		spin_lock(&fs_info->relocation_bg_lock);
+		if (bg->start == fs_info->data_reloc_bg) {
+			btrfs_put_block_group(bg);
+			spin_unlock(&fs_info->relocation_bg_lock);
+			continue;
+		}
+		spin_unlock(&fs_info->relocation_bg_lock);
+
 		space_info = bg->space_info;
 		spin_unlock(&fs_info->unused_bgs_lock);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 78d3966232ae..16bb52bcb69e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3547,6 +3547,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	}
 	btrfs_discard_resume(fs_info);
 
+	btrfs_reserve_relocation_bg(fs_info);
+
 	if (fs_info->uuid_root &&
 	    (btrfs_test_opt(fs_info, RESCAN_UUID_TREE) ||
 	     fs_info->generation != btrfs_super_uuid_tree_generation(disk_super))) {
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index c52a0063f7db..d291cf4f565e 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -17,6 +17,7 @@
 #include "fs.h"
 #include "accessors.h"
 #include "bio.h"
+#include "transaction.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -2637,3 +2638,69 @@ void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info)
 	}
 	spin_unlock(&fs_info->zone_active_bgs_lock);
 }
+
+static u64 find_empty_block_group(struct btrfs_space_info *sinfo, u64 flags)
+{
+	struct btrfs_block_group *bg;
+
+	for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		list_for_each_entry(bg, &sinfo->block_groups[i], list) {
+			if (bg->flags != flags)
+				continue;
+			if (bg->used == 0)
+				return bg->start;
+		}
+	}
+
+	return 0;
+}
+
+void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_root *tree_root = fs_info->tree_root;
+	struct btrfs_space_info *sinfo = fs_info->data_sinfo;
+	struct btrfs_trans_handle *trans;
+	struct btrfs_block_group *bg;
+	u64 flags = btrfs_get_alloc_profile(fs_info, sinfo->flags);
+	u64 bytenr = 0;
+
+	lockdep_assert_not_held(&fs_info->relocation_bg_lock);
+
+	if (!btrfs_is_zoned(fs_info))
+		return;
+
+	if (fs_info->data_reloc_bg)
+		return;
+
+	bytenr = find_empty_block_group(sinfo, flags);
+	if (!bytenr) {
+		int ret;
+
+		trans = btrfs_join_transaction(tree_root);
+		if (IS_ERR(trans))
+			return;
+
+		ret = btrfs_chunk_alloc(trans, flags, CHUNK_ALLOC_FORCE);
+		btrfs_end_transaction(trans);
+		if (ret)
+			return;
+
+		bytenr = find_empty_block_group(sinfo, flags);
+		if (!bytenr)
+			return;
+
+	}
+
+	bg = btrfs_lookup_block_group(fs_info, bytenr);
+	if (!bg)
+		return;
+
+	if (!btrfs_zone_activate(bg))
+		bytenr = 0;
+
+	btrfs_put_block_group(bg);
+
+	spin_lock(&fs_info->relocation_bg_lock);
+	fs_info->data_reloc_bg = bytenr;
+	spin_unlock(&fs_info->relocation_bg_lock);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index ff605beb84ef..56c1c19d52bc 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -95,6 +95,7 @@ int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info);
 int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
 				struct btrfs_space_info *space_info, bool do_finish);
 void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info);
+void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info);
 #else /* CONFIG_BLK_DEV_ZONED */
 
 static inline int btrfs_get_dev_zone_info_all_devices(struct btrfs_fs_info *fs_info)
@@ -264,6 +265,8 @@ static inline int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
 
 static inline void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info) { }
 
+static inline void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info) { }
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)

-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 0/2] btrfs: zoned: always set aside a zone for relocation
@ 2024-05-23 15:21  1% Johannes Thumshirn
  2024-05-23 15:21  1% ` [PATCH v4 1/2] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
  2024-05-23 15:21  1% ` [PATCH v4 2/2] btrfs: reserve new relocation block-group after successful relocation Johannes Thumshirn
  0 siblings, 2 replies; 200+ results
From: Johannes Thumshirn @ 2024-05-23 15:21 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Hans Holmberg, linux-btrfs, linux-kernel, Naohiro Aota,
	Filipe Manana, Johannes Thumshirn

For zoned filesytsems we heavily rely on relocation for garbage collecting
as we cannot do any in-place updates of disk blocks.

But there can be situations where we're running out of space for doing the
relocation.

To solve this, always have a zone reserved for relocation.

This is a subset of another approach to this problem I've submitted in
https://lore.kernel.org/r/20240328-hans-v1-0-4cd558959407@kernel.org

---
Changes in v4:
- Skip data_reloc_bg in delete_unused_bgs() and reclaim_bgs_work()
- Link to v3: https://lore.kernel.org/r/20240521-zoned-gc-v3-0-7db9742454c7@kernel.org

Changes in v3:
- Rename btrfs_reserve_relocation_zone -> btrfs_reserve_relocation_bg
- Bail out if we already have a relocation bg set
- Link to v2: https://lore.kernel.org/r/20240515-zoned-gc-v2-0-20c7cb9763cd@kernel.org

Changes in v2:
- Incorporate Naohiro's review
- Link to v1: https://lore.kernel.org/r/20240514-zoned-gc-v1-0-109f1a6c7447@kernel.org

---
Johannes Thumshirn (2):
      btrfs: zoned: reserve relocation block-group on mount
      btrfs: reserve new relocation block-group after successful relocation

 fs/btrfs/block-group.c | 17 +++++++++++++
 fs/btrfs/disk-io.c     |  2 ++
 fs/btrfs/relocation.c  |  7 ++++++
 fs/btrfs/zoned.c       | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  3 +++
 5 files changed, 96 insertions(+)
---
base-commit: 2aabf192868a0f6d9ee3e35f9b0a8d97c77a46da
change-id: 20240514-zoned-gc-2ce793459eb7

Best regards,
-- 
Johannes Thumshirn <jth@kernel.org>


^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 6/6] btrfs: rename and optimize return variable in btrfs_find_orphan_roots
  2024-05-21 17:59  1%       ` David Sterba
@ 2024-05-23 14:35  1%         ` Anand Jain
  0 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-23 14:35 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs

On 22/05/2024 01:59, David Sterba wrote:
> On Wed, May 22, 2024 at 01:10:08AM +0800, Anand Jain wrote:
>>
>>
>> On 5/21/24 23:18, David Sterba wrote:
>>> On Thu, May 16, 2024 at 07:12:15PM +0800, Anand Jain wrote:
>>>> The variable err is the actual return value of this function, and the
>>>> variable ret is a helper variable for err, which actually is not
>>>> needed and can be handled just by err, which is renamed to ret.
>>>>
>>>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>>>> ---
>>>> v3: drop ret2 as there is no need for it.
>>>> v2: n/a
>>>>    fs/btrfs/root-tree.c | 32 ++++++++++++++++----------------
>>>>    1 file changed, 16 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
>>>> index 33962671a96c..c11b0bccf513 100644
>>>> --- a/fs/btrfs/root-tree.c
>>>> +++ b/fs/btrfs/root-tree.c
>>>> @@ -220,8 +220,7 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
>>>>    	struct btrfs_path *path;
>>>>    	struct btrfs_key key;
>>>>    	struct btrfs_root *root;
>>>> -	int err = 0;
>>>> -	int ret;
>>>> +	int ret = 0;
>>>>    
>>>>    	path = btrfs_alloc_path();
>>>>    	if (!path)
>>>> @@ -235,18 +234,19 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
>>>>    		u64 root_objectid;
>>>>    
>>>>    		ret = btrfs_search_slot(NULL, tree_root, &key, path, 0, 0);
>>>> -		if (ret < 0) {
>>>> -			err = ret;
>>>> +		if (ret < 0)
>>>>    			break;
>>>> -		}
>>>> +		ret = 0;
>>>
>>> Should this be handled when ret > 0? This would be unexpected and
>>> probably means a corruption but simply overwriting the value does not
>>> seem right.
>>>
>>
>> Agreed.
>>
>> +               if (ret > 0)
>> +                       ret = 0;
>>
>> is much neater.
> 
> That's not what I meant. When btrfs_search_slot() returns 1 then the key
> was not found and could be inserted, path points to the slot. This is
> done in many other places, so in the orphan root lookup it should be
> also handled.

For the scenario where ret > 0 is good, we generally do varied tasks.
However, here we need to reassign ret = 0. Originally, err remained 0
and returned 0.

Or, my bad, I didn't understand which usual error handling pattern you 
are referring to.

Thanks, Anand



^ permalink raw reply	[relevance 1%]

* Re: [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions
  2024-05-22 22:21  1% ` Qu Wenruo
@ 2024-05-23 14:02  1%   ` Filipe Manana
  0 siblings, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-23 14:02 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, May 22, 2024 at 11:21 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/5/23 00:06, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > A few places can unnecessarily create an empty transaction and then commit
> > it, when the goal is just to catch the current transaction and wait for
> > its commit to complete. This results in wasting IO, time and rotation of
> > the precious backup roots in the super block. Details in the change logs.
> > The patches are all independent, except patch 4 that applies on top of
> > patch 3 (but could have been done in any order really, they are independent).
>
> Looks good to me.
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
> Have you considered outputting a warning if we're committing an empty
> transaction (for debug build)?
>
> That would prevent such problem from happening again.

It's not really a bug, just inefficient behaviour with side effects
for this particular type of use case.

An empty transaction can happen in other in other scenarios too like:

btrfs_start_transaction()

do something that fails, call btrfs_end_transaction() and return error
to user space

In that case no transaction abort happens since we haven't modified
anything, and if no one else uses that transaction until it's
committed, it's an "empty" transaction.

So the warning is not feasible.

Thanks.

>
> Thanks,
> Qu
> >
> > Filipe Manana (7):
> >    btrfs: qgroup: avoid start/commit empty transaction when flushing reservations
> >    btrfs: avoid create and commit empty transaction when committing super
> >    btrfs: send: make ensure_commit_roots_uptodate() simpler and more efficient
> >    btrfs: send: avoid create/commit empty transaction at ensure_commit_roots_uptodate()
> >    btrfs: scrub: avoid create/commit empty transaction at finish_extent_writes_for_zoned()
> >    btrfs: add and use helper to commit the current transaction
> >    btrfs: send: get rid of the label and gotos at ensure_commit_roots_uptodate()
> >
> >   fs/btrfs/disk-io.c     |  8 +-------
> >   fs/btrfs/qgroup.c      | 31 +++++--------------------------
> >   fs/btrfs/scrub.c       |  6 +-----
> >   fs/btrfs/send.c        | 32 ++++++++------------------------
> >   fs/btrfs/space-info.c  |  9 +--------
> >   fs/btrfs/super.c       | 11 +----------
> >   fs/btrfs/transaction.c | 19 +++++++++++++++++++
> >   fs/btrfs/transaction.h |  1 +
> >   8 files changed, 37 insertions(+), 80 deletions(-)
> >

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: move fiemap code into its own file
  2024-05-22 20:15  1% [PATCH] btrfs: move fiemap code into its own file fdmanana
@ 2024-05-23 10:25  1% ` Johannes Thumshirn
  2024-05-23 16:33  1% ` David Sterba
  1 sibling, 0 replies; 200+ results
From: Johannes Thumshirn @ 2024-05-23 10:25 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
                   ` (10 preceding siblings ...)
  2024-05-23  5:03  1% ` [PATCH v3 11/11] btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent() Qu Wenruo
@ 2024-05-23 10:23  1% ` Johannes Thumshirn
  2024-05-23 18:26  2% ` Filipe Manana
  12 siblings, 0 replies; 200+ results
From: Johannes Thumshirn @ 2024-05-23 10:23 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good to me,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[relevance 1%]

* [syzbot] [btrfs?] [overlayfs?] possible deadlock in ovl_copy_up_flags
@ 2024-05-23 10:09  1% syzbot
  0 siblings, 0 replies; 200+ results
From: syzbot @ 2024-05-23 10:09 UTC (permalink / raw)
  To: amir73il, brauner, clm, dsterba, jack, josef, linux-btrfs,
	linux-fsdevel, linux-kernel, linux-unionfs, miklos, mszeredi,
	syzkaller-bugs, viro

Hello,

syzbot found the following issue on:

HEAD commit:    c75962170e49 Add linux-next specific files for 20240517
git tree:       linux-next
console+strace: https://syzkaller.appspot.com/x/log.txt?x=1438a5cc980000
kernel config:  https://syzkaller.appspot.com/x/.config?x=fba88766130220e8
dashboard link: https://syzkaller.appspot.com/bug?extid=85e58cdf5b3136471d4b
compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=115f3e58980000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14f4c97c980000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/21696f8048a3/disk-c7596217.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/b8c71f928633/vmlinux-c7596217.xz
kernel image: https://storage.googleapis.com/syzbot-assets/350bfc6c0a6a/bzImage-c7596217.xz
mounted in repro: https://storage.googleapis.com/syzbot-assets/7f6a8434331c/mount_0.gz

The issue was bisected to:

commit 9a87907de3597a339cc129229d1a20bc7365ea5f
Author: Miklos Szeredi <mszeredi@redhat.com>
Date:   Thu May 2 18:35:57 2024 +0000

    ovl: implement tmpfile

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=120f89cc980000
final oops:     https://syzkaller.appspot.com/x/report.txt?x=110f89cc980000
console output: https://syzkaller.appspot.com/x/log.txt?x=160f89cc980000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+85e58cdf5b3136471d4b@syzkaller.appspotmail.com
Fixes: 9a87907de359 ("ovl: implement tmpfile")

============================================
WARNING: possible recursive locking detected
6.9.0-next-20240517-syzkaller #0 Not tainted
--------------------------------------------
syz-executor489/5091 is trying to acquire lock:
ffff88802f7f2420 (sb_writers#4){.+.+}-{0:0}, at: ovl_do_copy_up fs/overlayfs/copy_up.c:967 [inline]
ffff88802f7f2420 (sb_writers#4){.+.+}-{0:0}, at: ovl_copy_up_one fs/overlayfs/copy_up.c:1168 [inline]
ffff88802f7f2420 (sb_writers#4){.+.+}-{0:0}, at: ovl_copy_up_flags+0x1110/0x4470 fs/overlayfs/copy_up.c:1223

but task is already holding lock:
ffff88802f7f2420 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write+0x3f/0x90 fs/namespace.c:409

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(sb_writers#4);
  lock(sb_writers#4);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by syz-executor489/5091:
 #0: ffff8880241fe420 (sb_writers#9){.+.+}-{0:0}, at: mnt_want_write+0x3f/0x90 fs/namespace.c:409
 #1: ffff88802f7f2420 (sb_writers#4){.+.+}-{0:0}, at: mnt_want_write+0x3f/0x90 fs/namespace.c:409
 #2: ffff88807f0ea808 (&ovl_i_lock_key[depth]){+.+.}-{3:3}, at: ovl_inode_lock_interruptible fs/overlayfs/overlayfs.h:657 [inline]
 #2: ffff88807f0ea808 (&ovl_i_lock_key[depth]){+.+.}-{3:3}, at: ovl_copy_up_start+0x53/0x310 fs/overlayfs/util.c:719

stack backtrace:
CPU: 1 PID: 5091 Comm: syz-executor489 Not tainted 6.9.0-next-20240517-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 check_deadlock kernel/locking/lockdep.c:3062 [inline]
 validate_chain+0x15c1/0x58e0 kernel/locking/lockdep.c:3856
 __lock_acquire+0x1346/0x1fd0 kernel/locking/lockdep.c:5137
 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5754
 percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
 __sb_start_write include/linux/fs.h:1655 [inline]
 sb_start_write include/linux/fs.h:1791 [inline]
 ovl_start_write+0x11d/0x290 fs/overlayfs/util.c:31
 ovl_do_copy_up fs/overlayfs/copy_up.c:967 [inline]
 ovl_copy_up_one fs/overlayfs/copy_up.c:1168 [inline]
 ovl_copy_up_flags+0x1110/0x4470 fs/overlayfs/copy_up.c:1223
 ovl_create_tmpfile fs/overlayfs/dir.c:1317 [inline]
 ovl_tmpfile+0x262/0x6d0 fs/overlayfs/dir.c:1373
 vfs_tmpfile+0x396/0x510 fs/namei.c:3701
 do_tmpfile+0x156/0x340 fs/namei.c:3764
 path_openat+0x2ab8/0x3280 fs/namei.c:3798
 do_filp_open+0x235/0x490 fs/namei.c:3834
 do_sys_openat2+0x13e/0x1d0 fs/open.c:1405
 do_sys_open fs/open.c:1420 [inline]
 __do_sys_open fs/open.c:1428 [inline]
 __se_sys_open fs/open.c:1424 [inline]
 __x64_sys_open+0x225/0x270 fs/open.c:1424
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fab92feaba9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd714aed18 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
RAX: ffffffffffffffda RBX: 0030656c69662f2e RCX: 00007fab92feaba9
RDX: 0000000000000000 RSI: 0000000000410202 RDI: 0000000020000040
RBP: 00007fab930635f0 R08: 000055557e7894c0 R09: 000055557e7894c0
R10: 000055557e7894c0 R11: 0000000000000246 R12: 00007ffd714aed40
R13: 00007ffd714aef68 R14: 431bde82d7b634db R15: 00007fab9303303b
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply	[relevance 1%]

* [PATCH] btrfs-progs: doc: btrfs device assembly and verification
@ 2024-05-23  9:40  1% Anand Jain
  0 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-23  9:40 UTC (permalink / raw)
  To: linux-btrfs

Create a document on how devices are assembled and their verification
steps.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 .../Device-assembly-and-verification.rst      |  4 +
 .../ch-device-assembly-and-verification.rst   | 83 +++++++++++++++++++
 Documentation/index.rst                       |  1 +
 3 files changed, 88 insertions(+)
 create mode 100644 Documentation/Device-assembly-and-verification.rst
 create mode 100644 Documentation/ch-device-assembly-and-verification.rst

diff --git a/Documentation/Device-assembly-and-verification.rst b/Documentation/Device-assembly-and-verification.rst
new file mode 100644
index 000000000000..411db54a70ce
--- /dev/null
+++ b/Documentation/Device-assembly-and-verification.rst
@@ -0,0 +1,4 @@
+Btrfs Device Assembly and Verification
+======================================
+
+.. include:: ch-device-assembly-and-verification.rst
diff --git a/Documentation/ch-device-assembly-and-verification.rst b/Documentation/ch-device-assembly-and-verification.rst
new file mode 100644
index 000000000000..65ab9352d59e
--- /dev/null
+++ b/Documentation/ch-device-assembly-and-verification.rst
@@ -0,0 +1,83 @@
+.. _Btrfs Device Assembly and Verification:
+
+Btrfs supports volume management without any external configuration file. Let's look at the on-disk parameters that help bring independent devices together and how we handle the various ways in which it can confuse the dynamic assembling of devices and how to handle them.
+
+To begin, `mkfs.btrfs` creates on-disk super-blocks `struct btrfs_super_block` and some basic root trees, which help the kernel validate the assembled devices at the time of mount. At this stage, these devices are read into the kernel using the `BTRFS_IOC_SCAN_DEV` ioctl at `/dev/btrfs/control`. Additionally, the command `btrfs device scan` can help to read all the Btrfs filesystem devices into the Btrfs kernel as well. The actual identification of the devices and their assembly as a volume happens inside the kernel.
+
+The `list_head fs_uuids` in the kernel points to the list of all Btrfs fsids/filesystems in the kernel, with each one pointing to an fsid declared as `struct btrfs_fs_devices` (typically known as `fs_devices`). Furthermore, when all the devices are registered, the volume is formed at the `list btrfs_fs_devices::devlist`, which holds a linked list of `struct btrfs_device` to maintain the information for each device belonging to that filesystem.
+
+Each of the devices in a Btrfs filesystem is distinguished using `btrfs_device::uuid` and `btrfs_device::devid`. The `btrfs_device::uuid` was unique in the kernel until kernel v6.7. The `temp_fsid` feature for single-device filesystems knows how to handle identical single devices.
+
+So, when all the devices are registered, during the mount process, we need just one of the devices to mount the whole volume. However, if you prefer to specify the devices manually instead of using the kernel's automatic assembly, you can do so using the command `btrfs device scan --forget` to clear the kernel's known assembly, and then specify the devices in the mount option, `mount -o device=/dev/<dev1>,device=/dev/<dev2> <mnt>`.
+
+Generation Number
+-----------------
+
+With the `struct btrfs_device::generation` number, we select the device that has the most recent transaction commit. Generally, in a healthy volume, all the devices will have the same generation number.
+
+If there are multiple devices with the same matching fsid uuid and devid, the device with the larger generation number is always picked up. This is to avoid older or reappeared devices from being joined as part of the volume.
+
+Once the devices are assembled, a device with the largest generation is picked by the mount thread to read the metadata at the root tree.
+
+So far, we have identified the devices based on what each device declared through its super-block.
+
+Now, let us look at how we verify each of these devices through the mount thread.
+
+sys_chunk_array
+---------------
+
+As part of the struct btrfs_super_block, we also have an array of metadata chunks information defined as below:
+
+.. code-block:: c
+
+    #define BTRFS_SYSTEM_CHUNK_ARRAY_SIZE 2048
+    btrfs_super_block::sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
+
+Each element of this array is of type CHUNK_ITEM and contains information about the metadata block group profile and the identification of those devices.
+
+.. code-block:: bash
+
+    sys_chunk_array[2048]:
+      item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096)
+        length 8388608 owner 2 stripe_len 65536 type SYSTEM|DUP
+        io_align 65536 io_width 65536 sector_size 4096
+        num_stripes 2 sub_stripes 1
+            stripe 0 devid 1 offset 22020096
+            dev_uuid e9d99243-2b93-4917-9f5f-ed22507ec806
+            stripe 1 devid 1 offset 30408704
+            dev_uuid e9d99243-2b93-4917-9f5f-ed22507ec806
+
+Additional devices that are required to join with other devices are listed in the system chunk array. The UUID and devid are taken from here and matched with the devices in the btrfs_fs_devices::devlist. Now, the device shall have the state BTRFS_DEV_STATE_IN_FS_METADATA. Only those devices where the metadata is placed are found here.
+
+btrfs_read_chunk_tree
+---------------------
+
+The chunk-tree root is loaded from the btrfs_super_block::chunk_root and finds all the device items. The device items are of type struct btrfs_dev_item, from which the devices in the fs_devices::devlist are checked against devid, uuid, and fsid/metadata_uuid. The devices which fail to match are removed from the list. At this point, we also determine if there is any missing device and if there is a -o degraded option to override and continue with the degraded mount. If there is a missing device, a missing device added entry as a device with the expected devid and uuid as per the dev_item is added, and the rest of the devices get the BTRFS_DEV_STATE_IN_FS_METADATA state.
+
+btrfs_verify_dev_extents
+------------------------
+
+Various device verifications, such as physical sizes, are done at this stage, including verification of dev extent to its chunk mapping.
+
+btrfs_read_block_groups
+-----------------------
+
+To find out the block-group profiles being used, all the block groups are searched in the extent-tree or in a separate block-group-tree (since kernel v6.1). This provides us with a list of block groups that are already created. It can be visualized at:
+
+    /sys/fs/btrfs/<fsid>/allocation/<type>/<bg>
+
+For this reason, the mount time and the size of the extent tree were proportional, and it wasn't scalable on larger filesystems before v6.1. This issue was resolved by using a separate block-group-tree.
+
+Missing device
+--------------
+
+Missing device identification happens at multiple levels. First, in the read_sys_array, where all the metadata-required devices are identified, and then in the chunk tree read, where all the device items are read, providing a complete list of all the devices. However, we don't yet know if we need all these devices to mount the volume in a degraded mode, which means to activate the RAID fault tolerance. This is determined when we read all the chunks in the filesystem chunk tree because these chunks can have different block-group profiles, and the number of devices required to reconstruct the data or to read from the mirror copy varies.
+
+The missing device might reappear later, lacking the latest generation number. The filesystem will continue to work in a degraded state if the redundancy level allows. If it reappears, it shall be scanned; however, it won't join the allocation as of now. A mount recycle will be necessary following the balance so that the missing blocks on the missing device are copied.
+
+Device paths
+------------
+
+During boot, we also allow the user to update the device path without going through the device open and close cycle because systems without the initrd shall use a temporary device path (/dev/root) for the initial bootstrap, which must be updated to the final device path when the system block layer is ready.
+
+Also, at some point, we might mount a subvolume, in which case the device path is scanned again. So, it is essential to let the matched device path scan again and return success.
diff --git a/Documentation/index.rst b/Documentation/index.rst
index d65be265a178..723f4a55ce93 100644
--- a/Documentation/index.rst
+++ b/Documentation/index.rst
@@ -48,6 +48,7 @@ is in the :doc:`manual pages<man-index>`.
    Convert
    Deduplication
    Defragmentation
+   Device-assembly-and-verification
    Inline-files
    Qgroups
    Reflink
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v6 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  2024-05-23  7:05  2% [PATCH v6 0/5] btrfs: subpage + zoned fixes Qu Wenruo
  2024-05-23  7:05  1% ` [PATCH v6 1/5] btrfs: make __extent_writepage_io() to write specified range only Qu Wenruo
  2024-05-23  7:05  1% ` [PATCH v6 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking Qu Wenruo
@ 2024-05-23  7:05  1% ` Qu Wenruo
  2024-05-23  7:05  1% ` [PATCH v6 4/5] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
  2024-05-23  7:05  1% ` [PATCH v6 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
  4 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  7:05 UTC (permalink / raw)
  To: linux-btrfs

If we have a subpage range like this for a 16K page with 4K sectorsize:

    0     4K     8K     12K     16K
    |/////|      |//////|       |

    |/////| = dirty range

Currently writepage_delalloc() would go through the following steps:

- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)

So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().

But there is a special pitfall for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.

This patch would address the page unlocking problem of run_delalloc_cow(),
by changing the workflow to the following one:

- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)

So that run_delalloc_cow() can only unlock the full page until the
last lock user released.

To do that, this patch would:

- Utilizing subpage locked bitmap
  So for every delalloc range we found, call
  btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
  and later btrfs_folio_end_all_writers() if the page is fully unlocked.

  So we know there is a delalloc range that needs to be run later.

- Save the @delalloc_end as @last_delalloc_end inside
  writepage_delalloc()
  Since subpage locked bitmap is only for ranges inside the page,
  meanwhile we can have delalloc range ends beyond our page boundary,
  we have to save the @last_delalloc_end just in case it's beyond our
  page boundary.

Although there is one extra point to notice:

- We need to handle errors in previous iteration
  Since we can have multiple locked delalloc ranges thus we have to call
  run_delalloc_ranges() multiple times.
  If we hit an error half way, we still need to unlock the remaining
  ranges.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 104 ++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/subpage.c   |   6 +++
 2 files changed, 103 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 938061e0ce01..338067ce724a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1226,13 +1226,23 @@ static inline void contiguous_readpages(struct page *pages[], int nr_pages,
 static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 		struct page *page, struct writeback_control *wbc)
 {
+	struct btrfs_fs_info *fs_info = inode_to_fs_info(&inode->vfs_inode);
+	struct folio *folio = page_folio(page);
+	const bool is_subpage = btrfs_is_subpage(fs_info, page->mapping);
 	const u64 page_start = page_offset(page);
 	const u64 page_end = page_start + PAGE_SIZE - 1;
+	/*
+	 * Saves the last found delalloc end. As the delalloc end can go beyond
+	 * page boundary, thus we can not rely on subpage bitmap to locate
+	 * the last delalloc end.
+	 */
+	u64 last_delalloc_end = 0;
 	u64 delalloc_start = page_start;
 	u64 delalloc_end = page_end;
 	u64 delalloc_to_write = 0;
 	int ret = 0;
 
+	/* Lock all (subpage) delalloc ranges inside the page first. */
 	while (delalloc_start < page_end) {
 		delalloc_end = page_end;
 		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
@@ -1240,15 +1250,94 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
 			delalloc_start = delalloc_end + 1;
 			continue;
 		}
-
-		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
-					       delalloc_end, wbc);
-		if (ret < 0)
-			return ret;
-
+		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
+					    min(delalloc_end, page_end) + 1 -
+					    delalloc_start);
+		last_delalloc_end = delalloc_end;
 		delalloc_start = delalloc_end + 1;
 	}
+	delalloc_start = page_start;
 
+	if (!last_delalloc_end)
+		goto out;
+
+	/* Run the delalloc ranges for above locked ranges. */
+	while (delalloc_start < page_end) {
+		u64 found_start;
+		u32 found_len;
+		bool found;
+
+		if (!is_subpage) {
+			/*
+			 * For non-subpage case, the found delalloc range must
+			 * cover this page and there must be only one locked
+			 * delalloc range.
+			 */
+			found_start = page_start;
+			found_len = last_delalloc_end + 1 - found_start;
+			found = true;
+		} else {
+			found = btrfs_subpage_find_writer_locked(fs_info, folio,
+					delalloc_start, &found_start, &found_len);
+		}
+		if (!found)
+			break;
+		/*
+		 * The subpage range covers the last sector, the delalloc range may
+		 * end beyonds the page boundary, use the saved delalloc_end
+		 * instead.
+		 */
+		if (found_start + found_len >= page_end)
+			found_len = last_delalloc_end + 1 - found_start;
+
+		if (likely(ret >= 0)) {
+			/* No errors hit so far, run the current delalloc range. */
+			ret = btrfs_run_delalloc_range(inode, page, found_start,
+						       found_start + found_len - 1, wbc);
+		} else {
+			/*
+			 * We hit error during previous delalloc range, has to cleanup
+			 * the remaining locked ranges.
+			 */
+			unlock_extent(&inode->io_tree, found_start,
+				      found_start + found_len - 1, NULL);
+			__unlock_for_delalloc(&inode->vfs_inode, page, found_start,
+					      found_start + found_len - 1);
+		}
+
+		/*
+		 * We can hit btrfs_run_delalloc_range() with >0 return value.
+		 *
+		 * This happens when either the IO is already done and page
+		 * unlocked (inline) or the IO submission and page unlock would
+		 * be handled async (compression).
+		 *
+		 * Inline is only possible for regular sectorsize for now.
+		 *
+		 * Compression is possible for both subpage and regular cases,
+		 * but even for subpage compression only happens for page aligned
+		 * range, thus the found delalloc range must go beyond current
+		 * page.
+		 */
+		if (ret > 0)
+			ASSERT(!is_subpage || found_start + found_len >= page_end);
+
+		/*
+		 * Above btrfs_run_delalloc_range() may have unlocked the page,
+		 * Thus for the last range, we can not touch the page anymore.
+		 */
+		if (found_start + found_len >= last_delalloc_end + 1)
+			break;
+
+		delalloc_start = found_start + found_len;
+	}
+	if (ret < 0)
+		return ret;
+out:
+	if (last_delalloc_end)
+		delalloc_end = last_delalloc_end;
+	else
+		delalloc_end = page_end;
 	/*
 	 * delalloc_end is already one less than the total length, so
 	 * we don't subtract one from PAGE_SIZE
@@ -1520,7 +1609,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
 					       PAGE_SIZE, !ret);
 		mapping_set_error(page->mapping, ret);
 	}
-	unlock_page(page);
+
+	btrfs_folio_end_all_writers(inode_to_fs_info(inode), folio);
 	ASSERT(ret <= 0);
 	return ret;
 }
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 8bf83dd3313d..fe99a8ea94c0 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -868,6 +868,7 @@ bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
 void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
 				 struct folio *folio)
 {
+	struct btrfs_subpage *subpage = folio_get_private(folio);
 	u64 folio_start = folio_pos(folio);
 	u64 cur = folio_start;
 
@@ -877,6 +878,11 @@ void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
 		return;
 	}
 
+	/* The page has no new delalloc range locked on it. Just plain unlock. */
+	if (atomic_read(&subpage->writers) == 0) {
+		folio_unlock(folio);
+		return;
+	}
 	while (cur < folio_start + PAGE_SIZE) {
 		u64 found_start;
 		u32 found_len;
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v6 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly
  2024-05-23  7:05  2% [PATCH v6 0/5] btrfs: subpage + zoned fixes Qu Wenruo
                   ` (3 preceding siblings ...)
  2024-05-23  7:05  1% ` [PATCH v6 4/5] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
@ 2024-05-23  7:05  1% ` Qu Wenruo
  4 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  7:05 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Josef Bacik, Johannes Thumshirn, Naohiro Aota

When extent_write_locked_range() generated an inline extent, it would
set and finish the writeback for the whole page.

Although currently it's safe since subpage disables inline creation,
for the sake of consistency, let it go with subpage helpers to set and
clear the writeback flags.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2174c0e0fb15..1aac7b8fa7e2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2336,6 +2336,7 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 		u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end);
 		u32 cur_len = cur_end + 1 - cur;
 		struct page *page;
+		struct folio *folio;
 		int nr = 0;
 
 		page = find_get_page(mapping, cur >> PAGE_SHIFT);
@@ -2350,8 +2351,9 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 
 		/* Make sure the mapping tag for page dirty gets cleared. */
 		if (nr == 0) {
-			set_page_writeback(page);
-			end_page_writeback(page);
+			folio = page_folio(page);
+			btrfs_folio_set_writeback(fs_info, folio, cur, cur_len);
+			btrfs_folio_clear_writeback(fs_info, folio, cur, cur_len);
 		}
 		if (ret) {
 			btrfs_mark_ordered_io_finished(BTRFS_I(inode), page,
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v6 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking
  2024-05-23  7:05  2% [PATCH v6 0/5] btrfs: subpage + zoned fixes Qu Wenruo
  2024-05-23  7:05  1% ` [PATCH v6 1/5] btrfs: make __extent_writepage_io() to write specified range only Qu Wenruo
@ 2024-05-23  7:05  1% ` Qu Wenruo
  2024-05-23  7:05  1% ` [PATCH v6 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc() Qu Wenruo
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  7:05 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Naohiro Aota

Three new helpers are introduced for the incoming subpage delalloc locking
change.

- btrfs_folio_set_writer_lock()
  This is to mark specified range with subpage specific writer lock.
  After calling this, the subpage range can be proper unlocked by
  btrfs_folio_end_writer_lock()

- btrfs_subpage_find_writer_locked()
  This is to find the writer locked subpage range in a page.
  With the help of btrfs_folio_set_writer_lock(), it can allow us to
  record and find previously locked subpage range without extra memory
  allocation.

- btrfs_folio_end_all_writers()
  This is for the locked_page of __extent_writepage(), as there may be
  multiple subpage delalloc ranges locked.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/subpage.c | 122 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/subpage.h |   7 +++
 2 files changed, 129 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 2697e528eab2..8bf83dd3313d 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -775,6 +775,128 @@ void btrfs_folio_unlock_writer(struct btrfs_fs_info *fs_info,
 	btrfs_folio_end_writer_lock(fs_info, folio, start, len);
 }
 
+/*
+ * This is for folio already locked by plain lock_page()/folio_lock(), which
+ * doesn't have any subpage awareness.
+ *
+ * This would populate the involved subpage ranges so that subpage helpers can
+ * properly unlock them.
+ */
+void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
+				 struct folio *folio, u64 start, u32 len)
+{
+	struct btrfs_subpage *subpage;
+	unsigned long flags;
+	unsigned int start_bit;
+	unsigned int nbits;
+	int ret;
+
+	ASSERT(folio_test_locked(folio));
+	if (unlikely(!fs_info) || !btrfs_is_subpage(fs_info, folio->mapping))
+		return;
+
+	subpage = folio_get_private(folio);
+	start_bit = subpage_calc_start_bit(fs_info, folio, locked, start, len);
+	nbits = len >> fs_info->sectorsize_bits;
+	spin_lock_irqsave(&subpage->lock, flags);
+	/* Target range should not yet be locked. */
+	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
+	bitmap_set(subpage->bitmaps, start_bit, nbits);
+	ret = atomic_add_return(nbits, &subpage->writers);
+	ASSERT(ret <= fs_info->subpage_info->bitmap_nr_bits);
+	spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+/*
+ * Find any subpage writer locked range inside @folio, starting at file offset
+ * @search_start.
+ * The caller should ensure the folio is locked.
+ *
+ * Return true and update @found_start_ret and @found_len_ret to the first
+ * writer locked range.
+ * Return false if there is no writer locked range.
+ */
+bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
+				      struct folio *folio, u64 search_start,
+				      u64 *found_start_ret, u32 *found_len_ret)
+{
+	struct btrfs_subpage_info *subpage_info = fs_info->subpage_info;
+	struct btrfs_subpage *subpage = folio_get_private(folio);
+	const unsigned int len = PAGE_SIZE - offset_in_page(search_start);
+	const unsigned int start_bit = subpage_calc_start_bit(fs_info, folio,
+						locked, search_start, len);
+	const unsigned int locked_bitmap_start = subpage_info->locked_offset;
+	const unsigned int locked_bitmap_end = locked_bitmap_start +
+					       subpage_info->bitmap_nr_bits;
+	unsigned long flags;
+	int first_zero;
+	int first_set;
+	bool found = false;
+
+	ASSERT(folio_test_locked(folio));
+	spin_lock_irqsave(&subpage->lock, flags);
+	first_set = find_next_bit(subpage->bitmaps, locked_bitmap_end,
+				  start_bit);
+	if (first_set >= locked_bitmap_end)
+		goto out;
+
+	found = true;
+
+	*found_start_ret = folio_pos(folio) +
+		((first_set - locked_bitmap_start) << fs_info->sectorsize_bits);
+	/*
+	 * Since @first_set is ensured to be smaller than locked_bitmap_end
+	 * here, @found_start_ret should be inside the folio.
+	 */
+	ASSERT(*found_start_ret < folio_pos(folio) + PAGE_SIZE);
+
+	first_zero = find_next_zero_bit(subpage->bitmaps,
+					locked_bitmap_end, first_set);
+	*found_len_ret = (first_zero - first_set) << fs_info->sectorsize_bits;
+out:
+	spin_unlock_irqrestore(&subpage->lock, flags);
+	return found;
+}
+
+/*
+ * Unlike btrfs_folio_end_writer_lock() which unlock a specified subpage range,
+ * this would end all writer locked ranges of a page.
+ *
+ * This is for the locked page of __extent_writepage(), as the locked page
+ * can contain several locked subpage ranges.
+ */
+void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
+				 struct folio *folio)
+{
+	u64 folio_start = folio_pos(folio);
+	u64 cur = folio_start;
+
+	ASSERT(folio_test_locked(folio));
+	if (!btrfs_is_subpage(fs_info, folio->mapping)) {
+		folio_unlock(folio);
+		return;
+	}
+
+	while (cur < folio_start + PAGE_SIZE) {
+		u64 found_start;
+		u32 found_len;
+		bool found;
+		bool last;
+
+		found = btrfs_subpage_find_writer_locked(fs_info, folio, cur,
+							 &found_start, &found_len);
+		if (!found)
+			break;
+		last = btrfs_subpage_end_and_test_writer(fs_info, folio,
+							 found_start, found_len);
+		if (last) {
+			folio_unlock(folio);
+			break;
+		}
+		cur = found_start + found_len;
+	}
+}
+
 #define GET_SUBPAGE_BITMAP(subpage, subpage_info, name, dst)		\
 	bitmap_cut(dst, subpage->bitmaps, 0,				\
 		   subpage_info->name##_offset, subpage_info->bitmap_nr_bits)
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index 4b363d9453af..9f19850d59f2 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -112,6 +112,13 @@ int btrfs_folio_start_writer_lock(const struct btrfs_fs_info *fs_info,
 				  struct folio *folio, u64 start, u32 len);
 void btrfs_folio_end_writer_lock(const struct btrfs_fs_info *fs_info,
 				 struct folio *folio, u64 start, u32 len);
+void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
+				 struct folio *folio, u64 start, u32 len);
+bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
+				      struct folio *folio, u64 search_start,
+				      u64 *found_start_ret, u32 *found_len_ret);
+void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
+				 struct folio *folio);
 
 /*
  * Template for subpage related operations.
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v6 1/5] btrfs: make __extent_writepage_io() to write specified range only
  2024-05-23  7:05  2% [PATCH v6 0/5] btrfs: subpage + zoned fixes Qu Wenruo
@ 2024-05-23  7:05  1% ` Qu Wenruo
  2024-05-23  7:05  1% ` [PATCH v6 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking Qu Wenruo
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  7:05 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn, Naohiro Aota

Function __extent_writepage_io() is designed to find all dirty range of
a page, and add that dirty range into the bio_ctrl for submission.
It requires all the dirtied range to be covered by an ordered extent.

It get called in two locations, but one call site is not subpage aware:

- __extent_writepage()
  It get called when writepage_delalloc() returned 0, which means
  writepage_delalloc() has handled dellalloc for all subpage sectors
  inside the page.

  So this call site is OK.

- extent_write_locked_range()
  This call site is utilized by zoned support, and in this case, we may
  only run delalloc range for a subset of the page, like this: (64K page
  size)

  0     16K     32K     48K     64K
  |/////|       |///////|       |

  In above case, if extent_write_locked_range() is only triggered for
  range [0, 16K), __extent_writepage_io() would still try to submit
  the dirty range of [32K, 48K), then it would not find any ordered
  extent for it and trigger various ASSERT()s.

Fix this problem by:

- Introducing @start and @len parameters to specify the range

  For the first call site, we just pass the whole page, and the behavior
  is not touched, since run_delalloc_range() for the page should have
  created all ordered extents for the page.

  For the second call site, we would avoid touching anything beyond the
  range, thus avoid the dirty range which is not yet covered by any
  delalloc range.

- Making btrfs_folio_assert_not_dirty() subpage aware
  The only caller is inside __extent_writepage_io(), and since that
  caller now accepts a subpage range, we should also check the subpage
  range other than the whole page.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 18 +++++++++++-------
 fs/btrfs/subpage.c   | 22 ++++++++++++++++------
 fs/btrfs/subpage.h   |  3 ++-
 3 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bf50301ee528..938061e0ce01 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1339,20 +1339,23 @@ static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
  * < 0 if there were errors (page still locked)
  */
 static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
-				 struct page *page,
+				 struct page *page, u64 start, u32 len,
 				 struct btrfs_bio_ctrl *bio_ctrl,
 				 loff_t i_size,
 				 int *nr_ret)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
-	u64 cur = page_offset(page);
-	u64 end = cur + PAGE_SIZE - 1;
+	u64 cur = start;
+	u64 end = start + len - 1;
 	u64 extent_offset;
 	u64 block_start;
 	struct extent_map *em;
 	int ret = 0;
 	int nr = 0;
 
+	ASSERT(start >= page_offset(page) &&
+	       start + len <= page_offset(page) + PAGE_SIZE);
+
 	ret = btrfs_writepage_cow_fixup(page);
 	if (ret) {
 		/* Fixup worker will requeue */
@@ -1441,7 +1444,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		nr++;
 	}
 
-	btrfs_folio_assert_not_dirty(fs_info, page_folio(page));
+	btrfs_folio_assert_not_dirty(fs_info, page_folio(page), start, len);
 	*nr_ret = nr;
 	return 0;
 
@@ -1499,7 +1502,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
 	if (ret)
 		goto done;
 
-	ret = __extent_writepage_io(BTRFS_I(inode), page, bio_ctrl, i_size, &nr);
+	ret = __extent_writepage_io(BTRFS_I(inode), page, page_offset(page),
+				    PAGE_SIZE, bio_ctrl, i_size, &nr);
 	if (ret == 1)
 		return 0;
 
@@ -2251,8 +2255,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 			clear_page_dirty_for_io(page);
 		}
 
-		ret = __extent_writepage_io(BTRFS_I(inode), page, &bio_ctrl,
-					    i_size, &nr);
+		ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len,
+					    &bio_ctrl, i_size, &nr);
 		if (ret == 1)
 			goto next_page;
 
diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 9127704236ab..2697e528eab2 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -703,19 +703,29 @@ IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked,
  * Make sure not only the page dirty bit is cleared, but also subpage dirty bit
  * is cleared.
  */
-void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio)
+void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				  struct folio *folio, u64 start, u32 len)
 {
-	struct btrfs_subpage *subpage = folio_get_private(folio);
+	struct btrfs_subpage *subpage;
+	unsigned int start_bit;
+	unsigned int nbits;
+	unsigned long flags;
 
 	if (!IS_ENABLED(CONFIG_BTRFS_ASSERT))
 		return;
 
-	ASSERT(!folio_test_dirty(folio));
-	if (!btrfs_is_subpage(fs_info, folio->mapping))
+	if (!btrfs_is_subpage(fs_info, folio->mapping)) {
+		ASSERT(!folio_test_dirty(folio));
 		return;
+	}
 
-	ASSERT(folio_test_private(folio) && folio_get_private(folio));
-	ASSERT(subpage_test_bitmap_all_zero(fs_info, subpage, dirty));
+	start_bit = subpage_calc_start_bit(fs_info, folio, dirty, start, len);
+	nbits = len >> fs_info->sectorsize_bits;
+	subpage = folio_get_private(folio);
+	ASSERT(subpage);
+	spin_lock_irqsave(&subpage->lock, flags);
+	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
+	spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
 /*
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index b6dc013b0fdc..4b363d9453af 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -156,7 +156,8 @@ DECLARE_BTRFS_SUBPAGE_OPS(checked);
 bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
 					struct folio *folio, u64 start, u32 len);
 
-void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio);
+void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info,
+				  struct folio *folio, u64 start, u32 len);
 void btrfs_folio_unlock_writer(struct btrfs_fs_info *fs_info,
 			       struct folio *folio, u64 start, u32 len);
 void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v6 4/5] btrfs: do not clear page dirty inside extent_write_locked_range()
  2024-05-23  7:05  2% [PATCH v6 0/5] btrfs: subpage + zoned fixes Qu Wenruo
                   ` (2 preceding siblings ...)
  2024-05-23  7:05  1% ` [PATCH v6 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc() Qu Wenruo
@ 2024-05-23  7:05  1% ` Qu Wenruo
  2024-05-23  7:05  1% ` [PATCH v6 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
  4 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  7:05 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Johannes Thumshirn

[BUG]
For subpage + zoned case, the following workload can lead to rsv data
leak at unmount time:

 # mkfs.btrfs -f -s 4k $dev
 # mount $dev $mnt
 # fsstress -w -n 8 -d $mnt -s 1709539240
 0/0: fiemap - no filename
 0/1: copyrange read - no filename
 0/2: write - no filename
 0/3: rename - no source filename
 0/4: creat f0 x:0 0 0
 0/4: creat add id=0,parent=-1
 0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0
 0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1
 0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat()
 0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0
 # umount $mnt

The dmesg would include the following rsv leak detection wanring (all
call trace skipped):

 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8653 btrfs_destroy_inode+0x1e0/0x200 [btrfs]
 ---[ end trace 0000000000000000 ]---
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8654 btrfs_destroy_inode+0x1a8/0x200 [btrfs]
 ---[ end trace 0000000000000000 ]---
 ------------[ cut here ]------------
 WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8660 btrfs_destroy_inode+0x1a0/0x200 [btrfs]
 ---[ end trace 0000000000000000 ]---
 BTRFS info (device sda): last unmount of filesystem 1b4abba9-de34-4f07-9e7f-157cf12a18d6
 ------------[ cut here ]------------
 WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
 ---[ end trace 0000000000000000 ]---
 BTRFS info (device sda): space_info DATA has 268218368 free, is not full
 BTRFS info (device sda): space_info total=268435456, used=204800, pinned=0, reserved=0, may_use=12288, readonly=0 zone_unusable=0
 BTRFS info (device sda): global_block_rsv: size 0 reserved 0
 BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
 BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
 BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
 BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
 ------------[ cut here ]------------
 WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
 ---[ end trace 0000000000000000 ]---
 BTRFS info (device sda): space_info METADATA has 267796480 free, is not full
 BTRFS info (device sda): space_info total=268435456, used=131072, pinned=0, reserved=0, may_use=262144, readonly=0 zone_unusable=245760
 BTRFS info (device sda): global_block_rsv: size 0 reserved 0
 BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
 BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
 BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
 BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0

Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
append size of 64K, and the system has 64K page size.

[CAUSE]
I have added several trace_printk() to show the events (header skipped):

 > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864

The above lines shows our buffered write has dirtied 3 pages of inode
259 of root 5:

  704K             768K              832K              896K
  I           |////I/////////////////I///////////|     I
              756K                               868K

  |///| is the dirtied range using subpage bitmaps. and 'I' is the page
  boundary.

  Meanwhile all three pages (704K, 768K, 832K) all have its PageDirty
  flag set.

 > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400

Then direct IO write starts, since the range [680K, 780K) covers the
beginning part of the above dirty range, btrfs needs to writeback the
two pages at 704K and 768K.

 > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
 > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536

Now the above 2 lines shows that, we're writing back for dirty range
[756K, 756K + 64K).
We only writeback 64K because the zoned device has max zone append size
as 64K.

 > extent_write_locked_range: r/i=5/259 clear dirty for page=786432

!!! The above line shows the root cause. !!!

We're calling clear_page_dirty_for_io() inside extent_write_locked_range(),
for the page 768K.
This is because extent_write_locked_range() can go beyond the current
locked page, here we hit the page at 768K and clear it's page dirt.

In fact this would lead to the desync between subpage dirty and page
dirty flags.
We have the page dirty flag cleared, but the subpage range [820K, 832K)
is still dirty.

After the writeback of range [756K, 820K), the dirty flags looks like
this, as page 768K no longer has dirty flag set.

  704K             768K              832K              896K
  I                I      |          I/////////////|   I
                          820K                     868K

This means we will no longer writeback range [820K, 832K), thus the
reserved data/metadata space would never be properly released.

 > extent_write_cache_pages: r/i=5/259 skip non-dirty folio=786432

Now even we try to start wrteiback for page 768K, since the
page is not dirty, we completely skip it at extent_write_cache_pages()
time.

 > btrfs_direct_write: r/i=5/259 dio done filepos=696320 len=0

Now the direct IO finished.

 > cow_file_range: r/i=5/259 add ordered extent filepos=851968 len=36864
 > extent_write_locked_range: r/i=5/259 locked page=851968 start=851968 len=36864

Now we writeback the remaining dirty range, which is [832K, 868K).
Causing the range [820K, 832K) never be submitted, thus leaking the
reserved space.

This bug only affects subpage and zoned case.
For non-subpage and zoned case, we have exact one sector for each page,
thus no such partial dirty cases.

For subpage and non-zoned case, we never go into run_delalloc_cow(), and
normally all the dirty subpage ranges would be properly submitted inside
__extent_writepage_io().

[FIX]
Just do not clear the page dirty at all inside
extent_write_locked_range().
As __extent_writepage_io() would do a more accurate, subpage compatible
clear for page and subpage dirty flags anyway.

Now the correct trace would look like this:

 > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
 > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864

The page dirty part is still the same 3 pages.

 > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
 > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
 > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536

And the writeback for the first 64K is still correct.

 > cow_file_range: r/i=5/259 add ordered extent filepos=839680 len=49152
 > extent_write_locked_range: r/i=5/259 locked page=786432 start=839680 len=49152

Now with the fix, we can properly writeback the range [820K, 832K), and
properly release the reserved data/metadata space.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 338067ce724a..2174c0e0fb15 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2340,10 +2340,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
 
 		page = find_get_page(mapping, cur >> PAGE_SHIFT);
 		ASSERT(PageLocked(page));
-		if (pages_dirty && page != locked_page) {
+		if (pages_dirty && page != locked_page)
 			ASSERT(PageDirty(page));
-			clear_page_dirty_for_io(page);
-		}
 
 		ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len,
 					    &bio_ctrl, i_size, &nr);
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v6 0/5] btrfs: subpage + zoned fixes
@ 2024-05-23  7:05  2% Qu Wenruo
  2024-05-23  7:05  1% ` [PATCH v6 1/5] btrfs: make __extent_writepage_io() to write specified range only Qu Wenruo
                   ` (4 more replies)
  0 siblings, 5 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  7:05 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v6:
- Use unsigned int for bit map related members

- One extra ASSERT() to make sure our bit range never exceed the bitmap

- One extra ASSERT() for btrfs_run_delalloc_range() returning >0 case

- "dealloc" typo fix

- Small changes inside writepage_delalloc() main loop to make it a
  little easier to read

v5:
- Enhance the commit message on why we should not clear page dirty
  inside extent_write_locked_range()

- Reorder the patches so that no temporary list based solution for
  delalloc ranges

v4:
- Rebased to the latest for-next branch
  Thankfully no conflicts at all.

- Include all the previous preparation patches
  It turns out I split the preparation into other series and even get
  myself confused.

- Use the correct commit message from my local branch
  It turns out Josef is totally correct, the problem I described in
  "btrfs: do no clera page dirty inside extent_write_locked_range()" is
  really confusing, it has direct IO involved and my local branch is
  already using a much better commit and I just forgot it.
 
v3:
- Use the minimal fsstress workload with trace_printk() output to
  explain the bug better

v2:
- Update the commit message for the first patch
  As there is something wrong with the ASCII art of the memory layout.

[REPO]
https://github.com/adam900710/linux/tree/subpage_delalloc

If running subpage with zoned devices (TCMU emulated HDD, 64K or 16K
page size with 4K sectorsize), btrfs can easily hitting various bugs:

- ASSERT()s related to submitting page range which has no OE coverage

- Various reserved space leakage and some OE never finished

This is caused by two major reasons:

- run_delalloc_cow() is not subpage compatible
  There are several different problems involved furthermore.

  * extent_write_locked_range() would try to submit dirty pages beyond
    the specified subpage range
    Thus hit some ASSERT() that a dirty range has no corresponding OE


  * extent_write_locked_range() would unlock the whole page meanwhile
    we're only triggered for a subpage range
    Thus causing unexpected page to be unlocked.

  This would be addressed by patch 1~3 by:

  * Limited the submission range to follow the subpage ranges

  * Make the page unlocking part also subpage compatible, and always
    lock all delalloc subpage ranges covering the current page.

- Some dirty range is not submitted thus OE would never finish
  This happens due to the mismatch that extent_write_locked_range() can
  clear the full page dirty, even if we're only submitting part of the 
  dirty ranges, causing page dirty flags desync from subpage dirty
  flags.

  Then later __extent_writepage_io() would skip a non-dirty page, as the
  check is only checking the full page dirty flag, not the
  subpage bitmaps.

  This would be addressed by patch 4~5.


Qu Wenruo (5):
  btrfs: make __extent_writepage_io() to write specified range only
  btrfs: subpage: introduce helpers to handle subpage delalloc locking
  btrfs: lock subpage ranges in one go for writepage_delalloc()
  btrfs: do not clear page dirty inside extent_write_locked_range()
  btrfs: make extent_write_locked_range() to handle subpage writeback
    correctly

 fs/btrfs/extent_io.c | 132 +++++++++++++++++++++++++++++++------
 fs/btrfs/subpage.c   | 150 +++++++++++++++++++++++++++++++++++++++++--
 fs/btrfs/subpage.h   |  10 ++-
 3 files changed, 266 insertions(+), 26 deletions(-)

-- 
2.45.1


^ permalink raw reply	[relevance 2%]

* [PATCH v3 11/11] btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent()
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
                   ` (9 preceding siblings ...)
  2024-05-23  5:03  1% ` [PATCH v3 10/11] btrfs: cleanup duplicated parameters related to create_io_em() Qu Wenruo
@ 2024-05-23  5:03  1% ` Qu Wenruo
  2024-05-23 10:23  1% ` [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Johannes Thumshirn
  2024-05-23 18:26  2% ` Filipe Manana
  12 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs

The following 3 parameters can be cleaned up using btrfs_file_extent
structure:

- len
  btrfs_file_extent::num_bytes

- orig_block_len
  btrfs_file_extent::disk_num_bytes

- ram_bytes
  btrfs_file_extent::ram_bytes

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ecafaa181201..0ec275b24adc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7004,11 +7004,8 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 						  struct btrfs_dio_data *dio_data,
 						  const u64 start,
-						  const u64 len,
-						  const u64 orig_block_len,
-						  const u64 ram_bytes,
-						  const int type,
-						  struct btrfs_file_extent *file_extent)
+						  struct btrfs_file_extent *file_extent,
+						  const int type)
 {
 	struct extent_map *em = NULL;
 	struct btrfs_ordered_extent *ordered;
@@ -7026,7 +7023,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 		if (em) {
 			free_extent_map(em);
 			btrfs_drop_extent_map_range(inode, start,
-						    start + len - 1, false);
+					start + file_extent->num_bytes - 1, false);
 		}
 		em = ERR_CAST(ordered);
 	} else {
@@ -7069,10 +7066,8 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
 	file_extent.ram_bytes = ins.offset;
 	file_extent.offset = 0;
 	file_extent.compression = BTRFS_COMPRESS_NONE;
-	em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset,
-				     ins.offset,
-				     ins.offset, BTRFS_ORDERED_REGULAR,
-				     &file_extent);
+	em = btrfs_create_dio_extent(inode, dio_data, start, &file_extent,
+				     BTRFS_ORDERED_REGULAR);
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 	if (IS_ERR(em))
 		btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset,
@@ -7439,10 +7434,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		}
 		space_reserved = true;
 
-		em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
-					      file_extent.disk_num_bytes,
-					      file_extent.ram_bytes, type,
-					      &file_extent);
+		em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start,
+					      &file_extent, type);
 		btrfs_dec_nocow_writers(bg);
 		if (type == BTRFS_ORDERED_PREALLOC) {
 			free_extent_map(em);
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 10/11] btrfs: cleanup duplicated parameters related to create_io_em()
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
                   ` (8 preceding siblings ...)
  2024-05-23  5:03  1% ` [PATCH v3 09/11] btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent Qu Wenruo
@ 2024-05-23  5:03  1% ` Qu Wenruo
  2024-05-23  5:03  1% ` [PATCH v3 11/11] btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent() Qu Wenruo
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs

Most parameters of create_io_em() can be replaced by the members with
the same name inside btrfs_file_extent.

Do a straight parameters cleanup here.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 55 ++++++++++++------------------------------------
 1 file changed, 14 insertions(+), 41 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 35f03149b777..ecafaa181201 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -138,9 +138,6 @@ static noinline int run_delalloc_cow(struct btrfs_inode *inode,
 				     u64 end, struct writeback_control *wbc,
 				     bool pages_dirty);
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
-				       u64 len,
-				       u64 disk_num_bytes,
-				       u64 ram_bytes, int compress_type,
 				       struct btrfs_file_extent *file_extent,
 				       int type);
 
@@ -1207,13 +1204,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
 	file_extent.offset = 0;
 	file_extent.compression = async_extent->compress_type;
 
-	em = create_io_em(inode, start,
-			  async_extent->ram_size,	/* len */
-			  ins.offset,			/* orig_block_len */
-			  async_extent->ram_size,	/* ram_bytes */
-			  async_extent->compress_type,
-			  &file_extent,
-			  BTRFS_ORDERED_COMPRESSED);
+	em = create_io_em(inode, start, &file_extent, BTRFS_ORDERED_COMPRESSED);
 	if (IS_ERR(em)) {
 		ret = PTR_ERR(em);
 		goto out_free_reserve;
@@ -1443,12 +1434,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		lock_extent(&inode->io_tree, start, start + ram_size - 1,
 			    &cached);
 
-		em = create_io_em(inode, start, ins.offset, /* len */
-				  ins.offset, /* orig_block_len */
-				  ram_size, /* ram_bytes */
-				  BTRFS_COMPRESS_NONE, /* compress_type */
-				  &file_extent,
-				  BTRFS_ORDERED_REGULAR /* type */);
+		em = create_io_em(inode, start, &file_extent, BTRFS_ORDERED_REGULAR);
 		if (IS_ERR(em)) {
 			unlock_extent(&inode->io_tree, start,
 				      start + ram_size - 1, &cached);
@@ -2165,12 +2151,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 		if (is_prealloc) {
 			struct extent_map *em;
 
-			em = create_io_em(inode, cur_offset,
-					  nocow_args.file_extent.num_bytes,
-					  nocow_args.file_extent.disk_num_bytes,
-					  nocow_args.file_extent.ram_bytes,
-					  BTRFS_COMPRESS_NONE,
-					  &nocow_args.file_extent,
+			em = create_io_em(inode, cur_offset, &nocow_args.file_extent,
 					  BTRFS_ORDERED_PREALLOC);
 			if (IS_ERR(em)) {
 				unlock_extent(&inode->io_tree, cur_offset,
@@ -7033,10 +7014,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 	struct btrfs_ordered_extent *ordered;
 
 	if (type != BTRFS_ORDERED_NOCOW) {
-		em = create_io_em(inode, start, len,
-				  orig_block_len, ram_bytes,
-				  BTRFS_COMPRESS_NONE, /* compress_type */
-				  file_extent, type);
+		em = create_io_em(inode, start, file_extent, type);
 		if (IS_ERR(em))
 			goto out;
 	}
@@ -7328,9 +7306,6 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 
 /* The callers of this must take lock_extent() */
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
-				       u64 len,
-				       u64 disk_num_bytes,
-				       u64 ram_bytes, int compress_type,
 				       struct btrfs_file_extent *file_extent,
 				       int type)
 {
@@ -7352,25 +7327,25 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 	switch (type) {
 	case BTRFS_ORDERED_PREALLOC:
 		/* We're only referring part of a larger preallocated extent. */
-		ASSERT(len <= ram_bytes);
+		ASSERT(file_extent->num_bytes <= file_extent->ram_bytes);
 		break;
 	case BTRFS_ORDERED_REGULAR:
 		/* COW results a new extent matching our file extent size. */
-		ASSERT(disk_num_bytes == len);
-		ASSERT(ram_bytes == len);
+		ASSERT(file_extent->disk_num_bytes == file_extent->num_bytes);
+		ASSERT(file_extent->ram_bytes == file_extent->num_bytes);
 
 		/* Since it's a new extent, we should not have any offset. */
 		ASSERT(file_extent->offset == 0);
 		break;
 	case BTRFS_ORDERED_COMPRESSED:
 		/* Must be compressed. */
-		ASSERT(compress_type != BTRFS_COMPRESS_NONE);
+		ASSERT(file_extent->compression != BTRFS_COMPRESS_NONE);
 
 		/*
 		 * Encoded write can make us to refer to part of the
 		 * uncompressed extent.
 		 */
-		ASSERT(len <= ram_bytes);
+		ASSERT(file_extent->num_bytes <= file_extent->ram_bytes);
 		break;
 	}
 
@@ -7379,15 +7354,15 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 		return ERR_PTR(-ENOMEM);
 
 	em->start = start;
-	em->len = len;
+	em->len = file_extent->num_bytes;
 	em->disk_bytenr = file_extent->disk_bytenr;
-	em->disk_num_bytes = disk_num_bytes;
-	em->ram_bytes = ram_bytes;
+	em->disk_num_bytes = file_extent->disk_num_bytes;
+	em->ram_bytes = file_extent->ram_bytes;
 	em->generation = -1;
 	em->offset = file_extent->offset;
 	em->flags |= EXTENT_FLAG_PINNED;
 	if (type == BTRFS_ORDERED_COMPRESSED)
-		extent_map_set_compression(em, compress_type);
+		extent_map_set_compression(em, file_extent->compression);
 
 	ret = btrfs_replace_extent_map_range(inode, em, true);
 	if (ret) {
@@ -10354,9 +10329,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 	file_extent.ram_bytes = ram_bytes;
 	file_extent.offset = encoded->unencoded_offset;
 	file_extent.compression = compression;
-	em = create_io_em(inode, start, num_bytes,
-			  ins.offset, ram_bytes, compression,
-			  &file_extent, BTRFS_ORDERED_COMPRESSED);
+	em = create_io_em(inode, start, &file_extent, BTRFS_ORDERED_COMPRESSED);
 	if (IS_ERR(em)) {
 		ret = PTR_ERR(em);
 		goto out_free_reserved;
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 09/11] btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
                   ` (7 preceding siblings ...)
  2024-05-23  5:03  1% ` [PATCH v3 08/11] btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args Qu Wenruo
@ 2024-05-23  5:03  1% ` Qu Wenruo
  2024-05-23 18:17  1%   ` Filipe Manana
  2024-05-23  5:03  1% ` [PATCH v3 10/11] btrfs: cleanup duplicated parameters related to create_io_em() Qu Wenruo
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs

All parameters after @filepos of btrfs_alloc_ordered_extent() can be
replaced with btrfs_file_extent structure.

This patch does the cleanup, meanwhile some points to note:

- Move btrfs_file_extent structure to ordered-data.h
  The structure is needed by both btrfs_alloc_ordered_extent() and
  can_nocow_extent(), but since btrfs_inode.h includes
  ordered-data.h, so we need to move the structure to ordered-data.h.

- Move the special handling of NOCOW/PREALLOC into
  btrfs_alloc_ordered_extent()
  This is to allow btrfs_split_ordered_extent() to properly split them
  for DIO.
  For now just move the handling into btrfs_alloc_ordered_extent() to
  simplify the callers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/btrfs_inode.h  | 14 -----------
 fs/btrfs/inode.c        | 56 ++++++++---------------------------------
 fs/btrfs/ordered-data.c | 34 ++++++++++++++++++++-----
 fs/btrfs/ordered-data.h | 19 +++++++++++---
 4 files changed, 54 insertions(+), 69 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index dbc85efdf68a..97ce56a60672 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -514,20 +514,6 @@ int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
 			    u32 pgoff, u8 *csum, const u8 * const csum_expected);
 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
 			u32 bio_offset, struct bio_vec *bv);
-
-/*
- * This represents details about the target file extent item of a write
- * operation.
- */
-struct btrfs_file_extent {
-	u64 disk_bytenr;
-	u64 disk_num_bytes;
-	u64 num_bytes;
-	u64 ram_bytes;
-	u64 offset;
-	u8 compression;
-};
-
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 			      struct btrfs_file_extent *file_extent,
 			      bool nowait, bool strict);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 445c19d96d10..35f03149b777 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1220,14 +1220,8 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
 	}
 	free_extent_map(em);
 
-	ordered = btrfs_alloc_ordered_extent(inode, start,	/* file_offset */
-				       async_extent->ram_size,	/* num_bytes */
-				       async_extent->ram_size,	/* ram_bytes */
-				       ins.objectid,		/* disk_bytenr */
-				       ins.offset,		/* disk_num_bytes */
-				       0,			/* offset */
-				       1 << BTRFS_ORDERED_COMPRESSED,
-				       async_extent->compress_type);
+	ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent,
+				       1 << BTRFS_ORDERED_COMPRESSED);
 	if (IS_ERR(ordered)) {
 		btrfs_drop_extent_map_range(inode, start, end, false);
 		ret = PTR_ERR(ordered);
@@ -1463,10 +1457,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		}
 		free_extent_map(em);
 
-		ordered = btrfs_alloc_ordered_extent(inode, start, ram_size,
-					ram_size, ins.objectid, cur_alloc_size,
-					0, 1 << BTRFS_ORDERED_REGULAR,
-					BTRFS_COMPRESS_NONE);
+		ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent,
+						     1 << BTRFS_ORDERED_REGULAR);
 		if (IS_ERR(ordered)) {
 			unlock_extent(&inode->io_tree, start,
 				      start + ram_size - 1, &cached);
@@ -2191,15 +2183,10 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 		}
 
 		ordered = btrfs_alloc_ordered_extent(inode, cur_offset,
-				nocow_args.file_extent.num_bytes,
-				nocow_args.file_extent.num_bytes,
-				nocow_args.file_extent.disk_bytenr +
-				nocow_args.file_extent.offset,
-				nocow_args.file_extent.num_bytes, 0,
+				&nocow_args.file_extent,
 				is_prealloc
 				? (1 << BTRFS_ORDERED_PREALLOC)
-				: (1 << BTRFS_ORDERED_NOCOW),
-				BTRFS_COMPRESS_NONE);
+				: (1 << BTRFS_ORDERED_NOCOW));
 		btrfs_dec_nocow_writers(nocow_bg);
 		if (IS_ERR(ordered)) {
 			if (is_prealloc) {
@@ -7054,29 +7041,9 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 			goto out;
 	}
 
-	/*
-	 * For regular writes, file_extent->offset is always 0,
-	 * thus we really only need file_extent->disk_bytenr, every other length
-	 * (disk_num_bytes/ram_bytes) should match @len and
-	 * file_extent->num_bytes.
-	 *
-	 * For NOCOW, we don't really care about the numbers except
-	 * @start and @len, as we won't insert a file extent
-	 * item at all.
-	 *
-	 * For PREALLOC, we do not use ordered extent members, but
-	 * btrfs_mark_extent_written() handles everything.
-	 *
-	 * So here we always passing 0 as offset for the ordered extent,
-	 * or btrfs_split_ordered_extent() can not handle it correctly.
-	 */
-	ordered = btrfs_alloc_ordered_extent(inode, start, len, len,
-					     file_extent->disk_bytenr +
-					     file_extent->offset,
-					     len, 0,
+	ordered = btrfs_alloc_ordered_extent(inode, start, file_extent,
 					     (1 << type) |
-					     (1 << BTRFS_ORDERED_DIRECT),
-					     BTRFS_COMPRESS_NONE);
+					     (1 << BTRFS_ORDERED_DIRECT));
 	if (IS_ERR(ordered)) {
 		if (em) {
 			free_extent_map(em);
@@ -10396,12 +10363,9 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 	}
 	free_extent_map(em);
 
-	ordered = btrfs_alloc_ordered_extent(inode, start, num_bytes, ram_bytes,
-				       ins.objectid, ins.offset,
-				       encoded->unencoded_offset,
+	ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent,
 				       (1 << BTRFS_ORDERED_ENCODED) |
-				       (1 << BTRFS_ORDERED_COMPRESSED),
-				       compression);
+				       (1 << BTRFS_ORDERED_COMPRESSED));
 	if (IS_ERR(ordered)) {
 		btrfs_drop_extent_map_range(inode, start, end, false);
 		ret = PTR_ERR(ordered);
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index d446d89c2c34..5c2fb0a7c5c8 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -264,17 +264,39 @@ static void insert_ordered_extent(struct btrfs_ordered_extent *entry)
  */
 struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
 			struct btrfs_inode *inode, u64 file_offset,
-			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-			u64 disk_num_bytes, u64 offset, unsigned long flags,
-			int compress_type)
+			struct btrfs_file_extent *file_extent,
+			unsigned long flags)
 {
 	struct btrfs_ordered_extent *entry;
 
 	ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0);
 
-	entry = alloc_ordered_extent(inode, file_offset, num_bytes, ram_bytes,
-				     disk_bytenr, disk_num_bytes, offset, flags,
-				     compress_type);
+	/*
+	 * For regular writes, we just use the members in @file_extent.
+	 *
+	 * For NOCOW, we don't really care about the numbers except
+	 * @start and file_extent->num_bytes, as we won't insert a file extent
+	 * item at all.
+	 *
+	 * For PREALLOC, we do not use ordered extent members, but
+	 * btrfs_mark_extent_written() handles everything.
+	 *
+	 * So here we always passing 0 as offset for NOCOW/PREALLOC ordered
+	 * extents, or btrfs_split_ordered_extent() can not handle it correctly.
+	 */
+	if (flags & ((1 << BTRFS_ORDERED_NOCOW) | (1 << BTRFS_ORDERED_PREALLOC)))
+		entry = alloc_ordered_extent(inode, file_offset,
+				file_extent->num_bytes, file_extent->num_bytes,
+				file_extent->disk_bytenr + file_extent->offset,
+				file_extent->num_bytes, 0, flags,
+				file_extent->compression);
+	else
+		entry = alloc_ordered_extent(inode, file_offset,
+				file_extent->num_bytes, file_extent->ram_bytes,
+				file_extent->disk_bytenr,
+				file_extent->disk_num_bytes,
+				file_extent->offset, flags,
+				file_extent->compression);
 	if (!IS_ERR(entry))
 		insert_ordered_extent(entry);
 	return entry;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 2ec329e2f0f3..31e65f2f4990 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -171,11 +171,24 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
 bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
 				    struct btrfs_ordered_extent **cached,
 				    u64 file_offset, u64 io_size);
+
+/*
+ * This represents details about the target file extent item of a write
+ * operation.
+ */
+struct btrfs_file_extent {
+	u64 disk_bytenr;
+	u64 disk_num_bytes;
+	u64 num_bytes;
+	u64 ram_bytes;
+	u64 offset;
+	u8 compression;
+};
+
 struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
 			struct btrfs_inode *inode, u64 file_offset,
-			u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
-			u64 disk_num_bytes, u64 offset, unsigned long flags,
-			int compress_type);
+			struct btrfs_file_extent *file_extent,
+			unsigned long flags);
 void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 			   struct btrfs_ordered_sum *sum);
 struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 08/11] btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
                   ` (6 preceding siblings ...)
  2024-05-23  5:03  1% ` [PATCH v3 07/11] btrfs: remove extent_map::block_start member Qu Wenruo
@ 2024-05-23  5:03  1% ` Qu Wenruo
  2024-05-23  5:03  1% ` [PATCH v3 09/11] btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent Qu Wenruo
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs

The following functions and structures can be simplified using the
btrfs_file_extent structure:

- can_nocow_extent()
  No need to return ram_bytes/orig_block_len through the parameter list,
  the @file_extent parameter contains all needed info.

- can_nocow_file_extent_args
  The following members are no longer needed:

  * disk_bytenr
    This one is confusing as it's not really the
    btrfs_file_extent_item::disk_bytenr, but where the IO would be,
    thus it's file_extent::disk_bytenr + file_extent::offset now.

  * num_bytes
    Now file_extent::num_bytes.

  * extent_offset
    Now file_extent::offset.

  * disk_num_bytes
    Now file_extent::disk_num_bytes.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/btrfs_inode.h |  3 +-
 fs/btrfs/file.c        |  2 +-
 fs/btrfs/inode.c       | 84 ++++++++++++++++++------------------------
 3 files changed, 38 insertions(+), 51 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 269ee9ac859e..dbc85efdf68a 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -529,8 +529,7 @@ struct btrfs_file_extent {
 };
 
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
-			      u64 *orig_block_len,
-			      u64 *ram_bytes, struct btrfs_file_extent *file_extent,
+			      struct btrfs_file_extent *file_extent,
 			      bool nowait, bool strict);
 
 void btrfs_del_delalloc_inode(struct btrfs_inode *inode);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index f0cb7b29cab2..eaeefb683b4e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1104,7 +1104,7 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 						   &cached_state);
 	}
 	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
-			NULL, NULL, NULL, nowait, false);
+			       NULL, nowait, false);
 	if (ret <= 0)
 		btrfs_drew_write_unlock(&root->snapshot_lock);
 	else
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1b78769d1e41..445c19d96d10 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1860,15 +1860,10 @@ struct can_nocow_file_extent_args {
 	 */
 	bool free_path;
 
-	/* Output fields. Only set when can_nocow_file_extent() returns 1. */
-
-	u64 disk_bytenr;
-	u64 disk_num_bytes;
-	u64 extent_offset;
-	/* Number of bytes that can be written to in NOCOW mode. */
-	u64 num_bytes;
-
-	/* The expected file extent for the NOCOW write. */
+	/*
+	 * Output fields. Only set when can_nocow_file_extent() returns 1.
+	 * The expected file extent for the NOCOW write.
+	 */
 	struct btrfs_file_extent file_extent;
 };
 
@@ -1891,6 +1886,7 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 	struct btrfs_root *root = inode->root;
 	struct btrfs_file_extent_item *fi;
 	struct btrfs_root *csum_root;
+	u64 io_start;
 	u64 extent_end;
 	u8 extent_type;
 	int can_nocow = 0;
@@ -1903,11 +1899,6 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 	if (extent_type == BTRFS_FILE_EXTENT_INLINE)
 		goto out;
 
-	/* Can't access these fields unless we know it's not an inline extent. */
-	args->disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
-	args->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
-	args->extent_offset = btrfs_file_extent_offset(leaf, fi);
-
 	if (!(inode->flags & BTRFS_INODE_NODATACOW) &&
 	    extent_type == BTRFS_FILE_EXTENT_REG)
 		goto out;
@@ -1923,7 +1914,7 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 		goto out;
 
 	/* An explicit hole, must COW. */
-	if (args->disk_bytenr == 0)
+	if (btrfs_file_extent_disk_bytenr(leaf, fi) == 0)
 		goto out;
 
 	/* Compressed/encrypted/encoded extents must be COWed. */
@@ -1948,8 +1939,8 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 	btrfs_release_path(path);
 
 	ret = btrfs_cross_ref_exist(root, btrfs_ino(inode),
-				    key->offset - args->extent_offset,
-				    args->disk_bytenr, args->strict, path);
+				    key->offset - args->file_extent.offset,
+				    args->file_extent.disk_bytenr, args->strict, path);
 	WARN_ON_ONCE(ret > 0 && is_freespace_inode);
 	if (ret != 0)
 		goto out;
@@ -1970,21 +1961,18 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 	    atomic_read(&root->snapshot_force_cow))
 		goto out;
 
-	args->disk_bytenr += args->extent_offset;
-	args->disk_bytenr += args->start - key->offset;
-	args->num_bytes = min(args->end + 1, extent_end) - args->start;
-
-	args->file_extent.num_bytes = args->num_bytes;
+	args->file_extent.num_bytes = min(args->end + 1, extent_end) - args->start;
 	args->file_extent.offset += args->start - key->offset;
+	io_start = args->file_extent.disk_bytenr + args->file_extent.offset;
 
 	/*
 	 * Force COW if csums exist in the range. This ensures that csums for a
 	 * given extent are either valid or do not exist.
 	 */
 
-	csum_root = btrfs_csum_root(root->fs_info, args->disk_bytenr);
-	ret = btrfs_lookup_csums_list(csum_root, args->disk_bytenr,
-				      args->disk_bytenr + args->num_bytes - 1,
+	csum_root = btrfs_csum_root(root->fs_info, io_start);
+	ret = btrfs_lookup_csums_list(csum_root, io_start,
+				      io_start + args->file_extent.num_bytes - 1,
 				      NULL, nowait);
 	WARN_ON_ONCE(ret > 0 && is_freespace_inode);
 	if (ret != 0)
@@ -2043,7 +2031,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 		struct extent_buffer *leaf;
 		struct extent_state *cached_state = NULL;
 		u64 extent_end;
-		u64 ram_bytes;
 		u64 nocow_end;
 		int extent_type;
 		bool is_prealloc;
@@ -2122,7 +2109,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 			ret = -EUCLEAN;
 			goto error;
 		}
-		ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
 		extent_end = btrfs_file_extent_end(path);
 
 		/*
@@ -2142,7 +2128,9 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 			goto must_cow;
 
 		ret = 0;
-		nocow_bg = btrfs_inc_nocow_writers(fs_info, nocow_args.disk_bytenr);
+		nocow_bg = btrfs_inc_nocow_writers(fs_info,
+				nocow_args.file_extent.disk_bytenr +
+				nocow_args.file_extent.offset);
 		if (!nocow_bg) {
 must_cow:
 			/*
@@ -2178,16 +2166,18 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 			}
 		}
 
-		nocow_end = cur_offset + nocow_args.num_bytes - 1;
+		nocow_end = cur_offset + nocow_args.file_extent.num_bytes - 1;
 		lock_extent(&inode->io_tree, cur_offset, nocow_end, &cached_state);
 
 		is_prealloc = extent_type == BTRFS_FILE_EXTENT_PREALLOC;
 		if (is_prealloc) {
 			struct extent_map *em;
 
-			em = create_io_em(inode, cur_offset, nocow_args.num_bytes,
-					  nocow_args.disk_num_bytes, /* orig_block_len */
-					  ram_bytes, BTRFS_COMPRESS_NONE,
+			em = create_io_em(inode, cur_offset,
+					  nocow_args.file_extent.num_bytes,
+					  nocow_args.file_extent.disk_num_bytes,
+					  nocow_args.file_extent.ram_bytes,
+					  BTRFS_COMPRESS_NONE,
 					  &nocow_args.file_extent,
 					  BTRFS_ORDERED_PREALLOC);
 			if (IS_ERR(em)) {
@@ -2201,8 +2191,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 		}
 
 		ordered = btrfs_alloc_ordered_extent(inode, cur_offset,
-				nocow_args.num_bytes, nocow_args.num_bytes,
-				nocow_args.disk_bytenr, nocow_args.num_bytes, 0,
+				nocow_args.file_extent.num_bytes,
+				nocow_args.file_extent.num_bytes,
+				nocow_args.file_extent.disk_bytenr +
+				nocow_args.file_extent.offset,
+				nocow_args.file_extent.num_bytes, 0,
 				is_prealloc
 				? (1 << BTRFS_ORDERED_PREALLOC)
 				: (1 << BTRFS_ORDERED_NOCOW),
@@ -7177,8 +7170,7 @@ static bool btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr)
  *	 any ordered extents.
  */
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
-			      u64 *orig_block_len,
-			      u64 *ram_bytes, struct btrfs_file_extent *file_extent,
+			      struct btrfs_file_extent *file_extent,
 			      bool nowait, bool strict)
 {
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
@@ -7229,8 +7221,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 
 	fi = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
 	found_type = btrfs_file_extent_type(leaf, fi);
-	if (ram_bytes)
-		*ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
 
 	nocow_args.start = offset;
 	nocow_args.end = offset + *len - 1;
@@ -7248,14 +7238,15 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 	}
 
 	ret = 0;
-	if (btrfs_extent_readonly(fs_info, nocow_args.disk_bytenr))
+	if (btrfs_extent_readonly(fs_info,
+				nocow_args.file_extent.disk_bytenr + nocow_args.file_extent.offset))
 		goto out;
 
 	if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) &&
 	    found_type == BTRFS_FILE_EXTENT_PREALLOC) {
 		u64 range_end;
 
-		range_end = round_up(offset + nocow_args.num_bytes,
+		range_end = round_up(offset + nocow_args.file_extent.num_bytes,
 				     root->fs_info->sectorsize) - 1;
 		ret = test_range_bit_exists(io_tree, offset, range_end, EXTENT_DELALLOC);
 		if (ret) {
@@ -7264,13 +7255,11 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 		}
 	}
 
-	if (orig_block_len)
-		*orig_block_len = nocow_args.disk_num_bytes;
 	if (file_extent)
 		memcpy(file_extent, &nocow_args.file_extent,
 		       sizeof(*file_extent));
 
-	*len = nocow_args.num_bytes;
+	*len = nocow_args.file_extent.num_bytes;
 	ret = 1;
 out:
 	btrfs_free_path(path);
@@ -7455,7 +7444,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	struct btrfs_file_extent file_extent;
 	struct extent_map *em = *map;
 	int type;
-	u64 block_start, orig_block_len, ram_bytes;
+	u64 block_start;
 	struct btrfs_block_group *bg;
 	bool can_nocow = false;
 	bool space_reserved = false;
@@ -7483,7 +7472,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		block_start = extent_map_block_start(em) + (start - em->start);
 
 		if (can_nocow_extent(inode, start, &len,
-				     &orig_block_len, &ram_bytes,
 				     &file_extent, false, false) == 1) {
 			bg = btrfs_inc_nocow_writers(fs_info, block_start);
 			if (bg)
@@ -7510,8 +7498,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		space_reserved = true;
 
 		em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
-					      orig_block_len,
-					      ram_bytes, type,
+					      file_extent.disk_num_bytes,
+					      file_extent.ram_bytes, type,
 					      &file_extent);
 		btrfs_dec_nocow_writers(bg);
 		if (type == BTRFS_ORDERED_PREALLOC) {
@@ -10731,7 +10719,7 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		free_extent_map(em);
 		em = NULL;
 
-		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL, false, true);
+		ret = can_nocow_extent(inode, start, &len, NULL, false, true);
 		if (ret < 0) {
 			goto out;
 		} else if (ret) {
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 07/11] btrfs: remove extent_map::block_start member
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
                   ` (5 preceding siblings ...)
  2024-05-23  5:03  1% ` [PATCH v3 06/11] btrfs: remove extent_map::block_len member Qu Wenruo
@ 2024-05-23  5:03  1% ` Qu Wenruo
  2024-05-23 17:56  1%   ` Filipe Manana
  2024-05-23  5:03  1% ` [PATCH v3 08/11] btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args Qu Wenruo
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

The member extent_map::block_start can be calculated from
extent_map::disk_bytenr + extent_map::offset for regular extents.
And otherwise just extent_map::disk_bytenr.

And this is already validated by the validate_extent_map().
Now we can remove the member.

However there is a special case in btrfs_create_dio_extent() where we
for NOCOW/PREALLOC ordered extents can not directly use the resulted
btrfs_file_extent, as btrfs_split_ordered_extent() can not handle them
yet.
So for that call site, we pass file_extent->disk_bytenr +
file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
offset.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/compression.c            |  3 +-
 fs/btrfs/defrag.c                 |  9 ++-
 fs/btrfs/extent_io.c              | 10 ++--
 fs/btrfs/extent_map.c             | 55 +++++------------
 fs/btrfs/extent_map.h             | 22 ++++---
 fs/btrfs/file-item.c              |  4 --
 fs/btrfs/file.c                   | 11 ++--
 fs/btrfs/inode.c                  | 80 ++++++++++++++-----------
 fs/btrfs/relocation.c             |  1 -
 fs/btrfs/tests/extent-map-tests.c | 48 ++++++---------
 fs/btrfs/tests/inode-tests.c      | 99 ++++++++++++++++---------------
 fs/btrfs/tree-log.c               | 17 +++---
 fs/btrfs/zoned.c                  |  4 +-
 include/trace/events/btrfs.h      | 11 +---
 14 files changed, 168 insertions(+), 206 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index cd88432e7072..07b31d1c0926 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -507,7 +507,8 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 		 */
 		if (!em || cur < em->start ||
 		    (cur + fs_info->sectorsize > extent_map_end(em)) ||
-		    (em->block_start >> SECTOR_SHIFT) != orig_bio->bi_iter.bi_sector) {
+		    (extent_map_block_start(em) >> SECTOR_SHIFT) !=
+		    orig_bio->bi_iter.bi_sector) {
 			free_extent_map(em);
 			unlock_extent(tree, cur, page_end, NULL);
 			unlock_page(page);
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 025e7f853a68..6fb94e897fc5 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -707,7 +707,6 @@ static struct extent_map *defrag_get_extent(struct btrfs_inode *inode,
 		 */
 		if (key.offset > start) {
 			em->start = start;
-			em->block_start = EXTENT_MAP_HOLE;
 			em->disk_bytenr = EXTENT_MAP_HOLE;
 			em->disk_num_bytes = 0;
 			em->ram_bytes = 0;
@@ -828,7 +827,7 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em,
 	 */
 	next = defrag_lookup_extent(inode, em->start + em->len, newer_than, locked);
 	/* No more em or hole */
-	if (!next || next->block_start >= EXTENT_MAP_LAST_BYTE)
+	if (!next || next->disk_bytenr >= EXTENT_MAP_LAST_BYTE)
 		goto out;
 	if (next->flags & EXTENT_FLAG_PREALLOC)
 		goto out;
@@ -995,12 +994,12 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 		 * This is for users who want to convert inline extents to
 		 * regular ones through max_inline= mount option.
 		 */
-		if (em->block_start == EXTENT_MAP_INLINE &&
+		if (em->disk_bytenr == EXTENT_MAP_INLINE &&
 		    em->len <= inode->root->fs_info->max_inline)
 			goto next;
 
 		/* Skip holes and preallocated extents. */
-		if (em->block_start == EXTENT_MAP_HOLE ||
+		if (em->disk_bytenr == EXTENT_MAP_HOLE ||
 		    (em->flags & EXTENT_FLAG_PREALLOC))
 			goto next;
 
@@ -1065,7 +1064,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 		 * So if an inline extent passed all above checks, just add it
 		 * for defrag, and be converted to regular extents.
 		 */
-		if (em->block_start == EXTENT_MAP_INLINE)
+		if (em->disk_bytenr == EXTENT_MAP_INLINE)
 			goto add;
 
 		next_mergeable = defrag_check_next_extent(&inode->vfs_inode, em,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bf50301ee528..063d7954c9ed 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1083,10 +1083,10 @@ static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
 		iosize = min(extent_map_end(em) - cur, end - cur + 1);
 		iosize = ALIGN(iosize, blocksize);
 		if (compress_type != BTRFS_COMPRESS_NONE)
-			disk_bytenr = em->block_start;
+			disk_bytenr = em->disk_bytenr;
 		else
-			disk_bytenr = em->block_start + extent_offset;
-		block_start = em->block_start;
+			disk_bytenr = extent_map_block_start(em) + extent_offset;
+		block_start = extent_map_block_start(em);
 		if (em->flags & EXTENT_FLAG_PREALLOC)
 			block_start = EXTENT_MAP_HOLE;
 
@@ -1405,8 +1405,8 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		ASSERT(IS_ALIGNED(em->start, fs_info->sectorsize));
 		ASSERT(IS_ALIGNED(em->len, fs_info->sectorsize));
 
-		block_start = em->block_start;
-		disk_bytenr = em->block_start + extent_offset;
+		block_start = extent_map_block_start(em);
+		disk_bytenr = extent_map_block_start(em) + extent_offset;
 
 		ASSERT(!extent_map_is_compressed(em));
 		ASSERT(block_start != EXTENT_MAP_HOLE);
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 0c100fe47c43..38a1f07581b0 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -192,9 +192,10 @@ static inline u64 extent_map_block_len(const struct extent_map *em)
 
 static inline u64 extent_map_block_end(const struct extent_map *em)
 {
-	if (em->block_start + extent_map_block_len(em) < em->block_start)
+	if (extent_map_block_start(em) + extent_map_block_len(em) <
+	    extent_map_block_start(em))
 		return (u64)-1;
-	return em->block_start + extent_map_block_len(em);
+	return extent_map_block_start(em) + extent_map_block_len(em);
 }
 
 static bool can_merge_extent_map(const struct extent_map *em)
@@ -229,11 +230,11 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma
 	if (prev->flags != next->flags)
 		return false;
 
-	if (next->block_start < EXTENT_MAP_LAST_BYTE - 1)
-		return next->block_start == extent_map_block_end(prev);
+	if (next->disk_bytenr < EXTENT_MAP_LAST_BYTE - 1)
+		return extent_map_block_start(next) == extent_map_block_end(prev);
 
 	/* HOLES and INLINE extents. */
-	return next->block_start == prev->block_start;
+	return next->disk_bytenr == prev->disk_bytenr;
 }
 
 /*
@@ -295,10 +296,9 @@ static void dump_extent_map(struct btrfs_fs_info *fs_info,
 {
 	if (!IS_ENABLED(CONFIG_BTRFS_DEBUG))
 		return;
-	btrfs_crit(fs_info, "%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu block_start=%llu flags=0x%x\n",
+	btrfs_crit(fs_info, "%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu flags=0x%x\n",
 		prefix, em->start, em->len, em->disk_bytenr, em->disk_num_bytes,
-		em->ram_bytes, em->offset, em->block_start,
-		em->flags);
+		em->ram_bytes, em->offset, em->flags);
 	ASSERT(0);
 }
 
@@ -316,15 +316,6 @@ static void validate_extent_map(struct btrfs_fs_info *fs_info,
 		if (em->offset + em->len > em->disk_num_bytes &&
 		    !extent_map_is_compressed(em))
 			dump_extent_map(fs_info, "disk_num_bytes too small", em);
-
-		if (extent_map_is_compressed(em)) {
-			if (em->block_start != em->disk_bytenr)
-				dump_extent_map(fs_info,
-				"mismatch block_start/disk_bytenr/offset", em);
-		} else if (em->block_start != em->disk_bytenr + em->offset) {
-			dump_extent_map(fs_info,
-				"mismatch block_start/disk_bytenr/offset", em);
-		}
 	} else if (em->offset) {
 		dump_extent_map(fs_info,
 				"non-zero offset for hole/inline", em);
@@ -359,7 +350,6 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 		if (rb && can_merge_extent_map(merge) && mergeable_maps(merge, em)) {
 			em->start = merge->start;
 			em->len += merge->len;
-			em->block_start = merge->block_start;
 			em->generation = max(em->generation, merge->generation);
 
 			if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
@@ -669,11 +659,9 @@ static noinline int merge_extent_mapping(struct btrfs_inode *inode,
 	start_diff = start - em->start;
 	em->start = start;
 	em->len = end - start;
-	if (em->block_start < EXTENT_MAP_LAST_BYTE &&
-	    !extent_map_is_compressed(em)) {
-		em->block_start += start_diff;
+	if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE &&
+	    !extent_map_is_compressed(em))
 		em->offset += start_diff;
-	}
 	return add_extent_mapping(inode, em, 0);
 }
 
@@ -708,7 +696,7 @@ int btrfs_add_extent_mapping(struct btrfs_inode *inode,
 	 * Tree-checker should have rejected any inline extent with non-zero
 	 * file offset. Here just do a sanity check.
 	 */
-	if (em->block_start == EXTENT_MAP_INLINE)
+	if (em->disk_bytenr == EXTENT_MAP_INLINE)
 		ASSERT(em->start == 0);
 
 	ret = add_extent_mapping(inode, em, 0);
@@ -842,7 +830,6 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 		u64 gen;
 		unsigned long flags;
 		bool modified;
-		bool compressed;
 
 		if (em_end < end) {
 			next_em = next_extent_map(em);
@@ -876,7 +863,6 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			goto remove_em;
 
 		gen = em->generation;
-		compressed = extent_map_is_compressed(em);
 
 		if (em->start < start) {
 			if (!split) {
@@ -888,15 +874,12 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			split->start = em->start;
 			split->len = start - em->start;
 
-			if (em->block_start < EXTENT_MAP_LAST_BYTE) {
-				split->block_start = em->block_start;
-
+			if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
 				split->disk_bytenr = em->disk_bytenr;
 				split->disk_num_bytes = em->disk_num_bytes;
 				split->offset = em->offset;
 				split->ram_bytes = em->ram_bytes;
 			} else {
-				split->block_start = em->block_start;
 				split->disk_bytenr = em->disk_bytenr;
 				split->disk_num_bytes = 0;
 				split->offset = 0;
@@ -919,20 +902,14 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			}
 			split->start = end;
 			split->len = em_end - end;
-			split->block_start = em->block_start;
 			split->disk_bytenr = em->disk_bytenr;
 			split->flags = flags;
 			split->generation = gen;
 
-			if (em->block_start < EXTENT_MAP_LAST_BYTE) {
+			if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
 				split->disk_num_bytes = em->disk_num_bytes;
 				split->offset = em->offset + end - em->start;
 				split->ram_bytes = em->ram_bytes;
-				if (!compressed) {
-					const u64 diff = end - em->start;
-
-					split->block_start += diff;
-				}
 			} else {
 				split->disk_num_bytes = 0;
 				split->offset = 0;
@@ -1079,7 +1056,7 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 
 	ASSERT(em->len == len);
 	ASSERT(!extent_map_is_compressed(em));
-	ASSERT(em->block_start < EXTENT_MAP_LAST_BYTE);
+	ASSERT(em->disk_bytenr < EXTENT_MAP_LAST_BYTE);
 	ASSERT(em->flags & EXTENT_FLAG_PINNED);
 	ASSERT(!(em->flags & EXTENT_FLAG_LOGGING));
 	ASSERT(!list_empty(&em->list));
@@ -1093,7 +1070,6 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_pre->disk_bytenr = new_logical;
 	split_pre->disk_num_bytes = split_pre->len;
 	split_pre->offset = 0;
-	split_pre->block_start = new_logical;
 	split_pre->ram_bytes = split_pre->len;
 	split_pre->flags = flags;
 	split_pre->generation = em->generation;
@@ -1108,10 +1084,9 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	/* Insert the middle extent_map. */
 	split_mid->start = em->start + pre;
 	split_mid->len = em->len - pre;
-	split_mid->disk_bytenr = em->block_start + pre;
+	split_mid->disk_bytenr = extent_map_block_start(em) + pre;
 	split_mid->disk_num_bytes = split_mid->len;
 	split_mid->offset = 0;
-	split_mid->block_start = em->block_start + pre;
 	split_mid->ram_bytes = split_mid->len;
 	split_mid->flags = flags;
 	split_mid->generation = em->generation;
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 5312bb542af0..2bcf7149b44c 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -90,18 +90,6 @@ struct extent_map {
 	 */
 	u64 ram_bytes;
 
-	/*
-	 * The on-disk logical bytenr for the file extent.
-	 *
-	 * For compressed extents it matches btrfs_file_extent_item::disk_bytenr.
-	 * For uncompressed extents it matches
-	 * btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset
-	 *
-	 * For holes it is EXTENT_MAP_HOLE and for inline extents it is
-	 * EXTENT_MAP_INLINE.
-	 */
-	u64 block_start;
-
 	/*
 	 * Generation of the extent map, for merged em it's the highest
 	 * generation of all merged ems.
@@ -162,6 +150,16 @@ static inline int extent_map_in_tree(const struct extent_map *em)
 	return !RB_EMPTY_NODE(&em->rb_node);
 }
 
+static inline u64 extent_map_block_start(const struct extent_map *em)
+{
+	if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
+		if (extent_map_is_compressed(em))
+			return em->disk_bytenr;
+		return em->disk_bytenr + em->offset;
+	}
+	return em->disk_bytenr;
+}
+
 static inline u64 extent_map_end(const struct extent_map *em)
 {
 	if (em->start + em->len < em->start)
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 397df6588ce2..55703c833f3d 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -1295,7 +1295,6 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 		em->len = btrfs_file_extent_end(path) - extent_start;
 		bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
 		if (bytenr == 0) {
-			em->block_start = EXTENT_MAP_HOLE;
 			em->disk_bytenr = EXTENT_MAP_HOLE;
 			em->disk_num_bytes = 0;
 			em->offset = 0;
@@ -1306,10 +1305,8 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 		em->offset = btrfs_file_extent_offset(leaf, fi);
 		if (compress_type != BTRFS_COMPRESS_NONE) {
 			extent_map_set_compression(em, compress_type);
-			em->block_start = bytenr;
 		} else {
 			bytenr += btrfs_file_extent_offset(leaf, fi);
-			em->block_start = bytenr;
 			if (type == BTRFS_FILE_EXTENT_PREALLOC)
 				em->flags |= EXTENT_FLAG_PREALLOC;
 		}
@@ -1317,7 +1314,6 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 		/* Tree-checker has ensured this. */
 		ASSERT(extent_start == 0);
 
-		em->block_start = EXTENT_MAP_INLINE;
 		em->disk_bytenr = EXTENT_MAP_INLINE;
 		em->start = 0;
 		em->len = fs_info->sectorsize;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 7033ea619073..f0cb7b29cab2 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2348,7 +2348,6 @@ static int fill_holes(struct btrfs_trans_handle *trans,
 		hole_em->len = end - offset;
 		hole_em->ram_bytes = hole_em->len;
 
-		hole_em->block_start = EXTENT_MAP_HOLE;
 		hole_em->disk_bytenr = EXTENT_MAP_HOLE;
 		hole_em->disk_num_bytes = 0;
 		hole_em->generation = trans->transid;
@@ -2381,7 +2380,7 @@ static int find_first_non_hole(struct btrfs_inode *inode, u64 *start, u64 *len)
 		return PTR_ERR(em);
 
 	/* Hole or vacuum extent(only exists in no-hole mode) */
-	if (em->block_start == EXTENT_MAP_HOLE) {
+	if (em->disk_bytenr == EXTENT_MAP_HOLE) {
 		ret = 1;
 		*len = em->start + em->len > *start + *len ?
 		       0 : *start + *len - em->start - em->len;
@@ -3038,7 +3037,7 @@ static int btrfs_zero_range_check_range_boundary(struct btrfs_inode *inode,
 	if (IS_ERR(em))
 		return PTR_ERR(em);
 
-	if (em->block_start == EXTENT_MAP_HOLE)
+	if (em->disk_bytenr == EXTENT_MAP_HOLE)
 		ret = RANGE_BOUNDARY_HOLE;
 	else if (em->flags & EXTENT_FLAG_PREALLOC)
 		ret = RANGE_BOUNDARY_PREALLOC_EXTENT;
@@ -3102,7 +3101,7 @@ static int btrfs_zero_range(struct inode *inode,
 		ASSERT(IS_ALIGNED(alloc_start, sectorsize));
 		len = offset + len - alloc_start;
 		offset = alloc_start;
-		alloc_hint = em->block_start + em->len;
+		alloc_hint = extent_map_block_start(em) + em->len;
 	}
 	free_extent_map(em);
 
@@ -3120,7 +3119,7 @@ static int btrfs_zero_range(struct inode *inode,
 							   mode);
 			goto out;
 		}
-		if (len < sectorsize && em->block_start != EXTENT_MAP_HOLE) {
+		if (len < sectorsize && em->disk_bytenr != EXTENT_MAP_HOLE) {
 			free_extent_map(em);
 			ret = btrfs_truncate_block(BTRFS_I(inode), offset, len,
 						   0);
@@ -3333,7 +3332,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 		last_byte = min(extent_map_end(em), alloc_end);
 		actual_end = min_t(u64, extent_map_end(em), offset + len);
 		last_byte = ALIGN(last_byte, blocksize);
-		if (em->block_start == EXTENT_MAP_HOLE ||
+		if (em->disk_bytenr == EXTENT_MAP_HOLE ||
 		    (cur_offset >= inode->i_size &&
 		     !(em->flags & EXTENT_FLAG_PREALLOC))) {
 			const u64 range_len = last_byte - cur_offset;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 00bb64fdf938..1b78769d1e41 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -138,7 +138,7 @@ static noinline int run_delalloc_cow(struct btrfs_inode *inode,
 				     u64 end, struct writeback_control *wbc,
 				     bool pages_dirty);
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
-				       u64 len, u64 block_start,
+				       u64 len,
 				       u64 disk_num_bytes,
 				       u64 ram_bytes, int compress_type,
 				       struct btrfs_file_extent *file_extent,
@@ -1209,7 +1209,6 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
 
 	em = create_io_em(inode, start,
 			  async_extent->ram_size,	/* len */
-			  ins.objectid,			/* block_start */
 			  ins.offset,			/* orig_block_len */
 			  async_extent->ram_size,	/* ram_bytes */
 			  async_extent->compress_type,
@@ -1287,15 +1286,15 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
 		 * first block in this inode and use that as a hint.  If that
 		 * block is also bogus then just don't worry about it.
 		 */
-		if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
+		if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
 			free_extent_map(em);
 			em = search_extent_mapping(em_tree, 0, 0);
-			if (em && em->block_start < EXTENT_MAP_LAST_BYTE)
-				alloc_hint = em->block_start;
+			if (em && em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
+				alloc_hint = extent_map_block_start(em);
 			if (em)
 				free_extent_map(em);
 		} else {
-			alloc_hint = em->block_start;
+			alloc_hint = extent_map_block_start(em);
 			free_extent_map(em);
 		}
 	}
@@ -1451,7 +1450,6 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			    &cached);
 
 		em = create_io_em(inode, start, ins.offset, /* len */
-				  ins.objectid, /* block_start */
 				  ins.offset, /* orig_block_len */
 				  ram_size, /* ram_bytes */
 				  BTRFS_COMPRESS_NONE, /* compress_type */
@@ -2188,7 +2186,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 			struct extent_map *em;
 
 			em = create_io_em(inode, cur_offset, nocow_args.num_bytes,
-					  nocow_args.disk_bytenr, /* block_start */
 					  nocow_args.disk_num_bytes, /* orig_block_len */
 					  ram_bytes, BTRFS_COMPRESS_NONE,
 					  &nocow_args.file_extent,
@@ -2703,7 +2700,7 @@ static int btrfs_find_new_delalloc_bytes(struct btrfs_inode *inode,
 		if (IS_ERR(em))
 			return PTR_ERR(em);
 
-		if (em->block_start != EXTENT_MAP_HOLE)
+		if (extent_map_block_start(em) != EXTENT_MAP_HOLE)
 			goto next;
 
 		em_len = em->len;
@@ -5022,7 +5019,6 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
 			hole_em->start = cur_offset;
 			hole_em->len = hole_size;
 
-			hole_em->block_start = EXTENT_MAP_HOLE;
 			hole_em->disk_bytenr = EXTENT_MAP_HOLE;
 			hole_em->disk_num_bytes = 0;
 			hole_em->ram_bytes = hole_size;
@@ -6879,7 +6875,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 	if (em) {
 		if (em->start > start || em->start + em->len <= start)
 			free_extent_map(em);
-		else if (em->block_start == EXTENT_MAP_INLINE && page)
+		else if (em->disk_bytenr == EXTENT_MAP_INLINE && page)
 			free_extent_map(em);
 		else
 			goto out;
@@ -6982,7 +6978,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 		/* New extent overlaps with existing one */
 		em->start = start;
 		em->len = found_key.offset - start;
-		em->block_start = EXTENT_MAP_HOLE;
+		em->disk_bytenr = EXTENT_MAP_HOLE;
 		goto insert;
 	}
 
@@ -7006,7 +7002,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 		 *
 		 * Other members are not utilized for inline extents.
 		 */
-		ASSERT(em->block_start == EXTENT_MAP_INLINE);
+		ASSERT(em->disk_bytenr == EXTENT_MAP_INLINE);
 		ASSERT(em->len == fs_info->sectorsize);
 
 		ret = read_inline_extent(inode, path, page);
@@ -7017,7 +7013,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 not_found:
 	em->start = start;
 	em->len = len;
-	em->block_start = EXTENT_MAP_HOLE;
+	em->disk_bytenr = EXTENT_MAP_HOLE;
 insert:
 	ret = 0;
 	btrfs_release_path(path);
@@ -7048,7 +7044,6 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 						  struct btrfs_dio_data *dio_data,
 						  const u64 start,
 						  const u64 len,
-						  const u64 block_start,
 						  const u64 orig_block_len,
 						  const u64 ram_bytes,
 						  const int type,
@@ -7058,15 +7053,34 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 	struct btrfs_ordered_extent *ordered;
 
 	if (type != BTRFS_ORDERED_NOCOW) {
-		em = create_io_em(inode, start, len, block_start,
+		em = create_io_em(inode, start, len,
 				  orig_block_len, ram_bytes,
 				  BTRFS_COMPRESS_NONE, /* compress_type */
 				  file_extent, type);
 		if (IS_ERR(em))
 			goto out;
 	}
+
+	/*
+	 * For regular writes, file_extent->offset is always 0,
+	 * thus we really only need file_extent->disk_bytenr, every other length
+	 * (disk_num_bytes/ram_bytes) should match @len and
+	 * file_extent->num_bytes.
+	 *
+	 * For NOCOW, we don't really care about the numbers except
+	 * @start and @len, as we won't insert a file extent
+	 * item at all.
+	 *
+	 * For PREALLOC, we do not use ordered extent members, but
+	 * btrfs_mark_extent_written() handles everything.
+	 *
+	 * So here we always passing 0 as offset for the ordered extent,
+	 * or btrfs_split_ordered_extent() can not handle it correctly.
+	 */
 	ordered = btrfs_alloc_ordered_extent(inode, start, len, len,
-					     block_start, len, 0,
+					     file_extent->disk_bytenr +
+					     file_extent->offset,
+					     len, 0,
 					     (1 << type) |
 					     (1 << BTRFS_ORDERED_DIRECT),
 					     BTRFS_COMPRESS_NONE);
@@ -7118,7 +7132,7 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
 	file_extent.offset = 0;
 	file_extent.compression = BTRFS_COMPRESS_NONE;
 	em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset,
-				     ins.objectid, ins.offset,
+				     ins.offset,
 				     ins.offset, BTRFS_ORDERED_REGULAR,
 				     &file_extent);
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
@@ -7358,7 +7372,7 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 
 /* The callers of this must take lock_extent() */
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
-				       u64 len, u64 block_start,
+				       u64 len,
 				       u64 disk_num_bytes,
 				       u64 ram_bytes, int compress_type,
 				       struct btrfs_file_extent *file_extent,
@@ -7410,7 +7424,6 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 
 	em->start = start;
 	em->len = len;
-	em->block_start = block_start;
 	em->disk_bytenr = file_extent->disk_bytenr;
 	em->disk_num_bytes = disk_num_bytes;
 	em->ram_bytes = ram_bytes;
@@ -7461,13 +7474,13 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	 */
 	if ((em->flags & EXTENT_FLAG_PREALLOC) ||
 	    ((BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) &&
-	     em->block_start != EXTENT_MAP_HOLE)) {
+	     em->disk_bytenr != EXTENT_MAP_HOLE)) {
 		if (em->flags & EXTENT_FLAG_PREALLOC)
 			type = BTRFS_ORDERED_PREALLOC;
 		else
 			type = BTRFS_ORDERED_NOCOW;
 		len = min(len, em->len - (start - em->start));
-		block_start = em->block_start + (start - em->start);
+		block_start = extent_map_block_start(em) + (start - em->start);
 
 		if (can_nocow_extent(inode, start, &len,
 				     &orig_block_len, &ram_bytes,
@@ -7497,7 +7510,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		space_reserved = true;
 
 		em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
-					      block_start,
 					      orig_block_len,
 					      ram_bytes, type,
 					      &file_extent);
@@ -7700,7 +7712,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	 * the generic code.
 	 */
 	if (extent_map_is_compressed(em) ||
-	    em->block_start == EXTENT_MAP_INLINE) {
+	    em->disk_bytenr == EXTENT_MAP_INLINE) {
 		free_extent_map(em);
 		/*
 		 * If we are in a NOWAIT context, return -EAGAIN in order to
@@ -7794,12 +7806,12 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	 * We trim the extents (and move the addr) even though iomap code does
 	 * that, since we have locked only the parts we are performing I/O in.
 	 */
-	if ((em->block_start == EXTENT_MAP_HOLE) ||
+	if ((em->disk_bytenr == EXTENT_MAP_HOLE) ||
 	    ((em->flags & EXTENT_FLAG_PREALLOC) && !write)) {
 		iomap->addr = IOMAP_NULL_ADDR;
 		iomap->type = IOMAP_HOLE;
 	} else {
-		iomap->addr = em->block_start + (start - em->start);
+		iomap->addr = extent_map_block_start(em) + (start - em->start);
 		iomap->type = IOMAP_MAPPED;
 	}
 	iomap->offset = start;
@@ -9638,7 +9650,6 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 
 		em->start = cur_offset;
 		em->len = ins.offset;
-		em->block_start = ins.objectid;
 		em->disk_bytenr = ins.objectid;
 		em->offset = 0;
 		em->disk_num_bytes = ins.offset;
@@ -10104,7 +10115,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 		goto out_unlock_extent;
 	}
 
-	if (em->block_start == EXTENT_MAP_INLINE) {
+	if (em->disk_bytenr == EXTENT_MAP_INLINE) {
 		u64 extent_start = em->start;
 
 		/*
@@ -10125,14 +10136,14 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 	 */
 	encoded->len = min_t(u64, extent_map_end(em),
 			     inode->vfs_inode.i_size) - iocb->ki_pos;
-	if (em->block_start == EXTENT_MAP_HOLE ||
+	if (em->disk_bytenr == EXTENT_MAP_HOLE ||
 	    (em->flags & EXTENT_FLAG_PREALLOC)) {
 		disk_bytenr = EXTENT_MAP_HOLE;
 		count = min_t(u64, count, encoded->len);
 		encoded->len = count;
 		encoded->unencoded_len = count;
 	} else if (extent_map_is_compressed(em)) {
-		disk_bytenr = em->block_start;
+		disk_bytenr = em->disk_bytenr;
 		/*
 		 * Bail if the buffer isn't large enough to return the whole
 		 * compressed extent.
@@ -10151,7 +10162,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 			goto out_em;
 		encoded->compression = ret;
 	} else {
-		disk_bytenr = em->block_start + (start - em->start);
+		disk_bytenr = extent_map_block_start(em) + (start - em->start);
 		if (encoded->len > count)
 			encoded->len = count;
 		/*
@@ -10389,7 +10400,6 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 	file_extent.offset = encoded->unencoded_offset;
 	file_extent.compression = compression;
 	em = create_io_em(inode, start, num_bytes,
-			  ins.objectid,
 			  ins.offset, ram_bytes, compression,
 			  &file_extent, BTRFS_ORDERED_COMPRESSED);
 	if (IS_ERR(em)) {
@@ -10693,12 +10703,12 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			goto out;
 		}
 
-		if (em->block_start == EXTENT_MAP_HOLE) {
+		if (em->disk_bytenr == EXTENT_MAP_HOLE) {
 			btrfs_warn(fs_info, "swapfile must not have holes");
 			ret = -EINVAL;
 			goto out;
 		}
-		if (em->block_start == EXTENT_MAP_INLINE) {
+		if (em->disk_bytenr == EXTENT_MAP_INLINE) {
 			/*
 			 * It's unlikely we'll ever actually find ourselves
 			 * here, as a file small enough to fit inline won't be
@@ -10716,7 +10726,7 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 			goto out;
 		}
 
-		logical_block_start = em->block_start + (start - em->start);
+		logical_block_start = extent_map_block_start(em) + (start - em->start);
 		len = min(len, em->len - (start - em->start));
 		free_extent_map(em);
 		em = NULL;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 68fe52ab445d..bcb665613e78 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2912,7 +2912,6 @@ static noinline_for_stack int setup_relocation_extent_mapping(struct inode *inod
 
 	em->start = start;
 	em->len = end + 1 - start;
-	em->block_start = block_start;
 	em->disk_bytenr = block_start;
 	em->disk_num_bytes = em->len;
 	em->ram_bytes = em->len;
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 0dd270d6c506..ebec4ab361b8 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -28,8 +28,8 @@ static int free_extent_map_tree(struct btrfs_inode *inode)
 		if (refcount_read(&em->refs) != 1) {
 			ret = -EINVAL;
 			test_err(
-"em leak: em (start %llu len %llu block_start %llu disk_num_bytes %llu offset %llu) refs %d",
-				 em->start, em->len, em->block_start,
+"em leak: em (start %llu len %llu disk_bytenr %llu disk_num_bytes %llu offset %llu) refs %d",
+				 em->start, em->len, em->disk_bytenr,
 				 em->disk_num_bytes, em->offset,
 				 refcount_read(&em->refs));
 
@@ -77,7 +77,6 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	/* Add [0, 16K) */
 	em->start = 0;
 	em->len = SZ_16K;
-	em->block_start = 0;
 	em->disk_bytenr = 0;
 	em->disk_num_bytes = SZ_16K;
 	em->ram_bytes = SZ_16K;
@@ -100,7 +99,6 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 
 	em->start = SZ_16K;
 	em->len = SZ_4K;
-	em->block_start = SZ_32K; /* avoid merging */
 	em->disk_bytenr = SZ_32K; /* avoid merging */
 	em->disk_num_bytes = SZ_4K;
 	em->ram_bytes = SZ_4K;
@@ -123,7 +121,6 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	/* Add [0, 8K), should return [0, 16K) instead. */
 	em->start = start;
 	em->len = len;
-	em->block_start = start;
 	em->disk_bytenr = start;
 	em->disk_num_bytes = len;
 	em->ram_bytes = len;
@@ -141,11 +138,11 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 		goto out;
 	}
 	if (em->start != 0 || extent_map_end(em) != SZ_16K ||
-	    em->block_start != 0 || em->disk_num_bytes != SZ_16K) {
+	    em->disk_bytenr != 0 || em->disk_num_bytes != SZ_16K) {
 		test_err(
-"case1 [%llu %llu]: ret %d return a wrong em (start %llu len %llu block_start %llu disk_num_bytes %llu",
+"case1 [%llu %llu]: ret %d return a wrong em (start %llu len %llu disk_bytenr %llu disk_num_bytes %llu",
 			 start, start + len, ret, em->start, em->len,
-			 em->block_start, em->disk_num_bytes);
+			 em->disk_bytenr, em->disk_num_bytes);
 		ret = -EINVAL;
 	}
 	free_extent_map(em);
@@ -179,7 +176,6 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	/* Add [0, 1K) */
 	em->start = 0;
 	em->len = SZ_1K;
-	em->block_start = EXTENT_MAP_INLINE;
 	em->disk_bytenr = EXTENT_MAP_INLINE;
 	em->disk_num_bytes = 0;
 	em->ram_bytes = SZ_1K;
@@ -202,7 +198,6 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 
 	em->start = SZ_4K;
 	em->len = SZ_4K;
-	em->block_start = SZ_4K;
 	em->disk_bytenr = SZ_4K;
 	em->disk_num_bytes = SZ_4K;
 	em->ram_bytes = SZ_4K;
@@ -225,7 +220,6 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	/* Add [0, 1K) */
 	em->start = 0;
 	em->len = SZ_1K;
-	em->block_start = EXTENT_MAP_INLINE;
 	em->disk_bytenr = EXTENT_MAP_INLINE;
 	em->disk_num_bytes = 0;
 	em->ram_bytes = SZ_1K;
@@ -242,10 +236,10 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 		goto out;
 	}
 	if (em->start != 0 || extent_map_end(em) != SZ_1K ||
-	    em->block_start != EXTENT_MAP_INLINE) {
+	    em->disk_bytenr != EXTENT_MAP_INLINE) {
 		test_err(
-"case2 [0 1K]: ret %d return a wrong em (start %llu len %llu block_start %llu",
-			 ret, em->start, em->len, em->block_start);
+"case2 [0 1K]: ret %d return a wrong em (start %llu len %llu disk_bytenr %llu",
+			 ret, em->start, em->len, em->disk_bytenr);
 		ret = -EINVAL;
 	}
 	free_extent_map(em);
@@ -275,7 +269,6 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	/* Add [4K, 8K) */
 	em->start = SZ_4K;
 	em->len = SZ_4K;
-	em->block_start = SZ_4K;
 	em->disk_bytenr = SZ_4K;
 	em->disk_num_bytes = SZ_4K;
 	em->ram_bytes = SZ_4K;
@@ -298,7 +291,6 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	/* Add [0, 16K) */
 	em->start = 0;
 	em->len = SZ_16K;
-	em->block_start = 0;
 	em->disk_bytenr = 0;
 	em->disk_num_bytes = SZ_16K;
 	em->ram_bytes = SZ_16K;
@@ -321,11 +313,11 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	 * em->start.
 	 */
 	if (start < em->start || start + len > extent_map_end(em) ||
-	    em->start != em->block_start) {
+	    em->start != extent_map_block_start(em)) {
 		test_err(
-"case3 [%llu %llu): ret %d em (start %llu len %llu block_start %llu block_len %llu)",
+"case3 [%llu %llu): ret %d em (start %llu len %llu disk_bytenr %llu block_len %llu)",
 			 start, start + len, ret, em->start, em->len,
-			 em->block_start, em->disk_num_bytes);
+			 em->disk_bytenr, em->disk_num_bytes);
 		ret = -EINVAL;
 	}
 	free_extent_map(em);
@@ -386,7 +378,6 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	/* Add [0K, 8K) */
 	em->start = 0;
 	em->len = SZ_8K;
-	em->block_start = 0;
 	em->disk_bytenr = 0;
 	em->disk_num_bytes = SZ_8K;
 	em->ram_bytes = SZ_8K;
@@ -409,7 +400,6 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	/* Add [8K, 32K) */
 	em->start = SZ_8K;
 	em->len = 24 * SZ_1K;
-	em->block_start = SZ_16K; /* avoid merging */
 	em->disk_bytenr = SZ_16K; /* avoid merging */
 	em->disk_num_bytes = 24 * SZ_1K;
 	em->ram_bytes = 24 * SZ_1K;
@@ -431,7 +421,6 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	/* Add [0K, 32K) */
 	em->start = 0;
 	em->len = SZ_32K;
-	em->block_start = 0;
 	em->disk_bytenr = 0;
 	em->disk_num_bytes = SZ_32K;
 	em->ram_bytes = SZ_32K;
@@ -451,9 +440,9 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	}
 	if (start < em->start || start + len > extent_map_end(em)) {
 		test_err(
-"case4 [%llu %llu): ret %d, added wrong em (start %llu len %llu block_start %llu disk_num_bytes %llu)",
-			 start, start + len, ret, em->start, em->len, em->block_start,
-			 em->disk_num_bytes);
+"case4 [%llu %llu): ret %d, added wrong em (start %llu len %llu disk_bytenr %llu disk_num_bytes %llu)",
+			 start, start + len, ret, em->start, em->len,
+			 em->disk_bytenr, em->disk_num_bytes);
 		ret = -EINVAL;
 	}
 	free_extent_map(em);
@@ -517,7 +506,6 @@ static int add_compressed_extent(struct btrfs_inode *inode,
 
 	em->start = start;
 	em->len = len;
-	em->block_start = block_start;
 	em->disk_bytenr = block_start;
 	em->disk_num_bytes = SZ_4K;
 	em->ram_bytes = len;
@@ -740,7 +728,6 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 
 	em->start = SZ_4K;
 	em->len = SZ_4K;
-	em->block_start = SZ_16K;
 	em->disk_bytenr = SZ_16K;
 	em->disk_num_bytes = SZ_16K;
 	em->ram_bytes = SZ_16K;
@@ -795,7 +782,6 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	/* [0, 16K), pinned */
 	em->start = 0;
 	em->len = SZ_16K;
-	em->block_start = 0;
 	em->disk_bytenr = 0;
 	em->disk_num_bytes = SZ_4K;
 	em->ram_bytes = SZ_16K;
@@ -819,7 +805,6 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	/* [32K, 48K), not pinned */
 	em->start = SZ_32K;
 	em->len = SZ_16K;
-	em->block_start = SZ_32K;
 	em->disk_bytenr = SZ_32K;
 	em->disk_num_bytes = SZ_16K;
 	em->ram_bytes = SZ_16K;
@@ -885,8 +870,9 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 		goto out;
 	}
 
-	if (em->block_start != SZ_32K + SZ_4K) {
-		test_err("em->block_start is %llu, expected 36K", em->block_start);
+	if (extent_map_block_start(em) != SZ_32K + SZ_4K) {
+		test_err("em->block_start is %llu, expected 36K",
+				extent_map_block_start(em));
 		goto out;
 	}
 
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index fc390c18ac95..b8f0d67f4cf6 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -264,8 +264,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start != EXTENT_MAP_HOLE) {
-		test_err("expected a hole, got %llu", em->block_start);
+	if (em->disk_bytenr != EXTENT_MAP_HOLE) {
+		test_err("expected a hole, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	free_extent_map(em);
@@ -283,8 +283,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start != EXTENT_MAP_INLINE) {
-		test_err("expected an inline, got %llu", em->block_start);
+	if (em->disk_bytenr != EXTENT_MAP_INLINE) {
+		test_err("expected an inline, got %llu", em->disk_bytenr);
 		goto out;
 	}
 
@@ -321,8 +321,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start != EXTENT_MAP_HOLE) {
-		test_err("expected a hole, got %llu", em->block_start);
+	if (em->disk_bytenr != EXTENT_MAP_HOLE) {
+		test_err("expected a hole, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != 4) {
@@ -344,8 +344,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != sectorsize - 1) {
@@ -371,8 +371,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != sectorsize) {
@@ -389,7 +389,7 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("wrong offset, want 0, have %llu", em->offset);
 		goto out;
 	}
-	disk_bytenr = em->block_start;
+	disk_bytenr = extent_map_block_start(em);
 	orig_start = em->start;
 	offset = em->start + em->len;
 	free_extent_map(em);
@@ -399,8 +399,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start != EXTENT_MAP_HOLE) {
-		test_err("expected a hole, got %llu", em->block_start);
+	if (em->disk_bytenr != EXTENT_MAP_HOLE) {
+		test_err("expected a hole, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != sectorsize) {
@@ -421,8 +421,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != 2 * sectorsize) {
@@ -441,9 +441,9 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		goto out;
 	}
 	disk_bytenr += (em->start - orig_start);
-	if (em->block_start != disk_bytenr) {
+	if (extent_map_block_start(em) != disk_bytenr) {
 		test_err("wrong block start, want %llu, have %llu",
-			 disk_bytenr, em->block_start);
+			 disk_bytenr, extent_map_block_start(em));
 		goto out;
 	}
 	offset = em->start + em->len;
@@ -455,8 +455,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != sectorsize) {
@@ -483,8 +483,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != sectorsize) {
@@ -502,7 +502,7 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("wrong offset, want 0, have %llu", em->offset);
 		goto out;
 	}
-	disk_bytenr = em->block_start;
+	disk_bytenr = extent_map_block_start(em);
 	orig_start = em->start;
 	offset = em->start + em->len;
 	free_extent_map(em);
@@ -512,8 +512,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_HOLE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_HOLE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != sectorsize) {
@@ -531,9 +531,9 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 			 em->start - orig_start, em->offset);
 		goto out;
 	}
-	if (em->block_start != disk_bytenr + em->offset) {
+	if (extent_map_block_start(em) != disk_bytenr + em->offset) {
 		test_err("unexpected block start, wanted %llu, have %llu",
-			 disk_bytenr + em->offset, em->block_start);
+			 disk_bytenr + em->offset, extent_map_block_start(em));
 		goto out;
 	}
 	offset = em->start + em->len;
@@ -544,8 +544,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != 2 * sectorsize) {
@@ -564,9 +564,9 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 			 em->start, em->offset, orig_start);
 		goto out;
 	}
-	if (em->block_start != disk_bytenr + em->offset) {
+	if (extent_map_block_start(em) != disk_bytenr + em->offset) {
 		test_err("unexpected block start, wanted %llu, have %llu",
-			 disk_bytenr + em->offset, em->block_start);
+			 disk_bytenr + em->offset, extent_map_block_start(em));
 		goto out;
 	}
 	offset = em->start + em->len;
@@ -578,8 +578,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != 2 * sectorsize) {
@@ -611,8 +611,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != sectorsize) {
@@ -635,7 +635,7 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 			 BTRFS_COMPRESS_ZLIB, extent_map_compression(em));
 		goto out;
 	}
-	disk_bytenr = em->block_start;
+	disk_bytenr = extent_map_block_start(em);
 	orig_start = em->start;
 	offset = em->start + em->len;
 	free_extent_map(em);
@@ -645,8 +645,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != sectorsize) {
@@ -671,9 +671,9 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start != disk_bytenr) {
+	if (extent_map_block_start(em) != disk_bytenr) {
 		test_err("block start does not match, want %llu got %llu",
-			 disk_bytenr, em->block_start);
+			 disk_bytenr, extent_map_block_start(em));
 		goto out;
 	}
 	if (em->start != offset || em->len != 2 * sectorsize) {
@@ -706,8 +706,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != sectorsize) {
@@ -732,8 +732,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start != EXTENT_MAP_HOLE) {
-		test_err("expected a hole extent, got %llu", em->block_start);
+	if (em->disk_bytenr != EXTENT_MAP_HOLE) {
+		test_err("expected a hole extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	/*
@@ -764,8 +764,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (em->disk_bytenr >= EXTENT_MAP_LAST_BYTE) {
+		test_err("expected a real extent, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != offset || em->len != sectorsize) {
@@ -843,8 +843,8 @@ static int test_hole_first(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start != EXTENT_MAP_HOLE) {
-		test_err("expected a hole, got %llu", em->block_start);
+	if (em->disk_bytenr != EXTENT_MAP_HOLE) {
+		test_err("expected a hole, got %llu", em->disk_bytenr);
 		goto out;
 	}
 	if (em->start != 0 || em->len != sectorsize) {
@@ -865,8 +865,9 @@ static int test_hole_first(u32 sectorsize, u32 nodesize)
 		test_err("got an error when we shouldn't have");
 		goto out;
 	}
-	if (em->block_start != sectorsize) {
-		test_err("expected a real extent, got %llu", em->block_start);
+	if (extent_map_block_start(em) != sectorsize) {
+		test_err("expected a real extent, got %llu",
+			 extent_map_block_start(em));
 		goto out;
 	}
 	if (em->start != sectorsize || em->len != sectorsize) {
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 1d04f0cb6f53..f1e006a5fc4c 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4578,6 +4578,7 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_ordered_extent *ordered;
 	struct btrfs_root *csum_root;
+	u64 block_start;
 	u64 csum_offset;
 	u64 csum_len;
 	u64 mod_start = em->start;
@@ -4587,7 +4588,7 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
 
 	if (inode->flags & BTRFS_INODE_NODATASUM ||
 	    (em->flags & EXTENT_FLAG_PREALLOC) ||
-	    em->block_start == EXTENT_MAP_HOLE)
+	    em->disk_bytenr == EXTENT_MAP_HOLE)
 		return 0;
 
 	list_for_each_entry(ordered, &ctx->ordered_extents, log_list) {
@@ -4658,9 +4659,10 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
 	}
 
 	/* block start is already adjusted for the file extent offset. */
-	csum_root = btrfs_csum_root(trans->fs_info, em->block_start);
-	ret = btrfs_lookup_csums_list(csum_root, em->block_start + csum_offset,
-				      em->block_start + csum_offset +
+	block_start = extent_map_block_start(em);
+	csum_root = btrfs_csum_root(trans->fs_info, block_start);
+	ret = btrfs_lookup_csums_list(csum_root, block_start + csum_offset,
+				      block_start + csum_offset +
 				      csum_len - 1, &ordered_sums, false);
 	if (ret < 0)
 		return ret;
@@ -4692,6 +4694,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 	struct btrfs_key key;
 	enum btrfs_compression_type compress_type;
 	u64 extent_offset = em->offset;
+	u64 block_start = extent_map_block_start(em);
 	u64 block_len;
 	int ret;
 
@@ -4704,10 +4707,10 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 	block_len = em->disk_num_bytes;
 	compress_type = extent_map_compression(em);
 	if (compress_type != BTRFS_COMPRESS_NONE) {
-		btrfs_set_stack_file_extent_disk_bytenr(&fi, em->block_start);
+		btrfs_set_stack_file_extent_disk_bytenr(&fi, block_start);
 		btrfs_set_stack_file_extent_disk_num_bytes(&fi, block_len);
-	} else if (em->block_start < EXTENT_MAP_LAST_BYTE) {
-		btrfs_set_stack_file_extent_disk_bytenr(&fi, em->block_start -
+	} else if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
+		btrfs_set_stack_file_extent_disk_bytenr(&fi, block_start -
 							extent_offset);
 		btrfs_set_stack_file_extent_disk_num_bytes(&fi, block_len);
 	}
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index c52a0063f7db..da9de81f340e 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1773,7 +1773,9 @@ static void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered,
 	write_lock(&em_tree->lock);
 	em = search_extent_mapping(em_tree, ordered->file_offset,
 				   ordered->num_bytes);
-	em->block_start = logical;
+	/* The em should be a new COW extent, thus it should not have an offset. */
+	ASSERT(em->offset == 0);
+	em->disk_bytenr = logical;
 	free_extent_map(em);
 	write_unlock(&em_tree->lock);
 }
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index a1804239812c..ca0f99689a2d 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -291,7 +291,6 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
 		__field(	u64,  ino		)
 		__field(	u64,  start		)
 		__field(	u64,  len		)
-		__field(	u64,  block_start	)
 		__field(	u32,  flags		)
 		__field(	int,  refs		)
 	),
@@ -301,18 +300,16 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
 		__entry->ino		= btrfs_ino(inode);
 		__entry->start		= map->start;
 		__entry->len		= map->len;
-		__entry->block_start	= map->block_start;
 		__entry->flags		= map->flags;
 		__entry->refs		= refcount_read(&map->refs);
 	),
 
 	TP_printk_btrfs("root=%llu(%s) ino=%llu start=%llu len=%llu "
-		  "block_start=%llu(%s) flags=%s refs=%u",
+		  "flags=%s refs=%u",
 		  show_root_type(__entry->root_objectid),
 		  __entry->ino,
 		  __entry->start,
 		  __entry->len,
-		  show_map_type(__entry->block_start),
 		  show_map_flags(__entry->flags),
 		  __entry->refs)
 );
@@ -2608,7 +2605,6 @@ TRACE_EVENT(btrfs_extent_map_shrinker_remove_em,
 		__field(	u64,	root_id		)
 		__field(	u64,	start		)
 		__field(	u64,	len		)
-		__field(	u64,	block_start	)
 		__field(	u32,	flags		)
 	),
 
@@ -2617,15 +2613,12 @@ TRACE_EVENT(btrfs_extent_map_shrinker_remove_em,
 		__entry->root_id	= inode->root->root_key.objectid;
 		__entry->start		= em->start;
 		__entry->len		= em->len;
-		__entry->block_start	= em->block_start;
 		__entry->flags		= em->flags;
 	),
 
-	TP_printk_btrfs(
-"ino=%llu root=%llu(%s) start=%llu len=%llu block_start=%llu(%s) flags=%s",
+	TP_printk_btrfs("ino=%llu root=%llu(%s) start=%llu len=%llu flags=%s",
 			__entry->ino, show_root_type(__entry->root_id),
 			__entry->start, __entry->len,
-			show_map_type(__entry->block_start),
 			show_map_flags(__entry->flags))
 );
 
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 06/11] btrfs: remove extent_map::block_len member
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
                   ` (4 preceding siblings ...)
  2024-05-23  5:03  1% ` [PATCH v3 05/11] btrfs: remove extent_map::orig_start member Qu Wenruo
@ 2024-05-23  5:03  1% ` Qu Wenruo
  2024-05-23  5:03  1% ` [PATCH v3 07/11] btrfs: remove extent_map::block_start member Qu Wenruo
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

The extent_map::block_len is either extent_map::len (non-compressed
extent) or extent_map::disk_num_bytes (compressed extent).

Since we already have sanity checks to do the cross-check between the
new and old members, we can drop the old extent_map::block_len now.

For most call sites, they can manually select extent_map::len or
extent_map::disk_num_bytes, since most if not all of them have checked
if the extent is compressed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/compression.c            |  2 +-
 fs/btrfs/extent_map.c             | 40 +++++++++++-------------------
 fs/btrfs/extent_map.h             |  9 -------
 fs/btrfs/file-item.c              |  7 ------
 fs/btrfs/file.c                   |  1 -
 fs/btrfs/inode.c                  | 36 +++++++++------------------
 fs/btrfs/relocation.c             |  1 -
 fs/btrfs/tests/extent-map-tests.c | 41 ++++++++++---------------------
 fs/btrfs/tree-log.c               |  4 +--
 include/trace/events/btrfs.h      |  5 +---
 10 files changed, 42 insertions(+), 104 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 4f6d748aa99e..cd88432e7072 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -585,7 +585,7 @@ void btrfs_submit_compressed_read(struct btrfs_bio *bbio)
 	}
 
 	ASSERT(extent_map_is_compressed(em));
-	compressed_len = em->block_len;
+	compressed_len = em->disk_num_bytes;
 
 	cb = alloc_compressed_bio(inode, file_offset, REQ_OP_READ,
 				  end_bbio_compressed_read);
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 91be54f79d21..0c100fe47c43 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -183,11 +183,18 @@ static struct rb_node *__tree_search(struct rb_root *root, u64 offset,
 	return NULL;
 }
 
+static inline u64 extent_map_block_len(const struct extent_map *em)
+{
+	if (extent_map_is_compressed(em))
+		return em->disk_num_bytes;
+	return em->len;
+}
+
 static inline u64 extent_map_block_end(const struct extent_map *em)
 {
-	if (em->block_start + em->block_len < em->block_start)
+	if (em->block_start + extent_map_block_len(em) < em->block_start)
 		return (u64)-1;
-	return em->block_start + em->block_len;
+	return em->block_start + extent_map_block_len(em);
 }
 
 static bool can_merge_extent_map(const struct extent_map *em)
@@ -288,10 +295,10 @@ static void dump_extent_map(struct btrfs_fs_info *fs_info,
 {
 	if (!IS_ENABLED(CONFIG_BTRFS_DEBUG))
 		return;
-	btrfs_crit(fs_info, "%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu block_start=%llu block_len=%llu flags=0x%x\n",
+	btrfs_crit(fs_info, "%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu block_start=%llu flags=0x%x\n",
 		prefix, em->start, em->len, em->disk_bytenr, em->disk_num_bytes,
 		em->ram_bytes, em->offset, em->block_start,
-		em->block_len, em->flags);
+		em->flags);
 	ASSERT(0);
 }
 
@@ -314,9 +321,6 @@ static void validate_extent_map(struct btrfs_fs_info *fs_info,
 			if (em->block_start != em->disk_bytenr)
 				dump_extent_map(fs_info,
 				"mismatch block_start/disk_bytenr/offset", em);
-			if (em->disk_num_bytes != em->block_len)
-				dump_extent_map(fs_info,
-				"mismatch disk_num_bytes/block_len", em);
 		} else if (em->block_start != em->disk_bytenr + em->offset) {
 			dump_extent_map(fs_info,
 				"mismatch block_start/disk_bytenr/offset", em);
@@ -355,7 +359,6 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 		if (rb && can_merge_extent_map(merge) && mergeable_maps(merge, em)) {
 			em->start = merge->start;
 			em->len += merge->len;
-			em->block_len += merge->block_len;
 			em->block_start = merge->block_start;
 			em->generation = max(em->generation, merge->generation);
 
@@ -376,7 +379,6 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 		merge = rb_entry(rb, struct extent_map, rb_node);
 	if (rb && can_merge_extent_map(merge) && mergeable_maps(em, merge)) {
 		em->len += merge->len;
-		em->block_len += merge->block_len;
 		if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
 			merge_ondisk_extents(em, merge);
 		validate_extent_map(fs_info, em);
@@ -670,7 +672,6 @@ static noinline int merge_extent_mapping(struct btrfs_inode *inode,
 	if (em->block_start < EXTENT_MAP_LAST_BYTE &&
 	    !extent_map_is_compressed(em)) {
 		em->block_start += start_diff;
-		em->block_len = em->len;
 		em->offset += start_diff;
 	}
 	return add_extent_mapping(inode, em, 0);
@@ -890,17 +891,11 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			if (em->block_start < EXTENT_MAP_LAST_BYTE) {
 				split->block_start = em->block_start;
 
-				if (compressed)
-					split->block_len = em->block_len;
-				else
-					split->block_len = split->len;
 				split->disk_bytenr = em->disk_bytenr;
-				split->disk_num_bytes = max(split->block_len,
-							    em->disk_num_bytes);
+				split->disk_num_bytes = em->disk_num_bytes;
 				split->offset = em->offset;
 				split->ram_bytes = em->ram_bytes;
 			} else {
-				split->block_len = 0;
 				split->block_start = em->block_start;
 				split->disk_bytenr = em->disk_bytenr;
 				split->disk_num_bytes = 0;
@@ -930,23 +925,18 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			split->generation = gen;
 
 			if (em->block_start < EXTENT_MAP_LAST_BYTE) {
-				split->disk_num_bytes = max(em->block_len,
-							    em->disk_num_bytes);
+				split->disk_num_bytes = em->disk_num_bytes;
 				split->offset = em->offset + end - em->start;
 				split->ram_bytes = em->ram_bytes;
-				if (compressed) {
-					split->block_len = em->block_len;
-				} else {
+				if (!compressed) {
 					const u64 diff = end - em->start;
 
-					split->block_len = split->len;
 					split->block_start += diff;
 				}
 			} else {
 				split->disk_num_bytes = 0;
 				split->offset = 0;
 				split->ram_bytes = split->len;
-				split->block_len = 0;
 			}
 
 			if (extent_map_in_tree(em)) {
@@ -1104,7 +1094,6 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_pre->disk_num_bytes = split_pre->len;
 	split_pre->offset = 0;
 	split_pre->block_start = new_logical;
-	split_pre->block_len = split_pre->len;
 	split_pre->ram_bytes = split_pre->len;
 	split_pre->flags = flags;
 	split_pre->generation = em->generation;
@@ -1123,7 +1112,6 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_mid->disk_num_bytes = split_mid->len;
 	split_mid->offset = 0;
 	split_mid->block_start = em->block_start + pre;
-	split_mid->block_len = split_mid->len;
 	split_mid->ram_bytes = split_mid->len;
 	split_mid->flags = flags;
 	split_mid->generation = em->generation;
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 5ae3d56b4351..5312bb542af0 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -102,15 +102,6 @@ struct extent_map {
 	 */
 	u64 block_start;
 
-	/*
-	 * The on-disk length for the file extent.
-	 *
-	 * For compressed extents it matches btrfs_file_extent_item::disk_num_bytes.
-	 * For uncompressed extents it matches extent_map::len.
-	 * For holes and inline extents it's -1 and shouldn't be used.
-	 */
-	u64 block_len;
-
 	/*
 	 * Generation of the extent map, for merged em it's the highest
 	 * generation of all merged ems.
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 06d23951901c..397df6588ce2 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -1307,11 +1307,9 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 		if (compress_type != BTRFS_COMPRESS_NONE) {
 			extent_map_set_compression(em, compress_type);
 			em->block_start = bytenr;
-			em->block_len = em->disk_num_bytes;
 		} else {
 			bytenr += btrfs_file_extent_offset(leaf, fi);
 			em->block_start = bytenr;
-			em->block_len = em->len;
 			if (type == BTRFS_FILE_EXTENT_PREALLOC)
 				em->flags |= EXTENT_FLAG_PREALLOC;
 		}
@@ -1324,11 +1322,6 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 		em->start = 0;
 		em->len = fs_info->sectorsize;
 		em->offset = 0;
-		/*
-		 * Initialize block_len with the same values
-		 * as in inode.c:btrfs_get_extent().
-		 */
-		em->block_len = (u64)-1;
 		extent_map_set_compression(em, compress_type);
 	} else {
 		btrfs_err(fs_info,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 707012fc2d43..7033ea619073 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2350,7 +2350,6 @@ static int fill_holes(struct btrfs_trans_handle *trans,
 
 		hole_em->block_start = EXTENT_MAP_HOLE;
 		hole_em->disk_bytenr = EXTENT_MAP_HOLE;
-		hole_em->block_len = 0;
 		hole_em->disk_num_bytes = 0;
 		hole_em->generation = trans->transid;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 066f14c78bc9..00bb64fdf938 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -139,7 +139,7 @@ static noinline int run_delalloc_cow(struct btrfs_inode *inode,
 				     bool pages_dirty);
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 				       u64 len, u64 block_start,
-				       u64 block_len, u64 disk_num_bytes,
+				       u64 disk_num_bytes,
 				       u64 ram_bytes, int compress_type,
 				       struct btrfs_file_extent *file_extent,
 				       int type);
@@ -1210,7 +1210,6 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
 	em = create_io_em(inode, start,
 			  async_extent->ram_size,	/* len */
 			  ins.objectid,			/* block_start */
-			  ins.offset,			/* block_len */
 			  ins.offset,			/* orig_block_len */
 			  async_extent->ram_size,	/* ram_bytes */
 			  async_extent->compress_type,
@@ -1453,7 +1452,6 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 		em = create_io_em(inode, start, ins.offset, /* len */
 				  ins.objectid, /* block_start */
-				  ins.offset, /* block_len */
 				  ins.offset, /* orig_block_len */
 				  ram_size, /* ram_bytes */
 				  BTRFS_COMPRESS_NONE, /* compress_type */
@@ -2191,7 +2189,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 
 			em = create_io_em(inode, cur_offset, nocow_args.num_bytes,
 					  nocow_args.disk_bytenr, /* block_start */
-					  nocow_args.num_bytes, /* block_len */
 					  nocow_args.disk_num_bytes, /* orig_block_len */
 					  ram_bytes, BTRFS_COMPRESS_NONE,
 					  &nocow_args.file_extent,
@@ -5027,7 +5024,6 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
 
 			hole_em->block_start = EXTENT_MAP_HOLE;
 			hole_em->disk_bytenr = EXTENT_MAP_HOLE;
-			hole_em->block_len = 0;
 			hole_em->disk_num_bytes = 0;
 			hole_em->ram_bytes = hole_size;
 			hole_em->generation = btrfs_get_fs_generation(fs_info);
@@ -6896,7 +6892,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 	em->start = EXTENT_MAP_HOLE;
 	em->disk_bytenr = EXTENT_MAP_HOLE;
 	em->len = (u64)-1;
-	em->block_len = (u64)-1;
 
 	path = btrfs_alloc_path();
 	if (!path) {
@@ -7054,7 +7049,6 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 						  const u64 start,
 						  const u64 len,
 						  const u64 block_start,
-						  const u64 block_len,
 						  const u64 orig_block_len,
 						  const u64 ram_bytes,
 						  const int type,
@@ -7065,14 +7059,14 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 
 	if (type != BTRFS_ORDERED_NOCOW) {
 		em = create_io_em(inode, start, len, block_start,
-				  block_len, orig_block_len, ram_bytes,
+				  orig_block_len, ram_bytes,
 				  BTRFS_COMPRESS_NONE, /* compress_type */
 				  file_extent, type);
 		if (IS_ERR(em))
 			goto out;
 	}
 	ordered = btrfs_alloc_ordered_extent(inode, start, len, len,
-					     block_start, block_len, 0,
+					     block_start, len, 0,
 					     (1 << type) |
 					     (1 << BTRFS_ORDERED_DIRECT),
 					     BTRFS_COMPRESS_NONE);
@@ -7124,7 +7118,7 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
 	file_extent.offset = 0;
 	file_extent.compression = BTRFS_COMPRESS_NONE;
 	em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset,
-				     ins.objectid, ins.offset, ins.offset,
+				     ins.objectid, ins.offset,
 				     ins.offset, BTRFS_ORDERED_REGULAR,
 				     &file_extent);
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
@@ -7365,7 +7359,7 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 /* The callers of this must take lock_extent() */
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 				       u64 len, u64 block_start,
-				       u64 block_len, u64 disk_num_bytes,
+				       u64 disk_num_bytes,
 				       u64 ram_bytes, int compress_type,
 				       struct btrfs_file_extent *file_extent,
 				       int type)
@@ -7387,16 +7381,10 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 
 	switch (type) {
 	case BTRFS_ORDERED_PREALLOC:
-		/* Uncompressed extents. */
-		ASSERT(block_len == len);
-
 		/* We're only referring part of a larger preallocated extent. */
-		ASSERT(block_len <= ram_bytes);
+		ASSERT(len <= ram_bytes);
 		break;
 	case BTRFS_ORDERED_REGULAR:
-		/* Uncompressed extents. */
-		ASSERT(block_len == len);
-
 		/* COW results a new extent matching our file extent size. */
 		ASSERT(disk_num_bytes == len);
 		ASSERT(ram_bytes == len);
@@ -7422,7 +7410,6 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 
 	em->start = start;
 	em->len = len;
-	em->block_len = block_len;
 	em->block_start = block_start;
 	em->disk_bytenr = file_extent->disk_bytenr;
 	em->disk_num_bytes = disk_num_bytes;
@@ -7511,7 +7498,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 
 		em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
 					      block_start,
-					      len, orig_block_len,
+					      orig_block_len,
 					      ram_bytes, type,
 					      &file_extent);
 		btrfs_dec_nocow_writers(bg);
@@ -9654,7 +9641,6 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 		em->block_start = ins.objectid;
 		em->disk_bytenr = ins.objectid;
 		em->offset = 0;
-		em->block_len = ins.offset;
 		em->disk_num_bytes = ins.offset;
 		em->ram_bytes = ins.offset;
 		em->flags |= EXTENT_FLAG_PREALLOC;
@@ -10151,12 +10137,12 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 		 * Bail if the buffer isn't large enough to return the whole
 		 * compressed extent.
 		 */
-		if (em->block_len > count) {
+		if (em->disk_num_bytes > count) {
 			ret = -ENOBUFS;
 			goto out_em;
 		}
-		disk_io_size = em->block_len;
-		count = em->block_len;
+		disk_io_size = em->disk_num_bytes;
+		count = em->disk_num_bytes;
 		encoded->unencoded_len = em->ram_bytes;
 		encoded->unencoded_offset = iocb->ki_pos - (em->start - em->offset);
 		ret = btrfs_encoded_io_compression_from_extent(fs_info,
@@ -10404,7 +10390,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 	file_extent.compression = compression;
 	em = create_io_em(inode, start, num_bytes,
 			  ins.objectid,
-			  ins.offset, ins.offset, ram_bytes, compression,
+			  ins.offset, ram_bytes, compression,
 			  &file_extent, BTRFS_ORDERED_COMPRESSED);
 	if (IS_ERR(em)) {
 		ret = PTR_ERR(em);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 21061a0b2e7c..68fe52ab445d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2912,7 +2912,6 @@ static noinline_for_stack int setup_relocation_extent_mapping(struct inode *inod
 
 	em->start = start;
 	em->len = end + 1 - start;
-	em->block_len = em->len;
 	em->block_start = block_start;
 	em->disk_bytenr = block_start;
 	em->disk_num_bytes = em->len;
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index 65c6921ff4a2..0dd270d6c506 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -28,9 +28,10 @@ static int free_extent_map_tree(struct btrfs_inode *inode)
 		if (refcount_read(&em->refs) != 1) {
 			ret = -EINVAL;
 			test_err(
-"em leak: em (start %llu len %llu block_start %llu block_len %llu) refs %d",
+"em leak: em (start %llu len %llu block_start %llu disk_num_bytes %llu offset %llu) refs %d",
 				 em->start, em->len, em->block_start,
-				 em->block_len, refcount_read(&em->refs));
+				 em->disk_num_bytes, em->offset,
+				 refcount_read(&em->refs));
 
 			refcount_set(&em->refs, 1);
 		}
@@ -77,7 +78,6 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->start = 0;
 	em->len = SZ_16K;
 	em->block_start = 0;
-	em->block_len = SZ_16K;
 	em->disk_bytenr = 0;
 	em->disk_num_bytes = SZ_16K;
 	em->ram_bytes = SZ_16K;
@@ -101,7 +101,6 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->start = SZ_16K;
 	em->len = SZ_4K;
 	em->block_start = SZ_32K; /* avoid merging */
-	em->block_len = SZ_4K;
 	em->disk_bytenr = SZ_32K; /* avoid merging */
 	em->disk_num_bytes = SZ_4K;
 	em->ram_bytes = SZ_4K;
@@ -125,7 +124,6 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->start = start;
 	em->len = len;
 	em->block_start = start;
-	em->block_len = len;
 	em->disk_bytenr = start;
 	em->disk_num_bytes = len;
 	em->ram_bytes = len;
@@ -143,11 +141,11 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 		goto out;
 	}
 	if (em->start != 0 || extent_map_end(em) != SZ_16K ||
-	    em->block_start != 0 || em->block_len != SZ_16K) {
+	    em->block_start != 0 || em->disk_num_bytes != SZ_16K) {
 		test_err(
-"case1 [%llu %llu]: ret %d return a wrong em (start %llu len %llu block_start %llu block_len %llu",
+"case1 [%llu %llu]: ret %d return a wrong em (start %llu len %llu block_start %llu disk_num_bytes %llu",
 			 start, start + len, ret, em->start, em->len,
-			 em->block_start, em->block_len);
+			 em->block_start, em->disk_num_bytes);
 		ret = -EINVAL;
 	}
 	free_extent_map(em);
@@ -182,7 +180,6 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->start = 0;
 	em->len = SZ_1K;
 	em->block_start = EXTENT_MAP_INLINE;
-	em->block_len = (u64)-1;
 	em->disk_bytenr = EXTENT_MAP_INLINE;
 	em->disk_num_bytes = 0;
 	em->ram_bytes = SZ_1K;
@@ -206,7 +203,6 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->start = SZ_4K;
 	em->len = SZ_4K;
 	em->block_start = SZ_4K;
-	em->block_len = SZ_4K;
 	em->disk_bytenr = SZ_4K;
 	em->disk_num_bytes = SZ_4K;
 	em->ram_bytes = SZ_4K;
@@ -230,7 +226,6 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->start = 0;
 	em->len = SZ_1K;
 	em->block_start = EXTENT_MAP_INLINE;
-	em->block_len = (u64)-1;
 	em->disk_bytenr = EXTENT_MAP_INLINE;
 	em->disk_num_bytes = 0;
 	em->ram_bytes = SZ_1K;
@@ -247,11 +242,10 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 		goto out;
 	}
 	if (em->start != 0 || extent_map_end(em) != SZ_1K ||
-	    em->block_start != EXTENT_MAP_INLINE || em->block_len != (u64)-1) {
+	    em->block_start != EXTENT_MAP_INLINE) {
 		test_err(
-"case2 [0 1K]: ret %d return a wrong em (start %llu len %llu block_start %llu block_len %llu",
-			 ret, em->start, em->len, em->block_start,
-			 em->block_len);
+"case2 [0 1K]: ret %d return a wrong em (start %llu len %llu block_start %llu",
+			 ret, em->start, em->len, em->block_start);
 		ret = -EINVAL;
 	}
 	free_extent_map(em);
@@ -282,7 +276,6 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	em->start = SZ_4K;
 	em->len = SZ_4K;
 	em->block_start = SZ_4K;
-	em->block_len = SZ_4K;
 	em->disk_bytenr = SZ_4K;
 	em->disk_num_bytes = SZ_4K;
 	em->ram_bytes = SZ_4K;
@@ -306,7 +299,6 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	em->start = 0;
 	em->len = SZ_16K;
 	em->block_start = 0;
-	em->block_len = SZ_16K;
 	em->disk_bytenr = 0;
 	em->disk_num_bytes = SZ_16K;
 	em->ram_bytes = SZ_16K;
@@ -329,11 +321,11 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	 * em->start.
 	 */
 	if (start < em->start || start + len > extent_map_end(em) ||
-	    em->start != em->block_start || em->len != em->block_len) {
+	    em->start != em->block_start) {
 		test_err(
 "case3 [%llu %llu): ret %d em (start %llu len %llu block_start %llu block_len %llu)",
 			 start, start + len, ret, em->start, em->len,
-			 em->block_start, em->block_len);
+			 em->block_start, em->disk_num_bytes);
 		ret = -EINVAL;
 	}
 	free_extent_map(em);
@@ -395,7 +387,6 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	em->start = 0;
 	em->len = SZ_8K;
 	em->block_start = 0;
-	em->block_len = SZ_8K;
 	em->disk_bytenr = 0;
 	em->disk_num_bytes = SZ_8K;
 	em->ram_bytes = SZ_8K;
@@ -419,7 +410,6 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	em->start = SZ_8K;
 	em->len = 24 * SZ_1K;
 	em->block_start = SZ_16K; /* avoid merging */
-	em->block_len = 24 * SZ_1K;
 	em->disk_bytenr = SZ_16K; /* avoid merging */
 	em->disk_num_bytes = 24 * SZ_1K;
 	em->ram_bytes = 24 * SZ_1K;
@@ -442,7 +432,6 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	em->start = 0;
 	em->len = SZ_32K;
 	em->block_start = 0;
-	em->block_len = SZ_32K;
 	em->disk_bytenr = 0;
 	em->disk_num_bytes = SZ_32K;
 	em->ram_bytes = SZ_32K;
@@ -462,9 +451,9 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	}
 	if (start < em->start || start + len > extent_map_end(em)) {
 		test_err(
-"case4 [%llu %llu): ret %d, added wrong em (start %llu len %llu block_start %llu block_len %llu)",
+"case4 [%llu %llu): ret %d, added wrong em (start %llu len %llu block_start %llu disk_num_bytes %llu)",
 			 start, start + len, ret, em->start, em->len, em->block_start,
-			 em->block_len);
+			 em->disk_num_bytes);
 		ret = -EINVAL;
 	}
 	free_extent_map(em);
@@ -529,7 +518,6 @@ static int add_compressed_extent(struct btrfs_inode *inode,
 	em->start = start;
 	em->len = len;
 	em->block_start = block_start;
-	em->block_len = SZ_4K;
 	em->disk_bytenr = block_start;
 	em->disk_num_bytes = SZ_4K;
 	em->ram_bytes = len;
@@ -753,7 +741,6 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->start = SZ_4K;
 	em->len = SZ_4K;
 	em->block_start = SZ_16K;
-	em->block_len = SZ_16K;
 	em->disk_bytenr = SZ_16K;
 	em->disk_num_bytes = SZ_16K;
 	em->ram_bytes = SZ_16K;
@@ -809,7 +796,6 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->start = 0;
 	em->len = SZ_16K;
 	em->block_start = 0;
-	em->block_len = SZ_4K;
 	em->disk_bytenr = 0;
 	em->disk_num_bytes = SZ_4K;
 	em->ram_bytes = SZ_16K;
@@ -834,7 +820,6 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->start = SZ_32K;
 	em->len = SZ_16K;
 	em->block_start = SZ_32K;
-	em->block_len = SZ_16K;
 	em->disk_bytenr = SZ_32K;
 	em->disk_num_bytes = SZ_16K;
 	em->ram_bytes = SZ_16K;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index d6a3151d6c37..1d04f0cb6f53 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4651,7 +4651,7 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
 	/* If we're compressed we have to save the entire range of csums. */
 	if (extent_map_is_compressed(em)) {
 		csum_offset = 0;
-		csum_len = max(em->block_len, em->disk_num_bytes);
+		csum_len = em->disk_num_bytes;
 	} else {
 		csum_offset = mod_start - em->start;
 		csum_len = mod_len;
@@ -4701,7 +4701,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 	else
 		btrfs_set_stack_file_extent_type(&fi, BTRFS_FILE_EXTENT_REG);
 
-	block_len = max(em->block_len, em->disk_num_bytes);
+	block_len = em->disk_num_bytes;
 	compress_type = extent_map_compression(em);
 	if (compress_type != BTRFS_COMPRESS_NONE) {
 		btrfs_set_stack_file_extent_disk_bytenr(&fi, em->block_start);
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index cbac7cd11995..a1804239812c 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -292,7 +292,6 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
 		__field(	u64,  start		)
 		__field(	u64,  len		)
 		__field(	u64,  block_start	)
-		__field(	u64,  block_len		)
 		__field(	u32,  flags		)
 		__field(	int,  refs		)
 	),
@@ -303,19 +302,17 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
 		__entry->start		= map->start;
 		__entry->len		= map->len;
 		__entry->block_start	= map->block_start;
-		__entry->block_len	= map->block_len;
 		__entry->flags		= map->flags;
 		__entry->refs		= refcount_read(&map->refs);
 	),
 
 	TP_printk_btrfs("root=%llu(%s) ino=%llu start=%llu len=%llu "
-		  "block_start=%llu(%s) block_len=%llu flags=%s refs=%u",
+		  "block_start=%llu(%s) flags=%s refs=%u",
 		  show_root_type(__entry->root_objectid),
 		  __entry->ino,
 		  __entry->start,
 		  __entry->len,
 		  show_map_type(__entry->block_start),
-		  __entry->block_len,
 		  show_map_flags(__entry->flags),
 		  __entry->refs)
 );
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 05/11] btrfs: remove extent_map::orig_start member
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
                   ` (3 preceding siblings ...)
  2024-05-23  5:03  1% ` [PATCH v3 04/11] btrfs: introduce extra sanity checks for extent maps Qu Wenruo
@ 2024-05-23  5:03  1% ` Qu Wenruo
  2024-05-23  5:03  1% ` [PATCH v3 06/11] btrfs: remove extent_map::block_len member Qu Wenruo
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Since we have extent_map::offset, the old extent_map::orig_start is just
extent_map::start - extent_map::offset for non-hole/inline extents.

And since the new extent_map::offset is already verified by
validate_extent_map() meanwhile the old orig_start is not, let's
just remove the old member from all call sites.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/btrfs_inode.h            |  2 +-
 fs/btrfs/compression.c            |  2 +-
 fs/btrfs/defrag.c                 |  1 -
 fs/btrfs/extent_map.c             | 21 +-------
 fs/btrfs/extent_map.h             |  9 ----
 fs/btrfs/file-item.c              |  5 +-
 fs/btrfs/file.c                   |  3 +-
 fs/btrfs/inode.c                  | 37 +++++---------
 fs/btrfs/relocation.c             |  1 -
 fs/btrfs/tests/extent-map-tests.c |  9 ----
 fs/btrfs/tests/inode-tests.c      | 84 +++++++++++++------------------
 fs/btrfs/tree-log.c               |  2 +-
 include/trace/events/btrfs.h      |  6 +--
 13 files changed, 56 insertions(+), 126 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9ada4185ff93..269ee9ac859e 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -529,7 +529,7 @@ struct btrfs_file_extent {
 };
 
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
-			      u64 *orig_start, u64 *orig_block_len,
+			      u64 *orig_block_len,
 			      u64 *ram_bytes, struct btrfs_file_extent *file_extent,
 			      bool nowait, bool strict);
 
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 7b4843df0752..4f6d748aa99e 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -590,7 +590,7 @@ void btrfs_submit_compressed_read(struct btrfs_bio *bbio)
 	cb = alloc_compressed_bio(inode, file_offset, REQ_OP_READ,
 				  end_bbio_compressed_read);
 
-	cb->start = em->orig_start;
+	cb->start = em->start - em->offset;
 	em_len = em->len;
 	em_start = em->start;
 
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 242c5469f4ba..025e7f853a68 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -707,7 +707,6 @@ static struct extent_map *defrag_get_extent(struct btrfs_inode *inode,
 		 */
 		if (key.offset > start) {
 			em->start = start;
-			em->orig_start = start;
 			em->block_start = EXTENT_MAP_HOLE;
 			em->disk_bytenr = EXTENT_MAP_HOLE;
 			em->disk_num_bytes = 0;
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index b157f30ac241..91be54f79d21 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -288,9 +288,9 @@ static void dump_extent_map(struct btrfs_fs_info *fs_info,
 {
 	if (!IS_ENABLED(CONFIG_BTRFS_DEBUG))
 		return;
-	btrfs_crit(fs_info, "%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu orig_start=%llu block_start=%llu block_len=%llu flags=0x%x\n",
+	btrfs_crit(fs_info, "%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu block_start=%llu block_len=%llu flags=0x%x\n",
 		prefix, em->start, em->len, em->disk_bytenr, em->disk_num_bytes,
-		em->ram_bytes, em->offset, em->orig_start, em->block_start,
+		em->ram_bytes, em->offset, em->block_start,
 		em->block_len, em->flags);
 	ASSERT(0);
 }
@@ -317,15 +317,6 @@ static void validate_extent_map(struct btrfs_fs_info *fs_info,
 			if (em->disk_num_bytes != em->block_len)
 				dump_extent_map(fs_info,
 				"mismatch disk_num_bytes/block_len", em);
-			/*
-			 * Here we only check the start/orig_start/offset for
-			 * compressed extents as that's the only case where
-			 * orig_start is utilized.
-			 */
-			if (em->orig_start != em->start - em->offset)
-				dump_extent_map(fs_info,
-				"mismatch orig_start/offset/start", em);
-
 		} else if (em->block_start != em->disk_bytenr + em->offset) {
 			dump_extent_map(fs_info,
 				"mismatch block_start/disk_bytenr/offset", em);
@@ -363,7 +354,6 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 			merge = rb_entry(rb, struct extent_map, rb_node);
 		if (rb && can_merge_extent_map(merge) && mergeable_maps(merge, em)) {
 			em->start = merge->start;
-			em->orig_start = merge->orig_start;
 			em->len += merge->len;
 			em->block_len += merge->block_len;
 			em->block_start = merge->block_start;
@@ -898,7 +888,6 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			split->len = start - em->start;
 
 			if (em->block_start < EXTENT_MAP_LAST_BYTE) {
-				split->orig_start = em->orig_start;
 				split->block_start = em->block_start;
 
 				if (compressed)
@@ -911,7 +900,6 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 				split->offset = em->offset;
 				split->ram_bytes = em->ram_bytes;
 			} else {
-				split->orig_start = split->start;
 				split->block_len = 0;
 				split->block_start = em->block_start;
 				split->disk_bytenr = em->disk_bytenr;
@@ -948,19 +936,16 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 				split->ram_bytes = em->ram_bytes;
 				if (compressed) {
 					split->block_len = em->block_len;
-					split->orig_start = em->orig_start;
 				} else {
 					const u64 diff = end - em->start;
 
 					split->block_len = split->len;
 					split->block_start += diff;
-					split->orig_start = em->orig_start;
 				}
 			} else {
 				split->disk_num_bytes = 0;
 				split->offset = 0;
 				split->ram_bytes = split->len;
-				split->orig_start = split->start;
 				split->block_len = 0;
 			}
 
@@ -1118,7 +1103,6 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_pre->disk_bytenr = new_logical;
 	split_pre->disk_num_bytes = split_pre->len;
 	split_pre->offset = 0;
-	split_pre->orig_start = split_pre->start;
 	split_pre->block_start = new_logical;
 	split_pre->block_len = split_pre->len;
 	split_pre->ram_bytes = split_pre->len;
@@ -1138,7 +1122,6 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_mid->disk_bytenr = em->block_start + pre;
 	split_mid->disk_num_bytes = split_mid->len;
 	split_mid->offset = 0;
-	split_mid->orig_start = split_mid->start;
 	split_mid->block_start = em->block_start + pre;
 	split_mid->block_len = split_mid->len;
 	split_mid->ram_bytes = split_mid->len;
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 0b1a8e409377..5ae3d56b4351 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -61,15 +61,6 @@ struct extent_map {
 	 */
 	u64 len;
 
-	/*
-	 * The file offset of the original file extent before splitting.
-	 *
-	 * This is an in-memory only member, matching
-	 * extent_map::start - btrfs_file_extent_item::offset for
-	 * regular/preallocated extents. EXTENT_MAP_HOLE otherwise.
-	 */
-	u64 orig_start;
-
 	/*
 	 * The bytenr of the full on-disk extent.
 	 *
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 1298afea9503..06d23951901c 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -1293,8 +1293,6 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 	    type == BTRFS_FILE_EXTENT_PREALLOC) {
 		em->start = extent_start;
 		em->len = btrfs_file_extent_end(path) - extent_start;
-		em->orig_start = extent_start -
-			btrfs_file_extent_offset(leaf, fi);
 		bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
 		if (bytenr == 0) {
 			em->block_start = EXTENT_MAP_HOLE;
@@ -1327,10 +1325,9 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 		em->len = fs_info->sectorsize;
 		em->offset = 0;
 		/*
-		 * Initialize orig_start and block_len with the same values
+		 * Initialize block_len with the same values
 		 * as in inode.c:btrfs_get_extent().
 		 */
-		em->orig_start = EXTENT_MAP_HOLE;
 		em->block_len = (u64)-1;
 		extent_map_set_compression(em, compress_type);
 	} else {
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 5133c6705d74..707012fc2d43 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1104,7 +1104,7 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 						   &cached_state);
 	}
 	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
-			NULL, NULL, NULL, NULL, nowait, false);
+			NULL, NULL, NULL, nowait, false);
 	if (ret <= 0)
 		btrfs_drew_write_unlock(&root->snapshot_lock);
 	else
@@ -2347,7 +2347,6 @@ static int fill_holes(struct btrfs_trans_handle *trans,
 		hole_em->start = offset;
 		hole_em->len = end - offset;
 		hole_em->ram_bytes = hole_em->len;
-		hole_em->orig_start = offset;
 
 		hole_em->block_start = EXTENT_MAP_HOLE;
 		hole_em->disk_bytenr = EXTENT_MAP_HOLE;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7afcdea27782..066f14c78bc9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -138,7 +138,7 @@ static noinline int run_delalloc_cow(struct btrfs_inode *inode,
 				     u64 end, struct writeback_control *wbc,
 				     bool pages_dirty);
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
-				       u64 len, u64 orig_start, u64 block_start,
+				       u64 len, u64 block_start,
 				       u64 block_len, u64 disk_num_bytes,
 				       u64 ram_bytes, int compress_type,
 				       struct btrfs_file_extent *file_extent,
@@ -1209,7 +1209,6 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
 
 	em = create_io_em(inode, start,
 			  async_extent->ram_size,	/* len */
-			  start,			/* orig_start */
 			  ins.objectid,			/* block_start */
 			  ins.offset,			/* block_len */
 			  ins.offset,			/* orig_block_len */
@@ -1453,7 +1452,6 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 			    &cached);
 
 		em = create_io_em(inode, start, ins.offset, /* len */
-				  start, /* orig_start */
 				  ins.objectid, /* block_start */
 				  ins.offset, /* block_len */
 				  ins.offset, /* orig_block_len */
@@ -2189,11 +2187,9 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 
 		is_prealloc = extent_type == BTRFS_FILE_EXTENT_PREALLOC;
 		if (is_prealloc) {
-			u64 orig_start = found_key.offset - nocow_args.extent_offset;
 			struct extent_map *em;
 
 			em = create_io_em(inode, cur_offset, nocow_args.num_bytes,
-					  orig_start,
 					  nocow_args.disk_bytenr, /* block_start */
 					  nocow_args.num_bytes, /* block_len */
 					  nocow_args.disk_num_bytes, /* orig_block_len */
@@ -5028,7 +5024,6 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
 			}
 			hole_em->start = cur_offset;
 			hole_em->len = hole_size;
-			hole_em->orig_start = cur_offset;
 
 			hole_em->block_start = EXTENT_MAP_HOLE;
 			hole_em->disk_bytenr = EXTENT_MAP_HOLE;
@@ -6899,7 +6894,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 		goto out;
 	}
 	em->start = EXTENT_MAP_HOLE;
-	em->orig_start = EXTENT_MAP_HOLE;
 	em->disk_bytenr = EXTENT_MAP_HOLE;
 	em->len = (u64)-1;
 	em->block_len = (u64)-1;
@@ -6992,7 +6986,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 
 		/* New extent overlaps with existing one */
 		em->start = start;
-		em->orig_start = start;
 		em->len = found_key.offset - start;
 		em->block_start = EXTENT_MAP_HOLE;
 		goto insert;
@@ -7028,7 +7021,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 	}
 not_found:
 	em->start = start;
-	em->orig_start = start;
 	em->len = len;
 	em->block_start = EXTENT_MAP_HOLE;
 insert:
@@ -7061,7 +7053,6 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 						  struct btrfs_dio_data *dio_data,
 						  const u64 start,
 						  const u64 len,
-						  const u64 orig_start,
 						  const u64 block_start,
 						  const u64 block_len,
 						  const u64 orig_block_len,
@@ -7073,7 +7064,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 	struct btrfs_ordered_extent *ordered;
 
 	if (type != BTRFS_ORDERED_NOCOW) {
-		em = create_io_em(inode, start, len, orig_start, block_start,
+		em = create_io_em(inode, start, len, block_start,
 				  block_len, orig_block_len, ram_bytes,
 				  BTRFS_COMPRESS_NONE, /* compress_type */
 				  file_extent, type);
@@ -7132,7 +7123,7 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
 	file_extent.ram_bytes = ins.offset;
 	file_extent.offset = 0;
 	file_extent.compression = BTRFS_COMPRESS_NONE;
-	em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset, start,
+	em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset,
 				     ins.objectid, ins.offset, ins.offset,
 				     ins.offset, BTRFS_ORDERED_REGULAR,
 				     &file_extent);
@@ -7178,7 +7169,7 @@ static bool btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr)
  *	 any ordered extents.
  */
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
-			      u64 *orig_start, u64 *orig_block_len,
+			      u64 *orig_block_len,
 			      u64 *ram_bytes, struct btrfs_file_extent *file_extent,
 			      bool nowait, bool strict)
 {
@@ -7265,8 +7256,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 		}
 	}
 
-	if (orig_start)
-		*orig_start = key.offset - nocow_args.extent_offset;
 	if (orig_block_len)
 		*orig_block_len = nocow_args.disk_num_bytes;
 	if (file_extent)
@@ -7375,7 +7364,7 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 
 /* The callers of this must take lock_extent() */
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
-				       u64 len, u64 orig_start, u64 block_start,
+				       u64 len, u64 block_start,
 				       u64 block_len, u64 disk_num_bytes,
 				       u64 ram_bytes, int compress_type,
 				       struct btrfs_file_extent *file_extent,
@@ -7413,7 +7402,7 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 		ASSERT(ram_bytes == len);
 
 		/* Since it's a new extent, we should not have any offset. */
-		ASSERT(orig_start == start);
+		ASSERT(file_extent->offset == 0);
 		break;
 	case BTRFS_ORDERED_COMPRESSED:
 		/* Must be compressed. */
@@ -7432,7 +7421,6 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 		return ERR_PTR(-ENOMEM);
 
 	em->start = start;
-	em->orig_start = orig_start;
 	em->len = len;
 	em->block_len = block_len;
 	em->block_start = block_start;
@@ -7467,7 +7455,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 	struct btrfs_file_extent file_extent;
 	struct extent_map *em = *map;
 	int type;
-	u64 block_start, orig_start, orig_block_len, ram_bytes;
+	u64 block_start, orig_block_len, ram_bytes;
 	struct btrfs_block_group *bg;
 	bool can_nocow = false;
 	bool space_reserved = false;
@@ -7494,7 +7482,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		len = min(len, em->len - (start - em->start));
 		block_start = em->block_start + (start - em->start);
 
-		if (can_nocow_extent(inode, start, &len, &orig_start,
+		if (can_nocow_extent(inode, start, &len,
 				     &orig_block_len, &ram_bytes,
 				     &file_extent, false, false) == 1) {
 			bg = btrfs_inc_nocow_writers(fs_info, block_start);
@@ -7522,7 +7510,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		space_reserved = true;
 
 		em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
-					      orig_start, block_start,
+					      block_start,
 					      len, orig_block_len,
 					      ram_bytes, type,
 					      &file_extent);
@@ -9662,7 +9650,6 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 		}
 
 		em->start = cur_offset;
-		em->orig_start = cur_offset;
 		em->len = ins.offset;
 		em->block_start = ins.objectid;
 		em->disk_bytenr = ins.objectid;
@@ -10171,7 +10158,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 		disk_io_size = em->block_len;
 		count = em->block_len;
 		encoded->unencoded_len = em->ram_bytes;
-		encoded->unencoded_offset = iocb->ki_pos - em->orig_start;
+		encoded->unencoded_offset = iocb->ki_pos - (em->start - em->offset);
 		ret = btrfs_encoded_io_compression_from_extent(fs_info,
 							       extent_map_compression(em));
 		if (ret < 0)
@@ -10416,7 +10403,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 	file_extent.offset = encoded->unencoded_offset;
 	file_extent.compression = compression;
 	em = create_io_em(inode, start, num_bytes,
-			  start - encoded->unencoded_offset, ins.objectid,
+			  ins.objectid,
 			  ins.offset, ins.offset, ram_bytes, compression,
 			  &file_extent, BTRFS_ORDERED_COMPRESSED);
 	if (IS_ERR(em)) {
@@ -10748,7 +10735,7 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		free_extent_map(em);
 		em = NULL;
 
-		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL, NULL, false, true);
+		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL, false, true);
 		if (ret < 0) {
 			goto out;
 		} else if (ret) {
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 151ed1ebd291..21061a0b2e7c 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2911,7 +2911,6 @@ static noinline_for_stack int setup_relocation_extent_mapping(struct inode *inod
 		return -ENOMEM;
 
 	em->start = start;
-	em->orig_start = start;
 	em->len = end + 1 - start;
 	em->block_len = em->len;
 	em->block_start = block_start;
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index e73ac7a0869c..65c6921ff4a2 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -99,7 +99,6 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 
 	em->start = SZ_16K;
-	em->orig_start = SZ_16K;
 	em->len = SZ_4K;
 	em->block_start = SZ_32K; /* avoid merging */
 	em->block_len = SZ_4K;
@@ -124,7 +123,6 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 
 	/* Add [0, 8K), should return [0, 16K) instead. */
 	em->start = start;
-	em->orig_start = start;
 	em->len = len;
 	em->block_start = start;
 	em->block_len = len;
@@ -206,7 +204,6 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 
 	em->start = SZ_4K;
-	em->orig_start = SZ_4K;
 	em->len = SZ_4K;
 	em->block_start = SZ_4K;
 	em->block_len = SZ_4K;
@@ -283,7 +280,6 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 
 	/* Add [4K, 8K) */
 	em->start = SZ_4K;
-	em->orig_start = SZ_4K;
 	em->len = SZ_4K;
 	em->block_start = SZ_4K;
 	em->block_len = SZ_4K;
@@ -421,7 +417,6 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 
 	/* Add [8K, 32K) */
 	em->start = SZ_8K;
-	em->orig_start = SZ_8K;
 	em->len = 24 * SZ_1K;
 	em->block_start = SZ_16K; /* avoid merging */
 	em->block_len = 24 * SZ_1K;
@@ -445,7 +440,6 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	}
 	/* Add [0K, 32K) */
 	em->start = 0;
-	em->orig_start = 0;
 	em->len = SZ_32K;
 	em->block_start = 0;
 	em->block_len = SZ_32K;
@@ -533,7 +527,6 @@ static int add_compressed_extent(struct btrfs_inode *inode,
 	}
 
 	em->start = start;
-	em->orig_start = start;
 	em->len = len;
 	em->block_start = block_start;
 	em->block_len = SZ_4K;
@@ -758,7 +751,6 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 
 	em->start = SZ_4K;
-	em->orig_start = SZ_4K;
 	em->len = SZ_4K;
 	em->block_start = SZ_16K;
 	em->block_len = SZ_16K;
@@ -840,7 +832,6 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 
 	/* [32K, 48K), not pinned */
 	em->start = SZ_32K;
-	em->orig_start = SZ_32K;
 	em->len = SZ_16K;
 	em->block_start = SZ_32K;
 	em->block_len = SZ_16K;
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index 0895c6e06812..fc390c18ac95 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -358,9 +358,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("unexpected flags set, want 0 have %u", em->flags);
 		goto out;
 	}
-	if (em->orig_start != em->start) {
-		test_err("wrong orig offset, want %llu, have %llu", em->start,
-			 em->orig_start);
+	if (em->offset != 0) {
+		test_err("wrong offset, want 0, have %llu", em->offset);
 		goto out;
 	}
 	offset = em->start + em->len;
@@ -386,9 +385,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("unexpected flags set, want 0 have %u", em->flags);
 		goto out;
 	}
-	if (em->orig_start != em->start) {
-		test_err("wrong orig offset, want %llu, have %llu", em->start,
-			 em->orig_start);
+	if (em->offset != 0) {
+		test_err("wrong offset, want 0, have %llu", em->offset);
 		goto out;
 	}
 	disk_bytenr = em->block_start;
@@ -437,9 +435,9 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("unexpected flags set, want 0 have %u", em->flags);
 		goto out;
 	}
-	if (em->orig_start != orig_start) {
-		test_err("wrong orig offset, want %llu, have %llu",
-			 orig_start, em->orig_start);
+	if (em->start - em->offset != orig_start) {
+		test_err("wrong offset, em->start=%llu em->offset=%llu orig_start=%llu",
+			 em->start, em->offset, orig_start);
 		goto out;
 	}
 	disk_bytenr += (em->start - orig_start);
@@ -472,9 +470,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 			 prealloc_only, em->flags);
 		goto out;
 	}
-	if (em->orig_start != em->start) {
-		test_err("wrong orig offset, want %llu, have %llu", em->start,
-			 em->orig_start);
+	if (em->offset != 0) {
+		test_err("wrong offset, want 0, have %llu", em->offset);
 		goto out;
 	}
 	offset = em->start + em->len;
@@ -501,9 +498,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 			 prealloc_only, em->flags);
 		goto out;
 	}
-	if (em->orig_start != em->start) {
-		test_err("wrong orig offset, want %llu, have %llu", em->start,
-			 em->orig_start);
+	if (em->offset != 0) {
+		test_err("wrong offset, want 0, have %llu", em->offset);
 		goto out;
 	}
 	disk_bytenr = em->block_start;
@@ -530,15 +526,14 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("unexpected flags set, want 0 have %u", em->flags);
 		goto out;
 	}
-	if (em->orig_start != orig_start) {
-		test_err("unexpected orig offset, wanted %llu, have %llu",
-			 orig_start, em->orig_start);
+	if (em->start - em->offset != orig_start) {
+		test_err("unexpected offset, wanted %llu, have %llu",
+			 em->start - orig_start, em->offset);
 		goto out;
 	}
-	if (em->block_start != (disk_bytenr + (em->start - em->orig_start))) {
+	if (em->block_start != disk_bytenr + em->offset) {
 		test_err("unexpected block start, wanted %llu, have %llu",
-			 disk_bytenr + (em->start - em->orig_start),
-			 em->block_start);
+			 disk_bytenr + em->offset, em->block_start);
 		goto out;
 	}
 	offset = em->start + em->len;
@@ -564,15 +559,14 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 			 prealloc_only, em->flags);
 		goto out;
 	}
-	if (em->orig_start != orig_start) {
-		test_err("wrong orig offset, want %llu, have %llu", orig_start,
-			 em->orig_start);
+	if (em->start - em->offset != orig_start) {
+		test_err("wrong offset, em->start=%llu em->offset=%llu orig_start=%llu",
+			 em->start, em->offset, orig_start);
 		goto out;
 	}
-	if (em->block_start != (disk_bytenr + (em->start - em->orig_start))) {
+	if (em->block_start != disk_bytenr + em->offset) {
 		test_err("unexpected block start, wanted %llu, have %llu",
-			 disk_bytenr + (em->start - em->orig_start),
-			 em->block_start);
+			 disk_bytenr + em->offset, em->block_start);
 		goto out;
 	}
 	offset = em->start + em->len;
@@ -599,9 +593,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 			 compressed_only, em->flags);
 		goto out;
 	}
-	if (em->orig_start != em->start) {
-		test_err("wrong orig offset, want %llu, have %llu",
-			 em->start, em->orig_start);
+	if (em->offset != 0) {
+		test_err("wrong offset, want 0, have %llu", em->offset);
 		goto out;
 	}
 	if (extent_map_compression(em) != BTRFS_COMPRESS_ZLIB) {
@@ -633,9 +626,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 			 compressed_only, em->flags);
 		goto out;
 	}
-	if (em->orig_start != em->start) {
-		test_err("wrong orig offset, want %llu, have %llu",
-			 em->start, em->orig_start);
+	if (em->offset != 0) {
+		test_err("wrong offset, want 0, have %llu", em->offset);
 		goto out;
 	}
 	if (extent_map_compression(em) != BTRFS_COMPRESS_ZLIB) {
@@ -667,9 +659,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("unexpected flags set, want 0 have %u", em->flags);
 		goto out;
 	}
-	if (em->orig_start != em->start) {
-		test_err("wrong orig offset, want %llu, have %llu", em->start,
-			 em->orig_start);
+	if (em->offset != 0) {
+		test_err("wrong offset, want 0, have %llu", em->offset);
 		goto out;
 	}
 	offset = em->start + em->len;
@@ -696,9 +687,9 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 			 compressed_only, em->flags);
 		goto out;
 	}
-	if (em->orig_start != orig_start) {
-		test_err("wrong orig offset, want %llu, have %llu",
-			 em->start, orig_start);
+	if (em->start - em->offset != orig_start) {
+		test_err("wrong offset, em->start=%llu em->offset=%llu orig_start=%llu",
+			 em->start, em->offset, orig_start);
 		goto out;
 	}
 	if (extent_map_compression(em) != BTRFS_COMPRESS_ZLIB) {
@@ -729,9 +720,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("unexpected flags set, want 0 have %u", em->flags);
 		goto out;
 	}
-	if (em->orig_start != em->start) {
-		test_err("wrong orig offset, want %llu, have %llu", em->start,
-			 em->orig_start);
+	if (em->offset != 0) {
+		test_err("wrong offset, want 0, have %llu", em->offset);
 		goto out;
 	}
 	offset = em->start + em->len;
@@ -762,9 +752,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 			 vacancy_only, em->flags);
 		goto out;
 	}
-	if (em->orig_start != em->start) {
-		test_err("wrong orig offset, want %llu, have %llu", em->start,
-			 em->orig_start);
+	if (em->offset != 0) {
+		test_err("wrong offset, want 0, have %llu", em->offset);
 		goto out;
 	}
 	offset = em->start + em->len;
@@ -789,9 +778,8 @@ static noinline int test_btrfs_get_extent(u32 sectorsize, u32 nodesize)
 		test_err("unexpected flags set, want 0 have %u", em->flags);
 		goto out;
 	}
-	if (em->orig_start != em->start) {
-		test_err("wrong orig offset, want %llu, have %llu", em->start,
-			 em->orig_start);
+	if (em->offset != 0) {
+		test_err("wrong orig offset, want 0, have %llu", em->offset);
 		goto out;
 	}
 	ret = 0;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index f237b5ed80ec..d6a3151d6c37 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4691,7 +4691,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 	struct extent_buffer *leaf;
 	struct btrfs_key key;
 	enum btrfs_compression_type compress_type;
-	u64 extent_offset = em->start - em->orig_start;
+	u64 extent_offset = em->offset;
 	u64 block_len;
 	int ret;
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index d2d94d7c3fb5..cbac7cd11995 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -291,7 +291,6 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
 		__field(	u64,  ino		)
 		__field(	u64,  start		)
 		__field(	u64,  len		)
-		__field(	u64,  orig_start	)
 		__field(	u64,  block_start	)
 		__field(	u64,  block_len		)
 		__field(	u32,  flags		)
@@ -303,7 +302,6 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
 		__entry->ino		= btrfs_ino(inode);
 		__entry->start		= map->start;
 		__entry->len		= map->len;
-		__entry->orig_start	= map->orig_start;
 		__entry->block_start	= map->block_start;
 		__entry->block_len	= map->block_len;
 		__entry->flags		= map->flags;
@@ -311,13 +309,11 @@ TRACE_EVENT_CONDITION(btrfs_get_extent,
 	),
 
 	TP_printk_btrfs("root=%llu(%s) ino=%llu start=%llu len=%llu "
-		  "orig_start=%llu block_start=%llu(%s) "
-		  "block_len=%llu flags=%s refs=%u",
+		  "block_start=%llu(%s) block_len=%llu flags=%s refs=%u",
 		  show_root_type(__entry->root_objectid),
 		  __entry->ino,
 		  __entry->start,
 		  __entry->len,
-		  __entry->orig_start,
 		  show_map_type(__entry->block_start),
 		  __entry->block_len,
 		  show_map_flags(__entry->flags),
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 04/11] btrfs: introduce extra sanity checks for extent maps
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
                   ` (2 preceding siblings ...)
  2024-05-23  5:03  1% ` [PATCH v3 03/11] btrfs: introduce new members for extent_map Qu Wenruo
@ 2024-05-23  5:03  1% ` Qu Wenruo
  2024-05-23 16:57  2%   ` Filipe Manana
  2024-05-23  5:03  1% ` [PATCH v3 05/11] btrfs: remove extent_map::orig_start member Qu Wenruo
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Since extent_map structure has the all the needed members to represent a
file extent directly, we can apply all the file extent sanity checks to an extent
map.

The new sanity checks would cross check both the old members
(block_start/block_len/orig_start) and the new members
(disk_bytenr/disk_num_bytes/offset).

There is a special case for offset/orig_start/start cross check, we only
do such sanity check for compressed extent, as only compressed
read/encoded write really utilize orig_start.
This can be proved by the cleanup patch of orig_start.

The checks happens at the following timing:

- add_extent_mapping()
  This is for newly added extent map

- replace_extent_mapping()
  This is for btrfs_drop_extent_map_range() and split_extent_map()

- try_merge_map()

For a lot of call sites we have to properly populate all the members to
pass the sanity check, meanwhile the following code needs extra
modification:

- setup_file_extents() from inode-tests
  The file extents layout of setup_file_extents() is already too invalid
  that tree-checker would reject most of them in real world.

  However there is just a special unaligned regular extent which has
  mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1).
  So instead of dropping the whole test case, here we just unify
  disk_num_bytes and ram_bytes to 4096 - 1.

- test_case_7() from extent-map-tests
  An extent is inserted with 16K length, but on-disk extent size is
  only 4K.
  This means it must be a compressed extent, so set the compressed flag
  for it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_map.c             | 60 +++++++++++++++++++++++++++++++
 fs/btrfs/relocation.c             |  4 +++
 fs/btrfs/tests/extent-map-tests.c | 56 ++++++++++++++++++++++++++++-
 fs/btrfs/tests/inode-tests.c      |  2 +-
 4 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index c7d2393692e6..b157f30ac241 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -283,8 +283,62 @@ static void merge_ondisk_extents(struct extent_map *prev, struct extent_map *nex
 	next->offset = new_offset;
 }
 
+static void dump_extent_map(struct btrfs_fs_info *fs_info,
+			    const char *prefix, struct extent_map *em)
+{
+	if (!IS_ENABLED(CONFIG_BTRFS_DEBUG))
+		return;
+	btrfs_crit(fs_info, "%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu orig_start=%llu block_start=%llu block_len=%llu flags=0x%x\n",
+		prefix, em->start, em->len, em->disk_bytenr, em->disk_num_bytes,
+		em->ram_bytes, em->offset, em->orig_start, em->block_start,
+		em->block_len, em->flags);
+	ASSERT(0);
+}
+
+/* Internal sanity checks for btrfs debug builds. */
+static void validate_extent_map(struct btrfs_fs_info *fs_info,
+				struct extent_map *em)
+{
+	if (!IS_ENABLED(CONFIG_BTRFS_DEBUG))
+		return;
+	if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
+		if (em->disk_num_bytes == 0)
+			dump_extent_map(fs_info, "zero disk_num_bytes", em);
+		if (em->offset + em->len > em->ram_bytes)
+			dump_extent_map(fs_info, "ram_bytes too small", em);
+		if (em->offset + em->len > em->disk_num_bytes &&
+		    !extent_map_is_compressed(em))
+			dump_extent_map(fs_info, "disk_num_bytes too small", em);
+
+		if (extent_map_is_compressed(em)) {
+			if (em->block_start != em->disk_bytenr)
+				dump_extent_map(fs_info,
+				"mismatch block_start/disk_bytenr/offset", em);
+			if (em->disk_num_bytes != em->block_len)
+				dump_extent_map(fs_info,
+				"mismatch disk_num_bytes/block_len", em);
+			/*
+			 * Here we only check the start/orig_start/offset for
+			 * compressed extents as that's the only case where
+			 * orig_start is utilized.
+			 */
+			if (em->orig_start != em->start - em->offset)
+				dump_extent_map(fs_info,
+				"mismatch orig_start/offset/start", em);
+
+		} else if (em->block_start != em->disk_bytenr + em->offset) {
+			dump_extent_map(fs_info,
+				"mismatch block_start/disk_bytenr/offset", em);
+		}
+	} else if (em->offset) {
+		dump_extent_map(fs_info,
+				"non-zero offset for hole/inline", em);
+	}
+}
+
 static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 {
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct extent_map_tree *tree = &inode->extent_tree;
 	struct extent_map *merge = NULL;
 	struct rb_node *rb;
@@ -319,6 +373,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 				merge_ondisk_extents(merge, em);
 			em->flags |= EXTENT_FLAG_MERGED;
 
+			validate_extent_map(fs_info, em);
 			rb_erase(&merge->rb_node, &tree->root);
 			RB_CLEAR_NODE(&merge->rb_node);
 			free_extent_map(merge);
@@ -334,6 +389,7 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 		em->block_len += merge->block_len;
 		if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
 			merge_ondisk_extents(em, merge);
+		validate_extent_map(fs_info, em);
 		rb_erase(&merge->rb_node, &tree->root);
 		RB_CLEAR_NODE(&merge->rb_node);
 		em->generation = max(em->generation, merge->generation);
@@ -445,6 +501,7 @@ static int add_extent_mapping(struct btrfs_inode *inode,
 
 	lockdep_assert_held_write(&tree->lock);
 
+	validate_extent_map(fs_info, em);
 	ret = tree_insert(&tree->root, em);
 	if (ret)
 		return ret;
@@ -548,10 +605,13 @@ static void replace_extent_mapping(struct btrfs_inode *inode,
 				   struct extent_map *new,
 				   int modified)
 {
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct extent_map_tree *tree = &inode->extent_tree;
 
 	lockdep_assert_held_write(&tree->lock);
 
+	validate_extent_map(fs_info, new);
+
 	WARN_ON(cur->flags & EXTENT_FLAG_PINNED);
 	ASSERT(extent_map_in_tree(cur));
 	if (!(cur->flags & EXTENT_FLAG_LOGGING))
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 5f1a909a1d91..151ed1ebd291 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2911,9 +2911,13 @@ static noinline_for_stack int setup_relocation_extent_mapping(struct inode *inod
 		return -ENOMEM;
 
 	em->start = start;
+	em->orig_start = start;
 	em->len = end + 1 - start;
 	em->block_len = em->len;
 	em->block_start = block_start;
+	em->disk_bytenr = block_start;
+	em->disk_num_bytes = em->len;
+	em->ram_bytes = em->len;
 	em->flags |= EXTENT_FLAG_PINNED;
 
 	lock_extent(&BTRFS_I(inode)->io_tree, start, end, &cached_state);
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index c511a1297956..e73ac7a0869c 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -78,6 +78,9 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->len = SZ_16K;
 	em->block_start = 0;
 	em->block_len = SZ_16K;
+	em->disk_bytenr = 0;
+	em->disk_num_bytes = SZ_16K;
+	em->ram_bytes = SZ_16K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
@@ -96,9 +99,13 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 
 	em->start = SZ_16K;
+	em->orig_start = SZ_16K;
 	em->len = SZ_4K;
 	em->block_start = SZ_32K; /* avoid merging */
 	em->block_len = SZ_4K;
+	em->disk_bytenr = SZ_32K; /* avoid merging */
+	em->disk_num_bytes = SZ_4K;
+	em->ram_bytes = SZ_4K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
@@ -117,9 +124,13 @@ static int test_case_1(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 
 	/* Add [0, 8K), should return [0, 16K) instead. */
 	em->start = start;
+	em->orig_start = start;
 	em->len = len;
 	em->block_start = start;
 	em->block_len = len;
+	em->disk_bytenr = start;
+	em->disk_num_bytes = len;
+	em->ram_bytes = len;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
@@ -174,6 +185,9 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->len = SZ_1K;
 	em->block_start = EXTENT_MAP_INLINE;
 	em->block_len = (u64)-1;
+	em->disk_bytenr = EXTENT_MAP_INLINE;
+	em->disk_num_bytes = 0;
+	em->ram_bytes = SZ_1K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
@@ -192,9 +206,13 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 
 	em->start = SZ_4K;
+	em->orig_start = SZ_4K;
 	em->len = SZ_4K;
 	em->block_start = SZ_4K;
 	em->block_len = SZ_4K;
+	em->disk_bytenr = SZ_4K;
+	em->disk_num_bytes = SZ_4K;
+	em->ram_bytes = SZ_4K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
@@ -216,6 +234,9 @@ static int test_case_2(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->len = SZ_1K;
 	em->block_start = EXTENT_MAP_INLINE;
 	em->block_len = (u64)-1;
+	em->disk_bytenr = EXTENT_MAP_INLINE;
+	em->disk_num_bytes = 0;
+	em->ram_bytes = SZ_1K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
@@ -262,9 +283,13 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 
 	/* Add [4K, 8K) */
 	em->start = SZ_4K;
+	em->orig_start = SZ_4K;
 	em->len = SZ_4K;
 	em->block_start = SZ_4K;
 	em->block_len = SZ_4K;
+	em->disk_bytenr = SZ_4K;
+	em->disk_num_bytes = SZ_4K;
+	em->ram_bytes = SZ_4K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
@@ -286,6 +311,9 @@ static int __test_case_3(struct btrfs_fs_info *fs_info,
 	em->len = SZ_16K;
 	em->block_start = 0;
 	em->block_len = SZ_16K;
+	em->disk_bytenr = 0;
+	em->disk_num_bytes = SZ_16K;
+	em->ram_bytes = SZ_16K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, start, len);
 	write_unlock(&em_tree->lock);
@@ -372,6 +400,9 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	em->len = SZ_8K;
 	em->block_start = 0;
 	em->block_len = SZ_8K;
+	em->disk_bytenr = 0;
+	em->disk_num_bytes = SZ_8K;
+	em->ram_bytes = SZ_8K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
@@ -390,9 +421,13 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 
 	/* Add [8K, 32K) */
 	em->start = SZ_8K;
+	em->orig_start = SZ_8K;
 	em->len = 24 * SZ_1K;
 	em->block_start = SZ_16K; /* avoid merging */
 	em->block_len = 24 * SZ_1K;
+	em->disk_bytenr = SZ_16K; /* avoid merging */
+	em->disk_num_bytes = 24 * SZ_1K;
+	em->ram_bytes = 24 * SZ_1K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
@@ -410,9 +445,13 @@ static int __test_case_4(struct btrfs_fs_info *fs_info,
 	}
 	/* Add [0K, 32K) */
 	em->start = 0;
+	em->orig_start = 0;
 	em->len = SZ_32K;
 	em->block_start = 0;
 	em->block_len = SZ_32K;
+	em->disk_bytenr = 0;
+	em->disk_num_bytes = SZ_32K;
+	em->ram_bytes = SZ_32K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, start, len);
 	write_unlock(&em_tree->lock);
@@ -494,9 +533,13 @@ static int add_compressed_extent(struct btrfs_inode *inode,
 	}
 
 	em->start = start;
+	em->orig_start = start;
 	em->len = len;
 	em->block_start = block_start;
 	em->block_len = SZ_4K;
+	em->disk_bytenr = block_start;
+	em->disk_num_bytes = SZ_4K;
+	em->ram_bytes = len;
 	em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
@@ -715,9 +758,13 @@ static int test_case_6(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	}
 
 	em->start = SZ_4K;
+	em->orig_start = SZ_4K;
 	em->len = SZ_4K;
 	em->block_start = SZ_16K;
 	em->block_len = SZ_16K;
+	em->disk_bytenr = SZ_16K;
+	em->disk_num_bytes = SZ_16K;
+	em->ram_bytes = SZ_16K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, 0, SZ_8K);
 	write_unlock(&em_tree->lock);
@@ -771,7 +818,10 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	em->len = SZ_16K;
 	em->block_start = 0;
 	em->block_len = SZ_4K;
-	em->flags |= EXTENT_FLAG_PINNED;
+	em->disk_bytenr = 0;
+	em->disk_num_bytes = SZ_4K;
+	em->ram_bytes = SZ_16K;
+	em->flags |= (EXTENT_FLAG_PINNED | EXTENT_FLAG_COMPRESS_ZLIB);
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
@@ -790,9 +840,13 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 
 	/* [32K, 48K), not pinned */
 	em->start = SZ_32K;
+	em->orig_start = SZ_32K;
 	em->len = SZ_16K;
 	em->block_start = SZ_32K;
 	em->block_len = SZ_16K;
+	em->disk_bytenr = SZ_32K;
+	em->disk_num_bytes = SZ_16K;
+	em->ram_bytes = SZ_16K;
 	write_lock(&em_tree->lock);
 	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
 	write_unlock(&em_tree->lock);
diff --git a/fs/btrfs/tests/inode-tests.c b/fs/btrfs/tests/inode-tests.c
index 99da9d34b77a..0895c6e06812 100644
--- a/fs/btrfs/tests/inode-tests.c
+++ b/fs/btrfs/tests/inode-tests.c
@@ -117,7 +117,7 @@ static void setup_file_extents(struct btrfs_root *root, u32 sectorsize)
 
 	/* Now for a regular extent */
 	insert_extent(root, offset, sectorsize - 1, sectorsize - 1, 0,
-		      disk_bytenr, sectorsize, BTRFS_FILE_EXTENT_REG, 0, slot);
+		      disk_bytenr, sectorsize - 1, BTRFS_FILE_EXTENT_REG, 0, slot);
 	slot++;
 	disk_bytenr += sectorsize;
 	offset += sectorsize - 1;
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 03/11] btrfs: introduce new members for extent_map
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
  2024-05-23  5:03  1% ` [PATCH v3 01/11] btrfs: rename extent_map::orig_block_len to disk_num_bytes Qu Wenruo
  2024-05-23  5:03  1% ` [PATCH v3 02/11] btrfs: export the expected file extent through can_nocow_extent() Qu Wenruo
@ 2024-05-23  5:03  1% ` Qu Wenruo
  2024-05-23 16:53  1%   ` Filipe Manana
  2024-05-23 18:21  1%   ` Filipe Manana
  2024-05-23  5:03  1% ` [PATCH v3 04/11] btrfs: introduce extra sanity checks for extent maps Qu Wenruo
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Introduce two new members for extent_map:

- disk_bytenr
- offset

Both are matching the members with the same name inside
btrfs_file_extent_items.

For now this patch only touches those members when:

- Reading btrfs_file_extent_items from disk
- Inserting new holes
- Merging two extent maps
  With the new disk_bytenr and disk_num_bytes, doing merging would be a
  little more complex, as we have 3 different cases:

  * Both extent maps are referring to the same data extents
    |<----- data extent A ----->|
       |<- em 1 ->|<- em 2 ->|

  * Both extent maps are referring to different data extents
    |<-- data extent A -->|<-- data extent B -->|
               |<- em 1 ->|<- em 2 ->|

  * One of the extent maps is referring to a merged and larger data
    extent that covers both extent maps

    This is not really valid case other than some selftests.
    So this test case would be removed.

  A new helper merge_ondisk_extents() would be introduced to handle
  above valid cases.

To properly assign values for those new members, a new btrfs_file_extent
parameter is introduced to all the involved call sites.

- For NOCOW writes the btrfs_file_extent would be exposed from
  can_nocow_file_extent().

- For other writes, the members can be easily calculated
  As most of them have 0 offset and utilizing the whole on-disk data
  extent.
  The exception is encoded write, but thankfully that interface provided
  offset directly and all other needed info.

For now, both the old members (block_start/block_len/orig_start) are
co-existing with the new members (disk_bytenr/offset), meanwhile all the
critical code is still using the old members only.

The cleanup would happen later after all the older and newer members are
properly validated.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/defrag.c     |  4 +++
 fs/btrfs/extent_map.c | 78 ++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/extent_map.h | 17 ++++++++++
 fs/btrfs/file-item.c  |  9 ++++-
 fs/btrfs/file.c       |  1 +
 fs/btrfs/inode.c      | 57 +++++++++++++++++++++++++++----
 6 files changed, 155 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 407ccec3e57e..242c5469f4ba 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -709,6 +709,10 @@ static struct extent_map *defrag_get_extent(struct btrfs_inode *inode,
 			em->start = start;
 			em->orig_start = start;
 			em->block_start = EXTENT_MAP_HOLE;
+			em->disk_bytenr = EXTENT_MAP_HOLE;
+			em->disk_num_bytes = 0;
+			em->ram_bytes = 0;
+			em->offset = 0;
 			em->len = key.offset - start;
 			break;
 		}
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index a9d60d1eade9..c7d2393692e6 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -229,6 +229,60 @@ static bool mergeable_maps(const struct extent_map *prev, const struct extent_ma
 	return next->block_start == prev->block_start;
 }
 
+/*
+ * Handle the ondisk data extents merge for @prev and @next.
+ *
+ * Only touches disk_bytenr/disk_num_bytes/offset/ram_bytes.
+ * For now only uncompressed regular extent can be merged.
+ *
+ * @prev and @next will be both updated to point to the new merged range.
+ * Thus one of them should be removed by the caller.
+ */
+static void merge_ondisk_extents(struct extent_map *prev, struct extent_map *next)
+{
+	u64 new_disk_bytenr;
+	u64 new_disk_num_bytes;
+	u64 new_offset;
+
+	/* @prev and @next should not be compressed. */
+	ASSERT(!extent_map_is_compressed(prev));
+	ASSERT(!extent_map_is_compressed(next));
+
+	/*
+	 * There are two different cases where @prev and @next can be merged.
+	 *
+	 * 1) They are referring to the same data extent
+	 * |<----- data extent A ----->|
+	 *    |<- prev ->|<- next ->|
+	 *
+	 * 2) They are referring to different data extents but still adjacent
+	 *
+	 * |<-- data extent A -->|<-- data extent B -->|
+	 *            |<- prev ->|<- next ->|
+	 *
+	 * The calculation here always merge the data extents first, then update
+	 * @offset using the new data extents.
+	 *
+	 * For case 1), the merged data extent would be the same.
+	 * For case 2), we just merge the two data extents into one.
+	 */
+	new_disk_bytenr = min(prev->disk_bytenr, next->disk_bytenr);
+	new_disk_num_bytes = max(prev->disk_bytenr + prev->disk_num_bytes,
+				 next->disk_bytenr + next->disk_num_bytes) -
+			     new_disk_bytenr;
+	new_offset = prev->disk_bytenr + prev->offset - new_disk_bytenr;
+
+	prev->disk_bytenr = new_disk_bytenr;
+	prev->disk_num_bytes = new_disk_num_bytes;
+	prev->ram_bytes = new_disk_num_bytes;
+	prev->offset = new_offset;
+
+	next->disk_bytenr = new_disk_bytenr;
+	next->disk_num_bytes = new_disk_num_bytes;
+	next->ram_bytes = new_disk_num_bytes;
+	next->offset = new_offset;
+}
+
 static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 {
 	struct extent_map_tree *tree = &inode->extent_tree;
@@ -260,6 +314,9 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 			em->block_len += merge->block_len;
 			em->block_start = merge->block_start;
 			em->generation = max(em->generation, merge->generation);
+
+			if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
+				merge_ondisk_extents(merge, em);
 			em->flags |= EXTENT_FLAG_MERGED;
 
 			rb_erase(&merge->rb_node, &tree->root);
@@ -275,6 +332,8 @@ static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
 	if (rb && can_merge_extent_map(merge) && mergeable_maps(em, merge)) {
 		em->len += merge->len;
 		em->block_len += merge->block_len;
+		if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
+			merge_ondisk_extents(em, merge);
 		rb_erase(&merge->rb_node, &tree->root);
 		RB_CLEAR_NODE(&merge->rb_node);
 		em->generation = max(em->generation, merge->generation);
@@ -562,6 +621,7 @@ static noinline int merge_extent_mapping(struct btrfs_inode *inode,
 	    !extent_map_is_compressed(em)) {
 		em->block_start += start_diff;
 		em->block_len = em->len;
+		em->offset += start_diff;
 	}
 	return add_extent_mapping(inode, em, 0);
 }
@@ -785,14 +845,18 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 					split->block_len = em->block_len;
 				else
 					split->block_len = split->len;
+				split->disk_bytenr = em->disk_bytenr;
 				split->disk_num_bytes = max(split->block_len,
 							    em->disk_num_bytes);
+				split->offset = em->offset;
 				split->ram_bytes = em->ram_bytes;
 			} else {
 				split->orig_start = split->start;
 				split->block_len = 0;
 				split->block_start = em->block_start;
+				split->disk_bytenr = em->disk_bytenr;
 				split->disk_num_bytes = 0;
+				split->offset = 0;
 				split->ram_bytes = split->len;
 			}
 
@@ -813,13 +877,14 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			split->start = end;
 			split->len = em_end - end;
 			split->block_start = em->block_start;
+			split->disk_bytenr = em->disk_bytenr;
 			split->flags = flags;
 			split->generation = gen;
 
 			if (em->block_start < EXTENT_MAP_LAST_BYTE) {
 				split->disk_num_bytes = max(em->block_len,
 							    em->disk_num_bytes);
-
+				split->offset = em->offset + end - em->start;
 				split->ram_bytes = em->ram_bytes;
 				if (compressed) {
 					split->block_len = em->block_len;
@@ -832,10 +897,11 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 					split->orig_start = em->orig_start;
 				}
 			} else {
+				split->disk_num_bytes = 0;
+				split->offset = 0;
 				split->ram_bytes = split->len;
 				split->orig_start = split->start;
 				split->block_len = 0;
-				split->disk_num_bytes = 0;
 			}
 
 			if (extent_map_in_tree(em)) {
@@ -989,10 +1055,12 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	/* First, replace the em with a new extent_map starting from * em->start */
 	split_pre->start = em->start;
 	split_pre->len = pre;
+	split_pre->disk_bytenr = new_logical;
+	split_pre->disk_num_bytes = split_pre->len;
+	split_pre->offset = 0;
 	split_pre->orig_start = split_pre->start;
 	split_pre->block_start = new_logical;
 	split_pre->block_len = split_pre->len;
-	split_pre->disk_num_bytes = split_pre->block_len;
 	split_pre->ram_bytes = split_pre->len;
 	split_pre->flags = flags;
 	split_pre->generation = em->generation;
@@ -1007,10 +1075,12 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	/* Insert the middle extent_map. */
 	split_mid->start = em->start + pre;
 	split_mid->len = em->len - pre;
+	split_mid->disk_bytenr = em->block_start + pre;
+	split_mid->disk_num_bytes = split_mid->len;
+	split_mid->offset = 0;
 	split_mid->orig_start = split_mid->start;
 	split_mid->block_start = em->block_start + pre;
 	split_mid->block_len = split_mid->len;
-	split_mid->disk_num_bytes = split_mid->block_len;
 	split_mid->ram_bytes = split_mid->len;
 	split_mid->flags = flags;
 	split_mid->generation = em->generation;
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 2b7bbffd594b..0b1a8e409377 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -70,12 +70,29 @@ struct extent_map {
 	 */
 	u64 orig_start;
 
+	/*
+	 * The bytenr of the full on-disk extent.
+	 *
+	 * For regular extents it's btrfs_file_extent_item::disk_bytenr.
+	 * For holes it's EXTENT_MAP_HOLE and for inline extents it's
+	 * EXTENT_MAP_INLINE.
+	 */
+	u64 disk_bytenr;
+
 	/*
 	 * The full on-disk extent length, matching
 	 * btrfs_file_extent_item::disk_num_bytes.
 	 */
 	u64 disk_num_bytes;
 
+	/*
+	 * Offset inside the decompressed extent.
+	 *
+	 * For regular extents it's btrfs_file_extent_item::offset.
+	 * For holes and inline extents it's 0.
+	 */
+	u64 offset;
+
 	/*
 	 * The decompressed size of the whole on-disk extent, matching
 	 * btrfs_file_extent_item::ram_bytes.
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 430dce44ebd2..1298afea9503 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -1295,12 +1295,17 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 		em->len = btrfs_file_extent_end(path) - extent_start;
 		em->orig_start = extent_start -
 			btrfs_file_extent_offset(leaf, fi);
-		em->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
 		bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
 		if (bytenr == 0) {
 			em->block_start = EXTENT_MAP_HOLE;
+			em->disk_bytenr = EXTENT_MAP_HOLE;
+			em->disk_num_bytes = 0;
+			em->offset = 0;
 			return;
 		}
+		em->disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+		em->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
+		em->offset = btrfs_file_extent_offset(leaf, fi);
 		if (compress_type != BTRFS_COMPRESS_NONE) {
 			extent_map_set_compression(em, compress_type);
 			em->block_start = bytenr;
@@ -1317,8 +1322,10 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 		ASSERT(extent_start == 0);
 
 		em->block_start = EXTENT_MAP_INLINE;
+		em->disk_bytenr = EXTENT_MAP_INLINE;
 		em->start = 0;
 		em->len = fs_info->sectorsize;
+		em->offset = 0;
 		/*
 		 * Initialize orig_start and block_len with the same values
 		 * as in inode.c:btrfs_get_extent().
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 7c42565da70c..5133c6705d74 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2350,6 +2350,7 @@ static int fill_holes(struct btrfs_trans_handle *trans,
 		hole_em->orig_start = offset;
 
 		hole_em->block_start = EXTENT_MAP_HOLE;
+		hole_em->disk_bytenr = EXTENT_MAP_HOLE;
 		hole_em->block_len = 0;
 		hole_em->disk_num_bytes = 0;
 		hole_em->generation = trans->transid;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8ac489fb5e39..7afcdea27782 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -141,6 +141,7 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 				       u64 len, u64 orig_start, u64 block_start,
 				       u64 block_len, u64 disk_num_bytes,
 				       u64 ram_bytes, int compress_type,
+				       struct btrfs_file_extent *file_extent,
 				       int type);
 
 static int data_reloc_print_warning_inode(u64 inum, u64 offset, u64 num_bytes,
@@ -1152,6 +1153,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_ordered_extent *ordered;
+	struct btrfs_file_extent file_extent;
 	struct btrfs_key ins;
 	struct page *locked_page = NULL;
 	struct extent_state *cached = NULL;
@@ -1198,6 +1200,13 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
 	lock_extent(io_tree, start, end, &cached);
 
 	/* Here we're doing allocation and writeback of the compressed pages */
+	file_extent.disk_bytenr = ins.objectid;
+	file_extent.disk_num_bytes = ins.offset;
+	file_extent.ram_bytes = async_extent->ram_size;
+	file_extent.num_bytes = async_extent->ram_size;
+	file_extent.offset = 0;
+	file_extent.compression = async_extent->compress_type;
+
 	em = create_io_em(inode, start,
 			  async_extent->ram_size,	/* len */
 			  start,			/* orig_start */
@@ -1206,6 +1215,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
 			  ins.offset,			/* orig_block_len */
 			  async_extent->ram_size,	/* ram_bytes */
 			  async_extent->compress_type,
+			  &file_extent,
 			  BTRFS_ORDERED_COMPRESSED);
 	if (IS_ERR(em)) {
 		ret = PTR_ERR(em);
@@ -1395,6 +1405,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 
 	while (num_bytes > 0) {
 		struct btrfs_ordered_extent *ordered;
+		struct btrfs_file_extent file_extent;
 
 		cur_alloc_size = num_bytes;
 		ret = btrfs_reserve_extent(root, cur_alloc_size, cur_alloc_size,
@@ -1431,6 +1442,12 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 		extent_reserved = true;
 
 		ram_size = ins.offset;
+		file_extent.disk_bytenr = ins.objectid;
+		file_extent.disk_num_bytes = ins.offset;
+		file_extent.num_bytes = ins.offset;
+		file_extent.ram_bytes = ins.offset;
+		file_extent.offset = 0;
+		file_extent.compression = BTRFS_COMPRESS_NONE;
 
 		lock_extent(&inode->io_tree, start, start + ram_size - 1,
 			    &cached);
@@ -1442,6 +1459,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 				  ins.offset, /* orig_block_len */
 				  ram_size, /* ram_bytes */
 				  BTRFS_COMPRESS_NONE, /* compress_type */
+				  &file_extent,
 				  BTRFS_ORDERED_REGULAR /* type */);
 		if (IS_ERR(em)) {
 			unlock_extent(&inode->io_tree, start,
@@ -2180,6 +2198,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
 					  nocow_args.num_bytes, /* block_len */
 					  nocow_args.disk_num_bytes, /* orig_block_len */
 					  ram_bytes, BTRFS_COMPRESS_NONE,
+					  &nocow_args.file_extent,
 					  BTRFS_ORDERED_PREALLOC);
 			if (IS_ERR(em)) {
 				unlock_extent(&inode->io_tree, cur_offset,
@@ -5012,6 +5031,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
 			hole_em->orig_start = cur_offset;
 
 			hole_em->block_start = EXTENT_MAP_HOLE;
+			hole_em->disk_bytenr = EXTENT_MAP_HOLE;
 			hole_em->block_len = 0;
 			hole_em->disk_num_bytes = 0;
 			hole_em->ram_bytes = hole_size;
@@ -6880,6 +6900,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
 	}
 	em->start = EXTENT_MAP_HOLE;
 	em->orig_start = EXTENT_MAP_HOLE;
+	em->disk_bytenr = EXTENT_MAP_HOLE;
 	em->len = (u64)-1;
 	em->block_len = (u64)-1;
 
@@ -7045,7 +7066,8 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 						  const u64 block_len,
 						  const u64 orig_block_len,
 						  const u64 ram_bytes,
-						  const int type)
+						  const int type,
+						  struct btrfs_file_extent *file_extent)
 {
 	struct extent_map *em = NULL;
 	struct btrfs_ordered_extent *ordered;
@@ -7054,7 +7076,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
 		em = create_io_em(inode, start, len, orig_start, block_start,
 				  block_len, orig_block_len, ram_bytes,
 				  BTRFS_COMPRESS_NONE, /* compress_type */
-				  type);
+				  file_extent, type);
 		if (IS_ERR(em))
 			goto out;
 	}
@@ -7085,6 +7107,7 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_file_extent file_extent;
 	struct extent_map *em;
 	struct btrfs_key ins;
 	u64 alloc_hint;
@@ -7103,9 +7126,16 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
 	if (ret)
 		return ERR_PTR(ret);
 
+	file_extent.disk_bytenr = ins.objectid;
+	file_extent.disk_num_bytes = ins.offset;
+	file_extent.num_bytes = ins.offset;
+	file_extent.ram_bytes = ins.offset;
+	file_extent.offset = 0;
+	file_extent.compression = BTRFS_COMPRESS_NONE;
 	em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset, start,
 				     ins.objectid, ins.offset, ins.offset,
-				     ins.offset, BTRFS_ORDERED_REGULAR);
+				     ins.offset, BTRFS_ORDERED_REGULAR,
+				     &file_extent);
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 	if (IS_ERR(em))
 		btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset,
@@ -7348,6 +7378,7 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 				       u64 len, u64 orig_start, u64 block_start,
 				       u64 block_len, u64 disk_num_bytes,
 				       u64 ram_bytes, int compress_type,
+				       struct btrfs_file_extent *file_extent,
 				       int type)
 {
 	struct extent_map *em;
@@ -7405,9 +7436,11 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 	em->len = len;
 	em->block_len = block_len;
 	em->block_start = block_start;
+	em->disk_bytenr = file_extent->disk_bytenr;
 	em->disk_num_bytes = disk_num_bytes;
 	em->ram_bytes = ram_bytes;
 	em->generation = -1;
+	em->offset = file_extent->offset;
 	em->flags |= EXTENT_FLAG_PINNED;
 	if (type == BTRFS_ORDERED_COMPRESSED)
 		extent_map_set_compression(em, compress_type);
@@ -7431,6 +7464,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 {
 	const bool nowait = (iomap_flags & IOMAP_NOWAIT);
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
+	struct btrfs_file_extent file_extent;
 	struct extent_map *em = *map;
 	int type;
 	u64 block_start, orig_start, orig_block_len, ram_bytes;
@@ -7461,7 +7495,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		block_start = em->block_start + (start - em->start);
 
 		if (can_nocow_extent(inode, start, &len, &orig_start,
-				     &orig_block_len, &ram_bytes, NULL, false, false) == 1) {
+				     &orig_block_len, &ram_bytes,
+				     &file_extent, false, false) == 1) {
 			bg = btrfs_inc_nocow_writers(fs_info, block_start);
 			if (bg)
 				can_nocow = true;
@@ -7489,7 +7524,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
 					      orig_start, block_start,
 					      len, orig_block_len,
-					      ram_bytes, type);
+					      ram_bytes, type,
+					      &file_extent);
 		btrfs_dec_nocow_writers(bg);
 		if (type == BTRFS_ORDERED_PREALLOC) {
 			free_extent_map(em);
@@ -9629,6 +9665,8 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 		em->orig_start = cur_offset;
 		em->len = ins.offset;
 		em->block_start = ins.objectid;
+		em->disk_bytenr = ins.objectid;
+		em->offset = 0;
 		em->block_len = ins.offset;
 		em->disk_num_bytes = ins.offset;
 		em->ram_bytes = ins.offset;
@@ -10195,6 +10233,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 	struct extent_changeset *data_reserved = NULL;
 	struct extent_state *cached_state = NULL;
 	struct btrfs_ordered_extent *ordered;
+	struct btrfs_file_extent file_extent;
 	int compression;
 	size_t orig_count;
 	u64 start, end;
@@ -10370,10 +10409,16 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 		goto out_delalloc_release;
 	extent_reserved = true;
 
+	file_extent.disk_bytenr = ins.objectid;
+	file_extent.disk_num_bytes = ins.offset;
+	file_extent.num_bytes = num_bytes;
+	file_extent.ram_bytes = ram_bytes;
+	file_extent.offset = encoded->unencoded_offset;
+	file_extent.compression = compression;
 	em = create_io_em(inode, start, num_bytes,
 			  start - encoded->unencoded_offset, ins.objectid,
 			  ins.offset, ins.offset, ram_bytes, compression,
-			  BTRFS_ORDERED_COMPRESSED);
+			  &file_extent, BTRFS_ORDERED_COMPRESSED);
 	if (IS_ERR(em)) {
 		ret = PTR_ERR(em);
 		goto out_free_reserved;
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 02/11] btrfs: export the expected file extent through can_nocow_extent()
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
  2024-05-23  5:03  1% ` [PATCH v3 01/11] btrfs: rename extent_map::orig_block_len to disk_num_bytes Qu Wenruo
@ 2024-05-23  5:03  1% ` Qu Wenruo
  2024-05-23  5:03  1% ` [PATCH v3 03/11] btrfs: introduce new members for extent_map Qu Wenruo
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Currently function can_nocow_extent() only returns members needed for
extent_map.

However since we will soon change the extent_map structure to be more
like btrfs_file_extent_item, we want to expose the expected file extent
caused by the NOCOW write for future usage.

This introduces a new structure, btrfs_file_extent, to be a more
memory-access-friendly representation of btrfs_file_extent_item.
And use that structure to expose the expected file extent caused by the
NOCOW write.

For now there is no user of the new structure yet.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/btrfs_inode.h | 17 ++++++++++++++++-
 fs/btrfs/file.c        |  2 +-
 fs/btrfs/inode.c       | 22 +++++++++++++++++++---
 3 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 60fd81eaedb8..9ada4185ff93 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -514,9 +514,24 @@ int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
 			    u32 pgoff, u8 *csum, const u8 * const csum_expected);
 bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
 			u32 bio_offset, struct bio_vec *bv);
+
+/*
+ * This represents details about the target file extent item of a write
+ * operation.
+ */
+struct btrfs_file_extent {
+	u64 disk_bytenr;
+	u64 disk_num_bytes;
+	u64 num_bytes;
+	u64 ram_bytes;
+	u64 offset;
+	u8 compression;
+};
+
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 			      u64 *orig_start, u64 *orig_block_len,
-			      u64 *ram_bytes, bool nowait, bool strict);
+			      u64 *ram_bytes, struct btrfs_file_extent *file_extent,
+			      bool nowait, bool strict);
 
 void btrfs_del_delalloc_inode(struct btrfs_inode *inode);
 struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index a216a0fdf58d..7c42565da70c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1104,7 +1104,7 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 						   &cached_state);
 	}
 	ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
-			NULL, NULL, NULL, nowait, false);
+			NULL, NULL, NULL, NULL, nowait, false);
 	if (ret <= 0)
 		btrfs_drew_write_unlock(&root->snapshot_lock);
 	else
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7148da50e435..8ac489fb5e39 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1855,6 +1855,9 @@ struct can_nocow_file_extent_args {
 	u64 extent_offset;
 	/* Number of bytes that can be written to in NOCOW mode. */
 	u64 num_bytes;
+
+	/* The expected file extent for the NOCOW write. */
+	struct btrfs_file_extent file_extent;
 };
 
 /*
@@ -1919,6 +1922,12 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 
 	extent_end = btrfs_file_extent_end(path);
 
+	args->file_extent.disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+	args->file_extent.disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
+	args->file_extent.ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
+	args->file_extent.offset = btrfs_file_extent_offset(leaf, fi);
+	args->file_extent.compression = btrfs_file_extent_compression(leaf, fi);
+
 	/*
 	 * The following checks can be expensive, as they need to take other
 	 * locks and do btree or rbtree searches, so release the path to avoid
@@ -1953,6 +1962,9 @@ static int can_nocow_file_extent(struct btrfs_path *path,
 	args->disk_bytenr += args->start - key->offset;
 	args->num_bytes = min(args->end + 1, extent_end) - args->start;
 
+	args->file_extent.num_bytes = args->num_bytes;
+	args->file_extent.offset += args->start - key->offset;
+
 	/*
 	 * Force COW if csums exist in the range. This ensures that csums for a
 	 * given extent are either valid or do not exist.
@@ -7137,7 +7149,8 @@ static bool btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr)
  */
 noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 			      u64 *orig_start, u64 *orig_block_len,
-			      u64 *ram_bytes, bool nowait, bool strict)
+			      u64 *ram_bytes, struct btrfs_file_extent *file_extent,
+			      bool nowait, bool strict)
 {
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
 	struct can_nocow_file_extent_args nocow_args = { 0 };
@@ -7226,6 +7239,9 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
 		*orig_start = key.offset - nocow_args.extent_offset;
 	if (orig_block_len)
 		*orig_block_len = nocow_args.disk_num_bytes;
+	if (file_extent)
+		memcpy(file_extent, &nocow_args.file_extent,
+		       sizeof(*file_extent));
 
 	*len = nocow_args.num_bytes;
 	ret = 1;
@@ -7445,7 +7461,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
 		block_start = em->block_start + (start - em->start);
 
 		if (can_nocow_extent(inode, start, &len, &orig_start,
-				     &orig_block_len, &ram_bytes, false, false) == 1) {
+				     &orig_block_len, &ram_bytes, NULL, false, false) == 1) {
 			bg = btrfs_inc_nocow_writers(fs_info, block_start);
 			if (bg)
 				can_nocow = true;
@@ -10687,7 +10703,7 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 		free_extent_map(em);
 		em = NULL;
 
-		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL, false, true);
+		ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL, NULL, false, true);
 		if (ret < 0) {
 			goto out;
 		} else if (ret) {
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 01/11] btrfs: rename extent_map::orig_block_len to disk_num_bytes
  2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
@ 2024-05-23  5:03  1% ` Qu Wenruo
  2024-05-23  5:03  1% ` [PATCH v3 02/11] btrfs: export the expected file extent through can_nocow_extent() Qu Wenruo
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

This would make it very obvious that the member just matches
btrfs_file_extent_item::disk_num_bytes.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_map.c | 16 ++++++++--------
 fs/btrfs/extent_map.h |  2 +-
 fs/btrfs/file-item.c  |  4 ++--
 fs/btrfs/file.c       |  2 +-
 fs/btrfs/inode.c      | 12 ++++++------
 fs/btrfs/tree-log.c   |  4 ++--
 6 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 35e163152dbc..a9d60d1eade9 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -785,14 +785,14 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 					split->block_len = em->block_len;
 				else
 					split->block_len = split->len;
-				split->orig_block_len = max(split->block_len,
-						em->orig_block_len);
+				split->disk_num_bytes = max(split->block_len,
+							    em->disk_num_bytes);
 				split->ram_bytes = em->ram_bytes;
 			} else {
 				split->orig_start = split->start;
 				split->block_len = 0;
 				split->block_start = em->block_start;
-				split->orig_block_len = 0;
+				split->disk_num_bytes = 0;
 				split->ram_bytes = split->len;
 			}
 
@@ -817,8 +817,8 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 			split->generation = gen;
 
 			if (em->block_start < EXTENT_MAP_LAST_BYTE) {
-				split->orig_block_len = max(em->block_len,
-						    em->orig_block_len);
+				split->disk_num_bytes = max(em->block_len,
+							    em->disk_num_bytes);
 
 				split->ram_bytes = em->ram_bytes;
 				if (compressed) {
@@ -835,7 +835,7 @@ void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
 				split->ram_bytes = split->len;
 				split->orig_start = split->start;
 				split->block_len = 0;
-				split->orig_block_len = 0;
+				split->disk_num_bytes = 0;
 			}
 
 			if (extent_map_in_tree(em)) {
@@ -992,7 +992,7 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_pre->orig_start = split_pre->start;
 	split_pre->block_start = new_logical;
 	split_pre->block_len = split_pre->len;
-	split_pre->orig_block_len = split_pre->block_len;
+	split_pre->disk_num_bytes = split_pre->block_len;
 	split_pre->ram_bytes = split_pre->len;
 	split_pre->flags = flags;
 	split_pre->generation = em->generation;
@@ -1010,7 +1010,7 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
 	split_mid->orig_start = split_mid->start;
 	split_mid->block_start = em->block_start + pre;
 	split_mid->block_len = split_mid->len;
-	split_mid->orig_block_len = split_mid->block_len;
+	split_mid->disk_num_bytes = split_mid->block_len;
 	split_mid->ram_bytes = split_mid->len;
 	split_mid->flags = flags;
 	split_mid->generation = em->generation;
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
index 9144721b88a5..2b7bbffd594b 100644
--- a/fs/btrfs/extent_map.h
+++ b/fs/btrfs/extent_map.h
@@ -74,7 +74,7 @@ struct extent_map {
 	 * The full on-disk extent length, matching
 	 * btrfs_file_extent_item::disk_num_bytes.
 	 */
-	u64 orig_block_len;
+	u64 disk_num_bytes;
 
 	/*
 	 * The decompressed size of the whole on-disk extent, matching
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index f3ed78e21fa4..430dce44ebd2 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -1295,7 +1295,7 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 		em->len = btrfs_file_extent_end(path) - extent_start;
 		em->orig_start = extent_start -
 			btrfs_file_extent_offset(leaf, fi);
-		em->orig_block_len = btrfs_file_extent_disk_num_bytes(leaf, fi);
+		em->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
 		bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
 		if (bytenr == 0) {
 			em->block_start = EXTENT_MAP_HOLE;
@@ -1304,7 +1304,7 @@ void btrfs_extent_item_to_extent_map(struct btrfs_inode *inode,
 		if (compress_type != BTRFS_COMPRESS_NONE) {
 			extent_map_set_compression(em, compress_type);
 			em->block_start = bytenr;
-			em->block_len = em->orig_block_len;
+			em->block_len = em->disk_num_bytes;
 		} else {
 			bytenr += btrfs_file_extent_offset(leaf, fi);
 			em->block_start = bytenr;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index af58a1b33498..a216a0fdf58d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2351,7 +2351,7 @@ static int fill_holes(struct btrfs_trans_handle *trans,
 
 		hole_em->block_start = EXTENT_MAP_HOLE;
 		hole_em->block_len = 0;
-		hole_em->orig_block_len = 0;
+		hole_em->disk_num_bytes = 0;
 		hole_em->generation = trans->transid;
 
 		ret = btrfs_replace_extent_map_range(inode, hole_em, true);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b4410b463c6a..7148da50e435 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -139,7 +139,7 @@ static noinline int run_delalloc_cow(struct btrfs_inode *inode,
 				     bool pages_dirty);
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 				       u64 len, u64 orig_start, u64 block_start,
-				       u64 block_len, u64 orig_block_len,
+				       u64 block_len, u64 disk_num_bytes,
 				       u64 ram_bytes, int compress_type,
 				       int type);
 
@@ -5001,7 +5001,7 @@ int btrfs_cont_expand(struct btrfs_inode *inode, loff_t oldsize, loff_t size)
 
 			hole_em->block_start = EXTENT_MAP_HOLE;
 			hole_em->block_len = 0;
-			hole_em->orig_block_len = 0;
+			hole_em->disk_num_bytes = 0;
 			hole_em->ram_bytes = hole_size;
 			hole_em->generation = btrfs_get_fs_generation(fs_info);
 
@@ -7330,7 +7330,7 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
 /* The callers of this must take lock_extent() */
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 				       u64 len, u64 orig_start, u64 block_start,
-				       u64 block_len, u64 orig_block_len,
+				       u64 block_len, u64 disk_num_bytes,
 				       u64 ram_bytes, int compress_type,
 				       int type)
 {
@@ -7362,7 +7362,7 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 		ASSERT(block_len == len);
 
 		/* COW results a new extent matching our file extent size. */
-		ASSERT(orig_block_len == len);
+		ASSERT(disk_num_bytes == len);
 		ASSERT(ram_bytes == len);
 
 		/* Since it's a new extent, we should not have any offset. */
@@ -7389,7 +7389,7 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 	em->len = len;
 	em->block_len = block_len;
 	em->block_start = block_start;
-	em->orig_block_len = orig_block_len;
+	em->disk_num_bytes = disk_num_bytes;
 	em->ram_bytes = ram_bytes;
 	em->generation = -1;
 	em->flags |= EXTENT_FLAG_PINNED;
@@ -9614,7 +9614,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
 		em->len = ins.offset;
 		em->block_start = ins.objectid;
 		em->block_len = ins.offset;
-		em->orig_block_len = ins.offset;
+		em->disk_num_bytes = ins.offset;
 		em->ram_bytes = ins.offset;
 		em->flags |= EXTENT_FLAG_PREALLOC;
 		em->generation = trans->transid;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 51a167559ae8..f237b5ed80ec 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4651,7 +4651,7 @@ static int log_extent_csums(struct btrfs_trans_handle *trans,
 	/* If we're compressed we have to save the entire range of csums. */
 	if (extent_map_is_compressed(em)) {
 		csum_offset = 0;
-		csum_len = max(em->block_len, em->orig_block_len);
+		csum_len = max(em->block_len, em->disk_num_bytes);
 	} else {
 		csum_offset = mod_start - em->start;
 		csum_len = mod_len;
@@ -4701,7 +4701,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
 	else
 		btrfs_set_stack_file_extent_type(&fi, BTRFS_FILE_EXTENT_REG);
 
-	block_len = max(em->block_len, em->orig_block_len);
+	block_len = max(em->block_len, em->disk_num_bytes);
 	compress_type = extent_map_compression(em);
 	if (compress_type != BTRFS_COMPRESS_NONE) {
 		btrfs_set_stack_file_extent_disk_bytenr(&fi, em->block_start);
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent
@ 2024-05-23  5:03  2% Qu Wenruo
  2024-05-23  5:03  1% ` [PATCH v3 01/11] btrfs: rename extent_map::orig_block_len to disk_num_bytes Qu Wenruo
                   ` (12 more replies)
  0 siblings, 13 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  5:03 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v3:
- Rebased to the latest for-next
  There is a small conflicts with the extent map tree members changes,
  no big deal.

- Fix an error where original code is checking
  btrfs_file_extent_disk_bytenr()
  The newer code is checking disk_num_bytes, which is wrong.

- Various commit messages/comments update
  Mostly some grammar fixes and removal of rants on the btrfs_file_extent
  member mismatches for btrfs_alloc_ordered_extent().
  However a comment is still left inside btrfs_alloc_ordered_extent()
  for NOCOW/PREALLOC as a reminder for further cleanup.

v2:
- Rebased to the latest for-next
  There is a conflicts with extent locking, and maybe some other
  hidden conflicts for NOCOW/PREALLOC?
  As previously the patchset passes fstests auto group, but after
  the merging with other patches, it always crashes as btrfs/060.

- Fix an error in the final cleanup patch
  It's the NOCOW/PREALLOC shenanigans again, in the buffered NOCOW path,
  that we have to use the old inaccurate numbers for NOCOW/PREALLOC OEs.

- Split the final cleanup into 4 patches
  Most cleanups are very straightforward, but the cleanup for
  btrfs_alloc_ordered_extent() needs extra special handling for
  NOCOW/PREALLOC.

v1:
- Rebased to the latest for-next
  To resolve the conflicts with the recently introduced extent map
  shrinker

- A new cleanup patch to remove the recursive header inclusion

- Use a new structure to pass the file extent item related members
  around

- Add a new comment on why we're intentionally passing incorrect
  numbers for NOCOW/PREALLOC ordered extents inside
  btrfs_create_dio_extent()

[REPO]
https://github.com/adam900710/linux/tree/em_cleanup

This series introduce two new members (disk_bytenr/offset) to
extent_map, and removes three old members
(block_start/block_len/offset), finally rename one member
(orig_block_len -> disk_num_bytes).

This should save us one u64 for extent_map, although with the recent
extent map shrinker, the saving is not that useful.

But to make things safe to migrate, I introduce extra sanity checks for
extent_map, and do cross check for both old and new members.

The extra sanity checks already exposed one bug (thankfully harmless)
causing em::block_start to be incorrect.

But so far, the patchset is fine for default fstests run.

Furthermore, since we're already having too long parameter lists for
extent_map/ordered_extent/can_nocow_extent, here is a new structure,
btrfs_file_extent, a memory-access-friendly structure to represent a
btrfs_file_extent_item.

With the help of that structure, we can use that to represent a file
extent item without a super long parameter list.

The patchset would rename orig_block_len to disk_num_bytes first.
Then introduce the new member, the extra sanity checks, and introduce the
new btrfs_file_extent structure and use that to remove the older 3 members
from extent_map.

After all above works done, use btrfs_file_extent to further cleanup
can_nocow_file_extent_args()/btrfs_alloc_ordered_extent()/create_io_em()/
btrfs_create_dio_extent().

The cleanup is in fact pretty tricky, the current code base never
expects correct numbers for NOCOW/PREALLOC OEs, thus we have to keep the
old but incorrect numbers just for NOCOW/PREALLOC.

I will address the NOCOW/PREALLOC shenanigans the future, but
after the huge cleanup across multiple core structures.

Qu Wenruo (11):
  btrfs: rename extent_map::orig_block_len to disk_num_bytes
  btrfs: export the expected file extent through can_nocow_extent()
  btrfs: introduce new members for extent_map
  btrfs: introduce extra sanity checks for extent maps
  btrfs: remove extent_map::orig_start member
  btrfs: remove extent_map::block_len member
  btrfs: remove extent_map::block_start member
  btrfs: cleanup duplicated parameters related to
    can_nocow_file_extent_args
  btrfs: cleanup duplicated parameters related to
    btrfs_alloc_ordered_extent
  btrfs: cleanup duplicated parameters related to create_io_em()
  btrfs: cleanup duplicated parameters related to
    btrfs_create_dio_extent()

 fs/btrfs/btrfs_inode.h            |   4 +-
 fs/btrfs/compression.c            |   7 +-
 fs/btrfs/defrag.c                 |  14 +-
 fs/btrfs/extent_io.c              |  10 +-
 fs/btrfs/extent_map.c             | 192 +++++++++++++------
 fs/btrfs/extent_map.h             |  51 +++--
 fs/btrfs/file-item.c              |  23 +--
 fs/btrfs/file.c                   |  18 +-
 fs/btrfs/inode.c                  | 308 +++++++++++++-----------------
 fs/btrfs/ordered-data.c           |  34 +++-
 fs/btrfs/ordered-data.h           |  19 +-
 fs/btrfs/relocation.c             |   5 +-
 fs/btrfs/tests/extent-map-tests.c | 114 ++++++-----
 fs/btrfs/tests/inode-tests.c      | 177 ++++++++---------
 fs/btrfs/tree-log.c               |  23 ++-
 fs/btrfs/zoned.c                  |   4 +-
 include/trace/events/btrfs.h      |  18 +-
 17 files changed, 541 insertions(+), 480 deletions(-)

-- 
2.45.1


^ permalink raw reply	[relevance 2%]

* Re: [PATCH v2 11/11] btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent()
  2024-05-20 16:48  1%   ` Filipe Manana
@ 2024-05-23  4:03  1%     ` Qu Wenruo
  0 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  4:03 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2024/5/21 02:18, Filipe Manana 写道:
> On Fri, May 3, 2024 at 7:03 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>> The following 3 parameters can be cleaned up using btrfs_file_extent
>> structure:
>>
>> - len
>>    btrfs_file_extent::num_bytes
>>
>> - orig_block_len
>>    btrfs_file_extent::disk_num_bytes
>>
>> - ram_bytes
>>    btrfs_file_extent::ram_bytes
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/inode.c | 22 ++++++++--------------
>>   1 file changed, 8 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index a95dc2333972..09974c86d3d1 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -6969,11 +6969,8 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>>   static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>>                                                    struct btrfs_dio_data *dio_data,
>>                                                    const u64 start,
>> -                                                 const u64 len,
>> -                                                 const u64 orig_block_len,
>> -                                                 const u64 ram_bytes,
>> -                                                 const int type,
>> -                                                 struct btrfs_file_extent *file_extent)
>> +                                                 struct btrfs_file_extent *file_extent,
>> +                                                 const int type)
>>   {
>>          struct extent_map *em = NULL;
>>          struct btrfs_ordered_extent *ordered;
>> @@ -6991,7 +6988,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>>                  if (em) {
>>                          free_extent_map(em);
>>                          btrfs_drop_extent_map_range(inode, start,
>> -                                                   start + len - 1, false);
>> +                                       start + file_extent->num_bytes - 1, false);
>>                  }
>>                  em = ERR_CAST(ordered);
>>          } else {
>> @@ -7034,10 +7031,9 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
>>          file_extent.ram_bytes = ins.offset;
>>          file_extent.offset = 0;
>>          file_extent.compression = BTRFS_COMPRESS_NONE;
>> -       em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset,
>> -                                    ins.offset,
>> -                                    ins.offset, BTRFS_ORDERED_REGULAR,
>> -                                    &file_extent);
>> +       em = btrfs_create_dio_extent(inode, dio_data, start,
>> +                                    &file_extent,
>> +                                    BTRFS_ORDERED_REGULAR);
>
> As we're changing this, we can leave this in a single line as it fits.
>
>>          btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>>          if (IS_ERR(em))
>>                  btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset,
>> @@ -7404,10 +7400,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>>                  }
>>                  space_reserved = true;
>>
>> -               em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
>> -                                             file_extent.disk_num_bytes,
>> -                                             file_extent.ram_bytes, type,
>> -                                             &file_extent);
>> +               em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start,
>> +                                             &file_extent, type);
>
> Same here.

Just a small question related to the single line one.

The parameter @start with its tailing ',' is already at 80 chars,
do we still need to follow the old 80 chars width recommendation?

With previous several patches, I re-checked the lines, some can indeed
be improved a little, but some BTRFS_ORDERED_* flags can not be merged
without exceeding the 80 chars limits.

Thanks,
Qu
>
> The rest looks good, thanks.
>
>>                  btrfs_dec_nocow_writers(bg);
>>                  if (type == BTRFS_ORDERED_PREALLOC) {
>>                          free_extent_map(em);
>> --
>> 2.45.0
>>
>>
>

^ permalink raw reply	[relevance 1%]

* [PATCH v3 2/2] btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()
  2024-05-23  1:19  2% [PATCH v3 0/2] btrfs: enhance function extent_range_clear_dirty_for_io() Qu Wenruo
  2024-05-23  1:19  1% ` [PATCH v3 1/2] btrfs: move extent_range_clear_dirty_for_io() into inode.c Qu Wenruo
@ 2024-05-23  1:19  1% ` Qu Wenruo
  1 sibling, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  1:19 UTC (permalink / raw)
  To: linux-btrfs

Previously we have BUG_ON() inside extent_range_clear_dirty_for_io(), as
we expect all involved folios are still locked, thus no folio should be
missing.

However for extent_range_clear_dirty_for_io() itself, we can skip the
missing folio and handling the remaining ones, and return an error if
there is anything wrong.

So this patch would remove the BUG_ON() and let the caller to handle the
error.
In the caller we do not have a quick way to cleanup the error, but all
the compression routines would handle the missing folio as an error and
properly error out, so we only need to do an ASSERT() for developers,
meanwhile for non-debug build the compression routine would handle the
error correctly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 99be256f4f0e..126457236427 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -890,19 +890,24 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
 		btrfs_add_inode_defrag(NULL, inode, small_write);
 }
 
-static void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
+static int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
 {
-	unsigned long index = start >> PAGE_SHIFT;
 	unsigned long end_index = end >> PAGE_SHIFT;
 	struct page *page;
+	int ret = 0;
 
-	while (index <= end_index) {
+	for (unsigned long index = start >> PAGE_SHIFT;
+	     index <= end_index; index++) {
 		page = find_get_page(inode->i_mapping, index);
-		BUG_ON(!page); /* Pages should be in the extent_io_tree */
+		if (unlikely(!page)) {
+			if (!ret)
+				ret = -ENOENT;
+			continue;
+		}
 		clear_page_dirty_for_io(page);
 		put_page(page);
-		index++;
 	}
+	return ret;
 }
 
 /*
@@ -946,7 +951,16 @@ static void compress_file_range(struct btrfs_work *work)
 	 * Otherwise applications with the file mmap'd can wander in and change
 	 * the page contents while we are compressing them.
 	 */
-	extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end);
+	ret = extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end);
+
+	/*
+	 * All the folios should have been locked thus no failure.
+	 *
+	 * And even some folios are missing, btrfs_compress_folios()
+	 * would handle them correctly, so here just do an ASSERT() check for
+	 * early logic errors.
+	 */
+	ASSERT(ret == 0);
 
 	/*
 	 * We need to save i_size before now because it could change in between
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 1/2] btrfs: move extent_range_clear_dirty_for_io() into inode.c
  2024-05-23  1:19  2% [PATCH v3 0/2] btrfs: enhance function extent_range_clear_dirty_for_io() Qu Wenruo
@ 2024-05-23  1:19  1% ` Qu Wenruo
  2024-05-23  1:19  1% ` [PATCH v3 2/2] btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io() Qu Wenruo
  1 sibling, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  1:19 UTC (permalink / raw)
  To: linux-btrfs

The function is only utilized inside inode.c by compress_file_range(),
so move it to inode.c and unexport it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 15 ---------------
 fs/btrfs/extent_io.h |  1 -
 fs/btrfs/inode.c     | 15 +++++++++++++++
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7275bd919a3e..4af16c09dd88 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -164,21 +164,6 @@ void __cold extent_buffer_free_cachep(void)
 	kmem_cache_destroy(extent_buffer_cache);
 }
 
-void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
-{
-	unsigned long index = start >> PAGE_SHIFT;
-	unsigned long end_index = end >> PAGE_SHIFT;
-	struct page *page;
-
-	while (index <= end_index) {
-		page = find_get_page(inode->i_mapping, index);
-		BUG_ON(!page); /* Pages should be in the extent_io_tree */
-		clear_page_dirty_for_io(page);
-		put_page(page);
-		index++;
-	}
-}
-
 static void process_one_page(struct btrfs_fs_info *fs_info,
 			     struct page *page, struct page *locked_page,
 			     unsigned long page_ops, u64 start, u64 end)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index dca6b12769ec..7c2f1bbc6b67 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -350,7 +350,6 @@ void extent_buffer_bitmap_clear(const struct extent_buffer *eb,
 void set_extent_buffer_dirty(struct extent_buffer *eb);
 void set_extent_buffer_uptodate(struct extent_buffer *eb);
 void clear_extent_buffer_uptodate(struct extent_buffer *eb);
-void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct page *locked_page,
 				  struct extent_state **cached,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 000809e16aba..99be256f4f0e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -890,6 +890,21 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
 		btrfs_add_inode_defrag(NULL, inode, small_write);
 }
 
+static void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
+{
+	unsigned long index = start >> PAGE_SHIFT;
+	unsigned long end_index = end >> PAGE_SHIFT;
+	struct page *page;
+
+	while (index <= end_index) {
+		page = find_get_page(inode->i_mapping, index);
+		BUG_ON(!page); /* Pages should be in the extent_io_tree */
+		clear_page_dirty_for_io(page);
+		put_page(page);
+		index++;
+	}
+}
+
 /*
  * Work queue call back to started compression on a file and pages.
  *
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 0/2] btrfs: enhance function extent_range_clear_dirty_for_io()
@ 2024-05-23  1:19  2% Qu Wenruo
  2024-05-23  1:19  1% ` [PATCH v3 1/2] btrfs: move extent_range_clear_dirty_for_io() into inode.c Qu Wenruo
  2024-05-23  1:19  1% ` [PATCH v3 2/2] btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io() Qu Wenruo
  0 siblings, 2 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  1:19 UTC (permalink / raw)
  To: linux-btrfs

[Changelog]
v3:
- Drop the patch to use subpage helper
  For subpage cases, fsstress with compression can lead to hang where
  OE seems hanging and never to be finished.
  So far it looks like some race with i_size change but still not sure
  why the code change is involved.
  Drop the subpage helper change for now.

v2:
- Split the original patch into 3

- Return the error from filemap_get_folio() to be future-proof

- Enhance the comments for the new ASSERT() on
  extent_range_clear_dirty_for_io() error
  In fact, even if some pages are missing, we do not need to handle the
  error at compress_file_range(), as btrfs_compress_folios() and each
  compression routine would handle the missing folio correctly.

  Thus the new ASSERT() is only an early warning for developers.

Qu Wenruo (2):
  btrfs: move extent_range_clear_dirty_for_io() into inode.c
  btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()

 fs/btrfs/extent_io.c | 15 ---------------
 fs/btrfs/extent_io.h |  1 -
 fs/btrfs/inode.c     | 31 ++++++++++++++++++++++++++++++-
 3 files changed, 30 insertions(+), 17 deletions(-)

-- 
2.45.1


^ permalink raw reply	[relevance 2%]

* Re: [PATCH v2 2/3] btrfs: make extent_range_clear_dirty_for_io() subpage compatible
  2024-05-22 23:47  1% ` [PATCH v2 2/3] btrfs: make extent_range_clear_dirty_for_io() subpage compatible Qu Wenruo
@ 2024-05-23  0:49  1%   ` Qu Wenruo
  0 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-23  0:49 UTC (permalink / raw)
  To: linux-btrfs



在 2024/5/23 09:17, Qu Wenruo 写道:
> Although the function is never called for subpage ranges, there is no
> harm to make it subpage compatible for the future sector perfect subpage
> compression support.
> 
> And since we're here, also change it to use folio APIs as the subpage
> helper is already folio based.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

This patch is causing hangs for fstests with compression involved.

Please drop the series for now.

Thanks,
Qu
> ---
>   fs/btrfs/inode.c | 15 ++++++++++-----
>   1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 99be256f4f0e..dda47a273813 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -892,15 +892,20 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
>   
>   static void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
>   {
> +	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> +	const u64 len = end + 1 - start;
>   	unsigned long index = start >> PAGE_SHIFT;
>   	unsigned long end_index = end >> PAGE_SHIFT;
> -	struct page *page;
>   
> +	/* We should not have such large range. */
> +	ASSERT(len < U32_MAX);
>   	while (index <= end_index) {
> -		page = find_get_page(inode->i_mapping, index);
> -		BUG_ON(!page); /* Pages should be in the extent_io_tree */
> -		clear_page_dirty_for_io(page);
> -		put_page(page);
> +		struct folio *folio;
> +
> +		folio = filemap_get_folio(inode->i_mapping, index);
> +		BUG_ON(IS_ERR(folio)); /* Pages should have been locked. */
> +		btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
> +		folio_put(folio);
>   		index++;
>   	}
>   }

^ permalink raw reply	[relevance 1%]

* [PATCH v2 3/3] btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()
  2024-05-22 23:47  1% [PATCH v2 0/3] btrfs: enhance function extent_range_clear_dirty_for_io() Qu Wenruo
  2024-05-22 23:47  1% ` [PATCH v2 1/3] btrfs: move extent_range_clear_dirty_for_io() into inode.c Qu Wenruo
  2024-05-22 23:47  1% ` [PATCH v2 2/3] btrfs: make extent_range_clear_dirty_for_io() subpage compatible Qu Wenruo
@ 2024-05-22 23:47  1% ` Qu Wenruo
  2 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-22 23:47 UTC (permalink / raw)
  To: linux-btrfs

Previously we have BUG_ON() inside extent_range_clear_dirty_for_io(), as
we expect all involved folios are still locked, thus no folio should be
missing.

However for extent_range_clear_dirty_for_io() itself, we can skip the
missing folio and handling the remaining ones, and return an error if
there is anything wrong.

So this patch would remove the BUG_ON() and let the caller to handle the
error.
In the caller we do not have a quick way to cleanup the error, but all
the compression routines would handle the missing folio as an error and
properly error out, so we only need to do an ASSERT() for developers,
meanwhile for non-debug build the compression routine would handle the
error correctly.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index dda47a273813..18b833e58d19 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -890,24 +890,29 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
 		btrfs_add_inode_defrag(NULL, inode, small_write);
 }
 
-static void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
+static int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
 {
 	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
 	const u64 len = end + 1 - start;
-	unsigned long index = start >> PAGE_SHIFT;
 	unsigned long end_index = end >> PAGE_SHIFT;
+	int ret = 0;
 
 	/* We should not have such large range. */
 	ASSERT(len < U32_MAX);
-	while (index <= end_index) {
+	for (unsigned long index = start >> PAGE_SHIFT;
+	     index <= end_index; index++) {
 		struct folio *folio;
 
 		folio = filemap_get_folio(inode->i_mapping, index);
-		BUG_ON(IS_ERR(folio)); /* Pages should have been locked. */
+		if (IS_ERR(folio)) {
+			if (!ret)
+				ret = PTR_ERR(folio);
+			continue;
+		}
 		btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
 		folio_put(folio);
-		index++;
 	}
+	return ret;
 }
 
 /*
@@ -951,7 +956,16 @@ static void compress_file_range(struct btrfs_work *work)
 	 * Otherwise applications with the file mmap'd can wander in and change
 	 * the page contents while we are compressing them.
 	 */
-	extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end);
+	ret = extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end);
+
+	/*
+	 * All the folios should have been locked thus no failure.
+	 *
+	 * And even some folios are missing, btrfs_compress_folios()
+	 * would handle them correctly, so here just do an ASSERT() check for
+	 * early logic errors.
+	 */
+	ASSERT(ret == 0);
 
 	/*
 	 * We need to save i_size before now because it could change in between
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v2 2/3] btrfs: make extent_range_clear_dirty_for_io() subpage compatible
  2024-05-22 23:47  1% [PATCH v2 0/3] btrfs: enhance function extent_range_clear_dirty_for_io() Qu Wenruo
  2024-05-22 23:47  1% ` [PATCH v2 1/3] btrfs: move extent_range_clear_dirty_for_io() into inode.c Qu Wenruo
@ 2024-05-22 23:47  1% ` Qu Wenruo
  2024-05-23  0:49  1%   ` Qu Wenruo
  2024-05-22 23:47  1% ` [PATCH v2 3/3] btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io() Qu Wenruo
  2 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-22 23:47 UTC (permalink / raw)
  To: linux-btrfs

Although the function is never called for subpage ranges, there is no
harm to make it subpage compatible for the future sector perfect subpage
compression support.

And since we're here, also change it to use folio APIs as the subpage
helper is already folio based.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/inode.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 99be256f4f0e..dda47a273813 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -892,15 +892,20 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
 
 static void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
 {
+	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
+	const u64 len = end + 1 - start;
 	unsigned long index = start >> PAGE_SHIFT;
 	unsigned long end_index = end >> PAGE_SHIFT;
-	struct page *page;
 
+	/* We should not have such large range. */
+	ASSERT(len < U32_MAX);
 	while (index <= end_index) {
-		page = find_get_page(inode->i_mapping, index);
-		BUG_ON(!page); /* Pages should be in the extent_io_tree */
-		clear_page_dirty_for_io(page);
-		put_page(page);
+		struct folio *folio;
+
+		folio = filemap_get_folio(inode->i_mapping, index);
+		BUG_ON(IS_ERR(folio)); /* Pages should have been locked. */
+		btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
+		folio_put(folio);
 		index++;
 	}
 }
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v2 1/3] btrfs: move extent_range_clear_dirty_for_io() into inode.c
  2024-05-22 23:47  1% [PATCH v2 0/3] btrfs: enhance function extent_range_clear_dirty_for_io() Qu Wenruo
@ 2024-05-22 23:47  1% ` Qu Wenruo
  2024-05-22 23:47  1% ` [PATCH v2 2/3] btrfs: make extent_range_clear_dirty_for_io() subpage compatible Qu Wenruo
  2024-05-22 23:47  1% ` [PATCH v2 3/3] btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io() Qu Wenruo
  2 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-22 23:47 UTC (permalink / raw)
  To: linux-btrfs

The function is only utilized inside inode.c by compress_file_range(),
so move it to inode.c and unexport it.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 15 ---------------
 fs/btrfs/extent_io.h |  1 -
 fs/btrfs/inode.c     | 15 +++++++++++++++
 3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7275bd919a3e..4af16c09dd88 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -164,21 +164,6 @@ void __cold extent_buffer_free_cachep(void)
 	kmem_cache_destroy(extent_buffer_cache);
 }
 
-void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
-{
-	unsigned long index = start >> PAGE_SHIFT;
-	unsigned long end_index = end >> PAGE_SHIFT;
-	struct page *page;
-
-	while (index <= end_index) {
-		page = find_get_page(inode->i_mapping, index);
-		BUG_ON(!page); /* Pages should be in the extent_io_tree */
-		clear_page_dirty_for_io(page);
-		put_page(page);
-		index++;
-	}
-}
-
 static void process_one_page(struct btrfs_fs_info *fs_info,
 			     struct page *page, struct page *locked_page,
 			     unsigned long page_ops, u64 start, u64 end)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index dca6b12769ec..7c2f1bbc6b67 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -350,7 +350,6 @@ void extent_buffer_bitmap_clear(const struct extent_buffer *eb,
 void set_extent_buffer_dirty(struct extent_buffer *eb);
 void set_extent_buffer_uptodate(struct extent_buffer *eb);
 void clear_extent_buffer_uptodate(struct extent_buffer *eb);
-void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct page *locked_page,
 				  struct extent_state **cached,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 000809e16aba..99be256f4f0e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -890,6 +890,21 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
 		btrfs_add_inode_defrag(NULL, inode, small_write);
 }
 
+static void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
+{
+	unsigned long index = start >> PAGE_SHIFT;
+	unsigned long end_index = end >> PAGE_SHIFT;
+	struct page *page;
+
+	while (index <= end_index) {
+		page = find_get_page(inode->i_mapping, index);
+		BUG_ON(!page); /* Pages should be in the extent_io_tree */
+		clear_page_dirty_for_io(page);
+		put_page(page);
+		index++;
+	}
+}
+
 /*
  * Work queue call back to started compression on a file and pages.
  *
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v2 0/3] btrfs: enhance function extent_range_clear_dirty_for_io()
@ 2024-05-22 23:47  1% Qu Wenruo
  2024-05-22 23:47  1% ` [PATCH v2 1/3] btrfs: move extent_range_clear_dirty_for_io() into inode.c Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 200+ results
From: Qu Wenruo @ 2024-05-22 23:47 UTC (permalink / raw)
  To: linux-btrfs

[Changelog]
v2:
- Split the original patch into 3

- Return the error from filemap_get_folio() to be future-proof

- Enhance the comments for the new ASSERT() on
  extent_range_clear_dirty_for_io() error
  In fact, even if some pages are missing, we do not need to handle the
  error at compress_file_range(), as btrfs_compress_folios() and each
  compression routine would handle the missing folio correctly.

  Thus the new ASSERT() is only an early warning for developers.

This is a preparation for the (near) future support of sector perfect
subpage compression support. (the current one requires full page
alignment).

The function extent_range_clear_dirty_for_io() is just a simple start.

Qu Wenruo (3):
  btrfs: move extent_range_clear_dirty_for_io() into inode.c
  btrfs: make extent_range_clear_dirty_for_io() subpage compatible
  btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io()

 fs/btrfs/extent_io.c | 15 ---------------
 fs/btrfs/extent_io.h |  1 -
 fs/btrfs/inode.c     | 36 +++++++++++++++++++++++++++++++++++-
 3 files changed, 35 insertions(+), 17 deletions(-)

-- 
2.45.1


^ permalink raw reply	[relevance 1%]

* Re: [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions
  2024-05-22 14:36  2% [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions fdmanana
                   ` (7 preceding siblings ...)
  2024-05-22 15:21  1% ` [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions Josef Bacik
@ 2024-05-22 22:21  1% ` Qu Wenruo
  2024-05-23 14:02  1%   ` Filipe Manana
  2024-05-23 17:03  1% ` David Sterba
  9 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-22 22:21 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/5/23 00:06, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> A few places can unnecessarily create an empty transaction and then commit
> it, when the goal is just to catch the current transaction and wait for
> its commit to complete. This results in wasting IO, time and rotation of
> the precious backup roots in the super block. Details in the change logs.
> The patches are all independent, except patch 4 that applies on top of
> patch 3 (but could have been done in any order really, they are independent).

Looks good to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Have you considered outputting a warning if we're committing an empty
transaction (for debug build)?

That would prevent such problem from happening again.

Thanks,
Qu
>
> Filipe Manana (7):
>    btrfs: qgroup: avoid start/commit empty transaction when flushing reservations
>    btrfs: avoid create and commit empty transaction when committing super
>    btrfs: send: make ensure_commit_roots_uptodate() simpler and more efficient
>    btrfs: send: avoid create/commit empty transaction at ensure_commit_roots_uptodate()
>    btrfs: scrub: avoid create/commit empty transaction at finish_extent_writes_for_zoned()
>    btrfs: add and use helper to commit the current transaction
>    btrfs: send: get rid of the label and gotos at ensure_commit_roots_uptodate()
>
>   fs/btrfs/disk-io.c     |  8 +-------
>   fs/btrfs/qgroup.c      | 31 +++++--------------------------
>   fs/btrfs/scrub.c       |  6 +-----
>   fs/btrfs/send.c        | 32 ++++++++------------------------
>   fs/btrfs/space-info.c  |  9 +--------
>   fs/btrfs/super.c       | 11 +----------
>   fs/btrfs/transaction.c | 19 +++++++++++++++++++
>   fs/btrfs/transaction.h |  1 +
>   8 files changed, 37 insertions(+), 80 deletions(-)
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: move fiemap code from extent_io.c to inode.c
  2024-05-22 17:33  1% ` David Sterba
@ 2024-05-22 20:18  1%   ` Filipe Manana
  0 siblings, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-22 20:18 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs, Josef Bacik

On Wed, May 22, 2024 at 6:34 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Wed, May 22, 2024 at 03:43:58PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Currently the core of the fiemap code lives in extent_io.c, which does
> > not make any sense because it's not related to extent IO at all (and it
> > was not as well before the big rewrite of fiemap I did some time ago).
> >
> > Fiemap is an inode operation and its entry point is defined at inode.c,
> > where it really belongs. So move all the fiemap code from extent_io.c
> > into inode.c. This is a simple move without any other changes, only
> > extent_fiemap() is made static after being moved to inode.c and its
> > prototype declaration removed from extent_io.h.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/extent_io.c | 871 ------------------------------------------
> >  fs/btrfs/extent_io.h |   2 -
> >  fs/btrfs/inode.c     | 872 +++++++++++++++++++++++++++++++++++++++++++
>
> With so much code moved and no dependencies, you could also move it to a
> new file so we don't bloat inode.c.

Sounds good, updated to:

https://lore.kernel.org/linux-btrfs/d7579e89a2926ae126ba42794de3e7c39726f6eb.1716408773.git.fdmanana@suse.com/

Thanks.
>

^ permalink raw reply	[relevance 1%]

* [PATCH] btrfs: move fiemap code into its own file
@ 2024-05-22 20:15  1% fdmanana
  2024-05-23 10:25  1% ` Johannes Thumshirn
  2024-05-23 16:33  1% ` David Sterba
  0 siblings, 2 replies; 200+ results
From: fdmanana @ 2024-05-22 20:15 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently the core of the fiemap code lives in extent_io.c, which does
not make any sense because it's not related to extent IO at all (and it
was not as well before the big rewrite of fiemap I did some time ago).
The entry point for fiemap, btrfs_fiemap(), lives in inode.c since it's
an inode operation.

Since there's a significant amount of fiemap code, move all of it into a
dedicated file, including its entry point inode.c:btrfs_fiemap().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/Makefile    |   2 +-
 fs/btrfs/extent_io.c | 871 ----------------------------------------
 fs/btrfs/extent_io.h |   2 -
 fs/btrfs/fiemap.c    | 930 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/fiemap.h    |  11 +
 fs/btrfs/inode.c     |  52 +--
 6 files changed, 943 insertions(+), 925 deletions(-)
 create mode 100644 fs/btrfs/fiemap.c
 create mode 100644 fs/btrfs/fiemap.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 525af975f61c..50b19d15e956 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -33,7 +33,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
 	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
 	   subpage.o tree-mod-log.o extent-io-tree.o fs.o messages.o bio.o \
-	   lru_cache.o raid-stripe-tree.o
+	   lru_cache.o raid-stripe-tree.o fiemap.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bf50301ee528..f2898f45a4d6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2470,877 +2470,6 @@ bool try_release_extent_mapping(struct page *page, gfp_t mask)
 	return try_release_extent_state(io_tree, page, mask);
 }
 
-struct btrfs_fiemap_entry {
-	u64 offset;
-	u64 phys;
-	u64 len;
-	u32 flags;
-};
-
-/*
- * Indicate the caller of emit_fiemap_extent() that it needs to unlock the file
- * range from the inode's io tree, unlock the subvolume tree search path, flush
- * the fiemap cache and relock the file range and research the subvolume tree.
- * The value here is something negative that can't be confused with a valid
- * errno value and different from 1 because that's also a return value from
- * fiemap_fill_next_extent() and also it's often used to mean some btree search
- * did not find a key, so make it some distinct negative value.
- */
-#define BTRFS_FIEMAP_FLUSH_CACHE (-(MAX_ERRNO + 1))
-
-/*
- * Used to:
- *
- * - Cache the next entry to be emitted to the fiemap buffer, so that we can
- *   merge extents that are contiguous and can be grouped as a single one;
- *
- * - Store extents ready to be written to the fiemap buffer in an intermediary
- *   buffer. This intermediary buffer is to ensure that in case the fiemap
- *   buffer is memory mapped to the fiemap target file, we don't deadlock
- *   during btrfs_page_mkwrite(). This is because during fiemap we are locking
- *   an extent range in order to prevent races with delalloc flushing and
- *   ordered extent completion, which is needed in order to reliably detect
- *   delalloc in holes and prealloc extents. And this can lead to a deadlock
- *   if the fiemap buffer is memory mapped to the file we are running fiemap
- *   against (a silly, useless in practice scenario, but possible) because
- *   btrfs_page_mkwrite() will try to lock the same extent range.
- */
-struct fiemap_cache {
-	/* An array of ready fiemap entries. */
-	struct btrfs_fiemap_entry *entries;
-	/* Number of entries in the entries array. */
-	int entries_size;
-	/* Index of the next entry in the entries array to write to. */
-	int entries_pos;
-	/*
-	 * Once the entries array is full, this indicates what's the offset for
-	 * the next file extent item we must search for in the inode's subvolume
-	 * tree after unlocking the extent range in the inode's io tree and
-	 * releasing the search path.
-	 */
-	u64 next_search_offset;
-	/*
-	 * This matches struct fiemap_extent_info::fi_mapped_extents, we use it
-	 * to count ourselves emitted extents and stop instead of relying on
-	 * fiemap_fill_next_extent() because we buffer ready fiemap entries at
-	 * the @entries array, and we want to stop as soon as we hit the max
-	 * amount of extents to map, not just to save time but also to make the
-	 * logic at extent_fiemap() simpler.
-	 */
-	unsigned int extents_mapped;
-	/* Fields for the cached extent (unsubmitted, not ready, extent). */
-	u64 offset;
-	u64 phys;
-	u64 len;
-	u32 flags;
-	bool cached;
-};
-
-static int flush_fiemap_cache(struct fiemap_extent_info *fieinfo,
-			      struct fiemap_cache *cache)
-{
-	for (int i = 0; i < cache->entries_pos; i++) {
-		struct btrfs_fiemap_entry *entry = &cache->entries[i];
-		int ret;
-
-		ret = fiemap_fill_next_extent(fieinfo, entry->offset,
-					      entry->phys, entry->len,
-					      entry->flags);
-		/*
-		 * Ignore 1 (reached max entries) because we keep track of that
-		 * ourselves in emit_fiemap_extent().
-		 */
-		if (ret < 0)
-			return ret;
-	}
-	cache->entries_pos = 0;
-
-	return 0;
-}
-
-/*
- * Helper to submit fiemap extent.
- *
- * Will try to merge current fiemap extent specified by @offset, @phys,
- * @len and @flags with cached one.
- * And only when we fails to merge, cached one will be submitted as
- * fiemap extent.
- *
- * Return value is the same as fiemap_fill_next_extent().
- */
-static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
-				struct fiemap_cache *cache,
-				u64 offset, u64 phys, u64 len, u32 flags)
-{
-	struct btrfs_fiemap_entry *entry;
-	u64 cache_end;
-
-	/* Set at the end of extent_fiemap(). */
-	ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
-
-	if (!cache->cached)
-		goto assign;
-
-	/*
-	 * When iterating the extents of the inode, at extent_fiemap(), we may
-	 * find an extent that starts at an offset behind the end offset of the
-	 * previous extent we processed. This happens if fiemap is called
-	 * without FIEMAP_FLAG_SYNC and there are ordered extents completing
-	 * after we had to unlock the file range, release the search path, emit
-	 * the fiemap extents stored in the buffer (cache->entries array) and
-	 * the lock the remainder of the range and re-search the btree.
-	 *
-	 * For example we are in leaf X processing its last item, which is the
-	 * file extent item for file range [512K, 1M[, and after
-	 * btrfs_next_leaf() releases the path, there's an ordered extent that
-	 * completes for the file range [768K, 2M[, and that results in trimming
-	 * the file extent item so that it now corresponds to the file range
-	 * [512K, 768K[ and a new file extent item is inserted for the file
-	 * range [768K, 2M[, which may end up as the last item of leaf X or as
-	 * the first item of the next leaf - in either case btrfs_next_leaf()
-	 * will leave us with a path pointing to the new extent item, for the
-	 * file range [768K, 2M[, since that's the first key that follows the
-	 * last one we processed. So in order not to report overlapping extents
-	 * to user space, we trim the length of the previously cached extent and
-	 * emit it.
-	 *
-	 * Upon calling btrfs_next_leaf() we may also find an extent with an
-	 * offset smaller than or equals to cache->offset, and this happens
-	 * when we had a hole or prealloc extent with several delalloc ranges in
-	 * it, but after btrfs_next_leaf() released the path, delalloc was
-	 * flushed and the resulting ordered extents were completed, so we can
-	 * now have found a file extent item for an offset that is smaller than
-	 * or equals to what we have in cache->offset. We deal with this as
-	 * described below.
-	 */
-	cache_end = cache->offset + cache->len;
-	if (cache_end > offset) {
-		if (offset == cache->offset) {
-			/*
-			 * We cached a dealloc range (found in the io tree) for
-			 * a hole or prealloc extent and we have now found a
-			 * file extent item for the same offset. What we have
-			 * now is more recent and up to date, so discard what
-			 * we had in the cache and use what we have just found.
-			 */
-			goto assign;
-		} else if (offset > cache->offset) {
-			/*
-			 * The extent range we previously found ends after the
-			 * offset of the file extent item we found and that
-			 * offset falls somewhere in the middle of that previous
-			 * extent range. So adjust the range we previously found
-			 * to end at the offset of the file extent item we have
-			 * just found, since this extent is more up to date.
-			 * Emit that adjusted range and cache the file extent
-			 * item we have just found. This corresponds to the case
-			 * where a previously found file extent item was split
-			 * due to an ordered extent completing.
-			 */
-			cache->len = offset - cache->offset;
-			goto emit;
-		} else {
-			const u64 range_end = offset + len;
-
-			/*
-			 * The offset of the file extent item we have just found
-			 * is behind the cached offset. This means we were
-			 * processing a hole or prealloc extent for which we
-			 * have found delalloc ranges (in the io tree), so what
-			 * we have in the cache is the last delalloc range we
-			 * found while the file extent item we found can be
-			 * either for a whole delalloc range we previously
-			 * emmitted or only a part of that range.
-			 *
-			 * We have two cases here:
-			 *
-			 * 1) The file extent item's range ends at or behind the
-			 *    cached extent's end. In this case just ignore the
-			 *    current file extent item because we don't want to
-			 *    overlap with previous ranges that may have been
-			 *    emmitted already;
-			 *
-			 * 2) The file extent item starts behind the currently
-			 *    cached extent but its end offset goes beyond the
-			 *    end offset of the cached extent. We don't want to
-			 *    overlap with a previous range that may have been
-			 *    emmitted already, so we emit the currently cached
-			 *    extent and then partially store the current file
-			 *    extent item's range in the cache, for the subrange
-			 *    going the cached extent's end to the end of the
-			 *    file extent item.
-			 */
-			if (range_end <= cache_end)
-				return 0;
-
-			if (!(flags & (FIEMAP_EXTENT_ENCODED | FIEMAP_EXTENT_DELALLOC)))
-				phys += cache_end - offset;
-
-			offset = cache_end;
-			len = range_end - cache_end;
-			goto emit;
-		}
-	}
-
-	/*
-	 * Only merges fiemap extents if
-	 * 1) Their logical addresses are continuous
-	 *
-	 * 2) Their physical addresses are continuous
-	 *    So truly compressed (physical size smaller than logical size)
-	 *    extents won't get merged with each other
-	 *
-	 * 3) Share same flags
-	 */
-	if (cache->offset + cache->len  == offset &&
-	    cache->phys + cache->len == phys  &&
-	    cache->flags == flags) {
-		cache->len += len;
-		return 0;
-	}
-
-emit:
-	/* Not mergeable, need to submit cached one */
-
-	if (cache->entries_pos == cache->entries_size) {
-		/*
-		 * We will need to research for the end offset of the last
-		 * stored extent and not from the current offset, because after
-		 * unlocking the range and releasing the path, if there's a hole
-		 * between that end offset and this current offset, a new extent
-		 * may have been inserted due to a new write, so we don't want
-		 * to miss it.
-		 */
-		entry = &cache->entries[cache->entries_size - 1];
-		cache->next_search_offset = entry->offset + entry->len;
-		cache->cached = false;
-
-		return BTRFS_FIEMAP_FLUSH_CACHE;
-	}
-
-	entry = &cache->entries[cache->entries_pos];
-	entry->offset = cache->offset;
-	entry->phys = cache->phys;
-	entry->len = cache->len;
-	entry->flags = cache->flags;
-	cache->entries_pos++;
-	cache->extents_mapped++;
-
-	if (cache->extents_mapped == fieinfo->fi_extents_max) {
-		cache->cached = false;
-		return 1;
-	}
-assign:
-	cache->cached = true;
-	cache->offset = offset;
-	cache->phys = phys;
-	cache->len = len;
-	cache->flags = flags;
-
-	return 0;
-}
-
-/*
- * Emit last fiemap cache
- *
- * The last fiemap cache may still be cached in the following case:
- * 0		      4k		    8k
- * |<- Fiemap range ->|
- * |<------------  First extent ----------->|
- *
- * In this case, the first extent range will be cached but not emitted.
- * So we must emit it before ending extent_fiemap().
- */
-static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
-				  struct fiemap_cache *cache)
-{
-	int ret;
-
-	if (!cache->cached)
-		return 0;
-
-	ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
-				      cache->len, cache->flags);
-	cache->cached = false;
-	if (ret > 0)
-		ret = 0;
-	return ret;
-}
-
-static int fiemap_next_leaf_item(struct btrfs_inode *inode, struct btrfs_path *path)
-{
-	struct extent_buffer *clone = path->nodes[0];
-	struct btrfs_key key;
-	int slot;
-	int ret;
-
-	path->slots[0]++;
-	if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
-		return 0;
-
-	/*
-	 * Add a temporary extra ref to an already cloned extent buffer to
-	 * prevent btrfs_next_leaf() freeing it, we want to reuse it to avoid
-	 * the cost of allocating a new one.
-	 */
-	ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED, &clone->bflags));
-	atomic_inc(&clone->refs);
-
-	ret = btrfs_next_leaf(inode->root, path);
-	if (ret != 0)
-		goto out;
-
-	/*
-	 * Don't bother with cloning if there are no more file extent items for
-	 * our inode.
-	 */
-	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
-	if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY) {
-		ret = 1;
-		goto out;
-	}
-
-	/*
-	 * Important to preserve the start field, for the optimizations when
-	 * checking if extents are shared (see extent_fiemap()).
-	 *
-	 * We must set ->start before calling copy_extent_buffer_full().  If we
-	 * are on sub-pagesize blocksize, we use ->start to determine the offset
-	 * into the folio where our eb exists, and if we update ->start after
-	 * the fact then any subsequent reads of the eb may read from a
-	 * different offset in the folio than where we originally copied into.
-	 */
-	clone->start = path->nodes[0]->start;
-	/* See the comment at fiemap_search_slot() about why we clone. */
-	copy_extent_buffer_full(clone, path->nodes[0]);
-
-	slot = path->slots[0];
-	btrfs_release_path(path);
-	path->nodes[0] = clone;
-	path->slots[0] = slot;
-out:
-	if (ret)
-		free_extent_buffer(clone);
-
-	return ret;
-}
-
-/*
- * Search for the first file extent item that starts at a given file offset or
- * the one that starts immediately before that offset.
- * Returns: 0 on success, < 0 on error, 1 if not found.
- */
-static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
-			      u64 file_offset)
-{
-	const u64 ino = btrfs_ino(inode);
-	struct btrfs_root *root = inode->root;
-	struct extent_buffer *clone;
-	struct btrfs_key key;
-	int slot;
-	int ret;
-
-	key.objectid = ino;
-	key.type = BTRFS_EXTENT_DATA_KEY;
-	key.offset = file_offset;
-
-	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
-	if (ret < 0)
-		return ret;
-
-	if (ret > 0 && path->slots[0] > 0) {
-		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
-		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
-			path->slots[0]--;
-	}
-
-	if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
-		ret = btrfs_next_leaf(root, path);
-		if (ret != 0)
-			return ret;
-
-		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
-		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
-			return 1;
-	}
-
-	/*
-	 * We clone the leaf and use it during fiemap. This is because while
-	 * using the leaf we do expensive things like checking if an extent is
-	 * shared, which can take a long time. In order to prevent blocking
-	 * other tasks for too long, we use a clone of the leaf. We have locked
-	 * the file range in the inode's io tree, so we know none of our file
-	 * extent items can change. This way we avoid blocking other tasks that
-	 * want to insert items for other inodes in the same leaf or b+tree
-	 * rebalance operations (triggered for example when someone is trying
-	 * to push items into this leaf when trying to insert an item in a
-	 * neighbour leaf).
-	 * We also need the private clone because holding a read lock on an
-	 * extent buffer of the subvolume's b+tree will make lockdep unhappy
-	 * when we check if extents are shared, as backref walking may need to
-	 * lock the same leaf we are processing.
-	 */
-	clone = btrfs_clone_extent_buffer(path->nodes[0]);
-	if (!clone)
-		return -ENOMEM;
-
-	slot = path->slots[0];
-	btrfs_release_path(path);
-	path->nodes[0] = clone;
-	path->slots[0] = slot;
-
-	return 0;
-}
-
-/*
- * Process a range which is a hole or a prealloc extent in the inode's subvolume
- * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
- * extent. The end offset (@end) is inclusive.
- */
-static int fiemap_process_hole(struct btrfs_inode *inode,
-			       struct fiemap_extent_info *fieinfo,
-			       struct fiemap_cache *cache,
-			       struct extent_state **delalloc_cached_state,
-			       struct btrfs_backref_share_check_ctx *backref_ctx,
-			       u64 disk_bytenr, u64 extent_offset,
-			       u64 extent_gen,
-			       u64 start, u64 end)
-{
-	const u64 i_size = i_size_read(&inode->vfs_inode);
-	u64 cur_offset = start;
-	u64 last_delalloc_end = 0;
-	u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
-	bool checked_extent_shared = false;
-	int ret;
-
-	/*
-	 * There can be no delalloc past i_size, so don't waste time looking for
-	 * it beyond i_size.
-	 */
-	while (cur_offset < end && cur_offset < i_size) {
-		u64 delalloc_start;
-		u64 delalloc_end;
-		u64 prealloc_start;
-		u64 prealloc_len = 0;
-		bool delalloc;
-
-		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
-							delalloc_cached_state,
-							&delalloc_start,
-							&delalloc_end);
-		if (!delalloc)
-			break;
-
-		/*
-		 * If this is a prealloc extent we have to report every section
-		 * of it that has no delalloc.
-		 */
-		if (disk_bytenr != 0) {
-			if (last_delalloc_end == 0) {
-				prealloc_start = start;
-				prealloc_len = delalloc_start - start;
-			} else {
-				prealloc_start = last_delalloc_end + 1;
-				prealloc_len = delalloc_start - prealloc_start;
-			}
-		}
-
-		if (prealloc_len > 0) {
-			if (!checked_extent_shared && fieinfo->fi_extents_max) {
-				ret = btrfs_is_data_extent_shared(inode,
-								  disk_bytenr,
-								  extent_gen,
-								  backref_ctx);
-				if (ret < 0)
-					return ret;
-				else if (ret > 0)
-					prealloc_flags |= FIEMAP_EXTENT_SHARED;
-
-				checked_extent_shared = true;
-			}
-			ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
-						 disk_bytenr + extent_offset,
-						 prealloc_len, prealloc_flags);
-			if (ret)
-				return ret;
-			extent_offset += prealloc_len;
-		}
-
-		ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
-					 delalloc_end + 1 - delalloc_start,
-					 FIEMAP_EXTENT_DELALLOC |
-					 FIEMAP_EXTENT_UNKNOWN);
-		if (ret)
-			return ret;
-
-		last_delalloc_end = delalloc_end;
-		cur_offset = delalloc_end + 1;
-		extent_offset += cur_offset - delalloc_start;
-		cond_resched();
-	}
-
-	/*
-	 * Either we found no delalloc for the whole prealloc extent or we have
-	 * a prealloc extent that spans i_size or starts at or after i_size.
-	 */
-	if (disk_bytenr != 0 && last_delalloc_end < end) {
-		u64 prealloc_start;
-		u64 prealloc_len;
-
-		if (last_delalloc_end == 0) {
-			prealloc_start = start;
-			prealloc_len = end + 1 - start;
-		} else {
-			prealloc_start = last_delalloc_end + 1;
-			prealloc_len = end + 1 - prealloc_start;
-		}
-
-		if (!checked_extent_shared && fieinfo->fi_extents_max) {
-			ret = btrfs_is_data_extent_shared(inode,
-							  disk_bytenr,
-							  extent_gen,
-							  backref_ctx);
-			if (ret < 0)
-				return ret;
-			else if (ret > 0)
-				prealloc_flags |= FIEMAP_EXTENT_SHARED;
-		}
-		ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
-					 disk_bytenr + extent_offset,
-					 prealloc_len, prealloc_flags);
-		if (ret)
-			return ret;
-	}
-
-	return 0;
-}
-
-static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
-					  struct btrfs_path *path,
-					  u64 *last_extent_end_ret)
-{
-	const u64 ino = btrfs_ino(inode);
-	struct btrfs_root *root = inode->root;
-	struct extent_buffer *leaf;
-	struct btrfs_file_extent_item *ei;
-	struct btrfs_key key;
-	u64 disk_bytenr;
-	int ret;
-
-	/*
-	 * Lookup the last file extent. We're not using i_size here because
-	 * there might be preallocation past i_size.
-	 */
-	ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
-	/* There can't be a file extent item at offset (u64)-1 */
-	ASSERT(ret != 0);
-	if (ret < 0)
-		return ret;
-
-	/*
-	 * For a non-existing key, btrfs_search_slot() always leaves us at a
-	 * slot > 0, except if the btree is empty, which is impossible because
-	 * at least it has the inode item for this inode and all the items for
-	 * the root inode 256.
-	 */
-	ASSERT(path->slots[0] > 0);
-	path->slots[0]--;
-	leaf = path->nodes[0];
-	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
-	if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
-		/* No file extent items in the subvolume tree. */
-		*last_extent_end_ret = 0;
-		return 0;
-	}
-
-	/*
-	 * For an inline extent, the disk_bytenr is where inline data starts at,
-	 * so first check if we have an inline extent item before checking if we
-	 * have an implicit hole (disk_bytenr == 0).
-	 */
-	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
-	if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
-		*last_extent_end_ret = btrfs_file_extent_end(path);
-		return 0;
-	}
-
-	/*
-	 * Find the last file extent item that is not a hole (when NO_HOLES is
-	 * not enabled). This should take at most 2 iterations in the worst
-	 * case: we have one hole file extent item at slot 0 of a leaf and
-	 * another hole file extent item as the last item in the previous leaf.
-	 * This is because we merge file extent items that represent holes.
-	 */
-	disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
-	while (disk_bytenr == 0) {
-		ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
-		if (ret < 0) {
-			return ret;
-		} else if (ret > 0) {
-			/* No file extent items that are not holes. */
-			*last_extent_end_ret = 0;
-			return 0;
-		}
-		leaf = path->nodes[0];
-		ei = btrfs_item_ptr(leaf, path->slots[0],
-				    struct btrfs_file_extent_item);
-		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
-	}
-
-	*last_extent_end_ret = btrfs_file_extent_end(path);
-	return 0;
-}
-
-int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
-		  u64 start, u64 len)
-{
-	const u64 ino = btrfs_ino(inode);
-	struct extent_state *cached_state = NULL;
-	struct extent_state *delalloc_cached_state = NULL;
-	struct btrfs_path *path;
-	struct fiemap_cache cache = { 0 };
-	struct btrfs_backref_share_check_ctx *backref_ctx;
-	u64 last_extent_end;
-	u64 prev_extent_end;
-	u64 range_start;
-	u64 range_end;
-	const u64 sectorsize = inode->root->fs_info->sectorsize;
-	bool stopped = false;
-	int ret;
-
-	cache.entries_size = PAGE_SIZE / sizeof(struct btrfs_fiemap_entry);
-	cache.entries = kmalloc_array(cache.entries_size,
-				      sizeof(struct btrfs_fiemap_entry),
-				      GFP_KERNEL);
-	backref_ctx = btrfs_alloc_backref_share_check_ctx();
-	path = btrfs_alloc_path();
-	if (!cache.entries || !backref_ctx || !path) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
-restart:
-	range_start = round_down(start, sectorsize);
-	range_end = round_up(start + len, sectorsize);
-	prev_extent_end = range_start;
-
-	lock_extent(&inode->io_tree, range_start, range_end, &cached_state);
-
-	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
-	if (ret < 0)
-		goto out_unlock;
-	btrfs_release_path(path);
-
-	path->reada = READA_FORWARD;
-	ret = fiemap_search_slot(inode, path, range_start);
-	if (ret < 0) {
-		goto out_unlock;
-	} else if (ret > 0) {
-		/*
-		 * No file extent item found, but we may have delalloc between
-		 * the current offset and i_size. So check for that.
-		 */
-		ret = 0;
-		goto check_eof_delalloc;
-	}
-
-	while (prev_extent_end < range_end) {
-		struct extent_buffer *leaf = path->nodes[0];
-		struct btrfs_file_extent_item *ei;
-		struct btrfs_key key;
-		u64 extent_end;
-		u64 extent_len;
-		u64 extent_offset = 0;
-		u64 extent_gen;
-		u64 disk_bytenr = 0;
-		u64 flags = 0;
-		int extent_type;
-		u8 compression;
-
-		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
-		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
-			break;
-
-		extent_end = btrfs_file_extent_end(path);
-
-		/*
-		 * The first iteration can leave us at an extent item that ends
-		 * before our range's start. Move to the next item.
-		 */
-		if (extent_end <= range_start)
-			goto next_item;
-
-		backref_ctx->curr_leaf_bytenr = leaf->start;
-
-		/* We have in implicit hole (NO_HOLES feature enabled). */
-		if (prev_extent_end < key.offset) {
-			const u64 hole_end = min(key.offset, range_end) - 1;
-
-			ret = fiemap_process_hole(inode, fieinfo, &cache,
-						  &delalloc_cached_state,
-						  backref_ctx, 0, 0, 0,
-						  prev_extent_end, hole_end);
-			if (ret < 0) {
-				goto out_unlock;
-			} else if (ret > 0) {
-				/* fiemap_fill_next_extent() told us to stop. */
-				stopped = true;
-				break;
-			}
-
-			/* We've reached the end of the fiemap range, stop. */
-			if (key.offset >= range_end) {
-				stopped = true;
-				break;
-			}
-		}
-
-		extent_len = extent_end - key.offset;
-		ei = btrfs_item_ptr(leaf, path->slots[0],
-				    struct btrfs_file_extent_item);
-		compression = btrfs_file_extent_compression(leaf, ei);
-		extent_type = btrfs_file_extent_type(leaf, ei);
-		extent_gen = btrfs_file_extent_generation(leaf, ei);
-
-		if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
-			disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
-			if (compression == BTRFS_COMPRESS_NONE)
-				extent_offset = btrfs_file_extent_offset(leaf, ei);
-		}
-
-		if (compression != BTRFS_COMPRESS_NONE)
-			flags |= FIEMAP_EXTENT_ENCODED;
-
-		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
-			flags |= FIEMAP_EXTENT_DATA_INLINE;
-			flags |= FIEMAP_EXTENT_NOT_ALIGNED;
-			ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
-						 extent_len, flags);
-		} else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
-			ret = fiemap_process_hole(inode, fieinfo, &cache,
-						  &delalloc_cached_state,
-						  backref_ctx,
-						  disk_bytenr, extent_offset,
-						  extent_gen, key.offset,
-						  extent_end - 1);
-		} else if (disk_bytenr == 0) {
-			/* We have an explicit hole. */
-			ret = fiemap_process_hole(inode, fieinfo, &cache,
-						  &delalloc_cached_state,
-						  backref_ctx, 0, 0, 0,
-						  key.offset, extent_end - 1);
-		} else {
-			/* We have a regular extent. */
-			if (fieinfo->fi_extents_max) {
-				ret = btrfs_is_data_extent_shared(inode,
-								  disk_bytenr,
-								  extent_gen,
-								  backref_ctx);
-				if (ret < 0)
-					goto out_unlock;
-				else if (ret > 0)
-					flags |= FIEMAP_EXTENT_SHARED;
-			}
-
-			ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
-						 disk_bytenr + extent_offset,
-						 extent_len, flags);
-		}
-
-		if (ret < 0) {
-			goto out_unlock;
-		} else if (ret > 0) {
-			/* emit_fiemap_extent() told us to stop. */
-			stopped = true;
-			break;
-		}
-
-		prev_extent_end = extent_end;
-next_item:
-		if (fatal_signal_pending(current)) {
-			ret = -EINTR;
-			goto out_unlock;
-		}
-
-		ret = fiemap_next_leaf_item(inode, path);
-		if (ret < 0) {
-			goto out_unlock;
-		} else if (ret > 0) {
-			/* No more file extent items for this inode. */
-			break;
-		}
-		cond_resched();
-	}
-
-check_eof_delalloc:
-	if (!stopped && prev_extent_end < range_end) {
-		ret = fiemap_process_hole(inode, fieinfo, &cache,
-					  &delalloc_cached_state, backref_ctx,
-					  0, 0, 0, prev_extent_end, range_end - 1);
-		if (ret < 0)
-			goto out_unlock;
-		prev_extent_end = range_end;
-	}
-
-	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
-		const u64 i_size = i_size_read(&inode->vfs_inode);
-
-		if (prev_extent_end < i_size) {
-			u64 delalloc_start;
-			u64 delalloc_end;
-			bool delalloc;
-
-			delalloc = btrfs_find_delalloc_in_range(inode,
-								prev_extent_end,
-								i_size - 1,
-								&delalloc_cached_state,
-								&delalloc_start,
-								&delalloc_end);
-			if (!delalloc)
-				cache.flags |= FIEMAP_EXTENT_LAST;
-		} else {
-			cache.flags |= FIEMAP_EXTENT_LAST;
-		}
-	}
-
-out_unlock:
-	unlock_extent(&inode->io_tree, range_start, range_end, &cached_state);
-
-	if (ret == BTRFS_FIEMAP_FLUSH_CACHE) {
-		btrfs_release_path(path);
-		ret = flush_fiemap_cache(fieinfo, &cache);
-		if (ret)
-			goto out;
-		len -= cache.next_search_offset - start;
-		start = cache.next_search_offset;
-		goto restart;
-	} else if (ret < 0) {
-		goto out;
-	}
-
-	/*
-	 * Must free the path before emitting to the fiemap buffer because we
-	 * may have a non-cloned leaf and if the fiemap buffer is memory mapped
-	 * to a file, a write into it (through btrfs_page_mkwrite()) may trigger
-	 * waiting for an ordered extent that in order to complete needs to
-	 * modify that leaf, therefore leading to a deadlock.
-	 */
-	btrfs_free_path(path);
-	path = NULL;
-
-	ret = flush_fiemap_cache(fieinfo, &cache);
-	if (ret)
-		goto out;
-
-	ret = emit_last_fiemap_cache(fieinfo, &cache);
-out:
-	free_extent_state(delalloc_cached_state);
-	kfree(cache.entries);
-	btrfs_free_backref_share_ctx(backref_ctx);
-	btrfs_free_path(path);
-	return ret;
-}
-
 static void __free_extent_buffer(struct extent_buffer *eb)
 {
 	kmem_cache_free(extent_buffer_cache, eb);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index dca6b12769ec..ecf89424502e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -242,8 +242,6 @@ int btrfs_writepages(struct address_space *mapping, struct writeback_control *wb
 int btree_write_cache_pages(struct address_space *mapping,
 			    struct writeback_control *wbc);
 void btrfs_readahead(struct readahead_control *rac);
-int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
-		  u64 start, u64 len);
 int set_folio_extent_mapped(struct folio *folio);
 int set_page_extent_mapped(struct page *page);
 void clear_page_extent_mapped(struct page *page);
diff --git a/fs/btrfs/fiemap.c b/fs/btrfs/fiemap.c
new file mode 100644
index 000000000000..8f95f3e44e99
--- /dev/null
+++ b/fs/btrfs/fiemap.c
@@ -0,0 +1,930 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "backref.h"
+#include "btrfs_inode.h"
+#include "fiemap.h"
+#include "file.h"
+#include "file-item.h"
+
+struct btrfs_fiemap_entry {
+	u64 offset;
+	u64 phys;
+	u64 len;
+	u32 flags;
+};
+
+/*
+ * Indicate the caller of emit_fiemap_extent() that it needs to unlock the file
+ * range from the inode's io tree, unlock the subvolume tree search path, flush
+ * the fiemap cache and relock the file range and research the subvolume tree.
+ * The value here is something negative that can't be confused with a valid
+ * errno value and different from 1 because that's also a return value from
+ * fiemap_fill_next_extent() and also it's often used to mean some btree search
+ * did not find a key, so make it some distinct negative value.
+ */
+#define BTRFS_FIEMAP_FLUSH_CACHE (-(MAX_ERRNO + 1))
+
+/*
+ * Used to:
+ *
+ * - Cache the next entry to be emitted to the fiemap buffer, so that we can
+ *   merge extents that are contiguous and can be grouped as a single one;
+ *
+ * - Store extents ready to be written to the fiemap buffer in an intermediary
+ *   buffer. This intermediary buffer is to ensure that in case the fiemap
+ *   buffer is memory mapped to the fiemap target file, we don't deadlock
+ *   during btrfs_page_mkwrite(). This is because during fiemap we are locking
+ *   an extent range in order to prevent races with delalloc flushing and
+ *   ordered extent completion, which is needed in order to reliably detect
+ *   delalloc in holes and prealloc extents. And this can lead to a deadlock
+ *   if the fiemap buffer is memory mapped to the file we are running fiemap
+ *   against (a silly, useless in practice scenario, but possible) because
+ *   btrfs_page_mkwrite() will try to lock the same extent range.
+ */
+struct fiemap_cache {
+	/* An array of ready fiemap entries. */
+	struct btrfs_fiemap_entry *entries;
+	/* Number of entries in the entries array. */
+	int entries_size;
+	/* Index of the next entry in the entries array to write to. */
+	int entries_pos;
+	/*
+	 * Once the entries array is full, this indicates what's the offset for
+	 * the next file extent item we must search for in the inode's subvolume
+	 * tree after unlocking the extent range in the inode's io tree and
+	 * releasing the search path.
+	 */
+	u64 next_search_offset;
+	/*
+	 * This matches struct fiemap_extent_info::fi_mapped_extents, we use it
+	 * to count ourselves emitted extents and stop instead of relying on
+	 * fiemap_fill_next_extent() because we buffer ready fiemap entries at
+	 * the @entries array, and we want to stop as soon as we hit the max
+	 * amount of extents to map, not just to save time but also to make the
+	 * logic at extent_fiemap() simpler.
+	 */
+	unsigned int extents_mapped;
+	/* Fields for the cached extent (unsubmitted, not ready, extent). */
+	u64 offset;
+	u64 phys;
+	u64 len;
+	u32 flags;
+	bool cached;
+};
+
+static int flush_fiemap_cache(struct fiemap_extent_info *fieinfo,
+			      struct fiemap_cache *cache)
+{
+	for (int i = 0; i < cache->entries_pos; i++) {
+		struct btrfs_fiemap_entry *entry = &cache->entries[i];
+		int ret;
+
+		ret = fiemap_fill_next_extent(fieinfo, entry->offset,
+					      entry->phys, entry->len,
+					      entry->flags);
+		/*
+		 * Ignore 1 (reached max entries) because we keep track of that
+		 * ourselves in emit_fiemap_extent().
+		 */
+		if (ret < 0)
+			return ret;
+	}
+	cache->entries_pos = 0;
+
+	return 0;
+}
+
+/*
+ * Helper to submit fiemap extent.
+ *
+ * Will try to merge current fiemap extent specified by @offset, @phys,
+ * @len and @flags with cached one.
+ * And only when we fails to merge, cached one will be submitted as
+ * fiemap extent.
+ *
+ * Return value is the same as fiemap_fill_next_extent().
+ */
+static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
+				struct fiemap_cache *cache,
+				u64 offset, u64 phys, u64 len, u32 flags)
+{
+	struct btrfs_fiemap_entry *entry;
+	u64 cache_end;
+
+	/* Set at the end of extent_fiemap(). */
+	ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
+
+	if (!cache->cached)
+		goto assign;
+
+	/*
+	 * When iterating the extents of the inode, at extent_fiemap(), we may
+	 * find an extent that starts at an offset behind the end offset of the
+	 * previous extent we processed. This happens if fiemap is called
+	 * without FIEMAP_FLAG_SYNC and there are ordered extents completing
+	 * after we had to unlock the file range, release the search path, emit
+	 * the fiemap extents stored in the buffer (cache->entries array) and
+	 * the lock the remainder of the range and re-search the btree.
+	 *
+	 * For example we are in leaf X processing its last item, which is the
+	 * file extent item for file range [512K, 1M[, and after
+	 * btrfs_next_leaf() releases the path, there's an ordered extent that
+	 * completes for the file range [768K, 2M[, and that results in trimming
+	 * the file extent item so that it now corresponds to the file range
+	 * [512K, 768K[ and a new file extent item is inserted for the file
+	 * range [768K, 2M[, which may end up as the last item of leaf X or as
+	 * the first item of the next leaf - in either case btrfs_next_leaf()
+	 * will leave us with a path pointing to the new extent item, for the
+	 * file range [768K, 2M[, since that's the first key that follows the
+	 * last one we processed. So in order not to report overlapping extents
+	 * to user space, we trim the length of the previously cached extent and
+	 * emit it.
+	 *
+	 * Upon calling btrfs_next_leaf() we may also find an extent with an
+	 * offset smaller than or equals to cache->offset, and this happens
+	 * when we had a hole or prealloc extent with several delalloc ranges in
+	 * it, but after btrfs_next_leaf() released the path, delalloc was
+	 * flushed and the resulting ordered extents were completed, so we can
+	 * now have found a file extent item for an offset that is smaller than
+	 * or equals to what we have in cache->offset. We deal with this as
+	 * described below.
+	 */
+	cache_end = cache->offset + cache->len;
+	if (cache_end > offset) {
+		if (offset == cache->offset) {
+			/*
+			 * We cached a dealloc range (found in the io tree) for
+			 * a hole or prealloc extent and we have now found a
+			 * file extent item for the same offset. What we have
+			 * now is more recent and up to date, so discard what
+			 * we had in the cache and use what we have just found.
+			 */
+			goto assign;
+		} else if (offset > cache->offset) {
+			/*
+			 * The extent range we previously found ends after the
+			 * offset of the file extent item we found and that
+			 * offset falls somewhere in the middle of that previous
+			 * extent range. So adjust the range we previously found
+			 * to end at the offset of the file extent item we have
+			 * just found, since this extent is more up to date.
+			 * Emit that adjusted range and cache the file extent
+			 * item we have just found. This corresponds to the case
+			 * where a previously found file extent item was split
+			 * due to an ordered extent completing.
+			 */
+			cache->len = offset - cache->offset;
+			goto emit;
+		} else {
+			const u64 range_end = offset + len;
+
+			/*
+			 * The offset of the file extent item we have just found
+			 * is behind the cached offset. This means we were
+			 * processing a hole or prealloc extent for which we
+			 * have found delalloc ranges (in the io tree), so what
+			 * we have in the cache is the last delalloc range we
+			 * found while the file extent item we found can be
+			 * either for a whole delalloc range we previously
+			 * emmitted or only a part of that range.
+			 *
+			 * We have two cases here:
+			 *
+			 * 1) The file extent item's range ends at or behind the
+			 *    cached extent's end. In this case just ignore the
+			 *    current file extent item because we don't want to
+			 *    overlap with previous ranges that may have been
+			 *    emmitted already;
+			 *
+			 * 2) The file extent item starts behind the currently
+			 *    cached extent but its end offset goes beyond the
+			 *    end offset of the cached extent. We don't want to
+			 *    overlap with a previous range that may have been
+			 *    emmitted already, so we emit the currently cached
+			 *    extent and then partially store the current file
+			 *    extent item's range in the cache, for the subrange
+			 *    going the cached extent's end to the end of the
+			 *    file extent item.
+			 */
+			if (range_end <= cache_end)
+				return 0;
+
+			if (!(flags & (FIEMAP_EXTENT_ENCODED | FIEMAP_EXTENT_DELALLOC)))
+				phys += cache_end - offset;
+
+			offset = cache_end;
+			len = range_end - cache_end;
+			goto emit;
+		}
+	}
+
+	/*
+	 * Only merges fiemap extents if
+	 * 1) Their logical addresses are continuous
+	 *
+	 * 2) Their physical addresses are continuous
+	 *    So truly compressed (physical size smaller than logical size)
+	 *    extents won't get merged with each other
+	 *
+	 * 3) Share same flags
+	 */
+	if (cache->offset + cache->len  == offset &&
+	    cache->phys + cache->len == phys  &&
+	    cache->flags == flags) {
+		cache->len += len;
+		return 0;
+	}
+
+emit:
+	/* Not mergeable, need to submit cached one */
+
+	if (cache->entries_pos == cache->entries_size) {
+		/*
+		 * We will need to research for the end offset of the last
+		 * stored extent and not from the current offset, because after
+		 * unlocking the range and releasing the path, if there's a hole
+		 * between that end offset and this current offset, a new extent
+		 * may have been inserted due to a new write, so we don't want
+		 * to miss it.
+		 */
+		entry = &cache->entries[cache->entries_size - 1];
+		cache->next_search_offset = entry->offset + entry->len;
+		cache->cached = false;
+
+		return BTRFS_FIEMAP_FLUSH_CACHE;
+	}
+
+	entry = &cache->entries[cache->entries_pos];
+	entry->offset = cache->offset;
+	entry->phys = cache->phys;
+	entry->len = cache->len;
+	entry->flags = cache->flags;
+	cache->entries_pos++;
+	cache->extents_mapped++;
+
+	if (cache->extents_mapped == fieinfo->fi_extents_max) {
+		cache->cached = false;
+		return 1;
+	}
+assign:
+	cache->cached = true;
+	cache->offset = offset;
+	cache->phys = phys;
+	cache->len = len;
+	cache->flags = flags;
+
+	return 0;
+}
+
+/*
+ * Emit last fiemap cache
+ *
+ * The last fiemap cache may still be cached in the following case:
+ * 0		      4k		    8k
+ * |<- Fiemap range ->|
+ * |<------------  First extent ----------->|
+ *
+ * In this case, the first extent range will be cached but not emitted.
+ * So we must emit it before ending extent_fiemap().
+ */
+static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
+				  struct fiemap_cache *cache)
+{
+	int ret;
+
+	if (!cache->cached)
+		return 0;
+
+	ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
+				      cache->len, cache->flags);
+	cache->cached = false;
+	if (ret > 0)
+		ret = 0;
+	return ret;
+}
+
+static int fiemap_next_leaf_item(struct btrfs_inode *inode, struct btrfs_path *path)
+{
+	struct extent_buffer *clone = path->nodes[0];
+	struct btrfs_key key;
+	int slot;
+	int ret;
+
+	path->slots[0]++;
+	if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
+		return 0;
+
+	/*
+	 * Add a temporary extra ref to an already cloned extent buffer to
+	 * prevent btrfs_next_leaf() freeing it, we want to reuse it to avoid
+	 * the cost of allocating a new one.
+	 */
+	ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED, &clone->bflags));
+	atomic_inc(&clone->refs);
+
+	ret = btrfs_next_leaf(inode->root, path);
+	if (ret != 0)
+		goto out;
+
+	/*
+	 * Don't bother with cloning if there are no more file extent items for
+	 * our inode.
+	 */
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY) {
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * Important to preserve the start field, for the optimizations when
+	 * checking if extents are shared (see extent_fiemap()).
+	 *
+	 * We must set ->start before calling copy_extent_buffer_full().  If we
+	 * are on sub-pagesize blocksize, we use ->start to determine the offset
+	 * into the folio where our eb exists, and if we update ->start after
+	 * the fact then any subsequent reads of the eb may read from a
+	 * different offset in the folio than where we originally copied into.
+	 */
+	clone->start = path->nodes[0]->start;
+	/* See the comment at fiemap_search_slot() about why we clone. */
+	copy_extent_buffer_full(clone, path->nodes[0]);
+
+	slot = path->slots[0];
+	btrfs_release_path(path);
+	path->nodes[0] = clone;
+	path->slots[0] = slot;
+out:
+	if (ret)
+		free_extent_buffer(clone);
+
+	return ret;
+}
+
+/*
+ * Search for the first file extent item that starts at a given file offset or
+ * the one that starts immediately before that offset.
+ * Returns: 0 on success, < 0 on error, 1 if not found.
+ */
+static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
+			      u64 file_offset)
+{
+	const u64 ino = btrfs_ino(inode);
+	struct btrfs_root *root = inode->root;
+	struct extent_buffer *clone;
+	struct btrfs_key key;
+	int slot;
+	int ret;
+
+	key.objectid = ino;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = file_offset;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0)
+		return ret;
+
+	if (ret > 0 && path->slots[0] > 0) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
+		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
+			path->slots[0]--;
+	}
+
+	if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+		ret = btrfs_next_leaf(root, path);
+		if (ret != 0)
+			return ret;
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
+			return 1;
+	}
+
+	/*
+	 * We clone the leaf and use it during fiemap. This is because while
+	 * using the leaf we do expensive things like checking if an extent is
+	 * shared, which can take a long time. In order to prevent blocking
+	 * other tasks for too long, we use a clone of the leaf. We have locked
+	 * the file range in the inode's io tree, so we know none of our file
+	 * extent items can change. This way we avoid blocking other tasks that
+	 * want to insert items for other inodes in the same leaf or b+tree
+	 * rebalance operations (triggered for example when someone is trying
+	 * to push items into this leaf when trying to insert an item in a
+	 * neighbour leaf).
+	 * We also need the private clone because holding a read lock on an
+	 * extent buffer of the subvolume's b+tree will make lockdep unhappy
+	 * when we check if extents are shared, as backref walking may need to
+	 * lock the same leaf we are processing.
+	 */
+	clone = btrfs_clone_extent_buffer(path->nodes[0]);
+	if (!clone)
+		return -ENOMEM;
+
+	slot = path->slots[0];
+	btrfs_release_path(path);
+	path->nodes[0] = clone;
+	path->slots[0] = slot;
+
+	return 0;
+}
+
+/*
+ * Process a range which is a hole or a prealloc extent in the inode's subvolume
+ * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
+ * extent. The end offset (@end) is inclusive.
+ */
+static int fiemap_process_hole(struct btrfs_inode *inode,
+			       struct fiemap_extent_info *fieinfo,
+			       struct fiemap_cache *cache,
+			       struct extent_state **delalloc_cached_state,
+			       struct btrfs_backref_share_check_ctx *backref_ctx,
+			       u64 disk_bytenr, u64 extent_offset,
+			       u64 extent_gen,
+			       u64 start, u64 end)
+{
+	const u64 i_size = i_size_read(&inode->vfs_inode);
+	u64 cur_offset = start;
+	u64 last_delalloc_end = 0;
+	u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
+	bool checked_extent_shared = false;
+	int ret;
+
+	/*
+	 * There can be no delalloc past i_size, so don't waste time looking for
+	 * it beyond i_size.
+	 */
+	while (cur_offset < end && cur_offset < i_size) {
+		u64 delalloc_start;
+		u64 delalloc_end;
+		u64 prealloc_start;
+		u64 prealloc_len = 0;
+		bool delalloc;
+
+		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
+							delalloc_cached_state,
+							&delalloc_start,
+							&delalloc_end);
+		if (!delalloc)
+			break;
+
+		/*
+		 * If this is a prealloc extent we have to report every section
+		 * of it that has no delalloc.
+		 */
+		if (disk_bytenr != 0) {
+			if (last_delalloc_end == 0) {
+				prealloc_start = start;
+				prealloc_len = delalloc_start - start;
+			} else {
+				prealloc_start = last_delalloc_end + 1;
+				prealloc_len = delalloc_start - prealloc_start;
+			}
+		}
+
+		if (prealloc_len > 0) {
+			if (!checked_extent_shared && fieinfo->fi_extents_max) {
+				ret = btrfs_is_data_extent_shared(inode,
+								  disk_bytenr,
+								  extent_gen,
+								  backref_ctx);
+				if (ret < 0)
+					return ret;
+				else if (ret > 0)
+					prealloc_flags |= FIEMAP_EXTENT_SHARED;
+
+				checked_extent_shared = true;
+			}
+			ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
+						 disk_bytenr + extent_offset,
+						 prealloc_len, prealloc_flags);
+			if (ret)
+				return ret;
+			extent_offset += prealloc_len;
+		}
+
+		ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
+					 delalloc_end + 1 - delalloc_start,
+					 FIEMAP_EXTENT_DELALLOC |
+					 FIEMAP_EXTENT_UNKNOWN);
+		if (ret)
+			return ret;
+
+		last_delalloc_end = delalloc_end;
+		cur_offset = delalloc_end + 1;
+		extent_offset += cur_offset - delalloc_start;
+		cond_resched();
+	}
+
+	/*
+	 * Either we found no delalloc for the whole prealloc extent or we have
+	 * a prealloc extent that spans i_size or starts at or after i_size.
+	 */
+	if (disk_bytenr != 0 && last_delalloc_end < end) {
+		u64 prealloc_start;
+		u64 prealloc_len;
+
+		if (last_delalloc_end == 0) {
+			prealloc_start = start;
+			prealloc_len = end + 1 - start;
+		} else {
+			prealloc_start = last_delalloc_end + 1;
+			prealloc_len = end + 1 - prealloc_start;
+		}
+
+		if (!checked_extent_shared && fieinfo->fi_extents_max) {
+			ret = btrfs_is_data_extent_shared(inode,
+							  disk_bytenr,
+							  extent_gen,
+							  backref_ctx);
+			if (ret < 0)
+				return ret;
+			else if (ret > 0)
+				prealloc_flags |= FIEMAP_EXTENT_SHARED;
+		}
+		ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
+					 disk_bytenr + extent_offset,
+					 prealloc_len, prealloc_flags);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
+					  struct btrfs_path *path,
+					  u64 *last_extent_end_ret)
+{
+	const u64 ino = btrfs_ino(inode);
+	struct btrfs_root *root = inode->root;
+	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *ei;
+	struct btrfs_key key;
+	u64 disk_bytenr;
+	int ret;
+
+	/*
+	 * Lookup the last file extent. We're not using i_size here because
+	 * there might be preallocation past i_size.
+	 */
+	ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
+	/* There can't be a file extent item at offset (u64)-1 */
+	ASSERT(ret != 0);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * For a non-existing key, btrfs_search_slot() always leaves us at a
+	 * slot > 0, except if the btree is empty, which is impossible because
+	 * at least it has the inode item for this inode and all the items for
+	 * the root inode 256.
+	 */
+	ASSERT(path->slots[0] > 0);
+	path->slots[0]--;
+	leaf = path->nodes[0];
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+	if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
+		/* No file extent items in the subvolume tree. */
+		*last_extent_end_ret = 0;
+		return 0;
+	}
+
+	/*
+	 * For an inline extent, the disk_bytenr is where inline data starts at,
+	 * so first check if we have an inline extent item before checking if we
+	 * have an implicit hole (disk_bytenr == 0).
+	 */
+	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
+	if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
+		*last_extent_end_ret = btrfs_file_extent_end(path);
+		return 0;
+	}
+
+	/*
+	 * Find the last file extent item that is not a hole (when NO_HOLES is
+	 * not enabled). This should take at most 2 iterations in the worst
+	 * case: we have one hole file extent item at slot 0 of a leaf and
+	 * another hole file extent item as the last item in the previous leaf.
+	 * This is because we merge file extent items that represent holes.
+	 */
+	disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+	while (disk_bytenr == 0) {
+		ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
+		if (ret < 0) {
+			return ret;
+		} else if (ret > 0) {
+			/* No file extent items that are not holes. */
+			*last_extent_end_ret = 0;
+			return 0;
+		}
+		leaf = path->nodes[0];
+		ei = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+	}
+
+	*last_extent_end_ret = btrfs_file_extent_end(path);
+	return 0;
+}
+
+static int extent_fiemap(struct btrfs_inode *inode,
+			 struct fiemap_extent_info *fieinfo,
+			 u64 start, u64 len)
+{
+	const u64 ino = btrfs_ino(inode);
+	struct extent_state *cached_state = NULL;
+	struct extent_state *delalloc_cached_state = NULL;
+	struct btrfs_path *path;
+	struct fiemap_cache cache = { 0 };
+	struct btrfs_backref_share_check_ctx *backref_ctx;
+	u64 last_extent_end;
+	u64 prev_extent_end;
+	u64 range_start;
+	u64 range_end;
+	const u64 sectorsize = inode->root->fs_info->sectorsize;
+	bool stopped = false;
+	int ret;
+
+	cache.entries_size = PAGE_SIZE / sizeof(struct btrfs_fiemap_entry);
+	cache.entries = kmalloc_array(cache.entries_size,
+				      sizeof(struct btrfs_fiemap_entry),
+				      GFP_KERNEL);
+	backref_ctx = btrfs_alloc_backref_share_check_ctx();
+	path = btrfs_alloc_path();
+	if (!cache.entries || !backref_ctx || !path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+restart:
+	range_start = round_down(start, sectorsize);
+	range_end = round_up(start + len, sectorsize);
+	prev_extent_end = range_start;
+
+	lock_extent(&inode->io_tree, range_start, range_end, &cached_state);
+
+	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
+	if (ret < 0)
+		goto out_unlock;
+	btrfs_release_path(path);
+
+	path->reada = READA_FORWARD;
+	ret = fiemap_search_slot(inode, path, range_start);
+	if (ret < 0) {
+		goto out_unlock;
+	} else if (ret > 0) {
+		/*
+		 * No file extent item found, but we may have delalloc between
+		 * the current offset and i_size. So check for that.
+		 */
+		ret = 0;
+		goto check_eof_delalloc;
+	}
+
+	while (prev_extent_end < range_end) {
+		struct extent_buffer *leaf = path->nodes[0];
+		struct btrfs_file_extent_item *ei;
+		struct btrfs_key key;
+		u64 extent_end;
+		u64 extent_len;
+		u64 extent_offset = 0;
+		u64 extent_gen;
+		u64 disk_bytenr = 0;
+		u64 flags = 0;
+		int extent_type;
+		u8 compression;
+
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
+			break;
+
+		extent_end = btrfs_file_extent_end(path);
+
+		/*
+		 * The first iteration can leave us at an extent item that ends
+		 * before our range's start. Move to the next item.
+		 */
+		if (extent_end <= range_start)
+			goto next_item;
+
+		backref_ctx->curr_leaf_bytenr = leaf->start;
+
+		/* We have in implicit hole (NO_HOLES feature enabled). */
+		if (prev_extent_end < key.offset) {
+			const u64 hole_end = min(key.offset, range_end) - 1;
+
+			ret = fiemap_process_hole(inode, fieinfo, &cache,
+						  &delalloc_cached_state,
+						  backref_ctx, 0, 0, 0,
+						  prev_extent_end, hole_end);
+			if (ret < 0) {
+				goto out_unlock;
+			} else if (ret > 0) {
+				/* fiemap_fill_next_extent() told us to stop. */
+				stopped = true;
+				break;
+			}
+
+			/* We've reached the end of the fiemap range, stop. */
+			if (key.offset >= range_end) {
+				stopped = true;
+				break;
+			}
+		}
+
+		extent_len = extent_end - key.offset;
+		ei = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		compression = btrfs_file_extent_compression(leaf, ei);
+		extent_type = btrfs_file_extent_type(leaf, ei);
+		extent_gen = btrfs_file_extent_generation(leaf, ei);
+
+		if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
+			disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+			if (compression == BTRFS_COMPRESS_NONE)
+				extent_offset = btrfs_file_extent_offset(leaf, ei);
+		}
+
+		if (compression != BTRFS_COMPRESS_NONE)
+			flags |= FIEMAP_EXTENT_ENCODED;
+
+		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+			flags |= FIEMAP_EXTENT_DATA_INLINE;
+			flags |= FIEMAP_EXTENT_NOT_ALIGNED;
+			ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
+						 extent_len, flags);
+		} else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
+			ret = fiemap_process_hole(inode, fieinfo, &cache,
+						  &delalloc_cached_state,
+						  backref_ctx,
+						  disk_bytenr, extent_offset,
+						  extent_gen, key.offset,
+						  extent_end - 1);
+		} else if (disk_bytenr == 0) {
+			/* We have an explicit hole. */
+			ret = fiemap_process_hole(inode, fieinfo, &cache,
+						  &delalloc_cached_state,
+						  backref_ctx, 0, 0, 0,
+						  key.offset, extent_end - 1);
+		} else {
+			/* We have a regular extent. */
+			if (fieinfo->fi_extents_max) {
+				ret = btrfs_is_data_extent_shared(inode,
+								  disk_bytenr,
+								  extent_gen,
+								  backref_ctx);
+				if (ret < 0)
+					goto out_unlock;
+				else if (ret > 0)
+					flags |= FIEMAP_EXTENT_SHARED;
+			}
+
+			ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
+						 disk_bytenr + extent_offset,
+						 extent_len, flags);
+		}
+
+		if (ret < 0) {
+			goto out_unlock;
+		} else if (ret > 0) {
+			/* emit_fiemap_extent() told us to stop. */
+			stopped = true;
+			break;
+		}
+
+		prev_extent_end = extent_end;
+next_item:
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			goto out_unlock;
+		}
+
+		ret = fiemap_next_leaf_item(inode, path);
+		if (ret < 0) {
+			goto out_unlock;
+		} else if (ret > 0) {
+			/* No more file extent items for this inode. */
+			break;
+		}
+		cond_resched();
+	}
+
+check_eof_delalloc:
+	if (!stopped && prev_extent_end < range_end) {
+		ret = fiemap_process_hole(inode, fieinfo, &cache,
+					  &delalloc_cached_state, backref_ctx,
+					  0, 0, 0, prev_extent_end, range_end - 1);
+		if (ret < 0)
+			goto out_unlock;
+		prev_extent_end = range_end;
+	}
+
+	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
+		const u64 i_size = i_size_read(&inode->vfs_inode);
+
+		if (prev_extent_end < i_size) {
+			u64 delalloc_start;
+			u64 delalloc_end;
+			bool delalloc;
+
+			delalloc = btrfs_find_delalloc_in_range(inode,
+								prev_extent_end,
+								i_size - 1,
+								&delalloc_cached_state,
+								&delalloc_start,
+								&delalloc_end);
+			if (!delalloc)
+				cache.flags |= FIEMAP_EXTENT_LAST;
+		} else {
+			cache.flags |= FIEMAP_EXTENT_LAST;
+		}
+	}
+
+out_unlock:
+	unlock_extent(&inode->io_tree, range_start, range_end, &cached_state);
+
+	if (ret == BTRFS_FIEMAP_FLUSH_CACHE) {
+		btrfs_release_path(path);
+		ret = flush_fiemap_cache(fieinfo, &cache);
+		if (ret)
+			goto out;
+		len -= cache.next_search_offset - start;
+		start = cache.next_search_offset;
+		goto restart;
+	} else if (ret < 0) {
+		goto out;
+	}
+
+	/*
+	 * Must free the path before emitting to the fiemap buffer because we
+	 * may have a non-cloned leaf and if the fiemap buffer is memory mapped
+	 * to a file, a write into it (through btrfs_page_mkwrite()) may trigger
+	 * waiting for an ordered extent that in order to complete needs to
+	 * modify that leaf, therefore leading to a deadlock.
+	 */
+	btrfs_free_path(path);
+	path = NULL;
+
+	ret = flush_fiemap_cache(fieinfo, &cache);
+	if (ret)
+		goto out;
+
+	ret = emit_last_fiemap_cache(fieinfo, &cache);
+out:
+	free_extent_state(delalloc_cached_state);
+	kfree(cache.entries);
+	btrfs_free_backref_share_ctx(backref_ctx);
+	btrfs_free_path(path);
+	return ret;
+}
+
+int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		 u64 start, u64 len)
+{
+	struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
+	int ret;
+
+	ret = fiemap_prep(inode, fieinfo, start, &len, 0);
+	if (ret)
+		return ret;
+
+	/*
+	 * fiemap_prep() called filemap_write_and_wait() for the whole possible
+	 * file range (0 to LLONG_MAX), but that is not enough if we have
+	 * compression enabled. The first filemap_fdatawrite_range() only kicks
+	 * in the compression of data (in an async thread) and will return
+	 * before the compression is done and writeback is started. A second
+	 * filemap_fdatawrite_range() is needed to wait for the compression to
+	 * complete and writeback to start. We also need to wait for ordered
+	 * extents to complete, because our fiemap implementation uses mainly
+	 * file extent items to list the extents, searching for extent maps
+	 * only for file ranges with holes or prealloc extents to figure out
+	 * if we have delalloc in those ranges.
+	 */
+	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
+		ret = btrfs_wait_ordered_range(btrfs_inode, 0, LLONG_MAX);
+		if (ret)
+			return ret;
+	}
+
+	btrfs_inode_lock(btrfs_inode, BTRFS_ILOCK_SHARED);
+
+	/*
+	 * We did an initial flush to avoid holding the inode's lock while
+	 * triggering writeback and waiting for the completion of IO and ordered
+	 * extents. Now after we locked the inode we do it again, because it's
+	 * possible a new write may have happened in between those two steps.
+	 */
+	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
+		ret = btrfs_wait_ordered_range(btrfs_inode, 0, LLONG_MAX);
+		if (ret) {
+			btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
+			return ret;
+		}
+	}
+
+	ret = extent_fiemap(btrfs_inode, fieinfo, start, len);
+	btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
+
+	return ret;
+}
diff --git a/fs/btrfs/fiemap.h b/fs/btrfs/fiemap.h
new file mode 100644
index 000000000000..cfd74b35988f
--- /dev/null
+++ b/fs/btrfs/fiemap.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef BTRFS_FIEMAP_H
+#define BTRFS_FIEMAP_H
+
+#include <linux/fiemap.h>
+
+int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+		 u64 start, u64 len);
+
+#endif /* BTRFS_FIEMAP_H */
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b4410b463c6a..ab16b4ff3612 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -70,6 +70,7 @@
 #include "orphan.h"
 #include "backref.h"
 #include "raid-stripe-tree.h"
+#include "fiemap.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -7929,57 +7930,6 @@ struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 			    IOMAP_DIO_PARTIAL, &data, done_before);
 }
 
-static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-			u64 start, u64 len)
-{
-	struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
-	int	ret;
-
-	ret = fiemap_prep(inode, fieinfo, start, &len, 0);
-	if (ret)
-		return ret;
-
-	/*
-	 * fiemap_prep() called filemap_write_and_wait() for the whole possible
-	 * file range (0 to LLONG_MAX), but that is not enough if we have
-	 * compression enabled. The first filemap_fdatawrite_range() only kicks
-	 * in the compression of data (in an async thread) and will return
-	 * before the compression is done and writeback is started. A second
-	 * filemap_fdatawrite_range() is needed to wait for the compression to
-	 * complete and writeback to start. We also need to wait for ordered
-	 * extents to complete, because our fiemap implementation uses mainly
-	 * file extent items to list the extents, searching for extent maps
-	 * only for file ranges with holes or prealloc extents to figure out
-	 * if we have delalloc in those ranges.
-	 */
-	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
-		ret = btrfs_wait_ordered_range(btrfs_inode, 0, LLONG_MAX);
-		if (ret)
-			return ret;
-	}
-
-	btrfs_inode_lock(btrfs_inode, BTRFS_ILOCK_SHARED);
-
-	/*
-	 * We did an initial flush to avoid holding the inode's lock while
-	 * triggering writeback and waiting for the completion of IO and ordered
-	 * extents. Now after we locked the inode we do it again, because it's
-	 * possible a new write may have happened in between those two steps.
-	 */
-	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
-		ret = btrfs_wait_ordered_range(btrfs_inode, 0, LLONG_MAX);
-		if (ret) {
-			btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
-			return ret;
-		}
-	}
-
-	ret = extent_fiemap(btrfs_inode, fieinfo, start, len);
-	btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
-
-	return ret;
-}
-
 /*
  * For release_folio() and invalidate_folio() we have a race window where
  * folio_end_writeback() is called but the subpage spinlock is not yet released.
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* Re: [PATCH] btrfs: move fiemap code from extent_io.c to inode.c
  2024-05-22 14:43  1% [PATCH] btrfs: move fiemap code from extent_io.c to inode.c fdmanana
  2024-05-22 15:20  1% ` Josef Bacik
@ 2024-05-22 17:33  1% ` David Sterba
  2024-05-22 20:18  1%   ` Filipe Manana
  1 sibling, 1 reply; 200+ results
From: David Sterba @ 2024-05-22 17:33 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, May 22, 2024 at 03:43:58PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Currently the core of the fiemap code lives in extent_io.c, which does
> not make any sense because it's not related to extent IO at all (and it
> was not as well before the big rewrite of fiemap I did some time ago).
> 
> Fiemap is an inode operation and its entry point is defined at inode.c,
> where it really belongs. So move all the fiemap code from extent_io.c
> into inode.c. This is a simple move without any other changes, only
> extent_fiemap() is made static after being moved to inode.c and its
> prototype declaration removed from extent_io.h.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/extent_io.c | 871 ------------------------------------------
>  fs/btrfs/extent_io.h |   2 -
>  fs/btrfs/inode.c     | 872 +++++++++++++++++++++++++++++++++++++++++++

With so much code moved and no dependencies, you could also move it to a
new file so we don't bloat inode.c.

^ permalink raw reply	[relevance 1%]

* Re: [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions
  2024-05-22 14:36  2% [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions fdmanana
                   ` (6 preceding siblings ...)
  2024-05-22 14:36  1% ` [PATCH 7/7] btrfs: send: get rid of the label and gotos at ensure_commit_roots_uptodate() fdmanana
@ 2024-05-22 15:21  1% ` Josef Bacik
  2024-05-22 22:21  1% ` Qu Wenruo
  2024-05-23 17:03  1% ` David Sterba
  9 siblings, 0 replies; 200+ results
From: Josef Bacik @ 2024-05-22 15:21 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, May 22, 2024 at 03:36:28PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> A few places can unnecessarily create an empty transaction and then commit
> it, when the goal is just to catch the current transaction and wait for
> its commit to complete. This results in wasting IO, time and rotation of
> the precious backup roots in the super block. Details in the change logs.
> The patches are all independent, except patch 4 that applies on top of
> patch 3 (but could have been done in any order really, they are independent).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: move fiemap code from extent_io.c to inode.c
  2024-05-22 14:43  1% [PATCH] btrfs: move fiemap code from extent_io.c to inode.c fdmanana
@ 2024-05-22 15:20  1% ` Josef Bacik
  2024-05-22 17:33  1% ` David Sterba
  1 sibling, 0 replies; 200+ results
From: Josef Bacik @ 2024-05-22 15:20 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, May 22, 2024 at 03:43:58PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Currently the core of the fiemap code lives in extent_io.c, which does
> not make any sense because it's not related to extent IO at all (and it
> was not as well before the big rewrite of fiemap I did some time ago).
> 
> Fiemap is an inode operation and its entry point is defined at inode.c,
> where it really belongs. So move all the fiemap code from extent_io.c
> into inode.c. This is a simple move without any other changes, only
> extent_fiemap() is made static after being moved to inode.c and its
> prototype declaration removed from extent_io.h.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Yes please,

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[relevance 1%]

* [PATCH] btrfs: move fiemap code from extent_io.c to inode.c
@ 2024-05-22 14:43  1% fdmanana
  2024-05-22 15:20  1% ` Josef Bacik
  2024-05-22 17:33  1% ` David Sterba
  0 siblings, 2 replies; 200+ results
From: fdmanana @ 2024-05-22 14:43 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently the core of the fiemap code lives in extent_io.c, which does
not make any sense because it's not related to extent IO at all (and it
was not as well before the big rewrite of fiemap I did some time ago).

Fiemap is an inode operation and its entry point is defined at inode.c,
where it really belongs. So move all the fiemap code from extent_io.c
into inode.c. This is a simple move without any other changes, only
extent_fiemap() is made static after being moved to inode.c and its
prototype declaration removed from extent_io.h.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_io.c | 871 ------------------------------------------
 fs/btrfs/extent_io.h |   2 -
 fs/btrfs/inode.c     | 872 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 872 insertions(+), 873 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bf50301ee528..f2898f45a4d6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2470,877 +2470,6 @@ bool try_release_extent_mapping(struct page *page, gfp_t mask)
 	return try_release_extent_state(io_tree, page, mask);
 }
 
-struct btrfs_fiemap_entry {
-	u64 offset;
-	u64 phys;
-	u64 len;
-	u32 flags;
-};
-
-/*
- * Indicate the caller of emit_fiemap_extent() that it needs to unlock the file
- * range from the inode's io tree, unlock the subvolume tree search path, flush
- * the fiemap cache and relock the file range and research the subvolume tree.
- * The value here is something negative that can't be confused with a valid
- * errno value and different from 1 because that's also a return value from
- * fiemap_fill_next_extent() and also it's often used to mean some btree search
- * did not find a key, so make it some distinct negative value.
- */
-#define BTRFS_FIEMAP_FLUSH_CACHE (-(MAX_ERRNO + 1))
-
-/*
- * Used to:
- *
- * - Cache the next entry to be emitted to the fiemap buffer, so that we can
- *   merge extents that are contiguous and can be grouped as a single one;
- *
- * - Store extents ready to be written to the fiemap buffer in an intermediary
- *   buffer. This intermediary buffer is to ensure that in case the fiemap
- *   buffer is memory mapped to the fiemap target file, we don't deadlock
- *   during btrfs_page_mkwrite(). This is because during fiemap we are locking
- *   an extent range in order to prevent races with delalloc flushing and
- *   ordered extent completion, which is needed in order to reliably detect
- *   delalloc in holes and prealloc extents. And this can lead to a deadlock
- *   if the fiemap buffer is memory mapped to the file we are running fiemap
- *   against (a silly, useless in practice scenario, but possible) because
- *   btrfs_page_mkwrite() will try to lock the same extent range.
- */
-struct fiemap_cache {
-	/* An array of ready fiemap entries. */
-	struct btrfs_fiemap_entry *entries;
-	/* Number of entries in the entries array. */
-	int entries_size;
-	/* Index of the next entry in the entries array to write to. */
-	int entries_pos;
-	/*
-	 * Once the entries array is full, this indicates what's the offset for
-	 * the next file extent item we must search for in the inode's subvolume
-	 * tree after unlocking the extent range in the inode's io tree and
-	 * releasing the search path.
-	 */
-	u64 next_search_offset;
-	/*
-	 * This matches struct fiemap_extent_info::fi_mapped_extents, we use it
-	 * to count ourselves emitted extents and stop instead of relying on
-	 * fiemap_fill_next_extent() because we buffer ready fiemap entries at
-	 * the @entries array, and we want to stop as soon as we hit the max
-	 * amount of extents to map, not just to save time but also to make the
-	 * logic at extent_fiemap() simpler.
-	 */
-	unsigned int extents_mapped;
-	/* Fields for the cached extent (unsubmitted, not ready, extent). */
-	u64 offset;
-	u64 phys;
-	u64 len;
-	u32 flags;
-	bool cached;
-};
-
-static int flush_fiemap_cache(struct fiemap_extent_info *fieinfo,
-			      struct fiemap_cache *cache)
-{
-	for (int i = 0; i < cache->entries_pos; i++) {
-		struct btrfs_fiemap_entry *entry = &cache->entries[i];
-		int ret;
-
-		ret = fiemap_fill_next_extent(fieinfo, entry->offset,
-					      entry->phys, entry->len,
-					      entry->flags);
-		/*
-		 * Ignore 1 (reached max entries) because we keep track of that
-		 * ourselves in emit_fiemap_extent().
-		 */
-		if (ret < 0)
-			return ret;
-	}
-	cache->entries_pos = 0;
-
-	return 0;
-}
-
-/*
- * Helper to submit fiemap extent.
- *
- * Will try to merge current fiemap extent specified by @offset, @phys,
- * @len and @flags with cached one.
- * And only when we fails to merge, cached one will be submitted as
- * fiemap extent.
- *
- * Return value is the same as fiemap_fill_next_extent().
- */
-static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
-				struct fiemap_cache *cache,
-				u64 offset, u64 phys, u64 len, u32 flags)
-{
-	struct btrfs_fiemap_entry *entry;
-	u64 cache_end;
-
-	/* Set at the end of extent_fiemap(). */
-	ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
-
-	if (!cache->cached)
-		goto assign;
-
-	/*
-	 * When iterating the extents of the inode, at extent_fiemap(), we may
-	 * find an extent that starts at an offset behind the end offset of the
-	 * previous extent we processed. This happens if fiemap is called
-	 * without FIEMAP_FLAG_SYNC and there are ordered extents completing
-	 * after we had to unlock the file range, release the search path, emit
-	 * the fiemap extents stored in the buffer (cache->entries array) and
-	 * the lock the remainder of the range and re-search the btree.
-	 *
-	 * For example we are in leaf X processing its last item, which is the
-	 * file extent item for file range [512K, 1M[, and after
-	 * btrfs_next_leaf() releases the path, there's an ordered extent that
-	 * completes for the file range [768K, 2M[, and that results in trimming
-	 * the file extent item so that it now corresponds to the file range
-	 * [512K, 768K[ and a new file extent item is inserted for the file
-	 * range [768K, 2M[, which may end up as the last item of leaf X or as
-	 * the first item of the next leaf - in either case btrfs_next_leaf()
-	 * will leave us with a path pointing to the new extent item, for the
-	 * file range [768K, 2M[, since that's the first key that follows the
-	 * last one we processed. So in order not to report overlapping extents
-	 * to user space, we trim the length of the previously cached extent and
-	 * emit it.
-	 *
-	 * Upon calling btrfs_next_leaf() we may also find an extent with an
-	 * offset smaller than or equals to cache->offset, and this happens
-	 * when we had a hole or prealloc extent with several delalloc ranges in
-	 * it, but after btrfs_next_leaf() released the path, delalloc was
-	 * flushed and the resulting ordered extents were completed, so we can
-	 * now have found a file extent item for an offset that is smaller than
-	 * or equals to what we have in cache->offset. We deal with this as
-	 * described below.
-	 */
-	cache_end = cache->offset + cache->len;
-	if (cache_end > offset) {
-		if (offset == cache->offset) {
-			/*
-			 * We cached a dealloc range (found in the io tree) for
-			 * a hole or prealloc extent and we have now found a
-			 * file extent item for the same offset. What we have
-			 * now is more recent and up to date, so discard what
-			 * we had in the cache and use what we have just found.
-			 */
-			goto assign;
-		} else if (offset > cache->offset) {
-			/*
-			 * The extent range we previously found ends after the
-			 * offset of the file extent item we found and that
-			 * offset falls somewhere in the middle of that previous
-			 * extent range. So adjust the range we previously found
-			 * to end at the offset of the file extent item we have
-			 * just found, since this extent is more up to date.
-			 * Emit that adjusted range and cache the file extent
-			 * item we have just found. This corresponds to the case
-			 * where a previously found file extent item was split
-			 * due to an ordered extent completing.
-			 */
-			cache->len = offset - cache->offset;
-			goto emit;
-		} else {
-			const u64 range_end = offset + len;
-
-			/*
-			 * The offset of the file extent item we have just found
-			 * is behind the cached offset. This means we were
-			 * processing a hole or prealloc extent for which we
-			 * have found delalloc ranges (in the io tree), so what
-			 * we have in the cache is the last delalloc range we
-			 * found while the file extent item we found can be
-			 * either for a whole delalloc range we previously
-			 * emmitted or only a part of that range.
-			 *
-			 * We have two cases here:
-			 *
-			 * 1) The file extent item's range ends at or behind the
-			 *    cached extent's end. In this case just ignore the
-			 *    current file extent item because we don't want to
-			 *    overlap with previous ranges that may have been
-			 *    emmitted already;
-			 *
-			 * 2) The file extent item starts behind the currently
-			 *    cached extent but its end offset goes beyond the
-			 *    end offset of the cached extent. We don't want to
-			 *    overlap with a previous range that may have been
-			 *    emmitted already, so we emit the currently cached
-			 *    extent and then partially store the current file
-			 *    extent item's range in the cache, for the subrange
-			 *    going the cached extent's end to the end of the
-			 *    file extent item.
-			 */
-			if (range_end <= cache_end)
-				return 0;
-
-			if (!(flags & (FIEMAP_EXTENT_ENCODED | FIEMAP_EXTENT_DELALLOC)))
-				phys += cache_end - offset;
-
-			offset = cache_end;
-			len = range_end - cache_end;
-			goto emit;
-		}
-	}
-
-	/*
-	 * Only merges fiemap extents if
-	 * 1) Their logical addresses are continuous
-	 *
-	 * 2) Their physical addresses are continuous
-	 *    So truly compressed (physical size smaller than logical size)
-	 *    extents won't get merged with each other
-	 *
-	 * 3) Share same flags
-	 */
-	if (cache->offset + cache->len  == offset &&
-	    cache->phys + cache->len == phys  &&
-	    cache->flags == flags) {
-		cache->len += len;
-		return 0;
-	}
-
-emit:
-	/* Not mergeable, need to submit cached one */
-
-	if (cache->entries_pos == cache->entries_size) {
-		/*
-		 * We will need to research for the end offset of the last
-		 * stored extent and not from the current offset, because after
-		 * unlocking the range and releasing the path, if there's a hole
-		 * between that end offset and this current offset, a new extent
-		 * may have been inserted due to a new write, so we don't want
-		 * to miss it.
-		 */
-		entry = &cache->entries[cache->entries_size - 1];
-		cache->next_search_offset = entry->offset + entry->len;
-		cache->cached = false;
-
-		return BTRFS_FIEMAP_FLUSH_CACHE;
-	}
-
-	entry = &cache->entries[cache->entries_pos];
-	entry->offset = cache->offset;
-	entry->phys = cache->phys;
-	entry->len = cache->len;
-	entry->flags = cache->flags;
-	cache->entries_pos++;
-	cache->extents_mapped++;
-
-	if (cache->extents_mapped == fieinfo->fi_extents_max) {
-		cache->cached = false;
-		return 1;
-	}
-assign:
-	cache->cached = true;
-	cache->offset = offset;
-	cache->phys = phys;
-	cache->len = len;
-	cache->flags = flags;
-
-	return 0;
-}
-
-/*
- * Emit last fiemap cache
- *
- * The last fiemap cache may still be cached in the following case:
- * 0		      4k		    8k
- * |<- Fiemap range ->|
- * |<------------  First extent ----------->|
- *
- * In this case, the first extent range will be cached but not emitted.
- * So we must emit it before ending extent_fiemap().
- */
-static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
-				  struct fiemap_cache *cache)
-{
-	int ret;
-
-	if (!cache->cached)
-		return 0;
-
-	ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
-				      cache->len, cache->flags);
-	cache->cached = false;
-	if (ret > 0)
-		ret = 0;
-	return ret;
-}
-
-static int fiemap_next_leaf_item(struct btrfs_inode *inode, struct btrfs_path *path)
-{
-	struct extent_buffer *clone = path->nodes[0];
-	struct btrfs_key key;
-	int slot;
-	int ret;
-
-	path->slots[0]++;
-	if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
-		return 0;
-
-	/*
-	 * Add a temporary extra ref to an already cloned extent buffer to
-	 * prevent btrfs_next_leaf() freeing it, we want to reuse it to avoid
-	 * the cost of allocating a new one.
-	 */
-	ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED, &clone->bflags));
-	atomic_inc(&clone->refs);
-
-	ret = btrfs_next_leaf(inode->root, path);
-	if (ret != 0)
-		goto out;
-
-	/*
-	 * Don't bother with cloning if there are no more file extent items for
-	 * our inode.
-	 */
-	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
-	if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY) {
-		ret = 1;
-		goto out;
-	}
-
-	/*
-	 * Important to preserve the start field, for the optimizations when
-	 * checking if extents are shared (see extent_fiemap()).
-	 *
-	 * We must set ->start before calling copy_extent_buffer_full().  If we
-	 * are on sub-pagesize blocksize, we use ->start to determine the offset
-	 * into the folio where our eb exists, and if we update ->start after
-	 * the fact then any subsequent reads of the eb may read from a
-	 * different offset in the folio than where we originally copied into.
-	 */
-	clone->start = path->nodes[0]->start;
-	/* See the comment at fiemap_search_slot() about why we clone. */
-	copy_extent_buffer_full(clone, path->nodes[0]);
-
-	slot = path->slots[0];
-	btrfs_release_path(path);
-	path->nodes[0] = clone;
-	path->slots[0] = slot;
-out:
-	if (ret)
-		free_extent_buffer(clone);
-
-	return ret;
-}
-
-/*
- * Search for the first file extent item that starts at a given file offset or
- * the one that starts immediately before that offset.
- * Returns: 0 on success, < 0 on error, 1 if not found.
- */
-static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
-			      u64 file_offset)
-{
-	const u64 ino = btrfs_ino(inode);
-	struct btrfs_root *root = inode->root;
-	struct extent_buffer *clone;
-	struct btrfs_key key;
-	int slot;
-	int ret;
-
-	key.objectid = ino;
-	key.type = BTRFS_EXTENT_DATA_KEY;
-	key.offset = file_offset;
-
-	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
-	if (ret < 0)
-		return ret;
-
-	if (ret > 0 && path->slots[0] > 0) {
-		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
-		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
-			path->slots[0]--;
-	}
-
-	if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
-		ret = btrfs_next_leaf(root, path);
-		if (ret != 0)
-			return ret;
-
-		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
-		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
-			return 1;
-	}
-
-	/*
-	 * We clone the leaf and use it during fiemap. This is because while
-	 * using the leaf we do expensive things like checking if an extent is
-	 * shared, which can take a long time. In order to prevent blocking
-	 * other tasks for too long, we use a clone of the leaf. We have locked
-	 * the file range in the inode's io tree, so we know none of our file
-	 * extent items can change. This way we avoid blocking other tasks that
-	 * want to insert items for other inodes in the same leaf or b+tree
-	 * rebalance operations (triggered for example when someone is trying
-	 * to push items into this leaf when trying to insert an item in a
-	 * neighbour leaf).
-	 * We also need the private clone because holding a read lock on an
-	 * extent buffer of the subvolume's b+tree will make lockdep unhappy
-	 * when we check if extents are shared, as backref walking may need to
-	 * lock the same leaf we are processing.
-	 */
-	clone = btrfs_clone_extent_buffer(path->nodes[0]);
-	if (!clone)
-		return -ENOMEM;
-
-	slot = path->slots[0];
-	btrfs_release_path(path);
-	path->nodes[0] = clone;
-	path->slots[0] = slot;
-
-	return 0;
-}
-
-/*
- * Process a range which is a hole or a prealloc extent in the inode's subvolume
- * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
- * extent. The end offset (@end) is inclusive.
- */
-static int fiemap_process_hole(struct btrfs_inode *inode,
-			       struct fiemap_extent_info *fieinfo,
-			       struct fiemap_cache *cache,
-			       struct extent_state **delalloc_cached_state,
-			       struct btrfs_backref_share_check_ctx *backref_ctx,
-			       u64 disk_bytenr, u64 extent_offset,
-			       u64 extent_gen,
-			       u64 start, u64 end)
-{
-	const u64 i_size = i_size_read(&inode->vfs_inode);
-	u64 cur_offset = start;
-	u64 last_delalloc_end = 0;
-	u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
-	bool checked_extent_shared = false;
-	int ret;
-
-	/*
-	 * There can be no delalloc past i_size, so don't waste time looking for
-	 * it beyond i_size.
-	 */
-	while (cur_offset < end && cur_offset < i_size) {
-		u64 delalloc_start;
-		u64 delalloc_end;
-		u64 prealloc_start;
-		u64 prealloc_len = 0;
-		bool delalloc;
-
-		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
-							delalloc_cached_state,
-							&delalloc_start,
-							&delalloc_end);
-		if (!delalloc)
-			break;
-
-		/*
-		 * If this is a prealloc extent we have to report every section
-		 * of it that has no delalloc.
-		 */
-		if (disk_bytenr != 0) {
-			if (last_delalloc_end == 0) {
-				prealloc_start = start;
-				prealloc_len = delalloc_start - start;
-			} else {
-				prealloc_start = last_delalloc_end + 1;
-				prealloc_len = delalloc_start - prealloc_start;
-			}
-		}
-
-		if (prealloc_len > 0) {
-			if (!checked_extent_shared && fieinfo->fi_extents_max) {
-				ret = btrfs_is_data_extent_shared(inode,
-								  disk_bytenr,
-								  extent_gen,
-								  backref_ctx);
-				if (ret < 0)
-					return ret;
-				else if (ret > 0)
-					prealloc_flags |= FIEMAP_EXTENT_SHARED;
-
-				checked_extent_shared = true;
-			}
-			ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
-						 disk_bytenr + extent_offset,
-						 prealloc_len, prealloc_flags);
-			if (ret)
-				return ret;
-			extent_offset += prealloc_len;
-		}
-
-		ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
-					 delalloc_end + 1 - delalloc_start,
-					 FIEMAP_EXTENT_DELALLOC |
-					 FIEMAP_EXTENT_UNKNOWN);
-		if (ret)
-			return ret;
-
-		last_delalloc_end = delalloc_end;
-		cur_offset = delalloc_end + 1;
-		extent_offset += cur_offset - delalloc_start;
-		cond_resched();
-	}
-
-	/*
-	 * Either we found no delalloc for the whole prealloc extent or we have
-	 * a prealloc extent that spans i_size or starts at or after i_size.
-	 */
-	if (disk_bytenr != 0 && last_delalloc_end < end) {
-		u64 prealloc_start;
-		u64 prealloc_len;
-
-		if (last_delalloc_end == 0) {
-			prealloc_start = start;
-			prealloc_len = end + 1 - start;
-		} else {
-			prealloc_start = last_delalloc_end + 1;
-			prealloc_len = end + 1 - prealloc_start;
-		}
-
-		if (!checked_extent_shared && fieinfo->fi_extents_max) {
-			ret = btrfs_is_data_extent_shared(inode,
-							  disk_bytenr,
-							  extent_gen,
-							  backref_ctx);
-			if (ret < 0)
-				return ret;
-			else if (ret > 0)
-				prealloc_flags |= FIEMAP_EXTENT_SHARED;
-		}
-		ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
-					 disk_bytenr + extent_offset,
-					 prealloc_len, prealloc_flags);
-		if (ret)
-			return ret;
-	}
-
-	return 0;
-}
-
-static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
-					  struct btrfs_path *path,
-					  u64 *last_extent_end_ret)
-{
-	const u64 ino = btrfs_ino(inode);
-	struct btrfs_root *root = inode->root;
-	struct extent_buffer *leaf;
-	struct btrfs_file_extent_item *ei;
-	struct btrfs_key key;
-	u64 disk_bytenr;
-	int ret;
-
-	/*
-	 * Lookup the last file extent. We're not using i_size here because
-	 * there might be preallocation past i_size.
-	 */
-	ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
-	/* There can't be a file extent item at offset (u64)-1 */
-	ASSERT(ret != 0);
-	if (ret < 0)
-		return ret;
-
-	/*
-	 * For a non-existing key, btrfs_search_slot() always leaves us at a
-	 * slot > 0, except if the btree is empty, which is impossible because
-	 * at least it has the inode item for this inode and all the items for
-	 * the root inode 256.
-	 */
-	ASSERT(path->slots[0] > 0);
-	path->slots[0]--;
-	leaf = path->nodes[0];
-	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
-	if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
-		/* No file extent items in the subvolume tree. */
-		*last_extent_end_ret = 0;
-		return 0;
-	}
-
-	/*
-	 * For an inline extent, the disk_bytenr is where inline data starts at,
-	 * so first check if we have an inline extent item before checking if we
-	 * have an implicit hole (disk_bytenr == 0).
-	 */
-	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
-	if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
-		*last_extent_end_ret = btrfs_file_extent_end(path);
-		return 0;
-	}
-
-	/*
-	 * Find the last file extent item that is not a hole (when NO_HOLES is
-	 * not enabled). This should take at most 2 iterations in the worst
-	 * case: we have one hole file extent item at slot 0 of a leaf and
-	 * another hole file extent item as the last item in the previous leaf.
-	 * This is because we merge file extent items that represent holes.
-	 */
-	disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
-	while (disk_bytenr == 0) {
-		ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
-		if (ret < 0) {
-			return ret;
-		} else if (ret > 0) {
-			/* No file extent items that are not holes. */
-			*last_extent_end_ret = 0;
-			return 0;
-		}
-		leaf = path->nodes[0];
-		ei = btrfs_item_ptr(leaf, path->slots[0],
-				    struct btrfs_file_extent_item);
-		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
-	}
-
-	*last_extent_end_ret = btrfs_file_extent_end(path);
-	return 0;
-}
-
-int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
-		  u64 start, u64 len)
-{
-	const u64 ino = btrfs_ino(inode);
-	struct extent_state *cached_state = NULL;
-	struct extent_state *delalloc_cached_state = NULL;
-	struct btrfs_path *path;
-	struct fiemap_cache cache = { 0 };
-	struct btrfs_backref_share_check_ctx *backref_ctx;
-	u64 last_extent_end;
-	u64 prev_extent_end;
-	u64 range_start;
-	u64 range_end;
-	const u64 sectorsize = inode->root->fs_info->sectorsize;
-	bool stopped = false;
-	int ret;
-
-	cache.entries_size = PAGE_SIZE / sizeof(struct btrfs_fiemap_entry);
-	cache.entries = kmalloc_array(cache.entries_size,
-				      sizeof(struct btrfs_fiemap_entry),
-				      GFP_KERNEL);
-	backref_ctx = btrfs_alloc_backref_share_check_ctx();
-	path = btrfs_alloc_path();
-	if (!cache.entries || !backref_ctx || !path) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
-restart:
-	range_start = round_down(start, sectorsize);
-	range_end = round_up(start + len, sectorsize);
-	prev_extent_end = range_start;
-
-	lock_extent(&inode->io_tree, range_start, range_end, &cached_state);
-
-	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
-	if (ret < 0)
-		goto out_unlock;
-	btrfs_release_path(path);
-
-	path->reada = READA_FORWARD;
-	ret = fiemap_search_slot(inode, path, range_start);
-	if (ret < 0) {
-		goto out_unlock;
-	} else if (ret > 0) {
-		/*
-		 * No file extent item found, but we may have delalloc between
-		 * the current offset and i_size. So check for that.
-		 */
-		ret = 0;
-		goto check_eof_delalloc;
-	}
-
-	while (prev_extent_end < range_end) {
-		struct extent_buffer *leaf = path->nodes[0];
-		struct btrfs_file_extent_item *ei;
-		struct btrfs_key key;
-		u64 extent_end;
-		u64 extent_len;
-		u64 extent_offset = 0;
-		u64 extent_gen;
-		u64 disk_bytenr = 0;
-		u64 flags = 0;
-		int extent_type;
-		u8 compression;
-
-		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
-		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
-			break;
-
-		extent_end = btrfs_file_extent_end(path);
-
-		/*
-		 * The first iteration can leave us at an extent item that ends
-		 * before our range's start. Move to the next item.
-		 */
-		if (extent_end <= range_start)
-			goto next_item;
-
-		backref_ctx->curr_leaf_bytenr = leaf->start;
-
-		/* We have in implicit hole (NO_HOLES feature enabled). */
-		if (prev_extent_end < key.offset) {
-			const u64 hole_end = min(key.offset, range_end) - 1;
-
-			ret = fiemap_process_hole(inode, fieinfo, &cache,
-						  &delalloc_cached_state,
-						  backref_ctx, 0, 0, 0,
-						  prev_extent_end, hole_end);
-			if (ret < 0) {
-				goto out_unlock;
-			} else if (ret > 0) {
-				/* fiemap_fill_next_extent() told us to stop. */
-				stopped = true;
-				break;
-			}
-
-			/* We've reached the end of the fiemap range, stop. */
-			if (key.offset >= range_end) {
-				stopped = true;
-				break;
-			}
-		}
-
-		extent_len = extent_end - key.offset;
-		ei = btrfs_item_ptr(leaf, path->slots[0],
-				    struct btrfs_file_extent_item);
-		compression = btrfs_file_extent_compression(leaf, ei);
-		extent_type = btrfs_file_extent_type(leaf, ei);
-		extent_gen = btrfs_file_extent_generation(leaf, ei);
-
-		if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
-			disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
-			if (compression == BTRFS_COMPRESS_NONE)
-				extent_offset = btrfs_file_extent_offset(leaf, ei);
-		}
-
-		if (compression != BTRFS_COMPRESS_NONE)
-			flags |= FIEMAP_EXTENT_ENCODED;
-
-		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
-			flags |= FIEMAP_EXTENT_DATA_INLINE;
-			flags |= FIEMAP_EXTENT_NOT_ALIGNED;
-			ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
-						 extent_len, flags);
-		} else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
-			ret = fiemap_process_hole(inode, fieinfo, &cache,
-						  &delalloc_cached_state,
-						  backref_ctx,
-						  disk_bytenr, extent_offset,
-						  extent_gen, key.offset,
-						  extent_end - 1);
-		} else if (disk_bytenr == 0) {
-			/* We have an explicit hole. */
-			ret = fiemap_process_hole(inode, fieinfo, &cache,
-						  &delalloc_cached_state,
-						  backref_ctx, 0, 0, 0,
-						  key.offset, extent_end - 1);
-		} else {
-			/* We have a regular extent. */
-			if (fieinfo->fi_extents_max) {
-				ret = btrfs_is_data_extent_shared(inode,
-								  disk_bytenr,
-								  extent_gen,
-								  backref_ctx);
-				if (ret < 0)
-					goto out_unlock;
-				else if (ret > 0)
-					flags |= FIEMAP_EXTENT_SHARED;
-			}
-
-			ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
-						 disk_bytenr + extent_offset,
-						 extent_len, flags);
-		}
-
-		if (ret < 0) {
-			goto out_unlock;
-		} else if (ret > 0) {
-			/* emit_fiemap_extent() told us to stop. */
-			stopped = true;
-			break;
-		}
-
-		prev_extent_end = extent_end;
-next_item:
-		if (fatal_signal_pending(current)) {
-			ret = -EINTR;
-			goto out_unlock;
-		}
-
-		ret = fiemap_next_leaf_item(inode, path);
-		if (ret < 0) {
-			goto out_unlock;
-		} else if (ret > 0) {
-			/* No more file extent items for this inode. */
-			break;
-		}
-		cond_resched();
-	}
-
-check_eof_delalloc:
-	if (!stopped && prev_extent_end < range_end) {
-		ret = fiemap_process_hole(inode, fieinfo, &cache,
-					  &delalloc_cached_state, backref_ctx,
-					  0, 0, 0, prev_extent_end, range_end - 1);
-		if (ret < 0)
-			goto out_unlock;
-		prev_extent_end = range_end;
-	}
-
-	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
-		const u64 i_size = i_size_read(&inode->vfs_inode);
-
-		if (prev_extent_end < i_size) {
-			u64 delalloc_start;
-			u64 delalloc_end;
-			bool delalloc;
-
-			delalloc = btrfs_find_delalloc_in_range(inode,
-								prev_extent_end,
-								i_size - 1,
-								&delalloc_cached_state,
-								&delalloc_start,
-								&delalloc_end);
-			if (!delalloc)
-				cache.flags |= FIEMAP_EXTENT_LAST;
-		} else {
-			cache.flags |= FIEMAP_EXTENT_LAST;
-		}
-	}
-
-out_unlock:
-	unlock_extent(&inode->io_tree, range_start, range_end, &cached_state);
-
-	if (ret == BTRFS_FIEMAP_FLUSH_CACHE) {
-		btrfs_release_path(path);
-		ret = flush_fiemap_cache(fieinfo, &cache);
-		if (ret)
-			goto out;
-		len -= cache.next_search_offset - start;
-		start = cache.next_search_offset;
-		goto restart;
-	} else if (ret < 0) {
-		goto out;
-	}
-
-	/*
-	 * Must free the path before emitting to the fiemap buffer because we
-	 * may have a non-cloned leaf and if the fiemap buffer is memory mapped
-	 * to a file, a write into it (through btrfs_page_mkwrite()) may trigger
-	 * waiting for an ordered extent that in order to complete needs to
-	 * modify that leaf, therefore leading to a deadlock.
-	 */
-	btrfs_free_path(path);
-	path = NULL;
-
-	ret = flush_fiemap_cache(fieinfo, &cache);
-	if (ret)
-		goto out;
-
-	ret = emit_last_fiemap_cache(fieinfo, &cache);
-out:
-	free_extent_state(delalloc_cached_state);
-	kfree(cache.entries);
-	btrfs_free_backref_share_ctx(backref_ctx);
-	btrfs_free_path(path);
-	return ret;
-}
-
 static void __free_extent_buffer(struct extent_buffer *eb)
 {
 	kmem_cache_free(extent_buffer_cache, eb);
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index dca6b12769ec..ecf89424502e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -242,8 +242,6 @@ int btrfs_writepages(struct address_space *mapping, struct writeback_control *wb
 int btree_write_cache_pages(struct address_space *mapping,
 			    struct writeback_control *wbc);
 void btrfs_readahead(struct readahead_control *rac);
-int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
-		  u64 start, u64 len);
 int set_folio_extent_mapped(struct folio *folio);
 int set_page_extent_mapped(struct page *page);
 void clear_page_extent_mapped(struct page *page);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b4410b463c6a..729873e873ea 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7929,6 +7929,878 @@ struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter,
 			    IOMAP_DIO_PARTIAL, &data, done_before);
 }
 
+struct btrfs_fiemap_entry {
+	u64 offset;
+	u64 phys;
+	u64 len;
+	u32 flags;
+};
+
+/*
+ * Indicate the caller of emit_fiemap_extent() that it needs to unlock the file
+ * range from the inode's io tree, unlock the subvolume tree search path, flush
+ * the fiemap cache and relock the file range and research the subvolume tree.
+ * The value here is something negative that can't be confused with a valid
+ * errno value and different from 1 because that's also a return value from
+ * fiemap_fill_next_extent() and also it's often used to mean some btree search
+ * did not find a key, so make it some distinct negative value.
+ */
+#define BTRFS_FIEMAP_FLUSH_CACHE (-(MAX_ERRNO + 1))
+
+/*
+ * Used to:
+ *
+ * - Cache the next entry to be emitted to the fiemap buffer, so that we can
+ *   merge extents that are contiguous and can be grouped as a single one;
+ *
+ * - Store extents ready to be written to the fiemap buffer in an intermediary
+ *   buffer. This intermediary buffer is to ensure that in case the fiemap
+ *   buffer is memory mapped to the fiemap target file, we don't deadlock
+ *   during btrfs_page_mkwrite(). This is because during fiemap we are locking
+ *   an extent range in order to prevent races with delalloc flushing and
+ *   ordered extent completion, which is needed in order to reliably detect
+ *   delalloc in holes and prealloc extents. And this can lead to a deadlock
+ *   if the fiemap buffer is memory mapped to the file we are running fiemap
+ *   against (a silly, useless in practice scenario, but possible) because
+ *   btrfs_page_mkwrite() will try to lock the same extent range.
+ */
+struct fiemap_cache {
+	/* An array of ready fiemap entries. */
+	struct btrfs_fiemap_entry *entries;
+	/* Number of entries in the entries array. */
+	int entries_size;
+	/* Index of the next entry in the entries array to write to. */
+	int entries_pos;
+	/*
+	 * Once the entries array is full, this indicates what's the offset for
+	 * the next file extent item we must search for in the inode's subvolume
+	 * tree after unlocking the extent range in the inode's io tree and
+	 * releasing the search path.
+	 */
+	u64 next_search_offset;
+	/*
+	 * This matches struct fiemap_extent_info::fi_mapped_extents, we use it
+	 * to count ourselves emitted extents and stop instead of relying on
+	 * fiemap_fill_next_extent() because we buffer ready fiemap entries at
+	 * the @entries array, and we want to stop as soon as we hit the max
+	 * amount of extents to map, not just to save time but also to make the
+	 * logic at extent_fiemap() simpler.
+	 */
+	unsigned int extents_mapped;
+	/* Fields for the cached extent (unsubmitted, not ready, extent). */
+	u64 offset;
+	u64 phys;
+	u64 len;
+	u32 flags;
+	bool cached;
+};
+
+static int flush_fiemap_cache(struct fiemap_extent_info *fieinfo,
+			      struct fiemap_cache *cache)
+{
+	for (int i = 0; i < cache->entries_pos; i++) {
+		struct btrfs_fiemap_entry *entry = &cache->entries[i];
+		int ret;
+
+		ret = fiemap_fill_next_extent(fieinfo, entry->offset,
+					      entry->phys, entry->len,
+					      entry->flags);
+		/*
+		 * Ignore 1 (reached max entries) because we keep track of that
+		 * ourselves in emit_fiemap_extent().
+		 */
+		if (ret < 0)
+			return ret;
+	}
+	cache->entries_pos = 0;
+
+	return 0;
+}
+
+/*
+ * Helper to submit fiemap extent.
+ *
+ * Will try to merge current fiemap extent specified by @offset, @phys,
+ * @len and @flags with cached one.
+ * And only when we fails to merge, cached one will be submitted as
+ * fiemap extent.
+ *
+ * Return value is the same as fiemap_fill_next_extent().
+ */
+static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
+				struct fiemap_cache *cache,
+				u64 offset, u64 phys, u64 len, u32 flags)
+{
+	struct btrfs_fiemap_entry *entry;
+	u64 cache_end;
+
+	/* Set at the end of extent_fiemap(). */
+	ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
+
+	if (!cache->cached)
+		goto assign;
+
+	/*
+	 * When iterating the extents of the inode, at extent_fiemap(), we may
+	 * find an extent that starts at an offset behind the end offset of the
+	 * previous extent we processed. This happens if fiemap is called
+	 * without FIEMAP_FLAG_SYNC and there are ordered extents completing
+	 * after we had to unlock the file range, release the search path, emit
+	 * the fiemap extents stored in the buffer (cache->entries array) and
+	 * the lock the remainder of the range and re-search the btree.
+	 *
+	 * For example we are in leaf X processing its last item, which is the
+	 * file extent item for file range [512K, 1M[, and after
+	 * btrfs_next_leaf() releases the path, there's an ordered extent that
+	 * completes for the file range [768K, 2M[, and that results in trimming
+	 * the file extent item so that it now corresponds to the file range
+	 * [512K, 768K[ and a new file extent item is inserted for the file
+	 * range [768K, 2M[, which may end up as the last item of leaf X or as
+	 * the first item of the next leaf - in either case btrfs_next_leaf()
+	 * will leave us with a path pointing to the new extent item, for the
+	 * file range [768K, 2M[, since that's the first key that follows the
+	 * last one we processed. So in order not to report overlapping extents
+	 * to user space, we trim the length of the previously cached extent and
+	 * emit it.
+	 *
+	 * Upon calling btrfs_next_leaf() we may also find an extent with an
+	 * offset smaller than or equals to cache->offset, and this happens
+	 * when we had a hole or prealloc extent with several delalloc ranges in
+	 * it, but after btrfs_next_leaf() released the path, delalloc was
+	 * flushed and the resulting ordered extents were completed, so we can
+	 * now have found a file extent item for an offset that is smaller than
+	 * or equals to what we have in cache->offset. We deal with this as
+	 * described below.
+	 */
+	cache_end = cache->offset + cache->len;
+	if (cache_end > offset) {
+		if (offset == cache->offset) {
+			/*
+			 * We cached a dealloc range (found in the io tree) for
+			 * a hole or prealloc extent and we have now found a
+			 * file extent item for the same offset. What we have
+			 * now is more recent and up to date, so discard what
+			 * we had in the cache and use what we have just found.
+			 */
+			goto assign;
+		} else if (offset > cache->offset) {
+			/*
+			 * The extent range we previously found ends after the
+			 * offset of the file extent item we found and that
+			 * offset falls somewhere in the middle of that previous
+			 * extent range. So adjust the range we previously found
+			 * to end at the offset of the file extent item we have
+			 * just found, since this extent is more up to date.
+			 * Emit that adjusted range and cache the file extent
+			 * item we have just found. This corresponds to the case
+			 * where a previously found file extent item was split
+			 * due to an ordered extent completing.
+			 */
+			cache->len = offset - cache->offset;
+			goto emit;
+		} else {
+			const u64 range_end = offset + len;
+
+			/*
+			 * The offset of the file extent item we have just found
+			 * is behind the cached offset. This means we were
+			 * processing a hole or prealloc extent for which we
+			 * have found delalloc ranges (in the io tree), so what
+			 * we have in the cache is the last delalloc range we
+			 * found while the file extent item we found can be
+			 * either for a whole delalloc range we previously
+			 * emmitted or only a part of that range.
+			 *
+			 * We have two cases here:
+			 *
+			 * 1) The file extent item's range ends at or behind the
+			 *    cached extent's end. In this case just ignore the
+			 *    current file extent item because we don't want to
+			 *    overlap with previous ranges that may have been
+			 *    emmitted already;
+			 *
+			 * 2) The file extent item starts behind the currently
+			 *    cached extent but its end offset goes beyond the
+			 *    end offset of the cached extent. We don't want to
+			 *    overlap with a previous range that may have been
+			 *    emmitted already, so we emit the currently cached
+			 *    extent and then partially store the current file
+			 *    extent item's range in the cache, for the subrange
+			 *    going the cached extent's end to the end of the
+			 *    file extent item.
+			 */
+			if (range_end <= cache_end)
+				return 0;
+
+			if (!(flags & (FIEMAP_EXTENT_ENCODED | FIEMAP_EXTENT_DELALLOC)))
+				phys += cache_end - offset;
+
+			offset = cache_end;
+			len = range_end - cache_end;
+			goto emit;
+		}
+	}
+
+	/*
+	 * Only merges fiemap extents if
+	 * 1) Their logical addresses are continuous
+	 *
+	 * 2) Their physical addresses are continuous
+	 *    So truly compressed (physical size smaller than logical size)
+	 *    extents won't get merged with each other
+	 *
+	 * 3) Share same flags
+	 */
+	if (cache->offset + cache->len  == offset &&
+	    cache->phys + cache->len == phys  &&
+	    cache->flags == flags) {
+		cache->len += len;
+		return 0;
+	}
+
+emit:
+	/* Not mergeable, need to submit cached one */
+
+	if (cache->entries_pos == cache->entries_size) {
+		/*
+		 * We will need to research for the end offset of the last
+		 * stored extent and not from the current offset, because after
+		 * unlocking the range and releasing the path, if there's a hole
+		 * between that end offset and this current offset, a new extent
+		 * may have been inserted due to a new write, so we don't want
+		 * to miss it.
+		 */
+		entry = &cache->entries[cache->entries_size - 1];
+		cache->next_search_offset = entry->offset + entry->len;
+		cache->cached = false;
+
+		return BTRFS_FIEMAP_FLUSH_CACHE;
+	}
+
+	entry = &cache->entries[cache->entries_pos];
+	entry->offset = cache->offset;
+	entry->phys = cache->phys;
+	entry->len = cache->len;
+	entry->flags = cache->flags;
+	cache->entries_pos++;
+	cache->extents_mapped++;
+
+	if (cache->extents_mapped == fieinfo->fi_extents_max) {
+		cache->cached = false;
+		return 1;
+	}
+assign:
+	cache->cached = true;
+	cache->offset = offset;
+	cache->phys = phys;
+	cache->len = len;
+	cache->flags = flags;
+
+	return 0;
+}
+
+/*
+ * Emit last fiemap cache
+ *
+ * The last fiemap cache may still be cached in the following case:
+ * 0		      4k		    8k
+ * |<- Fiemap range ->|
+ * |<------------  First extent ----------->|
+ *
+ * In this case, the first extent range will be cached but not emitted.
+ * So we must emit it before ending extent_fiemap().
+ */
+static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
+				  struct fiemap_cache *cache)
+{
+	int ret;
+
+	if (!cache->cached)
+		return 0;
+
+	ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
+				      cache->len, cache->flags);
+	cache->cached = false;
+	if (ret > 0)
+		ret = 0;
+	return ret;
+}
+
+static int fiemap_next_leaf_item(struct btrfs_inode *inode, struct btrfs_path *path)
+{
+	struct extent_buffer *clone = path->nodes[0];
+	struct btrfs_key key;
+	int slot;
+	int ret;
+
+	path->slots[0]++;
+	if (path->slots[0] < btrfs_header_nritems(path->nodes[0]))
+		return 0;
+
+	/*
+	 * Add a temporary extra ref to an already cloned extent buffer to
+	 * prevent btrfs_next_leaf() freeing it, we want to reuse it to avoid
+	 * the cost of allocating a new one.
+	 */
+	ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED, &clone->bflags));
+	atomic_inc(&clone->refs);
+
+	ret = btrfs_next_leaf(inode->root, path);
+	if (ret != 0)
+		goto out;
+
+	/*
+	 * Don't bother with cloning if there are no more file extent items for
+	 * our inode.
+	 */
+	btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+	if (key.objectid != btrfs_ino(inode) || key.type != BTRFS_EXTENT_DATA_KEY) {
+		ret = 1;
+		goto out;
+	}
+
+	/*
+	 * Important to preserve the start field, for the optimizations when
+	 * checking if extents are shared (see extent_fiemap()).
+	 *
+	 * We must set ->start before calling copy_extent_buffer_full().  If we
+	 * are on sub-pagesize blocksize, we use ->start to determine the offset
+	 * into the folio where our eb exists, and if we update ->start after
+	 * the fact then any subsequent reads of the eb may read from a
+	 * different offset in the folio than where we originally copied into.
+	 */
+	clone->start = path->nodes[0]->start;
+	/* See the comment at fiemap_search_slot() about why we clone. */
+	copy_extent_buffer_full(clone, path->nodes[0]);
+
+	slot = path->slots[0];
+	btrfs_release_path(path);
+	path->nodes[0] = clone;
+	path->slots[0] = slot;
+out:
+	if (ret)
+		free_extent_buffer(clone);
+
+	return ret;
+}
+
+/*
+ * Search for the first file extent item that starts at a given file offset or
+ * the one that starts immediately before that offset.
+ * Returns: 0 on success, < 0 on error, 1 if not found.
+ */
+static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path,
+			      u64 file_offset)
+{
+	const u64 ino = btrfs_ino(inode);
+	struct btrfs_root *root = inode->root;
+	struct extent_buffer *clone;
+	struct btrfs_key key;
+	int slot;
+	int ret;
+
+	key.objectid = ino;
+	key.type = BTRFS_EXTENT_DATA_KEY;
+	key.offset = file_offset;
+
+	ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+	if (ret < 0)
+		return ret;
+
+	if (ret > 0 && path->slots[0] > 0) {
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1);
+		if (key.objectid == ino && key.type == BTRFS_EXTENT_DATA_KEY)
+			path->slots[0]--;
+	}
+
+	if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+		ret = btrfs_next_leaf(root, path);
+		if (ret != 0)
+			return ret;
+
+		btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
+			return 1;
+	}
+
+	/*
+	 * We clone the leaf and use it during fiemap. This is because while
+	 * using the leaf we do expensive things like checking if an extent is
+	 * shared, which can take a long time. In order to prevent blocking
+	 * other tasks for too long, we use a clone of the leaf. We have locked
+	 * the file range in the inode's io tree, so we know none of our file
+	 * extent items can change. This way we avoid blocking other tasks that
+	 * want to insert items for other inodes in the same leaf or b+tree
+	 * rebalance operations (triggered for example when someone is trying
+	 * to push items into this leaf when trying to insert an item in a
+	 * neighbour leaf).
+	 * We also need the private clone because holding a read lock on an
+	 * extent buffer of the subvolume's b+tree will make lockdep unhappy
+	 * when we check if extents are shared, as backref walking may need to
+	 * lock the same leaf we are processing.
+	 */
+	clone = btrfs_clone_extent_buffer(path->nodes[0]);
+	if (!clone)
+		return -ENOMEM;
+
+	slot = path->slots[0];
+	btrfs_release_path(path);
+	path->nodes[0] = clone;
+	path->slots[0] = slot;
+
+	return 0;
+}
+
+/*
+ * Process a range which is a hole or a prealloc extent in the inode's subvolume
+ * btree. If @disk_bytenr is 0, we are dealing with a hole, otherwise a prealloc
+ * extent. The end offset (@end) is inclusive.
+ */
+static int fiemap_process_hole(struct btrfs_inode *inode,
+			       struct fiemap_extent_info *fieinfo,
+			       struct fiemap_cache *cache,
+			       struct extent_state **delalloc_cached_state,
+			       struct btrfs_backref_share_check_ctx *backref_ctx,
+			       u64 disk_bytenr, u64 extent_offset,
+			       u64 extent_gen,
+			       u64 start, u64 end)
+{
+	const u64 i_size = i_size_read(&inode->vfs_inode);
+	u64 cur_offset = start;
+	u64 last_delalloc_end = 0;
+	u32 prealloc_flags = FIEMAP_EXTENT_UNWRITTEN;
+	bool checked_extent_shared = false;
+	int ret;
+
+	/*
+	 * There can be no delalloc past i_size, so don't waste time looking for
+	 * it beyond i_size.
+	 */
+	while (cur_offset < end && cur_offset < i_size) {
+		u64 delalloc_start;
+		u64 delalloc_end;
+		u64 prealloc_start;
+		u64 prealloc_len = 0;
+		bool delalloc;
+
+		delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
+							delalloc_cached_state,
+							&delalloc_start,
+							&delalloc_end);
+		if (!delalloc)
+			break;
+
+		/*
+		 * If this is a prealloc extent we have to report every section
+		 * of it that has no delalloc.
+		 */
+		if (disk_bytenr != 0) {
+			if (last_delalloc_end == 0) {
+				prealloc_start = start;
+				prealloc_len = delalloc_start - start;
+			} else {
+				prealloc_start = last_delalloc_end + 1;
+				prealloc_len = delalloc_start - prealloc_start;
+			}
+		}
+
+		if (prealloc_len > 0) {
+			if (!checked_extent_shared && fieinfo->fi_extents_max) {
+				ret = btrfs_is_data_extent_shared(inode,
+								  disk_bytenr,
+								  extent_gen,
+								  backref_ctx);
+				if (ret < 0)
+					return ret;
+				else if (ret > 0)
+					prealloc_flags |= FIEMAP_EXTENT_SHARED;
+
+				checked_extent_shared = true;
+			}
+			ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
+						 disk_bytenr + extent_offset,
+						 prealloc_len, prealloc_flags);
+			if (ret)
+				return ret;
+			extent_offset += prealloc_len;
+		}
+
+		ret = emit_fiemap_extent(fieinfo, cache, delalloc_start, 0,
+					 delalloc_end + 1 - delalloc_start,
+					 FIEMAP_EXTENT_DELALLOC |
+					 FIEMAP_EXTENT_UNKNOWN);
+		if (ret)
+			return ret;
+
+		last_delalloc_end = delalloc_end;
+		cur_offset = delalloc_end + 1;
+		extent_offset += cur_offset - delalloc_start;
+		cond_resched();
+	}
+
+	/*
+	 * Either we found no delalloc for the whole prealloc extent or we have
+	 * a prealloc extent that spans i_size or starts at or after i_size.
+	 */
+	if (disk_bytenr != 0 && last_delalloc_end < end) {
+		u64 prealloc_start;
+		u64 prealloc_len;
+
+		if (last_delalloc_end == 0) {
+			prealloc_start = start;
+			prealloc_len = end + 1 - start;
+		} else {
+			prealloc_start = last_delalloc_end + 1;
+			prealloc_len = end + 1 - prealloc_start;
+		}
+
+		if (!checked_extent_shared && fieinfo->fi_extents_max) {
+			ret = btrfs_is_data_extent_shared(inode,
+							  disk_bytenr,
+							  extent_gen,
+							  backref_ctx);
+			if (ret < 0)
+				return ret;
+			else if (ret > 0)
+				prealloc_flags |= FIEMAP_EXTENT_SHARED;
+		}
+		ret = emit_fiemap_extent(fieinfo, cache, prealloc_start,
+					 disk_bytenr + extent_offset,
+					 prealloc_len, prealloc_flags);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int fiemap_find_last_extent_offset(struct btrfs_inode *inode,
+					  struct btrfs_path *path,
+					  u64 *last_extent_end_ret)
+{
+	const u64 ino = btrfs_ino(inode);
+	struct btrfs_root *root = inode->root;
+	struct extent_buffer *leaf;
+	struct btrfs_file_extent_item *ei;
+	struct btrfs_key key;
+	u64 disk_bytenr;
+	int ret;
+
+	/*
+	 * Lookup the last file extent. We're not using i_size here because
+	 * there might be preallocation past i_size.
+	 */
+	ret = btrfs_lookup_file_extent(NULL, root, path, ino, (u64)-1, 0);
+	/* There can't be a file extent item at offset (u64)-1 */
+	ASSERT(ret != 0);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * For a non-existing key, btrfs_search_slot() always leaves us at a
+	 * slot > 0, except if the btree is empty, which is impossible because
+	 * at least it has the inode item for this inode and all the items for
+	 * the root inode 256.
+	 */
+	ASSERT(path->slots[0] > 0);
+	path->slots[0]--;
+	leaf = path->nodes[0];
+	btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+	if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY) {
+		/* No file extent items in the subvolume tree. */
+		*last_extent_end_ret = 0;
+		return 0;
+	}
+
+	/*
+	 * For an inline extent, the disk_bytenr is where inline data starts at,
+	 * so first check if we have an inline extent item before checking if we
+	 * have an implicit hole (disk_bytenr == 0).
+	 */
+	ei = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
+	if (btrfs_file_extent_type(leaf, ei) == BTRFS_FILE_EXTENT_INLINE) {
+		*last_extent_end_ret = btrfs_file_extent_end(path);
+		return 0;
+	}
+
+	/*
+	 * Find the last file extent item that is not a hole (when NO_HOLES is
+	 * not enabled). This should take at most 2 iterations in the worst
+	 * case: we have one hole file extent item at slot 0 of a leaf and
+	 * another hole file extent item as the last item in the previous leaf.
+	 * This is because we merge file extent items that represent holes.
+	 */
+	disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+	while (disk_bytenr == 0) {
+		ret = btrfs_previous_item(root, path, ino, BTRFS_EXTENT_DATA_KEY);
+		if (ret < 0) {
+			return ret;
+		} else if (ret > 0) {
+			/* No file extent items that are not holes. */
+			*last_extent_end_ret = 0;
+			return 0;
+		}
+		leaf = path->nodes[0];
+		ei = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+	}
+
+	*last_extent_end_ret = btrfs_file_extent_end(path);
+	return 0;
+}
+
+static int extent_fiemap(struct btrfs_inode *inode,
+			 struct fiemap_extent_info *fieinfo,
+			 u64 start, u64 len)
+{
+	const u64 ino = btrfs_ino(inode);
+	struct extent_state *cached_state = NULL;
+	struct extent_state *delalloc_cached_state = NULL;
+	struct btrfs_path *path;
+	struct fiemap_cache cache = { 0 };
+	struct btrfs_backref_share_check_ctx *backref_ctx;
+	u64 last_extent_end;
+	u64 prev_extent_end;
+	u64 range_start;
+	u64 range_end;
+	const u64 sectorsize = inode->root->fs_info->sectorsize;
+	bool stopped = false;
+	int ret;
+
+	cache.entries_size = PAGE_SIZE / sizeof(struct btrfs_fiemap_entry);
+	cache.entries = kmalloc_array(cache.entries_size,
+				      sizeof(struct btrfs_fiemap_entry),
+				      GFP_KERNEL);
+	backref_ctx = btrfs_alloc_backref_share_check_ctx();
+	path = btrfs_alloc_path();
+	if (!cache.entries || !backref_ctx || !path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+restart:
+	range_start = round_down(start, sectorsize);
+	range_end = round_up(start + len, sectorsize);
+	prev_extent_end = range_start;
+
+	lock_extent(&inode->io_tree, range_start, range_end, &cached_state);
+
+	ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
+	if (ret < 0)
+		goto out_unlock;
+	btrfs_release_path(path);
+
+	path->reada = READA_FORWARD;
+	ret = fiemap_search_slot(inode, path, range_start);
+	if (ret < 0) {
+		goto out_unlock;
+	} else if (ret > 0) {
+		/*
+		 * No file extent item found, but we may have delalloc between
+		 * the current offset and i_size. So check for that.
+		 */
+		ret = 0;
+		goto check_eof_delalloc;
+	}
+
+	while (prev_extent_end < range_end) {
+		struct extent_buffer *leaf = path->nodes[0];
+		struct btrfs_file_extent_item *ei;
+		struct btrfs_key key;
+		u64 extent_end;
+		u64 extent_len;
+		u64 extent_offset = 0;
+		u64 extent_gen;
+		u64 disk_bytenr = 0;
+		u64 flags = 0;
+		int extent_type;
+		u8 compression;
+
+		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+		if (key.objectid != ino || key.type != BTRFS_EXTENT_DATA_KEY)
+			break;
+
+		extent_end = btrfs_file_extent_end(path);
+
+		/*
+		 * The first iteration can leave us at an extent item that ends
+		 * before our range's start. Move to the next item.
+		 */
+		if (extent_end <= range_start)
+			goto next_item;
+
+		backref_ctx->curr_leaf_bytenr = leaf->start;
+
+		/* We have in implicit hole (NO_HOLES feature enabled). */
+		if (prev_extent_end < key.offset) {
+			const u64 hole_end = min(key.offset, range_end) - 1;
+
+			ret = fiemap_process_hole(inode, fieinfo, &cache,
+						  &delalloc_cached_state,
+						  backref_ctx, 0, 0, 0,
+						  prev_extent_end, hole_end);
+			if (ret < 0) {
+				goto out_unlock;
+			} else if (ret > 0) {
+				/* fiemap_fill_next_extent() told us to stop. */
+				stopped = true;
+				break;
+			}
+
+			/* We've reached the end of the fiemap range, stop. */
+			if (key.offset >= range_end) {
+				stopped = true;
+				break;
+			}
+		}
+
+		extent_len = extent_end - key.offset;
+		ei = btrfs_item_ptr(leaf, path->slots[0],
+				    struct btrfs_file_extent_item);
+		compression = btrfs_file_extent_compression(leaf, ei);
+		extent_type = btrfs_file_extent_type(leaf, ei);
+		extent_gen = btrfs_file_extent_generation(leaf, ei);
+
+		if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
+			disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, ei);
+			if (compression == BTRFS_COMPRESS_NONE)
+				extent_offset = btrfs_file_extent_offset(leaf, ei);
+		}
+
+		if (compression != BTRFS_COMPRESS_NONE)
+			flags |= FIEMAP_EXTENT_ENCODED;
+
+		if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+			flags |= FIEMAP_EXTENT_DATA_INLINE;
+			flags |= FIEMAP_EXTENT_NOT_ALIGNED;
+			ret = emit_fiemap_extent(fieinfo, &cache, key.offset, 0,
+						 extent_len, flags);
+		} else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
+			ret = fiemap_process_hole(inode, fieinfo, &cache,
+						  &delalloc_cached_state,
+						  backref_ctx,
+						  disk_bytenr, extent_offset,
+						  extent_gen, key.offset,
+						  extent_end - 1);
+		} else if (disk_bytenr == 0) {
+			/* We have an explicit hole. */
+			ret = fiemap_process_hole(inode, fieinfo, &cache,
+						  &delalloc_cached_state,
+						  backref_ctx, 0, 0, 0,
+						  key.offset, extent_end - 1);
+		} else {
+			/* We have a regular extent. */
+			if (fieinfo->fi_extents_max) {
+				ret = btrfs_is_data_extent_shared(inode,
+								  disk_bytenr,
+								  extent_gen,
+								  backref_ctx);
+				if (ret < 0)
+					goto out_unlock;
+				else if (ret > 0)
+					flags |= FIEMAP_EXTENT_SHARED;
+			}
+
+			ret = emit_fiemap_extent(fieinfo, &cache, key.offset,
+						 disk_bytenr + extent_offset,
+						 extent_len, flags);
+		}
+
+		if (ret < 0) {
+			goto out_unlock;
+		} else if (ret > 0) {
+			/* emit_fiemap_extent() told us to stop. */
+			stopped = true;
+			break;
+		}
+
+		prev_extent_end = extent_end;
+next_item:
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			goto out_unlock;
+		}
+
+		ret = fiemap_next_leaf_item(inode, path);
+		if (ret < 0) {
+			goto out_unlock;
+		} else if (ret > 0) {
+			/* No more file extent items for this inode. */
+			break;
+		}
+		cond_resched();
+	}
+
+check_eof_delalloc:
+	if (!stopped && prev_extent_end < range_end) {
+		ret = fiemap_process_hole(inode, fieinfo, &cache,
+					  &delalloc_cached_state, backref_ctx,
+					  0, 0, 0, prev_extent_end, range_end - 1);
+		if (ret < 0)
+			goto out_unlock;
+		prev_extent_end = range_end;
+	}
+
+	if (cache.cached && cache.offset + cache.len >= last_extent_end) {
+		const u64 i_size = i_size_read(&inode->vfs_inode);
+
+		if (prev_extent_end < i_size) {
+			u64 delalloc_start;
+			u64 delalloc_end;
+			bool delalloc;
+
+			delalloc = btrfs_find_delalloc_in_range(inode,
+								prev_extent_end,
+								i_size - 1,
+								&delalloc_cached_state,
+								&delalloc_start,
+								&delalloc_end);
+			if (!delalloc)
+				cache.flags |= FIEMAP_EXTENT_LAST;
+		} else {
+			cache.flags |= FIEMAP_EXTENT_LAST;
+		}
+	}
+
+out_unlock:
+	unlock_extent(&inode->io_tree, range_start, range_end, &cached_state);
+
+	if (ret == BTRFS_FIEMAP_FLUSH_CACHE) {
+		btrfs_release_path(path);
+		ret = flush_fiemap_cache(fieinfo, &cache);
+		if (ret)
+			goto out;
+		len -= cache.next_search_offset - start;
+		start = cache.next_search_offset;
+		goto restart;
+	} else if (ret < 0) {
+		goto out;
+	}
+
+	/*
+	 * Must free the path before emitting to the fiemap buffer because we
+	 * may have a non-cloned leaf and if the fiemap buffer is memory mapped
+	 * to a file, a write into it (through btrfs_page_mkwrite()) may trigger
+	 * waiting for an ordered extent that in order to complete needs to
+	 * modify that leaf, therefore leading to a deadlock.
+	 */
+	btrfs_free_path(path);
+	path = NULL;
+
+	ret = flush_fiemap_cache(fieinfo, &cache);
+	if (ret)
+		goto out;
+
+	ret = emit_last_fiemap_cache(fieinfo, &cache);
+out:
+	free_extent_state(delalloc_cached_state);
+	kfree(cache.entries);
+	btrfs_free_backref_share_ctx(backref_ctx);
+	btrfs_free_path(path);
+	return ret;
+}
+
 static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 			u64 start, u64 len)
 {
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 7/7] btrfs: send: get rid of the label and gotos at ensure_commit_roots_uptodate()
  2024-05-22 14:36  2% [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions fdmanana
                   ` (5 preceding siblings ...)
  2024-05-22 14:36  1% ` [PATCH 6/7] btrfs: add and use helper to commit the current transaction fdmanana
@ 2024-05-22 14:36  1% ` fdmanana
  2024-05-22 15:21  1% ` [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions Josef Bacik
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-22 14:36 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Now that there is a helper to commit the current transaction and we are
using it, there's no need for the label and goto statements at
ensure_commit_roots_uptodate(). So replace them with direct return
statements that call btrfs_commit_current_transaction().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/send.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 7a82132500a8..2099b5f8c022 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -8001,23 +8001,15 @@ static int ensure_commit_roots_uptodate(struct send_ctx *sctx)
 	struct btrfs_root *root = sctx->parent_root;
 
 	if (root && root->node != root->commit_root)
-		goto commit_trans;
+		return btrfs_commit_current_transaction(root);
 
 	for (int i = 0; i < sctx->clone_roots_cnt; i++) {
 		root = sctx->clone_roots[i].root;
 		if (root->node != root->commit_root)
-			goto commit_trans;
+			return btrfs_commit_current_transaction(root);
 	}
 
 	return 0;
-
-commit_trans:
-	/*
-	 * Use the first root we found. We could use any but that would cause
-	 * an unnecessary update of the root's item in the root tree when
-	 * committing the transaction if that root wasn't changed before.
-	 */
-	return btrfs_commit_current_transaction(root);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 6/7] btrfs: add and use helper to commit the current transaction
  2024-05-22 14:36  2% [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions fdmanana
                   ` (4 preceding siblings ...)
  2024-05-22 14:36  1% ` [PATCH 5/7] btrfs: scrub: avoid create/commit empty transaction at finish_extent_writes_for_zoned() fdmanana
@ 2024-05-22 14:36  1% ` fdmanana
  2024-05-22 14:36  1% ` [PATCH 7/7] btrfs: send: get rid of the label and gotos at ensure_commit_roots_uptodate() fdmanana
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-22 14:36 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

We have several places that attach to the current transaction with
btrfs_attach_transaction_barrier() and then commit the transaction if
there is one. Add a helper and use it to deduplicate this pattern.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c     | 12 +-----------
 fs/btrfs/qgroup.c      | 33 +++++----------------------------
 fs/btrfs/scrub.c       | 10 +---------
 fs/btrfs/send.c        | 10 +---------
 fs/btrfs/space-info.c  |  9 +--------
 fs/btrfs/super.c       | 11 +----------
 fs/btrfs/transaction.c | 19 +++++++++++++++++++
 fs/btrfs/transaction.h |  1 +
 8 files changed, 30 insertions(+), 75 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2f56f967beb8..78d3966232ae 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4144,9 +4144,6 @@ void btrfs_drop_and_free_fs_root(struct btrfs_fs_info *fs_info,
 
 int btrfs_commit_super(struct btrfs_fs_info *fs_info)
 {
-	struct btrfs_root *root = fs_info->tree_root;
-	struct btrfs_trans_handle *trans;
-
 	mutex_lock(&fs_info->cleaner_mutex);
 	btrfs_run_delayed_iputs(fs_info);
 	mutex_unlock(&fs_info->cleaner_mutex);
@@ -4156,14 +4153,7 @@ int btrfs_commit_super(struct btrfs_fs_info *fs_info)
 	down_write(&fs_info->cleanup_work_sem);
 	up_write(&fs_info->cleanup_work_sem);
 
-	trans = btrfs_attach_transaction_barrier(root);
-	if (IS_ERR(trans)) {
-		int ret = PTR_ERR(trans);
-
-		return (ret == -ENOENT) ? 0 : ret;
-	}
-
-	return btrfs_commit_transaction(trans);
+	return btrfs_commit_current_transaction(fs_info->tree_root);
 }
 
 static void warn_about_uncommitted_trans(struct btrfs_fs_info *fs_info)
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 391af9e79dd6..6864a8a201df 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1334,7 +1334,6 @@ int btrfs_quota_enable(struct btrfs_fs_info *fs_info,
  */
 static int flush_reservations(struct btrfs_fs_info *fs_info)
 {
-	struct btrfs_trans_handle *trans;
 	int ret;
 
 	ret = btrfs_start_delalloc_roots(fs_info, LONG_MAX, false);
@@ -1342,13 +1341,7 @@ static int flush_reservations(struct btrfs_fs_info *fs_info)
 		return ret;
 	btrfs_wait_ordered_roots(fs_info, U64_MAX, NULL);
 
-	trans = btrfs_attach_transaction_barrier(fs_info->tree_root);
-	if (IS_ERR(trans)) {
-		ret = PTR_ERR(trans);
-		return (ret == -ENOENT) ? 0 : ret;
-	}
-
-	return btrfs_commit_transaction(trans);
+	return btrfs_commit_current_transaction(fs_info->tree_root);
 }
 
 int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
@@ -4028,7 +4021,6 @@ int
 btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info)
 {
 	int ret = 0;
-	struct btrfs_trans_handle *trans;
 
 	ret = qgroup_rescan_init(fs_info, 0, 1);
 	if (ret)
@@ -4045,16 +4037,10 @@ btrfs_qgroup_rescan(struct btrfs_fs_info *fs_info)
 	 * going to clear all tracking information for a clean start.
 	 */
 
-	trans = btrfs_attach_transaction_barrier(fs_info->fs_root);
-	if (IS_ERR(trans) && trans != ERR_PTR(-ENOENT)) {
+	ret = btrfs_commit_current_transaction(fs_info->fs_root);
+	if (ret) {
 		fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
-		return PTR_ERR(trans);
-	} else if (trans != ERR_PTR(-ENOENT)) {
-		ret = btrfs_commit_transaction(trans);
-		if (ret) {
-			fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
-			return ret;
-		}
+		return ret;
 	}
 
 	qgroup_rescan_zero_tracking(fs_info);
@@ -4190,7 +4176,6 @@ static int qgroup_unreserve_range(struct btrfs_inode *inode,
  */
 static int try_flush_qgroup(struct btrfs_root *root)
 {
-	struct btrfs_trans_handle *trans;
 	int ret;
 
 	/* Can't hold an open transaction or we run the risk of deadlocking. */
@@ -4213,15 +4198,7 @@ static int try_flush_qgroup(struct btrfs_root *root)
 		goto out;
 	btrfs_wait_ordered_extents(root, U64_MAX, NULL);
 
-	trans = btrfs_attach_transaction_barrier(root);
-	if (IS_ERR(trans)) {
-		ret = PTR_ERR(trans);
-		if (ret == -ENOENT)
-			ret = 0;
-		goto out;
-	}
-
-	ret = btrfs_commit_transaction(trans);
+	ret = btrfs_commit_current_transaction(root);
 out:
 	clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state);
 	wake_up(&root->qgroup_flush_wait);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 6c7b5d52591e..188a9c42c9eb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2437,7 +2437,6 @@ static int finish_extent_writes_for_zoned(struct btrfs_root *root,
 					  struct btrfs_block_group *cache)
 {
 	struct btrfs_fs_info *fs_info = cache->fs_info;
-	struct btrfs_trans_handle *trans;
 
 	if (!btrfs_is_zoned(fs_info))
 		return 0;
@@ -2446,14 +2445,7 @@ static int finish_extent_writes_for_zoned(struct btrfs_root *root,
 	btrfs_wait_nocow_writers(cache);
 	btrfs_wait_ordered_roots(fs_info, U64_MAX, cache);
 
-	trans = btrfs_attach_transaction_barrier(root);
-	if (IS_ERR(trans)) {
-		int ret = PTR_ERR(trans);
-
-		return (ret == -ENOENT) ? 0 : ret;
-	}
-
-	return btrfs_commit_transaction(trans);
+	return btrfs_commit_current_transaction(root);
 }
 
 static noinline_for_stack
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 289e5e6a6c56..7a82132500a8 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -7998,7 +7998,6 @@ static int send_subvol(struct send_ctx *sctx)
  */
 static int ensure_commit_roots_uptodate(struct send_ctx *sctx)
 {
-	struct btrfs_trans_handle *trans;
 	struct btrfs_root *root = sctx->parent_root;
 
 	if (root && root->node != root->commit_root)
@@ -8018,14 +8017,7 @@ static int ensure_commit_roots_uptodate(struct send_ctx *sctx)
 	 * an unnecessary update of the root's item in the root tree when
 	 * committing the transaction if that root wasn't changed before.
 	 */
-	trans = btrfs_attach_transaction_barrier(root);
-	if (IS_ERR(trans)) {
-		int ret = PTR_ERR(trans);
-
-		return (ret == -ENOENT) ? 0 : ret;
-	}
-
-	return btrfs_commit_transaction(trans);
+	return btrfs_commit_current_transaction(root);
 }
 
 /*
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 69760e1e726f..0283ee9bf813 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -805,14 +805,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		 * because that does not wait for a transaction to fully commit
 		 * (only for it to be unblocked, state TRANS_STATE_UNBLOCKED).
 		 */
-		trans = btrfs_attach_transaction_barrier(root);
-		if (IS_ERR(trans)) {
-			ret = PTR_ERR(trans);
-			if (ret == -ENOENT)
-				ret = 0;
-			break;
-		}
-		ret = btrfs_commit_transaction(trans);
+		ret = btrfs_commit_current_transaction(root);
 		break;
 	default:
 		ret = -ENOSPC;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 117a355dbd7a..21d986e07500 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2257,9 +2257,7 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd,
 
 static int btrfs_freeze(struct super_block *sb)
 {
-	struct btrfs_trans_handle *trans;
 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
-	struct btrfs_root *root = fs_info->tree_root;
 
 	set_bit(BTRFS_FS_FROZEN, &fs_info->flags);
 	/*
@@ -2268,14 +2266,7 @@ static int btrfs_freeze(struct super_block *sb)
 	 * we want to avoid on a frozen filesystem), or do the commit
 	 * ourselves.
 	 */
-	trans = btrfs_attach_transaction_barrier(root);
-	if (IS_ERR(trans)) {
-		/* no transaction, don't bother */
-		if (PTR_ERR(trans) == -ENOENT)
-			return 0;
-		return PTR_ERR(trans);
-	}
-	return btrfs_commit_transaction(trans);
+	return btrfs_commit_current_transaction(fs_info->tree_root);
 }
 
 static int check_dev_super(struct btrfs_device *dev)
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 639755f025b4..9590a1899b9d 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1989,6 +1989,25 @@ void btrfs_commit_transaction_async(struct btrfs_trans_handle *trans)
 	btrfs_put_transaction(cur_trans);
 }
 
+/*
+ * If there is a running transaction commit it or if it's already committing,
+ * wait for its commit to complete. Does not start and commit a new transaction
+ * if there isn't any running.
+ */
+int btrfs_commit_current_transaction(struct btrfs_root *root)
+{
+	struct btrfs_trans_handle *trans;
+
+	trans = btrfs_attach_transaction_barrier(root);
+	if (IS_ERR(trans)) {
+		int ret = PTR_ERR(trans);
+
+		return (ret == -ENOENT) ? 0 : ret;
+	}
+
+	return btrfs_commit_transaction(trans);
+}
+
 static void cleanup_transaction(struct btrfs_trans_handle *trans, int err)
 {
 	struct btrfs_fs_info *fs_info = trans->fs_info;
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 90b987941dd1..81da655b5ee7 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -268,6 +268,7 @@ void btrfs_maybe_wake_unfinished_drop(struct btrfs_fs_info *fs_info);
 int btrfs_clean_one_deleted_snapshot(struct btrfs_fs_info *fs_info);
 int btrfs_commit_transaction(struct btrfs_trans_handle *trans);
 void btrfs_commit_transaction_async(struct btrfs_trans_handle *trans);
+int btrfs_commit_current_transaction(struct btrfs_root *root);
 int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans);
 bool btrfs_should_end_transaction(struct btrfs_trans_handle *trans);
 void btrfs_throttle(struct btrfs_fs_info *fs_info);
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 5/7] btrfs: scrub: avoid create/commit empty transaction at finish_extent_writes_for_zoned()
  2024-05-22 14:36  2% [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions fdmanana
                   ` (3 preceding siblings ...)
  2024-05-22 14:36  1% ` [PATCH 4/7] btrfs: send: avoid create/commit empty transaction at ensure_commit_roots_uptodate() fdmanana
@ 2024-05-22 14:36  1% ` fdmanana
  2024-05-22 14:36  1% ` [PATCH 6/7] btrfs: add and use helper to commit the current transaction fdmanana
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-22 14:36 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

At finish_extent_writes_for_zoned() we use btrfs_join_transaction() to
catch any running transaction and then commit it. This will however create
a new and empty transaction in case there's no running transaction anymore
(got committed by the transaction kthread or other task for example) or
there's a running transaction finishing its commit and with a state >=
TRANS_STATE_UNBLOCKED. In the former case we don't need to do anything
while in the second case we just need to wait for the transaction to
complete its commit.

So improve this by using btrfs_attach_transaction_barrier() instead, which
does not create a new transaction if there's none running, and if there's
a current transaction that is committing, it will wait for it to fully
commit and not create a new transaction. This helps avoiding creating and
committing empty transactions, saving IO, time and unnecessary rotation of
the backup roots in the super block.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/scrub.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 376c5c2e9aed..6c7b5d52591e 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2446,9 +2446,13 @@ static int finish_extent_writes_for_zoned(struct btrfs_root *root,
 	btrfs_wait_nocow_writers(cache);
 	btrfs_wait_ordered_roots(fs_info, U64_MAX, cache);
 
-	trans = btrfs_join_transaction(root);
-	if (IS_ERR(trans))
-		return PTR_ERR(trans);
+	trans = btrfs_attach_transaction_barrier(root);
+	if (IS_ERR(trans)) {
+		int ret = PTR_ERR(trans);
+
+		return (ret == -ENOENT) ? 0 : ret;
+	}
+
 	return btrfs_commit_transaction(trans);
 }
 
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 4/7] btrfs: send: avoid create/commit empty transaction at ensure_commit_roots_uptodate()
  2024-05-22 14:36  2% [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions fdmanana
                   ` (2 preceding siblings ...)
  2024-05-22 14:36  1% ` [PATCH 3/7] btrfs: send: make ensure_commit_roots_uptodate() simpler and more efficient fdmanana
@ 2024-05-22 14:36  1% ` fdmanana
  2024-05-22 14:36  1% ` [PATCH 5/7] btrfs: scrub: avoid create/commit empty transaction at finish_extent_writes_for_zoned() fdmanana
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-22 14:36 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

At ensure_commit_roots_uptodate() we use btrfs_join_transaction() to
catch any running transaction and then commit it. This will however create
a new and empty transaction in case there's no running transaction anymore
(got committed by the transaction kthread or other task for example) or
there's a running transaction finishing its commit and with a state >=
TRANS_STATE_UNBLOCKED. In the former case we don't need to do anything
while in the second case we just need to wait for the transaction to
complete its commit.

So improve this by using btrfs_attach_transaction_barrier() instead, which
does not create a new transaction if there's none running, and if there's
a current transaction that is committing, it will wait for it to fully
commit and not create a new transaction. This helps avoiding creating and
committing empty transactions, saving IO, time and unnecessary rotation of
the backup roots in the super block.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/send.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 2c46bd1c90d3..289e5e6a6c56 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -8018,9 +8018,12 @@ static int ensure_commit_roots_uptodate(struct send_ctx *sctx)
 	 * an unnecessary update of the root's item in the root tree when
 	 * committing the transaction if that root wasn't changed before.
 	 */
-	trans = btrfs_join_transaction(root);
-	if (IS_ERR(trans))
-		return PTR_ERR(trans);
+	trans = btrfs_attach_transaction_barrier(root);
+	if (IS_ERR(trans)) {
+		int ret = PTR_ERR(trans);
+
+		return (ret == -ENOENT) ? 0 : ret;
+	}
 
 	return btrfs_commit_transaction(trans);
 }
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 3/7] btrfs: send: make ensure_commit_roots_uptodate() simpler and more efficient
  2024-05-22 14:36  2% [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions fdmanana
  2024-05-22 14:36  1% ` [PATCH 1/7] btrfs: qgroup: avoid start/commit empty transaction when flushing reservations fdmanana
  2024-05-22 14:36  1% ` [PATCH 2/7] btrfs: avoid create and commit empty transaction when committing super fdmanana
@ 2024-05-22 14:36  1% ` fdmanana
  2024-05-22 14:36  1% ` [PATCH 4/7] btrfs: send: avoid create/commit empty transaction at ensure_commit_roots_uptodate() fdmanana
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-22 14:36 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Before starting a send operation we have to make sure that every root has
its commit root matching the regular root, to that send doesn't find stale
inodes in the commit root (inodes that were deleted in the regular root)
and fails the inode lookups with -ESTALE.

Currently we keep looking for roots used by the send operation and as soon
as we find one we commit the current transaction (or a new one since
btrfs_join_transaction() creates one if there isn't any running or the
running one is in a state >= TRANS_STATE_UNBLOCKED). It's pointless to
keep looking until we don't find any, because after the first transaction
commit all the other roots are updated too, as they were already tagged in
the fs_info->fs_roots_radix radix tree when they were modified in order to
have a commit root different from the regular root.

Currently we are also always passing the main send root into
btrfs_join_transaction(), which despite not having any functional issue,
it is not optimal because in case the root wasn't modified we end up
adding it to fs_info->fs_roots_radix and then update its root item in the
root tree when comitting the transaction, causing unnecessary work.

So simplify and make this more efficient by removing the looping and by
passing the first root we found that is modified as the argument to
btrfs_join_transaction().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/send.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index c69743233be5..2c46bd1c90d3 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -7998,32 +7998,29 @@ static int send_subvol(struct send_ctx *sctx)
  */
 static int ensure_commit_roots_uptodate(struct send_ctx *sctx)
 {
-	int i;
-	struct btrfs_trans_handle *trans = NULL;
+	struct btrfs_trans_handle *trans;
+	struct btrfs_root *root = sctx->parent_root;
 
-again:
-	if (sctx->parent_root &&
-	    sctx->parent_root->node != sctx->parent_root->commit_root)
+	if (root && root->node != root->commit_root)
 		goto commit_trans;
 
-	for (i = 0; i < sctx->clone_roots_cnt; i++)
-		if (sctx->clone_roots[i].root->node !=
-		    sctx->clone_roots[i].root->commit_root)
+	for (int i = 0; i < sctx->clone_roots_cnt; i++) {
+		root = sctx->clone_roots[i].root;
+		if (root->node != root->commit_root)
 			goto commit_trans;
-
-	if (trans)
-		return btrfs_end_transaction(trans);
+	}
 
 	return 0;
 
 commit_trans:
-	/* Use any root, all fs roots will get their commit roots updated. */
-	if (!trans) {
-		trans = btrfs_join_transaction(sctx->send_root);
-		if (IS_ERR(trans))
-			return PTR_ERR(trans);
-		goto again;
-	}
+	/*
+	 * Use the first root we found. We could use any but that would cause
+	 * an unnecessary update of the root's item in the root tree when
+	 * committing the transaction if that root wasn't changed before.
+	 */
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
 
 	return btrfs_commit_transaction(trans);
 }
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 2/7] btrfs: avoid create and commit empty transaction when committing super
  2024-05-22 14:36  2% [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions fdmanana
  2024-05-22 14:36  1% ` [PATCH 1/7] btrfs: qgroup: avoid start/commit empty transaction when flushing reservations fdmanana
@ 2024-05-22 14:36  1% ` fdmanana
  2024-05-22 14:36  1% ` [PATCH 3/7] btrfs: send: make ensure_commit_roots_uptodate() simpler and more efficient fdmanana
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-22 14:36 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

At btrfs_commit_super(), called in a few contextes such as when unmounting
a filesystem, we use btrfs_join_transaction() to catch any running
transaction and then commit it. This will however create a new and empty
transaction in case there's no running transaction or there's a running
transaction with a state >= TRANS_STATE_UNBLOCKED.

As we just want to be sure that any existing transaction is fully
committed, we can use btrfs_attach_transaction_barrier() instead of
btrfs_join_transaction(), therefore avoiding the creation and commit of
empty transactions, which only waste IO and causes rotation of the
precious backup roots.

Example where we create and commit a pointless empty transaction:

  $ mkfs.btrfs -f /dev/sdj
  $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
  generation            6

  $ mount /dev/sdj /mnt/sdj
  $ touch /mnt/sdj/foo

  # Commit the currently open transaction. Just 'sync' or wait ~30
  # seconds for the transaction kthread to commit it.
  $ sync

  $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
  generation            7

  $ umount /mnt/sdj

  $ btrfs inspect-internal dump-super /dev/sdj | grep -e '^generation'
  generation            8

The transaction with id 8 was pointless, an empty transaction that did
not achieve anything.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/disk-io.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index eec5bb392b8e..2f56f967beb8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4156,9 +4156,13 @@ int btrfs_commit_super(struct btrfs_fs_info *fs_info)
 	down_write(&fs_info->cleanup_work_sem);
 	up_write(&fs_info->cleanup_work_sem);
 
-	trans = btrfs_join_transaction(root);
-	if (IS_ERR(trans))
-		return PTR_ERR(trans);
+	trans = btrfs_attach_transaction_barrier(root);
+	if (IS_ERR(trans)) {
+		int ret = PTR_ERR(trans);
+
+		return (ret == -ENOENT) ? 0 : ret;
+	}
+
 	return btrfs_commit_transaction(trans);
 }
 
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 1/7] btrfs: qgroup: avoid start/commit empty transaction when flushing reservations
  2024-05-22 14:36  2% [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions fdmanana
@ 2024-05-22 14:36  1% ` fdmanana
  2024-05-22 14:36  1% ` [PATCH 2/7] btrfs: avoid create and commit empty transaction when committing super fdmanana
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-22 14:36 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When flushing reservations we are using btrfs_join_transaction() to get a
handle for the current transaction and then commit it to try to release
space. However btrfs_join_transaction() has some undesirable consequences:

1) If there's no running transaction, it will create one, and we will
   commit it right after. This is unncessary because it will not release
   any space, and it will result in unnecessary IO and rotation of backup
   roots in the superblock;

2) If there's a current transaction and that transaction is committing
   (its state is >= TRANS_STATE_COMMIT_DOING), it will wait for that
   transaction to almost finish its commit (for its state to be >=
   TRANS_STATE_UNBLOCKED) and then start and return a new transaction.

   We will then commit that new transaction, which is pointless because
   all we wanted was to wait for the current (previous) transaction to
   fully finish its commit (state == TRANS_STATE_COMPLETED), and by
   starting and committing a new transaction we are wasting IO too and
   causing unnecessary rotation of backup roots in the superblock.

So improve this by using btrfs_attach_transaction_barrier() instead, which
does not create a new transaction if there's none running, and if there's
a current transaction that is committing, it will wait for it to fully
commit and not create a new transaction.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/qgroup.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a2e329710287..391af9e79dd6 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1341,12 +1341,14 @@ static int flush_reservations(struct btrfs_fs_info *fs_info)
 	if (ret)
 		return ret;
 	btrfs_wait_ordered_roots(fs_info, U64_MAX, NULL);
-	trans = btrfs_join_transaction(fs_info->tree_root);
-	if (IS_ERR(trans))
-		return PTR_ERR(trans);
-	ret = btrfs_commit_transaction(trans);
 
-	return ret;
+	trans = btrfs_attach_transaction_barrier(fs_info->tree_root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		return (ret == -ENOENT) ? 0 : ret;
+	}
+
+	return btrfs_commit_transaction(trans);
 }
 
 int btrfs_quota_disable(struct btrfs_fs_info *fs_info)
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions
@ 2024-05-22 14:36  2% fdmanana
  2024-05-22 14:36  1% ` [PATCH 1/7] btrfs: qgroup: avoid start/commit empty transaction when flushing reservations fdmanana
                   ` (9 more replies)
  0 siblings, 10 replies; 200+ results
From: fdmanana @ 2024-05-22 14:36 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

A few places can unnecessarily create an empty transaction and then commit
it, when the goal is just to catch the current transaction and wait for
its commit to complete. This results in wasting IO, time and rotation of
the precious backup roots in the super block. Details in the change logs.
The patches are all independent, except patch 4 that applies on top of
patch 3 (but could have been done in any order really, they are independent).

Filipe Manana (7):
  btrfs: qgroup: avoid start/commit empty transaction when flushing reservations
  btrfs: avoid create and commit empty transaction when committing super
  btrfs: send: make ensure_commit_roots_uptodate() simpler and more efficient
  btrfs: send: avoid create/commit empty transaction at ensure_commit_roots_uptodate()
  btrfs: scrub: avoid create/commit empty transaction at finish_extent_writes_for_zoned()
  btrfs: add and use helper to commit the current transaction
  btrfs: send: get rid of the label and gotos at ensure_commit_roots_uptodate()

 fs/btrfs/disk-io.c     |  8 +-------
 fs/btrfs/qgroup.c      | 31 +++++--------------------------
 fs/btrfs/scrub.c       |  6 +-----
 fs/btrfs/send.c        | 32 ++++++++------------------------
 fs/btrfs/space-info.c  |  9 +--------
 fs/btrfs/super.c       | 11 +----------
 fs/btrfs/transaction.c | 19 +++++++++++++++++++
 fs/btrfs/transaction.h |  1 +
 8 files changed, 37 insertions(+), 80 deletions(-)

-- 
2.43.0


^ permalink raw reply	[relevance 2%]

* Re: [PATCH v3 1/2] btrfs: zoned: reserve relocation block-group on mount
  2024-05-21 15:22  1%   ` Filipe Manana
@ 2024-05-22  8:31  1%     ` Johannes Thumshirn
  0 siblings, 0 replies; 200+ results
From: Johannes Thumshirn @ 2024-05-22  8:31 UTC (permalink / raw)
  To: Filipe Manana, Johannes Thumshirn
  Cc: Chris Mason, Josef Bacik, David Sterba, Hans Holmberg,
	linux-btrfs, linux-kernel, Naohiro Aota

On 21.05.24 17:23, Filipe Manana wrote:
>> +static u64 find_empty_block_group(struct btrfs_space_info *sinfo, u64 flags)
>> +{
>> +       struct btrfs_block_group *bg;
>> +
>> +       for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
>> +               list_for_each_entry(bg, &sinfo->block_groups[i], list) {
>> +                       if (bg->flags != flags)
>> +                               continue;
>> +                       if (bg->used == 0)
>> +                               return bg->start;
>> +               }
>> +       }
> I believe I commented about this in some previous patchset version,
> but here goes again.
> 
> This happens at mount time, where we have already loaded all block groups.
> When we load them, if we find unused ones, we add them to the list of
> empty block groups, so that the next time the cleaner kthread runs it
> deletes them.
> 
> I don't see any code here removing the selected block group from that
> list, or anything at btrfs_delete_unused_bgs() that prevents deleting
> a block group if it was selected as the data reloc bg.
> 
> Maybe I'm missing something?
> How do ensure the selected block group isn't deleted by the cleaner kthread?

Indeed, I forgot about that.

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 05/10] btrfs-progs: mkfs: align byte_count with sectorsize and zone size
  2024-05-22  6:53  1%     ` Naohiro Aota
@ 2024-05-22  7:38  1%       ` Naohiro Aota
  0 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  7:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, May 22, 2024 at 06:53:44AM GMT, Naohiro Aota wrote:
> On Wed, May 22, 2024 at 04:13:34PM GMT, Qu Wenruo wrote:
> > 
> > 
> > 在 2024/5/22 15:32, Naohiro Aota 写道:
> > > While "byte_count" is eventually rounded down to sectorsize at make_btrfs()
> > > or btrfs_add_to_fs_id(), it would be better round it down first and do the
> > > size checks not to confuse the things.
> > > 
> > > Also, on a zoned device, creating a btrfs whose size is not aligned to the
> > > zone boundary can be confusing. Round it down further to the zone boundary.
> > > 
> > > The size calculation with a source directory is also tweaked to be aligned.
> > > 
> > > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > > ---
> > >   mkfs/main.c | 11 +++++++++--
> > >   1 file changed, 9 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mkfs/main.c b/mkfs/main.c
> > > index a437ecc40c7f..baf889873b41 100644
> > > --- a/mkfs/main.c
> > > +++ b/mkfs/main.c
> > > @@ -1591,6 +1591,12 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
> > >   	min_dev_size = btrfs_min_dev_size(nodesize, mixed,
> > >   					  opt_zoned ? zone_size(file) : 0,
> > >   					  metadata_profile, data_profile);
> > > +	if (byte_count) {
> > > +		byte_count = round_down(byte_count, sectorsize);
> > > +		if (opt_zoned)
> > > +			byte_count = round_down(byte_count,  zone_size(file));
> > > +	}
> > > +
> > >   	/*
> > >   	 * Enlarge the destination file or create a new one, using the size
> > >   	 * calculated from source dir.
> > > @@ -1624,12 +1630,13 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
> > >   		 * Or we will always use source_dir_size calculated for mkfs.
> > >   		 */
> > >   		if (!byte_count)
> > > -			byte_count = device_get_partition_size_fd_stat(fd, &statbuf);
> > > +			byte_count = round_up(device_get_partition_size_fd_stat(fd, &statbuf),
> > > +					      sectorsize);
> > >   		source_dir_size = btrfs_mkfs_size_dir(source_dir, sectorsize,
> > >   				min_dev_size, metadata_profile, data_profile);
> > >   		if (byte_count < source_dir_size) {
> > >   			if (S_ISREG(statbuf.st_mode)) {
> > > -				byte_count = source_dir_size;
> > > +				byte_count = round_up(source_dir_size, sectorsize);
> > 
> > I believe we should round up not round down, if we're using --rootdir
> > option.
> > 
> > As smaller size would only be more possible to hit ENOSPC.
> > 
> > Otherwise looks good to me.
> 
> The commit log was vague about that, but actually the source dir
> calculations are rounded up in the code. Sorry for the confusion.

Checking this line again. I think btrfs_mkfs_size_dir() returns a
"sectorsize" aligned size in the first place. So, I think I can just drop
this line diff.

> 
> Regards,
> 
> > 
> > Thanks,
> > Qu
> > >   			} else {
> > >   				warning(
> > >   "the target device %llu (%s) is smaller than the calculated source directory size %llu (%s), mkfs may fail",

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 05/10] btrfs-progs: mkfs: align byte_count with sectorsize and zone size
  2024-05-22  6:43  1%   ` Qu Wenruo
  2024-05-22  6:49  1%     ` Qu Wenruo
@ 2024-05-22  6:53  1%     ` Naohiro Aota
  2024-05-22  7:38  1%       ` Naohiro Aota
  1 sibling, 1 reply; 200+ results
From: Naohiro Aota @ 2024-05-22  6:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, May 22, 2024 at 04:13:34PM GMT, Qu Wenruo wrote:
> 
> 
> 在 2024/5/22 15:32, Naohiro Aota 写道:
> > While "byte_count" is eventually rounded down to sectorsize at make_btrfs()
> > or btrfs_add_to_fs_id(), it would be better round it down first and do the
> > size checks not to confuse the things.
> > 
> > Also, on a zoned device, creating a btrfs whose size is not aligned to the
> > zone boundary can be confusing. Round it down further to the zone boundary.
> > 
> > The size calculation with a source directory is also tweaked to be aligned.
> > 
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >   mkfs/main.c | 11 +++++++++--
> >   1 file changed, 9 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mkfs/main.c b/mkfs/main.c
> > index a437ecc40c7f..baf889873b41 100644
> > --- a/mkfs/main.c
> > +++ b/mkfs/main.c
> > @@ -1591,6 +1591,12 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
> >   	min_dev_size = btrfs_min_dev_size(nodesize, mixed,
> >   					  opt_zoned ? zone_size(file) : 0,
> >   					  metadata_profile, data_profile);
> > +	if (byte_count) {
> > +		byte_count = round_down(byte_count, sectorsize);
> > +		if (opt_zoned)
> > +			byte_count = round_down(byte_count,  zone_size(file));
> > +	}
> > +
> >   	/*
> >   	 * Enlarge the destination file or create a new one, using the size
> >   	 * calculated from source dir.
> > @@ -1624,12 +1630,13 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
> >   		 * Or we will always use source_dir_size calculated for mkfs.
> >   		 */
> >   		if (!byte_count)
> > -			byte_count = device_get_partition_size_fd_stat(fd, &statbuf);
> > +			byte_count = round_up(device_get_partition_size_fd_stat(fd, &statbuf),
> > +					      sectorsize);
> >   		source_dir_size = btrfs_mkfs_size_dir(source_dir, sectorsize,
> >   				min_dev_size, metadata_profile, data_profile);
> >   		if (byte_count < source_dir_size) {
> >   			if (S_ISREG(statbuf.st_mode)) {
> > -				byte_count = source_dir_size;
> > +				byte_count = round_up(source_dir_size, sectorsize);
> 
> I believe we should round up not round down, if we're using --rootdir
> option.
> 
> As smaller size would only be more possible to hit ENOSPC.
> 
> Otherwise looks good to me.

The commit log was vague about that, but actually the source dir
calculations are rounded up in the code. Sorry for the confusion.

Regards,

> 
> Thanks,
> Qu
> >   			} else {
> >   				warning(
> >   "the target device %llu (%s) is smaller than the calculated source directory size %llu (%s), mkfs may fail",

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support
  2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
                   ` (9 preceding siblings ...)
  2024-05-22  6:02  1% ` [PATCH v3 10/10] btrfs-progs: test: use nullb helpers in 031-zoned-bgt Naohiro Aota
@ 2024-05-22  6:50  1% ` Qu Wenruo
  10 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-22  6:50 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs



在 2024/5/22 15:32, Naohiro Aota 写道:
> mkfs.btrfs -b <byte_count> on a zoned device has several issues listed
> below.
>
> - The FS size needs to be larger than minimal size that can host a btrfs,
>    but its calculation does not consider non-SINGLE profile
> - The calculation also does not ensure tree-log BG and data relocation BG
> - It allows creating a FS not aligned to the zone boundary
> - It resets all device zones beyond the specified length
>
> This series fixes the issues with some cleanups.
>
> This one passed CI workflow here:
>
> Patches 1 to 3 are clean up patches, so they should not change the behavior.
>
> Patches 4 to 6 address the issues.
>
> Patches 7 to 10 add/modify the test cases. First, patch 7 adds nullb
> functions to use in later patches. Patch 8 adds a new test for
> zone resetting. And, patches 9 and 10 rewrites existing tests with the
> nullb helper.

Looks good to me overall, with only one small problem related to patch 5
commented.

You can add my reviewed-by after fixing that small
round_down()/round_up() problem.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
>
> Changes:
> - v3:
>    - Tweak minimum FS size calculation style.
>    - Round down the specified byte_count towards sectorsize and zone
>      size, instead of banning unaligned byte_count.
>    - Add active zone description in the commit log of patch 6.
>    - Add nullb test functions and use them in tests.
> - v2: https://lore.kernel.org/linux-btrfs/20240514182227.1197664-1-naohiro.aota@wdc.com/T/#t
>    - fix function declaration on older distro (non-ZONED setup)
>    - fix mkfs test failure
> - v1: https://lore.kernel.org/linux-btrfs/20240514005133.44786-1-naohiro.aota@wdc.com/
>
> Naohiro Aota (10):
>    btrfs-progs: rename block_count to byte_count
>    btrfs-progs: mkfs: remove duplicated device size check
>    btrfs-progs: mkfs: unify zoned mode minimum size calc into
>      btrfs_min_dev_size()
>    btrfs-progs: mkfs: fix minimum size calculation for zoned mode
>    btrfs-progs: mkfs: align byte_count with sectorsize and zone size
>    btrfs-progs: support byte length for zone resetting
>    btrfs-progs: test: add nullb setup functions
>    btrfs-progs: test: add test for zone resetting
>    btrfs-progs: test: use nullb helper and smaller zone size
>    btrfs-progs: test: use nullb helpers in 031-zoned-bgt
>
>   common/device-utils.c                    | 45 +++++++-----
>   kernel-shared/zoned.c                    | 23 ++++++-
>   kernel-shared/zoned.h                    |  7 +-
>   mkfs/common.c                            | 62 ++++++++++++++---
>   mkfs/common.h                            |  2 +-
>   mkfs/main.c                              | 88 ++++++++++--------------
>   tests/common                             | 63 +++++++++++++++++
>   tests/mkfs-tests/030-zoned-rst/test.sh   | 14 ++--
>   tests/mkfs-tests/031-zoned-bgt/test.sh   | 30 ++------
>   tests/mkfs-tests/032-zoned-reset/test.sh | 43 ++++++++++++
>   10 files changed, 259 insertions(+), 118 deletions(-)
>   create mode 100755 tests/mkfs-tests/032-zoned-reset/test.sh
>
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 05/10] btrfs-progs: mkfs: align byte_count with sectorsize and zone size
  2024-05-22  6:43  1%   ` Qu Wenruo
@ 2024-05-22  6:49  1%     ` Qu Wenruo
  2024-05-22  6:53  1%     ` Naohiro Aota
  1 sibling, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-22  6:49 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs



在 2024/5/22 16:13, Qu Wenruo 写道:
> 
> 
> 在 2024/5/22 15:32, Naohiro Aota 写道:
>> While "byte_count" is eventually rounded down to sectorsize at 
>> make_btrfs()
>> or btrfs_add_to_fs_id(), it would be better round it down first and do 
>> the
>> size checks not to confuse the things.
>>
>> Also, on a zoned device, creating a btrfs whose size is not aligned to 
>> the
>> zone boundary can be confusing. Round it down further to the zone 
>> boundary.
>>
>> The size calculation with a source directory is also tweaked to be 
>> aligned.
>>
>> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
>> ---
>>   mkfs/main.c | 11 +++++++++--
>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/mkfs/main.c b/mkfs/main.c
>> index a437ecc40c7f..baf889873b41 100644
>> --- a/mkfs/main.c
>> +++ b/mkfs/main.c
>> @@ -1591,6 +1591,12 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>>       min_dev_size = btrfs_min_dev_size(nodesize, mixed,
>>                         opt_zoned ? zone_size(file) : 0,
>>                         metadata_profile, data_profile);
>> +    if (byte_count) {
>> +        byte_count = round_down(byte_count, sectorsize);
>> +        if (opt_zoned)
>> +            byte_count = round_down(byte_count,  zone_size(file));
>> +    }
>> +
>>       /*
>>        * Enlarge the destination file or create a new one, using the size
>>        * calculated from source dir.
>> @@ -1624,12 +1630,13 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>>            * Or we will always use source_dir_size calculated for mkfs.
>>            */
>>           if (!byte_count)
>> -            byte_count = device_get_partition_size_fd_stat(fd, 
>> &statbuf);
>> +            byte_count = 
>> round_up(device_get_partition_size_fd_stat(fd, &statbuf),
>> +                          sectorsize);

My bad, forgot this one too.

We should round_down() here.

As if we have a 512 bytes blocked device, and the partition is only 
aligned to 512 bytes, this can make the last sector go beyond device 
boundary.

Thanks,
Qu
>>           source_dir_size = btrfs_mkfs_size_dir(source_dir, sectorsize,
>>                   min_dev_size, metadata_profile, data_profile);
>>           if (byte_count < source_dir_size) {
>>               if (S_ISREG(statbuf.st_mode)) {
>> -                byte_count = source_dir_size;
>> +                byte_count = round_up(source_dir_size, sectorsize);
> 
> I believe we should round up not round down, if we're using --rootdir
> option.
> 
> As smaller size would only be more possible to hit ENOSPC.
> 
> Otherwise looks good to me.
> 
> Thanks,
> Qu
>>               } else {
>>                   warning(
>>   "the target device %llu (%s) is smaller than the calculated source 
>> directory size %llu (%s), mkfs may fail",
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 05/10] btrfs-progs: mkfs: align byte_count with sectorsize and zone size
  2024-05-22  6:02  1% ` [PATCH v3 05/10] btrfs-progs: mkfs: align byte_count with sectorsize and zone size Naohiro Aota
@ 2024-05-22  6:43  1%   ` Qu Wenruo
  2024-05-22  6:49  1%     ` Qu Wenruo
  2024-05-22  6:53  1%     ` Naohiro Aota
  0 siblings, 2 replies; 200+ results
From: Qu Wenruo @ 2024-05-22  6:43 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs



在 2024/5/22 15:32, Naohiro Aota 写道:
> While "byte_count" is eventually rounded down to sectorsize at make_btrfs()
> or btrfs_add_to_fs_id(), it would be better round it down first and do the
> size checks not to confuse the things.
>
> Also, on a zoned device, creating a btrfs whose size is not aligned to the
> zone boundary can be confusing. Round it down further to the zone boundary.
>
> The size calculation with a source directory is also tweaked to be aligned.
>
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>   mkfs/main.c | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mkfs/main.c b/mkfs/main.c
> index a437ecc40c7f..baf889873b41 100644
> --- a/mkfs/main.c
> +++ b/mkfs/main.c
> @@ -1591,6 +1591,12 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   	min_dev_size = btrfs_min_dev_size(nodesize, mixed,
>   					  opt_zoned ? zone_size(file) : 0,
>   					  metadata_profile, data_profile);
> +	if (byte_count) {
> +		byte_count = round_down(byte_count, sectorsize);
> +		if (opt_zoned)
> +			byte_count = round_down(byte_count,  zone_size(file));
> +	}
> +
>   	/*
>   	 * Enlarge the destination file or create a new one, using the size
>   	 * calculated from source dir.
> @@ -1624,12 +1630,13 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
>   		 * Or we will always use source_dir_size calculated for mkfs.
>   		 */
>   		if (!byte_count)
> -			byte_count = device_get_partition_size_fd_stat(fd, &statbuf);
> +			byte_count = round_up(device_get_partition_size_fd_stat(fd, &statbuf),
> +					      sectorsize);
>   		source_dir_size = btrfs_mkfs_size_dir(source_dir, sectorsize,
>   				min_dev_size, metadata_profile, data_profile);
>   		if (byte_count < source_dir_size) {
>   			if (S_ISREG(statbuf.st_mode)) {
> -				byte_count = source_dir_size;
> +				byte_count = round_up(source_dir_size, sectorsize);

I believe we should round up not round down, if we're using --rootdir
option.

As smaller size would only be more possible to hit ENOSPC.

Otherwise looks good to me.

Thanks,
Qu
>   			} else {
>   				warning(
>   "the target device %llu (%s) is smaller than the calculated source directory size %llu (%s), mkfs may fail",

^ permalink raw reply	[relevance 1%]

* [PATCH v3 09/10] btrfs-progs: test: use nullb helper and smaller zone size
  2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
                   ` (7 preceding siblings ...)
  2024-05-22  6:02  1% ` [PATCH v3 08/10] btrfs-progs: test: add test for zone resetting Naohiro Aota
@ 2024-05-22  6:02  1% ` Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 10/10] btrfs-progs: test: use nullb helpers in 031-zoned-bgt Naohiro Aota
  2024-05-22  6:50  1% ` [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Qu Wenruo
  10 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  6:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

With the change of minimal number of zones, mkfs-tests/030-zoned-rst now
fails because the loopback device is 2GB and can contain 8x 256MB zones.

Use the nullb helpers to choose a smaller zone size.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 tests/mkfs-tests/030-zoned-rst/test.sh | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/tests/mkfs-tests/030-zoned-rst/test.sh b/tests/mkfs-tests/030-zoned-rst/test.sh
index 2e048cf79f20..b1c696c96eb7 100755
--- a/tests/mkfs-tests/030-zoned-rst/test.sh
+++ b/tests/mkfs-tests/030-zoned-rst/test.sh
@@ -4,22 +4,22 @@
 source "$TEST_TOP/common" || exit
 
 setup_root_helper
-setup_loopdevs 4
-prepare_loopdevs
-TEST_DEV=${loopdevs[1]}
+setup_nullbdevs 4 128 4
+prepare_nullbdevs
+TEST_DEV=${nullb_devs[1]}
 
 profiles="single dup raid1 raid1c3 raid1c4 raid10"
 
 for dprofile in $profiles; do
 	for mprofile in $profiles; do
 		# It's sufficient to specify only 'zoned', the rst will be enabled
-		run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -f -O zoned -d "$dprofile" -m "$mprofile" "${loopdevs[@]}"
+		run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -f -O zoned -d "$dprofile" -m "$mprofile" "${nullb_devs[@]}"
 	done
 done
 
 run_mustfail "unsupported profile raid56 created" \
-	$SUDO_HELPER "$TOP/mkfs.btrfs" -f -O zoned -d raid5 -m raid5 "${loopdevs[@]}"
+	$SUDO_HELPER "$TOP/mkfs.btrfs" -f -O zoned -d raid5 -m raid5 "${nullb_devs[@]}"
 run_mustfail "unsupported profile raid56 created" \
-	$SUDO_HELPER "$TOP/mkfs.btrfs" -f -O zoned -d raid6 -m raid6 "${loopdevs[@]}"
+	$SUDO_HELPER "$TOP/mkfs.btrfs" -f -O zoned -d raid6 -m raid6 "${nullb_devs[@]}"
 
-cleanup_loopdevs
+cleanup_nullbdevs
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 10/10] btrfs-progs: test: use nullb helpers in 031-zoned-bgt
  2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
                   ` (8 preceding siblings ...)
  2024-05-22  6:02  1% ` [PATCH v3 09/10] btrfs-progs: test: use nullb helper and smaller zone size Naohiro Aota
@ 2024-05-22  6:02  1% ` Naohiro Aota
  2024-05-22  6:50  1% ` [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Qu Wenruo
  10 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  6:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

Rewrite 031-zoned-bgt with the nullb helpers.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 tests/mkfs-tests/031-zoned-bgt/test.sh | 30 +++++---------------------
 1 file changed, 5 insertions(+), 25 deletions(-)

diff --git a/tests/mkfs-tests/031-zoned-bgt/test.sh b/tests/mkfs-tests/031-zoned-bgt/test.sh
index 91c107cd5a3b..e296c29b9238 100755
--- a/tests/mkfs-tests/031-zoned-bgt/test.sh
+++ b/tests/mkfs-tests/031-zoned-bgt/test.sh
@@ -4,37 +4,17 @@
 source "$TEST_TOP/common" || exit
 
 setup_root_helper
-prepare_test_dev
-
-nullb="$TEST_TOP/nullb"
 # Create one 128M device with 4M zones, 32 of them
-size=128
-zone=4
-
-run_mayfail $SUDO_HELPER "$nullb" setup
-if [ $? != 0 ]; then
-	_not_run "cannot setup nullb environment for zoned devices"
-fi
-
-# Record any other pre-existing devices in case creation fails
-run_check $SUDO_HELPER "$nullb" ls
-
-# Last line has the name of the device node path
-out=$(run_check_stdout $SUDO_HELPER "$nullb" create -s "$size" -z "$zone")
-if [ $? != 0 ]; then
-	_fail "cannot create nullb zoned device $i"
-fi
-dev=$(echo "$out" | tail -n 1)
-name=$(basename "${dev}")
+setup_nullbdevs 1 128 4
 
-run_check $SUDO_HELPER "$nullb" ls
+prepare_nullbdevs
 
-TEST_DEV="${dev}"
+TEST_DEV="${nullb_devs[1]}"
 # Use single as it's supported on more kernels
-run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -m single -d single -O block-group-tree "${dev}"
+run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -m single -d single -O block-group-tree "${TEST_DEV}"
 run_check_mount_test_dev
 run_check $SUDO_HELPER dd if=/dev/zero of="$TEST_MNT"/file bs=1M count=1
 run_check $SUDO_HELPER "$TOP/btrfs" filesystem usage -T "$TEST_MNT"
 run_check_umount_test_dev
 
-run_check $SUDO_HELPER "$nullb" rm "${name}"
+cleanup_nullbdevs
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 08/10] btrfs-progs: test: add test for zone resetting
  2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
                   ` (6 preceding siblings ...)
  2024-05-22  6:02  1% ` [PATCH v3 07/10] btrfs-progs: test: add nullb setup functions Naohiro Aota
@ 2024-05-22  6:02  1% ` Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 09/10] btrfs-progs: test: use nullb helper and smaller zone size Naohiro Aota
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  6:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

Add test for mkfs.btrfs's zone reset behavior to check if

- it resets all the zones without "-b" option
- it detects an active zone outside of the FS range
- it do not reset a zone outside of the range

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 tests/mkfs-tests/032-zoned-reset/test.sh | 43 ++++++++++++++++++++++++
 1 file changed, 43 insertions(+)
 create mode 100755 tests/mkfs-tests/032-zoned-reset/test.sh

diff --git a/tests/mkfs-tests/032-zoned-reset/test.sh b/tests/mkfs-tests/032-zoned-reset/test.sh
new file mode 100755
index 000000000000..2aedb14abb03
--- /dev/null
+++ b/tests/mkfs-tests/032-zoned-reset/test.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+# Verify mkfs for zoned devices support block-group-tree feature
+
+source "$TEST_TOP/common" || exit
+
+check_global_prereq blkzone
+setup_root_helper
+# Create one 128M device with 4M zones, 32 of them
+setup_nullbdevs 1 128 4
+
+prepare_nullbdevs
+
+TEST_DEV="${nullb_devs[1]}"
+last_zone_sector=$(( 4 * 31 * 1024 * 1024 / 512 ))
+# Write some data to the last zone
+run_check $SUDO_HELPER dd if=/dev/urandom of="${TEST_DEV}" bs=1M count=4 seek=$(( 4 * 31 ))
+# Use single as it's supported on more kernels
+run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -f -m single -d single "${TEST_DEV}"
+# Check if the lat zone is empty
+run_check_stdout $SUDO_HELPER blkzone report -o ${last_zone_sector} -c 1 "${TEST_DEV}" | grep -Fq '(em)'
+if [ $? != 0 ]; then
+	_fail "last zone is not empty"
+fi
+
+# Write some data to the last zone
+run_check $SUDO_HELPER dd if=/dev/urandom of="${TEST_DEV}" bs=1M count=1 seek=$(( 4 * 31 ))
+# Create a FS excluding the last zone
+run_mayfail $SUDO_HELPER "$TOP/mkfs.btrfs" -f -b $(( 4 * 31 ))M -m single -d single "${TEST_DEV}"
+if [ $? == 0 ]; then
+	_fail "mkfs.btrfs should detect active zone outside of FS range"
+fi
+
+# Fill the last zone to finish it
+run_check $SUDO_HELPER dd if=/dev/urandom of="${TEST_DEV}" bs=1M count=3 seek=$(( 4 * 31 + 1 ))
+# Create a FS excluding the last zone
+run_mayfail $SUDO_HELPER "$TOP/mkfs.btrfs" -f -b $(( 4 * 31 ))M -m single -d single "${TEST_DEV}"
+# Check if the lat zone is not empty
+run_check_stdout $SUDO_HELPER blkzone report -o ${last_zone_sector} -c 1 "${TEST_DEV}" | grep -Fq '(em)'
+if [ $? == 0 ]; then
+	_fail "last zone is empty"
+fi
+
+cleanup_nullbdevs
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 07/10] btrfs-progs: test: add nullb setup functions
  2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
                   ` (5 preceding siblings ...)
  2024-05-22  6:02  1% ` [PATCH v3 06/10] btrfs-progs: support byte length for zone resetting Naohiro Aota
@ 2024-05-22  6:02  1% ` Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 08/10] btrfs-progs: test: add test for zone resetting Naohiro Aota
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  6:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

Add functions to setup, create and remove nullb devices.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 tests/common | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/tests/common b/tests/common
index 1f880adead6d..ef9fcd32870a 100644
--- a/tests/common
+++ b/tests/common
@@ -882,6 +882,69 @@ cond_wait_for_loopdevs() {
 	fi
 }
 
+# prepare environment for nullb devices, set up the following variables
+# - nullb_count -- number of desired devices
+# - nullb_size -- size of the devices
+# - nullb_zone_size -- zone size of the devices
+# - nullb_devs  -- array containing paths to all devices (after prepare is called)
+#
+# $1: number of nullb devices to be set up
+# $2: size of the devices
+# $3: zone size of the devices
+setup_nullbdevs()
+{
+	if [ "$#" -lt 3 ]; then
+		_fail "setup_nullbdevs <number of device> <size> <zone size>"
+	fi
+
+	setup_root_helper
+	local nullb="${TEST_TOP}/nullb"
+
+	run_mayfail $SUDO_HELPER "${nullb}" setup
+	if [ $? != 0 ]; then
+		_not_run "cannot setup nullb environment for zoned devices"
+	fi
+
+	nullb_count="$1"
+	nullb_size="$2"
+	nullb_zone_size="$3"
+	declare -a nullb_devs
+}
+
+# create all nullb devices from a given nullb environment
+prepare_nullbdevs()
+{
+	setup_root_helper
+	local nullb="${TEST_TOP}/nullb"
+
+	# Record any other pre-existing devices in case creation fails
+	run_check $SUDO_HELPER "${nullb}" ls
+
+	for i in `seq ${nullb_count}`; do
+		# Last line has the name of the device node path
+		out=$(run_check_stdout $SUDO_HELPER "${nullb}" create -s "${nullb_size}" -z "${nullb_zone_size}")
+		if [ $? != 0 ]; then
+			_fail "cannot create nullb zoned device $i"
+		fi
+		dev=$(echo "${out}" | tail -n 1)
+		nullb_devs[$i]=${dev}
+	done
+
+	run_check $SUDO_HELPER "${nullb}" ls
+}
+
+# remove nullb devices
+cleanup_nullbdevs()
+{
+	setup_root_helper
+	local nullb="${TEST_TOP}/nullb"
+
+	for dev in ${nullb_devs[@]}; do
+		name=$(basename ${dev})
+		run_check $SUDO_HELPER "${nullb}" rm "${name}"
+	done
+}
+
 init_env()
 {
 	TEST_MNT="${TEST_MNT:-$TEST_TOP/mnt}"
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 01/10] btrfs-progs: rename block_count to byte_count
  2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
@ 2024-05-22  6:02  1% ` Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 02/10] btrfs-progs: mkfs: remove duplicated device size check Naohiro Aota
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  6:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

block_count and dev_block_count are counting the size in bytes. And,
comparing them with e.g, "min_dev_size" is confusing. Rename them to
represent the unit better.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 common/device-utils.c | 28 +++++++++++-----------
 mkfs/main.c           | 56 +++++++++++++++++++++----------------------
 2 files changed, 42 insertions(+), 42 deletions(-)

diff --git a/common/device-utils.c b/common/device-utils.c
index d086e9ea2564..86942e0c7041 100644
--- a/common/device-utils.c
+++ b/common/device-utils.c
@@ -222,11 +222,11 @@ out:
  * - reset zones
  * - delete end of the device
  */
-int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
-		u64 max_block_count, unsigned opflags)
+int btrfs_prepare_device(int fd, const char *file, u64 *byte_count_ret,
+		u64 max_byte_count, unsigned opflags)
 {
 	struct btrfs_zoned_device_info *zinfo = NULL;
-	u64 block_count;
+	u64 byte_count;
 	struct stat st;
 	int i, ret;
 
@@ -236,13 +236,13 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		return 1;
 	}
 
-	block_count = device_get_partition_size_fd_stat(fd, &st);
-	if (block_count == 0) {
+	byte_count = device_get_partition_size_fd_stat(fd, &st);
+	if (byte_count == 0) {
 		error("unable to determine size of %s", file);
 		return 1;
 	}
-	if (max_block_count)
-		block_count = min(block_count, max_block_count);
+	if (max_byte_count)
+		byte_count = min(byte_count, max_byte_count);
 
 	if (opflags & PREP_DEVICE_ZONED) {
 		ret = btrfs_get_zone_info(fd, file, &zinfo);
@@ -276,18 +276,18 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 		if (discard_supported(file)) {
 			if (opflags & PREP_DEVICE_VERBOSE)
 				printf("Performing full device TRIM %s (%s) ...\n",
-						file, pretty_size(block_count));
-			device_discard_blocks(fd, 0, block_count);
+						file, pretty_size(byte_count));
+			device_discard_blocks(fd, 0, byte_count);
 		}
 	}
 
-	ret = zero_dev_clamped(fd, zinfo, 0, ZERO_DEV_BYTES, block_count);
+	ret = zero_dev_clamped(fd, zinfo, 0, ZERO_DEV_BYTES, byte_count);
 	for (i = 0 ; !ret && i < BTRFS_SUPER_MIRROR_MAX; i++)
 		ret = zero_dev_clamped(fd, zinfo, btrfs_sb_offset(i),
-				       BTRFS_SUPER_INFO_SIZE, block_count);
+				       BTRFS_SUPER_INFO_SIZE, byte_count);
 	if (!ret && (opflags & PREP_DEVICE_ZERO_END))
-		ret = zero_dev_clamped(fd, zinfo, block_count - ZERO_DEV_BYTES,
-				       ZERO_DEV_BYTES, block_count);
+		ret = zero_dev_clamped(fd, zinfo, byte_count - ZERO_DEV_BYTES,
+				       ZERO_DEV_BYTES, byte_count);
 
 	if (ret < 0) {
 		errno = -ret;
@@ -302,7 +302,7 @@ int btrfs_prepare_device(int fd, const char *file, u64 *block_count_ret,
 	}
 
 	free(zinfo);
-	*block_count_ret = block_count;
+	*byte_count_ret = byte_count;
 	return 0;
 
 err:
diff --git a/mkfs/main.c b/mkfs/main.c
index a467795d4428..950f76101058 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -80,8 +80,8 @@ static int opt_oflags = O_RDWR;
 struct prepare_device_progress {
 	int fd;
 	char *file;
-	u64 dev_block_count;
-	u64 block_count;
+	u64 dev_byte_count;
+	u64 byte_count;
 	int ret;
 };
 
@@ -1159,8 +1159,8 @@ static void *prepare_one_device(void *ctx)
 	}
 	prepare_ctx->ret = btrfs_prepare_device(prepare_ctx->fd,
 				prepare_ctx->file,
-				&prepare_ctx->dev_block_count,
-				prepare_ctx->block_count,
+				&prepare_ctx->dev_byte_count,
+				prepare_ctx->byte_count,
 				(bconf.verbose ? PREP_DEVICE_VERBOSE : 0) |
 				(opt_zero_end ? PREP_DEVICE_ZERO_END : 0) |
 				(opt_discard ? PREP_DEVICE_DISCARD : 0) |
@@ -1204,8 +1204,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	bool metadata_profile_set = false;
 	u64 data_profile = 0;
 	bool data_profile_set = false;
-	u64 block_count = 0;
-	u64 dev_block_count = 0;
+	u64 byte_count = 0;
+	u64 dev_byte_count = 0;
 	bool mixed = false;
 	char *label = NULL;
 	int nr_global_roots = sysconf(_SC_NPROCESSORS_ONLN);
@@ -1347,7 +1347,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 				sectorsize = arg_strtou64_with_suffix(optarg);
 				break;
 			case 'b':
-				block_count = arg_strtou64_with_suffix(optarg);
+				byte_count = arg_strtou64_with_suffix(optarg);
 				opt_zero_end = false;
 				break;
 			case 'v':
@@ -1623,34 +1623,34 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		 * Block_count not specified, use file/device size first.
 		 * Or we will always use source_dir_size calculated for mkfs.
 		 */
-		if (!block_count)
-			block_count = device_get_partition_size_fd_stat(fd, &statbuf);
+		if (!byte_count)
+			byte_count = device_get_partition_size_fd_stat(fd, &statbuf);
 		source_dir_size = btrfs_mkfs_size_dir(source_dir, sectorsize,
 				min_dev_size, metadata_profile, data_profile);
-		if (block_count < source_dir_size) {
+		if (byte_count < source_dir_size) {
 			if (S_ISREG(statbuf.st_mode)) {
-				block_count = source_dir_size;
+				byte_count = source_dir_size;
 			} else {
 				warning(
 "the target device %llu (%s) is smaller than the calculated source directory size %llu (%s), mkfs may fail",
-					block_count, pretty_size(block_count),
+					byte_count, pretty_size(byte_count),
 					source_dir_size, pretty_size(source_dir_size));
 			}
 		}
-		ret = zero_output_file(fd, block_count);
+		ret = zero_output_file(fd, byte_count);
 		if (ret) {
 			error("unable to zero the output file");
 			close(fd);
 			goto error;
 		}
 		/* our "device" is the new image file */
-		dev_block_count = block_count;
+		dev_byte_count = byte_count;
 		close(fd);
 	}
-	/* Check device/block_count after the nodesize is determined */
-	if (block_count && block_count < min_dev_size) {
+	/* Check device/byte_count after the nodesize is determined */
+	if (byte_count && byte_count < min_dev_size) {
 		error("size %llu is too small to make a usable filesystem",
-			block_count);
+			byte_count);
 		error("minimum size for btrfs filesystem is %llu",
 			min_dev_size);
 		goto error;
@@ -1661,9 +1661,9 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	 * 1 zone for a metadata block group
 	 * 1 zone for a data block group
 	 */
-	if (opt_zoned && block_count && block_count < 5 * zone_size(file)) {
+	if (opt_zoned && byte_count && byte_count < 5 * zone_size(file)) {
 		error("size %llu is too small to make a usable filesystem",
-			block_count);
+			byte_count);
 		error("minimum size for a zoned btrfs filesystem is %llu",
 			min_dev_size);
 		goto error;
@@ -1741,8 +1741,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	/* Start threads */
 	for (i = 0; i < device_count; i++) {
 		prepare_ctx[i].file = argv[optind + i - 1];
-		prepare_ctx[i].block_count = block_count;
-		prepare_ctx[i].dev_block_count = block_count;
+		prepare_ctx[i].byte_count = byte_count;
+		prepare_ctx[i].dev_byte_count = byte_count;
 		ret = pthread_create(&t_prepare[i], NULL, prepare_one_device,
 				     &prepare_ctx[i]);
 		if (ret) {
@@ -1763,16 +1763,16 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		goto error;
 	}
 
-	dev_block_count = prepare_ctx[0].dev_block_count;
-	if (block_count && block_count > dev_block_count) {
+	dev_byte_count = prepare_ctx[0].dev_byte_count;
+	if (byte_count && byte_count > dev_byte_count) {
 		error("%s is smaller than requested size, expected %llu, found %llu",
-		      file, block_count, dev_block_count);
+		      file, byte_count, dev_byte_count);
 		goto error;
 	}
 
 	/* To create the first block group and chunk 0 in make_btrfs */
 	system_group_size = (opt_zoned ? zone_size(file) : BTRFS_MKFS_SYSTEM_GROUP_SIZE);
-	if (dev_block_count < system_group_size) {
+	if (dev_byte_count < system_group_size) {
 		error("device is too small to make filesystem, must be at least %llu",
 				system_group_size);
 		goto error;
@@ -1794,7 +1794,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	mkfs_cfg.label = label;
 	memcpy(mkfs_cfg.fs_uuid, fs_uuid, sizeof(mkfs_cfg.fs_uuid));
 	memcpy(mkfs_cfg.dev_uuid, dev_uuid, sizeof(mkfs_cfg.dev_uuid));
-	mkfs_cfg.num_bytes = dev_block_count;
+	mkfs_cfg.num_bytes = dev_byte_count;
 	mkfs_cfg.nodesize = nodesize;
 	mkfs_cfg.sectorsize = sectorsize;
 	mkfs_cfg.stripesize = stripesize;
@@ -1889,7 +1889,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 				file);
 			continue;
 		}
-		dev_block_count = prepare_ctx[i].dev_block_count;
+		dev_byte_count = prepare_ctx[i].dev_byte_count;
 
 		if (prepare_ctx[i].ret) {
 			errno = -prepare_ctx[i].ret;
@@ -1898,7 +1898,7 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		}
 
 		ret = btrfs_add_to_fsid(trans, root, prepare_ctx[i].fd,
-					prepare_ctx[i].file, dev_block_count,
+					prepare_ctx[i].file, dev_byte_count,
 					sectorsize, sectorsize, sectorsize);
 		if (ret) {
 			error("unable to add %s to filesystem: %d",
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support
@ 2024-05-22  6:02  2% Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 01/10] btrfs-progs: rename block_count to byte_count Naohiro Aota
                   ` (10 more replies)
  0 siblings, 11 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  6:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

mkfs.btrfs -b <byte_count> on a zoned device has several issues listed
below.

- The FS size needs to be larger than minimal size that can host a btrfs,
  but its calculation does not consider non-SINGLE profile
- The calculation also does not ensure tree-log BG and data relocation BG
- It allows creating a FS not aligned to the zone boundary
- It resets all device zones beyond the specified length

This series fixes the issues with some cleanups.

This one passed CI workflow here:

Patches 1 to 3 are clean up patches, so they should not change the behavior.

Patches 4 to 6 address the issues.

Patches 7 to 10 add/modify the test cases. First, patch 7 adds nullb
functions to use in later patches. Patch 8 adds a new test for
zone resetting. And, patches 9 and 10 rewrites existing tests with the
nullb helper.

Changes:
- v3:
  - Tweak minimum FS size calculation style.
  - Round down the specified byte_count towards sectorsize and zone
    size, instead of banning unaligned byte_count.
  - Add active zone description in the commit log of patch 6.
  - Add nullb test functions and use them in tests.
- v2: https://lore.kernel.org/linux-btrfs/20240514182227.1197664-1-naohiro.aota@wdc.com/T/#t
  - fix function declaration on older distro (non-ZONED setup)
  - fix mkfs test failure
- v1: https://lore.kernel.org/linux-btrfs/20240514005133.44786-1-naohiro.aota@wdc.com/

Naohiro Aota (10):
  btrfs-progs: rename block_count to byte_count
  btrfs-progs: mkfs: remove duplicated device size check
  btrfs-progs: mkfs: unify zoned mode minimum size calc into
    btrfs_min_dev_size()
  btrfs-progs: mkfs: fix minimum size calculation for zoned mode
  btrfs-progs: mkfs: align byte_count with sectorsize and zone size
  btrfs-progs: support byte length for zone resetting
  btrfs-progs: test: add nullb setup functions
  btrfs-progs: test: add test for zone resetting
  btrfs-progs: test: use nullb helper and smaller zone size
  btrfs-progs: test: use nullb helpers in 031-zoned-bgt

 common/device-utils.c                    | 45 +++++++-----
 kernel-shared/zoned.c                    | 23 ++++++-
 kernel-shared/zoned.h                    |  7 +-
 mkfs/common.c                            | 62 ++++++++++++++---
 mkfs/common.h                            |  2 +-
 mkfs/main.c                              | 88 ++++++++++--------------
 tests/common                             | 63 +++++++++++++++++
 tests/mkfs-tests/030-zoned-rst/test.sh   | 14 ++--
 tests/mkfs-tests/031-zoned-bgt/test.sh   | 30 ++------
 tests/mkfs-tests/032-zoned-reset/test.sh | 43 ++++++++++++
 10 files changed, 259 insertions(+), 118 deletions(-)
 create mode 100755 tests/mkfs-tests/032-zoned-reset/test.sh

--
2.45.1


^ permalink raw reply	[relevance 2%]

* [PATCH v3 06/10] btrfs-progs: support byte length for zone resetting
  2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
                   ` (4 preceding siblings ...)
  2024-05-22  6:02  1% ` [PATCH v3 05/10] btrfs-progs: mkfs: align byte_count with sectorsize and zone size Naohiro Aota
@ 2024-05-22  6:02  1% ` Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 07/10] btrfs-progs: test: add nullb setup functions Naohiro Aota
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  6:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota, Qu Wenruo

Even with "mkfs.btrfs -b", mkfs.btrfs resets all the zones on the device.
Limit the reset target within the specified length.

Also, we need to check that there is no active zone outside of the FS
range. Having an active zone outside FS reduces the number of zones btrfs
can write simultaneously. Technically, we can still scan all the device
zones and keep active zones outside FS intact and try to live with the
limited active zones. But, that will make btrfs operations harder.

It is generally bad idea to use "-b" on a non-test usage on a device with
active zone limit in the first place. You really need to take care that FS
and outside the FS goes over the limit. That means you'll never be able to
use zones outside the FS anyway.

So, until there is a strong request for that, I don't think it's worthwhile
to do so.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
---
 common/device-utils.c | 17 ++++++++++++-----
 kernel-shared/zoned.c | 23 ++++++++++++++++++++++-
 kernel-shared/zoned.h |  7 ++++---
 3 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/common/device-utils.c b/common/device-utils.c
index 86942e0c7041..7df7d9ce39d8 100644
--- a/common/device-utils.c
+++ b/common/device-utils.c
@@ -254,16 +254,23 @@ int btrfs_prepare_device(int fd, const char *file, u64 *byte_count_ret,
 
 		if (!zinfo->emulated) {
 			if (opflags & PREP_DEVICE_VERBOSE)
-				printf("Resetting device zones %s (%u zones) ...\n",
-				       file, zinfo->nr_zones);
+				printf("Resetting device zones %s (%llu zones) ...\n",
+				       file, byte_count / zinfo->zone_size);
 			/*
 			 * We cannot ignore zone reset errors for a zoned block
 			 * device as this could result in the inability to write
 			 * to non-empty sequential zones of the device.
 			 */
-			if (btrfs_reset_all_zones(fd, zinfo)) {
-				error("zoned: failed to reset device '%s' zones: %m",
-				      file);
+			ret = btrfs_reset_zones(fd, zinfo, byte_count);
+			if (ret) {
+				if (ret == EBUSY) {
+					error("zoned: device '%s' contains an active zone outside of the FS range",
+					      file);
+					error("zoned: btrfs needs full control of active zones");
+				} else {
+					error("zoned: failed to reset device '%s' zones: %m",
+					      file);
+				}
 				goto err;
 			}
 		}
diff --git a/kernel-shared/zoned.c b/kernel-shared/zoned.c
index fb1e1388804e..b4244966ca36 100644
--- a/kernel-shared/zoned.c
+++ b/kernel-shared/zoned.c
@@ -395,16 +395,24 @@ static int report_zones(int fd, const char *file,
  * Discard blocks in the zones of a zoned block device. Process this with zone
  * size granularity so that blocks in conventional zones are discarded using
  * discard_range and blocks in sequential zones are reset though a zone reset.
+ *
+ * We need to ensure that zones outside of the FS is not active, so that
+ * the FS can use all the active zones. Return EBUSY if there is an active
+ * zone.
  */
-int btrfs_reset_all_zones(int fd, struct btrfs_zoned_device_info *zinfo)
+int btrfs_reset_zones(int fd, struct btrfs_zoned_device_info *zinfo, u64 byte_count)
 {
 	unsigned int i;
 	int ret = 0;
 
 	ASSERT(zinfo);
+	ASSERT(IS_ALIGNED(byte_count, zinfo->zone_size));
 
 	/* Zone size granularity */
 	for (i = 0; i < zinfo->nr_zones; i++) {
+		if (byte_count == 0)
+			break;
+
 		if (zinfo->zones[i].type == BLK_ZONE_TYPE_CONVENTIONAL) {
 			ret = device_discard_blocks(fd,
 					     zinfo->zones[i].start << SECTOR_SHIFT,
@@ -419,7 +427,20 @@ int btrfs_reset_all_zones(int fd, struct btrfs_zoned_device_info *zinfo)
 
 		if (ret)
 			return ret;
+
+		byte_count -= zinfo->zone_size;
 	}
+	for (; i < zinfo->nr_zones; i++) {
+		const enum blk_zone_cond cond = zinfo->zones[i].cond;
+
+		if (zinfo->zones[i].type == BLK_ZONE_TYPE_CONVENTIONAL)
+			continue;
+		if (cond == BLK_ZONE_COND_IMP_OPEN ||
+		    cond == BLK_ZONE_COND_EXP_OPEN ||
+		    cond == BLK_ZONE_COND_CLOSED)
+			return EBUSY;
+	}
+
 	return fsync(fd);
 }
 
diff --git a/kernel-shared/zoned.h b/kernel-shared/zoned.h
index 6eba86d266bf..2bf24cbba62a 100644
--- a/kernel-shared/zoned.h
+++ b/kernel-shared/zoned.h
@@ -149,7 +149,7 @@ bool btrfs_redirty_extent_buffer_for_zoned(struct btrfs_fs_info *fs_info,
 					   u64 start, u64 end);
 int btrfs_reset_chunk_zones(struct btrfs_fs_info *fs_info, u64 devid,
 			    u64 offset, u64 length);
-int btrfs_reset_all_zones(int fd, struct btrfs_zoned_device_info *zinfo);
+int btrfs_reset_zones(int fd, struct btrfs_zoned_device_info *zinfo, u64 byte_count);
 int zero_zone_blocks(int fd, struct btrfs_zoned_device_info *zinfo, off_t start,
 		     size_t len);
 int btrfs_wipe_temporary_sb(struct btrfs_fs_devices *fs_devices);
@@ -203,8 +203,9 @@ static inline int btrfs_reset_chunk_zones(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
-static inline int btrfs_reset_all_zones(int fd,
-					struct btrfs_zoned_device_info *zinfo)
+static inline int btrfs_reset_zones(int fd,
+				    struct btrfs_zoned_device_info *zinfo,
+				    u64 byte_count)
 {
 	return -EOPNOTSUPP;
 }
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 05/10] btrfs-progs: mkfs: align byte_count with sectorsize and zone size
  2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
                   ` (3 preceding siblings ...)
  2024-05-22  6:02  1% ` [PATCH v3 04/10] btrfs-progs: mkfs: fix minimum size calculation for zoned mode Naohiro Aota
@ 2024-05-22  6:02  1% ` Naohiro Aota
  2024-05-22  6:43  1%   ` Qu Wenruo
  2024-05-22  6:02  1% ` [PATCH v3 06/10] btrfs-progs: support byte length for zone resetting Naohiro Aota
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 200+ results
From: Naohiro Aota @ 2024-05-22  6:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

While "byte_count" is eventually rounded down to sectorsize at make_btrfs()
or btrfs_add_to_fs_id(), it would be better round it down first and do the
size checks not to confuse the things.

Also, on a zoned device, creating a btrfs whose size is not aligned to the
zone boundary can be confusing. Round it down further to the zone boundary.

The size calculation with a source directory is also tweaked to be aligned.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 mkfs/main.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mkfs/main.c b/mkfs/main.c
index a437ecc40c7f..baf889873b41 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1591,6 +1591,12 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	min_dev_size = btrfs_min_dev_size(nodesize, mixed,
 					  opt_zoned ? zone_size(file) : 0,
 					  metadata_profile, data_profile);
+	if (byte_count) {
+		byte_count = round_down(byte_count, sectorsize);
+		if (opt_zoned)
+			byte_count = round_down(byte_count,  zone_size(file));
+	}
+
 	/*
 	 * Enlarge the destination file or create a new one, using the size
 	 * calculated from source dir.
@@ -1624,12 +1630,13 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		 * Or we will always use source_dir_size calculated for mkfs.
 		 */
 		if (!byte_count)
-			byte_count = device_get_partition_size_fd_stat(fd, &statbuf);
+			byte_count = round_up(device_get_partition_size_fd_stat(fd, &statbuf),
+					      sectorsize);
 		source_dir_size = btrfs_mkfs_size_dir(source_dir, sectorsize,
 				min_dev_size, metadata_profile, data_profile);
 		if (byte_count < source_dir_size) {
 			if (S_ISREG(statbuf.st_mode)) {
-				byte_count = source_dir_size;
+				byte_count = round_up(source_dir_size, sectorsize);
 			} else {
 				warning(
 "the target device %llu (%s) is smaller than the calculated source directory size %llu (%s), mkfs may fail",
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 02/10] btrfs-progs: mkfs: remove duplicated device size check
  2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 01/10] btrfs-progs: rename block_count to byte_count Naohiro Aota
@ 2024-05-22  6:02  1% ` Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 03/10] btrfs-progs: mkfs: unify zoned mode minimum size calc into btrfs_min_dev_size() Naohiro Aota
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  6:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

test_minimum_size() already checks if each device can host the initial
block groups. There is no need to check if the first device can host the
initial system chunk again.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 mkfs/main.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/mkfs/main.c b/mkfs/main.c
index 950f76101058..f6f67abf3b0e 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1189,7 +1189,6 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	struct prepare_device_progress *prepare_ctx = NULL;
 	struct mkfs_allocation allocation = { 0 };
 	struct btrfs_mkfs_config mkfs_cfg;
-	u64 system_group_size;
 	/* Options */
 	bool force_overwrite = false;
 	struct btrfs_mkfs_features features = btrfs_mkfs_default_features;
@@ -1770,14 +1769,6 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		goto error;
 	}
 
-	/* To create the first block group and chunk 0 in make_btrfs */
-	system_group_size = (opt_zoned ? zone_size(file) : BTRFS_MKFS_SYSTEM_GROUP_SIZE);
-	if (dev_byte_count < system_group_size) {
-		error("device is too small to make filesystem, must be at least %llu",
-				system_group_size);
-		goto error;
-	}
-
 	if (btrfs_bg_type_to_tolerated_failures(metadata_profile) <
 	    btrfs_bg_type_to_tolerated_failures(data_profile))
 		warning("metadata has lower redundancy than data!\n");
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 03/10] btrfs-progs: mkfs: unify zoned mode minimum size calc into btrfs_min_dev_size()
  2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 01/10] btrfs-progs: rename block_count to byte_count Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 02/10] btrfs-progs: mkfs: remove duplicated device size check Naohiro Aota
@ 2024-05-22  6:02  1% ` Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 04/10] btrfs-progs: mkfs: fix minimum size calculation for zoned mode Naohiro Aota
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  6:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

We are going to implement a better minimum size calculation for the zoned
mode. Move the current logic to btrfs_min_dev_size() and unify the size
checking path.

Also, convert "int mixed" to "bool mixed" while at it.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 mkfs/common.c | 11 ++++++++++-
 mkfs/common.h |  2 +-
 mkfs/main.c   | 22 +++++-----------------
 3 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/mkfs/common.c b/mkfs/common.c
index e61020002417..2550c2219c90 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -811,13 +811,22 @@ static u64 btrfs_min_global_blk_rsv_size(u32 nodesize)
 	return (u64)nodesize << 10;
 }
 
-u64 btrfs_min_dev_size(u32 nodesize, int mixed, u64 meta_profile,
+u64 btrfs_min_dev_size(u32 nodesize, bool mixed, u64 zone_size, u64 meta_profile,
 		       u64 data_profile)
 {
 	u64 reserved = 0;
 	u64 meta_size;
 	u64 data_size;
 
+	/*
+	 * 2 zones for the primary superblock
+	 * 1 zone for the system block group
+	 * 1 zone for a metadata block group
+	 * 1 zone for a data block group
+	 */
+	if (zone_size)
+		return 5 * zone_size;
+
 	if (mixed)
 		return 2 * (BTRFS_MKFS_SYSTEM_GROUP_SIZE +
 			    btrfs_min_global_blk_rsv_size(nodesize));
diff --git a/mkfs/common.h b/mkfs/common.h
index d9183c997bb2..de0ff57beee8 100644
--- a/mkfs/common.h
+++ b/mkfs/common.h
@@ -105,7 +105,7 @@ struct btrfs_mkfs_config {
 int make_btrfs(int fd, struct btrfs_mkfs_config *cfg);
 int btrfs_make_root_dir(struct btrfs_trans_handle *trans,
 			struct btrfs_root *root, u64 objectid);
-u64 btrfs_min_dev_size(u32 nodesize, int mixed, u64 meta_profile,
+u64 btrfs_min_dev_size(u32 nodesize, bool mixed, u64 zone_size, u64 meta_profile,
 		       u64 data_profile);
 int test_minimum_size(const char *file, u64 min_dev_size);
 int is_vol_small(const char *file);
diff --git a/mkfs/main.c b/mkfs/main.c
index f6f67abf3b0e..a437ecc40c7f 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -1588,8 +1588,9 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 		goto error;
 	}
 
-	min_dev_size = btrfs_min_dev_size(nodesize, mixed, metadata_profile,
-					  data_profile);
+	min_dev_size = btrfs_min_dev_size(nodesize, mixed,
+					  opt_zoned ? zone_size(file) : 0,
+					  metadata_profile, data_profile);
 	/*
 	 * Enlarge the destination file or create a new one, using the size
 	 * calculated from source dir.
@@ -1650,21 +1651,8 @@ int BOX_MAIN(mkfs)(int argc, char **argv)
 	if (byte_count && byte_count < min_dev_size) {
 		error("size %llu is too small to make a usable filesystem",
 			byte_count);
-		error("minimum size for btrfs filesystem is %llu",
-			min_dev_size);
-		goto error;
-	}
-	/*
-	 * 2 zones for the primary superblock
-	 * 1 zone for the system block group
-	 * 1 zone for a metadata block group
-	 * 1 zone for a data block group
-	 */
-	if (opt_zoned && byte_count && byte_count < 5 * zone_size(file)) {
-		error("size %llu is too small to make a usable filesystem",
-			byte_count);
-		error("minimum size for a zoned btrfs filesystem is %llu",
-			min_dev_size);
+		error("minimum size for a %sbtrfs filesystem is %llu",
+		      opt_zoned ? "zoned mode " : "", min_dev_size);
 		goto error;
 	}
 
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 04/10] btrfs-progs: mkfs: fix minimum size calculation for zoned mode
  2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
                   ` (2 preceding siblings ...)
  2024-05-22  6:02  1% ` [PATCH v3 03/10] btrfs-progs: mkfs: unify zoned mode minimum size calc into btrfs_min_dev_size() Naohiro Aota
@ 2024-05-22  6:02  1% ` Naohiro Aota
  2024-05-22  6:02  1% ` [PATCH v3 05/10] btrfs-progs: mkfs: align byte_count with sectorsize and zone size Naohiro Aota
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  6:02 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

Currently, we check if a device is larger than 5 zones to determine we can
create btrfs on the device or not. Actually, we need more zones to create
DUP block groups, so it fails with "ERROR: not enough free space to
allocate chunk". Implement proper support for non-SINGLE profile.

Also, current code does not ensure we can create tree-log BG and data
relocation BG, which are essential for the real usage. Count them as
requirement too.

The calculation for a regular btrfs is also adjusted to use dev_stripes
style.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 mkfs/common.c | 67 +++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/mkfs/common.c b/mkfs/common.c
index 2550c2219c90..1b09c8b1a673 100644
--- a/mkfs/common.c
+++ b/mkfs/common.c
@@ -817,15 +817,50 @@ u64 btrfs_min_dev_size(u32 nodesize, bool mixed, u64 zone_size, u64 meta_profile
 	u64 reserved = 0;
 	u64 meta_size;
 	u64 data_size;
+	u64 dev_stripes;
 
-	/*
-	 * 2 zones for the primary superblock
-	 * 1 zone for the system block group
-	 * 1 zone for a metadata block group
-	 * 1 zone for a data block group
-	 */
-	if (zone_size)
-		return 5 * zone_size;
+	if (zone_size) {
+		/* 2 zones for the primary superblock. */
+		reserved += 2 * zone_size;
+
+		/*
+		 * 1 zone each for the initial SINGLE system, SINGLE
+		 * metadata, and SINGLE data block group
+		 */
+		reserved += 3 * zone_size;
+
+		/*
+		 * On non-SINGLE profile, we need to add real system and
+		 * metadata block group. And, we also need to add a space
+		 * for a tree-log block group.
+		 *
+		 * SINGLE profile can reuse the initial block groups and
+		 * only need to add a tree-log block group
+		 */
+		dev_stripes = (meta_profile & BTRFS_BLOCK_GROUP_DUP) ? 2 : 1;
+		if (meta_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK)
+			meta_size = 3 * dev_stripes * zone_size;
+		else
+			meta_size = dev_stripes * zone_size;
+		reserved += meta_size;
+
+		/*
+		 * On non-SINGLE profile, we need to add real data block
+		 * group. And, we also need to add a space for a data
+		 * relocation block group.
+		 *
+		 * SINGLE profile can reuse the initial block groups and
+		 * only need to add a data relocation block group.
+		 */
+		dev_stripes = (data_profile & BTRFS_BLOCK_GROUP_DUP) ? 2 : 1;
+		if (data_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK)
+			data_size = 2 * dev_stripes * zone_size;
+		else
+			data_size = dev_stripes * zone_size;
+		reserved += data_size;
+
+		return reserved;
+	}
 
 	if (mixed)
 		return 2 * (BTRFS_MKFS_SYSTEM_GROUP_SIZE +
@@ -863,22 +898,18 @@ u64 btrfs_min_dev_size(u32 nodesize, bool mixed, u64 zone_size, u64 meta_profile
 	 *
 	 * And use the stripe size to calculate its physical used space.
 	 */
+	dev_stripes = (meta_profile & BTRFS_BLOCK_GROUP_DUP) ? 2 : 1;
 	if (meta_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK)
-		meta_size = SZ_8M + SZ_32M;
+		meta_size = dev_stripes * (SZ_8M + SZ_32M);
 	else
-		meta_size = SZ_8M + SZ_8M;
-	/* For DUP/metadata,  2 stripes on one disk */
-	if (meta_profile & BTRFS_BLOCK_GROUP_DUP)
-		meta_size *= 2;
+		meta_size = dev_stripes * (SZ_8M + SZ_8M);
 	reserved += meta_size;
 
+	dev_stripes = (data_profile & BTRFS_BLOCK_GROUP_DUP) ? 2 : 1;
 	if (data_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK)
-		data_size = SZ_64M;
+		data_size = dev_stripes * SZ_64M;
 	else
-		data_size = SZ_8M;
-	/* For DUP/data,  2 stripes on one disk */
-	if (data_profile & BTRFS_BLOCK_GROUP_DUP)
-		data_size *= 2;
+		data_size = dev_stripes * SZ_8M;
 	reserved += data_size;
 
 	return reserved;
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [syzbot] [nilfs?] [btrfs?] WARNING in filemap_unaccount_folio
@ 2024-05-22  2:55  1% syzbot
  0 siblings, 0 replies; 200+ results
From: syzbot @ 2024-05-22  2:55 UTC (permalink / raw)
  To: brauner, clm, dsterba, jack, josef, konishi.ryusuke, linux-btrfs,
	linux-fsdevel, linux-kernel, linux-nilfs, syzkaller-bugs, viro

Hello,

syzbot found the following issue on:

HEAD commit:    b6394d6f7159 Merge tag 'pull-misc' of git://git.kernel.org..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=142a7cb2980000
kernel config:  https://syzkaller.appspot.com/x/.config?x=713476114e57eef3
dashboard link: https://syzkaller.appspot.com/bug?extid=026119922c20a8915631
compiler:       Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=14d43f84980000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11d4fadc980000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/e8e1377d4772/disk-b6394d6f.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/19fbbb3b6dd5/vmlinux-b6394d6f.xz
kernel image: https://storage.googleapis.com/syzbot-assets/4dcce16af95d/bzImage-b6394d6f.xz
mounted in repro #1: https://storage.googleapis.com/syzbot-assets/e197bb1019a1/mount_0.gz
mounted in repro #2: https://storage.googleapis.com/syzbot-assets/1c62d475ecf4/mount_2.gz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+026119922c20a8915631@syzkaller.appspotmail.com

------------[ cut here ]------------
WARNING: CPU: 1 PID: 5096 at mm/filemap.c:217 filemap_unaccount_folio+0x6be/0xe40 mm/filemap.c:216
Modules linked in:
CPU: 1 PID: 5096 Comm: syz-executor306 Not tainted 6.9.0-syzkaller-10729-gb6394d6f7159 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
RIP: 0010:filemap_unaccount_folio+0x6be/0xe40 mm/filemap.c:216
Code: 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df 0f b6 04 08 84 c0 0f 85 e5 00 00 00 8b 6d 00 ff c5 e9 45 fa ff ff e8 c3 66 ca ff 90 <0f> 0b 90 48 b8 00 00 00 00 00 fc ff df 41 80 3c 06 00 74 0a 48 8b
RSP: 0018:ffffc9000382f1f8 EFLAGS: 00010093
RAX: ffffffff81cbd3ad RBX: ffff888079ef0380 RCX: ffff88802d4f5a00
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: 0000000000000003 R08: ffffffff81cbd2c9 R09: 1ffffd40000c1ec8
R10: dffffc0000000000 R11: fffff940000c1ec9 R12: 1ffffd40000c1ec8
R13: ffffea000060f640 R14: 1ffff1100f3de070 R15: ffffea000060f648
FS:  00007f13ab0c76c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000002ca92000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 delete_from_page_cache_batch+0x173/0xc70 mm/filemap.c:341
 truncate_inode_pages_range+0x364/0xfc0 mm/truncate.c:359
 truncate_inode_pages mm/truncate.c:439 [inline]
 truncate_pagecache mm/truncate.c:732 [inline]
 truncate_setsize+0xcf/0xf0 mm/truncate.c:757
 simple_setattr+0xbe/0x110 fs/libfs.c:886
 notify_change+0xbb4/0xe70 fs/attr.c:499
 do_truncate+0x220/0x310 fs/open.c:65
 handle_truncate fs/namei.c:3308 [inline]
 do_open fs/namei.c:3654 [inline]
 path_openat+0x2a3d/0x3280 fs/namei.c:3807
 do_filp_open+0x235/0x490 fs/namei.c:3834
 do_sys_openat2+0x13e/0x1d0 fs/open.c:1405
 do_sys_open fs/open.c:1420 [inline]
 __do_sys_creat fs/open.c:1496 [inline]
 __se_sys_creat fs/open.c:1490 [inline]
 __x64_sys_creat+0x123/0x170 fs/open.c:1490
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f13ab131c99
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f13ab0c7198 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
RAX: ffffffffffffffda RBX: 00007f13ab1bf6d8 RCX: 00007f13ab131c99
RDX: 00007f13ab131c99 RSI: 0000000000000000 RDI: 00000000200001c0
RBP: 00007f13ab1bf6d0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f13ab18c160
R13: 000000000000006e R14: 0030656c69662f2e R15: 00007f13ab186bc0
 </TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 2/2] btrfs: reserve new relocation block-group after successful relocation
  2024-05-21 14:58  1% ` [PATCH v3 2/2] btrfs: reserve new relocation block-group after successful relocation Johannes Thumshirn
@ 2024-05-22  1:17  1%   ` Naohiro Aota
  0 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  1:17 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Chris Mason, Josef Bacik, David Sterba, Hans Holmberg,
	linux-btrfs, linux-kernel, Johannes Thumshirn

On Tue, May 21, 2024 at 04:58:08PM GMT, Johannes Thumshirn wrote:
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> After we've committed a relocation transaction, we know we have just freed
> up space. Set it as hint for the next relocation.
> 
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---

Looks good.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  2024-05-21 22:16  1%         ` Qu Wenruo
@ 2024-05-22  1:10  1%           ` Naohiro Aota
  0 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-22  1:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, linux-btrfs, Johannes Thumshirn, josef

On Wed, May 22, 2024 at 07:46:46AM GMT, Qu Wenruo wrote:
> 
> 
> 在 2024/5/21 21:24, Naohiro Aota 写道:
> > On Tue, May 21, 2024 at 06:15:32PM GMT, Qu Wenruo wrote:
> > > 
> > > 
> > > 在 2024/5/21 17:41, Naohiro Aota 写道:
> > > [...]
> > > > Same here.
> > > > 
> > > > >    	while (delalloc_start < page_end) {
> > > > >    		delalloc_end = page_end;
> > > > >    		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
> > > > > @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
> > > > >    			delalloc_start = delalloc_end + 1;
> > > > >    			continue;
> > > > >    		}
> > > > > -
> > > > > -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
> > > > > -					       delalloc_end, wbc);
> > > > > -		if (ret < 0)
> > > > > -			return ret;
> > > > > -
> > > > > +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
> > > > > +					    min(delalloc_end, page_end) + 1 -
> > > > > +					    delalloc_start);
> > > > > +		last_delalloc_end = delalloc_end;
> > > > >    		delalloc_start = delalloc_end + 1;
> > > > >    	}
> > > > 
> > > > Can we bail out on the "if (!last_delalloc_end)" case? It would make the
> > > > following code simpler.
> > > 
> > > Mind to explain it a little more?
> > > 
> > > Did you mean something like this:
> > > 
> > > 	while (delalloc_start < page_end) {
> > > 		/* lock all subpage delalloc range code */
> > > 	}
> > > 	if (!last_delalloc_end)
> > > 		goto finish;
> > > 	while (delalloc_start < page_end) {
> > > 		/* run the delalloc ranges code* /
> > > 	}
> > > 
> > > If so, I can definitely go that way.
> > 
> > Yes, I meant that way. Apparently, "!last_delalloc_end" means it get no
> > delalloc region. So, we can just return 0 in that case without touching
> > "wbc->nr_to_write" as the current code does.
> > 
> > BTW, is this actually an overlooked error case? Is it OK to progress in
> > __extent_writepage() even if we don't run run_delalloc_range() ?
> 
> That's totally expected, and it would even be more common in fact.
> 
> Consider a very ordinary case like this:
> 
>    0             4K              8K            12K
>    |/////////////|///////////////|/////////////|
> 
> When running extent_writepage() for page 0, we run delalloc range for
> the whole [0, 12K) range, and created an OE for it.
> Then __extent_writepage_io() add page range [0, 4k) for bio.
> 
> Then extent_writepage() for page 4K, find_lock_delalloc() would not find
> any range, as previous iteration at page 0 has already created OE for
> the whole [0, 12K) range.
> 
> Although we would still run __extent_writepage_io() to add page range
> [4k, 8K) to the bio.
> 
> The same for page 8K.
> 
> Thanks,
> Qu

Ah, yes, that's true. I forgot that the following pages case. Thank you for
your explanation.

Regards,

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  2024-05-21 11:54  1%       ` Naohiro Aota
@ 2024-05-21 22:16  1%         ` Qu Wenruo
  2024-05-22  1:10  1%           ` Naohiro Aota
  0 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-21 22:16 UTC (permalink / raw)
  To: Naohiro Aota, Qu Wenruo; +Cc: linux-btrfs, Johannes Thumshirn, josef



在 2024/5/21 21:24, Naohiro Aota 写道:
> On Tue, May 21, 2024 at 06:15:32PM GMT, Qu Wenruo wrote:
>>
>>
>> 在 2024/5/21 17:41, Naohiro Aota 写道:
>> [...]
>>> Same here.
>>>
>>>>    	while (delalloc_start < page_end) {
>>>>    		delalloc_end = page_end;
>>>>    		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
>>>> @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>>>>    			delalloc_start = delalloc_end + 1;
>>>>    			continue;
>>>>    		}
>>>> -
>>>> -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
>>>> -					       delalloc_end, wbc);
>>>> -		if (ret < 0)
>>>> -			return ret;
>>>> -
>>>> +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
>>>> +					    min(delalloc_end, page_end) + 1 -
>>>> +					    delalloc_start);
>>>> +		last_delalloc_end = delalloc_end;
>>>>    		delalloc_start = delalloc_end + 1;
>>>>    	}
>>>
>>> Can we bail out on the "if (!last_delalloc_end)" case? It would make the
>>> following code simpler.
>>
>> Mind to explain it a little more?
>>
>> Did you mean something like this:
>>
>> 	while (delalloc_start < page_end) {
>> 		/* lock all subpage delalloc range code */
>> 	}
>> 	if (!last_delalloc_end)
>> 		goto finish;
>> 	while (delalloc_start < page_end) {
>> 		/* run the delalloc ranges code* /
>> 	}
>>
>> If so, I can definitely go that way.
>
> Yes, I meant that way. Apparently, "!last_delalloc_end" means it get no
> delalloc region. So, we can just return 0 in that case without touching
> "wbc->nr_to_write" as the current code does.
>
> BTW, is this actually an overlooked error case? Is it OK to progress in
> __extent_writepage() even if we don't run run_delalloc_range() ?

That's totally expected, and it would even be more common in fact.

Consider a very ordinary case like this:

    0             4K              8K            12K
    |/////////////|///////////////|/////////////|

When running extent_writepage() for page 0, we run delalloc range for
the whole [0, 12K) range, and created an OE for it.
Then __extent_writepage_io() add page range [0, 4k) for bio.

Then extent_writepage() for page 4K, find_lock_delalloc() would not find
any range, as previous iteration at page 0 has already created OE for
the whole [0, 12K) range.

Although we would still run __extent_writepage_io() to add page range
[4k, 8K) to the bio.

The same for page 8K.

Thanks,
Qu

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v4 0/6] part3 trivial adjustments for return variable coding style
  2024-05-21 17:11  1% [PATCH v4 0/6] part3 trivial adjustments for return variable coding style Anand Jain
                   ` (5 preceding siblings ...)
  2024-05-21 17:11  1% ` [PATCH v4 6/6] btrfs: rename err to ret in btrfs_find_orphan_roots() Anand Jain
@ 2024-05-21 18:10  1% ` David Sterba
  2024-05-23 17:18  1%   ` David Sterba
  6 siblings, 1 reply; 200+ results
From: David Sterba @ 2024-05-21 18:10 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Wed, May 22, 2024 at 01:11:06AM +0800, Anand Jain wrote:
> This is v4 of part 3 of the series, containing renaming with optimization of the
> return variable.
> 
> v3 part3:
>   https://lore.kernel.org/linux-btrfs/cover.1715783315.git.anand.jain@oracle.com/
> v2 part2:
>   https://lore.kernel.org/linux-btrfs/cover.1713370756.git.anand.jain@oracle.com/
> v1:
>   https://lore.kernel.org/linux-btrfs/cover.1710857863.git.anand.jain@oracle.com/
> 
> Anand Jain (6):
>   btrfs: rename err to ret in btrfs_cleanup_fs_roots()
>   btrfs: rename ret to err in btrfs_recover_relocation()
>   btrfs: rename ret to ret2 in btrfs_recover_relocation()
>   btrfs: rename err to ret in btrfs_recover_relocation()
>   btrfs: rename err to ret in btrfs_drop_snapshot()
>   btrfs: rename err to ret in btrfs_find_orphan_roots()

1-5 look ok to me, for patch 6 there's the ret = 0 reset question sent
to v3.

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 6/6] btrfs: rename and optimize return variable in btrfs_find_orphan_roots
  2024-05-21 17:10  1%     ` Anand Jain
@ 2024-05-21 17:59  1%       ` David Sterba
  2024-05-23 14:35  1%         ` Anand Jain
  0 siblings, 1 reply; 200+ results
From: David Sterba @ 2024-05-21 17:59 UTC (permalink / raw)
  To: Anand Jain; +Cc: dsterba, linux-btrfs

On Wed, May 22, 2024 at 01:10:08AM +0800, Anand Jain wrote:
> 
> 
> On 5/21/24 23:18, David Sterba wrote:
> > On Thu, May 16, 2024 at 07:12:15PM +0800, Anand Jain wrote:
> >> The variable err is the actual return value of this function, and the
> >> variable ret is a helper variable for err, which actually is not
> >> needed and can be handled just by err, which is renamed to ret.
> >>
> >> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> >> ---
> >> v3: drop ret2 as there is no need for it.
> >> v2: n/a
> >>   fs/btrfs/root-tree.c | 32 ++++++++++++++++----------------
> >>   1 file changed, 16 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
> >> index 33962671a96c..c11b0bccf513 100644
> >> --- a/fs/btrfs/root-tree.c
> >> +++ b/fs/btrfs/root-tree.c
> >> @@ -220,8 +220,7 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
> >>   	struct btrfs_path *path;
> >>   	struct btrfs_key key;
> >>   	struct btrfs_root *root;
> >> -	int err = 0;
> >> -	int ret;
> >> +	int ret = 0;
> >>   
> >>   	path = btrfs_alloc_path();
> >>   	if (!path)
> >> @@ -235,18 +234,19 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
> >>   		u64 root_objectid;
> >>   
> >>   		ret = btrfs_search_slot(NULL, tree_root, &key, path, 0, 0);
> >> -		if (ret < 0) {
> >> -			err = ret;
> >> +		if (ret < 0)
> >>   			break;
> >> -		}
> >> +		ret = 0;
> > 
> > Should this be handled when ret > 0? This would be unexpected and
> > probably means a corruption but simply overwriting the value does not
> > seem right.
> > 
> 
> Agreed.
> 
> +               if (ret > 0)
> +                       ret = 0;
> 
> is much neater.

That's not what I meant. When btrfs_search_slot() returns 1 then the key
was not found and could be inserted, path points to the slot. This is
done in many other places, so in the orphan root lookup it should be
also handled. 

^ permalink raw reply	[relevance 1%]

* [PATCH v4 6/6] btrfs: rename err to ret in btrfs_find_orphan_roots()
  2024-05-21 17:11  1% [PATCH v4 0/6] part3 trivial adjustments for return variable coding style Anand Jain
                   ` (4 preceding siblings ...)
  2024-05-21 17:11  1% ` [PATCH v4 5/6] btrfs: rename err to ret in btrfs_drop_snapshot() Anand Jain
@ 2024-05-21 17:11  1% ` Anand Jain
  2024-05-21 18:10  1% ` [PATCH v4 0/6] part3 trivial adjustments for return variable coding style David Sterba
  6 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-21 17:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Anand Jain

The variable err is the actual return value of this function, and the
variable ret is a helper variable for err, which actually is not
needed and can be handled just by err, which is renamed to ret.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v4: Move ret = 0 under if (ret > 0)
    Title changed
v3: drop ret2 as there is no need for it.
v2: n/a

 fs/btrfs/root-tree.c | 33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
index 33962671a96c..c18915a76d9d 100644
--- a/fs/btrfs/root-tree.c
+++ b/fs/btrfs/root-tree.c
@@ -220,8 +220,7 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
 	struct btrfs_path *path;
 	struct btrfs_key key;
 	struct btrfs_root *root;
-	int err = 0;
-	int ret;
+	int ret = 0;
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -235,18 +234,20 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
 		u64 root_objectid;
 
 		ret = btrfs_search_slot(NULL, tree_root, &key, path, 0, 0);
-		if (ret < 0) {
-			err = ret;
+		if (ret < 0)
 			break;
-		}
+		if (ret > 0)
+			ret = 0;
 
 		leaf = path->nodes[0];
 		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
 			ret = btrfs_next_leaf(tree_root, path);
 			if (ret < 0)
-				err = ret;
-			if (ret != 0)
 				break;
+			if (ret > 0) {
+				ret = 0;
+				break;
+			}
 			leaf = path->nodes[0];
 		}
 
@@ -261,26 +262,26 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
 		key.offset++;
 
 		root = btrfs_get_fs_root(fs_info, root_objectid, false);
-		err = PTR_ERR_OR_ZERO(root);
-		if (err && err != -ENOENT) {
+		ret = PTR_ERR_OR_ZERO(root);
+		if (ret && ret != -ENOENT) {
 			break;
-		} else if (err == -ENOENT) {
+		} else if (ret == -ENOENT) {
 			struct btrfs_trans_handle *trans;
 
 			btrfs_release_path(path);
 
 			trans = btrfs_join_transaction(tree_root);
 			if (IS_ERR(trans)) {
-				err = PTR_ERR(trans);
-				btrfs_handle_fs_error(fs_info, err,
+				ret = PTR_ERR(trans);
+				btrfs_handle_fs_error(fs_info, ret,
 					    "Failed to start trans to delete orphan item");
 				break;
 			}
-			err = btrfs_del_orphan_item(trans, tree_root,
+			ret = btrfs_del_orphan_item(trans, tree_root,
 						    root_objectid);
 			btrfs_end_transaction(trans);
-			if (err) {
-				btrfs_handle_fs_error(fs_info, err,
+			if (ret) {
+				btrfs_handle_fs_error(fs_info, ret,
 					    "Failed to delete root orphan item");
 				break;
 			}
@@ -311,7 +312,7 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
 	}
 
 	btrfs_free_path(path);
-	return err;
+	return ret;
 }
 
 /* drop the root item for 'key' from the tree root */
-- 
2.41.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 5/6] btrfs: rename err to ret in btrfs_drop_snapshot()
  2024-05-21 17:11  1% [PATCH v4 0/6] part3 trivial adjustments for return variable coding style Anand Jain
                   ` (3 preceding siblings ...)
  2024-05-21 17:11  1% ` [PATCH v4 4/6] btrfs: rename err to ret " Anand Jain
@ 2024-05-21 17:11  1% ` Anand Jain
  2024-05-21 17:11  1% ` [PATCH v4 6/6] btrfs: rename err to ret in btrfs_find_orphan_roots() Anand Jain
  2024-05-21 18:10  1% ` [PATCH v4 0/6] part3 trivial adjustments for return variable coding style David Sterba
  6 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-21 17:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Anand Jain

Drop the variable 'err', reuse the variable 'ret' by reinitializing it to
zero where necessary.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v4: Title changed.
v3: Fix comment formatting.
v2: handle return error better, no need of original 'ret'. (Josef).
 fs/btrfs/extent-tree.c | 48 +++++++++++++++++++++---------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3774c191e36d..5aa7c8a0dbc6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5833,8 +5833,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 	struct btrfs_root_item *root_item = &root->root_item;
 	struct walk_control *wc;
 	struct btrfs_key key;
-	int err = 0;
-	int ret;
+	int ret = 0;
 	int level;
 	bool root_dropped = false;
 	bool unfinished_drop = false;
@@ -5843,14 +5842,14 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 
 	path = btrfs_alloc_path();
 	if (!path) {
-		err = -ENOMEM;
+		ret = -ENOMEM;
 		goto out;
 	}
 
 	wc = kzalloc(sizeof(*wc), GFP_NOFS);
 	if (!wc) {
 		btrfs_free_path(path);
-		err = -ENOMEM;
+		ret = -ENOMEM;
 		goto out;
 	}
 
@@ -5863,12 +5862,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 	else
 		trans = btrfs_start_transaction(tree_root, 0);
 	if (IS_ERR(trans)) {
-		err = PTR_ERR(trans);
+		ret = PTR_ERR(trans);
 		goto out_free;
 	}
 
-	err = btrfs_run_delayed_items(trans);
-	if (err)
+	ret = btrfs_run_delayed_items(trans);
+	if (ret)
 		goto out_end_trans;
 
 	/*
@@ -5899,11 +5898,11 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 		path->lowest_level = level;
 		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
 		path->lowest_level = 0;
-		if (ret < 0) {
-			err = ret;
+		if (ret < 0)
 			goto out_end_trans;
-		}
+
 		WARN_ON(ret > 0);
+		ret = 0;
 
 		/*
 		 * unlock our path, this is safe because only this
@@ -5916,14 +5915,17 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 			btrfs_tree_lock(path->nodes[level]);
 			path->locks[level] = BTRFS_WRITE_LOCK;
 
+			/*
+			 * btrfs_lookup_extent_info() returns 0 for success,
+			 * or < 0 for error.
+			 */
 			ret = btrfs_lookup_extent_info(trans, fs_info,
 						path->nodes[level]->start,
 						level, 1, &wc->refs[level],
 						&wc->flags[level], NULL);
-			if (ret < 0) {
-				err = ret;
+			if (ret < 0)
 				goto out_end_trans;
-			}
+
 			BUG_ON(wc->refs[level] == 0);
 
 			if (level == btrfs_root_drop_level(root_item))
@@ -5949,19 +5951,18 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 		ret = walk_down_tree(trans, root, path, wc);
 		if (ret < 0) {
 			btrfs_abort_transaction(trans, ret);
-			err = ret;
 			break;
 		}
 
 		ret = walk_up_tree(trans, root, path, wc, BTRFS_MAX_LEVEL);
 		if (ret < 0) {
 			btrfs_abort_transaction(trans, ret);
-			err = ret;
 			break;
 		}
 
 		if (ret > 0) {
 			BUG_ON(wc->stage != DROP_REFERENCE);
+			ret = 0;
 			break;
 		}
 
@@ -5983,7 +5984,6 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 						root_item);
 			if (ret) {
 				btrfs_abort_transaction(trans, ret);
-				err = ret;
 				goto out_end_trans;
 			}
 
@@ -5994,7 +5994,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 			if (!for_reloc && btrfs_need_cleaner_sleep(fs_info)) {
 				btrfs_debug(fs_info,
 					    "drop snapshot early exit");
-				err = -EAGAIN;
+				ret = -EAGAIN;
 				goto out_free;
 			}
 
@@ -6008,19 +6008,18 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 			else
 				trans = btrfs_start_transaction(tree_root, 0);
 			if (IS_ERR(trans)) {
-				err = PTR_ERR(trans);
+				ret = PTR_ERR(trans);
 				goto out_free;
 			}
 		}
 	}
 	btrfs_release_path(path);
-	if (err)
+	if (ret)
 		goto out_end_trans;
 
 	ret = btrfs_del_root(trans, &root->root_key);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
-		err = ret;
 		goto out_end_trans;
 	}
 
@@ -6029,10 +6028,11 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 				      NULL, NULL);
 		if (ret < 0) {
 			btrfs_abort_transaction(trans, ret);
-			err = ret;
 			goto out_end_trans;
 		} else if (ret > 0) {
-			/* if we fail to delete the orphan item this time
+			ret = 0;
+			/*
+			 * If we fail to delete the orphan item this time
 			 * around, it'll get picked up the next time.
 			 *
 			 * The most common failure here is just -ENOENT.
@@ -6067,7 +6067,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 	 * We were an unfinished drop root, check to see if there are any
 	 * pending, and if not clear and wake up any waiters.
 	 */
-	if (!err && unfinished_drop)
+	if (!ret && unfinished_drop)
 		btrfs_maybe_wake_unfinished_drop(fs_info);
 
 	/*
@@ -6079,7 +6079,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 	 */
 	if (!for_reloc && !root_dropped)
 		btrfs_add_dead_root(root);
-	return err;
+	return ret;
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 4/6] btrfs: rename err to ret in btrfs_recover_relocation()
  2024-05-21 17:11  1% [PATCH v4 0/6] part3 trivial adjustments for return variable coding style Anand Jain
                   ` (2 preceding siblings ...)
  2024-05-21 17:11  1% ` [PATCH v4 3/6] btrfs: rename ret to ret2 " Anand Jain
@ 2024-05-21 17:11  1% ` Anand Jain
  2024-05-21 17:11  1% ` [PATCH v4 5/6] btrfs: rename err to ret in btrfs_drop_snapshot() Anand Jain
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-21 17:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Anand Jain

Fix coding style: rename the return variable to 'ret' in the function
btrfs_recover_relocation instead of 'err'.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v4: title changed
v3: new
 fs/btrfs/relocation.c | 56 +++++++++++++++++++++----------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d621fdbf59f3..cd3f4c686e5f 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4222,7 +4222,7 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 	struct reloc_control *rc = NULL;
 	struct btrfs_trans_handle *trans;
 	int ret2;
-	int err = 0;
+	int ret = 0;
 
 	path = btrfs_alloc_path();
 	if (!path)
@@ -4234,16 +4234,16 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 	key.offset = (u64)-1;
 
 	while (1) {
-		err = btrfs_search_slot(NULL, fs_info->tree_root, &key,
+		ret = btrfs_search_slot(NULL, fs_info->tree_root, &key,
 					path, 0, 0);
-		if (err < 0)
+		if (ret < 0)
 			goto out;
-		if (err > 0) {
+		if (ret > 0) {
 			if (path->slots[0] == 0)
 				break;
 			path->slots[0]--;
 		}
-		err = 0;
+		ret = 0;
 		leaf = path->nodes[0];
 		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
 		btrfs_release_path(path);
@@ -4254,7 +4254,7 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 
 		reloc_root = btrfs_read_tree_root(fs_info->tree_root, &key);
 		if (IS_ERR(reloc_root)) {
-			err = PTR_ERR(reloc_root);
+			ret = PTR_ERR(reloc_root);
 			goto out;
 		}
 
@@ -4265,13 +4265,13 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 			fs_root = btrfs_get_fs_root(fs_info,
 					reloc_root->root_key.offset, false);
 			if (IS_ERR(fs_root)) {
-				err = PTR_ERR(fs_root);
-				if (err != -ENOENT)
+				ret = PTR_ERR(fs_root);
+				if (ret != -ENOENT)
 					goto out;
-				err = mark_garbage_root(reloc_root);
-				if (err < 0)
+				ret = mark_garbage_root(reloc_root);
+				if (ret < 0)
 					goto out;
-				err = 0;
+				ret = 0;
 			} else {
 				btrfs_put_root(fs_root);
 			}
@@ -4289,12 +4289,12 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 
 	rc = alloc_reloc_control(fs_info);
 	if (!rc) {
-		err = -ENOMEM;
+		ret = -ENOMEM;
 		goto out;
 	}
 
-	err = reloc_chunk_start(fs_info);
-	if (err < 0)
+	ret = reloc_chunk_start(fs_info);
+	if (ret < 0)
 		goto out_end;
 
 	rc->extent_root = btrfs_extent_root(fs_info, 0);
@@ -4303,7 +4303,7 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 
 	trans = btrfs_join_transaction(rc->extent_root);
 	if (IS_ERR(trans)) {
-		err = PTR_ERR(trans);
+		ret = PTR_ERR(trans);
 		goto out_unset;
 	}
 
@@ -4323,15 +4323,15 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 		fs_root = btrfs_get_fs_root(fs_info, reloc_root->root_key.offset,
 					    false);
 		if (IS_ERR(fs_root)) {
-			err = PTR_ERR(fs_root);
+			ret = PTR_ERR(fs_root);
 			list_add_tail(&reloc_root->root_list, &reloc_roots);
 			btrfs_end_transaction(trans);
 			goto out_unset;
 		}
 
-		err = __add_reloc_root(reloc_root);
-		ASSERT(err != -EEXIST);
-		if (err) {
+		ret = __add_reloc_root(reloc_root);
+		ASSERT(ret != -EEXIST);
+		if (ret) {
 			list_add_tail(&reloc_root->root_list, &reloc_roots);
 			btrfs_put_root(fs_root);
 			btrfs_end_transaction(trans);
@@ -4341,8 +4341,8 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 		btrfs_put_root(fs_root);
 	}
 
-	err = btrfs_commit_transaction(trans);
-	if (err)
+	ret = btrfs_commit_transaction(trans);
+	if (ret)
 		goto out_unset;
 
 	merge_reloc_roots(rc);
@@ -4351,14 +4351,14 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 
 	trans = btrfs_join_transaction(rc->extent_root);
 	if (IS_ERR(trans)) {
-		err = PTR_ERR(trans);
+		ret = PTR_ERR(trans);
 		goto out_clean;
 	}
-	err = btrfs_commit_transaction(trans);
+	ret = btrfs_commit_transaction(trans);
 out_clean:
 	ret2 = clean_dirty_subvols(rc);
-	if (ret2 < 0 && !err)
-		err = ret2;
+	if (ret2 < 0 && !ret)
+		ret = ret2;
 out_unset:
 	unset_reloc_control(rc);
 out_end:
@@ -4369,14 +4369,14 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 
 	btrfs_free_path(path);
 
-	if (err == 0) {
+	if (ret == 0) {
 		/* cleanup orphan inode in data relocation tree */
 		fs_root = btrfs_grab_root(fs_info->data_reloc_root);
 		ASSERT(fs_root);
-		err = btrfs_orphan_cleanup(fs_root);
+		ret = btrfs_orphan_cleanup(fs_root);
 		btrfs_put_root(fs_root);
 	}
-	return err;
+	return ret;
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 3/6] btrfs: rename ret to ret2 in btrfs_recover_relocation()
  2024-05-21 17:11  1% [PATCH v4 0/6] part3 trivial adjustments for return variable coding style Anand Jain
  2024-05-21 17:11  1% ` [PATCH v4 1/6] btrfs: rename err to ret in btrfs_cleanup_fs_roots() Anand Jain
  2024-05-21 17:11  1% ` [PATCH v4 2/6] btrfs: rename ret to err in btrfs_recover_relocation() Anand Jain
@ 2024-05-21 17:11  1% ` Anand Jain
  2024-05-21 17:11  1% ` [PATCH v4 4/6] btrfs: rename err to ret " Anand Jain
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-21 17:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Anand Jain

A preparatory patch to rename 'err' to 'ret', but ret is already used as an
intermediary return value, so first rename 'ret' to 'ret2'.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v4: title changed
v3: new
 fs/btrfs/relocation.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d0352077f0fc..d621fdbf59f3 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4221,7 +4221,7 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 	struct extent_buffer *leaf;
 	struct reloc_control *rc = NULL;
 	struct btrfs_trans_handle *trans;
-	int ret;
+	int ret2;
 	int err = 0;
 
 	path = btrfs_alloc_path();
@@ -4356,9 +4356,9 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 	}
 	err = btrfs_commit_transaction(trans);
 out_clean:
-	ret = clean_dirty_subvols(rc);
-	if (ret < 0 && !err)
-		err = ret;
+	ret2 = clean_dirty_subvols(rc);
+	if (ret2 < 0 && !err)
+		err = ret2;
 out_unset:
 	unset_reloc_control(rc);
 out_end:
-- 
2.41.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 2/6] btrfs: rename ret to err in btrfs_recover_relocation()
  2024-05-21 17:11  1% [PATCH v4 0/6] part3 trivial adjustments for return variable coding style Anand Jain
  2024-05-21 17:11  1% ` [PATCH v4 1/6] btrfs: rename err to ret in btrfs_cleanup_fs_roots() Anand Jain
@ 2024-05-21 17:11  1% ` Anand Jain
  2024-05-21 17:11  1% ` [PATCH v4 3/6] btrfs: rename ret to ret2 " Anand Jain
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-21 17:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Anand Jain

In the function btrfs_recover_relocation(), currently the variable 'err'
carries the return value and 'ret' holds the intermediary return value.
However, in some lines, we don't need this two-step approach; we can
directly use 'err'. So, optimize them, which requires reinitializing 'err'
to zero at two locations.

This is a preparatory patch to fix the code style by renaming 'err'
to 'ret'.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v4: title changed
v3: splits optimization part from the rename part.

 fs/btrfs/relocation.c | 28 +++++++++++-----------------
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 8ce337ec033c..d0352077f0fc 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4234,17 +4234,16 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 	key.offset = (u64)-1;
 
 	while (1) {
-		ret = btrfs_search_slot(NULL, fs_info->tree_root, &key,
+		err = btrfs_search_slot(NULL, fs_info->tree_root, &key,
 					path, 0, 0);
-		if (ret < 0) {
-			err = ret;
+		if (err < 0)
 			goto out;
-		}
-		if (ret > 0) {
+		if (err > 0) {
 			if (path->slots[0] == 0)
 				break;
 			path->slots[0]--;
 		}
+		err = 0;
 		leaf = path->nodes[0];
 		btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
 		btrfs_release_path(path);
@@ -4266,16 +4265,13 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 			fs_root = btrfs_get_fs_root(fs_info,
 					reloc_root->root_key.offset, false);
 			if (IS_ERR(fs_root)) {
-				ret = PTR_ERR(fs_root);
-				if (ret != -ENOENT) {
-					err = ret;
+				err = PTR_ERR(fs_root);
+				if (err != -ENOENT)
 					goto out;
-				}
-				ret = mark_garbage_root(reloc_root);
-				if (ret < 0) {
-					err = ret;
+				err = mark_garbage_root(reloc_root);
+				if (err < 0)
 					goto out;
-				}
+				err = 0;
 			} else {
 				btrfs_put_root(fs_root);
 			}
@@ -4297,11 +4293,9 @@ int btrfs_recover_relocation(struct btrfs_fs_info *fs_info)
 		goto out;
 	}
 
-	ret = reloc_chunk_start(fs_info);
-	if (ret < 0) {
-		err = ret;
+	err = reloc_chunk_start(fs_info);
+	if (err < 0)
 		goto out_end;
-	}
 
 	rc->extent_root = btrfs_extent_root(fs_info, 0);
 
-- 
2.41.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 1/6] btrfs: rename err to ret in btrfs_cleanup_fs_roots()
  2024-05-21 17:11  1% [PATCH v4 0/6] part3 trivial adjustments for return variable coding style Anand Jain
@ 2024-05-21 17:11  1% ` Anand Jain
  2024-05-21 17:11  1% ` [PATCH v4 2/6] btrfs: rename ret to err in btrfs_recover_relocation() Anand Jain
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-21 17:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Anand Jain

Since err represents the function return value, rename it as ret,
and rename the original ret, which serves as a helper return value,
to found. Also, optimize the code to continue call btrfs_put_root()
for the rest of the root if even after btrfs_orphan_cleanup() returns
error.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
v4: localize variable i in the for()
v3: Add a code comment.
v2: Rename to 'found' instead of 'ret2' (Josef).
    Call btrfs_put_root() in the while-loop, avoids use of the variable
        'found' outside of the while loop (Qu).
    Use 'unsigned int i' instead of 'int' (Goffredo).

 fs/btrfs/disk-io.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 94b95836f61f..1f744bd6b785 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2914,22 +2914,22 @@ static int btrfs_cleanup_fs_roots(struct btrfs_fs_info *fs_info)
 {
 	u64 root_objectid = 0;
 	struct btrfs_root *gang[8];
-	int i = 0;
-	int err = 0;
-	unsigned int ret = 0;
+	int ret = 0;
 
 	while (1) {
+		unsigned int found;
+
 		spin_lock(&fs_info->fs_roots_radix_lock);
-		ret = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
+		found = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
 					     (void **)gang, root_objectid,
 					     ARRAY_SIZE(gang));
-		if (!ret) {
+		if (!found) {
 			spin_unlock(&fs_info->fs_roots_radix_lock);
 			break;
 		}
-		root_objectid = btrfs_root_id(gang[ret - 1]) + 1;
+		root_objectid = btrfs_root_id(gang[found - 1]) + 1;
 
-		for (i = 0; i < ret; i++) {
+		for (int i = 0; i < found; i++) {
 			/* Avoid to grab roots in dead_roots. */
 			if (btrfs_root_refs(&gang[i]->root_item) == 0) {
 				gang[i] = NULL;
@@ -2940,24 +2940,25 @@ static int btrfs_cleanup_fs_roots(struct btrfs_fs_info *fs_info)
 		}
 		spin_unlock(&fs_info->fs_roots_radix_lock);
 
-		for (i = 0; i < ret; i++) {
+		for (int i = 0; i < found; i++) {
 			if (!gang[i])
 				continue;
 			root_objectid = btrfs_root_id(gang[i]);
-			err = btrfs_orphan_cleanup(gang[i]);
-			if (err)
-				goto out;
+			/*
+			 * Continue to release the remaining roots after the first
+			 * error without cleanup and preserve the first error
+			 * for the return.
+			 */
+			if (!ret)
+				ret = btrfs_orphan_cleanup(gang[i]);
 			btrfs_put_root(gang[i]);
 		}
+		if (ret)
+			break;
+
 		root_objectid++;
 	}
-out:
-	/* Release the uncleaned roots due to error. */
-	for (; i < ret; i++) {
-		if (gang[i])
-			btrfs_put_root(gang[i]);
-	}
-	return err;
+	return ret;
 }
 
 /*
-- 
2.41.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 0/6] part3 trivial adjustments for return variable coding style
@ 2024-05-21 17:11  1% Anand Jain
  2024-05-21 17:11  1% ` [PATCH v4 1/6] btrfs: rename err to ret in btrfs_cleanup_fs_roots() Anand Jain
                   ` (6 more replies)
  0 siblings, 7 replies; 200+ results
From: Anand Jain @ 2024-05-21 17:11 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Anand Jain

This is v4 of part 3 of the series, containing renaming with optimization of the
return variable.

v3 part3:
  https://lore.kernel.org/linux-btrfs/cover.1715783315.git.anand.jain@oracle.com/
v2 part2:
  https://lore.kernel.org/linux-btrfs/cover.1713370756.git.anand.jain@oracle.com/
v1:
  https://lore.kernel.org/linux-btrfs/cover.1710857863.git.anand.jain@oracle.com/

Anand Jain (6):
  btrfs: rename err to ret in btrfs_cleanup_fs_roots()
  btrfs: rename ret to err in btrfs_recover_relocation()
  btrfs: rename ret to ret2 in btrfs_recover_relocation()
  btrfs: rename err to ret in btrfs_recover_relocation()
  btrfs: rename err to ret in btrfs_drop_snapshot()
  btrfs: rename err to ret in btrfs_find_orphan_roots()

 fs/btrfs/disk-io.c     | 37 ++++++++++++++--------------
 fs/btrfs/extent-tree.c | 48 ++++++++++++++++++------------------
 fs/btrfs/relocation.c  | 56 +++++++++++++++++++-----------------------
 fs/btrfs/root-tree.c   | 33 +++++++++++++------------
 4 files changed, 85 insertions(+), 89 deletions(-)

-- 
2.41.0


^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 0/6] part3 trivial adjustments for return variable coding style
  2024-05-21 15:21  2% ` David Sterba
@ 2024-05-21 17:10  1%   ` Anand Jain
  0 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-21 17:10 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs



On 5/21/24 23:21, David Sterba wrote:
> On Thu, May 16, 2024 at 07:12:09PM +0800, Anand Jain wrote:
>> This is part 3 of the series, containing renaming with optimization of the
>> return variable.
>>
>> Some of the patches are new it wasn't part of v1 or v2. The new patches follow
>> verb-first format for titles. Older patches not renamed for backward reference.
>>
>> Patchset passed tests -g quick without regressions, sending them first.
>>
>> Patch 3/6 and 4/6 can be merged; they are separated for easier diff.
> 
> Splitting the patches like might seem strange but reviewing the changes
> individually is indeed a bit easier so you can keep it like that.
> 
>> v2 part2:
>>    https://lore.kernel.org/linux-btrfs/cover.1713370756.git.anand.jain@oracle.com/
>> v1:
>>    https://lore.kernel.org/linux-btrfs/cover.1710857863.git.anand.jain@oracle.com/
>>
>> Anand Jain (6):
>>    btrfs: btrfs_cleanup_fs_roots handle ret variable
>>    btrfs: simplify ret in btrfs_recover_relocation
>>    btrfs: rename ret in btrfs_recover_relocation
>>    btrfs: rename err in btrfs_recover_relocation
>>    btrfs: btrfs_drop_snapshot optimize return variable
>>    btrfs: rename and optimize return variable in btrfs_find_orphan_roots
> 
> I've edited the subject lines from the previous series, please have a
> look and copy the subjects when the kind of change is the same in the
> patch. Also use the () when a function si mentioned in the subject.
> Thanks.

yep. I have updated the patch titles in v4.

Thx. Anand

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 6/6] btrfs: rename and optimize return variable in btrfs_find_orphan_roots
  2024-05-21 15:18  1%   ` David Sterba
@ 2024-05-21 17:10  1%     ` Anand Jain
  2024-05-21 17:59  1%       ` David Sterba
  0 siblings, 1 reply; 200+ results
From: Anand Jain @ 2024-05-21 17:10 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs



On 5/21/24 23:18, David Sterba wrote:
> On Thu, May 16, 2024 at 07:12:15PM +0800, Anand Jain wrote:
>> The variable err is the actual return value of this function, and the
>> variable ret is a helper variable for err, which actually is not
>> needed and can be handled just by err, which is renamed to ret.
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>> v3: drop ret2 as there is no need for it.
>> v2: n/a
>>   fs/btrfs/root-tree.c | 32 ++++++++++++++++----------------
>>   1 file changed, 16 insertions(+), 16 deletions(-)
>>
>> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
>> index 33962671a96c..c11b0bccf513 100644
>> --- a/fs/btrfs/root-tree.c
>> +++ b/fs/btrfs/root-tree.c
>> @@ -220,8 +220,7 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
>>   	struct btrfs_path *path;
>>   	struct btrfs_key key;
>>   	struct btrfs_root *root;
>> -	int err = 0;
>> -	int ret;
>> +	int ret = 0;
>>   
>>   	path = btrfs_alloc_path();
>>   	if (!path)
>> @@ -235,18 +234,19 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
>>   		u64 root_objectid;
>>   
>>   		ret = btrfs_search_slot(NULL, tree_root, &key, path, 0, 0);
>> -		if (ret < 0) {
>> -			err = ret;
>> +		if (ret < 0)
>>   			break;
>> -		}
>> +		ret = 0;
> 
> Should this be handled when ret > 0? This would be unexpected and
> probably means a corruption but simply overwriting the value does not
> seem right.
> 

Agreed.

+               if (ret > 0)
+                       ret = 0;

is much neater.

As in v4.

Thanks, Anand

>>   
>>   		leaf = path->nodes[0];
>>   		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
>>   			ret = btrfs_next_leaf(tree_root, path);
>>   			if (ret < 0)
>> -				err = ret;
>> -			if (ret != 0)
>>   				break;
>> +			if (ret > 0) {
>> +				ret = 0;
>> +				break;
>> +			}
>>   			leaf = path->nodes[0];
>>   		}
>>   
>> @@ -261,26 +261,26 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
>>   		key.offset++;
>>   
>>   		root = btrfs_get_fs_root(fs_info, root_objectid, false);
>> -		err = PTR_ERR_OR_ZERO(root);
>> -		if (err && err != -ENOENT) {
>> +		ret = PTR_ERR_OR_ZERO(root);
>> +		if (ret && ret != -ENOENT) {
>>   			break;
>> -		} else if (err == -ENOENT) {
>> +		} else if (ret == -ENOENT) {
>>   			struct btrfs_trans_handle *trans;
>>   
>>   			btrfs_release_path(path);
>>   
>>   			trans = btrfs_join_transaction(tree_root);
>>   			if (IS_ERR(trans)) {
>> -				err = PTR_ERR(trans);
>> -				btrfs_handle_fs_error(fs_info, err,
>> +				ret = PTR_ERR(trans);
>> +				btrfs_handle_fs_error(fs_info, ret,
>>   					    "Failed to start trans to delete orphan item");
>>   				break;
>>   			}
>> -			err = btrfs_del_orphan_item(trans, tree_root,
>> +			ret = btrfs_del_orphan_item(trans, tree_root,
>>   						    root_objectid);
>>   			btrfs_end_transaction(trans);
>> -			if (err) {
>> -				btrfs_handle_fs_error(fs_info, err,
>> +			if (ret) {
>> +				btrfs_handle_fs_error(fs_info, ret,
>>   					    "Failed to delete root orphan item");
>>   				break;
>>   			}
>> @@ -311,7 +311,7 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
>>   	}
>>   
>>   	btrfs_free_path(path);
>> -	return err;
>> +	return ret;
>>   }
>>   
>>   /* drop the root item for 'key' from the tree root */
>> -- 
>> 2.38.1
>>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 1/6] btrfs: btrfs_cleanup_fs_roots handle ret variable
  2024-05-21 15:10  1%   ` David Sterba
@ 2024-05-21 17:08  1%     ` Anand Jain
  0 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-21 17:08 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs



On 5/21/24 23:10, David Sterba wrote:
> On Thu, May 16, 2024 at 07:12:10PM +0800, Anand Jain wrote:
>> Since err represents the function return value, rename it as ret,
>> and rename the original ret, which serves as a helper return value,
>> to found. Also, optimize the code to continue call btrfs_put_root()
>> for the rest of the root if even after btrfs_orphan_cleanup() returns
>> error.
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>> v3: Add a code comment.
>> v2: Rename to 'found' instead of 'ret2' (Josef).
>>      Call btrfs_put_root() in the while-loop, avoids use of the variable
>> 	'found' outside of the while loop (Qu).
>>      Use 'unsigned int i' instead of 'int' (Goffredo).
>>
>>   fs/btrfs/disk-io.c | 38 ++++++++++++++++++++------------------
>>   1 file changed, 20 insertions(+), 18 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index a91a8056758a..d38cf973b02a 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2925,22 +2925,23 @@ static int btrfs_cleanup_fs_roots(struct btrfs_fs_info *fs_info)
>>   {
>>   	u64 root_objectid = 0;
>>   	struct btrfs_root *gang[8];
>> -	int i = 0;
>> -	int err = 0;
>> -	unsigned int ret = 0;
>> +	int ret = 0;
>>   
>>   	while (1) {
>> +		unsigned int i;
>> +		unsigned int found;
>> +
>>   		spin_lock(&fs_info->fs_roots_radix_lock);
>> -		ret = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
>> +		found = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
>>   					     (void **)gang, root_objectid,
>>   					     ARRAY_SIZE(gang));
>> -		if (!ret) {
>> +		if (!found) {
>>   			spin_unlock(&fs_info->fs_roots_radix_lock);
>>   			break;
>>   		}
>> -		root_objectid = btrfs_root_id(gang[ret - 1]) + 1;
>> +		root_objectid = btrfs_root_id(gang[found - 1]) + 1;
>>   
>> -		for (i = 0; i < ret; i++) {
>> +		for (i = 0; i < found; i++) {
> 
> You could also move the declaration of 'i' to the for loop as you move
> the other definition anyway.

Yep. Done in v4.

Thanks.

> 
>>   			/* Avoid to grab roots in dead_roots. */
>>   			if (btrfs_root_refs(&gang[i]->root_item) == 0) {
>>   				gang[i] = NULL;
>> @@ -2951,24 +2952,25 @@ static int btrfs_cleanup_fs_roots(struct btrfs_fs_info *fs_info)
>>   		}
>>   		spin_unlock(&fs_info->fs_roots_radix_lock);
>>   
>> -		for (i = 0; i < ret; i++) {
>> +		for (i = 0; i < found; i++) {
>>   			if (!gang[i])
>>   				continue;
>>   			root_objectid = btrfs_root_id(gang[i]);
>> -			err = btrfs_orphan_cleanup(gang[i]);
>> -			if (err)
>> -				goto out;
>> +			/*
>> +			 * Continue to release the remaining roots after the first
>> +			 * error without cleanup and preserve the first error
>> +			 * for the return.
>> +			 */
>> +			if (!ret)
>> +				ret = btrfs_orphan_cleanup(gang[i]);
>>   			btrfs_put_root(gang[i]);
>>   		}
>> +		if (ret)
>> +			break;
>> +
>>   		root_objectid++;
>>   	}
>> -out:
>> -	/* Release the uncleaned roots due to error. */
>> -	for (; i < ret; i++) {
>> -		if (gang[i])
>> -			btrfs_put_root(gang[i]);
>> -	}
>> -	return err;
>> +	return ret;
>>   }
>>   
>>   /*
>> -- 
>> 2.38.1
>>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 1/2] btrfs: zoned: reserve relocation block-group on mount
  2024-05-21 14:58  1% ` [PATCH v3 1/2] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
@ 2024-05-21 15:22  1%   ` Filipe Manana
  2024-05-22  8:31  1%     ` Johannes Thumshirn
  0 siblings, 1 reply; 200+ results
From: Filipe Manana @ 2024-05-21 15:22 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Chris Mason, Josef Bacik, David Sterba, Hans Holmberg,
	linux-btrfs, linux-kernel, Naohiro Aota, Johannes Thumshirn

On Tue, May 21, 2024 at 3:58 PM Johannes Thumshirn <jth@kernel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> Reserve one zone as a data relocation target on each mount. If we already
> find one empty block group, there's no need to force a chunk allocation,
> but we can use this empty data block group as our relocation target.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>  fs/btrfs/disk-io.c |  2 ++
>  fs/btrfs/zoned.c   | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/zoned.h   |  3 +++
>  3 files changed, 70 insertions(+)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a91a8056758a..19e7b4a59a9e 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3558,6 +3558,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>         }
>         btrfs_discard_resume(fs_info);
>
> +       btrfs_reserve_relocation_bg(fs_info);
> +
>         if (fs_info->uuid_root &&
>             (btrfs_test_opt(fs_info, RESCAN_UUID_TREE) ||
>              fs_info->generation != btrfs_super_uuid_tree_generation(disk_super))) {
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 4cba80b34387..9404cb32256f 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -17,6 +17,7 @@
>  #include "fs.h"
>  #include "accessors.h"
>  #include "bio.h"
> +#include "transaction.h"
>
>  /* Maximum number of zones to report per blkdev_report_zones() call */
>  #define BTRFS_REPORT_NR_ZONES   4096
> @@ -2634,3 +2635,67 @@ void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info)
>         }
>         spin_unlock(&fs_info->zone_active_bgs_lock);
>  }
> +
> +static u64 find_empty_block_group(struct btrfs_space_info *sinfo, u64 flags)
> +{
> +       struct btrfs_block_group *bg;
> +
> +       for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
> +               list_for_each_entry(bg, &sinfo->block_groups[i], list) {
> +                       if (bg->flags != flags)
> +                               continue;
> +                       if (bg->used == 0)
> +                               return bg->start;
> +               }
> +       }

I believe I commented about this in some previous patchset version,
but here goes again.

This happens at mount time, where we have already loaded all block groups.
When we load them, if we find unused ones, we add them to the list of
empty block groups, so that the next time the cleaner kthread runs it
deletes them.

I don't see any code here removing the selected block group from that
list, or anything at btrfs_delete_unused_bgs() that prevents deleting
a block group if it was selected as the data reloc bg.

Maybe I'm missing something?
How do ensure the selected block group isn't deleted by the cleaner kthread?

Thanks.


> +
> +       return 0;
> +}
> +
> +void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info)
> +{
> +       struct btrfs_root *tree_root = fs_info->tree_root;
> +       struct btrfs_space_info *sinfo = fs_info->data_sinfo;
> +       struct btrfs_trans_handle *trans;
> +       u64 flags = btrfs_get_alloc_profile(fs_info, sinfo->flags);
> +       u64 bytenr = 0;
> +
> +       lockdep_assert_not_held(&fs_info->relocation_bg_lock);
> +
> +       if (!btrfs_is_zoned(fs_info))
> +               return;
> +
> +       if (fs_info->data_reloc_bg)
> +               return;
> +
> +       bytenr = find_empty_block_group(sinfo, flags);
> +       if (!bytenr) {
> +               int ret;
> +
> +               trans = btrfs_join_transaction(tree_root);
> +               if (IS_ERR(trans))
> +                       return;
> +
> +               ret = btrfs_chunk_alloc(trans, flags, CHUNK_ALLOC_FORCE);
> +               btrfs_end_transaction(trans);
> +
> +               if (!ret) {
> +                       struct btrfs_block_group *bg;
> +
> +                       bytenr = find_empty_block_group(sinfo, flags);
> +                       if (!bytenr)
> +                               goto out;
> +                       bg = btrfs_lookup_block_group(fs_info, bytenr);
> +                       ASSERT(bg);
> +
> +                       if (!btrfs_zone_activate(bg))
> +                               bytenr = 0;
> +                       btrfs_put_block_group(bg);
> +               }
> +       }
> +
> +out:
> +       spin_lock(&fs_info->relocation_bg_lock);
> +       fs_info->data_reloc_bg = bytenr;
> +       spin_unlock(&fs_info->relocation_bg_lock);
> +}
> diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
> index 77c4321e331f..b9935222bf7a 100644
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -97,6 +97,7 @@ int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info);
>  int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
>                                 struct btrfs_space_info *space_info, bool do_finish);
>  void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info);
> +void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info);
>  #else /* CONFIG_BLK_DEV_ZONED */
>  static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>                                      struct blk_zone *zone)
> @@ -271,6 +272,8 @@ static inline int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
>
>  static inline void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info) { }
>
> +static inline void btrfs_reserve_relocation_zone(struct btrfs_fs_info *fs_info) { }
> +
>  #endif
>
>  static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
>
> --
> 2.43.0
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 0/6] part3 trivial adjustments for return variable coding style
                     ` (2 preceding siblings ...)
  2024-05-21  1:04  1% ` [PATCH v3 0/6] part3 trivial adjustments for return variable coding style Anand Jain
@ 2024-05-21 15:21  2% ` David Sterba
  2024-05-21 17:10  1%   ` Anand Jain
  3 siblings, 1 reply; 200+ results
From: David Sterba @ 2024-05-21 15:21 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Thu, May 16, 2024 at 07:12:09PM +0800, Anand Jain wrote:
> This is part 3 of the series, containing renaming with optimization of the
> return variable.
> 
> Some of the patches are new it wasn't part of v1 or v2. The new patches follow
> verb-first format for titles. Older patches not renamed for backward reference.
> 
> Patchset passed tests -g quick without regressions, sending them first.
> 
> Patch 3/6 and 4/6 can be merged; they are separated for easier diff.

Splitting the patches like might seem strange but reviewing the changes
individually is indeed a bit easier so you can keep it like that.

> v2 part2:
>   https://lore.kernel.org/linux-btrfs/cover.1713370756.git.anand.jain@oracle.com/
> v1:
>   https://lore.kernel.org/linux-btrfs/cover.1710857863.git.anand.jain@oracle.com/
> 
> Anand Jain (6):
>   btrfs: btrfs_cleanup_fs_roots handle ret variable
>   btrfs: simplify ret in btrfs_recover_relocation
>   btrfs: rename ret in btrfs_recover_relocation
>   btrfs: rename err in btrfs_recover_relocation
>   btrfs: btrfs_drop_snapshot optimize return variable
>   btrfs: rename and optimize return variable in btrfs_find_orphan_roots

I've edited the subject lines from the previous series, please have a
look and copy the subjects when the kind of change is the same in the
patch. Also use the () when a function si mentioned in the subject.
Thanks.

^ permalink raw reply	[relevance 2%]

* Re: [PATCH v3 6/6] btrfs: rename and optimize return variable in btrfs_find_orphan_roots
  @ 2024-05-21 15:18  1%   ` David Sterba
  2024-05-21 17:10  1%     ` Anand Jain
  0 siblings, 1 reply; 200+ results
From: David Sterba @ 2024-05-21 15:18 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Thu, May 16, 2024 at 07:12:15PM +0800, Anand Jain wrote:
> The variable err is the actual return value of this function, and the
> variable ret is a helper variable for err, which actually is not
> needed and can be handled just by err, which is renamed to ret.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
> v3: drop ret2 as there is no need for it.
> v2: n/a
>  fs/btrfs/root-tree.c | 32 ++++++++++++++++----------------
>  1 file changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
> index 33962671a96c..c11b0bccf513 100644
> --- a/fs/btrfs/root-tree.c
> +++ b/fs/btrfs/root-tree.c
> @@ -220,8 +220,7 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
>  	struct btrfs_path *path;
>  	struct btrfs_key key;
>  	struct btrfs_root *root;
> -	int err = 0;
> -	int ret;
> +	int ret = 0;
>  
>  	path = btrfs_alloc_path();
>  	if (!path)
> @@ -235,18 +234,19 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
>  		u64 root_objectid;
>  
>  		ret = btrfs_search_slot(NULL, tree_root, &key, path, 0, 0);
> -		if (ret < 0) {
> -			err = ret;
> +		if (ret < 0)
>  			break;
> -		}
> +		ret = 0;

Should this be handled when ret > 0? This would be unexpected and
probably means a corruption but simply overwriting the value does not
seem right.

>  
>  		leaf = path->nodes[0];
>  		if (path->slots[0] >= btrfs_header_nritems(leaf)) {
>  			ret = btrfs_next_leaf(tree_root, path);
>  			if (ret < 0)
> -				err = ret;
> -			if (ret != 0)
>  				break;
> +			if (ret > 0) {
> +				ret = 0;
> +				break;
> +			}
>  			leaf = path->nodes[0];
>  		}
>  
> @@ -261,26 +261,26 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
>  		key.offset++;
>  
>  		root = btrfs_get_fs_root(fs_info, root_objectid, false);
> -		err = PTR_ERR_OR_ZERO(root);
> -		if (err && err != -ENOENT) {
> +		ret = PTR_ERR_OR_ZERO(root);
> +		if (ret && ret != -ENOENT) {
>  			break;
> -		} else if (err == -ENOENT) {
> +		} else if (ret == -ENOENT) {
>  			struct btrfs_trans_handle *trans;
>  
>  			btrfs_release_path(path);
>  
>  			trans = btrfs_join_transaction(tree_root);
>  			if (IS_ERR(trans)) {
> -				err = PTR_ERR(trans);
> -				btrfs_handle_fs_error(fs_info, err,
> +				ret = PTR_ERR(trans);
> +				btrfs_handle_fs_error(fs_info, ret,
>  					    "Failed to start trans to delete orphan item");
>  				break;
>  			}
> -			err = btrfs_del_orphan_item(trans, tree_root,
> +			ret = btrfs_del_orphan_item(trans, tree_root,
>  						    root_objectid);
>  			btrfs_end_transaction(trans);
> -			if (err) {
> -				btrfs_handle_fs_error(fs_info, err,
> +			if (ret) {
> +				btrfs_handle_fs_error(fs_info, ret,
>  					    "Failed to delete root orphan item");
>  				break;
>  			}
> @@ -311,7 +311,7 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
>  	}
>  
>  	btrfs_free_path(path);
> -	return err;
> +	return ret;
>  }
>  
>  /* drop the root item for 'key' from the tree root */
> -- 
> 2.38.1
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 1/6] btrfs: btrfs_cleanup_fs_roots handle ret variable
  @ 2024-05-21 15:10  1%   ` David Sterba
  2024-05-21 17:08  1%     ` Anand Jain
  0 siblings, 1 reply; 200+ results
From: David Sterba @ 2024-05-21 15:10 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Thu, May 16, 2024 at 07:12:10PM +0800, Anand Jain wrote:
> Since err represents the function return value, rename it as ret,
> and rename the original ret, which serves as a helper return value,
> to found. Also, optimize the code to continue call btrfs_put_root()
> for the rest of the root if even after btrfs_orphan_cleanup() returns
> error.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
> v3: Add a code comment.
> v2: Rename to 'found' instead of 'ret2' (Josef).
>     Call btrfs_put_root() in the while-loop, avoids use of the variable
> 	'found' outside of the while loop (Qu).
>     Use 'unsigned int i' instead of 'int' (Goffredo).
> 
>  fs/btrfs/disk-io.c | 38 ++++++++++++++++++++------------------
>  1 file changed, 20 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index a91a8056758a..d38cf973b02a 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2925,22 +2925,23 @@ static int btrfs_cleanup_fs_roots(struct btrfs_fs_info *fs_info)
>  {
>  	u64 root_objectid = 0;
>  	struct btrfs_root *gang[8];
> -	int i = 0;
> -	int err = 0;
> -	unsigned int ret = 0;
> +	int ret = 0;
>  
>  	while (1) {
> +		unsigned int i;
> +		unsigned int found;
> +
>  		spin_lock(&fs_info->fs_roots_radix_lock);
> -		ret = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
> +		found = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
>  					     (void **)gang, root_objectid,
>  					     ARRAY_SIZE(gang));
> -		if (!ret) {
> +		if (!found) {
>  			spin_unlock(&fs_info->fs_roots_radix_lock);
>  			break;
>  		}
> -		root_objectid = btrfs_root_id(gang[ret - 1]) + 1;
> +		root_objectid = btrfs_root_id(gang[found - 1]) + 1;
>  
> -		for (i = 0; i < ret; i++) {
> +		for (i = 0; i < found; i++) {

You could also move the declaration of 'i' to the for loop as you move
the other definition anyway.

>  			/* Avoid to grab roots in dead_roots. */
>  			if (btrfs_root_refs(&gang[i]->root_item) == 0) {
>  				gang[i] = NULL;
> @@ -2951,24 +2952,25 @@ static int btrfs_cleanup_fs_roots(struct btrfs_fs_info *fs_info)
>  		}
>  		spin_unlock(&fs_info->fs_roots_radix_lock);
>  
> -		for (i = 0; i < ret; i++) {
> +		for (i = 0; i < found; i++) {
>  			if (!gang[i])
>  				continue;
>  			root_objectid = btrfs_root_id(gang[i]);
> -			err = btrfs_orphan_cleanup(gang[i]);
> -			if (err)
> -				goto out;
> +			/*
> +			 * Continue to release the remaining roots after the first
> +			 * error without cleanup and preserve the first error
> +			 * for the return.
> +			 */
> +			if (!ret)
> +				ret = btrfs_orphan_cleanup(gang[i]);
>  			btrfs_put_root(gang[i]);
>  		}
> +		if (ret)
> +			break;
> +
>  		root_objectid++;
>  	}
> -out:
> -	/* Release the uncleaned roots due to error. */
> -	for (; i < ret; i++) {
> -		if (gang[i])
> -			btrfs_put_root(gang[i]);
> -	}
> -	return err;
> +	return ret;
>  }
>  
>  /*
> -- 
> 2.38.1
> 

^ permalink raw reply	[relevance 1%]

* [PATCH v3 2/2] btrfs: reserve new relocation block-group after successful relocation
  2024-05-21 14:58  1% [PATCH v3 0/2] btrfs: zoned: always set aside a zone for relocation Johannes Thumshirn
  2024-05-21 14:58  1% ` [PATCH v3 1/2] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
@ 2024-05-21 14:58  1% ` Johannes Thumshirn
  2024-05-22  1:17  1%   ` Naohiro Aota
  1 sibling, 1 reply; 200+ results
From: Johannes Thumshirn @ 2024-05-21 14:58 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Hans Holmberg, linux-btrfs, linux-kernel, Naohiro Aota,
	Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

After we've committed a relocation transaction, we know we have just freed
up space. Set it as hint for the next relocation.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/relocation.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 8b24bb5a0aa1..764317a1c55d 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -3811,6 +3811,13 @@ static noinline_for_stack int relocate_block_group(struct reloc_control *rc)
 	ret = btrfs_commit_transaction(trans);
 	if (ret && !err)
 		err = ret;
+
+	/*
+	 * We know we have just freed space, set it as hint for the
+	 * next relocation.
+	 */
+	if (!err)
+		btrfs_reserve_relocation_bg(fs_info);
 out_free:
 	ret = clean_dirty_subvols(rc);
 	if (ret < 0 && !err)

-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 1/2] btrfs: zoned: reserve relocation block-group on mount
  2024-05-21 14:58  1% [PATCH v3 0/2] btrfs: zoned: always set aside a zone for relocation Johannes Thumshirn
@ 2024-05-21 14:58  1% ` Johannes Thumshirn
  2024-05-21 15:22  1%   ` Filipe Manana
  2024-05-21 14:58  1% ` [PATCH v3 2/2] btrfs: reserve new relocation block-group after successful relocation Johannes Thumshirn
  1 sibling, 1 reply; 200+ results
From: Johannes Thumshirn @ 2024-05-21 14:58 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Hans Holmberg, linux-btrfs, linux-kernel, Naohiro Aota,
	Johannes Thumshirn

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Reserve one zone as a data relocation target on each mount. If we already
find one empty block group, there's no need to force a chunk allocation,
but we can use this empty data block group as our relocation target.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/disk-io.c |  2 ++
 fs/btrfs/zoned.c   | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h   |  3 +++
 3 files changed, 70 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a91a8056758a..19e7b4a59a9e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3558,6 +3558,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	}
 	btrfs_discard_resume(fs_info);
 
+	btrfs_reserve_relocation_bg(fs_info);
+
 	if (fs_info->uuid_root &&
 	    (btrfs_test_opt(fs_info, RESCAN_UUID_TREE) ||
 	     fs_info->generation != btrfs_super_uuid_tree_generation(disk_super))) {
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 4cba80b34387..9404cb32256f 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -17,6 +17,7 @@
 #include "fs.h"
 #include "accessors.h"
 #include "bio.h"
+#include "transaction.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -2634,3 +2635,67 @@ void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info)
 	}
 	spin_unlock(&fs_info->zone_active_bgs_lock);
 }
+
+static u64 find_empty_block_group(struct btrfs_space_info *sinfo, u64 flags)
+{
+	struct btrfs_block_group *bg;
+
+	for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		list_for_each_entry(bg, &sinfo->block_groups[i], list) {
+			if (bg->flags != flags)
+				continue;
+			if (bg->used == 0)
+				return bg->start;
+		}
+	}
+
+	return 0;
+}
+
+void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_root *tree_root = fs_info->tree_root;
+	struct btrfs_space_info *sinfo = fs_info->data_sinfo;
+	struct btrfs_trans_handle *trans;
+	u64 flags = btrfs_get_alloc_profile(fs_info, sinfo->flags);
+	u64 bytenr = 0;
+
+	lockdep_assert_not_held(&fs_info->relocation_bg_lock);
+
+	if (!btrfs_is_zoned(fs_info))
+		return;
+
+	if (fs_info->data_reloc_bg)
+		return;
+
+	bytenr = find_empty_block_group(sinfo, flags);
+	if (!bytenr) {
+		int ret;
+
+		trans = btrfs_join_transaction(tree_root);
+		if (IS_ERR(trans))
+			return;
+
+		ret = btrfs_chunk_alloc(trans, flags, CHUNK_ALLOC_FORCE);
+		btrfs_end_transaction(trans);
+
+		if (!ret) {
+			struct btrfs_block_group *bg;
+
+			bytenr = find_empty_block_group(sinfo, flags);
+			if (!bytenr)
+				goto out;
+			bg = btrfs_lookup_block_group(fs_info, bytenr);
+			ASSERT(bg);
+
+			if (!btrfs_zone_activate(bg))
+				bytenr = 0;
+			btrfs_put_block_group(bg);
+		}
+	}
+
+out:
+	spin_lock(&fs_info->relocation_bg_lock);
+	fs_info->data_reloc_bg = bytenr;
+	spin_unlock(&fs_info->relocation_bg_lock);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 77c4321e331f..b9935222bf7a 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -97,6 +97,7 @@ int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info);
 int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
 				struct btrfs_space_info *space_info, bool do_finish);
 void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info);
+void btrfs_reserve_relocation_bg(struct btrfs_fs_info *fs_info);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -271,6 +272,8 @@ static inline int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
 
 static inline void btrfs_check_active_zone_reservation(struct btrfs_fs_info *fs_info) { }
 
+static inline void btrfs_reserve_relocation_zone(struct btrfs_fs_info *fs_info) { }
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)

-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v3 0/2] btrfs: zoned: always set aside a zone for relocation
@ 2024-05-21 14:58  1% Johannes Thumshirn
  2024-05-21 14:58  1% ` [PATCH v3 1/2] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
  2024-05-21 14:58  1% ` [PATCH v3 2/2] btrfs: reserve new relocation block-group after successful relocation Johannes Thumshirn
  0 siblings, 2 replies; 200+ results
From: Johannes Thumshirn @ 2024-05-21 14:58 UTC (permalink / raw)
  To: Chris Mason, Josef Bacik, David Sterba
  Cc: Hans Holmberg, linux-btrfs, linux-kernel, Naohiro Aota,
	Johannes Thumshirn

For zoned filesytsems we heavily rely on relocation for garbage collecting
as we cannot do any in-place updates of disk blocks.

But there can be situations where we're running out of space for doing the
relocation.

To solve this, always have a zone reserved for relocation.

This is a subset of another approach to this problem I've submitted in 
https://lore.kernel.org/r/20240328-hans-v1-0-4cd558959407@kernel.org

---
Changes in v3:
- Rename btrfs_reserve_relocation_zone -> btrfs_reserve_relocation_bg
- Bail out if we already have a relocation bg set
- Link to v2: https://lore.kernel.org/r/20240515-zoned-gc-v2-0-20c7cb9763cd@kernel.org

Changes in v2:
- Incorporate Naohiro's review
- Link to v1: https://lore.kernel.org/r/20240514-zoned-gc-v1-0-109f1a6c7447@kernel.org

---
Johannes Thumshirn (2):
      btrfs: zoned: reserve relocation block-group on mount
      btrfs: reserve new relocation block-group after successful relocation

 fs/btrfs/disk-io.c    |  2 ++
 fs/btrfs/relocation.c |  7 ++++++
 fs/btrfs/zoned.c      | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h      |  3 +++
 4 files changed, 77 insertions(+)
---
base-commit: d52875a6df98dc77933853e8427bd77f4598a9a7
change-id: 20240514-zoned-gc-2ce793459eb7

Best regards,
-- 
Johannes Thumshirn <jth@kernel.org>


^ permalink raw reply	[relevance 1%]

* Re: [PATCH 0/3] btrfs: avoid data races when accessing an inode's delayed_node
  2024-05-20 20:20  1%     ` David Sterba
@ 2024-05-21 14:47  1%       ` David Sterba
  0 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-21 14:47 UTC (permalink / raw)
  To: David Sterba; +Cc: Filipe Manana, linux-btrfs

On Mon, May 20, 2024 at 10:20:26PM +0200, David Sterba wrote:
> On Mon, May 20, 2024 at 05:58:37PM +0100, Filipe Manana wrote:
> > On Mon, May 20, 2024 at 4:48 PM David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Fri, May 17, 2024 at 02:13:23PM +0100, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > We do have some data races when accessing an inode's delayed_node, namely
> > > > we use READ_ONCE() in a couple places while there's no pairing WRITE_ONCE()
> > > > anywhere, and in one place (btrfs_dirty_inode()) we neither user READ_ONCE()
> > > > nor take the lock that protects the delayed_node. So fix these and add
> > > > helpers to access and update an inode's delayed_node.
> > > >
> > > > Filipe Manana (3):
> > > >   btrfs: always set an inode's delayed_inode with WRITE_ONCE()
> > > >   btrfs: use READ_ONCE() when accessing delayed_node at btrfs_dirty_node()
> > > >   btrfs: add and use helpers to get and set an inode's delayed_node
> > >
> > > The READ_ONCE for delayed nodes has been there historically but I don't
> > > think it's needed everywhere. The legitimate case is in
> > > btrfs_get_delayed_node() where first use is without lock and then again
> > > recheck under the lock so we do want to read fresh value. This is to
> > > prevent compiler optimization to coalesce the reads.
> > >
> > > Writing to delayed node under lock also does not need WRITE_ONCE.
> > >
> > > IOW, I would rather remove use of the _ONCE helpers and not add more as
> > > it is not the pattern where it's supposed to be used. You say it's to
> > > prevent load tearing but for a pointer type this does not happen and is
> > > an assumption of the hardware.
> > 
> > If you are sure that pointers aren't subject to load/store tearing
> > issues, then I'm fine with dropping the patchset.
> 
> This will be in some CPU manual, the thread on LWN
> https://lwn.net/Articles/793895/
> mentions that. I base my claim on reading about that in various other
> discussions on LKML over time. Pointers match the unsigned long type
> that is the machine word and register size, assignment from register to
> register/memory happens in one go. What could be problematic is constant
> (immediate) assigned to register on architectures like SPARC that have
> fixed size instructions and the constatnt has to be written in two
> steps.
> 
> The need for READ_ONCE/WRITE_ONCE is to prevent compiler optimizations
> and also the load tearing, but for native types up to unsigned long this
> seems to be true.
> 
> https://elixir.bootlin.com/linux/latest/source/include/asm-generic/rwonce.h#L29
> 
> The only requirement that can possibly cause the tearing even for
> pointer is if it's not aligned and could be split over two cachelines.
> This should be the case for structures defined in a normal way (ie. no
> forced mis-alignment or __packed).

For future reference: documented for example in Intel 64 and IA-32
Architectures, Software Developer's Manual, volume 3A, 9.2 Memory
ordering.

From section 9.2.3.1: "[...] Instructions that read or write a quadword
(8 bytes) whose address is aligned on an 8 byte boundary."

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] generic/733: add commit ID for btrfs
  2024-05-21 12:01  1% [PATCH] generic/733: add commit ID for btrfs fdmanana
  2024-05-21 12:58  1% ` David Sterba
@ 2024-05-21 14:01  1% ` Zorro Lang
  1 sibling, 0 replies; 200+ results
From: Zorro Lang @ 2024-05-21 14:01 UTC (permalink / raw)
  To: fdmanana; +Cc: fstests, linux-btrfs, Filipe Manana

On Tue, May 21, 2024 at 01:01:29PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> As of commit 5d6f0e9890ed ("btrfs: stop locking the source extent range
> during reflink"), btrfs now does reflink operations without locking the
> source file's range, allowing concurrent reads in the whole source file.
> So update the test to annotate that commit.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---

Make sense to me, Thanks!

Reviewed-by: Zorro Lang <zlang@redhat.com>

>  tests/generic/733 | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/tests/generic/733 b/tests/generic/733
> index d88d92a4..f6ee7f71 100755
> --- a/tests/generic/733
> +++ b/tests/generic/733
> @@ -7,7 +7,8 @@
>  # Race file reads with a very slow reflink operation to see if the reads
>  # actually complete while the reflink is ongoing.  This is a functionality
>  # test for XFS commit 14a537983b22 "xfs: allow read IO and FICLONE to run
> -# concurrently".
> +# concurrently" and for BTRFS commit 5d6f0e9890ed "btrfs: stop locking the
> +# source extent range during reflink".
>  #
>  . ./common/preamble
>  _begin_fstest auto clone punch
> @@ -26,8 +27,16 @@ _require_test_program "punch-alternating"
>  _require_test_program "t_reflink_read_race"
>  _require_command "$TIMEOUT_PROG" timeout
>  
> -[ "$FSTYP" = "xfs" ] && _fixed_by_kernel_commit 14a537983b22 \
> -        "xfs: allow read IO and FICLONE to run concurrently"
> +case "$FSTYP" in
> +"btrfs")
> +	_fixed_by_kernel_commit 5d6f0e9890ed \
> +		"btrfs: stop locking the source extent range during reflink"
> +	;;
> +"xfs")
> +	_fixed_by_kernel_commit 14a537983b22 \
> +		"xfs: allow read IO and FICLONE to run concurrently"
> +	;;
> +esac
>  
>  rm -f "$seqres.full"
>  
> -- 
> 2.43.0
> 
> 


^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: re-introduce 'norecovery' mount option
  2024-05-21  9:57  1% [PATCH] btrfs: re-introduce 'norecovery' mount option Qu Wenruo
                   ` (2 preceding siblings ...)
  2024-05-21 13:24  1% ` Lennart Poettering
@ 2024-05-21 13:26  1% ` David Sterba
  3 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-21 13:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Lennart Poettering, Jiri Slaby, stable

On Tue, May 21, 2024 at 07:27:31PM +0930, Qu Wenruo wrote:
> Although 'norecovery' mount option is marked deprecated for a long time
> and a warning message is introduced during the deprecation window, it's
> still actively utilized by several projects that need a safely way to
> mount a btrfs without any writes.
> 
> Furthermore this 'norecovery' mount option is supported by most major
> filesystems, which makes it harder to validate our motivation.
> 
> This patch would re-introduce the 'norecovery' mount option, and output
> a message to recommend 'rescue=nologreplay' option.
> 
> Link: https://lore.kernel.org/linux-btrfs/ZkxZT0J-z0GYvfy8@gardel-login/#t
> Link: https://github.com/systemd/systemd/pull/32892
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1222429
> Reported-by: Lennart Poettering <lennart@poettering.net>
> Reported-by: Jiri Slaby <jslaby@suse.com>
> Fixes: a1912f712188 ("btrfs: remove code for inode_cache and recovery mount options")
> Cc: stable@vger.kernel.org # 6.8+
> Signed-off-by: Qu Wenruo <wqu@suse.com>

I'll add it to for-next myself, there are a few more fixes that I
plan to send during merge window so this patch can be picked to stable
next week.

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: re-introduce 'norecovery' mount option
  2024-05-21  9:57  1% [PATCH] btrfs: re-introduce 'norecovery' mount option Qu Wenruo
  2024-05-21 10:43  1% ` Johannes Thumshirn
  2024-05-21 13:13  1% ` David Sterba
@ 2024-05-21 13:24  1% ` Lennart Poettering
  2024-05-21 13:26  1% ` David Sterba
  3 siblings, 0 replies; 200+ results
From: Lennart Poettering @ 2024-05-21 13:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Jiri Slaby, stable

On Di, 21.05.24 19:27, Qu Wenruo (wqu@suse.com) wrote:

thank you!

lgtm.

> Although 'norecovery' mount option is marked deprecated for a long time
> and a warning message is introduced during the deprecation window, it's
> still actively utilized by several projects that need a safely way to
> mount a btrfs without any writes.
>
> Furthermore this 'norecovery' mount option is supported by most major
> filesystems, which makes it harder to validate our motivation.
>
> This patch would re-introduce the 'norecovery' mount option, and output
> a message to recommend 'rescue=nologreplay' option.
>
> Link: https://lore.kernel.org/linux-btrfs/ZkxZT0J-z0GYvfy8@gardel-login/#t
> Link: https://github.com/systemd/systemd/pull/32892
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1222429
> Reported-by: Lennart Poettering <lennart@poettering.net>
> Reported-by: Jiri Slaby <jslaby@suse.com>
> Fixes: a1912f712188 ("btrfs: remove code for inode_cache and recovery mount options")
> Cc: stable@vger.kernel.org # 6.8+
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/super.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 2dbc930a20f7..f05cce7c8b8d 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -119,6 +119,7 @@ enum {
>  	Opt_thread_pool,
>  	Opt_treelog,
>  	Opt_user_subvol_rm_allowed,
> +	Opt_norecovery,
>
>  	/* Rescue options */
>  	Opt_rescue,
> @@ -245,6 +246,8 @@ static const struct fs_parameter_spec btrfs_fs_parameters[] = {
>  	__fsparam(NULL, "nologreplay", Opt_nologreplay, fs_param_deprecated, NULL),
>  	/* Deprecated, with alias rescue=usebackuproot */
>  	__fsparam(NULL, "usebackuproot", Opt_usebackuproot, fs_param_deprecated, NULL),
> +	/* For compatibility only, alias for "rescue=nologreplay". */
> +	fsparam_flag("norecovery", Opt_norecovery),
>
>  	/* Debugging options. */
>  	fsparam_flag_no("enospc_debug", Opt_enospc_debug),
> @@ -438,6 +441,11 @@ static int btrfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  		"'nologreplay' is deprecated, use 'rescue=nologreplay' instead");
>  		btrfs_set_opt(ctx->mount_opt, NOLOGREPLAY);
>  		break;
> +	case Opt_norecovery:
> +		btrfs_info(NULL,
> +"'norecovery' is for compatibility only, recommended to use 'rescue=nologreplay'");
> +		btrfs_set_opt(ctx->mount_opt, NOLOGREPLAY);
> +		break;
>  	case Opt_flushoncommit:
>  		if (result.negated)
>  			btrfs_clear_opt(ctx->mount_opt, FLUSHONCOMMIT);
> --
> 2.45.1
>
>

Lennart

--
Lennart Poettering, Berlin

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: re-introduce 'norecovery' mount option
  2024-05-21  9:57  1% [PATCH] btrfs: re-introduce 'norecovery' mount option Qu Wenruo
  2024-05-21 10:43  1% ` Johannes Thumshirn
@ 2024-05-21 13:13  1% ` David Sterba
  2024-05-21 13:24  1% ` Lennart Poettering
  2024-05-21 13:26  1% ` David Sterba
  3 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-21 13:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Lennart Poettering, Jiri Slaby, stable

On Tue, May 21, 2024 at 07:27:31PM +0930, Qu Wenruo wrote:
> Although 'norecovery' mount option is marked deprecated for a long time
> and a warning message is introduced during the deprecation window, it's
> still actively utilized by several projects that need a safely way to
> mount a btrfs without any writes.
> 
> Furthermore this 'norecovery' mount option is supported by most major
> filesystems, which makes it harder to validate our motivation.
> 
> This patch would re-introduce the 'norecovery' mount option, and output
> a message to recommend 'rescue=nologreplay' option.
> 
> Link: https://lore.kernel.org/linux-btrfs/ZkxZT0J-z0GYvfy8@gardel-login/#t
> Link: https://github.com/systemd/systemd/pull/32892
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1222429
> Reported-by: Lennart Poettering <lennart@poettering.net>
> Reported-by: Jiri Slaby <jslaby@suse.com>
> Fixes: a1912f712188 ("btrfs: remove code for inode_cache and recovery mount options")
> Cc: stable@vger.kernel.org # 6.8+
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH 3/6] btrfs: use for-local variabls that shadow function variables
  2024-05-21  4:13  1%   ` Naohiro Aota
@ 2024-05-21 13:01  1%     ` David Sterba
  0 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-21 13:01 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: David Sterba, linux-btrfs

On Tue, May 21, 2024 at 04:13:53AM +0000, Naohiro Aota wrote:
> On Mon, May 20, 2024 at 09:52:26PM GMT, David Sterba wrote:
> > We've started to use for-loop local variables and in a few places this
> > shadows a function variable. Convert a few cases reported by 'make W=2'.
> > If applicable also change the style to post-increment, that's the
> > preferred one.
> > 
> > Signed-off-by: David Sterba <dsterba@suse.com>
> > ---
> 
> LGTM asides from a small nit below.
> 
> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
> 
> >  fs/btrfs/qgroup.c  | 11 +++++------
> >  fs/btrfs/volumes.c |  9 +++------
> >  fs/btrfs/zoned.c   |  8 +++-----
> >  3 files changed, 11 insertions(+), 17 deletions(-)
> > 
> > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> > index fc2a7ea26354..a94a5b87b042 100644
> > --- a/fs/btrfs/qgroup.c
> > +++ b/fs/btrfs/qgroup.c
> > @@ -3216,7 +3216,6 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> >  			 struct btrfs_qgroup_inherit *inherit)
> >  {
> >  	int ret = 0;
> > -	int i;
> >  	u64 *i_qgroups;
> >  	bool committing = false;
> >  	struct btrfs_fs_info *fs_info = trans->fs_info;
> > @@ -3273,7 +3272,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> >  		i_qgroups = (u64 *)(inherit + 1);
> >  		nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
> >  		       2 * inherit->num_excl_copies;
> > -		for (i = 0; i < nums; ++i) {
> > +		for (int i = 0; i < nums; i++) {
> >  			srcgroup = find_qgroup_rb(fs_info, *i_qgroups);
> >  
> >  			/*
> > @@ -3300,7 +3299,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> >  	 */
> >  	if (inherit) {
> >  		i_qgroups = (u64 *)(inherit + 1);
> > -		for (i = 0; i < inherit->num_qgroups; ++i, ++i_qgroups) {
> > +		for (int i = 0; i < inherit->num_qgroups; i++, i_qgroups++) {
> >  			if (*i_qgroups == 0)
> >  				continue;
> >  			ret = add_qgroup_relation_item(trans, objectid,
> > @@ -3386,7 +3385,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> >  		goto unlock;
> >  
> >  	i_qgroups = (u64 *)(inherit + 1);
> > -	for (i = 0; i < inherit->num_qgroups; ++i) {
> > +	for (int i = 0; i < inherit->num_qgroups; i++) {
> >  		if (*i_qgroups) {
> >  			ret = add_relation_rb(fs_info, qlist_prealloc[i], objectid,
> >  					      *i_qgroups);
> > @@ -3406,7 +3405,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> >  		++i_qgroups;
> >  	}
> >  
> > -	for (i = 0; i <  inherit->num_ref_copies; ++i, i_qgroups += 2) {
> > +	for (int i = 0; i < inherit->num_ref_copies; i++, i_qgroups += 2) {
> >  		struct btrfs_qgroup *src;
> >  		struct btrfs_qgroup *dst;
> >  
> > @@ -3427,7 +3426,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
> >  		/* Manually tweaking numbers certainly needs a rescan */
> >  		need_rescan = true;
> >  	}
> > -	for (i = 0; i <  inherit->num_excl_copies; ++i, i_qgroups += 2) {
> > +	for (int i = 0; i <  inherit->num_excl_copies; i++, i_qgroups += 2) {
>                            ^
> nit:                       we have double space here for no reason.
> Can we just dedup it as well?

I remember removing it but probably forgot to refresh the patch before
sending.

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] generic/733: add commit ID for btrfs
  2024-05-21 12:01  1% [PATCH] generic/733: add commit ID for btrfs fdmanana
@ 2024-05-21 12:58  1% ` David Sterba
  2024-05-21 14:01  1% ` Zorro Lang
  1 sibling, 0 replies; 200+ results
From: David Sterba @ 2024-05-21 12:58 UTC (permalink / raw)
  To: fdmanana; +Cc: fstests, linux-btrfs, Filipe Manana

On Tue, May 21, 2024 at 01:01:29PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> As of commit 5d6f0e9890ed ("btrfs: stop locking the source extent range
> during reflink"), btrfs now does reflink operations without locking the
> source file's range, allowing concurrent reads in the whole source file.
> So update the test to annotate that commit.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: David Sterba <dsterba@suse.com>

^ permalink raw reply	[relevance 1%]

* [PATCH] generic/733: add commit ID for btrfs
@ 2024-05-21 12:01  1% fdmanana
  2024-05-21 12:58  1% ` David Sterba
  2024-05-21 14:01  1% ` Zorro Lang
  0 siblings, 2 replies; 200+ results
From: fdmanana @ 2024-05-21 12:01 UTC (permalink / raw)
  To: fstests; +Cc: linux-btrfs, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

As of commit 5d6f0e9890ed ("btrfs: stop locking the source extent range
during reflink"), btrfs now does reflink operations without locking the
source file's range, allowing concurrent reads in the whole source file.
So update the test to annotate that commit.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 tests/generic/733 | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/tests/generic/733 b/tests/generic/733
index d88d92a4..f6ee7f71 100755
--- a/tests/generic/733
+++ b/tests/generic/733
@@ -7,7 +7,8 @@
 # Race file reads with a very slow reflink operation to see if the reads
 # actually complete while the reflink is ongoing.  This is a functionality
 # test for XFS commit 14a537983b22 "xfs: allow read IO and FICLONE to run
-# concurrently".
+# concurrently" and for BTRFS commit 5d6f0e9890ed "btrfs: stop locking the
+# source extent range during reflink".
 #
 . ./common/preamble
 _begin_fstest auto clone punch
@@ -26,8 +27,16 @@ _require_test_program "punch-alternating"
 _require_test_program "t_reflink_read_race"
 _require_command "$TIMEOUT_PROG" timeout
 
-[ "$FSTYP" = "xfs" ] && _fixed_by_kernel_commit 14a537983b22 \
-        "xfs: allow read IO and FICLONE to run concurrently"
+case "$FSTYP" in
+"btrfs")
+	_fixed_by_kernel_commit 5d6f0e9890ed \
+		"btrfs: stop locking the source extent range during reflink"
+	;;
+"xfs")
+	_fixed_by_kernel_commit 14a537983b22 \
+		"xfs: allow read IO and FICLONE to run concurrently"
+	;;
+esac
 
 rm -f "$seqres.full"
 
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* Re: [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  2024-05-21  8:45  1%     ` Qu Wenruo
@ 2024-05-21 11:54  1%       ` Naohiro Aota
  2024-05-21 22:16  1%         ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Naohiro Aota @ 2024-05-21 11:54 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Johannes Thumshirn, josef

On Tue, May 21, 2024 at 06:15:32PM GMT, Qu Wenruo wrote:
> 
> 
> 在 2024/5/21 17:41, Naohiro Aota 写道:
> [...]
> > Same here.
> > 
> > >   	while (delalloc_start < page_end) {
> > >   		delalloc_end = page_end;
> > >   		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
> > > @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
> > >   			delalloc_start = delalloc_end + 1;
> > >   			continue;
> > >   		}
> > > -
> > > -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
> > > -					       delalloc_end, wbc);
> > > -		if (ret < 0)
> > > -			return ret;
> > > -
> > > +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
> > > +					    min(delalloc_end, page_end) + 1 -
> > > +					    delalloc_start);
> > > +		last_delalloc_end = delalloc_end;
> > >   		delalloc_start = delalloc_end + 1;
> > >   	}
> > 
> > Can we bail out on the "if (!last_delalloc_end)" case? It would make the
> > following code simpler.
> 
> Mind to explain it a little more?
> 
> Did you mean something like this:
> 
> 	while (delalloc_start < page_end) {
> 		/* lock all subpage delalloc range code */
> 	}
> 	if (!last_delalloc_end)
> 		goto finish;
> 	while (delalloc_start < page_end) {
> 		/* run the delalloc ranges code* /
> 	}
> 
> If so, I can definitely go that way.

Yes, I meant that way. Apparently, "!last_delalloc_end" means it get no
delalloc region. So, we can just return 0 in that case without touching
"wbc->nr_to_write" as the current code does.

BTW, is this actually an overlooked error case? Is it OK to progress in
__extent_writepage() even if we don't run run_delalloc_range() ?

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: re-introduce 'norecovery' mount option
  2024-05-21  9:57  1% [PATCH] btrfs: re-introduce 'norecovery' mount option Qu Wenruo
@ 2024-05-21 10:43  1% ` Johannes Thumshirn
  2024-05-21 13:13  1% ` David Sterba
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 200+ results
From: Johannes Thumshirn @ 2024-05-21 10:43 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs; +Cc: Lennart Poettering, Jiri Slaby, stable

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[relevance 1%]

* [PATCH] btrfs: re-introduce 'norecovery' mount option
@ 2024-05-21  9:57  1% Qu Wenruo
  2024-05-21 10:43  1% ` Johannes Thumshirn
                   ` (3 more replies)
  0 siblings, 4 replies; 200+ results
From: Qu Wenruo @ 2024-05-21  9:57 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Lennart Poettering, Jiri Slaby, stable

Although 'norecovery' mount option is marked deprecated for a long time
and a warning message is introduced during the deprecation window, it's
still actively utilized by several projects that need a safely way to
mount a btrfs without any writes.

Furthermore this 'norecovery' mount option is supported by most major
filesystems, which makes it harder to validate our motivation.

This patch would re-introduce the 'norecovery' mount option, and output
a message to recommend 'rescue=nologreplay' option.

Link: https://lore.kernel.org/linux-btrfs/ZkxZT0J-z0GYvfy8@gardel-login/#t
Link: https://github.com/systemd/systemd/pull/32892
Link: https://bugzilla.suse.com/show_bug.cgi?id=1222429
Reported-by: Lennart Poettering <lennart@poettering.net>
Reported-by: Jiri Slaby <jslaby@suse.com>
Fixes: a1912f712188 ("btrfs: remove code for inode_cache and recovery mount options")
Cc: stable@vger.kernel.org # 6.8+
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/super.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 2dbc930a20f7..f05cce7c8b8d 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -119,6 +119,7 @@ enum {
 	Opt_thread_pool,
 	Opt_treelog,
 	Opt_user_subvol_rm_allowed,
+	Opt_norecovery,
 
 	/* Rescue options */
 	Opt_rescue,
@@ -245,6 +246,8 @@ static const struct fs_parameter_spec btrfs_fs_parameters[] = {
 	__fsparam(NULL, "nologreplay", Opt_nologreplay, fs_param_deprecated, NULL),
 	/* Deprecated, with alias rescue=usebackuproot */
 	__fsparam(NULL, "usebackuproot", Opt_usebackuproot, fs_param_deprecated, NULL),
+	/* For compatibility only, alias for "rescue=nologreplay". */
+	fsparam_flag("norecovery", Opt_norecovery),
 
 	/* Debugging options. */
 	fsparam_flag_no("enospc_debug", Opt_enospc_debug),
@@ -438,6 +441,11 @@ static int btrfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		"'nologreplay' is deprecated, use 'rescue=nologreplay' instead");
 		btrfs_set_opt(ctx->mount_opt, NOLOGREPLAY);
 		break;
+	case Opt_norecovery:
+		btrfs_info(NULL,
+"'norecovery' is for compatibility only, recommended to use 'rescue=nologreplay'");
+		btrfs_set_opt(ctx->mount_opt, NOLOGREPLAY);
+		break;
 	case Opt_flushoncommit:
 		if (result.negated)
 			btrfs_clear_opt(ctx->mount_opt, FLUSHONCOMMIT);
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* Re: [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  2024-05-21  8:11  1%   ` Naohiro Aota
@ 2024-05-21  8:45  1%     ` Qu Wenruo
  2024-05-21 11:54  1%       ` Naohiro Aota
  0 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-21  8:45 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, Johannes Thumshirn, josef



在 2024/5/21 17:41, Naohiro Aota 写道:
[...]
> Same here.
> 
>>   	while (delalloc_start < page_end) {
>>   		delalloc_end = page_end;
>>   		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
>> @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>>   			delalloc_start = delalloc_end + 1;
>>   			continue;
>>   		}
>> -
>> -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
>> -					       delalloc_end, wbc);
>> -		if (ret < 0)
>> -			return ret;
>> -
>> +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
>> +					    min(delalloc_end, page_end) + 1 -
>> +					    delalloc_start);
>> +		last_delalloc_end = delalloc_end;
>>   		delalloc_start = delalloc_end + 1;
>>   	}
> 
> Can we bail out on the "if (!last_delalloc_end)" case? It would make the
> following code simpler.

Mind to explain it a little more?

Did you mean something like this:

	while (delalloc_start < page_end) {
		/* lock all subpage delalloc range code */
	}
	if (!last_delalloc_end)
		goto finish;
	while (delalloc_start < page_end) {
		/* run the delalloc ranges code* /
	}

If so, I can definitely go that way.

> 
>> +	delalloc_start = page_start;
>> +	/* Run the delalloc ranges for above locked ranges. */
>> +	while (last_delalloc_end && delalloc_start < page_end) {
>> +		u64 found_start;
>> +		u32 found_len;
>> +		bool found;
>>   
>> +		if (!btrfs_is_subpage(fs_info, page->mapping)) {
>> +			/*
>> +			 * For non-subpage case, the found delalloc range must
>> +			 * cover this page and there must be only one locked
>> +			 * delalloc range.
>> +			 */
>> +			found_start = page_start;
>> +			found_len = last_delalloc_end + 1 - found_start;
>> +			found = true;
>> +		} else {
>> +			found = btrfs_subpage_find_writer_locked(fs_info, folio,
>> +					delalloc_start, &found_start, &found_len);
>> +		}
>> +		if (!found)
>> +			break;
>> +		/*
>> +		 * The subpage range covers the last sector, the delalloc range may
>> +		 * end beyonds the page boundary, use the saved delalloc_end
>> +		 * instead.
>> +		 */
>> +		if (found_start + found_len >= page_end)
>> +			found_len = last_delalloc_end + 1 - found_start;
>> +
>> +		if (ret < 0) {
> 
> At first glance, it is strange because "ret" is not set above. But, it is
> executed when btrfs_run_delalloc_range() returns an error in an iteration,
> for the remaining iterations...
> 
> I'd like to have a dedicated clean-up path ... but I agree it is difficult
> to make such cleanup loop clean.

I can add an extra bool to indicate if we have any error, but overall 
it's not much different.

> 
> Flipping the if-conditions looks better? Or, adding more comments would be nice.

I guess that would go this path, flipping the if conditions and extra 
comments.

> 
>> +			/* Cleanup the remaining locked ranges. */
>> +			unlock_extent(&inode->io_tree, found_start,
>> +				      found_start + found_len - 1, NULL);
>> +			__unlock_for_delalloc(&inode->vfs_inode, page, found_start,
>> +					      found_start + found_len - 1);
>> +		} else {
>> +			ret = btrfs_run_delalloc_range(inode, page, found_start,
>> +						       found_start + found_len - 1, wbc);
> 
> Also, what happens if the first range returns "1" and a later one returns
> "0"? Is it OK to override the "ret" for the case? Actually, I guess it
> won't happen now because (as said in patch 5) subpage disables an inline
> extent, but having an ASSERT() would be good to catch a future mistake.

It's not only inline but also compression can return 1.

Thankfully for subpage, inline is disabled, meanwhile compression can 
only be done for a full page aligned range (start and end are both page 
aligned).

Considering you're mentioning this, I would definitely add an ASSERT() 
with comments explaining this.

Thanks for the feedback!
Qu

> 
>> +		}
>> +		/*
>> +		 * Above btrfs_run_delalloc_range() may have unlocked the page,
>> +		 * Thus for the last range, we can not touch the page anymore.
>> +		 */
>> +		if (found_start + found_len >= last_delalloc_end + 1)
>> +			break;
>> +
>> +		delalloc_start = found_start + found_len;
>> +	}
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (last_delalloc_end)
>> +		delalloc_end = last_delalloc_end;
>> +	else
>> +		delalloc_end = page_end;
>>   	/*
>>   	 * delalloc_end is already one less than the total length, so
>>   	 * we don't subtract one from PAGE_SIZE
>> @@ -1520,7 +1582,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
>>   					       PAGE_SIZE, !ret);
>>   		mapping_set_error(page->mapping, ret);
>>   	}
>> -	unlock_page(page);
>> +
>> +	btrfs_folio_end_all_writers(inode_to_fs_info(inode), folio);
>>   	ASSERT(ret <= 0);
>>   	return ret;
>>   }
>> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
>> index 3c957d03324e..81b862d7ab53 100644
>> --- a/fs/btrfs/subpage.c
>> +++ b/fs/btrfs/subpage.c
>> @@ -862,6 +862,7 @@ bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
>>   void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>>   				 struct folio *folio)
>>   {
>> +	struct btrfs_subpage *subpage = folio_get_private(folio);
>>   	u64 folio_start = folio_pos(folio);
>>   	u64 cur = folio_start;
>>   
>> @@ -871,6 +872,11 @@ void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>>   		return;
>>   	}
>>   
>> +	/* The page has no new delalloc range locked on it. Just plain unlock. */
>> +	if (atomic_read(&subpage->writers) == 0) {
>> +		folio_unlock(folio);
>> +		return;
>> +	}
>>   	while (cur < folio_start + PAGE_SIZE) {
>>   		u64 found_start;
>>   		u32 found_len;
>> -- 
>> 2.45.0

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc()
  @ 2024-05-21  8:11  1%   ` Naohiro Aota
  2024-05-21  8:45  1%     ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Naohiro Aota @ 2024-05-21  8:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Johannes Thumshirn, josef

On Sat, May 18, 2024 at 02:37:41PM GMT, Qu Wenruo wrote:
> If we have a subpage range like this for a 16K page with 4K sectorsize:
> 
>     0     4K     8K     12K     16K
>     |/////|      |//////|       |
> 
>     |/////| = dirty range
> 
> Currently writepage_delalloc() would go through the following steps:
> 
> - lock range [0, 4K)
> - run delalloc range for [0, 4K)
> - lock range [8K, 12K)
> - run delalloc range for [8K 12K)
> 
> So far it's fine for regular subpage writeback, as
> btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
> cow_file_range() and run_delalloc_compressed().
> 
> But there is a special pitfall for zoned subpage, where we will go
> through run_delalloc_cow(), which would create the ordered extent for the
> range and immediately submit the range.
> This would unlock the whole page range, causing all kinds of different
> ASSERT()s related to locked page.
> 
> This patch would address the page unlocking problem of run_delalloc_cow(),
> by changing the workflow to the following one:
> 
> - lock range [0, 4K)
> - lock range [8K, 12K)
> - run delalloc range for [0, 4K)
> - run delalloc range for [8K, 12K)
> 
> So that run_delalloc_cow() can only unlock the full page until the
> last lock user released.
> 
> To do that, this patch would:
> 
> - Utilizing subpage locked bitmap
>   So for every delalloc range we found, call
>   btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
>   and later btrfs_folio_end_all_writers() if the page is fully unlocked.
> 
>   So we know there is a delalloc range that needs to be run later.
> 
> - Save the @delalloc_end as @last_delalloc_end inside
>   writepage_delalloc()
>   Since subpage locked bitmap is only for ranges inside the page,
>   meanwhile we can have delalloc range ends beyond our page boundary,
>   we have to save the @last_delalloc_end just in case it's beyond our
>   page boundary.
> 
> Although there is one extra point to notice:
> 
> - We need to handle errors in previous iteration
>   Since we can have multiple locked delalloc ranges thus we have to call
>   run_delalloc_ranges() multiple times.
>   If we hit an error half way, we still need to unlock the remaining
>   ranges.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 77 ++++++++++++++++++++++++++++++++++++++++----
>  fs/btrfs/subpage.c   |  6 ++++
>  2 files changed, 76 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 8a4a7d00795f..b6dc9308105d 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1226,13 +1226,22 @@ static inline void contiguous_readpages(struct page *pages[], int nr_pages,
>  static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>  		struct page *page, struct writeback_control *wbc)
>  {
> +	struct btrfs_fs_info *fs_info = inode_to_fs_info(&inode->vfs_inode);
> +	struct folio *folio = page_folio(page);
>  	const u64 page_start = page_offset(page);
>  	const u64 page_end = page_start + PAGE_SIZE - 1;
> +	/*
> +	 * Saves the last found delalloc end. As the delalloc end can go beyond
> +	 * page boundary, thus we can not rely on subpage bitmap to locate
> +	 * the last dealloc end.

typo: dealloc -> delalloc

> +	 */
> +	u64 last_delalloc_end = 0;
>  	u64 delalloc_start = page_start;
>  	u64 delalloc_end = page_end;
>  	u64 delalloc_to_write = 0;
>  	int ret = 0;
>  
> +	/* Lock all (subpage) dealloc ranges inside the page first. */

Same here.

>  	while (delalloc_start < page_end) {
>  		delalloc_end = page_end;
>  		if (!find_lock_delalloc_range(&inode->vfs_inode, page,
> @@ -1240,15 +1249,68 @@ static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
>  			delalloc_start = delalloc_end + 1;
>  			continue;
>  		}
> -
> -		ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
> -					       delalloc_end, wbc);
> -		if (ret < 0)
> -			return ret;
> -
> +		btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
> +					    min(delalloc_end, page_end) + 1 -
> +					    delalloc_start);
> +		last_delalloc_end = delalloc_end;
>  		delalloc_start = delalloc_end + 1;
>  	}

Can we bail out on the "if (!last_delalloc_end)" case? It would make the
following code simpler.

> +	delalloc_start = page_start;
> +	/* Run the delalloc ranges for above locked ranges. */
> +	while (last_delalloc_end && delalloc_start < page_end) {
> +		u64 found_start;
> +		u32 found_len;
> +		bool found;
>  
> +		if (!btrfs_is_subpage(fs_info, page->mapping)) {
> +			/*
> +			 * For non-subpage case, the found delalloc range must
> +			 * cover this page and there must be only one locked
> +			 * delalloc range.
> +			 */
> +			found_start = page_start;
> +			found_len = last_delalloc_end + 1 - found_start;
> +			found = true;
> +		} else {
> +			found = btrfs_subpage_find_writer_locked(fs_info, folio,
> +					delalloc_start, &found_start, &found_len);
> +		}
> +		if (!found)
> +			break;
> +		/*
> +		 * The subpage range covers the last sector, the delalloc range may
> +		 * end beyonds the page boundary, use the saved delalloc_end
> +		 * instead.
> +		 */
> +		if (found_start + found_len >= page_end)
> +			found_len = last_delalloc_end + 1 - found_start;
> +
> +		if (ret < 0) {

At first glance, it is strange because "ret" is not set above. But, it is
executed when btrfs_run_delalloc_range() returns an error in an iteration,
for the remaining iterations...

I'd like to have a dedicated clean-up path ... but I agree it is difficult
to make such cleanup loop clean.

Flipping the if-conditions looks better? Or, adding more comments would be nice.

> +			/* Cleanup the remaining locked ranges. */
> +			unlock_extent(&inode->io_tree, found_start,
> +				      found_start + found_len - 1, NULL);
> +			__unlock_for_delalloc(&inode->vfs_inode, page, found_start,
> +					      found_start + found_len - 1);
> +		} else {
> +			ret = btrfs_run_delalloc_range(inode, page, found_start,
> +						       found_start + found_len - 1, wbc);

Also, what happens if the first range returns "1" and a later one returns
"0"? Is it OK to override the "ret" for the case? Actually, I guess it
won't happen now because (as said in patch 5) subpage disables an inline
extent, but having an ASSERT() would be good to catch a future mistake.

> +		}
> +		/*
> +		 * Above btrfs_run_delalloc_range() may have unlocked the page,
> +		 * Thus for the last range, we can not touch the page anymore.
> +		 */
> +		if (found_start + found_len >= last_delalloc_end + 1)
> +			break;
> +
> +		delalloc_start = found_start + found_len;
> +	}
> +	if (ret < 0)
> +		return ret;
> +
> +	if (last_delalloc_end)
> +		delalloc_end = last_delalloc_end;
> +	else
> +		delalloc_end = page_end;
>  	/*
>  	 * delalloc_end is already one less than the total length, so
>  	 * we don't subtract one from PAGE_SIZE
> @@ -1520,7 +1582,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
>  					       PAGE_SIZE, !ret);
>  		mapping_set_error(page->mapping, ret);
>  	}
> -	unlock_page(page);
> +
> +	btrfs_folio_end_all_writers(inode_to_fs_info(inode), folio);
>  	ASSERT(ret <= 0);
>  	return ret;
>  }
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 3c957d03324e..81b862d7ab53 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -862,6 +862,7 @@ bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
>  void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>  				 struct folio *folio)
>  {
> +	struct btrfs_subpage *subpage = folio_get_private(folio);
>  	u64 folio_start = folio_pos(folio);
>  	u64 cur = folio_start;
>  
> @@ -871,6 +872,11 @@ void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>  		return;
>  	}
>  
> +	/* The page has no new delalloc range locked on it. Just plain unlock. */
> +	if (atomic_read(&subpage->writers) == 0) {
> +		folio_unlock(folio);
> +		return;
> +	}
>  	while (cur < folio_start + PAGE_SIZE) {
>  		u64 found_start;
>  		u32 found_len;
> -- 
> 2.45.0
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v5 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking
  2024-05-21  7:50  1%   ` Naohiro Aota
@ 2024-05-21  7:57  1%     ` Qu Wenruo
  0 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-21  7:57 UTC (permalink / raw)
  To: Naohiro Aota, Qu Wenruo; +Cc: linux-btrfs, Johannes Thumshirn, josef



在 2024/5/21 17:20, Naohiro Aota 写道:
[...]
>> +void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
>> +				 struct folio *folio, u64 start, u32 len)
>> +{
>> +	struct btrfs_subpage *subpage;
>> +	unsigned long flags;
>> +	int start_bit;
>> +	int nbits;
>
> May want to use unsigned int for a consistency...
>

I can definitely change all the int to "unsigned int" to be consistent
during pushing to for-next branch.

[...]
>> +	found = true;
>> +	*found_start_ret = folio_pos(folio) +
>> +		((first_set - locked_bitmap_start) << fs_info->sectorsize_bits);
>
> It's a bit fearful to see an "int" value is shifted and added into u64
> value. But, I guess sectorsize is within 32-bit range, right?

In fact, (first_set - locked_bitmap_start) is never going to be larger
than (PAGE_SIZE / sectorsize).

I can add extra ASSERT() to be extra safe for that too.

Thanks,
Qu

>
>> +
>> +	first_zero = find_next_zero_bit(subpage->bitmaps,
>> +					locked_bitmap_end, first_set);
>> +	*found_len_ret = (first_zero - first_set) << fs_info->sectorsize_bits;
>> +out:
>> +	spin_unlock_irqrestore(&subpage->lock, flags);
>> +	return found;
>> +}
>> +
>> +/*
>> + * Unlike btrfs_folio_end_writer_lock() which unlock a specified subpage range,
>> + * this would end all writer locked ranges of a page.
>> + *
>> + * This is for the locked page of __extent_writepage(), as the locked page
>> + * can contain several locked subpage ranges.
>> + */
>> +void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>> +				 struct folio *folio)
>> +{
>> +	u64 folio_start = folio_pos(folio);
>> +	u64 cur = folio_start;
>> +
>> +	ASSERT(folio_test_locked(folio));
>> +	if (!btrfs_is_subpage(fs_info, folio->mapping)) {
>> +		folio_unlock(folio);
>> +		return;
>> +	}
>> +
>> +	while (cur < folio_start + PAGE_SIZE) {
>> +		u64 found_start;
>> +		u32 found_len;
>> +		bool found;
>> +		bool last;
>> +
>> +		found = btrfs_subpage_find_writer_locked(fs_info, folio, cur,
>> +							 &found_start, &found_len);
>> +		if (!found)
>> +			break;
>> +		last = btrfs_subpage_end_and_test_writer(fs_info, folio,
>> +							 found_start, found_len);
>> +		if (last) {
>> +			folio_unlock(folio);
>> +			break;
>> +		}
>> +		cur = found_start + found_len;
>> +	}
>> +}
>> +
>>   #define GET_SUBPAGE_BITMAP(subpage, subpage_info, name, dst)		\
>>   	bitmap_cut(dst, subpage->bitmaps, 0,				\
>>   		   subpage_info->name##_offset, subpage_info->bitmap_nr_bits)
>> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
>> index 4b363d9453af..9f19850d59f2 100644
>> --- a/fs/btrfs/subpage.h
>> +++ b/fs/btrfs/subpage.h
>> @@ -112,6 +112,13 @@ int btrfs_folio_start_writer_lock(const struct btrfs_fs_info *fs_info,
>>   				  struct folio *folio, u64 start, u32 len);
>>   void btrfs_folio_end_writer_lock(const struct btrfs_fs_info *fs_info,
>>   				 struct folio *folio, u64 start, u32 len);
>> +void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
>> +				 struct folio *folio, u64 start, u32 len);
>> +bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
>> +				      struct folio *folio, u64 search_start,
>> +				      u64 *found_start_ret, u32 *found_len_ret);
>> +void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
>> +				 struct folio *folio);
>>
>>   /*
>>    * Template for subpage related operations.
>> --
>> 2.45.0
>>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v5 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking
  @ 2024-05-21  7:50  1%   ` Naohiro Aota
  2024-05-21  7:57  1%     ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Naohiro Aota @ 2024-05-21  7:50 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Johannes Thumshirn, josef

On Sat, May 18, 2024 at 02:37:40PM GMT, Qu Wenruo wrote:
> Three new helpers are introduced for the incoming subpage delalloc locking
> change.
> 
> - btrfs_folio_set_writer_lock()
>   This is to mark specified range with subpage specific writer lock.
>   After calling this, the subpage range can be proper unlocked by
>   btrfs_folio_end_writer_lock()
> 
> - btrfs_subpage_find_writer_locked()
>   This is to find the writer locked subpage range in a page.
>   With the help of btrfs_folio_set_writer_lock(), it can allow us to
>   record and find previously locked subpage range without extra memory
>   allocation.
> 
> - btrfs_folio_end_all_writers()
>   This is for the locked_page of __extent_writepage(), as there may be
>   multiple subpage delalloc ranges locked.
> 
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---

There are some nits inlined below, basically it looks good.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

>  fs/btrfs/subpage.c | 116 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/subpage.h |   7 +++
>  2 files changed, 123 insertions(+)
> 
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 183b32f51f51..3c957d03324e 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -775,6 +775,122 @@ void btrfs_folio_unlock_writer(struct btrfs_fs_info *fs_info,
>  	btrfs_folio_end_writer_lock(fs_info, folio, start, len);
>  }
>  
> +/*
> + * This is for folio already locked by plain lock_page()/folio_lock(), which
> + * doesn't have any subpage awareness.
> + *
> + * This would populate the involved subpage ranges so that subpage helpers can
> + * properly unlock them.
> + */
> +void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
> +				 struct folio *folio, u64 start, u32 len)
> +{
> +	struct btrfs_subpage *subpage;
> +	unsigned long flags;
> +	int start_bit;
> +	int nbits;

May want to use unsigned int for a consistency...

> +	int ret;
> +
> +	ASSERT(folio_test_locked(folio));
> +	if (unlikely(!fs_info) || !btrfs_is_subpage(fs_info, folio->mapping))
> +		return;
> +
> +	subpage = folio_get_private(folio);
> +	start_bit = subpage_calc_start_bit(fs_info, folio, locked, start, len);
> +	nbits = len >> fs_info->sectorsize_bits;
> +	spin_lock_irqsave(&subpage->lock, flags);
> +	/* Target range should not yet be locked. */
> +	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
> +	bitmap_set(subpage->bitmaps, start_bit, nbits);
> +	ret = atomic_add_return(nbits, &subpage->writers);
> +	ASSERT(ret <= fs_info->subpage_info->bitmap_nr_bits);
> +	spin_unlock_irqrestore(&subpage->lock, flags);
> +}
> +
> +/*
> + * Find any subpage writer locked range inside @folio, starting at file offset
> + * @search_start.
> + * The caller should ensure the folio is locked.
> + *
> + * Return true and update @found_start_ret and @found_len_ret to the first
> + * writer locked range.
> + * Return false if there is no writer locked range.
> + */
> +bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
> +				      struct folio *folio, u64 search_start,
> +				      u64 *found_start_ret, u32 *found_len_ret)
> +{
> +	struct btrfs_subpage_info *subpage_info = fs_info->subpage_info;
> +	struct btrfs_subpage *subpage = folio_get_private(folio);
> +	const int len = PAGE_SIZE - offset_in_page(search_start);
> +	const int start_bit = subpage_calc_start_bit(fs_info, folio, locked,
> +						     search_start, len);
> +	const int locked_bitmap_start = subpage_info->locked_offset;
> +	const int locked_bitmap_end = locked_bitmap_start +
> +				      subpage_info->bitmap_nr_bits;
> +	unsigned long flags;
> +	int first_zero;
> +	int first_set;
> +	bool found = false;
> +
> +	ASSERT(folio_test_locked(folio));
> +	spin_lock_irqsave(&subpage->lock, flags);
> +	first_set = find_next_bit(subpage->bitmaps, locked_bitmap_end,
> +				  start_bit);
> +	if (first_set >= locked_bitmap_end)
> +		goto out;
> +
> +	found = true;
> +	*found_start_ret = folio_pos(folio) +
> +		((first_set - locked_bitmap_start) << fs_info->sectorsize_bits);

It's a bit fearful to see an "int" value is shifted and added into u64
value. But, I guess sectorsize is within 32-bit range, right?

> +
> +	first_zero = find_next_zero_bit(subpage->bitmaps,
> +					locked_bitmap_end, first_set);
> +	*found_len_ret = (first_zero - first_set) << fs_info->sectorsize_bits;
> +out:
> +	spin_unlock_irqrestore(&subpage->lock, flags);
> +	return found;
> +}
> +
> +/*
> + * Unlike btrfs_folio_end_writer_lock() which unlock a specified subpage range,
> + * this would end all writer locked ranges of a page.
> + *
> + * This is for the locked page of __extent_writepage(), as the locked page
> + * can contain several locked subpage ranges.
> + */
> +void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
> +				 struct folio *folio)
> +{
> +	u64 folio_start = folio_pos(folio);
> +	u64 cur = folio_start;
> +
> +	ASSERT(folio_test_locked(folio));
> +	if (!btrfs_is_subpage(fs_info, folio->mapping)) {
> +		folio_unlock(folio);
> +		return;
> +	}
> +
> +	while (cur < folio_start + PAGE_SIZE) {
> +		u64 found_start;
> +		u32 found_len;
> +		bool found;
> +		bool last;
> +
> +		found = btrfs_subpage_find_writer_locked(fs_info, folio, cur,
> +							 &found_start, &found_len);
> +		if (!found)
> +			break;
> +		last = btrfs_subpage_end_and_test_writer(fs_info, folio,
> +							 found_start, found_len);
> +		if (last) {
> +			folio_unlock(folio);
> +			break;
> +		}
> +		cur = found_start + found_len;
> +	}
> +}
> +
>  #define GET_SUBPAGE_BITMAP(subpage, subpage_info, name, dst)		\
>  	bitmap_cut(dst, subpage->bitmaps, 0,				\
>  		   subpage_info->name##_offset, subpage_info->bitmap_nr_bits)
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index 4b363d9453af..9f19850d59f2 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -112,6 +112,13 @@ int btrfs_folio_start_writer_lock(const struct btrfs_fs_info *fs_info,
>  				  struct folio *folio, u64 start, u32 len);
>  void btrfs_folio_end_writer_lock(const struct btrfs_fs_info *fs_info,
>  				 struct folio *folio, u64 start, u32 len);
> +void btrfs_folio_set_writer_lock(const struct btrfs_fs_info *fs_info,
> +				 struct folio *folio, u64 start, u32 len);
> +bool btrfs_subpage_find_writer_locked(const struct btrfs_fs_info *fs_info,
> +				      struct folio *folio, u64 search_start,
> +				      u64 *found_start_ret, u32 *found_len_ret);
> +void btrfs_folio_end_all_writers(const struct btrfs_fs_info *fs_info,
> +				 struct folio *folio);
>  
>  /*
>   * Template for subpage related operations.
> -- 
> 2.45.0
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v5 1/5] btrfs: make __extent_writepage_io() to write specified range only
  @ 2024-05-21  7:23  1%   ` Naohiro Aota
  0 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-21  7:23 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Johannes Thumshirn, josef

On Sat, May 18, 2024 at 02:37:39PM GMT, Qu Wenruo wrote:
> Function __extent_writepage_io() is designed to find all dirty range of
> a page, and add that dirty range into the bio_ctrl for submission.
> It requires all the dirtied range to be covered by an ordered extent.
> 
> It get called in two locations, but one call site is not subpage aware:
> 
> - __extent_writepage()
>   It get called when writepage_delalloc() returned 0, which means
>   writepage_delalloc() has handled dellalloc for all subpage sectors
>   inside the page.
> 
>   So this call site is OK.
> 
> - extent_write_locked_range()
>   This call site is utilized by zoned support, and in this case, we may
>   only run delalloc range for a subset of the page, like this: (64K page
>   size)
> 
>   0     16K     32K     48K     64K
>   |/////|       |///////|       |
> 
>   In above case, if extent_write_locked_range() is only triggered for
>   range [0, 16K), __extent_writepage_io() would still try to submit
>   the dirty range of [32K, 48K), then it would not find any ordered
>   extent for it and trigger various ASSERT()s.
> 
> Fix this problem by:
> 
> - Introducing @start and @len parameters to specify the range
> 
>   For the first call site, we just pass the whole page, and the behavior
>   is not touched, since run_delalloc_range() for the page should have
>   created all ordered extents for the page.
> 
>   For the second call site, we would avoid touching anything beyond the
>   range, thus avoid the dirty range which is not yet covered by any
>   delalloc range.
> 
> - Making btrfs_folio_assert_not_dirty() subpage aware
>   The only caller is inside __extent_writepage_io(), and since that
>   caller now accepts a subpage range, we should also check the subpage
>   range other than the whole page.
> 
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 18 +++++++++++-------
>  fs/btrfs/subpage.c   | 22 ++++++++++++++++------
>  fs/btrfs/subpage.h   |  3 ++-
>  3 files changed, 29 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 597387e9f040..8a4a7d00795f 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1339,20 +1339,23 @@ static void find_next_dirty_byte(struct btrfs_fs_info *fs_info,
>   * < 0 if there were errors (page still locked)
>   */
>  static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
> -				 struct page *page,
> +				 struct page *page, u64 start, u32 len,
>  				 struct btrfs_bio_ctrl *bio_ctrl,
>  				 loff_t i_size,
>  				 int *nr_ret)
>  {
>  	struct btrfs_fs_info *fs_info = inode->root->fs_info;
> -	u64 cur = page_offset(page);
> -	u64 end = cur + PAGE_SIZE - 1;
> +	u64 cur = start;
> +	u64 end = start + len - 1;
>  	u64 extent_offset;
>  	u64 block_start;
>  	struct extent_map *em;
>  	int ret = 0;
>  	int nr = 0;
>  
> +	ASSERT(start >= page_offset(page) &&
> +	       start + len <= page_offset(page) + PAGE_SIZE);
> +
>  	ret = btrfs_writepage_cow_fixup(page);
>  	if (ret) {
>  		/* Fixup worker will requeue */
> @@ -1441,7 +1444,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
>  		nr++;
>  	}
>  
> -	btrfs_folio_assert_not_dirty(fs_info, page_folio(page));
> +	btrfs_folio_assert_not_dirty(fs_info, page_folio(page), start, len);
>  	*nr_ret = nr;
>  	return 0;
>  
> @@ -1499,7 +1502,8 @@ static int __extent_writepage(struct page *page, struct btrfs_bio_ctrl *bio_ctrl
>  	if (ret)
>  		goto done;
>  
> -	ret = __extent_writepage_io(BTRFS_I(inode), page, bio_ctrl, i_size, &nr);
> +	ret = __extent_writepage_io(BTRFS_I(inode), page, page_offset(page),
> +				    PAGE_SIZE, bio_ctrl, i_size, &nr);
>  	if (ret == 1)
>  		return 0;
>  
> @@ -2251,8 +2255,8 @@ void extent_write_locked_range(struct inode *inode, struct page *locked_page,
>  			clear_page_dirty_for_io(page);
>  		}
>  
> -		ret = __extent_writepage_io(BTRFS_I(inode), page, &bio_ctrl,
> -					    i_size, &nr);
> +		ret = __extent_writepage_io(BTRFS_I(inode), page, cur, cur_len,
> +					    &bio_ctrl, i_size, &nr);
>  		if (ret == 1)
>  			goto next_page;
>  
> diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> index 54736f6238e6..183b32f51f51 100644
> --- a/fs/btrfs/subpage.c
> +++ b/fs/btrfs/subpage.c
> @@ -703,19 +703,29 @@ IMPLEMENT_BTRFS_PAGE_OPS(checked, folio_set_checked, folio_clear_checked,
>   * Make sure not only the page dirty bit is cleared, but also subpage dirty bit
>   * is cleared.
>   */
> -void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio)
> +void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info,
> +				  struct folio *folio, u64 start, u32 len)
>  {
> -	struct btrfs_subpage *subpage = folio_get_private(folio);
> +	struct btrfs_subpage *subpage;
> +	int start_bit;
> +	int nbits;

Can we have these as "unsigned int" to be consistent with the function
interface? But, it's not a big deal so

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

> +	unsigned long flags;
>  
>  	if (!IS_ENABLED(CONFIG_BTRFS_ASSERT))
>  		return;
>  
> -	ASSERT(!folio_test_dirty(folio));
> -	if (!btrfs_is_subpage(fs_info, folio->mapping))
> +	if (!btrfs_is_subpage(fs_info, folio->mapping)) {
> +		ASSERT(!folio_test_dirty(folio));
>  		return;
> +	}
>  
> -	ASSERT(folio_test_private(folio) && folio_get_private(folio));
> -	ASSERT(subpage_test_bitmap_all_zero(fs_info, subpage, dirty));
> +	start_bit = subpage_calc_start_bit(fs_info, folio, dirty, start, len);
> +	nbits = len >> fs_info->sectorsize_bits;
> +	subpage = folio_get_private(folio);
> +	ASSERT(subpage);
> +	spin_lock_irqsave(&subpage->lock, flags);
> +	ASSERT(bitmap_test_range_all_zero(subpage->bitmaps, start_bit, nbits));
> +	spin_unlock_irqrestore(&subpage->lock, flags);
>  }
>  
>  /*
> diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
> index b6dc013b0fdc..4b363d9453af 100644
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -156,7 +156,8 @@ DECLARE_BTRFS_SUBPAGE_OPS(checked);
>  bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
>  					struct folio *folio, u64 start, u32 len);
>  
> -void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info, struct folio *folio);
> +void btrfs_folio_assert_not_dirty(const struct btrfs_fs_info *fs_info,
> +				  struct folio *folio, u64 start, u32 len);
>  void btrfs_folio_unlock_writer(struct btrfs_fs_info *fs_info,
>  			       struct folio *folio, u64 start, u32 len);
>  void __cold btrfs_subpage_dump_bitmap(const struct btrfs_fs_info *fs_info,
> -- 
> 2.45.0
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v5 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly
  @ 2024-05-21  7:13  1%   ` Naohiro Aota
  0 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-21  7:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, Johannes Thumshirn, josef

On Sat, May 18, 2024 at 02:37:43PM GMT, Qu Wenruo wrote:
> When extent_write_locked_range() generated an inline extent, it would
> set and finish the writeback for the whole page.
> 
> Although currently it's safe since subpage disables inline creation,
> for the sake of consistency, let it go with subpage helpers to set and
> clear the writeback flags.
> 
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---

LGTM

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH 0/6] Cleanups and W=2 warning fixes
  2024-05-20 19:52  1% [PATCH 0/6] Cleanups and W=2 warning fixes David Sterba
                   ` (7 preceding siblings ...)
  2024-05-21  0:38  1% ` Anand Jain
@ 2024-05-21  4:21  1% ` Naohiro Aota
  8 siblings, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-21  4:21 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs

On Mon, May 20, 2024 at 09:52:08PM GMT, David Sterba wrote:
> We have a clean run of 'make W='1 with gcc 13, there are some
> interesting warnings to fix with level 2 and even 3. We can't enable the
> warning flags by defualt due to reports from generic code.
> 
> This short series removes shadowed variables, adds const and removes
> an unused macro. There are still some shadow variables to fix but the
> remaining cases are with 'ret' variables so I skipped it for now.
> 
> David Sterba (6):
>   btrfs: remove duplicate name variable declarations
>   btrfs: rename macro local variables that clash with other variables
>   btrfs: use for-local variabls that shadow function variables
>   btrfs: remove unused define EXTENT_SIZE_PER_ITEM
>   btrfs: keep const whene returnin value from get_unaligned_le8()
>   btrfs: constify parameters of write_eb_member() and its users

A small nit added to patch 3 but for whole the series

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

>  fs/btrfs/accessors.h   | 12 ++++++------
>  fs/btrfs/extent_io.c   |  6 ++----
>  fs/btrfs/inode.c       |  2 --
>  fs/btrfs/qgroup.c      | 11 +++++------
>  fs/btrfs/space-info.c  |  2 --
>  fs/btrfs/subpage.c     |  8 ++++----
>  fs/btrfs/transaction.h |  6 +++---
>  fs/btrfs/volumes.c     |  9 +++------
>  fs/btrfs/zoned.c       |  8 +++-----
>  9 files changed, 26 insertions(+), 38 deletions(-)
> 
> -- 
> 2.45.0
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH 3/6] btrfs: use for-local variabls that shadow function variables
  2024-05-20 19:52  1% ` [PATCH 3/6] btrfs: use for-local variabls that shadow function variables David Sterba
@ 2024-05-21  4:13  1%   ` Naohiro Aota
  2024-05-21 13:01  1%     ` David Sterba
  0 siblings, 1 reply; 200+ results
From: Naohiro Aota @ 2024-05-21  4:13 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs

On Mon, May 20, 2024 at 09:52:26PM GMT, David Sterba wrote:
> We've started to use for-loop local variables and in a few places this
> shadows a function variable. Convert a few cases reported by 'make W=2'.
> If applicable also change the style to post-increment, that's the
> preferred one.
> 
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---

LGTM asides from a small nit below.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

>  fs/btrfs/qgroup.c  | 11 +++++------
>  fs/btrfs/volumes.c |  9 +++------
>  fs/btrfs/zoned.c   |  8 +++-----
>  3 files changed, 11 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index fc2a7ea26354..a94a5b87b042 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -3216,7 +3216,6 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  			 struct btrfs_qgroup_inherit *inherit)
>  {
>  	int ret = 0;
> -	int i;
>  	u64 *i_qgroups;
>  	bool committing = false;
>  	struct btrfs_fs_info *fs_info = trans->fs_info;
> @@ -3273,7 +3272,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  		i_qgroups = (u64 *)(inherit + 1);
>  		nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
>  		       2 * inherit->num_excl_copies;
> -		for (i = 0; i < nums; ++i) {
> +		for (int i = 0; i < nums; i++) {
>  			srcgroup = find_qgroup_rb(fs_info, *i_qgroups);
>  
>  			/*
> @@ -3300,7 +3299,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  	 */
>  	if (inherit) {
>  		i_qgroups = (u64 *)(inherit + 1);
> -		for (i = 0; i < inherit->num_qgroups; ++i, ++i_qgroups) {
> +		for (int i = 0; i < inherit->num_qgroups; i++, i_qgroups++) {
>  			if (*i_qgroups == 0)
>  				continue;
>  			ret = add_qgroup_relation_item(trans, objectid,
> @@ -3386,7 +3385,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  		goto unlock;
>  
>  	i_qgroups = (u64 *)(inherit + 1);
> -	for (i = 0; i < inherit->num_qgroups; ++i) {
> +	for (int i = 0; i < inherit->num_qgroups; i++) {
>  		if (*i_qgroups) {
>  			ret = add_relation_rb(fs_info, qlist_prealloc[i], objectid,
>  					      *i_qgroups);
> @@ -3406,7 +3405,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  		++i_qgroups;
>  	}
>  
> -	for (i = 0; i <  inherit->num_ref_copies; ++i, i_qgroups += 2) {
> +	for (int i = 0; i < inherit->num_ref_copies; i++, i_qgroups += 2) {
>  		struct btrfs_qgroup *src;
>  		struct btrfs_qgroup *dst;
>  
> @@ -3427,7 +3426,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
>  		/* Manually tweaking numbers certainly needs a rescan */
>  		need_rescan = true;
>  	}
> -	for (i = 0; i <  inherit->num_excl_copies; ++i, i_qgroups += 2) {
> +	for (int i = 0; i <  inherit->num_excl_copies; i++, i_qgroups += 2) {
                           ^
nit:                       we have double space here for no reason.
Can we just dedup it as well?

>  		struct btrfs_qgroup *src;
>  		struct btrfs_qgroup *dst;
>  
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 7c9d68b1ba69..3f70f727dacf 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -5623,8 +5623,6 @@ static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
>  	u64 start = ctl->start;
>  	u64 type = ctl->type;
>  	int ret;
> -	int i;
> -	int j;
>  
>  	map = btrfs_alloc_chunk_map(ctl->num_stripes, GFP_NOFS);
>  	if (!map)
> @@ -5639,8 +5637,8 @@ static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
>  	map->sub_stripes = ctl->sub_stripes;
>  	map->num_stripes = ctl->num_stripes;
>  
> -	for (i = 0; i < ctl->ndevs; ++i) {
> -		for (j = 0; j < ctl->dev_stripes; ++j) {
> +	for (int i = 0; i < ctl->ndevs; i++) {
> +		for (int j = 0; j < ctl->dev_stripes; j++) {
>  			int s = i * ctl->dev_stripes + j;
>  			map->stripes[s].dev = devices_info[i].dev;
>  			map->stripes[s].physical = devices_info[i].dev_offset +
> @@ -6618,7 +6616,6 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>  	struct btrfs_chunk_map *map;
>  	struct btrfs_io_geometry io_geom = { 0 };
>  	u64 map_offset;
> -	int i;
>  	int ret = 0;
>  	int num_copies;
>  	struct btrfs_io_context *bioc = NULL;
> @@ -6764,7 +6761,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>  		 * For all other non-RAID56 profiles, just copy the target
>  		 * stripe into the bioc.
>  		 */
> -		for (i = 0; i < io_geom.num_stripes; i++) {
> +		for (int i = 0; i < io_geom.num_stripes; i++) {
>  			ret = set_io_stripe(fs_info, logical, length,
>  					    &bioc->stripes[i], map, &io_geom);
>  			if (ret < 0)
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index dde4a0a34037..e9087264f3e3 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -87,9 +87,8 @@ static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
>  	bool empty[BTRFS_NR_SB_LOG_ZONES];
>  	bool full[BTRFS_NR_SB_LOG_ZONES];
>  	sector_t sector;
> -	int i;
>  
> -	for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
> +	for (int i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
>  		ASSERT(zones[i].type != BLK_ZONE_TYPE_CONVENTIONAL);
>  		empty[i] = (zones[i].cond == BLK_ZONE_COND_EMPTY);
>  		full[i] = sb_zone_is_full(&zones[i]);
> @@ -121,9 +120,8 @@ static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
>  		struct address_space *mapping = bdev->bd_inode->i_mapping;
>  		struct page *page[BTRFS_NR_SB_LOG_ZONES];
>  		struct btrfs_super_block *super[BTRFS_NR_SB_LOG_ZONES];
> -		int i;
>  
> -		for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
> +		for (int i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
>  			u64 zone_end = (zones[i].start + zones[i].capacity) << SECTOR_SHIFT;
>  			u64 bytenr = ALIGN_DOWN(zone_end, BTRFS_SUPER_INFO_SIZE) -
>  						BTRFS_SUPER_INFO_SIZE;
> @@ -144,7 +142,7 @@ static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
>  		else
>  			sector = zones[0].start;
>  
> -		for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++)
> +		for (int i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++)
>  			btrfs_release_disk_super(super[i]);
>  	} else if (!full[0] && (empty[1] || full[1])) {
>  		sector = zones[0].wp;
> -- 
> 2.45.0
> 

^ permalink raw reply	[relevance 1%]

* [PATCH v4 2/2] btrfs: automatically remove the subvolume qgroup
  2024-05-21  3:03  1% [PATCH v4 0/2] btrfs: qgroup: stale qgroups related impromvents Qu Wenruo
  2024-05-21  3:03  1% ` [PATCH v4 1/2] btrfs: slightly loose the requirement for qgroup removal Qu Wenruo
@ 2024-05-21  3:03  1% ` Qu Wenruo
  1 sibling, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-21  3:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Boris Burkov

Currently if we fully removed a subvolume (not only unlinked, but fully
dropped its root item), its qgroup would not be removed.

Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.

This patch changes the behavior by automatically removing the qgroup of
a fully dropped subvolume when possible:

- Full qgroup but still consistent
  We can and should remove the qgroup.
  The qgroup numbers should be 0, without any rsv.

- Full qgroup but inconsistent
  Can happen with drop_subtree_threshold feature (skip accounting
  and mark qgroup inconsistent).

  We can and should remove the qgroup.
  Higher level qgroup numbers will be incorrect, but since qgroup
  is already inconsistent, it should not be a problem.

- Squota mode
  This is the special case, we can only drop the qgroup if its numbers
  are all 0.

  This would be handled by can_delete_qgroup(), so we only need to check
  the return value and ignore the -EBUSY error.

Reviewed-by: Boris Burkov <boris@bur.io>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent-tree.c |  8 ++++++++
 fs/btrfs/qgroup.c      | 38 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/qgroup.h      |  2 ++
 3 files changed, 48 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3774c191e36d..32f03c0a7b9e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5833,6 +5833,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 	struct btrfs_root_item *root_item = &root->root_item;
 	struct walk_control *wc;
 	struct btrfs_key key;
+	const u64 rootid = btrfs_root_id(root);
 	int err = 0;
 	int ret;
 	int level;
@@ -6063,6 +6064,13 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
 	kfree(wc);
 	btrfs_free_path(path);
 out:
+	if (!err && root_dropped) {
+		ret = btrfs_qgroup_cleanup_dropped_subvolume(fs_info, rootid);
+		if (ret < 0)
+			btrfs_warn_rl(fs_info,
+				      "failed to cleanup qgroup 0/%llu: %d",
+				      rootid, ret);
+	}
 	/*
 	 * We were an unfinished drop root, check to see if there are any
 	 * pending, and if not clear and wake up any waiters.
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 9b4de1b43298..826d972a7c75 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1893,6 +1893,44 @@ int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
 	return ret;
 }
 
+int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info,
+					   u64 subvolid)
+{
+	struct btrfs_trans_handle *trans;
+	int ret;
+
+	if (!is_fstree(subvolid) || !btrfs_qgroup_enabled(fs_info) ||
+	    !fs_info->quota_root)
+		return 0;
+
+	/*
+	 * Commit current transaction to make sure all the rfer/excl numbers
+	 * get updated.
+	 */
+	trans = btrfs_start_transaction(fs_info->quota_root, 0);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+
+	ret = btrfs_commit_transaction(trans);
+	if (ret < 0)
+		return ret;
+
+	/* Start new trans to delete the qgroup info and limit items. */
+	trans = btrfs_start_transaction(fs_info->quota_root, 2);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+	ret = btrfs_remove_qgroup(trans, subvolid);
+	btrfs_end_transaction(trans);
+	/*
+	 * It's squota and the subvolume still has numbers needed
+	 * for future accounting, in this case we can not delete.
+	 * Just skip it.
+	 */
+	if (ret == -EBUSY)
+		ret = 0;
+	return ret;
+}
+
 int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
 		       struct btrfs_qgroup_limit *limit)
 {
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 706640be0ec2..3f93856a02e1 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -327,6 +327,8 @@ int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
 			      u64 dst);
 int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
 int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
+int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info,
+					   u64 subvolid);
 int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
 		       struct btrfs_qgroup_limit *limit);
 int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info);
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 1/2] btrfs: slightly loose the requirement for qgroup removal
  2024-05-21  3:03  1% [PATCH v4 0/2] btrfs: qgroup: stale qgroups related impromvents Qu Wenruo
@ 2024-05-21  3:03  1% ` Qu Wenruo
  2024-05-21  3:03  1% ` [PATCH v4 2/2] btrfs: automatically remove the subvolume qgroup Qu Wenruo
  1 sibling, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-21  3:03 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Boris Burkov

[BUG]
Currently if one is utilizing "qgroups/drop_subtree_threshold" sysfs,
and a snapshot with level higher than that value is dropped, btrfs will
not be able to delete the qgroup until next qgroup rescan:

  uuid=ffffffff-eeee-dddd-cccc-000000000000

  wipefs -fa $dev
  mkfs.btrfs -f $dev -O quota -s 4k -n 4k -U $uuid
  mount $dev $mnt

  btrfs subvolume create $mnt/subv1/
  for (( i = 0; i < 1024; i++ )); do
  	xfs_io -f -c "pwrite 0 2k" $mnt/subv1/file_$i > /dev/null
  done
  sync
  btrfs subv snapshot $mnt/subv1 $mnt/snapshot
  btrfs quota enable $mnt
  btrfs quota rescan -w $mnt
  sync
  echo 1 > /sys/fs/btrfs/$uuid/qgroups/drop_subtree_threshold
  btrfs subvolume delete $mnt/snapshot
  btrfs subv sync $mnt
  btrfs qgroup show -prce --sync $mnt
  btrfs qgroup destroy 0/257 $mnt
  umount $mnt

The final qgroup removal would fail with the following error:

  ERROR: unable to destroy quota group: Device or resource busy

[CAUSE]
The above script would generate a subvolume of level 2, then snapshot
it, enable qgroup, set the drop_subtree_threshold, then drop the
snapshot.

Since the subvolume drop would meet the threshold, qgroup would be
marked inconsistent and skip accounting to avoid hanging the system at
transaction commit.

But currently we do not allow a qgroup with any rfer/excl numbers to be
dropped, and this is not really compatible with the new
drop_subtree_threshold behavior.

[FIX]
Only require the strong zero rfer/excl/rfer_cmpr/excl_cmpr for squota
mode.
This is due to the fact that squota can never go inconsistent, and it
can have dropped subvolume but with non-zero qgroup numbers for future
accounting.

For full qgroup mode, we only check if there is a subvolume for it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/qgroup.c | 91 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 84 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index fc2a7ea26354..9b4de1b43298 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1748,13 +1748,57 @@ int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
 	return ret;
 }
 
-static bool qgroup_has_usage(struct btrfs_qgroup *qgroup)
+/*
+ * Return 0 if we can not delete the qgroup (not empty or has children etc).
+ * Return >0 if we can delete the qgroup.
+ * Return <0 for other errors during tree search.
+ */
+static int can_delete_qgroup(struct btrfs_fs_info *fs_info,
+			     struct btrfs_qgroup *qgroup)
 {
-	return (qgroup->rfer > 0 || qgroup->rfer_cmpr > 0 ||
-		qgroup->excl > 0 || qgroup->excl_cmpr > 0 ||
-		qgroup->rsv.values[BTRFS_QGROUP_RSV_DATA] > 0 ||
-		qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PREALLOC] > 0 ||
-		qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PERTRANS] > 0);
+	struct btrfs_key key;
+	struct btrfs_path *path;
+	int ret;
+
+	/*
+	 * Squota would never be inconsistent, but there can still be
+	 * case where a dropped subvolume still has qgroup numbers, and
+	 * squota relies on such qgroup for future accounting.
+	 *
+	 * So for squota, do not allow dropping any non-zero qgroup.
+	 */
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE &&
+	    (qgroup->rfer || qgroup->excl || qgroup->excl_cmpr ||
+	     qgroup->rfer_cmpr))
+		return 0;
+
+	/* For higher level qgroup, we can only delete it if it has no child. */
+	if (btrfs_qgroup_level(qgroup->qgroupid)) {
+		if (!list_empty(&qgroup->members))
+			return 0;
+		return 1;
+	}
+
+	/*
+	 * For level-0 qgroups, we can only delete it if it has no subvolume
+	 * for it.
+	 * This means even a subvolume is unlinked but not yet fully dropped,
+	 * we can not delete the qgroup.
+	 */
+	key.objectid = qgroup->qgroupid;
+	key.type = BTRFS_ROOT_ITEM_KEY;
+	key.offset = -1ULL;
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	ret = btrfs_find_root(fs_info->tree_root, &key, path, NULL, NULL);
+	btrfs_free_path(path);
+	/*
+	 * The @ret from btrfs_find_root() exactly matches our definition for
+	 * the return value, thus can be returned directly.
+	 */
+	return ret;
 }
 
 int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
@@ -1776,7 +1820,10 @@ int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
 		goto out;
 	}
 
-	if (is_fstree(qgroupid) && qgroup_has_usage(qgroup)) {
+	ret = can_delete_qgroup(fs_info, qgroup);
+	if (ret < 0)
+		goto out;
+	if (ret == 0) {
 		ret = -EBUSY;
 		goto out;
 	}
@@ -1801,6 +1848,36 @@ int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
 	}
 
 	spin_lock(&fs_info->qgroup_lock);
+	/*
+	 * Warn on reserved space. The subvolume should has no child nor
+	 * corresponding subvolume.
+	 * Thus its reserved space should all be zero, no matter if qgroup
+	 * is consistent or the mode.
+	 */
+	WARN_ON(qgroup->rsv.values[BTRFS_QGROUP_RSV_DATA] ||
+		qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PREALLOC] ||
+		qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PERTRANS]);
+	/*
+	 * The same for rfer/excl numbers, but that's only if our qgroup is
+	 * consistent and if it's in regular qgroup mode.
+	 * For simple mode it's not as accurate thus we can hit non-zero values
+	 * very frequently.
+	 */
+	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_FULL &&
+	    !(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT)) {
+		if (WARN_ON(qgroup->rfer || qgroup->excl ||
+			    qgroup->rfer_cmpr || qgroup->excl_cmpr)) {
+			btrfs_warn_rl(fs_info,
+"to be deleted qgroup %u/%llu has non-zero numbers, rfer %llu rfer_cmpr %llu excl %llu excl_cmpr %llu",
+				      btrfs_qgroup_level(qgroup->qgroupid),
+				      btrfs_qgroup_subvolid(qgroup->qgroupid),
+				      qgroup->rfer,
+				      qgroup->rfer_cmpr,
+				      qgroup->excl,
+				      qgroup->excl_cmpr);
+			qgroup_mark_inconsistent(fs_info);
+		}
+	}
 	del_qgroup_rb(fs_info, qgroupid);
 	spin_unlock(&fs_info->qgroup_lock);
 
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 0/2] btrfs: qgroup: stale qgroups related impromvents
@ 2024-05-21  3:03  1% Qu Wenruo
  2024-05-21  3:03  1% ` [PATCH v4 1/2] btrfs: slightly loose the requirement for qgroup removal Qu Wenruo
  2024-05-21  3:03  1% ` [PATCH v4 2/2] btrfs: automatically remove the subvolume qgroup Qu Wenruo
  0 siblings, 2 replies; 200+ results
From: Qu Wenruo @ 2024-05-21  3:03 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v4:
- Slight code style changes
  Move one nested if into a single line one.

- Make can_delete_qgroup() return int to indicate error during tree
  search

v3:
- Do not allow dropping non-zero qgroups for squota
  Due to the design of squota, a fully dropped subvolume can still have
  non-zero qgroups, as in the extent tree the delta is still using the
  original owner (the dropped subvolume).

  So we can not drop any non-zero squota qgroups

- Slight change on the can_delete_qgroup() condition
  To co-operate with above change.

v2:
- Do more sanity checks before deleting a qgroup

- Make squota to handle auto deleted qgroup more gracefully
  Unfortunately the behavior change would affect btrfs/301, as the
  fully deleted subvolume would make the test case to cause bash grammar
  error (since the qgroup is gone with the subvolume).

  Cc Boris for extra comments on squota compatibility and future
  btrfs/311 updates ideas.


We have two problems in recent qgroup code:

- One can not delete a fully removed qgroup if the drop hits
  drop_subtree_threshold
  As hitting drop_subtree_threshold would mark qgroup inconsistent and
  skip all accounting, this would leave qgroup number untouched (thus
  non-zero), and btrfs refuses to delete qgroup with non-zero rfer/excl
  numbers.

  This would be addressed by the first patch, allowing qgroup
  deletion as long as it doesn't have any child nor a corresponding
  subvolume.

- Deleted snapshot would leave a stale qgroup
  This is a long existing problem.

  Although previous pushes all failed, just let me try it again.

  The idea is commit current transaction if needed (full accounting mode
  and qgroup numbers are consistent), then try to remove the subvolume
  qgroup after it is fully dropped.

  For full qgroup mode, no matter if qgroup is inconsistent, we will
  auto remove the dropped 0 level qgroup.
  (If consistent, the qgroup numbers should all be 0, and a safe
   removal. If inconsistent, we still remove the qgroup, leaving
   higher level qgroup incorrect, but since it's already inconsistent,
   we need a rescan anyway).

  For squota mode, we only do the auto reap if the qgroup numbers are
  already 0.
  Otherwise the qgroup numbers would be needed for future accounting.

The behavior change would only affect btrfs/301, which still expects the
dropped subvolume would still leave its qgroup untouched.
But that can be easily fixed by explicitly echoing "0" for dropped
subvolume.


Qu Wenruo (2):
  btrfs: slightly loose the requirement for qgroup removal
  btrfs: automatically remove the subvolume qgroup

 fs/btrfs/extent-tree.c |   8 +++
 fs/btrfs/qgroup.c      | 129 ++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/qgroup.h      |   2 +
 3 files changed, 132 insertions(+), 7 deletions(-)

-- 
2.45.1


^ permalink raw reply	[relevance 1%]

* Re: [PATCH] fstests: btrfs/301: handle auto-removed qgroups
  @ 2024-05-21  1:19  1% ` Boris Burkov
  2024-05-23 15:43  1% ` Anand Jain
  1 sibling, 0 replies; 200+ results
From: Boris Burkov @ 2024-05-21  1:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, fstests

On Tue, May 07, 2024 at 04:36:06PM +0930, Qu Wenruo wrote:
> There are always attempts to auto-remove empty qgroups after dropping a
> subvolume.
> 
> For squota mode, not all qgroups can or should be dropped, as there are
> common cases where the dropped subvolume are still referred by other
> snapshots.
> In that case, the numbers can only be freed when the last referencer
> got dropped.
> 
> The latest kernel attempt would only try to drop empty qgroups for
> squota mode.
> But even with such safe change, the test case still needs to handle
> auto-removed qgroups, by explicitly echoing "0", or later calculation
> would break bash grammar.
> 
> This patch would add extra handling for such removed qgroups, to be
> future proof for qgroup auto-removal behavior change.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Looks good, thanks!
Reviewed-by: Boris Burkov <boris@bur.io>

> ---
>  tests/btrfs/301 | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/tests/btrfs/301 b/tests/btrfs/301
> index db469724..bb18ab04 100755
> --- a/tests/btrfs/301
> +++ b/tests/btrfs/301
> @@ -51,9 +51,17 @@ _require_fio $fio_config
>  get_qgroup_usage()
>  {
>  	local qgroupid=$1
> +	local output
>  
> -	$BTRFS_UTIL_PROG qgroup show --sync --raw $SCRATCH_MNT | \
> -				grep "$qgroupid" | $AWK_PROG '{print $3}'
> +	output=$($BTRFS_UTIL_PROG qgroup show --sync --raw $SCRATCH_MNT | \
> +		 grep "$qgroupid" | $AWK_PROG '{print $3}')
> +	# The qgroup is auto-removed, this can only happen if its numbers are
> +	# already all zeros, so here we only need to explicitly echo "0".
> +	if [ -z "$output" ]; then
> +		echo "0"
> +	else
> +		echo "$output"
> +	fi
>  }
>  
>  get_subvol_usage()
> -- 
> 2.44.0
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 2/2] btrfs: automatically remove the subvolume qgroup
  2024-05-20 23:03  1%     ` Qu Wenruo
@ 2024-05-21  1:05  1%       ` Qu Wenruo
  0 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-21  1:05 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs



在 2024/5/21 08:33, Qu Wenruo 写道:
> 
> 
> 在 2024/5/21 08:20, Boris Burkov 写道:
> [...]
>>> +    /*
>>> +     * It's squota and the subvolume still has numbers needed
>>> +     * for future accounting, in this case we can not delete.
>>> +     * Just skip it.
>>> +     */
>>
>> Maybe throw in an ASSERT or WARN or whatever you think is best checking
>> for squota mode, if we are sure this shouldn't happen for normal qgroup?
> 
> Sounds good.
> 
> Would add an ASSERT() for making sure it's squota mode.

After more thought, I believe ASSERT() can lead to false alerts.

The problem here is, we do not have any extra race prevention here, 
really rely on one time call on btrfs_remove_qgroup() to do the proper 
locking.

But after btrfs_remove_qgroup() returned -EBUSY, we can race with qgroup 
disabling, thus doing an ASSERT() without the proper lock context can 
lead to false alert, e.g. the qgroup is disabled after 
btrfs_remove_qgroup() call.

So I'm afraid we can not do extra checks here.

Thanks,
Qu
> 
> Thanks,
> Qu
> 
>>
>>> +    if (ret == -EBUSY)
>>> +        ret = 0;
>>> +    return ret;
>>> +}
>>> +
>>>   int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
>>>                  struct btrfs_qgroup_limit *limit)
>>>   {
>>> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
>>> index 706640be0ec2..3f93856a02e1 100644
>>> --- a/fs/btrfs/qgroup.h
>>> +++ b/fs/btrfs/qgroup.h
>>> @@ -327,6 +327,8 @@ int btrfs_del_qgroup_relation(struct 
>>> btrfs_trans_handle *trans, u64 src,
>>>                     u64 dst);
>>>   int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 
>>> qgroupid);
>>>   int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 
>>> qgroupid);
>>> +int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info 
>>> *fs_info,
>>> +                       u64 subvolid);
>>>   int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
>>>                  struct btrfs_qgroup_limit *limit);
>>>   int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info);
>>> -- 
>>> 2.45.0
>>>
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 0/6] part3 trivial adjustments for return variable coding style
      @ 2024-05-21  1:04  1% ` Anand Jain
  2024-05-21 15:21  2% ` David Sterba
  3 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-21  1:04 UTC (permalink / raw)
  To: linux-btrfs


Any RB please?

Thanks, Anand


On 5/16/24 19:12, Anand Jain wrote:
> This is part 3 of the series, containing renaming with optimization of the
> return variable.
> 
> Some of the patches are new it wasn't part of v1 or v2. The new patches follow
> verb-first format for titles. Older patches not renamed for backward reference.
> 
> Patchset passed tests -g quick without regressions, sending them first.
> 
> Patch 3/6 and 4/6 can be merged; they are separated for easier diff.
> 
> v2 part2:
>    https://lore.kernel.org/linux-btrfs/cover.1713370756.git.anand.jain@oracle.com/
> v1:
>    https://lore.kernel.org/linux-btrfs/cover.1710857863.git.anand.jain@oracle.com/
> 
> Anand Jain (6):
>    btrfs: btrfs_cleanup_fs_roots handle ret variable
>    btrfs: simplify ret in btrfs_recover_relocation
>    btrfs: rename ret in btrfs_recover_relocation
>    btrfs: rename err in btrfs_recover_relocation
>    btrfs: btrfs_drop_snapshot optimize return variable
>    btrfs: rename and optimize return variable in btrfs_find_orphan_roots
> 
>   fs/btrfs/disk-io.c     | 38 ++++++++++++++--------------
>   fs/btrfs/extent-tree.c | 48 ++++++++++++++++++------------------
>   fs/btrfs/relocation.c  | 56 +++++++++++++++++++-----------------------
>   fs/btrfs/root-tree.c   | 32 ++++++++++++------------
>   4 files changed, 85 insertions(+), 89 deletions(-)
> 


^ permalink raw reply	[relevance 1%]

* Re: [PATCH 0/6] Cleanups and W=2 warning fixes
  2024-05-20 19:52  1% [PATCH 0/6] Cleanups and W=2 warning fixes David Sterba
                   ` (6 preceding siblings ...)
  2024-05-20 23:00  1% ` [PATCH 0/6] Cleanups and W=2 warning fixes Boris Burkov
@ 2024-05-21  0:38  1% ` Anand Jain
  2024-05-21  4:21  1% ` Naohiro Aota
  8 siblings, 0 replies; 200+ results
From: Anand Jain @ 2024-05-21  0:38 UTC (permalink / raw)
  To: David Sterba, linux-btrfs



On 5/21/24 03:52, David Sterba wrote:
> We have a clean run of 'make W='1 with gcc 13, there are some
> interesting warnings to fix with level 2 and even 3. We can't enable the
> warning flags by defualt due to reports from generic code.
> 
> This short series removes shadowed variables, adds const and removes
> an unused macro. There are still some shadow variables to fix but the
> remaining cases are with 'ret' variables so I skipped it for now.
> 
> David Sterba (6):
>    btrfs: remove duplicate name variable declarations
>    btrfs: rename macro local variables that clash with other variables
>    btrfs: use for-local variabls that shadow function variables
>    btrfs: remove unused define EXTENT_SIZE_PER_ITEM
>    btrfs: keep const whene returnin value from get_unaligned_le8()
>    btrfs: constify parameters of write_eb_member() and its users


Reviewed-by: Anand Jain <anand.jain@oracle.com>


Thanks, Anand

> 
>   fs/btrfs/accessors.h   | 12 ++++++------
>   fs/btrfs/extent_io.c   |  6 ++----
>   fs/btrfs/inode.c       |  2 --
>   fs/btrfs/qgroup.c      | 11 +++++------
>   fs/btrfs/space-info.c  |  2 --
>   fs/btrfs/subpage.c     |  8 ++++----
>   fs/btrfs/transaction.h |  6 +++---
>   fs/btrfs/volumes.c     |  9 +++------
>   fs/btrfs/zoned.c       |  8 +++-----
>   9 files changed, 26 insertions(+), 38 deletions(-)
> 


^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 2/2] btrfs: automatically remove the subvolume qgroup
  2024-05-20 22:50  1%   ` Boris Burkov
@ 2024-05-20 23:03  1%     ` Qu Wenruo
  2024-05-21  1:05  1%       ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-20 23:03 UTC (permalink / raw)
  To: Boris Burkov; +Cc: linux-btrfs



在 2024/5/21 08:20, Boris Burkov 写道:
[...]
>> +	/*
>> +	 * It's squota and the subvolume still has numbers needed
>> +	 * for future accounting, in this case we can not delete.
>> +	 * Just skip it.
>> +	 */
> 
> Maybe throw in an ASSERT or WARN or whatever you think is best checking
> for squota mode, if we are sure this shouldn't happen for normal qgroup?

Sounds good.

Would add an ASSERT() for making sure it's squota mode.

Thanks,
Qu

> 
>> +	if (ret == -EBUSY)
>> +		ret = 0;
>> +	return ret;
>> +}
>> +
>>   int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
>>   		       struct btrfs_qgroup_limit *limit)
>>   {
>> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
>> index 706640be0ec2..3f93856a02e1 100644
>> --- a/fs/btrfs/qgroup.h
>> +++ b/fs/btrfs/qgroup.h
>> @@ -327,6 +327,8 @@ int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
>>   			      u64 dst);
>>   int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
>>   int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
>> +int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info,
>> +					   u64 subvolid);
>>   int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
>>   		       struct btrfs_qgroup_limit *limit);
>>   int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info);
>> -- 
>> 2.45.0
>>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH 0/6] Cleanups and W=2 warning fixes
  2024-05-20 19:52  1% [PATCH 0/6] Cleanups and W=2 warning fixes David Sterba
                   ` (5 preceding siblings ...)
  2024-05-20 19:52  1% ` [PATCH 6/6] btrfs: constify parameters of write_eb_member() and its users David Sterba
@ 2024-05-20 23:00  1% ` Boris Burkov
  2024-05-21  0:38  1% ` Anand Jain
  2024-05-21  4:21  1% ` Naohiro Aota
  8 siblings, 0 replies; 200+ results
From: Boris Burkov @ 2024-05-20 23:00 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs

On Mon, May 20, 2024 at 09:52:08PM +0200, David Sterba wrote:
> We have a clean run of 'make W='1 with gcc 13, there are some
> interesting warnings to fix with level 2 and even 3. We can't enable the
> warning flags by defualt due to reports from generic code.
> 
> This short series removes shadowed variables, adds const and removes
> an unused macro. There are still some shadow variables to fix but the
> remaining cases are with 'ret' variables so I skipped it for now.

These LGTM. I did notice a few typos in the patch titles, though.
Reviewed-by: Boris Burkov <boris@bur.io>

> 
> David Sterba (6):
>   btrfs: remove duplicate name variable declarations
>   btrfs: rename macro local variables that clash with other variables
>   btrfs: use for-local variabls that shadow function variables
btrfs: use for-local variables that shadow function variables
>   btrfs: remove unused define EXTENT_SIZE_PER_ITEM
>   btrfs: keep const whene returnin value from get_unaligned_le8()
btrfs: keep const when returning value from get_unaligned_le8()
>   btrfs: constify parameters of write_eb_member() and its users
> 
>  fs/btrfs/accessors.h   | 12 ++++++------
>  fs/btrfs/extent_io.c   |  6 ++----
>  fs/btrfs/inode.c       |  2 --
>  fs/btrfs/qgroup.c      | 11 +++++------
>  fs/btrfs/space-info.c  |  2 --
>  fs/btrfs/subpage.c     |  8 ++++----
>  fs/btrfs/transaction.h |  6 +++---
>  fs/btrfs/volumes.c     |  9 +++------
>  fs/btrfs/zoned.c       |  8 +++-----
>  9 files changed, 26 insertions(+), 38 deletions(-)
> 
> -- 
> 2.45.0
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 2/2] btrfs: automatically remove the subvolume qgroup
  @ 2024-05-20 22:50  1%   ` Boris Burkov
  2024-05-20 23:03  1%     ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Boris Burkov @ 2024-05-20 22:50 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, May 07, 2024 at 04:28:11PM +0930, Qu Wenruo wrote:
> Currently if we fully removed a subvolume (not only unlinked, but fully
> dropped its root item), its qgroup would not be removed.
> 
> Thus we have "btrfs qgroup clear-stale" to handle such 0 level qgroups.
> 
> This patch changes the behavior by automatically removing the qgroup of
> a fully dropped subvolume when possible:
> 
> - Full qgroup but still consistent
>   We can and should remove the qgroup.
>   The qgroup numbers should be 0, without any rsv.
> 
> - Full qgroup but inconsistent
>   Can happen with drop_subtree_threshold feature (skip accounting
>   and mark qgroup inconsistent).
> 
>   We can and should remove the qgroup.
>   Higher level qgroup numbers will be incorrect, but since qgroup
>   is already inconsistent, it should not be a problem.
> 
> - Squota mode
>   This is the special case, we can only drop the qgroup if its numbers
>   are all 0.
> 
>   This would be handled by can_delete_qgroup(), so we only need to check
>   the return value and ignore the -EBUSY error.
> 
> Link: https://bugzilla.suse.com/show_bug.cgi?id=1222847
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Reviewed-by: Boris Burkov <boris@bur.io>

> ---
>  fs/btrfs/extent-tree.c |  8 ++++++++
>  fs/btrfs/qgroup.c      | 38 ++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/qgroup.h      |  2 ++
>  3 files changed, 48 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 47d48233b592..21e07b698625 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5833,6 +5833,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
>  	struct btrfs_root_item *root_item = &root->root_item;
>  	struct walk_control *wc;
>  	struct btrfs_key key;
> +	const u64 rootid = btrfs_root_id(root);
>  	int err = 0;
>  	int ret;
>  	int level;
> @@ -6063,6 +6064,13 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
>  	kfree(wc);
>  	btrfs_free_path(path);
>  out:
> +	if (!err && root_dropped) {
> +		ret = btrfs_qgroup_cleanup_dropped_subvolume(fs_info, rootid);
> +		if (ret < 0)
> +			btrfs_warn_rl(fs_info,
> +				      "failed to cleanup qgroup 0/%llu: %d",
> +				      rootid, ret);
> +	}
>  	/*
>  	 * We were an unfinished drop root, check to see if there are any
>  	 * pending, and if not clear and wake up any waiters.
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index d89f16366a1c..d894a7e2bf30 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1864,6 +1864,44 @@ int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
>  	return ret;
>  }
>  
> +int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info,
> +					   u64 subvolid)
> +{
> +	struct btrfs_trans_handle *trans;
> +	int ret;
> +
> +	if (!is_fstree(subvolid) || !btrfs_qgroup_enabled(fs_info) ||
> +	    !fs_info->quota_root)
> +		return 0;
> +
> +	/*
> +	 * Commit current transaction to make sure all the rfer/excl numbers
> +	 * get updated.
> +	 */
> +	trans = btrfs_start_transaction(fs_info->quota_root, 0);
> +	if (IS_ERR(trans))
> +		return PTR_ERR(trans);
> +
> +	ret = btrfs_commit_transaction(trans);
> +	if (ret < 0)
> +		return ret;
> +
> +	/* Start new trans to delete the qgroup info and limit items. */
> +	trans = btrfs_start_transaction(fs_info->quota_root, 2);
> +	if (IS_ERR(trans))
> +		return PTR_ERR(trans);
> +	ret = btrfs_remove_qgroup(trans, subvolid);
> +	btrfs_end_transaction(trans);
> +	/*
> +	 * It's squota and the subvolume still has numbers needed
> +	 * for future accounting, in this case we can not delete.
> +	 * Just skip it.
> +	 */

Maybe throw in an ASSERT or WARN or whatever you think is best checking
for squota mode, if we are sure this shouldn't happen for normal qgroup?

> +	if (ret == -EBUSY)
> +		ret = 0;
> +	return ret;
> +}
> +
>  int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
>  		       struct btrfs_qgroup_limit *limit)
>  {
> diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
> index 706640be0ec2..3f93856a02e1 100644
> --- a/fs/btrfs/qgroup.h
> +++ b/fs/btrfs/qgroup.h
> @@ -327,6 +327,8 @@ int btrfs_del_qgroup_relation(struct btrfs_trans_handle *trans, u64 src,
>  			      u64 dst);
>  int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
>  int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid);
> +int btrfs_qgroup_cleanup_dropped_subvolume(struct btrfs_fs_info *fs_info,
> +					   u64 subvolid);
>  int btrfs_limit_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid,
>  		       struct btrfs_qgroup_limit *limit);
>  int btrfs_read_qgroup_config(struct btrfs_fs_info *fs_info);
> -- 
> 2.45.0
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 1/2] btrfs: slightly loose the requirement for qgroup removal
  @ 2024-05-20 22:46  1%   ` Boris Burkov
  0 siblings, 0 replies; 200+ results
From: Boris Burkov @ 2024-05-20 22:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, May 07, 2024 at 04:28:10PM +0930, Qu Wenruo wrote:
> [BUG]
> Currently if one is utilizing "qgroups/drop_subtree_threshold" sysfs,
> and a snapshot with level higher than that value is dropped, btrfs will
> not be able to delete the qgroup until next qgroup rescan:
> 
>   uuid=ffffffff-eeee-dddd-cccc-000000000000
> 
>   wipefs -fa $dev
>   mkfs.btrfs -f $dev -O quota -s 4k -n 4k -U $uuid
>   mount $dev $mnt
> 
>   btrfs subvolume create $mnt/subv1/
>   for (( i = 0; i < 1024; i++ )); do
>   	xfs_io -f -c "pwrite 0 2k" $mnt/subv1/file_$i > /dev/null
>   done
>   sync
>   btrfs subv snapshot $mnt/subv1 $mnt/snapshot
>   btrfs quota enable $mnt
>   btrfs quota rescan -w $mnt
>   sync
>   echo 1 > /sys/fs/btrfs/$uuid/qgroups/drop_subtree_threshold
>   btrfs subvolume delete $mnt/snapshot
>   btrfs subv sync $mnt
>   btrfs qgroup show -prce --sync $mnt
>   btrfs qgroup destroy 0/257 $mnt
>   umount $mnt
> 
> The final qgroup removal would fail with the following error:
> 
>   ERROR: unable to destroy quota group: Device or resource busy
> 
> [CAUSE]
> The above script would generate a subvolume of level 2, then snapshot
> it, enable qgroup, set the drop_subtree_threshold, then drop the
> snapshot.
> 
> Since the subvolume drop would meet the threshold, qgroup would be
> marked inconsistent and skip accounting to avoid hanging the system at
> transaction commit.
> 
> But currently we do not allow a qgroup with any rfer/excl numbers to be
> dropped, and this is not really compatible with the new
> drop_subtree_threshold behavior.
> 
> [FIX]
> Only require the strong zero rfer/excl/rfer_cmpr/excl_cmpr for squota
> mode.
> This is due to the fact that squota can never go inconsistent, and it
> can have dropped subvolume but with non-zero qgroup numbers for future
> accounting.
> 
> For full qgroup mode, we only check if there is a subvolume for it.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Sorry, I got dragged into other stuff and didn't get around to reviewing
this. LGTM!

Reviewed-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/qgroup.c | 82 +++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 75 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index eb28141d5c37..d89f16366a1c 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1728,13 +1728,51 @@ int btrfs_create_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
>  	return ret;
>  }
>  
> -static bool qgroup_has_usage(struct btrfs_qgroup *qgroup)
> +static bool can_delete_qgroup(struct btrfs_fs_info *fs_info,
> +			      struct btrfs_qgroup *qgroup)
>  {
> -	return (qgroup->rfer > 0 || qgroup->rfer_cmpr > 0 ||
> -		qgroup->excl > 0 || qgroup->excl_cmpr > 0 ||
> -		qgroup->rsv.values[BTRFS_QGROUP_RSV_DATA] > 0 ||
> -		qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PREALLOC] > 0 ||
> -		qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PERTRANS] > 0);
> +	struct btrfs_key key;
> +	struct btrfs_path *path;
> +	int ret;
> +
> +	/*
> +	 * Squota would never be inconsistent, but there can still be
> +	 * case where a dropped subvolume still has qgroup numbers, and
> +	 * squota relies on such qgroup for future accounting.
> +	 *
> +	 * So for squota, do not allow dropping any non-zero qgroup.
> +	 */
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_SIMPLE) {

nit: you can chain this condition with an and rather than a nested if.

> +		if (qgroup->rfer || qgroup->excl || qgroup->excl_cmpr ||
> +		    qgroup->rfer_cmpr)
> +			return false;
> +	}
> +
> +	/* For higher level qgroup, we can only delete it if it has no child. */
> +	if (btrfs_qgroup_level(qgroup->qgroupid)) {
> +		if (!list_empty(&qgroup->members))
> +			return false;
> +		return true;
> +	}
> +
> +	/*
> +	 * For level-0 qgroups, we can only delete it if it has no subvolume
> +	 * for it.
> +	 * This means even a subvolume is unlinked but not yet fully dropped,
> +	 * we can not delete the qgroup.
> +	 */
> +	key.objectid = qgroup->qgroupid;
> +	key.type = BTRFS_ROOT_ITEM_KEY;
> +	key.offset = -1ULL;
> +	path = btrfs_alloc_path();
> +	if (!path)

I suppose ideally this should be ENOMEM, not false => EBUSY

> +		return false;
> +
> +	ret = btrfs_find_root(fs_info->tree_root, &key, path, NULL, NULL);
> +	btrfs_free_path(path);
> +	if (ret > 0)
> +		return true;
> +	return false;
>  }
>  
>  int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
> @@ -1756,7 +1794,7 @@ int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
>  		goto out;
>  	}
>  
> -	if (is_fstree(qgroupid) && qgroup_has_usage(qgroup)) {
> +	if (!can_delete_qgroup(fs_info, qgroup)) {
>  		ret = -EBUSY;
>  		goto out;
>  	}
> @@ -1781,6 +1819,36 @@ int btrfs_remove_qgroup(struct btrfs_trans_handle *trans, u64 qgroupid)
>  	}
>  
>  	spin_lock(&fs_info->qgroup_lock);
> +	/*
> +	 * Warn on reserved space. The subvolume should has no child nor
> +	 * corresponding subvolume.
> +	 * Thus its reserved space should all be zero, no matter if qgroup
> +	 * is consistent or the mode.
> +	 */
> +	WARN_ON(qgroup->rsv.values[BTRFS_QGROUP_RSV_DATA] ||
> +		qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PREALLOC] ||
> +		qgroup->rsv.values[BTRFS_QGROUP_RSV_META_PERTRANS]);
> +	/*
> +	 * The same for rfer/excl numbers, but that's only if our qgroup is
> +	 * consistent and if it's in regular qgroup mode.
> +	 * For simple mode it's not as accurate thus we can hit non-zero values
> +	 * very frequently.
> +	 */
> +	if (btrfs_qgroup_mode(fs_info) == BTRFS_QGROUP_MODE_FULL &&
> +	    !(fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT)) {
> +		if (WARN_ON(qgroup->rfer || qgroup->excl ||
> +			    qgroup->rfer_cmpr || qgroup->excl_cmpr)) {
> +			btrfs_warn_rl(fs_info,
> +"to be deleted qgroup %u/%llu has non-zero numbers, rfer %llu rfer_cmpr %llu excl %llu excl_cmpr %llu",
> +				      btrfs_qgroup_level(qgroup->qgroupid),
> +				      btrfs_qgroup_subvolid(qgroup->qgroupid),
> +				      qgroup->rfer,
> +				      qgroup->rfer_cmpr,
> +				      qgroup->excl,
> +				      qgroup->excl_cmpr);
> +			qgroup_mark_inconsistent(fs_info);
> +		}
> +	}
>  	del_qgroup_rb(fs_info, qgroupid);
>  	spin_unlock(&fs_info->qgroup_lock);
>  
> -- 
> 2.45.0
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v2 08/11] btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args
  2024-05-20 15:55  1%   ` Filipe Manana
@ 2024-05-20 22:13  1%     ` Qu Wenruo
  0 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-20 22:13 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2024/5/21 01:25, Filipe Manana 写道:
> On Fri, May 3, 2024 at 7:02 AM Qu Wenruo <wqu@suse.com> wrote:
[...]
>> @@ -1926,7 +1916,7 @@ static int can_nocow_file_extent(struct btrfs_path *path,
>>                  goto out;
>>
>>          /* An explicit hole, must COW. */
>> -       if (args->disk_bytenr == 0)
>> +       if (btrfs_file_extent_disk_num_bytes(leaf, fi) == 0)
>
> No, this is not correct.

All my bad, will fix it definitely.

> It's btrfs_file_extent_disk_bytenr() that we want, not
> btrfs_file_extent_disk_num_bytes().
> In fact a disk_num_bytes of 0 should ve invalid and never happen.

Thankfully for most cases, a explict hole has both its disk_num_bytes
and disk_bytenr as zeros, thus I didn't get any test case triggered:

	item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53
		generation 7 type 1 (regular)
		extent data disk byte 0 nr 0
		extent data offset 0 nr 65536 ram 65536
		extent compression 0 (none)

But still I should fix that.

Thanks,
Qu

>
>>                  goto out;
>>
>>          /* Compressed/encrypted/encoded extents must be COWed. */
>> @@ -1951,8 +1941,8 @@ static int can_nocow_file_extent(struct btrfs_path *path,
>>          btrfs_release_path(path);
>>
>>          ret = btrfs_cross_ref_exist(root, btrfs_ino(inode),
>> -                                   key->offset - args->extent_offset,
>> -                                   args->disk_bytenr, args->strict, path);
>> +                                   key->offset - args->file_extent.offset,
>> +                                   args->file_extent.disk_bytenr, args->strict, path);
>>          WARN_ON_ONCE(ret > 0 && is_freespace_inode);
>>          if (ret != 0)
>>                  goto out;
>> @@ -1973,21 +1963,18 @@ static int can_nocow_file_extent(struct btrfs_path *path,
>>              atomic_read(&root->snapshot_force_cow))
>>                  goto out;
>>
>> -       args->disk_bytenr += args->extent_offset;
>> -       args->disk_bytenr += args->start - key->offset;
>> -       args->num_bytes = min(args->end + 1, extent_end) - args->start;
>> -
>> -       args->file_extent.num_bytes = args->num_bytes;
>> +       args->file_extent.num_bytes = min(args->end + 1, extent_end) - args->start;
>>          args->file_extent.offset += args->start - key->offset;
>> +       io_start = args->file_extent.disk_bytenr + args->file_extent.offset;
>>
>>          /*
>>           * Force COW if csums exist in the range. This ensures that csums for a
>>           * given extent are either valid or do not exist.
>>           */
>>
>> -       csum_root = btrfs_csum_root(root->fs_info, args->disk_bytenr);
>> -       ret = btrfs_lookup_csums_list(csum_root, args->disk_bytenr,
>> -                                     args->disk_bytenr + args->num_bytes - 1,
>> +       csum_root = btrfs_csum_root(root->fs_info, io_start);
>> +       ret = btrfs_lookup_csums_list(csum_root, io_start,
>> +                                     io_start + args->file_extent.num_bytes - 1,
>>                                        NULL, nowait);
>>          WARN_ON_ONCE(ret > 0 && is_freespace_inode);
>>          if (ret != 0)
>> @@ -2046,7 +2033,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>>                  struct extent_buffer *leaf;
>>                  struct extent_state *cached_state = NULL;
>>                  u64 extent_end;
>> -               u64 ram_bytes;
>>                  u64 nocow_end;
>>                  int extent_type;
>>                  bool is_prealloc;
>> @@ -2125,7 +2111,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>>                          ret = -EUCLEAN;
>>                          goto error;
>>                  }
>> -               ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
>>                  extent_end = btrfs_file_extent_end(path);
>>
>>                  /*
>> @@ -2145,7 +2130,9 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>>                          goto must_cow;
>>
>>                  ret = 0;
>> -               nocow_bg = btrfs_inc_nocow_writers(fs_info, nocow_args.disk_bytenr);
>> +               nocow_bg = btrfs_inc_nocow_writers(fs_info,
>> +                               nocow_args.file_extent.disk_bytenr +
>> +                               nocow_args.file_extent.offset);
>>                  if (!nocow_bg) {
>>   must_cow:
>>                          /*
>> @@ -2181,16 +2168,18 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>>                          }
>>                  }
>>
>> -               nocow_end = cur_offset + nocow_args.num_bytes - 1;
>> +               nocow_end = cur_offset + nocow_args.file_extent.num_bytes - 1;
>>                  lock_extent(&inode->io_tree, cur_offset, nocow_end, &cached_state);
>>
>>                  is_prealloc = extent_type == BTRFS_FILE_EXTENT_PREALLOC;
>>                  if (is_prealloc) {
>>                          struct extent_map *em;
>>
>> -                       em = create_io_em(inode, cur_offset, nocow_args.num_bytes,
>> -                                         nocow_args.disk_num_bytes, /* orig_block_len */
>> -                                         ram_bytes, BTRFS_COMPRESS_NONE,
>> +                       em = create_io_em(inode, cur_offset,
>> +                                         nocow_args.file_extent.num_bytes,
>> +                                         nocow_args.file_extent.disk_num_bytes,
>> +                                         nocow_args.file_extent.ram_bytes,
>> +                                         BTRFS_COMPRESS_NONE,
>>                                            &nocow_args.file_extent,
>>                                            BTRFS_ORDERED_PREALLOC);
>>                          if (IS_ERR(em)) {
>> @@ -2203,9 +2192,16 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>>                          free_extent_map(em);
>>                  }
>>
>> +               /*
>> +                * Check btrfs_create_dio_extent() for why we intentionally pass
>> +                * incorrect value for NOCOW/PREALLOC OEs.
>> +                */
>
> If in the next version you remove that similar comment/rant about OEs
> and disk_bytenr, also remove this one.
>
> Everything else in this patch looks fine, thanks.
>
>
>>                  ordered = btrfs_alloc_ordered_extent(inode, cur_offset,
>> -                               nocow_args.num_bytes, nocow_args.num_bytes,
>> -                               nocow_args.disk_bytenr, nocow_args.num_bytes, 0,
>> +                               nocow_args.file_extent.num_bytes,
>> +                               nocow_args.file_extent.num_bytes,
>> +                               nocow_args.file_extent.disk_bytenr +
>> +                               nocow_args.file_extent.offset,
>> +                               nocow_args.file_extent.num_bytes, 0,
>>                                  is_prealloc
>>                                  ? (1 << BTRFS_ORDERED_PREALLOC)
>>                                  : (1 << BTRFS_ORDERED_NOCOW),
>> @@ -7144,8 +7140,7 @@ static bool btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr)
>>    *      any ordered extents.
>>    */
>>   noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>> -                             u64 *orig_block_len,
>> -                             u64 *ram_bytes, struct btrfs_file_extent *file_extent,
>> +                             struct btrfs_file_extent *file_extent,
>>                                bool nowait, bool strict)
>>   {
>>          struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
>> @@ -7196,8 +7191,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>>
>>          fi = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
>>          found_type = btrfs_file_extent_type(leaf, fi);
>> -       if (ram_bytes)
>> -               *ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
>>
>>          nocow_args.start = offset;
>>          nocow_args.end = offset + *len - 1;
>> @@ -7215,14 +7208,15 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>>          }
>>
>>          ret = 0;
>> -       if (btrfs_extent_readonly(fs_info, nocow_args.disk_bytenr))
>> +       if (btrfs_extent_readonly(fs_info,
>> +                               nocow_args.file_extent.disk_bytenr + nocow_args.file_extent.offset))
>>                  goto out;
>>
>>          if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) &&
>>              found_type == BTRFS_FILE_EXTENT_PREALLOC) {
>>                  u64 range_end;
>>
>> -               range_end = round_up(offset + nocow_args.num_bytes,
>> +               range_end = round_up(offset + nocow_args.file_extent.num_bytes,
>>                                       root->fs_info->sectorsize) - 1;
>>                  ret = test_range_bit_exists(io_tree, offset, range_end, EXTENT_DELALLOC);
>>                  if (ret) {
>> @@ -7231,13 +7225,11 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>>                  }
>>          }
>>
>> -       if (orig_block_len)
>> -               *orig_block_len = nocow_args.disk_num_bytes;
>>          if (file_extent)
>>                  memcpy(file_extent, &nocow_args.file_extent,
>>                         sizeof(struct btrfs_file_extent));
>>
>> -       *len = nocow_args.num_bytes;
>> +       *len = nocow_args.file_extent.num_bytes;
>>          ret = 1;
>>   out:
>>          btrfs_free_path(path);
>> @@ -7422,7 +7414,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>>          struct btrfs_file_extent file_extent = { 0 };
>>          struct extent_map *em = *map;
>>          int type;
>> -       u64 block_start, orig_block_len, ram_bytes;
>> +       u64 block_start;
>>          struct btrfs_block_group *bg;
>>          bool can_nocow = false;
>>          bool space_reserved = false;
>> @@ -7450,7 +7442,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>>                  block_start = extent_map_block_start(em) + (start - em->start);
>>
>>                  if (can_nocow_extent(inode, start, &len,
>> -                                    &orig_block_len, &ram_bytes,
>>                                       &file_extent, false, false) == 1) {
>>                          bg = btrfs_inc_nocow_writers(fs_info, block_start);
>>                          if (bg)
>> @@ -7477,8 +7468,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>>                  space_reserved = true;
>>
>>                  em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
>> -                                             orig_block_len,
>> -                                             ram_bytes, type,
>> +                                             file_extent.disk_num_bytes,
>> +                                             file_extent.ram_bytes, type,
>>                                                &file_extent);
>>                  btrfs_dec_nocow_writers(bg);
>>                  if (type == BTRFS_ORDERED_PREALLOC) {
>> @@ -10709,7 +10700,7 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>>                  free_extent_map(em);
>>                  em = NULL;
>>
>> -               ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL, false, true);
>> +               ret = can_nocow_extent(inode, start, &len, NULL, false, true);
>>                  if (ret < 0) {
>>                          goto out;
>>                  } else if (ret) {
>> --
>> 2.45.0
>>
>>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH 0/3] btrfs: avoid data races when accessing an inode's delayed_node
  2024-05-20 16:58  1%   ` Filipe Manana
@ 2024-05-20 20:20  1%     ` David Sterba
  2024-05-21 14:47  1%       ` David Sterba
  0 siblings, 1 reply; 200+ results
From: David Sterba @ 2024-05-20 20:20 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On Mon, May 20, 2024 at 05:58:37PM +0100, Filipe Manana wrote:
> On Mon, May 20, 2024 at 4:48 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Fri, May 17, 2024 at 02:13:23PM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > We do have some data races when accessing an inode's delayed_node, namely
> > > we use READ_ONCE() in a couple places while there's no pairing WRITE_ONCE()
> > > anywhere, and in one place (btrfs_dirty_inode()) we neither user READ_ONCE()
> > > nor take the lock that protects the delayed_node. So fix these and add
> > > helpers to access and update an inode's delayed_node.
> > >
> > > Filipe Manana (3):
> > >   btrfs: always set an inode's delayed_inode with WRITE_ONCE()
> > >   btrfs: use READ_ONCE() when accessing delayed_node at btrfs_dirty_node()
> > >   btrfs: add and use helpers to get and set an inode's delayed_node
> >
> > The READ_ONCE for delayed nodes has been there historically but I don't
> > think it's needed everywhere. The legitimate case is in
> > btrfs_get_delayed_node() where first use is without lock and then again
> > recheck under the lock so we do want to read fresh value. This is to
> > prevent compiler optimization to coalesce the reads.
> >
> > Writing to delayed node under lock also does not need WRITE_ONCE.
> >
> > IOW, I would rather remove use of the _ONCE helpers and not add more as
> > it is not the pattern where it's supposed to be used. You say it's to
> > prevent load tearing but for a pointer type this does not happen and is
> > an assumption of the hardware.
> 
> If you are sure that pointers aren't subject to load/store tearing
> issues, then I'm fine with dropping the patchset.

This will be in some CPU manual, the thread on LWN
https://lwn.net/Articles/793895/
mentions that. I base my claim on reading about that in various other
discussions on LKML over time. Pointers match the unsigned long type
that is the machine word and register size, assignment from register to
register/memory happens in one go. What could be problematic is constant
(immediate) assigned to register on architectures like SPARC that have
fixed size instructions and the constatnt has to be written in two
steps.

The need for READ_ONCE/WRITE_ONCE is to prevent compiler optimizations
and also the load tearing, but for native types up to unsigned long this
seems to be true.

https://elixir.bootlin.com/linux/latest/source/include/asm-generic/rwonce.h#L29

The only requirement that can possibly cause the tearing even for
pointer is if it's not aligned and could be split over two cachelines.
This should be the case for structures defined in a normal way (ie. no
forced mis-alignment or __packed).

In case of u64 we use _ONCE access because of 32bit architectures if
there's an unlocked access.

^ permalink raw reply	[relevance 1%]

* [PATCH 6/6] btrfs: constify parameters of write_eb_member() and its users
  2024-05-20 19:52  1% [PATCH 0/6] Cleanups and W=2 warning fixes David Sterba
                   ` (4 preceding siblings ...)
  2024-05-20 19:52  1% ` [PATCH 5/6] btrfs: keep const whene returnin value from get_unaligned_le8() David Sterba
@ 2024-05-20 19:52  1% ` David Sterba
  2024-05-20 23:00  1% ` [PATCH 0/6] Cleanups and W=2 warning fixes Boris Burkov
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-20 19:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Reported by '-Wcast-qual', the argument from which write_extent_buffer()
reads data to write to the eb should be const. In addition the const
needs to be also added to __write_extent_buffer() local buffers.

All callers of write_eb_member() can now be updated to use const for the
input buffer structure or type.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/accessors.h | 10 +++++-----
 fs/btrfs/extent_io.c |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index c60f0e7f768a..6c3deaa3e878 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -48,8 +48,8 @@ static inline void put_unaligned_le8(u8 val, void *p)
 			    offsetof(type, member),			\
 			    sizeof_field(type, member)))
 
-#define write_eb_member(eb, ptr, type, member, result) (\
-	write_extent_buffer(eb, (char *)(result),			\
+#define write_eb_member(eb, ptr, type, member, source) (		\
+	write_extent_buffer(eb, (const char *)(source),			\
 			   ((unsigned long)(ptr)) +			\
 			    offsetof(type, member),			\
 			    sizeof_field(type, member)))
@@ -353,7 +353,7 @@ static inline void btrfs_tree_block_key(const struct extent_buffer *eb,
 
 static inline void btrfs_set_tree_block_key(const struct extent_buffer *eb,
 					    struct btrfs_tree_block_info *item,
-					    struct btrfs_disk_key *key)
+					    const struct btrfs_disk_key *key)
 {
 	write_eb_member(eb, item, struct btrfs_tree_block_info, key, key);
 }
@@ -446,7 +446,7 @@ void btrfs_node_key(const struct extent_buffer *eb,
 		    struct btrfs_disk_key *disk_key, int nr);
 
 static inline void btrfs_set_node_key(const struct extent_buffer *eb,
-				      struct btrfs_disk_key *disk_key, int nr)
+				      const struct btrfs_disk_key *disk_key, int nr)
 {
 	unsigned long ptr;
 
@@ -512,7 +512,7 @@ static inline void btrfs_item_key(const struct extent_buffer *eb,
 }
 
 static inline void btrfs_set_item_key(struct extent_buffer *eb,
-				      struct btrfs_disk_key *disk_key, int nr)
+				      const struct btrfs_disk_key *disk_key, int nr)
 {
 	struct btrfs_item *item = btrfs_item_nr(eb, nr);
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2d773c1cbaa7..bf50301ee528 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4604,7 +4604,7 @@ static void __write_extent_buffer(const struct extent_buffer *eb,
 	size_t cur;
 	size_t offset;
 	char *kaddr;
-	char *src = (char *)srcv;
+	const char *src = (const char *)srcv;
 	unsigned long i = get_eb_folio_index(eb, start);
 	/* For unmapped (dummy) ebs, no need to check their uptodate status. */
 	const bool check_uptodate = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
-- 
2.45.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 5/6] btrfs: keep const whene returnin value from get_unaligned_le8()
  2024-05-20 19:52  1% [PATCH 0/6] Cleanups and W=2 warning fixes David Sterba
                   ` (3 preceding siblings ...)
  2024-05-20 19:52  1% ` [PATCH 4/6] btrfs: remove unused define EXTENT_SIZE_PER_ITEM David Sterba
@ 2024-05-20 19:52  1% ` David Sterba
  2024-05-20 19:52  1% ` [PATCH 6/6] btrfs: constify parameters of write_eb_member() and its users David Sterba
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-20 19:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

This was reported by -Wcast-qual, the get_unaligned_le8() simply returns
the arugment and there's no reason to drop the cast.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/accessors.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/accessors.h b/fs/btrfs/accessors.h
index 6fce3e8d3dac..c60f0e7f768a 100644
--- a/fs/btrfs/accessors.h
+++ b/fs/btrfs/accessors.h
@@ -34,7 +34,7 @@ void btrfs_init_map_token(struct btrfs_map_token *token, struct extent_buffer *e
 
 static inline u8 get_unaligned_le8(const void *p)
 {
-       return *(u8 *)p;
+       return *(const u8 *)p;
 }
 
 static inline void put_unaligned_le8(u8 val, void *p)
-- 
2.45.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 4/6] btrfs: remove unused define EXTENT_SIZE_PER_ITEM
  2024-05-20 19:52  1% [PATCH 0/6] Cleanups and W=2 warning fixes David Sterba
                   ` (2 preceding siblings ...)
  2024-05-20 19:52  1% ` [PATCH 3/6] btrfs: use for-local variabls that shadow function variables David Sterba
@ 2024-05-20 19:52  1% ` David Sterba
  2024-05-20 19:52  1% ` [PATCH 5/6] btrfs: keep const whene returnin value from get_unaligned_le8() David Sterba
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-20 19:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

This was added  in c61a16a701a126 ("Btrfs: fix the confusion between
delalloc bytes and metadata bytes") and removed in 03fe78cc2942c5
("btrfs: use delalloc_bytes to determine flush amount for
shrink_delalloc") where the calculation was reworked to use a
non-constant numbers. This was found by 'make W=2'.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/space-info.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index d620323d08ea..411b95601f18 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -587,8 +587,6 @@ static inline u64 calc_reclaim_items_nr(const struct btrfs_fs_info *fs_info,
 	return nr;
 }
 
-#define EXTENT_SIZE_PER_ITEM	SZ_256K
-
 /*
  * shrink metadata reservation for delalloc
  */
-- 
2.45.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 3/6] btrfs: use for-local variabls that shadow function variables
  2024-05-20 19:52  1% [PATCH 0/6] Cleanups and W=2 warning fixes David Sterba
  2024-05-20 19:52  1% ` [PATCH 1/6] btrfs: remove duplicate name variable declarations David Sterba
  2024-05-20 19:52  1% ` [PATCH 2/6] btrfs: rename macro local variables that clash with other variables David Sterba
@ 2024-05-20 19:52  1% ` David Sterba
  2024-05-21  4:13  1%   ` Naohiro Aota
  2024-05-20 19:52  1% ` [PATCH 4/6] btrfs: remove unused define EXTENT_SIZE_PER_ITEM David Sterba
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 200+ results
From: David Sterba @ 2024-05-20 19:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

We've started to use for-loop local variables and in a few places this
shadows a function variable. Convert a few cases reported by 'make W=2'.
If applicable also change the style to post-increment, that's the
preferred one.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/qgroup.c  | 11 +++++------
 fs/btrfs/volumes.c |  9 +++------
 fs/btrfs/zoned.c   |  8 +++-----
 3 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index fc2a7ea26354..a94a5b87b042 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3216,7 +3216,6 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 			 struct btrfs_qgroup_inherit *inherit)
 {
 	int ret = 0;
-	int i;
 	u64 *i_qgroups;
 	bool committing = false;
 	struct btrfs_fs_info *fs_info = trans->fs_info;
@@ -3273,7 +3272,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 		i_qgroups = (u64 *)(inherit + 1);
 		nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
 		       2 * inherit->num_excl_copies;
-		for (i = 0; i < nums; ++i) {
+		for (int i = 0; i < nums; i++) {
 			srcgroup = find_qgroup_rb(fs_info, *i_qgroups);
 
 			/*
@@ -3300,7 +3299,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 	 */
 	if (inherit) {
 		i_qgroups = (u64 *)(inherit + 1);
-		for (i = 0; i < inherit->num_qgroups; ++i, ++i_qgroups) {
+		for (int i = 0; i < inherit->num_qgroups; i++, i_qgroups++) {
 			if (*i_qgroups == 0)
 				continue;
 			ret = add_qgroup_relation_item(trans, objectid,
@@ -3386,7 +3385,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 		goto unlock;
 
 	i_qgroups = (u64 *)(inherit + 1);
-	for (i = 0; i < inherit->num_qgroups; ++i) {
+	for (int i = 0; i < inherit->num_qgroups; i++) {
 		if (*i_qgroups) {
 			ret = add_relation_rb(fs_info, qlist_prealloc[i], objectid,
 					      *i_qgroups);
@@ -3406,7 +3405,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 		++i_qgroups;
 	}
 
-	for (i = 0; i <  inherit->num_ref_copies; ++i, i_qgroups += 2) {
+	for (int i = 0; i < inherit->num_ref_copies; i++, i_qgroups += 2) {
 		struct btrfs_qgroup *src;
 		struct btrfs_qgroup *dst;
 
@@ -3427,7 +3426,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 		/* Manually tweaking numbers certainly needs a rescan */
 		need_rescan = true;
 	}
-	for (i = 0; i <  inherit->num_excl_copies; ++i, i_qgroups += 2) {
+	for (int i = 0; i <  inherit->num_excl_copies; i++, i_qgroups += 2) {
 		struct btrfs_qgroup *src;
 		struct btrfs_qgroup *dst;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7c9d68b1ba69..3f70f727dacf 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5623,8 +5623,6 @@ static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
 	u64 start = ctl->start;
 	u64 type = ctl->type;
 	int ret;
-	int i;
-	int j;
 
 	map = btrfs_alloc_chunk_map(ctl->num_stripes, GFP_NOFS);
 	if (!map)
@@ -5639,8 +5637,8 @@ static struct btrfs_block_group *create_chunk(struct btrfs_trans_handle *trans,
 	map->sub_stripes = ctl->sub_stripes;
 	map->num_stripes = ctl->num_stripes;
 
-	for (i = 0; i < ctl->ndevs; ++i) {
-		for (j = 0; j < ctl->dev_stripes; ++j) {
+	for (int i = 0; i < ctl->ndevs; i++) {
+		for (int j = 0; j < ctl->dev_stripes; j++) {
 			int s = i * ctl->dev_stripes + j;
 			map->stripes[s].dev = devices_info[i].dev;
 			map->stripes[s].physical = devices_info[i].dev_offset +
@@ -6618,7 +6616,6 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	struct btrfs_chunk_map *map;
 	struct btrfs_io_geometry io_geom = { 0 };
 	u64 map_offset;
-	int i;
 	int ret = 0;
 	int num_copies;
 	struct btrfs_io_context *bioc = NULL;
@@ -6764,7 +6761,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 		 * For all other non-RAID56 profiles, just copy the target
 		 * stripe into the bioc.
 		 */
-		for (i = 0; i < io_geom.num_stripes; i++) {
+		for (int i = 0; i < io_geom.num_stripes; i++) {
 			ret = set_io_stripe(fs_info, logical, length,
 					    &bioc->stripes[i], map, &io_geom);
 			if (ret < 0)
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index dde4a0a34037..e9087264f3e3 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -87,9 +87,8 @@ static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
 	bool empty[BTRFS_NR_SB_LOG_ZONES];
 	bool full[BTRFS_NR_SB_LOG_ZONES];
 	sector_t sector;
-	int i;
 
-	for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
+	for (int i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
 		ASSERT(zones[i].type != BLK_ZONE_TYPE_CONVENTIONAL);
 		empty[i] = (zones[i].cond == BLK_ZONE_COND_EMPTY);
 		full[i] = sb_zone_is_full(&zones[i]);
@@ -121,9 +120,8 @@ static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
 		struct address_space *mapping = bdev->bd_inode->i_mapping;
 		struct page *page[BTRFS_NR_SB_LOG_ZONES];
 		struct btrfs_super_block *super[BTRFS_NR_SB_LOG_ZONES];
-		int i;
 
-		for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
+		for (int i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++) {
 			u64 zone_end = (zones[i].start + zones[i].capacity) << SECTOR_SHIFT;
 			u64 bytenr = ALIGN_DOWN(zone_end, BTRFS_SUPER_INFO_SIZE) -
 						BTRFS_SUPER_INFO_SIZE;
@@ -144,7 +142,7 @@ static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
 		else
 			sector = zones[0].start;
 
-		for (i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++)
+		for (int i = 0; i < BTRFS_NR_SB_LOG_ZONES; i++)
 			btrfs_release_disk_super(super[i]);
 	} else if (!full[0] && (empty[1] || full[1])) {
 		sector = zones[0].wp;
-- 
2.45.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 2/6] btrfs: rename macro local variables that clash with other variables
  2024-05-20 19:52  1% [PATCH 0/6] Cleanups and W=2 warning fixes David Sterba
  2024-05-20 19:52  1% ` [PATCH 1/6] btrfs: remove duplicate name variable declarations David Sterba
@ 2024-05-20 19:52  1% ` David Sterba
  2024-05-20 19:52  1% ` [PATCH 3/6] btrfs: use for-local variabls that shadow function variables David Sterba
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-20 19:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Fix variable names in two macros where there's a local function variable
of the same name.  In subpage_calc_start_bit() it's in several callers,
in btrfs_abort_transaction() it's only in replace_file_extents().
Found by 'make W=2'.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/subpage.c     | 8 ++++----
 fs/btrfs/transaction.h | 6 +++---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 54736f6238e6..9127704236ab 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -242,12 +242,12 @@ static void btrfs_subpage_assert(const struct btrfs_fs_info *fs_info,
 
 #define subpage_calc_start_bit(fs_info, folio, name, start, len)	\
 ({									\
-	unsigned int start_bit;						\
+	unsigned int __start_bit;						\
 									\
 	btrfs_subpage_assert(fs_info, folio, start, len);		\
-	start_bit = offset_in_page(start) >> fs_info->sectorsize_bits;	\
-	start_bit += fs_info->subpage_info->name##_offset;		\
-	start_bit;							\
+	__start_bit = offset_in_page(start) >> fs_info->sectorsize_bits;	\
+	__start_bit += fs_info->subpage_info->name##_offset;		\
+	__start_bit;							\
 })
 
 void btrfs_subpage_start_reader(const struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 4e451ab173b1..90b987941dd1 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -229,11 +229,11 @@ bool __cold abort_should_print_stack(int error);
  */
 #define btrfs_abort_transaction(trans, error)		\
 do {								\
-	bool first = false;					\
+	bool __first = false;					\
 	/* Report first abort since mount */			\
 	if (!test_and_set_bit(BTRFS_FS_STATE_TRANS_ABORTED,	\
 			&((trans)->fs_info->fs_state))) {	\
-		first = true;					\
+		__first = true;					\
 		if (WARN(abort_should_print_stack(error),	\
 			KERN_ERR				\
 			"BTRFS: Transaction aborted (error %d)\n",	\
@@ -246,7 +246,7 @@ do {								\
 		}						\
 	}							\
 	__btrfs_abort_transaction((trans), __func__,		\
-				  __LINE__, (error), first);	\
+				  __LINE__, (error), __first);	\
 } while (0)
 
 int btrfs_end_transaction(struct btrfs_trans_handle *trans);
-- 
2.45.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 1/6] btrfs: remove duplicate name variable declarations
  2024-05-20 19:52  1% [PATCH 0/6] Cleanups and W=2 warning fixes David Sterba
@ 2024-05-20 19:52  1% ` David Sterba
  2024-05-20 19:52  1% ` [PATCH 2/6] btrfs: rename macro local variables that clash with other variables David Sterba
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 200+ results
From: David Sterba @ 2024-05-20 19:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

When running 'make W=2' there are a few reports where a variable of the
same name is declared in a nested block. In all the cases we can use the
one declared in the parent block, no problematic cases were found.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent_io.c | 4 +---
 fs/btrfs/inode.c     | 2 --
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 597387e9f040..2d773c1cbaa7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3507,7 +3507,6 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
 
 	for (int i = 0; i < num_folios; i++) {
 		struct folio *folio = new->folios[i];
-		int ret;
 
 		ret = attach_extent_buffer_folio(new, folio, NULL);
 		if (ret < 0) {
@@ -4587,8 +4586,7 @@ static void assert_eb_folio_uptodate(const struct extent_buffer *eb, int i)
 		return;
 
 	if (fs_info->nodesize < PAGE_SIZE) {
-		struct folio *folio = eb->folios[0];
-
+		folio = eb->folios[0];
 		ASSERT(i == 0);
 		if (WARN_ON(!btrfs_subpage_test_uptodate(fs_info, folio,
 							 eb->start, eb->len)))
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3cf32bc721d2..87278d2f8447 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1617,10 +1617,8 @@ static noinline void submit_compressed_extents(struct btrfs_work *work, bool do_
 	u64 alloc_hint = 0;
 
 	if (do_free) {
-		struct async_chunk *async_chunk;
 		struct async_cow *async_cow;
 
-		async_chunk = container_of(work, struct async_chunk, work);
 		btrfs_add_delayed_iput(async_chunk->inode);
 		if (async_chunk->blkcg_css)
 			css_put(async_chunk->blkcg_css);
-- 
2.45.0


^ permalink raw reply related	[relevance 1%]

* [PATCH 0/6] Cleanups and W=2 warning fixes
@ 2024-05-20 19:52  1% David Sterba
  2024-05-20 19:52  1% ` [PATCH 1/6] btrfs: remove duplicate name variable declarations David Sterba
                   ` (8 more replies)
  0 siblings, 9 replies; 200+ results
From: David Sterba @ 2024-05-20 19:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

We have a clean run of 'make W='1 with gcc 13, there are some
interesting warnings to fix with level 2 and even 3. We can't enable the
warning flags by defualt due to reports from generic code.

This short series removes shadowed variables, adds const and removes
an unused macro. There are still some shadow variables to fix but the
remaining cases are with 'ret' variables so I skipped it for now.

David Sterba (6):
  btrfs: remove duplicate name variable declarations
  btrfs: rename macro local variables that clash with other variables
  btrfs: use for-local variabls that shadow function variables
  btrfs: remove unused define EXTENT_SIZE_PER_ITEM
  btrfs: keep const whene returnin value from get_unaligned_le8()
  btrfs: constify parameters of write_eb_member() and its users

 fs/btrfs/accessors.h   | 12 ++++++------
 fs/btrfs/extent_io.c   |  6 ++----
 fs/btrfs/inode.c       |  2 --
 fs/btrfs/qgroup.c      | 11 +++++------
 fs/btrfs/space-info.c  |  2 --
 fs/btrfs/subpage.c     |  8 ++++----
 fs/btrfs/transaction.h |  6 +++---
 fs/btrfs/volumes.c     |  9 +++------
 fs/btrfs/zoned.c       |  8 +++-----
 9 files changed, 26 insertions(+), 38 deletions(-)

-- 
2.45.0


^ permalink raw reply	[relevance 1%]

* Re: [PATCH 0/3] btrfs: avoid data races when accessing an inode's delayed_node
  2024-05-20 15:48  1% ` David Sterba
@ 2024-05-20 16:58  1%   ` Filipe Manana
  2024-05-20 20:20  1%     ` David Sterba
  0 siblings, 1 reply; 200+ results
From: Filipe Manana @ 2024-05-20 16:58 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs

On Mon, May 20, 2024 at 4:48 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Fri, May 17, 2024 at 02:13:23PM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > We do have some data races when accessing an inode's delayed_node, namely
> > we use READ_ONCE() in a couple places while there's no pairing WRITE_ONCE()
> > anywhere, and in one place (btrfs_dirty_inode()) we neither user READ_ONCE()
> > nor take the lock that protects the delayed_node. So fix these and add
> > helpers to access and update an inode's delayed_node.
> >
> > Filipe Manana (3):
> >   btrfs: always set an inode's delayed_inode with WRITE_ONCE()
> >   btrfs: use READ_ONCE() when accessing delayed_node at btrfs_dirty_node()
> >   btrfs: add and use helpers to get and set an inode's delayed_node
>
> The READ_ONCE for delayed nodes has been there historically but I don't
> think it's needed everywhere. The legitimate case is in
> btrfs_get_delayed_node() where first use is without lock and then again
> recheck under the lock so we do want to read fresh value. This is to
> prevent compiler optimization to coalesce the reads.
>
> Writing to delayed node under lock also does not need WRITE_ONCE.
>
> IOW, I would rather remove use of the _ONCE helpers and not add more as
> it is not the pattern where it's supposed to be used. You say it's to
> prevent load tearing but for a pointer type this does not happen and is
> an assumption of the hardware.

If you are sure that pointers aren't subject to load/store tearing
issues, then I'm fine with dropping the patchset.

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v2 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent
                     ` (3 preceding siblings ...)
  @ 2024-05-20 16:55  2% ` Filipe Manana
  4 siblings, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-20 16:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, May 3, 2024 at 7:02 AM Qu Wenruo <wqu@suse.com> wrote:
>
> [CHANGELOG]
> v2:
> - Rebased to the latest for-next
>   There is a conflicts with extent locking, and maybe some other
>   hidden conflicts for NOCOW/PREALLOC?
>   As previously the patchset passes fstests auto group, but after
>   the merging with other patches, it always crashes as btrfs/060.
>
> - Fix an error in the final cleanup patch
>   It's the NOCOW/PREALLOC shenanigans again, in the buffered NOCOW path,
>   that we have to use the old inaccurate numbers for NOCOW/PREALLOC OEs.
>
> - Split the final cleanup into 4 patches
>   Most cleanups are very straightforward, but the cleanup for
>   btrfs_alloc_ordered_extent() needs extra special handling for
>   NOCOW/PREALLOC.
>
> v1:
> - Rebased to the latest for-next
>   To resolve the conflicts with the recently introduced extent map
>   shrinker
>
> - A new cleanup patch to remove the recursive header inclusion
>
> - Use a new structure to pass the file extent item related members
>   around
>
> - Add a new comment on why we're intentionally passing incorrect
>   numbers for NOCOW/PREALLOC ordered extents inside
>   btrfs_create_dio_extent()
>
> [REPO]
> https://github.com/adam900710/linux/tree/em_cleanup
>
> This series introduce two new members (disk_bytenr/offset) to
> extent_map, and removes three old members
> (block_start/block_len/offset), finally rename one member
> (orig_block_len -> disk_num_bytes).
>
> This should save us one u64 for extent_map, although with the recent
> extent map shrinker, the saving is not that useful.

The shrinker doesn't invalidate or make this patchset less useful.

It's always good to reduce the size of a structure like this for which
we can easily have millions of instances, it reduces the number of
pages we consume.

Things are a bit hard to review here, because a lot of code is added
and then removed later and fields at a time, so a lot of cross
reference checks are needed.
Changing the approach here would be a lot of work, and probably would
be more bike shedding than anything else.

But it looks fine, and all the comments on the individual patches are
minor, except for a bug in patch 8/11.

Thanks!

>
> But to make things safe to migrate, I introduce extra sanity checks for
> extent_map, and do cross check for both old and new members.
>
> The extra sanity checks already exposed one bug (thankfully harmless)
> causing em::block_start to be incorrect.
>
> But so far, the patchset is fine for default fstests run.
>
> Furthermore, since we're already having too long parameter lists for
> extent_map/ordered_extent/can_nocow_extent, here is a new structure,
> btrfs_file_extent, a memory-access-friendly structure to represent a
> btrfs_file_extent_item.
>
> With the help of that structure, we can use that to represent a file
> extent item without a super long parameter list.
>
> The patchset would rename orig_block_len to disk_num_bytes first.
> Then introduce the new member, the extra sanity checks, and introduce the
> new btrfs_file_extent structure and use that to remove the older 3 members
> from extent_map.
>
> After all above works done, use btrfs_file_extent to further cleanup
> can_nocow_file_extent_args()/btrfs_alloc_ordered_extent()/create_io_em()/
> btrfs_create_dio_extent().
>
> The cleanup is in fact pretty tricky, the current code base never
> expects correct numbers for NOCOW/PREALLOC OEs, thus we have to keep the
> old but incorrect numbers just for NOCOW/PREALLOC.
>
> I will address the NOCOW/PREALLOC shenanigans the future, but
> after the huge cleanup across multiple core structures.
>
> Qu Wenruo (11):
>   btrfs: rename extent_map::orig_block_len to disk_num_bytes
>   btrfs: export the expected file extent through can_nocow_extent()
>   btrfs: introduce new members for extent_map
>   btrfs: introduce extra sanity checks for extent maps
>   btrfs: remove extent_map::orig_start member
>   btrfs: remove extent_map::block_len member
>   btrfs: remove extent_map::block_start member
>   btrfs: cleanup duplicated parameters related to
>     can_nocow_file_extent_args
>   btrfs: cleanup duplicated parameters related to
>     btrfs_alloc_ordered_extent
>   btrfs: cleanup duplicated parameters related to create_io_em()
>   btrfs: cleanup duplicated parameters related to
>     btrfs_create_dio_extent()
>
>  fs/btrfs/btrfs_inode.h            |   4 +-
>  fs/btrfs/compression.c            |   7 +-
>  fs/btrfs/defrag.c                 |  14 +-
>  fs/btrfs/extent_io.c              |  10 +-
>  fs/btrfs/extent_map.c             | 187 ++++++++++++------
>  fs/btrfs/extent_map.h             |  51 +++--
>  fs/btrfs/file-item.c              |  23 +--
>  fs/btrfs/file.c                   |  18 +-
>  fs/btrfs/inode.c                  | 308 +++++++++++++-----------------
>  fs/btrfs/ordered-data.c           |  36 +++-
>  fs/btrfs/ordered-data.h           |  22 ++-
>  fs/btrfs/relocation.c             |   5 +-
>  fs/btrfs/tests/extent-map-tests.c | 114 ++++++-----
>  fs/btrfs/tests/inode-tests.c      | 177 ++++++++---------
>  fs/btrfs/tree-log.c               |  25 +--
>  fs/btrfs/zoned.c                  |   4 +-
>  include/trace/events/btrfs.h      |  26 +--
>  17 files changed, 548 insertions(+), 483 deletions(-)
>
> --
> 2.45.0
>
>

^ permalink raw reply	[relevance 2%]

* Re: [PATCH v2 11/11] btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent()
  @ 2024-05-20 16:48  1%   ` Filipe Manana
  2024-05-23  4:03  1%     ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Filipe Manana @ 2024-05-20 16:48 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, May 3, 2024 at 7:03 AM Qu Wenruo <wqu@suse.com> wrote:
>
> The following 3 parameters can be cleaned up using btrfs_file_extent
> structure:
>
> - len
>   btrfs_file_extent::num_bytes
>
> - orig_block_len
>   btrfs_file_extent::disk_num_bytes
>
> - ram_bytes
>   btrfs_file_extent::ram_bytes
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/inode.c | 22 ++++++++--------------
>  1 file changed, 8 insertions(+), 14 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a95dc2333972..09974c86d3d1 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -6969,11 +6969,8 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
>  static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>                                                   struct btrfs_dio_data *dio_data,
>                                                   const u64 start,
> -                                                 const u64 len,
> -                                                 const u64 orig_block_len,
> -                                                 const u64 ram_bytes,
> -                                                 const int type,
> -                                                 struct btrfs_file_extent *file_extent)
> +                                                 struct btrfs_file_extent *file_extent,
> +                                                 const int type)
>  {
>         struct extent_map *em = NULL;
>         struct btrfs_ordered_extent *ordered;
> @@ -6991,7 +6988,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>                 if (em) {
>                         free_extent_map(em);
>                         btrfs_drop_extent_map_range(inode, start,
> -                                                   start + len - 1, false);
> +                                       start + file_extent->num_bytes - 1, false);
>                 }
>                 em = ERR_CAST(ordered);
>         } else {
> @@ -7034,10 +7031,9 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode,
>         file_extent.ram_bytes = ins.offset;
>         file_extent.offset = 0;
>         file_extent.compression = BTRFS_COMPRESS_NONE;
> -       em = btrfs_create_dio_extent(inode, dio_data, start, ins.offset,
> -                                    ins.offset,
> -                                    ins.offset, BTRFS_ORDERED_REGULAR,
> -                                    &file_extent);
> +       em = btrfs_create_dio_extent(inode, dio_data, start,
> +                                    &file_extent,
> +                                    BTRFS_ORDERED_REGULAR);

As we're changing this, we can leave this in a single line as it fits.

>         btrfs_dec_block_group_reservations(fs_info, ins.objectid);
>         if (IS_ERR(em))
>                 btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset,
> @@ -7404,10 +7400,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>                 }
>                 space_reserved = true;
>
> -               em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
> -                                             file_extent.disk_num_bytes,
> -                                             file_extent.ram_bytes, type,
> -                                             &file_extent);
> +               em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start,
> +                                             &file_extent, type);

Same here.

The rest looks good, thanks.

>                 btrfs_dec_nocow_writers(bg);
>                 if (type == BTRFS_ORDERED_PREALLOC) {
>                         free_extent_map(em);
> --
> 2.45.0
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v2 10/11] btrfs: cleanup duplicated parameters related to create_io_em()
  @ 2024-05-20 16:46  1%   ` Filipe Manana
  0 siblings, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-20 16:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, May 3, 2024 at 7:03 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Most parameters of create_io_em() can be replaced by the members with
> the same name inside btrfs_file_extent.
>
> Do a straight parameters cleanup here.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/inode.c | 50 +++++++++++++-----------------------------------
>  1 file changed, 13 insertions(+), 37 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index eec5ecb917d8..a95dc2333972 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -138,9 +138,6 @@ static noinline int run_delalloc_cow(struct btrfs_inode *inode,
>                                      u64 end, struct writeback_control *wbc,
>                                      bool pages_dirty);
>  static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
> -                                      u64 len,
> -                                      u64 disk_num_bytes,
> -                                      u64 ram_bytes, int compress_type,
>                                        struct btrfs_file_extent *file_extent,
>                                        int type);
>
> @@ -1207,12 +1204,7 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
>         file_extent.offset = 0;
>         file_extent.compression = async_extent->compress_type;
>
> -       em = create_io_em(inode, start,
> -                         async_extent->ram_size,       /* len */
> -                         ins.offset,                   /* orig_block_len */
> -                         async_extent->ram_size,       /* ram_bytes */
> -                         async_extent->compress_type,
> -                         &file_extent,
> +       em = create_io_em(inode, start, &file_extent,
>                           BTRFS_ORDERED_COMPRESSED);

Btw, as we're changing this, we can take the chance to make everything
in a single line since it fits and gets more readable.

>         if (IS_ERR(em)) {
>                 ret = PTR_ERR(em);
> @@ -1443,11 +1435,7 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>                 lock_extent(&inode->io_tree, start, start + ram_size - 1,
>                             &cached);
>
> -               em = create_io_em(inode, start, ins.offset, /* len */
> -                                 ins.offset, /* orig_block_len */
> -                                 ram_size, /* ram_bytes */
> -                                 BTRFS_COMPRESS_NONE, /* compress_type */
> -                                 &file_extent,
> +               em = create_io_em(inode, start, &file_extent,
>                                   BTRFS_ORDERED_REGULAR /* type */);

Same here, and remove the /* type */ comment there which is superfluous anyway.

>                 if (IS_ERR(em)) {
>                         unlock_extent(&inode->io_tree, start,
> @@ -2168,10 +2156,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>                         struct extent_map *em;
>
>                         em = create_io_em(inode, cur_offset,
> -                                         nocow_args.file_extent.num_bytes,
> -                                         nocow_args.file_extent.disk_num_bytes,
> -                                         nocow_args.file_extent.ram_bytes,
> -                                         BTRFS_COMPRESS_NONE,
>                                           &nocow_args.file_extent,
>                                           BTRFS_ORDERED_PREALLOC);

Same here.

The rest looks good, simple patch.
Thanks.

>                         if (IS_ERR(em)) {
> @@ -6995,10 +6979,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>         struct btrfs_ordered_extent *ordered;
>
>         if (type != BTRFS_ORDERED_NOCOW) {
> -               em = create_io_em(inode, start, len,
> -                                 orig_block_len, ram_bytes,
> -                                 BTRFS_COMPRESS_NONE, /* compress_type */
> -                                 file_extent, type);
> +               em = create_io_em(inode, start, file_extent, type);
>                 if (IS_ERR(em))
>                         goto out;
>         }
> @@ -7290,9 +7271,6 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
>
>  /* The callers of this must take lock_extent() */
>  static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
> -                                      u64 len,
> -                                      u64 disk_num_bytes,
> -                                      u64 ram_bytes, int compress_type,
>                                        struct btrfs_file_extent *file_extent,
>                                        int type)
>  {
> @@ -7314,25 +7292,25 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
>         switch (type) {
>         case BTRFS_ORDERED_PREALLOC:
>                 /* We're only referring part of a larger preallocated extent. */
> -               ASSERT(len <= ram_bytes);
> +               ASSERT(file_extent->num_bytes <= file_extent->ram_bytes);
>                 break;
>         case BTRFS_ORDERED_REGULAR:
>                 /* COW results a new extent matching our file extent size. */
> -               ASSERT(disk_num_bytes == len);
> -               ASSERT(ram_bytes == len);
> +               ASSERT(file_extent->disk_num_bytes == file_extent->num_bytes);
> +               ASSERT(file_extent->ram_bytes == file_extent->num_bytes);
>
>                 /* Since it's a new extent, we should not have any offset. */
>                 ASSERT(file_extent->offset == 0);
>                 break;
>         case BTRFS_ORDERED_COMPRESSED:
>                 /* Must be compressed. */
> -               ASSERT(compress_type != BTRFS_COMPRESS_NONE);
> +               ASSERT(file_extent->compression != BTRFS_COMPRESS_NONE);
>
>                 /*
>                  * Encoded write can make us to refer to part of the
>                  * uncompressed extent.
>                  */
> -               ASSERT(len <= ram_bytes);
> +               ASSERT(file_extent->num_bytes <= file_extent->ram_bytes);
>                 break;
>         }
>
> @@ -7341,15 +7319,15 @@ static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
>                 return ERR_PTR(-ENOMEM);
>
>         em->start = start;
> -       em->len = len;
> +       em->len = file_extent->num_bytes;
>         em->disk_bytenr = file_extent->disk_bytenr;
> -       em->disk_num_bytes = disk_num_bytes;
> -       em->ram_bytes = ram_bytes;
> +       em->disk_num_bytes = file_extent->disk_num_bytes;
> +       em->ram_bytes = file_extent->ram_bytes;
>         em->generation = -1;
>         em->offset = file_extent->offset;
>         em->flags |= EXTENT_FLAG_PINNED;
>         if (type == BTRFS_ORDERED_COMPRESSED)
> -               extent_map_set_compression(em, compress_type);
> +               extent_map_set_compression(em, file_extent->compression);
>
>         ret = btrfs_replace_extent_map_range(inode, em, true);
>         if (ret) {
> @@ -10327,9 +10305,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
>         file_extent.ram_bytes = ram_bytes;
>         file_extent.offset = encoded->unencoded_offset;
>         file_extent.compression = compression;
> -       em = create_io_em(inode, start, num_bytes,
> -                         ins.offset, ram_bytes, compression,
> -                         &file_extent, BTRFS_ORDERED_COMPRESSED);
> +       em = create_io_em(inode, start, &file_extent, BTRFS_ORDERED_COMPRESSED);
>         if (IS_ERR(em)) {
>                 ret = PTR_ERR(em);
>                 goto out_free_reserved;
> --
> 2.45.0
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v2 09/11] btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent
  @ 2024-05-20 16:31  2%   ` Filipe Manana
  0 siblings, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-20 16:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, May 3, 2024 at 7:03 AM Qu Wenruo <wqu@suse.com> wrote:
>
> All parameters after @filepos of btrfs_alloc_ordered_extent() can be
> replaced with btrfs_file_extent structure.
>
> This patch would do the cleanup, meanwhile some points to note:

would do -> does

>
> - Move btrfs_file_extent structure to ordered-data.h
>   The structure is needed by both btrfs_alloc_ordered_extent() and
>   can_nocow_extent(), but since btrfs_inode.h would include

would include -> includes  (it's including at the moment, before this
patchset - there's an include of ordered-data.h in btrfs_inode.h)

>   ordered-data.h, so we need to move the structure to ordered-data.h.
>
> - Move the special handling of NOCOW/PREALLOC into
>   btrfs_alloc_ordered_extent()
>   Previously we have two call sites intentionally forging the numbers,

So this forging is the thing you don't like about disk_bytenr not
matching the disk_bytenr of a file extent item, but rather that plus
some offset.
I would leave the rant about it from the changelog the comments in
code. Leave that for a patch that specifically addresses that.

Otherwise the rest of this patch looks fine.
Thanks.

>   but now with accurate btrfs_file_extent results, it's better to move
>   the special handling into btrfs_alloc_ordered_extent(), so callers can
>   just pass the accurate file_extent.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/btrfs_inode.h  | 17 -----------
>  fs/btrfs/inode.c        | 64 +++++++----------------------------------
>  fs/btrfs/ordered-data.c | 36 +++++++++++++++++++----
>  fs/btrfs/ordered-data.h | 22 ++++++++++++--
>  4 files changed, 59 insertions(+), 80 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index bea343615ad1..6622485389dc 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -443,23 +443,6 @@ int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page,
>                             u32 pgoff, u8 *csum, const u8 * const csum_expected);
>  bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev,
>                         u32 bio_offset, struct bio_vec *bv);
> -
> -/*
> - * A more access-friendly representation of btrfs_file_extent_item.
> - *
> - * Unused members are excluded.
> - */
> -struct btrfs_file_extent {
> -       u64 disk_bytenr;
> -       u64 disk_num_bytes;
> -
> -       u64 num_bytes;
> -       u64 ram_bytes;
> -       u64 offset;
> -
> -       u8 compression;
> -};
> -
>  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>                               struct btrfs_file_extent *file_extent,
>                               bool nowait, bool strict);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 89f284ae26a4..eec5ecb917d8 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1220,14 +1220,8 @@ static void submit_one_async_extent(struct async_chunk *async_chunk,
>         }
>         free_extent_map(em);
>
> -       ordered = btrfs_alloc_ordered_extent(inode, start,      /* file_offset */
> -                                      async_extent->ram_size,  /* num_bytes */
> -                                      async_extent->ram_size,  /* ram_bytes */
> -                                      ins.objectid,            /* disk_bytenr */
> -                                      ins.offset,              /* disk_num_bytes */
> -                                      0,                       /* offset */
> -                                      1 << BTRFS_ORDERED_COMPRESSED,
> -                                      async_extent->compress_type);
> +       ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent,
> +                                      1 << BTRFS_ORDERED_COMPRESSED);
>         if (IS_ERR(ordered)) {
>                 btrfs_drop_extent_map_range(inode, start, end, false);
>                 ret = PTR_ERR(ordered);
> @@ -1463,10 +1457,8 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
>                 }
>                 free_extent_map(em);
>
> -               ordered = btrfs_alloc_ordered_extent(inode, start, ram_size,
> -                                       ram_size, ins.objectid, cur_alloc_size,
> -                                       0, 1 << BTRFS_ORDERED_REGULAR,
> -                                       BTRFS_COMPRESS_NONE);
> +               ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent,
> +                                                    1 << BTRFS_ORDERED_REGULAR);
>                 if (IS_ERR(ordered)) {
>                         unlock_extent(&inode->io_tree, start,
>                                       start + ram_size - 1, &cached);
> @@ -2192,20 +2184,11 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>                         free_extent_map(em);
>                 }
>
> -               /*
> -                * Check btrfs_create_dio_extent() for why we intentionally pass
> -                * incorrect value for NOCOW/PREALLOC OEs.
> -                */
>                 ordered = btrfs_alloc_ordered_extent(inode, cur_offset,
> -                               nocow_args.file_extent.num_bytes,
> -                               nocow_args.file_extent.num_bytes,
> -                               nocow_args.file_extent.disk_bytenr +
> -                               nocow_args.file_extent.offset,
> -                               nocow_args.file_extent.num_bytes, 0,
> +                               &nocow_args.file_extent,
>                                 is_prealloc
>                                 ? (1 << BTRFS_ORDERED_PREALLOC)
> -                               : (1 << BTRFS_ORDERED_NOCOW),
> -                               BTRFS_COMPRESS_NONE);
> +                               : (1 << BTRFS_ORDERED_NOCOW));
>                 btrfs_dec_nocow_writers(nocow_bg);
>                 if (IS_ERR(ordered)) {
>                         if (is_prealloc) {
> @@ -7020,33 +7003,9 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode,
>                         goto out;
>         }
>
> -       /*
> -        * NOTE: I know the numbers are totally wrong for NOCOW/PREALLOC,
> -        * but it doesn't cause problem at least for now.
> -        *
> -        * For regular writes, we would have file_extent->offset as 0,
> -        * thus we really only need disk_bytenr, every other length
> -        * (disk_num_bytes/ram_bytes) would match @len and fe->num_bytes.
> -        * The current numbers are totally fine.
> -        *
> -        * For NOCOW, we don't really care about the numbers except @file_pos
> -        * and @num_bytes, as we won't insert a file extent item at all.
> -        *
> -        * For PREALLOC, we do not use ordered extent's member, but
> -        * btrfs_mark_extent_written() would handle everything.
> -        *
> -        * So here we intentionally go with pseudo numbers for the NOCOW/PREALLOC
> -        * OEs, or btrfs_extract_ordered_extent() would need a completely new
> -        * routine to handle NOCOW/PREALLOC splits, meanwhile result nothing
> -        * different.
> -        */
> -       ordered = btrfs_alloc_ordered_extent(inode, start, len, len,
> -                                            file_extent->disk_bytenr +
> -                                            file_extent->offset,
> -                                            len, 0,
> +       ordered = btrfs_alloc_ordered_extent(inode, start, file_extent,
>                                              (1 << type) |
> -                                            (1 << BTRFS_ORDERED_DIRECT),
> -                                            BTRFS_COMPRESS_NONE);
> +                                            (1 << BTRFS_ORDERED_DIRECT));
>         if (IS_ERR(ordered)) {
>                 if (em) {
>                         free_extent_map(em);
> @@ -10377,12 +10336,9 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
>         }
>         free_extent_map(em);
>
> -       ordered = btrfs_alloc_ordered_extent(inode, start, num_bytes, ram_bytes,
> -                                      ins.objectid, ins.offset,
> -                                      encoded->unencoded_offset,
> +       ordered = btrfs_alloc_ordered_extent(inode, start, &file_extent,
>                                        (1 << BTRFS_ORDERED_ENCODED) |
> -                                      (1 << BTRFS_ORDERED_COMPRESSED),
> -                                      compression);
> +                                      (1 << BTRFS_ORDERED_COMPRESSED));
>         if (IS_ERR(ordered)) {
>                 btrfs_drop_extent_map_range(inode, start, end, false);
>                 ret = PTR_ERR(ordered);
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index c5bdd674f55c..371a85250d6a 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -263,17 +263,41 @@ static void insert_ordered_extent(struct btrfs_ordered_extent *entry)
>   */
>  struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
>                         struct btrfs_inode *inode, u64 file_offset,
> -                       u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
> -                       u64 disk_num_bytes, u64 offset, unsigned long flags,
> -                       int compress_type)
> +                       struct btrfs_file_extent *file_extent,
> +                       unsigned long flags)
>  {
>         struct btrfs_ordered_extent *entry;
>
>         ASSERT((flags & ~BTRFS_ORDERED_TYPE_FLAGS) == 0);
>
> -       entry = alloc_ordered_extent(inode, file_offset, num_bytes, ram_bytes,
> -                                    disk_bytenr, disk_num_bytes, offset, flags,
> -                                    compress_type);
> +       /*
> +        * NOTE: I know the numbers are totally wrong for NOCOW/PREALLOC,
> +        * but it doesn't cause problem at least for now.
> +        *
> +        * For NOCOW, we don't really care about the numbers except @file_pos
> +        * and @num_bytes, as we won't insert a file extent item at all.
> +        *
> +        * For PREALLOC, we do not use ordered extent's member, but
> +        * btrfs_mark_extent_written() would handle everything.
> +        *
> +        * So here we intentionally go with pseudo numbers for the NOCOW/PREALLOC
> +        * OEs, or btrfs_extract_ordered_extent() would need a completely new
> +        * routine to handle NOCOW/PREALLOC splits, meanwhile result nothing
> +        * different.
> +        */
> +       if (flags & ((1 << BTRFS_ORDERED_NOCOW) | (1 << BTRFS_ORDERED_PREALLOC)))
> +               entry = alloc_ordered_extent(inode, file_offset,
> +                               file_extent->num_bytes, file_extent->num_bytes,
> +                               file_extent->disk_bytenr + file_extent->offset,
> +                               file_extent->num_bytes, 0, flags,
> +                               file_extent->compression);
> +       else
> +               entry = alloc_ordered_extent(inode, file_offset,
> +                               file_extent->num_bytes, file_extent->ram_bytes,
> +                               file_extent->disk_bytenr,
> +                               file_extent->disk_num_bytes,
> +                               file_extent->offset, flags,
> +                               file_extent->compression);
>         if (!IS_ERR(entry))
>                 insert_ordered_extent(entry);
>         return entry;
> diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
> index b6f6c6b91732..5bbec06fbc8d 100644
> --- a/fs/btrfs/ordered-data.h
> +++ b/fs/btrfs/ordered-data.h
> @@ -171,11 +171,27 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
>  bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode,
>                                     struct btrfs_ordered_extent **cached,
>                                     u64 file_offset, u64 io_size);
> +
> +/*
> + * A more access-friendly representation of btrfs_file_extent_item.
> + *
> + * Unused members are excluded.
> + */
> +struct btrfs_file_extent {
> +       u64 disk_bytenr;
> +       u64 disk_num_bytes;
> +
> +       u64 num_bytes;
> +       u64 ram_bytes;
> +       u64 offset;
> +
> +       u8 compression;
> +};
> +
>  struct btrfs_ordered_extent *btrfs_alloc_ordered_extent(
>                         struct btrfs_inode *inode, u64 file_offset,
> -                       u64 num_bytes, u64 ram_bytes, u64 disk_bytenr,
> -                       u64 disk_num_bytes, u64 offset, unsigned long flags,
> -                       int compress_type);
> +                       struct btrfs_file_extent *file_extent,
> +                       unsigned long flags);
>  void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
>                            struct btrfs_ordered_sum *sum);
>  struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
> --
> 2.45.0
>
>

^ permalink raw reply	[relevance 2%]

* Re: [PATCH v2 08/11] btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args
  @ 2024-05-20 15:55  1%   ` Filipe Manana
  2024-05-20 22:13  1%     ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Filipe Manana @ 2024-05-20 15:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, May 3, 2024 at 7:02 AM Qu Wenruo <wqu@suse.com> wrote:
>
> The following functions and structures can be simplified using the
> btrfs_file_extent structure:
>
> - can_nocow_extent()
>   No need to return ram_bytes/orig_block_len through the parameter list,
>   the @file_extent parameter contains all needed info.
>
> - can_nocow_file_extent_args
>   The following members are no longer needed:
>
>   * disk_bytenr
>     This one is confusing as it's not really the
>     btrfs_file_extent_item::disk_bytenr, but where the IO would be,
>     thus it's file_extent::disk_bytenr + file_extent::offset now.
>
>   * num_bytes
>     Now file_extent::num_bytes.
>
>   * extent_offset
>     Now file_extent::offset.
>
>   * disk_num_bytes
>     Now file_extent::disk_num_bytes.
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/btrfs_inode.h |  3 +-
>  fs/btrfs/file.c        |  2 +-
>  fs/btrfs/inode.c       | 89 +++++++++++++++++++-----------------------
>  3 files changed, 42 insertions(+), 52 deletions(-)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index f30afce4f6ca..bea343615ad1 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -461,8 +461,7 @@ struct btrfs_file_extent {
>  };
>
>  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
> -                             u64 *orig_block_len,
> -                             u64 *ram_bytes, struct btrfs_file_extent *file_extent,
> +                             struct btrfs_file_extent *file_extent,
>                               bool nowait, bool strict);
>
>  void btrfs_del_delalloc_inode(struct btrfs_inode *inode);
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 102b5c17ece1..6aaeb9ee048d 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1104,7 +1104,7 @@ int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
>                                                    &cached_state);
>         }
>         ret = can_nocow_extent(&inode->vfs_inode, lockstart, &num_bytes,
> -                       NULL, NULL, NULL, nowait, false);
> +                              NULL, nowait, false);
>         if (ret <= 0)
>                 btrfs_drew_write_unlock(&root->snapshot_lock);
>         else
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 8bc1f165193a..89f284ae26a4 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1862,16 +1862,10 @@ struct can_nocow_file_extent_args {
>          */
>         bool free_path;
>
> -       /* Output fields. Only set when can_nocow_file_extent() returns 1. */
> -
> -       u64 disk_bytenr;
> -       u64 disk_num_bytes;
> -       u64 extent_offset;
> -
> -       /* Number of bytes that can be written to in NOCOW mode. */
> -       u64 num_bytes;
> -
> -       /* The expected file extent for the NOCOW write. */
> +       /*
> +        * Output fields. Only set when can_nocow_file_extent() returns 1.
> +        * The expected file extent for the NOCOW write.
> +        */
>         struct btrfs_file_extent file_extent;
>  };
>
> @@ -1894,6 +1888,7 @@ static int can_nocow_file_extent(struct btrfs_path *path,
>         struct btrfs_root *root = inode->root;
>         struct btrfs_file_extent_item *fi;
>         struct btrfs_root *csum_root;
> +       u64 io_start;
>         u64 extent_end;
>         u8 extent_type;
>         int can_nocow = 0;
> @@ -1906,11 +1901,6 @@ static int can_nocow_file_extent(struct btrfs_path *path,
>         if (extent_type == BTRFS_FILE_EXTENT_INLINE)
>                 goto out;
>
> -       /* Can't access these fields unless we know it's not an inline extent. */
> -       args->disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
> -       args->disk_num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
> -       args->extent_offset = btrfs_file_extent_offset(leaf, fi);
> -
>         if (!(inode->flags & BTRFS_INODE_NODATACOW) &&
>             extent_type == BTRFS_FILE_EXTENT_REG)
>                 goto out;
> @@ -1926,7 +1916,7 @@ static int can_nocow_file_extent(struct btrfs_path *path,
>                 goto out;
>
>         /* An explicit hole, must COW. */
> -       if (args->disk_bytenr == 0)
> +       if (btrfs_file_extent_disk_num_bytes(leaf, fi) == 0)

No, this is not correct.
It's btrfs_file_extent_disk_bytenr() that we want, not
btrfs_file_extent_disk_num_bytes().
In fact a disk_num_bytes of 0 should ve invalid and never happen.

>                 goto out;
>
>         /* Compressed/encrypted/encoded extents must be COWed. */
> @@ -1951,8 +1941,8 @@ static int can_nocow_file_extent(struct btrfs_path *path,
>         btrfs_release_path(path);
>
>         ret = btrfs_cross_ref_exist(root, btrfs_ino(inode),
> -                                   key->offset - args->extent_offset,
> -                                   args->disk_bytenr, args->strict, path);
> +                                   key->offset - args->file_extent.offset,
> +                                   args->file_extent.disk_bytenr, args->strict, path);
>         WARN_ON_ONCE(ret > 0 && is_freespace_inode);
>         if (ret != 0)
>                 goto out;
> @@ -1973,21 +1963,18 @@ static int can_nocow_file_extent(struct btrfs_path *path,
>             atomic_read(&root->snapshot_force_cow))
>                 goto out;
>
> -       args->disk_bytenr += args->extent_offset;
> -       args->disk_bytenr += args->start - key->offset;
> -       args->num_bytes = min(args->end + 1, extent_end) - args->start;
> -
> -       args->file_extent.num_bytes = args->num_bytes;
> +       args->file_extent.num_bytes = min(args->end + 1, extent_end) - args->start;
>         args->file_extent.offset += args->start - key->offset;
> +       io_start = args->file_extent.disk_bytenr + args->file_extent.offset;
>
>         /*
>          * Force COW if csums exist in the range. This ensures that csums for a
>          * given extent are either valid or do not exist.
>          */
>
> -       csum_root = btrfs_csum_root(root->fs_info, args->disk_bytenr);
> -       ret = btrfs_lookup_csums_list(csum_root, args->disk_bytenr,
> -                                     args->disk_bytenr + args->num_bytes - 1,
> +       csum_root = btrfs_csum_root(root->fs_info, io_start);
> +       ret = btrfs_lookup_csums_list(csum_root, io_start,
> +                                     io_start + args->file_extent.num_bytes - 1,
>                                       NULL, nowait);
>         WARN_ON_ONCE(ret > 0 && is_freespace_inode);
>         if (ret != 0)
> @@ -2046,7 +2033,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>                 struct extent_buffer *leaf;
>                 struct extent_state *cached_state = NULL;
>                 u64 extent_end;
> -               u64 ram_bytes;
>                 u64 nocow_end;
>                 int extent_type;
>                 bool is_prealloc;
> @@ -2125,7 +2111,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>                         ret = -EUCLEAN;
>                         goto error;
>                 }
> -               ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
>                 extent_end = btrfs_file_extent_end(path);
>
>                 /*
> @@ -2145,7 +2130,9 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>                         goto must_cow;
>
>                 ret = 0;
> -               nocow_bg = btrfs_inc_nocow_writers(fs_info, nocow_args.disk_bytenr);
> +               nocow_bg = btrfs_inc_nocow_writers(fs_info,
> +                               nocow_args.file_extent.disk_bytenr +
> +                               nocow_args.file_extent.offset);
>                 if (!nocow_bg) {
>  must_cow:
>                         /*
> @@ -2181,16 +2168,18 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>                         }
>                 }
>
> -               nocow_end = cur_offset + nocow_args.num_bytes - 1;
> +               nocow_end = cur_offset + nocow_args.file_extent.num_bytes - 1;
>                 lock_extent(&inode->io_tree, cur_offset, nocow_end, &cached_state);
>
>                 is_prealloc = extent_type == BTRFS_FILE_EXTENT_PREALLOC;
>                 if (is_prealloc) {
>                         struct extent_map *em;
>
> -                       em = create_io_em(inode, cur_offset, nocow_args.num_bytes,
> -                                         nocow_args.disk_num_bytes, /* orig_block_len */
> -                                         ram_bytes, BTRFS_COMPRESS_NONE,
> +                       em = create_io_em(inode, cur_offset,
> +                                         nocow_args.file_extent.num_bytes,
> +                                         nocow_args.file_extent.disk_num_bytes,
> +                                         nocow_args.file_extent.ram_bytes,
> +                                         BTRFS_COMPRESS_NONE,
>                                           &nocow_args.file_extent,
>                                           BTRFS_ORDERED_PREALLOC);
>                         if (IS_ERR(em)) {
> @@ -2203,9 +2192,16 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode,
>                         free_extent_map(em);
>                 }
>
> +               /*
> +                * Check btrfs_create_dio_extent() for why we intentionally pass
> +                * incorrect value for NOCOW/PREALLOC OEs.
> +                */

If in the next version you remove that similar comment/rant about OEs
and disk_bytenr, also remove this one.

Everything else in this patch looks fine, thanks.


>                 ordered = btrfs_alloc_ordered_extent(inode, cur_offset,
> -                               nocow_args.num_bytes, nocow_args.num_bytes,
> -                               nocow_args.disk_bytenr, nocow_args.num_bytes, 0,
> +                               nocow_args.file_extent.num_bytes,
> +                               nocow_args.file_extent.num_bytes,
> +                               nocow_args.file_extent.disk_bytenr +
> +                               nocow_args.file_extent.offset,
> +                               nocow_args.file_extent.num_bytes, 0,
>                                 is_prealloc
>                                 ? (1 << BTRFS_ORDERED_PREALLOC)
>                                 : (1 << BTRFS_ORDERED_NOCOW),
> @@ -7144,8 +7140,7 @@ static bool btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr)
>   *      any ordered extents.
>   */
>  noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
> -                             u64 *orig_block_len,
> -                             u64 *ram_bytes, struct btrfs_file_extent *file_extent,
> +                             struct btrfs_file_extent *file_extent,
>                               bool nowait, bool strict)
>  {
>         struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> @@ -7196,8 +7191,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>
>         fi = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_file_extent_item);
>         found_type = btrfs_file_extent_type(leaf, fi);
> -       if (ram_bytes)
> -               *ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
>
>         nocow_args.start = offset;
>         nocow_args.end = offset + *len - 1;
> @@ -7215,14 +7208,15 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>         }
>
>         ret = 0;
> -       if (btrfs_extent_readonly(fs_info, nocow_args.disk_bytenr))
> +       if (btrfs_extent_readonly(fs_info,
> +                               nocow_args.file_extent.disk_bytenr + nocow_args.file_extent.offset))
>                 goto out;
>
>         if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW) &&
>             found_type == BTRFS_FILE_EXTENT_PREALLOC) {
>                 u64 range_end;
>
> -               range_end = round_up(offset + nocow_args.num_bytes,
> +               range_end = round_up(offset + nocow_args.file_extent.num_bytes,
>                                      root->fs_info->sectorsize) - 1;
>                 ret = test_range_bit_exists(io_tree, offset, range_end, EXTENT_DELALLOC);
>                 if (ret) {
> @@ -7231,13 +7225,11 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
>                 }
>         }
>
> -       if (orig_block_len)
> -               *orig_block_len = nocow_args.disk_num_bytes;
>         if (file_extent)
>                 memcpy(file_extent, &nocow_args.file_extent,
>                        sizeof(struct btrfs_file_extent));
>
> -       *len = nocow_args.num_bytes;
> +       *len = nocow_args.file_extent.num_bytes;
>         ret = 1;
>  out:
>         btrfs_free_path(path);
> @@ -7422,7 +7414,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>         struct btrfs_file_extent file_extent = { 0 };
>         struct extent_map *em = *map;
>         int type;
> -       u64 block_start, orig_block_len, ram_bytes;
> +       u64 block_start;
>         struct btrfs_block_group *bg;
>         bool can_nocow = false;
>         bool space_reserved = false;
> @@ -7450,7 +7442,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>                 block_start = extent_map_block_start(em) + (start - em->start);
>
>                 if (can_nocow_extent(inode, start, &len,
> -                                    &orig_block_len, &ram_bytes,
>                                      &file_extent, false, false) == 1) {
>                         bg = btrfs_inc_nocow_writers(fs_info, block_start);
>                         if (bg)
> @@ -7477,8 +7468,8 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map,
>                 space_reserved = true;
>
>                 em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len,
> -                                             orig_block_len,
> -                                             ram_bytes, type,
> +                                             file_extent.disk_num_bytes,
> +                                             file_extent.ram_bytes, type,
>                                               &file_extent);
>                 btrfs_dec_nocow_writers(bg);
>                 if (type == BTRFS_ORDERED_PREALLOC) {
> @@ -10709,7 +10700,7 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>                 free_extent_map(em);
>                 em = NULL;
>
> -               ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL, false, true);
> +               ret = can_nocow_extent(inode, start, &len, NULL, false, true);
>                 if (ret < 0) {
>                         goto out;
>                 } else if (ret) {
> --
> 2.45.0
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH 0/3] btrfs: avoid data races when accessing an inode's delayed_node
  @ 2024-05-20 15:48  1% ` David Sterba
  2024-05-20 16:58  1%   ` Filipe Manana
  0 siblings, 1 reply; 200+ results
From: David Sterba @ 2024-05-20 15:48 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Fri, May 17, 2024 at 02:13:23PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> We do have some data races when accessing an inode's delayed_node, namely
> we use READ_ONCE() in a couple places while there's no pairing WRITE_ONCE()
> anywhere, and in one place (btrfs_dirty_inode()) we neither user READ_ONCE()
> nor take the lock that protects the delayed_node. So fix these and add
> helpers to access and update an inode's delayed_node.
> 
> Filipe Manana (3):
>   btrfs: always set an inode's delayed_inode with WRITE_ONCE()
>   btrfs: use READ_ONCE() when accessing delayed_node at btrfs_dirty_node()
>   btrfs: add and use helpers to get and set an inode's delayed_node

The READ_ONCE for delayed nodes has been there historically but I don't
think it's needed everywhere. The legitimate case is in
btrfs_get_delayed_node() where first use is without lock and then again
recheck under the lock so we do want to read fresh value. This is to
prevent compiler optimization to coalesce the reads.

Writing to delayed node under lock also does not need WRITE_ONCE.

IOW, I would rather remove use of the _ONCE helpers and not add more as
it is not the pattern where it's supposed to be used. You say it's to
prevent load tearing but for a pointer type this does not happen and is
an assumption of the hardware.

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: enhance function extent_range_clear_dirty_for_io()
  2024-05-20 10:51  1% ` Filipe Manana
@ 2024-05-20 11:06  1%   ` Qu Wenruo
  0 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-20 11:06 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs



在 2024/5/20 20:21, Filipe Manana 写道:
> On Mon, May 20, 2024 at 4:56 AM Qu Wenruo <wqu@suse.com> wrote:
[...]
>> - Make it subpage compatible
>>    Although currently compression only happens in a full page basis even
>>    for subpage routine, there is no harm to make it subpage compatible
>>    now.
> 
> The changes seem ok and reasonable to me.
> 
> However I think these are really 3 separate changes that should be in
> 3 different patches.
> It makes it easier to review and to revert in case there's a need to do so.
> 
> So I would make the move to inode.c first, and then the other changes.
> Or the move last in case we need to backport the other changes.

Sure, that indeed sounds better

[...]
>> +       if (missing_folio)
>> +               return -ENOENT;
> 
> Why not return the error from filemap_get_folio()? We could keep it
> and then return it after finishing the loop.
> Currently it can only return -ENOENT, according to the function's
> comment, but it would be better future proof and return whatever error
> it returns.

Sure, although we can only either keep the first or the last error.

Thanks,
Qu

> 
> Thanks.
> 
>> +       return 0;
>> +}
>> +
>>   /*
>>    * Work queue call back to started compression on a file and pages.
>>    *
>> @@ -931,7 +957,10 @@ static void compress_file_range(struct btrfs_work *work)
>>           * Otherwise applications with the file mmap'd can wander in and change
>>           * the page contents while we are compressing them.
>>           */
>> -       extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end);
>> +       ret = extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end);
>> +
>> +       /* We have locked all the involved pagse, shouldn't hit a missing page. */
>> +       ASSERT(ret == 0);
>>
>>          /*
>>           * We need to save i_size before now because it could change in between
>> --
>> 2.45.1
>>
>>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: enhance function extent_range_clear_dirty_for_io()
  2024-05-20  3:55  1% [PATCH] btrfs: enhance function extent_range_clear_dirty_for_io() Qu Wenruo
@ 2024-05-20 10:51  1% ` Filipe Manana
  2024-05-20 11:06  1%   ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Filipe Manana @ 2024-05-20 10:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Mon, May 20, 2024 at 4:56 AM Qu Wenruo <wqu@suse.com> wrote:
>
> Enhance that function by:
>
> - Moving it to inode.c
>   As there is only one user inside compress_file_range(), there is no
>   need to export it through extent_io.h.
>
> - Add extra error handling
>   Previously we go BUG_ON() if we can not find a page inside the range.
>   Now we downgrade it to ASSERT(), as this really means some logic
>   error since we should have all the pages locked already.
>
> - Make it subpage compatible
>   Although currently compression only happens in a full page basis even
>   for subpage routine, there is no harm to make it subpage compatible
>   now.

The changes seem ok and reasonable to me.

However I think these are really 3 separate changes that should be in
3 different patches.
It makes it easier to review and to revert in case there's a need to do so.

So I would make the move to inode.c first, and then the other changes.
Or the move last in case we need to backport the other changes.

Some comments inlined below.

>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/extent_io.c | 15 ---------------
>  fs/btrfs/extent_io.h |  1 -
>  fs/btrfs/inode.c     | 31 ++++++++++++++++++++++++++++++-
>  3 files changed, 30 insertions(+), 17 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index a8fc0fcfa69f..9a6f369945c6 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -164,21 +164,6 @@ void __cold extent_buffer_free_cachep(void)
>         kmem_cache_destroy(extent_buffer_cache);
>  }
>
> -void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
> -{
> -       unsigned long index = start >> PAGE_SHIFT;
> -       unsigned long end_index = end >> PAGE_SHIFT;
> -       struct page *page;
> -
> -       while (index <= end_index) {
> -               page = find_get_page(inode->i_mapping, index);
> -               BUG_ON(!page); /* Pages should be in the extent_io_tree */
> -               clear_page_dirty_for_io(page);
> -               put_page(page);
> -               index++;
> -       }
> -}
> -
>  static void process_one_page(struct btrfs_fs_info *fs_info,
>                              struct page *page, struct page *locked_page,
>                              unsigned long page_ops, u64 start, u64 end)
> diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
> index dca6b12769ec..7c2f1bbc6b67 100644
> --- a/fs/btrfs/extent_io.h
> +++ b/fs/btrfs/extent_io.h
> @@ -350,7 +350,6 @@ void extent_buffer_bitmap_clear(const struct extent_buffer *eb,
>  void set_extent_buffer_dirty(struct extent_buffer *eb);
>  void set_extent_buffer_uptodate(struct extent_buffer *eb);
>  void clear_extent_buffer_uptodate(struct extent_buffer *eb);
> -void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
>  void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
>                                   struct page *locked_page,
>                                   struct extent_state **cached,
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 000809e16aba..541a719284a9 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -890,6 +890,32 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
>                 btrfs_add_inode_defrag(NULL, inode, small_write);
>  }
>
> +static int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
> +{
> +       struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
> +       const u64 len = end + 1 - start;
> +       unsigned long end_index = end >> PAGE_SHIFT;
> +       bool missing_folio = false;
> +
> +       /* We should not have such large range. */
> +       ASSERT(len < U32_MAX);
> +       for (unsigned long index = start >> PAGE_SHIFT;
> +            index <= end_index; index++) {
> +               struct folio *folio;
> +
> +               folio = filemap_get_folio(inode->i_mapping, index);
> +               if (IS_ERR(folio)) {
> +                       missing_folio = true;
> +                       continue;
> +               }
> +               btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
> +               folio_put(folio);
> +       }
> +       if (missing_folio)
> +               return -ENOENT;

Why not return the error from filemap_get_folio()? We could keep it
and then return it after finishing the loop.
Currently it can only return -ENOENT, according to the function's
comment, but it would be better future proof and return whatever error
it returns.

Thanks.

> +       return 0;
> +}
> +
>  /*
>   * Work queue call back to started compression on a file and pages.
>   *
> @@ -931,7 +957,10 @@ static void compress_file_range(struct btrfs_work *work)
>          * Otherwise applications with the file mmap'd can wander in and change
>          * the page contents while we are compressing them.
>          */
> -       extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end);
> +       ret = extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end);
> +
> +       /* We have locked all the involved pagse, shouldn't hit a missing page. */
> +       ASSERT(ret == 0);
>
>         /*
>          * We need to save i_size before now because it could change in between
> --
> 2.45.1
>
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths
  2024-05-20  9:46  2% ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths fdmanana
                     ` (5 preceding siblings ...)
  2024-05-20  9:46  1%   ` [PATCH v4 6/6] btrfs: use a btrfs_inode local variable at btrfs_sync_file() fdmanana
@ 2024-05-20 10:23  1%   ` Qu Wenruo
  6 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-20 10:23 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/5/20 19:16, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> There's a bug where a fast fsync can log extent maps that were not written
> due to an error in a write path or during writeback. This affects both
> direct IO writes and buffered writes, and besides the failure depends on
> a race due to the fact that ordered extent completion happens in a work
> queue and a fast fsync doesn't wait for ordered extent completion before
> logging. The details are in the change log of the first patch.
>
> V4: Use a slightly different approach to avoid a deadlock on the inode's
>      spinlock due to it being used both in irq and non-irq context, pointed
>      out by Qu.
>      Added some cleanup patches (patches 3, 4, 5 and 6).

The whole series looks good to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
>
> V3: Change the approach of patch 1/2 to not drop extent maps at
>      btrfs_finish_ordered_extent() since that runs in irq context and
>      dropping an extent map range triggers NOFS extent map allocations,
>      which can trigger a reclaim and that can't run in irq context.
>      Updated comments and changelog to distinguish differences between
>      failures for direct IO writes and buffered writes.
>
> V2: Rework solution since other error paths caused the same problem, make
>      it more generic.
>      Added more details to change log and comment about what's going on,
>      and why reads aren't affected.
>
>      https://lore.kernel.org/linux-btrfs/cover.1715798440.git.fdmanana@suse.com/
>
> V1: https://lore.kernel.org/linux-btrfs/cover.1715688057.git.fdmanana@suse.com/
>
> Filipe Manana (6):
>    btrfs: ensure fast fsync waits for ordered extents after a write failure
>    btrfs: make btrfs_finish_ordered_extent() return void
>    btrfs: use a btrfs_inode in the log context (struct btrfs_log_ctx)
>    btrfs: pass a btrfs_inode to btrfs_fdatawrite_range()
>    btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()
>    btrfs: use a btrfs_inode local variable at btrfs_sync_file()
>
>   fs/btrfs/btrfs_inode.h      | 10 ++++++
>   fs/btrfs/file.c             | 63 ++++++++++++++++++++++---------------
>   fs/btrfs/file.h             |  2 +-
>   fs/btrfs/free-space-cache.c |  4 +--
>   fs/btrfs/inode.c            | 16 +++++-----
>   fs/btrfs/ordered-data.c     | 40 ++++++++++++++++++++---
>   fs/btrfs/ordered-data.h     |  4 +--
>   fs/btrfs/reflink.c          |  8 ++---
>   fs/btrfs/relocation.c       |  2 +-
>   fs/btrfs/tree-log.c         | 10 +++---
>   fs/btrfs/tree-log.h         |  4 +--
>   11 files changed, 108 insertions(+), 55 deletions(-)
>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v4 1/6] btrfs: ensure fast fsync waits for ordered extents after a write failure
  2024-05-20  9:46  1%   ` [PATCH v4 1/6] btrfs: ensure fast fsync waits for ordered extents after a write failure fdmanana
@ 2024-05-20 10:20  1%     ` Qu Wenruo
  0 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-20 10:20 UTC (permalink / raw)
  To: fdmanana, linux-btrfs



在 2024/5/20 19:16, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> If a write path in COW mode fails, either before submitting a bio for the
> new extents or an actual IO error happens, we can end up allowing a fast
> fsync to log file extent items that point to unwritten extents.
>
> This is because dropping the extent maps happens when completing ordered
> extents, at btrfs_finish_one_ordered(), and the completion of an ordered
> extent is executed in a work queue.
>
> This can result in a fast fsync to start logging file extent items based
> on existing extent maps before the ordered extents complete, therefore
> resulting in a log that has file extent items that point to unwritten
> extents, resulting in a corrupt file if a crash happens after and the log
> tree is replayed the next time the fs is mounted.
>
> This can happen for both direct IO writes and buffered writes.
>
> For example consider a direct IO write, in COW mode, that fails at
> btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
> error:
>
> 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
>     set to false, meaning an error happened;
>
> 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
>     flag;
>
> 3) btrfs_finish_ordered_extent() queues the completion of the ordered
>     extent - so that btrfs_finish_one_ordered() will be executed later in
>     a work queue. That function will drop extent maps in the range when
>     it's executed, since the extent maps point to unwritten locations
>     (signaled by the BTRFS_ORDERED_IOERR flag);
>
> 4) After calling btrfs_finish_ordered_extent() we keep going down the
>     write path and unlock the inode;
>
> 5) After that a fast fsync starts and locks the inode;
>
> 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
>     task sees the extent maps that point to the unwritten locations and
>     logs file extent items based on them - it does not know they are
>     unwritten, and the fast fsync path does not wait for ordered extents
>     to complete, which is an intentional behaviour in order to reduce
>     latency.
>
> For the buffered write case, here's one example:
>
> 1) A fast fsync begins, and it starts by flushing delalloc and waiting for
>     the writeback to complete by calling filemap_fdatawait_range();
>
> 2) Flushing the dellaloc created a new extent map X;
>
> 3) During the writeback some IO error happened, and at the end io callback
>     (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
>     sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
>     completion;
>
> 4) After queuing the ordered extent completion, the end io callback clears
>     the writeback flag from all pages (or folios), and from that moment the
>     fast fsync can proceed;
>
> 5) The fast fsync proceeds sees extent map X and logs a file extent item
>     based on extent map X, resulting in a log that points to an unwritten
>     data extent - because the ordered extent completion hasn't run yet, it
>     happens only after the logging.
>
> To fix this make btrfs_finish_ordered_extent() set the inode flag
> BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
> so that a fast fsync will wait for ordered extent completion.
>
> Note that this issues of using extent maps that point to unwritten
> locations can not happen for reads, because in read paths we start by
> locking the extent range and wait for any ordered extents in the range
> to complete before looking for extent maps.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

So a new inode flag without touching the spinlock, that's a solid solution.

Thanks,
Qu
> ---
>   fs/btrfs/btrfs_inode.h  | 10 ++++++++++
>   fs/btrfs/file.c         | 16 ++++++++++++++++
>   fs/btrfs/ordered-data.c | 31 +++++++++++++++++++++++++++++++
>   3 files changed, 57 insertions(+)
>
> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> index 3c8bc7a8ebdd..46db4027bf15 100644
> --- a/fs/btrfs/btrfs_inode.h
> +++ b/fs/btrfs/btrfs_inode.h
> @@ -112,6 +112,16 @@ enum {
>   	 * done at new_simple_dir(), called from btrfs_lookup_dentry().
>   	 */
>   	BTRFS_INODE_ROOT_STUB,
> +	/*
> +	 * Set if an error happened when doing a COW write before submitting a
> +	 * bio or during writeback. Used for both buffered writes and direct IO
> +	 * writes. This is to signal a fast fsync that it has to wait for
> +	 * ordered extents to complete and therefore not log extent maps that
> +	 * point to unwritten extents (when an ordered extent completes and it
> +	 * has the BTRFS_ORDERED_IOERR flag set, it drops extent maps in its
> +	 * range).
> +	 */
> +	BTRFS_INODE_COW_WRITE_ERROR,
>   };
>
>   /* in memory btrfs inode */
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0c7c1b42028e..00670596bf06 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1885,6 +1885,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>   	 */
>   	if (full_sync || btrfs_is_zoned(fs_info)) {
>   		ret = btrfs_wait_ordered_range(inode, start, len);
> +		clear_bit(BTRFS_INODE_COW_WRITE_ERROR, &BTRFS_I(inode)->runtime_flags);
>   	} else {
>   		/*
>   		 * Get our ordered extents as soon as possible to avoid doing
> @@ -1894,6 +1895,21 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>   		btrfs_get_ordered_extents_for_logging(BTRFS_I(inode),
>   						      &ctx.ordered_extents);
>   		ret = filemap_fdatawait_range(inode->i_mapping, start, end);
> +		if (ret)
> +			goto out_release_extents;
> +
> +		/*
> +		 * Check and clear the BTRFS_INODE_COW_WRITE_ERROR now after
> +		 * starting and waiting for writeback, because for buffered IO
> +		 * it may have been set during the end IO callback
> +		 * (end_bbio_data_write() -> btrfs_finish_ordered_extent()) in
> +		 * case an error happened and we need to wait for ordered
> +		 * extents to complete so that any extent maps that point to
> +		 * unwritten locations are dropped and we don't log them.
> +		 */
> +		if (test_and_clear_bit(BTRFS_INODE_COW_WRITE_ERROR,
> +				       &BTRFS_I(inode)->runtime_flags))
> +			ret = btrfs_wait_ordered_range(inode, start, len);
>   	}
>
>   	if (ret)
> diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> index 44157e43fd2a..7d175d10a6d0 100644
> --- a/fs/btrfs/ordered-data.c
> +++ b/fs/btrfs/ordered-data.c
> @@ -388,6 +388,37 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
>   	ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
>   	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
>
> +	/*
> +	 * If this is a COW write it means we created new extent maps for the
> +	 * range and they point to unwritten locations if we got an error either
> +	 * before submitting a bio or during IO.
> +	 *
> +	 * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
> +	 * are queuing its completion below. During completion, at
> +	 * btrfs_finish_one_ordered(), we will drop the extent maps for the
> +	 * unwritten extents.
> +	 *
> +	 * However because completion runs in a work queue we can end up having
> +	 * a fast fsync running before that. In the case of direct IO, once we
> +	 * unlock the inode the fsync might start, and we queue the completion
> +	 * before unlocking the inode. In the case of buffered IO when writeback
> +	 * finishes (end_bbio_data_write()) we queue the completion, so if the
> +	 * writeback was triggered by a fast fsync, the fsync might start
> +	 * logging before ordered extent completion runs in the work queue.
> +	 *
> +	 * The fast fsync will log file extent items based on the extent maps it
> +	 * finds, so if by the time it collects extent maps the ordered extent
> +	 * completion didn't happen yet, it will log file extent items that
> +	 * point to unwritten extents, resulting in a corruption if a crash
> +	 * happens and the log tree is replayed. Note that a fast fsync does not
> +	 * wait for completion of ordered extents in order to reduce latency.
> +	 *
> +	 * Set a flag in the inode so that the next fast fsync will wait for
> +	 * ordered extents to complete before starting to log.
> +	 */
> +	if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
> +		set_bit(BTRFS_INODE_COW_WRITE_ERROR, &inode->runtime_flags);
> +
>   	if (ret)
>   		btrfs_queue_ordered_fn(ordered);
>   	return ret;

^ permalink raw reply	[relevance 1%]

* Re: [PATCH v3 1/2] btrfs: ensure fast fsync waits for ordered extents after a write failure
  @ 2024-05-20  9:46  1%     ` Filipe Manana
  0 siblings, 0 replies; 200+ results
From: Filipe Manana @ 2024-05-20  9:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, May 18, 2024 at 6:28 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/5/18 02:22, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > If a write path in COW mode fails, either before submitting a bio for the
> > new extents or an actual IO error happens, we can end up allowing a fast
> > fsync to log file extent items that point to unwritten extents.
> >
> > This is because dropping the extent maps happens when completing ordered
> > extents, at btrfs_finish_one_ordered(), and the completion of an ordered
> > extent is executed in a work queue.
> >
> > This can result in a fast fsync to start logging file extent items based
> > on existing extent maps before the ordered extents complete, therefore
> > resulting in a log that has file extent items that point to unwritten
> > extents, resulting in a corrupt file if a crash happens after and the log
> > tree is replayed the next time the fs is mounted.
> >
> > This can happen for both direct IO writes and buffered writes.
> >
> > For example consider a direct IO write, in COW mode, that fails at
> > btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
> > error:
> >
> > 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
> >     set to false, meaning an error happened;
> >
> > 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
> >     flag;
> >
> > 3) btrfs_finish_ordered_extent() queues the completion of the ordered
> >     extent - so that btrfs_finish_one_ordered() will be executed later in
> >     a work queue. That function will drop extent maps in the range when
> >     it's executed, since the extent maps point to unwritten locations
> >     (signaled by the BTRFS_ORDERED_IOERR flag);
> >
> > 4) After calling btrfs_finish_ordered_extent() we keep going down the
> >     write path and unlock the inode;
> >
> > 5) After that a fast fsync starts and locks the inode;
> >
> > 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
> >     task sees the extent maps that point to the unwritten locations and
> >     logs file extent items based on them - it does not know they are
> >     unwritten, and the fast fsync path does not wait for ordered extents
> >     to complete, which is an intentional behaviour in order to reduce
> >     latency.
> >
> > For the buffered write case, here's one example:
> >
> > 1) A fast fsync begins, and it starts by flushing delalloc and waiting for
> >     the writeback to complete by calling filemap_fdatawait_range();
> >
> > 2) Flushing the dellaloc created a new extent map X;
> >
> > 3) During the writeback some IO error happened, and at the end io callback
> >     (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
> >     sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
> >     completion;
> >
> > 4) After queuing the ordered extent completion, the end io callback clears
> >     the writeback flag from all pages (or folios), and from that moment the
> >     fast fsync can proceed;
> >
> > 5) The fast fsync proceeds sees extent map X and logs a file extent item
> >     based on extent map X, resulting in a log that points to an unwritten
> >     data extent - because the ordered extent completion hasn't run yet, it
> >     happens only after the logging.
> >
> > To fix this make btrfs_finish_ordered_extent() set the inode flag
> > BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
> > so that a fast fsync will wait for ordered extent completion.
> >
> > Note that this issues of using extent maps that point to unwritten
> > locations can not happen for reads, because in read paths we start by
> > locking the extent range and wait for any ordered extents in the range
> > to complete before looking for extent maps.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> Thanks for the updated commit messages, it's much clear for the race window.
>
> And since we no longer try to run finish_ordered_io() inside the endio
> function, we should no longer hit the memory allocation warning inside
> irq context.
>
> But the inode->lock usage seems unsafe to me, comment inlined below:
> [...]
> > @@ -478,10 +485,10 @@ static inline void btrfs_set_inode_full_sync(struct btrfs_inode *inode)
> >        * while ->last_trans was not yet updated in the current transaction,
> >        * and therefore has a lower value.
> >        */
> > -     spin_lock(&inode->lock);
> > +     spin_lock_irqsave(&inode->lock, flags);
> >       if (inode->last_reflink_trans < inode->last_trans)
> >               inode->last_reflink_trans = inode->last_trans;
> > -     spin_unlock(&inode->lock);
> > +     spin_unlock_irqrestore(&inode->lock, flags);
>
> IIRC this is not how we change the lock usage to be irq safe.
> We need all lock users to use irq variants.
>
> Or we can hit situation like:
>
>         Thread A
>
>         spin_lock(&inode->lock);
> --- IRQ happens for the endio ---
>         spin_lock_irqsave();
>
> Then we dead lock.
>
> Thus all inode->lock users needs to use the irq variant, which would be
> a huge change.

Indeed, I missed that, thanks.

>
> I guess if we unconditionally wait for ordered extents inside
> btrfs_sync_file() would be too slow?

No way. Not waiting for ordered extent completion is one of the main
things that makes the fast fsync faster then the full fsync.
It's ok to wait only in case of errors, since they are unexpected and
unlikely, and in error cases the ordered extent completion doesn't do
much (no trees to update).

Fixed in v4, thanks.

>
> Thanks,
> Qu
>
> >   }
> >
> >   static inline bool btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index 0c7c1b42028e..d635bc0c01df 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -1894,6 +1894,21 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> >               btrfs_get_ordered_extents_for_logging(BTRFS_I(inode),
> >                                                     &ctx.ordered_extents);
> >               ret = filemap_fdatawait_range(inode->i_mapping, start, end);
> > +             if (ret)
> > +                     goto out_release_extents;
> > +
> > +             /*
> > +              * Check again the full sync flag, because it may have been set
> > +              * during the end IO callback (end_bbio_data_write() ->
> > +              * btrfs_finish_ordered_extent()) in case an error happened and
> > +              * we need to wait for ordered extents to complete so that any
> > +              * extent maps that point to unwritten locations are dropped and
> > +              * we don't log them.
> > +              */
> > +             full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
> > +                                  &BTRFS_I(inode)->runtime_flags);
> > +             if (full_sync)
> > +                     ret = btrfs_wait_ordered_range(inode, start, len);
> >       }
> >
> >       if (ret)
> > diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
> > index 44157e43fd2a..55a9aeed7344 100644
> > --- a/fs/btrfs/ordered-data.c
> > +++ b/fs/btrfs/ordered-data.c
> > @@ -388,6 +388,37 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
> >       ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
> >       spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
> >
> > +     /*
> > +      * If this is a COW write it means we created new extent maps for the
> > +      * range and they point to unwritten locations if we got an error either
> > +      * before submitting a bio or during IO.
> > +      *
> > +      * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
> > +      * are queuing its completion below. During completion, at
> > +      * btrfs_finish_one_ordered(), we will drop the extent maps for the
> > +      * unwritten extents.
> > +      *
> > +      * However because completion runs in a work queue we can end up having
> > +      * a fast fsync running before that. In the case of direct IO, once we
> > +      * unlock the inode the fsync might start, and we queue the completion
> > +      * before unlocking the inode. In the case of buffered IO when writeback
> > +      * finishes (end_bbio_data_write()) we queue the completion, so if the
> > +      * writeback was triggered by a fast fsync, the fsync might start
> > +      * logging before ordered extent completion runs in the work queue.
> > +      *
> > +      * The fast fsync will log file extent items based on the extent maps it
> > +      * finds, so if by the time it collects extent maps the ordered extent
> > +      * completion didn't happen yet, it will log file extent items that
> > +      * point to unwritten extents, resulting in a corruption if a crash
> > +      * happens and the log tree is replayed. Note that a fast fsync does not
> > +      * wait for completion of ordered extents in order to reduce latency.
> > +      *
> > +      * Set a flag in the inode so that the next fast fsync will wait for
> > +      * ordered extents to complete before starting to log.
> > +      */
> > +     if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
> > +             btrfs_set_inode_full_sync(inode);
> > +
> >       if (ret)
> >               btrfs_queue_ordered_fn(ordered);
> >       return ret;

^ permalink raw reply	[relevance 1%]

* [PATCH v4 6/6] btrfs: use a btrfs_inode local variable at btrfs_sync_file()
  2024-05-20  9:46  2% ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths fdmanana
                     ` (4 preceding siblings ...)
  2024-05-20  9:46  1%   ` [PATCH v4 5/6] btrfs: pass a btrfs_inode to btrfs_wait_ordered_range() fdmanana
@ 2024-05-20  9:46  1%   ` fdmanana
  2024-05-20 10:23  1%   ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths Qu Wenruo
  6 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-20  9:46 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of using a VFS inode local pointer and then doing many BTRFS_I()
calls inside btrfs_sync_file(), use a btrfs_inode pointer instead. This
makes everything a bit easier to read and less confusing, allowing to
make some statements shorter.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c | 43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 21a48d326b99..af58a1b33498 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1794,9 +1794,9 @@ static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
 int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 {
 	struct dentry *dentry = file_dentry(file);
-	struct inode *inode = d_inode(dentry);
-	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
-	struct btrfs_root *root = BTRFS_I(inode)->root;
+	struct btrfs_inode *inode = BTRFS_I(d_inode(dentry));
+	struct btrfs_root *root = inode->root;
+	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_trans_handle *trans;
 	struct btrfs_log_ctx ctx;
 	int ret = 0, err;
@@ -1805,7 +1805,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 
 	trace_btrfs_sync_file(file, datasync);
 
-	btrfs_init_log_ctx(&ctx, BTRFS_I(inode));
+	btrfs_init_log_ctx(&ctx, inode);
 
 	/*
 	 * Always set the range to a full range, otherwise we can get into
@@ -1825,11 +1825,11 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * multi-task, and make the performance up.  See
 	 * btrfs_wait_ordered_range for an explanation of the ASYNC check.
 	 */
-	ret = start_ordered_ops(BTRFS_I(inode), start, end);
+	ret = start_ordered_ops(inode, start, end);
 	if (ret)
 		goto out;
 
-	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
+	btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP);
 
 	atomic_inc(&root->log_batch);
 
@@ -1851,9 +1851,9 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * So trigger writeback for any eventual new dirty pages and then we
 	 * wait for all ordered extents to complete below.
 	 */
-	ret = start_ordered_ops(BTRFS_I(inode), start, end);
+	ret = start_ordered_ops(inode, start, end);
 	if (ret) {
-		btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
+		btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
 		goto out;
 	}
 
@@ -1865,8 +1865,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * running delalloc the full sync flag may be set if we need to drop
 	 * extra extent map ranges due to temporary memory allocation failures.
 	 */
-	full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
-			     &BTRFS_I(inode)->runtime_flags);
+	full_sync = test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags);
 
 	/*
 	 * We have to do this here to avoid the priority inversion of waiting on
@@ -1884,17 +1883,16 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * to wait for the IO to stabilize the logical address.
 	 */
 	if (full_sync || btrfs_is_zoned(fs_info)) {
-		ret = btrfs_wait_ordered_range(BTRFS_I(inode), start, len);
-		clear_bit(BTRFS_INODE_COW_WRITE_ERROR, &BTRFS_I(inode)->runtime_flags);
+		ret = btrfs_wait_ordered_range(inode, start, len);
+		clear_bit(BTRFS_INODE_COW_WRITE_ERROR, &inode->runtime_flags);
 	} else {
 		/*
 		 * Get our ordered extents as soon as possible to avoid doing
 		 * checksum lookups in the csum tree, and use instead the
 		 * checksums attached to the ordered extents.
 		 */
-		btrfs_get_ordered_extents_for_logging(BTRFS_I(inode),
-						      &ctx.ordered_extents);
-		ret = filemap_fdatawait_range(inode->i_mapping, start, end);
+		btrfs_get_ordered_extents_for_logging(inode, &ctx.ordered_extents);
+		ret = filemap_fdatawait_range(inode->vfs_inode.i_mapping, start, end);
 		if (ret)
 			goto out_release_extents;
 
@@ -1908,8 +1906,8 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		 * unwritten locations are dropped and we don't log them.
 		 */
 		if (test_and_clear_bit(BTRFS_INODE_COW_WRITE_ERROR,
-				       &BTRFS_I(inode)->runtime_flags))
-			ret = btrfs_wait_ordered_range(BTRFS_I(inode), start, len);
+				       &inode->runtime_flags))
+			ret = btrfs_wait_ordered_range(inode, start, len);
 	}
 
 	if (ret)
@@ -1923,8 +1921,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		 * modified so clear this flag in case it was set for whatever
 		 * reason, it's no longer relevant.
 		 */
-		clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
-			  &BTRFS_I(inode)->runtime_flags);
+		clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags);
 		/*
 		 * An ordered extent might have started before and completed
 		 * already with io errors, in which case the inode was not
@@ -1932,7 +1929,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		 * for any errors that might have happened since we last
 		 * checked called fsync.
 		 */
-		ret = filemap_check_wb_err(inode->i_mapping, file->f_wb_err);
+		ret = filemap_check_wb_err(inode->vfs_inode.i_mapping, file->f_wb_err);
 		goto out_release_extents;
 	}
 
@@ -1982,7 +1979,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * file again, but that will end up using the synchronization
 	 * inside btrfs_sync_log to keep things safe.
 	 */
-	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
+	btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
 
 	if (ret == BTRFS_NO_LOG_SYNC) {
 		ret = btrfs_end_transaction(trans);
@@ -2014,7 +2011,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		ret = btrfs_end_transaction(trans);
 		if (ret)
 			goto out;
-		ret = btrfs_wait_ordered_range(BTRFS_I(inode), start, len);
+		ret = btrfs_wait_ordered_range(inode, start, len);
 		if (ret)
 			goto out;
 
@@ -2051,7 +2048,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 
 out_release_extents:
 	btrfs_release_log_ctx_extents(&ctx);
-	btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
+	btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
 	goto out;
 }
 
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 5/6] btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()
  2024-05-20  9:46  2% ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths fdmanana
                     ` (3 preceding siblings ...)
  2024-05-20  9:46  1%   ` [PATCH v4 4/6] btrfs: pass a btrfs_inode to btrfs_fdatawrite_range() fdmanana
@ 2024-05-20  9:46  1%   ` fdmanana
  2024-05-20  9:46  1%   ` [PATCH v4 6/6] btrfs: use a btrfs_inode local variable at btrfs_sync_file() fdmanana
  2024-05-20 10:23  1%   ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths Qu Wenruo
  6 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-20  9:46 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode
instead, as this is generally what we do for internal APIs, making it
more consistent with most of the code base. This will later allow to
help to remove a lot of BTRFS_I() calls in btrfs_sync_file().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c             | 10 +++++-----
 fs/btrfs/free-space-cache.c |  2 +-
 fs/btrfs/inode.c            | 16 ++++++++--------
 fs/btrfs/ordered-data.c     |  8 ++++----
 fs/btrfs/ordered-data.h     |  2 +-
 fs/btrfs/reflink.c          |  8 ++++----
 fs/btrfs/relocation.c       |  2 +-
 7 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 23c5510f6271..21a48d326b99 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1884,7 +1884,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * to wait for the IO to stabilize the logical address.
 	 */
 	if (full_sync || btrfs_is_zoned(fs_info)) {
-		ret = btrfs_wait_ordered_range(inode, start, len);
+		ret = btrfs_wait_ordered_range(BTRFS_I(inode), start, len);
 		clear_bit(BTRFS_INODE_COW_WRITE_ERROR, &BTRFS_I(inode)->runtime_flags);
 	} else {
 		/*
@@ -1909,7 +1909,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		 */
 		if (test_and_clear_bit(BTRFS_INODE_COW_WRITE_ERROR,
 				       &BTRFS_I(inode)->runtime_flags))
-			ret = btrfs_wait_ordered_range(inode, start, len);
+			ret = btrfs_wait_ordered_range(BTRFS_I(inode), start, len);
 	}
 
 	if (ret)
@@ -2014,7 +2014,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		ret = btrfs_end_transaction(trans);
 		if (ret)
 			goto out;
-		ret = btrfs_wait_ordered_range(inode, start, len);
+		ret = btrfs_wait_ordered_range(BTRFS_I(inode), start, len);
 		if (ret)
 			goto out;
 
@@ -2814,7 +2814,7 @@ static int btrfs_punch_hole(struct file *file, loff_t offset, loff_t len)
 
 	btrfs_inode_lock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
 
-	ret = btrfs_wait_ordered_range(inode, offset, len);
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode), offset, len);
 	if (ret)
 		goto out_only_mutex;
 
@@ -3309,7 +3309,7 @@ static long btrfs_fallocate(struct file *file, int mode,
 	 * the file range and, due to the previous locking we did, we know there
 	 * can't be more delalloc or ordered extents in the range.
 	 */
-	ret = btrfs_wait_ordered_range(inode, alloc_start,
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode), alloc_start,
 				       alloc_end - alloc_start);
 	if (ret)
 		goto out;
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index e6d599efd713..8bed59fe937c 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1268,7 +1268,7 @@ static int flush_dirty_cache(struct inode *inode)
 {
 	int ret;
 
-	ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1);
 	if (ret)
 		clear_extent_bit(&BTRFS_I(inode)->io_tree, 0, inode->i_size - 1,
 				 EXTENT_DELALLOC, NULL);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2f3129fe0e58..3cf32bc721d2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5081,7 +5081,7 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 		struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
 
 		if (btrfs_is_zoned(fs_info)) {
-			ret = btrfs_wait_ordered_range(inode,
+			ret = btrfs_wait_ordered_range(BTRFS_I(inode),
 					ALIGN(newsize, fs_info->sectorsize),
 					(u64)-1);
 			if (ret)
@@ -5111,7 +5111,7 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 			 * wait for disk_i_size to be stable and then update the
 			 * in-memory size to match.
 			 */
-			err = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+			err = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1);
 			if (err)
 				return err;
 			i_size_write(inode, BTRFS_I(inode)->disk_i_size);
@@ -7955,7 +7955,7 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	 * if we have delalloc in those ranges.
 	 */
 	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
-		ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
+		ret = btrfs_wait_ordered_range(btrfs_inode, 0, LLONG_MAX);
 		if (ret)
 			return ret;
 	}
@@ -7969,7 +7969,7 @@ static int btrfs_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
 	 * possible a new write may have happened in between those two steps.
 	 */
 	if (fieinfo->fi_flags & FIEMAP_FLAG_SYNC) {
-		ret = btrfs_wait_ordered_range(inode, 0, LLONG_MAX);
+		ret = btrfs_wait_ordered_range(btrfs_inode, 0, LLONG_MAX);
 		if (ret) {
 			btrfs_inode_unlock(btrfs_inode, BTRFS_ILOCK_SHARED);
 			return ret;
@@ -8238,7 +8238,7 @@ static int btrfs_truncate(struct btrfs_inode *inode, bool skip_writeback)
 	const u64 min_size = btrfs_calc_metadata_size(fs_info, 1);
 
 	if (!skip_writeback) {
-		ret = btrfs_wait_ordered_range(&inode->vfs_inode,
+		ret = btrfs_wait_ordered_range(inode,
 					       inode->vfs_inode.i_size & (~mask),
 					       (u64)-1);
 		if (ret)
@@ -10059,7 +10059,7 @@ ssize_t btrfs_encoded_read(struct kiocb *iocb, struct iov_iter *iter,
 	for (;;) {
 		struct btrfs_ordered_extent *ordered;
 
-		ret = btrfs_wait_ordered_range(&inode->vfs_inode, start,
+		ret = btrfs_wait_ordered_range(inode, start,
 					       lockend - start + 1);
 		if (ret)
 			goto out_unlock_inode;
@@ -10302,7 +10302,7 @@ ssize_t btrfs_do_encoded_write(struct kiocb *iocb, struct iov_iter *from,
 	for (;;) {
 		struct btrfs_ordered_extent *ordered;
 
-		ret = btrfs_wait_ordered_range(&inode->vfs_inode, start, num_bytes);
+		ret = btrfs_wait_ordered_range(inode, start, num_bytes);
 		if (ret)
 			goto out_folios;
 		ret = invalidate_inode_pages2_range(inode->vfs_inode.i_mapping,
@@ -10575,7 +10575,7 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	 * file changes again after this, the user is doing something stupid and
 	 * we don't really care.
 	 */
-	ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode), 0, (u64)-1);
 	if (ret)
 		return ret;
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 605d88e09525..e2c176f7c387 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -840,7 +840,7 @@ void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry)
 /*
  * Used to wait on ordered extents across a large range of bytes.
  */
-int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len)
+int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len)
 {
 	int ret = 0;
 	int ret_wb = 0;
@@ -859,7 +859,7 @@ int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len)
 	/* start IO across the range first to instantiate any delalloc
 	 * extents
 	 */
-	ret = btrfs_fdatawrite_range(BTRFS_I(inode), start, orig_end);
+	ret = btrfs_fdatawrite_range(inode, start, orig_end);
 	if (ret)
 		return ret;
 
@@ -870,11 +870,11 @@ int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len)
 	 * before the ordered extents complete - to avoid failures (-EEXIST)
 	 * when adding the new ordered extents to the ordered tree.
 	 */
-	ret_wb = filemap_fdatawait_range(inode->i_mapping, start, orig_end);
+	ret_wb = filemap_fdatawait_range(inode->vfs_inode.i_mapping, start, orig_end);
 
 	end = orig_end;
 	while (1) {
-		ordered = btrfs_lookup_first_ordered_extent(BTRFS_I(inode), end);
+		ordered = btrfs_lookup_first_ordered_extent(inode, end);
 		if (!ordered)
 			break;
 		if (ordered->file_offset > orig_end) {
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index bef22179e7c5..4a4dd15d38ba 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -181,7 +181,7 @@ void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry,
 struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct btrfs_inode *inode,
 							 u64 file_offset);
 void btrfs_start_ordered_extent(struct btrfs_ordered_extent *entry);
-int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len);
+int btrfs_wait_ordered_range(struct btrfs_inode *inode, u64 start, u64 len);
 struct btrfs_ordered_extent *
 btrfs_lookup_first_ordered_extent(struct btrfs_inode *inode, u64 file_offset);
 struct btrfs_ordered_extent *btrfs_lookup_first_ordered_range(
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index d0a3fcecc46a..df6b93b927cd 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -733,7 +733,7 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 		 * we found the previous extent covering eof and before we
 		 * attempted to increment its reference count).
 		 */
-		ret = btrfs_wait_ordered_range(inode, wb_start,
+		ret = btrfs_wait_ordered_range(BTRFS_I(inode), wb_start,
 					       destoff - wb_start);
 		if (ret)
 			return ret;
@@ -755,7 +755,7 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
 	 * range, so wait for writeback to complete before truncating pages
 	 * from the page cache. This is a rare case.
 	 */
-	wb_ret = btrfs_wait_ordered_range(inode, destoff, len);
+	wb_ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len);
 	ret = ret ? ret : wb_ret;
 	/*
 	 * Truncate page cache pages so that future reads will see the cloned
@@ -835,11 +835,11 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 	if (ret < 0)
 		return ret;
 
-	ret = btrfs_wait_ordered_range(inode_in, ALIGN_DOWN(pos_in, bs),
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode_in), ALIGN_DOWN(pos_in, bs),
 				       wb_len);
 	if (ret < 0)
 		return ret;
-	ret = btrfs_wait_ordered_range(inode_out, ALIGN_DOWN(pos_out, bs),
+	ret = btrfs_wait_ordered_range(BTRFS_I(inode_out), ALIGN_DOWN(pos_out, bs),
 				       wb_len);
 	if (ret < 0)
 		return ret;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 9f35524b6664..8ce337ec033c 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4149,7 +4149,7 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
 		 * out of the loop if we hit an error.
 		 */
 		if (rc->stage == MOVE_DATA_EXTENTS && rc->found_file_extent) {
-			ret = btrfs_wait_ordered_range(rc->data_inode, 0,
+			ret = btrfs_wait_ordered_range(BTRFS_I(rc->data_inode), 0,
 						       (u64)-1);
 			if (ret)
 				err = ret;
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 4/6] btrfs: pass a btrfs_inode to btrfs_fdatawrite_range()
  2024-05-20  9:46  2% ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths fdmanana
                     ` (2 preceding siblings ...)
  2024-05-20  9:46  1%   ` [PATCH v4 3/6] btrfs: use a btrfs_inode in the log context (struct btrfs_log_ctx) fdmanana
@ 2024-05-20  9:46  1%   ` fdmanana
  2024-05-20  9:46  1%   ` [PATCH v4 5/6] btrfs: pass a btrfs_inode to btrfs_wait_ordered_range() fdmanana
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-20  9:46 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode
instead, as this is generally what we do for internal APIs, making it
more consistent with most of the code base. This will later allow to
help to remove a lot of BTRFS_I() calls in btrfs_sync_file().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c             | 18 +++++++++---------
 fs/btrfs/file.h             |  2 +-
 fs/btrfs/free-space-cache.c |  2 +-
 fs/btrfs/ordered-data.c     |  2 +-
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 506eabcd809d..23c5510f6271 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1625,7 +1625,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	 * able to read what was just written.
 	 */
 	endbyte = pos + written_buffered - 1;
-	ret = btrfs_fdatawrite_range(inode, pos, endbyte);
+	ret = btrfs_fdatawrite_range(BTRFS_I(inode), pos, endbyte);
 	if (ret)
 		goto out;
 	ret = filemap_fdatawait_range(inode->i_mapping, pos, endbyte);
@@ -1738,7 +1738,7 @@ int btrfs_release_file(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static int start_ordered_ops(struct inode *inode, loff_t start, loff_t end)
+static int start_ordered_ops(struct btrfs_inode *inode, loff_t start, loff_t end)
 {
 	int ret;
 	struct blk_plug plug;
@@ -1825,7 +1825,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * multi-task, and make the performance up.  See
 	 * btrfs_wait_ordered_range for an explanation of the ASYNC check.
 	 */
-	ret = start_ordered_ops(inode, start, end);
+	ret = start_ordered_ops(BTRFS_I(inode), start, end);
 	if (ret)
 		goto out;
 
@@ -1851,7 +1851,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * So trigger writeback for any eventual new dirty pages and then we
 	 * wait for all ordered extents to complete below.
 	 */
-	ret = start_ordered_ops(inode, start, end);
+	ret = start_ordered_ops(BTRFS_I(inode), start, end);
 	if (ret) {
 		btrfs_inode_unlock(BTRFS_I(inode), BTRFS_ILOCK_MMAP);
 		goto out;
@@ -4045,8 +4045,9 @@ const struct file_operations btrfs_file_operations = {
 	.remap_file_range = btrfs_remap_file_range,
 };
 
-int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end)
+int btrfs_fdatawrite_range(struct btrfs_inode *inode, loff_t start, loff_t end)
 {
+	struct address_space *mapping = inode->vfs_inode.i_mapping;
 	int ret;
 
 	/*
@@ -4063,10 +4064,9 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end)
 	 * know better and pull this out at some point in the future, it is
 	 * right and you are wrong.
 	 */
-	ret = filemap_fdatawrite_range(inode->i_mapping, start, end);
-	if (!ret && test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
-			     &BTRFS_I(inode)->runtime_flags))
-		ret = filemap_fdatawrite_range(inode->i_mapping, start, end);
+	ret = filemap_fdatawrite_range(mapping, start, end);
+	if (!ret && test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &inode->runtime_flags))
+		ret = filemap_fdatawrite_range(mapping, start, end);
 
 	return ret;
 }
diff --git a/fs/btrfs/file.h b/fs/btrfs/file.h
index 77aaca208c7b..ce93ed7083ab 100644
--- a/fs/btrfs/file.h
+++ b/fs/btrfs/file.h
@@ -37,7 +37,7 @@ int btrfs_release_file(struct inode *inode, struct file *file);
 int btrfs_dirty_pages(struct btrfs_inode *inode, struct page **pages,
 		      size_t num_pages, loff_t pos, size_t write_bytes,
 		      struct extent_state **cached, bool noreserve);
-int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+int btrfs_fdatawrite_range(struct btrfs_inode *inode, loff_t start, loff_t end);
 int btrfs_check_nocow_lock(struct btrfs_inode *inode, loff_t pos,
 			   size_t *write_bytes, bool nowait);
 void btrfs_check_nocow_unlock(struct btrfs_inode *inode);
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index c8a05d5eb9cb..e6d599efd713 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1483,7 +1483,7 @@ static int __btrfs_write_out_cache(struct inode *inode,
 	io_ctl->entries = entries;
 	io_ctl->bitmaps = bitmaps;
 
-	ret = btrfs_fdatawrite_range(inode, 0, (u64)-1);
+	ret = btrfs_fdatawrite_range(BTRFS_I(inode), 0, (u64)-1);
 	if (ret)
 		goto out;
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 16f9ddd2831c..605d88e09525 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -859,7 +859,7 @@ int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len)
 	/* start IO across the range first to instantiate any delalloc
 	 * extents
 	 */
-	ret = btrfs_fdatawrite_range(inode, start, orig_end);
+	ret = btrfs_fdatawrite_range(BTRFS_I(inode), start, orig_end);
 	if (ret)
 		return ret;
 
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 2/6] btrfs: make btrfs_finish_ordered_extent() return void
  2024-05-20  9:46  2% ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths fdmanana
  2024-05-20  9:46  1%   ` [PATCH v4 1/6] btrfs: ensure fast fsync waits for ordered extents after a write failure fdmanana
@ 2024-05-20  9:46  1%   ` fdmanana
  2024-05-20  9:46  1%   ` [PATCH v4 3/6] btrfs: use a btrfs_inode in the log context (struct btrfs_log_ctx) fdmanana
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-20  9:46 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Currently btrfs_finish_ordered_extent() returns a boolean indicating if
the ordered extent was added to the work queue for completion, but none
of its callers cares about it, so make it return void.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ordered-data.c | 3 +--
 fs/btrfs/ordered-data.h | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 7d175d10a6d0..16f9ddd2831c 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -374,7 +374,7 @@ static void btrfs_queue_ordered_fn(struct btrfs_ordered_extent *ordered)
 	btrfs_queue_work(wq, &ordered->work);
 }
 
-bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
+void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 				 struct page *page, u64 file_offset, u64 len,
 				 bool uptodate)
 {
@@ -421,7 +421,6 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 
 	if (ret)
 		btrfs_queue_ordered_fn(ordered);
-	return ret;
 }
 
 /*
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index b6f6c6b91732..bef22179e7c5 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -162,7 +162,7 @@ int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent);
 void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry);
 void btrfs_remove_ordered_extent(struct btrfs_inode *btrfs_inode,
 				struct btrfs_ordered_extent *entry);
-bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
+void btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 				 struct page *page, u64 file_offset, u64 len,
 				 bool uptodate);
 void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode,
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 3/6] btrfs: use a btrfs_inode in the log context (struct btrfs_log_ctx)
  2024-05-20  9:46  2% ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths fdmanana
  2024-05-20  9:46  1%   ` [PATCH v4 1/6] btrfs: ensure fast fsync waits for ordered extents after a write failure fdmanana
  2024-05-20  9:46  1%   ` [PATCH v4 2/6] btrfs: make btrfs_finish_ordered_extent() return void fdmanana
@ 2024-05-20  9:46  1%   ` fdmanana
  2024-05-20  9:46  1%   ` [PATCH v4 4/6] btrfs: pass a btrfs_inode to btrfs_fdatawrite_range() fdmanana
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 200+ results
From: fdmanana @ 2024-05-20  9:46 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Instead of using a inode pointer, use a btrfs_inode pointer in the log
context structure, as this is generally what we need and allows for some
internal APIs to take a btrfs_inode instead, making them more consistent
with most of the code base. This will later allow to help to remove a lot
of BTRFS_I() calls in btrfs_sync_file().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c     |  4 ++--
 fs/btrfs/tree-log.c | 10 +++++-----
 fs/btrfs/tree-log.h |  4 ++--
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 00670596bf06..506eabcd809d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1758,7 +1758,7 @@ static int start_ordered_ops(struct inode *inode, loff_t start, loff_t end)
 
 static inline bool skip_inode_logging(const struct btrfs_log_ctx *ctx)
 {
-	struct btrfs_inode *inode = BTRFS_I(ctx->inode);
+	struct btrfs_inode *inode = ctx->inode;
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
 	if (btrfs_inode_in_log(inode, btrfs_get_fs_generation(fs_info)) &&
@@ -1805,7 +1805,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 
 	trace_btrfs_sync_file(file, datasync);
 
-	btrfs_init_log_ctx(&ctx, inode);
+	btrfs_init_log_ctx(&ctx, BTRFS_I(inode));
 
 	/*
 	 * Always set the range to a full range, otherwise we can get into
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 2e762b89d4a2..51a167559ae8 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2821,7 +2821,7 @@ static void wait_for_writer(struct btrfs_root *root)
 	finish_wait(&root->log_writer_wait, &wait);
 }
 
-void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, struct inode *inode)
+void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, struct btrfs_inode *inode)
 {
 	ctx->log_ret = 0;
 	ctx->log_transid = 0;
@@ -2840,7 +2840,7 @@ void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, struct inode *inode)
 
 void btrfs_init_log_ctx_scratch_eb(struct btrfs_log_ctx *ctx)
 {
-	struct btrfs_inode *inode = BTRFS_I(ctx->inode);
+	struct btrfs_inode *inode = ctx->inode;
 
 	if (!test_bit(BTRFS_INODE_NEEDS_FULL_SYNC, &inode->runtime_flags) &&
 	    !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags))
@@ -2858,7 +2858,7 @@ void btrfs_release_log_ctx_extents(struct btrfs_log_ctx *ctx)
 	struct btrfs_ordered_extent *ordered;
 	struct btrfs_ordered_extent *tmp;
 
-	ASSERT(inode_is_locked(ctx->inode));
+	ASSERT(inode_is_locked(&ctx->inode->vfs_inode));
 
 	list_for_each_entry_safe(ordered, tmp, &ctx->ordered_extents, log_list) {
 		list_del_init(&ordered->log_list);
@@ -5908,7 +5908,7 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans,
 			if (ret < 0) {
 				return ret;
 			} else if (ret > 0 &&
-				   other_ino != btrfs_ino(BTRFS_I(ctx->inode))) {
+				   other_ino != btrfs_ino(ctx->inode)) {
 				if (ins_nr > 0) {
 					ins_nr++;
 				} else {
@@ -7570,7 +7570,7 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
 			goto out;
 	}
 
-	btrfs_init_log_ctx(&ctx, &inode->vfs_inode);
+	btrfs_init_log_ctx(&ctx, inode);
 	ctx.logging_new_name = true;
 	btrfs_init_log_ctx_scratch_eb(&ctx);
 	/*
diff --git a/fs/btrfs/tree-log.h b/fs/btrfs/tree-log.h
index 22e9cbc81577..fa0a689259b1 100644
--- a/fs/btrfs/tree-log.h
+++ b/fs/btrfs/tree-log.h
@@ -37,7 +37,7 @@ struct btrfs_log_ctx {
 	bool logging_new_delayed_dentries;
 	/* Indicate if the inode being logged was logged before. */
 	bool logged_before;
-	struct inode *inode;
+	struct btrfs_inode *inode;
 	struct list_head list;
 	/* Only used for fast fsyncs. */
 	struct list_head ordered_extents;
@@ -55,7 +55,7 @@ struct btrfs_log_ctx {
 	struct extent_buffer *scratch_eb;
 };
 
-void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, struct inode *inode);
+void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, struct btrfs_inode *inode);
 void btrfs_init_log_ctx_scratch_eb(struct btrfs_log_ctx *ctx);
 void btrfs_release_log_ctx_extents(struct btrfs_log_ctx *ctx);
 
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 1/6] btrfs: ensure fast fsync waits for ordered extents after a write failure
  2024-05-20  9:46  2% ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths fdmanana
@ 2024-05-20  9:46  1%   ` fdmanana
  2024-05-20 10:20  1%     ` Qu Wenruo
  2024-05-20  9:46  1%   ` [PATCH v4 2/6] btrfs: make btrfs_finish_ordered_extent() return void fdmanana
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 200+ results
From: fdmanana @ 2024-05-20  9:46 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

If a write path in COW mode fails, either before submitting a bio for the
new extents or an actual IO error happens, we can end up allowing a fast
fsync to log file extent items that point to unwritten extents.

This is because dropping the extent maps happens when completing ordered
extents, at btrfs_finish_one_ordered(), and the completion of an ordered
extent is executed in a work queue.

This can result in a fast fsync to start logging file extent items based
on existing extent maps before the ordered extents complete, therefore
resulting in a log that has file extent items that point to unwritten
extents, resulting in a corrupt file if a crash happens after and the log
tree is replayed the next time the fs is mounted.

This can happen for both direct IO writes and buffered writes.

For example consider a direct IO write, in COW mode, that fails at
btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an
error:

1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter
   set to false, meaning an error happened;

2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR
   flag;

3) btrfs_finish_ordered_extent() queues the completion of the ordered
   extent - so that btrfs_finish_one_ordered() will be executed later in
   a work queue. That function will drop extent maps in the range when
   it's executed, since the extent maps point to unwritten locations
   (signaled by the BTRFS_ORDERED_IOERR flag);

4) After calling btrfs_finish_ordered_extent() we keep going down the
   write path and unlock the inode;

5) After that a fast fsync starts and locks the inode;

6) Before the work queue executes btrfs_finish_one_ordered(), the fsync
   task sees the extent maps that point to the unwritten locations and
   logs file extent items based on them - it does not know they are
   unwritten, and the fast fsync path does not wait for ordered extents
   to complete, which is an intentional behaviour in order to reduce
   latency.

For the buffered write case, here's one example:

1) A fast fsync begins, and it starts by flushing delalloc and waiting for
   the writeback to complete by calling filemap_fdatawait_range();

2) Flushing the dellaloc created a new extent map X;

3) During the writeback some IO error happened, and at the end io callback
   (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which
   sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its
   completion;

4) After queuing the ordered extent completion, the end io callback clears
   the writeback flag from all pages (or folios), and from that moment the
   fast fsync can proceed;

5) The fast fsync proceeds sees extent map X and logs a file extent item
   based on extent map X, resulting in a log that points to an unwritten
   data extent - because the ordered extent completion hasn't run yet, it
   happens only after the logging.

To fix this make btrfs_finish_ordered_extent() set the inode flag
BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write,
so that a fast fsync will wait for ordered extent completion.

Note that this issues of using extent maps that point to unwritten
locations can not happen for reads, because in read paths we start by
locking the extent range and wait for any ordered extents in the range
to complete before looking for extent maps.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/btrfs_inode.h  | 10 ++++++++++
 fs/btrfs/file.c         | 16 ++++++++++++++++
 fs/btrfs/ordered-data.c | 31 +++++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3c8bc7a8ebdd..46db4027bf15 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -112,6 +112,16 @@ enum {
 	 * done at new_simple_dir(), called from btrfs_lookup_dentry().
 	 */
 	BTRFS_INODE_ROOT_STUB,
+	/*
+	 * Set if an error happened when doing a COW write before submitting a
+	 * bio or during writeback. Used for both buffered writes and direct IO
+	 * writes. This is to signal a fast fsync that it has to wait for
+	 * ordered extents to complete and therefore not log extent maps that
+	 * point to unwritten extents (when an ordered extent completes and it
+	 * has the BTRFS_ORDERED_IOERR flag set, it drops extent maps in its
+	 * range).
+	 */
+	BTRFS_INODE_COW_WRITE_ERROR,
 };
 
 /* in memory btrfs inode */
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0c7c1b42028e..00670596bf06 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1885,6 +1885,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 */
 	if (full_sync || btrfs_is_zoned(fs_info)) {
 		ret = btrfs_wait_ordered_range(inode, start, len);
+		clear_bit(BTRFS_INODE_COW_WRITE_ERROR, &BTRFS_I(inode)->runtime_flags);
 	} else {
 		/*
 		 * Get our ordered extents as soon as possible to avoid doing
@@ -1894,6 +1895,21 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		btrfs_get_ordered_extents_for_logging(BTRFS_I(inode),
 						      &ctx.ordered_extents);
 		ret = filemap_fdatawait_range(inode->i_mapping, start, end);
+		if (ret)
+			goto out_release_extents;
+
+		/*
+		 * Check and clear the BTRFS_INODE_COW_WRITE_ERROR now after
+		 * starting and waiting for writeback, because for buffered IO
+		 * it may have been set during the end IO callback
+		 * (end_bbio_data_write() -> btrfs_finish_ordered_extent()) in
+		 * case an error happened and we need to wait for ordered
+		 * extents to complete so that any extent maps that point to
+		 * unwritten locations are dropped and we don't log them.
+		 */
+		if (test_and_clear_bit(BTRFS_INODE_COW_WRITE_ERROR,
+				       &BTRFS_I(inode)->runtime_flags))
+			ret = btrfs_wait_ordered_range(inode, start, len);
 	}
 
 	if (ret)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 44157e43fd2a..7d175d10a6d0 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -388,6 +388,37 @@ bool btrfs_finish_ordered_extent(struct btrfs_ordered_extent *ordered,
 	ret = can_finish_ordered_extent(ordered, page, file_offset, len, uptodate);
 	spin_unlock_irqrestore(&inode->ordered_tree_lock, flags);
 
+	/*
+	 * If this is a COW write it means we created new extent maps for the
+	 * range and they point to unwritten locations if we got an error either
+	 * before submitting a bio or during IO.
+	 *
+	 * We have marked the ordered extent with BTRFS_ORDERED_IOERR, and we
+	 * are queuing its completion below. During completion, at
+	 * btrfs_finish_one_ordered(), we will drop the extent maps for the
+	 * unwritten extents.
+	 *
+	 * However because completion runs in a work queue we can end up having
+	 * a fast fsync running before that. In the case of direct IO, once we
+	 * unlock the inode the fsync might start, and we queue the completion
+	 * before unlocking the inode. In the case of buffered IO when writeback
+	 * finishes (end_bbio_data_write()) we queue the completion, so if the
+	 * writeback was triggered by a fast fsync, the fsync might start
+	 * logging before ordered extent completion runs in the work queue.
+	 *
+	 * The fast fsync will log file extent items based on the extent maps it
+	 * finds, so if by the time it collects extent maps the ordered extent
+	 * completion didn't happen yet, it will log file extent items that
+	 * point to unwritten extents, resulting in a corruption if a crash
+	 * happens and the log tree is replayed. Note that a fast fsync does not
+	 * wait for completion of ordered extents in order to reduce latency.
+	 *
+	 * Set a flag in the inode so that the next fast fsync will wait for
+	 * ordered extents to complete before starting to log.
+	 */
+	if (!uptodate && !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags))
+		set_bit(BTRFS_INODE_COW_WRITE_ERROR, &inode->runtime_flags);
+
 	if (ret)
 		btrfs_queue_ordered_fn(ordered);
 	return ret;
-- 
2.43.0


^ permalink raw reply related	[relevance 1%]

* [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths
    @ 2024-05-20  9:46  2% ` fdmanana
  2024-05-20  9:46  1%   ` [PATCH v4 1/6] btrfs: ensure fast fsync waits for ordered extents after a write failure fdmanana
                     ` (6 more replies)
  1 sibling, 7 replies; 200+ results
From: fdmanana @ 2024-05-20  9:46 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

There's a bug where a fast fsync can log extent maps that were not written
due to an error in a write path or during writeback. This affects both
direct IO writes and buffered writes, and besides the failure depends on
a race due to the fact that ordered extent completion happens in a work
queue and a fast fsync doesn't wait for ordered extent completion before
logging. The details are in the change log of the first patch.

V4: Use a slightly different approach to avoid a deadlock on the inode's
    spinlock due to it being used both in irq and non-irq context, pointed
    out by Qu.
    Added some cleanup patches (patches 3, 4, 5 and 6).

V3: Change the approach of patch 1/2 to not drop extent maps at
    btrfs_finish_ordered_extent() since that runs in irq context and
    dropping an extent map range triggers NOFS extent map allocations,
    which can trigger a reclaim and that can't run in irq context.
    Updated comments and changelog to distinguish differences between
    failures for direct IO writes and buffered writes.

V2: Rework solution since other error paths caused the same problem, make
    it more generic.
    Added more details to change log and comment about what's going on,
    and why reads aren't affected.

    https://lore.kernel.org/linux-btrfs/cover.1715798440.git.fdmanana@suse.com/

V1: https://lore.kernel.org/linux-btrfs/cover.1715688057.git.fdmanana@suse.com/

Filipe Manana (6):
  btrfs: ensure fast fsync waits for ordered extents after a write failure
  btrfs: make btrfs_finish_ordered_extent() return void
  btrfs: use a btrfs_inode in the log context (struct btrfs_log_ctx)
  btrfs: pass a btrfs_inode to btrfs_fdatawrite_range()
  btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()
  btrfs: use a btrfs_inode local variable at btrfs_sync_file()

 fs/btrfs/btrfs_inode.h      | 10 ++++++
 fs/btrfs/file.c             | 63 ++++++++++++++++++++++---------------
 fs/btrfs/file.h             |  2 +-
 fs/btrfs/free-space-cache.c |  4 +--
 fs/btrfs/inode.c            | 16 +++++-----
 fs/btrfs/ordered-data.c     | 40 ++++++++++++++++++++---
 fs/btrfs/ordered-data.h     |  4 +--
 fs/btrfs/reflink.c          |  8 ++---
 fs/btrfs/relocation.c       |  2 +-
 fs/btrfs/tree-log.c         | 10 +++---
 fs/btrfs/tree-log.h         |  4 +--
 11 files changed, 108 insertions(+), 55 deletions(-)

-- 
2.43.0


^ permalink raw reply	[relevance 2%]

* Re: [PATCH] btrfs: do not clear page dirty at extent_write_cache_pages()
  2024-05-20  6:04  1% ` Qu Wenruo
@ 2024-05-20  6:28  1%   ` Qu Wenruo
  0 siblings, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-20  6:28 UTC (permalink / raw)
  To: linux-btrfs



在 2024/5/20 15:34, Qu Wenruo 写道:
> 
> 
> 在 2024/5/20 13:00, Qu Wenruo 写道:
>> [PROBLEM]
>> Currently we call folio_clear_dirty_for_io() for the locked dirty folio
>> inside extent_write_cache_pages().
>>
>> However this call is not really subpage aware, it's from the older days
>> where one page can only have one sector.
>>
>> But with nowadays subpage support, we can have multiple sectors inside
>> one page, thus if we clear the whole page dirty flag, it would make the
>> subpage and page dirty flags desynchronize.
>>
>> Thankfully this is not a big deal as our current subpage routine always
>> call __extent_writepage_io() for all the subpage dirty ranges, thus it
>> would ensure there is no subpage range dirty left.
>>
>> [FIX]
>> So here we just drop the folio_clear_dirty_for_io() call, and let
>> __extent_writepage_io() and extent_clear_unlock_delalloc() (which is for
>> compression path) to handle the dirty page and subapge clearing.
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> Please drop the patch.
> 
> Weirdly with this one, generic/027 would hang on locking the page...

More weirdly, this only happens for aarch64 subpage cases...

> 
> Thanks,
> Qu
>> ---
>> This patch is independent from the subpage zoned fixes, thus it can be
>> applied either before or after the subpage zoned fixes.
>> ---
>>   fs/btrfs/extent_io.c | 3 +--
>>   1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 7275bd919a3e..a8fc0fcfa69f 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2231,8 +2231,7 @@ static int extent_write_cache_pages(struct 
>> address_space *mapping,
>>                   folio_wait_writeback(folio);
>>               }
>> -            if (folio_test_writeback(folio) ||
>> -                !folio_clear_dirty_for_io(folio)) {
>> +            if (folio_test_writeback(folio)) {
>>                   folio_unlock(folio);
>>                   continue;
>>               }
> 

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: do not clear page dirty at extent_write_cache_pages()
  2024-05-20  3:30  1% [PATCH] btrfs: do not clear page dirty at extent_write_cache_pages() Qu Wenruo
@ 2024-05-20  6:04  1% ` Qu Wenruo
  2024-05-20  6:28  1%   ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-20  6:04 UTC (permalink / raw)
  To: linux-btrfs



在 2024/5/20 13:00, Qu Wenruo 写道:
> [PROBLEM]
> Currently we call folio_clear_dirty_for_io() for the locked dirty folio
> inside extent_write_cache_pages().
> 
> However this call is not really subpage aware, it's from the older days
> where one page can only have one sector.
> 
> But with nowadays subpage support, we can have multiple sectors inside
> one page, thus if we clear the whole page dirty flag, it would make the
> subpage and page dirty flags desynchronize.
> 
> Thankfully this is not a big deal as our current subpage routine always
> call __extent_writepage_io() for all the subpage dirty ranges, thus it
> would ensure there is no subpage range dirty left.
> 
> [FIX]
> So here we just drop the folio_clear_dirty_for_io() call, and let
> __extent_writepage_io() and extent_clear_unlock_delalloc() (which is for
> compression path) to handle the dirty page and subapge clearing.
> 
> Signed-off-by: Qu Wenruo <wqu@suse.com>

Please drop the patch.

Weirdly with this one, generic/027 would hang on locking the page...

Thanks,
Qu
> ---
> This patch is independent from the subpage zoned fixes, thus it can be
> applied either before or after the subpage zoned fixes.
> ---
>   fs/btrfs/extent_io.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 7275bd919a3e..a8fc0fcfa69f 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2231,8 +2231,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
>   				folio_wait_writeback(folio);
>   			}
>   
> -			if (folio_test_writeback(folio) ||
> -			    !folio_clear_dirty_for_io(folio)) {
> +			if (folio_test_writeback(folio)) {
>   				folio_unlock(folio);
>   				continue;
>   			}

^ permalink raw reply	[relevance 1%]

* [PATCH] btrfs: enhance function extent_range_clear_dirty_for_io()
@ 2024-05-20  3:55  1% Qu Wenruo
  2024-05-20 10:51  1% ` Filipe Manana
  0 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-20  3:55 UTC (permalink / raw)
  To: linux-btrfs

Enhance that function by:

- Moving it to inode.c
  As there is only one user inside compress_file_range(), there is no
  need to export it through extent_io.h.

- Add extra error handling
  Previously we go BUG_ON() if we can not find a page inside the range.
  Now we downgrade it to ASSERT(), as this really means some logic
  error since we should have all the pages locked already.

- Make it subpage compatible
  Although currently compression only happens in a full page basis even
  for subpage routine, there is no harm to make it subpage compatible
  now.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 15 ---------------
 fs/btrfs/extent_io.h |  1 -
 fs/btrfs/inode.c     | 31 ++++++++++++++++++++++++++++++-
 3 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a8fc0fcfa69f..9a6f369945c6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -164,21 +164,6 @@ void __cold extent_buffer_free_cachep(void)
 	kmem_cache_destroy(extent_buffer_cache);
 }
 
-void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
-{
-	unsigned long index = start >> PAGE_SHIFT;
-	unsigned long end_index = end >> PAGE_SHIFT;
-	struct page *page;
-
-	while (index <= end_index) {
-		page = find_get_page(inode->i_mapping, index);
-		BUG_ON(!page); /* Pages should be in the extent_io_tree */
-		clear_page_dirty_for_io(page);
-		put_page(page);
-		index++;
-	}
-}
-
 static void process_one_page(struct btrfs_fs_info *fs_info,
 			     struct page *page, struct page *locked_page,
 			     unsigned long page_ops, u64 start, u64 end)
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index dca6b12769ec..7c2f1bbc6b67 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -350,7 +350,6 @@ void extent_buffer_bitmap_clear(const struct extent_buffer *eb,
 void set_extent_buffer_dirty(struct extent_buffer *eb);
 void set_extent_buffer_uptodate(struct extent_buffer *eb);
 void clear_extent_buffer_uptodate(struct extent_buffer *eb);
-void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end);
 void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
 				  struct page *locked_page,
 				  struct extent_state **cached,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 000809e16aba..541a719284a9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -890,6 +890,32 @@ static inline void inode_should_defrag(struct btrfs_inode *inode,
 		btrfs_add_inode_defrag(NULL, inode, small_write);
 }
 
+static int extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
+{
+	struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
+	const u64 len = end + 1 - start;
+	unsigned long end_index = end >> PAGE_SHIFT;
+	bool missing_folio = false;
+
+	/* We should not have such large range. */
+	ASSERT(len < U32_MAX);
+	for (unsigned long index = start >> PAGE_SHIFT;
+	     index <= end_index; index++) {
+		struct folio *folio;
+
+		folio = filemap_get_folio(inode->i_mapping, index);
+		if (IS_ERR(folio)) {
+			missing_folio = true;
+			continue;
+		}
+		btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
+		folio_put(folio);
+	}
+	if (missing_folio)
+		return -ENOENT;
+	return 0;
+}
+
 /*
  * Work queue call back to started compression on a file and pages.
  *
@@ -931,7 +957,10 @@ static void compress_file_range(struct btrfs_work *work)
 	 * Otherwise applications with the file mmap'd can wander in and change
 	 * the page contents while we are compressing them.
 	 */
-	extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end);
+	ret = extent_range_clear_dirty_for_io(&inode->vfs_inode, start, end);
+
+	/* We have locked all the involved pagse, shouldn't hit a missing page. */
+	ASSERT(ret == 0);
 
 	/*
 	 * We need to save i_size before now because it could change in between
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* [PATCH] btrfs: do not clear page dirty at extent_write_cache_pages()
@ 2024-05-20  3:30  1% Qu Wenruo
  2024-05-20  6:04  1% ` Qu Wenruo
  0 siblings, 1 reply; 200+ results
From: Qu Wenruo @ 2024-05-20  3:30 UTC (permalink / raw)
  To: linux-btrfs

[PROBLEM]
Currently we call folio_clear_dirty_for_io() for the locked dirty folio
inside extent_write_cache_pages().

However this call is not really subpage aware, it's from the older days
where one page can only have one sector.

But with nowadays subpage support, we can have multiple sectors inside
one page, thus if we clear the whole page dirty flag, it would make the
subpage and page dirty flags desynchronize.

Thankfully this is not a big deal as our current subpage routine always
call __extent_writepage_io() for all the subpage dirty ranges, thus it
would ensure there is no subpage range dirty left.

[FIX]
So here we just drop the folio_clear_dirty_for_io() call, and let
__extent_writepage_io() and extent_clear_unlock_delalloc() (which is for
compression path) to handle the dirty page and subapge clearing.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
This patch is independent from the subpage zoned fixes, thus it can be
applied either before or after the subpage zoned fixes.
---
 fs/btrfs/extent_io.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7275bd919a3e..a8fc0fcfa69f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2231,8 +2231,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
 				folio_wait_writeback(folio);
 			}
 
-			if (folio_test_writeback(folio) ||
-			    !folio_clear_dirty_for_io(folio)) {
+			if (folio_test_writeback(folio)) {
 				folio_unlock(folio);
 				continue;
 			}
-- 
2.45.1


^ permalink raw reply related	[relevance 1%]

* Re: [PATCH] btrfs: move btrfs_block_group_root to block-group.c
    2024-05-19 23:29  1% ` Qu Wenruo
@ 2024-05-20  1:35  1% ` Naohiro Aota
  1 sibling, 0 replies; 200+ results
From: Naohiro Aota @ 2024-05-20  1:35 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Sun, May 19, 2024 at 08:20:41AM GMT, Anand Jain wrote:
> The function btrfs_block_group_root() is declared in disk-io.c; however,
> all its callers are in block-group.c. Move it to the latter file and
> declare it static.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---

Looks reasonable.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>

^ permalink raw reply	[relevance 1%]

* Re: [PATCH] btrfs: move btrfs_block_group_root to block-group.c
  @ 2024-05-19 23:29  1% ` Qu Wenruo
  2024-05-20  1:35  1% ` Naohiro Aota
  1 sibling, 0 replies; 200+ results
From: Qu Wenruo @ 2024-05-19 23:29 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs



在 2024/5/19 09:50, Anand Jain 写道:
> The function btrfs_block_group_root() is declared in disk-io.c; however,
> all its callers are in block-group.c. Move it to the latter file and
> declare it static.
>
> Signed-off-by: Anand Jain <anand.jain@oracle.com>

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
> ---
>   fs/btrfs/block-group.c | 7 +++++++
>   fs/btrfs/disk-io.c     | 7 -------
>   fs/btrfs/disk-io.h     | 1 -
>   3 files changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 1e09aeea69c2..9910bae89966 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1022,6 +1022,13 @@ static void clear_incompat_bg_bits(struct btrfs_fs_info *fs_info, u64 flags)
>   	}
>   }
>
> +static struct btrfs_root *btrfs_block_group_root(struct btrfs_fs_info *fs_info)
> +{
> +	if (btrfs_fs_compat_ro(fs_info, BLOCK_GROUP_TREE))
> +		return fs_info->block_group_root;
> +	return btrfs_extent_root(fs_info, 0);
> +}
> +
>   static int remove_block_group_item(struct btrfs_trans_handle *trans,
>   				   struct btrfs_path *path,
>   				   struct btrfs_block_group *block_group)
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index e6bf895b3547..94b95836f61f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -846,13 +846,6 @@ struct btrfs_root *btrfs_extent_root(struct btrfs_fs_info *fs_info, u64 bytenr)
>   	return btrfs_global_root(fs_info, &key);
>   }
>
> -struct btrfs_root *btrfs_block_group_root(struct btrfs_fs_info *fs_info)
> -{
> -	if (btrfs_fs_compat_ro(fs_info, BLOCK_GROUP_TREE))
> -		return fs_info->block_group_root;
> -	return btrfs_extent_root(fs_info, 0);
> -}
> -
>   struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans,
>   				     u64 objectid)
>   {
> diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
> index 76eb53fe7a11..1f93feae1872 100644
> --- a/fs/btrfs/disk-io.h
> +++ b/fs/btrfs/disk-io.h
> @@ -83,7 +83,6 @@ struct btrfs_root *btrfs_global_root(struct btrfs_fs_info *fs_info,
>   				     struct btrfs_key *key);
>   struct btrfs_root *btrfs_csum_root(struct btrfs_fs_info *fs_info, u64 bytenr);
>   struct btrfs_root *btrfs_extent_root(struct btrfs_fs_info *fs_info, u64 bytenr);
> -struct btrfs_root *btrfs_block_group_root(struct btrfs_fs_info *fs_info);
>
>   void btrfs_free_fs_info(struct btrfs_fs_info *fs_info);
>   void btrfs_btree_balance_dirty(struct btrfs_fs_info *fs_info);

^ permalink raw reply	[relevance 1%]

* Re: system drive corruption, btrfs check failure
  @ 2024-05-19  2:17  0%   ` Jared Van Bortel
  0 siblings, 0 replies; 200+ results
From: Jared Van Bortel @ 2024-05-19  2:17 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

On Sat, 2024-03-30 at 10:12 +1030, Qu Wenruo wrote:
> 
> 
> 在 2024/3/30 04:00, Jared Van Bortel 写道:
> > Hi,
> > 
> > Yesterday I ran `pacman -Syu` to update my Arch Linux installation.
> > I
> > saw a lot of complaints from ldconfig, and programs started
> > crashing.
> > Thinking it was related to having only 7GiB of free space available,
> > I
> > tried deleting some large files and reinstalling the affected
> > packages. I saw no clear improvement from this, and eventually
> > decided
> > to shut my computer down.
> 
> Do you have any dmesg of that incident?

Hi, sorry for the delay. I finally got around to running the lowmem
check on the old drives.

Firstly, there was nothing relevant in dmesg. When I first saw your
reply, I checked the system journal from the time of the incident and
there was nothing disk-related from the kernel between mount and
shutdown - medium errors, btrfs I/O errors, or anything like that.

> 
> > 
> > I booted memtest, and it completed a full pass without errors. I
> > then
> > booted a live USB and ran `btrfs check --readonly /dev/nvme0n1p2`,
> > and
> > saw a long list of errors, realizing my filesystem is most likely
> > beyond repair.
> > 
> > Basic information (RAID1 metadata, single data):
> > ```
> > Label: 'System'  uuid: 76721faa-8c32-4e70-8a9e-859dece0aec1
> > Total devices 2 FS bytes used 2.18TiB
> > devid    1 size 422.63GiB used 422.63GiB path /dev/nvme0n1p2
> > devid    2 size 1.82TiB used 1.82TiB path /dev/nvme1n1
> > ```
> > The installed kernel is linux-zen 6.6.10 with a few patches. The
> > live
> > USB I'm using has the Arch Linux 6.4.7-arch1-1 kernel. Full `btrfs
> > check` log and smartctl information is attached.
> > 
> > There are three main errors. One:
> > ```
> > ref mismatch on [1248293634048 16384] extent item 1, found 0
> > tree extent[1248293634048, 16384] parent 2368656916480 has no tree
> > block found
> > incorrect global backref count on 1248293634048 found 1 wanted 0
> > backpointer mismatch on [1248293634048 16384]
> > owner ref check failed [1248293634048 16384]
> > ```
> > 
> > Two:
> > ```
> > ref mismatch on [1261902016512 4096] extent item 2, found 1
> > data extent[1261902016512, 4096] bytenr mimsmatch, extent item
> > bytenr
> > 1261902016512 file item bytenr 0
> > data extent[1261902016512, 4096] referencer count mismatch (parent
> > 2369673248768) wanted 1 have 0
> > backpointer mismatch on [1261902016512 4096]
> > ```
> 
> Corrupted extent tree, this can lead to fs falling back to read-only
> halfway.

This fs actually still mounts writable without any issue, FWIW. Although
the error counters are not zeroed:

bdev /dev/nvme0n1p2 errs: wr 51, rd 0, flush 0, corrupt 0, gen 5

It's not clear to me when these errors occurred - wouldn't they have
been logged to dmesg at the time?

> > 
> > Three:
> > ```
> > block group 1342751899648 has wrong amount of free space, free space
> > cache has 34193408 block group has 42893312
> > failed to load free space cache for block group 1342751899648
> > ```
> 
> This is not that uncommon if extent  tree is already corrupted.
> 
> But unfortunately, this may not be the direct/root cause of the
> corruption.
> 
> Thus I'd prefer to have the initial dmesg.
> 
> > 
> > And this warning:
> > ```
> > [4/7] checking fs roots
> > warning line 3916
> > ```
> > 
> > I bought some replacement disks that I can install alongside the old
> > ones, but I don't have a recent backup of the full FS. It seems to
> > mount readonly without issue.
> 
> The fs tree is mostly fine, so you can mount it RO and grab your data.
> 
> > 
> > What's the best way to recover the data that's left? And is there
> > any
> > clue here as to what went wrong? I'm really not sure. If this is a
> > drive failure, it seems premature.
> 
> It's hard to say. The old original mode check output is not that
> helpful
> to locate the root cause.
> 
> Mind to run "btrfs check --mode=lowmem" on that fs, and save both
> stderr
> and stdout?

Here is the output:

$ sudo btrfs check --mode=lowmem /dev/nvme0n1p2
Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p2
UUID: 76721faa-8c32-4e70-8a9e-859dece0aec1
[1/7] checking root items
[2/7] checking extents
ERROR: shared extent[1248293634048 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248304054272 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248304807936 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248315129856 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248324468736 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248588103680 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248685195264 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248901791744 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248939687936 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248953860096 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248962543616 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248966279168 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248966934528 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1248978468864 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1249032519680 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1249075806208 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1249310851072 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1249319256064 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1249401454592 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1249661976576 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1249796259840 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1249802420224 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1250339143680 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1250870542336 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1250975170560 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1251038543872 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1251216883712 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1251319889920 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1251321462784 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1251324887040 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1251325100032 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1253394546688 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1254622068736 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1255368310784 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1255986282496 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1255986331648 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1256068186112 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1256145797120 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1256190967808 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1256305262592 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1256650244096 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1256696086528 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1258225631232 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1258967351296 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1258967367680 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1258967400448 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1258967482368 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1258967515136 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1258967531520 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1258967564288 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1259028234240 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1259345543168 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1259767791616 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260607963136 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260611665920 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260611862528 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260617236480 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260620627968 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260707020800 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260716212224 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260718686208 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260725223424 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260836290560 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260837158912 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260838322176 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260840239104 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260841402368 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260841910272 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260842811392 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260843696128 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260848300032 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260850085888 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260851412992 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260852494336 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260854067200 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260856000512 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260856590336 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260857556992 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260859555840 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260861636608 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260861898752 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260863062016 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[1260863815680 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent 1284548759552 referencer lost (parent: 1252056268800)
ERROR: shared extent 1287628115968 referencer lost (parent: 1252056268800)
ERROR: shared extent 1287628156928 referencer lost (parent: 1252056268800)
ERROR: shared extent 1291874332672 referencer lost (parent: 1252056268800)
ERROR: shared extent 1291892744192 referencer lost (parent: 1252056268800)
ERROR: shared extent 1291893682176 referencer lost (parent: 1252056268800)
ERROR: shared extent 1302641364992 referencer lost (parent: 1252056268800)
ERROR: shared extent 1302646538240 referencer lost (parent: 1252056268800)
ERROR: shared extent 1302646906880 referencer lost (parent: 1252056268800)
ERROR: shared extent 1302725738496 referencer lost (parent: 1252056268800)
ERROR: shared extent 1307336536064 referencer lost (parent: 1252056268800)
ERROR: shared extent 1307421573120 referencer lost (parent: 1252056268800)
ERROR: shared extent 1318061346816 referencer lost (parent: 1252056268800)
ERROR: shared extent 1321886617600 referencer lost (parent: 1252056268800)
ERROR: shared extent 1548210819072 referencer lost (parent: 1252056268800)
ERROR: shared extent 1564826959872 referencer lost (parent: 1252056268800)
ERROR: shared extent 1609847508992 referencer lost (parent: 1252056268800)
ERROR: shared extent 1609915379712 referencer lost (parent: 1252056268800)
ERROR: shared extent 1611724808192 referencer lost (parent: 1252056268800)
ERROR: shared extent 1611749105664 referencer lost (parent: 1252056268800)
ERROR: shared extent 1611749363712 referencer lost (parent: 1252056268800)
ERROR: shared extent 1620880302080 referencer lost (parent: 1252056268800)
ERROR: shared extent 1620899688448 referencer lost (parent: 1252056268800)
ERROR: shared extent 1620944224256 referencer lost (parent: 1252056268800)
ERROR: shared extent 1623958888448 referencer lost (parent: 1252056268800)
ERROR: shared extent 1624013213696 referencer lost (parent: 1252056268800)
ERROR: shared extent 1624013307904 referencer lost (parent: 1252056268800)
ERROR: shared extent[2368230686720 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2368279707648 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2368409845760 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2368417906688 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2368656982016 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369673248768 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369795997696 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369801355264 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369801568256 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369801666560 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369807466496 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369807548416 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369808187392 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369808678912 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369808990208 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369809530880 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369809661952 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369809743872 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369811316736 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369811365888 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369811562496 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369811906560 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369811972096 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369812234240 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369812365312 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369813331968 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369813610496 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369814020096 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369816100864 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369816231936 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369816788992 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369817116672 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369817182208 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369817247744 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369817296896 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369817395200 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369817509888 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369817542656 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369817853952 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369818001408 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369818116096 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369818148864 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369818263552 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369818492928 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369818640384 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369818869760 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369819623424 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369819869184 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369821540352 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369821622272 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369821949952 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369822572544 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369823145984 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369823195136 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369823621120 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369824112640 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2369863860224 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2370334277632 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[2370373042176 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[4949035106304 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[4949037793280 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[4949039235072 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[4949053538304 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[4949401337856 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[4949457502208 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[4949804662784 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[4949804761088 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[4949804777472 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[4949804793856 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: shared extent[4949849210880 16384] lost its parent (parent: 2368656916480, level: 0)
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
block group 1342751899648 has wrong amount of free space, free space cache has 34193408 block group has 42893312
failed to load free space cache for block group 1342751899648
block group 1343825641472 has wrong amount of free space, free space cache has 23257088 block group has 25903104
failed to load free space cache for block group 1343825641472
block group 1351341834240 has wrong amount of free space, free space cache has 22396928 block group has 42348544
failed to load free space cache for block group 1351341834240
block group 1566090199040 has wrong amount of free space, free space cache has 48562176 block group has 50651136
failed to load free space cache for block group 1566090199040
block group 1572532649984 has wrong amount of free space, free space cache has 12173312 block group has 15945728
failed to load free space cache for block group 1572532649984
block group 1580048842752 has wrong amount of free space, free space cache has 29745152 block group has 33087488
failed to load free space cache for block group 1580048842752
block group 1584343810048 has wrong amount of free space, free space cache has 56512512 block group has 58601472
failed to load free space cache for block group 1584343810048
block group 1602597421056 has wrong amount of free space, free space cache has 87953408 block group has 90349568
failed to load free space cache for block group 1602597421056
block group 1744331341824 has wrong amount of free space, free space cache has 339968 block group has 393216
failed to load free space cache for block group 1744331341824
block group 2666675568640 has wrong amount of free space, free space cache has 602112 block group has 782336
failed to load free space cache for block group 2666675568640
block group 2909341220864 has wrong amount of free space, free space cache has 65536 block group has 151552
failed to load free space cache for block group 2909341220864
block group 3904699891712 has wrong amount of free space, free space cache has 172032 block group has 221184
failed to load free space cache for block group 3904699891712
block group 3941207113728 has wrong amount of free space, free space cache has 1728512 block group has 1826816
failed to load free space cache for block group 3941207113728
block group 4085088518144 has wrong amount of free space, free space cache has 5697536 block group has 5754880
failed to load free space cache for block group 4085088518144
block group 4241854824448 has wrong amount of free space, free space cache has 23293952 block group has 28966912
failed to load free space cache for block group 4241854824448
block group 4838855278592 has wrong amount of free space, free space cache has 86016 block group has 118784
failed to load free space cache for block group 4838855278592
block group 4847445213184 has wrong amount of free space, free space cache has 49152 block group has 110592
failed to load free space cache for block group 4847445213184
block group 4897911078912 has wrong amount of free space, free space cache has 7475200 block group has 7577600
failed to load free space cache for block group 4897911078912
block group 5010008178688 has wrong amount of free space, free space cache has 69632 block group has 106496
failed to load free space cache for block group 5010008178688
block group 5062655082496 has wrong amount of free space, free space cache has 5836800 block group has 5890048
failed to load free space cache for block group 5062655082496
block group 5268813512704 has wrong amount of free space, free space cache has 135168 block group has 221184
failed to load free space cache for block group 5268813512704
[4/7] checking fs roots
ERROR: root 259 EXTENT_DATA[1522634 4096] gap exists, expected: EXTENT_DATA[1522634 128]
ERROR: root 259 EXTENT_DATA[1522636 4096] gap exists, expected: EXTENT_DATA[1522636 128]
ERROR: root 407 EXTENT_DATA[398831 4096] gap exists, expected: EXTENT_DATA[398831 25]
ERROR: root 407 EXTENT_DATA[398973 4096] gap exists, expected: EXTENT_DATA[398973 25]
ERROR: root 407 EXTENT_DATA[398975 4096] gap exists, expected: EXTENT_DATA[398975 25]
ERROR: root 407 EXTENT_DATA[398976 4096] gap exists, expected: EXTENT_DATA[398976 25]
ERROR: root 407 EXTENT_DATA[418307 4096] gap exists, expected: EXTENT_DATA[418307 25]
ERROR: root 407 EXTENT_DATA[418316 4096] gap exists, expected: EXTENT_DATA[418316 25]
ERROR: root 407 EXTENT_DATA[418317 4096] gap exists, expected: EXTENT_DATA[418317 25]
ERROR: root 407 EXTENT_DATA[420660 4096] gap exists, expected: EXTENT_DATA[420660 25]
ERROR: root 407 EXTENT_DATA[420673 4096] gap exists, expected: EXTENT_DATA[420673 25]
ERROR: root 407 EXTENT_DATA[439382 4096] gap exists, expected: EXTENT_DATA[439382 25]
ERROR: root 407 EXTENT_DATA[439383 4096] gap exists, expected: EXTENT_DATA[439383 25]
ERROR: root 407 EXTENT_DATA[451252 4096] gap exists, expected: EXTENT_DATA[451252 25]
ERROR: root 407 EXTENT_DATA[451264 4096] gap exists, expected: EXTENT_DATA[451264 25]
ERROR: root 407 EXTENT_DATA[451265 4096] gap exists, expected: EXTENT_DATA[451265 25]
ERROR: root 407 EXTENT_DATA[452326 4096] gap exists, expected: EXTENT_DATA[452326 25]
ERROR: root 407 EXTENT_DATA[452332 4096] gap exists, expected: EXTENT_DATA[452332 25]
ERROR: root 407 EXTENT_DATA[452339 4096] gap exists, expected: EXTENT_DATA[452339 25]
ERROR: root 407 EXTENT_DATA[4293157 4096] gap exists, expected: EXTENT_DATA[4293157 25]
ERROR: root 407 EXTENT_DATA[4293570 4096] gap exists, expected: EXTENT_DATA[4293570 25]
ERROR: root 407 EXTENT_DATA[4293571 4096] gap exists, expected: EXTENT_DATA[4293571 25]
ERROR: root 407 EXTENT_DATA[4293572 4096] gap exists, expected: EXTENT_DATA[4293572 25]
ERROR: root 407 EXTENT_DATA[4302136 4096] gap exists, expected: EXTENT_DATA[4302136 25]
ERROR: root 407 EXTENT_DATA[4302148 4096] gap exists, expected: EXTENT_DATA[4302148 25]
ERROR: root 407 EXTENT_DATA[4302149 4096] gap exists, expected: EXTENT_DATA[4302149 25]
ERROR: root 407 EXTENT_DATA[4302150 4096] gap exists, expected: EXTENT_DATA[4302150 25]
ERROR: root 407 EXTENT_DATA[5970391 4096] gap exists, expected: EXTENT_DATA[5970391 25]
ERROR: errors found in fs roots
found 2397613547520 bytes used, error(s) found
total csum bytes: 1840478932
total tree bytes: 13337329664
total fs tree bytes: 10208378880
total extent tree bytes: 874070016
btree space waste bytes: 2240708820
file data blocks allocated: 24819271946240
 referenced 2695187488768


Hopefully that means something to you. I'm still curious to know to what
degree I should still trust these drives if I were to wipe the fs and
start over. I suppose I could run a SMART test or something, right?

Thanks,
Jared

> 
> Thanks,
> Qu
> 
> > 
> > Thanks,
> > Jared


^ permalink raw reply	[relevance 0%]

Results 1-200 of ~120000   | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2024-03-29 17:30     system drive corruption, btrfs check failure Jared Van Bortel
2024-03-29 23:42     ` Qu Wenruo
2024-05-19  2:17  0%   ` Jared Van Bortel
2024-05-02 21:26     [PATCH 2/4] fscache: Remove duplicate included header Thorsten Blum
2024-05-26 21:21  1% ` [RESEND PATCH " Thorsten Blum
2024-05-03  6:01     [PATCH v2 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
2024-05-03  6:01     ` [PATCH v2 08/11] btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args Qu Wenruo
2024-05-20 15:55  1%   ` Filipe Manana
2024-05-20 22:13  1%     ` Qu Wenruo
2024-05-03  6:01     ` [PATCH v2 09/11] btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent Qu Wenruo
2024-05-20 16:31  2%   ` Filipe Manana
2024-05-03  6:01     ` [PATCH v2 10/11] btrfs: cleanup duplicated parameters related to create_io_em() Qu Wenruo
2024-05-20 16:46  1%   ` Filipe Manana
2024-05-03  6:01     ` [PATCH v2 11/11] btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent() Qu Wenruo
2024-05-20 16:48  1%   ` Filipe Manana
2024-05-23  4:03  1%     ` Qu Wenruo
2024-05-20 16:55  2% ` [PATCH v2 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Filipe Manana
2024-05-05  7:03     [PATCH 4/4] crash: Remove duplicate included header Baoquan He
2024-05-26 21:23  1% ` [RESEND PATCH " Thorsten Blum
2024-05-07  6:58     [PATCH v3 0/2] btrfs: qgroup: stale qgroups related impromvents Qu Wenruo
2024-05-07  6:58     ` [PATCH v3 1/2] btrfs: slightly loose the requirement for qgroup removal Qu Wenruo
2024-05-20 22:46  1%   ` Boris Burkov
2024-05-07  6:58     ` [PATCH v3 2/2] btrfs: automatically remove the subvolume qgroup Qu Wenruo
2024-05-20 22:50  1%   ` Boris Burkov
2024-05-20 23:03  1%     ` Qu Wenruo
2024-05-21  1:05  1%       ` Qu Wenruo
2024-05-07  7:06     [PATCH] fstests: btrfs/301: handle auto-removed qgroups Qu Wenruo
2024-05-21  1:19  1% ` Boris Burkov
2024-05-23 15:43  1% ` Anand Jain
2024-05-07 20:08     [PATCH] fstests: mkfs the scratch device if we have missing profiles Josef Bacik
2024-05-24  3:51  1% ` Anand Jain
2024-05-16 11:12     [PATCH v3 0/6] part3 trivial adjustments for return variable coding style Anand Jain
2024-05-16 11:12     ` [PATCH v3 1/6] btrfs: btrfs_cleanup_fs_roots handle ret variable Anand Jain
2024-05-21 15:10  1%   ` David Sterba
2024-05-21 17:08  1%     ` Anand Jain
2024-05-16 11:12     ` [PATCH v3 6/6] btrfs: rename and optimize return variable in btrfs_find_orphan_roots Anand Jain
2024-05-21 15:18  1%   ` David Sterba
2024-05-21 17:10  1%     ` Anand Jain
2024-05-21 17:59  1%       ` David Sterba
2024-05-23 14:35  1%         ` Anand Jain
2024-05-21  1:04  1% ` [PATCH v3 0/6] part3 trivial adjustments for return variable coding style Anand Jain
2024-05-21 15:21  2% ` David Sterba
2024-05-21 17:10  1%   ` Anand Jain
2024-05-17 13:13     [PATCH 0/3] btrfs: avoid data races when accessing an inode's delayed_node fdmanana
2024-05-20 15:48  1% ` David Sterba
2024-05-20 16:58  1%   ` Filipe Manana
2024-05-20 20:20  1%     ` David Sterba
2024-05-21 14:47  1%       ` David Sterba
2024-05-17 16:52     [PATCH v3 0/2] btrfs: fix logging unwritten extents after failure in write paths fdmanana
2024-05-17 16:52     ` [PATCH v3 1/2] btrfs: ensure fast fsync waits for ordered extents after a write failure fdmanana
2024-05-18  5:28       ` Qu Wenruo
2024-05-20  9:46  1%     ` Filipe Manana
2024-05-20  9:46  2% ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths fdmanana
2024-05-20  9:46  1%   ` [PATCH v4 1/6] btrfs: ensure fast fsync waits for ordered extents after a write failure fdmanana
2024-05-20 10:20  1%     ` Qu Wenruo
2024-05-20  9:46  1%   ` [PATCH v4 2/6] btrfs: make btrfs_finish_ordered_extent() return void fdmanana
2024-05-20  9:46  1%   ` [PATCH v4 3/6] btrfs: use a btrfs_inode in the log context (struct btrfs_log_ctx) fdmanana
2024-05-20  9:46  1%   ` [PATCH v4 4/6] btrfs: pass a btrfs_inode to btrfs_fdatawrite_range() fdmanana
2024-05-20  9:46  1%   ` [PATCH v4 5/6] btrfs: pass a btrfs_inode to btrfs_wait_ordered_range() fdmanana
2024-05-20  9:46  1%   ` [PATCH v4 6/6] btrfs: use a btrfs_inode local variable at btrfs_sync_file() fdmanana
2024-05-20 10:23  1%   ` [PATCH v4 0/6] btrfs: fix logging unwritten extents after failure in write paths Qu Wenruo
2024-05-18  5:07     [PATCH v5 0/5] btrfs: subpage + zoned fixes Qu Wenruo
2024-05-18  5:07     ` [PATCH v5 1/5] btrfs: make __extent_writepage_io() to write specified range only Qu Wenruo
2024-05-21  7:23  1%   ` Naohiro Aota
2024-05-18  5:07     ` [PATCH v5 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking Qu Wenruo
2024-05-21  7:50  1%   ` Naohiro Aota
2024-05-21  7:57  1%     ` Qu Wenruo
2024-05-18  5:07     ` [PATCH v5 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc() Qu Wenruo
2024-05-21  8:11  1%   ` Naohiro Aota
2024-05-21  8:45  1%     ` Qu Wenruo
2024-05-21 11:54  1%       ` Naohiro Aota
2024-05-21 22:16  1%         ` Qu Wenruo
2024-05-22  1:10  1%           ` Naohiro Aota
2024-05-18  5:07     ` [PATCH v5 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
2024-05-21  7:13  1%   ` Naohiro Aota
2024-05-19  0:20     [PATCH] btrfs: move btrfs_block_group_root to block-group.c Anand Jain
2024-05-19 23:29  1% ` Qu Wenruo
2024-05-20  1:35  1% ` Naohiro Aota
2024-05-20  3:30  1% [PATCH] btrfs: do not clear page dirty at extent_write_cache_pages() Qu Wenruo
2024-05-20  6:04  1% ` Qu Wenruo
2024-05-20  6:28  1%   ` Qu Wenruo
2024-05-20  3:55  1% [PATCH] btrfs: enhance function extent_range_clear_dirty_for_io() Qu Wenruo
2024-05-20 10:51  1% ` Filipe Manana
2024-05-20 11:06  1%   ` Qu Wenruo
2024-05-20 19:52  1% [PATCH 0/6] Cleanups and W=2 warning fixes David Sterba
2024-05-20 19:52  1% ` [PATCH 1/6] btrfs: remove duplicate name variable declarations David Sterba
2024-05-20 19:52  1% ` [PATCH 2/6] btrfs: rename macro local variables that clash with other variables David Sterba
2024-05-20 19:52  1% ` [PATCH 3/6] btrfs: use for-local variabls that shadow function variables David Sterba
2024-05-21  4:13  1%   ` Naohiro Aota
2024-05-21 13:01  1%     ` David Sterba
2024-05-20 19:52  1% ` [PATCH 4/6] btrfs: remove unused define EXTENT_SIZE_PER_ITEM David Sterba
2024-05-20 19:52  1% ` [PATCH 5/6] btrfs: keep const whene returnin value from get_unaligned_le8() David Sterba
2024-05-20 19:52  1% ` [PATCH 6/6] btrfs: constify parameters of write_eb_member() and its users David Sterba
2024-05-20 23:00  1% ` [PATCH 0/6] Cleanups and W=2 warning fixes Boris Burkov
2024-05-21  0:38  1% ` Anand Jain
2024-05-21  4:21  1% ` Naohiro Aota
2024-05-21  3:03  1% [PATCH v4 0/2] btrfs: qgroup: stale qgroups related impromvents Qu Wenruo
2024-05-21  3:03  1% ` [PATCH v4 1/2] btrfs: slightly loose the requirement for qgroup removal Qu Wenruo
2024-05-21  3:03  1% ` [PATCH v4 2/2] btrfs: automatically remove the subvolume qgroup Qu Wenruo
2024-05-21  9:57  1% [PATCH] btrfs: re-introduce 'norecovery' mount option Qu Wenruo
2024-05-21 10:43  1% ` Johannes Thumshirn
2024-05-21 13:13  1% ` David Sterba
2024-05-21 13:24  1% ` Lennart Poettering
2024-05-21 13:26  1% ` David Sterba
2024-05-21 12:01  1% [PATCH] generic/733: add commit ID for btrfs fdmanana
2024-05-21 12:58  1% ` David Sterba
2024-05-21 14:01  1% ` Zorro Lang
2024-05-21 14:58  1% [PATCH v3 0/2] btrfs: zoned: always set aside a zone for relocation Johannes Thumshirn
2024-05-21 14:58  1% ` [PATCH v3 1/2] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
2024-05-21 15:22  1%   ` Filipe Manana
2024-05-22  8:31  1%     ` Johannes Thumshirn
2024-05-21 14:58  1% ` [PATCH v3 2/2] btrfs: reserve new relocation block-group after successful relocation Johannes Thumshirn
2024-05-22  1:17  1%   ` Naohiro Aota
2024-05-21 17:11  1% [PATCH v4 0/6] part3 trivial adjustments for return variable coding style Anand Jain
2024-05-21 17:11  1% ` [PATCH v4 1/6] btrfs: rename err to ret in btrfs_cleanup_fs_roots() Anand Jain
2024-05-21 17:11  1% ` [PATCH v4 2/6] btrfs: rename ret to err in btrfs_recover_relocation() Anand Jain
2024-05-21 17:11  1% ` [PATCH v4 3/6] btrfs: rename ret to ret2 " Anand Jain
2024-05-21 17:11  1% ` [PATCH v4 4/6] btrfs: rename err to ret " Anand Jain
2024-05-21 17:11  1% ` [PATCH v4 5/6] btrfs: rename err to ret in btrfs_drop_snapshot() Anand Jain
2024-05-21 17:11  1% ` [PATCH v4 6/6] btrfs: rename err to ret in btrfs_find_orphan_roots() Anand Jain
2024-05-21 18:10  1% ` [PATCH v4 0/6] part3 trivial adjustments for return variable coding style David Sterba
2024-05-23 17:18  1%   ` David Sterba
2024-05-24  3:09  1%     ` Anand Jain
2024-05-22  2:55  1% [syzbot] [nilfs?] [btrfs?] WARNING in filemap_unaccount_folio syzbot
2024-05-22  6:02  2% [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Naohiro Aota
2024-05-22  6:02  1% ` [PATCH v3 01/10] btrfs-progs: rename block_count to byte_count Naohiro Aota
2024-05-22  6:02  1% ` [PATCH v3 02/10] btrfs-progs: mkfs: remove duplicated device size check Naohiro Aota
2024-05-22  6:02  1% ` [PATCH v3 03/10] btrfs-progs: mkfs: unify zoned mode minimum size calc into btrfs_min_dev_size() Naohiro Aota
2024-05-22  6:02  1% ` [PATCH v3 04/10] btrfs-progs: mkfs: fix minimum size calculation for zoned mode Naohiro Aota
2024-05-22  6:02  1% ` [PATCH v3 05/10] btrfs-progs: mkfs: align byte_count with sectorsize and zone size Naohiro Aota
2024-05-22  6:43  1%   ` Qu Wenruo
2024-05-22  6:49  1%     ` Qu Wenruo
2024-05-22  6:53  1%     ` Naohiro Aota
2024-05-22  7:38  1%       ` Naohiro Aota
2024-05-22  6:02  1% ` [PATCH v3 06/10] btrfs-progs: support byte length for zone resetting Naohiro Aota
2024-05-22  6:02  1% ` [PATCH v3 07/10] btrfs-progs: test: add nullb setup functions Naohiro Aota
2024-05-22  6:02  1% ` [PATCH v3 08/10] btrfs-progs: test: add test for zone resetting Naohiro Aota
2024-05-22  6:02  1% ` [PATCH v3 09/10] btrfs-progs: test: use nullb helper and smaller zone size Naohiro Aota
2024-05-22  6:02  1% ` [PATCH v3 10/10] btrfs-progs: test: use nullb helpers in 031-zoned-bgt Naohiro Aota
2024-05-22  6:50  1% ` [PATCH v3 00/10] btrfs-progs: zoned: proper "mkfs.btrfs -b" support Qu Wenruo
2024-05-22 14:36  2% [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions fdmanana
2024-05-22 14:36  1% ` [PATCH 1/7] btrfs: qgroup: avoid start/commit empty transaction when flushing reservations fdmanana
2024-05-22 14:36  1% ` [PATCH 2/7] btrfs: avoid create and commit empty transaction when committing super fdmanana
2024-05-22 14:36  1% ` [PATCH 3/7] btrfs: send: make ensure_commit_roots_uptodate() simpler and more efficient fdmanana
2024-05-22 14:36  1% ` [PATCH 4/7] btrfs: send: avoid create/commit empty transaction at ensure_commit_roots_uptodate() fdmanana
2024-05-22 14:36  1% ` [PATCH 5/7] btrfs: scrub: avoid create/commit empty transaction at finish_extent_writes_for_zoned() fdmanana
2024-05-22 14:36  1% ` [PATCH 6/7] btrfs: add and use helper to commit the current transaction fdmanana
2024-05-22 14:36  1% ` [PATCH 7/7] btrfs: send: get rid of the label and gotos at ensure_commit_roots_uptodate() fdmanana
2024-05-22 15:21  1% ` [PATCH 0/7] btrfs: avoid some unnecessary commit of empty transactions Josef Bacik
2024-05-22 22:21  1% ` Qu Wenruo
2024-05-23 14:02  1%   ` Filipe Manana
2024-05-23 17:03  1% ` David Sterba
2024-05-22 14:43  1% [PATCH] btrfs: move fiemap code from extent_io.c to inode.c fdmanana
2024-05-22 15:20  1% ` Josef Bacik
2024-05-22 17:33  1% ` David Sterba
2024-05-22 20:18  1%   ` Filipe Manana
2024-05-22 20:15  1% [PATCH] btrfs: move fiemap code into its own file fdmanana
2024-05-23 10:25  1% ` Johannes Thumshirn
2024-05-23 16:33  1% ` David Sterba
2024-05-22 23:47  1% [PATCH v2 0/3] btrfs: enhance function extent_range_clear_dirty_for_io() Qu Wenruo
2024-05-22 23:47  1% ` [PATCH v2 1/3] btrfs: move extent_range_clear_dirty_for_io() into inode.c Qu Wenruo
2024-05-22 23:47  1% ` [PATCH v2 2/3] btrfs: make extent_range_clear_dirty_for_io() subpage compatible Qu Wenruo
2024-05-23  0:49  1%   ` Qu Wenruo
2024-05-22 23:47  1% ` [PATCH v2 3/3] btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io() Qu Wenruo
2024-05-23  1:19  2% [PATCH v3 0/2] btrfs: enhance function extent_range_clear_dirty_for_io() Qu Wenruo
2024-05-23  1:19  1% ` [PATCH v3 1/2] btrfs: move extent_range_clear_dirty_for_io() into inode.c Qu Wenruo
2024-05-23  1:19  1% ` [PATCH v3 2/2] btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io() Qu Wenruo
2024-05-23  5:03  2% [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Qu Wenruo
2024-05-23  5:03  1% ` [PATCH v3 01/11] btrfs: rename extent_map::orig_block_len to disk_num_bytes Qu Wenruo
2024-05-23  5:03  1% ` [PATCH v3 02/11] btrfs: export the expected file extent through can_nocow_extent() Qu Wenruo
2024-05-23  5:03  1% ` [PATCH v3 03/11] btrfs: introduce new members for extent_map Qu Wenruo
2024-05-23 16:53  1%   ` Filipe Manana
2024-05-23 23:19  2%     ` Qu Wenruo
2024-05-24 10:59  1%       ` David Sterba
2024-05-23 18:21  1%   ` Filipe Manana
2024-05-23  5:03  1% ` [PATCH v3 04/11] btrfs: introduce extra sanity checks for extent maps Qu Wenruo
2024-05-23 16:57  2%   ` Filipe Manana
2024-05-23  5:03  1% ` [PATCH v3 05/11] btrfs: remove extent_map::orig_start member Qu Wenruo
2024-05-23  5:03  1% ` [PATCH v3 06/11] btrfs: remove extent_map::block_len member Qu Wenruo
2024-05-23  5:03  1% ` [PATCH v3 07/11] btrfs: remove extent_map::block_start member Qu Wenruo
2024-05-23 17:56  1%   ` Filipe Manana
2024-05-23 23:23  1%     ` Qu Wenruo
2024-05-23  5:03  1% ` [PATCH v3 08/11] btrfs: cleanup duplicated parameters related to can_nocow_file_extent_args Qu Wenruo
2024-05-23  5:03  1% ` [PATCH v3 09/11] btrfs: cleanup duplicated parameters related to btrfs_alloc_ordered_extent Qu Wenruo
2024-05-23 18:17  1%   ` Filipe Manana
2024-05-23  5:03  1% ` [PATCH v3 10/11] btrfs: cleanup duplicated parameters related to create_io_em() Qu Wenruo
2024-05-23  5:03  1% ` [PATCH v3 11/11] btrfs: cleanup duplicated parameters related to btrfs_create_dio_extent() Qu Wenruo
2024-05-23 10:23  1% ` [PATCH v3 00/11] btrfs: extent-map: unify the members with btrfs_ordered_extent Johannes Thumshirn
2024-05-23 18:26  2% ` Filipe Manana
2024-05-23  7:05  2% [PATCH v6 0/5] btrfs: subpage + zoned fixes Qu Wenruo
2024-05-23  7:05  1% ` [PATCH v6 1/5] btrfs: make __extent_writepage_io() to write specified range only Qu Wenruo
2024-05-23  7:05  1% ` [PATCH v6 2/5] btrfs: subpage: introduce helpers to handle subpage delalloc locking Qu Wenruo
2024-05-23  7:05  1% ` [PATCH v6 3/5] btrfs: lock subpage ranges in one go for writepage_delalloc() Qu Wenruo
2024-05-23  7:05  1% ` [PATCH v6 4/5] btrfs: do not clear page dirty inside extent_write_locked_range() Qu Wenruo
2024-05-23  7:05  1% ` [PATCH v6 5/5] btrfs: make extent_write_locked_range() to handle subpage writeback correctly Qu Wenruo
2024-05-23  9:40  1% [PATCH] btrfs-progs: doc: btrfs device assembly and verification Anand Jain
2024-05-23 10:09  1% [syzbot] [btrfs?] [overlayfs?] possible deadlock in ovl_copy_up_flags syzbot
2024-05-23 15:21  1% [PATCH v4 0/2] btrfs: zoned: always set aside a zone for relocation Johannes Thumshirn
2024-05-23 15:21  1% ` [PATCH v4 1/2] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
2024-05-24  8:31  1%   ` Naohiro Aota
2024-05-24 14:07  1%   ` Filipe Manana
2024-05-23 15:21  1% ` [PATCH v4 2/2] btrfs: reserve new relocation block-group after successful relocation Johannes Thumshirn
2024-05-24  8:33  1%   ` Naohiro Aota
2024-05-23 19:34  1% [PATCH] btrfs: fix crash on racing fsync and size-extending direct I/O into prealloc Omar Sandoval
2024-05-23 19:34  1% ` [PATCH fstests] btrfs: add regression test for fsync vs. size-extending direct I/O into prealloc crash Omar Sandoval
2024-05-24 13:24  1%   ` Filipe Manana
2024-05-24 13:05  1% ` [PATCH] btrfs: fix crash on racing fsync and size-extending direct I/O into prealloc Filipe Manana
2024-05-24  4:26  1% [PATCH] btrfs/741: add commit ID in _fixed_by_kernel_commit Anand Jain
2024-05-24 13:17  1% ` David Sterba
2024-05-24 16:29  1% [PATCH v5 0/3] btrfs: zoned: always set aside a zone for relocation Johannes Thumshirn
2024-05-24 16:29  1% ` [PATCH v5 1/3] btrfs: don't try to relocate the data relocation block-group Johannes Thumshirn
2024-05-24 16:29  1% ` [PATCH v5 2/3] btrfs: zoned: reserve relocation block-group on mount Johannes Thumshirn
2024-05-24 16:29  1% ` [PATCH v5 3/3] btrfs: reserve new relocation block-group after successful relocation Johannes Thumshirn
2024-05-24 20:58  1% [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Omar Sandoval
2024-05-24 20:58  1% ` [PATCH fstests v2] generic: test Btrfs fsync vs. size-extending prealloc write crash Omar Sandoval
2024-05-26 11:47  1%   ` Filipe Manana
2024-05-26 11:41  1% ` [PATCH v2] btrfs: fix crash on racing fsync and size-extending write into prealloc Filipe Manana
2024-05-25 10:31  1% [PATCH] btrfs-progs: convert: Add 64 bit block numbers support Srivathsa Dara
2024-05-26  9:08     RIP: + BUG: with 6.8.11 and BTRFS Toralf Förster
2024-05-26 14:46     ` Toralf Förster
2024-05-27  9:32  1%   ` Qu Wenruo
2024-05-27  9:24  1% [PATCH] btrfs: zlib: do not do unnecessary page copying for compression Qu Wenruo
2024-05-27 16:25  1% ` Zaslonko Mikhail
2024-05-27 22:09  1%   ` Qu Wenruo
2024-05-27 10:13  1% [PATCH] btrfs: qgroup: use kmem cache to alloc struct btrfs_qgroup_extent_record Junchao Sun
2024-05-27 15:27  1% ` David Sterba
2024-05-27 13:50     [syzbot] [btrfs?] [overlayfs?] possible deadlock in ovl_copy_up_flags Miklos Szeredi
2024-05-27 21:36  1% ` [syzbot] [overlayfs] " syzbot
2024-05-27 17:56  1% [PATCH] fs: btrfs: add MODULE_DESCRIPTION() Jeff Johnson
2024-05-27 20:05  1% ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).