All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] btrfs: zoned: fix active zone tracking issues
@ 2022-07-04  4:58 Naohiro Aota
  2022-07-04  4:58 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
                   ` (13 more replies)
  0 siblings, 14 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

This series addresses mainly two issues on zoned btrfs' active zone
tracking and one issue which is a dependency of the main issue.

* Background

A ZNS drive has an upper limit of zones that simultaneously can be written
out. We call the limit max_active_zones. An active zone is deactivated when
we write fully to the zone, or when we explicitly send a REQ_OP_ZONE_FINISH
command to make it full.

The zoned btrfs must be aware of max_active_zones to use a ZNS drive. So,
we have an active zone tracking system that considers a block group as
active iff the underlying zone is active. In fact, we consider a block
group (and its underlying zones) as active when we start allocating from
it. Then, when the last region which can be allocated in the block group is
written, we send a REQ_OP_ZONE_FINISH command to each zone and consider the
block group as inactive.

So, in short, we currently depend on writing fully to a zone to finish a block group.

* Issues
** Issue A

In a certain situation, the current zoned btrfs's extent allocation fails
with an early -ENOSPC on a ZNS drive. When all the block groups do not have
enough space left for the allocation, it tries to allocate a new block
group if we can activate a new zone. If not, it returns -ENOSPC while the
device still has free space left.

** Issue B

When doing a buffered write, we call cow_file_range() to allocate the data
extent. The cow_file_range() works like an all-or-nothing manner: if it can
allocate for all the range it returns 0, or -ENOSPC if not. Thus, when all
the block group have small free space left, and btrfs cannot finish any
block group, the allocation partly succeed but fails in the end. This also
results in an early -ENOSPC.

We cannot finish any block group in a certain situation. Let's consider
that we have 8 active data block groups (forget about metadata/system block
groups here) and each of them has 1 MB free space left. Now, we want to do
10 MB buffered write. We can allocate blocks for the 8 of 10 MB. And, we
can no longer allocate from any block group. Furthermore, we cannot finish
any block group, because all the block groups have 1 MB reserved unwritten
space left now. And, since this 1 MB regions are owned by the allocating
process itself, simply waiting for the region to be written won't work.

** Issue C

To address issue A, we needed to disable metadata reservation
over-commit. That reveals that we under-estimate the number of extents to
be written on zoned btrfs. On zoned btrfs, we use a ZONE APPEND command to
write data, whose bio size is limited by max_zone_append_sectors and
max_segments. So, a data extent is always split at most at the size of the
limit. As a result, if BTRFS_MAX_EXTENT_SIZE is larger than the limit, we
tend to have more extents than expected from the estimation using
BTRFS_MAX_EXTENT_SIZE.

Since the metadata reservation is done before allocation (e.g, at
btrfs_buffered_write) and released afterward along with the delalloc
process or ordered extent creation. As a result, we can be short of the
metadata reservation in a certain situation, and can cause a WARN by that.

* Solutions
** For issue A

Issue A is that we can have early -ENOSPC if we cannot activate another
block group and no block group has enough space left.

To avoid the early -ENOSPC, we need to choose one block group and finish it
to make rooms for a new block group to be activated. But, that is only
possible from the data extent allocation context. From the metadata
context, we can cause a deadlock because we might need to wait for a
running transaction to make the finishing block group read-only.

So, we use two different methods for data allocation and metadata
allocation. For data allocation, we can finish a block group on-demand from
btrfs_reserve_extent() context. The finishing block group will be the block
group with a least free space left.

For metadata allocation, we use flush_space() to ensure that reserved bytes
can be written into active block groups. To do so, we track active block
groups' total bytes as active_total_bytes, and activate a block group
on-demand from flush_space().

Also, a newly allocated block group from some contexts must be activated

** For issue B

Issue B is about when we cannot allocate space from any block group, and we
cannot finish any block group. This issue only occurs when allocating a
data extent, because metadata reservation is ensured to be contained in
active block groups by solution for issue A.

In this case, writing out the partially allocated region will close the gap
between the allocation pointer and the capacity of the block group, make
the zone finished, and opens up rooms to activate a new block group. So,
this series implements the partial writing out and retrying of the
alloction.

In a certain case, we can't allocate anything from the block groups. In
that case, we'd expect there is on-going IOs to finish a block group. So,
we wait for it and retry the allocation.

** For issue C

Issue C is about that we underestimate the number of extents to be written
on zoned btrfs, because we don't expect an ordered extent is split by the
size of a bio.

We need to use a proper extent size limit to fix issue C. For that, we
revive the fs_info->max_zone_append_size and use it to calculate
count_max_extents(). Technically, the bio size is also limited by the
max_segments, so the limit is also capped by it.

* Patch structure
 
The fix for issue C comes first because it is a dependency of the fixes for
issue A and B.

Patches 1 to 5 address issue C by reviving fs_info->max_zone_append_bytes
and use it to replace BTRFS_MAX_EXTENT_SIZE on zoned btrfs.

Patches 6 to 11 address issue A. In detail, patch 7 fixes the data
allocation by finishing a block group when we cannot activate another block
group. Patch 10 fixes the metadata allocation by finishing a block group at
space reservation time.

Patches 12 and 13 address issue B by writing out a successfully allocated
part first and retrying the rest allocation.

Naohiro Aota (13):
  block: add bdev_max_segments() helper
  btrfs: zoned: revive max_zone_append_bytes
  btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
  btrfs: convert count_max_extents() to use fs_info->max_extent_size
  btrfs: use fs_info->max_extent_size in get_extent_max_capacity()
  btrfs: let can_allocate_chunk return int
  btrfs: zoned: finish least available block group on data BG allocation
  btrfs: zoned: introduce space_info->active_total_bytes
  btrfs: zoned: disable metadata overcommit for zoned
  btrfs: zoned: activate metadata BG on flush_space
  btrfs: zoned: activate necessary block group
  btrfs: zoned: write out partially allocated region
  btrfs: zoned: wait until zone is finished when allocation didn't
    progress

 fs/btrfs/block-group.c    |  23 +++++++-
 fs/btrfs/ctree.h          |  25 +++++---
 fs/btrfs/delalloc-space.c |   6 +-
 fs/btrfs/disk-io.c        |   3 +
 fs/btrfs/extent-tree.c    |  64 +++++++++++++++-----
 fs/btrfs/extent_io.c      |   3 +-
 fs/btrfs/inode.c          |  90 ++++++++++++++++++++--------
 fs/btrfs/ioctl.c          |  11 ++--
 fs/btrfs/space-info.c     |  66 +++++++++++++++++----
 fs/btrfs/space-info.h     |   4 +-
 fs/btrfs/zoned.c          | 119 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h          |  18 ++++++
 include/linux/blkdev.h    |   5 ++
 13 files changed, 368 insertions(+), 69 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/13] block: add bdev_max_segments() helper
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  6:57   ` Johannes Thumshirn
  2022-07-04  8:23   ` Christoph Hellwig
  2022-07-04  4:58 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

Add bdev_max_segments() like other queue parameters.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 include/linux/blkdev.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2f7b43444c5f..62e3ff52ab03 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1206,6 +1206,11 @@ bdev_max_zone_append_sectors(struct block_device *bdev)
 	return queue_max_zone_append_sectors(bdev_get_queue(bdev));
 }
 
+static inline unsigned int bdev_max_segments(struct block_device *bdev)
+{
+	return queue_max_segments(bdev_get_queue(bdev));
+}
+
 static inline unsigned queue_logical_block_size(const struct request_queue *q)
 {
 	int retval = 512;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
  2022-07-04  4:58 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  7:57   ` Johannes Thumshirn
                     ` (2 more replies)
  2022-07-04  4:58 ` [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size Naohiro Aota
                   ` (11 subsequent siblings)
  13 siblings, 3 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

This patch is basically a revert of commit 5a80d1c6a270 ("btrfs: zoned:
remove max_zone_append_size logic"), but without unnecessary ASSERT and
check. The max_zone_append_size will be used as a hint to estimate the
number of extents to cover delalloc/writeback region in the later commits.

The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
this commit considers it to calculate max_zone_append_size. Technically, a
bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
as an upper limit of an extent size to calculate the number of extents
needed to write data.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/zoned.c | 10 ++++++++++
 fs/btrfs/zoned.h |  1 +
 3 files changed, 13 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4e2569f84aab..e4879912c475 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1071,6 +1071,8 @@ struct btrfs_fs_info {
 	 */
 	u64 zone_size;
 
+	/* Max size to emit ZONE_APPEND write command */
+	u64 max_zone_append_size;
 	struct mutex zoned_meta_io_lock;
 	spinlock_t treelog_bg_lock;
 	u64 treelog_bg;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 7a0f8fa44800..271b8b8fd4d0 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -415,6 +415,9 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 	nr_sectors = bdev_nr_sectors(bdev);
 	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
 	zone_info->nr_zones = nr_sectors >> ilog2(zone_sectors);
+	zone_info->max_zone_append_size =
+		min_t(u64, (u64)bdev_max_zone_append_sectors(bdev) << SECTOR_SHIFT,
+		      (u64)bdev_max_segments(bdev) << PAGE_SHIFT);
 	if (!IS_ALIGNED(nr_sectors, zone_sectors))
 		zone_info->nr_zones++;
 
@@ -640,6 +643,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	u64 zoned_devices = 0;
 	u64 nr_devices = 0;
 	u64 zone_size = 0;
+	u64 max_zone_append_size = 0;
 	const bool incompat_zoned = btrfs_fs_incompat(fs_info, ZONED);
 	int ret = 0;
 
@@ -674,6 +678,11 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 				ret = -EINVAL;
 				goto out;
 			}
+			if (!max_zone_append_size ||
+			    (zone_info->max_zone_append_size &&
+			     zone_info->max_zone_append_size < max_zone_append_size))
+				max_zone_append_size =
+					zone_info->max_zone_append_size;
 		}
 		nr_devices++;
 	}
@@ -723,6 +732,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	}
 
 	fs_info->zone_size = zone_size;
+	fs_info->max_zone_append_size = max_zone_append_size;
 	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
 
 	/*
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 6b2eec99162b..9caeab07fd38 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -19,6 +19,7 @@ struct btrfs_zoned_device_info {
 	 */
 	u64 zone_size;
 	u8  zone_size_shift;
+	u64 max_zone_append_size;
 	u32 nr_zones;
 	unsigned int max_active_zones;
 	atomic_t active_zones_left;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
  2022-07-04  4:58 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
  2022-07-04  4:58 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  7:02   ` Johannes Thumshirn
  2022-07-04  4:58 ` [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size Naohiro Aota
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

On zoned btrfs, data write out is limited by max_zone_append_size, and a
large ordered extent is split according the size of a bio. OTOH, the number
of extents to be written is calculated using BTRFS_MAX_EXTENT_SIZE, and
that estimated number is used to reserve the metadata bytes to update
and/or create the metadata items.

The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.

The increase of the number of extents easily occurs on zoned btrfs if
BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the following
warning on a small RAM environment with disabling metadata over-commit (in
the following patch).

[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.524407] Modules linked in: btrfs null_blk blake2b_generic xor
raid6_pq loop dm_flakey dm_mod algif_hash af_alg veth xt_nat xt_conntrack
xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter
iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter
br_netfilter bridge stp llc overlay sunrpc ext4 mbcache jbd2 rapl ipmi_ssif
bfq k10temp i2c_piix4 ipmi_si ipmi_devintf ipmi_msghandler zram ip_tables
ccp ast bnxt_en drm_vram_helper drm_ttm_helper pkcs8_key_parser
asn1_decoder public_key oid_registry fuse ipv6 [last unloaded: btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.624255] Code: 83 c0 01 38 d0 7c 0c 84 d2 74 08 4c 89 ff e8 57 59 64
e0 41 0f b7 74 24 62 ba e4 ff ff ff 48 c7 c7 a0 dc 33 a1 e8 c4 58 50 e2
<0f> 0b e9 9c fe ff ff 4d 8d a5 a0 02 00 00 4c 89 e7 e8 aa fb 5f e2
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166]  <TASK>
[75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431]  ? memcpy+0x4e/0x60
[75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
[75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300]  ? lock_downgrade+0x7c0/0x7c0
[75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085]  ? lock_release+0x552/0xf80
[75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886]  ? __kasan_check_write+0x14/0x20
[75721.895152]  ? do_raw_read_unlock+0x44/0x80
[75721.901323]  ? _raw_write_lock_irq+0x60/0x80
[75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166]  ? _raw_write_unlock+0x23/0x40
[75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906]  ? try_to_wake_up+0x30/0x14a0
[75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111]  ? lock_acquire+0x41b/0x4c0
[75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643]  process_one_work+0x815/0x1460
[75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
[75722.013086]  worker_thread+0x59a/0xeb0
[75722.018511]  kthread+0x2ac/0x360
[75722.023428]  ? process_one_work+0x1460/0x1460
[75722.029431]  ? kthread_complete_and_exit+0x30/0x30
[75722.036044]  ret_from_fork+0x22/0x30
[75722.041255]  </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---

To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular btrfs vs zoned btrfs.

Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
btrfs, it is set to fs_info->max_zone_append_size.

CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h     | 3 +++
 fs/btrfs/disk-io.c   | 2 ++
 fs/btrfs/extent_io.c | 3 ++-
 fs/btrfs/inode.c     | 6 ++++--
 fs/btrfs/zoned.c     | 2 ++
 5 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e4879912c475..fca253bdb4b8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1056,6 +1056,9 @@ struct btrfs_fs_info {
 	u32 csums_per_leaf;
 	u32 stripesize;
 
+	/* Maximum size of an extent. BTRFS_MAX_EXTENT_SIZE on regular btrfs. */
+	u64 max_extent_size;
+
 	/* Block groups and devices containing active swapfiles. */
 	spinlock_t swapfile_pins_lock;
 	struct rb_root swapfile_pins;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 70b388de4d66..ef9d28147b9e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3138,6 +3138,8 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	fs_info->sectorsize_bits = ilog2(4096);
 	fs_info->stripesize = 4096;
 
+	fs_info->max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
 	spin_lock_init(&fs_info->swapfile_pins_lock);
 	fs_info->swapfile_pins = RB_ROOT;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3194eca41635..80d9c218534f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2021,10 +2021,11 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 				    struct page *locked_page, u64 *start,
 				    u64 *end)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	const u64 orig_start = *start;
 	const u64 orig_end = *end;
-	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
+	u64 max_bytes = fs_info->max_extent_size;
 	u64 delalloc_start;
 	u64 delalloc_end;
 	bool found;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9890782fe932..74ac7ef69a3f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2201,6 +2201,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
 void btrfs_split_delalloc_extent(struct inode *inode,
 				 struct extent_state *orig, u64 split)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 size;
 
 	/* not delalloc, ignore it */
@@ -2208,7 +2209,7 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 		return;
 
 	size = orig->end - orig->start + 1;
-	if (size > BTRFS_MAX_EXTENT_SIZE) {
+	if (size > fs_info->max_extent_size) {
 		u32 num_extents;
 		u64 new_size;
 
@@ -2237,6 +2238,7 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
 				 struct extent_state *other)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 new_size, old_size;
 	u32 num_extents;
 
@@ -2250,7 +2252,7 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
 		new_size = other->end - new->start + 1;
 
 	/* we're not bigger than the max, unreserve the space and go */
-	if (new_size <= BTRFS_MAX_EXTENT_SIZE) {
+	if (new_size <= fs_info->max_extent_size) {
 		spin_lock(&BTRFS_I(inode)->lock);
 		btrfs_mod_outstanding_extents(BTRFS_I(inode), -1);
 		spin_unlock(&BTRFS_I(inode)->lock);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 271b8b8fd4d0..eb5a612ea912 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -734,6 +734,8 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	fs_info->zone_size = zone_size;
 	fs_info->max_zone_append_size = max_zone_append_size;
 	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
+	if (fs_info->max_zone_append_size < fs_info->max_extent_size)
+		fs_info->max_extent_size = fs_info->max_zone_append_size;
 
 	/*
 	 * Check mount options here, because we might change fs_info->zoned
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (2 preceding siblings ...)
  2022-07-04  4:58 ` [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  7:56   ` Johannes Thumshirn
  2022-07-04  4:58 ` [PATCH 05/13] btrfs: use fs_info->max_extent_size in get_extent_max_capacity() Naohiro Aota
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
of extents needed, btrfs release the metadata reservation too much on its
way to write out the data.

Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
convert count_max_extents() to use it instead, and fix the calculation of
the metadata reservation.

CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h          | 16 ++++++++--------
 fs/btrfs/delalloc-space.c |  6 +++---
 fs/btrfs/inode.c          | 16 ++++++++--------
 3 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fca253bdb4b8..4aac7df5a17d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -107,14 +107,6 @@ struct btrfs_ioctl_encoded_io_args;
 #define BTRFS_STAT_CURR		0
 #define BTRFS_STAT_PREV		1
 
-/*
- * Count how many BTRFS_MAX_EXTENT_SIZE cover the @size
- */
-static inline u32 count_max_extents(u64 size)
-{
-	return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
-}
-
 static inline unsigned long btrfs_chunk_item_size(int num_stripes)
 {
 	BUG_ON(num_stripes == 0);
@@ -4057,6 +4049,14 @@ static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
 	return fs_info->zone_size > 0;
 }
 
+/*
+ * Count how many fs_info->max_extent_size cover the @size
+ */
+static inline u32 count_max_extents(struct btrfs_fs_info *fs_info, u64 size)
+{
+	return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
+}
+
 static inline bool btrfs_is_data_reloc_root(const struct btrfs_root *root)
 {
 	return root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID;
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 36ab0859a263..1e8f17ff829e 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -273,7 +273,7 @@ static void calc_inode_reservations(struct btrfs_fs_info *fs_info,
 				    u64 num_bytes, u64 disk_num_bytes,
 				    u64 *meta_reserve, u64 *qgroup_reserve)
 {
-	u64 nr_extents = count_max_extents(num_bytes);
+	u64 nr_extents = count_max_extents(fs_info, num_bytes);
 	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, disk_num_bytes);
 	u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
 
@@ -350,7 +350,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	 * needs to free the reservation we just made.
 	 */
 	spin_lock(&inode->lock);
-	nr_extents = count_max_extents(num_bytes);
+	nr_extents = count_max_extents(fs_info, num_bytes);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
 	inode->csum_bytes += disk_num_bytes;
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
@@ -413,7 +413,7 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 	unsigned num_extents;
 
 	spin_lock(&inode->lock);
-	num_extents = count_max_extents(num_bytes);
+	num_extents = count_max_extents(fs_info, num_bytes);
 	btrfs_mod_outstanding_extents(inode, -num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 74ac7ef69a3f..357322da51b5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2218,10 +2218,10 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 		 * applies here, just in reverse.
 		 */
 		new_size = orig->end - split + 1;
-		num_extents = count_max_extents(new_size);
+		num_extents = count_max_extents(fs_info, new_size);
 		new_size = split - orig->start;
-		num_extents += count_max_extents(new_size);
-		if (count_max_extents(size) >= num_extents)
+		num_extents += count_max_extents(fs_info, new_size);
+		if (count_max_extents(fs_info, size) >= num_extents)
 			return;
 	}
 
@@ -2278,10 +2278,10 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
 	 * this case.
 	 */
 	old_size = other->end - other->start + 1;
-	num_extents = count_max_extents(old_size);
+	num_extents = count_max_extents(fs_info, old_size);
 	old_size = new->end - new->start + 1;
-	num_extents += count_max_extents(old_size);
-	if (count_max_extents(new_size) >= num_extents)
+	num_extents += count_max_extents(fs_info, old_size);
+	if (count_max_extents(fs_info, new_size) >= num_extents)
 		return;
 
 	spin_lock(&BTRFS_I(inode)->lock);
@@ -2360,7 +2360,7 @@ void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
 	if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
 		struct btrfs_root *root = BTRFS_I(inode)->root;
 		u64 len = state->end + 1 - state->start;
-		u32 num_extents = count_max_extents(len);
+		u32 num_extents = count_max_extents(fs_info, len);
 		bool do_list = !btrfs_is_free_space_inode(BTRFS_I(inode));
 
 		spin_lock(&BTRFS_I(inode)->lock);
@@ -2402,7 +2402,7 @@ void btrfs_clear_delalloc_extent(struct inode *vfs_inode,
 	struct btrfs_inode *inode = BTRFS_I(vfs_inode);
 	struct btrfs_fs_info *fs_info = btrfs_sb(vfs_inode->i_sb);
 	u64 len = state->end + 1 - state->start;
-	u32 num_extents = count_max_extents(len);
+	u32 num_extents = count_max_extents(fs_info, len);
 
 	if ((state->state & EXTENT_DEFRAG) && (bits & EXTENT_DEFRAG)) {
 		spin_lock(&inode->lock);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 05/13] btrfs: use fs_info->max_extent_size in get_extent_max_capacity()
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (3 preceding siblings ...)
  2022-07-04  4:58 ` [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  7:12   ` Johannes Thumshirn
  2022-07-04  4:58 ` [PATCH 06/13] btrfs: let can_allocate_chunk return int Naohiro Aota
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

Use fs_info->max_extent_size also in get_extent_max_capacity() for the
completeness. This is only used for defrag and not really necessary to fix
the metadata reservation size. But, it still suppresses unnecessary defrag
operations.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ioctl.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7e1b4b0fbd6c..37480d4e6443 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1230,16 +1230,18 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
 	return em;
 }
 
-static u32 get_extent_max_capacity(const struct extent_map *em)
+static u32 get_extent_max_capacity(struct btrfs_fs_info *fs_info,
+				   const struct extent_map *em)
 {
 	if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
 		return BTRFS_MAX_COMPRESSED;
-	return BTRFS_MAX_EXTENT_SIZE;
+	return fs_info->max_extent_size;
 }
 
 static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em,
 				     u32 extent_thresh, u64 newer_than, bool locked)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *next;
 	bool ret = false;
 
@@ -1263,7 +1265,7 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em,
 	 * If the next extent is at its max capacity, defragging current extent
 	 * makes no sense, as the total number of extents won't change.
 	 */
-	if (next->len >= get_extent_max_capacity(em))
+	if (next->len >= get_extent_max_capacity(fs_info, em))
 		goto out;
 	/* Skip older extent */
 	if (next->generation < newer_than)
@@ -1400,6 +1402,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 				  bool locked, struct list_head *target_list,
 				  u64 *last_scanned_ret)
 {
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	bool last_is_target = false;
 	u64 cur = start;
 	int ret = 0;
@@ -1484,7 +1487,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 		 * Skip extents already at its max capacity, this is mostly for
 		 * compressed extents, which max cap is only 128K.
 		 */
-		if (em->len >= get_extent_max_capacity(em))
+		if (em->len >= get_extent_max_capacity(fs_info, em))
 			goto next;
 
 		/*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 06/13] btrfs: let can_allocate_chunk return int
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (4 preceding siblings ...)
  2022-07-04  4:58 ` [PATCH 05/13] btrfs: use fs_info->max_extent_size in get_extent_max_capacity() Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  7:11   ` Johannes Thumshirn
  2022-07-04  4:58 ` [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation Naohiro Aota
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

For the later patch, convert the return type from bool to int. There is no
functional changes.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f97a0f28f464..c8f26ab7fe24 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3965,12 +3965,12 @@ static void found_extent(struct find_free_extent_ctl *ffe_ctl,
 	}
 }
 
-static bool can_allocate_chunk(struct btrfs_fs_info *fs_info,
-			       struct find_free_extent_ctl *ffe_ctl)
+static int can_allocate_chunk(struct btrfs_fs_info *fs_info,
+			      struct find_free_extent_ctl *ffe_ctl)
 {
 	switch (ffe_ctl->policy) {
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
-		return true;
+		return 0;
 	case BTRFS_EXTENT_ALLOC_ZONED:
 		/*
 		 * If we have enough free space left in an already
@@ -3980,8 +3980,8 @@ static bool can_allocate_chunk(struct btrfs_fs_info *fs_info,
 		 */
 		if (ffe_ctl->max_extent_size >= ffe_ctl->min_alloc_size &&
 		    !btrfs_can_activate_zone(fs_info->fs_devices, ffe_ctl->flags))
-			return false;
-		return true;
+			return -ENOSPC;
+		return 0;
 	default:
 		BUG();
 	}
@@ -4063,8 +4063,9 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
 			int exist = 0;
 
 			/*Check if allocation policy allows to create a new chunk */
-			if (!can_allocate_chunk(fs_info, ffe_ctl))
-				return -ENOSPC;
+			ret = can_allocate_chunk(fs_info, ffe_ctl);
+			if (ret)
+				return ret;
 
 			trans = current->journal_info;
 			if (trans)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (5 preceding siblings ...)
  2022-07-04  4:58 ` [PATCH 06/13] btrfs: let can_allocate_chunk return int Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  7:25   ` Johannes Thumshirn
  2022-07-04  4:58 ` [PATCH 08/13] btrfs: zoned: introduce space_info->active_total_bytes Naohiro Aota
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

When we run out of active zones and no sufficient space is left in any
block groups, we need to finish one block group to make room to activate a
new block group.

However, we cannot do this for metadata block groups because we can cause a
deadlock by waiting for a running transaction commit. So, do that only for
a data block group.

Furthermore, the block group to be finished has two requirements. First,
the block group must not have reserved bytes left. Having reserved bytes
means we have an allocated region but did not yet send bios for it. If that
region is allocated by the thread calling btrfs_zone_finish(), it results
in a deadlock.

Second, the block group to be finished must not be a SYSTEM block
group. Finishing a SYSTEM block group easily breaks further chunk
allocation by nullifying the SYSTEM free space.

In a certain case, we cannot find any zone finish candidate or
btrfs_zone_finish() may fail. In that case, we fall back to split the
allocation bytes and fill the last spaces left in the block groups.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c | 43 ++++++++++++++++++++++++++++++++----------
 fs/btrfs/zoned.c       | 40 +++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  7 +++++++
 3 files changed, 80 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c8f26ab7fe24..62e75c1d1155 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3965,6 +3965,38 @@ static void found_extent(struct find_free_extent_ctl *ffe_ctl,
 	}
 }
 
+static int can_allocate_chunk_zoned(struct btrfs_fs_info *fs_info,
+				    struct find_free_extent_ctl *ffe_ctl)
+{
+	/* If we can activate new zone, just allocate a chunk and use it */
+	if (btrfs_can_activate_zone(fs_info->fs_devices, ffe_ctl->flags))
+		return 0;
+
+	/*
+	 * We already reached the max active zones. Try to finish one block
+	 * group to make a room for a new block group. This is only possible for
+	 * a data BG because btrfs_zone_finish() may need to wait for a running
+	 * transaction which can cause a deadlock for metadata allocation.
+	 */
+	if ((ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA) && btrfs_finish_one_bg(fs_info))
+		return 0;
+
+	/*
+	 * If we have enough free space left in an already active block group
+	 * and we can't activate any other zone now, do not allow allocating a
+	 * new chunk and let find_free_extent() retry with a smaller size.
+	 */
+	if (ffe_ctl->max_extent_size >= ffe_ctl->min_alloc_size)
+		return -ENOSPC;
+
+	/*
+	 * We cannot activate a new block group and no enough space left in any
+	 * block groups. So, allocating a new block group may not help. But,
+	 * there is nothing to do anyway, so let's go with it.
+	 */
+	return 0;
+}
+
 static int can_allocate_chunk(struct btrfs_fs_info *fs_info,
 			      struct find_free_extent_ctl *ffe_ctl)
 {
@@ -3972,16 +4004,7 @@ static int can_allocate_chunk(struct btrfs_fs_info *fs_info,
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
 		return 0;
 	case BTRFS_EXTENT_ALLOC_ZONED:
-		/*
-		 * If we have enough free space left in an already
-		 * active block group and we can't activate any other
-		 * zone now, do not allow allocating a new chunk and
-		 * let find_free_extent() retry with a smaller size.
-		 */
-		if (ffe_ctl->max_extent_size >= ffe_ctl->min_alloc_size &&
-		    !btrfs_can_activate_zone(fs_info->fs_devices, ffe_ctl->flags))
-			return -ENOSPC;
-		return 0;
+		return can_allocate_chunk_zoned(fs_info, ffe_ctl);
 	default:
 		BUG();
 	}
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index eb5a612ea912..4a69e8492177 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2178,3 +2178,43 @@ void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logica
 	spin_unlock(&block_group->lock);
 	btrfs_put_block_group(block_group);
 }
+
+bool btrfs_finish_one_bg(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_block_group *block_group;
+	struct btrfs_block_group *min_bg = NULL;
+	u64 min_avail = U64_MAX;
+	int ret;
+
+	spin_lock(&fs_info->zone_active_bgs_lock);
+	list_for_each_entry(block_group, &fs_info->zone_active_bgs,
+			    active_bg_list) {
+		u64 avail;
+
+		spin_lock(&block_group->lock);
+		if (block_group->reserved ||
+		    (block_group->flags & BTRFS_BLOCK_GROUP_SYSTEM)) {
+			spin_unlock(&block_group->lock);
+			continue;
+		}
+
+		avail = block_group->zone_capacity - block_group->alloc_offset;
+		if (min_avail > avail) {
+			if (min_bg)
+				btrfs_put_block_group(min_bg);
+			min_bg = block_group;
+			min_avail = avail;
+			btrfs_get_block_group(min_bg);
+		}
+		spin_unlock(&block_group->lock);
+	}
+	spin_unlock(&fs_info->zone_active_bgs_lock);
+
+	if (!min_bg)
+		return false;
+
+	ret = btrfs_zone_finish(min_bg);
+	btrfs_put_block_group(min_bg);
+
+	return ret == 0;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 9caeab07fd38..09a19772ee68 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -80,6 +80,7 @@ void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info);
 bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info);
 void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logical,
 				       u64 length);
+bool btrfs_finish_one_bg(struct btrfs_fs_info *fs_info);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -249,6 +250,12 @@ static inline bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info)
 
 static inline void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info,
 						     u64 logical, u64 length) { }
+
+static inline bool btrfs_finish_one_bg(struct btrfs_fs_info *fs_info)
+{
+	return true;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 08/13] btrfs: zoned: introduce space_info->active_total_bytes
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (6 preceding siblings ...)
  2022-07-04  4:58 ` [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  4:58 ` [PATCH 09/13] btrfs: zoned: disable metadata overcommit for zoned Naohiro Aota
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

The active_total_bytes, like the total_bytes, accounts for the total bytes
of active block groups in the space_info.

With an introduction of active_total_bytes, we can check if the reserved
bytes can be written to the block groups without activating a new block
group. The check is necessary for metadata allocation on zoned btrfs. We
cannot finish a block group, which may require waiting for the current
transaction, from the metadata allocation context. Instead, we need to
ensure the on-going allocation (reserved bytes) fits in active block
groups.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 12 +++++++++---
 fs/btrfs/space-info.c  | 41 ++++++++++++++++++++++++++++++++---------
 fs/btrfs/space-info.h  |  4 +++-
 fs/btrfs/zoned.c       | 16 ++++++++++++++++
 4 files changed, 60 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e930749770ac..51e7c1f1d93f 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1051,8 +1051,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			< block_group->zone_unusable);
 		WARN_ON(block_group->space_info->disk_total
 			< block_group->length * factor);
+		WARN_ON(block_group->zone_is_active &&
+			block_group->space_info->active_total_bytes
+			< block_group->length);
 	}
 	block_group->space_info->total_bytes -= block_group->length;
+	if (block_group->zone_is_active)
+		block_group->space_info->active_total_bytes -= block_group->length;
 	block_group->space_info->bytes_readonly -=
 		(block_group->length - block_group->zone_unusable);
 	block_group->space_info->bytes_zone_unusable -=
@@ -2107,7 +2112,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	trace_btrfs_add_block_group(info, cache, 0);
 	btrfs_update_space_info(info, cache->flags, cache->length,
 				cache->used, cache->bytes_super,
-				cache->zone_unusable, &space_info);
+				cache->zone_unusable, cache->zone_is_active,
+				&space_info);
 
 	cache->space_info = space_info;
 
@@ -2177,7 +2183,7 @@ static int fill_dummy_bgs(struct btrfs_fs_info *fs_info)
 		}
 
 		btrfs_update_space_info(fs_info, bg->flags, em->len, em->len,
-					0, 0, &space_info);
+					0, 0, false, &space_info);
 		bg->space_info = space_info;
 		link_block_group(bg);
 
@@ -2558,7 +2564,7 @@ struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *tran
 	trace_btrfs_add_block_group(fs_info, cache, 1);
 	btrfs_update_space_info(fs_info, cache->flags, size, bytes_used,
 				cache->bytes_super, cache->zone_unusable,
-				&cache->space_info);
+				cache->zone_is_active, &cache->space_info);
 	btrfs_update_global_block_rsv(fs_info);
 
 	link_block_group(cache);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 62d25112310d..c7a60341b2d2 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -295,7 +295,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
 			     u64 bytes_readonly, u64 bytes_zone_unusable,
-			     struct btrfs_space_info **space_info)
+			     bool active, struct btrfs_space_info **space_info)
 {
 	struct btrfs_space_info *found;
 	int factor;
@@ -306,6 +306,8 @@ void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 	ASSERT(found);
 	spin_lock(&found->lock);
 	found->total_bytes += total_bytes;
+	if (active)
+		found->active_total_bytes += total_bytes;
 	found->disk_total += total_bytes * factor;
 	found->bytes_used += bytes_used;
 	found->disk_used += bytes_used * factor;
@@ -369,6 +371,22 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
 	return avail;
 }
 
+static inline u64 writable_total_bytes(struct btrfs_fs_info *fs_info,
+				       struct btrfs_space_info *space_info)
+{
+	/*
+	 * On regular btrfs, all total_bytes are always writable. On zoned
+	 * btrfs, there may be a limitation imposed by max_active_zzones. For
+	 * metadata allocation, we cannot finish an existing active block group
+	 * to avoid a deadlock. Thus, we need to consider only the active groups
+	 * to be writable for metadata space.
+	 */
+	if (!btrfs_is_zoned(fs_info) || (space_info->flags & BTRFS_BLOCK_GROUP_DATA))
+		return space_info->total_bytes;
+
+	return space_info->active_total_bytes;
+}
+
 int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
 			 struct btrfs_space_info *space_info, u64 bytes,
 			 enum btrfs_reserve_flush_enum flush)
@@ -383,7 +401,7 @@ int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
 	used = btrfs_space_info_used(space_info, true);
 	avail = calc_available_free_space(fs_info, space_info, flush);
 
-	if (used + bytes < space_info->total_bytes + avail)
+	if (used + bytes < writable_total_bytes(fs_info, space_info) + avail)
 		return 1;
 	return 0;
 }
@@ -419,7 +437,7 @@ void btrfs_try_granting_tickets(struct btrfs_fs_info *fs_info,
 		ticket = list_first_entry(head, struct reserve_ticket, list);
 
 		/* Check and see if our ticket can be satisfied now. */
-		if ((used + ticket->bytes <= space_info->total_bytes) ||
+		if ((used + ticket->bytes <= writable_total_bytes(fs_info, space_info)) ||
 		    btrfs_can_overcommit(fs_info, space_info, ticket->bytes,
 					 flush)) {
 			btrfs_space_info_update_bytes_may_use(fs_info,
@@ -750,6 +768,7 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
 {
 	u64 used;
 	u64 avail;
+	u64 total;
 	u64 to_reclaim = space_info->reclaim_size;
 
 	lockdep_assert_held(&space_info->lock);
@@ -764,8 +783,9 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
 	 * space.  If that's the case add in our overage so we make sure to put
 	 * appropriate pressure on the flushing state machine.
 	 */
-	if (space_info->total_bytes + avail < used)
-		to_reclaim += used - (space_info->total_bytes + avail);
+	total = writable_total_bytes(fs_info, space_info);
+	if (total + avail < used)
+		to_reclaim += used - (total + avail);
 
 	return to_reclaim;
 }
@@ -775,9 +795,12 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
 {
 	u64 global_rsv_size = fs_info->global_block_rsv.reserved;
 	u64 ordered, delalloc;
-	u64 thresh = div_factor_fine(space_info->total_bytes, 90);
+	u64 total = writable_total_bytes(fs_info, space_info);
+	u64 thresh;
 	u64 used;
 
+	thresh = div_factor_fine(total, 90);
+
 	lockdep_assert_held(&space_info->lock);
 
 	/* If we're just plain full then async reclaim just slows us down. */
@@ -839,8 +862,8 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
 					   BTRFS_RESERVE_FLUSH_ALL);
 	used = space_info->bytes_used + space_info->bytes_reserved +
 	       space_info->bytes_readonly + global_rsv_size;
-	if (used < space_info->total_bytes)
-		thresh += space_info->total_bytes - used;
+	if (used < total)
+		thresh += total - used;
 	thresh >>= space_info->clamp;
 
 	used = space_info->bytes_pinned;
@@ -1557,7 +1580,7 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 	 * can_overcommit() to ensure we can overcommit to continue.
 	 */
 	if (!pending_tickets &&
-	    ((used + orig_bytes <= space_info->total_bytes) ||
+	    ((used + orig_bytes <= writable_total_bytes(fs_info, space_info)) ||
 	     btrfs_can_overcommit(fs_info, space_info, orig_bytes, flush))) {
 		btrfs_space_info_update_bytes_may_use(fs_info, space_info,
 						      orig_bytes);
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index e7de24a529cf..3cc356a55c53 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -19,6 +19,8 @@ struct btrfs_space_info {
 	u64 bytes_may_use;	/* number of bytes that may be used for
 				   delalloc/allocations */
 	u64 bytes_readonly;	/* total bytes that are read only */
+	u64 active_total_bytes;	/* total bytes in the space, but only accounts
+					   active block groups. */
 	u64 bytes_zone_unusable;	/* total bytes that are unusable until
 					   resetting the device zone */
 
@@ -124,7 +126,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
 			     u64 bytes_readonly, u64 bytes_zone_unusable,
-			     struct btrfs_space_info **space_info);
+			     bool active, struct btrfs_space_info **space_info);
 void btrfs_update_space_info_chunk_size(struct btrfs_space_info *space_info,
 					u64 chunk_size);
 struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 4a69e8492177..9cabf088b800 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1838,6 +1838,7 @@ struct btrfs_device *btrfs_zoned_get_device(struct btrfs_fs_info *fs_info,
 bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 {
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
+	struct btrfs_space_info *space_info = block_group->space_info;
 	struct map_lookup *map;
 	struct btrfs_device *device;
 	u64 physical;
@@ -1849,6 +1850,7 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 	map = block_group->physical_map;
 
+	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
 	if (block_group->zone_is_active) {
 		ret = true;
@@ -1877,7 +1879,10 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 	/* Successfully activated all the zones */
 	block_group->zone_is_active = 1;
+	space_info->active_total_bytes += block_group->length;
 	spin_unlock(&block_group->lock);
+	btrfs_try_granting_tickets(fs_info, space_info);
+	spin_unlock(&space_info->lock);
 
 	/* For the active block group list */
 	btrfs_get_block_group(block_group);
@@ -1890,20 +1895,24 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 out_unlock:
 	spin_unlock(&block_group->lock);
+	spin_unlock(&space_info->lock);
 	return ret;
 }
 
 static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_written)
 {
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
+	struct btrfs_space_info *space_info = block_group->space_info;
 	struct map_lookup *map;
 	bool need_zone_finish;
 	int ret = 0;
 	int i;
 
+	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
 	if (!block_group->zone_is_active) {
 		spin_unlock(&block_group->lock);
+		spin_unlock(&space_info->lock);
 		return 0;
 	}
 
@@ -1912,6 +1921,7 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 	     (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_SYSTEM)) &&
 	    block_group->start + block_group->alloc_offset > block_group->meta_write_pointer) {
 		spin_unlock(&block_group->lock);
+		spin_unlock(&space_info->lock);
 		return -EAGAIN;
 	}
 
@@ -1924,6 +1934,7 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 	 */
 	if (!fully_written) {
 		spin_unlock(&block_group->lock);
+		spin_unlock(&space_info->lock);
 
 		ret = btrfs_inc_block_group_ro(block_group, false);
 		if (ret)
@@ -1935,6 +1946,7 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 		btrfs_wait_ordered_roots(fs_info, U64_MAX, block_group->start,
 					 block_group->length);
 
+		spin_lock(&space_info->lock);
 		spin_lock(&block_group->lock);
 
 		/*
@@ -1943,12 +1955,14 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 		 */
 		if (!block_group->zone_is_active) {
 			spin_unlock(&block_group->lock);
+			spin_unlock(&space_info->lock);
 			btrfs_dec_block_group_ro(block_group);
 			return 0;
 		}
 
 		if (block_group->reserved) {
 			spin_unlock(&block_group->lock);
+			spin_unlock(&space_info->lock);
 			btrfs_dec_block_group_ro(block_group);
 			return -EAGAIN;
 		}
@@ -1965,7 +1979,9 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 	block_group->free_space_ctl->free_space = 0;
 	btrfs_clear_treelog_bg(block_group);
 	btrfs_clear_data_reloc_bg(block_group);
+	space_info->active_total_bytes -= block_group->length;
 	spin_unlock(&block_group->lock);
+	spin_unlock(&space_info->lock);
 
 	map = block_group->physical_map;
 	for (i = 0; i < map->num_stripes; i++) {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 09/13] btrfs: zoned: disable metadata overcommit for zoned
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (7 preceding siblings ...)
  2022-07-04  4:58 ` [PATCH 08/13] btrfs: zoned: introduce space_info->active_total_bytes Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  7:52   ` Johannes Thumshirn
  2022-07-04  4:58 ` [PATCH 10/13] btrfs: zoned: activate metadata BG on flush_space Naohiro Aota
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

The metadata overcommit makes the space reservation flexible but it is also
harmful to active zone tracking. Since we cannot finish a block group from
the metadata allocation context, we might not activate a new block group
and might not be able to actually write out the overcommit reservations.

So, disable metadata overcommit for zoned btrfs. We will ensure the
reservations are under active_total_bytes in the following patches.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/space-info.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index c7a60341b2d2..4ce9dfbabd97 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -399,7 +399,10 @@ int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
 		return 0;
 
 	used = btrfs_space_info_used(space_info, true);
-	avail = calc_available_free_space(fs_info, space_info, flush);
+	if (btrfs_is_zoned(fs_info) && (space_info->flags & BTRFS_BLOCK_GROUP_METADATA))
+		avail = 0;
+	else
+		avail = calc_available_free_space(fs_info, space_info, flush);
 
 	if (used + bytes < writable_total_bytes(fs_info, space_info) + avail)
 		return 1;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 10/13] btrfs: zoned: activate metadata BG on flush_space
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (8 preceding siblings ...)
  2022-07-04  4:58 ` [PATCH 09/13] btrfs: zoned: disable metadata overcommit for zoned Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-07 14:51   ` Johannes Thumshirn
  2022-07-04  4:58 ` [PATCH 11/13] btrfs: zoned: activate necessary block group Naohiro Aota
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

For metadata space on zoned btrfs, reaching ALLOC_CHUNK{,_FORCE} means we
don't have enough space left in the active_total_bytes. Before allocating a
new chunk, we can try to activate an existing block group in this case.

Also, allocating a chunk is not enough to grant a ticket for metadata space
on zoned btrfs. We need to activate the block group to increase the
active_total_bytes.

btrfs_zoned_activate_one_bg() implements the activation feature. It will
activate a block group by (maybe) finishing a block group. It will give up
activating a block group if it cannot finish any block group.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/space-info.c | 20 +++++++++++++++++++
 fs/btrfs/zoned.c      | 45 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h      | 10 ++++++++++
 3 files changed, 75 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 4ce9dfbabd97..f35f36d89660 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -9,6 +9,7 @@
 #include "ordered-data.h"
 #include "transaction.h"
 #include "block-group.h"
+#include "zoned.h"
 
 /*
  * HOW DOES SPACE RESERVATION WORK
@@ -724,6 +725,15 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		break;
 	case ALLOC_CHUNK:
 	case ALLOC_CHUNK_FORCE:
+		/*
+		 * For metadata space on zoned btrfs, reaching here means we
+		 * don't have enough space left in active_total_bytes. Try to
+		 * activate a block group first, because we may have inactive
+		 * block group already allocated.
+		 */
+		if (btrfs_zoned_activate_one_bg(fs_info, space_info, false))
+			break;
+
 		trans = btrfs_join_transaction(root);
 		if (IS_ERR(trans)) {
 			ret = PTR_ERR(trans);
@@ -734,6 +744,16 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 				(state == ALLOC_CHUNK) ? CHUNK_ALLOC_NO_FORCE :
 					CHUNK_ALLOC_FORCE);
 		btrfs_end_transaction(trans);
+
+		/*
+		 * For metadata space on zoned btrfs, allocating a new chunk is
+		 * not enough. We still need to activate the block group. Active
+		 * the newly allocated block group by (maybe) finishing a block
+		 * group.
+		 */
+		if (ret == 1)
+			btrfs_zoned_activate_one_bg(fs_info, space_info, true);
+
 		if (ret > 0 || ret == -ENOSPC)
 			ret = 0;
 		break;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 9cabf088b800..6441a311e658 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2234,3 +2234,48 @@ bool btrfs_finish_one_bg(struct btrfs_fs_info *fs_info)
 
 	return ret == 0;
 }
+
+bool btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
+				 struct btrfs_space_info *space_info,
+				 bool do_finish)
+{
+	struct btrfs_block_group *bg;
+	bool need_finish;
+	int index;
+
+	if (!btrfs_is_zoned(fs_info) || (space_info->flags & BTRFS_BLOCK_GROUP_DATA))
+		return false;
+
+	/* No more block group to activate */
+	if (space_info->active_total_bytes == space_info->total_bytes)
+		return false;
+
+	for (;;) {
+		need_finish = false;
+		down_read(&space_info->groups_sem);
+		for (index = 0; index < BTRFS_NR_RAID_TYPES; index++) {
+			list_for_each_entry(bg, &space_info->block_groups[index], list) {
+				if (!spin_trylock(&bg->lock))
+					continue;
+				if (bg->zone_is_active) {
+					spin_unlock(&bg->lock);
+					continue;
+				}
+				spin_unlock(&bg->lock);
+
+				if (btrfs_zone_activate(bg)) {
+					up_read(&space_info->groups_sem);
+					return true;
+				}
+
+				need_finish = true;
+			}
+		}
+		up_read(&space_info->groups_sem);
+
+		if (!do_finish || !need_finish || !btrfs_finish_one_bg(fs_info))
+			break;
+	}
+
+	return false;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 09a19772ee68..1beca00c69fc 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -81,6 +81,8 @@ bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info);
 void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logical,
 				       u64 length);
 bool btrfs_finish_one_bg(struct btrfs_fs_info *fs_info);
+bool btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
+				 struct btrfs_space_info *space_info, bool do_finish);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -256,6 +258,14 @@ static inline bool btrfs_finish_one_bg(struct btrfs_fs_info *fs_info)
 	return true;
 }
 
+static inline bool btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
+					       struct btrfs_space_info *space_info,
+					       bool do_finish)
+{
+	/* Consider all the BGs are active */
+	return false;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 11/13] btrfs: zoned: activate necessary block group
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (9 preceding siblings ...)
  2022-07-04  4:58 ` [PATCH 10/13] btrfs: zoned: activate metadata BG on flush_space Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  4:58 ` [PATCH 12/13] btrfs: zoned: write out partially allocated region Naohiro Aota
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

There are two places where allocating a chunk is not enough. These two
places are trying to ensure the space by allocating a chunk. To meet the
condition for active_total_bytes, we also need to activate a block group
there.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 51e7c1f1d93f..1c22cfe91a65 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2664,6 +2664,11 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
 	ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
 	if (ret < 0)
 		goto out;
+	/*
+	 * We have allocated a new chunk. We also need to activate that chunk to
+	 * grant metadata tickets for zoned btrfs.
+	 */
+	btrfs_zoned_activate_one_bg(fs_info, cache->space_info, true);
 	ret = inc_block_group_ro(cache, 0);
 	if (ret == -ETXTBSY)
 		goto unlock_out;
@@ -3889,6 +3894,12 @@ static void reserve_chunk_space(struct btrfs_trans_handle *trans,
 		if (IS_ERR(bg)) {
 			ret = PTR_ERR(bg);
 		} else {
+			/*
+			 * We have a new chunk. We also need to activate it for
+			 * zoned btrfs.
+			 */
+			btrfs_zoned_activate_one_bg(fs_info, info, true);
+
 			/*
 			 * If we fail to add the chunk item here, we end up
 			 * trying again at phase 2 of chunk allocation, at
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 12/13] btrfs: zoned: write out partially allocated region
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (10 preceding siblings ...)
  2022-07-04  4:58 ` [PATCH 11/13] btrfs: zoned: activate necessary block group Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  4:58 ` [PATCH 13/13] btrfs: zoned: wait until zone is finished when allocation didn't progress Naohiro Aota
  2022-07-08 18:01 ` [PATCH 00/13] btrfs: zoned: fix active zone tracking issues David Sterba
  13 siblings, 0 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

cow_file_range() works in an all-or-nothing way: if it fails to allocate an
extent for a part of the given region, it gives up all the region including
the successfully allocated parts. On cow_file_range(), run_delalloc_zoned()
writes data for the region only when it successfully allocate all the
region.

This all-or-nothing allocation and write-out are problematic when available
space in all the block groups are get tight with the active zone
restriction. btrfs_reserve_extent() try hard to utilize the left space in
the active block groups and gives up finally and fails with
-ENOSPC. However, if we send IOs for the successfully allocated region, we
can finish a zone and can continue on the rest of the allocation on a newly
allocated block group.

This patch implements the partial write-out for run_delalloc_zoned(). With
this patch applied, cow_file_range() returns -EAGAIN to tell the caller to
do something to progress the further allocation, and tells the successfully
allocated region with done_offset. Furthermore, the zoned extent allocator
returns -EAGAIN to tell cow_file_range() going back to the caller side.

Actually, we still need to wait for an IO to complete to continue the
allocation. The next patch implements that part.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c | 10 +++++++
 fs/btrfs/inode.c       | 63 ++++++++++++++++++++++++++++++++----------
 2 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 62e75c1d1155..5637d1cea1c5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3989,6 +3989,16 @@ static int can_allocate_chunk_zoned(struct btrfs_fs_info *fs_info,
 	if (ffe_ctl->max_extent_size >= ffe_ctl->min_alloc_size)
 		return -ENOSPC;
 
+	/*
+	 * Even min_alloc_size is not left in any block groups. Since we cannot
+	 * activate a new block group, allocating it may not help. Let's tell a
+	 * caller to try again and hope it progress something by writing some
+	 * parts of the region. That is only possible for data block groups,
+	 * where a part of the region can be written.
+	 */
+	if (ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA)
+		return -EAGAIN;
+
 	/*
 	 * We cannot activate a new block group and no enough space left in any
 	 * block groups. So, allocating a new block group may not help. But,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 357322da51b5..163f3d995f00 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -117,7 +117,8 @@ static int btrfs_truncate(struct inode *inode, bool skip_writeback);
 static noinline int cow_file_range(struct btrfs_inode *inode,
 				   struct page *locked_page,
 				   u64 start, u64 end, int *page_started,
-				   unsigned long *nr_written, int unlock);
+				   unsigned long *nr_written, int unlock,
+				   u64 *done_offset);
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 				       u64 len, u64 orig_start, u64 block_start,
 				       u64 block_len, u64 orig_block_len,
@@ -921,7 +922,7 @@ static int submit_uncompressed_range(struct btrfs_inode *inode,
 	 * can directly submit them without interruption.
 	 */
 	ret = cow_file_range(inode, locked_page, start, end, &page_started,
-			     &nr_written, 0);
+			     &nr_written, 0, NULL);
 	/* Inline extent inserted, page gets unlocked and everything is done */
 	if (page_started) {
 		ret = 0;
@@ -1170,7 +1171,8 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
 static noinline int cow_file_range(struct btrfs_inode *inode,
 				   struct page *locked_page,
 				   u64 start, u64 end, int *page_started,
-				   unsigned long *nr_written, int unlock)
+				   unsigned long *nr_written, int unlock,
+				   u64 *done_offset)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -1363,6 +1365,21 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
 out_unlock:
+	/*
+	 * If done_offset is non-NULL and ret == -EAGAIN, we expect the
+	 * caller to write out the successfully allocated region and retry.
+	 */
+	if (done_offset && ret == -EAGAIN) {
+		if (orig_start < start)
+			*done_offset = start - 1;
+		else
+			*done_offset = start;
+		return ret;
+	} else if (ret == -EAGAIN) {
+		/* Convert to -ENOSPC since the caller cannot retry. */
+		ret = -ENOSPC;
+	}
+
 	/*
 	 * Now, we have three regions to clean up:
 	 *
@@ -1608,19 +1625,37 @@ static noinline int run_delalloc_zoned(struct btrfs_inode *inode,
 				       u64 end, int *page_started,
 				       unsigned long *nr_written)
 {
+	u64 done_offset = end;
 	int ret;
+	bool locked_page_done = false;
 
-	ret = cow_file_range(inode, locked_page, start, end, page_started,
-			     nr_written, 0);
-	if (ret)
-		return ret;
+	while (start <= end) {
+		ret = cow_file_range(inode, locked_page, start, end, page_started,
+				     nr_written, 0, &done_offset);
+		if (ret && ret != -EAGAIN)
+			return ret;
 
-	if (*page_started)
-		return 0;
+		if (*page_started) {
+			ASSERT(ret == 0);
+			return 0;
+		}
+
+		if (ret == 0)
+			done_offset = end;
+
+		if (done_offset == start)
+			return -ENOSPC;
+
+		if (!locked_page_done) {
+			__set_page_dirty_nobuffers(locked_page);
+			account_page_redirty(locked_page);
+		}
+		locked_page_done = true;
+		extent_write_locked_range(&inode->vfs_inode, start, done_offset);
+
+		start = done_offset + 1;
+	}
 
-	__set_page_dirty_nobuffers(locked_page);
-	account_page_redirty(locked_page);
-	extent_write_locked_range(&inode->vfs_inode, start, end);
 	*page_started = 1;
 
 	return 0;
@@ -1712,7 +1747,7 @@ static int fallback_to_cow(struct btrfs_inode *inode, struct page *locked_page,
 	}
 
 	return cow_file_range(inode, locked_page, start, end, page_started,
-			      nr_written, 1);
+			      nr_written, 1, NULL);
 }
 
 struct can_nocow_file_extent_args {
@@ -2185,7 +2220,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
 						 page_started, nr_written);
 		else
 			ret = cow_file_range(inode, locked_page, start, end,
-					     page_started, nr_written, 1);
+					     page_started, nr_written, 1, NULL);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &inode->runtime_flags);
 		ret = cow_file_range_async(inode, wbc, locked_page, start, end,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 13/13] btrfs: zoned: wait until zone is finished when allocation didn't progress
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (11 preceding siblings ...)
  2022-07-04  4:58 ` [PATCH 12/13] btrfs: zoned: write out partially allocated region Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-08 18:01 ` [PATCH 00/13] btrfs: zoned: fix active zone tracking issues David Sterba
  13 siblings, 0 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

When the allocated position doesn't progress, we cannot submit IOs to
finish a block group, but there should be ongoing IOs that will finish a
block group. So, in that case, we wait for a zone to be finished and retry
the allocation after that.

Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
indicate we need a zone finish to have proceeded. The flag is set when the
allocator detected it cannot activate a new block group. And, it is cleared
once a zone is finished.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h   | 4 ++++
 fs/btrfs/disk-io.c | 1 +
 fs/btrfs/inode.c   | 9 +++++++--
 fs/btrfs/zoned.c   | 6 ++++++
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4aac7df5a17d..bace2f2eb9d5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -638,6 +638,9 @@ enum {
 	/* Indicate we have half completed snapshot deletions pending. */
 	BTRFS_FS_UNFINISHED_DROPS,
 
+	/* Indicate we have to finish a zone to do next allocation. */
+	BTRFS_FS_NEED_ZONE_FINISH,
+
 #if BITS_PER_LONG == 32
 	/* Indicate if we have error/warn message printed on 32bit systems */
 	BTRFS_FS_32BIT_ERROR,
@@ -1084,6 +1087,7 @@ struct btrfs_fs_info {
 
 	spinlock_t zone_active_bgs_lock;
 	struct list_head zone_active_bgs;
+	wait_queue_head_t zone_finish_wait;
 
 	/* Updates are not protected by any lock */
 	struct btrfs_commit_stats commit_stats;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ef9d28147b9e..b76b7ef6d85d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3131,6 +3131,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	init_waitqueue_head(&fs_info->transaction_blocked_wait);
 	init_waitqueue_head(&fs_info->async_submit_wait);
 	init_waitqueue_head(&fs_info->delayed_iputs_wait);
+	init_waitqueue_head(&fs_info->zone_finish_wait);
 
 	/* Usable values until the real ones are cached from the superblock */
 	fs_info->nodesize = 4096;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 163f3d995f00..d5f27cd1eef2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1643,8 +1643,13 @@ static noinline int run_delalloc_zoned(struct btrfs_inode *inode,
 		if (ret == 0)
 			done_offset = end;
 
-		if (done_offset == start)
-			return -ENOSPC;
+		if (done_offset == start) {
+			struct btrfs_fs_info *info = inode->root->fs_info;
+
+			wait_var_event(&info->zone_finish_wait,
+				       !test_bit(BTRFS_FS_NEED_ZONE_FINISH, &info->flags));
+			continue;
+		}
 
 		if (!locked_page_done) {
 			__set_page_dirty_nobuffers(locked_page);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 6441a311e658..3503dd29eab0 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2015,6 +2015,9 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 	/* For active_bg_list */
 	btrfs_put_block_group(block_group);
 
+	clear_bit(BTRFS_FS_NEED_ZONE_FINISH, &fs_info->flags);
+	wake_up_all(&fs_info->zone_finish_wait);
+
 	return 0;
 }
 
@@ -2051,6 +2054,9 @@ bool btrfs_can_activate_zone(struct btrfs_fs_devices *fs_devices, u64 flags)
 	}
 	mutex_unlock(&fs_info->chunk_mutex);
 
+	if (!ret)
+		set_bit(BTRFS_FS_NEED_ZONE_FINISH, &fs_info->flags);
+
 	return ret;
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/13] block: add bdev_max_segments() helper
  2022-07-04  4:58 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
@ 2022-07-04  6:57   ` Johannes Thumshirn
  2022-07-04  8:23   ` Christoph Hellwig
  1 sibling, 0 replies; 35+ messages in thread
From: Johannes Thumshirn @ 2022-07-04  6:57 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
  2022-07-04  4:58 ` [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size Naohiro Aota
@ 2022-07-04  7:02   ` Johannes Thumshirn
  2022-07-04  7:46     ` Naohiro Aota
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Thumshirn @ 2022-07-04  7:02 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs

On 04.07.22 06:59, Naohiro Aota wrote:
>  fs/btrfs/ctree.h     | 3 +++
>  fs/btrfs/disk-io.c   | 2 ++
>  fs/btrfs/extent_io.c | 3 ++-
>  fs/btrfs/inode.c     | 6 ++++--
>  fs/btrfs/zoned.c     | 2 ++
>  5 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index e4879912c475..fca253bdb4b8 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1056,6 +1056,9 @@ struct btrfs_fs_info {
>  	u32 csums_per_leaf;
>  	u32 stripesize;
>  
> +	/* Maximum size of an extent. BTRFS_MAX_EXTENT_SIZE on regular btrfs. */
> +	u64 max_extent_size;
> +
>  	/* Block groups and devices containing active swapfiles. */
>  	spinlock_t swapfile_pins_lock;
>  	struct rb_root swapfile_pins;

Shouldn't count_max_extens() use fs_info->max_extent_size instead of 
BTRFS_MAX_EXTENT_SIZE as well?


IIRC I did do a similar patch once as well, which then didn't get merged
for different reasons and count_max_extens() needed to be converted as well.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/13] btrfs: let can_allocate_chunk return int
  2022-07-04  4:58 ` [PATCH 06/13] btrfs: let can_allocate_chunk return int Naohiro Aota
@ 2022-07-04  7:11   ` Johannes Thumshirn
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Thumshirn @ 2022-07-04  7:11 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/13] btrfs: use fs_info->max_extent_size in get_extent_max_capacity()
  2022-07-04  4:58 ` [PATCH 05/13] btrfs: use fs_info->max_extent_size in get_extent_max_capacity() Naohiro Aota
@ 2022-07-04  7:12   ` Johannes Thumshirn
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Thumshirn @ 2022-07-04  7:12 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation
  2022-07-04  4:58 ` [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation Naohiro Aota
@ 2022-07-04  7:25   ` Johannes Thumshirn
  2022-07-04 13:25     ` Naohiro Aota
  0 siblings, 1 reply; 35+ messages in thread
From: Johannes Thumshirn @ 2022-07-04  7:25 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs

On 04.07.22 06:59, Naohiro Aota wrote:
> +
> +	ret = btrfs_zone_finish(min_bg);
> +	btrfs_put_block_group(min_bg);
> +
> +	return ret == 0;

Why aren't you propagating the error from btrfs_zone_finish()
back to the caller?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
  2022-07-04  7:02   ` Johannes Thumshirn
@ 2022-07-04  7:46     ` Naohiro Aota
  0 siblings, 0 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04  7:46 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Jul 04, 2022 at 07:02:57AM +0000, Johannes Thumshirn wrote:
> On 04.07.22 06:59, Naohiro Aota wrote:
> >  fs/btrfs/ctree.h     | 3 +++
> >  fs/btrfs/disk-io.c   | 2 ++
> >  fs/btrfs/extent_io.c | 3 ++-
> >  fs/btrfs/inode.c     | 6 ++++--
> >  fs/btrfs/zoned.c     | 2 ++
> >  5 files changed, 13 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index e4879912c475..fca253bdb4b8 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1056,6 +1056,9 @@ struct btrfs_fs_info {
> >  	u32 csums_per_leaf;
> >  	u32 stripesize;
> >  
> > +	/* Maximum size of an extent. BTRFS_MAX_EXTENT_SIZE on regular btrfs. */
> > +	u64 max_extent_size;
> > +
> >  	/* Block groups and devices containing active swapfiles. */
> >  	spinlock_t swapfile_pins_lock;
> >  	struct rb_root swapfile_pins;
> 
> Shouldn't count_max_extens() use fs_info->max_extent_size instead of 
> BTRFS_MAX_EXTENT_SIZE as well?
> 
> 
> IIRC I did do a similar patch once as well, which then didn't get merged
> for different reasons and count_max_extens() needed to be converted as well.

That is done in patch 04. I split the patches because using
fs_info->max_extent_size in count_max_extents() require the argument
change, and I think it looks good to have them separated.

But, squashing them might be OK.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 09/13] btrfs: zoned: disable metadata overcommit for zoned
  2022-07-04  4:58 ` [PATCH 09/13] btrfs: zoned: disable metadata overcommit for zoned Naohiro Aota
@ 2022-07-04  7:52   ` Johannes Thumshirn
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Thumshirn @ 2022-07-04  7:52 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size
  2022-07-04  4:58 ` [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size Naohiro Aota
@ 2022-07-04  7:56   ` Johannes Thumshirn
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Thumshirn @ 2022-07-04  7:56 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs

On 04.07.22 06:59, Naohiro Aota wrote:
> +/*
> + * Count how many fs_info->max_extent_size cover the @size
> + */
> +static inline u32 count_max_extents(struct btrfs_fs_info *fs_info, u64 size)
> +{
> +	return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
> +}
> +


For the record, Naohiro and I just discussed this offline. 
count_max_extents() needs to check for an eventual NULL fs_info
because of the selftest code.


static inline u32 count_max_extents(struct btrfs_fs_info *fs_info, u64 size)
{
	if (IS_ENABLED(CONFIG_FS_BTRFS_RUN_SANITY_TESTS) && !fs_info)
		return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);

	return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
}

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes
  2022-07-04  4:58 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
@ 2022-07-04  7:57   ` Johannes Thumshirn
  2022-07-04  8:24   ` Christoph Hellwig
  2022-07-04  9:33   ` Johannes Thumshirn
  2 siblings, 0 replies; 35+ messages in thread
From: Johannes Thumshirn @ 2022-07-04  7:57 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/13] block: add bdev_max_segments() helper
  2022-07-04  4:58 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
  2022-07-04  6:57   ` Johannes Thumshirn
@ 2022-07-04  8:23   ` Christoph Hellwig
  2022-07-04 13:27     ` Naohiro Aota
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-07-04  8:23 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

Please Cc the block list.

On Mon, Jul 04, 2022 at 01:58:05PM +0900, Naohiro Aota wrote:
> Add bdev_max_segments() like other queue parameters.
> 
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  include/linux/blkdev.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2f7b43444c5f..62e3ff52ab03 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1206,6 +1206,11 @@ bdev_max_zone_append_sectors(struct block_device *bdev)
>  	return queue_max_zone_append_sectors(bdev_get_queue(bdev));
>  }
>  
> +static inline unsigned int bdev_max_segments(struct block_device *bdev)
> +{
> +	return queue_max_segments(bdev_get_queue(bdev));
> +}
> +
>  static inline unsigned queue_logical_block_size(const struct request_queue *q)
>  {
>  	int retval = 512;
> -- 
> 2.35.1
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes
  2022-07-04  4:58 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
  2022-07-04  7:57   ` Johannes Thumshirn
@ 2022-07-04  8:24   ` Christoph Hellwig
  2022-07-04  9:33   ` Johannes Thumshirn
  2 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2022-07-04  8:24 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 7a0f8fa44800..271b8b8fd4d0 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -415,6 +415,9 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
>  	nr_sectors = bdev_nr_sectors(bdev);
>  	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
>  	zone_info->nr_zones = nr_sectors >> ilog2(zone_sectors);
> +	zone_info->max_zone_append_size =
> +		min_t(u64, (u64)bdev_max_zone_append_sectors(bdev) << SECTOR_SHIFT,
> +		      (u64)bdev_max_segments(bdev) << PAGE_SHIFT);

This assumes each segment is just page sized, so you probably want to
document how you arrived at that assumption.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes
  2022-07-04  4:58 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
  2022-07-04  7:57   ` Johannes Thumshirn
  2022-07-04  8:24   ` Christoph Hellwig
@ 2022-07-04  9:33   ` Johannes Thumshirn
  2022-07-04 11:54     ` Christoph Hellwig
  2 siblings, 1 reply; 35+ messages in thread
From: Johannes Thumshirn @ 2022-07-04  9:33 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs

> @@ -723,6 +732,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
>  	}
>  
>  	fs_info->zone_size = zone_size;
> +	fs_info->max_zone_append_size = max_zone_append_size;
>  	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
>  
>  	/*

Thinking a bit more of this, this need to be the min() of all 
max_zone_append_size values of the underlying devices, because even as of now
zoned btrfs supports multiple devices on a single FS.

Sorry for not noticing this earlier.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes
  2022-07-04  9:33   ` Johannes Thumshirn
@ 2022-07-04 11:54     ` Christoph Hellwig
  2022-07-04 13:24       ` Naohiro Aota
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2022-07-04 11:54 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: Naohiro Aota, linux-btrfs

On Mon, Jul 04, 2022 at 09:33:32AM +0000, Johannes Thumshirn wrote:
> > @@ -723,6 +732,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
> >  	}
> >  
> >  	fs_info->zone_size = zone_size;
> > +	fs_info->max_zone_append_size = max_zone_append_size;
> >  	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
> >  
> >  	/*
> 
> Thinking a bit more of this, this need to be the min() of all 
> max_zone_append_size values of the underlying devices, because even as of now
> zoned btrfs supports multiple devices on a single FS.

Yes.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes
  2022-07-04 11:54     ` Christoph Hellwig
@ 2022-07-04 13:24       ` Naohiro Aota
  0 siblings, 0 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04 13:24 UTC (permalink / raw)
  To: hch; +Cc: Johannes Thumshirn, linux-btrfs

On Mon, Jul 04, 2022 at 04:54:44AM -0700, Christoph Hellwig wrote:
> On Mon, Jul 04, 2022 at 09:33:32AM +0000, Johannes Thumshirn wrote:
> > > @@ -723,6 +732,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
> > >  	}
> > >  
> > >  	fs_info->zone_size = zone_size;
> > > +	fs_info->max_zone_append_size = max_zone_append_size;
> > >  	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
> > >  
> > >  	/*
> > 
> > Thinking a bit more of this, this need to be the min() of all 
> > max_zone_append_size values of the underlying devices, because even as of now
> > zoned btrfs supports multiple devices on a single FS.
> 
> Yes.

That min() is done by the one above hunk.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation
  2022-07-04  7:25   ` Johannes Thumshirn
@ 2022-07-04 13:25     ` Naohiro Aota
  0 siblings, 0 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04 13:25 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

On Mon, Jul 04, 2022 at 07:25:14AM +0000, Johannes Thumshirn wrote:
> On 04.07.22 06:59, Naohiro Aota wrote:
> > +
> > +	ret = btrfs_zone_finish(min_bg);
> > +	btrfs_put_block_group(min_bg);
> > +
> > +	return ret == 0;
> 
> Why aren't you propagating the error from btrfs_zone_finish()
> back to the caller?

Ah, yeah. We should propagate it. I'll fix in the next version.

Thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/13] block: add bdev_max_segments() helper
  2022-07-04  8:23   ` Christoph Hellwig
@ 2022-07-04 13:27     ` Naohiro Aota
  0 siblings, 0 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-04 13:27 UTC (permalink / raw)
  To: hch; +Cc: linux-btrfs

On Mon, Jul 04, 2022 at 01:23:44AM -0700, Christoph Hellwig wrote:
> Please Cc the block list.

Oops, I completely forgot about it. I will do so from the next version.

> On Mon, Jul 04, 2022 at 01:58:05PM +0900, Naohiro Aota wrote:
> > Add bdev_max_segments() like other queue parameters.
> > 
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  include/linux/blkdev.h | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 2f7b43444c5f..62e3ff52ab03 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1206,6 +1206,11 @@ bdev_max_zone_append_sectors(struct block_device *bdev)
> >  	return queue_max_zone_append_sectors(bdev_get_queue(bdev));
> >  }
> >  
> > +static inline unsigned int bdev_max_segments(struct block_device *bdev)
> > +{
> > +	return queue_max_segments(bdev_get_queue(bdev));
> > +}
> > +
> >  static inline unsigned queue_logical_block_size(const struct request_queue *q)
> >  {
> >  	int retval = 512;
> > -- 
> > 2.35.1
> > 
> ---end quoted text---

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 10/13] btrfs: zoned: activate metadata BG on flush_space
  2022-07-04  4:58 ` [PATCH 10/13] btrfs: zoned: activate metadata BG on flush_space Naohiro Aota
@ 2022-07-07 14:51   ` Johannes Thumshirn
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Thumshirn @ 2022-07-07 14:51 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 00/13] btrfs: zoned: fix active zone tracking issues
  2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (12 preceding siblings ...)
  2022-07-04  4:58 ` [PATCH 13/13] btrfs: zoned: wait until zone is finished when allocation didn't progress Naohiro Aota
@ 2022-07-08 18:01 ` David Sterba
  2022-07-08 23:06   ` Naohiro Aota
  13 siblings, 1 reply; 35+ messages in thread
From: David Sterba @ 2022-07-08 18:01 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

On Mon, Jul 04, 2022 at 01:58:04PM +0900, Naohiro Aota wrote:
> This series addresses mainly two issues on zoned btrfs' active zone
> tracking and one issue which is a dependency of the main issue.
> 
> * Background
[...]

Thanks for the writeup, this seems to be fixing a serious problem, also
guessing by the length of the series. Some of the patches are marked for
stable 5.12 or 5.16 but I think this would need to be backported
manually and to 5.18 as the other versions have been EOLed.

As most of the changes are in zoned code I can add the whole series to
misc-next rather sooner than later because the code freeze is near.

I did a quick test and it crashes in the self tests so I can't add the
branch to for-next.

[   13.324894] Btrfs loaded, crc32c=crc32c-generic, debug=on, assert=on, integrity-checker=on, ref-verify=on, zoned=yes, fsverity=yes
[   13.326507] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
[   13.327303] BTRFS: selftest: running btrfs free space cache tests
[   13.328133] BTRFS: selftest: running extent only tests
[   13.328935] BTRFS: selftest: running bitmap only tests
[   13.329770] BTRFS: selftest: running bitmap and extent tests
[   13.330647] BTRFS: selftest: running space stealing from bitmap to extent tests
[   13.331990] BTRFS: selftest: running bytes index tests
[   13.332915] BTRFS: selftest: running extent buffer operation tests
[   13.333924] BTRFS: selftest: running btrfs_split_item tests
[   13.334922] BTRFS: selftest: running extent I/O tests
[   13.335733] BTRFS: selftest: running find delalloc tests
[   13.525595] BUG: unable to handle page fault for address: 0000000000002360
[   13.526677] #PF: supervisor read access in kernel mode
[   13.527480] #PF: error_code(0x0000) - not-present page
[   13.528381] PGD 0 P4D 0 
[   13.528909] Oops: 0000 [#1] PREEMPT SMP
[   13.529604] CPU: 0 PID: 642 Comm: modprobe Not tainted 5.19.0-rc5-default+ #1809
[   13.530742] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[   13.532475] RIP: 0010:find_lock_delalloc_range+0x41/0x2a0 [btrfs]
[   13.535137] RSP: 0018:ffffb750c05ebcd8 EFLAGS: 00010296
[   13.535467] RAX: 0000000000000000 RBX: ffff96b1fe0f3440 RCX: 0000000000000fff
[   13.535880] RDX: ffffb750c05ebd60 RSI: ffff96b1fe0f3440 RDI: ffff96b1838b8f00
[   13.536534] RBP: ffff96b1838b8f00 R08: 0000000000000000 R09: 0000000000000001
[   13.537141] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000001000
[   13.537548] R13: ffff96b1838b8ab8 R14: 0000000000000fff R15: ffffb750c05ebd58
[   13.537956] FS:  00007eff49b43740(0000) GS:ffff96b1fd600000(0000) knlGS:0000000000000000
[   13.538467] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.538808] CR2: 0000000000002360 CR3: 0000000003f1c000 CR4: 00000000000006b0
[   13.539213] Call Trace:
[   13.539407]  <TASK>
[   13.539584]  test_find_delalloc+0x19b/0x695 [btrfs]
[   13.539978]  btrfs_test_extent_io+0x1e/0x39 [btrfs]
[   13.540710]  btrfs_run_sanity_tests.cold+0x33/0xcd [btrfs]
[   13.541686]  init_btrfs_fs+0xcc/0x12b [btrfs]
[   13.542328]  ? 0xffffffffc060a000
[   13.542868]  do_one_initcall+0x65/0x330
[   13.543348]  ? rcu_read_lock_sched_held+0x3b/0x70
[   13.543980]  ? trace_kmalloc+0x33/0xe0
[   13.544394]  ? kmem_cache_alloc_trace+0x188/0x270
[   13.544805]  do_init_module+0x4a/0x1f0
[   13.545076]  __do_sys_finit_module+0x9e/0xf0
[   13.545373]  do_syscall_64+0x3c/0x80
[   13.545618]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[   13.545961] RIP: 0033:0x7eff49c6da8d
[   13.547264] RSP: 002b:00007ffcb6e6bbc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   13.547724] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff49c6da8d
[   13.548184] RDX: 0000000000000000 RSI: 000055a8143faab2 RDI: 000000000000000d
[   13.549012] RBP: 000055a81441d460 R08: 0000000000000000 R09: 0000000000000000
[   13.549550] R10: 000000000000000d R11: 0000000000000246 R12: 000055a8143faab2
[   13.549958] R13: 000055a814423530 R14: 0000000000000000 R15: 000055a814424ab8
[   13.550363]  </TASK>
[   13.550541] Modules linked in: btrfs(+) blake2b_generic libcrc32c xor lzo_compress lzo_decompress raid6_pq zstd_decompress zstd_compress xxhash loop
[   13.551275] CR2: 0000000000002360
[   13.551525] ---[ end trace 0000000000000000 ]---
[   13.551819] RIP: 0010:find_lock_delalloc_range+0x41/0x2a0 [btrfs]
[   13.555923] RSP: 0018:ffffb750c05ebcd8 EFLAGS: 00010296
[   13.557052] RAX: 0000000000000000 RBX: ffff96b1fe0f3440 RCX: 0000000000000fff
[   13.558614] RDX: ffffb750c05ebd60 RSI: ffff96b1fe0f3440 RDI: ffff96b1838b8f00
[   13.559857] RBP: ffff96b1838b8f00 R08: 0000000000000000 R09: 0000000000000001
[   13.561280] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000001000
[   13.562619] R13: ffff96b1838b8ab8 R14: 0000000000000fff R15: ffffb750c05ebd58
[   13.563522] FS:  00007eff49b43740(0000) GS:ffff96b1fd600000(0000) knlGS:0000000000000000
[   13.564612] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.565356] CR2: 0000000000002360 CR3: 0000000003f1c000 CR4: 00000000000006b0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 00/13] btrfs: zoned: fix active zone tracking issues
  2022-07-08 18:01 ` [PATCH 00/13] btrfs: zoned: fix active zone tracking issues David Sterba
@ 2022-07-08 23:06   ` Naohiro Aota
  0 siblings, 0 replies; 35+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:06 UTC (permalink / raw)
  To: dsterba, linux-btrfs

On Fri, Jul 08, 2022 at 08:01:01PM +0200, David Sterba wrote:
> On Mon, Jul 04, 2022 at 01:58:04PM +0900, Naohiro Aota wrote:
> > This series addresses mainly two issues on zoned btrfs' active zone
> > tracking and one issue which is a dependency of the main issue.
> > 
> > * Background
> [...]
> 
> Thanks for the writeup, this seems to be fixing a serious problem, also
> guessing by the length of the series. Some of the patches are marked for
> stable 5.12 or 5.16 but I think this would need to be backported
> manually and to 5.18 as the other versions have been EOLed.
> 
> As most of the changes are in zoned code I can add the whole series to
> misc-next rather sooner than later because the code freeze is near.

Thank you.

> I did a quick test and it crashes in the self tests so I can't add the
> branch to for-next.
> 
> [   13.324894] Btrfs loaded, crc32c=crc32c-generic, debug=on, assert=on, integrity-checker=on, ref-verify=on, zoned=yes, fsverity=yes
> [   13.326507] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
> [   13.327303] BTRFS: selftest: running btrfs free space cache tests
> [   13.328133] BTRFS: selftest: running extent only tests
> [   13.328935] BTRFS: selftest: running bitmap only tests
> [   13.329770] BTRFS: selftest: running bitmap and extent tests
> [   13.330647] BTRFS: selftest: running space stealing from bitmap to extent tests
> [   13.331990] BTRFS: selftest: running bytes index tests
> [   13.332915] BTRFS: selftest: running extent buffer operation tests
> [   13.333924] BTRFS: selftest: running btrfs_split_item tests
> [   13.334922] BTRFS: selftest: running extent I/O tests
> [   13.335733] BTRFS: selftest: running find delalloc tests
> [   13.525595] BUG: unable to handle page fault for address: 0000000000002360
> [   13.526677] #PF: supervisor read access in kernel mode
> [   13.527480] #PF: error_code(0x0000) - not-present page
> [   13.528381] PGD 0 P4D 0 
> [   13.528909] Oops: 0000 [#1] PREEMPT SMP
> [   13.529604] CPU: 0 PID: 642 Comm: modprobe Not tainted 5.19.0-rc5-default+ #1809
> [   13.530742] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
> [   13.532475] RIP: 0010:find_lock_delalloc_range+0x41/0x2a0 [btrfs]
> [   13.535137] RSP: 0018:ffffb750c05ebcd8 EFLAGS: 00010296
> [   13.535467] RAX: 0000000000000000 RBX: ffff96b1fe0f3440 RCX: 0000000000000fff
> [   13.535880] RDX: ffffb750c05ebd60 RSI: ffff96b1fe0f3440 RDI: ffff96b1838b8f00
> [   13.536534] RBP: ffff96b1838b8f00 R08: 0000000000000000 R09: 0000000000000001
> [   13.537141] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000001000
> [   13.537548] R13: ffff96b1838b8ab8 R14: 0000000000000fff R15: ffffb750c05ebd58
> [   13.537956] FS:  00007eff49b43740(0000) GS:ffff96b1fd600000(0000) knlGS:0000000000000000
> [   13.538467] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   13.538808] CR2: 0000000000002360 CR3: 0000000003f1c000 CR4: 00000000000006b0
> [   13.539213] Call Trace:
> [   13.539407]  <TASK>
> [   13.539584]  test_find_delalloc+0x19b/0x695 [btrfs]
> [   13.539978]  btrfs_test_extent_io+0x1e/0x39 [btrfs]
> [   13.540710]  btrfs_run_sanity_tests.cold+0x33/0xcd [btrfs]
> [   13.541686]  init_btrfs_fs+0xcc/0x12b [btrfs]
> [   13.542328]  ? 0xffffffffc060a000
> [   13.542868]  do_one_initcall+0x65/0x330
> [   13.543348]  ? rcu_read_lock_sched_held+0x3b/0x70
> [   13.543980]  ? trace_kmalloc+0x33/0xe0
> [   13.544394]  ? kmem_cache_alloc_trace+0x188/0x270
> [   13.544805]  do_init_module+0x4a/0x1f0
> [   13.545076]  __do_sys_finit_module+0x9e/0xf0
> [   13.545373]  do_syscall_64+0x3c/0x80
> [   13.545618]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [   13.545961] RIP: 0033:0x7eff49c6da8d
> [   13.547264] RSP: 002b:00007ffcb6e6bbc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [   13.547724] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007eff49c6da8d
> [   13.548184] RDX: 0000000000000000 RSI: 000055a8143faab2 RDI: 000000000000000d
> [   13.549012] RBP: 000055a81441d460 R08: 0000000000000000 R09: 0000000000000000
> [   13.549550] R10: 000000000000000d R11: 0000000000000246 R12: 000055a8143faab2
> [   13.549958] R13: 000055a814423530 R14: 0000000000000000 R15: 000055a814424ab8
> [   13.550363]  </TASK>
> [   13.550541] Modules linked in: btrfs(+) blake2b_generic libcrc32c xor lzo_compress lzo_decompress raid6_pq zstd_decompress zstd_compress xxhash loop
> [   13.551275] CR2: 0000000000002360
> [   13.551525] ---[ end trace 0000000000000000 ]---
> [   13.551819] RIP: 0010:find_lock_delalloc_range+0x41/0x2a0 [btrfs]
> [   13.555923] RSP: 0018:ffffb750c05ebcd8 EFLAGS: 00010296
> [   13.557052] RAX: 0000000000000000 RBX: ffff96b1fe0f3440 RCX: 0000000000000fff
> [   13.558614] RDX: ffffb750c05ebd60 RSI: ffff96b1fe0f3440 RDI: ffff96b1838b8f00
> [   13.559857] RBP: ffff96b1838b8f00 R08: 0000000000000000 R09: 0000000000000001
> [   13.561280] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000001000
> [   13.562619] R13: ffff96b1838b8ab8 R14: 0000000000000fff R15: ffffb750c05ebd58
> [   13.563522] FS:  00007eff49b43740(0000) GS:ffff96b1fd600000(0000) knlGS:0000000000000000
> [   13.564612] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   13.565356] CR2: 0000000000002360 CR3: 0000000003f1c000 CR4: 00000000000006b0

Yes. This is also pointed out by Johannes. I'm going to send v2 with this fixed.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes
  2022-07-08 23:18 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
@ 2022-07-09 11:34   ` Johannes Thumshirn
  0 siblings, 0 replies; 35+ messages in thread
From: Johannes Thumshirn @ 2022-07-09 11:34 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: linux-block

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes
  2022-07-08 23:18 Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-09 11:34   ` Johannes Thumshirn
  0 siblings, 1 reply; 35+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota

This patch is basically a revert of commit 5a80d1c6a270 ("btrfs: zoned:
remove max_zone_append_size logic"), but without unnecessary ASSERT and
check. The max_zone_append_size will be used as a hint to estimate the
number of extents to cover delalloc/writeback region in the later commits.

The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
this commit considers it to calculate max_zone_append_size. Technically, a
bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
as an upper limit of an extent size to calculate the number of extents
needed to write data.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/zoned.c | 17 +++++++++++++++++
 fs/btrfs/zoned.h |  1 +
 3 files changed, 20 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4e2569f84aab..e4879912c475 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1071,6 +1071,8 @@ struct btrfs_fs_info {
 	 */
 	u64 zone_size;
 
+	/* Max size to emit ZONE_APPEND write command */
+	u64 max_zone_append_size;
 	struct mutex zoned_meta_io_lock;
 	spinlock_t treelog_bg_lock;
 	u64 treelog_bg;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 79a2d48a5251..bdc533fa80ae 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -415,6 +415,16 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 	nr_sectors = bdev_nr_sectors(bdev);
 	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
 	zone_info->nr_zones = nr_sectors >> ilog2(zone_sectors);
+	/*
+	 * We limit max_zone_append_size also by max_segments *
+	 * PAGE_SIZE. Technically, we can have multiple pages per segment. But,
+	 * since btrfs adds the pages one by one to a bio, and btrfs cannot
+	 * increase the metadata reservation even if it increases the number of
+	 * extents, it is safe to stick with the limit.
+	 */
+	zone_info->max_zone_append_size =
+		min_t(u64, (u64)bdev_max_zone_append_sectors(bdev) << SECTOR_SHIFT,
+		      (u64)bdev_max_segments(bdev) << PAGE_SHIFT);
 	if (!IS_ALIGNED(nr_sectors, zone_sectors))
 		zone_info->nr_zones++;
 
@@ -640,6 +650,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	u64 zoned_devices = 0;
 	u64 nr_devices = 0;
 	u64 zone_size = 0;
+	u64 max_zone_append_size = 0;
 	const bool incompat_zoned = btrfs_fs_incompat(fs_info, ZONED);
 	int ret = 0;
 
@@ -674,6 +685,11 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 				ret = -EINVAL;
 				goto out;
 			}
+			if (!max_zone_append_size ||
+			    (zone_info->max_zone_append_size &&
+			     zone_info->max_zone_append_size < max_zone_append_size))
+				max_zone_append_size =
+					zone_info->max_zone_append_size;
 		}
 		nr_devices++;
 	}
@@ -723,6 +739,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	}
 
 	fs_info->zone_size = zone_size;
+	fs_info->max_zone_append_size = max_zone_append_size;
 	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
 
 	/*
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 6b2eec99162b..9caeab07fd38 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -19,6 +19,7 @@ struct btrfs_zoned_device_info {
 	 */
 	u64 zone_size;
 	u8  zone_size_shift;
+	u64 max_zone_append_size;
 	u32 nr_zones;
 	unsigned int max_active_zones;
 	atomic_t active_zones_left;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2022-07-09 11:34 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-04  4:58 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
2022-07-04  4:58 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
2022-07-04  6:57   ` Johannes Thumshirn
2022-07-04  8:23   ` Christoph Hellwig
2022-07-04 13:27     ` Naohiro Aota
2022-07-04  4:58 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
2022-07-04  7:57   ` Johannes Thumshirn
2022-07-04  8:24   ` Christoph Hellwig
2022-07-04  9:33   ` Johannes Thumshirn
2022-07-04 11:54     ` Christoph Hellwig
2022-07-04 13:24       ` Naohiro Aota
2022-07-04  4:58 ` [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size Naohiro Aota
2022-07-04  7:02   ` Johannes Thumshirn
2022-07-04  7:46     ` Naohiro Aota
2022-07-04  4:58 ` [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size Naohiro Aota
2022-07-04  7:56   ` Johannes Thumshirn
2022-07-04  4:58 ` [PATCH 05/13] btrfs: use fs_info->max_extent_size in get_extent_max_capacity() Naohiro Aota
2022-07-04  7:12   ` Johannes Thumshirn
2022-07-04  4:58 ` [PATCH 06/13] btrfs: let can_allocate_chunk return int Naohiro Aota
2022-07-04  7:11   ` Johannes Thumshirn
2022-07-04  4:58 ` [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation Naohiro Aota
2022-07-04  7:25   ` Johannes Thumshirn
2022-07-04 13:25     ` Naohiro Aota
2022-07-04  4:58 ` [PATCH 08/13] btrfs: zoned: introduce space_info->active_total_bytes Naohiro Aota
2022-07-04  4:58 ` [PATCH 09/13] btrfs: zoned: disable metadata overcommit for zoned Naohiro Aota
2022-07-04  7:52   ` Johannes Thumshirn
2022-07-04  4:58 ` [PATCH 10/13] btrfs: zoned: activate metadata BG on flush_space Naohiro Aota
2022-07-07 14:51   ` Johannes Thumshirn
2022-07-04  4:58 ` [PATCH 11/13] btrfs: zoned: activate necessary block group Naohiro Aota
2022-07-04  4:58 ` [PATCH 12/13] btrfs: zoned: write out partially allocated region Naohiro Aota
2022-07-04  4:58 ` [PATCH 13/13] btrfs: zoned: wait until zone is finished when allocation didn't progress Naohiro Aota
2022-07-08 18:01 ` [PATCH 00/13] btrfs: zoned: fix active zone tracking issues David Sterba
2022-07-08 23:06   ` Naohiro Aota
2022-07-08 23:18 Naohiro Aota
2022-07-08 23:18 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
2022-07-09 11:34   ` Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.