All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] btrfs: zoned: fix active zone tracking issues
@ 2022-07-08 23:18 Naohiro Aota
  2022-07-08 23:18 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
                   ` (14 more replies)
  0 siblings, 15 replies; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota

This series addresses mainly two issues on zoned btrfs' active zone
tracking and one issue which is a dependency of the main issue.

* ChangeLog
- v2
  - Support sanity tests (Johannes)
    - fs_info can be NULL while it is running sanity tests. Consider that
      case in CONFIG_FS_BTRFS_RUN_SANITY_TESTS.
  - Propagete an error of btrfs_zone_finish() (Johannes)
  - Add a comment to max_segments limitation (Christoph)
  - Rename btrfs_finish_one_bg() to btrfs_zone_finish_one_bg() to make the
    it clear it is related to zoned code.
  - Do not reduce active_total_bytes when finishing a block group.
    - While it's no longer active, but it still can have "used" bytes. So,
      it should be counted to host "total_bytes". Or, it breaks free space
      calculation.
  - Do not try to activate a fully allocated block group.

* Background

A ZNS drive has an upper limit of zones that simultaneously can be written
out. We call the limit max_active_zones. An active zone is deactivated when
we write fully to the zone, or when we explicitly send a REQ_OP_ZONE_FINISH
command to make it full.

The zoned btrfs must be aware of max_active_zones to use a ZNS drive. So,
we have an active zone tracking system that considers a block group as
active iff the underlying zone is active. In fact, we consider a block
group (and its underlying zones) as active when we start allocating from
it. Then, when the last region which can be allocated in the block group is
written, we send a REQ_OP_ZONE_FINISH command to each zone and consider the
block group as inactive.

So, in short, we currently depend on writing fully to a zone to finish a block group.

* Issues
** Issue A

In a certain situation, the current zoned btrfs's extent allocation fails
with an early -ENOSPC on a ZNS drive. When all the block groups do not have
enough space left for the allocation, it tries to allocate a new block
group if we can activate a new zone. If not, it returns -ENOSPC while the
device still has free space left.

** Issue B

When doing a buffered write, we call cow_file_range() to allocate the data
extent. The cow_file_range() works like an all-or-nothing manner: if it can
allocate for all the range it returns 0, or -ENOSPC if not. Thus, when all
the block group have small free space left, and btrfs cannot finish any
block group, the allocation partly succeed but fails in the end. This also
results in an early -ENOSPC.

We cannot finish any block group in a certain situation. Let's consider
that we have 8 active data block groups (forget about metadata/system block
groups here) and each of them has 1 MB free space left. Now, we want to do
10 MB buffered write. We can allocate blocks for the 8 of 10 MB. And, we
can no longer allocate from any block group. Furthermore, we cannot finish
any block group, because all the block groups have 1 MB reserved unwritten
space left now. And, since this 1 MB regions are owned by the allocating
process itself, simply waiting for the region to be written won't work.

** Issue C

To address issue A, we needed to disable metadata reservation
over-commit. That reveals that we under-estimate the number of extents to
be written on zoned btrfs. On zoned btrfs, we use a ZONE APPEND command to
write data, whose bio size is limited by max_zone_append_sectors and
max_segments. So, a data extent is always split at most at the size of the
limit. As a result, if BTRFS_MAX_EXTENT_SIZE is larger than the limit, we
tend to have more extents than expected from the estimation using
BTRFS_MAX_EXTENT_SIZE.

Since the metadata reservation is done before allocation (e.g, at
btrfs_buffered_write) and released afterward along with the delalloc
process or ordered extent creation. As a result, we can be short of the
metadata reservation in a certain situation, and can cause a WARN by that.

* Solutions
** For issue A

Issue A is that we can have early -ENOSPC if we cannot activate another
block group and no block group has enough space left.

To avoid the early -ENOSPC, we need to choose one block group and finish it
to make rooms for a new block group to be activated. But, that is only
possible from the data extent allocation context. From the metadata
context, we can cause a deadlock because we might need to wait for a
running transaction to make the finishing block group read-only.

So, we use two different methods for data allocation and metadata
allocation. For data allocation, we can finish a block group on-demand from
btrfs_reserve_extent() context. The finishing block group will be the block
group with a least free space left.

For metadata allocation, we use flush_space() to ensure that reserved bytes
can be written into active block groups. To do so, we track active block
groups' total bytes as active_total_bytes, and activate a block group
on-demand from flush_space().

Also, a newly allocated block group from some contexts must be activated

** For issue B

Issue B is about when we cannot allocate space from any block group, and we
cannot finish any block group. This issue only occurs when allocating a
data extent, because metadata reservation is ensured to be contained in
active block groups by solution for issue A.

In this case, writing out the partially allocated region will close the gap
between the allocation pointer and the capacity of the block group, make
the zone finished, and opens up rooms to activate a new block group. So,
this series implements the partial writing out and retrying of the
alloction.

In a certain case, we can't allocate anything from the block groups. In
that case, we'd expect there is on-going IOs to finish a block group. So,
we wait for it and retry the allocation.

** For issue C

Issue C is about that we underestimate the number of extents to be written
on zoned btrfs, because we don't expect an ordered extent is split by the
size of a bio.

We need to use a proper extent size limit to fix issue C. For that, we
revive the fs_info->max_zone_append_size and use it to calculate
count_max_extents(). Technically, the bio size is also limited by the
max_segments, so the limit is also capped by it.

* Patch structure
 
The fix for issue C comes first because it is a dependency of the fixes for
issue A and B.

Patches 1 to 5 address issue C by reviving fs_info->max_zone_append_bytes
and use it to replace BTRFS_MAX_EXTENT_SIZE on zoned btrfs.

Patches 6 to 11 address issue A. In detail, patch 7 fixes the data
allocation by finishing a block group when we cannot activate another block
group. Patch 10 fixes the metadata allocation by finishing a block group at
space reservation time.

Patches 12 and 13 address issue B by writing out a successfully allocated
part first and retrying the rest allocation.

Naohiro Aota (13):
  block: add bdev_max_segments() helper
  btrfs: zoned: revive max_zone_append_bytes
  btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
  btrfs: convert count_max_extents() to use fs_info->max_extent_size
  btrfs: use fs_info->max_extent_size in get_extent_max_capacity()
  btrfs: let can_allocate_chunk return int
  btrfs: zoned: finish least available block group on data BG allocation
  btrfs: zoned: introduce space_info->active_total_bytes
  btrfs: zoned: disable metadata overcommit for zoned
  btrfs: zoned: activate metadata BG on flush_space
  btrfs: zoned: activate necessary block group
  btrfs: zoned: write out partially allocated region
  btrfs: zoned: wait until zone is finished when allocation didn't
    progress

 fs/btrfs/block-group.c    |  28 ++++++++-
 fs/btrfs/ctree.h          |  30 ++++++---
 fs/btrfs/delalloc-space.c |   6 +-
 fs/btrfs/disk-io.c        |   3 +
 fs/btrfs/extent-tree.c    |  70 ++++++++++++++++-----
 fs/btrfs/extent_io.c      |   8 ++-
 fs/btrfs/inode.c          |  90 +++++++++++++++++++--------
 fs/btrfs/ioctl.c          |  11 ++--
 fs/btrfs/space-info.c     |  76 ++++++++++++++++++++---
 fs/btrfs/space-info.h     |   4 +-
 fs/btrfs/zoned.c          | 124 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h          |  18 ++++++
 include/linux/blkdev.h    |   5 ++
 13 files changed, 404 insertions(+), 69 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/13] block: add bdev_max_segments() helper
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-09 16:10   ` Jens Axboe
  2022-07-11  7:00   ` Christoph Hellwig
  2022-07-08 23:18 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota, Johannes Thumshirn

Add bdev_max_segments() like other queue parameters.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 include/linux/blkdev.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2f7b43444c5f..62e3ff52ab03 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1206,6 +1206,11 @@ bdev_max_zone_append_sectors(struct block_device *bdev)
 	return queue_max_zone_append_sectors(bdev_get_queue(bdev));
 }
 
+static inline unsigned int bdev_max_segments(struct block_device *bdev)
+{
+	return queue_max_segments(bdev_get_queue(bdev));
+}
+
 static inline unsigned queue_logical_block_size(const struct request_queue *q)
 {
 	int retval = 512;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
  2022-07-08 23:18 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-09 11:34   ` Johannes Thumshirn
  2022-07-08 23:18 ` [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size Naohiro Aota
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota

This patch is basically a revert of commit 5a80d1c6a270 ("btrfs: zoned:
remove max_zone_append_size logic"), but without unnecessary ASSERT and
check. The max_zone_append_size will be used as a hint to estimate the
number of extents to cover delalloc/writeback region in the later commits.

The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
this commit considers it to calculate max_zone_append_size. Technically, a
bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
as an upper limit of an extent size to calculate the number of extents
needed to write data.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/zoned.c | 17 +++++++++++++++++
 fs/btrfs/zoned.h |  1 +
 3 files changed, 20 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4e2569f84aab..e4879912c475 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1071,6 +1071,8 @@ struct btrfs_fs_info {
 	 */
 	u64 zone_size;
 
+	/* Max size to emit ZONE_APPEND write command */
+	u64 max_zone_append_size;
 	struct mutex zoned_meta_io_lock;
 	spinlock_t treelog_bg_lock;
 	u64 treelog_bg;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 79a2d48a5251..bdc533fa80ae 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -415,6 +415,16 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
 	nr_sectors = bdev_nr_sectors(bdev);
 	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
 	zone_info->nr_zones = nr_sectors >> ilog2(zone_sectors);
+	/*
+	 * We limit max_zone_append_size also by max_segments *
+	 * PAGE_SIZE. Technically, we can have multiple pages per segment. But,
+	 * since btrfs adds the pages one by one to a bio, and btrfs cannot
+	 * increase the metadata reservation even if it increases the number of
+	 * extents, it is safe to stick with the limit.
+	 */
+	zone_info->max_zone_append_size =
+		min_t(u64, (u64)bdev_max_zone_append_sectors(bdev) << SECTOR_SHIFT,
+		      (u64)bdev_max_segments(bdev) << PAGE_SHIFT);
 	if (!IS_ALIGNED(nr_sectors, zone_sectors))
 		zone_info->nr_zones++;
 
@@ -640,6 +650,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	u64 zoned_devices = 0;
 	u64 nr_devices = 0;
 	u64 zone_size = 0;
+	u64 max_zone_append_size = 0;
 	const bool incompat_zoned = btrfs_fs_incompat(fs_info, ZONED);
 	int ret = 0;
 
@@ -674,6 +685,11 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 				ret = -EINVAL;
 				goto out;
 			}
+			if (!max_zone_append_size ||
+			    (zone_info->max_zone_append_size &&
+			     zone_info->max_zone_append_size < max_zone_append_size))
+				max_zone_append_size =
+					zone_info->max_zone_append_size;
 		}
 		nr_devices++;
 	}
@@ -723,6 +739,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	}
 
 	fs_info->zone_size = zone_size;
+	fs_info->max_zone_append_size = max_zone_append_size;
 	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
 
 	/*
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 6b2eec99162b..9caeab07fd38 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -19,6 +19,7 @@ struct btrfs_zoned_device_info {
 	 */
 	u64 zone_size;
 	u8  zone_size_shift;
+	u64 max_zone_append_size;
 	u32 nr_zones;
 	unsigned int max_active_zones;
 	atomic_t active_zones_left;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
  2022-07-08 23:18 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
  2022-07-08 23:18 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-09 11:36   ` Johannes Thumshirn
  2022-07-08 23:18 ` [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size Naohiro Aota
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota

On zoned btrfs, data write out is limited by max_zone_append_size, and a
large ordered extent is split according the size of a bio. OTOH, the number
of extents to be written is calculated using BTRFS_MAX_EXTENT_SIZE, and
that estimated number is used to reserve the metadata bytes to update
and/or create the metadata items.

The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.

The increase of the number of extents easily occurs on zoned btrfs if
BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the following
warning on a small RAM environment with disabling metadata over-commit (in
the following patch).

[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.524407] Modules linked in: btrfs null_blk blake2b_generic xor
raid6_pq loop dm_flakey dm_mod algif_hash af_alg veth xt_nat xt_conntrack
xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter
iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter
br_netfilter bridge stp llc overlay sunrpc ext4 mbcache jbd2 rapl ipmi_ssif
bfq k10temp i2c_piix4 ipmi_si ipmi_devintf ipmi_msghandler zram ip_tables
ccp ast bnxt_en drm_vram_helper drm_ttm_helper pkcs8_key_parser
asn1_decoder public_key oid_registry fuse ipv6 [last unloaded: btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.624255] Code: 83 c0 01 38 d0 7c 0c 84 d2 74 08 4c 89 ff e8 57 59 64
e0 41 0f b7 74 24 62 ba e4 ff ff ff 48 c7 c7 a0 dc 33 a1 e8 c4 58 50 e2
<0f> 0b e9 9c fe ff ff 4d 8d a5 a0 02 00 00 4c 89 e7 e8 aa fb 5f e2
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166]  <TASK>
[75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431]  ? memcpy+0x4e/0x60
[75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
[75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300]  ? lock_downgrade+0x7c0/0x7c0
[75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085]  ? lock_release+0x552/0xf80
[75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886]  ? __kasan_check_write+0x14/0x20
[75721.895152]  ? do_raw_read_unlock+0x44/0x80
[75721.901323]  ? _raw_write_lock_irq+0x60/0x80
[75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166]  ? _raw_write_unlock+0x23/0x40
[75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906]  ? try_to_wake_up+0x30/0x14a0
[75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111]  ? lock_acquire+0x41b/0x4c0
[75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643]  process_one_work+0x815/0x1460
[75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
[75722.013086]  worker_thread+0x59a/0xeb0
[75722.018511]  kthread+0x2ac/0x360
[75722.023428]  ? process_one_work+0x1460/0x1460
[75722.029431]  ? kthread_complete_and_exit+0x30/0x30
[75722.036044]  ret_from_fork+0x22/0x30
[75722.041255]  </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---

To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular btrfs vs zoned btrfs.

Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
btrfs, it is set to fs_info->max_zone_append_size.

CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h     | 3 +++
 fs/btrfs/disk-io.c   | 2 ++
 fs/btrfs/extent_io.c | 8 +++++++-
 fs/btrfs/inode.c     | 6 ++++--
 fs/btrfs/zoned.c     | 2 ++
 5 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e4879912c475..fca253bdb4b8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1056,6 +1056,9 @@ struct btrfs_fs_info {
 	u32 csums_per_leaf;
 	u32 stripesize;
 
+	/* Maximum size of an extent. BTRFS_MAX_EXTENT_SIZE on regular btrfs. */
+	u64 max_extent_size;
+
 	/* Block groups and devices containing active swapfiles. */
 	spinlock_t swapfile_pins_lock;
 	struct rb_root swapfile_pins;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 76835394a61b..914557d59472 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3142,6 +3142,8 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	fs_info->sectorsize_bits = ilog2(4096);
 	fs_info->stripesize = 4096;
 
+	fs_info->max_extent_size = BTRFS_MAX_EXTENT_SIZE;
+
 	spin_lock_init(&fs_info->swapfile_pins_lock);
 	fs_info->swapfile_pins = RB_ROOT;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3194eca41635..cedc94a7d5b2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2021,10 +2021,16 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
 				    struct page *locked_page, u64 *start,
 				    u64 *end)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
 	const u64 orig_start = *start;
 	const u64 orig_end = *end;
-	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
+	/* The sanity tests may not set a valid fs_info. */
+	u64 max_bytes = fs_info ? fs_info->max_extent_size : BTRFS_MAX_EXTENT_SIZE;
+#else
+	u64 max_bytes = fs_info->max_extent_size;
+#endif
 	u64 delalloc_start;
 	u64 delalloc_end;
 	bool found;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b9485e19b696..155282dacc6e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2201,6 +2201,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
 void btrfs_split_delalloc_extent(struct inode *inode,
 				 struct extent_state *orig, u64 split)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 size;
 
 	/* not delalloc, ignore it */
@@ -2208,7 +2209,7 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 		return;
 
 	size = orig->end - orig->start + 1;
-	if (size > BTRFS_MAX_EXTENT_SIZE) {
+	if (size > fs_info->max_extent_size) {
 		u32 num_extents;
 		u64 new_size;
 
@@ -2237,6 +2238,7 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
 				 struct extent_state *other)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	u64 new_size, old_size;
 	u32 num_extents;
 
@@ -2250,7 +2252,7 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
 		new_size = other->end - new->start + 1;
 
 	/* we're not bigger than the max, unreserve the space and go */
-	if (new_size <= BTRFS_MAX_EXTENT_SIZE) {
+	if (new_size <= fs_info->max_extent_size) {
 		spin_lock(&BTRFS_I(inode)->lock);
 		btrfs_mod_outstanding_extents(BTRFS_I(inode), -1);
 		spin_unlock(&BTRFS_I(inode)->lock);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index bdc533fa80ae..3b45b35aa945 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -741,6 +741,8 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	fs_info->zone_size = zone_size;
 	fs_info->max_zone_append_size = max_zone_append_size;
 	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
+	if (fs_info->max_zone_append_size < fs_info->max_extent_size)
+		fs_info->max_extent_size = fs_info->max_zone_append_size;
 
 	/*
 	 * Check mount options here, because we might change fs_info->zoned
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (2 preceding siblings ...)
  2022-07-08 23:18 ` [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-09 11:37   ` Johannes Thumshirn
  2022-07-08 23:18 ` [PATCH 05/13] btrfs: use fs_info->max_extent_size in get_extent_max_capacity() Naohiro Aota
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota

If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
of extents needed, btrfs release the metadata reservation too much on its
way to write out the data.

Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
convert count_max_extents() to use it instead, and fix the calculation of
the metadata reservation.

CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h          | 21 +++++++++++++--------
 fs/btrfs/delalloc-space.c |  6 +++---
 fs/btrfs/inode.c          | 16 ++++++++--------
 3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fca253bdb4b8..c215e15baea2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -107,14 +107,6 @@ struct btrfs_ioctl_encoded_io_args;
 #define BTRFS_STAT_CURR		0
 #define BTRFS_STAT_PREV		1
 
-/*
- * Count how many BTRFS_MAX_EXTENT_SIZE cover the @size
- */
-static inline u32 count_max_extents(u64 size)
-{
-	return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
-}
-
 static inline unsigned long btrfs_chunk_item_size(int num_stripes)
 {
 	BUG_ON(num_stripes == 0);
@@ -4057,6 +4049,19 @@ static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
 	return fs_info->zone_size > 0;
 }
 
+/*
+ * Count how many fs_info->max_extent_size cover the @size
+ */
+static inline u32 count_max_extents(struct btrfs_fs_info *fs_info, u64 size)
+{
+#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
+	if (!fs_info)
+		return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
+#endif
+
+	return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
+}
+
 static inline bool btrfs_is_data_reloc_root(const struct btrfs_root *root)
 {
 	return root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID;
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 36ab0859a263..1e8f17ff829e 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -273,7 +273,7 @@ static void calc_inode_reservations(struct btrfs_fs_info *fs_info,
 				    u64 num_bytes, u64 disk_num_bytes,
 				    u64 *meta_reserve, u64 *qgroup_reserve)
 {
-	u64 nr_extents = count_max_extents(num_bytes);
+	u64 nr_extents = count_max_extents(fs_info, num_bytes);
 	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, disk_num_bytes);
 	u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
 
@@ -350,7 +350,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	 * needs to free the reservation we just made.
 	 */
 	spin_lock(&inode->lock);
-	nr_extents = count_max_extents(num_bytes);
+	nr_extents = count_max_extents(fs_info, num_bytes);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
 	inode->csum_bytes += disk_num_bytes;
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
@@ -413,7 +413,7 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 	unsigned num_extents;
 
 	spin_lock(&inode->lock);
-	num_extents = count_max_extents(num_bytes);
+	num_extents = count_max_extents(fs_info, num_bytes);
 	btrfs_mod_outstanding_extents(inode, -num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 155282dacc6e..8ce937b0b014 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2218,10 +2218,10 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 		 * applies here, just in reverse.
 		 */
 		new_size = orig->end - split + 1;
-		num_extents = count_max_extents(new_size);
+		num_extents = count_max_extents(fs_info, new_size);
 		new_size = split - orig->start;
-		num_extents += count_max_extents(new_size);
-		if (count_max_extents(size) >= num_extents)
+		num_extents += count_max_extents(fs_info, new_size);
+		if (count_max_extents(fs_info, size) >= num_extents)
 			return;
 	}
 
@@ -2278,10 +2278,10 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
 	 * this case.
 	 */
 	old_size = other->end - other->start + 1;
-	num_extents = count_max_extents(old_size);
+	num_extents = count_max_extents(fs_info, old_size);
 	old_size = new->end - new->start + 1;
-	num_extents += count_max_extents(old_size);
-	if (count_max_extents(new_size) >= num_extents)
+	num_extents += count_max_extents(fs_info, old_size);
+	if (count_max_extents(fs_info, new_size) >= num_extents)
 		return;
 
 	spin_lock(&BTRFS_I(inode)->lock);
@@ -2360,7 +2360,7 @@ void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
 	if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
 		struct btrfs_root *root = BTRFS_I(inode)->root;
 		u64 len = state->end + 1 - state->start;
-		u32 num_extents = count_max_extents(len);
+		u32 num_extents = count_max_extents(fs_info, len);
 		bool do_list = !btrfs_is_free_space_inode(BTRFS_I(inode));
 
 		spin_lock(&BTRFS_I(inode)->lock);
@@ -2402,7 +2402,7 @@ void btrfs_clear_delalloc_extent(struct inode *vfs_inode,
 	struct btrfs_inode *inode = BTRFS_I(vfs_inode);
 	struct btrfs_fs_info *fs_info = btrfs_sb(vfs_inode->i_sb);
 	u64 len = state->end + 1 - state->start;
-	u32 num_extents = count_max_extents(len);
+	u32 num_extents = count_max_extents(fs_info, len);
 
 	if ((state->state & EXTENT_DEFRAG) && (bits & EXTENT_DEFRAG)) {
 		spin_lock(&inode->lock);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/13] btrfs: use fs_info->max_extent_size in get_extent_max_capacity()
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (3 preceding siblings ...)
  2022-07-08 23:18 ` [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-08 23:18 ` [PATCH 06/13] btrfs: let can_allocate_chunk return int Naohiro Aota
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota, Johannes Thumshirn

Use fs_info->max_extent_size also in get_extent_max_capacity() for the
completeness. This is only used for defrag and not really necessary to fix
the metadata reservation size. But, it still suppresses unnecessary defrag
operations.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/ioctl.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7e1b4b0fbd6c..37480d4e6443 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1230,16 +1230,18 @@ static struct extent_map *defrag_lookup_extent(struct inode *inode, u64 start,
 	return em;
 }
 
-static u32 get_extent_max_capacity(const struct extent_map *em)
+static u32 get_extent_max_capacity(struct btrfs_fs_info *fs_info,
+				   const struct extent_map *em)
 {
 	if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
 		return BTRFS_MAX_COMPRESSED;
-	return BTRFS_MAX_EXTENT_SIZE;
+	return fs_info->max_extent_size;
 }
 
 static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em,
 				     u32 extent_thresh, u64 newer_than, bool locked)
 {
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct extent_map *next;
 	bool ret = false;
 
@@ -1263,7 +1265,7 @@ static bool defrag_check_next_extent(struct inode *inode, struct extent_map *em,
 	 * If the next extent is at its max capacity, defragging current extent
 	 * makes no sense, as the total number of extents won't change.
 	 */
-	if (next->len >= get_extent_max_capacity(em))
+	if (next->len >= get_extent_max_capacity(fs_info, em))
 		goto out;
 	/* Skip older extent */
 	if (next->generation < newer_than)
@@ -1400,6 +1402,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 				  bool locked, struct list_head *target_list,
 				  u64 *last_scanned_ret)
 {
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	bool last_is_target = false;
 	u64 cur = start;
 	int ret = 0;
@@ -1484,7 +1487,7 @@ static int defrag_collect_targets(struct btrfs_inode *inode,
 		 * Skip extents already at its max capacity, this is mostly for
 		 * compressed extents, which max cap is only 128K.
 		 */
-		if (em->len >= get_extent_max_capacity(em))
+		if (em->len >= get_extent_max_capacity(fs_info, em))
 			goto next;
 
 		/*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/13] btrfs: let can_allocate_chunk return int
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (4 preceding siblings ...)
  2022-07-08 23:18 ` [PATCH 05/13] btrfs: use fs_info->max_extent_size in get_extent_max_capacity() Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-08 23:18 ` [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation Naohiro Aota
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota, Johannes Thumshirn

For the later patch, convert the return type from bool to int. There is no
functional changes.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/extent-tree.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f97a0f28f464..c8f26ab7fe24 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3965,12 +3965,12 @@ static void found_extent(struct find_free_extent_ctl *ffe_ctl,
 	}
 }
 
-static bool can_allocate_chunk(struct btrfs_fs_info *fs_info,
-			       struct find_free_extent_ctl *ffe_ctl)
+static int can_allocate_chunk(struct btrfs_fs_info *fs_info,
+			      struct find_free_extent_ctl *ffe_ctl)
 {
 	switch (ffe_ctl->policy) {
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
-		return true;
+		return 0;
 	case BTRFS_EXTENT_ALLOC_ZONED:
 		/*
 		 * If we have enough free space left in an already
@@ -3980,8 +3980,8 @@ static bool can_allocate_chunk(struct btrfs_fs_info *fs_info,
 		 */
 		if (ffe_ctl->max_extent_size >= ffe_ctl->min_alloc_size &&
 		    !btrfs_can_activate_zone(fs_info->fs_devices, ffe_ctl->flags))
-			return false;
-		return true;
+			return -ENOSPC;
+		return 0;
 	default:
 		BUG();
 	}
@@ -4063,8 +4063,9 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
 			int exist = 0;
 
 			/*Check if allocation policy allows to create a new chunk */
-			if (!can_allocate_chunk(fs_info, ffe_ctl))
-				return -ENOSPC;
+			ret = can_allocate_chunk(fs_info, ffe_ctl);
+			if (ret)
+				return ret;
 
 			trans = current->journal_info;
 			if (trans)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (5 preceding siblings ...)
  2022-07-08 23:18 ` [PATCH 06/13] btrfs: let can_allocate_chunk return int Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-11  6:45   ` Johannes Thumshirn
  2022-07-08 23:18 ` [PATCH 08/13] btrfs: zoned: introduce space_info->active_total_bytes Naohiro Aota
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota

When we run out of active zones and no sufficient space is left in any
block groups, we need to finish one block group to make room to activate a
new block group.

However, we cannot do this for metadata block groups because we can cause a
deadlock by waiting for a running transaction commit. So, do that only for
a data block group.

Furthermore, the block group to be finished has two requirements. First,
the block group must not have reserved bytes left. Having reserved bytes
means we have an allocated region but did not yet send bios for it. If that
region is allocated by the thread calling btrfs_zone_finish(), it results
in a deadlock.

Second, the block group to be finished must not be a SYSTEM block
group. Finishing a SYSTEM block group easily breaks further chunk
allocation by nullifying the SYSTEM free space.

In a certain case, we cannot find any zone finish candidate or
btrfs_zone_finish() may fail. In that case, we fall back to split the
allocation bytes and fill the last spaces left in the block groups.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c | 49 +++++++++++++++++++++++++++++++++---------
 fs/btrfs/zoned.c       | 40 ++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  7 ++++++
 3 files changed, 86 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c8f26ab7fe24..5589e04eda0e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3965,6 +3965,44 @@ static void found_extent(struct find_free_extent_ctl *ffe_ctl,
 	}
 }
 
+static int can_allocate_chunk_zoned(struct btrfs_fs_info *fs_info,
+				    struct find_free_extent_ctl *ffe_ctl)
+{
+	/* If we can activate new zone, just allocate a chunk and use it */
+	if (btrfs_can_activate_zone(fs_info->fs_devices, ffe_ctl->flags))
+		return 0;
+
+	/*
+	 * We already reached the max active zones. Try to finish one block
+	 * group to make a room for a new block group. This is only possible for
+	 * a data BG because btrfs_zone_finish() may need to wait for a running
+	 * transaction which can cause a deadlock for metadata allocation.
+	 */
+	if (ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA) {
+		int ret = btrfs_zone_finish_one_bg(fs_info);
+
+		if (ret == 1)
+			return 0;
+		else if (ret < 0)
+			return ret;
+	}
+
+	/*
+	 * If we have enough free space left in an already active block group
+	 * and we can't activate any other zone now, do not allow allocating a
+	 * new chunk and let find_free_extent() retry with a smaller size.
+	 */
+	if (ffe_ctl->max_extent_size >= ffe_ctl->min_alloc_size)
+		return -ENOSPC;
+
+	/*
+	 * We cannot activate a new block group and no enough space left in any
+	 * block groups. So, allocating a new block group may not help. But,
+	 * there is nothing to do anyway, so let's go with it.
+	 */
+	return 0;
+}
+
 static int can_allocate_chunk(struct btrfs_fs_info *fs_info,
 			      struct find_free_extent_ctl *ffe_ctl)
 {
@@ -3972,16 +4010,7 @@ static int can_allocate_chunk(struct btrfs_fs_info *fs_info,
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
 		return 0;
 	case BTRFS_EXTENT_ALLOC_ZONED:
-		/*
-		 * If we have enough free space left in an already
-		 * active block group and we can't activate any other
-		 * zone now, do not allow allocating a new chunk and
-		 * let find_free_extent() retry with a smaller size.
-		 */
-		if (ffe_ctl->max_extent_size >= ffe_ctl->min_alloc_size &&
-		    !btrfs_can_activate_zone(fs_info->fs_devices, ffe_ctl->flags))
-			return -ENOSPC;
-		return 0;
+		return can_allocate_chunk_zoned(fs_info, ffe_ctl);
 	default:
 		BUG();
 	}
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 3b45b35aa945..40ac90272b53 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2179,3 +2179,43 @@ void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logica
 	spin_unlock(&block_group->lock);
 	btrfs_put_block_group(block_group);
 }
+
+int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_block_group *block_group;
+	struct btrfs_block_group *min_bg = NULL;
+	u64 min_avail = U64_MAX;
+	int ret;
+
+	spin_lock(&fs_info->zone_active_bgs_lock);
+	list_for_each_entry(block_group, &fs_info->zone_active_bgs,
+			    active_bg_list) {
+		u64 avail;
+
+		spin_lock(&block_group->lock);
+		if (block_group->reserved ||
+		    (block_group->flags & BTRFS_BLOCK_GROUP_SYSTEM)) {
+			spin_unlock(&block_group->lock);
+			continue;
+		}
+
+		avail = block_group->zone_capacity - block_group->alloc_offset;
+		if (min_avail > avail) {
+			if (min_bg)
+				btrfs_put_block_group(min_bg);
+			min_bg = block_group;
+			min_avail = avail;
+			btrfs_get_block_group(min_bg);
+		}
+		spin_unlock(&block_group->lock);
+	}
+	spin_unlock(&fs_info->zone_active_bgs_lock);
+
+	if (!min_bg)
+		return 0;
+
+	ret = btrfs_zone_finish(min_bg);
+	btrfs_put_block_group(min_bg);
+
+	return ret < 0 ? ret : 1;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 9caeab07fd38..329d28e2fd8d 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -80,6 +80,7 @@ void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info);
 bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info);
 void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logical,
 				       u64 length);
+int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -249,6 +250,12 @@ static inline bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info)
 
 static inline void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info,
 						     u64 logical, u64 length) { }
+
+static inline int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info)
+{
+	return 1;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 08/13] btrfs: zoned: introduce space_info->active_total_bytes
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (6 preceding siblings ...)
  2022-07-08 23:18 ` [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-08 23:18 ` [PATCH 09/13] btrfs: zoned: disable metadata overcommit for zoned Naohiro Aota
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota

The active_total_bytes, like the total_bytes, accounts for the total bytes
of active block groups in the space_info.

With an introduction of active_total_bytes, we can check if the reserved
bytes can be written to the block groups without activating a new block
group. The check is necessary for metadata allocation on zoned btrfs. We
cannot finish a block group, which may require waiting for the current
transaction, from the metadata allocation context. Instead, we need to
ensure the on-going allocation (reserved bytes) fits in active block
groups.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 12 +++++++++---
 fs/btrfs/space-info.c  | 41 ++++++++++++++++++++++++++++++++---------
 fs/btrfs/space-info.h  |  4 +++-
 fs/btrfs/zoned.c       |  6 ++++++
 4 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e930749770ac..51e7c1f1d93f 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1051,8 +1051,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 			< block_group->zone_unusable);
 		WARN_ON(block_group->space_info->disk_total
 			< block_group->length * factor);
+		WARN_ON(block_group->zone_is_active &&
+			block_group->space_info->active_total_bytes
+			< block_group->length);
 	}
 	block_group->space_info->total_bytes -= block_group->length;
+	if (block_group->zone_is_active)
+		block_group->space_info->active_total_bytes -= block_group->length;
 	block_group->space_info->bytes_readonly -=
 		(block_group->length - block_group->zone_unusable);
 	block_group->space_info->bytes_zone_unusable -=
@@ -2107,7 +2112,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	trace_btrfs_add_block_group(info, cache, 0);
 	btrfs_update_space_info(info, cache->flags, cache->length,
 				cache->used, cache->bytes_super,
-				cache->zone_unusable, &space_info);
+				cache->zone_unusable, cache->zone_is_active,
+				&space_info);
 
 	cache->space_info = space_info;
 
@@ -2177,7 +2183,7 @@ static int fill_dummy_bgs(struct btrfs_fs_info *fs_info)
 		}
 
 		btrfs_update_space_info(fs_info, bg->flags, em->len, em->len,
-					0, 0, &space_info);
+					0, 0, false, &space_info);
 		bg->space_info = space_info;
 		link_block_group(bg);
 
@@ -2558,7 +2564,7 @@ struct btrfs_block_group *btrfs_make_block_group(struct btrfs_trans_handle *tran
 	trace_btrfs_add_block_group(fs_info, cache, 1);
 	btrfs_update_space_info(fs_info, cache->flags, size, bytes_used,
 				cache->bytes_super, cache->zone_unusable,
-				&cache->space_info);
+				cache->zone_is_active, &cache->space_info);
 	btrfs_update_global_block_rsv(fs_info);
 
 	link_block_group(cache);
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 62d25112310d..b970909c0820 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -295,7 +295,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
 			     u64 bytes_readonly, u64 bytes_zone_unusable,
-			     struct btrfs_space_info **space_info)
+			     bool active, struct btrfs_space_info **space_info)
 {
 	struct btrfs_space_info *found;
 	int factor;
@@ -306,6 +306,8 @@ void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 	ASSERT(found);
 	spin_lock(&found->lock);
 	found->total_bytes += total_bytes;
+	if (active)
+		found->active_total_bytes += total_bytes;
 	found->disk_total += total_bytes * factor;
 	found->bytes_used += bytes_used;
 	found->disk_used += bytes_used * factor;
@@ -369,6 +371,22 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
 	return avail;
 }
 
+static inline u64 writable_total_bytes(struct btrfs_fs_info *fs_info,
+				       struct btrfs_space_info *space_info)
+{
+	/*
+	 * On regular btrfs, all total_bytes are always writable. On zoned
+	 * btrfs, there may be a limitation imposed by max_active_zzones. For
+	 * metadata allocation, we cannot finish an existing active block group
+	 * to avoid a deadlock. Thus, we need to consider only the active groups
+	 * to be writable for metadata space.
+	 */
+	if (!btrfs_is_zoned(fs_info) || (space_info->flags & BTRFS_BLOCK_GROUP_DATA))
+		return space_info->total_bytes;
+
+	return space_info->active_total_bytes;
+}
+
 int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
 			 struct btrfs_space_info *space_info, u64 bytes,
 			 enum btrfs_reserve_flush_enum flush)
@@ -383,7 +401,7 @@ int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
 	used = btrfs_space_info_used(space_info, true);
 	avail = calc_available_free_space(fs_info, space_info, flush);
 
-	if (used + bytes < space_info->total_bytes + avail)
+	if (used + bytes < writable_total_bytes(fs_info, space_info) + avail)
 		return 1;
 	return 0;
 }
@@ -419,7 +437,7 @@ void btrfs_try_granting_tickets(struct btrfs_fs_info *fs_info,
 		ticket = list_first_entry(head, struct reserve_ticket, list);
 
 		/* Check and see if our ticket can be satisfied now. */
-		if ((used + ticket->bytes <= space_info->total_bytes) ||
+		if ((used + ticket->bytes <= writable_total_bytes(fs_info, space_info)) ||
 		    btrfs_can_overcommit(fs_info, space_info, ticket->bytes,
 					 flush)) {
 			btrfs_space_info_update_bytes_may_use(fs_info,
@@ -750,6 +768,7 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
 {
 	u64 used;
 	u64 avail;
+	u64 total;
 	u64 to_reclaim = space_info->reclaim_size;
 
 	lockdep_assert_held(&space_info->lock);
@@ -764,8 +783,9 @@ btrfs_calc_reclaim_metadata_size(struct btrfs_fs_info *fs_info,
 	 * space.  If that's the case add in our overage so we make sure to put
 	 * appropriate pressure on the flushing state machine.
 	 */
-	if (space_info->total_bytes + avail < used)
-		to_reclaim += used - (space_info->total_bytes + avail);
+	total = writable_total_bytes(fs_info, space_info);
+	if (total + avail < used)
+		to_reclaim += used - (total + avail);
 
 	return to_reclaim;
 }
@@ -775,9 +795,12 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
 {
 	u64 global_rsv_size = fs_info->global_block_rsv.reserved;
 	u64 ordered, delalloc;
-	u64 thresh = div_factor_fine(space_info->total_bytes, 90);
+	u64 total = writable_total_bytes(fs_info, space_info);
+	u64 thresh;
 	u64 used;
 
+	thresh = div_factor_fine(total, 90);
+
 	lockdep_assert_held(&space_info->lock);
 
 	/* If we're just plain full then async reclaim just slows us down. */
@@ -839,8 +862,8 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
 					   BTRFS_RESERVE_FLUSH_ALL);
 	used = space_info->bytes_used + space_info->bytes_reserved +
 	       space_info->bytes_readonly + global_rsv_size;
-	if (used < space_info->total_bytes)
-		thresh += space_info->total_bytes - used;
+	if (used < total)
+		thresh += total - used;
 	thresh >>= space_info->clamp;
 
 	used = space_info->bytes_pinned;
@@ -1557,7 +1580,7 @@ static int __reserve_bytes(struct btrfs_fs_info *fs_info,
 	 * can_overcommit() to ensure we can overcommit to continue.
 	 */
 	if (!pending_tickets &&
-	    ((used + orig_bytes <= space_info->total_bytes) ||
+	    ((used + orig_bytes <= writable_total_bytes(fs_info, space_info)) ||
 	     btrfs_can_overcommit(fs_info, space_info, orig_bytes, flush))) {
 		btrfs_space_info_update_bytes_may_use(fs_info, space_info,
 						      orig_bytes);
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index e7de24a529cf..3cc356a55c53 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -19,6 +19,8 @@ struct btrfs_space_info {
 	u64 bytes_may_use;	/* number of bytes that may be used for
 				   delalloc/allocations */
 	u64 bytes_readonly;	/* total bytes that are read only */
+	u64 active_total_bytes;	/* total bytes in the space, but only accounts
+					   active block groups. */
 	u64 bytes_zone_unusable;	/* total bytes that are unusable until
 					   resetting the device zone */
 
@@ -124,7 +126,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
 			     u64 bytes_readonly, u64 bytes_zone_unusable,
-			     struct btrfs_space_info **space_info);
+			     bool active, struct btrfs_space_info **space_info);
 void btrfs_update_space_info_chunk_size(struct btrfs_space_info *space_info,
 					u64 chunk_size);
 struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 40ac90272b53..44a4b9e7dae9 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1848,6 +1848,7 @@ struct btrfs_device *btrfs_zoned_get_device(struct btrfs_fs_info *fs_info,
 bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 {
 	struct btrfs_fs_info *fs_info = block_group->fs_info;
+	struct btrfs_space_info *space_info = block_group->space_info;
 	struct map_lookup *map;
 	struct btrfs_device *device;
 	u64 physical;
@@ -1859,6 +1860,7 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 	map = block_group->physical_map;
 
+	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
 	if (block_group->zone_is_active) {
 		ret = true;
@@ -1887,7 +1889,10 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 	/* Successfully activated all the zones */
 	block_group->zone_is_active = 1;
+	space_info->active_total_bytes += block_group->length;
 	spin_unlock(&block_group->lock);
+	btrfs_try_granting_tickets(fs_info, space_info);
+	spin_unlock(&space_info->lock);
 
 	/* For the active block group list */
 	btrfs_get_block_group(block_group);
@@ -1900,6 +1905,7 @@ bool btrfs_zone_activate(struct btrfs_block_group *block_group)
 
 out_unlock:
 	spin_unlock(&block_group->lock);
+	spin_unlock(&space_info->lock);
 	return ret;
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 09/13] btrfs: zoned: disable metadata overcommit for zoned
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (7 preceding siblings ...)
  2022-07-08 23:18 ` [PATCH 08/13] btrfs: zoned: introduce space_info->active_total_bytes Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-08 23:18 ` [PATCH 10/13] btrfs: zoned: activate metadata BG on flush_space Naohiro Aota
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota, Johannes Thumshirn

The metadata overcommit makes the space reservation flexible but it is also
harmful to active zone tracking. Since we cannot finish a block group from
the metadata allocation context, we might not activate a new block group
and might not be able to actually write out the overcommit reservations.

So, disable metadata overcommit for zoned btrfs. We will ensure the
reservations are under active_total_bytes in the following patches.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/space-info.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index b970909c0820..7183a8dc9b34 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -399,7 +399,10 @@ int btrfs_can_overcommit(struct btrfs_fs_info *fs_info,
 		return 0;
 
 	used = btrfs_space_info_used(space_info, true);
-	avail = calc_available_free_space(fs_info, space_info, flush);
+	if (btrfs_is_zoned(fs_info) && (space_info->flags & BTRFS_BLOCK_GROUP_METADATA))
+		avail = 0;
+	else
+		avail = calc_available_free_space(fs_info, space_info, flush);
 
 	if (used + bytes < writable_total_bytes(fs_info, space_info) + avail)
 		return 1;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 10/13] btrfs: zoned: activate metadata BG on flush_space
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (8 preceding siblings ...)
  2022-07-08 23:18 ` [PATCH 09/13] btrfs: zoned: disable metadata overcommit for zoned Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-08 23:18 ` [PATCH 11/13] btrfs: zoned: activate necessary block group Naohiro Aota
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota

For metadata space on zoned btrfs, reaching ALLOC_CHUNK{,_FORCE} means we
don't have enough space left in the active_total_bytes. Before allocating a
new chunk, we can try to activate an existing block group in this case.

Also, allocating a chunk is not enough to grant a ticket for metadata space
on zoned btrfs. We need to activate the block group to increase the
active_total_bytes.

btrfs_zoned_activate_one_bg() implements the activation feature. It will
activate a block group by (maybe) finishing a block group. It will give up
activating a block group if it cannot finish any block group.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/space-info.c | 30 ++++++++++++++++++++++++
 fs/btrfs/zoned.c      | 53 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h      | 10 ++++++++
 3 files changed, 93 insertions(+)

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 7183a8dc9b34..b99e3c32c07d 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -9,6 +9,7 @@
 #include "ordered-data.h"
 #include "transaction.h"
 #include "block-group.h"
+#include "zoned.h"
 
 /*
  * HOW DOES SPACE RESERVATION WORK
@@ -724,6 +725,18 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 		break;
 	case ALLOC_CHUNK:
 	case ALLOC_CHUNK_FORCE:
+		/*
+		 * For metadata space on zoned btrfs, reaching here means we
+		 * don't have enough space left in active_total_bytes. Try to
+		 * activate a block group first, because we may have inactive
+		 * block group already allocated.
+		 */
+		ret = btrfs_zoned_activate_one_bg(fs_info, space_info, false);
+		if (ret < 0)
+			break;
+		else if (ret == 1)
+			break;
+
 		trans = btrfs_join_transaction(root);
 		if (IS_ERR(trans)) {
 			ret = PTR_ERR(trans);
@@ -734,6 +747,23 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 				(state == ALLOC_CHUNK) ? CHUNK_ALLOC_NO_FORCE :
 					CHUNK_ALLOC_FORCE);
 		btrfs_end_transaction(trans);
+
+		/*
+		 * For metadata space on zoned btrfs, allocating a new chunk is
+		 * not enough. We still need to activate the block group. Active
+		 * the newly allocated block group by (maybe) finishing a block
+		 * group.
+		 */
+		if (ret == 1) {
+			ret = btrfs_zoned_activate_one_bg(fs_info, space_info, true);
+			/*
+			 * Revert to the original ret regardless we could finish
+			 * one block group or not.
+			 */
+			if (ret >= 0)
+				ret = 1;
+		}
+
 		if (ret > 0 || ret == -ENOSPC)
 			ret = 0;
 		break;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 44a4b9e7dae9..67098f3fcd14 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2225,3 +2225,56 @@ int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info)
 
 	return ret < 0 ? ret : 1;
 }
+
+int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
+				struct btrfs_space_info *space_info,
+				bool do_finish)
+{
+	struct btrfs_block_group *bg;
+	bool need_finish;
+	int index;
+
+	if (!btrfs_is_zoned(fs_info) || (space_info->flags & BTRFS_BLOCK_GROUP_DATA))
+		return 0;
+
+	/* No more block group to activate */
+	if (space_info->active_total_bytes == space_info->total_bytes)
+		return 0;
+
+	for (;;) {
+		int ret;
+
+		need_finish = false;
+		down_read(&space_info->groups_sem);
+		for (index = 0; index < BTRFS_NR_RAID_TYPES; index++) {
+			list_for_each_entry(bg, &space_info->block_groups[index], list) {
+				if (!spin_trylock(&bg->lock))
+					continue;
+				if (btrfs_zoned_bg_is_full(bg) || bg->zone_is_active) {
+					spin_unlock(&bg->lock);
+					continue;
+				}
+				spin_unlock(&bg->lock);
+
+				if (btrfs_zone_activate(bg)) {
+					up_read(&space_info->groups_sem);
+					return 1;
+				}
+
+				need_finish = true;
+			}
+		}
+		up_read(&space_info->groups_sem);
+
+		if (!do_finish || !need_finish)
+			break;
+
+		ret = btrfs_zone_finish_one_bg(fs_info);
+		if (ret == 0)
+			break;
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 329d28e2fd8d..f7b0b9035fd6 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -81,6 +81,8 @@ bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info);
 void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logical,
 				       u64 length);
 int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info);
+int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
+				struct btrfs_space_info *space_info, bool do_finish);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -256,6 +258,14 @@ static inline int btrfs_zone_finish_one_bg(struct btrfs_fs_info *fs_info)
 	return 1;
 }
 
+static inline int btrfs_zoned_activate_one_bg(struct btrfs_fs_info *fs_info,
+					      struct btrfs_space_info *space_info,
+					      bool do_finish)
+{
+	/* Consider all the BGs are active */
+	return 0;
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 11/13] btrfs: zoned: activate necessary block group
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (9 preceding siblings ...)
  2022-07-08 23:18 ` [PATCH 10/13] btrfs: zoned: activate metadata BG on flush_space Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-08 23:18 ` [PATCH 12/13] btrfs: zoned: write out partially allocated region Naohiro Aota
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota

There are two places where allocating a chunk is not enough. These two
places are trying to ensure the space by allocating a chunk. To meet the
condition for active_total_bytes, we also need to activate a block group
there.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 51e7c1f1d93f..14084da12844 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2664,6 +2664,14 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache,
 	ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
 	if (ret < 0)
 		goto out;
+	/*
+	 * We have allocated a new chunk. We also need to activate that chunk to
+	 * grant metadata tickets for zoned btrfs.
+	 */
+	ret = btrfs_zoned_activate_one_bg(fs_info, cache->space_info, true);
+	if (ret < 0)
+		goto out;
+
 	ret = inc_block_group_ro(cache, 0);
 	if (ret == -ETXTBSY)
 		goto unlock_out;
@@ -3889,6 +3897,14 @@ static void reserve_chunk_space(struct btrfs_trans_handle *trans,
 		if (IS_ERR(bg)) {
 			ret = PTR_ERR(bg);
 		} else {
+			/*
+			 * We have a new chunk. We also need to activate it for
+			 * zoned btrfs.
+			 */
+			ret = btrfs_zoned_activate_one_bg(fs_info, info, true);
+			if (ret < 0)
+				return;
+
 			/*
 			 * If we fail to add the chunk item here, we end up
 			 * trying again at phase 2 of chunk allocation, at
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 12/13] btrfs: zoned: write out partially allocated region
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (10 preceding siblings ...)
  2022-07-08 23:18 ` [PATCH 11/13] btrfs: zoned: activate necessary block group Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-08 23:18 ` [PATCH 13/13] btrfs: zoned: wait until zone is finished when allocation didn't progress Naohiro Aota
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota

cow_file_range() works in an all-or-nothing way: if it fails to allocate an
extent for a part of the given region, it gives up all the region including
the successfully allocated parts. On cow_file_range(), run_delalloc_zoned()
writes data for the region only when it successfully allocate all the
region.

This all-or-nothing allocation and write-out are problematic when available
space in all the block groups are get tight with the active zone
restriction. btrfs_reserve_extent() try hard to utilize the left space in
the active block groups and gives up finally and fails with
-ENOSPC. However, if we send IOs for the successfully allocated region, we
can finish a zone and can continue on the rest of the allocation on a newly
allocated block group.

This patch implements the partial write-out for run_delalloc_zoned(). With
this patch applied, cow_file_range() returns -EAGAIN to tell the caller to
do something to progress the further allocation, and tells the successfully
allocated region with done_offset. Furthermore, the zoned extent allocator
returns -EAGAIN to tell cow_file_range() going back to the caller side.

Actually, we still need to wait for an IO to complete to continue the
allocation. The next patch implements that part.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent-tree.c | 10 +++++++
 fs/btrfs/inode.c       | 63 ++++++++++++++++++++++++++++++++----------
 2 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5589e04eda0e..1b29b16f6736 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3995,6 +3995,16 @@ static int can_allocate_chunk_zoned(struct btrfs_fs_info *fs_info,
 	if (ffe_ctl->max_extent_size >= ffe_ctl->min_alloc_size)
 		return -ENOSPC;
 
+	/*
+	 * Even min_alloc_size is not left in any block groups. Since we cannot
+	 * activate a new block group, allocating it may not help. Let's tell a
+	 * caller to try again and hope it progress something by writing some
+	 * parts of the region. That is only possible for data block groups,
+	 * where a part of the region can be written.
+	 */
+	if (ffe_ctl->flags & BTRFS_BLOCK_GROUP_DATA)
+		return -EAGAIN;
+
 	/*
 	 * We cannot activate a new block group and no enough space left in any
 	 * block groups. So, allocating a new block group may not help. But,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8ce937b0b014..681e2cb4dd9c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -117,7 +117,8 @@ static int btrfs_truncate(struct inode *inode, bool skip_writeback);
 static noinline int cow_file_range(struct btrfs_inode *inode,
 				   struct page *locked_page,
 				   u64 start, u64 end, int *page_started,
-				   unsigned long *nr_written, int unlock);
+				   unsigned long *nr_written, int unlock,
+				   u64 *done_offset);
 static struct extent_map *create_io_em(struct btrfs_inode *inode, u64 start,
 				       u64 len, u64 orig_start, u64 block_start,
 				       u64 block_len, u64 orig_block_len,
@@ -921,7 +922,7 @@ static int submit_uncompressed_range(struct btrfs_inode *inode,
 	 * can directly submit them without interruption.
 	 */
 	ret = cow_file_range(inode, locked_page, start, end, &page_started,
-			     &nr_written, 0);
+			     &nr_written, 0, NULL);
 	/* Inline extent inserted, page gets unlocked and everything is done */
 	if (page_started) {
 		ret = 0;
@@ -1170,7 +1171,8 @@ static u64 get_extent_allocation_hint(struct btrfs_inode *inode, u64 start,
 static noinline int cow_file_range(struct btrfs_inode *inode,
 				   struct page *locked_page,
 				   u64 start, u64 end, int *page_started,
-				   unsigned long *nr_written, int unlock)
+				   unsigned long *nr_written, int unlock,
+				   u64 *done_offset)
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -1363,6 +1365,21 @@ static noinline int cow_file_range(struct btrfs_inode *inode,
 	btrfs_dec_block_group_reservations(fs_info, ins.objectid);
 	btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1);
 out_unlock:
+	/*
+	 * If done_offset is non-NULL and ret == -EAGAIN, we expect the
+	 * caller to write out the successfully allocated region and retry.
+	 */
+	if (done_offset && ret == -EAGAIN) {
+		if (orig_start < start)
+			*done_offset = start - 1;
+		else
+			*done_offset = start;
+		return ret;
+	} else if (ret == -EAGAIN) {
+		/* Convert to -ENOSPC since the caller cannot retry. */
+		ret = -ENOSPC;
+	}
+
 	/*
 	 * Now, we have three regions to clean up:
 	 *
@@ -1608,19 +1625,37 @@ static noinline int run_delalloc_zoned(struct btrfs_inode *inode,
 				       u64 end, int *page_started,
 				       unsigned long *nr_written)
 {
+	u64 done_offset = end;
 	int ret;
+	bool locked_page_done = false;
 
-	ret = cow_file_range(inode, locked_page, start, end, page_started,
-			     nr_written, 0);
-	if (ret)
-		return ret;
+	while (start <= end) {
+		ret = cow_file_range(inode, locked_page, start, end, page_started,
+				     nr_written, 0, &done_offset);
+		if (ret && ret != -EAGAIN)
+			return ret;
 
-	if (*page_started)
-		return 0;
+		if (*page_started) {
+			ASSERT(ret == 0);
+			return 0;
+		}
+
+		if (ret == 0)
+			done_offset = end;
+
+		if (done_offset == start)
+			return -ENOSPC;
+
+		if (!locked_page_done) {
+			__set_page_dirty_nobuffers(locked_page);
+			account_page_redirty(locked_page);
+		}
+		locked_page_done = true;
+		extent_write_locked_range(&inode->vfs_inode, start, done_offset);
+
+		start = done_offset + 1;
+	}
 
-	__set_page_dirty_nobuffers(locked_page);
-	account_page_redirty(locked_page);
-	extent_write_locked_range(&inode->vfs_inode, start, end);
 	*page_started = 1;
 
 	return 0;
@@ -1712,7 +1747,7 @@ static int fallback_to_cow(struct btrfs_inode *inode, struct page *locked_page,
 	}
 
 	return cow_file_range(inode, locked_page, start, end, page_started,
-			      nr_written, 1);
+			      nr_written, 1, NULL);
 }
 
 struct can_nocow_file_extent_args {
@@ -2185,7 +2220,7 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
 						 page_started, nr_written);
 		else
 			ret = cow_file_range(inode, locked_page, start, end,
-					     page_started, nr_written, 1);
+					     page_started, nr_written, 1, NULL);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &inode->runtime_flags);
 		ret = cow_file_range_async(inode, wbc, locked_page, start, end,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 13/13] btrfs: zoned: wait until zone is finished when allocation didn't progress
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (11 preceding siblings ...)
  2022-07-08 23:18 ` [PATCH 12/13] btrfs: zoned: write out partially allocated region Naohiro Aota
@ 2022-07-08 23:18 ` Naohiro Aota
  2022-07-11 20:29 ` [PATCH v2 00/13] btrfs: zoned: fix active zone tracking issues David Sterba
  2022-07-12 20:32 ` David Sterba
  14 siblings, 0 replies; 25+ messages in thread
From: Naohiro Aota @ 2022-07-08 23:18 UTC (permalink / raw)
  To: linux-btrfs; +Cc: linux-block, Naohiro Aota

When the allocated position doesn't progress, we cannot submit IOs to
finish a block group, but there should be ongoing IOs that will finish a
block group. So, in that case, we wait for a zone to be finished and retry
the allocation after that.

Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
indicate we need a zone finish to have proceeded. The flag is set when the
allocator detected it cannot activate a new block group. And, it is cleared
once a zone is finished.

CC: stable@vger.kernel.org # 5.16+
Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h   | 4 ++++
 fs/btrfs/disk-io.c | 1 +
 fs/btrfs/inode.c   | 9 +++++++--
 fs/btrfs/zoned.c   | 6 ++++++
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c215e15baea2..ddecd92fa848 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -638,6 +638,9 @@ enum {
 	/* Indicate we have half completed snapshot deletions pending. */
 	BTRFS_FS_UNFINISHED_DROPS,
 
+	/* Indicate we have to finish a zone to do next allocation. */
+	BTRFS_FS_NEED_ZONE_FINISH,
+
 #if BITS_PER_LONG == 32
 	/* Indicate if we have error/warn message printed on 32bit systems */
 	BTRFS_FS_32BIT_ERROR,
@@ -1084,6 +1087,7 @@ struct btrfs_fs_info {
 
 	spinlock_t zone_active_bgs_lock;
 	struct list_head zone_active_bgs;
+	wait_queue_head_t zone_finish_wait;
 
 	/* Updates are not protected by any lock */
 	struct btrfs_commit_stats commit_stats;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 914557d59472..1fe5f79770a0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3135,6 +3135,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	init_waitqueue_head(&fs_info->transaction_blocked_wait);
 	init_waitqueue_head(&fs_info->async_submit_wait);
 	init_waitqueue_head(&fs_info->delayed_iputs_wait);
+	init_waitqueue_head(&fs_info->zone_finish_wait);
 
 	/* Usable values until the real ones are cached from the superblock */
 	fs_info->nodesize = 4096;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 681e2cb4dd9c..815121350d91 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1643,8 +1643,13 @@ static noinline int run_delalloc_zoned(struct btrfs_inode *inode,
 		if (ret == 0)
 			done_offset = end;
 
-		if (done_offset == start)
-			return -ENOSPC;
+		if (done_offset == start) {
+			struct btrfs_fs_info *info = inode->root->fs_info;
+
+			wait_var_event(&info->zone_finish_wait,
+				       !test_bit(BTRFS_FS_NEED_ZONE_FINISH, &info->flags));
+			continue;
+		}
 
 		if (!locked_page_done) {
 			__set_page_dirty_nobuffers(locked_page);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 67098f3fcd14..471d870875ed 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2006,6 +2006,9 @@ static int do_zone_finish(struct btrfs_block_group *block_group, bool fully_writ
 	/* For active_bg_list */
 	btrfs_put_block_group(block_group);
 
+	clear_bit(BTRFS_FS_NEED_ZONE_FINISH, &fs_info->flags);
+	wake_up_all(&fs_info->zone_finish_wait);
+
 	return 0;
 }
 
@@ -2042,6 +2045,9 @@ bool btrfs_can_activate_zone(struct btrfs_fs_devices *fs_devices, u64 flags)
 	}
 	mutex_unlock(&fs_info->chunk_mutex);
 
+	if (!ret)
+		set_bit(BTRFS_FS_NEED_ZONE_FINISH, &fs_info->flags);
+
 	return ret;
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes
  2022-07-08 23:18 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
@ 2022-07-09 11:34   ` Johannes Thumshirn
  0 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2022-07-09 11:34 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: linux-block

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
  2022-07-08 23:18 ` [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size Naohiro Aota
@ 2022-07-09 11:36   ` Johannes Thumshirn
  2022-07-11 15:30     ` David Sterba
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Thumshirn @ 2022-07-09 11:36 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: linux-block

On 09.07.22 01:21, Naohiro Aota wrote:
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 3194eca41635..cedc94a7d5b2 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2021,10 +2021,16 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
>  				    struct page *locked_page, u64 *start,
>  				    u64 *end)
>  {
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
>  	const u64 orig_start = *start;
>  	const u64 orig_end = *end;
> -	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> +	/* The sanity tests may not set a valid fs_info. */
> +	u64 max_bytes = fs_info ? fs_info->max_extent_size : BTRFS_MAX_EXTENT_SIZE;
> +#else
> +	u64 max_bytes = fs_info->max_extent_size;
> +#endif

Do we really need the ifdef here? I don't think there will be a lot
of performance penalty from the 1 compare that we safe with the ifdef.

Otherwise
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size
  2022-07-08 23:18 ` [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size Naohiro Aota
@ 2022-07-09 11:37   ` Johannes Thumshirn
  0 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2022-07-09 11:37 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: linux-block

On 09.07.22 01:21, Naohiro Aota wrote:
> +/*
> + * Count how many fs_info->max_extent_size cover the @size
> + */
> +static inline u32 count_max_extents(struct btrfs_fs_info *fs_info, u64 size)
> +{
> +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> +	if (!fs_info)
> +		return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
> +#endif

	if (IS_ENABLED(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) && !fs_info) ?
> +
> +	return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
> +}


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 01/13] block: add bdev_max_segments() helper
  2022-07-08 23:18 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
@ 2022-07-09 16:10   ` Jens Axboe
  2022-07-11  7:00   ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2022-07-09 16:10 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: linux-block, Johannes Thumshirn

On 7/8/22 5:18 PM, Naohiro Aota wrote:
> Add bdev_max_segments() like other queue parameters.

Reviewed-by: Jens Axboe <axboe@kernel.dk>

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation
  2022-07-08 23:18 ` [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation Naohiro Aota
@ 2022-07-11  6:45   ` Johannes Thumshirn
  0 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2022-07-11  6:45 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs; +Cc: linux-block

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 01/13] block: add bdev_max_segments() helper
  2022-07-08 23:18 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
  2022-07-09 16:10   ` Jens Axboe
@ 2022-07-11  7:00   ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2022-07-11  7:00 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, linux-block, Johannes Thumshirn

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
  2022-07-09 11:36   ` Johannes Thumshirn
@ 2022-07-11 15:30     ` David Sterba
  0 siblings, 0 replies; 25+ messages in thread
From: David Sterba @ 2022-07-11 15:30 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: Naohiro Aota, linux-btrfs, linux-block

On Sat, Jul 09, 2022 at 11:36:45AM +0000, Johannes Thumshirn wrote:
> On 09.07.22 01:21, Naohiro Aota wrote:
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 3194eca41635..cedc94a7d5b2 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2021,10 +2021,16 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
> >  				    struct page *locked_page, u64 *start,
> >  				    u64 *end)
> >  {
> > +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> >  	struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
> >  	const u64 orig_start = *start;
> >  	const u64 orig_end = *end;
> > -	u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
> > +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
> > +	/* The sanity tests may not set a valid fs_info. */
> > +	u64 max_bytes = fs_info ? fs_info->max_extent_size : BTRFS_MAX_EXTENT_SIZE;
> > +#else
> > +	u64 max_bytes = fs_info->max_extent_size;
> > +#endif
> 
> Do we really need the ifdef here? I don't think there will be a lot
> of performance penalty from the 1 compare that we safe with the ifdef.

You're right that performce-wise it does not make much improvement
however with the explicit ifdef it's clear that it's there for tests,
otherwise finding the conditional fs_info check would be hard to spot. A
search for CONFIG_BTRFS_FS_RUN_SANITY_TESTS would find it. On the other
hand the whole function is EXPORT_FOR_TESTS so we can expect exceptions
for tests. So this could be the pattern to follow should we need it in
the future (ie. no ifdef).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 00/13] btrfs: zoned: fix active zone tracking issues
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (12 preceding siblings ...)
  2022-07-08 23:18 ` [PATCH 13/13] btrfs: zoned: wait until zone is finished when allocation didn't progress Naohiro Aota
@ 2022-07-11 20:29 ` David Sterba
  2022-07-12 20:32 ` David Sterba
  14 siblings, 0 replies; 25+ messages in thread
From: David Sterba @ 2022-07-11 20:29 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, linux-block

On Sat, Jul 09, 2022 at 08:18:37AM +0900, Naohiro Aota wrote:
> This series addresses mainly two issues on zoned btrfs' active zone
> tracking and one issue which is a dependency of the main issue.
> 
> * ChangeLog
> - v2
>   - Support sanity tests (Johannes)
>     - fs_info can be NULL while it is running sanity tests. Consider that
>       case in CONFIG_FS_BTRFS_RUN_SANITY_TESTS.
>   - Propagete an error of btrfs_zone_finish() (Johannes)
>   - Add a comment to max_segments limitation (Christoph)
>   - Rename btrfs_finish_one_bg() to btrfs_zone_finish_one_bg() to make the
>     it clear it is related to zoned code.
>   - Do not reduce active_total_bytes when finishing a block group.
>     - While it's no longer active, but it still can have "used" bytes. So,
>       it should be counted to host "total_bytes". Or, it breaks free space
>       calculation.
>   - Do not try to activate a fully allocated block group.

The self tests are now working and I haven't seen any new errors in
fstests, so I'll add the branch to misc-next soon. This will be probably
the last big patchset before code freeze.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 00/13] btrfs: zoned: fix active zone tracking issues
  2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
                   ` (13 preceding siblings ...)
  2022-07-11 20:29 ` [PATCH v2 00/13] btrfs: zoned: fix active zone tracking issues David Sterba
@ 2022-07-12 20:32 ` David Sterba
  14 siblings, 0 replies; 25+ messages in thread
From: David Sterba @ 2022-07-12 20:32 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs, linux-block

On Sat, Jul 09, 2022 at 08:18:37AM +0900, Naohiro Aota wrote:
> This series addresses mainly two issues on zoned btrfs' active zone

Btw, I have one minor remark that's in many patches: please don't use
'zoned btrfs' when referring to it, either use 'zoned filesystem' or
'zoned mode'. I've fixed it where I noticed but in the future it would
be good to have it in the patches from the beginning. Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size
  2022-07-04  4:58 ` [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size Naohiro Aota
@ 2022-07-04  7:56   ` Johannes Thumshirn
  0 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2022-07-04  7:56 UTC (permalink / raw)
  To: Naohiro Aota, linux-btrfs

On 04.07.22 06:59, Naohiro Aota wrote:
> +/*
> + * Count how many fs_info->max_extent_size cover the @size
> + */
> +static inline u32 count_max_extents(struct btrfs_fs_info *fs_info, u64 size)
> +{
> +	return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
> +}
> +


For the record, Naohiro and I just discussed this offline. 
count_max_extents() needs to check for an eventual NULL fs_info
because of the selftest code.


static inline u32 count_max_extents(struct btrfs_fs_info *fs_info, u64 size)
{
	if (IS_ENABLED(CONFIG_FS_BTRFS_RUN_SANITY_TESTS) && !fs_info)
		return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);

	return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size
  2022-07-04  4:58 [PATCH " Naohiro Aota
@ 2022-07-04  4:58 ` Naohiro Aota
  2022-07-04  7:56   ` Johannes Thumshirn
  0 siblings, 1 reply; 25+ messages in thread
From: Naohiro Aota @ 2022-07-04  4:58 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
of extents needed, btrfs release the metadata reservation too much on its
way to write out the data.

Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
convert count_max_extents() to use it instead, and fix the calculation of
the metadata reservation.

CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/ctree.h          | 16 ++++++++--------
 fs/btrfs/delalloc-space.c |  6 +++---
 fs/btrfs/inode.c          | 16 ++++++++--------
 3 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fca253bdb4b8..4aac7df5a17d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -107,14 +107,6 @@ struct btrfs_ioctl_encoded_io_args;
 #define BTRFS_STAT_CURR		0
 #define BTRFS_STAT_PREV		1
 
-/*
- * Count how many BTRFS_MAX_EXTENT_SIZE cover the @size
- */
-static inline u32 count_max_extents(u64 size)
-{
-	return div_u64(size + BTRFS_MAX_EXTENT_SIZE - 1, BTRFS_MAX_EXTENT_SIZE);
-}
-
 static inline unsigned long btrfs_chunk_item_size(int num_stripes)
 {
 	BUG_ON(num_stripes == 0);
@@ -4057,6 +4049,14 @@ static inline bool btrfs_is_zoned(const struct btrfs_fs_info *fs_info)
 	return fs_info->zone_size > 0;
 }
 
+/*
+ * Count how many fs_info->max_extent_size cover the @size
+ */
+static inline u32 count_max_extents(struct btrfs_fs_info *fs_info, u64 size)
+{
+	return div_u64(size + fs_info->max_extent_size - 1, fs_info->max_extent_size);
+}
+
 static inline bool btrfs_is_data_reloc_root(const struct btrfs_root *root)
 {
 	return root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID;
diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
index 36ab0859a263..1e8f17ff829e 100644
--- a/fs/btrfs/delalloc-space.c
+++ b/fs/btrfs/delalloc-space.c
@@ -273,7 +273,7 @@ static void calc_inode_reservations(struct btrfs_fs_info *fs_info,
 				    u64 num_bytes, u64 disk_num_bytes,
 				    u64 *meta_reserve, u64 *qgroup_reserve)
 {
-	u64 nr_extents = count_max_extents(num_bytes);
+	u64 nr_extents = count_max_extents(fs_info, num_bytes);
 	u64 csum_leaves = btrfs_csum_bytes_to_leaves(fs_info, disk_num_bytes);
 	u64 inode_update = btrfs_calc_metadata_size(fs_info, 1);
 
@@ -350,7 +350,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes,
 	 * needs to free the reservation we just made.
 	 */
 	spin_lock(&inode->lock);
-	nr_extents = count_max_extents(num_bytes);
+	nr_extents = count_max_extents(fs_info, num_bytes);
 	btrfs_mod_outstanding_extents(inode, nr_extents);
 	inode->csum_bytes += disk_num_bytes;
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
@@ -413,7 +413,7 @@ void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes)
 	unsigned num_extents;
 
 	spin_lock(&inode->lock);
-	num_extents = count_max_extents(num_bytes);
+	num_extents = count_max_extents(fs_info, num_bytes);
 	btrfs_mod_outstanding_extents(inode, -num_extents);
 	btrfs_calculate_inode_block_rsv_size(fs_info, inode);
 	spin_unlock(&inode->lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 74ac7ef69a3f..357322da51b5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2218,10 +2218,10 @@ void btrfs_split_delalloc_extent(struct inode *inode,
 		 * applies here, just in reverse.
 		 */
 		new_size = orig->end - split + 1;
-		num_extents = count_max_extents(new_size);
+		num_extents = count_max_extents(fs_info, new_size);
 		new_size = split - orig->start;
-		num_extents += count_max_extents(new_size);
-		if (count_max_extents(size) >= num_extents)
+		num_extents += count_max_extents(fs_info, new_size);
+		if (count_max_extents(fs_info, size) >= num_extents)
 			return;
 	}
 
@@ -2278,10 +2278,10 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new,
 	 * this case.
 	 */
 	old_size = other->end - other->start + 1;
-	num_extents = count_max_extents(old_size);
+	num_extents = count_max_extents(fs_info, old_size);
 	old_size = new->end - new->start + 1;
-	num_extents += count_max_extents(old_size);
-	if (count_max_extents(new_size) >= num_extents)
+	num_extents += count_max_extents(fs_info, old_size);
+	if (count_max_extents(fs_info, new_size) >= num_extents)
 		return;
 
 	spin_lock(&BTRFS_I(inode)->lock);
@@ -2360,7 +2360,7 @@ void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state,
 	if (!(state->state & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
 		struct btrfs_root *root = BTRFS_I(inode)->root;
 		u64 len = state->end + 1 - state->start;
-		u32 num_extents = count_max_extents(len);
+		u32 num_extents = count_max_extents(fs_info, len);
 		bool do_list = !btrfs_is_free_space_inode(BTRFS_I(inode));
 
 		spin_lock(&BTRFS_I(inode)->lock);
@@ -2402,7 +2402,7 @@ void btrfs_clear_delalloc_extent(struct inode *vfs_inode,
 	struct btrfs_inode *inode = BTRFS_I(vfs_inode);
 	struct btrfs_fs_info *fs_info = btrfs_sb(vfs_inode->i_sb);
 	u64 len = state->end + 1 - state->start;
-	u32 num_extents = count_max_extents(len);
+	u32 num_extents = count_max_extents(fs_info, len);
 
 	if ((state->state & EXTENT_DEFRAG) && (bits & EXTENT_DEFRAG)) {
 		spin_lock(&inode->lock);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-07-12 20:37 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-08 23:18 [PATCH 00/13] btrfs: zoned: fix active zone tracking issues Naohiro Aota
2022-07-08 23:18 ` [PATCH 01/13] block: add bdev_max_segments() helper Naohiro Aota
2022-07-09 16:10   ` Jens Axboe
2022-07-11  7:00   ` Christoph Hellwig
2022-07-08 23:18 ` [PATCH 02/13] btrfs: zoned: revive max_zone_append_bytes Naohiro Aota
2022-07-09 11:34   ` Johannes Thumshirn
2022-07-08 23:18 ` [PATCH 03/13] btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size Naohiro Aota
2022-07-09 11:36   ` Johannes Thumshirn
2022-07-11 15:30     ` David Sterba
2022-07-08 23:18 ` [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size Naohiro Aota
2022-07-09 11:37   ` Johannes Thumshirn
2022-07-08 23:18 ` [PATCH 05/13] btrfs: use fs_info->max_extent_size in get_extent_max_capacity() Naohiro Aota
2022-07-08 23:18 ` [PATCH 06/13] btrfs: let can_allocate_chunk return int Naohiro Aota
2022-07-08 23:18 ` [PATCH 07/13] btrfs: zoned: finish least available block group on data BG allocation Naohiro Aota
2022-07-11  6:45   ` Johannes Thumshirn
2022-07-08 23:18 ` [PATCH 08/13] btrfs: zoned: introduce space_info->active_total_bytes Naohiro Aota
2022-07-08 23:18 ` [PATCH 09/13] btrfs: zoned: disable metadata overcommit for zoned Naohiro Aota
2022-07-08 23:18 ` [PATCH 10/13] btrfs: zoned: activate metadata BG on flush_space Naohiro Aota
2022-07-08 23:18 ` [PATCH 11/13] btrfs: zoned: activate necessary block group Naohiro Aota
2022-07-08 23:18 ` [PATCH 12/13] btrfs: zoned: write out partially allocated region Naohiro Aota
2022-07-08 23:18 ` [PATCH 13/13] btrfs: zoned: wait until zone is finished when allocation didn't progress Naohiro Aota
2022-07-11 20:29 ` [PATCH v2 00/13] btrfs: zoned: fix active zone tracking issues David Sterba
2022-07-12 20:32 ` David Sterba
  -- strict thread matches above, loose matches on Subject: below --
2022-07-04  4:58 [PATCH " Naohiro Aota
2022-07-04  4:58 ` [PATCH 04/13] btrfs: convert count_max_extents() to use fs_info->max_extent_size Naohiro Aota
2022-07-04  7:56   ` Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.