All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] btrfs: zoned: fixes for data relocation
@ 2022-06-06 15:59 Naohiro Aota
  2022-06-06 15:59 ` [PATCH 1/2] btrfs: zoned: prevent allocation from previous data relocation BG Naohiro Aota
  2022-06-06 15:59 ` [PATCH 2/2] btrfs: zoned: fix critical section of relocation inode writeback Naohiro Aota
  0 siblings, 2 replies; 5+ messages in thread
From: Naohiro Aota @ 2022-06-06 15:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

There are two long-standing potential bugs in the data relocation path of
zoned btrfs. They are recently revealed by commit 5f0addf7b890 ("btrfs:
zoned: use dedicated lock for data relocation"). One is a mixed issue of
WRITE (for relocation extents) and ZONE APPEND (for regular extent) at the
same time, which confuses the write pointer. The other one is a too short
critical section, which can cause an out-of-order issue of the IOs.

Actually, these bugs are easily reproducible with a smaller zone size (e.g,
128 MB) with fstests btrfs/232. For example, IO failures occurs like this:

  [99909.031820][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs]
  <snip>
  [99909.268769][T4038707] Call Trace:
  [99909.272105][T4038707]  <TASK>
  [99909.275093][T4038707]  run_delalloc_nocow+0x7f1/0x11a0 [btrfs]
  [99909.280996][T4038707]  ? test_range_bit+0x174/0x320 [btrfs]
  [99909.286622][T4038707]  ? fallback_to_cow+0x980/0x980 [btrfs]
  [99909.292333][T4038707]  ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs]
  [99909.298825][T4038707]  btrfs_run_delalloc_range+0x445/0x1320 [btrfs]
  [99909.305222][T4038707]  ? test_range_bit+0x320/0x320 [btrfs]
  [99909.310844][T4038707]  ? lock_downgrade+0x6a0/0x6a0
  [99909.315732][T4038707]  ? orc_find.part.0+0x1ed/0x300
  [99909.320705][T4038707]  ? __module_address.part.0+0x25/0x300
  [99909.326280][T4038707]  writepage_delalloc+0x159/0x310 [btrfs]
  <snip>
  [99909.883814][    C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
  [99909.893855][    C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current]
  [99909.901819][    C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command
  [99909.909525][    C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00
  [99909.919544][    C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
  [99909.930329][    C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0

Or, an assertion failure occur like this:

  [   12.527832] assertion failed: start >= found_start && end <= found_end, in fs/btrfs/free-space-tree.c:737
  <snip>
  [   12.533391] Call Trace:
  [   12.533391]  <TASK>
  [   12.533391]  __remove_from_free_space_tree.cold+0x11/0x22 [btrfs]
  [   12.542073]  ? setup_items_for_insert.isra.0+0x2bf/0x3f0 [btrfs]
  [   12.542073]  remove_from_free_space_tree+0x80/0x110 [btrfs]
  [   12.542073]  alloc_reserved_file_extent+0x1b4/0x240 [btrfs]
  [   12.542073]  __btrfs_run_delayed_refs+0x692/0xf30 [btrfs]
  [   12.542073]  ? btrfs_btree_balance_dirty+0x2f/0x50 [btrfs]
  [   12.542073]  btrfs_run_delayed_refs+0x81/0x1e0 [btrfs]
  [   12.542073]  btrfs_commit_transaction+0x54/0xaf0 [btrfs]
  [   12.542073]  ? start_transaction+0xc2/0x5b0 [btrfs]
  [   12.542073]  ? _raw_read_lock_irqsave+0x20/0x40
  [   12.542073]  relocate_block_group+0x320/0x550 [btrfs]
  [   12.542073]  btrfs_relocate_block_group+0x1f9/0x3a0 [btrfs]
  [   12.542073]  btrfs_relocate_chunk+0x36/0xf0 [btrfs]
  [   12.542073]  btrfs_reclaim_bgs_work.cold+0x4f/0x74 [btrfs]
  [   12.542073]  process_one_work+0x1b0/0x310
  [   12.542073]  worker_thread+0x48/0x3d0
  [   12.542073]  ? rescuer_thread+0x3a0/0x3a0
  [   12.542073]  kthread+0xed/0x120
  [   12.550506]  ? kthread_complete_and_exit+0x20/0x20
  [   12.550506]  ret_from_fork+0x22/0x30
  [   12.550506]  </TASK>

This series fixes the two issues. The first one is fixed by introducing a
new btrfs_block_group bit to disallow extent allocation but still allow
nocow writes to start.

The second one is simply fixed by extending the critical section.

Naohiro Aota (2):
  btrfs: zoned: prevent allocation from previous data relocation BG
  btrfs: zoned: fix critical section of relocation inode writeback

 fs/btrfs/block-group.h |  1 +
 fs/btrfs/extent-tree.c | 20 ++++++++++++++++++--
 fs/btrfs/extent_io.c   |  3 ++-
 fs/btrfs/inode.c       |  2 ++
 fs/btrfs/zoned.c       | 27 +++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  5 +++++
 6 files changed, 55 insertions(+), 3 deletions(-)

-- 
2.35.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/2] btrfs: zoned: prevent allocation from previous data relocation BG
  2022-06-06 15:59 [PATCH 0/2] btrfs: zoned: fixes for data relocation Naohiro Aota
@ 2022-06-06 15:59 ` Naohiro Aota
  2022-06-06 17:40   ` David Sterba
  2022-06-06 15:59 ` [PATCH 2/2] btrfs: zoned: fix critical section of relocation inode writeback Naohiro Aota
  1 sibling, 1 reply; 5+ messages in thread
From: Naohiro Aota @ 2022-06-06 15:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

After commit 5f0addf7b890 ("btrfs: zoned: use dedicated lock for data
relocation"), we observe IO errors on e.g, btrfs/232 like below.

   [99909.031820][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs]
   <snip>
   [99909.268769][T4038707] Call Trace:
   [99909.272105][T4038707]  <TASK>
   [99909.275093][T4038707]  run_delalloc_nocow+0x7f1/0x11a0 [btrfs]
   [99909.280996][T4038707]  ? test_range_bit+0x174/0x320 [btrfs]
   [99909.286622][T4038707]  ? fallback_to_cow+0x980/0x980 [btrfs]
   [99909.292333][T4038707]  ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs]
   [99909.298825][T4038707]  btrfs_run_delalloc_range+0x445/0x1320 [btrfs]
   [99909.305222][T4038707]  ? test_range_bit+0x320/0x320 [btrfs]
   [99909.310844][T4038707]  ? lock_downgrade+0x6a0/0x6a0
   [99909.315732][T4038707]  ? orc_find.part.0+0x1ed/0x300
   [99909.320705][T4038707]  ? __module_address.part.0+0x25/0x300
   [99909.326280][T4038707]  writepage_delalloc+0x159/0x310 [btrfs]
   <snip>
   [99909.883814][    C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
   [99909.893855][    C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current]
   [99909.901819][    C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command
   [99909.909525][    C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00
   [99909.919544][    C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
   [99909.930329][    C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0

The IO errors occur when we allocate a regular extent in a previous data
relocation block group.

On zoned btrfs, we use a dedicated block group to relocate a data
extent. Thus, we allocate relocating data extents (pre-alloc) only from the
dedicated block group and vice versa. Once the free space in the dedicated
block group gets tight, a relocating extent may not fit into the block
group. In that case, we need to switch the dedicated block group to the
next one. Then, the previous one is now freed up for allocating a regular
extent. The BG is already not enough to allocate the relocating extent, but
there is still room to allocate a smaller extent. Now the problem
happens. By allocating a regular extent while nocow IOs for the relocation
is still on-going, we will issue WRITE IOs (for relocation) and ZONE APPEND
IOs (for the regular writes) at the same time. That mixed IOs confuses the
write pointer and arises the unaligned write errors.

This commit introduces a new bit 'zoned_data_reloc_ongoing' to the
btrfs_block_group. We set this bit before releasing the dedicated block
group, and no extent are allocated from a block group having this bit
set. This bit is similar to setting block_group->ro, but is different from
it by allowing nocow writes to start.

Once all the nocow IOs for relocation is done (hooked from
btrfs_finish_ordered_io), we reset the bit to release the block group for
further allocation.

Fixes: c2707a255623 ("btrfs: zoned: add a dedicated data relocation block group")
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/extent-tree.c | 20 ++++++++++++++++++--
 fs/btrfs/inode.c       |  2 ++
 fs/btrfs/zoned.c       | 27 +++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  5 +++++
 5 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 3ac668ace50a..5787b3dd759c 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -104,6 +104,7 @@ struct btrfs_block_group {
 	unsigned int relocating_repair:1;
 	unsigned int chunk_item_inserted:1;
 	unsigned int zone_is_active:1;
+	unsigned int zoned_data_reloc_ongoing;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fb367689d9d2..abf7dc438409 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3832,7 +3832,7 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	       block_group->start == fs_info->data_reloc_bg ||
 	       fs_info->data_reloc_bg == 0);
 
-	if (block_group->ro) {
+	if (block_group->ro || block_group->zoned_data_reloc_ongoing) {
 		ret = 1;
 		goto out;
 	}
@@ -3894,8 +3894,24 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 out:
 	if (ret && ffe_ctl->for_treelog)
 		fs_info->treelog_bg = 0;
-	if (ret && ffe_ctl->for_data_reloc)
+	if (ret && ffe_ctl->for_data_reloc &&
+	    fs_info->data_reloc_bg == block_group->start) {
+		/*
+		 * Do not allow further allocations from this block group.
+		 * Compared to increasing the ->ro, setting the
+		 * ->zoned_data_reloc_ongoing flag still allows nocow
+		 *  writers to come in. See btrfs_inc_nocow_writers().
+		 *
+		 * We need to disable an allocation to avoid an allocation of
+		 * regular (non relocation data) extent. With mixed of
+		 * relocation extents and regular extents, we can dispatch WRITE
+		 * commands (for relocation extents) and ZONE APPEND commands
+		 * (for regular extents) at the same time to the same zone,
+		 * which easily break the write pointer.
+		 */
+		block_group->zoned_data_reloc_ongoing = 1;
 		fs_info->data_reloc_bg = 0;
+	}
 	spin_unlock(&fs_info->relocation_bg_lock);
 	spin_unlock(&fs_info->treelog_bg_lock);
 	spin_unlock(&block_group->lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f85503ccf106..0ab23bd1310a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3216,6 +3216,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 						ordered_extent->file_offset,
 						ordered_extent->file_offset +
 						logical_len);
+		btrfs_zoned_release_data_reloc_bg(fs_info, ordered_extent->disk_bytenr,
+						  ordered_extent->disk_num_bytes);
 	} else {
 		BUG_ON(root == fs_info->tree_root);
 		ret = insert_ordered_extent_file_extent(trans, ordered_extent);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index c09b1b0208c4..4930ab566e2c 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -2140,3 +2140,30 @@ bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info)
 	factor = div64_u64(used * 100, total);
 	return factor >= fs_info->bg_reclaim_threshold;
 }
+
+void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logical,
+				       u64 length)
+{
+	struct btrfs_block_group *block_group;
+
+	if (!btrfs_is_zoned(fs_info))
+		return;
+
+	block_group = btrfs_lookup_block_group(fs_info, logical);
+	/* It should be called on a previous data relocation block group. */
+	ASSERT(block_group && (block_group->flags & BTRFS_BLOCK_GROUP_DATA));
+
+	spin_lock(&block_group->lock);
+	if (!block_group->zoned_data_reloc_ongoing)
+		goto out;
+
+	/* All relocation extents are written. */
+	if (block_group->start + block_group->alloc_offset == logical + length) {
+		/* Now, release this block group for further allocations. */
+		block_group->zoned_data_reloc_ongoing = 0;
+	}
+
+out:
+	spin_unlock(&block_group->lock);
+	btrfs_put_block_group(block_group);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index bb1a189e11f9..6b2eec99162b 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -77,6 +77,8 @@ void btrfs_schedule_zone_finish_bg(struct btrfs_block_group *bg,
 void btrfs_clear_data_reloc_bg(struct btrfs_block_group *bg);
 void btrfs_free_zone_cache(struct btrfs_fs_info *fs_info);
 bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info);
+void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info, u64 logical,
+				       u64 length);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -243,6 +245,9 @@ static inline bool btrfs_zoned_should_reclaim(struct btrfs_fs_info *fs_info)
 {
 	return false;
 }
+
+static inline void btrfs_zoned_release_data_reloc_bg(struct btrfs_fs_info *fs_info,
+						     u64 logical, u64 length) { }
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/2] btrfs: zoned: fix critical section of relocation inode writeback
  2022-06-06 15:59 [PATCH 0/2] btrfs: zoned: fixes for data relocation Naohiro Aota
  2022-06-06 15:59 ` [PATCH 1/2] btrfs: zoned: prevent allocation from previous data relocation BG Naohiro Aota
@ 2022-06-06 15:59 ` Naohiro Aota
  1 sibling, 0 replies; 5+ messages in thread
From: Naohiro Aota @ 2022-06-06 15:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Naohiro Aota

We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to
write out to the relocation inode. That critical section must include all
the IO submission for the inode. However, flush_write_bio() in
extent_writepages() is out of the critical section, causing an IO
submission outside of the lock. This leads to an out of the order IO
submission and fail the relocation process.

Fix it by extending the critical section.

Fixes: 35156d852762 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4847e0471dbf..7a125b319a9f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5227,13 +5227,14 @@ int extent_writepages(struct address_space *mapping,
 	 */
 	btrfs_zoned_data_reloc_lock(BTRFS_I(inode));
 	ret = extent_write_cache_pages(mapping, wbc, &epd);
-	btrfs_zoned_data_reloc_unlock(BTRFS_I(inode));
 	ASSERT(ret <= 0);
 	if (ret < 0) {
+		btrfs_zoned_data_reloc_unlock(BTRFS_I(inode));
 		end_write_bio(&epd, ret);
 		return ret;
 	}
 	flush_write_bio(&epd);
+	btrfs_zoned_data_reloc_unlock(BTRFS_I(inode));
 	return ret;
 }
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] btrfs: zoned: prevent allocation from previous data relocation BG
  2022-06-06 15:59 ` [PATCH 1/2] btrfs: zoned: prevent allocation from previous data relocation BG Naohiro Aota
@ 2022-06-06 17:40   ` David Sterba
  2022-06-07  6:59     ` Naohiro Aota
  0 siblings, 1 reply; 5+ messages in thread
From: David Sterba @ 2022-06-06 17:40 UTC (permalink / raw)
  To: Naohiro Aota; +Cc: linux-btrfs

On Tue, Jun 07, 2022 at 12:59:20AM +0900, Naohiro Aota wrote:
> Fixes: c2707a255623 ("btrfs: zoned: add a dedicated data relocation block group")
> CC: stable@vger.kernel.org # 5.16+
> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> ---
>  fs/btrfs/block-group.h |  1 +
>  fs/btrfs/extent-tree.c | 20 ++++++++++++++++++--
>  fs/btrfs/inode.c       |  2 ++
>  fs/btrfs/zoned.c       | 27 +++++++++++++++++++++++++++
>  fs/btrfs/zoned.h       |  5 +++++
>  5 files changed, 53 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> index 3ac668ace50a..5787b3dd759c 100644
> --- a/fs/btrfs/block-group.h
> +++ b/fs/btrfs/block-group.h
> @@ -104,6 +104,7 @@ struct btrfs_block_group {
>  	unsigned int relocating_repair:1;
>  	unsigned int chunk_item_inserted:1;
>  	unsigned int zone_is_active:1;
> +	unsigned int zoned_data_reloc_ongoing;

Probably jsut a typo, the variable is used only in logical conditions

	unsigned int zoned_data_reloc_ongoing:1;

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/2] btrfs: zoned: prevent allocation from previous data relocation BG
  2022-06-06 17:40   ` David Sterba
@ 2022-06-07  6:59     ` Naohiro Aota
  0 siblings, 0 replies; 5+ messages in thread
From: Naohiro Aota @ 2022-06-07  6:59 UTC (permalink / raw)
  To: dsterba, linux-btrfs

On Mon, Jun 06, 2022 at 07:40:02PM +0200, David Sterba wrote:
> On Tue, Jun 07, 2022 at 12:59:20AM +0900, Naohiro Aota wrote:
> > Fixes: c2707a255623 ("btrfs: zoned: add a dedicated data relocation block group")
> > CC: stable@vger.kernel.org # 5.16+
> > Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
> > ---
> >  fs/btrfs/block-group.h |  1 +
> >  fs/btrfs/extent-tree.c | 20 ++++++++++++++++++--
> >  fs/btrfs/inode.c       |  2 ++
> >  fs/btrfs/zoned.c       | 27 +++++++++++++++++++++++++++
> >  fs/btrfs/zoned.h       |  5 +++++
> >  5 files changed, 53 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
> > index 3ac668ace50a..5787b3dd759c 100644
> > --- a/fs/btrfs/block-group.h
> > +++ b/fs/btrfs/block-group.h
> > @@ -104,6 +104,7 @@ struct btrfs_block_group {
> >  	unsigned int relocating_repair:1;
> >  	unsigned int chunk_item_inserted:1;
> >  	unsigned int zone_is_active:1;
> > +	unsigned int zoned_data_reloc_ongoing;
> 
> Probably jsut a typo, the variable is used only in logical conditions
> 
> 	unsigned int zoned_data_reloc_ongoing:1;

Yep, exactly. It should be a bit.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-06-07  7:00 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-06 15:59 [PATCH 0/2] btrfs: zoned: fixes for data relocation Naohiro Aota
2022-06-06 15:59 ` [PATCH 1/2] btrfs: zoned: prevent allocation from previous data relocation BG Naohiro Aota
2022-06-06 17:40   ` David Sterba
2022-06-07  6:59     ` Naohiro Aota
2022-06-06 15:59 ` [PATCH 2/2] btrfs: zoned: fix critical section of relocation inode writeback Naohiro Aota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.