[PATCH v2 0/3] btrfs: zoned: fix writes on a compressed zoned filesystem

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] btrfs: zoned: fix writes on a compressed zoned filesystem
@ 2021-05-18 15:40 Johannes Thumshirn
  2021-05-18 15:40 ` [PATCH v2 1/3] btrfs: zoned: pass start block to btrfs_use_zone_append Johannes Thumshirn
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2021-05-18 15:40 UTC (permalink / raw)
  To: David Sterba, linux-btrfs; +Cc: Johannes Thumshirn

David reported that I/O errors get thrown on a zoned filesystem with
compression enabled.

This happens because we're using regular writes instead of zoned append, but
with regular writes and increased parallelism, we cannot guarantee the data
placement requirements can be met.

This series switches the compressed I/O submission path to zone append writing
on a zoned filesystem.

Changes in v2:
- Factor out zoned device lookup
- limit scope of bdev variable

Johannes Thumshirn (3):
  btrfs: zoned: pass start block to btrfs_use_zone_append
  btrfs: zoned: fix compressed writes
  btrfs: zoned: factor out zoned device lookup

 fs/btrfs/compression.c | 35 +++++++++++++++++++++++++++++++----
 fs/btrfs/extent_io.c   | 18 ++++++------------
 fs/btrfs/inode.c       |  2 +-
 fs/btrfs/zoned.c       | 25 +++++++++++++++++++++++--
 fs/btrfs/zoned.h       | 14 +++++++++++---
 5 files changed, 72 insertions(+), 22 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/3] btrfs: zoned: pass start block to btrfs_use_zone_append
  2021-05-18 15:40 [PATCH v2 0/3] btrfs: zoned: fix writes on a compressed zoned filesystem Johannes Thumshirn
@ 2021-05-18 15:40 ` Johannes Thumshirn
  2021-05-18 15:40 ` [PATCH v2 2/3] btrfs: zoned: fix compressed writes Johannes Thumshirn
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2021-05-18 15:40 UTC (permalink / raw)
  To: David Sterba, linux-btrfs; +Cc: Johannes Thumshirn

btrfs_use_zone_append only needs the passed in extent_map's block_start
member, so there's no need to pass in the full extent map.

This also enables the use of btrfs_use_zone_append in places where we only
have a start byte but no extent_map.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/extent_io.c | 2 +-
 fs/btrfs/inode.c     | 2 +-
 fs/btrfs/zoned.c     | 4 ++--
 fs/btrfs/zoned.h     | 5 ++---
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 78d3f2ec90e0..ce6364dd1517 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3765,7 +3765,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		/* Note that em_end from extent_map_end() is exclusive */
 		iosize = min(em_end, end + 1) - cur;
 
-		if (btrfs_use_zone_append(inode, em))
+		if (btrfs_use_zone_append(inode, em->block_start))
 			opf = REQ_OP_ZONE_APPEND;
 
 		free_extent_map(em);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 955d0f5849e3..105deb6a300a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7794,7 +7794,7 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	iomap->bdev = fs_info->fs_devices->latest_bdev;
 	iomap->length = len;
 
-	if (write && btrfs_use_zone_append(BTRFS_I(inode), em))
+	if (write && btrfs_use_zone_append(BTRFS_I(inode), em->block_start))
 		iomap->flags |= IOMAP_F_ZONE_APPEND;
 
 	free_extent_map(em);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index c41373a92476..b9d5579a578d 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1296,7 +1296,7 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans)
 	spin_unlock(&trans->releasing_ebs_lock);
 }
 
-bool btrfs_use_zone_append(struct btrfs_inode *inode, struct extent_map *em)
+bool btrfs_use_zone_append(struct btrfs_inode *inode, u64 start)
 {
 	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	struct btrfs_block_group *cache;
@@ -1311,7 +1311,7 @@ bool btrfs_use_zone_append(struct btrfs_inode *inode, struct extent_map *em)
 	if (!is_data_inode(&inode->vfs_inode))
 		return false;
 
-	cache = btrfs_lookup_block_group(fs_info, em->block_start);
+	cache = btrfs_lookup_block_group(fs_info, start);
 	ASSERT(cache);
 	if (!cache)
 		return false;
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 5e41a74a9cb2..e55d32595c2c 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -53,7 +53,7 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
 void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 void btrfs_free_redirty_list(struct btrfs_transaction *trans);
-bool btrfs_use_zone_append(struct btrfs_inode *inode, struct extent_map *em);
+bool btrfs_use_zone_append(struct btrfs_inode *inode, u64 start);
 void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
 				 struct bio *bio);
 void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered);
@@ -152,8 +152,7 @@ static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 					  struct extent_buffer *eb) { }
 static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
 
-static inline bool btrfs_use_zone_append(struct btrfs_inode *inode,
-					 struct extent_map *em)
+static inline bool btrfs_use_zone_append(struct btrfs_inode *inode, u64 start)
 {
 	return false;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-05-18 15:40 [PATCH v2 0/3] btrfs: zoned: fix writes on a compressed zoned filesystem Johannes Thumshirn
  2021-05-18 15:40 ` [PATCH v2 1/3] btrfs: zoned: pass start block to btrfs_use_zone_append Johannes Thumshirn
@ 2021-05-18 15:40 ` Johannes Thumshirn
  2021-05-23 14:13   ` Josef Bacik
  2021-06-10  7:27   ` Qu Wenruo
  2021-05-18 15:40 ` [PATCH v2 3/3] btrfs: zoned: factor out zoned device lookup Johannes Thumshirn
  2021-05-20 15:05 ` [PATCH v2 0/3] btrfs: zoned: fix writes on a compressed zoned filesystem David Sterba
  3 siblings, 2 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2021-05-18 15:40 UTC (permalink / raw)
  To: David Sterba, linux-btrfs; +Cc: Johannes Thumshirn

When multiple processes write data to the same block group on a compressed
zoned filesystem, the underlying device could report I/O errors and data
corruption is possible.

This happens because on a zoned file system, compressed data writes where
sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
operation. But with REQ_OP_WRITE and parallel submission it cannot be
guaranteed that the data is always submitted aligned to the underlying
zone's write pointer.

The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
filesystem is non intrusive on a regular file system or when submitting to
a conventional zone on a zoned filesystem, as it is guarded by
btrfs_use_zone_append.

Reported-by: David Sterba <dsterba@suse.com>
Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/compression.c | 43 ++++++++++++++++++++++++++++++++++++++----
 1 file changed, 39 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 2bea01d23a5b..f224c35de5d8 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -28,6 +28,7 @@
 #include "compression.h"
 #include "extent_io.h"
 #include "extent_map.h"
+#include "zoned.h"
 
 static const char* const btrfs_compress_types[] = { "", "zlib", "lzo", "zstd" };
 
@@ -349,6 +350,7 @@ static void end_compressed_bio_write(struct bio *bio)
 	 */
 	inode = cb->inode;
 	cb->compressed_pages[0]->mapping = cb->inode->i_mapping;
+	btrfs_record_physical_zoned(inode, cb->start, bio);
 	btrfs_writepage_endio_finish_ordered(cb->compressed_pages[0],
 			cb->start, cb->start + cb->len - 1,
 			bio->bi_status == BLK_STS_OK);
@@ -401,6 +403,9 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	u64 first_byte = disk_start;
 	blk_status_t ret;
 	int skip_sum = inode->flags & BTRFS_INODE_NODATASUM;
+	const bool use_append = btrfs_use_zone_append(inode, disk_start);
+	const unsigned int bio_op =
+		use_append ? REQ_OP_ZONE_APPEND : REQ_OP_WRITE;
 
 	WARN_ON(!PAGE_ALIGNED(start));
 	cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
@@ -418,10 +423,31 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	cb->nr_pages = nr_pages;
 
 	bio = btrfs_bio_alloc(first_byte);
-	bio->bi_opf = REQ_OP_WRITE | write_flags;
+	bio->bi_opf = bio_op | write_flags;
 	bio->bi_private = cb;
 	bio->bi_end_io = end_compressed_bio_write;
 
+	if (use_append) {
+		struct extent_map *em;
+		struct map_lookup *map;
+		struct block_device *bdev;
+
+		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
+		if (IS_ERR(em)) {
+			kfree(cb);
+			bio_put(bio);
+			return BLK_STS_NOTSUPP;
+		}
+
+		map = em->map_lookup;
+		/* We only support single profile for now */
+		ASSERT(map->num_stripes == 1);
+		bdev = map->stripes[0].dev->bdev;
+
+		bio_set_dev(bio, bdev);
+		free_extent_map(em);
+	}
+
 	if (blkcg_css) {
 		bio->bi_opf |= REQ_CGROUP_PUNT;
 		kthread_associate_blkcg(blkcg_css);
@@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	bytes_left = compressed_len;
 	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
 		int submit = 0;
+		int len;
 
 		page = compressed_pages[pg_index];
 		page->mapping = inode->vfs_inode.i_mapping;
@@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
 							  0);
 
+		if (pg_index == 0 && use_append)
+			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
+		else
+			len = bio_add_page(bio, page, PAGE_SIZE, 0);
+
 		page->mapping = NULL;
-		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
-		    PAGE_SIZE) {
+		if (submit || len < PAGE_SIZE) {
 			/*
 			 * inc the count before we submit the bio so
 			 * we know the end IO handler won't happen before
@@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 			}
 
 			bio = btrfs_bio_alloc(first_byte);
-			bio->bi_opf = REQ_OP_WRITE | write_flags;
+			bio->bi_opf = bio_op | write_flags;
 			bio->bi_private = cb;
 			bio->bi_end_io = end_compressed_bio_write;
 			if (blkcg_css)
 				bio->bi_opf |= REQ_CGROUP_PUNT;
+			/*
+			 * Use bio_add_page() to ensure the bio has at least one
+			 * page.
+			 */
 			bio_add_page(bio, page, PAGE_SIZE, 0);
 		}
 		if (bytes_left < PAGE_SIZE) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/3] btrfs: zoned: factor out zoned device lookup
  2021-05-18 15:40 [PATCH v2 0/3] btrfs: zoned: fix writes on a compressed zoned filesystem Johannes Thumshirn
  2021-05-18 15:40 ` [PATCH v2 1/3] btrfs: zoned: pass start block to btrfs_use_zone_append Johannes Thumshirn
  2021-05-18 15:40 ` [PATCH v2 2/3] btrfs: zoned: fix compressed writes Johannes Thumshirn
@ 2021-05-18 15:40 ` Johannes Thumshirn
  2021-05-24 10:00   ` Qu Wenruo
  2021-05-20 15:05 ` [PATCH v2 0/3] btrfs: zoned: fix writes on a compressed zoned filesystem David Sterba
  3 siblings, 1 reply; 19+ messages in thread
From: Johannes Thumshirn @ 2021-05-18 15:40 UTC (permalink / raw)
  To: David Sterba, linux-btrfs; +Cc: Johannes Thumshirn

To be able to construct a zone append bio we need to look up the
btrfs_device. The code doing the chunk map lookup to get the device is
present in btrfs_submit_compressed_write and submit_extent_page.

Factor out the lookup calls into a helper and use it in the submission
paths.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/compression.c | 16 ++++------------
 fs/btrfs/extent_io.c   | 16 +++++-----------
 fs/btrfs/zoned.c       | 21 +++++++++++++++++++++
 fs/btrfs/zoned.h       |  9 +++++++++
 4 files changed, 39 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index f224c35de5d8..b101bc483e40 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -428,24 +428,16 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
 	bio->bi_end_io = end_compressed_bio_write;
 
 	if (use_append) {
-		struct extent_map *em;
-		struct map_lookup *map;
-		struct block_device *bdev;
+		struct btrfs_device *device;
 
-		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
-		if (IS_ERR(em)) {
+		device = btrfs_zoned_get_device(fs_info, disk_start, PAGE_SIZE);
+		if (IS_ERR(device)) {
 			kfree(cb);
 			bio_put(bio);
 			return BLK_STS_NOTSUPP;
 		}
 
-		map = em->map_lookup;
-		/* We only support single profile for now */
-		ASSERT(map->num_stripes == 1);
-		bdev = map->stripes[0].dev->bdev;
-
-		bio_set_dev(bio, bdev);
-		free_extent_map(em);
+		bio_set_dev(bio, device->bdev);
 	}
 
 	if (blkcg_css) {
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ce6364dd1517..2b250c610562 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3266,19 +3266,13 @@ static int submit_extent_page(unsigned int opf,
 		wbc_account_cgroup_owner(wbc, page, io_size);
 	}
 	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
-		struct extent_map *em;
-		struct map_lookup *map;
+		struct btrfs_device *device;
 
-		em = btrfs_get_chunk_map(fs_info, disk_bytenr, io_size);
-		if (IS_ERR(em))
-			return PTR_ERR(em);
+		device = btrfs_zoned_get_device(fs_info, disk_bytenr, io_size);
+		if (IS_ERR(device))
+			return PTR_ERR(device);
 
-		map = em->map_lookup;
-		/* We only support single profile for now */
-		ASSERT(map->num_stripes == 1);
-		btrfs_io_bio(bio)->device = map->stripes[0].dev;
-
-		free_extent_map(em);
+		btrfs_io_bio(bio)->device = device;
 	}
 
 	*bio_ret = bio;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index b9d5579a578d..15843a858bf6 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1520,3 +1520,24 @@ int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
 	length = wp - physical_pos;
 	return btrfs_zoned_issue_zeroout(tgt_dev, physical_pos, length);
 }
+
+struct btrfs_device *btrfs_zoned_get_device(struct btrfs_fs_info *fs_info,
+					    u64 logical, u64 length)
+{
+	struct btrfs_device *device;
+	struct extent_map *em;
+	struct map_lookup *map;
+
+	em = btrfs_get_chunk_map(fs_info, logical, length);
+	if (IS_ERR(em))
+		return ERR_CAST(em);
+
+	map = em->map_lookup;
+	/* We only support single profile for now */
+	ASSERT(map->num_stripes == 1);
+	device = map->stripes[0].dev;
+
+	free_extent_map(em);
+
+	return device;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index e55d32595c2c..b0ae2608cb6b 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -65,6 +65,8 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
 int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
 				  u64 physical_start, u64 physical_pos);
+struct btrfs_device *btrfs_zoned_get_device(struct btrfs_fs_info *fs_info,
+					    u64 logical, u64 length);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -191,6 +193,13 @@ static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev,
 	return -EOPNOTSUPP;
 }
 
+static inline struct btrfs_device *btrfs_zoned_get_device(
+						  struct btrfs_fs_info *fs_info,
+						  u64 logical, u64 length)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 0/3] btrfs: zoned: fix writes on a compressed zoned filesystem
  2021-05-18 15:40 [PATCH v2 0/3] btrfs: zoned: fix writes on a compressed zoned filesystem Johannes Thumshirn
                   ` (2 preceding siblings ...)
  2021-05-18 15:40 ` [PATCH v2 3/3] btrfs: zoned: factor out zoned device lookup Johannes Thumshirn
@ 2021-05-20 15:05 ` David Sterba
  3 siblings, 0 replies; 19+ messages in thread
From: David Sterba @ 2021-05-20 15:05 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: David Sterba, linux-btrfs

On Wed, May 19, 2021 at 12:40:26AM +0900, Johannes Thumshirn wrote:
> David reported that I/O errors get thrown on a zoned filesystem with
> compression enabled.
> 
> This happens because we're using regular writes instead of zoned append, but
> with regular writes and increased parallelism, we cannot guarantee the data
> placement requirements can be met.
> 
> This series switches the compressed I/O submission path to zone append writing
> on a zoned filesystem.
> 
> Changes in v2:
> - Factor out zoned device lookup
> - limit scope of bdev variable
> 
> Johannes Thumshirn (3):
>   btrfs: zoned: pass start block to btrfs_use_zone_append
>   btrfs: zoned: fix compressed writes
>   btrfs: zoned: factor out zoned device lookup

The test now does not report any problems. Moved to misc-next, thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-05-18 15:40 ` [PATCH v2 2/3] btrfs: zoned: fix compressed writes Johannes Thumshirn
@ 2021-05-23 14:13   ` Josef Bacik
  2021-05-23 23:09     ` Qu Wenruo
  2021-05-25  5:46     ` Johannes Thumshirn
  2021-06-10  7:27   ` Qu Wenruo
  1 sibling, 2 replies; 19+ messages in thread
From: Josef Bacik @ 2021-05-23 14:13 UTC (permalink / raw)
  To: Johannes Thumshirn, David Sterba, linux-btrfs

On 5/18/21 11:40 AM, Johannes Thumshirn wrote:
> When multiple processes write data to the same block group on a compressed
> zoned filesystem, the underlying device could report I/O errors and data
> corruption is possible.
> 
> This happens because on a zoned file system, compressed data writes where
> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
> operation. But with REQ_OP_WRITE and parallel submission it cannot be
> guaranteed that the data is always submitted aligned to the underlying
> zone's write pointer.
> 
> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
> filesystem is non intrusive on a regular file system or when submitting to
> a conventional zone on a zoned filesystem, as it is guarded by
> btrfs_use_zone_append.
> 
> Reported-by: David Sterba <dsterba@suse.com>
> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

This one is causing panics with btrfs/027 with -o compress.  I bisected it to 
something else earlier, but it was still happening today and I bisected again 
and this is what popped out.  I also went the extra step to revert the patch as 
I have already fucked this up once, and the problem didn't re-occur with this 
patch reverted.  The panic looks like this

May 22 00:33:16 xfstests2 kernel: BTRFS critical (device dm-9): mapping failed 
logical 22429696 bio len 53248 len 49152
May 22 00:33:16 xfstests2 kernel: ------------[ cut here ]------------
May 22 00:33:16 xfstests2 kernel: kernel BUG at fs/btrfs/volumes.c:6643!
May 22 00:33:16 xfstests2 kernel: invalid opcode: 0000 [#1] SMP NOPTI
May 22 00:33:16 xfstests2 kernel: CPU: 1 PID: 2236088 Comm: kworker/u4:4 Not 
tainted 5.13.0-rc2+ #240
May 22 00:33:16 xfstests2 kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 
2009), BIOS 1.13.0-2.fc32 04/01/2014
May 22 00:33:16 xfstests2 kernel: Workqueue: btrfs-delalloc btrfs_work_helper
May 22 00:33:16 xfstests2 kernel: RIP: 0010:btrfs_map_bio.cold+0x58/0x5a
May 22 00:33:16 xfstests2 kernel: Code: 50 e8 6b 83 ff ff e8 5b 0d 88 ff 48 83 
c4 18 e9 94 8f 88 ff 48 8b 3c 24 4c 89 f1 4c 89 fa 48 c7 c6 f8 db 62 96 e8 47 83 
ff ff <0f> 0b 4c 89 e7 e8 52 1f 83 ff e9 03 98 88 ff 49 8b 7a 50 44 89 f2
May 22 00:33:16 xfstests2 kernel: RSP: 0018:ffffb310c1de7c88 EFLAGS: 00010282
May 22 00:33:16 xfstests2 kernel: RAX: 0000000000000055 RBX: 0000000000000000 
RCX: 0000000000000000
May 22 00:33:16 xfstests2 kernel: RDX: ffff9b9a7bd27540 RSI: ffff9b9a7bd18e10 
RDI: ffff9b9a7bd18e10
May 22 00:33:16 xfstests2 kernel: RBP: ffff9b9a482ad7f8 R08: 0000000000000000 
R09: 0000000000000000
May 22 00:33:16 xfstests2 kernel: R10: ffffb310c1de7a48 R11: ffffffff96973748 
R12: 0000000000000000
May 22 00:33:16 xfstests2 kernel: R13: ffff9b9a001e7300 R14: 000000000000d000 
R15: 0000000001564000
May 22 00:33:16 xfstests2 kernel: FS:  0000000000000000(0000) 
GS:ffff9b9a7bd00000(0000) knlGS:0000000000000000
May 22 00:33:16 xfstests2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 22 00:33:16 xfstests2 kernel: CR2: 00005621fe4566e0 CR3: 000000013943a005 
CR4: 0000000000370ee0
May 22 00:33:16 xfstests2 kernel: Call Trace:
May 22 00:33:16 xfstests2 kernel:  btrfs_submit_compressed_write+0x2d7/0x470
May 22 00:33:16 xfstests2 kernel:  submit_compressed_extents+0x364/0x420
May 22 00:33:16 xfstests2 kernel:  ? lock_acquire+0x15d/0x380
May 22 00:33:16 xfstests2 kernel:  ? lock_release+0x1cd/0x2a0
May 22 00:33:16 xfstests2 kernel:  ? submit_compressed_extents+0x420/0x420
May 22 00:33:16 xfstests2 kernel:  btrfs_work_helper+0x133/0x520
May 22 00:33:16 xfstests2 kernel:  process_one_work+0x26b/0x570
May 22 00:33:16 xfstests2 kernel:  worker_thread+0x55/0x3c0
May 22 00:33:16 xfstests2 kernel:  ? process_one_work+0x570/0x570
May 22 00:33:16 xfstests2 kernel:  kthread+0x134/0x150
May 22 00:33:16 xfstests2 kernel:  ? __kthread_bind_mask+0x60/0x60
May 22 00:33:16 xfstests2 kernel:  ret_from_fork+0x1f/0x30

Thanks,

Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-05-23 14:13   ` Josef Bacik
@ 2021-05-23 23:09     ` Qu Wenruo
  2021-05-24 13:04       ` Qu Wenruo
  2021-05-25  5:46     ` Johannes Thumshirn
  1 sibling, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2021-05-23 23:09 UTC (permalink / raw)
  To: Josef Bacik, Johannes Thumshirn, David Sterba, linux-btrfs



On 2021/5/23 下午10:13, Josef Bacik wrote:
> On 5/18/21 11:40 AM, Johannes Thumshirn wrote:
>> When multiple processes write data to the same block group on a
>> compressed
>> zoned filesystem, the underlying device could report I/O errors and data
>> corruption is possible.
>>
>> This happens because on a zoned file system, compressed data writes where
>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>> guaranteed that the data is always submitted aligned to the underlying
>> zone's write pointer.
>>
>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>> filesystem is non intrusive on a regular file system or when
>> submitting to
>> a conventional zone on a zoned filesystem, as it is guarded by
>> btrfs_use_zone_append.
>>
>> Reported-by: David Sterba <dsterba@suse.com>
>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>
> This one is causing panics with btrfs/027 with -o compress.  I bisected
> it to something else earlier, but it was still happening today and I
> bisected again and this is what popped out.  I also went the extra step
> to revert the patch as I have already fucked this up once, and the
> problem didn't re-occur with this patch reverted.  The panic looks like
> this
>
> May 22 00:33:16 xfstests2 kernel: BTRFS critical (device dm-9): mapping
> failed logical 22429696 bio len 53248 len 49152

This is the interesting part, it means we are just one sector beyond the
stripe boundary.
Definitely a sign of changed bio submission timing.

Just like the code:

+		if (pg_index == 0 && use_append)
+			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
+		else
+			len = bio_add_page(bio, page, PAGE_SIZE, 0);
+
  		page->mapping = NULL;
-		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
-		    PAGE_SIZE) {
+		if (submit || len < PAGE_SIZE) {

The code has changed the timing of bio_add_page().

Previously, if we have submit == true, we won't even try to call
bio_add_page().

But now, we will add the page even we're already at the stripe boundary,
thus it causes the extra sector being added to bio, and crosses stripe
boundary.

This part is already super tricky, thus I refactored
submit_extent_page() to do a better job at stripe boundary calculation.

We definitely need to make other bio_add_page() callers to use a better
helper, not only for later subpage, but also to make our lives easier.

Thanks,
Qu
> May 22 00:33:16 xfstests2 kernel: ------------[ cut here ]------------
> May 22 00:33:16 xfstests2 kernel: kernel BUG at fs/btrfs/volumes.c:6643!
> May 22 00:33:16 xfstests2 kernel: invalid opcode: 0000 [#1] SMP NOPTI
> May 22 00:33:16 xfstests2 kernel: CPU: 1 PID: 2236088 Comm: kworker/u4:4
> Not tainted 5.13.0-rc2+ #240
> May 22 00:33:16 xfstests2 kernel: Hardware name: QEMU Standard PC (Q35 +
> ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
> May 22 00:33:16 xfstests2 kernel: Workqueue: btrfs-delalloc
> btrfs_work_helper
> May 22 00:33:16 xfstests2 kernel: RIP: 0010:btrfs_map_bio.cold+0x58/0x5a
> May 22 00:33:16 xfstests2 kernel: Code: 50 e8 6b 83 ff ff e8 5b 0d 88 ff
> 48 83 c4 18 e9 94 8f 88 ff 48 8b 3c 24 4c 89 f1 4c 89 fa 48 c7 c6 f8 db
> 62 96 e8 47 83 ff ff <0f> 0b 4c 89 e7 e8 52 1f 83 ff e9 03 98 88 ff 49
> 8b 7a 50 44 89 f2
> May 22 00:33:16 xfstests2 kernel: RSP: 0018:ffffb310c1de7c88 EFLAGS:
> 00010282
> May 22 00:33:16 xfstests2 kernel: RAX: 0000000000000055 RBX:
> 0000000000000000 RCX: 0000000000000000
> May 22 00:33:16 xfstests2 kernel: RDX: ffff9b9a7bd27540 RSI:
> ffff9b9a7bd18e10 RDI: ffff9b9a7bd18e10
> May 22 00:33:16 xfstests2 kernel: RBP: ffff9b9a482ad7f8 R08:
> 0000000000000000 R09: 0000000000000000
> May 22 00:33:16 xfstests2 kernel: R10: ffffb310c1de7a48 R11:
> ffffffff96973748 R12: 0000000000000000
> May 22 00:33:16 xfstests2 kernel: R13: ffff9b9a001e7300 R14:
> 000000000000d000 R15: 0000000001564000
> May 22 00:33:16 xfstests2 kernel: FS:  0000000000000000(0000)
> GS:ffff9b9a7bd00000(0000) knlGS:0000000000000000
> May 22 00:33:16 xfstests2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> May 22 00:33:16 xfstests2 kernel: CR2: 00005621fe4566e0 CR3:
> 000000013943a005 CR4: 0000000000370ee0
> May 22 00:33:16 xfstests2 kernel: Call Trace:
> May 22 00:33:16 xfstests2 kernel:
> btrfs_submit_compressed_write+0x2d7/0x470
> May 22 00:33:16 xfstests2 kernel:  submit_compressed_extents+0x364/0x420
> May 22 00:33:16 xfstests2 kernel:  ? lock_acquire+0x15d/0x380
> May 22 00:33:16 xfstests2 kernel:  ? lock_release+0x1cd/0x2a0
> May 22 00:33:16 xfstests2 kernel:  ? submit_compressed_extents+0x420/0x420
> May 22 00:33:16 xfstests2 kernel:  btrfs_work_helper+0x133/0x520
> May 22 00:33:16 xfstests2 kernel:  process_one_work+0x26b/0x570
> May 22 00:33:16 xfstests2 kernel:  worker_thread+0x55/0x3c0
> May 22 00:33:16 xfstests2 kernel:  ? process_one_work+0x570/0x570
> May 22 00:33:16 xfstests2 kernel:  kthread+0x134/0x150
> May 22 00:33:16 xfstests2 kernel:  ? __kthread_bind_mask+0x60/0x60
> May 22 00:33:16 xfstests2 kernel:  ret_from_fork+0x1f/0x30
>
> Thanks,
>
> Josef

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/3] btrfs: zoned: factor out zoned device lookup
  2021-05-18 15:40 ` [PATCH v2 3/3] btrfs: zoned: factor out zoned device lookup Johannes Thumshirn
@ 2021-05-24 10:00   ` Qu Wenruo
  2021-05-25  9:11     ` Johannes Thumshirn
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2021-05-24 10:00 UTC (permalink / raw)
  To: Johannes Thumshirn, David Sterba, linux-btrfs



On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
> To be able to construct a zone append bio we need to look up the
> btrfs_device. The code doing the chunk map lookup to get the device is
> present in btrfs_submit_compressed_write and submit_extent_page.
>
> Factor out the lookup calls into a helper and use it in the submission
> paths.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> ---
>   fs/btrfs/compression.c | 16 ++++------------
>   fs/btrfs/extent_io.c   | 16 +++++-----------
>   fs/btrfs/zoned.c       | 21 +++++++++++++++++++++
>   fs/btrfs/zoned.h       |  9 +++++++++
>   4 files changed, 39 insertions(+), 23 deletions(-)
>
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index f224c35de5d8..b101bc483e40 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -428,24 +428,16 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>   	bio->bi_end_io = end_compressed_bio_write;
>
>   	if (use_append) {
> -		struct extent_map *em;
> -		struct map_lookup *map;
> -		struct block_device *bdev;
> +		struct btrfs_device *device;
>
> -		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
> -		if (IS_ERR(em)) {
> +		device = btrfs_zoned_get_device(fs_info, disk_start, PAGE_SIZE);

Not sure if it's too late, but I find that all callers are just passing
PAGE_SIZE (to be more accurate, should be sectorsize) to
btrfs_zoned_get_device().

Do we really need that @length parameter anyway?

Since we're just using one sector to grab the chunk map, the @length
parameter is completely useless and we can grab sectorsize using fs_info.

This makes me to dig deeper than needed every time I see such length
usage...

Thanks,
Qu

> +		if (IS_ERR(device)) {
>   			kfree(cb);
>   			bio_put(bio);
>   			return BLK_STS_NOTSUPP;
>   		}
>
> -		map = em->map_lookup;
> -		/* We only support single profile for now */
> -		ASSERT(map->num_stripes == 1);
> -		bdev = map->stripes[0].dev->bdev;
> -
> -		bio_set_dev(bio, bdev);
> -		free_extent_map(em);
> +		bio_set_dev(bio, device->bdev);
>   	}
>
>   	if (blkcg_css) {
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index ce6364dd1517..2b250c610562 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3266,19 +3266,13 @@ static int submit_extent_page(unsigned int opf,
>   		wbc_account_cgroup_owner(wbc, page, io_size);
>   	}
>   	if (btrfs_is_zoned(fs_info) && bio_op(bio) == REQ_OP_ZONE_APPEND) {
> -		struct extent_map *em;
> -		struct map_lookup *map;
> +		struct btrfs_device *device;
>
> -		em = btrfs_get_chunk_map(fs_info, disk_bytenr, io_size);
> -		if (IS_ERR(em))
> -			return PTR_ERR(em);
> +		device = btrfs_zoned_get_device(fs_info, disk_bytenr, io_size);
> +		if (IS_ERR(device))
> +			return PTR_ERR(device);
>
> -		map = em->map_lookup;
> -		/* We only support single profile for now */
> -		ASSERT(map->num_stripes == 1);
> -		btrfs_io_bio(bio)->device = map->stripes[0].dev;
> -
> -		free_extent_map(em);
> +		btrfs_io_bio(bio)->device = device;
>   	}
>
>   	*bio_ret = bio;
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index b9d5579a578d..15843a858bf6 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -1520,3 +1520,24 @@ int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
>   	length = wp - physical_pos;
>   	return btrfs_zoned_issue_zeroout(tgt_dev, physical_pos, length);
>   }
> +
> +struct btrfs_device *btrfs_zoned_get_device(struct btrfs_fs_info *fs_info,
> +					    u64 logical, u64 length)
> +{
> +	struct btrfs_device *device;
> +	struct extent_map *em;
> +	struct map_lookup *map;
> +
> +	em = btrfs_get_chunk_map(fs_info, logical, length);
> +	if (IS_ERR(em))
> +		return ERR_CAST(em);
> +
> +	map = em->map_lookup;
> +	/* We only support single profile for now */
> +	ASSERT(map->num_stripes == 1);
> +	device = map->stripes[0].dev;
> +
> +	free_extent_map(em);
> +
> +	return device;
> +}
> diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
> index e55d32595c2c..b0ae2608cb6b 100644
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -65,6 +65,8 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
>   int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length);
>   int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
>   				  u64 physical_start, u64 physical_pos);
> +struct btrfs_device *btrfs_zoned_get_device(struct btrfs_fs_info *fs_info,
> +					    u64 logical, u64 length);
>   #else /* CONFIG_BLK_DEV_ZONED */
>   static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
>   				     struct blk_zone *zone)
> @@ -191,6 +193,13 @@ static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev,
>   	return -EOPNOTSUPP;
>   }
>
> +static inline struct btrfs_device *btrfs_zoned_get_device(
> +						  struct btrfs_fs_info *fs_info,
> +						  u64 logical, u64 length)
> +{
> +	return ERR_PTR(-EOPNOTSUPP);
> +}
> +
>   #endif
>
>   static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-05-23 23:09     ` Qu Wenruo
@ 2021-05-24 13:04       ` Qu Wenruo
  2021-05-24 13:30         ` David Sterba
  2021-05-25  6:31         ` Johannes Thumshirn
  0 siblings, 2 replies; 19+ messages in thread
From: Qu Wenruo @ 2021-05-24 13:04 UTC (permalink / raw)
  To: Josef Bacik, Johannes Thumshirn, David Sterba, linux-btrfs



On 2021/5/24 上午7:09, Qu Wenruo wrote:
>
>
> On 2021/5/23 下午10:13, Josef Bacik wrote:
>> On 5/18/21 11:40 AM, Johannes Thumshirn wrote:
>>> When multiple processes write data to the same block group on a
>>> compressed
>>> zoned filesystem, the underlying device could report I/O errors and data
>>> corruption is possible.
>>>
>>> This happens because on a zoned file system, compressed data writes
>>> where
>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>>> guaranteed that the data is always submitted aligned to the underlying
>>> zone's write pointer.
>>>
>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a
>>> zoned
>>> filesystem is non intrusive on a regular file system or when
>>> submitting to
>>> a conventional zone on a zoned filesystem, as it is guarded by
>>> btrfs_use_zone_append.
>>>
>>> Reported-by: David Sterba <dsterba@suse.com>
>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat
>>> flag")
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> This one is causing panics with btrfs/027 with -o compress.  I bisected
>> it to something else earlier, but it was still happening today and I
>> bisected again and this is what popped out.  I also went the extra step
>> to revert the patch as I have already fucked this up once, and the
>> problem didn't re-occur with this patch reverted.  The panic looks like
>> this
>>
>> May 22 00:33:16 xfstests2 kernel: BTRFS critical (device dm-9): mapping
>> failed logical 22429696 bio len 53248 len 49152
>
> This is the interesting part, it means we are just one sector beyond the
> stripe boundary.
> Definitely a sign of changed bio submission timing.
>
> Just like the code:
>
> +        if (pg_index == 0 && use_append)
> +            len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
> +        else
> +            len = bio_add_page(bio, page, PAGE_SIZE, 0);
> +
>           page->mapping = NULL;
> -        if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
> -            PAGE_SIZE) {
> +        if (submit || len < PAGE_SIZE) {
>
> The code has changed the timing of bio_add_page().
>
> Previously, if we have submit == true, we won't even try to call
> bio_add_page().
>
> But now, we will add the page even we're already at the stripe boundary,
> thus it causes the extra sector being added to bio, and crosses stripe
> boundary.
>
> This part is already super tricky, thus I refactored
> submit_extent_page() to do a better job at stripe boundary calculation.

BTW, I can also reproduce the problem in btrfs/027 using the latest
misc-next branch.

Thus to workaround the problem, I'm using the following diff, feel free
to fold in to the offending patch.

Thanks,
Qu

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 831e6ae92940..26e2dceda1fc 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -455,10 +455,13 @@ blk_status_t btrfs_submit_compressed_write(struct
btrfs_inode *inode, u64 start,
                         submit = btrfs_bio_fits_in_stripe(page,
PAGE_SIZE, bio,
                                                           0);

-               if (pg_index == 0 && use_append)
-                       len = bio_add_zone_append_page(bio, page,
PAGE_SIZE, 0);
-               else
-                       len = bio_add_page(bio, page, PAGE_SIZE, 0);
+               if (!submit) {
+                       if (pg_index == 0 && use_append)
+                               len = bio_add_zone_append_page(bio, page,
+
PAGE_SIZE, 0);
+                       else
+                               len = bio_add_page(bio, page, PAGE_SIZE, 0);
+               }

                 page->mapping = NULL;
                 if (submit || len < PAGE_SIZE) {

>
> We definitely need to make other bio_add_page() callers to use a better
> helper, not only for later subpage, but also to make our lives easier.
>
> Thanks,
> Qu

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-05-24 13:04       ` Qu Wenruo
@ 2021-05-24 13:30         ` David Sterba
  2021-05-25  6:31         ` Johannes Thumshirn
  1 sibling, 0 replies; 19+ messages in thread
From: David Sterba @ 2021-05-24 13:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Josef Bacik, Johannes Thumshirn, David Sterba, linux-btrfs

On Mon, May 24, 2021 at 09:04:40PM +0800, Qu Wenruo wrote:
> > This is the interesting part, it means we are just one sector beyond the
> > stripe boundary.
> > Definitely a sign of changed bio submission timing.
> >
> > Just like the code:
> >
> > +        if (pg_index == 0 && use_append)
> > +            len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
> > +        else
> > +            len = bio_add_page(bio, page, PAGE_SIZE, 0);
> > +
> >           page->mapping = NULL;
> > -        if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
> > -            PAGE_SIZE) {
> > +        if (submit || len < PAGE_SIZE) {
> >
> > The code has changed the timing of bio_add_page().
> >
> > Previously, if we have submit == true, we won't even try to call
> > bio_add_page().
> >
> > But now, we will add the page even we're already at the stripe boundary,
> > thus it causes the extra sector being added to bio, and crosses stripe
> > boundary.
> >
> > This part is already super tricky, thus I refactored
> > submit_extent_page() to do a better job at stripe boundary calculation.
> 
> BTW, I can also reproduce the problem in btrfs/027 using the latest
> misc-next branch.
> 
> Thus to workaround the problem, I'm using the following diff, feel free
> to fold in to the offending patch.

The patch is now in master so we'll need a proper fix.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-05-23 14:13   ` Josef Bacik
  2021-05-23 23:09     ` Qu Wenruo
@ 2021-05-25  5:46     ` Johannes Thumshirn
  1 sibling, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2021-05-25  5:46 UTC (permalink / raw)
  To: Josef Bacik, David Sterba, linux-btrfs

On 23/05/2021 16:13, Josef Bacik wrote:
> On 5/18/21 11:40 AM, Johannes Thumshirn wrote:
>> When multiple processes write data to the same block group on a compressed
>> zoned filesystem, the underlying device could report I/O errors and data
>> corruption is possible.
>>
>> This happens because on a zoned file system, compressed data writes where
>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>> guaranteed that the data is always submitted aligned to the underlying
>> zone's write pointer.
>>
>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>> filesystem is non intrusive on a regular file system or when submitting to
>> a conventional zone on a zoned filesystem, as it is guarded by
>> btrfs_use_zone_append.
>>
>> Reported-by: David Sterba <dsterba@suse.com>
>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> This one is causing panics with btrfs/027 with -o compress.  I bisected it to 
> something else earlier, but it was still happening today and I bisected again 
> and this is what popped out.  I also went the extra step to revert the patch as 
> I have already fucked this up once, and the problem didn't re-occur with this 
> patch reverted.  The panic looks like this
> 
This is on a regular block device or on a zoned block device? I guess it's a 
regular, but I can't see yet why.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-05-24 13:04       ` Qu Wenruo
  2021-05-24 13:30         ` David Sterba
@ 2021-05-25  6:31         ` Johannes Thumshirn
  1 sibling, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2021-05-25  6:31 UTC (permalink / raw)
  To: Qu Wenruo, Josef Bacik, David Sterba, linux-btrfs

On 24/05/2021 15:05, Qu Wenruo wrote:
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 831e6ae92940..26e2dceda1fc 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -455,10 +455,13 @@ blk_status_t btrfs_submit_compressed_write(struct
> btrfs_inode *inode, u64 start,
>                          submit = btrfs_bio_fits_in_stripe(page,
> PAGE_SIZE, bio,
>                                                            0);
> 
> -               if (pg_index == 0 && use_append)
> -                       len = bio_add_zone_append_page(bio, page,
> PAGE_SIZE, 0);
> -               else
> -                       len = bio_add_page(bio, page, PAGE_SIZE, 0);
> +               if (!submit) {
> +                       if (pg_index == 0 && use_append)
> +                               len = bio_add_zone_append_page(bio, page,
> +
> PAGE_SIZE, 0);
> +                       else
> +                               len = bio_add_page(bio, page, PAGE_SIZE, 0);
> +               }
> 
>                  page->mapping = NULL;
>                  if (submit || len < PAGE_SIZE) {

This looks good, thanks for fixing it. I indeed messed up the bio_add_page() 
call in the new version when untangling that if (submit || bio_add_page())
construct. 

Can you send an official patch?

Thanks a lot,
	Johannes

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/3] btrfs: zoned: factor out zoned device lookup
  2021-05-24 10:00   ` Qu Wenruo
@ 2021-05-25  9:11     ` Johannes Thumshirn
  0 siblings, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2021-05-25  9:11 UTC (permalink / raw)
  To: Qu Wenruo, David Sterba, linux-btrfs

On 24/05/2021 12:01, Qu Wenruo wrote:
>>   	if (use_append) {
>> -		struct extent_map *em;
>> -		struct map_lookup *map;
>> -		struct block_device *bdev;
>> +		struct btrfs_device *device;
>>
>> -		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
>> -		if (IS_ERR(em)) {
>> +		device = btrfs_zoned_get_device(fs_info, disk_start, PAGE_SIZE);
> Not sure if it's too late, but I find that all callers are just passing
> PAGE_SIZE (to be more accurate, should be sectorsize) to
> btrfs_zoned_get_device().
> 
> Do we really need that @length parameter anyway?
> 
> Since we're just using one sector to grab the chunk map, the @length
> parameter is completely useless and we can grab sectorsize using fs_info.
> 
> This makes me to dig deeper than needed every time I see such length
> usage...


Hm yes I think you're right. I'll send a cleanup soon.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-05-18 15:40 ` [PATCH v2 2/3] btrfs: zoned: fix compressed writes Johannes Thumshirn
  2021-05-23 14:13   ` Josef Bacik
@ 2021-06-10  7:27   ` Qu Wenruo
  2021-06-10  7:36     ` Damien Le Moal
  1 sibling, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2021-06-10  7:27 UTC (permalink / raw)
  To: Johannes Thumshirn, David Sterba, linux-btrfs



On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
> When multiple processes write data to the same block group on a compressed
> zoned filesystem, the underlying device could report I/O errors and data
> corruption is possible.
>
> This happens because on a zoned file system, compressed data writes where
> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
> operation. But with REQ_OP_WRITE and parallel submission it cannot be
> guaranteed that the data is always submitted aligned to the underlying
> zone's write pointer.
>
> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
> filesystem is non intrusive on a regular file system or when submitting to
> a conventional zone on a zoned filesystem, as it is guarded by
> btrfs_use_zone_append.
>
> Reported-by: David Sterba <dsterba@suse.com>
> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Now working on compression support for subpage, just noticed some
strange code behavior, I'm not sure if it's designed or just a typo.

So please correct me if possible.

[...]
>
>   	bio = btrfs_bio_alloc(first_byte);
> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
> +	bio->bi_opf = bio_op | write_flags;
>   	bio->bi_private = cb;
>   	bio->bi_end_io = end_compressed_bio_write;
>
> +	if (use_append) {
> +		struct extent_map *em;
> +		struct map_lookup *map;
> +		struct block_device *bdev;
> +
> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
> +		if (IS_ERR(em)) {
> +			kfree(cb);
> +			bio_put(bio);
> +			return BLK_STS_NOTSUPP;
> +		}
> +
> +		map = em->map_lookup;
> +		/* We only support single profile for now */
> +		ASSERT(map->num_stripes == 1);
> +		bdev = map->stripes[0].dev->bdev;
> +
> +		bio_set_dev(bio, bdev);
> +		free_extent_map(em);
> +	}
> +

Here for the newly created bio, we will try to call bio_set_dev() for
it. (although later patch refactor this part a little)

So far so good.

>   	if (blkcg_css) {
>   		bio->bi_opf |= REQ_CGROUP_PUNT;
>   		kthread_associate_blkcg(blkcg_css);
> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>   	bytes_left = compressed_len;
>   	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>   		int submit = 0;
> +		int len;
>
>   		page = compressed_pages[pg_index];
>   		page->mapping = inode->vfs_inode.i_mapping;
> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>   			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>   							  0);
>
> +		if (pg_index == 0 && use_append)
> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
> +		else
> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
> +
>   		page->mapping = NULL;
> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
> -		    PAGE_SIZE) {
> +		if (submit || len < PAGE_SIZE) {
>   			/*
>   			 * inc the count before we submit the bio so
>   			 * we know the end IO handler won't happen before
> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>   			}
>
>   			bio = btrfs_bio_alloc(first_byte);
> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
> +			bio->bi_opf = bio_op | write_flags;

But here, for the newly allocated bio, we didn't call bio_set_dev() at all.

Shouldn't all zoned write bio need that bio_set_dev() call?

I guess since most compressed extents are pretty small, it's really hard
to hit a case where we need to split the bio due to stripe boundary,
thus very hard to hit anything wrong.

Anyway, since I'm working on compression code to make compressed write
to follow the same boundary check in extent_io.c, I can definitely
refactor the bio allocation code to add the zoned needed calls.

Thanks,
Qu

>   			bio->bi_private = cb;
>   			bio->bi_end_io = end_compressed_bio_write;
>   			if (blkcg_css)
>   				bio->bi_opf |= REQ_CGROUP_PUNT;
> +			/*
> +			 * Use bio_add_page() to ensure the bio has at least one
> +			 * page.
> +			 */
>   			bio_add_page(bio, page, PAGE_SIZE, 0);
>   		}
>   		if (bytes_left < PAGE_SIZE) {
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-06-10  7:27   ` Qu Wenruo
@ 2021-06-10  7:36     ` Damien Le Moal
  2021-06-10  7:41       ` Qu Wenruo
  0 siblings, 1 reply; 19+ messages in thread
From: Damien Le Moal @ 2021-06-10  7:36 UTC (permalink / raw)
  To: Qu Wenruo, Johannes Thumshirn, David Sterba, linux-btrfs

On 2021/06/10 16:28, Qu Wenruo wrote:
> 
> 
> On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
>> When multiple processes write data to the same block group on a compressed
>> zoned filesystem, the underlying device could report I/O errors and data
>> corruption is possible.
>>
>> This happens because on a zoned file system, compressed data writes where
>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>> guaranteed that the data is always submitted aligned to the underlying
>> zone's write pointer.
>>
>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>> filesystem is non intrusive on a regular file system or when submitting to
>> a conventional zone on a zoned filesystem, as it is guarded by
>> btrfs_use_zone_append.
>>
>> Reported-by: David Sterba <dsterba@suse.com>
>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> 
> Now working on compression support for subpage, just noticed some
> strange code behavior, I'm not sure if it's designed or just a typo.
> 
> So please correct me if possible.
> 
> [...]
>>
>>   	bio = btrfs_bio_alloc(first_byte);
>> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
>> +	bio->bi_opf = bio_op | write_flags;
>>   	bio->bi_private = cb;
>>   	bio->bi_end_io = end_compressed_bio_write;
>>
>> +	if (use_append) {
>> +		struct extent_map *em;
>> +		struct map_lookup *map;
>> +		struct block_device *bdev;
>> +
>> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
>> +		if (IS_ERR(em)) {
>> +			kfree(cb);
>> +			bio_put(bio);
>> +			return BLK_STS_NOTSUPP;
>> +		}
>> +
>> +		map = em->map_lookup;
>> +		/* We only support single profile for now */
>> +		ASSERT(map->num_stripes == 1);
>> +		bdev = map->stripes[0].dev->bdev;

This variable seems rather useless...

>> +
>> +		bio_set_dev(bio, bdev);
>> +		free_extent_map(em);
>> +	}
>> +
> 
> Here for the newly created bio, we will try to call bio_set_dev() for
> it. (although later patch refactor this part a little)
> 
> So far so good.
> 
>>   	if (blkcg_css) {
>>   		bio->bi_opf |= REQ_CGROUP_PUNT;
>>   		kthread_associate_blkcg(blkcg_css);
>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>   	bytes_left = compressed_len;
>>   	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>>   		int submit = 0;
>> +		int len;
>>
>>   		page = compressed_pages[pg_index];
>>   		page->mapping = inode->vfs_inode.i_mapping;
>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>   			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>>   							  0);
>>
>> +		if (pg_index == 0 && use_append)
>> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
>> +		else
>> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
>> +
>>   		page->mapping = NULL;
>> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
>> -		    PAGE_SIZE) {
>> +		if (submit || len < PAGE_SIZE) {
>>   			/*
>>   			 * inc the count before we submit the bio so
>>   			 * we know the end IO handler won't happen before
>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>   			}
>>
>>   			bio = btrfs_bio_alloc(first_byte);
>> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
>> +			bio->bi_opf = bio_op | write_flags;
> 
> But here, for the newly allocated bio, we didn't call bio_set_dev() at all.
> 
> Shouldn't all zoned write bio need that bio_set_dev() call?

Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called.
Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets
the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer
oops here... Johannes ?

> 
> I guess since most compressed extents are pretty small, it's really hard
> to hit a case where we need to split the bio due to stripe boundary,
> thus very hard to hit anything wrong.
> 
> Anyway, since I'm working on compression code to make compressed write
> to follow the same boundary check in extent_io.c, I can definitely
> refactor the bio allocation code to add the zoned needed calls.
> 
> Thanks,
> Qu
> 
>>   			bio->bi_private = cb;
>>   			bio->bi_end_io = end_compressed_bio_write;
>>   			if (blkcg_css)
>>   				bio->bi_opf |= REQ_CGROUP_PUNT;
>> +			/*
>> +			 * Use bio_add_page() to ensure the bio has at least one
>> +			 * page.
>> +			 */
>>   			bio_add_page(bio, page, PAGE_SIZE, 0);
>>   		}
>>   		if (bytes_left < PAGE_SIZE) {
>>
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-06-10  7:36     ` Damien Le Moal
@ 2021-06-10  7:41       ` Qu Wenruo
  2021-06-10  7:45         ` Damien Le Moal
  0 siblings, 1 reply; 19+ messages in thread
From: Qu Wenruo @ 2021-06-10  7:41 UTC (permalink / raw)
  To: Damien Le Moal, Johannes Thumshirn, David Sterba, linux-btrfs



On 2021/6/10 下午3:36, Damien Le Moal wrote:
> On 2021/06/10 16:28, Qu Wenruo wrote:
>>
>>
>> On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
>>> When multiple processes write data to the same block group on a compressed
>>> zoned filesystem, the underlying device could report I/O errors and data
>>> corruption is possible.
>>>
>>> This happens because on a zoned file system, compressed data writes where
>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>>> guaranteed that the data is always submitted aligned to the underlying
>>> zone's write pointer.
>>>
>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>>> filesystem is non intrusive on a regular file system or when submitting to
>>> a conventional zone on a zoned filesystem, as it is guarded by
>>> btrfs_use_zone_append.
>>>
>>> Reported-by: David Sterba <dsterba@suse.com>
>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>
>> Now working on compression support for subpage, just noticed some
>> strange code behavior, I'm not sure if it's designed or just a typo.
>>
>> So please correct me if possible.
>>
>> [...]
>>>
>>>    	bio = btrfs_bio_alloc(first_byte);
>>> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
>>> +	bio->bi_opf = bio_op | write_flags;
>>>    	bio->bi_private = cb;
>>>    	bio->bi_end_io = end_compressed_bio_write;
>>>
>>> +	if (use_append) {
>>> +		struct extent_map *em;
>>> +		struct map_lookup *map;
>>> +		struct block_device *bdev;
>>> +
>>> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
>>> +		if (IS_ERR(em)) {
>>> +			kfree(cb);
>>> +			bio_put(bio);
>>> +			return BLK_STS_NOTSUPP;
>>> +		}
>>> +
>>> +		map = em->map_lookup;
>>> +		/* We only support single profile for now */
>>> +		ASSERT(map->num_stripes == 1);
>>> +		bdev = map->stripes[0].dev->bdev;
>
> This variable seems rather useless...

No need to bother that, that has already been removed by later refactor.

>
>>> +
>>> +		bio_set_dev(bio, bdev);
>>> +		free_extent_map(em);
>>> +	}
>>> +
>>
>> Here for the newly created bio, we will try to call bio_set_dev() for
>> it. (although later patch refactor this part a little)
>>
>> So far so good.
>>
>>>    	if (blkcg_css) {
>>>    		bio->bi_opf |= REQ_CGROUP_PUNT;
>>>    		kthread_associate_blkcg(blkcg_css);
>>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>    	bytes_left = compressed_len;
>>>    	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>>>    		int submit = 0;
>>> +		int len;
>>>
>>>    		page = compressed_pages[pg_index];
>>>    		page->mapping = inode->vfs_inode.i_mapping;
>>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>    			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>>>    							  0);
>>>
>>> +		if (pg_index == 0 && use_append)
>>> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
>>> +		else
>>> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
>>> +
>>>    		page->mapping = NULL;
>>> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
>>> -		    PAGE_SIZE) {
>>> +		if (submit || len < PAGE_SIZE) {
>>>    			/*
>>>    			 * inc the count before we submit the bio so
>>>    			 * we know the end IO handler won't happen before
>>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>    			}
>>>
>>>    			bio = btrfs_bio_alloc(first_byte);
>>> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
>>> +			bio->bi_opf = bio_op | write_flags;
>>
>> But here, for the newly allocated bio, we didn't call bio_set_dev() at all.
>>
>> Shouldn't all zoned write bio need that bio_set_dev() call?
>
> Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called.
> Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets
> the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer
> oops here... Johannes ?

That's because it's really really rare/hard to have a compressed extent
just lies at the stripe boundary.

For most cases, the data we provide for compression tests is either:
- Too compressible
   Thus the whole range can be compressed into just one sector, thus
   it will never cross stripe boundary.

- Not compressible at all
   We fall back to regular buffered write, which will do their proper
   stripe boundary check correctly.

Thus it's really near impossible to hit it in various tests.

Thanks,
Qu

>
>>
>> I guess since most compressed extents are pretty small, it's really hard
>> to hit a case where we need to split the bio due to stripe boundary,
>> thus very hard to hit anything wrong.
>>
>> Anyway, since I'm working on compression code to make compressed write
>> to follow the same boundary check in extent_io.c, I can definitely
>> refactor the bio allocation code to add the zoned needed calls.
>>
>> Thanks,
>> Qu
>>
>>>    			bio->bi_private = cb;
>>>    			bio->bi_end_io = end_compressed_bio_write;
>>>    			if (blkcg_css)
>>>    				bio->bi_opf |= REQ_CGROUP_PUNT;
>>> +			/*
>>> +			 * Use bio_add_page() to ensure the bio has at least one
>>> +			 * page.
>>> +			 */
>>>    			bio_add_page(bio, page, PAGE_SIZE, 0);
>>>    		}
>>>    		if (bytes_left < PAGE_SIZE) {
>>>
>>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-06-10  7:41       ` Qu Wenruo
@ 2021-06-10  7:45         ` Damien Le Moal
  2021-06-10  7:51           ` Qu Wenruo
  2021-06-10  8:28           ` Johannes Thumshirn
  0 siblings, 2 replies; 19+ messages in thread
From: Damien Le Moal @ 2021-06-10  7:45 UTC (permalink / raw)
  To: Qu Wenruo, Johannes Thumshirn, David Sterba, linux-btrfs

On 2021/06/10 16:41, Qu Wenruo wrote:
> 
> 
> On 2021/6/10 下午3:36, Damien Le Moal wrote:
>> On 2021/06/10 16:28, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
>>>> When multiple processes write data to the same block group on a compressed
>>>> zoned filesystem, the underlying device could report I/O errors and data
>>>> corruption is possible.
>>>>
>>>> This happens because on a zoned file system, compressed data writes where
>>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>>>> guaranteed that the data is always submitted aligned to the underlying
>>>> zone's write pointer.
>>>>
>>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>>>> filesystem is non intrusive on a regular file system or when submitting to
>>>> a conventional zone on a zoned filesystem, as it is guarded by
>>>> btrfs_use_zone_append.
>>>>
>>>> Reported-by: David Sterba <dsterba@suse.com>
>>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>
>>> Now working on compression support for subpage, just noticed some
>>> strange code behavior, I'm not sure if it's designed or just a typo.
>>>
>>> So please correct me if possible.
>>>
>>> [...]
>>>>
>>>>    	bio = btrfs_bio_alloc(first_byte);
>>>> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>> +	bio->bi_opf = bio_op | write_flags;
>>>>    	bio->bi_private = cb;
>>>>    	bio->bi_end_io = end_compressed_bio_write;
>>>>
>>>> +	if (use_append) {
>>>> +		struct extent_map *em;
>>>> +		struct map_lookup *map;
>>>> +		struct block_device *bdev;
>>>> +
>>>> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
>>>> +		if (IS_ERR(em)) {
>>>> +			kfree(cb);
>>>> +			bio_put(bio);
>>>> +			return BLK_STS_NOTSUPP;
>>>> +		}
>>>> +
>>>> +		map = em->map_lookup;
>>>> +		/* We only support single profile for now */
>>>> +		ASSERT(map->num_stripes == 1);
>>>> +		bdev = map->stripes[0].dev->bdev;
>>
>> This variable seems rather useless...
> 
> No need to bother that, that has already been removed by later refactor.
> 
>>
>>>> +
>>>> +		bio_set_dev(bio, bdev);
>>>> +		free_extent_map(em);
>>>> +	}
>>>> +
>>>
>>> Here for the newly created bio, we will try to call bio_set_dev() for
>>> it. (although later patch refactor this part a little)
>>>
>>> So far so good.
>>>
>>>>    	if (blkcg_css) {
>>>>    		bio->bi_opf |= REQ_CGROUP_PUNT;
>>>>    		kthread_associate_blkcg(blkcg_css);
>>>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>    	bytes_left = compressed_len;
>>>>    	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>>>>    		int submit = 0;
>>>> +		int len;
>>>>
>>>>    		page = compressed_pages[pg_index];
>>>>    		page->mapping = inode->vfs_inode.i_mapping;
>>>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>    			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>>>>    							  0);
>>>>
>>>> +		if (pg_index == 0 && use_append)
>>>> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
>>>> +		else
>>>> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
>>>> +
>>>>    		page->mapping = NULL;
>>>> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
>>>> -		    PAGE_SIZE) {
>>>> +		if (submit || len < PAGE_SIZE) {
>>>>    			/*
>>>>    			 * inc the count before we submit the bio so
>>>>    			 * we know the end IO handler won't happen before
>>>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>    			}
>>>>
>>>>    			bio = btrfs_bio_alloc(first_byte);
>>>> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>> +			bio->bi_opf = bio_op | write_flags;
>>>
>>> But here, for the newly allocated bio, we didn't call bio_set_dev() at all.
>>>
>>> Shouldn't all zoned write bio need that bio_set_dev() call?
>>
>> Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called.
>> Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets
>> the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer
>> oops here... Johannes ?
> 
> That's because it's really really rare/hard to have a compressed extent
> just lies at the stripe boundary.
> 
> For most cases, the data we provide for compression tests is either:
> - Too compressible
>    Thus the whole range can be compressed into just one sector, thus
>    it will never cross stripe boundary.
> 
> - Not compressible at all
>    We fall back to regular buffered write, which will do their proper
>    stripe boundary check correctly.
> 
> Thus it's really near impossible to hit it in various tests.

But this is a data write, isn't it ? So in the zoned case, it means a zone
append write. And adding a page for even a single sector using
bio_add_zone_append_page() will oops if the bio bdev is not set, regardless of
the bio size... Am I misunderstanding something here about this IO path ?

> 
> Thanks,
> Qu
> 
>>
>>>
>>> I guess since most compressed extents are pretty small, it's really hard
>>> to hit a case where we need to split the bio due to stripe boundary,
>>> thus very hard to hit anything wrong.
>>>
>>> Anyway, since I'm working on compression code to make compressed write
>>> to follow the same boundary check in extent_io.c, I can definitely
>>> refactor the bio allocation code to add the zoned needed calls.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>    			bio->bi_private = cb;
>>>>    			bio->bi_end_io = end_compressed_bio_write;
>>>>    			if (blkcg_css)
>>>>    				bio->bi_opf |= REQ_CGROUP_PUNT;
>>>> +			/*
>>>> +			 * Use bio_add_page() to ensure the bio has at least one
>>>> +			 * page.
>>>> +			 */
>>>>    			bio_add_page(bio, page, PAGE_SIZE, 0);
>>>>    		}
>>>>    		if (bytes_left < PAGE_SIZE) {
>>>>
>>>
>>
>>
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-06-10  7:45         ` Damien Le Moal
@ 2021-06-10  7:51           ` Qu Wenruo
  2021-06-10  8:28           ` Johannes Thumshirn
  1 sibling, 0 replies; 19+ messages in thread
From: Qu Wenruo @ 2021-06-10  7:51 UTC (permalink / raw)
  To: Damien Le Moal, Johannes Thumshirn, David Sterba, linux-btrfs



On 2021/6/10 下午3:45, Damien Le Moal wrote:
> On 2021/06/10 16:41, Qu Wenruo wrote:
>>
>>
>> On 2021/6/10 下午3:36, Damien Le Moal wrote:
>>> On 2021/06/10 16:28, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
>>>>> When multiple processes write data to the same block group on a compressed
>>>>> zoned filesystem, the underlying device could report I/O errors and data
>>>>> corruption is possible.
>>>>>
>>>>> This happens because on a zoned file system, compressed data writes where
>>>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>>>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>>>>> guaranteed that the data is always submitted aligned to the underlying
>>>>> zone's write pointer.
>>>>>
>>>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>>>>> filesystem is non intrusive on a regular file system or when submitting to
>>>>> a conventional zone on a zoned filesystem, as it is guarded by
>>>>> btrfs_use_zone_append.
>>>>>
>>>>> Reported-by: David Sterba <dsterba@suse.com>
>>>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>
>>>> Now working on compression support for subpage, just noticed some
>>>> strange code behavior, I'm not sure if it's designed or just a typo.
>>>>
>>>> So please correct me if possible.
>>>>
>>>> [...]
>>>>>
>>>>>     	bio = btrfs_bio_alloc(first_byte);
>>>>> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>>> +	bio->bi_opf = bio_op | write_flags;
>>>>>     	bio->bi_private = cb;
>>>>>     	bio->bi_end_io = end_compressed_bio_write;
>>>>>
>>>>> +	if (use_append) {
>>>>> +		struct extent_map *em;
>>>>> +		struct map_lookup *map;
>>>>> +		struct block_device *bdev;
>>>>> +
>>>>> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
>>>>> +		if (IS_ERR(em)) {
>>>>> +			kfree(cb);
>>>>> +			bio_put(bio);
>>>>> +			return BLK_STS_NOTSUPP;
>>>>> +		}
>>>>> +
>>>>> +		map = em->map_lookup;
>>>>> +		/* We only support single profile for now */
>>>>> +		ASSERT(map->num_stripes == 1);
>>>>> +		bdev = map->stripes[0].dev->bdev;
>>>
>>> This variable seems rather useless...
>>
>> No need to bother that, that has already been removed by later refactor.
>>
>>>
>>>>> +
>>>>> +		bio_set_dev(bio, bdev);
>>>>> +		free_extent_map(em);
>>>>> +	}
>>>>> +
>>>>
>>>> Here for the newly created bio, we will try to call bio_set_dev() for
>>>> it. (although later patch refactor this part a little)
>>>>
>>>> So far so good.
>>>>
>>>>>     	if (blkcg_css) {
>>>>>     		bio->bi_opf |= REQ_CGROUP_PUNT;
>>>>>     		kthread_associate_blkcg(blkcg_css);
>>>>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>     	bytes_left = compressed_len;
>>>>>     	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>>>>>     		int submit = 0;
>>>>> +		int len;
>>>>>
>>>>>     		page = compressed_pages[pg_index];
>>>>>     		page->mapping = inode->vfs_inode.i_mapping;
>>>>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>     			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>>>>>     							  0);
>>>>>
>>>>> +		if (pg_index == 0 && use_append)
>>>>> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
>>>>> +		else
>>>>> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
>>>>> +
>>>>>     		page->mapping = NULL;
>>>>> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
>>>>> -		    PAGE_SIZE) {
>>>>> +		if (submit || len < PAGE_SIZE) {
>>>>>     			/*
>>>>>     			 * inc the count before we submit the bio so
>>>>>     			 * we know the end IO handler won't happen before
>>>>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>     			}
>>>>>
>>>>>     			bio = btrfs_bio_alloc(first_byte);
>>>>> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>>> +			bio->bi_opf = bio_op | write_flags;
>>>>
>>>> But here, for the newly allocated bio, we didn't call bio_set_dev() at all.
>>>>
>>>> Shouldn't all zoned write bio need that bio_set_dev() call?
>>>
>>> Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called.
>>> Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets
>>> the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer
>>> oops here... Johannes ?
>>
>> That's because it's really really rare/hard to have a compressed extent
>> just lies at the stripe boundary.
>>
>> For most cases, the data we provide for compression tests is either:
>> - Too compressible
>>     Thus the whole range can be compressed into just one sector, thus
>>     it will never cross stripe boundary.
>>
>> - Not compressible at all
>>     We fall back to regular buffered write, which will do their proper
>>     stripe boundary check correctly.
>>
>> Thus it's really near impossible to hit it in various tests.
>
> But this is a data write, isn't it ? So in the zoned case, it means a zone
> append write. And adding a page for even a single sector using
> bio_add_zone_append_page() will oops if the bio bdev is not set, regardless of
> the bio size... Am I misunderstanding something here about this IO path ?

The bio can only be allocated when we have a write range crossing device
stripe boundary.
Which is already a rare case for compressed write.


And the initial bio always has the correct setup, thus we have to hit
the corner case to get into this situation.


Finally I find that since zoned support doesn't support multi-device
profile yet, there is no limitation for stripe length, thus we won't get
into this branch at all.

So my bad, false alert...

Thanks,
Qu

>
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>> I guess since most compressed extents are pretty small, it's really hard
>>>> to hit a case where we need to split the bio due to stripe boundary,
>>>> thus very hard to hit anything wrong.
>>>>
>>>> Anyway, since I'm working on compression code to make compressed write
>>>> to follow the same boundary check in extent_io.c, I can definitely
>>>> refactor the bio allocation code to add the zoned needed calls.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>     			bio->bi_private = cb;
>>>>>     			bio->bi_end_io = end_compressed_bio_write;
>>>>>     			if (blkcg_css)
>>>>>     				bio->bi_opf |= REQ_CGROUP_PUNT;
>>>>> +			/*
>>>>> +			 * Use bio_add_page() to ensure the bio has at least one
>>>>> +			 * page.
>>>>> +			 */
>>>>>     			bio_add_page(bio, page, PAGE_SIZE, 0);
>>>>>     		}
>>>>>     		if (bytes_left < PAGE_SIZE) {
>>>>>
>>>>
>>>
>>>
>>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] btrfs: zoned: fix compressed writes
  2021-06-10  7:45         ` Damien Le Moal
  2021-06-10  7:51           ` Qu Wenruo
@ 2021-06-10  8:28           ` Johannes Thumshirn
  1 sibling, 0 replies; 19+ messages in thread
From: Johannes Thumshirn @ 2021-06-10  8:28 UTC (permalink / raw)
  To: Damien Le Moal, Qu Wenruo, David Sterba, linux-btrfs

On 10/06/2021 09:45, Damien Le Moal wrote:
> On 2021/06/10 16:41, Qu Wenruo wrote:
>>
>>
>> On 2021/6/10 下午3:36, Damien Le Moal wrote:
>>> On 2021/06/10 16:28, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2021/5/18 下午11:40, Johannes Thumshirn wrote:
>>>>> When multiple processes write data to the same block group on a compressed
>>>>> zoned filesystem, the underlying device could report I/O errors and data
>>>>> corruption is possible.
>>>>>
>>>>> This happens because on a zoned file system, compressed data writes where
>>>>> sent to the device via a REQ_OP_WRITE instead of a REQ_OP_ZONE_APPEND
>>>>> operation. But with REQ_OP_WRITE and parallel submission it cannot be
>>>>> guaranteed that the data is always submitted aligned to the underlying
>>>>> zone's write pointer.
>>>>>
>>>>> The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a zoned
>>>>> filesystem is non intrusive on a regular file system or when submitting to
>>>>> a conventional zone on a zoned filesystem, as it is guarded by
>>>>> btrfs_use_zone_append.
>>>>>
>>>>> Reported-by: David Sterba <dsterba@suse.com>
>>>>> Fixes: 9d294a685fbc ("btrfs: zoned: enable to mount ZONED incompat flag")
>>>>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>>>>
>>>> Now working on compression support for subpage, just noticed some
>>>> strange code behavior, I'm not sure if it's designed or just a typo.
>>>>
>>>> So please correct me if possible.
>>>>
>>>> [...]
>>>>>
>>>>>    	bio = btrfs_bio_alloc(first_byte);
>>>>> -	bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>>> +	bio->bi_opf = bio_op | write_flags;
>>>>>    	bio->bi_private = cb;
>>>>>    	bio->bi_end_io = end_compressed_bio_write;
>>>>>
>>>>> +	if (use_append) {
>>>>> +		struct extent_map *em;
>>>>> +		struct map_lookup *map;
>>>>> +		struct block_device *bdev;
>>>>> +
>>>>> +		em = btrfs_get_chunk_map(fs_info, disk_start, PAGE_SIZE);
>>>>> +		if (IS_ERR(em)) {
>>>>> +			kfree(cb);
>>>>> +			bio_put(bio);
>>>>> +			return BLK_STS_NOTSUPP;
>>>>> +		}
>>>>> +
>>>>> +		map = em->map_lookup;
>>>>> +		/* We only support single profile for now */
>>>>> +		ASSERT(map->num_stripes == 1);
>>>>> +		bdev = map->stripes[0].dev->bdev;
>>>
>>> This variable seems rather useless...
>>
>> No need to bother that, that has already been removed by later refactor.
>>
>>>
>>>>> +
>>>>> +		bio_set_dev(bio, bdev);
>>>>> +		free_extent_map(em);
>>>>> +	}
>>>>> +
>>>>
>>>> Here for the newly created bio, we will try to call bio_set_dev() for
>>>> it. (although later patch refactor this part a little)
>>>>
>>>> So far so good.
>>>>
>>>>>    	if (blkcg_css) {
>>>>>    		bio->bi_opf |= REQ_CGROUP_PUNT;
>>>>>    		kthread_associate_blkcg(blkcg_css);
>>>>> @@ -432,6 +458,7 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>    	bytes_left = compressed_len;
>>>>>    	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
>>>>>    		int submit = 0;
>>>>> +		int len;
>>>>>
>>>>>    		page = compressed_pages[pg_index];
>>>>>    		page->mapping = inode->vfs_inode.i_mapping;
>>>>> @@ -439,9 +466,13 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>    			submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio,
>>>>>    							  0);
>>>>>
>>>>> +		if (pg_index == 0 && use_append)
>>>>> +			len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
>>>>> +		else
>>>>> +			len = bio_add_page(bio, page, PAGE_SIZE, 0);
>>>>> +
>>>>>    		page->mapping = NULL;
>>>>> -		if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
>>>>> -		    PAGE_SIZE) {
>>>>> +		if (submit || len < PAGE_SIZE) {
>>>>>    			/*
>>>>>    			 * inc the count before we submit the bio so
>>>>>    			 * we know the end IO handler won't happen before
>>>>> @@ -465,11 +496,15 @@ blk_status_t btrfs_submit_compressed_write(struct btrfs_inode *inode, u64 start,
>>>>>    			}
>>>>>
>>>>>    			bio = btrfs_bio_alloc(first_byte);
>>>>> -			bio->bi_opf = REQ_OP_WRITE | write_flags;
>>>>> +			bio->bi_opf = bio_op | write_flags;
>>>>
>>>> But here, for the newly allocated bio, we didn't call bio_set_dev() at all.
>>>>
>>>> Shouldn't all zoned write bio need that bio_set_dev() call?
>>>
>>> Yep, bio->bi_bdev must be set before bio_add_zone_append_page() is called.
>>> Otherwise, there will be a crash (first line of bio_add_zone_append_page() gets
>>> the request queue from bio->bi_bdev). I wonder why we do not see NULL pointer
>>> oops here... Johannes ?
>>
>> That's because it's really really rare/hard to have a compressed extent
>> just lies at the stripe boundary.
>>
>> For most cases, the data we provide for compression tests is either:
>> - Too compressible
>>    Thus the whole range can be compressed into just one sector, thus
>>    it will never cross stripe boundary.
>>
>> - Not compressible at all
>>    We fall back to regular buffered write, which will do their proper
>>    stripe boundary check correctly.
>>
>> Thus it's really near impossible to hit it in various tests.
> 
> But this is a data write, isn't it ? So in the zoned case, it means a zone
> append write. And adding a page for even a single sector using
> bio_add_zone_append_page() will oops if the bio bdev is not set, regardless of
> the bio size... Am I misunderstanding something here about this IO path ?

Because we're not using bio_add_zone_append_page() but bio_add_page() as we only add
one page to a newly allocated bio. It's a bit complicated.  

>>>>> +			/*
>>>>> +			 * Use bio_add_page() to ensure the bio has at least one
>>>>> +			 * page.
>>>>> +			 */
>>>>>    			bio_add_page(bio, page, PAGE_SIZE, 0);
>>>>>    		}
>>>>>    		if (bytes_left < PAGE_SIZE) {
>>>>>
>>>>
>>>
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-06-10  8:28 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-18 15:40 [PATCH v2 0/3] btrfs: zoned: fix writes on a compressed zoned filesystem Johannes Thumshirn
2021-05-18 15:40 ` [PATCH v2 1/3] btrfs: zoned: pass start block to btrfs_use_zone_append Johannes Thumshirn
2021-05-18 15:40 ` [PATCH v2 2/3] btrfs: zoned: fix compressed writes Johannes Thumshirn
2021-05-23 14:13   ` Josef Bacik
2021-05-23 23:09     ` Qu Wenruo
2021-05-24 13:04       ` Qu Wenruo
2021-05-24 13:30         ` David Sterba
2021-05-25  6:31         ` Johannes Thumshirn
2021-05-25  5:46     ` Johannes Thumshirn
2021-06-10  7:27   ` Qu Wenruo
2021-06-10  7:36     ` Damien Le Moal
2021-06-10  7:41       ` Qu Wenruo
2021-06-10  7:45         ` Damien Le Moal
2021-06-10  7:51           ` Qu Wenruo
2021-06-10  8:28           ` Johannes Thumshirn
2021-05-18 15:40 ` [PATCH v2 3/3] btrfs: zoned: factor out zoned device lookup Johannes Thumshirn
2021-05-24 10:00   ` Qu Wenruo
2021-05-25  9:11     ` Johannes Thumshirn
2021-05-20 15:05 ` [PATCH v2 0/3] btrfs: zoned: fix writes on a compressed zoned filesystem David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.