All of lore.kernel.org
 help / color / mirror / Atom feed
* [dm-devel] [PATCH 0/8] dm: io accounting & polling improvement
@ 2022-04-12  8:56 ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

Hello Guys,

The 1st patch adds bdev based io accounting interface.

The 2nd ~ 5th patches improves dm's io accounting & split, meantime
fixes kernel panic on dm-zone.

The other patches improves io polling & dm io reference handling.


Ming Lei (8):
  block: replace disk based account with bdev's
  dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct
  dm: pass 'dm_io' instance to dm_io_acct directly
  dm: switch to bdev based io accounting interface
  dm: always setup ->orig_bio in alloc_io
  dm: don't grab target io reference in dm_zone_map_bio
  dm: improve target io referencing
  dm: put all polled io into one single list

 block/blk-core.c              |  15 +--
 drivers/block/zram/zram_drv.c |   5 +-
 drivers/md/dm-core.h          |  17 ++-
 drivers/md/dm-zone.c          |  10 --
 drivers/md/dm.c               | 190 +++++++++++++++++++---------------
 include/linux/blkdev.h        |   7 +-
 6 files changed, 131 insertions(+), 113 deletions(-)

-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 0/8] dm: io accounting & polling improvement
@ 2022-04-12  8:56 ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

Hello Guys,

The 1st patch adds bdev based io accounting interface.

The 2nd ~ 5th patches improves dm's io accounting & split, meantime
fixes kernel panic on dm-zone.

The other patches improves io polling & dm io reference handling.


Ming Lei (8):
  block: replace disk based account with bdev's
  dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct
  dm: pass 'dm_io' instance to dm_io_acct directly
  dm: switch to bdev based io accounting interface
  dm: always setup ->orig_bio in alloc_io
  dm: don't grab target io reference in dm_zone_map_bio
  dm: improve target io referencing
  dm: put all polled io into one single list

 block/blk-core.c              |  15 +--
 drivers/block/zram/zram_drv.c |   5 +-
 drivers/md/dm-core.h          |  17 ++-
 drivers/md/dm-zone.c          |  10 --
 drivers/md/dm.c               | 190 +++++++++++++++++++---------------
 include/linux/blkdev.h        |   7 +-
 6 files changed, 131 insertions(+), 113 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dm-devel] [PATCH 1/8] block: replace disk based account with bdev's
  2022-04-12  8:56 ` Ming Lei
@ 2022-04-12  8:56   ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

'block device' is generic type for interface, and gendisk becomes more
one block layer internal type, so replace disk based account interface
with bdec's.

Also add 'start_time' parameter to bdev_start_io_acct() so that we
can cover device mapper's io accounting by the two bdev based interface.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c              | 15 ++++++++-------
 drivers/block/zram/zram_drv.c |  5 +++--
 include/linux/blkdev.h        |  7 ++++---
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 937bb6b86331..a3ae13b129ff 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1063,12 +1063,13 @@ unsigned long bio_start_io_acct(struct bio *bio)
 }
 EXPORT_SYMBOL_GPL(bio_start_io_acct);
 
-unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors,
-				 unsigned int op)
+unsigned long bdev_start_io_acct(struct block_device *bdev,
+				 unsigned int sectors, unsigned int op,
+				 unsigned long start_time)
 {
-	return __part_start_io_acct(disk->part0, sectors, op, jiffies);
+	return __part_start_io_acct(bdev, sectors, op, start_time);
 }
-EXPORT_SYMBOL(disk_start_io_acct);
+EXPORT_SYMBOL(bdev_start_io_acct);
 
 static void __part_end_io_acct(struct block_device *part, unsigned int op,
 			       unsigned long start_time)
@@ -1091,12 +1092,12 @@ void bio_end_io_acct_remapped(struct bio *bio, unsigned long start_time,
 }
 EXPORT_SYMBOL_GPL(bio_end_io_acct_remapped);
 
-void disk_end_io_acct(struct gendisk *disk, unsigned int op,
+void bdev_end_io_acct(struct block_device *bdev, unsigned int op,
 		      unsigned long start_time)
 {
-	__part_end_io_acct(disk->part0, op, start_time);
+	__part_end_io_acct(bdev, op, start_time);
 }
-EXPORT_SYMBOL(disk_end_io_acct);
+EXPORT_SYMBOL(bdev_end_io_acct);
 
 /**
  * blk_lld_busy - Check if underlying low-level drivers of a device are busy
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index e9474b02012d..adb5209a556a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1675,9 +1675,10 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
 	bv.bv_len = PAGE_SIZE;
 	bv.bv_offset = 0;
 
-	start_time = disk_start_io_acct(bdev->bd_disk, SECTORS_PER_PAGE, op);
+	start_time = bdev_start_io_acct(bdev->bd_disk->part0,
+			SECTORS_PER_PAGE, op, jiffies);
 	ret = zram_bvec_rw(zram, &bv, index, offset, op, NULL);
-	disk_end_io_acct(bdev->bd_disk, op, start_time);
+	bdev_end_io_acct(bdev->bd_disk->part0, op, start_time);
 out:
 	/*
 	 * If I/O fails, just return error(ie, non-zero) without
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 60d016138997..f680ba6f0ab2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1491,9 +1491,10 @@ static inline void blk_wake_io_task(struct task_struct *waiter)
 		wake_up_process(waiter);
 }
 
-unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors,
-		unsigned int op);
-void disk_end_io_acct(struct gendisk *disk, unsigned int op,
+unsigned long bdev_start_io_acct(struct block_device *bdev,
+				 unsigned int sectors, unsigned int op,
+				 unsigned long start_time);
+void bdev_end_io_acct(struct block_device *bdev, unsigned int op,
 		unsigned long start_time);
 
 void bio_start_io_acct_time(struct bio *bio, unsigned long start_time);
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 1/8] block: replace disk based account with bdev's
@ 2022-04-12  8:56   ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

'block device' is generic type for interface, and gendisk becomes more
one block layer internal type, so replace disk based account interface
with bdec's.

Also add 'start_time' parameter to bdev_start_io_acct() so that we
can cover device mapper's io accounting by the two bdev based interface.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c              | 15 ++++++++-------
 drivers/block/zram/zram_drv.c |  5 +++--
 include/linux/blkdev.h        |  7 ++++---
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 937bb6b86331..a3ae13b129ff 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1063,12 +1063,13 @@ unsigned long bio_start_io_acct(struct bio *bio)
 }
 EXPORT_SYMBOL_GPL(bio_start_io_acct);
 
-unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors,
-				 unsigned int op)
+unsigned long bdev_start_io_acct(struct block_device *bdev,
+				 unsigned int sectors, unsigned int op,
+				 unsigned long start_time)
 {
-	return __part_start_io_acct(disk->part0, sectors, op, jiffies);
+	return __part_start_io_acct(bdev, sectors, op, start_time);
 }
-EXPORT_SYMBOL(disk_start_io_acct);
+EXPORT_SYMBOL(bdev_start_io_acct);
 
 static void __part_end_io_acct(struct block_device *part, unsigned int op,
 			       unsigned long start_time)
@@ -1091,12 +1092,12 @@ void bio_end_io_acct_remapped(struct bio *bio, unsigned long start_time,
 }
 EXPORT_SYMBOL_GPL(bio_end_io_acct_remapped);
 
-void disk_end_io_acct(struct gendisk *disk, unsigned int op,
+void bdev_end_io_acct(struct block_device *bdev, unsigned int op,
 		      unsigned long start_time)
 {
-	__part_end_io_acct(disk->part0, op, start_time);
+	__part_end_io_acct(bdev, op, start_time);
 }
-EXPORT_SYMBOL(disk_end_io_acct);
+EXPORT_SYMBOL(bdev_end_io_acct);
 
 /**
  * blk_lld_busy - Check if underlying low-level drivers of a device are busy
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index e9474b02012d..adb5209a556a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1675,9 +1675,10 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
 	bv.bv_len = PAGE_SIZE;
 	bv.bv_offset = 0;
 
-	start_time = disk_start_io_acct(bdev->bd_disk, SECTORS_PER_PAGE, op);
+	start_time = bdev_start_io_acct(bdev->bd_disk->part0,
+			SECTORS_PER_PAGE, op, jiffies);
 	ret = zram_bvec_rw(zram, &bv, index, offset, op, NULL);
-	disk_end_io_acct(bdev->bd_disk, op, start_time);
+	bdev_end_io_acct(bdev->bd_disk->part0, op, start_time);
 out:
 	/*
 	 * If I/O fails, just return error(ie, non-zero) without
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 60d016138997..f680ba6f0ab2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1491,9 +1491,10 @@ static inline void blk_wake_io_task(struct task_struct *waiter)
 		wake_up_process(waiter);
 }
 
-unsigned long disk_start_io_acct(struct gendisk *disk, unsigned int sectors,
-		unsigned int op);
-void disk_end_io_acct(struct gendisk *disk, unsigned int op,
+unsigned long bdev_start_io_acct(struct block_device *bdev,
+				 unsigned int sectors, unsigned int op,
+				 unsigned long start_time);
+void bdev_end_io_acct(struct block_device *bdev, unsigned int op,
 		unsigned long start_time);
 
 void bio_start_io_acct_time(struct bio *bio, unsigned long start_time);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [dm-devel] [PATCH 2/8] dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct
  2022-04-12  8:56 ` Ming Lei
@ 2022-04-12  8:56   ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

dm->orig_bio is always passed to __dm_start_io_acct and dm_end_io_acct,
so it isn't necessary to take one bio parameter for the two helpers.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..62f7af815ef8 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -526,16 +526,13 @@ static void dm_io_acct(bool end, struct mapped_device *md, struct bio *bio,
 		bio->bi_iter.bi_size = bi_size;
 }
 
-static void __dm_start_io_acct(struct dm_io *io, struct bio *bio)
+static void __dm_start_io_acct(struct dm_io *io)
 {
-	dm_io_acct(false, io->md, bio, io->start_time, &io->stats_aux);
+	dm_io_acct(false, io->md, io->orig_bio, io->start_time, &io->stats_aux);
 }
 
 static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
 {
-	/* Must account IO to DM device in terms of orig_bio */
-	struct bio *bio = io->orig_bio;
-
 	/*
 	 * Ensure IO accounting is only ever started once.
 	 * Expect no possibility for race unless DM_TIO_IS_DUPLICATE_BIO.
@@ -555,12 +552,12 @@ static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
 		spin_unlock_irqrestore(&io->lock, flags);
 	}
 
-	__dm_start_io_acct(io, bio);
+	__dm_start_io_acct(io);
 }
 
-static void dm_end_io_acct(struct dm_io *io, struct bio *bio)
+static void dm_end_io_acct(struct dm_io *io)
 {
-	dm_io_acct(true, io->md, bio, io->start_time, &io->stats_aux);
+	dm_io_acct(true, io->md, io->orig_bio, io->start_time, &io->stats_aux);
 }
 
 static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
@@ -875,14 +872,14 @@ static void dm_io_complete(struct dm_io *io)
 
 	io_error = io->status;
 	if (dm_io_flagged(io, DM_IO_ACCOUNTED))
-		dm_end_io_acct(io, bio);
+		dm_end_io_acct(io);
 	else if (!io_error) {
 		/*
 		 * Must handle target that DM_MAPIO_SUBMITTED only to
 		 * then bio_endio() rather than dm_submit_bio_remap()
 		 */
-		__dm_start_io_acct(io, bio);
-		dm_end_io_acct(io, bio);
+		__dm_start_io_acct(io);
+		dm_end_io_acct(io);
 	}
 	free_io(io);
 	smp_wmb();
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 2/8] dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct
@ 2022-04-12  8:56   ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

dm->orig_bio is always passed to __dm_start_io_acct and dm_end_io_acct,
so it isn't necessary to take one bio parameter for the two helpers.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 19 ++++++++-----------
 1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..62f7af815ef8 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -526,16 +526,13 @@ static void dm_io_acct(bool end, struct mapped_device *md, struct bio *bio,
 		bio->bi_iter.bi_size = bi_size;
 }
 
-static void __dm_start_io_acct(struct dm_io *io, struct bio *bio)
+static void __dm_start_io_acct(struct dm_io *io)
 {
-	dm_io_acct(false, io->md, bio, io->start_time, &io->stats_aux);
+	dm_io_acct(false, io->md, io->orig_bio, io->start_time, &io->stats_aux);
 }
 
 static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
 {
-	/* Must account IO to DM device in terms of orig_bio */
-	struct bio *bio = io->orig_bio;
-
 	/*
 	 * Ensure IO accounting is only ever started once.
 	 * Expect no possibility for race unless DM_TIO_IS_DUPLICATE_BIO.
@@ -555,12 +552,12 @@ static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
 		spin_unlock_irqrestore(&io->lock, flags);
 	}
 
-	__dm_start_io_acct(io, bio);
+	__dm_start_io_acct(io);
 }
 
-static void dm_end_io_acct(struct dm_io *io, struct bio *bio)
+static void dm_end_io_acct(struct dm_io *io)
 {
-	dm_io_acct(true, io->md, bio, io->start_time, &io->stats_aux);
+	dm_io_acct(true, io->md, io->orig_bio, io->start_time, &io->stats_aux);
 }
 
 static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
@@ -875,14 +872,14 @@ static void dm_io_complete(struct dm_io *io)
 
 	io_error = io->status;
 	if (dm_io_flagged(io, DM_IO_ACCOUNTED))
-		dm_end_io_acct(io, bio);
+		dm_end_io_acct(io);
 	else if (!io_error) {
 		/*
 		 * Must handle target that DM_MAPIO_SUBMITTED only to
 		 * then bio_endio() rather than dm_submit_bio_remap()
 		 */
-		__dm_start_io_acct(io, bio);
-		dm_end_io_acct(io, bio);
+		__dm_start_io_acct(io);
+		dm_end_io_acct(io);
 	}
 	free_io(io);
 	smp_wmb();
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [dm-devel] [PATCH 3/8] dm: pass 'dm_io' instance to dm_io_acct directly
  2022-04-12  8:56 ` Ming Lei
@ 2022-04-12  8:56   ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

All the other 4 parameters are retrieved from the 'dm_io' instance, so
not necessary to pass all four to dm_io_acct().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 62f7af815ef8..ed85cd1165a4 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -498,9 +498,12 @@ static bool bio_is_flush_with_data(struct bio *bio)
 	return ((bio->bi_opf & REQ_PREFLUSH) && bio->bi_iter.bi_size);
 }
 
-static void dm_io_acct(bool end, struct mapped_device *md, struct bio *bio,
-		       unsigned long start_time, struct dm_stats_aux *stats_aux)
+static void dm_io_acct(struct dm_io *io, bool end)
 {
+	struct dm_stats_aux *stats_aux = &io->stats_aux;
+	unsigned long start_time = io->start_time;
+	struct mapped_device *md = io->md;
+	struct bio *bio = io->orig_bio;
 	bool is_flush_with_data;
 	unsigned int bi_size;
 
@@ -528,7 +531,7 @@ static void dm_io_acct(bool end, struct mapped_device *md, struct bio *bio,
 
 static void __dm_start_io_acct(struct dm_io *io)
 {
-	dm_io_acct(false, io->md, io->orig_bio, io->start_time, &io->stats_aux);
+	dm_io_acct(io, false);
 }
 
 static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
@@ -557,7 +560,7 @@ static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
 
 static void dm_end_io_acct(struct dm_io *io)
 {
-	dm_io_acct(true, io->md, io->orig_bio, io->start_time, &io->stats_aux);
+	dm_io_acct(io, true);
 }
 
 static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 3/8] dm: pass 'dm_io' instance to dm_io_acct directly
@ 2022-04-12  8:56   ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

All the other 4 parameters are retrieved from the 'dm_io' instance, so
not necessary to pass all four to dm_io_acct().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 62f7af815ef8..ed85cd1165a4 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -498,9 +498,12 @@ static bool bio_is_flush_with_data(struct bio *bio)
 	return ((bio->bi_opf & REQ_PREFLUSH) && bio->bi_iter.bi_size);
 }
 
-static void dm_io_acct(bool end, struct mapped_device *md, struct bio *bio,
-		       unsigned long start_time, struct dm_stats_aux *stats_aux)
+static void dm_io_acct(struct dm_io *io, bool end)
 {
+	struct dm_stats_aux *stats_aux = &io->stats_aux;
+	unsigned long start_time = io->start_time;
+	struct mapped_device *md = io->md;
+	struct bio *bio = io->orig_bio;
 	bool is_flush_with_data;
 	unsigned int bi_size;
 
@@ -528,7 +531,7 @@ static void dm_io_acct(bool end, struct mapped_device *md, struct bio *bio,
 
 static void __dm_start_io_acct(struct dm_io *io)
 {
-	dm_io_acct(false, io->md, io->orig_bio, io->start_time, &io->stats_aux);
+	dm_io_acct(io, false);
 }
 
 static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
@@ -557,7 +560,7 @@ static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
 
 static void dm_end_io_acct(struct dm_io *io)
 {
-	dm_io_acct(true, io->md, io->orig_bio, io->start_time, &io->stats_aux);
+	dm_io_acct(io, true);
 }
 
 static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [dm-devel] [PATCH 4/8] dm: switch to bdev based io accounting interface
  2022-04-12  8:56 ` Ming Lei
@ 2022-04-12  8:56   ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

DM won't account sectors in flush IO, also we can retrieve sectors
from 'dm_io' for avoiding to allocate & update new original bio, which
will be done in the following patch.

So switch to bdev based io accounting interface.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 21 ++++++++-------------
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ed85cd1165a4..31eacc0e93ed 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -504,29 +504,24 @@ static void dm_io_acct(struct dm_io *io, bool end)
 	unsigned long start_time = io->start_time;
 	struct mapped_device *md = io->md;
 	struct bio *bio = io->orig_bio;
-	bool is_flush_with_data;
-	unsigned int bi_size;
+	unsigned int sectors;
 
 	/* If REQ_PREFLUSH set save any payload but do not account it */
-	is_flush_with_data = bio_is_flush_with_data(bio);
-	if (is_flush_with_data) {
-		bi_size = bio->bi_iter.bi_size;
-		bio->bi_iter.bi_size = 0;
-	}
+	if (bio_is_flush_with_data(bio))
+		sectors = 0;
+	else
+		sectors = bio_sectors(bio);
 
 	if (!end)
-		bio_start_io_acct_time(bio, start_time);
+		bdev_start_io_acct(bio->bi_bdev, sectors, bio_op(bio),
+				start_time);
 	else
-		bio_end_io_acct(bio, start_time);
+		bdev_end_io_acct(bio->bi_bdev, bio_op(bio), start_time);
 
 	if (unlikely(dm_stats_used(&md->stats)))
 		dm_stats_account_io(&md->stats, bio_data_dir(bio),
 				    bio->bi_iter.bi_sector, bio_sectors(bio),
 				    end, start_time, stats_aux);
-
-	/* Restore bio's payload so it does get accounted upon requeue */
-	if (is_flush_with_data)
-		bio->bi_iter.bi_size = bi_size;
 }
 
 static void __dm_start_io_acct(struct dm_io *io)
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 4/8] dm: switch to bdev based io accounting interface
@ 2022-04-12  8:56   ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

DM won't account sectors in flush IO, also we can retrieve sectors
from 'dm_io' for avoiding to allocate & update new original bio, which
will be done in the following patch.

So switch to bdev based io accounting interface.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 21 ++++++++-------------
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ed85cd1165a4..31eacc0e93ed 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -504,29 +504,24 @@ static void dm_io_acct(struct dm_io *io, bool end)
 	unsigned long start_time = io->start_time;
 	struct mapped_device *md = io->md;
 	struct bio *bio = io->orig_bio;
-	bool is_flush_with_data;
-	unsigned int bi_size;
+	unsigned int sectors;
 
 	/* If REQ_PREFLUSH set save any payload but do not account it */
-	is_flush_with_data = bio_is_flush_with_data(bio);
-	if (is_flush_with_data) {
-		bi_size = bio->bi_iter.bi_size;
-		bio->bi_iter.bi_size = 0;
-	}
+	if (bio_is_flush_with_data(bio))
+		sectors = 0;
+	else
+		sectors = bio_sectors(bio);
 
 	if (!end)
-		bio_start_io_acct_time(bio, start_time);
+		bdev_start_io_acct(bio->bi_bdev, sectors, bio_op(bio),
+				start_time);
 	else
-		bio_end_io_acct(bio, start_time);
+		bdev_end_io_acct(bio->bi_bdev, bio_op(bio), start_time);
 
 	if (unlikely(dm_stats_used(&md->stats)))
 		dm_stats_account_io(&md->stats, bio_data_dir(bio),
 				    bio->bi_iter.bi_sector, bio_sectors(bio),
 				    end, start_time, stats_aux);
-
-	/* Restore bio's payload so it does get accounted upon requeue */
-	if (is_flush_with_data)
-		bio->bi_iter.bi_size = bi_size;
 }
 
 static void __dm_start_io_acct(struct dm_io *io)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-12  8:56 ` Ming Lei
@ 2022-04-12  8:56   ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

The current DM codes setup ->orig_bio after __map_bio() returns,
and not only cause kernel panic for dm zone, but also a bit ugly
and tricky, especially the waiting until ->orig_bio is set in
dm_submit_bio_remap().

The reason is that one new bio is cloned from original FS bio to
represent the mapped part, which just serves io accounting.

Now we have switched to bdev based io accounting interface, and we
can retrieve sectors/bio_op from both the real original bio and the
added fields of .sector_offset & .sectors easily, so the new cloned
bio isn't necessary any more.

Not only fixes dm-zone's kernel panic, but also cleans up dm io
accounting & split a bit.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  8 ++++++-
 drivers/md/dm.c      | 51 ++++++++++++++++++++++----------------------
 2 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 4277853c7535..aefb080c230d 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -247,7 +247,12 @@ struct dm_io {
 	blk_short_t flags;
 	atomic_t io_count;
 	struct mapped_device *md;
+
+	/* The three fields represent mapped part of original bio */
 	struct bio *orig_bio;
+	unsigned int sector_offset; /* offset to end of orig_bio */
+	unsigned int sectors;
+
 	blk_status_t status;
 	spinlock_t lock;
 	unsigned long start_time;
@@ -264,7 +269,8 @@ struct dm_io {
  */
 enum {
 	DM_IO_START_ACCT,
-	DM_IO_ACCOUNTED
+	DM_IO_ACCOUNTED,
+	DM_IO_SPLITTED
 };
 
 static inline bool dm_io_flagged(struct dm_io *io, unsigned int bit)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 31eacc0e93ed..df1d013fb793 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -509,8 +509,10 @@ static void dm_io_acct(struct dm_io *io, bool end)
 	/* If REQ_PREFLUSH set save any payload but do not account it */
 	if (bio_is_flush_with_data(bio))
 		sectors = 0;
-	else
+	else if (likely(!(dm_io_flagged(io, DM_IO_SPLITTED))))
 		sectors = bio_sectors(bio);
+	else
+		sectors = io->sectors;
 
 	if (!end)
 		bdev_start_io_acct(bio->bi_bdev, sectors, bio_op(bio),
@@ -518,10 +520,21 @@ static void dm_io_acct(struct dm_io *io, bool end)
 	else
 		bdev_end_io_acct(bio->bi_bdev, bio_op(bio), start_time);
 
-	if (unlikely(dm_stats_used(&md->stats)))
+	if (unlikely(dm_stats_used(&md->stats))) {
+		sector_t sector;
+
+		if (likely(!dm_io_flagged(io, DM_IO_SPLITTED))) {
+			sector = bio->bi_iter.bi_sector;
+			sectors = bio_sectors(bio);
+		} else {
+			sector = bio_end_sector(bio) - io->sector_offset;
+			sectors = io->sectors;
+		}
+
 		dm_stats_account_io(&md->stats, bio_data_dir(bio),
-				    bio->bi_iter.bi_sector, bio_sectors(bio),
+				    sector, sectors,
 				    end, start_time, stats_aux);
+	}
 }
 
 static void __dm_start_io_acct(struct dm_io *io)
@@ -576,7 +589,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	io->status = 0;
 	atomic_set(&io->io_count, 1);
 	this_cpu_inc(*md->pending_io);
-	io->orig_bio = NULL;
+	io->orig_bio = bio;
 	io->md = md;
 	io->map_task = current;
 	spin_lock_init(&io->lock);
@@ -1222,13 +1235,6 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
 		/* Still in target's map function */
 		dm_io_set_flag(io, DM_IO_START_ACCT);
 	} else {
-		/*
-		 * Called by another thread, managed by DM target,
-		 * wait for dm_split_and_process_bio() to store
-		 * io->orig_bio
-		 */
-		while (unlikely(!smp_load_acquire(&io->orig_bio)))
-			msleep(1);
 		dm_start_io_acct(io, clone);
 	}
 
@@ -1557,7 +1563,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 				     struct dm_table *map, struct bio *bio)
 {
 	struct clone_info ci;
-	struct bio *orig_bio = NULL;
 	int error = 0;
 
 	init_clone_info(&ci, md, map, bio);
@@ -1573,22 +1578,16 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	if (error || !ci.sector_count)
 		goto out;
 
-	/*
-	 * Remainder must be passed to submit_bio_noacct() so it gets handled
-	 * *after* bios already submitted have been completely processed.
-	 * We take a clone of the original to store in ci.io->orig_bio to be
-	 * used by dm_end_io_acct() and for dm_io_complete() to use for
-	 * completion handling.
-	 */
-	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
-			     GFP_NOIO, &md->queue->bio_split);
-	bio_chain(orig_bio, bio);
-	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
+	/* setup the mapped part for accounting */
+	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
+	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
+	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
+
+	bio_trim(bio, ci.io->sectors, ci.sector_count);
+	trace_block_split(bio, bio->bi_iter.bi_sector);
+	bio_inc_remaining(bio);
 	submit_bio_noacct(bio);
 out:
-	if (!orig_bio)
-		orig_bio = bio;
-	smp_store_release(&ci.io->orig_bio, orig_bio);
 	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
 		dm_start_io_acct(ci.io, NULL);
 
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-12  8:56   ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

The current DM codes setup ->orig_bio after __map_bio() returns,
and not only cause kernel panic for dm zone, but also a bit ugly
and tricky, especially the waiting until ->orig_bio is set in
dm_submit_bio_remap().

The reason is that one new bio is cloned from original FS bio to
represent the mapped part, which just serves io accounting.

Now we have switched to bdev based io accounting interface, and we
can retrieve sectors/bio_op from both the real original bio and the
added fields of .sector_offset & .sectors easily, so the new cloned
bio isn't necessary any more.

Not only fixes dm-zone's kernel panic, but also cleans up dm io
accounting & split a bit.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  8 ++++++-
 drivers/md/dm.c      | 51 ++++++++++++++++++++++----------------------
 2 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 4277853c7535..aefb080c230d 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -247,7 +247,12 @@ struct dm_io {
 	blk_short_t flags;
 	atomic_t io_count;
 	struct mapped_device *md;
+
+	/* The three fields represent mapped part of original bio */
 	struct bio *orig_bio;
+	unsigned int sector_offset; /* offset to end of orig_bio */
+	unsigned int sectors;
+
 	blk_status_t status;
 	spinlock_t lock;
 	unsigned long start_time;
@@ -264,7 +269,8 @@ struct dm_io {
  */
 enum {
 	DM_IO_START_ACCT,
-	DM_IO_ACCOUNTED
+	DM_IO_ACCOUNTED,
+	DM_IO_SPLITTED
 };
 
 static inline bool dm_io_flagged(struct dm_io *io, unsigned int bit)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 31eacc0e93ed..df1d013fb793 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -509,8 +509,10 @@ static void dm_io_acct(struct dm_io *io, bool end)
 	/* If REQ_PREFLUSH set save any payload but do not account it */
 	if (bio_is_flush_with_data(bio))
 		sectors = 0;
-	else
+	else if (likely(!(dm_io_flagged(io, DM_IO_SPLITTED))))
 		sectors = bio_sectors(bio);
+	else
+		sectors = io->sectors;
 
 	if (!end)
 		bdev_start_io_acct(bio->bi_bdev, sectors, bio_op(bio),
@@ -518,10 +520,21 @@ static void dm_io_acct(struct dm_io *io, bool end)
 	else
 		bdev_end_io_acct(bio->bi_bdev, bio_op(bio), start_time);
 
-	if (unlikely(dm_stats_used(&md->stats)))
+	if (unlikely(dm_stats_used(&md->stats))) {
+		sector_t sector;
+
+		if (likely(!dm_io_flagged(io, DM_IO_SPLITTED))) {
+			sector = bio->bi_iter.bi_sector;
+			sectors = bio_sectors(bio);
+		} else {
+			sector = bio_end_sector(bio) - io->sector_offset;
+			sectors = io->sectors;
+		}
+
 		dm_stats_account_io(&md->stats, bio_data_dir(bio),
-				    bio->bi_iter.bi_sector, bio_sectors(bio),
+				    sector, sectors,
 				    end, start_time, stats_aux);
+	}
 }
 
 static void __dm_start_io_acct(struct dm_io *io)
@@ -576,7 +589,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	io->status = 0;
 	atomic_set(&io->io_count, 1);
 	this_cpu_inc(*md->pending_io);
-	io->orig_bio = NULL;
+	io->orig_bio = bio;
 	io->md = md;
 	io->map_task = current;
 	spin_lock_init(&io->lock);
@@ -1222,13 +1235,6 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
 		/* Still in target's map function */
 		dm_io_set_flag(io, DM_IO_START_ACCT);
 	} else {
-		/*
-		 * Called by another thread, managed by DM target,
-		 * wait for dm_split_and_process_bio() to store
-		 * io->orig_bio
-		 */
-		while (unlikely(!smp_load_acquire(&io->orig_bio)))
-			msleep(1);
 		dm_start_io_acct(io, clone);
 	}
 
@@ -1557,7 +1563,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 				     struct dm_table *map, struct bio *bio)
 {
 	struct clone_info ci;
-	struct bio *orig_bio = NULL;
 	int error = 0;
 
 	init_clone_info(&ci, md, map, bio);
@@ -1573,22 +1578,16 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	if (error || !ci.sector_count)
 		goto out;
 
-	/*
-	 * Remainder must be passed to submit_bio_noacct() so it gets handled
-	 * *after* bios already submitted have been completely processed.
-	 * We take a clone of the original to store in ci.io->orig_bio to be
-	 * used by dm_end_io_acct() and for dm_io_complete() to use for
-	 * completion handling.
-	 */
-	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
-			     GFP_NOIO, &md->queue->bio_split);
-	bio_chain(orig_bio, bio);
-	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
+	/* setup the mapped part for accounting */
+	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
+	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
+	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
+
+	bio_trim(bio, ci.io->sectors, ci.sector_count);
+	trace_block_split(bio, bio->bi_iter.bi_sector);
+	bio_inc_remaining(bio);
 	submit_bio_noacct(bio);
 out:
-	if (!orig_bio)
-		orig_bio = bio;
-	smp_store_release(&ci.io->orig_bio, orig_bio);
 	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
 		dm_start_io_acct(ci.io, NULL);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [dm-devel] [PATCH 6/8] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-12  8:56 ` Ming Lei
@ 2022-04-12  8:56   ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

dm_zone_map_bio() is only called from __map_bio in which the io's
reference is grabbed already, and the reference won't be released
until the bio is submitted, so no necessary to do it dm_zone_map_bio
any more.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  7 -------
 drivers/md/dm-zone.c | 10 ----------
 drivers/md/dm.c      |  7 ++++++-
 3 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index aefb080c230d..811c0ccbc63d 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -283,13 +283,6 @@ static inline void dm_io_set_flag(struct dm_io *io, unsigned int bit)
 	io->flags |= (1U << bit);
 }
 
-static inline void dm_io_inc_pending(struct dm_io *io)
-{
-	atomic_inc(&io->io_count);
-}
-
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error);
-
 static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
 {
 	return &container_of(kobj, struct dm_kobject_holder, kobj)->completion;
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index c1ca9be4b79e..85d3c158719f 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -545,13 +545,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		return DM_MAPIO_KILL;
 	}
 
-	/*
-	 * The target map function may issue and complete the IO quickly.
-	 * Take an extra reference on the IO to make sure it does disappear
-	 * until we run dm_zone_map_bio_end().
-	 */
-	dm_io_inc_pending(io);
-
 	/* Let the target do its work */
 	r = ti->type->map(ti, clone);
 	switch (r) {
@@ -580,9 +573,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		break;
 	}
 
-	/* Drop the extra reference on the IO */
-	dm_io_dec_pending(io, sts);
-
 	if (sts != BLK_STS_OK)
 		return DM_MAPIO_KILL;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index df1d013fb793..3c3ba6b4e19b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -937,11 +937,16 @@ static inline bool dm_tio_is_normal(struct dm_target_io *tio)
 		!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
 }
 
+static void dm_io_inc_pending(struct dm_io *io)
+{
+	atomic_inc(&io->io_count);
+}
+
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
  */
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
+static void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
 {
 	/* Push-back supersedes any I/O errors */
 	if (unlikely(error)) {
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 6/8] dm: don't grab target io reference in dm_zone_map_bio
@ 2022-04-12  8:56   ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

dm_zone_map_bio() is only called from __map_bio in which the io's
reference is grabbed already, and the reference won't be released
until the bio is submitted, so no necessary to do it dm_zone_map_bio
any more.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  7 -------
 drivers/md/dm-zone.c | 10 ----------
 drivers/md/dm.c      |  7 ++++++-
 3 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index aefb080c230d..811c0ccbc63d 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -283,13 +283,6 @@ static inline void dm_io_set_flag(struct dm_io *io, unsigned int bit)
 	io->flags |= (1U << bit);
 }
 
-static inline void dm_io_inc_pending(struct dm_io *io)
-{
-	atomic_inc(&io->io_count);
-}
-
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error);
-
 static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
 {
 	return &container_of(kobj, struct dm_kobject_holder, kobj)->completion;
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index c1ca9be4b79e..85d3c158719f 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -545,13 +545,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		return DM_MAPIO_KILL;
 	}
 
-	/*
-	 * The target map function may issue and complete the IO quickly.
-	 * Take an extra reference on the IO to make sure it does disappear
-	 * until we run dm_zone_map_bio_end().
-	 */
-	dm_io_inc_pending(io);
-
 	/* Let the target do its work */
 	r = ti->type->map(ti, clone);
 	switch (r) {
@@ -580,9 +573,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		break;
 	}
 
-	/* Drop the extra reference on the IO */
-	dm_io_dec_pending(io, sts);
-
 	if (sts != BLK_STS_OK)
 		return DM_MAPIO_KILL;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index df1d013fb793..3c3ba6b4e19b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -937,11 +937,16 @@ static inline bool dm_tio_is_normal(struct dm_target_io *tio)
 		!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
 }
 
+static void dm_io_inc_pending(struct dm_io *io)
+{
+	atomic_inc(&io->io_count);
+}
+
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
  */
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
+static void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
 {
 	/* Push-back supersedes any I/O errors */
 	if (unlikely(error)) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [dm-devel] [PATCH 7/8] dm: improve target io referencing
  2022-04-12  8:56 ` Ming Lei
@ 2022-04-12  8:56   ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

Currently target io's reference counter is grabbed before calling
__map_bio(), this way isn't efficient since we can move this grabbing
into alloc_io().

Meantime it becomes typical async io reference counter model: one is
for submission side, the other is for completion side, and the io won't
be completed until both two sides are done.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 51 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c3ba6b4e19b..2987f7cf7b47 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -587,7 +587,9 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	io = container_of(tio, struct dm_io, tio);
 	io->magic = DM_IO_MAGIC;
 	io->status = 0;
-	atomic_set(&io->io_count, 1);
+
+	/* one is for submission, the other is for completion */
+	atomic_set(&io->io_count, 2);
 	this_cpu_inc(*md->pending_io);
 	io->orig_bio = bio;
 	io->md = md;
@@ -937,11 +939,6 @@ static inline bool dm_tio_is_normal(struct dm_target_io *tio)
 		!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
 }
 
-static void dm_io_inc_pending(struct dm_io *io)
-{
-	atomic_inc(&io->io_count);
-}
-
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
@@ -1276,7 +1273,6 @@ static void __map_bio(struct bio *clone)
 	/*
 	 * Map the clone.
 	 */
-	dm_io_inc_pending(io);
 	tio->old_sector = clone->bi_iter.bi_sector;
 
 	if (unlikely(swap_bios_limit(ti, clone))) {
@@ -1358,11 +1354,12 @@ static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
 	}
 }
 
-static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
+static int __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 				  unsigned num_bios, unsigned *len)
 {
 	struct bio_list blist = BIO_EMPTY_LIST;
 	struct bio *clone;
+	int ret = 0;
 
 	switch (num_bios) {
 	case 0:
@@ -1371,15 +1368,19 @@ static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 		clone = alloc_tio(ci, ti, 0, len, GFP_NOIO);
 		dm_tio_set_flag(clone_to_tio(clone), DM_TIO_IS_DUPLICATE_BIO);
 		__map_bio(clone);
+		ret = 1;
 		break;
 	default:
 		alloc_multiple_bios(&blist, ci, ti, num_bios, len);
 		while ((clone = bio_list_pop(&blist))) {
 			dm_tio_set_flag(clone_to_tio(clone), DM_TIO_IS_DUPLICATE_BIO);
 			__map_bio(clone);
+			ret += 1;
 		}
 		break;
 	}
+
+	return ret;
 }
 
 static void __send_empty_flush(struct clone_info *ci)
@@ -1399,8 +1400,19 @@ static void __send_empty_flush(struct clone_info *ci)
 	ci->bio = &flush_bio;
 	ci->sector_count = 0;
 
-	while ((ti = dm_table_get_target(ci->map, target_nr++)))
-		__send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
+	while ((ti = dm_table_get_target(ci->map, target_nr++))) {
+		int bios;
+
+		atomic_add(ti->num_flush_bios, &ci->io->io_count);
+		bios = __send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
+		atomic_sub(ti->num_flush_bios - bios, &ci->io->io_count);
+	}
+
+	/*
+	 * alloc_io() takes one extra reference for submission, so the
+	 * reference won't reach 0 after the following subtraction
+	 */
+	atomic_sub(1, &ci->io->io_count);
 
 	bio_uninit(ci->bio);
 }
@@ -1409,6 +1421,7 @@ static void __send_changing_extent_only(struct clone_info *ci, struct dm_target
 					unsigned num_bios)
 {
 	unsigned len;
+	int bios;
 
 	len = min_t(sector_t, ci->sector_count,
 		    max_io_len_target_boundary(ti, dm_target_offset(ti, ci->sector)));
@@ -1420,7 +1433,13 @@ static void __send_changing_extent_only(struct clone_info *ci, struct dm_target
 	ci->sector += len;
 	ci->sector_count -= len;
 
-	__send_duplicate_bios(ci, ti, num_bios, &len);
+	atomic_add(num_bios, &ci->io->io_count);
+	bios = __send_duplicate_bios(ci, ti, num_bios, &len);
+	/*
+	 * alloc_io() takes one extra reference for submission, so the
+	 * reference won't reach 0 after the following subtraction
+	 */
+	atomic_sub(num_bios - bios + 1, &ci->io->io_count);
 }
 
 static bool is_abnormal_io(struct bio *bio)
@@ -1603,9 +1622,15 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	 * Add every dm_io instance into the hlist_head which is stored in
 	 * bio->bi_private, so that dm_poll_bio can poll them all.
 	 */
-	if (error || !ci.submit_as_polled)
+	if (error || !ci.submit_as_polled) {
+		/*
+		 * In case of submission failure, the extra reference for
+		 * submitting io isn't consumed yet
+		 */
+		if (error)
+			atomic_dec(&ci.io->io_count);
 		dm_io_dec_pending(ci.io, errno_to_blk_status(error));
-	else
+	} else
 		dm_queue_poll_io(bio, ci.io);
 }
 
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 7/8] dm: improve target io referencing
@ 2022-04-12  8:56   ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

Currently target io's reference counter is grabbed before calling
__map_bio(), this way isn't efficient since we can move this grabbing
into alloc_io().

Meantime it becomes typical async io reference counter model: one is
for submission side, the other is for completion side, and the io won't
be completed until both two sides are done.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 51 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c3ba6b4e19b..2987f7cf7b47 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -587,7 +587,9 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	io = container_of(tio, struct dm_io, tio);
 	io->magic = DM_IO_MAGIC;
 	io->status = 0;
-	atomic_set(&io->io_count, 1);
+
+	/* one is for submission, the other is for completion */
+	atomic_set(&io->io_count, 2);
 	this_cpu_inc(*md->pending_io);
 	io->orig_bio = bio;
 	io->md = md;
@@ -937,11 +939,6 @@ static inline bool dm_tio_is_normal(struct dm_target_io *tio)
 		!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
 }
 
-static void dm_io_inc_pending(struct dm_io *io)
-{
-	atomic_inc(&io->io_count);
-}
-
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
@@ -1276,7 +1273,6 @@ static void __map_bio(struct bio *clone)
 	/*
 	 * Map the clone.
 	 */
-	dm_io_inc_pending(io);
 	tio->old_sector = clone->bi_iter.bi_sector;
 
 	if (unlikely(swap_bios_limit(ti, clone))) {
@@ -1358,11 +1354,12 @@ static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
 	}
 }
 
-static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
+static int __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 				  unsigned num_bios, unsigned *len)
 {
 	struct bio_list blist = BIO_EMPTY_LIST;
 	struct bio *clone;
+	int ret = 0;
 
 	switch (num_bios) {
 	case 0:
@@ -1371,15 +1368,19 @@ static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 		clone = alloc_tio(ci, ti, 0, len, GFP_NOIO);
 		dm_tio_set_flag(clone_to_tio(clone), DM_TIO_IS_DUPLICATE_BIO);
 		__map_bio(clone);
+		ret = 1;
 		break;
 	default:
 		alloc_multiple_bios(&blist, ci, ti, num_bios, len);
 		while ((clone = bio_list_pop(&blist))) {
 			dm_tio_set_flag(clone_to_tio(clone), DM_TIO_IS_DUPLICATE_BIO);
 			__map_bio(clone);
+			ret += 1;
 		}
 		break;
 	}
+
+	return ret;
 }
 
 static void __send_empty_flush(struct clone_info *ci)
@@ -1399,8 +1400,19 @@ static void __send_empty_flush(struct clone_info *ci)
 	ci->bio = &flush_bio;
 	ci->sector_count = 0;
 
-	while ((ti = dm_table_get_target(ci->map, target_nr++)))
-		__send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
+	while ((ti = dm_table_get_target(ci->map, target_nr++))) {
+		int bios;
+
+		atomic_add(ti->num_flush_bios, &ci->io->io_count);
+		bios = __send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
+		atomic_sub(ti->num_flush_bios - bios, &ci->io->io_count);
+	}
+
+	/*
+	 * alloc_io() takes one extra reference for submission, so the
+	 * reference won't reach 0 after the following subtraction
+	 */
+	atomic_sub(1, &ci->io->io_count);
 
 	bio_uninit(ci->bio);
 }
@@ -1409,6 +1421,7 @@ static void __send_changing_extent_only(struct clone_info *ci, struct dm_target
 					unsigned num_bios)
 {
 	unsigned len;
+	int bios;
 
 	len = min_t(sector_t, ci->sector_count,
 		    max_io_len_target_boundary(ti, dm_target_offset(ti, ci->sector)));
@@ -1420,7 +1433,13 @@ static void __send_changing_extent_only(struct clone_info *ci, struct dm_target
 	ci->sector += len;
 	ci->sector_count -= len;
 
-	__send_duplicate_bios(ci, ti, num_bios, &len);
+	atomic_add(num_bios, &ci->io->io_count);
+	bios = __send_duplicate_bios(ci, ti, num_bios, &len);
+	/*
+	 * alloc_io() takes one extra reference for submission, so the
+	 * reference won't reach 0 after the following subtraction
+	 */
+	atomic_sub(num_bios - bios + 1, &ci->io->io_count);
 }
 
 static bool is_abnormal_io(struct bio *bio)
@@ -1603,9 +1622,15 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	 * Add every dm_io instance into the hlist_head which is stored in
 	 * bio->bi_private, so that dm_poll_bio can poll them all.
 	 */
-	if (error || !ci.submit_as_polled)
+	if (error || !ci.submit_as_polled) {
+		/*
+		 * In case of submission failure, the extra reference for
+		 * submitting io isn't consumed yet
+		 */
+		if (error)
+			atomic_dec(&ci.io->io_count);
 		dm_io_dec_pending(ci.io, errno_to_blk_status(error));
-	else
+	} else
 		dm_queue_poll_io(bio, ci.io);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [dm-devel] [PATCH 8/8] dm: put all polled io into one single list
  2022-04-12  8:56 ` Ming Lei
@ 2022-04-12  8:56   ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

If bio_split() isn't involved, it is a bit overkill to link dm_io into hlist,
given there is only single dm_io in the list, so convert to single list
for holding all dm_io instances associated with this bio.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  2 +-
 drivers/md/dm.c      | 46 +++++++++++++++++++++++---------------------
 2 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 811c0ccbc63d..7f51957913e8 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -257,7 +257,7 @@ struct dm_io {
 	spinlock_t lock;
 	unsigned long start_time;
 	void *data;
-	struct hlist_node node;
+	struct dm_io *next;
 	struct task_struct *map_task;
 	struct dm_stats_aux stats_aux;
 	/* last member of dm_target_io is 'struct bio' */
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2987f7cf7b47..db23efd6bbf6 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1492,7 +1492,7 @@ static bool __process_abnormal_io(struct clone_info *ci, struct dm_target *ti,
 }
 
 /*
- * Reuse ->bi_private as hlist head for storing all dm_io instances
+ * Reuse ->bi_private as dm_io list head for storing all dm_io instances
  * associated with this bio, and this bio's bi_private needs to be
  * stored in dm_io->data before the reuse.
  *
@@ -1500,14 +1500,14 @@ static bool __process_abnormal_io(struct clone_info *ci, struct dm_target *ti,
  * touch it after splitting. Meantime it won't be changed by anyone after
  * bio is submitted. So this reuse is safe.
  */
-static inline struct hlist_head *dm_get_bio_hlist_head(struct bio *bio)
+static inline struct dm_io **dm_poll_list_head(struct bio *bio)
 {
-	return (struct hlist_head *)&bio->bi_private;
+	return (struct dm_io **)&bio->bi_private;
 }
 
 static void dm_queue_poll_io(struct bio *bio, struct dm_io *io)
 {
-	struct hlist_head *head = dm_get_bio_hlist_head(bio);
+	struct dm_io **head = dm_poll_list_head(bio);
 
 	if (!(bio->bi_opf & REQ_DM_POLL_LIST)) {
 		bio->bi_opf |= REQ_DM_POLL_LIST;
@@ -1517,19 +1517,20 @@ static void dm_queue_poll_io(struct bio *bio, struct dm_io *io)
 		 */
 		io->data = bio->bi_private;
 
-		INIT_HLIST_HEAD(head);
-
 		/* tell block layer to poll for completion */
 		bio->bi_cookie = ~BLK_QC_T_NONE;
+
+		io->next = NULL;
 	} else {
 		/*
 		 * bio recursed due to split, reuse original poll list,
 		 * and save bio->bi_private too.
 		 */
-		io->data = hlist_entry(head->first, struct dm_io, node)->data;
+		io->data = (*head)->data;
+		io->next = *head;
 	}
 
-	hlist_add_head(&io->node, head);
+	*head = io;
 }
 
 /*
@@ -1682,18 +1683,16 @@ static bool dm_poll_dm_io(struct dm_io *io, struct io_comp_batch *iob,
 static int dm_poll_bio(struct bio *bio, struct io_comp_batch *iob,
 		       unsigned int flags)
 {
-	struct hlist_head *head = dm_get_bio_hlist_head(bio);
-	struct hlist_head tmp = HLIST_HEAD_INIT;
-	struct hlist_node *next;
-	struct dm_io *io;
+	struct dm_io **head = dm_poll_list_head(bio);
+	struct dm_io *list = *head;
+	struct dm_io *tmp = NULL;
+	struct dm_io *curr, *next;
 
 	/* Only poll normal bio which was marked as REQ_DM_POLL_LIST */
 	if (!(bio->bi_opf & REQ_DM_POLL_LIST))
 		return 0;
 
-	WARN_ON_ONCE(hlist_empty(head));
-
-	hlist_move_list(head, &tmp);
+	WARN_ON_ONCE(!list);
 
 	/*
 	 * Restore .bi_private before possibly completing dm_io.
@@ -1704,24 +1703,27 @@ static int dm_poll_bio(struct bio *bio, struct io_comp_batch *iob,
 	 * clearing REQ_DM_POLL_LIST here.
 	 */
 	bio->bi_opf &= ~REQ_DM_POLL_LIST;
-	bio->bi_private = hlist_entry(tmp.first, struct dm_io, node)->data;
+	bio->bi_private = list->data;
 
-	hlist_for_each_entry_safe(io, next, &tmp, node) {
-		if (dm_poll_dm_io(io, iob, flags)) {
-			hlist_del_init(&io->node);
+	for (curr = list, next = curr->next; curr; curr = next, next =
+			curr ? curr->next : NULL) {
+		if (dm_poll_dm_io(curr, iob, flags)) {
 			/*
 			 * clone_endio() has already occurred, so passing
 			 * error as 0 here doesn't override io->status
 			 */
-			dm_io_dec_pending(io, 0);
+			dm_io_dec_pending(curr, 0);
+		} else {
+			curr->next = tmp;
+			tmp = curr;
 		}
 	}
 
 	/* Not done? */
-	if (!hlist_empty(&tmp)) {
+	if (tmp) {
 		bio->bi_opf |= REQ_DM_POLL_LIST;
 		/* Reset bio->bi_private to dm_io list head */
-		hlist_move_list(&tmp, head);
+		*head = tmp;
 		return 0;
 	}
 	return 1;
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 8/8] dm: put all polled io into one single list
@ 2022-04-12  8:56   ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-12  8:56 UTC (permalink / raw)
  To: Jens Axboe, Mike Snitzer; +Cc: linux-block, dm-devel, Damien Le Moal, Ming Lei

If bio_split() isn't involved, it is a bit overkill to link dm_io into hlist,
given there is only single dm_io in the list, so convert to single list
for holding all dm_io instances associated with this bio.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  2 +-
 drivers/md/dm.c      | 46 +++++++++++++++++++++++---------------------
 2 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 811c0ccbc63d..7f51957913e8 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -257,7 +257,7 @@ struct dm_io {
 	spinlock_t lock;
 	unsigned long start_time;
 	void *data;
-	struct hlist_node node;
+	struct dm_io *next;
 	struct task_struct *map_task;
 	struct dm_stats_aux stats_aux;
 	/* last member of dm_target_io is 'struct bio' */
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2987f7cf7b47..db23efd6bbf6 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1492,7 +1492,7 @@ static bool __process_abnormal_io(struct clone_info *ci, struct dm_target *ti,
 }
 
 /*
- * Reuse ->bi_private as hlist head for storing all dm_io instances
+ * Reuse ->bi_private as dm_io list head for storing all dm_io instances
  * associated with this bio, and this bio's bi_private needs to be
  * stored in dm_io->data before the reuse.
  *
@@ -1500,14 +1500,14 @@ static bool __process_abnormal_io(struct clone_info *ci, struct dm_target *ti,
  * touch it after splitting. Meantime it won't be changed by anyone after
  * bio is submitted. So this reuse is safe.
  */
-static inline struct hlist_head *dm_get_bio_hlist_head(struct bio *bio)
+static inline struct dm_io **dm_poll_list_head(struct bio *bio)
 {
-	return (struct hlist_head *)&bio->bi_private;
+	return (struct dm_io **)&bio->bi_private;
 }
 
 static void dm_queue_poll_io(struct bio *bio, struct dm_io *io)
 {
-	struct hlist_head *head = dm_get_bio_hlist_head(bio);
+	struct dm_io **head = dm_poll_list_head(bio);
 
 	if (!(bio->bi_opf & REQ_DM_POLL_LIST)) {
 		bio->bi_opf |= REQ_DM_POLL_LIST;
@@ -1517,19 +1517,20 @@ static void dm_queue_poll_io(struct bio *bio, struct dm_io *io)
 		 */
 		io->data = bio->bi_private;
 
-		INIT_HLIST_HEAD(head);
-
 		/* tell block layer to poll for completion */
 		bio->bi_cookie = ~BLK_QC_T_NONE;
+
+		io->next = NULL;
 	} else {
 		/*
 		 * bio recursed due to split, reuse original poll list,
 		 * and save bio->bi_private too.
 		 */
-		io->data = hlist_entry(head->first, struct dm_io, node)->data;
+		io->data = (*head)->data;
+		io->next = *head;
 	}
 
-	hlist_add_head(&io->node, head);
+	*head = io;
 }
 
 /*
@@ -1682,18 +1683,16 @@ static bool dm_poll_dm_io(struct dm_io *io, struct io_comp_batch *iob,
 static int dm_poll_bio(struct bio *bio, struct io_comp_batch *iob,
 		       unsigned int flags)
 {
-	struct hlist_head *head = dm_get_bio_hlist_head(bio);
-	struct hlist_head tmp = HLIST_HEAD_INIT;
-	struct hlist_node *next;
-	struct dm_io *io;
+	struct dm_io **head = dm_poll_list_head(bio);
+	struct dm_io *list = *head;
+	struct dm_io *tmp = NULL;
+	struct dm_io *curr, *next;
 
 	/* Only poll normal bio which was marked as REQ_DM_POLL_LIST */
 	if (!(bio->bi_opf & REQ_DM_POLL_LIST))
 		return 0;
 
-	WARN_ON_ONCE(hlist_empty(head));
-
-	hlist_move_list(head, &tmp);
+	WARN_ON_ONCE(!list);
 
 	/*
 	 * Restore .bi_private before possibly completing dm_io.
@@ -1704,24 +1703,27 @@ static int dm_poll_bio(struct bio *bio, struct io_comp_batch *iob,
 	 * clearing REQ_DM_POLL_LIST here.
 	 */
 	bio->bi_opf &= ~REQ_DM_POLL_LIST;
-	bio->bi_private = hlist_entry(tmp.first, struct dm_io, node)->data;
+	bio->bi_private = list->data;
 
-	hlist_for_each_entry_safe(io, next, &tmp, node) {
-		if (dm_poll_dm_io(io, iob, flags)) {
-			hlist_del_init(&io->node);
+	for (curr = list, next = curr->next; curr; curr = next, next =
+			curr ? curr->next : NULL) {
+		if (dm_poll_dm_io(curr, iob, flags)) {
 			/*
 			 * clone_endio() has already occurred, so passing
 			 * error as 0 here doesn't override io->status
 			 */
-			dm_io_dec_pending(io, 0);
+			dm_io_dec_pending(curr, 0);
+		} else {
+			curr->next = tmp;
+			tmp = curr;
 		}
 	}
 
 	/* Not done? */
-	if (!hlist_empty(&tmp)) {
+	if (tmp) {
 		bio->bi_opf |= REQ_DM_POLL_LIST;
 		/* Reset bio->bi_private to dm_io list head */
-		hlist_move_list(&tmp, head);
+		*head = tmp;
 		return 0;
 	}
 	return 1;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/8] dm: io accounting & polling improvement
  2022-04-12  8:56 ` Ming Lei
@ 2022-04-12 17:15   ` Mike Snitzer
  -1 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-12 17:15 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12 2022 at  4:56P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Hello Guys,
> 
> The 1st patch adds bdev based io accounting interface.
> 
> The 2nd ~ 5th patches improves dm's io accounting & split, meantime
> fixes kernel panic on dm-zone.
> 
> The other patches improves io polling & dm io reference handling.
> 
> 
> Ming Lei (8):
>   block: replace disk based account with bdev's
>   dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct
>   dm: pass 'dm_io' instance to dm_io_acct directly
>   dm: switch to bdev based io accounting interface
>   dm: always setup ->orig_bio in alloc_io
>   dm: don't grab target io reference in dm_zone_map_bio
>   dm: improve target io referencing
>   dm: put all polled io into one single list
> 
>  block/blk-core.c              |  15 +--
>  drivers/block/zram/zram_drv.c |   5 +-
>  drivers/md/dm-core.h          |  17 ++-
>  drivers/md/dm-zone.c          |  10 --
>  drivers/md/dm.c               | 190 +++++++++++++++++++---------------
>  include/linux/blkdev.h        |   7 +-
>  6 files changed, 131 insertions(+), 113 deletions(-)
> 
> -- 
> 2.31.1
> 

I'll review this closely but, a couple weeks ago, I queued up quite a
lot of conflicting changes for 5.19 here:

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 0/8] dm: io accounting & polling improvement
@ 2022-04-12 17:15   ` Mike Snitzer
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-12 17:15 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12 2022 at  4:56P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Hello Guys,
> 
> The 1st patch adds bdev based io accounting interface.
> 
> The 2nd ~ 5th patches improves dm's io accounting & split, meantime
> fixes kernel panic on dm-zone.
> 
> The other patches improves io polling & dm io reference handling.
> 
> 
> Ming Lei (8):
>   block: replace disk based account with bdev's
>   dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct
>   dm: pass 'dm_io' instance to dm_io_acct directly
>   dm: switch to bdev based io accounting interface
>   dm: always setup ->orig_bio in alloc_io
>   dm: don't grab target io reference in dm_zone_map_bio
>   dm: improve target io referencing
>   dm: put all polled io into one single list
> 
>  block/blk-core.c              |  15 +--
>  drivers/block/zram/zram_drv.c |   5 +-
>  drivers/md/dm-core.h          |  17 ++-
>  drivers/md/dm-zone.c          |  10 --
>  drivers/md/dm.c               | 190 +++++++++++++++++++---------------
>  include/linux/blkdev.h        |   7 +-
>  6 files changed, 131 insertions(+), 113 deletions(-)
> 
> -- 
> 2.31.1
> 

I'll review this closely but, a couple weeks ago, I queued up quite a
lot of conflicting changes for 5.19 here:

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 3/8] dm: pass 'dm_io' instance to dm_io_acct directly
  2022-04-12  8:56   ` Ming Lei
@ 2022-04-12 20:28     ` Mike Snitzer
  -1 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-12 20:28 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12 2022 at  4:56P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> All the other 4 parameters are retrieved from the 'dm_io' instance, so
> not necessary to pass all four to dm_io_acct().
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Yeah, commit 0ab30b4079e103 ("dm: eliminate copying of dm_io fields in
dm_io_dec_pending") could've gone further to do what you've done here
in this patch.

But it stopped short because of the additional "games" associated with
the late assignment of io->orig_bio that is in the dm-5.19 branch.

Mike


> ---
>  drivers/md/dm.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 62f7af815ef8..ed85cd1165a4 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -498,9 +498,12 @@ static bool bio_is_flush_with_data(struct bio *bio)
>  	return ((bio->bi_opf & REQ_PREFLUSH) && bio->bi_iter.bi_size);
>  }
>  
> -static void dm_io_acct(bool end, struct mapped_device *md, struct bio *bio,
> -		       unsigned long start_time, struct dm_stats_aux *stats_aux)
> +static void dm_io_acct(struct dm_io *io, bool end)
>  {
> +	struct dm_stats_aux *stats_aux = &io->stats_aux;
> +	unsigned long start_time = io->start_time;
> +	struct mapped_device *md = io->md;
> +	struct bio *bio = io->orig_bio;
>  	bool is_flush_with_data;
>  	unsigned int bi_size;
>  
> @@ -528,7 +531,7 @@ static void dm_io_acct(bool end, struct mapped_device *md, struct bio *bio,
>  
>  static void __dm_start_io_acct(struct dm_io *io)
>  {
> -	dm_io_acct(false, io->md, io->orig_bio, io->start_time, &io->stats_aux);
> +	dm_io_acct(io, false);
>  }
>  
>  static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
> @@ -557,7 +560,7 @@ static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
>  
>  static void dm_end_io_acct(struct dm_io *io)
>  {
> -	dm_io_acct(true, io->md, io->orig_bio, io->start_time, &io->stats_aux);
> +	dm_io_acct(io, true);
>  }
>  
>  static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
> -- 
> 2.31.1
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/8] dm: pass 'dm_io' instance to dm_io_acct directly
@ 2022-04-12 20:28     ` Mike Snitzer
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-12 20:28 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12 2022 at  4:56P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> All the other 4 parameters are retrieved from the 'dm_io' instance, so
> not necessary to pass all four to dm_io_acct().
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Yeah, commit 0ab30b4079e103 ("dm: eliminate copying of dm_io fields in
dm_io_dec_pending") could've gone further to do what you've done here
in this patch.

But it stopped short because of the additional "games" associated with
the late assignment of io->orig_bio that is in the dm-5.19 branch.

Mike


> ---
>  drivers/md/dm.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 62f7af815ef8..ed85cd1165a4 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -498,9 +498,12 @@ static bool bio_is_flush_with_data(struct bio *bio)
>  	return ((bio->bi_opf & REQ_PREFLUSH) && bio->bi_iter.bi_size);
>  }
>  
> -static void dm_io_acct(bool end, struct mapped_device *md, struct bio *bio,
> -		       unsigned long start_time, struct dm_stats_aux *stats_aux)
> +static void dm_io_acct(struct dm_io *io, bool end)
>  {
> +	struct dm_stats_aux *stats_aux = &io->stats_aux;
> +	unsigned long start_time = io->start_time;
> +	struct mapped_device *md = io->md;
> +	struct bio *bio = io->orig_bio;
>  	bool is_flush_with_data;
>  	unsigned int bi_size;
>  
> @@ -528,7 +531,7 @@ static void dm_io_acct(bool end, struct mapped_device *md, struct bio *bio,
>  
>  static void __dm_start_io_acct(struct dm_io *io)
>  {
> -	dm_io_acct(false, io->md, io->orig_bio, io->start_time, &io->stats_aux);
> +	dm_io_acct(io, false);
>  }
>  
>  static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
> @@ -557,7 +560,7 @@ static void dm_start_io_acct(struct dm_io *io, struct bio *clone)
>  
>  static void dm_end_io_acct(struct dm_io *io)
>  {
> -	dm_io_acct(true, io->md, io->orig_bio, io->start_time, &io->stats_aux);
> +	dm_io_acct(io, true);
>  }
>  
>  static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
> -- 
> 2.31.1
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-12  8:56   ` Ming Lei
@ 2022-04-12 20:52     ` Mike Snitzer
  -1 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-12 20:52 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12 2022 at  4:56P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> The current DM codes setup ->orig_bio after __map_bio() returns,
> and not only cause kernel panic for dm zone, but also a bit ugly
> and tricky, especially the waiting until ->orig_bio is set in
> dm_submit_bio_remap().
> 
> The reason is that one new bio is cloned from original FS bio to
> represent the mapped part, which just serves io accounting.
> 
> Now we have switched to bdev based io accounting interface, and we
> can retrieve sectors/bio_op from both the real original bio and the
> added fields of .sector_offset & .sectors easily, so the new cloned
> bio isn't necessary any more.
> 
> Not only fixes dm-zone's kernel panic, but also cleans up dm io
> accounting & split a bit.

You're conflating quite a few things here.  DM zone really has no
business accessing io->orig_bio (dm-zone.c can just as easily inspect
the tio->clone, because it hasn't been remapped yet it reflects the
io->origin_bio, so there is no need to look at io->orig_bio) -- but
yes I clearly broke things during the 5.18 merge and it needs fixing
ASAP.

But I'm (ab)using io->orig_bio assignment to indicate to completion
that it may proceed.  See these dm-5.19 commits to see it imposed even
further:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=311a8e6650601a79079000466db77386c5ec2abb
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=56219ebb5f5c84785aa821f755d545eae41bdb1a

And then leveraged here:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=4aa7a368370c2a172d5a0b8927c6332c4b6a3514

Could be all these dm-5.19 changes suck.. but I do know dm-zone.c is
too tightly coupled to DM core.  So I'll focus on that first, fix
5.18, and then circle back to "what's next?".

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-12 20:52     ` Mike Snitzer
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-12 20:52 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12 2022 at  4:56P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> The current DM codes setup ->orig_bio after __map_bio() returns,
> and not only cause kernel panic for dm zone, but also a bit ugly
> and tricky, especially the waiting until ->orig_bio is set in
> dm_submit_bio_remap().
> 
> The reason is that one new bio is cloned from original FS bio to
> represent the mapped part, which just serves io accounting.
> 
> Now we have switched to bdev based io accounting interface, and we
> can retrieve sectors/bio_op from both the real original bio and the
> added fields of .sector_offset & .sectors easily, so the new cloned
> bio isn't necessary any more.
> 
> Not only fixes dm-zone's kernel panic, but also cleans up dm io
> accounting & split a bit.

You're conflating quite a few things here.  DM zone really has no
business accessing io->orig_bio (dm-zone.c can just as easily inspect
the tio->clone, because it hasn't been remapped yet it reflects the
io->origin_bio, so there is no need to look at io->orig_bio) -- but
yes I clearly broke things during the 5.18 merge and it needs fixing
ASAP.

But I'm (ab)using io->orig_bio assignment to indicate to completion
that it may proceed.  See these dm-5.19 commits to see it imposed even
further:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=311a8e6650601a79079000466db77386c5ec2abb
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=56219ebb5f5c84785aa821f755d545eae41bdb1a

And then leveraged here:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=4aa7a368370c2a172d5a0b8927c6332c4b6a3514

Could be all these dm-5.19 changes suck.. but I do know dm-zone.c is
too tightly coupled to DM core.  So I'll focus on that first, fix
5.18, and then circle back to "what's next?".

Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-12 20:52     ` Mike Snitzer
@ 2022-04-12 22:38       ` Damien Le Moal
  -1 siblings, 0 replies; 62+ messages in thread
From: Damien Le Moal @ 2022-04-12 22:38 UTC (permalink / raw)
  To: Mike Snitzer, Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel

On 4/13/22 05:52, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  4:56P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
>> The current DM codes setup ->orig_bio after __map_bio() returns,
>> and not only cause kernel panic for dm zone, but also a bit ugly
>> and tricky, especially the waiting until ->orig_bio is set in
>> dm_submit_bio_remap().
>>
>> The reason is that one new bio is cloned from original FS bio to
>> represent the mapped part, which just serves io accounting.
>>
>> Now we have switched to bdev based io accounting interface, and we
>> can retrieve sectors/bio_op from both the real original bio and the
>> added fields of .sector_offset & .sectors easily, so the new cloned
>> bio isn't necessary any more.
>>
>> Not only fixes dm-zone's kernel panic, but also cleans up dm io
>> accounting & split a bit.
> 
> You're conflating quite a few things here.  DM zone really has no
> business accessing io->orig_bio (dm-zone.c can just as easily inspect
> the tio->clone, because it hasn't been remapped yet it reflects the
> io->origin_bio, so there is no need to look at io->orig_bio) -- but
> yes I clearly broke things during the 5.18 merge and it needs fixing
> ASAP.

Problem is that we need to look at the BIO op in submission AND completion
path to handle zone append requests. So looking at the clone on submission
is OK since its op is still the original one. But on the completion path,
the clone may now be a regular write emulating a zone append op. And
looking at the clone only does not allow detecting that zone append.

We could have the orig_bio op saved in dm_io to avoid completely looking
at the orig_bio for detecting zone append.

> 
> But I'm (ab)using io->orig_bio assignment to indicate to completion
> that it may proceed.  See these dm-5.19 commits to see it imposed even
> further:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=311a8e6650601a79079000466db77386c5ec2abb
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=56219ebb5f5c84785aa821f755d545eae41bdb1a
> 
> And then leveraged here:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=4aa7a368370c2a172d5a0b8927c6332c4b6a3514
> 
> Could be all these dm-5.19 changes suck.. but I do know dm-zone.c is
> too tightly coupled to DM core.  So I'll focus on that first, fix
> 5.18, and then circle back to "what's next?".
> 
> Mike
> 


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-12 22:38       ` Damien Le Moal
  0 siblings, 0 replies; 62+ messages in thread
From: Damien Le Moal @ 2022-04-12 22:38 UTC (permalink / raw)
  To: Mike Snitzer, Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel

On 4/13/22 05:52, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  4:56P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
>> The current DM codes setup ->orig_bio after __map_bio() returns,
>> and not only cause kernel panic for dm zone, but also a bit ugly
>> and tricky, especially the waiting until ->orig_bio is set in
>> dm_submit_bio_remap().
>>
>> The reason is that one new bio is cloned from original FS bio to
>> represent the mapped part, which just serves io accounting.
>>
>> Now we have switched to bdev based io accounting interface, and we
>> can retrieve sectors/bio_op from both the real original bio and the
>> added fields of .sector_offset & .sectors easily, so the new cloned
>> bio isn't necessary any more.
>>
>> Not only fixes dm-zone's kernel panic, but also cleans up dm io
>> accounting & split a bit.
> 
> You're conflating quite a few things here.  DM zone really has no
> business accessing io->orig_bio (dm-zone.c can just as easily inspect
> the tio->clone, because it hasn't been remapped yet it reflects the
> io->origin_bio, so there is no need to look at io->orig_bio) -- but
> yes I clearly broke things during the 5.18 merge and it needs fixing
> ASAP.

Problem is that we need to look at the BIO op in submission AND completion
path to handle zone append requests. So looking at the clone on submission
is OK since its op is still the original one. But on the completion path,
the clone may now be a regular write emulating a zone append op. And
looking at the clone only does not allow detecting that zone append.

We could have the orig_bio op saved in dm_io to avoid completely looking
at the orig_bio for detecting zone append.

> 
> But I'm (ab)using io->orig_bio assignment to indicate to completion
> that it may proceed.  See these dm-5.19 commits to see it imposed even
> further:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=311a8e6650601a79079000466db77386c5ec2abb
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=56219ebb5f5c84785aa821f755d545eae41bdb1a
> 
> And then leveraged here:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19&id=4aa7a368370c2a172d5a0b8927c6332c4b6a3514
> 
> Could be all these dm-5.19 changes suck.. but I do know dm-zone.c is
> too tightly coupled to DM core.  So I'll focus on that first, fix
> 5.18, and then circle back to "what's next?".
> 
> Mike
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-12 22:38       ` Damien Le Moal
@ 2022-04-12 23:00         ` Mike Snitzer
  -1 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-12 23:00 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Jens Axboe, linux-block, dm-devel, Ming Lei

On Tue, Apr 12 2022 at  6:38P -0400,
Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote:

> On 4/13/22 05:52, Mike Snitzer wrote:
> > On Tue, Apr 12 2022 at  4:56P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> >> The current DM codes setup ->orig_bio after __map_bio() returns,
> >> and not only cause kernel panic for dm zone, but also a bit ugly
> >> and tricky, especially the waiting until ->orig_bio is set in
> >> dm_submit_bio_remap().
> >>
> >> The reason is that one new bio is cloned from original FS bio to
> >> represent the mapped part, which just serves io accounting.
> >>
> >> Now we have switched to bdev based io accounting interface, and we
> >> can retrieve sectors/bio_op from both the real original bio and the
> >> added fields of .sector_offset & .sectors easily, so the new cloned
> >> bio isn't necessary any more.
> >>
> >> Not only fixes dm-zone's kernel panic, but also cleans up dm io
> >> accounting & split a bit.
> > 
> > You're conflating quite a few things here.  DM zone really has no
> > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > the tio->clone, because it hasn't been remapped yet it reflects the
> > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > yes I clearly broke things during the 5.18 merge and it needs fixing
> > ASAP.
> 
> Problem is that we need to look at the BIO op in submission AND completion
> path to handle zone append requests. So looking at the clone on submission
> is OK since its op is still the original one. But on the completion path,
> the clone may now be a regular write emulating a zone append op. And
> looking at the clone only does not allow detecting that zone append.
> 
> We could have the orig_bio op saved in dm_io to avoid completely looking
> at the orig_bio for detecting zone append.

Can you please try the following patch?

Really sorry for breaking dm-zone.c; please teach this man how to test
the basics of all things dm-zoned (is there a testsuite in the tools
or something?).

Thanks,
Mike

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index c1ca9be4b79e..896127366000 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -360,16 +360,20 @@ static int dm_update_zone_wp_offset(struct mapped_device *md, unsigned int zno,
 	return 0;
 }
 
+struct orig_bio_details {
+	unsigned int op;
+	unsigned int nr_sectors;
+};
+
 /*
  * First phase of BIO mapping for targets with zone append emulation:
  * check all BIO that change a zone writer pointer and change zone
  * append operations into regular write operations.
  */
-static bool dm_zone_map_bio_begin(struct mapped_device *md,
-				  struct bio *orig_bio, struct bio *clone)
+static bool dm_zone_map_bio_begin(struct mapped_device *md, unsigned int zno,
+				  struct bio *clone)
 {
 	sector_t zsectors = blk_queue_zone_sectors(md->queue);
-	unsigned int zno = bio_zone_no(orig_bio);
 	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
 
 	/*
@@ -384,7 +388,7 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
 		WRITE_ONCE(md->zwp_offset[zno], zwp_offset);
 	}
 
-	switch (bio_op(orig_bio)) {
+	switch (bio_op(clone)) {
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_FINISH:
 		return true;
@@ -401,9 +405,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
 		 * target zone.
 		 */
 		clone->bi_opf = REQ_OP_WRITE | REQ_NOMERGE |
-			(orig_bio->bi_opf & (~REQ_OP_MASK));
-		clone->bi_iter.bi_sector =
-			orig_bio->bi_iter.bi_sector + zwp_offset;
+			(clone->bi_opf & (~REQ_OP_MASK));
+		clone->bi_iter.bi_sector += zwp_offset;
 		break;
 	default:
 		DMWARN_LIMIT("Invalid BIO operation");
@@ -423,11 +426,10 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
  * data written to a zone. Note that at this point, the remapped clone BIO
  * may already have completed, so we do not touch it.
  */
-static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
-					struct bio *orig_bio,
+static blk_status_t dm_zone_map_bio_end(struct mapped_device *md, unsigned int zno,
+					struct orig_bio_details *orig_bio_details,
 					unsigned int nr_sectors)
 {
-	unsigned int zno = bio_zone_no(orig_bio);
 	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
 
 	/* The clone BIO may already have been completed and failed */
@@ -435,7 +437,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
 		return BLK_STS_IOERR;
 
 	/* Update the zone wp offset */
-	switch (bio_op(orig_bio)) {
+	switch (orig_bio_details->op) {
 	case REQ_OP_ZONE_RESET:
 		WRITE_ONCE(md->zwp_offset[zno], 0);
 		return BLK_STS_OK;
@@ -452,7 +454,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
 		 * Check that the target did not truncate the write operation
 		 * emulating a zone append.
 		 */
-		if (nr_sectors != bio_sectors(orig_bio)) {
+		if (nr_sectors != orig_bio_details->nr_sectors) {
 			DMWARN_LIMIT("Truncated write for zone append");
 			return BLK_STS_IOERR;
 		}
@@ -488,7 +490,7 @@ static inline void dm_zone_unlock(struct request_queue *q,
 	bio_clear_flag(clone, BIO_ZONE_WRITE_LOCKED);
 }
 
-static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
+static bool dm_need_zone_wp_tracking(struct bio *clone)
 {
 	/*
 	 * Special processing is not needed for operations that do not need the
@@ -496,15 +498,15 @@ static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
 	 * zones and all operations that do not modify directly a sequential
 	 * zone write pointer.
 	 */
-	if (op_is_flush(orig_bio->bi_opf) && !bio_sectors(orig_bio))
+	if (op_is_flush(clone->bi_opf) && !bio_sectors(clone))
 		return false;
-	switch (bio_op(orig_bio)) {
+	switch (bio_op(clone)) {
 	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE:
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_FINISH:
 	case REQ_OP_ZONE_APPEND:
-		return bio_zone_is_seq(orig_bio);
+		return bio_zone_is_seq(clone);
 	default:
 		return false;
 	}
@@ -519,8 +521,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 	struct dm_target *ti = tio->ti;
 	struct mapped_device *md = io->md;
 	struct request_queue *q = md->queue;
-	struct bio *orig_bio = io->orig_bio;
 	struct bio *clone = &tio->clone;
+	struct orig_bio_details orig_bio_details;
 	unsigned int zno;
 	blk_status_t sts;
 	int r;
@@ -529,18 +531,21 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 	 * IOs that do not change a zone write pointer do not need
 	 * any additional special processing.
 	 */
-	if (!dm_need_zone_wp_tracking(orig_bio))
+	if (!dm_need_zone_wp_tracking(clone))
 		return ti->type->map(ti, clone);
 
 	/* Lock the target zone */
-	zno = bio_zone_no(orig_bio);
+	zno = bio_zone_no(clone);
 	dm_zone_lock(q, zno, clone);
 
+	orig_bio_details.nr_sectors = bio_sectors(clone);
+	orig_bio_details.op = bio_op(clone);
+
 	/*
 	 * Check that the bio and the target zone write pointer offset are
 	 * both valid, and if the bio is a zone append, remap it to a write.
 	 */
-	if (!dm_zone_map_bio_begin(md, orig_bio, clone)) {
+	if (!dm_zone_map_bio_begin(md, zno, clone)) {
 		dm_zone_unlock(q, zno, clone);
 		return DM_MAPIO_KILL;
 	}
@@ -560,7 +565,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		 * The target submitted the clone BIO. The target zone will
 		 * be unlocked on completion of the clone.
 		 */
-		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
+		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
+					  *tio->len_ptr);
 		break;
 	case DM_MAPIO_REMAPPED:
 		/*
@@ -568,7 +574,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		 * unlock the target zone here as the clone will not be
 		 * submitted.
 		 */
-		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
+		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
+					  *tio->len_ptr);
 		if (sts != BLK_STS_OK)
 			dm_zone_unlock(q, zno, clone);
 		break;

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-12 23:00         ` Mike Snitzer
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-12 23:00 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: Ming Lei, Jens Axboe, linux-block, dm-devel

On Tue, Apr 12 2022 at  6:38P -0400,
Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote:

> On 4/13/22 05:52, Mike Snitzer wrote:
> > On Tue, Apr 12 2022 at  4:56P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> >> The current DM codes setup ->orig_bio after __map_bio() returns,
> >> and not only cause kernel panic for dm zone, but also a bit ugly
> >> and tricky, especially the waiting until ->orig_bio is set in
> >> dm_submit_bio_remap().
> >>
> >> The reason is that one new bio is cloned from original FS bio to
> >> represent the mapped part, which just serves io accounting.
> >>
> >> Now we have switched to bdev based io accounting interface, and we
> >> can retrieve sectors/bio_op from both the real original bio and the
> >> added fields of .sector_offset & .sectors easily, so the new cloned
> >> bio isn't necessary any more.
> >>
> >> Not only fixes dm-zone's kernel panic, but also cleans up dm io
> >> accounting & split a bit.
> > 
> > You're conflating quite a few things here.  DM zone really has no
> > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > the tio->clone, because it hasn't been remapped yet it reflects the
> > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > yes I clearly broke things during the 5.18 merge and it needs fixing
> > ASAP.
> 
> Problem is that we need to look at the BIO op in submission AND completion
> path to handle zone append requests. So looking at the clone on submission
> is OK since its op is still the original one. But on the completion path,
> the clone may now be a regular write emulating a zone append op. And
> looking at the clone only does not allow detecting that zone append.
> 
> We could have the orig_bio op saved in dm_io to avoid completely looking
> at the orig_bio for detecting zone append.

Can you please try the following patch?

Really sorry for breaking dm-zone.c; please teach this man how to test
the basics of all things dm-zoned (is there a testsuite in the tools
or something?).

Thanks,
Mike

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index c1ca9be4b79e..896127366000 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -360,16 +360,20 @@ static int dm_update_zone_wp_offset(struct mapped_device *md, unsigned int zno,
 	return 0;
 }
 
+struct orig_bio_details {
+	unsigned int op;
+	unsigned int nr_sectors;
+};
+
 /*
  * First phase of BIO mapping for targets with zone append emulation:
  * check all BIO that change a zone writer pointer and change zone
  * append operations into regular write operations.
  */
-static bool dm_zone_map_bio_begin(struct mapped_device *md,
-				  struct bio *orig_bio, struct bio *clone)
+static bool dm_zone_map_bio_begin(struct mapped_device *md, unsigned int zno,
+				  struct bio *clone)
 {
 	sector_t zsectors = blk_queue_zone_sectors(md->queue);
-	unsigned int zno = bio_zone_no(orig_bio);
 	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
 
 	/*
@@ -384,7 +388,7 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
 		WRITE_ONCE(md->zwp_offset[zno], zwp_offset);
 	}
 
-	switch (bio_op(orig_bio)) {
+	switch (bio_op(clone)) {
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_FINISH:
 		return true;
@@ -401,9 +405,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
 		 * target zone.
 		 */
 		clone->bi_opf = REQ_OP_WRITE | REQ_NOMERGE |
-			(orig_bio->bi_opf & (~REQ_OP_MASK));
-		clone->bi_iter.bi_sector =
-			orig_bio->bi_iter.bi_sector + zwp_offset;
+			(clone->bi_opf & (~REQ_OP_MASK));
+		clone->bi_iter.bi_sector += zwp_offset;
 		break;
 	default:
 		DMWARN_LIMIT("Invalid BIO operation");
@@ -423,11 +426,10 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
  * data written to a zone. Note that at this point, the remapped clone BIO
  * may already have completed, so we do not touch it.
  */
-static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
-					struct bio *orig_bio,
+static blk_status_t dm_zone_map_bio_end(struct mapped_device *md, unsigned int zno,
+					struct orig_bio_details *orig_bio_details,
 					unsigned int nr_sectors)
 {
-	unsigned int zno = bio_zone_no(orig_bio);
 	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
 
 	/* The clone BIO may already have been completed and failed */
@@ -435,7 +437,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
 		return BLK_STS_IOERR;
 
 	/* Update the zone wp offset */
-	switch (bio_op(orig_bio)) {
+	switch (orig_bio_details->op) {
 	case REQ_OP_ZONE_RESET:
 		WRITE_ONCE(md->zwp_offset[zno], 0);
 		return BLK_STS_OK;
@@ -452,7 +454,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
 		 * Check that the target did not truncate the write operation
 		 * emulating a zone append.
 		 */
-		if (nr_sectors != bio_sectors(orig_bio)) {
+		if (nr_sectors != orig_bio_details->nr_sectors) {
 			DMWARN_LIMIT("Truncated write for zone append");
 			return BLK_STS_IOERR;
 		}
@@ -488,7 +490,7 @@ static inline void dm_zone_unlock(struct request_queue *q,
 	bio_clear_flag(clone, BIO_ZONE_WRITE_LOCKED);
 }
 
-static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
+static bool dm_need_zone_wp_tracking(struct bio *clone)
 {
 	/*
 	 * Special processing is not needed for operations that do not need the
@@ -496,15 +498,15 @@ static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
 	 * zones and all operations that do not modify directly a sequential
 	 * zone write pointer.
 	 */
-	if (op_is_flush(orig_bio->bi_opf) && !bio_sectors(orig_bio))
+	if (op_is_flush(clone->bi_opf) && !bio_sectors(clone))
 		return false;
-	switch (bio_op(orig_bio)) {
+	switch (bio_op(clone)) {
 	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE:
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_FINISH:
 	case REQ_OP_ZONE_APPEND:
-		return bio_zone_is_seq(orig_bio);
+		return bio_zone_is_seq(clone);
 	default:
 		return false;
 	}
@@ -519,8 +521,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 	struct dm_target *ti = tio->ti;
 	struct mapped_device *md = io->md;
 	struct request_queue *q = md->queue;
-	struct bio *orig_bio = io->orig_bio;
 	struct bio *clone = &tio->clone;
+	struct orig_bio_details orig_bio_details;
 	unsigned int zno;
 	blk_status_t sts;
 	int r;
@@ -529,18 +531,21 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 	 * IOs that do not change a zone write pointer do not need
 	 * any additional special processing.
 	 */
-	if (!dm_need_zone_wp_tracking(orig_bio))
+	if (!dm_need_zone_wp_tracking(clone))
 		return ti->type->map(ti, clone);
 
 	/* Lock the target zone */
-	zno = bio_zone_no(orig_bio);
+	zno = bio_zone_no(clone);
 	dm_zone_lock(q, zno, clone);
 
+	orig_bio_details.nr_sectors = bio_sectors(clone);
+	orig_bio_details.op = bio_op(clone);
+
 	/*
 	 * Check that the bio and the target zone write pointer offset are
 	 * both valid, and if the bio is a zone append, remap it to a write.
 	 */
-	if (!dm_zone_map_bio_begin(md, orig_bio, clone)) {
+	if (!dm_zone_map_bio_begin(md, zno, clone)) {
 		dm_zone_unlock(q, zno, clone);
 		return DM_MAPIO_KILL;
 	}
@@ -560,7 +565,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		 * The target submitted the clone BIO. The target zone will
 		 * be unlocked on completion of the clone.
 		 */
-		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
+		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
+					  *tio->len_ptr);
 		break;
 	case DM_MAPIO_REMAPPED:
 		/*
@@ -568,7 +574,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		 * unlock the target zone here as the clone will not be
 		 * submitted.
 		 */
-		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
+		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
+					  *tio->len_ptr);
 		if (sts != BLK_STS_OK)
 			dm_zone_unlock(q, zno, clone);
 		break;


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-12 23:00         ` Mike Snitzer
@ 2022-04-12 23:31           ` Damien Le Moal
  -1 siblings, 0 replies; 62+ messages in thread
From: Damien Le Moal @ 2022-04-12 23:31 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Ming Lei

[-- Attachment #1: Type: text/plain, Size: 8952 bytes --]

On 4/13/22 08:00, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  6:38P -0400,
> Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote:
> 
>> On 4/13/22 05:52, Mike Snitzer wrote:
>>> On Tue, Apr 12 2022 at  4:56P -0400,
>>> Ming Lei <ming.lei@redhat.com> wrote:
>>>
>>>> The current DM codes setup ->orig_bio after __map_bio() returns,
>>>> and not only cause kernel panic for dm zone, but also a bit ugly
>>>> and tricky, especially the waiting until ->orig_bio is set in
>>>> dm_submit_bio_remap().
>>>>
>>>> The reason is that one new bio is cloned from original FS bio to
>>>> represent the mapped part, which just serves io accounting.
>>>>
>>>> Now we have switched to bdev based io accounting interface, and we
>>>> can retrieve sectors/bio_op from both the real original bio and the
>>>> added fields of .sector_offset & .sectors easily, so the new cloned
>>>> bio isn't necessary any more.
>>>>
>>>> Not only fixes dm-zone's kernel panic, but also cleans up dm io
>>>> accounting & split a bit.
>>>
>>> You're conflating quite a few things here.  DM zone really has no
>>> business accessing io->orig_bio (dm-zone.c can just as easily inspect
>>> the tio->clone, because it hasn't been remapped yet it reflects the
>>> io->origin_bio, so there is no need to look at io->orig_bio) -- but
>>> yes I clearly broke things during the 5.18 merge and it needs fixing
>>> ASAP.
>>
>> Problem is that we need to look at the BIO op in submission AND completion
>> path to handle zone append requests. So looking at the clone on submission
>> is OK since its op is still the original one. But on the completion path,
>> the clone may now be a regular write emulating a zone append op. And
>> looking at the clone only does not allow detecting that zone append.
>>
>> We could have the orig_bio op saved in dm_io to avoid completely looking
>> at the orig_bio for detecting zone append.
> 
> Can you please try the following patch?

OK. Will do right away.

> 
> Really sorry for breaking dm-zone.c; please teach this man how to test
> the basics of all things dm-zoned (is there a testsuite in the tools
> or something?).

We have an internal test suite to check all things related to zone. We run
that weekly on all RC releases. We did not catch the problem earlier as we
do not run against for-next trees in previous cycles. We could add such
runs :)

We would be happy to contribute stuff for testing. Ideally, integrating
that into blktest, with a new DM group, would be nice. That was discussed
in past LSF. Maybe a topic again for this year ? Beside zone stuff, I am
sure we can add more DM tests (I am sure you do also have a test suite ?).

For quick tests, I generally use a zoned nullblk device. I am attaching 2
scripts which allow creating and deleting nullblk devices easily.

> 
> Thanks,
> Mike
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index c1ca9be4b79e..896127366000 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -360,16 +360,20 @@ static int dm_update_zone_wp_offset(struct mapped_device *md, unsigned int zno,
>  	return 0;
>  }
>  
> +struct orig_bio_details {
> +	unsigned int op;
> +	unsigned int nr_sectors;
> +};
> +
>  /*
>   * First phase of BIO mapping for targets with zone append emulation:
>   * check all BIO that change a zone writer pointer and change zone
>   * append operations into regular write operations.
>   */
> -static bool dm_zone_map_bio_begin(struct mapped_device *md,
> -				  struct bio *orig_bio, struct bio *clone)
> +static bool dm_zone_map_bio_begin(struct mapped_device *md, unsigned int zno,
> +				  struct bio *clone)
>  {
>  	sector_t zsectors = blk_queue_zone_sectors(md->queue);
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/*
> @@ -384,7 +388,7 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		WRITE_ONCE(md->zwp_offset[zno], zwp_offset);
>  	}
>  
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  		return true;
> @@ -401,9 +405,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		 * target zone.
>  		 */
>  		clone->bi_opf = REQ_OP_WRITE | REQ_NOMERGE |
> -			(orig_bio->bi_opf & (~REQ_OP_MASK));
> -		clone->bi_iter.bi_sector =
> -			orig_bio->bi_iter.bi_sector + zwp_offset;
> +			(clone->bi_opf & (~REQ_OP_MASK));
> +		clone->bi_iter.bi_sector += zwp_offset;
>  		break;
>  	default:
>  		DMWARN_LIMIT("Invalid BIO operation");
> @@ -423,11 +426,10 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>   * data written to a zone. Note that at this point, the remapped clone BIO
>   * may already have completed, so we do not touch it.
>   */
> -static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
> -					struct bio *orig_bio,
> +static blk_status_t dm_zone_map_bio_end(struct mapped_device *md, unsigned int zno,
> +					struct orig_bio_details *orig_bio_details,
>  					unsigned int nr_sectors)
>  {
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/* The clone BIO may already have been completed and failed */
> @@ -435,7 +437,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		return BLK_STS_IOERR;
>  
>  	/* Update the zone wp offset */
> -	switch (bio_op(orig_bio)) {
> +	switch (orig_bio_details->op) {
>  	case REQ_OP_ZONE_RESET:
>  		WRITE_ONCE(md->zwp_offset[zno], 0);
>  		return BLK_STS_OK;
> @@ -452,7 +454,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		 * Check that the target did not truncate the write operation
>  		 * emulating a zone append.
>  		 */
> -		if (nr_sectors != bio_sectors(orig_bio)) {
> +		if (nr_sectors != orig_bio_details->nr_sectors) {
>  			DMWARN_LIMIT("Truncated write for zone append");
>  			return BLK_STS_IOERR;
>  		}
> @@ -488,7 +490,7 @@ static inline void dm_zone_unlock(struct request_queue *q,
>  	bio_clear_flag(clone, BIO_ZONE_WRITE_LOCKED);
>  }
>  
> -static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
> +static bool dm_need_zone_wp_tracking(struct bio *clone)
>  {
>  	/*
>  	 * Special processing is not needed for operations that do not need the
> @@ -496,15 +498,15 @@ static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
>  	 * zones and all operations that do not modify directly a sequential
>  	 * zone write pointer.
>  	 */
> -	if (op_is_flush(orig_bio->bi_opf) && !bio_sectors(orig_bio))
> +	if (op_is_flush(clone->bi_opf) && !bio_sectors(clone))
>  		return false;
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_WRITE_ZEROES:
>  	case REQ_OP_WRITE:
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  	case REQ_OP_ZONE_APPEND:
> -		return bio_zone_is_seq(orig_bio);
> +		return bio_zone_is_seq(clone);
>  	default:
>  		return false;
>  	}
> @@ -519,8 +521,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	struct dm_target *ti = tio->ti;
>  	struct mapped_device *md = io->md;
>  	struct request_queue *q = md->queue;
> -	struct bio *orig_bio = io->orig_bio;
>  	struct bio *clone = &tio->clone;
> +	struct orig_bio_details orig_bio_details;
>  	unsigned int zno;
>  	blk_status_t sts;
>  	int r;
> @@ -529,18 +531,21 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	 * IOs that do not change a zone write pointer do not need
>  	 * any additional special processing.
>  	 */
> -	if (!dm_need_zone_wp_tracking(orig_bio))
> +	if (!dm_need_zone_wp_tracking(clone))
>  		return ti->type->map(ti, clone);
>  
>  	/* Lock the target zone */
> -	zno = bio_zone_no(orig_bio);
> +	zno = bio_zone_no(clone);
>  	dm_zone_lock(q, zno, clone);
>  
> +	orig_bio_details.nr_sectors = bio_sectors(clone);
> +	orig_bio_details.op = bio_op(clone);
> +
>  	/*
>  	 * Check that the bio and the target zone write pointer offset are
>  	 * both valid, and if the bio is a zone append, remap it to a write.
>  	 */
> -	if (!dm_zone_map_bio_begin(md, orig_bio, clone)) {
> +	if (!dm_zone_map_bio_begin(md, zno, clone)) {
>  		dm_zone_unlock(q, zno, clone);
>  		return DM_MAPIO_KILL;
>  	}
> @@ -560,7 +565,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * The target submitted the clone BIO. The target zone will
>  		 * be unlocked on completion of the clone.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		break;
>  	case DM_MAPIO_REMAPPED:
>  		/*
> @@ -568,7 +574,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * unlock the target zone here as the clone will not be
>  		 * submitted.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		if (sts != BLK_STS_OK)
>  			dm_zone_unlock(q, zno, clone);
>  		break;
> 


-- 
Damien Le Moal
Western Digital Research

[-- Attachment #2: nullblk-create.sh --]
[-- Type: application/x-shellscript, Size: 3574 bytes --]

[-- Attachment #3: nullblk-destroy.sh --]
[-- Type: application/x-shellscript, Size: 306 bytes --]

[-- Attachment #4: Type: text/plain, Size: 98 bytes --]

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-12 23:31           ` Damien Le Moal
  0 siblings, 0 replies; 62+ messages in thread
From: Damien Le Moal @ 2022-04-12 23:31 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Ming Lei, Jens Axboe, linux-block, dm-devel

[-- Attachment #1: Type: text/plain, Size: 8952 bytes --]

On 4/13/22 08:00, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  6:38P -0400,
> Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote:
> 
>> On 4/13/22 05:52, Mike Snitzer wrote:
>>> On Tue, Apr 12 2022 at  4:56P -0400,
>>> Ming Lei <ming.lei@redhat.com> wrote:
>>>
>>>> The current DM codes setup ->orig_bio after __map_bio() returns,
>>>> and not only cause kernel panic for dm zone, but also a bit ugly
>>>> and tricky, especially the waiting until ->orig_bio is set in
>>>> dm_submit_bio_remap().
>>>>
>>>> The reason is that one new bio is cloned from original FS bio to
>>>> represent the mapped part, which just serves io accounting.
>>>>
>>>> Now we have switched to bdev based io accounting interface, and we
>>>> can retrieve sectors/bio_op from both the real original bio and the
>>>> added fields of .sector_offset & .sectors easily, so the new cloned
>>>> bio isn't necessary any more.
>>>>
>>>> Not only fixes dm-zone's kernel panic, but also cleans up dm io
>>>> accounting & split a bit.
>>>
>>> You're conflating quite a few things here.  DM zone really has no
>>> business accessing io->orig_bio (dm-zone.c can just as easily inspect
>>> the tio->clone, because it hasn't been remapped yet it reflects the
>>> io->origin_bio, so there is no need to look at io->orig_bio) -- but
>>> yes I clearly broke things during the 5.18 merge and it needs fixing
>>> ASAP.
>>
>> Problem is that we need to look at the BIO op in submission AND completion
>> path to handle zone append requests. So looking at the clone on submission
>> is OK since its op is still the original one. But on the completion path,
>> the clone may now be a regular write emulating a zone append op. And
>> looking at the clone only does not allow detecting that zone append.
>>
>> We could have the orig_bio op saved in dm_io to avoid completely looking
>> at the orig_bio for detecting zone append.
> 
> Can you please try the following patch?

OK. Will do right away.

> 
> Really sorry for breaking dm-zone.c; please teach this man how to test
> the basics of all things dm-zoned (is there a testsuite in the tools
> or something?).

We have an internal test suite to check all things related to zone. We run
that weekly on all RC releases. We did not catch the problem earlier as we
do not run against for-next trees in previous cycles. We could add such
runs :)

We would be happy to contribute stuff for testing. Ideally, integrating
that into blktest, with a new DM group, would be nice. That was discussed
in past LSF. Maybe a topic again for this year ? Beside zone stuff, I am
sure we can add more DM tests (I am sure you do also have a test suite ?).

For quick tests, I generally use a zoned nullblk device. I am attaching 2
scripts which allow creating and deleting nullblk devices easily.

> 
> Thanks,
> Mike
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index c1ca9be4b79e..896127366000 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -360,16 +360,20 @@ static int dm_update_zone_wp_offset(struct mapped_device *md, unsigned int zno,
>  	return 0;
>  }
>  
> +struct orig_bio_details {
> +	unsigned int op;
> +	unsigned int nr_sectors;
> +};
> +
>  /*
>   * First phase of BIO mapping for targets with zone append emulation:
>   * check all BIO that change a zone writer pointer and change zone
>   * append operations into regular write operations.
>   */
> -static bool dm_zone_map_bio_begin(struct mapped_device *md,
> -				  struct bio *orig_bio, struct bio *clone)
> +static bool dm_zone_map_bio_begin(struct mapped_device *md, unsigned int zno,
> +				  struct bio *clone)
>  {
>  	sector_t zsectors = blk_queue_zone_sectors(md->queue);
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/*
> @@ -384,7 +388,7 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		WRITE_ONCE(md->zwp_offset[zno], zwp_offset);
>  	}
>  
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  		return true;
> @@ -401,9 +405,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		 * target zone.
>  		 */
>  		clone->bi_opf = REQ_OP_WRITE | REQ_NOMERGE |
> -			(orig_bio->bi_opf & (~REQ_OP_MASK));
> -		clone->bi_iter.bi_sector =
> -			orig_bio->bi_iter.bi_sector + zwp_offset;
> +			(clone->bi_opf & (~REQ_OP_MASK));
> +		clone->bi_iter.bi_sector += zwp_offset;
>  		break;
>  	default:
>  		DMWARN_LIMIT("Invalid BIO operation");
> @@ -423,11 +426,10 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>   * data written to a zone. Note that at this point, the remapped clone BIO
>   * may already have completed, so we do not touch it.
>   */
> -static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
> -					struct bio *orig_bio,
> +static blk_status_t dm_zone_map_bio_end(struct mapped_device *md, unsigned int zno,
> +					struct orig_bio_details *orig_bio_details,
>  					unsigned int nr_sectors)
>  {
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/* The clone BIO may already have been completed and failed */
> @@ -435,7 +437,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		return BLK_STS_IOERR;
>  
>  	/* Update the zone wp offset */
> -	switch (bio_op(orig_bio)) {
> +	switch (orig_bio_details->op) {
>  	case REQ_OP_ZONE_RESET:
>  		WRITE_ONCE(md->zwp_offset[zno], 0);
>  		return BLK_STS_OK;
> @@ -452,7 +454,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		 * Check that the target did not truncate the write operation
>  		 * emulating a zone append.
>  		 */
> -		if (nr_sectors != bio_sectors(orig_bio)) {
> +		if (nr_sectors != orig_bio_details->nr_sectors) {
>  			DMWARN_LIMIT("Truncated write for zone append");
>  			return BLK_STS_IOERR;
>  		}
> @@ -488,7 +490,7 @@ static inline void dm_zone_unlock(struct request_queue *q,
>  	bio_clear_flag(clone, BIO_ZONE_WRITE_LOCKED);
>  }
>  
> -static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
> +static bool dm_need_zone_wp_tracking(struct bio *clone)
>  {
>  	/*
>  	 * Special processing is not needed for operations that do not need the
> @@ -496,15 +498,15 @@ static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
>  	 * zones and all operations that do not modify directly a sequential
>  	 * zone write pointer.
>  	 */
> -	if (op_is_flush(orig_bio->bi_opf) && !bio_sectors(orig_bio))
> +	if (op_is_flush(clone->bi_opf) && !bio_sectors(clone))
>  		return false;
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_WRITE_ZEROES:
>  	case REQ_OP_WRITE:
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  	case REQ_OP_ZONE_APPEND:
> -		return bio_zone_is_seq(orig_bio);
> +		return bio_zone_is_seq(clone);
>  	default:
>  		return false;
>  	}
> @@ -519,8 +521,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	struct dm_target *ti = tio->ti;
>  	struct mapped_device *md = io->md;
>  	struct request_queue *q = md->queue;
> -	struct bio *orig_bio = io->orig_bio;
>  	struct bio *clone = &tio->clone;
> +	struct orig_bio_details orig_bio_details;
>  	unsigned int zno;
>  	blk_status_t sts;
>  	int r;
> @@ -529,18 +531,21 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	 * IOs that do not change a zone write pointer do not need
>  	 * any additional special processing.
>  	 */
> -	if (!dm_need_zone_wp_tracking(orig_bio))
> +	if (!dm_need_zone_wp_tracking(clone))
>  		return ti->type->map(ti, clone);
>  
>  	/* Lock the target zone */
> -	zno = bio_zone_no(orig_bio);
> +	zno = bio_zone_no(clone);
>  	dm_zone_lock(q, zno, clone);
>  
> +	orig_bio_details.nr_sectors = bio_sectors(clone);
> +	orig_bio_details.op = bio_op(clone);
> +
>  	/*
>  	 * Check that the bio and the target zone write pointer offset are
>  	 * both valid, and if the bio is a zone append, remap it to a write.
>  	 */
> -	if (!dm_zone_map_bio_begin(md, orig_bio, clone)) {
> +	if (!dm_zone_map_bio_begin(md, zno, clone)) {
>  		dm_zone_unlock(q, zno, clone);
>  		return DM_MAPIO_KILL;
>  	}
> @@ -560,7 +565,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * The target submitted the clone BIO. The target zone will
>  		 * be unlocked on completion of the clone.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		break;
>  	case DM_MAPIO_REMAPPED:
>  		/*
> @@ -568,7 +574,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * unlock the target zone here as the clone will not be
>  		 * submitted.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		if (sts != BLK_STS_OK)
>  			dm_zone_unlock(q, zno, clone);
>  		break;
> 


-- 
Damien Le Moal
Western Digital Research

[-- Attachment #2: nullblk-create.sh --]
[-- Type: application/x-shellscript, Size: 3574 bytes --]

[-- Attachment #3: nullblk-destroy.sh --]
[-- Type: application/x-shellscript, Size: 306 bytes --]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-12 23:00         ` Mike Snitzer
@ 2022-04-13  0:00           ` Damien Le Moal
  -1 siblings, 0 replies; 62+ messages in thread
From: Damien Le Moal @ 2022-04-13  0:00 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Ming Lei, Jens Axboe, linux-block, dm-devel

On 4/13/22 08:00, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  6:38P -0400,
> Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote:
> 
>> On 4/13/22 05:52, Mike Snitzer wrote:
>>> On Tue, Apr 12 2022 at  4:56P -0400,
>>> Ming Lei <ming.lei@redhat.com> wrote:
>>>
>>>> The current DM codes setup ->orig_bio after __map_bio() returns,
>>>> and not only cause kernel panic for dm zone, but also a bit ugly
>>>> and tricky, especially the waiting until ->orig_bio is set in
>>>> dm_submit_bio_remap().
>>>>
>>>> The reason is that one new bio is cloned from original FS bio to
>>>> represent the mapped part, which just serves io accounting.
>>>>
>>>> Now we have switched to bdev based io accounting interface, and we
>>>> can retrieve sectors/bio_op from both the real original bio and the
>>>> added fields of .sector_offset & .sectors easily, so the new cloned
>>>> bio isn't necessary any more.
>>>>
>>>> Not only fixes dm-zone's kernel panic, but also cleans up dm io
>>>> accounting & split a bit.
>>>
>>> You're conflating quite a few things here.  DM zone really has no
>>> business accessing io->orig_bio (dm-zone.c can just as easily inspect
>>> the tio->clone, because it hasn't been remapped yet it reflects the
>>> io->origin_bio, so there is no need to look at io->orig_bio) -- but
>>> yes I clearly broke things during the 5.18 merge and it needs fixing
>>> ASAP.
>>
>> Problem is that we need to look at the BIO op in submission AND completion
>> path to handle zone append requests. So looking at the clone on submission
>> is OK since its op is still the original one. But on the completion path,
>> the clone may now be a regular write emulating a zone append op. And
>> looking at the clone only does not allow detecting that zone append.
>>
>> We could have the orig_bio op saved in dm_io to avoid completely looking
>> at the orig_bio for detecting zone append.
> 
> Can you please try the following patch?

This works. I tested with a zoned nullblk + dm-crypt, forcing the zone
append emulation code to be used. I ran zonefs tests on top of that with
no issues. I will run btrfs tests too later today to exercise things a
little more.

> 
> Really sorry for breaking dm-zone.c; please teach this man how to test
> the basics of all things dm-zoned (is there a testsuite in the tools
> or something?).
> 
> Thanks,
> Mike
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index c1ca9be4b79e..896127366000 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -360,16 +360,20 @@ static int dm_update_zone_wp_offset(struct mapped_device *md, unsigned int zno,
>  	return 0;
>  }
>  
> +struct orig_bio_details {
> +	unsigned int op;
> +	unsigned int nr_sectors;
> +};
> +
>  /*
>   * First phase of BIO mapping for targets with zone append emulation:
>   * check all BIO that change a zone writer pointer and change zone
>   * append operations into regular write operations.
>   */
> -static bool dm_zone_map_bio_begin(struct mapped_device *md,
> -				  struct bio *orig_bio, struct bio *clone)
> +static bool dm_zone_map_bio_begin(struct mapped_device *md, unsigned int zno,
> +				  struct bio *clone)
>  {
>  	sector_t zsectors = blk_queue_zone_sectors(md->queue);
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/*
> @@ -384,7 +388,7 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		WRITE_ONCE(md->zwp_offset[zno], zwp_offset);
>  	}
>  
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  		return true;
> @@ -401,9 +405,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		 * target zone.
>  		 */
>  		clone->bi_opf = REQ_OP_WRITE | REQ_NOMERGE |
> -			(orig_bio->bi_opf & (~REQ_OP_MASK));
> -		clone->bi_iter.bi_sector =
> -			orig_bio->bi_iter.bi_sector + zwp_offset;
> +			(clone->bi_opf & (~REQ_OP_MASK));
> +		clone->bi_iter.bi_sector += zwp_offset;
>  		break;
>  	default:
>  		DMWARN_LIMIT("Invalid BIO operation");
> @@ -423,11 +426,10 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>   * data written to a zone. Note that at this point, the remapped clone BIO
>   * may already have completed, so we do not touch it.
>   */
> -static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
> -					struct bio *orig_bio,
> +static blk_status_t dm_zone_map_bio_end(struct mapped_device *md, unsigned int zno,
> +					struct orig_bio_details *orig_bio_details,
>  					unsigned int nr_sectors)
>  {
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/* The clone BIO may already have been completed and failed */
> @@ -435,7 +437,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		return BLK_STS_IOERR;
>  
>  	/* Update the zone wp offset */
> -	switch (bio_op(orig_bio)) {
> +	switch (orig_bio_details->op) {
>  	case REQ_OP_ZONE_RESET:
>  		WRITE_ONCE(md->zwp_offset[zno], 0);
>  		return BLK_STS_OK;
> @@ -452,7 +454,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		 * Check that the target did not truncate the write operation
>  		 * emulating a zone append.
>  		 */
> -		if (nr_sectors != bio_sectors(orig_bio)) {
> +		if (nr_sectors != orig_bio_details->nr_sectors) {
>  			DMWARN_LIMIT("Truncated write for zone append");
>  			return BLK_STS_IOERR;
>  		}
> @@ -488,7 +490,7 @@ static inline void dm_zone_unlock(struct request_queue *q,
>  	bio_clear_flag(clone, BIO_ZONE_WRITE_LOCKED);
>  }
>  
> -static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
> +static bool dm_need_zone_wp_tracking(struct bio *clone)
>  {
>  	/*
>  	 * Special processing is not needed for operations that do not need the
> @@ -496,15 +498,15 @@ static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
>  	 * zones and all operations that do not modify directly a sequential
>  	 * zone write pointer.
>  	 */
> -	if (op_is_flush(orig_bio->bi_opf) && !bio_sectors(orig_bio))
> +	if (op_is_flush(clone->bi_opf) && !bio_sectors(clone))
>  		return false;
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_WRITE_ZEROES:
>  	case REQ_OP_WRITE:
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  	case REQ_OP_ZONE_APPEND:
> -		return bio_zone_is_seq(orig_bio);
> +		return bio_zone_is_seq(clone);
>  	default:
>  		return false;
>  	}
> @@ -519,8 +521,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	struct dm_target *ti = tio->ti;
>  	struct mapped_device *md = io->md;
>  	struct request_queue *q = md->queue;
> -	struct bio *orig_bio = io->orig_bio;
>  	struct bio *clone = &tio->clone;
> +	struct orig_bio_details orig_bio_details;
>  	unsigned int zno;
>  	blk_status_t sts;
>  	int r;
> @@ -529,18 +531,21 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	 * IOs that do not change a zone write pointer do not need
>  	 * any additional special processing.
>  	 */
> -	if (!dm_need_zone_wp_tracking(orig_bio))
> +	if (!dm_need_zone_wp_tracking(clone))
>  		return ti->type->map(ti, clone);
>  
>  	/* Lock the target zone */
> -	zno = bio_zone_no(orig_bio);
> +	zno = bio_zone_no(clone);
>  	dm_zone_lock(q, zno, clone);
>  
> +	orig_bio_details.nr_sectors = bio_sectors(clone);
> +	orig_bio_details.op = bio_op(clone);
> +
>  	/*
>  	 * Check that the bio and the target zone write pointer offset are
>  	 * both valid, and if the bio is a zone append, remap it to a write.
>  	 */
> -	if (!dm_zone_map_bio_begin(md, orig_bio, clone)) {
> +	if (!dm_zone_map_bio_begin(md, zno, clone)) {
>  		dm_zone_unlock(q, zno, clone);
>  		return DM_MAPIO_KILL;
>  	}
> @@ -560,7 +565,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * The target submitted the clone BIO. The target zone will
>  		 * be unlocked on completion of the clone.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		break;
>  	case DM_MAPIO_REMAPPED:
>  		/*
> @@ -568,7 +574,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * unlock the target zone here as the clone will not be
>  		 * submitted.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		if (sts != BLK_STS_OK)
>  			dm_zone_unlock(q, zno, clone);
>  		break;
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-13  0:00           ` Damien Le Moal
  0 siblings, 0 replies; 62+ messages in thread
From: Damien Le Moal @ 2022-04-13  0:00 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Ming Lei

On 4/13/22 08:00, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  6:38P -0400,
> Damien Le Moal <damien.lemoal@opensource.wdc.com> wrote:
> 
>> On 4/13/22 05:52, Mike Snitzer wrote:
>>> On Tue, Apr 12 2022 at  4:56P -0400,
>>> Ming Lei <ming.lei@redhat.com> wrote:
>>>
>>>> The current DM codes setup ->orig_bio after __map_bio() returns,
>>>> and not only cause kernel panic for dm zone, but also a bit ugly
>>>> and tricky, especially the waiting until ->orig_bio is set in
>>>> dm_submit_bio_remap().
>>>>
>>>> The reason is that one new bio is cloned from original FS bio to
>>>> represent the mapped part, which just serves io accounting.
>>>>
>>>> Now we have switched to bdev based io accounting interface, and we
>>>> can retrieve sectors/bio_op from both the real original bio and the
>>>> added fields of .sector_offset & .sectors easily, so the new cloned
>>>> bio isn't necessary any more.
>>>>
>>>> Not only fixes dm-zone's kernel panic, but also cleans up dm io
>>>> accounting & split a bit.
>>>
>>> You're conflating quite a few things here.  DM zone really has no
>>> business accessing io->orig_bio (dm-zone.c can just as easily inspect
>>> the tio->clone, because it hasn't been remapped yet it reflects the
>>> io->origin_bio, so there is no need to look at io->orig_bio) -- but
>>> yes I clearly broke things during the 5.18 merge and it needs fixing
>>> ASAP.
>>
>> Problem is that we need to look at the BIO op in submission AND completion
>> path to handle zone append requests. So looking at the clone on submission
>> is OK since its op is still the original one. But on the completion path,
>> the clone may now be a regular write emulating a zone append op. And
>> looking at the clone only does not allow detecting that zone append.
>>
>> We could have the orig_bio op saved in dm_io to avoid completely looking
>> at the orig_bio for detecting zone append.
> 
> Can you please try the following patch?

This works. I tested with a zoned nullblk + dm-crypt, forcing the zone
append emulation code to be used. I ran zonefs tests on top of that with
no issues. I will run btrfs tests too later today to exercise things a
little more.

> 
> Really sorry for breaking dm-zone.c; please teach this man how to test
> the basics of all things dm-zoned (is there a testsuite in the tools
> or something?).
> 
> Thanks,
> Mike
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index c1ca9be4b79e..896127366000 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -360,16 +360,20 @@ static int dm_update_zone_wp_offset(struct mapped_device *md, unsigned int zno,
>  	return 0;
>  }
>  
> +struct orig_bio_details {
> +	unsigned int op;
> +	unsigned int nr_sectors;
> +};
> +
>  /*
>   * First phase of BIO mapping for targets with zone append emulation:
>   * check all BIO that change a zone writer pointer and change zone
>   * append operations into regular write operations.
>   */
> -static bool dm_zone_map_bio_begin(struct mapped_device *md,
> -				  struct bio *orig_bio, struct bio *clone)
> +static bool dm_zone_map_bio_begin(struct mapped_device *md, unsigned int zno,
> +				  struct bio *clone)
>  {
>  	sector_t zsectors = blk_queue_zone_sectors(md->queue);
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/*
> @@ -384,7 +388,7 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		WRITE_ONCE(md->zwp_offset[zno], zwp_offset);
>  	}
>  
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  		return true;
> @@ -401,9 +405,8 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>  		 * target zone.
>  		 */
>  		clone->bi_opf = REQ_OP_WRITE | REQ_NOMERGE |
> -			(orig_bio->bi_opf & (~REQ_OP_MASK));
> -		clone->bi_iter.bi_sector =
> -			orig_bio->bi_iter.bi_sector + zwp_offset;
> +			(clone->bi_opf & (~REQ_OP_MASK));
> +		clone->bi_iter.bi_sector += zwp_offset;
>  		break;
>  	default:
>  		DMWARN_LIMIT("Invalid BIO operation");
> @@ -423,11 +426,10 @@ static bool dm_zone_map_bio_begin(struct mapped_device *md,
>   * data written to a zone. Note that at this point, the remapped clone BIO
>   * may already have completed, so we do not touch it.
>   */
> -static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
> -					struct bio *orig_bio,
> +static blk_status_t dm_zone_map_bio_end(struct mapped_device *md, unsigned int zno,
> +					struct orig_bio_details *orig_bio_details,
>  					unsigned int nr_sectors)
>  {
> -	unsigned int zno = bio_zone_no(orig_bio);
>  	unsigned int zwp_offset = READ_ONCE(md->zwp_offset[zno]);
>  
>  	/* The clone BIO may already have been completed and failed */
> @@ -435,7 +437,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		return BLK_STS_IOERR;
>  
>  	/* Update the zone wp offset */
> -	switch (bio_op(orig_bio)) {
> +	switch (orig_bio_details->op) {
>  	case REQ_OP_ZONE_RESET:
>  		WRITE_ONCE(md->zwp_offset[zno], 0);
>  		return BLK_STS_OK;
> @@ -452,7 +454,7 @@ static blk_status_t dm_zone_map_bio_end(struct mapped_device *md,
>  		 * Check that the target did not truncate the write operation
>  		 * emulating a zone append.
>  		 */
> -		if (nr_sectors != bio_sectors(orig_bio)) {
> +		if (nr_sectors != orig_bio_details->nr_sectors) {
>  			DMWARN_LIMIT("Truncated write for zone append");
>  			return BLK_STS_IOERR;
>  		}
> @@ -488,7 +490,7 @@ static inline void dm_zone_unlock(struct request_queue *q,
>  	bio_clear_flag(clone, BIO_ZONE_WRITE_LOCKED);
>  }
>  
> -static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
> +static bool dm_need_zone_wp_tracking(struct bio *clone)
>  {
>  	/*
>  	 * Special processing is not needed for operations that do not need the
> @@ -496,15 +498,15 @@ static bool dm_need_zone_wp_tracking(struct bio *orig_bio)
>  	 * zones and all operations that do not modify directly a sequential
>  	 * zone write pointer.
>  	 */
> -	if (op_is_flush(orig_bio->bi_opf) && !bio_sectors(orig_bio))
> +	if (op_is_flush(clone->bi_opf) && !bio_sectors(clone))
>  		return false;
> -	switch (bio_op(orig_bio)) {
> +	switch (bio_op(clone)) {
>  	case REQ_OP_WRITE_ZEROES:
>  	case REQ_OP_WRITE:
>  	case REQ_OP_ZONE_RESET:
>  	case REQ_OP_ZONE_FINISH:
>  	case REQ_OP_ZONE_APPEND:
> -		return bio_zone_is_seq(orig_bio);
> +		return bio_zone_is_seq(clone);
>  	default:
>  		return false;
>  	}
> @@ -519,8 +521,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	struct dm_target *ti = tio->ti;
>  	struct mapped_device *md = io->md;
>  	struct request_queue *q = md->queue;
> -	struct bio *orig_bio = io->orig_bio;
>  	struct bio *clone = &tio->clone;
> +	struct orig_bio_details orig_bio_details;
>  	unsigned int zno;
>  	blk_status_t sts;
>  	int r;
> @@ -529,18 +531,21 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  	 * IOs that do not change a zone write pointer do not need
>  	 * any additional special processing.
>  	 */
> -	if (!dm_need_zone_wp_tracking(orig_bio))
> +	if (!dm_need_zone_wp_tracking(clone))
>  		return ti->type->map(ti, clone);
>  
>  	/* Lock the target zone */
> -	zno = bio_zone_no(orig_bio);
> +	zno = bio_zone_no(clone);
>  	dm_zone_lock(q, zno, clone);
>  
> +	orig_bio_details.nr_sectors = bio_sectors(clone);
> +	orig_bio_details.op = bio_op(clone);
> +
>  	/*
>  	 * Check that the bio and the target zone write pointer offset are
>  	 * both valid, and if the bio is a zone append, remap it to a write.
>  	 */
> -	if (!dm_zone_map_bio_begin(md, orig_bio, clone)) {
> +	if (!dm_zone_map_bio_begin(md, zno, clone)) {
>  		dm_zone_unlock(q, zno, clone);
>  		return DM_MAPIO_KILL;
>  	}
> @@ -560,7 +565,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * The target submitted the clone BIO. The target zone will
>  		 * be unlocked on completion of the clone.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		break;
>  	case DM_MAPIO_REMAPPED:
>  		/*
> @@ -568,7 +574,8 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		 * unlock the target zone here as the clone will not be
>  		 * submitted.
>  		 */
> -		sts = dm_zone_map_bio_end(md, orig_bio, *tio->len_ptr);
> +		sts = dm_zone_map_bio_end(md, zno, &orig_bio_details,
> +					  *tio->len_ptr);
>  		if (sts != BLK_STS_OK)
>  			dm_zone_unlock(q, zno, clone);
>  		break;
> 


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 3/8] dm: pass 'dm_io' instance to dm_io_acct directly
  2022-04-12 20:28     ` Mike Snitzer
@ 2022-04-13  1:43       ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-13  1:43 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12, 2022 at 04:28:59PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  4:56P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > All the other 4 parameters are retrieved from the 'dm_io' instance, so
> > not necessary to pass all four to dm_io_acct().
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> 
> Yeah, commit 0ab30b4079e103 ("dm: eliminate copying of dm_io fields in
> dm_io_dec_pending") could've gone further to do what you've done here
> in this patch.
> 
> But it stopped short because of the additional "games" associated with
> the late assignment of io->orig_bio that is in the dm-5.19 branch.

OK, I will rebase on dm-5.19, but IMO the idea of late assignment of
io->orig_bio isn't good, same with splitting one bio just for
accounting, things shouldn't be so tricky.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 3/8] dm: pass 'dm_io' instance to dm_io_acct directly
@ 2022-04-13  1:43       ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-13  1:43 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12, 2022 at 04:28:59PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  4:56P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > All the other 4 parameters are retrieved from the 'dm_io' instance, so
> > not necessary to pass all four to dm_io_acct().
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> 
> Yeah, commit 0ab30b4079e103 ("dm: eliminate copying of dm_io fields in
> dm_io_dec_pending") could've gone further to do what you've done here
> in this patch.
> 
> But it stopped short because of the additional "games" associated with
> the late assignment of io->orig_bio that is in the dm-5.19 branch.

OK, I will rebase on dm-5.19, but IMO the idea of late assignment of
io->orig_bio isn't good, same with splitting one bio just for
accounting, things shouldn't be so tricky.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-12 20:52     ` Mike Snitzer
@ 2022-04-13  1:56       ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-13  1:56 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  4:56P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > The current DM codes setup ->orig_bio after __map_bio() returns,
> > and not only cause kernel panic for dm zone, but also a bit ugly
> > and tricky, especially the waiting until ->orig_bio is set in
> > dm_submit_bio_remap().
> > 
> > The reason is that one new bio is cloned from original FS bio to
> > represent the mapped part, which just serves io accounting.
> > 
> > Now we have switched to bdev based io accounting interface, and we
> > can retrieve sectors/bio_op from both the real original bio and the
> > added fields of .sector_offset & .sectors easily, so the new cloned
> > bio isn't necessary any more.
> > 
> > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > accounting & split a bit.
> 
> You're conflating quite a few things here.  DM zone really has no
> business accessing io->orig_bio (dm-zone.c can just as easily inspect
> the tio->clone, because it hasn't been remapped yet it reflects the
> io->origin_bio, so there is no need to look at io->orig_bio) -- but
> yes I clearly broke things during the 5.18 merge and it needs fixing
> ASAP.

You can just consider the cleanup part of this patches, :-)

1) no late assignment of ->orig_bio, and always set it in alloc_io()

2) no waiting on on ->origi_bio, especially the waiting is done in
fast path of dm_submit_bio_remap().

3) no split for io accounting


Thanks,
Ming


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-13  1:56       ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-13  1:56 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  4:56P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > The current DM codes setup ->orig_bio after __map_bio() returns,
> > and not only cause kernel panic for dm zone, but also a bit ugly
> > and tricky, especially the waiting until ->orig_bio is set in
> > dm_submit_bio_remap().
> > 
> > The reason is that one new bio is cloned from original FS bio to
> > represent the mapped part, which just serves io accounting.
> > 
> > Now we have switched to bdev based io accounting interface, and we
> > can retrieve sectors/bio_op from both the real original bio and the
> > added fields of .sector_offset & .sectors easily, so the new cloned
> > bio isn't necessary any more.
> > 
> > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > accounting & split a bit.
> 
> You're conflating quite a few things here.  DM zone really has no
> business accessing io->orig_bio (dm-zone.c can just as easily inspect
> the tio->clone, because it hasn't been remapped yet it reflects the
> io->origin_bio, so there is no need to look at io->orig_bio) -- but
> yes I clearly broke things during the 5.18 merge and it needs fixing
> ASAP.

You can just consider the cleanup part of this patches, :-)

1) no late assignment of ->orig_bio, and always set it in alloc_io()

2) no waiting on on ->origi_bio, especially the waiting is done in
fast path of dm_submit_bio_remap().

3) no split for io accounting


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-13  1:56       ` [dm-devel] " Ming Lei
@ 2022-04-13  6:12         ` Mike Snitzer
  -1 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-13  6:12 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12 2022 at  9:56P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > On Tue, Apr 12 2022 at  4:56P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > and tricky, especially the waiting until ->orig_bio is set in
> > > dm_submit_bio_remap().
> > > 
> > > The reason is that one new bio is cloned from original FS bio to
> > > represent the mapped part, which just serves io accounting.
> > > 
> > > Now we have switched to bdev based io accounting interface, and we
> > > can retrieve sectors/bio_op from both the real original bio and the
> > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > bio isn't necessary any more.
> > > 
> > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > accounting & split a bit.
> > 
> > You're conflating quite a few things here.  DM zone really has no
> > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > the tio->clone, because it hasn't been remapped yet it reflects the
> > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > yes I clearly broke things during the 5.18 merge and it needs fixing
> > ASAP.
> 
> You can just consider the cleanup part of this patches, :-)

I will.  But your following list doesn't reflect any "cleanup" that I
saw in your patchset.  Pretty fundamental changes that are similar,
but different, to the dm-5.19 changes I've staged.

> 1) no late assignment of ->orig_bio, and always set it in alloc_io()
>
> 2) no waiting on on ->origi_bio, especially the waiting is done in
> fast path of dm_submit_bio_remap().

For 5.18 waiting on io->orig_bio just enables a signal that the IO was
split and can be accounted.

For 5.19 I also plan on using late io->orig_bio assignment as an
alternative to the full-blown refcounting currently done with
io->io_count.  I've yet to quantify the gains with focused testing but
in theory this approach should scale better on large systems with many
concurrent IO threads to the same device (RCU is primary constraint
now).

I'll try to write a bpfrace script to measure how frequently "waiting on
io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
dm-crypt). But I think we'll find it is very rarely, if ever, waited
on in the fast path.

> 3) no split for io accounting

DM's more recent approach to splitting has never been done for benefit
or use of IO accounting, see this commit for its origin:
18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")

Not sure why you keep poking fun at DM only doing a single split when:
that is the actual design.  DM splits off orig_bio then recurses to
handle the remainder of the bio that wasn't issued.  Storing it in
io->orig_bio (previously io->bio) was always a means of reflecting
things properly. And yes IO accounting is one use, the other is IO
completion. But unfortunately DM's IO accounting has always been a
mess ever since the above commit. Changes in 5.18 fixed that.

But again, DM's splitting has _nothing_ to do with IO accounting.
Splitting only happens when needed for IO submission given constraints
of DM target(s) or underlying layers.

All said, I will look closer at your entire set and see if it better
to go with your approach.  This patch in particular is interesting
(avoids cloning and other complexity of bio_split + bio_chain):
https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-13  6:12         ` Mike Snitzer
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-13  6:12 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12 2022 at  9:56P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > On Tue, Apr 12 2022 at  4:56P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > and tricky, especially the waiting until ->orig_bio is set in
> > > dm_submit_bio_remap().
> > > 
> > > The reason is that one new bio is cloned from original FS bio to
> > > represent the mapped part, which just serves io accounting.
> > > 
> > > Now we have switched to bdev based io accounting interface, and we
> > > can retrieve sectors/bio_op from both the real original bio and the
> > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > bio isn't necessary any more.
> > > 
> > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > accounting & split a bit.
> > 
> > You're conflating quite a few things here.  DM zone really has no
> > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > the tio->clone, because it hasn't been remapped yet it reflects the
> > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > yes I clearly broke things during the 5.18 merge and it needs fixing
> > ASAP.
> 
> You can just consider the cleanup part of this patches, :-)

I will.  But your following list doesn't reflect any "cleanup" that I
saw in your patchset.  Pretty fundamental changes that are similar,
but different, to the dm-5.19 changes I've staged.

> 1) no late assignment of ->orig_bio, and always set it in alloc_io()
>
> 2) no waiting on on ->origi_bio, especially the waiting is done in
> fast path of dm_submit_bio_remap().

For 5.18 waiting on io->orig_bio just enables a signal that the IO was
split and can be accounted.

For 5.19 I also plan on using late io->orig_bio assignment as an
alternative to the full-blown refcounting currently done with
io->io_count.  I've yet to quantify the gains with focused testing but
in theory this approach should scale better on large systems with many
concurrent IO threads to the same device (RCU is primary constraint
now).

I'll try to write a bpfrace script to measure how frequently "waiting on
io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
dm-crypt). But I think we'll find it is very rarely, if ever, waited
on in the fast path.

> 3) no split for io accounting

DM's more recent approach to splitting has never been done for benefit
or use of IO accounting, see this commit for its origin:
18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")

Not sure why you keep poking fun at DM only doing a single split when:
that is the actual design.  DM splits off orig_bio then recurses to
handle the remainder of the bio that wasn't issued.  Storing it in
io->orig_bio (previously io->bio) was always a means of reflecting
things properly. And yes IO accounting is one use, the other is IO
completion. But unfortunately DM's IO accounting has always been a
mess ever since the above commit. Changes in 5.18 fixed that.

But again, DM's splitting has _nothing_ to do with IO accounting.
Splitting only happens when needed for IO submission given constraints
of DM target(s) or underlying layers.

All said, I will look closer at your entire set and see if it better
to go with your approach.  This patch in particular is interesting
(avoids cloning and other complexity of bio_split + bio_chain):
https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-13  6:12         ` [dm-devel] " Mike Snitzer
@ 2022-04-13 12:26           ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-13 12:26 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13, 2022 at 02:12:47AM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  9:56P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > > On Tue, Apr 12 2022 at  4:56P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > > and tricky, especially the waiting until ->orig_bio is set in
> > > > dm_submit_bio_remap().
> > > > 
> > > > The reason is that one new bio is cloned from original FS bio to
> > > > represent the mapped part, which just serves io accounting.
> > > > 
> > > > Now we have switched to bdev based io accounting interface, and we
> > > > can retrieve sectors/bio_op from both the real original bio and the
> > > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > > bio isn't necessary any more.
> > > > 
> > > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > > accounting & split a bit.
> > > 
> > > You're conflating quite a few things here.  DM zone really has no
> > > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > > the tio->clone, because it hasn't been remapped yet it reflects the
> > > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > > yes I clearly broke things during the 5.18 merge and it needs fixing
> > > ASAP.
> > 
> > You can just consider the cleanup part of this patches, :-)
> 
> I will.  But your following list doesn't reflect any "cleanup" that I
> saw in your patchset.  Pretty fundamental changes that are similar,
> but different, to the dm-5.19 changes I've staged.
> 
> > 1) no late assignment of ->orig_bio, and always set it in alloc_io()
> >
> > 2) no waiting on on ->origi_bio, especially the waiting is done in
> > fast path of dm_submit_bio_remap().
> 
> For 5.18 waiting on io->orig_bio just enables a signal that the IO was
> split and can be accounted.
> 
> For 5.19 I also plan on using late io->orig_bio assignment as an
> alternative to the full-blown refcounting currently done with
> io->io_count.  I've yet to quantify the gains with focused testing but
> in theory this approach should scale better on large systems with many
> concurrent IO threads to the same device (RCU is primary constraint
> now).
> 
> I'll try to write a bpfrace script to measure how frequently "waiting on
> io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
> dm-crypt). But I think we'll find it is very rarely, if ever, waited
> on in the fast path.

The waiting depends on CPU and device's speed, if device is quicker than
CPU, the wait should be longer. Testing in one environment is usually
not enough.

> 
> > 3) no split for io accounting
> 
> DM's more recent approach to splitting has never been done for benefit
> or use of IO accounting, see this commit for its origin:
> 18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")
> 
> Not sure why you keep poking fun at DM only doing a single split when:
> that is the actual design.  DM splits off orig_bio then recurses to
> handle the remainder of the bio that wasn't issued.  Storing it in
> io->orig_bio (previously io->bio) was always a means of reflecting
> things properly. And yes IO accounting is one use, the other is IO
> completion. But unfortunately DM's IO accounting has always been a
> mess ever since the above commit. Changes in 5.18 fixed that.
> 
> But again, DM's splitting has _nothing_ to do with IO accounting.
> Splitting only happens when needed for IO submission given constraints
> of DM target(s) or underlying layers.

What I meant is that the bio returned from bio_split() is only for
io accounting. Yeah, the comment said it can be for io completion too,
but that is easily done without the splitted bio.

> 
> All said, I will look closer at your entire set and see if it better
> to go with your approach.  This patch in particular is interesting
> (avoids cloning and other complexity of bio_split + bio_chain):
> https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/

That patch shows we can avoid the extra split, also shows that the
splitted bio from bio_split() is for io accounting only.


thanks,
Ming


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-13 12:26           ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-13 12:26 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13, 2022 at 02:12:47AM -0400, Mike Snitzer wrote:
> On Tue, Apr 12 2022 at  9:56P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > > On Tue, Apr 12 2022 at  4:56P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > > and tricky, especially the waiting until ->orig_bio is set in
> > > > dm_submit_bio_remap().
> > > > 
> > > > The reason is that one new bio is cloned from original FS bio to
> > > > represent the mapped part, which just serves io accounting.
> > > > 
> > > > Now we have switched to bdev based io accounting interface, and we
> > > > can retrieve sectors/bio_op from both the real original bio and the
> > > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > > bio isn't necessary any more.
> > > > 
> > > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > > accounting & split a bit.
> > > 
> > > You're conflating quite a few things here.  DM zone really has no
> > > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > > the tio->clone, because it hasn't been remapped yet it reflects the
> > > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > > yes I clearly broke things during the 5.18 merge and it needs fixing
> > > ASAP.
> > 
> > You can just consider the cleanup part of this patches, :-)
> 
> I will.  But your following list doesn't reflect any "cleanup" that I
> saw in your patchset.  Pretty fundamental changes that are similar,
> but different, to the dm-5.19 changes I've staged.
> 
> > 1) no late assignment of ->orig_bio, and always set it in alloc_io()
> >
> > 2) no waiting on on ->origi_bio, especially the waiting is done in
> > fast path of dm_submit_bio_remap().
> 
> For 5.18 waiting on io->orig_bio just enables a signal that the IO was
> split and can be accounted.
> 
> For 5.19 I also plan on using late io->orig_bio assignment as an
> alternative to the full-blown refcounting currently done with
> io->io_count.  I've yet to quantify the gains with focused testing but
> in theory this approach should scale better on large systems with many
> concurrent IO threads to the same device (RCU is primary constraint
> now).
> 
> I'll try to write a bpfrace script to measure how frequently "waiting on
> io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
> dm-crypt). But I think we'll find it is very rarely, if ever, waited
> on in the fast path.

The waiting depends on CPU and device's speed, if device is quicker than
CPU, the wait should be longer. Testing in one environment is usually
not enough.

> 
> > 3) no split for io accounting
> 
> DM's more recent approach to splitting has never been done for benefit
> or use of IO accounting, see this commit for its origin:
> 18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")
> 
> Not sure why you keep poking fun at DM only doing a single split when:
> that is the actual design.  DM splits off orig_bio then recurses to
> handle the remainder of the bio that wasn't issued.  Storing it in
> io->orig_bio (previously io->bio) was always a means of reflecting
> things properly. And yes IO accounting is one use, the other is IO
> completion. But unfortunately DM's IO accounting has always been a
> mess ever since the above commit. Changes in 5.18 fixed that.
> 
> But again, DM's splitting has _nothing_ to do with IO accounting.
> Splitting only happens when needed for IO submission given constraints
> of DM target(s) or underlying layers.

What I meant is that the bio returned from bio_split() is only for
io accounting. Yeah, the comment said it can be for io completion too,
but that is easily done without the splitted bio.

> 
> All said, I will look closer at your entire set and see if it better
> to go with your approach.  This patch in particular is interesting
> (avoids cloning and other complexity of bio_split + bio_chain):
> https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/

That patch shows we can avoid the extra split, also shows that the
splitted bio from bio_split() is for io accounting only.


thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-13 12:26           ` [dm-devel] " Ming Lei
@ 2022-04-13 17:58             ` Mike Snitzer
  -1 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-13 17:58 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13 2022 at  8:26P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Wed, Apr 13, 2022 at 02:12:47AM -0400, Mike Snitzer wrote:
> > On Tue, Apr 12 2022 at  9:56P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > > > On Tue, Apr 12 2022 at  4:56P -0400,
> > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > 
> > > > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > > > and tricky, especially the waiting until ->orig_bio is set in
> > > > > dm_submit_bio_remap().
> > > > > 
> > > > > The reason is that one new bio is cloned from original FS bio to
> > > > > represent the mapped part, which just serves io accounting.
> > > > > 
> > > > > Now we have switched to bdev based io accounting interface, and we
> > > > > can retrieve sectors/bio_op from both the real original bio and the
> > > > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > > > bio isn't necessary any more.
> > > > > 
> > > > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > > > accounting & split a bit.
> > > > 
> > > > You're conflating quite a few things here.  DM zone really has no
> > > > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > > > the tio->clone, because it hasn't been remapped yet it reflects the
> > > > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > > > yes I clearly broke things during the 5.18 merge and it needs fixing
> > > > ASAP.
> > > 
> > > You can just consider the cleanup part of this patches, :-)
> > 
> > I will.  But your following list doesn't reflect any "cleanup" that I
> > saw in your patchset.  Pretty fundamental changes that are similar,
> > but different, to the dm-5.19 changes I've staged.
> > 
> > > 1) no late assignment of ->orig_bio, and always set it in alloc_io()
> > >
> > > 2) no waiting on on ->origi_bio, especially the waiting is done in
> > > fast path of dm_submit_bio_remap().
> > 
> > For 5.18 waiting on io->orig_bio just enables a signal that the IO was
> > split and can be accounted.
> > 
> > For 5.19 I also plan on using late io->orig_bio assignment as an
> > alternative to the full-blown refcounting currently done with
> > io->io_count.  I've yet to quantify the gains with focused testing but
> > in theory this approach should scale better on large systems with many
> > concurrent IO threads to the same device (RCU is primary constraint
> > now).
> > 
> > I'll try to write a bpfrace script to measure how frequently "waiting on
> > io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
> > dm-crypt). But I think we'll find it is very rarely, if ever, waited
> > on in the fast path.
> 
> The waiting depends on CPU and device's speed, if device is quicker than
> CPU, the wait should be longer. Testing in one environment is usually
> not enough.
> 
> > 
> > > 3) no split for io accounting
> > 
> > DM's more recent approach to splitting has never been done for benefit
> > or use of IO accounting, see this commit for its origin:
> > 18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")
> > 
> > Not sure why you keep poking fun at DM only doing a single split when:
> > that is the actual design.  DM splits off orig_bio then recurses to
> > handle the remainder of the bio that wasn't issued.  Storing it in
> > io->orig_bio (previously io->bio) was always a means of reflecting
> > things properly. And yes IO accounting is one use, the other is IO
> > completion. But unfortunately DM's IO accounting has always been a
> > mess ever since the above commit. Changes in 5.18 fixed that.
> > 
> > But again, DM's splitting has _nothing_ to do with IO accounting.
> > Splitting only happens when needed for IO submission given constraints
> > of DM target(s) or underlying layers.
> 
> What I meant is that the bio returned from bio_split() is only for
> io accounting. Yeah, the comment said it can be for io completion too,
> but that is easily done without the splitted bio.
>
> > All said, I will look closer at your entire set and see if it better
> > to go with your approach.  This patch in particular is interesting
> > (avoids cloning and other complexity of bio_split + bio_chain):
> > https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/
> 
> That patch shows we can avoid the extra split, also shows that the
> splitted bio from bio_split() is for io accounting only.

Yes I see that now.  But it also served to preserve the original bio
for use in completion.  Not a big deal, but it did track the head of
the bio_chain.

The bigger issue with this patch is that you've caused
dm_submit_bio_remap() to go back to accounting the entire original bio
before any split occurs.  That is a problem because you'll end up
accounting that bio for every split, so in split heavy workloads the
IO accounting won't reflect when the IO is actually issued and we'll
regress back to having very inaccurate and incorrect IO accounting for
dm_submit_bio_remap() heavy targets (e.g. dm-crypt).

Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-13 17:58             ` Mike Snitzer
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-13 17:58 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13 2022 at  8:26P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Wed, Apr 13, 2022 at 02:12:47AM -0400, Mike Snitzer wrote:
> > On Tue, Apr 12 2022 at  9:56P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > > > On Tue, Apr 12 2022 at  4:56P -0400,
> > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > 
> > > > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > > > and tricky, especially the waiting until ->orig_bio is set in
> > > > > dm_submit_bio_remap().
> > > > > 
> > > > > The reason is that one new bio is cloned from original FS bio to
> > > > > represent the mapped part, which just serves io accounting.
> > > > > 
> > > > > Now we have switched to bdev based io accounting interface, and we
> > > > > can retrieve sectors/bio_op from both the real original bio and the
> > > > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > > > bio isn't necessary any more.
> > > > > 
> > > > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > > > accounting & split a bit.
> > > > 
> > > > You're conflating quite a few things here.  DM zone really has no
> > > > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > > > the tio->clone, because it hasn't been remapped yet it reflects the
> > > > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > > > yes I clearly broke things during the 5.18 merge and it needs fixing
> > > > ASAP.
> > > 
> > > You can just consider the cleanup part of this patches, :-)
> > 
> > I will.  But your following list doesn't reflect any "cleanup" that I
> > saw in your patchset.  Pretty fundamental changes that are similar,
> > but different, to the dm-5.19 changes I've staged.
> > 
> > > 1) no late assignment of ->orig_bio, and always set it in alloc_io()
> > >
> > > 2) no waiting on on ->origi_bio, especially the waiting is done in
> > > fast path of dm_submit_bio_remap().
> > 
> > For 5.18 waiting on io->orig_bio just enables a signal that the IO was
> > split and can be accounted.
> > 
> > For 5.19 I also plan on using late io->orig_bio assignment as an
> > alternative to the full-blown refcounting currently done with
> > io->io_count.  I've yet to quantify the gains with focused testing but
> > in theory this approach should scale better on large systems with many
> > concurrent IO threads to the same device (RCU is primary constraint
> > now).
> > 
> > I'll try to write a bpfrace script to measure how frequently "waiting on
> > io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
> > dm-crypt). But I think we'll find it is very rarely, if ever, waited
> > on in the fast path.
> 
> The waiting depends on CPU and device's speed, if device is quicker than
> CPU, the wait should be longer. Testing in one environment is usually
> not enough.
> 
> > 
> > > 3) no split for io accounting
> > 
> > DM's more recent approach to splitting has never been done for benefit
> > or use of IO accounting, see this commit for its origin:
> > 18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")
> > 
> > Not sure why you keep poking fun at DM only doing a single split when:
> > that is the actual design.  DM splits off orig_bio then recurses to
> > handle the remainder of the bio that wasn't issued.  Storing it in
> > io->orig_bio (previously io->bio) was always a means of reflecting
> > things properly. And yes IO accounting is one use, the other is IO
> > completion. But unfortunately DM's IO accounting has always been a
> > mess ever since the above commit. Changes in 5.18 fixed that.
> > 
> > But again, DM's splitting has _nothing_ to do with IO accounting.
> > Splitting only happens when needed for IO submission given constraints
> > of DM target(s) or underlying layers.
> 
> What I meant is that the bio returned from bio_split() is only for
> io accounting. Yeah, the comment said it can be for io completion too,
> but that is easily done without the splitted bio.
>
> > All said, I will look closer at your entire set and see if it better
> > to go with your approach.  This patch in particular is interesting
> > (avoids cloning and other complexity of bio_split + bio_chain):
> > https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/
> 
> That patch shows we can avoid the extra split, also shows that the
> splitted bio from bio_split() is for io accounting only.

Yes I see that now.  But it also served to preserve the original bio
for use in completion.  Not a big deal, but it did track the head of
the bio_chain.

The bigger issue with this patch is that you've caused
dm_submit_bio_remap() to go back to accounting the entire original bio
before any split occurs.  That is a problem because you'll end up
accounting that bio for every split, so in split heavy workloads the
IO accounting won't reflect when the IO is actually issued and we'll
regress back to having very inaccurate and incorrect IO accounting for
dm_submit_bio_remap() heavy targets (e.g. dm-crypt).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-13 17:58             ` [dm-devel] " Mike Snitzer
@ 2022-04-14  0:36               ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-14  0:36 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at  8:26P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Wed, Apr 13, 2022 at 02:12:47AM -0400, Mike Snitzer wrote:
> > > On Tue, Apr 12 2022 at  9:56P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > > > > On Tue, Apr 12 2022 at  4:56P -0400,
> > > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > > 
> > > > > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > > > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > > > > and tricky, especially the waiting until ->orig_bio is set in
> > > > > > dm_submit_bio_remap().
> > > > > > 
> > > > > > The reason is that one new bio is cloned from original FS bio to
> > > > > > represent the mapped part, which just serves io accounting.
> > > > > > 
> > > > > > Now we have switched to bdev based io accounting interface, and we
> > > > > > can retrieve sectors/bio_op from both the real original bio and the
> > > > > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > > > > bio isn't necessary any more.
> > > > > > 
> > > > > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > > > > accounting & split a bit.
> > > > > 
> > > > > You're conflating quite a few things here.  DM zone really has no
> > > > > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > > > > the tio->clone, because it hasn't been remapped yet it reflects the
> > > > > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > > > > yes I clearly broke things during the 5.18 merge and it needs fixing
> > > > > ASAP.
> > > > 
> > > > You can just consider the cleanup part of this patches, :-)
> > > 
> > > I will.  But your following list doesn't reflect any "cleanup" that I
> > > saw in your patchset.  Pretty fundamental changes that are similar,
> > > but different, to the dm-5.19 changes I've staged.
> > > 
> > > > 1) no late assignment of ->orig_bio, and always set it in alloc_io()
> > > >
> > > > 2) no waiting on on ->origi_bio, especially the waiting is done in
> > > > fast path of dm_submit_bio_remap().
> > > 
> > > For 5.18 waiting on io->orig_bio just enables a signal that the IO was
> > > split and can be accounted.
> > > 
> > > For 5.19 I also plan on using late io->orig_bio assignment as an
> > > alternative to the full-blown refcounting currently done with
> > > io->io_count.  I've yet to quantify the gains with focused testing but
> > > in theory this approach should scale better on large systems with many
> > > concurrent IO threads to the same device (RCU is primary constraint
> > > now).
> > > 
> > > I'll try to write a bpfrace script to measure how frequently "waiting on
> > > io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
> > > dm-crypt). But I think we'll find it is very rarely, if ever, waited
> > > on in the fast path.
> > 
> > The waiting depends on CPU and device's speed, if device is quicker than
> > CPU, the wait should be longer. Testing in one environment is usually
> > not enough.
> > 
> > > 
> > > > 3) no split for io accounting
> > > 
> > > DM's more recent approach to splitting has never been done for benefit
> > > or use of IO accounting, see this commit for its origin:
> > > 18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")
> > > 
> > > Not sure why you keep poking fun at DM only doing a single split when:
> > > that is the actual design.  DM splits off orig_bio then recurses to
> > > handle the remainder of the bio that wasn't issued.  Storing it in
> > > io->orig_bio (previously io->bio) was always a means of reflecting
> > > things properly. And yes IO accounting is one use, the other is IO
> > > completion. But unfortunately DM's IO accounting has always been a
> > > mess ever since the above commit. Changes in 5.18 fixed that.
> > > 
> > > But again, DM's splitting has _nothing_ to do with IO accounting.
> > > Splitting only happens when needed for IO submission given constraints
> > > of DM target(s) or underlying layers.
> > 
> > What I meant is that the bio returned from bio_split() is only for
> > io accounting. Yeah, the comment said it can be for io completion too,
> > but that is easily done without the splitted bio.
> >
> > > All said, I will look closer at your entire set and see if it better
> > > to go with your approach.  This patch in particular is interesting
> > > (avoids cloning and other complexity of bio_split + bio_chain):
> > > https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/
> > 
> > That patch shows we can avoid the extra split, also shows that the
> > splitted bio from bio_split() is for io accounting only.
> 
> Yes I see that now.  But it also served to preserve the original bio
> for use in completion.  Not a big deal, but it did track the head of
> the bio_chain.
> 
> The bigger issue with this patch is that you've caused
> dm_submit_bio_remap() to go back to accounting the entire original bio
> before any split occurs.  That is a problem because you'll end up
> accounting that bio for every split, so in split heavy workloads the
> IO accounting won't reflect when the IO is actually issued and we'll
> regress back to having very inaccurate and incorrect IO accounting for
> dm_submit_bio_remap() heavy targets (e.g. dm-crypt).

Good catch, but we know the length of mapped part in original bio before
calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
something like the following delta change should address it:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index db23efd6bbf6..06b554f3104b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
 
 	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
 	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
+
+	if (ci->sector_count > len) {
+		/* setup the mapped part for accounting */
+		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
+		ci->io->sectors = len;
+		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
+	}
 	__map_bio(clone);
 
 	ci->sector += len;
@@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	if (error || !ci.sector_count)
 		goto out;
 
-	/* setup the mapped part for accounting */
-	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
-	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
-	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
-
 	bio_trim(bio, ci.io->sectors, ci.sector_count);
 	trace_block_split(bio, bio->bi_iter.bi_sector);
 	bio_inc_remaining(bio);

-- 
Ming


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-14  0:36               ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-14  0:36 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at  8:26P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Wed, Apr 13, 2022 at 02:12:47AM -0400, Mike Snitzer wrote:
> > > On Tue, Apr 12 2022 at  9:56P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > On Tue, Apr 12, 2022 at 04:52:40PM -0400, Mike Snitzer wrote:
> > > > > On Tue, Apr 12 2022 at  4:56P -0400,
> > > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > > 
> > > > > > The current DM codes setup ->orig_bio after __map_bio() returns,
> > > > > > and not only cause kernel panic for dm zone, but also a bit ugly
> > > > > > and tricky, especially the waiting until ->orig_bio is set in
> > > > > > dm_submit_bio_remap().
> > > > > > 
> > > > > > The reason is that one new bio is cloned from original FS bio to
> > > > > > represent the mapped part, which just serves io accounting.
> > > > > > 
> > > > > > Now we have switched to bdev based io accounting interface, and we
> > > > > > can retrieve sectors/bio_op from both the real original bio and the
> > > > > > added fields of .sector_offset & .sectors easily, so the new cloned
> > > > > > bio isn't necessary any more.
> > > > > > 
> > > > > > Not only fixes dm-zone's kernel panic, but also cleans up dm io
> > > > > > accounting & split a bit.
> > > > > 
> > > > > You're conflating quite a few things here.  DM zone really has no
> > > > > business accessing io->orig_bio (dm-zone.c can just as easily inspect
> > > > > the tio->clone, because it hasn't been remapped yet it reflects the
> > > > > io->origin_bio, so there is no need to look at io->orig_bio) -- but
> > > > > yes I clearly broke things during the 5.18 merge and it needs fixing
> > > > > ASAP.
> > > > 
> > > > You can just consider the cleanup part of this patches, :-)
> > > 
> > > I will.  But your following list doesn't reflect any "cleanup" that I
> > > saw in your patchset.  Pretty fundamental changes that are similar,
> > > but different, to the dm-5.19 changes I've staged.
> > > 
> > > > 1) no late assignment of ->orig_bio, and always set it in alloc_io()
> > > >
> > > > 2) no waiting on on ->origi_bio, especially the waiting is done in
> > > > fast path of dm_submit_bio_remap().
> > > 
> > > For 5.18 waiting on io->orig_bio just enables a signal that the IO was
> > > split and can be accounted.
> > > 
> > > For 5.19 I also plan on using late io->orig_bio assignment as an
> > > alternative to the full-blown refcounting currently done with
> > > io->io_count.  I've yet to quantify the gains with focused testing but
> > > in theory this approach should scale better on large systems with many
> > > concurrent IO threads to the same device (RCU is primary constraint
> > > now).
> > > 
> > > I'll try to write a bpfrace script to measure how frequently "waiting on
> > > io->orig_bio" occurs for dm_submit_bio_remap() heavy usage (like
> > > dm-crypt). But I think we'll find it is very rarely, if ever, waited
> > > on in the fast path.
> > 
> > The waiting depends on CPU and device's speed, if device is quicker than
> > CPU, the wait should be longer. Testing in one environment is usually
> > not enough.
> > 
> > > 
> > > > 3) no split for io accounting
> > > 
> > > DM's more recent approach to splitting has never been done for benefit
> > > or use of IO accounting, see this commit for its origin:
> > > 18a25da84354c6b ("dm: ensure bio submission follows a depth-first tree walk")
> > > 
> > > Not sure why you keep poking fun at DM only doing a single split when:
> > > that is the actual design.  DM splits off orig_bio then recurses to
> > > handle the remainder of the bio that wasn't issued.  Storing it in
> > > io->orig_bio (previously io->bio) was always a means of reflecting
> > > things properly. And yes IO accounting is one use, the other is IO
> > > completion. But unfortunately DM's IO accounting has always been a
> > > mess ever since the above commit. Changes in 5.18 fixed that.
> > > 
> > > But again, DM's splitting has _nothing_ to do with IO accounting.
> > > Splitting only happens when needed for IO submission given constraints
> > > of DM target(s) or underlying layers.
> > 
> > What I meant is that the bio returned from bio_split() is only for
> > io accounting. Yeah, the comment said it can be for io completion too,
> > but that is easily done without the splitted bio.
> >
> > > All said, I will look closer at your entire set and see if it better
> > > to go with your approach.  This patch in particular is interesting
> > > (avoids cloning and other complexity of bio_split + bio_chain):
> > > https://patchwork.kernel.org/project/dm-devel/patch/20220412085616.1409626-6-ming.lei@redhat.com/
> > 
> > That patch shows we can avoid the extra split, also shows that the
> > splitted bio from bio_split() is for io accounting only.
> 
> Yes I see that now.  But it also served to preserve the original bio
> for use in completion.  Not a big deal, but it did track the head of
> the bio_chain.
> 
> The bigger issue with this patch is that you've caused
> dm_submit_bio_remap() to go back to accounting the entire original bio
> before any split occurs.  That is a problem because you'll end up
> accounting that bio for every split, so in split heavy workloads the
> IO accounting won't reflect when the IO is actually issued and we'll
> regress back to having very inaccurate and incorrect IO accounting for
> dm_submit_bio_remap() heavy targets (e.g. dm-crypt).

Good catch, but we know the length of mapped part in original bio before
calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
something like the following delta change should address it:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index db23efd6bbf6..06b554f3104b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
 
 	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
 	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
+
+	if (ci->sector_count > len) {
+		/* setup the mapped part for accounting */
+		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
+		ci->io->sectors = len;
+		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
+	}
 	__map_bio(clone);
 
 	ci->sector += len;
@@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	if (error || !ci.sector_count)
 		goto out;
 
-	/* setup the mapped part for accounting */
-	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
-	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
-	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
-
 	bio_trim(bio, ci.io->sectors, ci.sector_count);
 	trace_block_split(bio, bio->bi_iter.bi_sector);
 	bio_inc_remaining(bio);

-- 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-14  0:36               ` [dm-devel] " Ming Lei
@ 2022-04-14  2:25                 ` Mike Snitzer
  -1 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-14  2:25 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13 2022 at  8:36P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > 
> > The bigger issue with this patch is that you've caused
> > dm_submit_bio_remap() to go back to accounting the entire original bio
> > before any split occurs.  That is a problem because you'll end up
> > accounting that bio for every split, so in split heavy workloads the
> > IO accounting won't reflect when the IO is actually issued and we'll
> > regress back to having very inaccurate and incorrect IO accounting for
> > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> 
> Good catch, but we know the length of mapped part in original bio before
> calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> something like the following delta change should address it:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index db23efd6bbf6..06b554f3104b 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
>  
>  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
>  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> +
> +	if (ci->sector_count > len) {
> +		/* setup the mapped part for accounting */
> +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> +		ci->io->sectors = len;
> +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> +	}
>  	__map_bio(clone);
>  
>  	ci->sector += len;
> @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>  	if (error || !ci.sector_count)
>  		goto out;
>  
> -	/* setup the mapped part for accounting */
> -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> -
>  	bio_trim(bio, ci.io->sectors, ci.sector_count);
>  	trace_block_split(bio, bio->bi_iter.bi_sector);
>  	bio_inc_remaining(bio);
> 
> -- 
> Ming
> 

Unfortunately we do need splitting after __map_bio() because a dm
target's ->map can use dm_accept_partial_bio() to further reduce a
bio's mapped part.

But I think dm_accept_partial_bio() could be trained to update
tio->io->sectors?

dm_accept_partial_bio() has been around for a long time, it keeps
growing BUG_ONs that are actually helpful to narrow its use to "normal
IO", so it should be OK.

Running 'make check' in a built cryptsetup source tree should be a
good test for DM target interface functionality.

But there aren't automated tests for IO accounting correctness yet.

Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-14  2:25                 ` Mike Snitzer
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-14  2:25 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13 2022 at  8:36P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > 
> > The bigger issue with this patch is that you've caused
> > dm_submit_bio_remap() to go back to accounting the entire original bio
> > before any split occurs.  That is a problem because you'll end up
> > accounting that bio for every split, so in split heavy workloads the
> > IO accounting won't reflect when the IO is actually issued and we'll
> > regress back to having very inaccurate and incorrect IO accounting for
> > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> 
> Good catch, but we know the length of mapped part in original bio before
> calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> something like the following delta change should address it:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index db23efd6bbf6..06b554f3104b 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
>  
>  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
>  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> +
> +	if (ci->sector_count > len) {
> +		/* setup the mapped part for accounting */
> +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> +		ci->io->sectors = len;
> +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> +	}
>  	__map_bio(clone);
>  
>  	ci->sector += len;
> @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>  	if (error || !ci.sector_count)
>  		goto out;
>  
> -	/* setup the mapped part for accounting */
> -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> -
>  	bio_trim(bio, ci.io->sectors, ci.sector_count);
>  	trace_block_split(bio, bio->bi_iter.bi_sector);
>  	bio_inc_remaining(bio);
> 
> -- 
> Ming
> 

Unfortunately we do need splitting after __map_bio() because a dm
target's ->map can use dm_accept_partial_bio() to further reduce a
bio's mapped part.

But I think dm_accept_partial_bio() could be trained to update
tio->io->sectors?

dm_accept_partial_bio() has been around for a long time, it keeps
growing BUG_ONs that are actually helpful to narrow its use to "normal
IO", so it should be OK.

Running 'make check' in a built cryptsetup source tree should be a
good test for DM target interface functionality.

But there aren't automated tests for IO accounting correctness yet.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-14  2:25                 ` [dm-devel] " Mike Snitzer
@ 2022-04-14  3:57                   ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-14  3:57 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at  8:36P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > 
> > > The bigger issue with this patch is that you've caused
> > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > before any split occurs.  That is a problem because you'll end up
> > > accounting that bio for every split, so in split heavy workloads the
> > > IO accounting won't reflect when the IO is actually issued and we'll
> > > regress back to having very inaccurate and incorrect IO accounting for
> > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > 
> > Good catch, but we know the length of mapped part in original bio before
> > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > something like the following delta change should address it:
> > 
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index db23efd6bbf6..06b554f3104b 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> >  
> >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > +
> > +	if (ci->sector_count > len) {
> > +		/* setup the mapped part for accounting */
> > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > +		ci->io->sectors = len;
> > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > +	}
> >  	__map_bio(clone);
> >  
> >  	ci->sector += len;
> > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> >  	if (error || !ci.sector_count)
> >  		goto out;
> >  
> > -	/* setup the mapped part for accounting */
> > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > -
> >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> >  	bio_inc_remaining(bio);
> > 
> > -- 
> > Ming
> > 
> 
> Unfortunately we do need splitting after __map_bio() because a dm
> target's ->map can use dm_accept_partial_bio() to further reduce a
> bio's mapped part.
> 
> But I think dm_accept_partial_bio() could be trained to update
> tio->io->sectors?

->orig_bio is just for serving io accounting, but ->orig_bio isn't
passed to dm_accept_partial_bio(), and not gets updated after
dm_accept_partial_bio() is called.

If that is one issue, it must be one existed issue in dm io accounting
since ->orig_bio isn't updated when dm_accept_partial_bio() is called.

So do we have to update it?

> 
> dm_accept_partial_bio() has been around for a long time, it keeps
> growing BUG_ONs that are actually helpful to narrow its use to "normal
> IO", so it should be OK.
> 
> Running 'make check' in a built cryptsetup source tree should be a
> good test for DM target interface functionality.

Care to share the test tree?

> 
> But there aren't automated tests for IO accounting correctness yet.

I did verify io accounting by running dm-thin with blk-throttle, and the
observed throughput is same with expected setting. Running both small bs
and large bs, so non-split and split code path are covered.

Maybe you can add this kind of test into dm io accounting automated test.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-14  3:57                   ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-14  3:57 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at  8:36P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > 
> > > The bigger issue with this patch is that you've caused
> > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > before any split occurs.  That is a problem because you'll end up
> > > accounting that bio for every split, so in split heavy workloads the
> > > IO accounting won't reflect when the IO is actually issued and we'll
> > > regress back to having very inaccurate and incorrect IO accounting for
> > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > 
> > Good catch, but we know the length of mapped part in original bio before
> > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > something like the following delta change should address it:
> > 
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index db23efd6bbf6..06b554f3104b 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> >  
> >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > +
> > +	if (ci->sector_count > len) {
> > +		/* setup the mapped part for accounting */
> > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > +		ci->io->sectors = len;
> > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > +	}
> >  	__map_bio(clone);
> >  
> >  	ci->sector += len;
> > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> >  	if (error || !ci.sector_count)
> >  		goto out;
> >  
> > -	/* setup the mapped part for accounting */
> > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > -
> >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> >  	bio_inc_remaining(bio);
> > 
> > -- 
> > Ming
> > 
> 
> Unfortunately we do need splitting after __map_bio() because a dm
> target's ->map can use dm_accept_partial_bio() to further reduce a
> bio's mapped part.
> 
> But I think dm_accept_partial_bio() could be trained to update
> tio->io->sectors?

->orig_bio is just for serving io accounting, but ->orig_bio isn't
passed to dm_accept_partial_bio(), and not gets updated after
dm_accept_partial_bio() is called.

If that is one issue, it must be one existed issue in dm io accounting
since ->orig_bio isn't updated when dm_accept_partial_bio() is called.

So do we have to update it?

> 
> dm_accept_partial_bio() has been around for a long time, it keeps
> growing BUG_ONs that are actually helpful to narrow its use to "normal
> IO", so it should be OK.
> 
> Running 'make check' in a built cryptsetup source tree should be a
> good test for DM target interface functionality.

Care to share the test tree?

> 
> But there aren't automated tests for IO accounting correctness yet.

I did verify io accounting by running dm-thin with blk-throttle, and the
observed throughput is same with expected setting. Running both small bs
and large bs, so non-split and split code path are covered.

Maybe you can add this kind of test into dm io accounting automated test.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-14  3:57                   ` [dm-devel] " Ming Lei
@ 2022-04-14 17:45                     ` Mike Snitzer
  -1 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-14 17:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13 2022 at 11:57P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > On Wed, Apr 13 2022 at  8:36P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > 
> > > > The bigger issue with this patch is that you've caused
> > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > before any split occurs.  That is a problem because you'll end up
> > > > accounting that bio for every split, so in split heavy workloads the
> > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > 
> > > Good catch, but we know the length of mapped part in original bio before
> > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > something like the following delta change should address it:
> > > 
> > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > index db23efd6bbf6..06b554f3104b 100644
> > > --- a/drivers/md/dm.c
> > > +++ b/drivers/md/dm.c
> > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > >  
> > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > +
> > > +	if (ci->sector_count > len) {
> > > +		/* setup the mapped part for accounting */
> > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > +		ci->io->sectors = len;
> > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > +	}
> > >  	__map_bio(clone);
> > >  
> > >  	ci->sector += len;
> > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > >  	if (error || !ci.sector_count)
> > >  		goto out;
> > >  
> > > -	/* setup the mapped part for accounting */
> > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > -
> > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > >  	bio_inc_remaining(bio);
> > > 
> > > -- 
> > > Ming
> > > 
> > 
> > Unfortunately we do need splitting after __map_bio() because a dm
> > target's ->map can use dm_accept_partial_bio() to further reduce a
> > bio's mapped part.
> > 
> > But I think dm_accept_partial_bio() could be trained to update
> > tio->io->sectors?
> 
> ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> passed to dm_accept_partial_bio(), and not gets updated after
> dm_accept_partial_bio() is called.
> 
> If that is one issue, it must be one existed issue in dm io accounting
> since ->orig_bio isn't updated when dm_accept_partial_bio() is called.

Recall that ->orig_bio is updated after the bio_split() at the bottom of
dm_split_and_process_bio().

That bio_split() is based on ci->sector_count, which is reduced as a
side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
pretty circuitous so I can absolutely understand why you didn't
immediately appreciate the interface.  The block comment above
dm_accept_partial_bio() does a pretty comprehensive job of explaining.

But basically dm_accept_partial_bio() provides DM targets access to
control DM core's splitting if they find that they cannot accommodate
the entirety of the clone bio that is sent to their ->map.
dm_accept_partial_bio() may only ever be called from a target's ->map

> So do we have to update it?
> 
> > 
> > dm_accept_partial_bio() has been around for a long time, it keeps
> > growing BUG_ONs that are actually helpful to narrow its use to "normal
> > IO", so it should be OK.
> > 
> > Running 'make check' in a built cryptsetup source tree should be a
> > good test for DM target interface functionality.
> 
> Care to share the test tree?

https://gitlab.com/cryptsetup/cryptsetup.git

> 
> > 
> > But there aren't automated tests for IO accounting correctness yet.
> 
> I did verify io accounting by running dm-thin with blk-throttle, and the
> observed throughput is same with expected setting. Running both small bs
> and large bs, so non-split and split code path are covered.
> 
> Maybe you can add this kind of test into dm io accounting automated test.

Yeah, something like that would be good.

Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-14 17:45                     ` Mike Snitzer
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-14 17:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Wed, Apr 13 2022 at 11:57P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > On Wed, Apr 13 2022 at  8:36P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > 
> > > > The bigger issue with this patch is that you've caused
> > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > before any split occurs.  That is a problem because you'll end up
> > > > accounting that bio for every split, so in split heavy workloads the
> > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > 
> > > Good catch, but we know the length of mapped part in original bio before
> > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > something like the following delta change should address it:
> > > 
> > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > index db23efd6bbf6..06b554f3104b 100644
> > > --- a/drivers/md/dm.c
> > > +++ b/drivers/md/dm.c
> > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > >  
> > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > +
> > > +	if (ci->sector_count > len) {
> > > +		/* setup the mapped part for accounting */
> > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > +		ci->io->sectors = len;
> > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > +	}
> > >  	__map_bio(clone);
> > >  
> > >  	ci->sector += len;
> > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > >  	if (error || !ci.sector_count)
> > >  		goto out;
> > >  
> > > -	/* setup the mapped part for accounting */
> > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > -
> > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > >  	bio_inc_remaining(bio);
> > > 
> > > -- 
> > > Ming
> > > 
> > 
> > Unfortunately we do need splitting after __map_bio() because a dm
> > target's ->map can use dm_accept_partial_bio() to further reduce a
> > bio's mapped part.
> > 
> > But I think dm_accept_partial_bio() could be trained to update
> > tio->io->sectors?
> 
> ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> passed to dm_accept_partial_bio(), and not gets updated after
> dm_accept_partial_bio() is called.
> 
> If that is one issue, it must be one existed issue in dm io accounting
> since ->orig_bio isn't updated when dm_accept_partial_bio() is called.

Recall that ->orig_bio is updated after the bio_split() at the bottom of
dm_split_and_process_bio().

That bio_split() is based on ci->sector_count, which is reduced as a
side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
pretty circuitous so I can absolutely understand why you didn't
immediately appreciate the interface.  The block comment above
dm_accept_partial_bio() does a pretty comprehensive job of explaining.

But basically dm_accept_partial_bio() provides DM targets access to
control DM core's splitting if they find that they cannot accommodate
the entirety of the clone bio that is sent to their ->map.
dm_accept_partial_bio() may only ever be called from a target's ->map

> So do we have to update it?
> 
> > 
> > dm_accept_partial_bio() has been around for a long time, it keeps
> > growing BUG_ONs that are actually helpful to narrow its use to "normal
> > IO", so it should be OK.
> > 
> > Running 'make check' in a built cryptsetup source tree should be a
> > good test for DM target interface functionality.
> 
> Care to share the test tree?

https://gitlab.com/cryptsetup/cryptsetup.git

> 
> > 
> > But there aren't automated tests for IO accounting correctness yet.
> 
> I did verify io accounting by running dm-thin with blk-throttle, and the
> observed throughput is same with expected setting. Running both small bs
> and large bs, so non-split and split code path are covered.
> 
> Maybe you can add this kind of test into dm io accounting automated test.

Yeah, something like that would be good.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-14 17:45                     ` [dm-devel] " Mike Snitzer
@ 2022-04-15  0:14                       ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-15  0:14 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Thu, Apr 14, 2022 at 01:45:33PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at 11:57P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > > On Wed, Apr 13 2022 at  8:36P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > > 
> > > > > The bigger issue with this patch is that you've caused
> > > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > > before any split occurs.  That is a problem because you'll end up
> > > > > accounting that bio for every split, so in split heavy workloads the
> > > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > > 
> > > > Good catch, but we know the length of mapped part in original bio before
> > > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > > something like the following delta change should address it:
> > > > 
> > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > index db23efd6bbf6..06b554f3104b 100644
> > > > --- a/drivers/md/dm.c
> > > > +++ b/drivers/md/dm.c
> > > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > > >  
> > > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > > +
> > > > +	if (ci->sector_count > len) {
> > > > +		/* setup the mapped part for accounting */
> > > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > > +		ci->io->sectors = len;
> > > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > > +	}
> > > >  	__map_bio(clone);
> > > >  
> > > >  	ci->sector += len;
> > > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > > >  	if (error || !ci.sector_count)
> > > >  		goto out;
> > > >  
> > > > -	/* setup the mapped part for accounting */
> > > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > > -
> > > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > > >  	bio_inc_remaining(bio);
> > > > 
> > > > -- 
> > > > Ming
> > > > 
> > > 
> > > Unfortunately we do need splitting after __map_bio() because a dm
> > > target's ->map can use dm_accept_partial_bio() to further reduce a
> > > bio's mapped part.
> > > 
> > > But I think dm_accept_partial_bio() could be trained to update
> > > tio->io->sectors?
> > 
> > ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> > passed to dm_accept_partial_bio(), and not gets updated after
> > dm_accept_partial_bio() is called.
> > 
> > If that is one issue, it must be one existed issue in dm io accounting
> > since ->orig_bio isn't updated when dm_accept_partial_bio() is called.
> 
> Recall that ->orig_bio is updated after the bio_split() at the bottom of
> dm_split_and_process_bio().
> 
> That bio_split() is based on ci->sector_count, which is reduced as a
> side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
> pretty circuitous so I can absolutely understand why you didn't
> immediately appreciate the interface.  The block comment above
> dm_accept_partial_bio() does a pretty comprehensive job of explaining.

Go it now, thanks for the explanation.

As you mentioned, it can be addressed in dm_accept_partial_bio()
by updating ti->io->sectors.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-15  0:14                       ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-15  0:14 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Thu, Apr 14, 2022 at 01:45:33PM -0400, Mike Snitzer wrote:
> On Wed, Apr 13 2022 at 11:57P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > > On Wed, Apr 13 2022 at  8:36P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > > 
> > > > > The bigger issue with this patch is that you've caused
> > > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > > before any split occurs.  That is a problem because you'll end up
> > > > > accounting that bio for every split, so in split heavy workloads the
> > > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > > 
> > > > Good catch, but we know the length of mapped part in original bio before
> > > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > > something like the following delta change should address it:
> > > > 
> > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > index db23efd6bbf6..06b554f3104b 100644
> > > > --- a/drivers/md/dm.c
> > > > +++ b/drivers/md/dm.c
> > > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > > >  
> > > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > > +
> > > > +	if (ci->sector_count > len) {
> > > > +		/* setup the mapped part for accounting */
> > > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > > +		ci->io->sectors = len;
> > > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > > +	}
> > > >  	__map_bio(clone);
> > > >  
> > > >  	ci->sector += len;
> > > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > > >  	if (error || !ci.sector_count)
> > > >  		goto out;
> > > >  
> > > > -	/* setup the mapped part for accounting */
> > > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > > -
> > > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > > >  	bio_inc_remaining(bio);
> > > > 
> > > > -- 
> > > > Ming
> > > > 
> > > 
> > > Unfortunately we do need splitting after __map_bio() because a dm
> > > target's ->map can use dm_accept_partial_bio() to further reduce a
> > > bio's mapped part.
> > > 
> > > But I think dm_accept_partial_bio() could be trained to update
> > > tio->io->sectors?
> > 
> > ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> > passed to dm_accept_partial_bio(), and not gets updated after
> > dm_accept_partial_bio() is called.
> > 
> > If that is one issue, it must be one existed issue in dm io accounting
> > since ->orig_bio isn't updated when dm_accept_partial_bio() is called.
> 
> Recall that ->orig_bio is updated after the bio_split() at the bottom of
> dm_split_and_process_bio().
> 
> That bio_split() is based on ci->sector_count, which is reduced as a
> side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
> pretty circuitous so I can absolutely understand why you didn't
> immediately appreciate the interface.  The block comment above
> dm_accept_partial_bio() does a pretty comprehensive job of explaining.

Go it now, thanks for the explanation.

As you mentioned, it can be addressed in dm_accept_partial_bio()
by updating ti->io->sectors.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-15  0:14                       ` [dm-devel] " Ming Lei
@ 2022-04-15 21:06                         ` Mike Snitzer
  -1 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-15 21:06 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Thu, Apr 14 2022 at  8:14P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Thu, Apr 14, 2022 at 01:45:33PM -0400, Mike Snitzer wrote:
> > On Wed, Apr 13 2022 at 11:57P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > > > On Wed, Apr 13 2022 at  8:36P -0400,
> > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > 
> > > > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > > > 
> > > > > > The bigger issue with this patch is that you've caused
> > > > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > > > before any split occurs.  That is a problem because you'll end up
> > > > > > accounting that bio for every split, so in split heavy workloads the
> > > > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > > > 
> > > > > Good catch, but we know the length of mapped part in original bio before
> > > > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > > > something like the following delta change should address it:
> > > > > 
> > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > > index db23efd6bbf6..06b554f3104b 100644
> > > > > --- a/drivers/md/dm.c
> > > > > +++ b/drivers/md/dm.c
> > > > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > > > >  
> > > > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > > > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > > > +
> > > > > +	if (ci->sector_count > len) {
> > > > > +		/* setup the mapped part for accounting */
> > > > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > > > +		ci->io->sectors = len;
> > > > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > > > +	}
> > > > >  	__map_bio(clone);
> > > > >  
> > > > >  	ci->sector += len;
> > > > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > > > >  	if (error || !ci.sector_count)
> > > > >  		goto out;
> > > > >  
> > > > > -	/* setup the mapped part for accounting */
> > > > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > > > -
> > > > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > > > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > > > >  	bio_inc_remaining(bio);
> > > > > 
> > > > > -- 
> > > > > Ming
> > > > > 
> > > > 
> > > > Unfortunately we do need splitting after __map_bio() because a dm
> > > > target's ->map can use dm_accept_partial_bio() to further reduce a
> > > > bio's mapped part.
> > > > 
> > > > But I think dm_accept_partial_bio() could be trained to update
> > > > tio->io->sectors?
> > > 
> > > ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> > > passed to dm_accept_partial_bio(), and not gets updated after
> > > dm_accept_partial_bio() is called.
> > > 
> > > If that is one issue, it must be one existed issue in dm io accounting
> > > since ->orig_bio isn't updated when dm_accept_partial_bio() is called.
> > 
> > Recall that ->orig_bio is updated after the bio_split() at the bottom of
> > dm_split_and_process_bio().
> > 
> > That bio_split() is based on ci->sector_count, which is reduced as a
> > side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
> > pretty circuitous so I can absolutely understand why you didn't
> > immediately appreciate the interface.  The block comment above
> > dm_accept_partial_bio() does a pretty comprehensive job of explaining.
> 
> Go it now, thanks for the explanation.
> 
> As you mentioned, it can be addressed in dm_accept_partial_bio()
> by updating ti->io->sectors.

Yes, I rebased your patchset ontop of dm-5.19 and fixed up your
splitting like we discussed.  I'll be rebasing ontop of v5.18-rc3 once
it is released but please have a look at this 'dm-5.19-v2' branch:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19-v2

And this commit in particular:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19-v2&id=fe5a99da8b0d0518342f5cdb522a06b0f123ca09

Once I've verified with you that it looks OK I'll fold it into your
commit (at the same time I rebase on v5.18-rc3 early next week).

In general this rebase sucked.. but I wanted to layer your changes
ontop of earlier changes I had already staged for 5.19. I've at least
verified that DM's bio polling seems to work like usual and also
verified that cryptsetup's 'make check' passes.

No urgency on you reviewing before early next week.  Have a great
weekend.

Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-15 21:06                         ` Mike Snitzer
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-15 21:06 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Thu, Apr 14 2022 at  8:14P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Thu, Apr 14, 2022 at 01:45:33PM -0400, Mike Snitzer wrote:
> > On Wed, Apr 13 2022 at 11:57P -0400,
> > Ming Lei <ming.lei@redhat.com> wrote:
> > 
> > > On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > > > On Wed, Apr 13 2022 at  8:36P -0400,
> > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > 
> > > > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > > > 
> > > > > > The bigger issue with this patch is that you've caused
> > > > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > > > before any split occurs.  That is a problem because you'll end up
> > > > > > accounting that bio for every split, so in split heavy workloads the
> > > > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > > > 
> > > > > Good catch, but we know the length of mapped part in original bio before
> > > > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > > > something like the following delta change should address it:
> > > > > 
> > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > > index db23efd6bbf6..06b554f3104b 100644
> > > > > --- a/drivers/md/dm.c
> > > > > +++ b/drivers/md/dm.c
> > > > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > > > >  
> > > > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > > > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > > > +
> > > > > +	if (ci->sector_count > len) {
> > > > > +		/* setup the mapped part for accounting */
> > > > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > > > +		ci->io->sectors = len;
> > > > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > > > +	}
> > > > >  	__map_bio(clone);
> > > > >  
> > > > >  	ci->sector += len;
> > > > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > > > >  	if (error || !ci.sector_count)
> > > > >  		goto out;
> > > > >  
> > > > > -	/* setup the mapped part for accounting */
> > > > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > > > -
> > > > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > > > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > > > >  	bio_inc_remaining(bio);
> > > > > 
> > > > > -- 
> > > > > Ming
> > > > > 
> > > > 
> > > > Unfortunately we do need splitting after __map_bio() because a dm
> > > > target's ->map can use dm_accept_partial_bio() to further reduce a
> > > > bio's mapped part.
> > > > 
> > > > But I think dm_accept_partial_bio() could be trained to update
> > > > tio->io->sectors?
> > > 
> > > ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> > > passed to dm_accept_partial_bio(), and not gets updated after
> > > dm_accept_partial_bio() is called.
> > > 
> > > If that is one issue, it must be one existed issue in dm io accounting
> > > since ->orig_bio isn't updated when dm_accept_partial_bio() is called.
> > 
> > Recall that ->orig_bio is updated after the bio_split() at the bottom of
> > dm_split_and_process_bio().
> > 
> > That bio_split() is based on ci->sector_count, which is reduced as a
> > side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
> > pretty circuitous so I can absolutely understand why you didn't
> > immediately appreciate the interface.  The block comment above
> > dm_accept_partial_bio() does a pretty comprehensive job of explaining.
> 
> Go it now, thanks for the explanation.
> 
> As you mentioned, it can be addressed in dm_accept_partial_bio()
> by updating ti->io->sectors.

Yes, I rebased your patchset ontop of dm-5.19 and fixed up your
splitting like we discussed.  I'll be rebasing ontop of v5.18-rc3 once
it is released but please have a look at this 'dm-5.19-v2' branch:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19-v2

And this commit in particular:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19-v2&id=fe5a99da8b0d0518342f5cdb522a06b0f123ca09

Once I've verified with you that it looks OK I'll fold it into your
commit (at the same time I rebase on v5.18-rc3 early next week).

In general this rebase sucked.. but I wanted to layer your changes
ontop of earlier changes I had already staged for 5.19. I've at least
verified that DM's bio polling seems to work like usual and also
verified that cryptsetup's 'make check' passes.

No urgency on you reviewing before early next week.  Have a great
weekend.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 1/8] block: replace disk based account with bdev's
  2022-04-12  8:56   ` Ming Lei
@ 2022-04-16  5:57     ` Christoph Hellwig
  -1 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2022-04-16  5:57 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12, 2022 at 04:56:09PM +0800, Ming Lei wrote:
> 'block device' is generic type for interface, and gendisk becomes more
> one block layer internal type, so replace disk based account interface
> with bdec's.

I can't parse this sentence.

> +unsigned long bdev_start_io_acct(struct block_device *bdev,
> +				 unsigned int sectors, unsigned int op,
> +				 unsigned long start_time)
>  {
> -	return __part_start_io_acct(disk->part0, sectors, op, jiffies);
> +	return __part_start_io_acct(bdev, sectors, op, start_time);
>  }
> -EXPORT_SYMBOL(disk_start_io_acct);
> +EXPORT_SYMBOL(bdev_start_io_acct);

Just rename __part_start_io_acct to bdev_start_io_acct instead.

> -void disk_end_io_acct(struct gendisk *disk, unsigned int op,
> +void bdev_end_io_acct(struct block_device *bdev, unsigned int op,
>  		      unsigned long start_time)
>  {
> -	__part_end_io_acct(disk->part0, op, start_time);
> +	__part_end_io_acct(bdev, op, start_time);
>  }
> -EXPORT_SYMBOL(disk_end_io_acct);
> +EXPORT_SYMBOL(bdev_end_io_acct);

Same here.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 1/8] block: replace disk based account with bdev's
@ 2022-04-16  5:57     ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2022-04-16  5:57 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal, Mike Snitzer

On Tue, Apr 12, 2022 at 04:56:09PM +0800, Ming Lei wrote:
> 'block device' is generic type for interface, and gendisk becomes more
> one block layer internal type, so replace disk based account interface
> with bdec's.

I can't parse this sentence.

> +unsigned long bdev_start_io_acct(struct block_device *bdev,
> +				 unsigned int sectors, unsigned int op,
> +				 unsigned long start_time)
>  {
> -	return __part_start_io_acct(disk->part0, sectors, op, jiffies);
> +	return __part_start_io_acct(bdev, sectors, op, start_time);
>  }
> -EXPORT_SYMBOL(disk_start_io_acct);
> +EXPORT_SYMBOL(bdev_start_io_acct);

Just rename __part_start_io_acct to bdev_start_io_acct instead.

> -void disk_end_io_acct(struct gendisk *disk, unsigned int op,
> +void bdev_end_io_acct(struct block_device *bdev, unsigned int op,
>  		      unsigned long start_time)
>  {
> -	__part_end_io_acct(disk->part0, op, start_time);
> +	__part_end_io_acct(bdev, op, start_time);
>  }
> -EXPORT_SYMBOL(disk_end_io_acct);
> +EXPORT_SYMBOL(bdev_end_io_acct);

Same here.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/8] dm: io accounting & polling improvement
  2022-04-12 17:15   ` [dm-devel] " Mike Snitzer
@ 2022-04-16  5:58     ` Christoph Hellwig
  -1 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2022-04-16  5:58 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Ming Lei, Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Tue, Apr 12, 2022 at 01:15:14PM -0400, Mike Snitzer wrote:
> I'll review this closely but, a couple weeks ago, I queued up quite a
> lot of conflicting changes for 5.19 here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19

Can you please end them out to the device mapper list for review?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 0/8] dm: io accounting & polling improvement
@ 2022-04-16  5:58     ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2022-04-16  5:58 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal, Ming Lei

On Tue, Apr 12, 2022 at 01:15:14PM -0400, Mike Snitzer wrote:
> I'll review this closely but, a couple weeks ago, I queued up quite a
> lot of conflicting changes for 5.19 here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19

Can you please end them out to the device mapper list for review?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 0/8] dm: io accounting & polling improvement
  2022-04-16  5:58     ` [dm-devel] " Christoph Hellwig
@ 2022-04-17  1:23       ` Mike Snitzer
  -1 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-17  1:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Sat, Apr 16 2022 at  1:58P -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Apr 12, 2022 at 01:15:14PM -0400, Mike Snitzer wrote:
> > I'll review this closely but, a couple weeks ago, I queued up quite a
> > lot of conflicting changes for 5.19 here:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19
> 
> Can you please end them out to the device mapper list for review?

Yes, no problem. I brought my various earlier changes to the front,
dropped my changes for removing dm_io refcounting for normal r/w IO
(still visible in 'dm-5.19-v1'), and rebased Ming's series (now
visible in 'dm-5.19-v2').

Common: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19
Old: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19-v1
New: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19-v2

I'll clean up the 'dm-5.19-v2' series further and likely post to the
list on Tuesday.

I discussed with Jens and we'll both be rebasing our 5.19 trees to
v5.18-rc3. dm will be rebasing on block (I can fix up Ming's patch for
bdev_{start,end}_io_acct based on your review and Jens can then pick
it up from the first patch in the series I post).

Mike


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 0/8] dm: io accounting & polling improvement
@ 2022-04-17  1:23       ` Mike Snitzer
  0 siblings, 0 replies; 62+ messages in thread
From: Mike Snitzer @ 2022-04-17  1:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal, Ming Lei

On Sat, Apr 16 2022 at  1:58P -0400,
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Apr 12, 2022 at 01:15:14PM -0400, Mike Snitzer wrote:
> > I'll review this closely but, a couple weeks ago, I queued up quite a
> > lot of conflicting changes for 5.19 here:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19
> 
> Can you please end them out to the device mapper list for review?

Yes, no problem. I brought my various earlier changes to the front,
dropped my changes for removing dm_io refcounting for normal r/w IO
(still visible in 'dm-5.19-v1'), and rebased Ming's series (now
visible in 'dm-5.19-v2').

Common: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19
Old: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19-v1
New: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19-v2

I'll clean up the 'dm-5.19-v2' series further and likely post to the
list on Tuesday.

I discussed with Jens and we'll both be rebasing our 5.19 trees to
v5.18-rc3. dm will be rebasing on block (I can fix up Ming's patch for
bdev_{start,end}_io_acct based on your review and Jens can then pick
it up from the first patch in the series I post).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
  2022-04-15 21:06                         ` [dm-devel] " Mike Snitzer
@ 2022-04-17  2:22                           ` Ming Lei
  -1 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-17  2:22 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Fri, Apr 15, 2022 at 05:06:55PM -0400, Mike Snitzer wrote:
> On Thu, Apr 14 2022 at  8:14P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Thu, Apr 14, 2022 at 01:45:33PM -0400, Mike Snitzer wrote:
> > > On Wed, Apr 13 2022 at 11:57P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > > > > On Wed, Apr 13 2022 at  8:36P -0400,
> > > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > > 
> > > > > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > > > > 
> > > > > > > The bigger issue with this patch is that you've caused
> > > > > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > > > > before any split occurs.  That is a problem because you'll end up
> > > > > > > accounting that bio for every split, so in split heavy workloads the
> > > > > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > > > > 
> > > > > > Good catch, but we know the length of mapped part in original bio before
> > > > > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > > > > something like the following delta change should address it:
> > > > > > 
> > > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > > > index db23efd6bbf6..06b554f3104b 100644
> > > > > > --- a/drivers/md/dm.c
> > > > > > +++ b/drivers/md/dm.c
> > > > > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > > > > >  
> > > > > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > > > > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > > > > +
> > > > > > +	if (ci->sector_count > len) {
> > > > > > +		/* setup the mapped part for accounting */
> > > > > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > > > > +		ci->io->sectors = len;
> > > > > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > > > > +	}
> > > > > >  	__map_bio(clone);
> > > > > >  
> > > > > >  	ci->sector += len;
> > > > > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > > > > >  	if (error || !ci.sector_count)
> > > > > >  		goto out;
> > > > > >  
> > > > > > -	/* setup the mapped part for accounting */
> > > > > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > > > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > > > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > > > > -
> > > > > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > > > > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > > > > >  	bio_inc_remaining(bio);
> > > > > > 
> > > > > > -- 
> > > > > > Ming
> > > > > > 
> > > > > 
> > > > > Unfortunately we do need splitting after __map_bio() because a dm
> > > > > target's ->map can use dm_accept_partial_bio() to further reduce a
> > > > > bio's mapped part.
> > > > > 
> > > > > But I think dm_accept_partial_bio() could be trained to update
> > > > > tio->io->sectors?
> > > > 
> > > > ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> > > > passed to dm_accept_partial_bio(), and not gets updated after
> > > > dm_accept_partial_bio() is called.
> > > > 
> > > > If that is one issue, it must be one existed issue in dm io accounting
> > > > since ->orig_bio isn't updated when dm_accept_partial_bio() is called.
> > > 
> > > Recall that ->orig_bio is updated after the bio_split() at the bottom of
> > > dm_split_and_process_bio().
> > > 
> > > That bio_split() is based on ci->sector_count, which is reduced as a
> > > side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
> > > pretty circuitous so I can absolutely understand why you didn't
> > > immediately appreciate the interface.  The block comment above
> > > dm_accept_partial_bio() does a pretty comprehensive job of explaining.
> > 
> > Go it now, thanks for the explanation.
> > 
> > As you mentioned, it can be addressed in dm_accept_partial_bio()
> > by updating ti->io->sectors.
> 
> Yes, I rebased your patchset ontop of dm-5.19 and fixed up your
> splitting like we discussed.  I'll be rebasing ontop of v5.18-rc3 once
> it is released but please have a look at this 'dm-5.19-v2' branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19-v2
> 
> And this commit in particular:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19-v2&id=fe5a99da8b0d0518342f5cdb522a06b0f123ca09
> 
> Once I've verified with you that it looks OK I'll fold it into your
> commit (at the same time I rebase on v5.18-rc3 early next week).

Hi Mike,

Your delta change looks good, thanks for fixing it!

Thanks,
Ming


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io
@ 2022-04-17  2:22                           ` Ming Lei
  0 siblings, 0 replies; 62+ messages in thread
From: Ming Lei @ 2022-04-17  2:22 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: Jens Axboe, linux-block, dm-devel, Damien Le Moal

On Fri, Apr 15, 2022 at 05:06:55PM -0400, Mike Snitzer wrote:
> On Thu, Apr 14 2022 at  8:14P -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > On Thu, Apr 14, 2022 at 01:45:33PM -0400, Mike Snitzer wrote:
> > > On Wed, Apr 13 2022 at 11:57P -0400,
> > > Ming Lei <ming.lei@redhat.com> wrote:
> > > 
> > > > On Wed, Apr 13, 2022 at 10:25:45PM -0400, Mike Snitzer wrote:
> > > > > On Wed, Apr 13 2022 at  8:36P -0400,
> > > > > Ming Lei <ming.lei@redhat.com> wrote:
> > > > > 
> > > > > > On Wed, Apr 13, 2022 at 01:58:54PM -0400, Mike Snitzer wrote:
> > > > > > > 
> > > > > > > The bigger issue with this patch is that you've caused
> > > > > > > dm_submit_bio_remap() to go back to accounting the entire original bio
> > > > > > > before any split occurs.  That is a problem because you'll end up
> > > > > > > accounting that bio for every split, so in split heavy workloads the
> > > > > > > IO accounting won't reflect when the IO is actually issued and we'll
> > > > > > > regress back to having very inaccurate and incorrect IO accounting for
> > > > > > > dm_submit_bio_remap() heavy targets (e.g. dm-crypt).
> > > > > > 
> > > > > > Good catch, but we know the length of mapped part in original bio before
> > > > > > calling __map_bio(), so io->sectors/io->offset_sector can be setup here,
> > > > > > something like the following delta change should address it:
> > > > > > 
> > > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > > > index db23efd6bbf6..06b554f3104b 100644
> > > > > > --- a/drivers/md/dm.c
> > > > > > +++ b/drivers/md/dm.c
> > > > > > @@ -1558,6 +1558,13 @@ static int __split_and_process_bio(struct clone_info *ci)
> > > > > >  
> > > > > >  	len = min_t(sector_t, max_io_len(ti, ci->sector), ci->sector_count);
> > > > > >  	clone = alloc_tio(ci, ti, 0, &len, GFP_NOIO);
> > > > > > +
> > > > > > +	if (ci->sector_count > len) {
> > > > > > +		/* setup the mapped part for accounting */
> > > > > > +		dm_io_set_flag(ci->io, DM_IO_SPLITTED);
> > > > > > +		ci->io->sectors = len;
> > > > > > +		ci->io->sector_offset = bio_end_sector(ci->bio) - ci->sector;
> > > > > > +	}
> > > > > >  	__map_bio(clone);
> > > > > >  
> > > > > >  	ci->sector += len;
> > > > > > @@ -1603,11 +1610,6 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> > > > > >  	if (error || !ci.sector_count)
> > > > > >  		goto out;
> > > > > >  
> > > > > > -	/* setup the mapped part for accounting */
> > > > > > -	dm_io_set_flag(ci.io, DM_IO_SPLITTED);
> > > > > > -	ci.io->sectors = bio_sectors(bio) - ci.sector_count;
> > > > > > -	ci.io->sector_offset = bio_end_sector(bio) - bio->bi_iter.bi_sector;
> > > > > > -
> > > > > >  	bio_trim(bio, ci.io->sectors, ci.sector_count);
> > > > > >  	trace_block_split(bio, bio->bi_iter.bi_sector);
> > > > > >  	bio_inc_remaining(bio);
> > > > > > 
> > > > > > -- 
> > > > > > Ming
> > > > > > 
> > > > > 
> > > > > Unfortunately we do need splitting after __map_bio() because a dm
> > > > > target's ->map can use dm_accept_partial_bio() to further reduce a
> > > > > bio's mapped part.
> > > > > 
> > > > > But I think dm_accept_partial_bio() could be trained to update
> > > > > tio->io->sectors?
> > > > 
> > > > ->orig_bio is just for serving io accounting, but ->orig_bio isn't
> > > > passed to dm_accept_partial_bio(), and not gets updated after
> > > > dm_accept_partial_bio() is called.
> > > > 
> > > > If that is one issue, it must be one existed issue in dm io accounting
> > > > since ->orig_bio isn't updated when dm_accept_partial_bio() is called.
> > > 
> > > Recall that ->orig_bio is updated after the bio_split() at the bottom of
> > > dm_split_and_process_bio().
> > > 
> > > That bio_split() is based on ci->sector_count, which is reduced as a
> > > side-effect of dm_accept_partial_bio() reducing tio->len_ptr.  It is
> > > pretty circuitous so I can absolutely understand why you didn't
> > > immediately appreciate the interface.  The block comment above
> > > dm_accept_partial_bio() does a pretty comprehensive job of explaining.
> > 
> > Go it now, thanks for the explanation.
> > 
> > As you mentioned, it can be addressed in dm_accept_partial_bio()
> > by updating ti->io->sectors.
> 
> Yes, I rebased your patchset ontop of dm-5.19 and fixed up your
> splitting like we discussed.  I'll be rebasing ontop of v5.18-rc3 once
> it is released but please have a look at this 'dm-5.19-v2' branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-5.19-v2
> 
> And this commit in particular:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.19-v2&id=fe5a99da8b0d0518342f5cdb522a06b0f123ca09
> 
> Once I've verified with you that it looks OK I'll fold it into your
> commit (at the same time I rebase on v5.18-rc3 early next week).

Hi Mike,

Your delta change looks good, thanks for fixing it!

Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2022-04-17  2:22 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-12  8:56 [dm-devel] [PATCH 0/8] dm: io accounting & polling improvement Ming Lei
2022-04-12  8:56 ` Ming Lei
2022-04-12  8:56 ` [dm-devel] [PATCH 1/8] block: replace disk based account with bdev's Ming Lei
2022-04-12  8:56   ` Ming Lei
2022-04-16  5:57   ` Christoph Hellwig
2022-04-16  5:57     ` [dm-devel] " Christoph Hellwig
2022-04-12  8:56 ` [dm-devel] [PATCH 2/8] dm: don't pass bio to __dm_start_io_acct and dm_end_io_acct Ming Lei
2022-04-12  8:56   ` Ming Lei
2022-04-12  8:56 ` [dm-devel] [PATCH 3/8] dm: pass 'dm_io' instance to dm_io_acct directly Ming Lei
2022-04-12  8:56   ` Ming Lei
2022-04-12 20:28   ` [dm-devel] " Mike Snitzer
2022-04-12 20:28     ` Mike Snitzer
2022-04-13  1:43     ` Ming Lei
2022-04-13  1:43       ` [dm-devel] " Ming Lei
2022-04-12  8:56 ` [dm-devel] [PATCH 4/8] dm: switch to bdev based io accounting interface Ming Lei
2022-04-12  8:56   ` Ming Lei
2022-04-12  8:56 ` [dm-devel] [PATCH 5/8] dm: always setup ->orig_bio in alloc_io Ming Lei
2022-04-12  8:56   ` Ming Lei
2022-04-12 20:52   ` [dm-devel] " Mike Snitzer
2022-04-12 20:52     ` Mike Snitzer
2022-04-12 22:38     ` [dm-devel] " Damien Le Moal
2022-04-12 22:38       ` Damien Le Moal
2022-04-12 23:00       ` [dm-devel] " Mike Snitzer
2022-04-12 23:00         ` Mike Snitzer
2022-04-12 23:31         ` [dm-devel] " Damien Le Moal
2022-04-12 23:31           ` Damien Le Moal
2022-04-13  0:00         ` Damien Le Moal
2022-04-13  0:00           ` [dm-devel] " Damien Le Moal
2022-04-13  1:56     ` Ming Lei
2022-04-13  1:56       ` [dm-devel] " Ming Lei
2022-04-13  6:12       ` Mike Snitzer
2022-04-13  6:12         ` [dm-devel] " Mike Snitzer
2022-04-13 12:26         ` Ming Lei
2022-04-13 12:26           ` [dm-devel] " Ming Lei
2022-04-13 17:58           ` Mike Snitzer
2022-04-13 17:58             ` [dm-devel] " Mike Snitzer
2022-04-14  0:36             ` Ming Lei
2022-04-14  0:36               ` [dm-devel] " Ming Lei
2022-04-14  2:25               ` Mike Snitzer
2022-04-14  2:25                 ` [dm-devel] " Mike Snitzer
2022-04-14  3:57                 ` Ming Lei
2022-04-14  3:57                   ` [dm-devel] " Ming Lei
2022-04-14 17:45                   ` Mike Snitzer
2022-04-14 17:45                     ` [dm-devel] " Mike Snitzer
2022-04-15  0:14                     ` Ming Lei
2022-04-15  0:14                       ` [dm-devel] " Ming Lei
2022-04-15 21:06                       ` Mike Snitzer
2022-04-15 21:06                         ` [dm-devel] " Mike Snitzer
2022-04-17  2:22                         ` Ming Lei
2022-04-17  2:22                           ` [dm-devel] " Ming Lei
2022-04-12  8:56 ` [dm-devel] [PATCH 6/8] dm: don't grab target io reference in dm_zone_map_bio Ming Lei
2022-04-12  8:56   ` Ming Lei
2022-04-12  8:56 ` [dm-devel] [PATCH 7/8] dm: improve target io referencing Ming Lei
2022-04-12  8:56   ` Ming Lei
2022-04-12  8:56 ` [dm-devel] [PATCH 8/8] dm: put all polled io into one single list Ming Lei
2022-04-12  8:56   ` Ming Lei
2022-04-12 17:15 ` [PATCH 0/8] dm: io accounting & polling improvement Mike Snitzer
2022-04-12 17:15   ` [dm-devel] " Mike Snitzer
2022-04-16  5:58   ` Christoph Hellwig
2022-04-16  5:58     ` [dm-devel] " Christoph Hellwig
2022-04-17  1:23     ` Mike Snitzer
2022-04-17  1:23       ` [dm-devel] " Mike Snitzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.