All of lore.kernel.org
 help / color / mirror / Atom feed
* [dm-devel] [PATCH 0/3] dm: improvement on io referencing and io poll
@ 2022-04-08 17:12 Ming Lei
  2022-04-08 17:12 ` [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio Ming Lei
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Ming Lei @ 2022-04-08 17:12 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Ming Lei

Hello,

The 1st patch removes get/put io reference in dm_zone_map_bio.

The 2nd one improves io referencing.

The 3rd patch switches to hold dm_io instance in single list.


Ming Lei (3):
  dm: don't grab target io reference in dm_zone_map_bio
  dm: improve target io referencing
  dm: put all polled io into one single list

 drivers/md/dm-core.h |  9 +----
 drivers/md/dm-zone.c | 10 ------
 drivers/md/dm.c      | 79 ++++++++++++++++++++++++++++----------------
 3 files changed, 51 insertions(+), 47 deletions(-)

-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-08 17:12 [dm-devel] [PATCH 0/3] dm: improvement on io referencing and io poll Ming Lei
@ 2022-04-08 17:12 ` Ming Lei
  2022-04-11  0:18   ` Damien Le Moal
  2022-04-11  9:40   ` Damien Le Moal
  2022-04-08 17:12 ` [dm-devel] [PATCH 2/3] dm: improve target io referencing Ming Lei
  2022-04-08 17:12 ` [dm-devel] [PATCH 3/3] dm: put all polled io into one single list Ming Lei
  2 siblings, 2 replies; 21+ messages in thread
From: Ming Lei @ 2022-04-08 17:12 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Damien Le Moal, Ming Lei

dm_zone_map_bio() is only called from __map_bio in which the io's
reference is grabbed already, and the reference won't be released
until the bio is submitted, so no necessary to do it dm_zone_map_bio
any more.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  7 -------
 drivers/md/dm-zone.c | 10 ----------
 drivers/md/dm.c      |  7 ++++++-
 3 files changed, 6 insertions(+), 18 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 4277853c7535..13136bd47cb3 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -277,13 +277,6 @@ static inline void dm_io_set_flag(struct dm_io *io, unsigned int bit)
 	io->flags |= (1U << bit);
 }
 
-static inline void dm_io_inc_pending(struct dm_io *io)
-{
-	atomic_inc(&io->io_count);
-}
-
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error);
-
 static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
 {
 	return &container_of(kobj, struct dm_kobject_holder, kobj)->completion;
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index c1ca9be4b79e..85d3c158719f 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -545,13 +545,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		return DM_MAPIO_KILL;
 	}
 
-	/*
-	 * The target map function may issue and complete the IO quickly.
-	 * Take an extra reference on the IO to make sure it does disappear
-	 * until we run dm_zone_map_bio_end().
-	 */
-	dm_io_inc_pending(io);
-
 	/* Let the target do its work */
 	r = ti->type->map(ti, clone);
 	switch (r) {
@@ -580,9 +573,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
 		break;
 	}
 
-	/* Drop the extra reference on the IO */
-	dm_io_dec_pending(io, sts);
-
 	if (sts != BLK_STS_OK)
 		return DM_MAPIO_KILL;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..b8424a4b4725 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -929,11 +929,16 @@ static inline bool dm_tio_is_normal(struct dm_target_io *tio)
 		!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
 }
 
+static void dm_io_inc_pending(struct dm_io *io)
+{
+	atomic_inc(&io->io_count);
+}
+
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
  */
-void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
+static void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
 {
 	/* Push-back supersedes any I/O errors */
 	if (unlikely(error)) {
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [dm-devel] [PATCH 2/3] dm: improve target io referencing
  2022-04-08 17:12 [dm-devel] [PATCH 0/3] dm: improvement on io referencing and io poll Ming Lei
  2022-04-08 17:12 ` [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio Ming Lei
@ 2022-04-08 17:12 ` Ming Lei
  2022-04-08 17:12 ` [dm-devel] [PATCH 3/3] dm: put all polled io into one single list Ming Lei
  2 siblings, 0 replies; 21+ messages in thread
From: Ming Lei @ 2022-04-08 17:12 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Ming Lei

Currently target io's reference counter is grabbed before calling
__map_bio(), this way isn't efficient since we can move this grabbing
into alloc_io().

Meantime it becomes typical async io reference counter model: one is
for submission side, the other is for completion side, and the io won't
be completed until both two sides are done.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b8424a4b4725..528559ca2f91 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -579,7 +579,9 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	io = container_of(tio, struct dm_io, tio);
 	io->magic = DM_IO_MAGIC;
 	io->status = 0;
-	atomic_set(&io->io_count, 1);
+
+	/* one is for submission, the other is for completion */
+	atomic_set(&io->io_count, 2);
 	this_cpu_inc(*md->pending_io);
 	io->orig_bio = NULL;
 	io->md = md;
@@ -929,11 +931,6 @@ static inline bool dm_tio_is_normal(struct dm_target_io *tio)
 		!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
 }
 
-static void dm_io_inc_pending(struct dm_io *io)
-{
-	atomic_inc(&io->io_count);
-}
-
 /*
  * Decrements the number of outstanding ios that a bio has been
  * cloned into, completing the original io if necc.
@@ -1275,7 +1272,6 @@ static void __map_bio(struct bio *clone)
 	/*
 	 * Map the clone.
 	 */
-	dm_io_inc_pending(io);
 	tio->old_sector = clone->bi_iter.bi_sector;
 
 	if (unlikely(swap_bios_limit(ti, clone))) {
@@ -1357,11 +1353,12 @@ static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
 	}
 }
 
-static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
+static int __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 				  unsigned num_bios, unsigned *len)
 {
 	struct bio_list blist = BIO_EMPTY_LIST;
 	struct bio *clone;
+	int ret = 0;
 
 	switch (num_bios) {
 	case 0:
@@ -1370,15 +1367,19 @@ static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 		clone = alloc_tio(ci, ti, 0, len, GFP_NOIO);
 		dm_tio_set_flag(clone_to_tio(clone), DM_TIO_IS_DUPLICATE_BIO);
 		__map_bio(clone);
+		ret = 1;
 		break;
 	default:
 		alloc_multiple_bios(&blist, ci, ti, num_bios, len);
 		while ((clone = bio_list_pop(&blist))) {
 			dm_tio_set_flag(clone_to_tio(clone), DM_TIO_IS_DUPLICATE_BIO);
 			__map_bio(clone);
+			ret += 1;
 		}
 		break;
 	}
+
+	return ret;
 }
 
 static void __send_empty_flush(struct clone_info *ci)
@@ -1398,8 +1399,16 @@ static void __send_empty_flush(struct clone_info *ci)
 	ci->bio = &flush_bio;
 	ci->sector_count = 0;
 
-	while ((ti = dm_table_get_target(ci->map, target_nr++)))
-		__send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
+	while ((ti = dm_table_get_target(ci->map, target_nr++))) {
+		int bios;
+
+		atomic_add(ti->num_flush_bios, &ci->io->io_count);
+		bios = __send_duplicate_bios(ci, ti, ti->num_flush_bios, NULL);
+		atomic_sub(ti->num_flush_bios - bios, &ci->io->io_count);
+	}
+
+	/* alloc_io() takes one extra reference for submission */
+	atomic_sub(1, &ci->io->io_count);
 
 	bio_uninit(ci->bio);
 }
@@ -1408,6 +1417,7 @@ static void __send_changing_extent_only(struct clone_info *ci, struct dm_target
 					unsigned num_bios)
 {
 	unsigned len;
+	int bios;
 
 	len = min_t(sector_t, ci->sector_count,
 		    max_io_len_target_boundary(ti, dm_target_offset(ti, ci->sector)));
@@ -1419,7 +1429,11 @@ static void __send_changing_extent_only(struct clone_info *ci, struct dm_target
 	ci->sector += len;
 	ci->sector_count -= len;
 
-	__send_duplicate_bios(ci, ti, num_bios, &len);
+
+	atomic_add(num_bios, &ci->io->io_count);
+	bios = __send_duplicate_bios(ci, ti, num_bios, &len);
+	/* alloc_io() takes one extra reference for submission */
+	atomic_sub(num_bios - bios + 1, &ci->io->io_count);
 }
 
 static bool is_abnormal_io(struct bio *bio)
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [dm-devel] [PATCH 3/3] dm: put all polled io into one single list
  2022-04-08 17:12 [dm-devel] [PATCH 0/3] dm: improvement on io referencing and io poll Ming Lei
  2022-04-08 17:12 ` [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio Ming Lei
  2022-04-08 17:12 ` [dm-devel] [PATCH 2/3] dm: improve target io referencing Ming Lei
@ 2022-04-08 17:12 ` Ming Lei
  2 siblings, 0 replies; 21+ messages in thread
From: Ming Lei @ 2022-04-08 17:12 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel, Ming Lei

If bio_split() isn't involved, it is a bit overkill to link dm_io into hlist,
given there is only single dm_io in the list, so convert to single list
for holding all dm_io instances associated with this bio.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-core.h |  2 +-
 drivers/md/dm.c      | 46 +++++++++++++++++++++++---------------------
 2 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 13136bd47cb3..e28f7d54d2b7 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -252,7 +252,7 @@ struct dm_io {
 	spinlock_t lock;
 	unsigned long start_time;
 	void *data;
-	struct hlist_node node;
+	struct dm_io *next;
 	struct task_struct *map_task;
 	struct dm_stats_aux stats_aux;
 	/* last member of dm_target_io is 'struct bio' */
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 528559ca2f91..84cb73462fcd 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1486,7 +1486,7 @@ static bool __process_abnormal_io(struct clone_info *ci, struct dm_target *ti,
 }
 
 /*
- * Reuse ->bi_private as hlist head for storing all dm_io instances
+ * Reuse ->bi_private as dm_io list head for storing all dm_io instances
  * associated with this bio, and this bio's bi_private needs to be
  * stored in dm_io->data before the reuse.
  *
@@ -1494,14 +1494,14 @@ static bool __process_abnormal_io(struct clone_info *ci, struct dm_target *ti,
  * touch it after splitting. Meantime it won't be changed by anyone after
  * bio is submitted. So this reuse is safe.
  */
-static inline struct hlist_head *dm_get_bio_hlist_head(struct bio *bio)
+static inline struct dm_io **dm_poll_list_head(struct bio *bio)
 {
-	return (struct hlist_head *)&bio->bi_private;
+	return (struct dm_io **)&bio->bi_private;
 }
 
 static void dm_queue_poll_io(struct bio *bio, struct dm_io *io)
 {
-	struct hlist_head *head = dm_get_bio_hlist_head(bio);
+	struct dm_io **head = dm_poll_list_head(bio);
 
 	if (!(bio->bi_opf & REQ_DM_POLL_LIST)) {
 		bio->bi_opf |= REQ_DM_POLL_LIST;
@@ -1511,19 +1511,20 @@ static void dm_queue_poll_io(struct bio *bio, struct dm_io *io)
 		 */
 		io->data = bio->bi_private;
 
-		INIT_HLIST_HEAD(head);
-
 		/* tell block layer to poll for completion */
 		bio->bi_cookie = ~BLK_QC_T_NONE;
+
+		io->next = NULL;
 	} else {
 		/*
 		 * bio recursed due to split, reuse original poll list,
 		 * and save bio->bi_private too.
 		 */
-		io->data = hlist_entry(head->first, struct dm_io, node)->data;
+		io->data = (*head)->data;
+		io->next = *head;
 	}
 
-	hlist_add_head(&io->node, head);
+	*head = io;
 }
 
 /*
@@ -1677,18 +1678,16 @@ static bool dm_poll_dm_io(struct dm_io *io, struct io_comp_batch *iob,
 static int dm_poll_bio(struct bio *bio, struct io_comp_batch *iob,
 		       unsigned int flags)
 {
-	struct hlist_head *head = dm_get_bio_hlist_head(bio);
-	struct hlist_head tmp = HLIST_HEAD_INIT;
-	struct hlist_node *next;
-	struct dm_io *io;
+	struct dm_io **head = dm_poll_list_head(bio);
+	struct dm_io *list = *head;
+	struct dm_io *tmp = NULL;
+	struct dm_io *curr, *next;
 
 	/* Only poll normal bio which was marked as REQ_DM_POLL_LIST */
 	if (!(bio->bi_opf & REQ_DM_POLL_LIST))
 		return 0;
 
-	WARN_ON_ONCE(hlist_empty(head));
-
-	hlist_move_list(head, &tmp);
+	WARN_ON_ONCE(!list);
 
 	/*
 	 * Restore .bi_private before possibly completing dm_io.
@@ -1699,24 +1698,27 @@ static int dm_poll_bio(struct bio *bio, struct io_comp_batch *iob,
 	 * clearing REQ_DM_POLL_LIST here.
 	 */
 	bio->bi_opf &= ~REQ_DM_POLL_LIST;
-	bio->bi_private = hlist_entry(tmp.first, struct dm_io, node)->data;
+	bio->bi_private = list->data;
 
-	hlist_for_each_entry_safe(io, next, &tmp, node) {
-		if (dm_poll_dm_io(io, iob, flags)) {
-			hlist_del_init(&io->node);
+	for (curr = list, next = curr->next; curr; curr = next, next =
+			curr ? curr->next : NULL) {
+		if (dm_poll_dm_io(curr, iob, flags)) {
 			/*
 			 * clone_endio() has already occurred, so passing
 			 * error as 0 here doesn't override io->status
 			 */
-			dm_io_dec_pending(io, 0);
+			dm_io_dec_pending(curr, 0);
+		} else {
+			curr->next = tmp;
+			tmp = curr;
 		}
 	}
 
 	/* Not done? */
-	if (!hlist_empty(&tmp)) {
+	if (tmp) {
 		bio->bi_opf |= REQ_DM_POLL_LIST;
 		/* Reset bio->bi_private to dm_io list head */
-		hlist_move_list(&tmp, head);
+		*head = tmp;
 		return 0;
 	}
 	return 1;
-- 
2.31.1

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-08 17:12 ` [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio Ming Lei
@ 2022-04-11  0:18   ` Damien Le Moal
  2022-04-11  0:36     ` Ming Lei
  2022-04-11  9:40   ` Damien Le Moal
  1 sibling, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2022-04-11  0:18 UTC (permalink / raw)
  To: Ming Lei, Mike Snitzer; +Cc: dm-devel, Damien Le Moal

On 4/9/22 02:12, Ming Lei wrote:
> dm_zone_map_bio() is only called from __map_bio in which the io's
> reference is grabbed already, and the reference won't be released
> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> any more.

I do not think that this patch is correct. Removing the extra reference on
the io can lead to problems if the io is completed in the target
->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
to reference the original bio, so we must ensure that it is still around,
which means keeping an extra IO reference to avoid dm_io_dec_pending()
calling bio_endio() on the orgi bio before dm_zone_map_bio_end() has a
chance to run.

> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/md/dm-core.h |  7 -------
>  drivers/md/dm-zone.c | 10 ----------
>  drivers/md/dm.c      |  7 ++++++-
>  3 files changed, 6 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
> index 4277853c7535..13136bd47cb3 100644
> --- a/drivers/md/dm-core.h
> +++ b/drivers/md/dm-core.h
> @@ -277,13 +277,6 @@ static inline void dm_io_set_flag(struct dm_io *io, unsigned int bit)
>  	io->flags |= (1U << bit);
>  }
>  
> -static inline void dm_io_inc_pending(struct dm_io *io)
> -{
> -	atomic_inc(&io->io_count);
> -}
> -
> -void dm_io_dec_pending(struct dm_io *io, blk_status_t error);
> -
>  static inline struct completion *dm_get_completion_from_kobject(struct kobject *kobj)
>  {
>  	return &container_of(kobj, struct dm_kobject_holder, kobj)->completion;
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index c1ca9be4b79e..85d3c158719f 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -545,13 +545,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		return DM_MAPIO_KILL;
>  	}
>  
> -	/*
> -	 * The target map function may issue and complete the IO quickly.
> -	 * Take an extra reference on the IO to make sure it does disappear
> -	 * until we run dm_zone_map_bio_end().
> -	 */
> -	dm_io_inc_pending(io);
> -
>  	/* Let the target do its work */
>  	r = ti->type->map(ti, clone);
>  	switch (r) {
> @@ -580,9 +573,6 @@ int dm_zone_map_bio(struct dm_target_io *tio)
>  		break;
>  	}
>  
> -	/* Drop the extra reference on the IO */
> -	dm_io_dec_pending(io, sts);
> -
>  	if (sts != BLK_STS_OK)
>  		return DM_MAPIO_KILL;
>  
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3c5fad7c4ee6..b8424a4b4725 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -929,11 +929,16 @@ static inline bool dm_tio_is_normal(struct dm_target_io *tio)
>  		!dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
>  }
>  
> +static void dm_io_inc_pending(struct dm_io *io)
> +{
> +	atomic_inc(&io->io_count);
> +}
> +
>  /*
>   * Decrements the number of outstanding ios that a bio has been
>   * cloned into, completing the original io if necc.
>   */
> -void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
> +static void dm_io_dec_pending(struct dm_io *io, blk_status_t error)
>  {
>  	/* Push-back supersedes any I/O errors */
>  	if (unlikely(error)) {


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11  0:18   ` Damien Le Moal
@ 2022-04-11  0:36     ` Ming Lei
  2022-04-11  0:50       ` Damien Le Moal
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2022-04-11  0:36 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> On 4/9/22 02:12, Ming Lei wrote:
> > dm_zone_map_bio() is only called from __map_bio in which the io's
> > reference is grabbed already, and the reference won't be released
> > until the bio is submitted, so no necessary to do it dm_zone_map_bio
> > any more.
> 
> I do not think that this patch is correct. Removing the extra reference on
> the io can lead to problems if the io is completed in the target
> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs

__map_bio():
	...
	dm_io_inc_pending(io);
	...
	dm_zone_map_bio(tio);
	...

dm_submit_bio():
	dm_split_and_process_bio
		init_clone_info(&ci, md, map, bio)
			atomic_set(&io->io_count, 1)
		...
		__map_bio()
		...
		dm_io_dec_pending()

The target io can only be completed after the above line returns,
so the extra reference in dm_zone_map_bio() doesn't make any difference,
does it?


Thanks, 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11  0:36     ` Ming Lei
@ 2022-04-11  0:50       ` Damien Le Moal
  2022-04-11  1:04         ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2022-04-11  0:50 UTC (permalink / raw)
  To: Ming Lei; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On 4/11/22 09:36, Ming Lei wrote:
> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>> On 4/9/22 02:12, Ming Lei wrote:
>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>> reference is grabbed already, and the reference won't be released
>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>> any more.
>>
>> I do not think that this patch is correct. Removing the extra reference on
>> the io can lead to problems if the io is completed in the target
>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
> 
> __map_bio():
> 	...
> 	dm_io_inc_pending(io);
> 	...
> 	dm_zone_map_bio(tio);
> 	...

dm-crypt (for instance) may terminate the clone bio immediately in its
->map() function context, resulting in the bio_endio()clone) ->
clone_endio() -> dm_io_dec_pending() call chain.

With that, the io is gone and dm_zone_map_bio_end() will not have a valid
reference on the orig bio.

What am I missing here ?

> 
> dm_submit_bio():
> 	dm_split_and_process_bio
> 		init_clone_info(&ci, md, map, bio)
> 			atomic_set(&io->io_count, 1)
> 		...
> 		__map_bio()
> 		...
> 		dm_io_dec_pending()
> 
> The target io can only be completed after the above line returns,
> so the extra reference in dm_zone_map_bio() doesn't make any difference,
> does it?
> 
> 
> Thanks, 
> Ming
> 


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11  0:50       ` Damien Le Moal
@ 2022-04-11  1:04         ` Ming Lei
  2022-04-11  1:08           ` Damien Le Moal
  2022-04-11  2:19           ` Damien Le Moal
  0 siblings, 2 replies; 21+ messages in thread
From: Ming Lei @ 2022-04-11  1:04 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
> On 4/11/22 09:36, Ming Lei wrote:
> > On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> >> On 4/9/22 02:12, Ming Lei wrote:
> >>> dm_zone_map_bio() is only called from __map_bio in which the io's
> >>> reference is grabbed already, and the reference won't be released
> >>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> >>> any more.
> >>
> >> I do not think that this patch is correct. Removing the extra reference on
> >> the io can lead to problems if the io is completed in the target
> >> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> >> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
> > 
> > __map_bio():
> > 	...
> > 	dm_io_inc_pending(io);
> > 	...
> > 	dm_zone_map_bio(tio);
> > 	...
> 
> dm-crypt (for instance) may terminate the clone bio immediately in its
> ->map() function context, resulting in the bio_endio()clone) ->
> clone_endio() -> dm_io_dec_pending() call chain.
> 
> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
> reference on the orig bio.

Any target can complete io during ->map. Here looks nothing is special with
dm-crypt or dm-zone, why does only dm zone need extra reference?

The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
in dm_split_and_process_bio() is called.

Or maybe I miss any extra requirement from dm-zone?


thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11  1:04         ` Ming Lei
@ 2022-04-11  1:08           ` Damien Le Moal
  2022-04-11  2:19           ` Damien Le Moal
  1 sibling, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2022-04-11  1:08 UTC (permalink / raw)
  To: Ming Lei; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On 4/11/22 10:04, Ming Lei wrote:
> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
>> On 4/11/22 09:36, Ming Lei wrote:
>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>>>> On 4/9/22 02:12, Ming Lei wrote:
>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>>>> reference is grabbed already, and the reference won't be released
>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>>>> any more.
>>>>
>>>> I do not think that this patch is correct. Removing the extra reference on
>>>> the io can lead to problems if the io is completed in the target
>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
>>>
>>> __map_bio():
>>> 	...
>>> 	dm_io_inc_pending(io);
>>> 	...
>>> 	dm_zone_map_bio(tio);
>>> 	...
>>
>> dm-crypt (for instance) may terminate the clone bio immediately in its
>> ->map() function context, resulting in the bio_endio()clone) ->
>> clone_endio() -> dm_io_dec_pending() call chain.
>>
>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
>> reference on the orig bio.
> 
> Any target can complete io during ->map. Here looks nothing is special with
> dm-crypt or dm-zone, why does only dm zone need extra reference?
> 
> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
> in dm_split_and_process_bio() is called.
> 
> Or maybe I miss any extra requirement from dm-zone?

I added that extra reference as I discovered it was needed when testing
zone append emulation with dm-crypt, back when it was implemented (the
emulation, not dm-crypt :)). Let me revisit this.

> 
> 
> thanks,
> Ming
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11  1:04         ` Ming Lei
  2022-04-11  1:08           ` Damien Le Moal
@ 2022-04-11  2:19           ` Damien Le Moal
  2022-04-11  2:55             ` Damien Le Moal
  1 sibling, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2022-04-11  2:19 UTC (permalink / raw)
  To: Ming Lei; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On 4/11/22 10:04, Ming Lei wrote:
> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
>> On 4/11/22 09:36, Ming Lei wrote:
>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>>>> On 4/9/22 02:12, Ming Lei wrote:
>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>>>> reference is grabbed already, and the reference won't be released
>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>>>> any more.
>>>>
>>>> I do not think that this patch is correct. Removing the extra reference on
>>>> the io can lead to problems if the io is completed in the target
>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
>>>
>>> __map_bio():
>>> 	...
>>> 	dm_io_inc_pending(io);
>>> 	...
>>> 	dm_zone_map_bio(tio);
>>> 	...
>>
>> dm-crypt (for instance) may terminate the clone bio immediately in its
>> ->map() function context, resulting in the bio_endio()clone) ->
>> clone_endio() -> dm_io_dec_pending() call chain.
>>
>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
>> reference on the orig bio.
> 
> Any target can complete io during ->map. Here looks nothing is special with
> dm-crypt or dm-zone, why does only dm zone need extra reference?
> 
> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
> in dm_split_and_process_bio() is called.
> 
> Or maybe I miss any extra requirement from dm-zone?

Something is wrong... With and without your patch, when I setup a dm-crypt
target on top of a zoned nullblk device, I get:

[  292.596454] device-mapper: uevent: version 1.0.3
[  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
initialised: dm-devel@redhat.com
[  292.732217] general protection fault, probably for non-canonical
address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
[  292.743724] KASAN: null-ptr-deref in range
[0x0000000000000010-0x0000000000000017]
[  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
5.18.0-rc2+ #1458
[  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
02/21/2020
[  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
[  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
<0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
[  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
[  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
1ffff11034470027
[  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
ffff8885c5bcdc60
[  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
ffff8881a238013f
[  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
0000000000000000
[  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
ffff8885c5bcdd08
[  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
knlGS:0000000000000000
[  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
00000000007706f0
[  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  292.868194] PKRU: 55555554
[  292.870949] Call Trace:
[  292.873446]  <TASK>
[  292.875593]  ? lock_is_held_type+0xd7/0x130
[  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
[  292.885718]  ? __module_address.part.0+0x25/0x300
[  292.890509]  ? is_module_address+0x43/0x70
[  292.894674]  ? static_obj+0x8a/0xc0
[  292.898233]  __map_bio+0x352/0x740 [dm_mod]
[  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
[  292.907222]  ? find_held_lock+0x2c/0x110
[  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
[  292.916459]  ? lock_release+0x3b2/0x6f0
[  292.920368]  ? lock_downgrade+0x6d0/0x6d0
[  292.924458]  ? lock_is_held_type+0xd7/0x130
[  292.928714]  __submit_bio+0x12a/0x1f0
[  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
[  292.937324]  ? should_fail_request+0x70/0x70
[  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
[  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
[  292.950888]  ? lock_is_held_type+0xd7/0x130
[  292.957813]  mpage_readahead+0x32e/0x4b0
[  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
[  292.971661]  ? blkdev_write_begin+0x20/0x20
[  292.978567]  ? lock_release+0x3b2/0x6f0
[  292.985073]  ? folio_add_lru+0x217/0x3f0
[  292.991620]  ? lock_downgrade+0x6d0/0x6d0
[  292.998237]  read_pages+0x18c/0x990
[  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
[  293.011404]  ? folio_add_lru+0x238/0x3f0
[  293.017805]  ? file_ra_state_init+0xd0/0xd0
[  293.024395]  ? policy_node+0xbb/0x140
[  293.030416]  page_cache_ra_unbounded+0x258/0x410
[  293.037376]  force_page_cache_ra+0x281/0x400
[  293.043944]  filemap_get_pages+0x25e/0x1290
[  293.050342]  ? __lock_acquire+0x1603/0x6180
[  293.056654]  ? filemap_add_folio+0x140/0x140
[  293.063002]  ? lock_is_held_type+0xd7/0x130
[  293.069236]  filemap_read+0x29e/0x910
[  293.074927]  ? filemap_get_pages+0x1290/0x1290
[  293.081378]  ? __lock_acquire+0x1603/0x6180
[  293.087558]  blkdev_read_iter+0x20c/0x640
[  293.093529]  ? cp_new_stat+0x47a/0x590
[  293.099190]  ? cp_old_stat+0x470/0x470
[  293.104795]  new_sync_read+0x2e4/0x520
[  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
[  293.116269]  ? lock_acquire+0x1b2/0x4d0
[  293.121928]  ? find_held_lock+0x2c/0x110
[  293.127648]  vfs_read+0x312/0x430
[  293.132755]  ksys_read+0xf3/0x1d0
[  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
[  293.144032]  do_syscall_64+0x35/0x80
[  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae

The crash is at: drivers/md/dm-zone.c:499, which is
dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
pointer is invalid. Weird. Investigating.

Also checking why our weekly test runs did not catch this.

-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11  2:19           ` Damien Le Moal
@ 2022-04-11  2:55             ` Damien Le Moal
  2022-04-11  3:10               ` Damien Le Moal
  2022-04-11  7:34               ` Ming Lei
  0 siblings, 2 replies; 21+ messages in thread
From: Damien Le Moal @ 2022-04-11  2:55 UTC (permalink / raw)
  To: Ming Lei; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On 4/11/22 11:19, Damien Le Moal wrote:
> On 4/11/22 10:04, Ming Lei wrote:
>> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
>>> On 4/11/22 09:36, Ming Lei wrote:
>>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>>>>> On 4/9/22 02:12, Ming Lei wrote:
>>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>>>>> reference is grabbed already, and the reference won't be released
>>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>>>>> any more.
>>>>>
>>>>> I do not think that this patch is correct. Removing the extra reference on
>>>>> the io can lead to problems if the io is completed in the target
>>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
>>>>
>>>> __map_bio():
>>>> 	...
>>>> 	dm_io_inc_pending(io);
>>>> 	...
>>>> 	dm_zone_map_bio(tio);
>>>> 	...
>>>
>>> dm-crypt (for instance) may terminate the clone bio immediately in its
>>> ->map() function context, resulting in the bio_endio()clone) ->
>>> clone_endio() -> dm_io_dec_pending() call chain.
>>>
>>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
>>> reference on the orig bio.
>>
>> Any target can complete io during ->map. Here looks nothing is special with
>> dm-crypt or dm-zone, why does only dm zone need extra reference?
>>
>> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
>> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
>> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
>> in dm_split_and_process_bio() is called.
>>
>> Or maybe I miss any extra requirement from dm-zone?
> 
> Something is wrong... With and without your patch, when I setup a dm-crypt
> target on top of a zoned nullblk device, I get:
> 
> [  292.596454] device-mapper: uevent: version 1.0.3
> [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
> initialised: dm-devel@redhat.com
> [  292.732217] general protection fault, probably for non-canonical
> address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
> [  292.743724] KASAN: null-ptr-deref in range
> [0x0000000000000010-0x0000000000000017]
> [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
> 5.18.0-rc2+ #1458
> [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
> 02/21/2020
> [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
> [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
> 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
> <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
> [  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
> [  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
> 1ffff11034470027
> [  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
> ffff8885c5bcdc60
> [  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
> ffff8881a238013f
> [  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
> 0000000000000000
> [  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
> ffff8885c5bcdd08
> [  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
> knlGS:0000000000000000
> [  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
> 00000000007706f0
> [  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  292.868194] PKRU: 55555554
> [  292.870949] Call Trace:
> [  292.873446]  <TASK>
> [  292.875593]  ? lock_is_held_type+0xd7/0x130
> [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
> [  292.885718]  ? __module_address.part.0+0x25/0x300
> [  292.890509]  ? is_module_address+0x43/0x70
> [  292.894674]  ? static_obj+0x8a/0xc0
> [  292.898233]  __map_bio+0x352/0x740 [dm_mod]
> [  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
> [  292.907222]  ? find_held_lock+0x2c/0x110
> [  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
> [  292.916459]  ? lock_release+0x3b2/0x6f0
> [  292.920368]  ? lock_downgrade+0x6d0/0x6d0
> [  292.924458]  ? lock_is_held_type+0xd7/0x130
> [  292.928714]  __submit_bio+0x12a/0x1f0
> [  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
> [  292.937324]  ? should_fail_request+0x70/0x70
> [  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
> [  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
> [  292.950888]  ? lock_is_held_type+0xd7/0x130
> [  292.957813]  mpage_readahead+0x32e/0x4b0
> [  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
> [  292.971661]  ? blkdev_write_begin+0x20/0x20
> [  292.978567]  ? lock_release+0x3b2/0x6f0
> [  292.985073]  ? folio_add_lru+0x217/0x3f0
> [  292.991620]  ? lock_downgrade+0x6d0/0x6d0
> [  292.998237]  read_pages+0x18c/0x990
> [  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
> [  293.011404]  ? folio_add_lru+0x238/0x3f0
> [  293.017805]  ? file_ra_state_init+0xd0/0xd0
> [  293.024395]  ? policy_node+0xbb/0x140
> [  293.030416]  page_cache_ra_unbounded+0x258/0x410
> [  293.037376]  force_page_cache_ra+0x281/0x400
> [  293.043944]  filemap_get_pages+0x25e/0x1290
> [  293.050342]  ? __lock_acquire+0x1603/0x6180
> [  293.056654]  ? filemap_add_folio+0x140/0x140
> [  293.063002]  ? lock_is_held_type+0xd7/0x130
> [  293.069236]  filemap_read+0x29e/0x910
> [  293.074927]  ? filemap_get_pages+0x1290/0x1290
> [  293.081378]  ? __lock_acquire+0x1603/0x6180
> [  293.087558]  blkdev_read_iter+0x20c/0x640
> [  293.093529]  ? cp_new_stat+0x47a/0x590
> [  293.099190]  ? cp_old_stat+0x470/0x470
> [  293.104795]  new_sync_read+0x2e4/0x520
> [  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
> [  293.116269]  ? lock_acquire+0x1b2/0x4d0
> [  293.121928]  ? find_held_lock+0x2c/0x110
> [  293.127648]  vfs_read+0x312/0x430
> [  293.132755]  ksys_read+0xf3/0x1d0
> [  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
> [  293.144032]  do_syscall_64+0x35/0x80
> [  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> The crash is at: drivers/md/dm-zone.c:499, which is
> dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
> pointer is invalid. Weird. Investigating.
> 
> Also checking why our weekly test runs did not catch this.

This fixes the issue:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..3dd6735450c5 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
*md, struct bio *bio)
        io->status = 0;
        atomic_set(&io->io_count, 1);
        this_cpu_inc(*md->pending_io);
-       io->orig_bio = NULL;
+       io->orig_bio = bio;
        io->md = md;
        io->map_task = current;
        spin_lock_init(&io->lock);

Otherwise, the dm-zone.c code sees a NULL orig_bio.
However, this change may be messing up the bio accounting. Need to check that.


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11  2:55             ` Damien Le Moal
@ 2022-04-11  3:10               ` Damien Le Moal
  2022-04-11  7:34               ` Ming Lei
  1 sibling, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2022-04-11  3:10 UTC (permalink / raw)
  To: Ming Lei; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On 4/11/22 11:55, Damien Le Moal wrote:
> On 4/11/22 11:19, Damien Le Moal wrote:
>> On 4/11/22 10:04, Ming Lei wrote:
>>> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
>>>> On 4/11/22 09:36, Ming Lei wrote:
>>>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>>>>>> On 4/9/22 02:12, Ming Lei wrote:
>>>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>>>>>> reference is grabbed already, and the reference won't be released
>>>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>>>>>> any more.
>>>>>>
>>>>>> I do not think that this patch is correct. Removing the extra reference on
>>>>>> the io can lead to problems if the io is completed in the target
>>>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>>>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
>>>>>
>>>>> __map_bio():
>>>>> 	...
>>>>> 	dm_io_inc_pending(io);
>>>>> 	...
>>>>> 	dm_zone_map_bio(tio);
>>>>> 	...
>>>>
>>>> dm-crypt (for instance) may terminate the clone bio immediately in its
>>>> ->map() function context, resulting in the bio_endio()clone) ->
>>>> clone_endio() -> dm_io_dec_pending() call chain.
>>>>
>>>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
>>>> reference on the orig bio.
>>>
>>> Any target can complete io during ->map. Here looks nothing is special with
>>> dm-crypt or dm-zone, why does only dm zone need extra reference?
>>>
>>> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
>>> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
>>> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
>>> in dm_split_and_process_bio() is called.
>>>
>>> Or maybe I miss any extra requirement from dm-zone?
>>
>> Something is wrong... With and without your patch, when I setup a dm-crypt
>> target on top of a zoned nullblk device, I get:
>>
>> [  292.596454] device-mapper: uevent: version 1.0.3
>> [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
>> initialised: dm-devel@redhat.com
>> [  292.732217] general protection fault, probably for non-canonical
>> address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
>> [  292.743724] KASAN: null-ptr-deref in range
>> [0x0000000000000010-0x0000000000000017]
>> [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
>> 5.18.0-rc2+ #1458
>> [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
>> 02/21/2020
>> [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
>> [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
>> 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
>> <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
>> [  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
>> [  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
>> 1ffff11034470027
>> [  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
>> ffff8885c5bcdc60
>> [  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
>> ffff8881a238013f
>> [  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
>> 0000000000000000
>> [  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
>> ffff8885c5bcdd08
>> [  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
>> knlGS:0000000000000000
>> [  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
>> 00000000007706f0
>> [  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>> [  292.868194] PKRU: 55555554
>> [  292.870949] Call Trace:
>> [  292.873446]  <TASK>
>> [  292.875593]  ? lock_is_held_type+0xd7/0x130
>> [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
>> [  292.885718]  ? __module_address.part.0+0x25/0x300
>> [  292.890509]  ? is_module_address+0x43/0x70
>> [  292.894674]  ? static_obj+0x8a/0xc0
>> [  292.898233]  __map_bio+0x352/0x740 [dm_mod]
>> [  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
>> [  292.907222]  ? find_held_lock+0x2c/0x110
>> [  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
>> [  292.916459]  ? lock_release+0x3b2/0x6f0
>> [  292.920368]  ? lock_downgrade+0x6d0/0x6d0
>> [  292.924458]  ? lock_is_held_type+0xd7/0x130
>> [  292.928714]  __submit_bio+0x12a/0x1f0
>> [  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
>> [  292.937324]  ? should_fail_request+0x70/0x70
>> [  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
>> [  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
>> [  292.950888]  ? lock_is_held_type+0xd7/0x130
>> [  292.957813]  mpage_readahead+0x32e/0x4b0
>> [  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
>> [  292.971661]  ? blkdev_write_begin+0x20/0x20
>> [  292.978567]  ? lock_release+0x3b2/0x6f0
>> [  292.985073]  ? folio_add_lru+0x217/0x3f0
>> [  292.991620]  ? lock_downgrade+0x6d0/0x6d0
>> [  292.998237]  read_pages+0x18c/0x990
>> [  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
>> [  293.011404]  ? folio_add_lru+0x238/0x3f0
>> [  293.017805]  ? file_ra_state_init+0xd0/0xd0
>> [  293.024395]  ? policy_node+0xbb/0x140
>> [  293.030416]  page_cache_ra_unbounded+0x258/0x410
>> [  293.037376]  force_page_cache_ra+0x281/0x400
>> [  293.043944]  filemap_get_pages+0x25e/0x1290
>> [  293.050342]  ? __lock_acquire+0x1603/0x6180
>> [  293.056654]  ? filemap_add_folio+0x140/0x140
>> [  293.063002]  ? lock_is_held_type+0xd7/0x130
>> [  293.069236]  filemap_read+0x29e/0x910
>> [  293.074927]  ? filemap_get_pages+0x1290/0x1290
>> [  293.081378]  ? __lock_acquire+0x1603/0x6180
>> [  293.087558]  blkdev_read_iter+0x20c/0x640
>> [  293.093529]  ? cp_new_stat+0x47a/0x590
>> [  293.099190]  ? cp_old_stat+0x470/0x470
>> [  293.104795]  new_sync_read+0x2e4/0x520
>> [  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
>> [  293.116269]  ? lock_acquire+0x1b2/0x4d0
>> [  293.121928]  ? find_held_lock+0x2c/0x110
>> [  293.127648]  vfs_read+0x312/0x430
>> [  293.132755]  ksys_read+0xf3/0x1d0
>> [  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
>> [  293.144032]  do_syscall_64+0x35/0x80
>> [  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>
>> The crash is at: drivers/md/dm-zone.c:499, which is
>> dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
>> pointer is invalid. Weird. Investigating.
>>
>> Also checking why our weekly test runs did not catch this.
> 
> This fixes the issue:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3c5fad7c4ee6..3dd6735450c5 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> *md, struct bio *bio)
>         io->status = 0;
>         atomic_set(&io->io_count, 1);
>         this_cpu_inc(*md->pending_io);
> -       io->orig_bio = NULL;
> +       io->orig_bio = bio;
>         io->md = md;
>         io->map_task = current;
>         spin_lock_init(&io->lock);
> 
> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> However, this change may be messing up the bio accounting. Need to check that.

Forgot to mention that with the above fix + your patch applied, everything
seems to be fine. No issues. The extra reference is indeed not needed.
Only the orig_bio initialization missing is a problem.


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11  2:55             ` Damien Le Moal
  2022-04-11  3:10               ` Damien Le Moal
@ 2022-04-11  7:34               ` Ming Lei
  2022-04-11  7:42                 ` Damien Le Moal
  1 sibling, 1 reply; 21+ messages in thread
From: Ming Lei @ 2022-04-11  7:34 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On Mon, Apr 11, 2022 at 11:55:14AM +0900, Damien Le Moal wrote:
> On 4/11/22 11:19, Damien Le Moal wrote:
> > On 4/11/22 10:04, Ming Lei wrote:
> >> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
> >>> On 4/11/22 09:36, Ming Lei wrote:
> >>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> >>>>> On 4/9/22 02:12, Ming Lei wrote:
> >>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
> >>>>>> reference is grabbed already, and the reference won't be released
> >>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> >>>>>> any more.
> >>>>>
> >>>>> I do not think that this patch is correct. Removing the extra reference on
> >>>>> the io can lead to problems if the io is completed in the target
> >>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> >>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
> >>>>
> >>>> __map_bio():
> >>>> 	...
> >>>> 	dm_io_inc_pending(io);
> >>>> 	...
> >>>> 	dm_zone_map_bio(tio);
> >>>> 	...
> >>>
> >>> dm-crypt (for instance) may terminate the clone bio immediately in its
> >>> ->map() function context, resulting in the bio_endio()clone) ->
> >>> clone_endio() -> dm_io_dec_pending() call chain.
> >>>
> >>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
> >>> reference on the orig bio.
> >>
> >> Any target can complete io during ->map. Here looks nothing is special with
> >> dm-crypt or dm-zone, why does only dm zone need extra reference?
> >>
> >> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
> >> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
> >> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
> >> in dm_split_and_process_bio() is called.
> >>
> >> Or maybe I miss any extra requirement from dm-zone?
> > 
> > Something is wrong... With and without your patch, when I setup a dm-crypt
> > target on top of a zoned nullblk device, I get:
> > 
> > [  292.596454] device-mapper: uevent: version 1.0.3
> > [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
> > initialised: dm-devel@redhat.com
> > [  292.732217] general protection fault, probably for non-canonical
> > address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
> > [  292.743724] KASAN: null-ptr-deref in range
> > [0x0000000000000010-0x0000000000000017]
> > [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
> > 5.18.0-rc2+ #1458
> > [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
> > 02/21/2020
> > [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
> > [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
> > 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
> > <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
> > [  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
> > [  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
> > 1ffff11034470027
> > [  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
> > ffff8885c5bcdc60
> > [  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
> > ffff8881a238013f
> > [  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
> > 0000000000000000
> > [  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
> > ffff8885c5bcdd08
> > [  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
> > knlGS:0000000000000000
> > [  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
> > 00000000007706f0
> > [  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [  292.868194] PKRU: 55555554
> > [  292.870949] Call Trace:
> > [  292.873446]  <TASK>
> > [  292.875593]  ? lock_is_held_type+0xd7/0x130
> > [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
> > [  292.885718]  ? __module_address.part.0+0x25/0x300
> > [  292.890509]  ? is_module_address+0x43/0x70
> > [  292.894674]  ? static_obj+0x8a/0xc0
> > [  292.898233]  __map_bio+0x352/0x740 [dm_mod]
> > [  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
> > [  292.907222]  ? find_held_lock+0x2c/0x110
> > [  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
> > [  292.916459]  ? lock_release+0x3b2/0x6f0
> > [  292.920368]  ? lock_downgrade+0x6d0/0x6d0
> > [  292.924458]  ? lock_is_held_type+0xd7/0x130
> > [  292.928714]  __submit_bio+0x12a/0x1f0
> > [  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
> > [  292.937324]  ? should_fail_request+0x70/0x70
> > [  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
> > [  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
> > [  292.950888]  ? lock_is_held_type+0xd7/0x130
> > [  292.957813]  mpage_readahead+0x32e/0x4b0
> > [  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
> > [  292.971661]  ? blkdev_write_begin+0x20/0x20
> > [  292.978567]  ? lock_release+0x3b2/0x6f0
> > [  292.985073]  ? folio_add_lru+0x217/0x3f0
> > [  292.991620]  ? lock_downgrade+0x6d0/0x6d0
> > [  292.998237]  read_pages+0x18c/0x990
> > [  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
> > [  293.011404]  ? folio_add_lru+0x238/0x3f0
> > [  293.017805]  ? file_ra_state_init+0xd0/0xd0
> > [  293.024395]  ? policy_node+0xbb/0x140
> > [  293.030416]  page_cache_ra_unbounded+0x258/0x410
> > [  293.037376]  force_page_cache_ra+0x281/0x400
> > [  293.043944]  filemap_get_pages+0x25e/0x1290
> > [  293.050342]  ? __lock_acquire+0x1603/0x6180
> > [  293.056654]  ? filemap_add_folio+0x140/0x140
> > [  293.063002]  ? lock_is_held_type+0xd7/0x130
> > [  293.069236]  filemap_read+0x29e/0x910
> > [  293.074927]  ? filemap_get_pages+0x1290/0x1290
> > [  293.081378]  ? __lock_acquire+0x1603/0x6180
> > [  293.087558]  blkdev_read_iter+0x20c/0x640
> > [  293.093529]  ? cp_new_stat+0x47a/0x590
> > [  293.099190]  ? cp_old_stat+0x470/0x470
> > [  293.104795]  new_sync_read+0x2e4/0x520
> > [  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
> > [  293.116269]  ? lock_acquire+0x1b2/0x4d0
> > [  293.121928]  ? find_held_lock+0x2c/0x110
> > [  293.127648]  vfs_read+0x312/0x430
> > [  293.132755]  ksys_read+0xf3/0x1d0
> > [  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
> > [  293.144032]  do_syscall_64+0x35/0x80
> > [  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > The crash is at: drivers/md/dm-zone.c:499, which is
> > dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
> > pointer is invalid. Weird. Investigating.
> > 
> > Also checking why our weekly test runs did not catch this.
> 
> This fixes the issue:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3c5fad7c4ee6..3dd6735450c5 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> *md, struct bio *bio)
>         io->status = 0;
>         atomic_set(&io->io_count, 1);
>         this_cpu_inc(*md->pending_io);
> -       io->orig_bio = NULL;
> +       io->orig_bio = bio;
>         io->md = md;
>         io->map_task = current;
>         spin_lock_init(&io->lock);
> 
> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> However, this change may be messing up the bio accounting. Need to check that.

Looks it is one recent regression since:

commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11  7:34               ` Ming Lei
@ 2022-04-11  7:42                 ` Damien Le Moal
  2022-04-11 14:18                   ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2022-04-11  7:42 UTC (permalink / raw)
  To: Ming Lei; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On 4/11/22 16:34, Ming Lei wrote:
> On Mon, Apr 11, 2022 at 11:55:14AM +0900, Damien Le Moal wrote:
>> On 4/11/22 11:19, Damien Le Moal wrote:
>>> On 4/11/22 10:04, Ming Lei wrote:
>>>> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
>>>>> On 4/11/22 09:36, Ming Lei wrote:
>>>>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
>>>>>>> On 4/9/22 02:12, Ming Lei wrote:
>>>>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
>>>>>>>> reference is grabbed already, and the reference won't be released
>>>>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
>>>>>>>> any more.
>>>>>>>
>>>>>>> I do not think that this patch is correct. Removing the extra reference on
>>>>>>> the io can lead to problems if the io is completed in the target
>>>>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
>>>>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
>>>>>>
>>>>>> __map_bio():
>>>>>> 	...
>>>>>> 	dm_io_inc_pending(io);
>>>>>> 	...
>>>>>> 	dm_zone_map_bio(tio);
>>>>>> 	...
>>>>>
>>>>> dm-crypt (for instance) may terminate the clone bio immediately in its
>>>>> ->map() function context, resulting in the bio_endio()clone) ->
>>>>> clone_endio() -> dm_io_dec_pending() call chain.
>>>>>
>>>>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
>>>>> reference on the orig bio.
>>>>
>>>> Any target can complete io during ->map. Here looks nothing is special with
>>>> dm-crypt or dm-zone, why does only dm zone need extra reference?
>>>>
>>>> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
>>>> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
>>>> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
>>>> in dm_split_and_process_bio() is called.
>>>>
>>>> Or maybe I miss any extra requirement from dm-zone?
>>>
>>> Something is wrong... With and without your patch, when I setup a dm-crypt
>>> target on top of a zoned nullblk device, I get:
>>>
>>> [  292.596454] device-mapper: uevent: version 1.0.3
>>> [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
>>> initialised: dm-devel@redhat.com
>>> [  292.732217] general protection fault, probably for non-canonical
>>> address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
>>> [  292.743724] KASAN: null-ptr-deref in range
>>> [0x0000000000000010-0x0000000000000017]
>>> [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
>>> 5.18.0-rc2+ #1458
>>> [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
>>> 02/21/2020
>>> [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
>>> [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
>>> 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
>>> <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
>>> [  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
>>> [  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
>>> 1ffff11034470027
>>> [  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
>>> ffff8885c5bcdc60
>>> [  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
>>> ffff8881a238013f
>>> [  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
>>> 0000000000000000
>>> [  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
>>> ffff8885c5bcdd08
>>> [  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
>>> knlGS:0000000000000000
>>> [  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
>>> 00000000007706f0
>>> [  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> [  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>> 0000000000000400
>>> [  292.868194] PKRU: 55555554
>>> [  292.870949] Call Trace:
>>> [  292.873446]  <TASK>
>>> [  292.875593]  ? lock_is_held_type+0xd7/0x130
>>> [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
>>> [  292.885718]  ? __module_address.part.0+0x25/0x300
>>> [  292.890509]  ? is_module_address+0x43/0x70
>>> [  292.894674]  ? static_obj+0x8a/0xc0
>>> [  292.898233]  __map_bio+0x352/0x740 [dm_mod]
>>> [  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
>>> [  292.907222]  ? find_held_lock+0x2c/0x110
>>> [  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
>>> [  292.916459]  ? lock_release+0x3b2/0x6f0
>>> [  292.920368]  ? lock_downgrade+0x6d0/0x6d0
>>> [  292.924458]  ? lock_is_held_type+0xd7/0x130
>>> [  292.928714]  __submit_bio+0x12a/0x1f0
>>> [  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
>>> [  292.937324]  ? should_fail_request+0x70/0x70
>>> [  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
>>> [  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
>>> [  292.950888]  ? lock_is_held_type+0xd7/0x130
>>> [  292.957813]  mpage_readahead+0x32e/0x4b0
>>> [  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
>>> [  292.971661]  ? blkdev_write_begin+0x20/0x20
>>> [  292.978567]  ? lock_release+0x3b2/0x6f0
>>> [  292.985073]  ? folio_add_lru+0x217/0x3f0
>>> [  292.991620]  ? lock_downgrade+0x6d0/0x6d0
>>> [  292.998237]  read_pages+0x18c/0x990
>>> [  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
>>> [  293.011404]  ? folio_add_lru+0x238/0x3f0
>>> [  293.017805]  ? file_ra_state_init+0xd0/0xd0
>>> [  293.024395]  ? policy_node+0xbb/0x140
>>> [  293.030416]  page_cache_ra_unbounded+0x258/0x410
>>> [  293.037376]  force_page_cache_ra+0x281/0x400
>>> [  293.043944]  filemap_get_pages+0x25e/0x1290
>>> [  293.050342]  ? __lock_acquire+0x1603/0x6180
>>> [  293.056654]  ? filemap_add_folio+0x140/0x140
>>> [  293.063002]  ? lock_is_held_type+0xd7/0x130
>>> [  293.069236]  filemap_read+0x29e/0x910
>>> [  293.074927]  ? filemap_get_pages+0x1290/0x1290
>>> [  293.081378]  ? __lock_acquire+0x1603/0x6180
>>> [  293.087558]  blkdev_read_iter+0x20c/0x640
>>> [  293.093529]  ? cp_new_stat+0x47a/0x590
>>> [  293.099190]  ? cp_old_stat+0x470/0x470
>>> [  293.104795]  new_sync_read+0x2e4/0x520
>>> [  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
>>> [  293.116269]  ? lock_acquire+0x1b2/0x4d0
>>> [  293.121928]  ? find_held_lock+0x2c/0x110
>>> [  293.127648]  vfs_read+0x312/0x430
>>> [  293.132755]  ksys_read+0xf3/0x1d0
>>> [  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
>>> [  293.144032]  do_syscall_64+0x35/0x80
>>> [  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>
>>> The crash is at: drivers/md/dm-zone.c:499, which is
>>> dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
>>> pointer is invalid. Weird. Investigating.
>>>
>>> Also checking why our weekly test runs did not catch this.
>>
>> This fixes the issue:
>>
>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>> index 3c5fad7c4ee6..3dd6735450c5 100644
>> --- a/drivers/md/dm.c
>> +++ b/drivers/md/dm.c
>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
>> *md, struct bio *bio)
>>         io->status = 0;
>>         atomic_set(&io->io_count, 1);
>>         this_cpu_inc(*md->pending_io);
>> -       io->orig_bio = NULL;
>> +       io->orig_bio = bio;
>>         io->md = md;
>>         io->map_task = current;
>>         spin_lock_init(&io->lock);
>>
>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
>> However, this change may be messing up the bio accounting. Need to check that.
> 
> Looks it is one recent regression since:
> 
> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")

Yep, saw that. Problem is, I really do not understand that change setting
io->orig_bio *after* __map_bio() is called. It seems that the accounting
is done on each fragment of the orig_bio instead of once for the entire
BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
an argument to  __map_bio() from __split_and_process_bio(), I do not think
my change is correct. Thoughts ?

-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-08 17:12 ` [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio Ming Lei
  2022-04-11  0:18   ` Damien Le Moal
@ 2022-04-11  9:40   ` Damien Le Moal
  1 sibling, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2022-04-11  9:40 UTC (permalink / raw)
  To: Ming Lei, Mike Snitzer; +Cc: dm-devel, Damien Le Moal

On 4/9/22 02:12, Ming Lei wrote:
> dm_zone_map_bio() is only called from __map_bio in which the io's
> reference is grabbed already, and the reference won't be released
> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> any more.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11  7:42                 ` Damien Le Moal
@ 2022-04-11 14:18                   ` Ming Lei
  2022-04-11 23:33                     ` Damien Le Moal
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2022-04-11 14:18 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On Mon, Apr 11, 2022 at 04:42:59PM +0900, Damien Le Moal wrote:
> On 4/11/22 16:34, Ming Lei wrote:
> > On Mon, Apr 11, 2022 at 11:55:14AM +0900, Damien Le Moal wrote:
> >> On 4/11/22 11:19, Damien Le Moal wrote:
> >>> On 4/11/22 10:04, Ming Lei wrote:
> >>>> On Mon, Apr 11, 2022 at 09:50:57AM +0900, Damien Le Moal wrote:
> >>>>> On 4/11/22 09:36, Ming Lei wrote:
> >>>>>> On Mon, Apr 11, 2022 at 09:18:56AM +0900, Damien Le Moal wrote:
> >>>>>>> On 4/9/22 02:12, Ming Lei wrote:
> >>>>>>>> dm_zone_map_bio() is only called from __map_bio in which the io's
> >>>>>>>> reference is grabbed already, and the reference won't be released
> >>>>>>>> until the bio is submitted, so no necessary to do it dm_zone_map_bio
> >>>>>>>> any more.
> >>>>>>>
> >>>>>>> I do not think that this patch is correct. Removing the extra reference on
> >>>>>>> the io can lead to problems if the io is completed in the target
> >>>>>>> ->map(ti, clone) call or before dm_zone_map_bio_end() is called for the
> >>>>>>> DM_MAPIO_SUBMITTED or DM_MAPIO_REMAPPED cases. dm_zone_map_bio_end() needs
> >>>>>>
> >>>>>> __map_bio():
> >>>>>> 	...
> >>>>>> 	dm_io_inc_pending(io);
> >>>>>> 	...
> >>>>>> 	dm_zone_map_bio(tio);
> >>>>>> 	...
> >>>>>
> >>>>> dm-crypt (for instance) may terminate the clone bio immediately in its
> >>>>> ->map() function context, resulting in the bio_endio()clone) ->
> >>>>> clone_endio() -> dm_io_dec_pending() call chain.
> >>>>>
> >>>>> With that, the io is gone and dm_zone_map_bio_end() will not have a valid
> >>>>> reference on the orig bio.
> >>>>
> >>>> Any target can complete io during ->map. Here looks nothing is special with
> >>>> dm-crypt or dm-zone, why does only dm zone need extra reference?
> >>>>
> >>>> The reference counter is initialized as 1 in init_clone_info(), dm_io_inc_pending()
> >>>> in __map_bio() increases it to 2, so after the above call chain you mentioned is done,
> >>>> the counter becomes 1. The original bio can't be completed until dm_io_dec_pending()
> >>>> in dm_split_and_process_bio() is called.
> >>>>
> >>>> Or maybe I miss any extra requirement from dm-zone?
> >>>
> >>> Something is wrong... With and without your patch, when I setup a dm-crypt
> >>> target on top of a zoned nullblk device, I get:
> >>>
> >>> [  292.596454] device-mapper: uevent: version 1.0.3
> >>> [  292.602746] device-mapper: ioctl: 4.46.0-ioctl (2022-02-22)
> >>> initialised: dm-devel@redhat.com
> >>> [  292.732217] general protection fault, probably for non-canonical
> >>> address 0xdffffc0000000002: 0000 [#1] PREEMPT SMP KASAN PTI
> >>> [  292.743724] KASAN: null-ptr-deref in range
> >>> [0x0000000000000010-0x0000000000000017]
> >>> [  292.751409] CPU: 0 PID: 4259 Comm: systemd-udevd Not tainted
> >>> 5.18.0-rc2+ #1458
> >>> [  292.758746] Hardware name: Supermicro Super Server/X11DPL-i, BIOS 3.3
> >>> 02/21/2020
> >>> [  292.766250] RIP: 0010:dm_zone_map_bio+0x146/0x1740 [dm_mod]
> >>> [  292.771938] Code: 00 00 4d 8b 65 10 48 8d 43 28 48 89 44 24 10 49 8d 44
> >>> 24 10 48 89 c2 48 89 44 24 18 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03
> >>> <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 78 0e 00 00 45 8b 7c 24 10 41
> >>> [  292.790946] RSP: 0018:ffff8883cd847218 EFLAGS: 00010202
> >>> [  292.796260] RAX: dffffc0000000000 RBX: ffff8885c5bcdce8 RCX:
> >>> 1ffff11034470027
> >>> [  292.803496] RDX: 0000000000000002 RSI: 0000000000000008 RDI:
> >>> ffff8885c5bcdc60
> >>> [  292.810732] RBP: 1ffff11079b08e4f R08: ffff8881a23801d8 R09:
> >>> ffff8881a238013f
> >>> [  292.817970] R10: ffff88821c594040 R11: 0000000000000001 R12:
> >>> 0000000000000000
> >>> [  292.825206] R13: ffff8885c5bcdc50 R14: ffff8881a2380000 R15:
> >>> ffff8885c5bcdd08
> >>> [  292.832442] FS:  00007fe169b06b40(0000) GS:ffff88880fc00000(0000)
> >>> knlGS:0000000000000000
> >>> [  292.840646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [  292.846481] CR2: 00007ffd80a57a38 CR3: 00000004b91b0006 CR4:
> >>> 00000000007706f0
> >>> [  292.853722] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>> 0000000000000000
> >>> [  292.860957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> >>> 0000000000000400
> >>> [  292.868194] PKRU: 55555554
> >>> [  292.870949] Call Trace:
> >>> [  292.873446]  <TASK>
> >>> [  292.875593]  ? lock_is_held_type+0xd7/0x130
> >>> [  292.879860]  ? dm_set_zones_restrictions+0x8f0/0x8f0 [dm_mod]
> >>> [  292.885718]  ? __module_address.part.0+0x25/0x300
> >>> [  292.890509]  ? is_module_address+0x43/0x70
> >>> [  292.894674]  ? static_obj+0x8a/0xc0
> >>> [  292.898233]  __map_bio+0x352/0x740 [dm_mod]
> >>> [  292.902512]  dm_submit_bio+0x72f/0x17a0 [dm_mod]
> >>> [  292.907222]  ? find_held_lock+0x2c/0x110
> >>> [  292.911217]  ? __send_empty_flush+0x2b0/0x2b0 [dm_mod]
> >>> [  292.916459]  ? lock_release+0x3b2/0x6f0
> >>> [  292.920368]  ? lock_downgrade+0x6d0/0x6d0
> >>> [  292.924458]  ? lock_is_held_type+0xd7/0x130
> >>> [  292.928714]  __submit_bio+0x12a/0x1f0
> >>> [  292.932450]  submit_bio_noacct_nocheck+0x324/0x840
> >>> [  292.937324]  ? should_fail_request+0x70/0x70
> >>> [  292.941670]  ? rcu_read_lock_sched_held+0x3f/0x70
> >>> [  292.946458]  ? submit_bio_noacct+0xfa4/0x1530
> >>> [  292.950888]  ? lock_is_held_type+0xd7/0x130
> >>> [  292.957813]  mpage_readahead+0x32e/0x4b0
> >>> [  292.964470]  ? do_mpage_readpage+0x17c0/0x17c0
> >>> [  292.971661]  ? blkdev_write_begin+0x20/0x20
> >>> [  292.978567]  ? lock_release+0x3b2/0x6f0
> >>> [  292.985073]  ? folio_add_lru+0x217/0x3f0
> >>> [  292.991620]  ? lock_downgrade+0x6d0/0x6d0
> >>> [  292.998237]  read_pages+0x18c/0x990
> >>> [  293.004308]  ? memcg_list_lru_alloc+0x810/0x810
> >>> [  293.011404]  ? folio_add_lru+0x238/0x3f0
> >>> [  293.017805]  ? file_ra_state_init+0xd0/0xd0
> >>> [  293.024395]  ? policy_node+0xbb/0x140
> >>> [  293.030416]  page_cache_ra_unbounded+0x258/0x410
> >>> [  293.037376]  force_page_cache_ra+0x281/0x400
> >>> [  293.043944]  filemap_get_pages+0x25e/0x1290
> >>> [  293.050342]  ? __lock_acquire+0x1603/0x6180
> >>> [  293.056654]  ? filemap_add_folio+0x140/0x140
> >>> [  293.063002]  ? lock_is_held_type+0xd7/0x130
> >>> [  293.069236]  filemap_read+0x29e/0x910
> >>> [  293.074927]  ? filemap_get_pages+0x1290/0x1290
> >>> [  293.081378]  ? __lock_acquire+0x1603/0x6180
> >>> [  293.087558]  blkdev_read_iter+0x20c/0x640
> >>> [  293.093529]  ? cp_new_stat+0x47a/0x590
> >>> [  293.099190]  ? cp_old_stat+0x470/0x470
> >>> [  293.104795]  new_sync_read+0x2e4/0x520
> >>> [  293.110362]  ? __x64_sys_lseek+0x1d0/0x1d0
> >>> [  293.116269]  ? lock_acquire+0x1b2/0x4d0
> >>> [  293.121928]  ? find_held_lock+0x2c/0x110
> >>> [  293.127648]  vfs_read+0x312/0x430
> >>> [  293.132755]  ksys_read+0xf3/0x1d0
> >>> [  293.137863]  ? __x64_sys_pwrite64+0x1f0/0x1f0
> >>> [  293.144032]  do_syscall_64+0x35/0x80
> >>> [  293.149391]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>>
> >>> The crash is at: drivers/md/dm-zone.c:499, which is
> >>> dm_need_zone_wp_tracking() called from dm_zone_map_bio(). The orig_bio
> >>> pointer is invalid. Weird. Investigating.
> >>>
> >>> Also checking why our weekly test runs did not catch this.
> >>
> >> This fixes the issue:
> >>
> >> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >> index 3c5fad7c4ee6..3dd6735450c5 100644
> >> --- a/drivers/md/dm.c
> >> +++ b/drivers/md/dm.c
> >> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> >> *md, struct bio *bio)
> >>         io->status = 0;
> >>         atomic_set(&io->io_count, 1);
> >>         this_cpu_inc(*md->pending_io);
> >> -       io->orig_bio = NULL;
> >> +       io->orig_bio = bio;
> >>         io->md = md;
> >>         io->map_task = current;
> >>         spin_lock_init(&io->lock);
> >>
> >> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> >> However, this change may be messing up the bio accounting. Need to check that.
> > 
> > Looks it is one recent regression since:
> > 
> > commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
> 
> Yep, saw that. Problem is, I really do not understand that change setting
> io->orig_bio *after* __map_bio() is called. It seems that the accounting
> is done on each fragment of the orig_bio instead of once for the entire
> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
> an argument to  __map_bio() from __split_and_process_bio(), I do not think
> my change is correct. Thoughts ?

Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
after ->map() looks ugly & tricky, and the following change should avoid the
issue, meantime simplify dm accounting a bit:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3c5fad7c4ee6..f1fe83113608 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
 	io->status = 0;
 	atomic_set(&io->io_count, 1);
 	this_cpu_inc(*md->pending_io);
-	io->orig_bio = NULL;
+	io->orig_bio = bio;
 	io->md = md;
 	io->map_task = current;
 	spin_lock_init(&io->lock);
@@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
 	 * Account io->origin_bio to DM dev on behalf of target
 	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
 	 */
-	if (io->map_task == current) {
+	if (io->map_task == current)
 		/* Still in target's map function */
 		dm_io_set_flag(io, DM_IO_START_ACCT);
-	} else {
-		/*
-		 * Called by another thread, managed by DM target,
-		 * wait for dm_split_and_process_bio() to store
-		 * io->orig_bio
-		 */
-		while (unlikely(!smp_load_acquire(&io->orig_bio)))
-			msleep(1);
+	else
 		dm_start_io_acct(io, clone);
-	}
 
 	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
 			      tio->old_sector);
@@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 				     struct dm_table *map, struct bio *bio)
 {
 	struct clone_info ci;
-	struct bio *orig_bio = NULL;
+	struct bio *new_bio = NULL;
 	int error = 0;
 
 	init_clone_info(&ci, md, map, bio);
@@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
 	if (error || !ci.sector_count)
 		goto out;
 
-	/*
-	 * Remainder must be passed to submit_bio_noacct() so it gets handled
-	 * *after* bios already submitted have been completely processed.
-	 * We take a clone of the original to store in ci.io->orig_bio to be
-	 * used by dm_end_io_acct() and for dm_io_complete() to use for
-	 * completion handling.
-	 */
-	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
-			     GFP_NOIO, &md->queue->bio_split);
-	bio_chain(orig_bio, bio);
-	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
-	submit_bio_noacct(bio);
+	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
+			&md->queue->bio_split);
+	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
+	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
+	bio_chain(new_bio, bio);
+	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
+	submit_bio_noacct(new_bio);
 out:
-	if (!orig_bio)
-		orig_bio = bio;
-	smp_store_release(&ci.io->orig_bio, orig_bio);
 	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
 		dm_start_io_acct(ci.io, NULL);
 

-- 
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11 14:18                   ` Ming Lei
@ 2022-04-11 23:33                     ` Damien Le Moal
  2022-04-12  0:09                       ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2022-04-11 23:33 UTC (permalink / raw)
  To: Ming Lei; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On 4/11/22 23:18, Ming Lei wrote:
>>>> This fixes the issue:
>>>>
>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>>> index 3c5fad7c4ee6..3dd6735450c5 100644
>>>> --- a/drivers/md/dm.c
>>>> +++ b/drivers/md/dm.c
>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
>>>> *md, struct bio *bio)
>>>>         io->status = 0;
>>>>         atomic_set(&io->io_count, 1);
>>>>         this_cpu_inc(*md->pending_io);
>>>> -       io->orig_bio = NULL;
>>>> +       io->orig_bio = bio;
>>>>         io->md = md;
>>>>         io->map_task = current;
>>>>         spin_lock_init(&io->lock);
>>>>
>>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
>>>> However, this change may be messing up the bio accounting. Need to check that.
>>>
>>> Looks it is one recent regression since:
>>>
>>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
>>
>> Yep, saw that. Problem is, I really do not understand that change setting
>> io->orig_bio *after* __map_bio() is called. It seems that the accounting
>> is done on each fragment of the orig_bio instead of once for the entire
>> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
>> an argument to  __map_bio() from __split_and_process_bio(), I do not think
>> my change is correct. Thoughts ?
> 
> Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
> after ->map() looks ugly & tricky, and the following change should avoid the
> issue, meantime simplify dm accounting a bit:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3c5fad7c4ee6..f1fe83113608 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
>  	io->status = 0;
>  	atomic_set(&io->io_count, 1);
>  	this_cpu_inc(*md->pending_io);
> -	io->orig_bio = NULL;
> +	io->orig_bio = bio;
>  	io->md = md;
>  	io->map_task = current;
>  	spin_lock_init(&io->lock);
> @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
>  	 * Account io->origin_bio to DM dev on behalf of target
>  	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
>  	 */
> -	if (io->map_task == current) {
> +	if (io->map_task == current)
>  		/* Still in target's map function */
>  		dm_io_set_flag(io, DM_IO_START_ACCT);
> -	} else {
> -		/*
> -		 * Called by another thread, managed by DM target,
> -		 * wait for dm_split_and_process_bio() to store
> -		 * io->orig_bio
> -		 */
> -		while (unlikely(!smp_load_acquire(&io->orig_bio)))
> -			msleep(1);
> +	else

Curly brackets around the else here.

>  		dm_start_io_acct(io, clone);
> -	}
>  
>  	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
>  			      tio->old_sector);
> @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>  				     struct dm_table *map, struct bio *bio)
>  {
>  	struct clone_info ci;
> -	struct bio *orig_bio = NULL;
> +	struct bio *new_bio = NULL;
>  	int error = 0;
>  
>  	init_clone_info(&ci, md, map, bio);
> @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>  	if (error || !ci.sector_count)
>  		goto out;
>  
> -	/*
> -	 * Remainder must be passed to submit_bio_noacct() so it gets handled
> -	 * *after* bios already submitted have been completely processed.
> -	 * We take a clone of the original to store in ci.io->orig_bio to be
> -	 * used by dm_end_io_acct() and for dm_io_complete() to use for
> -	 * completion handling.
> -	 */

This comment should remain with some adjustment.

> -	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
> -			     GFP_NOIO, &md->queue->bio_split);
> -	bio_chain(orig_bio, bio);
> -	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
> -	submit_bio_noacct(bio);
> +	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
> +			&md->queue->bio_split);

Why not keep using bio_split() ?

> +	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
> +	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
> +	bio_chain(new_bio, bio);
> +	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
> +	submit_bio_noacct(new_bio);
>  out:
> -	if (!orig_bio)
> -		orig_bio = bio;
> -	smp_store_release(&ci.io->orig_bio, orig_bio);
>  	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
>  		dm_start_io_acct(ci.io, NULL);

I tested this and it works. Need to check the accounting though.
And I agree this is a lot cleaner :)

-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-11 23:33                     ` Damien Le Moal
@ 2022-04-12  0:09                       ` Ming Lei
  2022-04-12  0:28                         ` Damien Le Moal
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2022-04-12  0:09 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On Tue, Apr 12, 2022 at 08:33:04AM +0900, Damien Le Moal wrote:
> On 4/11/22 23:18, Ming Lei wrote:
> >>>> This fixes the issue:
> >>>>
> >>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>>> index 3c5fad7c4ee6..3dd6735450c5 100644
> >>>> --- a/drivers/md/dm.c
> >>>> +++ b/drivers/md/dm.c
> >>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> >>>> *md, struct bio *bio)
> >>>>         io->status = 0;
> >>>>         atomic_set(&io->io_count, 1);
> >>>>         this_cpu_inc(*md->pending_io);
> >>>> -       io->orig_bio = NULL;
> >>>> +       io->orig_bio = bio;
> >>>>         io->md = md;
> >>>>         io->map_task = current;
> >>>>         spin_lock_init(&io->lock);
> >>>>
> >>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> >>>> However, this change may be messing up the bio accounting. Need to check that.
> >>>
> >>> Looks it is one recent regression since:
> >>>
> >>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
> >>
> >> Yep, saw that. Problem is, I really do not understand that change setting
> >> io->orig_bio *after* __map_bio() is called. It seems that the accounting
> >> is done on each fragment of the orig_bio instead of once for the entire
> >> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
> >> an argument to  __map_bio() from __split_and_process_bio(), I do not think
> >> my change is correct. Thoughts ?
> > 
> > Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
> > after ->map() looks ugly & tricky, and the following change should avoid the
> > issue, meantime simplify dm accounting a bit:
> > 
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 3c5fad7c4ee6..f1fe83113608 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
> >  	io->status = 0;
> >  	atomic_set(&io->io_count, 1);
> >  	this_cpu_inc(*md->pending_io);
> > -	io->orig_bio = NULL;
> > +	io->orig_bio = bio;
> >  	io->md = md;
> >  	io->map_task = current;
> >  	spin_lock_init(&io->lock);
> > @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
> >  	 * Account io->origin_bio to DM dev on behalf of target
> >  	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
> >  	 */
> > -	if (io->map_task == current) {
> > +	if (io->map_task == current)
> >  		/* Still in target's map function */
> >  		dm_io_set_flag(io, DM_IO_START_ACCT);
> > -	} else {
> > -		/*
> > -		 * Called by another thread, managed by DM target,
> > -		 * wait for dm_split_and_process_bio() to store
> > -		 * io->orig_bio
> > -		 */
> > -		while (unlikely(!smp_load_acquire(&io->orig_bio)))
> > -			msleep(1);
> > +	else
> 
> Curly brackets around the else here.
> 
> >  		dm_start_io_acct(io, clone);
> > -	}
> >  
> >  	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
> >  			      tio->old_sector);
> > @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> >  				     struct dm_table *map, struct bio *bio)
> >  {
> >  	struct clone_info ci;
> > -	struct bio *orig_bio = NULL;
> > +	struct bio *new_bio = NULL;
> >  	int error = 0;
> >  
> >  	init_clone_info(&ci, md, map, bio);
> > @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> >  	if (error || !ci.sector_count)
> >  		goto out;
> >  
> > -	/*
> > -	 * Remainder must be passed to submit_bio_noacct() so it gets handled
> > -	 * *after* bios already submitted have been completely processed.
> > -	 * We take a clone of the original to store in ci.io->orig_bio to be
> > -	 * used by dm_end_io_acct() and for dm_io_complete() to use for
> > -	 * completion handling.
> > -	 */
> 
> This comment should remain with some adjustment.

Fine, just felt the approach is very straightforward.

> 
> > -	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
> > -			     GFP_NOIO, &md->queue->bio_split);
> > -	bio_chain(orig_bio, bio);
> > -	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
> > -	submit_bio_noacct(bio);
> > +	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
> > +			&md->queue->bio_split);
> 
> Why not keep using bio_split() ?

With bio_split(), 'bio' actually tracks the remainder, and the returned
'orig_bio' tracks the part for current target io, so ->orig_bio has to
be updated in this way.

With bio_alloc_clone() and bio_trim(), ->orig_bio needn't to be
changed, and assigning it in alloc_io() works perfectly.

> 
> > +	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
> > +	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
> > +	bio_chain(new_bio, bio);
> > +	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
> > +	submit_bio_noacct(new_bio);
> >  out:
> > -	if (!orig_bio)
> > -		orig_bio = bio;
> > -	smp_store_release(&ci.io->orig_bio, orig_bio);
> >  	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
> >  		dm_start_io_acct(ci.io, NULL);
> 
> I tested this and it works. Need to check the accounting though.
> And I agree this is a lot cleaner :)

BTW, the cloned bio for split is just for accounting purpose, if
->bi_iter.bi_sector & ->bi_iter.bi_size can be stored in 'struct dm_io',
the cloned bio can be avoided, but code may become not readable as
before.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-12  0:09                       ` Ming Lei
@ 2022-04-12  0:28                         ` Damien Le Moal
  2022-04-12  1:02                           ` Ming Lei
  0 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2022-04-12  0:28 UTC (permalink / raw)
  To: Ming Lei; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On 4/12/22 09:09, Ming Lei wrote:
> On Tue, Apr 12, 2022 at 08:33:04AM +0900, Damien Le Moal wrote:
>> On 4/11/22 23:18, Ming Lei wrote:
>>>>>> This fixes the issue:
>>>>>>
>>>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>>>>> index 3c5fad7c4ee6..3dd6735450c5 100644
>>>>>> --- a/drivers/md/dm.c
>>>>>> +++ b/drivers/md/dm.c
>>>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
>>>>>> *md, struct bio *bio)
>>>>>>         io->status = 0;
>>>>>>         atomic_set(&io->io_count, 1);
>>>>>>         this_cpu_inc(*md->pending_io);
>>>>>> -       io->orig_bio = NULL;
>>>>>> +       io->orig_bio = bio;
>>>>>>         io->md = md;
>>>>>>         io->map_task = current;
>>>>>>         spin_lock_init(&io->lock);
>>>>>>
>>>>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
>>>>>> However, this change may be messing up the bio accounting. Need to check that.
>>>>>
>>>>> Looks it is one recent regression since:
>>>>>
>>>>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
>>>>
>>>> Yep, saw that. Problem is, I really do not understand that change setting
>>>> io->orig_bio *after* __map_bio() is called. It seems that the accounting
>>>> is done on each fragment of the orig_bio instead of once for the entire
>>>> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
>>>> an argument to  __map_bio() from __split_and_process_bio(), I do not think
>>>> my change is correct. Thoughts ?
>>>
>>> Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
>>> after ->map() looks ugly & tricky, and the following change should avoid the
>>> issue, meantime simplify dm accounting a bit:
>>>
>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>> index 3c5fad7c4ee6..f1fe83113608 100644
>>> --- a/drivers/md/dm.c
>>> +++ b/drivers/md/dm.c
>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
>>>  	io->status = 0;
>>>  	atomic_set(&io->io_count, 1);
>>>  	this_cpu_inc(*md->pending_io);
>>> -	io->orig_bio = NULL;
>>> +	io->orig_bio = bio;
>>>  	io->md = md;
>>>  	io->map_task = current;
>>>  	spin_lock_init(&io->lock);
>>> @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
>>>  	 * Account io->origin_bio to DM dev on behalf of target
>>>  	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
>>>  	 */
>>> -	if (io->map_task == current) {
>>> +	if (io->map_task == current)
>>>  		/* Still in target's map function */
>>>  		dm_io_set_flag(io, DM_IO_START_ACCT);
>>> -	} else {
>>> -		/*
>>> -		 * Called by another thread, managed by DM target,
>>> -		 * wait for dm_split_and_process_bio() to store
>>> -		 * io->orig_bio
>>> -		 */
>>> -		while (unlikely(!smp_load_acquire(&io->orig_bio)))
>>> -			msleep(1);
>>> +	else
>>
>> Curly brackets around the else here.
>>
>>>  		dm_start_io_acct(io, clone);
>>> -	}
>>>  
>>>  	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
>>>  			      tio->old_sector);
>>> @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>>>  				     struct dm_table *map, struct bio *bio)
>>>  {
>>>  	struct clone_info ci;
>>> -	struct bio *orig_bio = NULL;
>>> +	struct bio *new_bio = NULL;
>>>  	int error = 0;
>>>  
>>>  	init_clone_info(&ci, md, map, bio);
>>> @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>>>  	if (error || !ci.sector_count)
>>>  		goto out;
>>>  
>>> -	/*
>>> -	 * Remainder must be passed to submit_bio_noacct() so it gets handled
>>> -	 * *after* bios already submitted have been completely processed.
>>> -	 * We take a clone of the original to store in ci.io->orig_bio to be
>>> -	 * used by dm_end_io_acct() and for dm_io_complete() to use for
>>> -	 * completion handling.
>>> -	 */
>>
>> This comment should remain with some adjustment.
> 
> Fine, just felt the approach is very straightforward.
> 
>>
>>> -	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
>>> -			     GFP_NOIO, &md->queue->bio_split);
>>> -	bio_chain(orig_bio, bio);
>>> -	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
>>> -	submit_bio_noacct(bio);
>>> +	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
>>> +			&md->queue->bio_split);
>>
>> Why not keep using bio_split() ?
> 
> With bio_split(), 'bio' actually tracks the remainder, and the returned
> 'orig_bio' tracks the part for current target io, so ->orig_bio has to
> be updated in this way.
> 
> With bio_alloc_clone() and bio_trim(), ->orig_bio needn't to be
> changed, and assigning it in alloc_io() works perfectly.

OK. Got it.

>>> +	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
>>> +	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
>>> +	bio_chain(new_bio, bio);
>>> +	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
>>> +	submit_bio_noacct(new_bio);
>>>  out:
>>> -	if (!orig_bio)
>>> -		orig_bio = bio;
>>> -	smp_store_release(&ci.io->orig_bio, orig_bio);
>>>  	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
>>>  		dm_start_io_acct(ci.io, NULL);
>>
>> I tested this and it works. Need to check the accounting though.
>> And I agree this is a lot cleaner :)
> 
> BTW, the cloned bio for split is just for accounting purpose, if
> ->bi_iter.bi_sector & ->bi_iter.bi_size can be stored in 'struct dm_io',
> the cloned bio can be avoided, but code may become not readable as
> before.

The BIO op would be needed too for remapping zone append to regular writes
when zone append emulation is enabled. That is actually why
dm_zone_map_bio() needs the original BIO, to have the unmodified BIO op
since the clone op may not be the same.

So I think this fix+cleanup as is is good for now. Will you send a proper
patch ?

> 
> 
> Thanks,
> Ming
> 


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-12  0:28                         ` Damien Le Moal
@ 2022-04-12  1:02                           ` Ming Lei
  2022-04-12  1:17                             ` Damien Le Moal
  0 siblings, 1 reply; 21+ messages in thread
From: Ming Lei @ 2022-04-12  1:02 UTC (permalink / raw)
  To: Damien Le Moal; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On Tue, Apr 12, 2022 at 09:28:46AM +0900, Damien Le Moal wrote:
> On 4/12/22 09:09, Ming Lei wrote:
> > On Tue, Apr 12, 2022 at 08:33:04AM +0900, Damien Le Moal wrote:
> >> On 4/11/22 23:18, Ming Lei wrote:
> >>>>>> This fixes the issue:
> >>>>>>
> >>>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>>>>> index 3c5fad7c4ee6..3dd6735450c5 100644
> >>>>>> --- a/drivers/md/dm.c
> >>>>>> +++ b/drivers/md/dm.c
> >>>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
> >>>>>> *md, struct bio *bio)
> >>>>>>         io->status = 0;
> >>>>>>         atomic_set(&io->io_count, 1);
> >>>>>>         this_cpu_inc(*md->pending_io);
> >>>>>> -       io->orig_bio = NULL;
> >>>>>> +       io->orig_bio = bio;
> >>>>>>         io->md = md;
> >>>>>>         io->map_task = current;
> >>>>>>         spin_lock_init(&io->lock);
> >>>>>>
> >>>>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
> >>>>>> However, this change may be messing up the bio accounting. Need to check that.
> >>>>>
> >>>>> Looks it is one recent regression since:
> >>>>>
> >>>>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
> >>>>
> >>>> Yep, saw that. Problem is, I really do not understand that change setting
> >>>> io->orig_bio *after* __map_bio() is called. It seems that the accounting
> >>>> is done on each fragment of the orig_bio instead of once for the entire
> >>>> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
> >>>> an argument to  __map_bio() from __split_and_process_bio(), I do not think
> >>>> my change is correct. Thoughts ?
> >>>
> >>> Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
> >>> after ->map() looks ugly & tricky, and the following change should avoid the
> >>> issue, meantime simplify dm accounting a bit:
> >>>
> >>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> >>> index 3c5fad7c4ee6..f1fe83113608 100644
> >>> --- a/drivers/md/dm.c
> >>> +++ b/drivers/md/dm.c
> >>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
> >>>  	io->status = 0;
> >>>  	atomic_set(&io->io_count, 1);
> >>>  	this_cpu_inc(*md->pending_io);
> >>> -	io->orig_bio = NULL;
> >>> +	io->orig_bio = bio;
> >>>  	io->md = md;
> >>>  	io->map_task = current;
> >>>  	spin_lock_init(&io->lock);
> >>> @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
> >>>  	 * Account io->origin_bio to DM dev on behalf of target
> >>>  	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
> >>>  	 */
> >>> -	if (io->map_task == current) {
> >>> +	if (io->map_task == current)
> >>>  		/* Still in target's map function */
> >>>  		dm_io_set_flag(io, DM_IO_START_ACCT);
> >>> -	} else {
> >>> -		/*
> >>> -		 * Called by another thread, managed by DM target,
> >>> -		 * wait for dm_split_and_process_bio() to store
> >>> -		 * io->orig_bio
> >>> -		 */
> >>> -		while (unlikely(!smp_load_acquire(&io->orig_bio)))
> >>> -			msleep(1);
> >>> +	else
> >>
> >> Curly brackets around the else here.
> >>
> >>>  		dm_start_io_acct(io, clone);
> >>> -	}
> >>>  
> >>>  	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
> >>>  			      tio->old_sector);
> >>> @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> >>>  				     struct dm_table *map, struct bio *bio)
> >>>  {
> >>>  	struct clone_info ci;
> >>> -	struct bio *orig_bio = NULL;
> >>> +	struct bio *new_bio = NULL;
> >>>  	int error = 0;
> >>>  
> >>>  	init_clone_info(&ci, md, map, bio);
> >>> @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
> >>>  	if (error || !ci.sector_count)
> >>>  		goto out;
> >>>  
> >>> -	/*
> >>> -	 * Remainder must be passed to submit_bio_noacct() so it gets handled
> >>> -	 * *after* bios already submitted have been completely processed.
> >>> -	 * We take a clone of the original to store in ci.io->orig_bio to be
> >>> -	 * used by dm_end_io_acct() and for dm_io_complete() to use for
> >>> -	 * completion handling.
> >>> -	 */
> >>
> >> This comment should remain with some adjustment.
> > 
> > Fine, just felt the approach is very straightforward.
> > 
> >>
> >>> -	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
> >>> -			     GFP_NOIO, &md->queue->bio_split);
> >>> -	bio_chain(orig_bio, bio);
> >>> -	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
> >>> -	submit_bio_noacct(bio);
> >>> +	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
> >>> +			&md->queue->bio_split);
> >>
> >> Why not keep using bio_split() ?
> > 
> > With bio_split(), 'bio' actually tracks the remainder, and the returned
> > 'orig_bio' tracks the part for current target io, so ->orig_bio has to
> > be updated in this way.
> > 
> > With bio_alloc_clone() and bio_trim(), ->orig_bio needn't to be
> > changed, and assigning it in alloc_io() works perfectly.
> 
> OK. Got it.
> 
> >>> +	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
> >>> +	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
> >>> +	bio_chain(new_bio, bio);
> >>> +	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
> >>> +	submit_bio_noacct(new_bio);
> >>>  out:
> >>> -	if (!orig_bio)
> >>> -		orig_bio = bio;
> >>> -	smp_store_release(&ci.io->orig_bio, orig_bio);
> >>>  	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
> >>>  		dm_start_io_acct(ci.io, NULL);
> >>
> >> I tested this and it works. Need to check the accounting though.
> >> And I agree this is a lot cleaner :)
> > 
> > BTW, the cloned bio for split is just for accounting purpose, if
> > ->bi_iter.bi_sector & ->bi_iter.bi_size can be stored in 'struct dm_io',
> > the cloned bio can be avoided, but code may become not readable as
> > before.
> 
> The BIO op would be needed too for remapping zone append to regular writes
> when zone append emulation is enabled. That is actually why
> dm_zone_map_bio() needs the original BIO, to have the unmodified BIO op
> since the clone op may not be the same.

clone inherits the original bio's op. I meant to store the
real bio from FS as ->orig_bio always in alloc_io(), and simply trim it in case
of split & re-submission, meantime store orig_bio's ->bi_sector & ->size into
'dm_io' for account purpose. But it looks a bit complicated and messy.

Wrt. dm zone, I'd suggest to double check anywhere orig bio is used,
since only part of orig bio may be mapped in case of splitting, which is
actually post-split.

If bio split won't happen for dm-zone, your patch is fine. But I guess
it isn't true for dm-zone.

> 
> So I think this fix+cleanup as is is good for now. Will you send a proper
> patch ?

Not yet, the fix+cleanup patch actually breaks recent dm io polling, and I
don't figure out one solution yet.


Thanks,
Ming
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio
  2022-04-12  1:02                           ` Ming Lei
@ 2022-04-12  1:17                             ` Damien Le Moal
  0 siblings, 0 replies; 21+ messages in thread
From: Damien Le Moal @ 2022-04-12  1:17 UTC (permalink / raw)
  To: Ming Lei; +Cc: dm-devel, Damien Le Moal, Mike Snitzer

On 4/12/22 10:02, Ming Lei wrote:
> On Tue, Apr 12, 2022 at 09:28:46AM +0900, Damien Le Moal wrote:
>> On 4/12/22 09:09, Ming Lei wrote:
>>> On Tue, Apr 12, 2022 at 08:33:04AM +0900, Damien Le Moal wrote:
>>>> On 4/11/22 23:18, Ming Lei wrote:
>>>>>>>> This fixes the issue:
>>>>>>>>
>>>>>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>>>>>>> index 3c5fad7c4ee6..3dd6735450c5 100644
>>>>>>>> --- a/drivers/md/dm.c
>>>>>>>> +++ b/drivers/md/dm.c
>>>>>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device
>>>>>>>> *md, struct bio *bio)
>>>>>>>>         io->status = 0;
>>>>>>>>         atomic_set(&io->io_count, 1);
>>>>>>>>         this_cpu_inc(*md->pending_io);
>>>>>>>> -       io->orig_bio = NULL;
>>>>>>>> +       io->orig_bio = bio;
>>>>>>>>         io->md = md;
>>>>>>>>         io->map_task = current;
>>>>>>>>         spin_lock_init(&io->lock);
>>>>>>>>
>>>>>>>> Otherwise, the dm-zone.c code sees a NULL orig_bio.
>>>>>>>> However, this change may be messing up the bio accounting. Need to check that.
>>>>>>>
>>>>>>> Looks it is one recent regression since:
>>>>>>>
>>>>>>> commit 0fbb4d93b38b ("dm: add dm_submit_bio_remap interface")
>>>>>>
>>>>>> Yep, saw that. Problem is, I really do not understand that change setting
>>>>>> io->orig_bio *after* __map_bio() is called. It seems that the accounting
>>>>>> is done on each fragment of the orig_bio instead of once for the entire
>>>>>> BIO... So my "fix" above seems wrong. Apart from passing along orig_bio as
>>>>>> an argument to  __map_bio() from __split_and_process_bio(), I do not think
>>>>>> my change is correct. Thoughts ?
>>>>>
>>>>> Frankly speaking, both changing ->orig_bio after split and setting ->orig_bio
>>>>> after ->map() looks ugly & tricky, and the following change should avoid the
>>>>> issue, meantime simplify dm accounting a bit:
>>>>>
>>>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>>>> index 3c5fad7c4ee6..f1fe83113608 100644
>>>>> --- a/drivers/md/dm.c
>>>>> +++ b/drivers/md/dm.c
>>>>> @@ -581,7 +581,7 @@ static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
>>>>>  	io->status = 0;
>>>>>  	atomic_set(&io->io_count, 1);
>>>>>  	this_cpu_inc(*md->pending_io);
>>>>> -	io->orig_bio = NULL;
>>>>> +	io->orig_bio = bio;
>>>>>  	io->md = md;
>>>>>  	io->map_task = current;
>>>>>  	spin_lock_init(&io->lock);
>>>>> @@ -1223,19 +1223,11 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone)
>>>>>  	 * Account io->origin_bio to DM dev on behalf of target
>>>>>  	 * that took ownership of IO with DM_MAPIO_SUBMITTED.
>>>>>  	 */
>>>>> -	if (io->map_task == current) {
>>>>> +	if (io->map_task == current)
>>>>>  		/* Still in target's map function */
>>>>>  		dm_io_set_flag(io, DM_IO_START_ACCT);
>>>>> -	} else {
>>>>> -		/*
>>>>> -		 * Called by another thread, managed by DM target,
>>>>> -		 * wait for dm_split_and_process_bio() to store
>>>>> -		 * io->orig_bio
>>>>> -		 */
>>>>> -		while (unlikely(!smp_load_acquire(&io->orig_bio)))
>>>>> -			msleep(1);
>>>>> +	else
>>>>
>>>> Curly brackets around the else here.
>>>>
>>>>>  		dm_start_io_acct(io, clone);
>>>>> -	}
>>>>>  
>>>>>  	__dm_submit_bio_remap(tgt_clone, disk_devt(io->md->disk),
>>>>>  			      tio->old_sector);
>>>>> @@ -1562,7 +1554,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>>>>>  				     struct dm_table *map, struct bio *bio)
>>>>>  {
>>>>>  	struct clone_info ci;
>>>>> -	struct bio *orig_bio = NULL;
>>>>> +	struct bio *new_bio = NULL;
>>>>>  	int error = 0;
>>>>>  
>>>>>  	init_clone_info(&ci, md, map, bio);
>>>>> @@ -1578,22 +1570,14 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>>>>>  	if (error || !ci.sector_count)
>>>>>  		goto out;
>>>>>  
>>>>> -	/*
>>>>> -	 * Remainder must be passed to submit_bio_noacct() so it gets handled
>>>>> -	 * *after* bios already submitted have been completely processed.
>>>>> -	 * We take a clone of the original to store in ci.io->orig_bio to be
>>>>> -	 * used by dm_end_io_acct() and for dm_io_complete() to use for
>>>>> -	 * completion handling.
>>>>> -	 */
>>>>
>>>> This comment should remain with some adjustment.
>>>
>>> Fine, just felt the approach is very straightforward.
>>>
>>>>
>>>>> -	orig_bio = bio_split(bio, bio_sectors(bio) - ci.sector_count,
>>>>> -			     GFP_NOIO, &md->queue->bio_split);
>>>>> -	bio_chain(orig_bio, bio);
>>>>> -	trace_block_split(orig_bio, bio->bi_iter.bi_sector);
>>>>> -	submit_bio_noacct(bio);
>>>>> +	new_bio = bio_alloc_clone(bio->bi_bdev, bio, GFP_NOIO,
>>>>> +			&md->queue->bio_split);
>>>>
>>>> Why not keep using bio_split() ?
>>>
>>> With bio_split(), 'bio' actually tracks the remainder, and the returned
>>> 'orig_bio' tracks the part for current target io, so ->orig_bio has to
>>> be updated in this way.
>>>
>>> With bio_alloc_clone() and bio_trim(), ->orig_bio needn't to be
>>> changed, and assigning it in alloc_io() works perfectly.
>>
>> OK. Got it.
>>
>>>>> +	bio_trim(new_bio, bio_sectors(bio) - ci.sector_count, ci.sector_count);
>>>>> +	bio_trim(bio, 0, bio_sectors(bio) - ci.sector_count);
>>>>> +	bio_chain(new_bio, bio);
>>>>> +	trace_block_split(new_bio, new_bio->bi_iter.bi_sector);
>>>>> +	submit_bio_noacct(new_bio);
>>>>>  out:
>>>>> -	if (!orig_bio)
>>>>> -		orig_bio = bio;
>>>>> -	smp_store_release(&ci.io->orig_bio, orig_bio);
>>>>>  	if (dm_io_flagged(ci.io, DM_IO_START_ACCT))
>>>>>  		dm_start_io_acct(ci.io, NULL);
>>>>
>>>> I tested this and it works. Need to check the accounting though.
>>>> And I agree this is a lot cleaner :)
>>>
>>> BTW, the cloned bio for split is just for accounting purpose, if
>>> ->bi_iter.bi_sector & ->bi_iter.bi_size can be stored in 'struct dm_io',
>>> the cloned bio can be avoided, but code may become not readable as
>>> before.
>>
>> The BIO op would be needed too for remapping zone append to regular writes
>> when zone append emulation is enabled. That is actually why
>> dm_zone_map_bio() needs the original BIO, to have the unmodified BIO op
>> since the clone op may not be the same.
> 
> clone inherits the original bio's op. I meant to store the
> real bio from FS as ->orig_bio always in alloc_io(), and simply trim it in case
> of split & re-submission, meantime store orig_bio's ->bi_sector & ->size into
> 'dm_io' for account purpose. But it looks a bit complicated and messy.

Indeed.

> Wrt. dm zone, I'd suggest to double check anywhere orig bio is used,
> since only part of orig bio may be mapped in case of splitting, which is
> actually post-split.
> 
> If bio split won't happen for dm-zone, your patch is fine. But I guess
> it isn't true for dm-zone.

Zone append operations can never be split. That is checked before a user
bio reaches DM __map_bio()/dm_zone_map_bio(). So I think that there is no
issue with the clone+trim change since that will never be done for a zone
append operation.

In the case of emulated zone append, regular writes will still go through
the dm_zone_map_bio() function, and these writes may be split. But then
the orig_bio does not really matter in that case and the zone write
pointer tracking relies on the clone sector & size, not on the original BIO.

I will run more tests with this patch. But so far, this seems all OK.

>> So I think this fix+cleanup as is is good for now. Will you send a proper
>> patch ?
> 
> Not yet, the fix+cleanup patch actually breaks recent dm io polling, and I
> don't figure out one solution yet.

OK. Let me know if you need help testing.


-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-04-12  1:17 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-08 17:12 [dm-devel] [PATCH 0/3] dm: improvement on io referencing and io poll Ming Lei
2022-04-08 17:12 ` [dm-devel] [PATCH 1/3] dm: don't grab target io reference in dm_zone_map_bio Ming Lei
2022-04-11  0:18   ` Damien Le Moal
2022-04-11  0:36     ` Ming Lei
2022-04-11  0:50       ` Damien Le Moal
2022-04-11  1:04         ` Ming Lei
2022-04-11  1:08           ` Damien Le Moal
2022-04-11  2:19           ` Damien Le Moal
2022-04-11  2:55             ` Damien Le Moal
2022-04-11  3:10               ` Damien Le Moal
2022-04-11  7:34               ` Ming Lei
2022-04-11  7:42                 ` Damien Le Moal
2022-04-11 14:18                   ` Ming Lei
2022-04-11 23:33                     ` Damien Le Moal
2022-04-12  0:09                       ` Ming Lei
2022-04-12  0:28                         ` Damien Le Moal
2022-04-12  1:02                           ` Ming Lei
2022-04-12  1:17                             ` Damien Le Moal
2022-04-11  9:40   ` Damien Le Moal
2022-04-08 17:12 ` [dm-devel] [PATCH 2/3] dm: improve target io referencing Ming Lei
2022-04-08 17:12 ` [dm-devel] [PATCH 3/3] dm: put all polled io into one single list Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.