io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/7] dm: add support of iopoll
@ 2020-12-23 11:26 Jeffle Xu
  2020-12-23 11:26 ` [PATCH RFC 1/7] block: move definition of blk_qc_t to types.h Jeffle Xu
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Jeffle Xu @ 2020-12-23 11:26 UTC (permalink / raw)
  To: snitzer; +Cc: linux-block, dm-devel, io-uring

This patch set adds support of iopoll for dm devices.

Several months ago, I also sent a patch set adding support of iopoll
for dm devices [1]. The old patch set implement this by polling all
polling mode hardware queues of all underlying target devices
unconditionally, no matter which hardware queue the bio is enqueued
into.

Ming Lei pointed out that this implementation may have performance
issue. Mike Snitzer also discussed and helpt a lot on the design
issue. At that time, we agreed that this feature should be
implemented on the basis of split-bio tracking, since the bio to the
dm device could be split into multiple split bios to the underlying
target devices.

This patch set actually implement the split-bio tracking part.
Regrettably this implementation is quite coarse and original. Quite
code refactoring is introduced to the block core, since device mapper
also calls methods provided by block core to split bios. The new
fields are directly added into 'struct bio' structure. So this is
just an RFC version and there may be quite a lot design issues be
fronted on. I just implement a version quickly and would like to share
with you my thoughts and problems at hand.

This implementation works but has poor performance. Awkwardly the
performance of direct 8k randread decreases ~20% on a 8k-striped
dm-stripe device.

The first 5 patches prepare the introduction of iopoll for dm device.
Patch 6 is the core part and implements a mechanism of tracking split
bios for bio-based device. Patch 7 just enables this feature.

[1] https://patchwork.kernel.org/project/linux-block/cover/20201020065420.124885-1-jefflexu@linux.alibaba.com/


[Design Notes]

- recursive way or non-recursive way?

The core of the split-bio tracking mechanism is that, a list is
maintained in the top bio (the original bio submitted to the dm
device), which is actually used to maintain all valid cookies of all
split bios.

This is actually a non-recursive way to implement the tracking
mechanism. On the contrary, the recursive way is that, maintain the
split bios at each dm layer. DM device can be build into a device
stack. For example, considering the following device stack,

```
            dm0(bio 0)
dm1(bio 1)             dm2(bio 2)
nvme0(bio 3)           nvme1(bio 4)

```

The non-recursive way is that bio 3/4 are directly maintained as a
list beneath bio 0. The recursive way is that, bio 3 is maintained
as a list beneath bio 1 and bio 4 is maintained as a list beneath
bio 2, while bio 1/2 are maintained as a list beneath bio 0.

The reason why I choose the non-recursive way is that, we need call
blk_bio_poll() or something recursively if it's implemented in the
recursive way, and I worry this would consume up the stack space if
the device stack is too deep. After all bio_list is used to prevent
the dm device using up stack space when submitting bio. 


- why embed new fields into struct bio directly?

Mike Snitzer had ever suggested that the newly added fields could be
integrated into the clone bio allocated by dm subsystem through
bio_set. There're 3 reasons why I directly embed these fields into
struct bio:

1. The implementation difference of DM and MD
DM subsystem indeed allocate clone bio itself, in which case we could
allocate extra per-bio space for specific usage. However MD subsystem
doesn't allocate extra clone bio, and just uses the bio structure
originally submitted to MD device as the bio structure submitted to
the underlying target device. (At least raid0 works in this way.)

2. In the previously mentioned non-recursive way of iopoll, a list
containing all valid cookies should be maintained. For convenience, I
just put this list in the top bio (the original bio structure
submitted to the dm device). This original bio structure is allocated
by the upper layer, e.g. filesystem, and is out of control of DM
subsystem. (Of course we could resolve this problem technically, e.g.,
put these newlly added fields into the corresponding dm_io structure
of the top bio.)

3. As a quick implementation, I just put these fields into struct bio
directly. Obviously more design issues need to be considered when it
comes into the formal version.


Jeffle Xu (7):
  block: move definition of blk_qc_t to types.h
  block: add helper function fetching gendisk from queue
  block: add iopoll method for non-mq device
  block: define blk_qc_t as uintptr_t
  dm: always return BLK_QC_T_NONE for bio-based device
  block: track cookies of split bios for bio-based device
  dm: add support for IO polling

 block/bio.c                  |   8 ++
 block/blk-core.c             | 163 ++++++++++++++++++++++++++++++++++-
 block/blk-mq.c               |  70 ++-------------
 drivers/md/dm-table.c        |  28 ++++++
 drivers/md/dm.c              |  27 +++---
 include/linux/blk-mq.h       |   3 +
 include/linux/blk_types.h    |  41 ++++++++-
 include/linux/blkdev.h       |   3 +
 include/linux/fs.h           |   2 +-
 include/linux/types.h        |   3 +
 include/trace/events/kyber.h |   6 +-
 11 files changed, 270 insertions(+), 84 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH RFC 1/7] block: move definition of blk_qc_t to types.h
  2020-12-23 11:26 [PATCH RFC 0/7] dm: add support of iopoll Jeffle Xu
@ 2020-12-23 11:26 ` Jeffle Xu
  2021-01-07 19:04   ` Mike Snitzer
  2020-12-23 11:26 ` [PATCH RFC 2/7] block: add helper function fetching gendisk from queue Jeffle Xu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Jeffle Xu @ 2020-12-23 11:26 UTC (permalink / raw)
  To: snitzer; +Cc: linux-block, dm-devel, io-uring

So that kiocb.ki_cookie can be defined as blk_qc_t, which will enforce
the encapsulation.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 include/linux/blk_types.h | 2 +-
 include/linux/fs.h        | 2 +-
 include/linux/types.h     | 3 +++
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 866f74261b3b..2e05244fc16d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -532,7 +532,7 @@ static inline int op_stat_group(unsigned int op)
 	return op_is_write(op);
 }
 
-typedef unsigned int blk_qc_t;
+/* Macros for blk_qc_t */
 #define BLK_QC_T_NONE		-1U
 #define BLK_QC_T_SHIFT		16
 #define BLK_QC_T_INTERNAL	(1U << 31)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ad4cf1bae586..58db714c4834 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -330,7 +330,7 @@ struct kiocb {
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
 	union {
-		unsigned int		ki_cookie; /* for ->iopoll */
+		blk_qc_t		ki_cookie; /* for ->iopoll */
 		struct wait_page_queue	*ki_waitq; /* for async buffered IO */
 	};
 
diff --git a/include/linux/types.h b/include/linux/types.h
index a147977602b5..da5ca7e1bea9 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -125,6 +125,9 @@ typedef s64			int64_t;
 typedef u64 sector_t;
 typedef u64 blkcnt_t;
 
+/* cookie used for IO polling */
+typedef unsigned int blk_qc_t;
+
 /*
  * The type of an index into the pagecache.
  */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 2/7] block: add helper function fetching gendisk from queue
  2020-12-23 11:26 [PATCH RFC 0/7] dm: add support of iopoll Jeffle Xu
  2020-12-23 11:26 ` [PATCH RFC 1/7] block: move definition of blk_qc_t to types.h Jeffle Xu
@ 2020-12-23 11:26 ` Jeffle Xu
  2021-01-07 20:31   ` Mike Snitzer
  2020-12-23 11:26 ` [PATCH RFC 3/7] block: add iopoll method for non-mq device Jeffle Xu
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Jeffle Xu @ 2020-12-23 11:26 UTC (permalink / raw)
  To: snitzer; +Cc: linux-block, dm-devel, io-uring

Sometimes we need to get the corresponding gendisk from request_queue.

One such use case is that, the block device driver had ever stored the
same private data both in queue->queuedata and gendisk->private_data,
while nowadays gendisk->private_data is more preferable in such case,
e.g. commit c4a59c4e5db3 ("dm: stop using ->queuedata"). So if only
request_queue given, we need to get the corresponding gendisk from
queue, to get the private data stored in gendisk.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 include/linux/blkdev.h       | 2 ++
 include/trace/events/kyber.h | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 070de09425ad..2303d06a5a82 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -691,6 +691,8 @@ static inline bool blk_account_rq(struct request *rq)
 	dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \
 	(dir), (attrs))
 
+#define queue_to_disk(q)	(dev_to_disk(kobj_to_dev((q)->kobj.parent)))
+
 static inline bool queue_is_mq(struct request_queue *q)
 {
 	return q->mq_ops;
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index c0e7d24ca256..f9802562edf6 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kyber_latency,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		strlcpy(__entry->type, type, sizeof(__entry->type));
 		__entry->percentile	= percentile;
@@ -59,7 +59,7 @@ TRACE_EVENT(kyber_adjust,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		__entry->depth		= depth;
 	),
@@ -81,7 +81,7 @@ TRACE_EVENT(kyber_throttled,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 	),
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 3/7] block: add iopoll method for non-mq device
  2020-12-23 11:26 [PATCH RFC 0/7] dm: add support of iopoll Jeffle Xu
  2020-12-23 11:26 ` [PATCH RFC 1/7] block: move definition of blk_qc_t to types.h Jeffle Xu
  2020-12-23 11:26 ` [PATCH RFC 2/7] block: add helper function fetching gendisk from queue Jeffle Xu
@ 2020-12-23 11:26 ` Jeffle Xu
  2021-01-07 21:47   ` Mike Snitzer
  2020-12-23 11:26 ` [PATCH RFC 4/7] block: define blk_qc_t as uintptr_t Jeffle Xu
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Jeffle Xu @ 2020-12-23 11:26 UTC (permalink / raw)
  To: snitzer; +Cc: linux-block, dm-devel, io-uring

->poll_fn is introduced in commit ea435e1b9392 ("block: add a poll_fn
callback to struct request_queue") for supporting non-mq queues such as
nvme multipath, but removed in commit 529262d56dbe ("block: remove
->poll_fn").

To add support of IO polling for non-mq device, this method need to be
back. Since commit c62b37d96b6e ("block: move ->make_request_fn to
struct block_device_operations") has moved all callbacks into struct
block_device_operations in gendisk, we also add the new method named
->iopoll in block_device_operations.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-core.c       | 79 ++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq.c         | 70 +++++--------------------------------
 include/linux/blk-mq.h |  3 ++
 include/linux/blkdev.h |  1 +
 4 files changed, 92 insertions(+), 61 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 96e5fcd7f071..2f5c51ce32e3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1131,6 +1131,85 @@ blk_qc_t submit_bio(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio);
 
+static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
+{
+	struct blk_mq_hw_ctx *hctx;
+
+	/* TODO: bio-based device doesn't support hybrid poll. */
+	if (!queue_is_mq(q))
+		return false;
+
+	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+	if (blk_mq_poll_hybrid(q, hctx, cookie))
+		return true;
+
+	hctx->poll_considered++;
+	return false;
+}
+
+/**
+ * blk_poll - poll for IO completions
+ * @q:  the queue
+ * @cookie: cookie passed back at IO submission time
+ * @spin: whether to spin for completions
+ *
+ * Description:
+ *    Poll for completions on the passed in queue. Returns number of
+ *    completed entries found. If @spin is true, then blk_poll will continue
+ *    looping until at least one completion is found, unless the task is
+ *    otherwise marked running (or we need to reschedule).
+ */
+int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+	long state;
+
+	if (!blk_qc_t_valid(cookie) ||
+	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+		return 0;
+
+	if (current->plug)
+		blk_flush_plug_list(current->plug, false);
+
+	/*
+	 * If we sleep, have the caller restart the poll loop to reset
+	 * the state. Like for the other success return cases, the
+	 * caller is responsible for checking if the IO completed. If
+	 * the IO isn't complete, we'll get called again and will go
+	 * straight to the busy poll loop. If specified not to spin,
+	 * we also should not sleep.
+	 */
+	if (spin && blk_poll_hybrid(q, cookie))
+		return 1;
+
+	state = current->state;
+	do {
+		int ret;
+		struct gendisk *disk = queue_to_disk(q);
+
+		if (disk->fops->iopoll)
+			ret = disk->fops->iopoll(q, cookie);
+		else
+			ret = blk_mq_poll(q, cookie);
+		if (ret > 0) {
+			__set_current_state(TASK_RUNNING);
+			return ret;
+		}
+
+		if (signal_pending_state(state, current))
+			__set_current_state(TASK_RUNNING);
+
+		if (current->state == TASK_RUNNING)
+			return 1;
+		if (ret < 0 || !spin)
+			break;
+		cpu_relax();
+	} while (!need_resched());
+
+	__set_current_state(TASK_RUNNING);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(blk_poll);
+
 /**
  * blk_cloned_rq_check_limits - Helper function to check a cloned request
  *                              for the new queue limits
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b09ce00cc6af..85258958e9f1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3818,8 +3818,8 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 	return true;
 }
 
-static bool blk_mq_poll_hybrid(struct request_queue *q,
-			       struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
+bool blk_mq_poll_hybrid(struct request_queue *q,
+			struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
 {
 	struct request *rq;
 
@@ -3843,72 +3843,20 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
-/**
- * blk_poll - poll for IO completions
- * @q:  the queue
- * @cookie: cookie passed back at IO submission time
- * @spin: whether to spin for completions
- *
- * Description:
- *    Poll for completions on the passed in queue. Returns number of
- *    completed entries found. If @spin is true, then blk_poll will continue
- *    looping until at least one completion is found, unless the task is
- *    otherwise marked running (or we need to reschedule).
- */
-int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+int blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 {
+	int ret;
 	struct blk_mq_hw_ctx *hctx;
-	long state;
-
-	if (!blk_qc_t_valid(cookie) ||
-	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
-		return 0;
-
-	if (current->plug)
-		blk_flush_plug_list(current->plug, false);
 
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 
-	/*
-	 * If we sleep, have the caller restart the poll loop to reset
-	 * the state. Like for the other success return cases, the
-	 * caller is responsible for checking if the IO completed. If
-	 * the IO isn't complete, we'll get called again and will go
-	 * straight to the busy poll loop. If specified not to spin,
-	 * we also should not sleep.
-	 */
-	if (spin && blk_mq_poll_hybrid(q, hctx, cookie))
-		return 1;
-
-	hctx->poll_considered++;
+	hctx->poll_invoked++;
+	ret = q->mq_ops->poll(hctx);
+	if (ret > 0)
+		hctx->poll_success++;
 
-	state = current->state;
-	do {
-		int ret;
-
-		hctx->poll_invoked++;
-
-		ret = q->mq_ops->poll(hctx);
-		if (ret > 0) {
-			hctx->poll_success++;
-			__set_current_state(TASK_RUNNING);
-			return ret;
-		}
-
-		if (signal_pending_state(state, current))
-			__set_current_state(TASK_RUNNING);
-
-		if (current->state == TASK_RUNNING)
-			return 1;
-		if (ret < 0 || !spin)
-			break;
-		cpu_relax();
-	} while (!need_resched());
-
-	__set_current_state(TASK_RUNNING);
-	return 0;
+	return ret;
 }
-EXPORT_SYMBOL_GPL(blk_poll);
 
 unsigned int blk_mq_rq_cpu(struct request *rq)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 47b021952ac7..032e08ecd42e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -607,6 +607,9 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
 }
 
 blk_qc_t blk_mq_submit_bio(struct bio *bio);
+int blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
+bool blk_mq_poll_hybrid(struct request_queue *q,
+		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie);
 void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
 		struct lock_class_key *key);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2303d06a5a82..e8965879eb90 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1845,6 +1845,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
 
 struct block_device_operations {
 	blk_qc_t (*submit_bio) (struct bio *bio);
+	int (*iopoll)(struct request_queue *q, blk_qc_t cookie);
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
 	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 4/7] block: define blk_qc_t as uintptr_t
  2020-12-23 11:26 [PATCH RFC 0/7] dm: add support of iopoll Jeffle Xu
                   ` (2 preceding siblings ...)
  2020-12-23 11:26 ` [PATCH RFC 3/7] block: add iopoll method for non-mq device Jeffle Xu
@ 2020-12-23 11:26 ` Jeffle Xu
  2021-01-07 21:52   ` Mike Snitzer
  2020-12-23 11:26 ` [PATCH RFC 5/7] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Jeffle Xu @ 2020-12-23 11:26 UTC (permalink / raw)
  To: snitzer; +Cc: linux-block, dm-devel, io-uring

To support iopoll for bio-based device, the returned cookie is actually
a pointer to an implementation specific object, e.g. an object
maintaining all split bios.

In such case, blk_qc_t should be large enough to contain one pointer,
for which uintptr_t is used here. Since uintptr_t is actually an integer
type in essence, there's no need of type casting in the original mq
path, while type casting is indeed needed in bio-based routine.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 include/linux/types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/types.h b/include/linux/types.h
index da5ca7e1bea9..f6301014a459 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -126,7 +126,7 @@ typedef u64 sector_t;
 typedef u64 blkcnt_t;
 
 /* cookie used for IO polling */
-typedef unsigned int blk_qc_t;
+typedef uintptr_t blk_qc_t;
 
 /*
  * The type of an index into the pagecache.
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 5/7] dm: always return BLK_QC_T_NONE for bio-based device
  2020-12-23 11:26 [PATCH RFC 0/7] dm: add support of iopoll Jeffle Xu
                   ` (3 preceding siblings ...)
  2020-12-23 11:26 ` [PATCH RFC 4/7] block: define blk_qc_t as uintptr_t Jeffle Xu
@ 2020-12-23 11:26 ` Jeffle Xu
  2021-01-07 21:54   ` Mike Snitzer
  2020-12-23 11:26 ` [PATCH RFC 6/7] block: track cookies of split bios " Jeffle Xu
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Jeffle Xu @ 2020-12-23 11:26 UTC (permalink / raw)
  To: snitzer; +Cc: linux-block, dm-devel, io-uring

Currently the returned cookie of bio-based device is not used at all.

In the following patches, bio-based device will actually return a
pointer to a specific object as the returned cookie.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 drivers/md/dm.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5b2f371ec4bb..03c2b867acaa 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1252,14 +1252,13 @@ void dm_accept_partial_bio(struct bio *bio, unsigned n_sectors)
 }
 EXPORT_SYMBOL_GPL(dm_accept_partial_bio);
 
-static blk_qc_t __map_bio(struct dm_target_io *tio)
+static void __map_bio(struct dm_target_io *tio)
 {
 	int r;
 	sector_t sector;
 	struct bio *clone = &tio->clone;
 	struct dm_io *io = tio->io;
 	struct dm_target *ti = tio->ti;
-	blk_qc_t ret = BLK_QC_T_NONE;
 
 	clone->bi_end_io = clone_endio;
 
@@ -1278,7 +1277,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
 	case DM_MAPIO_REMAPPED:
 		/* the bio has been remapped so dispatch it */
 		trace_block_bio_remap(clone, bio_dev(io->orig_bio), sector);
-		ret = submit_bio_noacct(clone);
+		submit_bio_noacct(clone);
 		break;
 	case DM_MAPIO_KILL:
 		free_tio(tio);
@@ -1292,8 +1291,6 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
 		DMWARN("unimplemented target map return value: %d", r);
 		BUG();
 	}
-
-	return ret;
 }
 
 static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
@@ -1380,7 +1377,7 @@ static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
 	}
 }
 
-static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
+static void __clone_and_map_simple_bio(struct clone_info *ci,
 					   struct dm_target_io *tio, unsigned *len)
 {
 	struct bio *clone = &tio->clone;
@@ -1391,7 +1388,7 @@ static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
 	if (len)
 		bio_setup_sector(clone, ci->sector, *len);
 
-	return __map_bio(tio);
+	__map_bio(tio);
 }
 
 static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
@@ -1405,7 +1402,7 @@ static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 
 	while ((bio = bio_list_pop(&blist))) {
 		tio = container_of(bio, struct dm_target_io, clone);
-		(void) __clone_and_map_simple_bio(ci, tio, len);
+		__clone_and_map_simple_bio(ci, tio, len);
 	}
 }
 
@@ -1450,7 +1447,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 		free_tio(tio);
 		return r;
 	}
-	(void) __map_bio(tio);
+	__map_bio(tio);
 
 	return 0;
 }
@@ -1565,11 +1562,10 @@ static void init_clone_info(struct clone_info *ci, struct mapped_device *md,
 /*
  * Entry point to split a bio into clones and submit them to the targets.
  */
-static blk_qc_t __split_and_process_bio(struct mapped_device *md,
+static void __split_and_process_bio(struct mapped_device *md,
 					struct dm_table *map, struct bio *bio)
 {
 	struct clone_info ci;
-	blk_qc_t ret = BLK_QC_T_NONE;
 	int error = 0;
 
 	init_clone_info(&ci, md, map, bio);
@@ -1613,7 +1609,7 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
 
 				bio_chain(b, bio);
 				trace_block_split(b, bio->bi_iter.bi_sector);
-				ret = submit_bio_noacct(bio);
+				submit_bio_noacct(bio);
 				break;
 			}
 		}
@@ -1621,13 +1617,11 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
 
 	/* drop the extra reference count */
 	dec_pending(ci.io, errno_to_blk_status(error));
-	return ret;
 }
 
 static blk_qc_t dm_submit_bio(struct bio *bio)
 {
 	struct mapped_device *md = bio->bi_disk->private_data;
-	blk_qc_t ret = BLK_QC_T_NONE;
 	int srcu_idx;
 	struct dm_table *map;
 
@@ -1657,10 +1651,10 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 	if (is_abnormal_io(bio))
 		blk_queue_split(&bio);
 
-	ret = __split_and_process_bio(md, map, bio);
+	__split_and_process_bio(md, map, bio);
 out:
 	dm_put_live_table(md, srcu_idx);
-	return ret;
+	return BLK_QC_T_NONE;
 }
 
 /*-----------------------------------------------------------------
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 6/7] block: track cookies of split bios for bio-based device
  2020-12-23 11:26 [PATCH RFC 0/7] dm: add support of iopoll Jeffle Xu
                   ` (4 preceding siblings ...)
  2020-12-23 11:26 ` [PATCH RFC 5/7] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
@ 2020-12-23 11:26 ` Jeffle Xu
  2021-01-07 22:18   ` Mike Snitzer
  2020-12-23 11:26 ` [PATCH RFC 7/7] dm: add support for IO polling Jeffle Xu
  2021-01-07  1:14 ` [dm-devel] [PATCH RFC 0/7] dm: add support of iopoll JeffleXu
  7 siblings, 1 reply; 26+ messages in thread
From: Jeffle Xu @ 2020-12-23 11:26 UTC (permalink / raw)
  To: snitzer; +Cc: linux-block, dm-devel, io-uring

This is actuaaly the core when supporting iopoll for bio-based device.

A list is maintained in the top bio (the original bio submitted to dm
device), which is used to maintain all valid cookies of split bios. The
IO polling routine will actually iterate this list and poll on
corresponding hardware queues of the underlying mq devices.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/bio.c               |  8 ++++
 block/blk-core.c          | 84 ++++++++++++++++++++++++++++++++++++++-
 include/linux/blk_types.h | 39 ++++++++++++++++++
 3 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 1f2cc1fbe283..ca6d1a7ee196 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -284,6 +284,10 @@ void bio_init(struct bio *bio, struct bio_vec *table,
 
 	bio->bi_io_vec = table;
 	bio->bi_max_vecs = max_vecs;
+
+	INIT_LIST_HEAD(&bio->bi_plist);
+	INIT_LIST_HEAD(&bio->bi_pnode);
+	spin_lock_init(&bio->bi_plock);
 }
 EXPORT_SYMBOL(bio_init);
 
@@ -689,6 +693,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
 	bio->bi_write_hint = bio_src->bi_write_hint;
 	bio->bi_iter = bio_src->bi_iter;
 	bio->bi_io_vec = bio_src->bi_io_vec;
+	bio->bi_root = bio_src->bi_root;
 
 	bio_clone_blkg_association(bio, bio_src);
 	blkcg_bio_issue_init(bio);
@@ -1425,6 +1430,8 @@ void bio_endio(struct bio *bio)
 	if (bio->bi_disk)
 		rq_qos_done_bio(bio->bi_disk->queue, bio);
 
+	bio_del_poll_list(bio);
+
 	/*
 	 * Need to have a real endio function for chained bios, otherwise
 	 * various corner cases will break (like stacking block devices that
@@ -1446,6 +1453,7 @@ void bio_endio(struct bio *bio)
 	blk_throtl_bio_endio(bio);
 	/* release cgroup info */
 	bio_uninit(bio);
+
 	if (bio->bi_end_io)
 		bio->bi_end_io(bio);
 }
diff --git a/block/blk-core.c b/block/blk-core.c
index 2f5c51ce32e3..5a332af01939 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -960,12 +960,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 {
 	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
+	bool iopoll;
+	struct bio *root;
 
 	BUG_ON(bio->bi_next);
 
 	bio_list_init(&bio_list_on_stack[0]);
 	current->bio_list = bio_list_on_stack;
 
+	iopoll = test_bit(QUEUE_FLAG_POLL, &bio->bi_disk->queue->queue_flags);
+	iopoll = iopoll && (bio->bi_opf & REQ_HIPRI);
+
+	if (iopoll) {
+		bio->bi_root = root = bio;
+		/*
+		 * We need to pin root bio here since there's a reference from
+		 * the returned cookie. bio_get() is not enough since the whole
+		 * bio and the corresponding kiocb/dio may have already
+		 * completed and thus won't call blk_poll() at all, in which
+		 * case the pairing bio_put() in blk_bio_poll() won't be called.
+		 * The side effect of bio_inc_remaining() is that, the whole bio
+		 * won't complete until blk_poll() called.
+		 */
+		bio_inc_remaining(root);
+	}
+
 	do {
 		struct request_queue *q = bio->bi_disk->queue;
 		struct bio_list lower, same;
@@ -979,7 +998,18 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 		bio_list_on_stack[1] = bio_list_on_stack[0];
 		bio_list_init(&bio_list_on_stack[0]);
 
-		ret = __submit_bio(bio);
+		if (iopoll) {
+			/* See the comments of above bio_inc_remaining(). */
+			bio_inc_remaining(bio);
+			bio->bi_cookie = __submit_bio(bio);
+
+			if (blk_qc_t_valid(bio->bi_cookie))
+				bio_add_poll_list(bio);
+
+			bio_endio(bio);
+		} else {
+			ret = __submit_bio(bio);
+		}
 
 		/*
 		 * Sort new bios into those for a lower level and those for the
@@ -1002,7 +1032,11 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 	} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
 
 	current->bio_list = NULL;
-	return ret;
+
+	if (iopoll)
+		return (blk_qc_t)root;
+
+	return BLK_QC_T_NONE;
 }
 
 static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
@@ -1131,6 +1165,52 @@ blk_qc_t submit_bio(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio);
 
+int blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
+{
+	int ret = 0;
+	struct bio *bio, *root = (struct bio*)cookie;
+
+	if (list_empty(&root->bi_plist)) {
+		bio_endio(root);
+		return 1;
+	}
+
+	spin_lock(&root->bi_plock);
+	bio = list_first_entry_or_null(&root->bi_plist, struct bio, bi_pnode);
+
+	while (bio) {
+		struct request_queue *q = bio->bi_disk->queue;
+		blk_qc_t cookie = bio->bi_cookie;
+
+		spin_unlock(&root->bi_plock);
+		BUG_ON(!blk_qc_t_valid(cookie));
+
+		ret += blk_mq_poll(q, cookie);
+
+		spin_lock(&root->bi_plock);
+		/*
+		 * One blk_mq_poll() call could complete multiple bios, and
+		 * thus multiple bios could be removed from root->bi_plock
+		 * list.
+		 */
+		bio = list_first_entry_or_null(&root->bi_plist, struct bio, bi_pnode);
+	}
+
+	spin_unlock(&root->bi_plock);
+
+	if (list_empty(&root->bi_plist)) {
+		bio_endio(root);
+		/*
+		 * 'ret' may be 0 here. root->bi_plist may be empty once we
+		 * acquire the list spinlock.
+		 */
+		ret = max(ret, 1);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(blk_bio_poll);
+
 static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
 {
 	struct blk_mq_hw_ctx *hctx;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 2e05244fc16d..2cf5d8f0ea34 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -277,6 +277,12 @@ struct bio {
 
 	struct bio_set		*bi_pool;
 
+	struct bio		*bi_root;	/* original bio of submit_bio() */
+	struct list_head        bi_plist;
+	struct list_head        bi_pnode;
+	struct spinlock         bi_plock;
+	blk_qc_t		bi_cookie;
+
 	/*
 	 * We can inline a number of vecs at the end of the bio, to avoid
 	 * double allocations for a small number of bio_vecs. This member
@@ -557,6 +563,39 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
 	return (cookie & BLK_QC_T_INTERNAL) != 0;
 }
 
+static inline void bio_add_poll_list(struct bio *bio)
+{
+	struct bio *root = bio->bi_root;
+
+	/*
+	 * The spin_lock() variant is enough since bios in root->bi_plist are
+	 * all enqueued into polling mode hardware queue, thus the list_del()
+	 * operation is handled only in process context.
+	 */
+	spin_lock(&root->bi_plock);
+	list_add_tail(&bio->bi_pnode, &root->bi_plist);
+	spin_unlock(&root->bi_plock);
+}
+
+static inline void bio_del_poll_list(struct bio *bio)
+{
+	struct bio *root = bio->bi_root;
+
+	/*
+	 * bios in mq routine: @bi_root is NULL, @bi_cookie is 0;
+	 * bios in bio-based routine: @bi_root is non-NULL, @bi_cookie is valid
+	 * (including 0) for those in root->bi_plist, invalid for the
+	 * remaining.
+	 */
+	if (bio->bi_root && blk_qc_t_valid(bio->bi_cookie)) {
+		spin_lock(&root->bi_plock);
+		list_del(&bio->bi_pnode);
+		spin_unlock(&root->bi_plock);
+	}
+}
+
+int blk_bio_poll(struct request_queue *q, blk_qc_t cookie);
+
 struct blk_rq_stat {
 	u64 mean;
 	u64 min;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH RFC 7/7] dm: add support for IO polling
  2020-12-23 11:26 [PATCH RFC 0/7] dm: add support of iopoll Jeffle Xu
                   ` (5 preceding siblings ...)
  2020-12-23 11:26 ` [PATCH RFC 6/7] block: track cookies of split bios " Jeffle Xu
@ 2020-12-23 11:26 ` Jeffle Xu
  2021-01-08  3:12   ` [dm-devel] " JeffleXu
  2021-01-07  1:14 ` [dm-devel] [PATCH RFC 0/7] dm: add support of iopoll JeffleXu
  7 siblings, 1 reply; 26+ messages in thread
From: Jeffle Xu @ 2020-12-23 11:26 UTC (permalink / raw)
  To: snitzer; +Cc: linux-block, dm-devel, io-uring

Enable iopoll when all underlying target devices supports iopoll.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 drivers/md/dm-table.c | 28 ++++++++++++++++++++++++++++
 drivers/md/dm.c       |  1 +
 2 files changed, 29 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 188f41287f18..b0cd5bf58c3c 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1791,6 +1791,31 @@ static bool dm_table_requires_stable_pages(struct dm_table *t)
 	return false;
 }
 
+static int device_supports_poll(struct dm_target *ti, struct dm_dev *dev,
+				sector_t start, sector_t len, void *data)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	return q && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
+}
+
+static bool dm_table_supports_poll(struct dm_table *t)
+{
+	struct dm_target *ti;
+	unsigned int i;
+
+	/* Ensure that all targets support iopoll. */
+	for (i = 0; i < dm_table_get_num_targets(t); i++) {
+		ti = dm_table_get_target(t, i);
+
+		if (!ti->type->iterate_devices ||
+		    !ti->type->iterate_devices(ti, device_supports_poll, NULL))
+			return false;
+	}
+
+	return true;
+}
+
 void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 			       struct queue_limits *limits)
 {
@@ -1883,6 +1908,9 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 #endif
 
 	blk_queue_update_readahead(q);
+
+	if (dm_table_supports_poll(t))
+		blk_queue_flag_set(QUEUE_FLAG_POLL, q);
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 03c2b867acaa..ffd2c5ead256 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -3049,6 +3049,7 @@ static const struct pr_ops dm_pr_ops = {
 };
 
 static const struct block_device_operations dm_blk_dops = {
+	.iopoll = blk_bio_poll,
 	.submit_bio = dm_submit_bio,
 	.open = dm_blk_open,
 	.release = dm_blk_close,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH RFC 0/7] dm: add support of iopoll
  2020-12-23 11:26 [PATCH RFC 0/7] dm: add support of iopoll Jeffle Xu
                   ` (6 preceding siblings ...)
  2020-12-23 11:26 ` [PATCH RFC 7/7] dm: add support for IO polling Jeffle Xu
@ 2021-01-07  1:14 ` JeffleXu
  7 siblings, 0 replies; 26+ messages in thread
From: JeffleXu @ 2021-01-07  1:14 UTC (permalink / raw)
  To: snitzer; +Cc: linux-block, dm-devel, io-uring

Hi Mike,

Would you please give some suggestions on this series?


Thanks,
Jeffle

On 12/23/20 7:26 PM, Jeffle Xu wrote:
> This patch set adds support of iopoll for dm devices.
> 
> Several months ago, I also sent a patch set adding support of iopoll
> for dm devices [1]. The old patch set implement this by polling all
> polling mode hardware queues of all underlying target devices
> unconditionally, no matter which hardware queue the bio is enqueued
> into.
> 
> Ming Lei pointed out that this implementation may have performance
> issue. Mike Snitzer also discussed and helpt a lot on the design
> issue. At that time, we agreed that this feature should be
> implemented on the basis of split-bio tracking, since the bio to the
> dm device could be split into multiple split bios to the underlying
> target devices.
> 
> This patch set actually implement the split-bio tracking part.
> Regrettably this implementation is quite coarse and original. Quite
> code refactoring is introduced to the block core, since device mapper
> also calls methods provided by block core to split bios. The new
> fields are directly added into 'struct bio' structure. So this is
> just an RFC version and there may be quite a lot design issues be
> fronted on. I just implement a version quickly and would like to share
> with you my thoughts and problems at hand.
> 
> This implementation works but has poor performance. Awkwardly the
> performance of direct 8k randread decreases ~20% on a 8k-striped
> dm-stripe device.
> 
> The first 5 patches prepare the introduction of iopoll for dm device.
> Patch 6 is the core part and implements a mechanism of tracking split
> bios for bio-based device. Patch 7 just enables this feature.
> 
> [1] https://patchwork.kernel.org/project/linux-block/cover/20201020065420.124885-1-jefflexu@linux.alibaba.com/
> 
> 
> [Design Notes]
> 
> - recursive way or non-recursive way?
> 
> The core of the split-bio tracking mechanism is that, a list is
> maintained in the top bio (the original bio submitted to the dm
> device), which is actually used to maintain all valid cookies of all
> split bios.
> 
> This is actually a non-recursive way to implement the tracking
> mechanism. On the contrary, the recursive way is that, maintain the
> split bios at each dm layer. DM device can be build into a device
> stack. For example, considering the following device stack,
> 
> ```
>             dm0(bio 0)
> dm1(bio 1)             dm2(bio 2)
> nvme0(bio 3)           nvme1(bio 4)
> 
> ```
> 
> The non-recursive way is that bio 3/4 are directly maintained as a
> list beneath bio 0. The recursive way is that, bio 3 is maintained
> as a list beneath bio 1 and bio 4 is maintained as a list beneath
> bio 2, while bio 1/2 are maintained as a list beneath bio 0.
> 
> The reason why I choose the non-recursive way is that, we need call
> blk_bio_poll() or something recursively if it's implemented in the
> recursive way, and I worry this would consume up the stack space if
> the device stack is too deep. After all bio_list is used to prevent
> the dm device using up stack space when submitting bio. 
> 
> 
> - why embed new fields into struct bio directly?
> 
> Mike Snitzer had ever suggested that the newly added fields could be
> integrated into the clone bio allocated by dm subsystem through
> bio_set. There're 3 reasons why I directly embed these fields into
> struct bio:
> 
> 1. The implementation difference of DM and MD
> DM subsystem indeed allocate clone bio itself, in which case we could
> allocate extra per-bio space for specific usage. However MD subsystem
> doesn't allocate extra clone bio, and just uses the bio structure
> originally submitted to MD device as the bio structure submitted to
> the underlying target device. (At least raid0 works in this way.)
> 
> 2. In the previously mentioned non-recursive way of iopoll, a list
> containing all valid cookies should be maintained. For convenience, I
> just put this list in the top bio (the original bio structure
> submitted to the dm device). This original bio structure is allocated
> by the upper layer, e.g. filesystem, and is out of control of DM
> subsystem. (Of course we could resolve this problem technically, e.g.,
> put these newlly added fields into the corresponding dm_io structure
> of the top bio.)
> 
> 3. As a quick implementation, I just put these fields into struct bio
> directly. Obviously more design issues need to be considered when it
> comes into the formal version.
> 
> 
> Jeffle Xu (7):
>   block: move definition of blk_qc_t to types.h
>   block: add helper function fetching gendisk from queue
>   block: add iopoll method for non-mq device
>   block: define blk_qc_t as uintptr_t
>   dm: always return BLK_QC_T_NONE for bio-based device
>   block: track cookies of split bios for bio-based device
>   dm: add support for IO polling
> 
>  block/bio.c                  |   8 ++
>  block/blk-core.c             | 163 ++++++++++++++++++++++++++++++++++-
>  block/blk-mq.c               |  70 ++-------------
>  drivers/md/dm-table.c        |  28 ++++++
>  drivers/md/dm.c              |  27 +++---
>  include/linux/blk-mq.h       |   3 +
>  include/linux/blk_types.h    |  41 ++++++++-
>  include/linux/blkdev.h       |   3 +
>  include/linux/fs.h           |   2 +-
>  include/linux/types.h        |   3 +
>  include/trace/events/kyber.h |   6 +-
>  11 files changed, 270 insertions(+), 84 deletions(-)
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 1/7] block: move definition of blk_qc_t to types.h
  2020-12-23 11:26 ` [PATCH RFC 1/7] block: move definition of blk_qc_t to types.h Jeffle Xu
@ 2021-01-07 19:04   ` Mike Snitzer
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Snitzer @ 2021-01-07 19:04 UTC (permalink / raw)
  To: Jeffle Xu; +Cc: linux-block, dm-devel, io-uring

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> So that kiocb.ki_cookie can be defined as blk_qc_t, which will enforce
> the encapsulation.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> ---
>  include/linux/blk_types.h | 2 +-
>  include/linux/fs.h        | 2 +-
>  include/linux/types.h     | 3 +++
>  3 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 866f74261b3b..2e05244fc16d 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -532,7 +532,7 @@ static inline int op_stat_group(unsigned int op)
>  	return op_is_write(op);
>  }
>  
> -typedef unsigned int blk_qc_t;
> +/* Macros for blk_qc_t */
>  #define BLK_QC_T_NONE		-1U
>  #define BLK_QC_T_SHIFT		16
>  #define BLK_QC_T_INTERNAL	(1U << 31)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ad4cf1bae586..58db714c4834 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -330,7 +330,7 @@ struct kiocb {
>  	u16			ki_hint;
>  	u16			ki_ioprio; /* See linux/ioprio.h */
>  	union {
> -		unsigned int		ki_cookie; /* for ->iopoll */
> +		blk_qc_t		ki_cookie; /* for ->iopoll */
>  		struct wait_page_queue	*ki_waitq; /* for async buffered IO */
>  	};
>  
> diff --git a/include/linux/types.h b/include/linux/types.h
> index a147977602b5..da5ca7e1bea9 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -125,6 +125,9 @@ typedef s64			int64_t;
>  typedef u64 sector_t;
>  typedef u64 blkcnt_t;
>  
> +/* cookie used for IO polling */
> +typedef unsigned int blk_qc_t;
> +
>  /*
>   * The type of an index into the pagecache.
>   */
> -- 
> 2.27.0
> 

Unfortunate that you cannot just include blk_types.h in fs.h; but
vma_is_dax() ruins that for us since commit baabda2614245 ("mm: always
enable thp for dax mappings").

Reviewed-by: Mike Snitzer <snitzer@redhat.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 2/7] block: add helper function fetching gendisk from queue
  2020-12-23 11:26 ` [PATCH RFC 2/7] block: add helper function fetching gendisk from queue Jeffle Xu
@ 2021-01-07 20:31   ` Mike Snitzer
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Snitzer @ 2021-01-07 20:31 UTC (permalink / raw)
  To: Jeffle Xu; +Cc: linux-block, dm-devel, io-uring

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> Sometimes we need to get the corresponding gendisk from request_queue.
> 
> One such use case is that, the block device driver had ever stored the
> same private data both in queue->queuedata and gendisk->private_data,
> while nowadays gendisk->private_data is more preferable in such case,
> e.g. commit c4a59c4e5db3 ("dm: stop using ->queuedata"). So if only
> request_queue given, we need to get the corresponding gendisk from
> queue, to get the private data stored in gendisk.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> ---
>  include/linux/blkdev.h       | 2 ++
>  include/trace/events/kyber.h | 6 +++---
>  2 files changed, 5 insertions(+), 3 deletions(-)

Looks good, but please update the patch subject and header to be:

block: add queue_to_disk() to get gendisk from request_queue

Sometimes we need to get the corresponding gendisk from request_queue.

It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").

So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Review-by: Mike Snitzer <snitzer@redhat.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 3/7] block: add iopoll method for non-mq device
  2020-12-23 11:26 ` [PATCH RFC 3/7] block: add iopoll method for non-mq device Jeffle Xu
@ 2021-01-07 21:47   ` Mike Snitzer
  2021-01-08  3:24     ` [dm-devel] " JeffleXu
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2021-01-07 21:47 UTC (permalink / raw)
  To: Jeffle Xu; +Cc: linux-block, dm-devel, io-uring

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> ->poll_fn is introduced in commit ea435e1b9392 ("block: add a poll_fn
> callback to struct request_queue") for supporting non-mq queues such as
> nvme multipath, but removed in commit 529262d56dbe ("block: remove
> ->poll_fn").
> 
> To add support of IO polling for non-mq device, this method need to be
> back. Since commit c62b37d96b6e ("block: move ->make_request_fn to
> struct block_device_operations") has moved all callbacks into struct
> block_device_operations in gendisk, we also add the new method named
> ->iopoll in block_device_operations.

Please update patch subject and header to:

block: add iopoll method to support bio-based IO polling

->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
callback to struct request_queue") to support bio-based queues such as
nvme multipath, but was later removed in commit 529262d56dbe ("block:
remove ->poll_fn").

Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
block_device_operations") restore the possibility of bio-based IO
polling support by adding an ->iopoll method to gendisk->fops.
Elevate bulk of blk_mq_poll() implementation to blk_poll() and reduce
blk_mq_poll() to blk-mq specific code that is called from blk_poll().

> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> ---
>  block/blk-core.c       | 79 ++++++++++++++++++++++++++++++++++++++++++
>  block/blk-mq.c         | 70 +++++--------------------------------
>  include/linux/blk-mq.h |  3 ++
>  include/linux/blkdev.h |  1 +
>  4 files changed, 92 insertions(+), 61 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 96e5fcd7f071..2f5c51ce32e3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1131,6 +1131,85 @@ blk_qc_t submit_bio(struct bio *bio)
>  }
>  EXPORT_SYMBOL(submit_bio);
>  
> +static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
> +{
> +	struct blk_mq_hw_ctx *hctx;
> +
> +	/* TODO: bio-based device doesn't support hybrid poll. */
> +	if (!queue_is_mq(q))
> +		return false;
> +
> +	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> +	if (blk_mq_poll_hybrid(q, hctx, cookie))
> +		return true;
> +
> +	hctx->poll_considered++;
> +	return false;
> +}

I don't see where you ever backfill bio-based hybrid support (in
the following patches in this series, so it is lingering TODO).

> +
> +/**
> + * blk_poll - poll for IO completions
> + * @q:  the queue
> + * @cookie: cookie passed back at IO submission time
> + * @spin: whether to spin for completions
> + *
> + * Description:
> + *    Poll for completions on the passed in queue. Returns number of
> + *    completed entries found. If @spin is true, then blk_poll will continue
> + *    looping until at least one completion is found, unless the task is
> + *    otherwise marked running (or we need to reschedule).
> + */
> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +{
> +	long state;
> +
> +	if (!blk_qc_t_valid(cookie) ||
> +	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
> +		return 0;
> +
> +	if (current->plug)
> +		blk_flush_plug_list(current->plug, false);
> +
> +	/*
> +	 * If we sleep, have the caller restart the poll loop to reset
> +	 * the state. Like for the other success return cases, the
> +	 * caller is responsible for checking if the IO completed. If
> +	 * the IO isn't complete, we'll get called again and will go
> +	 * straight to the busy poll loop. If specified not to spin,
> +	 * we also should not sleep.
> +	 */
> +	if (spin && blk_poll_hybrid(q, cookie))
> +		return 1;
> +
> +	state = current->state;
> +	do {
> +		int ret;
> +		struct gendisk *disk = queue_to_disk(q);
> +
> +		if (disk->fops->iopoll)
> +			ret = disk->fops->iopoll(q, cookie);
> +		else
> +			ret = blk_mq_poll(q, cookie);

Really don't like that blk-mq is needlessly getting gendisk and checking
disk->fops->iopoll.

This is just to give an idea, whitespace damaged due to coding in mail
client, but why not remove above blk_poll_hybrid() and do:

struct blk_mq_hw_ctx *hctx = NULL;
struct gendisk *disk = NULL;
...

if (queue_is_mq(q)) {
	/*
	 * If we sleep, have the caller restart the poll loop to reset
	 * the state. Like for the other success return cases, the
	 * caller is responsible for checking if the IO completed. If
	 * the IO isn't complete, we'll get called again and will go
	 * straight to the busy poll loop. If specified not to spin,
	 * we also should not sleep.
	 */
	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
	if (spin && blk_mq_poll_hybrid(q, hctx, cookie))
		return 1;
	hctx->poll_considered++;   
} else {
	disk = queue_to_disk(q);
}

do {
	int ret;

	if (hctx)
	        ret = blk_mq_poll(q, hctx, cookie);
	else if (disk->fops->iopoll)
		ret = disk->fops->iopoll(q, cookie);
		
> +		if (ret > 0) {
> +			__set_current_state(TASK_RUNNING);
> +			return ret;
> +		}
> +
> +		if (signal_pending_state(state, current))
> +			__set_current_state(TASK_RUNNING);
> +
> +		if (current->state == TASK_RUNNING)
> +			return 1;
> +		if (ret < 0 || !spin)
> +			break;
> +		cpu_relax();
> +	} while (!need_resched());
> +
> +	__set_current_state(TASK_RUNNING);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(blk_poll);
> +
>  /**
>   * blk_cloned_rq_check_limits - Helper function to check a cloned request
>   *                              for the new queue limits
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b09ce00cc6af..85258958e9f1 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3818,8 +3818,8 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
>  	return true;
>  }
>  
> -static bool blk_mq_poll_hybrid(struct request_queue *q,
> -			       struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
> +bool blk_mq_poll_hybrid(struct request_queue *q,
> +			struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
>  {
>  	struct request *rq;
>  
> @@ -3843,72 +3843,20 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
>  	return blk_mq_poll_hybrid_sleep(q, rq);
>  }
>  
> -/**
> - * blk_poll - poll for IO completions
> - * @q:  the queue
> - * @cookie: cookie passed back at IO submission time
> - * @spin: whether to spin for completions
> - *
> - * Description:
> - *    Poll for completions on the passed in queue. Returns number of
> - *    completed entries found. If @spin is true, then blk_poll will continue
> - *    looping until at least one completion is found, unless the task is
> - *    otherwise marked running (or we need to reschedule).
> - */
> -int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +int blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
>  {
> +	int ret;
>  	struct blk_mq_hw_ctx *hctx;
> -	long state;
> -
> -	if (!blk_qc_t_valid(cookie) ||
> -	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
> -		return 0;
> -
> -	if (current->plug)
> -		blk_flush_plug_list(current->plug, false);
>  
>  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];


Given my suggested code changes above, pass hctx in to blk_mq_poll() to
avoid redundant code to access it in q->queue_hw_ctx[], so:

int blk_mq_poll(struct request_queue *q,
    		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)

> -	/*
> -	 * If we sleep, have the caller restart the poll loop to reset
> -	 * the state. Like for the other success return cases, the
> -	 * caller is responsible for checking if the IO completed. If
> -	 * the IO isn't complete, we'll get called again and will go
> -	 * straight to the busy poll loop. If specified not to spin,
> -	 * we also should not sleep.
> -	 */
> -	if (spin && blk_mq_poll_hybrid(q, hctx, cookie))
> -		return 1;
> -
> -	hctx->poll_considered++;
> +	hctx->poll_invoked++;
> +	ret = q->mq_ops->poll(hctx);
> +	if (ret > 0)
> +		hctx->poll_success++;
>  
> -	state = current->state;
> -	do {
> -		int ret;
> -
> -		hctx->poll_invoked++;
> -
> -		ret = q->mq_ops->poll(hctx);
> -		if (ret > 0) {
> -			hctx->poll_success++;
> -			__set_current_state(TASK_RUNNING);
> -			return ret;
> -		}
> -
> -		if (signal_pending_state(state, current))
> -			__set_current_state(TASK_RUNNING);
> -
> -		if (current->state == TASK_RUNNING)
> -			return 1;
> -		if (ret < 0 || !spin)
> -			break;
> -		cpu_relax();
> -	} while (!need_resched());
> -
> -	__set_current_state(TASK_RUNNING);
> -	return 0;
> +	return ret;
>  }
> -EXPORT_SYMBOL_GPL(blk_poll);
>  
>  unsigned int blk_mq_rq_cpu(struct request *rq)
>  {
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 47b021952ac7..032e08ecd42e 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -607,6 +607,9 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
>  }
>  
>  blk_qc_t blk_mq_submit_bio(struct bio *bio);
> +int blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
> +bool blk_mq_poll_hybrid(struct request_queue *q,
> +		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie);
>  void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
>  		struct lock_class_key *key);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2303d06a5a82..e8965879eb90 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1845,6 +1845,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
>  
>  struct block_device_operations {
>  	blk_qc_t (*submit_bio) (struct bio *bio);
> +	int (*iopoll)(struct request_queue *q, blk_qc_t cookie);
>  	int (*open) (struct block_device *, fmode_t);
>  	void (*release) (struct gendisk *, fmode_t);
>  	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
> -- 
> 2.27.0
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 4/7] block: define blk_qc_t as uintptr_t
  2020-12-23 11:26 ` [PATCH RFC 4/7] block: define blk_qc_t as uintptr_t Jeffle Xu
@ 2021-01-07 21:52   ` Mike Snitzer
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Snitzer @ 2021-01-07 21:52 UTC (permalink / raw)
  To: Jeffle Xu; +Cc: linux-block, dm-devel, io-uring

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> To support iopoll for bio-based device, the returned cookie is actually
> a pointer to an implementation specific object, e.g. an object
> maintaining all split bios.
> 
> In such case, blk_qc_t should be large enough to contain one pointer,
> for which uintptr_t is used here. Since uintptr_t is actually an integer
> type in essence, there's no need of type casting in the original mq
> path, while type casting is indeed needed in bio-based routine.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> ---
>  include/linux/types.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/types.h b/include/linux/types.h
> index da5ca7e1bea9..f6301014a459 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -126,7 +126,7 @@ typedef u64 sector_t;
>  typedef u64 blkcnt_t;
>  
>  /* cookie used for IO polling */
> -typedef unsigned int blk_qc_t;
> +typedef uintptr_t blk_qc_t;
>  
>  /*
>   * The type of an index into the pagecache.

I'd just fold this into patch 6.. not seeing benefit to having this be
separate.

Patch 6's header needs a lot more detail anyway so..


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 5/7] dm: always return BLK_QC_T_NONE for bio-based device
  2020-12-23 11:26 ` [PATCH RFC 5/7] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
@ 2021-01-07 21:54   ` Mike Snitzer
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Snitzer @ 2021-01-07 21:54 UTC (permalink / raw)
  To: Jeffle Xu; +Cc: linux-block, dm-devel, io-uring

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> Currently the returned cookie of bio-based device is not used at all.
> 
> In the following patches, bio-based device will actually return a
> pointer to a specific object as the returned cookie.

In the following patch, ...

> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>

Reviewed-by: Mike Snitzer <snitzer@redhat.com>

> ---
>  drivers/md/dm.c | 26 ++++++++++----------------
>  1 file changed, 10 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 5b2f371ec4bb..03c2b867acaa 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1252,14 +1252,13 @@ void dm_accept_partial_bio(struct bio *bio, unsigned n_sectors)
>  }
>  EXPORT_SYMBOL_GPL(dm_accept_partial_bio);
>  
> -static blk_qc_t __map_bio(struct dm_target_io *tio)
> +static void __map_bio(struct dm_target_io *tio)
>  {
>  	int r;
>  	sector_t sector;
>  	struct bio *clone = &tio->clone;
>  	struct dm_io *io = tio->io;
>  	struct dm_target *ti = tio->ti;
> -	blk_qc_t ret = BLK_QC_T_NONE;
>  
>  	clone->bi_end_io = clone_endio;
>  
> @@ -1278,7 +1277,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
>  	case DM_MAPIO_REMAPPED:
>  		/* the bio has been remapped so dispatch it */
>  		trace_block_bio_remap(clone, bio_dev(io->orig_bio), sector);
> -		ret = submit_bio_noacct(clone);
> +		submit_bio_noacct(clone);
>  		break;
>  	case DM_MAPIO_KILL:
>  		free_tio(tio);
> @@ -1292,8 +1291,6 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
>  		DMWARN("unimplemented target map return value: %d", r);
>  		BUG();
>  	}
> -
> -	return ret;
>  }
>  
>  static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
> @@ -1380,7 +1377,7 @@ static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
>  	}
>  }
>  
> -static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
> +static void __clone_and_map_simple_bio(struct clone_info *ci,
>  					   struct dm_target_io *tio, unsigned *len)
>  {
>  	struct bio *clone = &tio->clone;
> @@ -1391,7 +1388,7 @@ static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
>  	if (len)
>  		bio_setup_sector(clone, ci->sector, *len);
>  
> -	return __map_bio(tio);
> +	__map_bio(tio);
>  }
>  
>  static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
> @@ -1405,7 +1402,7 @@ static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
>  
>  	while ((bio = bio_list_pop(&blist))) {
>  		tio = container_of(bio, struct dm_target_io, clone);
> -		(void) __clone_and_map_simple_bio(ci, tio, len);
> +		__clone_and_map_simple_bio(ci, tio, len);
>  	}
>  }
>  
> @@ -1450,7 +1447,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
>  		free_tio(tio);
>  		return r;
>  	}
> -	(void) __map_bio(tio);
> +	__map_bio(tio);
>  
>  	return 0;
>  }
> @@ -1565,11 +1562,10 @@ static void init_clone_info(struct clone_info *ci, struct mapped_device *md,
>  /*
>   * Entry point to split a bio into clones and submit them to the targets.
>   */
> -static blk_qc_t __split_and_process_bio(struct mapped_device *md,
> +static void __split_and_process_bio(struct mapped_device *md,
>  					struct dm_table *map, struct bio *bio)
>  {
>  	struct clone_info ci;
> -	blk_qc_t ret = BLK_QC_T_NONE;
>  	int error = 0;
>  
>  	init_clone_info(&ci, md, map, bio);
> @@ -1613,7 +1609,7 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
>  
>  				bio_chain(b, bio);
>  				trace_block_split(b, bio->bi_iter.bi_sector);
> -				ret = submit_bio_noacct(bio);
> +				submit_bio_noacct(bio);
>  				break;
>  			}
>  		}
> @@ -1621,13 +1617,11 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
>  
>  	/* drop the extra reference count */
>  	dec_pending(ci.io, errno_to_blk_status(error));
> -	return ret;
>  }
>  
>  static blk_qc_t dm_submit_bio(struct bio *bio)
>  {
>  	struct mapped_device *md = bio->bi_disk->private_data;
> -	blk_qc_t ret = BLK_QC_T_NONE;
>  	int srcu_idx;
>  	struct dm_table *map;
>  
> @@ -1657,10 +1651,10 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
>  	if (is_abnormal_io(bio))
>  		blk_queue_split(&bio);
>  
> -	ret = __split_and_process_bio(md, map, bio);
> +	__split_and_process_bio(md, map, bio);
>  out:
>  	dm_put_live_table(md, srcu_idx);
> -	return ret;
> +	return BLK_QC_T_NONE;
>  }
>  
>  /*-----------------------------------------------------------------
> -- 
> 2.27.0
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 6/7] block: track cookies of split bios for bio-based device
  2020-12-23 11:26 ` [PATCH RFC 6/7] block: track cookies of split bios " Jeffle Xu
@ 2021-01-07 22:18   ` Mike Snitzer
  2021-01-08  3:08     ` [dm-devel] " JeffleXu
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2021-01-07 22:18 UTC (permalink / raw)
  To: Jeffle Xu; +Cc: linux-block, dm-devel, io-uring

On Wed, Dec 23 2020 at  6:26am -0500,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> This is actuaaly the core when supporting iopoll for bio-based device.
> 
> A list is maintained in the top bio (the original bio submitted to dm
> device), which is used to maintain all valid cookies of split bios. The
> IO polling routine will actually iterate this list and poll on
> corresponding hardware queues of the underlying mq devices.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>

Like I said in response to patch 4 in this series: please fold patch 4
into this patch and _really_ improve this patch header.

In particular, the (ab)use of bio_inc_remaining() needs be documented in
this patch header very well.

But its use could easily be why you're seeing a performance hit (coupled
with the extra spinlock locking and list management used).  Just added
latency and contention across CPUs.

> ---
>  block/bio.c               |  8 ++++
>  block/blk-core.c          | 84 ++++++++++++++++++++++++++++++++++++++-
>  include/linux/blk_types.h | 39 ++++++++++++++++++
>  3 files changed, 129 insertions(+), 2 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 1f2cc1fbe283..ca6d1a7ee196 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -284,6 +284,10 @@ void bio_init(struct bio *bio, struct bio_vec *table,
>  
>  	bio->bi_io_vec = table;
>  	bio->bi_max_vecs = max_vecs;
> +
> +	INIT_LIST_HEAD(&bio->bi_plist);
> +	INIT_LIST_HEAD(&bio->bi_pnode);
> +	spin_lock_init(&bio->bi_plock);
>  }
>  EXPORT_SYMBOL(bio_init);
>  
> @@ -689,6 +693,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
>  	bio->bi_write_hint = bio_src->bi_write_hint;
>  	bio->bi_iter = bio_src->bi_iter;
>  	bio->bi_io_vec = bio_src->bi_io_vec;
> +	bio->bi_root = bio_src->bi_root;
>  
>  	bio_clone_blkg_association(bio, bio_src);
>  	blkcg_bio_issue_init(bio);
> @@ -1425,6 +1430,8 @@ void bio_endio(struct bio *bio)
>  	if (bio->bi_disk)
>  		rq_qos_done_bio(bio->bi_disk->queue, bio);
>  
> +	bio_del_poll_list(bio);
> +
>  	/*
>  	 * Need to have a real endio function for chained bios, otherwise
>  	 * various corner cases will break (like stacking block devices that
> @@ -1446,6 +1453,7 @@ void bio_endio(struct bio *bio)
>  	blk_throtl_bio_endio(bio);
>  	/* release cgroup info */
>  	bio_uninit(bio);
> +
>  	if (bio->bi_end_io)
>  		bio->bi_end_io(bio);
>  }
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2f5c51ce32e3..5a332af01939 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -960,12 +960,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;
> +	bool iopoll;
> +	struct bio *root;
>  
>  	BUG_ON(bio->bi_next);
>  
>  	bio_list_init(&bio_list_on_stack[0]);
>  	current->bio_list = bio_list_on_stack;
>  
> +	iopoll = test_bit(QUEUE_FLAG_POLL, &bio->bi_disk->queue->queue_flags);
> +	iopoll = iopoll && (bio->bi_opf & REQ_HIPRI);
> +
> +	if (iopoll) {
> +		bio->bi_root = root = bio;
> +		/*
> +		 * We need to pin root bio here since there's a reference from
> +		 * the returned cookie. bio_get() is not enough since the whole
> +		 * bio and the corresponding kiocb/dio may have already
> +		 * completed and thus won't call blk_poll() at all, in which
> +		 * case the pairing bio_put() in blk_bio_poll() won't be called.
> +		 * The side effect of bio_inc_remaining() is that, the whole bio
> +		 * won't complete until blk_poll() called.
> +		 */
> +		bio_inc_remaining(root);
> +	}
> +
>  	do {
>  		struct request_queue *q = bio->bi_disk->queue;
>  		struct bio_list lower, same;
> @@ -979,7 +998,18 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		if (iopoll) {
> +			/* See the comments of above bio_inc_remaining(). */
> +			bio_inc_remaining(bio);
> +			bio->bi_cookie = __submit_bio(bio);
> +
> +			if (blk_qc_t_valid(bio->bi_cookie))
> +				bio_add_poll_list(bio);
> +
> +			bio_endio(bio);
> +		} else {
> +			ret = __submit_bio(bio);
> +		}
>  
>  		/*
>  		 * Sort new bios into those for a lower level and those for the
> @@ -1002,7 +1032,11 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
>  
>  	current->bio_list = NULL;
> -	return ret;
> +
> +	if (iopoll)
> +		return (blk_qc_t)root;
> +
> +	return BLK_QC_T_NONE;
>  }
>  
>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> @@ -1131,6 +1165,52 @@ blk_qc_t submit_bio(struct bio *bio)
>  }
>  EXPORT_SYMBOL(submit_bio);
>  
> +int blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> +{
> +	int ret = 0;
> +	struct bio *bio, *root = (struct bio*)cookie;
> +
> +	if (list_empty(&root->bi_plist)) {
> +		bio_endio(root);
> +		return 1;
> +	}
> +
> +	spin_lock(&root->bi_plock);
> +	bio = list_first_entry_or_null(&root->bi_plist, struct bio, bi_pnode);
> +
> +	while (bio) {
> +		struct request_queue *q = bio->bi_disk->queue;
> +		blk_qc_t cookie = bio->bi_cookie;
> +
> +		spin_unlock(&root->bi_plock);
> +		BUG_ON(!blk_qc_t_valid(cookie));
> +
> +		ret += blk_mq_poll(q, cookie);

Not yet clear to me how you _know_ this q is blk-mq...
What about a deep stack of bio-based DM devices?

Or are you confining bio-based DM IO polling support to bio-based
stacked directly on blk-mq? (patch 7 likely shows that to be the case).

If so, I'm not liking that at all.  So if your implementation doesn't
support arbitrary bio-based IO stacks then this bio-based IO polling
support really has limited utility still.

Helps explin how you got away with having bio-based DM always returning
BLK_QC_T_NONE in patch 5 though... feels far too simplistic.  Patch 5+6
together are implicitly ignoring the complexity that results from
arbitrary bio-based DM stacking.

Or am I missing something?

Mike


> +
> +		spin_lock(&root->bi_plock);
> +		/*
> +		 * One blk_mq_poll() call could complete multiple bios, and
> +		 * thus multiple bios could be removed from root->bi_plock
> +		 * list.
> +		 */
> +		bio = list_first_entry_or_null(&root->bi_plist, struct bio, bi_pnode);
> +	}
> +
> +	spin_unlock(&root->bi_plock);
> +
> +	if (list_empty(&root->bi_plist)) {
> +		bio_endio(root);
> +		/*
> +		 * 'ret' may be 0 here. root->bi_plist may be empty once we
> +		 * acquire the list spinlock.
> +		 */
> +		ret = max(ret, 1);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(blk_bio_poll);
> +
>  static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
>  {
>  	struct blk_mq_hw_ctx *hctx;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 2e05244fc16d..2cf5d8f0ea34 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -277,6 +277,12 @@ struct bio {
>  
>  	struct bio_set		*bi_pool;
>  
> +	struct bio		*bi_root;	/* original bio of submit_bio() */
> +	struct list_head        bi_plist;
> +	struct list_head        bi_pnode;
> +	struct spinlock         bi_plock;
> +	blk_qc_t		bi_cookie;
> +
>  	/*
>  	 * We can inline a number of vecs at the end of the bio, to avoid
>  	 * double allocations for a small number of bio_vecs. This member
> @@ -557,6 +563,39 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
>  	return (cookie & BLK_QC_T_INTERNAL) != 0;
>  }
>  
> +static inline void bio_add_poll_list(struct bio *bio)
> +{
> +	struct bio *root = bio->bi_root;
> +
> +	/*
> +	 * The spin_lock() variant is enough since bios in root->bi_plist are
> +	 * all enqueued into polling mode hardware queue, thus the list_del()
> +	 * operation is handled only in process context.
> +	 */
> +	spin_lock(&root->bi_plock);
> +	list_add_tail(&bio->bi_pnode, &root->bi_plist);
> +	spin_unlock(&root->bi_plock);
> +}
> +
> +static inline void bio_del_poll_list(struct bio *bio)
> +{
> +	struct bio *root = bio->bi_root;
> +
> +	/*
> +	 * bios in mq routine: @bi_root is NULL, @bi_cookie is 0;
> +	 * bios in bio-based routine: @bi_root is non-NULL, @bi_cookie is valid
> +	 * (including 0) for those in root->bi_plist, invalid for the
> +	 * remaining.
> +	 */
> +	if (bio->bi_root && blk_qc_t_valid(bio->bi_cookie)) {
> +		spin_lock(&root->bi_plock);
> +		list_del(&bio->bi_pnode);
> +		spin_unlock(&root->bi_plock);
> +	}
> +}
> +
> +int blk_bio_poll(struct request_queue *q, blk_qc_t cookie);
> +
>  struct blk_rq_stat {
>  	u64 mean;
>  	u64 min;
> -- 
> 2.27.0
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH RFC 6/7] block: track cookies of split bios for bio-based device
  2021-01-07 22:18   ` Mike Snitzer
@ 2021-01-08  3:08     ` JeffleXu
  2021-01-08 17:26       ` Mike Snitzer
  0 siblings, 1 reply; 26+ messages in thread
From: JeffleXu @ 2021-01-08  3:08 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, dm-devel, io-uring

Thanks for reviewing.


On 1/8/21 6:18 AM, Mike Snitzer wrote:
> On Wed, Dec 23 2020 at  6:26am -0500,
> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> 
>> This is actuaaly the core when supporting iopoll for bio-based device.
>>
>> A list is maintained in the top bio (the original bio submitted to dm
>> device), which is used to maintain all valid cookies of split bios. The
>> IO polling routine will actually iterate this list and poll on
>> corresponding hardware queues of the underlying mq devices.
>>
>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> 
> Like I said in response to patch 4 in this series: please fold patch 4
> into this patch and _really_ improve this patch header.
> 
> In particular, the (ab)use of bio_inc_remaining() needs be documented in
> this patch header very well.
> 
> But its use could easily be why you're seeing a performance hit (coupled
> with the extra spinlock locking and list management used).  Just added
> latency and contention across CPUs.

Indeed bio_inc_remaining() is abused here and the code seems quite hacky
here.

Actually I'm regarding implementing the split bio tracking mechanism in
a recursive way you had ever suggested. That is, the split bios could be
maintained in an array, which is allocated with 'struct dm_io'. This way
the overhead of spinlock protecting the &root->bi_plist may be omitted
here. Also the lifetime management may be simplified somehow. But the
block core needs to fetch the per-bio private data now, just like what
you had ever suggested before.

How do you think, Mike?


Besides the lifetime management is quite annoying to me. As long as the
tracking object representing a valid split bio) is dynamically
allocated, no matter it's embedded directly in 'struct bio' (in this
patch), or allocated with 'struct dm_io', the lifetime management of the
tracking object comes in. Here the so called tracking object is
something like

struct node {
    struct blk_mq_hw_ctx *hctx;
    blk_qc_t cookie;
};


Actually currently the tracking objects are all allocated with 'struct
bio', then the lifetime management of the tracking objects is actually
equivalent to lifetime management of bio. Since the returned cookie is
actually a pointer to the bio, the refcount of this bio must be
incremented, since we release a reference to this bio through the
returned cookie, in which case the abuse of the refcount trick seems
unavoidable? Unless we allocate the tracking object individually, then
the returned cookie is actually pointing to the tracking object, and the
refcount is individually maintained for the tracking object.


> 
>> ---
>>  block/bio.c               |  8 ++++
>>  block/blk-core.c          | 84 ++++++++++++++++++++++++++++++++++++++-
>>  include/linux/blk_types.h | 39 ++++++++++++++++++
>>  3 files changed, 129 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/bio.c b/block/bio.c
>> index 1f2cc1fbe283..ca6d1a7ee196 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -284,6 +284,10 @@ void bio_init(struct bio *bio, struct bio_vec *table,
>>  
>>  	bio->bi_io_vec = table;
>>  	bio->bi_max_vecs = max_vecs;
>> +
>> +	INIT_LIST_HEAD(&bio->bi_plist);
>> +	INIT_LIST_HEAD(&bio->bi_pnode);
>> +	spin_lock_init(&bio->bi_plock);
>>  }
>>  EXPORT_SYMBOL(bio_init);
>>  
>> @@ -689,6 +693,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
>>  	bio->bi_write_hint = bio_src->bi_write_hint;
>>  	bio->bi_iter = bio_src->bi_iter;
>>  	bio->bi_io_vec = bio_src->bi_io_vec;
>> +	bio->bi_root = bio_src->bi_root;
>>  
>>  	bio_clone_blkg_association(bio, bio_src);
>>  	blkcg_bio_issue_init(bio);
>> @@ -1425,6 +1430,8 @@ void bio_endio(struct bio *bio)
>>  	if (bio->bi_disk)
>>  		rq_qos_done_bio(bio->bi_disk->queue, bio);
>>  
>> +	bio_del_poll_list(bio);
>> +
>>  	/*
>>  	 * Need to have a real endio function for chained bios, otherwise
>>  	 * various corner cases will break (like stacking block devices that
>> @@ -1446,6 +1453,7 @@ void bio_endio(struct bio *bio)
>>  	blk_throtl_bio_endio(bio);
>>  	/* release cgroup info */
>>  	bio_uninit(bio);
>> +
>>  	if (bio->bi_end_io)
>>  		bio->bi_end_io(bio);
>>  }
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 2f5c51ce32e3..5a332af01939 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -960,12 +960,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>  {
>>  	struct bio_list bio_list_on_stack[2];
>>  	blk_qc_t ret = BLK_QC_T_NONE;
>> +	bool iopoll;
>> +	struct bio *root;
>>  
>>  	BUG_ON(bio->bi_next);
>>  
>>  	bio_list_init(&bio_list_on_stack[0]);
>>  	current->bio_list = bio_list_on_stack;
>>  
>> +	iopoll = test_bit(QUEUE_FLAG_POLL, &bio->bi_disk->queue->queue_flags);
>> +	iopoll = iopoll && (bio->bi_opf & REQ_HIPRI);
>> +
>> +	if (iopoll) {
>> +		bio->bi_root = root = bio;
>> +		/*
>> +		 * We need to pin root bio here since there's a reference from
>> +		 * the returned cookie. bio_get() is not enough since the whole
>> +		 * bio and the corresponding kiocb/dio may have already
>> +		 * completed and thus won't call blk_poll() at all, in which
>> +		 * case the pairing bio_put() in blk_bio_poll() won't be called.
>> +		 * The side effect of bio_inc_remaining() is that, the whole bio
>> +		 * won't complete until blk_poll() called.
>> +		 */
>> +		bio_inc_remaining(root);
>> +	}
>> +
>>  	do {
>>  		struct request_queue *q = bio->bi_disk->queue;
>>  		struct bio_list lower, same;
>> @@ -979,7 +998,18 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>  		bio_list_init(&bio_list_on_stack[0]);
>>  
>> -		ret = __submit_bio(bio);
>> +		if (iopoll) {
>> +			/* See the comments of above bio_inc_remaining(). */
>> +			bio_inc_remaining(bio);
>> +			bio->bi_cookie = __submit_bio(bio);
>> +
>> +			if (blk_qc_t_valid(bio->bi_cookie))
>> +				bio_add_poll_list(bio);
>> +
>> +			bio_endio(bio);
>> +		} else {
>> +			ret = __submit_bio(bio);
>> +		}
>>  
>>  		/*
>>  		 * Sort new bios into those for a lower level and those for the
>> @@ -1002,7 +1032,11 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>  	} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
>>  
>>  	current->bio_list = NULL;
>> -	return ret;
>> +
>> +	if (iopoll)
>> +		return (blk_qc_t)root;
>> +
>> +	return BLK_QC_T_NONE;
>>  }
>>  
>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>> @@ -1131,6 +1165,52 @@ blk_qc_t submit_bio(struct bio *bio)
>>  }
>>  EXPORT_SYMBOL(submit_bio);
>>  
>> +int blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
>> +{
>> +	int ret = 0;
>> +	struct bio *bio, *root = (struct bio*)cookie;
>> +
>> +	if (list_empty(&root->bi_plist)) {
>> +		bio_endio(root);
>> +		return 1;
>> +	}
>> +
>> +	spin_lock(&root->bi_plock);
>> +	bio = list_first_entry_or_null(&root->bi_plist, struct bio, bi_pnode);
>> +
>> +	while (bio) {
>> +		struct request_queue *q = bio->bi_disk->queue;
>> +		blk_qc_t cookie = bio->bi_cookie;
>> +
>> +		spin_unlock(&root->bi_plock);
>> +		BUG_ON(!blk_qc_t_valid(cookie));
>> +
>> +		ret += blk_mq_poll(q, cookie);
> 
> Not yet clear to me how you _know_ this q is blk-mq...
> What about a deep stack of bio-based DM devices?
> 

This design works in arbitrary bio-based DM stacking.


> Or are you confining bio-based DM IO polling support to bio-based
> stacked directly on blk-mq? (patch 7 likely shows that to be the case).
> 

patch 7 works in arbitrary bio-based DM stacking. Please see the reply
for patch 7 for details.


> If so, I'm not liking that at all.  So if your implementation doesn't
> support arbitrary bio-based IO stacks then this bio-based IO polling
> support really has limited utility still.
> 
> Helps explin how you got away with having bio-based DM always returning
> BLK_QC_T_NONE in patch 5 though... feels far too simplistic.  Patch 5+6
> together are implicitly ignoring the complexity that results from
> arbitrary bio-based DM stacking.
> 
> Or am I missing something?

The magic is in patch 5. Bios submitted directly to DM device won't be
enqueue into this &root->bi_plist list, since all bios submitted
directly to DM device will return BLK_QC_T_NONE since patch 5, and
__submit_bio_noacct() only enqueues split bios with valid cookie into
&root->bi_plist list. Thus only bios submitted to mq device will be
enqueued into this &root->bi_plist list.

Following is the related logic (the blk_qc_t_valid() part).


>> -		ret = __submit_bio(bio);
>> +		if (iopoll) {
>> +			/* See the comments of above bio_inc_remaining(). */
>> +			bio_inc_remaining(bio);
>> +			bio->bi_cookie = __submit_bio(bio);
>> +
>> +			if (blk_qc_t_valid(bio->bi_cookie))
>> +				bio_add_poll_list(bio);
>> +
>> +			bio_endio(bio);
>> +		} else {
>> +			ret = __submit_bio(bio);
>> +		}



Suppose we have the following device stack hierarchy, that is, dm0 is
stacked on dm1, while dm1 is stacked on nvme0 and nvme1.

    dm0
    dm1
nvme0  nvme1


Then the bio graph is like:


                                   +------------+
                                   |bio0(to dm0)|
                                   +------------+
                                         ^
                                         | orig_bio
                                   +--------------------+
                                   |struct dm_io of bio0|
+--------------------+ bi_private  ----------------------
|bio3(to dm1)        |------------>|bio1(to dm1)        |
+--------------------+             +--------------------+
        ^                                ^
        | ->orig_bio                     | ->orig_bio
+--------------------+             +--------------------+
|struct dm_io        |             |struct dm_io        |
----------------------             ----------------------
|bio2(to nvme0)      |             |bio4(to nvme1)      |
+--------------------+             +--------------------+

In this way, bio 0/1/3 will return BLK_QC_T_NONE and won't be enqueued
into &root->bi_plist list, while bio 2/4 will be enqueued if they return
valid cookie.

> 
> 
>> +
>> +		spin_lock(&root->bi_plock);
>> +		/*
>> +		 * One blk_mq_poll() call could complete multiple bios, and
>> +		 * thus multiple bios could be removed from root->bi_plock
>> +		 * list.
>> +		 */
>> +		bio = list_first_entry_or_null(&root->bi_plist, struct bio, bi_pnode);
>> +	}
>> +
>> +	spin_unlock(&root->bi_plock);
>> +
>> +	if (list_empty(&root->bi_plist)) {
>> +		bio_endio(root);
>> +		/*
>> +		 * 'ret' may be 0 here. root->bi_plist may be empty once we
>> +		 * acquire the list spinlock.
>> +		 */
>> +		ret = max(ret, 1);
>> +	}
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(blk_bio_poll);
>> +
>>  static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
>>  {
>>  	struct blk_mq_hw_ctx *hctx;
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index 2e05244fc16d..2cf5d8f0ea34 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -277,6 +277,12 @@ struct bio {
>>  
>>  	struct bio_set		*bi_pool;
>>  
>> +	struct bio		*bi_root;	/* original bio of submit_bio() */
>> +	struct list_head        bi_plist;
>> +	struct list_head        bi_pnode;
>> +	struct spinlock         bi_plock;
>> +	blk_qc_t		bi_cookie;
>> +
>>  	/*
>>  	 * We can inline a number of vecs at the end of the bio, to avoid
>>  	 * double allocations for a small number of bio_vecs. This member
>> @@ -557,6 +563,39 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
>>  	return (cookie & BLK_QC_T_INTERNAL) != 0;
>>  }
>>  
>> +static inline void bio_add_poll_list(struct bio *bio)
>> +{
>> +	struct bio *root = bio->bi_root;
>> +
>> +	/*
>> +	 * The spin_lock() variant is enough since bios in root->bi_plist are
>> +	 * all enqueued into polling mode hardware queue, thus the list_del()
>> +	 * operation is handled only in process context.
>> +	 */
>> +	spin_lock(&root->bi_plock);
>> +	list_add_tail(&bio->bi_pnode, &root->bi_plist);
>> +	spin_unlock(&root->bi_plock);
>> +}
>> +
>> +static inline void bio_del_poll_list(struct bio *bio)
>> +{
>> +	struct bio *root = bio->bi_root;
>> +
>> +	/*
>> +	 * bios in mq routine: @bi_root is NULL, @bi_cookie is 0;
>> +	 * bios in bio-based routine: @bi_root is non-NULL, @bi_cookie is valid
>> +	 * (including 0) for those in root->bi_plist, invalid for the
>> +	 * remaining.
>> +	 */
>> +	if (bio->bi_root && blk_qc_t_valid(bio->bi_cookie)) {
>> +		spin_lock(&root->bi_plock);
>> +		list_del(&bio->bi_pnode);
>> +		spin_unlock(&root->bi_plock);
>> +	}
>> +}
>> +
>> +int blk_bio_poll(struct request_queue *q, blk_qc_t cookie);
>> +
>>  struct blk_rq_stat {
>>  	u64 mean;
>>  	u64 min;
>> -- 
>> 2.27.0
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH RFC 7/7] dm: add support for IO polling
  2020-12-23 11:26 ` [PATCH RFC 7/7] dm: add support for IO polling Jeffle Xu
@ 2021-01-08  3:12   ` JeffleXu
  0 siblings, 0 replies; 26+ messages in thread
From: JeffleXu @ 2021-01-08  3:12 UTC (permalink / raw)
  To: snitzer; +Cc: linux-block, dm-devel, io-uring



On 12/23/20 7:26 PM, Jeffle Xu wrote:
> Enable iopoll when all underlying target devices supports iopoll.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> ---
>  drivers/md/dm-table.c | 28 ++++++++++++++++++++++++++++
>  drivers/md/dm.c       |  1 +
>  2 files changed, 29 insertions(+)
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 188f41287f18..b0cd5bf58c3c 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1791,6 +1791,31 @@ static bool dm_table_requires_stable_pages(struct dm_table *t)
>  	return false;
>  }
>  
> +static int device_supports_poll(struct dm_target *ti, struct dm_dev *dev,
> +				sector_t start, sector_t len, void *data)
> +{
> +	struct request_queue *q = bdev_get_queue(dev->bdev);
> +
> +	return q && test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
> +}
> +
> +static bool dm_table_supports_poll(struct dm_table *t)
> +{
> +	struct dm_target *ti;
> +	unsigned int i;
> +
> +	/* Ensure that all targets support iopoll. */
> +	for (i = 0; i < dm_table_get_num_targets(t); i++) {
> +		ti = dm_table_get_target(t, i);
> +
> +		if (!ti->type->iterate_devices ||
> +		    !ti->type->iterate_devices(ti, device_supports_poll, NULL))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
>  void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>  			       struct queue_limits *limits)
>  {
> @@ -1883,6 +1908,9 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>  #endif
>  
>  	blk_queue_update_readahead(q);
> +
> +	if (dm_table_supports_poll(t))
> +		blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>  }
>  

This works in arbitrary bio-based DM stacking. Suppose the following
device stack:

    dm0
    dm1
nvme0  nvme1


dm 0 will check if dm1 has QUEUE_FLAG_POLL flag set.


>  unsigned int dm_table_get_num_targets(struct dm_table *t)
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 03c2b867acaa..ffd2c5ead256 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -3049,6 +3049,7 @@ static const struct pr_ops dm_pr_ops = {
>  };
>  
>  static const struct block_device_operations dm_blk_dops = {
> +	.iopoll = blk_bio_poll,
>  	.submit_bio = dm_submit_bio,
>  	.open = dm_blk_open,
>  	.release = dm_blk_close,
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH RFC 3/7] block: add iopoll method for non-mq device
  2021-01-07 21:47   ` Mike Snitzer
@ 2021-01-08  3:24     ` JeffleXu
  2021-01-08 17:33       ` Mike Snitzer
  0 siblings, 1 reply; 26+ messages in thread
From: JeffleXu @ 2021-01-08  3:24 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, dm-devel, io-uring



On 1/8/21 5:47 AM, Mike Snitzer wrote:
> On Wed, Dec 23 2020 at  6:26am -0500,
> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> 
>> ->poll_fn is introduced in commit ea435e1b9392 ("block: add a poll_fn
>> callback to struct request_queue") for supporting non-mq queues such as
>> nvme multipath, but removed in commit 529262d56dbe ("block: remove
>> ->poll_fn").
>>
>> To add support of IO polling for non-mq device, this method need to be
>> back. Since commit c62b37d96b6e ("block: move ->make_request_fn to
>> struct block_device_operations") has moved all callbacks into struct
>> block_device_operations in gendisk, we also add the new method named
>> ->iopoll in block_device_operations.
> 
> Please update patch subject and header to:
> 
> block: add iopoll method to support bio-based IO polling
> 
> ->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
> callback to struct request_queue") to support bio-based queues such as
> nvme multipath, but was later removed in commit 529262d56dbe ("block:
> remove ->poll_fn").
> 
> Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
> block_device_operations") restore the possibility of bio-based IO
> polling support by adding an ->iopoll method to gendisk->fops.
> Elevate bulk of blk_mq_poll() implementation to blk_poll() and reduce
> blk_mq_poll() to blk-mq specific code that is called from blk_poll().
> 
>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>> ---
>>  block/blk-core.c       | 79 ++++++++++++++++++++++++++++++++++++++++++
>>  block/blk-mq.c         | 70 +++++--------------------------------
>>  include/linux/blk-mq.h |  3 ++
>>  include/linux/blkdev.h |  1 +
>>  4 files changed, 92 insertions(+), 61 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 96e5fcd7f071..2f5c51ce32e3 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1131,6 +1131,85 @@ blk_qc_t submit_bio(struct bio *bio)
>>  }
>>  EXPORT_SYMBOL(submit_bio);
>>  
>> +static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
>> +{
>> +	struct blk_mq_hw_ctx *hctx;
>> +
>> +	/* TODO: bio-based device doesn't support hybrid poll. */
>> +	if (!queue_is_mq(q))
>> +		return false;
>> +
>> +	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>> +	if (blk_mq_poll_hybrid(q, hctx, cookie))
>> +		return true;
>> +
>> +	hctx->poll_considered++;
>> +	return false;
>> +}
> 
> I don't see where you ever backfill bio-based hybrid support (in
> the following patches in this series, so it is lingering TODO).

Yes the hybrid polling is not implemented and thus bypassed for
bio-based device currently.

> 
>> +
>> +/**
>> + * blk_poll - poll for IO completions
>> + * @q:  the queue
>> + * @cookie: cookie passed back at IO submission time
>> + * @spin: whether to spin for completions
>> + *
>> + * Description:
>> + *    Poll for completions on the passed in queue. Returns number of
>> + *    completed entries found. If @spin is true, then blk_poll will continue
>> + *    looping until at least one completion is found, unless the task is
>> + *    otherwise marked running (or we need to reschedule).
>> + */
>> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>> +{
>> +	long state;
>> +
>> +	if (!blk_qc_t_valid(cookie) ||
>> +	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
>> +		return 0;
>> +
>> +	if (current->plug)
>> +		blk_flush_plug_list(current->plug, false);
>> +
>> +	/*
>> +	 * If we sleep, have the caller restart the poll loop to reset
>> +	 * the state. Like for the other success return cases, the
>> +	 * caller is responsible for checking if the IO completed. If
>> +	 * the IO isn't complete, we'll get called again and will go
>> +	 * straight to the busy poll loop. If specified not to spin,
>> +	 * we also should not sleep.
>> +	 */
>> +	if (spin && blk_poll_hybrid(q, cookie))
>> +		return 1;
>> +
>> +	state = current->state;
>> +	do {
>> +		int ret;
>> +		struct gendisk *disk = queue_to_disk(q);
>> +
>> +		if (disk->fops->iopoll)
>> +			ret = disk->fops->iopoll(q, cookie);
>> +		else
>> +			ret = blk_mq_poll(q, cookie);

The original code is indeed buggy. For bio-based device, ->iopoll() may
not be implemented while QUEUE_FLAG_POLL flag is still set, in which
case blk_mq_poll() will be called for this bio-based device.

> 
> Really don't like that blk-mq is needlessly getting gendisk and checking
> disk->fops->iopoll.
> 
> This is just to give an idea, whitespace damaged due to coding in mail
> client, but why not remove above blk_poll_hybrid() and do:
> 
> struct blk_mq_hw_ctx *hctx = NULL;
> struct gendisk *disk = NULL;
> ...
> 
> if (queue_is_mq(q)) {
> 	/*
> 	 * If we sleep, have the caller restart the poll loop to reset
> 	 * the state. Like for the other success return cases, the
> 	 * caller is responsible for checking if the IO completed. If
> 	 * the IO isn't complete, we'll get called again and will go
> 	 * straight to the busy poll loop. If specified not to spin,
> 	 * we also should not sleep.
> 	 */
> 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> 	if (spin && blk_mq_poll_hybrid(q, hctx, cookie))
> 		return 1;
> 	hctx->poll_considered++;   
> } else {
> 	disk = queue_to_disk(q);
> }
> 
> do {
> 	int ret;
> 
> 	if (hctx)
> 	        ret = blk_mq_poll(q, hctx, cookie);
> 	else if (disk->fops->iopoll)
> 		ret = disk->fops->iopoll(q, cookie);
> 		

Regards.


>> +		if (ret > 0) {
>> +			__set_current_state(TASK_RUNNING);
>> +			return ret;
>> +		}
>> +
>> +		if (signal_pending_state(state, current))
>> +			__set_current_state(TASK_RUNNING);
>> +
>> +		if (current->state == TASK_RUNNING)
>> +			return 1;
>> +		if (ret < 0 || !spin)
>> +			break;
>> +		cpu_relax();
>> +	} while (!need_resched());
>> +
>> +	__set_current_state(TASK_RUNNING);
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(blk_poll);
>> +
>>  /**
>>   * blk_cloned_rq_check_limits - Helper function to check a cloned request
>>   *                              for the new queue limits
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index b09ce00cc6af..85258958e9f1 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -3818,8 +3818,8 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
>>  	return true;
>>  }
>>  
>> -static bool blk_mq_poll_hybrid(struct request_queue *q,
>> -			       struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
>> +bool blk_mq_poll_hybrid(struct request_queue *q,
>> +			struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
>>  {
>>  	struct request *rq;
>>  
>> @@ -3843,72 +3843,20 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
>>  	return blk_mq_poll_hybrid_sleep(q, rq);
>>  }
>>  
>> -/**
>> - * blk_poll - poll for IO completions
>> - * @q:  the queue
>> - * @cookie: cookie passed back at IO submission time
>> - * @spin: whether to spin for completions
>> - *
>> - * Description:
>> - *    Poll for completions on the passed in queue. Returns number of
>> - *    completed entries found. If @spin is true, then blk_poll will continue
>> - *    looping until at least one completion is found, unless the task is
>> - *    otherwise marked running (or we need to reschedule).
>> - */
>> -int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>> +int blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
>>  {
>> +	int ret;
>>  	struct blk_mq_hw_ctx *hctx;
>> -	long state;
>> -
>> -	if (!blk_qc_t_valid(cookie) ||
>> -	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
>> -		return 0;
>> -
>> -	if (current->plug)
>> -		blk_flush_plug_list(current->plug, false);
>>  
>>  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> 
> 
> Given my suggested code changes above, pass hctx in to blk_mq_poll() to
> avoid redundant code to access it in q->queue_hw_ctx[], so:
> 
> int blk_mq_poll(struct request_queue *q,
>     		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
> 

Regards.

>> -	/*
>> -	 * If we sleep, have the caller restart the poll loop to reset
>> -	 * the state. Like for the other success return cases, the
>> -	 * caller is responsible for checking if the IO completed. If
>> -	 * the IO isn't complete, we'll get called again and will go
>> -	 * straight to the busy poll loop. If specified not to spin,
>> -	 * we also should not sleep.
>> -	 */
>> -	if (spin && blk_mq_poll_hybrid(q, hctx, cookie))
>> -		return 1;
>> -
>> -	hctx->poll_considered++;
>> +	hctx->poll_invoked++;
>> +	ret = q->mq_ops->poll(hctx);
>> +	if (ret > 0)
>> +		hctx->poll_success++;
>>  
>> -	state = current->state;
>> -	do {
>> -		int ret;
>> -
>> -		hctx->poll_invoked++;
>> -
>> -		ret = q->mq_ops->poll(hctx);
>> -		if (ret > 0) {
>> -			hctx->poll_success++;
>> -			__set_current_state(TASK_RUNNING);
>> -			return ret;
>> -		}
>> -
>> -		if (signal_pending_state(state, current))
>> -			__set_current_state(TASK_RUNNING);
>> -
>> -		if (current->state == TASK_RUNNING)
>> -			return 1;
>> -		if (ret < 0 || !spin)
>> -			break;
>> -		cpu_relax();
>> -	} while (!need_resched());
>> -
>> -	__set_current_state(TASK_RUNNING);
>> -	return 0;
>> +	return ret;
>>  }
>> -EXPORT_SYMBOL_GPL(blk_poll);
>>  
>>  unsigned int blk_mq_rq_cpu(struct request *rq)
>>  {
>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index 47b021952ac7..032e08ecd42e 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -607,6 +607,9 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
>>  }
>>  
>>  blk_qc_t blk_mq_submit_bio(struct bio *bio);
>> +int blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
>> +bool blk_mq_poll_hybrid(struct request_queue *q,
>> +		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie);
>>  void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
>>  		struct lock_class_key *key);
>>  
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 2303d06a5a82..e8965879eb90 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1845,6 +1845,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
>>  
>>  struct block_device_operations {
>>  	blk_qc_t (*submit_bio) (struct bio *bio);
>> +	int (*iopoll)(struct request_queue *q, blk_qc_t cookie);
>>  	int (*open) (struct block_device *, fmode_t);
>>  	void (*release) (struct gendisk *, fmode_t);
>>  	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
>> -- 
>> 2.27.0
>>
>> --
>> dm-devel mailing list
>> dm-devel@redhat.com
>> https://www.redhat.com/mailman/listinfo/dm-devel
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 6/7] block: track cookies of split bios for bio-based device
  2021-01-08  3:08     ` [dm-devel] " JeffleXu
@ 2021-01-08 17:26       ` Mike Snitzer
  2021-01-12  5:46         ` [dm-devel] " JeffleXu
  2021-01-12  7:11         ` [dm-devel] " JeffleXu
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Snitzer @ 2021-01-08 17:26 UTC (permalink / raw)
  To: JeffleXu; +Cc: linux-block, dm-devel, io-uring

On Thu, Jan 07 2021 at 10:08pm -0500,
JeffleXu <jefflexu@linux.alibaba.com> wrote:

> Thanks for reviewing.
> 
> 
> On 1/8/21 6:18 AM, Mike Snitzer wrote:
> > On Wed, Dec 23 2020 at  6:26am -0500,
> > Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> > 
> >> This is actuaaly the core when supporting iopoll for bio-based device.
> >>
> >> A list is maintained in the top bio (the original bio submitted to dm
> >> device), which is used to maintain all valid cookies of split bios. The
> >> IO polling routine will actually iterate this list and poll on
> >> corresponding hardware queues of the underlying mq devices.
> >>
> >> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> > 
> > Like I said in response to patch 4 in this series: please fold patch 4
> > into this patch and _really_ improve this patch header.
> > 
> > In particular, the (ab)use of bio_inc_remaining() needs be documented in
> > this patch header very well.
> > 
> > But its use could easily be why you're seeing a performance hit (coupled
> > with the extra spinlock locking and list management used).  Just added
> > latency and contention across CPUs.
> 
> Indeed bio_inc_remaining() is abused here and the code seems quite hacky
> here.
> 
> Actually I'm regarding implementing the split bio tracking mechanism in
> a recursive way you had ever suggested. That is, the split bios could be
> maintained in an array, which is allocated with 'struct dm_io'. This way
> the overhead of spinlock protecting the &root->bi_plist may be omitted
> here. Also the lifetime management may be simplified somehow. But the
> block core needs to fetch the per-bio private data now, just like what
> you had ever suggested before.
> 
> How do you think, Mike?

Yes, using per-bio-data is a requirement (we cannot bloat 'struct bio').

As for using an array, how would you index the array?  blk-mq is able to
use an array (with cookie to hctx index translation) because there are a
finite number of fixed hctx for the life of the device.  But with
stacked bio-based DM devices, each DM table associated with a DM device
can change via table reload.  Any reloads should flush outstanding IO,
but there are cases where no flushing occurs (e.g. dm-multipath when no
paths are available, _but_ in that instance, there wouldn't be any
mapping that results in a blk-mq hctx endpoint).

All the DM edge cases aside, you need to ensure that the lifetime of the
per-bio-data that holds the 'struct node' (that you correctly detailed
needing below) doesn't somehow get used _after_ the hctx and/or cookie
are no longer valid.  So to start we'll need some BUG_ON() to validate
the lifetime is correct.

> Besides the lifetime management is quite annoying to me. As long as the
> tracking object representing a valid split bio) is dynamically
> allocated, no matter it's embedded directly in 'struct bio' (in this
> patch), or allocated with 'struct dm_io', the lifetime management of the
> tracking object comes in. Here the so called tracking object is
> something like
> 
> struct node {
>     struct blk_mq_hw_ctx *hctx;
>     blk_qc_t cookie;
> };

Needs a better name, think I had 'struct dm_poll_data'
 
> Actually currently the tracking objects are all allocated with 'struct
> bio', then the lifetime management of the tracking objects is actually
> equivalent to lifetime management of bio. Since the returned cookie is
> actually a pointer to the bio, the refcount of this bio must be
> incremented, since we release a reference to this bio through the
> returned cookie, in which case the abuse of the refcount trick seems
> unavoidable? Unless we allocate the tracking object individually, then
> the returned cookie is actually pointing to the tracking object, and the
> refcount is individually maintained for the tracking object.

The refcounting and lifetime of the per-bio-data should all work as is.
Would hope you can avoid extra bio_inc_remaining().. that infratsructure
is way too tightly coupled to bio_chain()'ing, etc.

The challenge you have is the array that would point at these various
per-bio-data needs to be rooted somewhere (you put it in the topmost
original bio with the current patchset).  But why not manage that as
part of 'struct mapped_device'?  It'd need proper management at DM table
reload boundaries and such but it seems like the most logical place to
put the array.  But again, this array needs to be dynamic.. so thinking
further, maybe a better model would be to have a fixed array in 'struct
dm_table' for each hctx associated with a blk_mq _data_ device directly
used/managed by that dm_table?

And ideally, access to these arrays should be as lockless as possible
(rcu, or whatever) so that scaling to many cpus isn't a problem.

> >> ---
> >>  block/bio.c               |  8 ++++
> >>  block/blk-core.c          | 84 ++++++++++++++++++++++++++++++++++++++-
> >>  include/linux/blk_types.h | 39 ++++++++++++++++++
> >>  3 files changed, 129 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/block/bio.c b/block/bio.c
> >> index 1f2cc1fbe283..ca6d1a7ee196 100644
> >> --- a/block/bio.c
> >> +++ b/block/bio.c
> >> @@ -284,6 +284,10 @@ void bio_init(struct bio *bio, struct bio_vec *table,
> >>  
> >>  	bio->bi_io_vec = table;
> >>  	bio->bi_max_vecs = max_vecs;
> >> +
> >> +	INIT_LIST_HEAD(&bio->bi_plist);
> >> +	INIT_LIST_HEAD(&bio->bi_pnode);
> >> +	spin_lock_init(&bio->bi_plock);
> >>  }
> >>  EXPORT_SYMBOL(bio_init);
> >>  
> >> @@ -689,6 +693,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
> >>  	bio->bi_write_hint = bio_src->bi_write_hint;
> >>  	bio->bi_iter = bio_src->bi_iter;
> >>  	bio->bi_io_vec = bio_src->bi_io_vec;
> >> +	bio->bi_root = bio_src->bi_root;
> >>  
> >>  	bio_clone_blkg_association(bio, bio_src);
> >>  	blkcg_bio_issue_init(bio);
> >> @@ -1425,6 +1430,8 @@ void bio_endio(struct bio *bio)
> >>  	if (bio->bi_disk)
> >>  		rq_qos_done_bio(bio->bi_disk->queue, bio);
> >>  
> >> +	bio_del_poll_list(bio);
> >> +
> >>  	/*
> >>  	 * Need to have a real endio function for chained bios, otherwise
> >>  	 * various corner cases will break (like stacking block devices that
> >> @@ -1446,6 +1453,7 @@ void bio_endio(struct bio *bio)
> >>  	blk_throtl_bio_endio(bio);
> >>  	/* release cgroup info */
> >>  	bio_uninit(bio);
> >> +
> >>  	if (bio->bi_end_io)
> >>  		bio->bi_end_io(bio);
> >>  }
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 2f5c51ce32e3..5a332af01939 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -960,12 +960,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>  {
> >>  	struct bio_list bio_list_on_stack[2];
> >>  	blk_qc_t ret = BLK_QC_T_NONE;
> >> +	bool iopoll;
> >> +	struct bio *root;
> >>  
> >>  	BUG_ON(bio->bi_next);
> >>  
> >>  	bio_list_init(&bio_list_on_stack[0]);
> >>  	current->bio_list = bio_list_on_stack;
> >>  
> >> +	iopoll = test_bit(QUEUE_FLAG_POLL, &bio->bi_disk->queue->queue_flags);
> >> +	iopoll = iopoll && (bio->bi_opf & REQ_HIPRI);
> >> +
> >> +	if (iopoll) {
> >> +		bio->bi_root = root = bio;
> >> +		/*
> >> +		 * We need to pin root bio here since there's a reference from
> >> +		 * the returned cookie. bio_get() is not enough since the whole
> >> +		 * bio and the corresponding kiocb/dio may have already
> >> +		 * completed and thus won't call blk_poll() at all, in which
> >> +		 * case the pairing bio_put() in blk_bio_poll() won't be called.
> >> +		 * The side effect of bio_inc_remaining() is that, the whole bio
> >> +		 * won't complete until blk_poll() called.
> >> +		 */
> >> +		bio_inc_remaining(root);
> >> +	}
> >> +
> >>  	do {
> >>  		struct request_queue *q = bio->bi_disk->queue;
> >>  		struct bio_list lower, same;
> >> @@ -979,7 +998,18 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >>  		bio_list_init(&bio_list_on_stack[0]);
> >>  
> >> -		ret = __submit_bio(bio);
> >> +		if (iopoll) {
> >> +			/* See the comments of above bio_inc_remaining(). */
> >> +			bio_inc_remaining(bio);
> >> +			bio->bi_cookie = __submit_bio(bio);
> >> +
> >> +			if (blk_qc_t_valid(bio->bi_cookie))
> >> +				bio_add_poll_list(bio);
> >> +
> >> +			bio_endio(bio);
> >> +		} else {
> >> +			ret = __submit_bio(bio);
> >> +		}
> >>  
> >>  		/*
> >>  		 * Sort new bios into those for a lower level and those for the
> >> @@ -1002,7 +1032,11 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>  	} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
> >>  
> >>  	current->bio_list = NULL;
> >> -	return ret;
> >> +
> >> +	if (iopoll)
> >> +		return (blk_qc_t)root;
> >> +
> >> +	return BLK_QC_T_NONE;
> >>  }
> >>  
> >>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >> @@ -1131,6 +1165,52 @@ blk_qc_t submit_bio(struct bio *bio)
> >>  }
> >>  EXPORT_SYMBOL(submit_bio);
> >>  
> >> +int blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> >> +{
> >> +	int ret = 0;
> >> +	struct bio *bio, *root = (struct bio*)cookie;
> >> +
> >> +	if (list_empty(&root->bi_plist)) {
> >> +		bio_endio(root);
> >> +		return 1;
> >> +	}
> >> +
> >> +	spin_lock(&root->bi_plock);
> >> +	bio = list_first_entry_or_null(&root->bi_plist, struct bio, bi_pnode);
> >> +
> >> +	while (bio) {
> >> +		struct request_queue *q = bio->bi_disk->queue;
> >> +		blk_qc_t cookie = bio->bi_cookie;
> >> +
> >> +		spin_unlock(&root->bi_plock);
> >> +		BUG_ON(!blk_qc_t_valid(cookie));
> >> +
> >> +		ret += blk_mq_poll(q, cookie);
> > 
> > Not yet clear to me how you _know_ this q is blk-mq...
> > What about a deep stack of bio-based DM devices?
> > 
> 
> This design works in arbitrary bio-based DM stacking.
> 
> 
> > Or are you confining bio-based DM IO polling support to bio-based
> > stacked directly on blk-mq? (patch 7 likely shows that to be the case).
> > 
> 
> patch 7 works in arbitrary bio-based DM stacking. Please see the reply
> for patch 7 for details.

OK, I see.  Definitely need to capture that aspect of the design in the
relevant patch header(s).

And likely a block comment in blk_bio_poll().

> > If so, I'm not liking that at all.  So if your implementation doesn't
> > support arbitrary bio-based IO stacks then this bio-based IO polling
> > support really has limited utility still.
> > 
> > Helps explin how you got away with having bio-based DM always returning
> > BLK_QC_T_NONE in patch 5 though... feels far too simplistic.  Patch 5+6
> > together are implicitly ignoring the complexity that results from
> > arbitrary bio-based DM stacking.
> > 
> > Or am I missing something?
> 
> The magic is in patch 5. Bios submitted directly to DM device won't be
> enqueue into this &root->bi_plist list, since all bios submitted
> directly to DM device will return BLK_QC_T_NONE since patch 5, and
> __submit_bio_noacct() only enqueues split bios with valid cookie into
> &root->bi_plist list. Thus only bios submitted to mq device will be
> enqueued into this &root->bi_plist list.
> 
> Following is the related logic (the blk_qc_t_valid() part).
> 
> 
> >> -		ret = __submit_bio(bio);
> >> +		if (iopoll) {
> >> +			/* See the comments of above bio_inc_remaining(). */
> >> +			bio_inc_remaining(bio);
> >> +			bio->bi_cookie = __submit_bio(bio);
> >> +
> >> +			if (blk_qc_t_valid(bio->bi_cookie))
> >> +				bio_add_poll_list(bio);
> >> +
> >> +			bio_endio(bio);
> >> +		} else {
> >> +			ret = __submit_bio(bio);
> >> +		}
> 
> 
> 
> Suppose we have the following device stack hierarchy, that is, dm0 is
> stacked on dm1, while dm1 is stacked on nvme0 and nvme1.
> 
>     dm0
>     dm1
> nvme0  nvme1
> 
> 
> Then the bio graph is like:
> 
> 
>                                    +------------+
>                                    |bio0(to dm0)|
>                                    +------------+
>                                          ^
>                                          | orig_bio
>                                    +--------------------+
>                                    |struct dm_io of bio0|
> +--------------------+ bi_private  ----------------------
> |bio3(to dm1)        |------------>|bio1(to dm1)        |
> +--------------------+             +--------------------+
>         ^                                ^
>         | ->orig_bio                     | ->orig_bio
> +--------------------+             +--------------------+
> |struct dm_io        |             |struct dm_io        |
> ----------------------             ----------------------
> |bio2(to nvme0)      |             |bio4(to nvme1)      |
> +--------------------+             +--------------------+
> 
> In this way, bio 0/1/3 will return BLK_QC_T_NONE and won't be enqueued
> into &root->bi_plist list, while bio 2/4 will be enqueued if they return
> valid cookie.

Yes, useful insight, thanks.

Mike


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 3/7] block: add iopoll method for non-mq device
  2021-01-08  3:24     ` [dm-devel] " JeffleXu
@ 2021-01-08 17:33       ` Mike Snitzer
  2021-01-11  7:50         ` [dm-devel] " JeffleXu
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2021-01-08 17:33 UTC (permalink / raw)
  To: JeffleXu; +Cc: linux-block, dm-devel, io-uring

On Thu, Jan 07 2021 at 10:24pm -0500,
JeffleXu <jefflexu@linux.alibaba.com> wrote:

> 
> 
> On 1/8/21 5:47 AM, Mike Snitzer wrote:
> > On Wed, Dec 23 2020 at  6:26am -0500,
> > Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> > 
> >> ->poll_fn is introduced in commit ea435e1b9392 ("block: add a poll_fn
> >> callback to struct request_queue") for supporting non-mq queues such as
> >> nvme multipath, but removed in commit 529262d56dbe ("block: remove
> >> ->poll_fn").
> >>
> >> To add support of IO polling for non-mq device, this method need to be
> >> back. Since commit c62b37d96b6e ("block: move ->make_request_fn to
> >> struct block_device_operations") has moved all callbacks into struct
> >> block_device_operations in gendisk, we also add the new method named
> >> ->iopoll in block_device_operations.
> > 
> > Please update patch subject and header to:
> > 
> > block: add iopoll method to support bio-based IO polling
> > 
> > ->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
> > callback to struct request_queue") to support bio-based queues such as
> > nvme multipath, but was later removed in commit 529262d56dbe ("block:
> > remove ->poll_fn").
> > 
> > Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
> > block_device_operations") restore the possibility of bio-based IO
> > polling support by adding an ->iopoll method to gendisk->fops.
> > Elevate bulk of blk_mq_poll() implementation to blk_poll() and reduce
> > blk_mq_poll() to blk-mq specific code that is called from blk_poll().
> > 
> >> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> >> ---
> >>  block/blk-core.c       | 79 ++++++++++++++++++++++++++++++++++++++++++
> >>  block/blk-mq.c         | 70 +++++--------------------------------
> >>  include/linux/blk-mq.h |  3 ++
> >>  include/linux/blkdev.h |  1 +
> >>  4 files changed, 92 insertions(+), 61 deletions(-)
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 96e5fcd7f071..2f5c51ce32e3 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -1131,6 +1131,85 @@ blk_qc_t submit_bio(struct bio *bio)
> >>  }
> >>  EXPORT_SYMBOL(submit_bio);
> >>  
> >> +static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
> >> +{
> >> +	struct blk_mq_hw_ctx *hctx;
> >> +
> >> +	/* TODO: bio-based device doesn't support hybrid poll. */
> >> +	if (!queue_is_mq(q))
> >> +		return false;
> >> +
> >> +	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> >> +	if (blk_mq_poll_hybrid(q, hctx, cookie))
> >> +		return true;
> >> +
> >> +	hctx->poll_considered++;
> >> +	return false;
> >> +}
> > 
> > I don't see where you ever backfill bio-based hybrid support (in
> > the following patches in this series, so it is lingering TODO).
> 
> Yes the hybrid polling is not implemented and thus bypassed for
> bio-based device currently.
> 
> > 
> >> +
> >> +/**
> >> + * blk_poll - poll for IO completions
> >> + * @q:  the queue
> >> + * @cookie: cookie passed back at IO submission time
> >> + * @spin: whether to spin for completions
> >> + *
> >> + * Description:
> >> + *    Poll for completions on the passed in queue. Returns number of
> >> + *    completed entries found. If @spin is true, then blk_poll will continue
> >> + *    looping until at least one completion is found, unless the task is
> >> + *    otherwise marked running (or we need to reschedule).
> >> + */
> >> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> >> +{
> >> +	long state;
> >> +
> >> +	if (!blk_qc_t_valid(cookie) ||
> >> +	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
> >> +		return 0;
> >> +
> >> +	if (current->plug)
> >> +		blk_flush_plug_list(current->plug, false);
> >> +
> >> +	/*
> >> +	 * If we sleep, have the caller restart the poll loop to reset
> >> +	 * the state. Like for the other success return cases, the
> >> +	 * caller is responsible for checking if the IO completed. If
> >> +	 * the IO isn't complete, we'll get called again and will go
> >> +	 * straight to the busy poll loop. If specified not to spin,
> >> +	 * we also should not sleep.
> >> +	 */
> >> +	if (spin && blk_poll_hybrid(q, cookie))
> >> +		return 1;
> >> +
> >> +	state = current->state;
> >> +	do {
> >> +		int ret;
> >> +		struct gendisk *disk = queue_to_disk(q);
> >> +
> >> +		if (disk->fops->iopoll)
> >> +			ret = disk->fops->iopoll(q, cookie);
> >> +		else
> >> +			ret = blk_mq_poll(q, cookie);
> 
> The original code is indeed buggy. For bio-based device, ->iopoll() may
> not be implemented while QUEUE_FLAG_POLL flag is still set, in which
> case blk_mq_poll() will be called for this bio-based device.

Yes, here is the patch I created to capture my suggestions.  Provided it
looks good to you, please fold it into patch 3 when you rebase for
posting a v2 of your patchset:

From: Mike Snitzer <snitzer@redhat.com>
Date: Thu, 7 Jan 2021 20:45:17 -0500
Subject: [PATCH] fixup patch 3

---
 block/blk-core.c       | 51 +++++++++++++++++++++-----------------------------
 block/blk-mq.c         |  6 ++----
 include/linux/blk-mq.h |  3 ++-
 3 files changed, 25 insertions(+), 35 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e6671f6ce1a4..44f62dc0fa9f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1212,22 +1212,6 @@ int blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
 }
 EXPORT_SYMBOL(blk_bio_poll);
 
-static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
-{
-	struct blk_mq_hw_ctx *hctx;
-
-	/* TODO: bio-based device doesn't support hybrid poll. */
-	if (!queue_is_mq(q))
-		return false;
-
-	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
-	if (blk_mq_poll_hybrid(q, hctx, cookie))
-		return true;
-
-	hctx->poll_considered++;
-	return false;
-}
-
 /**
  * blk_poll - poll for IO completions
  * @q:  the queue
@@ -1243,6 +1227,8 @@ static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
 int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
 	long state;
+	struct blk_mq_hw_ctx *hctx = NULL;
+	struct gendisk *disk = NULL;
 
 	if (!blk_qc_t_valid(cookie) ||
 	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
@@ -1251,26 +1237,31 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	if (current->plug)
 		blk_flush_plug_list(current->plug, false);
 
-	/*
-	 * If we sleep, have the caller restart the poll loop to reset
-	 * the state. Like for the other success return cases, the
-	 * caller is responsible for checking if the IO completed. If
-	 * the IO isn't complete, we'll get called again and will go
-	 * straight to the busy poll loop. If specified not to spin,
-	 * we also should not sleep.
-	 */
-	if (spin && blk_poll_hybrid(q, cookie))
-		return 1;
+	if (queue_is_mq(q)) {
+		/*
+		 * If we sleep, have the caller restart the poll loop to reset
+		 * the state. Like for the other success return cases, the
+		 * caller is responsible for checking if the IO completed. If
+		 * the IO isn't complete, we'll get called again and will go
+		 * straight to the busy poll loop. If specified not to spin,
+		 * we also should not sleep.
+		 */
+		hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+		if (spin && blk_mq_poll_hybrid(q, hctx, cookie))
+			return 1;
+		hctx->poll_considered++;
+	} else
+		disk = queue_to_disk(q);
 
 	state = current->state;
 	do {
 		int ret;
-		struct gendisk *disk = queue_to_disk(q);
 
-		if (disk->fops->iopoll)
+		if (hctx)
+			ret = blk_mq_poll(q, hctx, cookie);
+		else if (disk->fops->iopoll)
 			ret = disk->fops->iopoll(q, cookie);
-		else
-			ret = blk_mq_poll(q, cookie);
+
 		if (ret > 0) {
 			__set_current_state(TASK_RUNNING);
 			return ret;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index fcb44604f806..90d8dead1da5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3826,12 +3826,10 @@ bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
-int blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
+int blk_mq_poll(struct request_queue *q,
+		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
 {
 	int ret;
-	struct blk_mq_hw_ctx *hctx;
-
-	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 
 	hctx->poll_invoked++;
 	ret = q->mq_ops->poll(hctx);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2f3742207df5..b95f2ffa866f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -607,7 +607,8 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
 }
 
 blk_qc_t blk_mq_submit_bio(struct bio *bio);
-int blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
+int blk_mq_poll(struct request_queue *q,
+		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie);
 bool blk_mq_poll_hybrid(struct request_queue *q,
 		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie);
 void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH RFC 3/7] block: add iopoll method for non-mq device
  2021-01-08 17:33       ` Mike Snitzer
@ 2021-01-11  7:50         ` JeffleXu
  0 siblings, 0 replies; 26+ messages in thread
From: JeffleXu @ 2021-01-11  7:50 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, dm-devel, io-uring



On 1/9/21 1:33 AM, Mike Snitzer wrote:
> On Thu, Jan 07 2021 at 10:24pm -0500,
> JeffleXu <jefflexu@linux.alibaba.com> wrote:
> 
>>
>>
>> On 1/8/21 5:47 AM, Mike Snitzer wrote:
>>> On Wed, Dec 23 2020 at  6:26am -0500,
>>> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>>> ->poll_fn is introduced in commit ea435e1b9392 ("block: add a poll_fn
>>>> callback to struct request_queue") for supporting non-mq queues such as
>>>> nvme multipath, but removed in commit 529262d56dbe ("block: remove
>>>> ->poll_fn").
>>>>
>>>> To add support of IO polling for non-mq device, this method need to be
>>>> back. Since commit c62b37d96b6e ("block: move ->make_request_fn to
>>>> struct block_device_operations") has moved all callbacks into struct
>>>> block_device_operations in gendisk, we also add the new method named
>>>> ->iopoll in block_device_operations.
>>>
>>> Please update patch subject and header to:
>>>
>>> block: add iopoll method to support bio-based IO polling
>>>
>>> ->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
>>> callback to struct request_queue") to support bio-based queues such as
>>> nvme multipath, but was later removed in commit 529262d56dbe ("block:
>>> remove ->poll_fn").
>>>
>>> Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
>>> block_device_operations") restore the possibility of bio-based IO
>>> polling support by adding an ->iopoll method to gendisk->fops.
>>> Elevate bulk of blk_mq_poll() implementation to blk_poll() and reduce
>>> blk_mq_poll() to blk-mq specific code that is called from blk_poll().
>>>
>>>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>>>> ---
>>>>  block/blk-core.c       | 79 ++++++++++++++++++++++++++++++++++++++++++
>>>>  block/blk-mq.c         | 70 +++++--------------------------------
>>>>  include/linux/blk-mq.h |  3 ++
>>>>  include/linux/blkdev.h |  1 +
>>>>  4 files changed, 92 insertions(+), 61 deletions(-)
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 96e5fcd7f071..2f5c51ce32e3 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -1131,6 +1131,85 @@ blk_qc_t submit_bio(struct bio *bio)
>>>>  }
>>>>  EXPORT_SYMBOL(submit_bio);
>>>>  
>>>> +static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
>>>> +{
>>>> +	struct blk_mq_hw_ctx *hctx;
>>>> +
>>>> +	/* TODO: bio-based device doesn't support hybrid poll. */
>>>> +	if (!queue_is_mq(q))
>>>> +		return false;
>>>> +
>>>> +	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>>> +	if (blk_mq_poll_hybrid(q, hctx, cookie))
>>>> +		return true;
>>>> +
>>>> +	hctx->poll_considered++;
>>>> +	return false;
>>>> +}
>>>
>>> I don't see where you ever backfill bio-based hybrid support (in
>>> the following patches in this series, so it is lingering TODO).
>>
>> Yes the hybrid polling is not implemented and thus bypassed for
>> bio-based device currently.
>>
>>>
>>>> +
>>>> +/**
>>>> + * blk_poll - poll for IO completions
>>>> + * @q:  the queue
>>>> + * @cookie: cookie passed back at IO submission time
>>>> + * @spin: whether to spin for completions
>>>> + *
>>>> + * Description:
>>>> + *    Poll for completions on the passed in queue. Returns number of
>>>> + *    completed entries found. If @spin is true, then blk_poll will continue
>>>> + *    looping until at least one completion is found, unless the task is
>>>> + *    otherwise marked running (or we need to reschedule).
>>>> + */
>>>> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>>>> +{
>>>> +	long state;
>>>> +
>>>> +	if (!blk_qc_t_valid(cookie) ||
>>>> +	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
>>>> +		return 0;
>>>> +
>>>> +	if (current->plug)
>>>> +		blk_flush_plug_list(current->plug, false);
>>>> +
>>>> +	/*
>>>> +	 * If we sleep, have the caller restart the poll loop to reset
>>>> +	 * the state. Like for the other success return cases, the
>>>> +	 * caller is responsible for checking if the IO completed. If
>>>> +	 * the IO isn't complete, we'll get called again and will go
>>>> +	 * straight to the busy poll loop. If specified not to spin,
>>>> +	 * we also should not sleep.
>>>> +	 */
>>>> +	if (spin && blk_poll_hybrid(q, cookie))
>>>> +		return 1;
>>>> +
>>>> +	state = current->state;
>>>> +	do {
>>>> +		int ret;
>>>> +		struct gendisk *disk = queue_to_disk(q);
>>>> +
>>>> +		if (disk->fops->iopoll)
>>>> +			ret = disk->fops->iopoll(q, cookie);
>>>> +		else
>>>> +			ret = blk_mq_poll(q, cookie);
>>
>> The original code is indeed buggy. For bio-based device, ->iopoll() may
>> not be implemented while QUEUE_FLAG_POLL flag is still set, in which
>> case blk_mq_poll() will be called for this bio-based device.
> 
> Yes, here is the patch I created to capture my suggestions.  Provided it
> looks good to you, please fold it into patch 3 when you rebase for
> posting a v2 of your patchset:

Thanks, I will merge it into the next version.


Thanks,
Jeffle

> 
> From: Mike Snitzer <snitzer@redhat.com>
> Date: Thu, 7 Jan 2021 20:45:17 -0500
> Subject: [PATCH] fixup patch 3
> 
> ---
>  block/blk-core.c       | 51 +++++++++++++++++++++-----------------------------
>  block/blk-mq.c         |  6 ++----
>  include/linux/blk-mq.h |  3 ++-
>  3 files changed, 25 insertions(+), 35 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index e6671f6ce1a4..44f62dc0fa9f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1212,22 +1212,6 @@ int blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
>  }
>  EXPORT_SYMBOL(blk_bio_poll);
>  
> -static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
> -{
> -	struct blk_mq_hw_ctx *hctx;
> -
> -	/* TODO: bio-based device doesn't support hybrid poll. */
> -	if (!queue_is_mq(q))
> -		return false;
> -
> -	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> -	if (blk_mq_poll_hybrid(q, hctx, cookie))
> -		return true;
> -
> -	hctx->poll_considered++;
> -	return false;
> -}
> -
>  /**
>   * blk_poll - poll for IO completions
>   * @q:  the queue
> @@ -1243,6 +1227,8 @@ static bool blk_poll_hybrid(struct request_queue *q, blk_qc_t cookie)
>  int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
>  	long state;
> +	struct blk_mq_hw_ctx *hctx = NULL;
> +	struct gendisk *disk = NULL;
>  
>  	if (!blk_qc_t_valid(cookie) ||
>  	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
> @@ -1251,26 +1237,31 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	if (current->plug)
>  		blk_flush_plug_list(current->plug, false);
>  
> -	/*
> -	 * If we sleep, have the caller restart the poll loop to reset
> -	 * the state. Like for the other success return cases, the
> -	 * caller is responsible for checking if the IO completed. If
> -	 * the IO isn't complete, we'll get called again and will go
> -	 * straight to the busy poll loop. If specified not to spin,
> -	 * we also should not sleep.
> -	 */
> -	if (spin && blk_poll_hybrid(q, cookie))
> -		return 1;
> +	if (queue_is_mq(q)) {
> +		/*
> +		 * If we sleep, have the caller restart the poll loop to reset
> +		 * the state. Like for the other success return cases, the
> +		 * caller is responsible for checking if the IO completed. If
> +		 * the IO isn't complete, we'll get called again and will go
> +		 * straight to the busy poll loop. If specified not to spin,
> +		 * we also should not sleep.
> +		 */
> +		hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> +		if (spin && blk_mq_poll_hybrid(q, hctx, cookie))
> +			return 1;
> +		hctx->poll_considered++;
> +	} else
> +		disk = queue_to_disk(q);
>  
>  	state = current->state;
>  	do {
>  		int ret;
> -		struct gendisk *disk = queue_to_disk(q);
>  
> -		if (disk->fops->iopoll)
> +		if (hctx)
> +			ret = blk_mq_poll(q, hctx, cookie);
> +		else if (disk->fops->iopoll)
>  			ret = disk->fops->iopoll(q, cookie);
> -		else
> -			ret = blk_mq_poll(q, cookie);
> +
>  		if (ret > 0) {
>  			__set_current_state(TASK_RUNNING);
>  			return ret;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index fcb44604f806..90d8dead1da5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3826,12 +3826,10 @@ bool blk_mq_poll_hybrid(struct request_queue *q,
>  	return blk_mq_poll_hybrid_sleep(q, rq);
>  }
>  
> -int blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
> +int blk_mq_poll(struct request_queue *q,
> +		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie)
>  {
>  	int ret;
> -	struct blk_mq_hw_ctx *hctx;
> -
> -	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>  
>  	hctx->poll_invoked++;
>  	ret = q->mq_ops->poll(hctx);
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2f3742207df5..b95f2ffa866f 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -607,7 +607,8 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
>  }
>  
>  blk_qc_t blk_mq_submit_bio(struct bio *bio);
> -int blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
> +int blk_mq_poll(struct request_queue *q,
> +		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie);
>  bool blk_mq_poll_hybrid(struct request_queue *q,
>  		struct blk_mq_hw_ctx *hctx, blk_qc_t cookie);
>  void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH RFC 6/7] block: track cookies of split bios for bio-based device
  2021-01-08 17:26       ` Mike Snitzer
@ 2021-01-12  5:46         ` JeffleXu
  2021-01-12 16:13           ` Mike Snitzer
  2021-01-12  7:11         ` [dm-devel] " JeffleXu
  1 sibling, 1 reply; 26+ messages in thread
From: JeffleXu @ 2021-01-12  5:46 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, dm-devel, io-uring



On 1/9/21 1:26 AM, Mike Snitzer wrote:
> On Thu, Jan 07 2021 at 10:08pm -0500,
> JeffleXu <jefflexu@linux.alibaba.com> wrote:
> 
>> Thanks for reviewing.
>>
>>
>> On 1/8/21 6:18 AM, Mike Snitzer wrote:
>>> On Wed, Dec 23 2020 at  6:26am -0500,
>>> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>>> This is actuaaly the core when supporting iopoll for bio-based device.
>>>>
>>>> A list is maintained in the top bio (the original bio submitted to dm
>>>> device), which is used to maintain all valid cookies of split bios. The
>>>> IO polling routine will actually iterate this list and poll on
>>>> corresponding hardware queues of the underlying mq devices.
>>>>
>>>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>>>
>>> Like I said in response to patch 4 in this series: please fold patch 4
>>> into this patch and _really_ improve this patch header.
>>>
>>> In particular, the (ab)use of bio_inc_remaining() needs be documented in
>>> this patch header very well.
>>>
>>> But its use could easily be why you're seeing a performance hit (coupled
>>> with the extra spinlock locking and list management used).  Just added
>>> latency and contention across CPUs.
>>
>> Indeed bio_inc_remaining() is abused here and the code seems quite hacky
>> here.
>>
>> Actually I'm regarding implementing the split bio tracking mechanism in
>> a recursive way you had ever suggested. That is, the split bios could be
>> maintained in an array, which is allocated with 'struct dm_io'. This way
>> the overhead of spinlock protecting the &root->bi_plist may be omitted
>> here. Also the lifetime management may be simplified somehow. But the
>> block core needs to fetch the per-bio private data now, just like what
>> you had ever suggested before.
>>
>> How do you think, Mike?
> 
> Yes, using per-bio-data is a requirement (we cannot bloat 'struct bio').

Agreed. Then MD will need some refactor to support IO polling, if
possible, since just like I mentioned in patch 0 before, MD doesn't
allocate extra clone bio, and just re-uses the original bio structure.


> 
> As for using an array, how would you index the array?  

The 'array' here is not an array of 'struct blk_mq_hw_ctx *' maintained
in struct dm_table as you mentioned. Actually what I mean is to maintain
an array of struct dm_poll_data (or something like that, e.g. just
struct blk_mq_hw_ctx *) in per-bio private data. The size of the array
just equals the number of the target devices.

For example, for the following device stack,

>>
>> Suppose we have the following device stack hierarchy, that is, dm0 is
>> stacked on dm1, while dm1 is stacked on nvme0 and nvme1.
>>
>>     dm0
>>     dm1
>> nvme0  nvme1
>>
>>
>> Then the bio graph is like:
>>
>>
>>                                    +------------+
>>                                    |bio0(to dm0)|
>>                                    +------------+
>>                                          ^
>>                                          | orig_bio
>>                                    +--------------------+
>>                                    |struct dm_io A      |
>> +--------------------+ bi_private  ----------------------
>> |bio3(to dm1)        |------------>|bio1(to dm1)        |
>> +--------------------+             +--------------------+
>>         ^                                ^
>>         | ->orig_bio                     | ->orig_bio
>> +--------------------+             +--------------------+
>> |struct dm_io        |             |struct dm_io B      |
>> ----------------------             ----------------------
>> |bio2(to nvme0)      |             |bio4(to nvme1)      |
>> +--------------------+             +--------------------+
>>

An array of struct blk_mq_hw_ctx * is maintained in struct dm_io B.


struct blk_mq_hw_ctx * hctxs[2];

The array size is two since dm1 maps to two target devices (i.e. nvme0
and nvme1). Then hctxs[0] points to the hw queue of nvme0, while
hctxs[1] points to the hw queue of nvme1.


This mechanism supports arbitrary device stacking. Similarly, an array
of struct blk_mq_hw_ctx * is maintained in struct dm_io A. The array
size is one since dm0 only maps to one target device (i.e. dm1). In this
case, hctx[0] points to the struct dm_io of the next level, i.e. struct
dm_io B.


But I'm afraid the implementation of this style may be more complex.


>> struct node {
>>     struct blk_mq_hw_ctx *hctx;
>>     blk_qc_t cookie;
>> };
> 
> Needs a better name, think I had 'struct dm_poll_data'

Sure, the name here is just for example.


>  
>> Actually currently the tracking objects are all allocated with 'struct
>> bio', then the lifetime management of the tracking objects is actually
>> equivalent to lifetime management of bio. Since the returned cookie is
>> actually a pointer to the bio, the refcount of this bio must be
>> incremented, since we release a reference to this bio through the
>> returned cookie, in which case the abuse of the refcount trick seems
>> unavoidable? Unless we allocate the tracking object individually, then
>> the returned cookie is actually pointing to the tracking object, and the
>> refcount is individually maintained for the tracking object.
> 
> The refcounting and lifetime of the per-bio-data should all work as is.
> Would hope you can avoid extra bio_inc_remaining().. that infratsructure
> is way too tightly coupled to bio_chain()'ing, etc.
> 
> The challenge you have is the array that would point at these various
> per-bio-data needs to be rooted somewhere (you put it in the topmost
> original bio with the current patchset).  But why not manage that as
> part of 'struct mapped_device'?  It'd need proper management at DM table
> reload boundaries and such but it seems like the most logical place to
> put the array.  But again, this array needs to be dynamic.. so thinking
> further, maybe a better model would be to have a fixed array in 'struct
> dm_table' for each hctx associated with a blk_mq _data_ device directly
> used/managed by that dm_table?

It seems that you are referring 'array' here as an array of 'struct
blk_mq_hw_ctx *'? Such as

struct dm_table {
    ...
    struct blk_mq_hw_ctx *hctxs[];
};

Certainly with this we can replace the original 'struct blk_mq_hw_ctx *'
pointer in 'struct dm_poll_data' with the index into this array, such as

struct dm_poll_data {
     int hctx_index; /* index into dm_table->hctxs[] */
     blk_qc_t cookie;
};


But I'm doubted if this makes much sense. The core difficulty here is
maintaining a list (or dynamic sized array) to track all split bios.
With the array of 'struct blk_mq_hw_ctx *' maintained in struct
dm_table, we still need some **per-bio** structure (e.g., &bio->bi_plist
in current patch set) to track these split bios.




-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH RFC 6/7] block: track cookies of split bios for bio-based device
  2021-01-08 17:26       ` Mike Snitzer
  2021-01-12  5:46         ` [dm-devel] " JeffleXu
@ 2021-01-12  7:11         ` JeffleXu
  1 sibling, 0 replies; 26+ messages in thread
From: JeffleXu @ 2021-01-12  7:11 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, dm-devel, io-uring

Actually I'm thinking why the first RFC version [1] I implemented can't
work.

Indeed iterating all hw queues of all target devices may not work well
as for performance. Then since we have an array of struct blk_mq_hw_ctx
* in struct dm_table now, we can also maintain a bitmap or something to
track if currently there's polling bio enqueued into these hw queues.

```
struct dm_table {
    ...
    struct blk_mq_hw_ctx *hctxs[];
};
```

Then when doing polling for one specific bio, we could iterate and poll
on all hw queues that are flagged pending (there are polling bios
enqueued into these hw queues).


However this may also increase the competition. Consider the following
device stacking,

```
       dm 0
nvme0  nvme1 nvme2
```

If bio 0 is actually enqueue into nvme0, and bio 1 is enqueued into
nvme1, then polling of bio 0 will poll on both nvme0 and nvme1, the same
for bio 1, which will increase the competition.

Then we can maintain a second bitmap tracking if there are polling
instances polling on these hw queues. Polling instance will skip these
hw queues flagged as busy (there's polling instance polling on this hw
queue currently).


By this way we don't need to implement the complex split bio tracking
mechanism anymore.

How about this approach, Mike?


[1] https://www.spinics.net/lists/dm-devel/msg43307.html


On 1/9/21 1:26 AM, Mike Snitzer wrote:
> On Thu, Jan 07 2021 at 10:08pm -0500,
> JeffleXu <jefflexu@linux.alibaba.com> wrote:
> 
>> Thanks for reviewing.
>>
>>
>> On 1/8/21 6:18 AM, Mike Snitzer wrote:
>>> On Wed, Dec 23 2020 at  6:26am -0500,
>>> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>>> This is actuaaly the core when supporting iopoll for bio-based device.
>>>>
>>>> A list is maintained in the top bio (the original bio submitted to dm
>>>> device), which is used to maintain all valid cookies of split bios. The
>>>> IO polling routine will actually iterate this list and poll on
>>>> corresponding hardware queues of the underlying mq devices.
>>>>
>>>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>>>
>>> Like I said in response to patch 4 in this series: please fold patch 4
>>> into this patch and _really_ improve this patch header.
>>>
>>> In particular, the (ab)use of bio_inc_remaining() needs be documented in
>>> this patch header very well.
>>>
>>> But its use could easily be why you're seeing a performance hit (coupled
>>> with the extra spinlock locking and list management used).  Just added
>>> latency and contention across CPUs.
>>
>> Indeed bio_inc_remaining() is abused here and the code seems quite hacky
>> here.
>>
>> Actually I'm regarding implementing the split bio tracking mechanism in
>> a recursive way you had ever suggested. That is, the split bios could be
>> maintained in an array, which is allocated with 'struct dm_io'. This way
>> the overhead of spinlock protecting the &root->bi_plist may be omitted
>> here. Also the lifetime management may be simplified somehow. But the
>> block core needs to fetch the per-bio private data now, just like what
>> you had ever suggested before.
>>
>> How do you think, Mike?
> 
> Yes, using per-bio-data is a requirement (we cannot bloat 'struct bio').
> 
> As for using an array, how would you index the array?  blk-mq is able to
> use an array (with cookie to hctx index translation) because there are a
> finite number of fixed hctx for the life of the device.  But with
> stacked bio-based DM devices, each DM table associated with a DM device
> can change via table reload.  Any reloads should flush outstanding IO,
> but there are cases where no flushing occurs (e.g. dm-multipath when no
> paths are available, _but_ in that instance, there wouldn't be any
> mapping that results in a blk-mq hctx endpoint).
> 
> All the DM edge cases aside, you need to ensure that the lifetime of the
> per-bio-data that holds the 'struct node' (that you correctly detailed
> needing below) doesn't somehow get used _after_ the hctx and/or cookie
> are no longer valid.  So to start we'll need some BUG_ON() to validate
> the lifetime is correct.
> 
>> Besides the lifetime management is quite annoying to me. As long as the
>> tracking object representing a valid split bio) is dynamically
>> allocated, no matter it's embedded directly in 'struct bio' (in this
>> patch), or allocated with 'struct dm_io', the lifetime management of the
>> tracking object comes in. Here the so called tracking object is
>> something like
>>
>> struct node {
>>     struct blk_mq_hw_ctx *hctx;
>>     blk_qc_t cookie;
>> };
> 
> Needs a better name, think I had 'struct dm_poll_data'
>  
>> Actually currently the tracking objects are all allocated with 'struct
>> bio', then the lifetime management of the tracking objects is actually
>> equivalent to lifetime management of bio. Since the returned cookie is
>> actually a pointer to the bio, the refcount of this bio must be
>> incremented, since we release a reference to this bio through the
>> returned cookie, in which case the abuse of the refcount trick seems
>> unavoidable? Unless we allocate the tracking object individually, then
>> the returned cookie is actually pointing to the tracking object, and the
>> refcount is individually maintained for the tracking object.
> 
> The refcounting and lifetime of the per-bio-data should all work as is.
> Would hope you can avoid extra bio_inc_remaining().. that infratsructure
> is way too tightly coupled to bio_chain()'ing, etc.
> 
> The challenge you have is the array that would point at these various
> per-bio-data needs to be rooted somewhere (you put it in the topmost
> original bio with the current patchset).  But why not manage that as
> part of 'struct mapped_device'?  It'd need proper management at DM table
> reload boundaries and such but it seems like the most logical place to
> put the array.  But again, this array needs to be dynamic.. so thinking
> further, maybe a better model would be to have a fixed array in 'struct
> dm_table' for each hctx associated with a blk_mq _data_ device directly
> used/managed by that dm_table?
> 
> And ideally, access to these arrays should be as lockless as possible
> (rcu, or whatever) so that scaling to many cpus isn't a problem.
> 
>>>> ---
>>>>  block/bio.c               |  8 ++++
>>>>  block/blk-core.c          | 84 ++++++++++++++++++++++++++++++++++++++-
>>>>  include/linux/blk_types.h | 39 ++++++++++++++++++
>>>>  3 files changed, 129 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/block/bio.c b/block/bio.c
>>>> index 1f2cc1fbe283..ca6d1a7ee196 100644
>>>> --- a/block/bio.c
>>>> +++ b/block/bio.c
>>>> @@ -284,6 +284,10 @@ void bio_init(struct bio *bio, struct bio_vec *table,
>>>>  
>>>>  	bio->bi_io_vec = table;
>>>>  	bio->bi_max_vecs = max_vecs;
>>>> +
>>>> +	INIT_LIST_HEAD(&bio->bi_plist);
>>>> +	INIT_LIST_HEAD(&bio->bi_pnode);
>>>> +	spin_lock_init(&bio->bi_plock);
>>>>  }
>>>>  EXPORT_SYMBOL(bio_init);
>>>>  
>>>> @@ -689,6 +693,7 @@ void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
>>>>  	bio->bi_write_hint = bio_src->bi_write_hint;
>>>>  	bio->bi_iter = bio_src->bi_iter;
>>>>  	bio->bi_io_vec = bio_src->bi_io_vec;
>>>> +	bio->bi_root = bio_src->bi_root;
>>>>  
>>>>  	bio_clone_blkg_association(bio, bio_src);
>>>>  	blkcg_bio_issue_init(bio);
>>>> @@ -1425,6 +1430,8 @@ void bio_endio(struct bio *bio)
>>>>  	if (bio->bi_disk)
>>>>  		rq_qos_done_bio(bio->bi_disk->queue, bio);
>>>>  
>>>> +	bio_del_poll_list(bio);
>>>> +
>>>>  	/*
>>>>  	 * Need to have a real endio function for chained bios, otherwise
>>>>  	 * various corner cases will break (like stacking block devices that
>>>> @@ -1446,6 +1453,7 @@ void bio_endio(struct bio *bio)
>>>>  	blk_throtl_bio_endio(bio);
>>>>  	/* release cgroup info */
>>>>  	bio_uninit(bio);
>>>> +
>>>>  	if (bio->bi_end_io)
>>>>  		bio->bi_end_io(bio);
>>>>  }
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 2f5c51ce32e3..5a332af01939 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -960,12 +960,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>>  {
>>>>  	struct bio_list bio_list_on_stack[2];
>>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>>> +	bool iopoll;
>>>> +	struct bio *root;
>>>>  
>>>>  	BUG_ON(bio->bi_next);
>>>>  
>>>>  	bio_list_init(&bio_list_on_stack[0]);
>>>>  	current->bio_list = bio_list_on_stack;
>>>>  
>>>> +	iopoll = test_bit(QUEUE_FLAG_POLL, &bio->bi_disk->queue->queue_flags);
>>>> +	iopoll = iopoll && (bio->bi_opf & REQ_HIPRI);
>>>> +
>>>> +	if (iopoll) {
>>>> +		bio->bi_root = root = bio;
>>>> +		/*
>>>> +		 * We need to pin root bio here since there's a reference from
>>>> +		 * the returned cookie. bio_get() is not enough since the whole
>>>> +		 * bio and the corresponding kiocb/dio may have already
>>>> +		 * completed and thus won't call blk_poll() at all, in which
>>>> +		 * case the pairing bio_put() in blk_bio_poll() won't be called.
>>>> +		 * The side effect of bio_inc_remaining() is that, the whole bio
>>>> +		 * won't complete until blk_poll() called.
>>>> +		 */
>>>> +		bio_inc_remaining(root);
>>>> +	}
>>>> +
>>>>  	do {
>>>>  		struct request_queue *q = bio->bi_disk->queue;
>>>>  		struct bio_list lower, same;
>>>> @@ -979,7 +998,18 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>>  
>>>> -		ret = __submit_bio(bio);
>>>> +		if (iopoll) {
>>>> +			/* See the comments of above bio_inc_remaining(). */
>>>> +			bio_inc_remaining(bio);
>>>> +			bio->bi_cookie = __submit_bio(bio);
>>>> +
>>>> +			if (blk_qc_t_valid(bio->bi_cookie))
>>>> +				bio_add_poll_list(bio);
>>>> +
>>>> +			bio_endio(bio);
>>>> +		} else {
>>>> +			ret = __submit_bio(bio);
>>>> +		}
>>>>  
>>>>  		/*
>>>>  		 * Sort new bios into those for a lower level and those for the
>>>> @@ -1002,7 +1032,11 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>>  	} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
>>>>  
>>>>  	current->bio_list = NULL;
>>>> -	return ret;
>>>> +
>>>> +	if (iopoll)
>>>> +		return (blk_qc_t)root;
>>>> +
>>>> +	return BLK_QC_T_NONE;
>>>>  }
>>>>  
>>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>>>> @@ -1131,6 +1165,52 @@ blk_qc_t submit_bio(struct bio *bio)
>>>>  }
>>>>  EXPORT_SYMBOL(submit_bio);
>>>>  
>>>> +int blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
>>>> +{
>>>> +	int ret = 0;
>>>> +	struct bio *bio, *root = (struct bio*)cookie;
>>>> +
>>>> +	if (list_empty(&root->bi_plist)) {
>>>> +		bio_endio(root);
>>>> +		return 1;
>>>> +	}
>>>> +
>>>> +	spin_lock(&root->bi_plock);
>>>> +	bio = list_first_entry_or_null(&root->bi_plist, struct bio, bi_pnode);
>>>> +
>>>> +	while (bio) {
>>>> +		struct request_queue *q = bio->bi_disk->queue;
>>>> +		blk_qc_t cookie = bio->bi_cookie;
>>>> +
>>>> +		spin_unlock(&root->bi_plock);
>>>> +		BUG_ON(!blk_qc_t_valid(cookie));
>>>> +
>>>> +		ret += blk_mq_poll(q, cookie);
>>>
>>> Not yet clear to me how you _know_ this q is blk-mq...
>>> What about a deep stack of bio-based DM devices?
>>>
>>
>> This design works in arbitrary bio-based DM stacking.
>>
>>
>>> Or are you confining bio-based DM IO polling support to bio-based
>>> stacked directly on blk-mq? (patch 7 likely shows that to be the case).
>>>
>>
>> patch 7 works in arbitrary bio-based DM stacking. Please see the reply
>> for patch 7 for details.
> 
> OK, I see.  Definitely need to capture that aspect of the design in the
> relevant patch header(s).
> 
> And likely a block comment in blk_bio_poll().
> 
>>> If so, I'm not liking that at all.  So if your implementation doesn't
>>> support arbitrary bio-based IO stacks then this bio-based IO polling
>>> support really has limited utility still.
>>>
>>> Helps explin how you got away with having bio-based DM always returning
>>> BLK_QC_T_NONE in patch 5 though... feels far too simplistic.  Patch 5+6
>>> together are implicitly ignoring the complexity that results from
>>> arbitrary bio-based DM stacking.
>>>
>>> Or am I missing something?
>>
>> The magic is in patch 5. Bios submitted directly to DM device won't be
>> enqueue into this &root->bi_plist list, since all bios submitted
>> directly to DM device will return BLK_QC_T_NONE since patch 5, and
>> __submit_bio_noacct() only enqueues split bios with valid cookie into
>> &root->bi_plist list. Thus only bios submitted to mq device will be
>> enqueued into this &root->bi_plist list.
>>
>> Following is the related logic (the blk_qc_t_valid() part).
>>
>>
>>>> -		ret = __submit_bio(bio);
>>>> +		if (iopoll) {
>>>> +			/* See the comments of above bio_inc_remaining(). */
>>>> +			bio_inc_remaining(bio);
>>>> +			bio->bi_cookie = __submit_bio(bio);
>>>> +
>>>> +			if (blk_qc_t_valid(bio->bi_cookie))
>>>> +				bio_add_poll_list(bio);
>>>> +
>>>> +			bio_endio(bio);
>>>> +		} else {
>>>> +			ret = __submit_bio(bio);
>>>> +		}
>>
>>
>>
>> Suppose we have the following device stack hierarchy, that is, dm0 is
>> stacked on dm1, while dm1 is stacked on nvme0 and nvme1.
>>
>>     dm0
>>     dm1
>> nvme0  nvme1
>>
>>
>> Then the bio graph is like:
>>
>>
>>                                    +------------+
>>                                    |bio0(to dm0)|
>>                                    +------------+
>>                                          ^
>>                                          | orig_bio
>>                                    +--------------------+
>>                                    |struct dm_io of bio0|
>> +--------------------+ bi_private  ----------------------
>> |bio3(to dm1)        |------------>|bio1(to dm1)        |
>> +--------------------+             +--------------------+
>>         ^                                ^
>>         | ->orig_bio                     | ->orig_bio
>> +--------------------+             +--------------------+
>> |struct dm_io        |             |struct dm_io        |
>> ----------------------             ----------------------
>> |bio2(to nvme0)      |             |bio4(to nvme1)      |
>> +--------------------+             +--------------------+
>>
>> In this way, bio 0/1/3 will return BLK_QC_T_NONE and won't be enqueued
>> into &root->bi_plist list, while bio 2/4 will be enqueued if they return
>> valid cookie.
> 
> Yes, useful insight, thanks.
> 
> Mike
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 6/7] block: track cookies of split bios for bio-based device
  2021-01-12  5:46         ` [dm-devel] " JeffleXu
@ 2021-01-12 16:13           ` Mike Snitzer
  2021-01-14  9:16             ` [dm-devel] " JeffleXu
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Snitzer @ 2021-01-12 16:13 UTC (permalink / raw)
  To: JeffleXu; +Cc: linux-block, dm-devel, io-uring

On Tue, Jan 12 2021 at 12:46am -0500,
JeffleXu <jefflexu@linux.alibaba.com> wrote:

> 
> 
> On 1/9/21 1:26 AM, Mike Snitzer wrote:
> > On Thu, Jan 07 2021 at 10:08pm -0500,
> > JeffleXu <jefflexu@linux.alibaba.com> wrote:
> > 
> >> Thanks for reviewing.
> >>
> >>
> >> On 1/8/21 6:18 AM, Mike Snitzer wrote:
> >>> On Wed, Dec 23 2020 at  6:26am -0500,
> >>> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> >>>
> >>>> This is actuaaly the core when supporting iopoll for bio-based device.
> >>>>
> >>>> A list is maintained in the top bio (the original bio submitted to dm
> >>>> device), which is used to maintain all valid cookies of split bios. The
> >>>> IO polling routine will actually iterate this list and poll on
> >>>> corresponding hardware queues of the underlying mq devices.
> >>>>
> >>>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> >>>
> >>> Like I said in response to patch 4 in this series: please fold patch 4
> >>> into this patch and _really_ improve this patch header.
> >>>
> >>> In particular, the (ab)use of bio_inc_remaining() needs be documented in
> >>> this patch header very well.
> >>>
> >>> But its use could easily be why you're seeing a performance hit (coupled
> >>> with the extra spinlock locking and list management used).  Just added
> >>> latency and contention across CPUs.
> >>
> >> Indeed bio_inc_remaining() is abused here and the code seems quite hacky
> >> here.
> >>
> >> Actually I'm regarding implementing the split bio tracking mechanism in
> >> a recursive way you had ever suggested. That is, the split bios could be
> >> maintained in an array, which is allocated with 'struct dm_io'. This way
> >> the overhead of spinlock protecting the &root->bi_plist may be omitted
> >> here. Also the lifetime management may be simplified somehow. But the
> >> block core needs to fetch the per-bio private data now, just like what
> >> you had ever suggested before.
> >>
> >> How do you think, Mike?
> > 
> > Yes, using per-bio-data is a requirement (we cannot bloat 'struct bio').
> 
> Agreed. Then MD will need some refactor to support IO polling, if
> possible, since just like I mentioned in patch 0 before, MD doesn't
> allocate extra clone bio, and just re-uses the original bio structure.
> 
> 
> > 
> > As for using an array, how would you index the array?  
> 
> The 'array' here is not an array of 'struct blk_mq_hw_ctx *' maintained
> in struct dm_table as you mentioned. Actually what I mean is to maintain
> an array of struct dm_poll_data (or something like that, e.g. just
> struct blk_mq_hw_ctx *) in per-bio private data. The size of the array
> just equals the number of the target devices.
> 
> For example, for the following device stack,
> 
> >>
> >> Suppose we have the following device stack hierarchy, that is, dm0 is
> >> stacked on dm1, while dm1 is stacked on nvme0 and nvme1.
> >>
> >>     dm0
> >>     dm1
> >> nvme0  nvme1
> >>
> >>
> >> Then the bio graph is like:
> >>
> >>
> >>                                    +------------+
> >>                                    |bio0(to dm0)|
> >>                                    +------------+
> >>                                          ^
> >>                                          | orig_bio
> >>                                    +--------------------+
> >>                                    |struct dm_io A      |
> >> +--------------------+ bi_private  ----------------------
> >> |bio3(to dm1)        |------------>|bio1(to dm1)        |
> >> +--------------------+             +--------------------+
> >>         ^                                ^
> >>         | ->orig_bio                     | ->orig_bio
> >> +--------------------+             +--------------------+
> >> |struct dm_io        |             |struct dm_io B      |
> >> ----------------------             ----------------------
> >> |bio2(to nvme0)      |             |bio4(to nvme1)      |
> >> +--------------------+             +--------------------+
> >>
> 
> An array of struct blk_mq_hw_ctx * is maintained in struct dm_io B.
> 
> 
> struct blk_mq_hw_ctx * hctxs[2];
> 
> The array size is two since dm1 maps to two target devices (i.e. nvme0
> and nvme1). Then hctxs[0] points to the hw queue of nvme0, while
> hctxs[1] points to the hw queue of nvme1.

Both nvme0 and nvme1 may have multiple hctxs.  Not sure why you're
thinking there is just one per device?

> 
> 
> This mechanism supports arbitrary device stacking. Similarly, an array
> of struct blk_mq_hw_ctx * is maintained in struct dm_io A. The array
> size is one since dm0 only maps to one target device (i.e. dm1). In this
> case, hctx[0] points to the struct dm_io of the next level, i.e. struct
> dm_io B.
> 
> 
> But I'm afraid the implementation of this style may be more complex.

We are running the risk of talking in circles about this design...


> >> struct node {
> >>     struct blk_mq_hw_ctx *hctx;
> >>     blk_qc_t cookie;
> >> };
> > 
> > Needs a better name, think I had 'struct dm_poll_data'
> 
> Sure, the name here is just for example.
> 
> 
> >  
> >> Actually currently the tracking objects are all allocated with 'struct
> >> bio', then the lifetime management of the tracking objects is actually
> >> equivalent to lifetime management of bio. Since the returned cookie is
> >> actually a pointer to the bio, the refcount of this bio must be
> >> incremented, since we release a reference to this bio through the
> >> returned cookie, in which case the abuse of the refcount trick seems
> >> unavoidable? Unless we allocate the tracking object individually, then
> >> the returned cookie is actually pointing to the tracking object, and the
> >> refcount is individually maintained for the tracking object.
> > 
> > The refcounting and lifetime of the per-bio-data should all work as is.
> > Would hope you can avoid extra bio_inc_remaining().. that infratsructure
> > is way too tightly coupled to bio_chain()'ing, etc.
> > 
> > The challenge you have is the array that would point at these various
> > per-bio-data needs to be rooted somewhere (you put it in the topmost
> > original bio with the current patchset).  But why not manage that as
> > part of 'struct mapped_device'?  It'd need proper management at DM table
> > reload boundaries and such but it seems like the most logical place to
> > put the array.  But again, this array needs to be dynamic.. so thinking
> > further, maybe a better model would be to have a fixed array in 'struct
> > dm_table' for each hctx associated with a blk_mq _data_ device directly
> > used/managed by that dm_table?
> 
> It seems that you are referring 'array' here as an array of 'struct
> blk_mq_hw_ctx *'? Such as
> 
> struct dm_table {
>     ...
>     struct blk_mq_hw_ctx *hctxs[];
> };
> 
> Certainly with this we can replace the original 'struct blk_mq_hw_ctx *'
> pointer in 'struct dm_poll_data' with the index into this array, such as
> 
> struct dm_poll_data {
>      int hctx_index; /* index into dm_table->hctxs[] */
>      blk_qc_t cookie;
> };

You seized on my mentioning blk-mq's array of hctx too literally.  I was
illustrating that blk-mq's cookie is converted to an index into that
array.

But for this DM bio-polling application we'd need to map the blk-mq
returned cookie to a request_queue.  Hence the original 2 members of
dm_poll_data needing to be 'struct request_queue *' and blk_qc_t.

> But I'm doubted if this makes much sense. The core difficulty here is
> maintaining a list (or dynamic sized array) to track all split bios.
> With the array of 'struct blk_mq_hw_ctx *' maintained in struct
> dm_table, we still need some **per-bio** structure (e.g., &bio->bi_plist
> in current patch set) to track these split bios.

One primary goal of all of this design is to achieve bio-polling cleanly
(without extra locking, without block core data structure bloat, etc).
I know you share that goal.  But we need to nail down the core data
structures and what needs tracking at scale and then associate them with
DM's associated objects with consideration for object lifetime.

My suggestion was to anchor your core data structures (e.g. 'struct
dm_poll_data' array, etc) to 'struct dm_table'.  I suggested that
because the dm_table is what dm_get_device()s each underlying _data_
device (a subset of all devices in a dm_table, as iterated through
.iterate_devices).  But a DM 'struct mapped_device' has 2 potential
dm_tables, active and inactive slots, that would imply some complexity
in handing off any outstanding bio's associated 'struct dm_poll_data'
array on DM table reload.

Anyway, you seem to be gravitating to a more simplistic approach of a
single array of 'struct dm_poll_data' for each DM device (regardless of
how arbitrarily deep that DM device stack is, the topmost DM device
would accumulate the list of 'struct dm_poll_data'?).

I'm now questioning the need for any high-level data structure to track
all N of the 'struct dm_poll_data' that may result from a given bio (as
it is split to multiple blk-mq hctxs across multiple blk-mq devices).
Each 'struct dm_poll_data', that will be returned to block core and
stored in struct kiocb's ki_cookie, would have an object lifetime that
matches the original DM bio clone's per-bio-data that the 'struct
dm_poll_data' was part of; then we just need to cast that ki_cookie's
blk_qc_t as 'struct dm_poll_data' and call blk_poll().

The hardest part is to ensure that all the disparate 'struct
dm_poll_data' (and associated clone bios) aren't free'd until the
_original_ bio completes.  That would create quite some back-pressure
with more potential to exhaust system resources -- because then the
cataylst for dropping reference counts on these clone bios would then
need to be tied to the blk_bio_poll() interface... which feels "wrong"
(e.g. it ushers in the (ab)use of bio_inc_remaining you had in your most
recent patchset).

All said, maybe post a v2 that takes the incremental steps of:
1) using DM per-bio-data for 'struct dm_poll_data'
2) simplify blk_bio_poll() to call into DM to translate provided
   blk_qc_t (from struct kiocb's ki_cookie) to request_queue and
   blk_qc_t.
   - this eliminates any need for extra list processing
3) keep your (ab)use of bio_inc_remaining() to allow for exploring this

?

Thanks,
Mike


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH RFC 6/7] block: track cookies of split bios for bio-based device
  2021-01-12 16:13           ` Mike Snitzer
@ 2021-01-14  9:16             ` JeffleXu
  2021-01-14 14:30               ` Mike Snitzer
  0 siblings, 1 reply; 26+ messages in thread
From: JeffleXu @ 2021-01-14  9:16 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: linux-block, dm-devel, io-uring



On 1/13/21 12:13 AM, Mike Snitzer wrote:
> On Tue, Jan 12 2021 at 12:46am -0500,
> JeffleXu <jefflexu@linux.alibaba.com> wrote:
> 
>>
>>
>> On 1/9/21 1:26 AM, Mike Snitzer wrote:
>>> On Thu, Jan 07 2021 at 10:08pm -0500,
>>> JeffleXu <jefflexu@linux.alibaba.com> wrote:
>>>
>>>> Thanks for reviewing.
>>>>
>>>>
>>>> On 1/8/21 6:18 AM, Mike Snitzer wrote:
>>>>> On Wed, Dec 23 2020 at  6:26am -0500,
>>>>> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>
>>>>>> This is actuaaly the core when supporting iopoll for bio-based device.
>>>>>>
>>>>>> A list is maintained in the top bio (the original bio submitted to dm
>>>>>> device), which is used to maintain all valid cookies of split bios. The
>>>>>> IO polling routine will actually iterate this list and poll on
>>>>>> corresponding hardware queues of the underlying mq devices.
>>>>>>
>>>>>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>>>>>
>>>>> Like I said in response to patch 4 in this series: please fold patch 4
>>>>> into this patch and _really_ improve this patch header.
>>>>>
>>>>> In particular, the (ab)use of bio_inc_remaining() needs be documented in
>>>>> this patch header very well.
>>>>>
>>>>> But its use could easily be why you're seeing a performance hit (coupled
>>>>> with the extra spinlock locking and list management used).  Just added
>>>>> latency and contention across CPUs.
>>>>
>>>> Indeed bio_inc_remaining() is abused here and the code seems quite hacky
>>>> here.
>>>>
>>>> Actually I'm regarding implementing the split bio tracking mechanism in
>>>> a recursive way you had ever suggested. That is, the split bios could be
>>>> maintained in an array, which is allocated with 'struct dm_io'. This way
>>>> the overhead of spinlock protecting the &root->bi_plist may be omitted
>>>> here. Also the lifetime management may be simplified somehow. But the
>>>> block core needs to fetch the per-bio private data now, just like what
>>>> you had ever suggested before.
>>>>
>>>> How do you think, Mike?
>>>
>>> Yes, using per-bio-data is a requirement (we cannot bloat 'struct bio').
>>
>> Agreed. Then MD will need some refactor to support IO polling, if
>> possible, since just like I mentioned in patch 0 before, MD doesn't
>> allocate extra clone bio, and just re-uses the original bio structure.
>>
>>
>>>
>>> As for using an array, how would you index the array?  
>>
>> The 'array' here is not an array of 'struct blk_mq_hw_ctx *' maintained
>> in struct dm_table as you mentioned. Actually what I mean is to maintain
>> an array of struct dm_poll_data (or something like that, e.g. just
>> struct blk_mq_hw_ctx *) in per-bio private data. The size of the array
>> just equals the number of the target devices.
>>
>> For example, for the following device stack,
>>
>>>>
>>>> Suppose we have the following device stack hierarchy, that is, dm0 is
>>>> stacked on dm1, while dm1 is stacked on nvme0 and nvme1.
>>>>
>>>>     dm0
>>>>     dm1
>>>> nvme0  nvme1
>>>>
>>>>
>>>> Then the bio graph is like:
>>>>
>>>>
>>>>                                    +------------+
>>>>                                    |bio0(to dm0)|
>>>>                                    +------------+
>>>>                                          ^
>>>>                                          | orig_bio
>>>>                                    +--------------------+
>>>>                                    |struct dm_io A      |
>>>> +--------------------+ bi_private  ----------------------
>>>> |bio3(to dm1)        |------------>|bio1(to dm1)        |
>>>> +--------------------+             +--------------------+
>>>>         ^                                ^
>>>>         | ->orig_bio                     | ->orig_bio
>>>> +--------------------+             +--------------------+
>>>> |struct dm_io        |             |struct dm_io B      |
>>>> ----------------------             ----------------------
>>>> |bio2(to nvme0)      |             |bio4(to nvme1)      |
>>>> +--------------------+             +--------------------+
>>>>
>>
>> An array of struct blk_mq_hw_ctx * is maintained in struct dm_io B.
>>
>>
>> struct blk_mq_hw_ctx * hctxs[2];
>>
>> The array size is two since dm1 maps to two target devices (i.e. nvme0
>> and nvme1). Then hctxs[0] points to the hw queue of nvme0, while
>> hctxs[1] points to the hw queue of nvme1.
> 
> Both nvme0 and nvme1 may have multiple hctxs.  Not sure why you're
> thinking there is just one per device?
> 
>>
>>
>> This mechanism supports arbitrary device stacking. Similarly, an array
>> of struct blk_mq_hw_ctx * is maintained in struct dm_io A. The array
>> size is one since dm0 only maps to one target device (i.e. dm1). In this
>> case, hctx[0] points to the struct dm_io of the next level, i.e. struct
>> dm_io B.
>>
>>
>> But I'm afraid the implementation of this style may be more complex.
> 
> We are running the risk of talking in circles about this design...

Sorry for the inconvenience. I have started working on the next version,
but I do want to clarify some design issues first.

> 
> 
>>>> struct node {
>>>>     struct blk_mq_hw_ctx *hctx;
>>>>     blk_qc_t cookie;
>>>> };
>>>
>>> Needs a better name, think I had 'struct dm_poll_data'
>>
>> Sure, the name here is just for example.
>>
>>
>>>  
>>>> Actually currently the tracking objects are all allocated with 'struct
>>>> bio', then the lifetime management of the tracking objects is actually
>>>> equivalent to lifetime management of bio. Since the returned cookie is
>>>> actually a pointer to the bio, the refcount of this bio must be
>>>> incremented, since we release a reference to this bio through the
>>>> returned cookie, in which case the abuse of the refcount trick seems
>>>> unavoidable? Unless we allocate the tracking object individually, then
>>>> the returned cookie is actually pointing to the tracking object, and the
>>>> refcount is individually maintained for the tracking object.
>>>
>>> The refcounting and lifetime of the per-bio-data should all work as is.
>>> Would hope you can avoid extra bio_inc_remaining().. that infratsructure
>>> is way too tightly coupled to bio_chain()'ing, etc.
>>>
>>> The challenge you have is the array that would point at these various
>>> per-bio-data needs to be rooted somewhere (you put it in the topmost
>>> original bio with the current patchset).  But why not manage that as
>>> part of 'struct mapped_device'?  It'd need proper management at DM table
>>> reload boundaries and such but it seems like the most logical place to
>>> put the array.  But again, this array needs to be dynamic.. so thinking
>>> further, maybe a better model would be to have a fixed array in 'struct
>>> dm_table' for each hctx associated with a blk_mq _data_ device directly
>>> used/managed by that dm_table?
>>

Confusion also stated in the following comment. How 'struct
dm_poll_data' could be involved with 'struct dm_table' or 'struct
mapped_device'. In the current patchset, every bio need to maintain one
list to track all its 'struct dm_poll_data' structures. Then how to
maintain this per-bio information in one single 'struct dm_table' or
'struct mapped_device'?


>> It seems that you are referring 'array' here as an array of 'struct
>> blk_mq_hw_ctx *'? Such as
>>
>> struct dm_table {
>>     ...
>>     struct blk_mq_hw_ctx *hctxs[];
>> };
>>
>> Certainly with this we can replace the original 'struct blk_mq_hw_ctx *'
>> pointer in 'struct dm_poll_data' with the index into this array, such as
>>
>> struct dm_poll_data {
>>      int hctx_index; /* index into dm_table->hctxs[] */
>>      blk_qc_t cookie;
>> };
> 
> You seized on my mentioning blk-mq's array of hctx too literally.  I was
> illustrating that blk-mq's cookie is converted to an index into that
> array.
> 
> But for this DM bio-polling application we'd need to map the blk-mq
> returned cookie to a request_queue.  Hence the original 2 members of
> dm_poll_data needing to be 'struct request_queue *' and blk_qc_t.
> 
>> But I'm doubted if this makes much sense. The core difficulty here is
>> maintaining a list (or dynamic sized array) to track all split bios.
>> With the array of 'struct blk_mq_hw_ctx *' maintained in struct
>> dm_table, we still need some **per-bio** structure (e.g., &bio->bi_plist
>> in current patch set) to track these split bios.
> 
> One primary goal of all of this design is to achieve bio-polling cleanly
> (without extra locking, without block core data structure bloat, etc).
> I know you share that goal.  But we need to nail down the core data
> structures and what needs tracking at scale and then associate them with
> DM's associated objects with consideration for object lifetime.
> 
> My suggestion was to anchor your core data structures (e.g. 'struct
> dm_poll_data' array, etc) to 'struct dm_table'.  I suggested that
> because the dm_table is what dm_get_device()s each underlying _data_
> device (a subset of all devices in a dm_table, as iterated through
> .iterate_devices).  But a DM 'struct mapped_device' has 2 potential
> dm_tables, active and inactive slots, that would imply some complexity
> in handing off any outstanding bio's associated 'struct dm_poll_data'
> array on DM table reload.

1) If 'struct dm_poll_data' resides in per-bio-data, then how do you
**link** or **associate** all the 'struct dm_poll_data' structures from
one original bio? Do we link them by the internal relationship between
bio/dm_io/dm_target_io, or some other auxiliary data structure?

2) I get confused how 'struct dm_poll_data' could be involved with
'struct dm_table'. Is there an array of 'struct dm_poll_data' or 'struct
dm_poll_data *' maintained in 'struct dm_table'? If this is the case,
then the size of the array may be incredible large, or expanded/shrank
frequently, since one dm_table could correspond to millions bios.



> 
> Anyway, you seem to be gravitating to a more simplistic approach of a
> single array of 'struct dm_poll_data' for each DM device (regardless of
> how arbitrarily deep that DM device stack is, the topmost DM device
> would accumulate the list of 'struct dm_poll_data'?).

I'm open to this. At least you don't need to care the lifetime of other
disparate 'struct dm_poll_data's, if all 'struct dm_poll_data's are
accumulated in one (e.g., the topmost) place.


> 
> I'm now questioning the need for any high-level data structure to track
> all N of the 'struct dm_poll_data' that may result from a given bio (as
> it is split to multiple blk-mq hctxs across multiple blk-mq devices).
> Each 'struct dm_poll_data', that will be returned to block core and
> stored in struct kiocb's ki_cookie, would have an object lifetime that
> matches the original DM bio clone's per-bio-data that the 'struct
> dm_poll_data' was part of; then we just need to cast that ki_cookie's
> blk_qc_t as 'struct dm_poll_data' and call blk_poll().
> 
> The hardest part is to ensure that all the disparate 'struct
> dm_poll_data' (and associated clone bios) aren't free'd until the
> _original_ bio completes.  That would create quite some back-pressure
> with more potential to exhaust system resources -- because then the
> cataylst for dropping reference counts on these clone bios would then
> need to be tied to the blk_bio_poll() interface... which feels "wrong"
> (e.g. it ushers in the (ab)use of bio_inc_remaining you had in your most
> recent patchset).
> 
> All said, maybe post a v2 that takes the incremental steps of:
> 1) using DM per-bio-data for 'struct dm_poll_data'
> 2) simplify blk_bio_poll() to call into DM to translate provided
>    blk_qc_t (from struct kiocb's ki_cookie) to request_queue and
>    blk_qc_t.
>    - this eliminates any need for extra list processing
> 3) keep your (ab)use of bio_inc_remaining() to allow for exploring this 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH RFC 6/7] block: track cookies of split bios for bio-based device
  2021-01-14  9:16             ` [dm-devel] " JeffleXu
@ 2021-01-14 14:30               ` Mike Snitzer
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Snitzer @ 2021-01-14 14:30 UTC (permalink / raw)
  To: JeffleXu; +Cc: linux-block, dm-devel, io-uring

On Thu, Jan 14 2021 at  4:16am -0500,
JeffleXu <jefflexu@linux.alibaba.com> wrote:

> 
> 
> On 1/13/21 12:13 AM, Mike Snitzer wrote:
> > On Tue, Jan 12 2021 at 12:46am -0500,
> > JeffleXu <jefflexu@linux.alibaba.com> wrote:
> > 
> >>
> >>
> >> On 1/9/21 1:26 AM, Mike Snitzer wrote:
> >>> On Thu, Jan 07 2021 at 10:08pm -0500,
> >>> JeffleXu <jefflexu@linux.alibaba.com> wrote:
> >>>
> >>>> Thanks for reviewing.
> >>>>
> >>>>
> >>>> On 1/8/21 6:18 AM, Mike Snitzer wrote:
> >>>>> On Wed, Dec 23 2020 at  6:26am -0500,
> >>>>> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>>
> >>>>>> This is actuaaly the core when supporting iopoll for bio-based device.
> >>>>>>
> >>>>>> A list is maintained in the top bio (the original bio submitted to dm
> >>>>>> device), which is used to maintain all valid cookies of split bios. The
> >>>>>> IO polling routine will actually iterate this list and poll on
> >>>>>> corresponding hardware queues of the underlying mq devices.
> >>>>>>
> >>>>>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> >>>>>
> >>>>> Like I said in response to patch 4 in this series: please fold patch 4
> >>>>> into this patch and _really_ improve this patch header.
> >>>>>
> >>>>> In particular, the (ab)use of bio_inc_remaining() needs be documented in
> >>>>> this patch header very well.
> >>>>>
> >>>>> But its use could easily be why you're seeing a performance hit (coupled
> >>>>> with the extra spinlock locking and list management used).  Just added
> >>>>> latency and contention across CPUs.
> >>>>
> >>>> Indeed bio_inc_remaining() is abused here and the code seems quite hacky
> >>>> here.
> >>>>
> >>>> Actually I'm regarding implementing the split bio tracking mechanism in
> >>>> a recursive way you had ever suggested. That is, the split bios could be
> >>>> maintained in an array, which is allocated with 'struct dm_io'. This way
> >>>> the overhead of spinlock protecting the &root->bi_plist may be omitted
> >>>> here. Also the lifetime management may be simplified somehow. But the
> >>>> block core needs to fetch the per-bio private data now, just like what
> >>>> you had ever suggested before.
> >>>>
> >>>> How do you think, Mike?
> >>>
> >>> Yes, using per-bio-data is a requirement (we cannot bloat 'struct bio').
> >>
> >> Agreed. Then MD will need some refactor to support IO polling, if
> >> possible, since just like I mentioned in patch 0 before, MD doesn't
> >> allocate extra clone bio, and just re-uses the original bio structure.
> >>
> >>
> >>>
> >>> As for using an array, how would you index the array?  
> >>
> >> The 'array' here is not an array of 'struct blk_mq_hw_ctx *' maintained
> >> in struct dm_table as you mentioned. Actually what I mean is to maintain
> >> an array of struct dm_poll_data (or something like that, e.g. just
> >> struct blk_mq_hw_ctx *) in per-bio private data. The size of the array
> >> just equals the number of the target devices.
> >>
> >> For example, for the following device stack,
> >>
> >>>>
> >>>> Suppose we have the following device stack hierarchy, that is, dm0 is
> >>>> stacked on dm1, while dm1 is stacked on nvme0 and nvme1.
> >>>>
> >>>>     dm0
> >>>>     dm1
> >>>> nvme0  nvme1
> >>>>
> >>>>
> >>>> Then the bio graph is like:
> >>>>
> >>>>
> >>>>                                    +------------+
> >>>>                                    |bio0(to dm0)|
> >>>>                                    +------------+
> >>>>                                          ^
> >>>>                                          | orig_bio
> >>>>                                    +--------------------+
> >>>>                                    |struct dm_io A      |
> >>>> +--------------------+ bi_private  ----------------------
> >>>> |bio3(to dm1)        |------------>|bio1(to dm1)        |
> >>>> +--------------------+             +--------------------+
> >>>>         ^                                ^
> >>>>         | ->orig_bio                     | ->orig_bio
> >>>> +--------------------+             +--------------------+
> >>>> |struct dm_io        |             |struct dm_io B      |
> >>>> ----------------------             ----------------------
> >>>> |bio2(to nvme0)      |             |bio4(to nvme1)      |
> >>>> +--------------------+             +--------------------+
> >>>>
> >>
> >> An array of struct blk_mq_hw_ctx * is maintained in struct dm_io B.
> >>
> >>
> >> struct blk_mq_hw_ctx * hctxs[2];
> >>
> >> The array size is two since dm1 maps to two target devices (i.e. nvme0
> >> and nvme1). Then hctxs[0] points to the hw queue of nvme0, while
> >> hctxs[1] points to the hw queue of nvme1.
> > 
> > Both nvme0 and nvme1 may have multiple hctxs.  Not sure why you're
> > thinking there is just one per device?
> > 
> >>
> >>
> >> This mechanism supports arbitrary device stacking. Similarly, an array
> >> of struct blk_mq_hw_ctx * is maintained in struct dm_io A. The array
> >> size is one since dm0 only maps to one target device (i.e. dm1). In this
> >> case, hctx[0] points to the struct dm_io of the next level, i.e. struct
> >> dm_io B.
> >>
> >>
> >> But I'm afraid the implementation of this style may be more complex.
> > 
> > We are running the risk of talking in circles about this design...
> 
> Sorry for the inconvenience. I have started working on the next version,
> but I do want to clarify some design issues first.
> 
> > 
> > 
> >>>> struct node {
> >>>>     struct blk_mq_hw_ctx *hctx;
> >>>>     blk_qc_t cookie;
> >>>> };
> >>>
> >>> Needs a better name, think I had 'struct dm_poll_data'
> >>
> >> Sure, the name here is just for example.
> >>
> >>
> >>>  
> >>>> Actually currently the tracking objects are all allocated with 'struct
> >>>> bio', then the lifetime management of the tracking objects is actually
> >>>> equivalent to lifetime management of bio. Since the returned cookie is
> >>>> actually a pointer to the bio, the refcount of this bio must be
> >>>> incremented, since we release a reference to this bio through the
> >>>> returned cookie, in which case the abuse of the refcount trick seems
> >>>> unavoidable? Unless we allocate the tracking object individually, then
> >>>> the returned cookie is actually pointing to the tracking object, and the
> >>>> refcount is individually maintained for the tracking object.
> >>>
> >>> The refcounting and lifetime of the per-bio-data should all work as is.
> >>> Would hope you can avoid extra bio_inc_remaining().. that infratsructure
> >>> is way too tightly coupled to bio_chain()'ing, etc.
> >>>
> >>> The challenge you have is the array that would point at these various
> >>> per-bio-data needs to be rooted somewhere (you put it in the topmost
> >>> original bio with the current patchset).  But why not manage that as
> >>> part of 'struct mapped_device'?  It'd need proper management at DM table
> >>> reload boundaries and such but it seems like the most logical place to
> >>> put the array.  But again, this array needs to be dynamic.. so thinking
> >>> further, maybe a better model would be to have a fixed array in 'struct
> >>> dm_table' for each hctx associated with a blk_mq _data_ device directly
> >>> used/managed by that dm_table?
> >>
> 
> Confusion also stated in the following comment. How 'struct
> dm_poll_data' could be involved with 'struct dm_table' or 'struct
> mapped_device'. In the current patchset, every bio need to maintain one
> list to track all its 'struct dm_poll_data' structures. Then how to
> maintain this per-bio information in one single 'struct dm_table' or
> 'struct mapped_device'?
> 
> 
> >> It seems that you are referring 'array' here as an array of 'struct
> >> blk_mq_hw_ctx *'? Such as
> >>
> >> struct dm_table {
> >>     ...
> >>     struct blk_mq_hw_ctx *hctxs[];
> >> };
> >>
> >> Certainly with this we can replace the original 'struct blk_mq_hw_ctx *'
> >> pointer in 'struct dm_poll_data' with the index into this array, such as
> >>
> >> struct dm_poll_data {
> >>      int hctx_index; /* index into dm_table->hctxs[] */
> >>      blk_qc_t cookie;
> >> };
> > 
> > You seized on my mentioning blk-mq's array of hctx too literally.  I was
> > illustrating that blk-mq's cookie is converted to an index into that
> > array.
> > 
> > But for this DM bio-polling application we'd need to map the blk-mq
> > returned cookie to a request_queue.  Hence the original 2 members of
> > dm_poll_data needing to be 'struct request_queue *' and blk_qc_t.
> > 
> >> But I'm doubted if this makes much sense. The core difficulty here is
> >> maintaining a list (or dynamic sized array) to track all split bios.
> >> With the array of 'struct blk_mq_hw_ctx *' maintained in struct
> >> dm_table, we still need some **per-bio** structure (e.g., &bio->bi_plist
> >> in current patch set) to track these split bios.
> > 
> > One primary goal of all of this design is to achieve bio-polling cleanly
> > (without extra locking, without block core data structure bloat, etc).
> > I know you share that goal.  But we need to nail down the core data
> > structures and what needs tracking at scale and then associate them with
> > DM's associated objects with consideration for object lifetime.
> > 
> > My suggestion was to anchor your core data structures (e.g. 'struct
> > dm_poll_data' array, etc) to 'struct dm_table'.  I suggested that
> > because the dm_table is what dm_get_device()s each underlying _data_
> > device (a subset of all devices in a dm_table, as iterated through
> > .iterate_devices).  But a DM 'struct mapped_device' has 2 potential
> > dm_tables, active and inactive slots, that would imply some complexity
> > in handing off any outstanding bio's associated 'struct dm_poll_data'
> > array on DM table reload.
> 
> 1) If 'struct dm_poll_data' resides in per-bio-data, then how do you
> **link** or **associate** all the 'struct dm_poll_data' structures from
> one original bio? Do we link them by the internal relationship between
> bio/dm_io/dm_target_io, or some other auxiliary data structure?
> 
> 2) I get confused how 'struct dm_poll_data' could be involved with
> 'struct dm_table'. Is there an array of 'struct dm_poll_data' or 'struct
> dm_poll_data *' maintained in 'struct dm_table'? If this is the case,
> then the size of the array may be incredible large, or expanded/shrank
> frequently, since one dm_table could correspond to millions bios.

My line of thinking didn't account for the fan-out of clone bios and
the 'struct dm_poll_data' associated with each needing to be tracked
with an auxillary data structure.  I was just thinking in terms of a
single cookie for each bio.  That model works for blk-mq because a
request isn't ever split.

So I had a blindspot/hope we could avoid the complexity but I was
mistaken.

> > Anyway, you seem to be gravitating to a more simplistic approach of a
> > single array of 'struct dm_poll_data' for each DM device (regardless of
> > how arbitrarily deep that DM device stack is, the topmost DM device
> > would accumulate the list of 'struct dm_poll_data'?).
> 
> I'm open to this. At least you don't need to care the lifetime of other
> disparate 'struct dm_poll_data's, if all 'struct dm_poll_data's are
> accumulated in one (e.g., the topmost) place.

Treating the entire IO stack as if it can all be accumulated/managed in
a single pool of objects is dangerous.  It ushers in serious lifetime
problems associated with completion of IO that must occur in order for
DM targets to work as designed.  Waiting for a chain of bios to complete
at various layers is fine.  But if that chain spans targets boundaries
I think we could easily introduce problems.

So not only am I struggling to see how we avoid a data structure to
track all split bios' dm_poll_data: I also don't yet see how we can
safely allow per-bio-data to linger waiting for blk_bio_poll() to
eventually reap bios whose completion has been delayed for IO polling's
benefit.

This IO polling model is really awkward to apply to bio-based IO.

Mike

> > I'm now questioning the need for any high-level data structure to track
> > all N of the 'struct dm_poll_data' that may result from a given bio (as
> > it is split to multiple blk-mq hctxs across multiple blk-mq devices).
> > Each 'struct dm_poll_data', that will be returned to block core and
> > stored in struct kiocb's ki_cookie, would have an object lifetime that
> > matches the original DM bio clone's per-bio-data that the 'struct
> > dm_poll_data' was part of; then we just need to cast that ki_cookie's
> > blk_qc_t as 'struct dm_poll_data' and call blk_poll().
> > 
> > The hardest part is to ensure that all the disparate 'struct
> > dm_poll_data' (and associated clone bios) aren't free'd until the
> > _original_ bio completes.  That would create quite some back-pressure
> > with more potential to exhaust system resources -- because then the
> > cataylst for dropping reference counts on these clone bios would then
> > need to be tied to the blk_bio_poll() interface... which feels "wrong"
> > (e.g. it ushers in the (ab)use of bio_inc_remaining you had in your most
> > recent patchset).
> > 
> > All said, maybe post a v2 that takes the incremental steps of:
> > 1) using DM per-bio-data for 'struct dm_poll_data'
> > 2) simplify blk_bio_poll() to call into DM to translate provided
> >    blk_qc_t (from struct kiocb's ki_cookie) to request_queue and
> >    blk_qc_t.
> >    - this eliminates any need for extra list processing
> > 3) keep your (ab)use of bio_inc_remaining() to allow for exploring this 
> 
> -- 
> Thanks,
> Jeffle
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2021-01-14 14:31 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-23 11:26 [PATCH RFC 0/7] dm: add support of iopoll Jeffle Xu
2020-12-23 11:26 ` [PATCH RFC 1/7] block: move definition of blk_qc_t to types.h Jeffle Xu
2021-01-07 19:04   ` Mike Snitzer
2020-12-23 11:26 ` [PATCH RFC 2/7] block: add helper function fetching gendisk from queue Jeffle Xu
2021-01-07 20:31   ` Mike Snitzer
2020-12-23 11:26 ` [PATCH RFC 3/7] block: add iopoll method for non-mq device Jeffle Xu
2021-01-07 21:47   ` Mike Snitzer
2021-01-08  3:24     ` [dm-devel] " JeffleXu
2021-01-08 17:33       ` Mike Snitzer
2021-01-11  7:50         ` [dm-devel] " JeffleXu
2020-12-23 11:26 ` [PATCH RFC 4/7] block: define blk_qc_t as uintptr_t Jeffle Xu
2021-01-07 21:52   ` Mike Snitzer
2020-12-23 11:26 ` [PATCH RFC 5/7] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
2021-01-07 21:54   ` Mike Snitzer
2020-12-23 11:26 ` [PATCH RFC 6/7] block: track cookies of split bios " Jeffle Xu
2021-01-07 22:18   ` Mike Snitzer
2021-01-08  3:08     ` [dm-devel] " JeffleXu
2021-01-08 17:26       ` Mike Snitzer
2021-01-12  5:46         ` [dm-devel] " JeffleXu
2021-01-12 16:13           ` Mike Snitzer
2021-01-14  9:16             ` [dm-devel] " JeffleXu
2021-01-14 14:30               ` Mike Snitzer
2021-01-12  7:11         ` [dm-devel] " JeffleXu
2020-12-23 11:26 ` [PATCH RFC 7/7] dm: add support for IO polling Jeffle Xu
2021-01-08  3:12   ` [dm-devel] " JeffleXu
2021-01-07  1:14 ` [dm-devel] [PATCH RFC 0/7] dm: add support of iopoll JeffleXu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).