linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/12] dm: support polling
@ 2021-03-03 11:57 Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 01/12] block: move definition of blk_qc_t to types.h Jeffle Xu
                   ` (11 more replies)
  0 siblings, 12 replies; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

[Changes since v4]
- rebased to 5.12

- refactor patch 10 (fastpath) to fix the issue proposed by Mikulas
  Patocka in [1].
When bio doesn't get split and is submitted to *one* underlying device,
then the o=polling routine will go into the fastpath. The refactored
design is to return the dev_t of the underlying device the bio finnaly
submitted to as the returned cookie. The following polling routine will
call blkdev_get_no_open() to get blkdev by the dev_t stored in the
cookie, and thus fetch the corresponding hw queue.
One remained issue is that, blkdev_get_no_open() need to acquire global
lock @inode_hash_lock in ilookup(), and this may become the hot spot as
the polling threads increase. The RCU version (i.e.,
find_inode_by_ino_rcu()) maybe could be used here as an optimization.

- call disk->fops->poll() recursively in patch 12 to fix the issue
  described in [2] .

- update performance tests in [Performance Tests] chapter.

[1] https://listman.redhat.com/archives/dm-devel/2021-February/msg00255.html
[2] https://listman.redhat.com/archives/dm-devel/2021-February/msg00254.html



[Intention]
Bio-based polling (e.g., for dm/md devices) is one indispensable part of
high performance IO stack. As far as I know, dm (e.g., dm-stripe) is
widely used in database, splicing several NVMe disks as one whole disk,
in hope of achieving better performance. With this patch set, io_uring
could be used upon dm devices.


[Performance Tests]
1. fastpath (patch 10)
The polling routine will go into this path when bio submitted to dm
device is not split.

In this case, there will be only one bio submitted to only one polling
hw queue of one underlying mq device, and thus we don't need to track
all split bios or iterate through all polling hw queues. The pointer to
the polling hw queue the bio submitted to is returned here as the
returned cookie. In this case, the polling routine will call
mq_ops->poll() directly with the hw queue converted from the input
cookie.

- One process reading dm-linear (mapping to three underlying NVMe devices,
with one polling hw queue per NVMe device).

(ioengine=io_uring, iodepth=128, numjobs=1, rw=randread, sqthread_poll=0
direct=1, bs=4k)

	    	 | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
---------------- | --------------- | -------------------- | ----
with patchset    |	      214k |		     282k | ~32%


- Three processes reading dm-linear (mapping to three underlying NVMe
devices, with three polling hw queues per NVMe device), with every
process pinned to one CPU and mapped to one exclusive hw queue.

(ioengine=io_uring, iodepth=128, numjobs=3, rw=randread, sqthread_poll=0
direct=1, bs=4k)

	    	 | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
---------------- | --------------- | -------------------- | ----
with patchset    |	      629k |		     820k | ~30%



2. sub-fastpath (patch 11)

The polling routine will go into this path when bio submitted to dm
device gets split and enqueued into multiple hw queues, while the IO
submission process has not been migrated to another CPU.

In this case, the IO submission routine will return the CPU number on
which the IO submission happened as the returned cookie, while the
polling routine will only iterate and poll on hw queues that this CPU
number maps, instead of iterating *all* hw queues.

This optimization can dramatically reduce cache ping-pong and thus
improve the polling performance, when multiple hw queues in polling mode
per device could be reserved when there are multiple polling processes.

- Three processes reading dm-stripe (mapping to three underlying NVMe
devices, with three polling hw queues per NVMe device), with every
process pinned to one CPU and mapped to one exclusive hw queue.

(ioengine=io_uring, iodepth=128, numjobs=3, rw=randread, sqthread_poll=0
direct=1, bs=12k(4k for every NVMe device))

	    	 | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
---------------- | --------------- | -------------------- | ----
with patchset    |	      311k |		     409k | ~32%



[Changes since v3]
- newly add patch 7 and patch 11, as a new optimization improving
performance of multiple polling processes. Now performance of multiple
polling processes can be as scalable as single polling process (~30%).
Refer to the following [Performance] chapter for more details.


[Changes since v2]

Patchset v2 caches all hw queues (in polling mode) of underlying mq
devices in dm layer. The polling routine actually iterates through all
these cached hw queues.

However, mq may change the queue mapping at runtime (e.g., NVMe RESET
command), thus the cached hw queues in dm layer may be out-of-date. Thus
patchset v3 falls back to the implementation of the very first RFC
version, in which the mq layer needs to export one interface iterating
all polling hw queues (patch 5), and the bio-based polling routine just
calls this interface to iterate all polling hw queues.

Besides, several new optimization is proposed.


- patch 1,2,7
same as v2, untouched

- patch 3
Considering advice from Christoph Hellwig, while refactoring blk_poll(),
split mq and bio-based polling routine from the very beginning. Now
blk_poll() is just a simple entry. blk_bio_poll() is simply copied from
blk_mq_poll(), while the loop structure is some sort of duplication
though.

- patch 4
This patch is newly added to support turning on/off polling through
'/sys/block/<dev>/queue/io_poll' dynamiclly for bio-based devices.
Patchset v2 implemented this functionality by added one new queue flag,
which is not preferred since the queue flag resource is quite short of
nowadays.

- patch 5
This patch is newly added, preparing for the following bio-based
polling. The following bio-based polling will call this helper function,
accounting on the corresponding hw queue.

- patch 6
It's from the very first RFC version, preparing for the following
bio-based polling.

- patch 8
One fixing patch needed by the following bio-based polling. It's
actually a v2 of [1]. I had sent the v2 singly in-reply-to [1], though
it has not been visible on the mailing list maybe due to the delay.

- patch 9
It's from the very first RFC version.

- patch 10
This patch is newly added. Patchset v2 had ever proposed one
optimization that, skipping the **busy** hw queues during the iteration
phase. Back upon that time, one flag of 'atomic_t' is specifically
maintained in dm layer, representing if the corresponding hw queue is
busy or not. The idea is inherited, while the implementation changes.
Now @nvmeq->cq_poll_lock is used directly here, no need for extra flag
anymore.

This optimization can significantly reduce the competition for one hw
queue between multiple polling instances. Following statistics is the
test result when 3 threads concurrently randread (bs=4k, direct=1) one
dm-linear device, which is built upon 3 nvme devices, with one polling
hw queue per nvme device.

	    | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
----------- | --------------- | -------------------- | ----
without opt | 		 318k |		 	256k | ~-20%
with opt    |		 314k |		 	354k | ~13%
							

- patch 11
This is another newly added optimizatin for bio-based polling.

One intuitive insight is that, when the original bio submitted to dm
device doesn't get split, then the bio gets enqueued into only one hw
queue of one of the underlying mq devices. In this case, we no longer
need to track all split bios, and one cookie (for the only split bio)
is enough. It is implemented by returning the pointer to the
corresponding hw queue in this case.

It should be safe by directly returning the pointer to the hw queue,
since 'struct blk_mq_hw_ctx' won't be freed during the whole lifetime of
'struct request_queue'. Even when the number of hw queues may decrease
when NVMe RESET happens, the 'struct request_queue' structure of decreased
hw queues won't be freed, instead it's buffered into
&q->unused_hctx_list list.

Though this optimization seems quite intuitive, the performance test
shows that it does no benefit nor harm to the performance, while 3
threads concurrently randreading (bs=4k, direct=1) one dm-linear
device, which is built upon 3 nvme devices, with one polling hw queue
per nvme device.

I'm not sure why it doesn't work, maybe because the number of devices,
or the depth of the devcice stack is to low in my test case?


changes since v1:
- patch 1,2,4 is the same as v1 and have already been reviewed
- patch 3 is refactored a bit on the basis of suggestions from
Mike Snitzer.
- patch 5 is newly added and introduces one new queue flag
representing if the queue is capable of IO polling. This mainly
simplifies the logic in queue_poll_store().
- patch 6 implements the core mechanism supporting IO polling.
The sanity check checking if the dm device supports IO polling is
also folded into this patch, and the queue flag will be cleared if
it doesn't support, in case of table reloading.




Jeffle Xu (12):
  block: move definition of blk_qc_t to types.h
  block: add queue_to_disk() to get gendisk from request_queue
  block: add poll method to support bio-based IO polling
  block: add poll_capable method to support bio-based IO polling
  blk-mq: extract one helper function polling hw queue
  blk-mq: add iterator for polling hw queues
  blk-mq: add one helper function getting hw queue
  dm: always return BLK_QC_T_NONE for bio-based device
  nvme/pci: don't wait for locked polling queue
  block: fastpath for bio-based polling
  block: sub-fastpath for bio-based polling
  dm: support IO polling for bio-based dm device

 block/blk-core.c              | 143 +++++++++++++++++++++++++++++++++-
 block/blk-mq.c                |  37 +++------
 block/blk-sysfs.c             |  14 +++-
 drivers/md/dm-table.c         |  26 +++++++
 drivers/md/dm.c               |  98 +++++++++++++++++++----
 drivers/nvme/host/pci.c       |   4 +-
 include/linux/blk-mq.h        |  23 ++++++
 include/linux/blk_types.h     |  96 ++++++++++++++++++++++-
 include/linux/blkdev.h        |   4 +
 include/linux/device-mapper.h |   1 +
 include/linux/fs.h            |   2 +-
 include/linux/types.h         |   3 +
 include/trace/events/kyber.h  |   6 +-
 13 files changed, 405 insertions(+), 52 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v5 01/12] block: move definition of blk_qc_t to types.h
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 02/12] block: add queue_to_disk() to get gendisk from request_queue Jeffle Xu
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

So that kiocb.ki_cookie can be defined as blk_qc_t, which will enforce
the encapsulation.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blk_types.h | 2 +-
 include/linux/fs.h        | 2 +-
 include/linux/types.h     | 3 +++
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..fb429daaa909 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -505,7 +505,7 @@ static inline int op_stat_group(unsigned int op)
 	return op_is_write(op);
 }
 
-typedef unsigned int blk_qc_t;
+/* Macros for blk_qc_t */
 #define BLK_QC_T_NONE		-1U
 #define BLK_QC_T_SHIFT		16
 #define BLK_QC_T_INTERNAL	(1U << 31)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ec8f3ddf4a6a..8b14dacf618d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -332,7 +332,7 @@ struct kiocb {
 	u16			ki_hint;
 	u16			ki_ioprio; /* See linux/ioprio.h */
 	union {
-		unsigned int		ki_cookie; /* for ->iopoll */
+		blk_qc_t		ki_cookie; /* for ->iopoll */
 		struct wait_page_queue	*ki_waitq; /* for async buffered IO */
 	};
 
diff --git a/include/linux/types.h b/include/linux/types.h
index ac825ad90e44..52a54ed6ffac 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -125,6 +125,9 @@ typedef s64			int64_t;
 typedef u64 sector_t;
 typedef u64 blkcnt_t;
 
+/* cookie used for IO polling */
+typedef unsigned int blk_qc_t;
+
 /*
  * The type of an index into the pagecache.
  */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 02/12] block: add queue_to_disk() to get gendisk from request_queue
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 01/12] block: move definition of blk_qc_t to types.h Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 03/12] block: add poll method to support bio-based IO polling Jeffle Xu
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

Sometimes we need to get the corresponding gendisk from request_queue.

It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").

So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h       | 2 ++
 include/trace/events/kyber.h | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c032cfe133c7..b81a9fe015ab 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -687,6 +687,8 @@ static inline bool blk_account_rq(struct request *rq)
 	dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \
 	(dir), (attrs))
 
+#define queue_to_disk(q)	(dev_to_disk(kobj_to_dev((q)->kobj.parent)))
+
 static inline bool queue_is_mq(struct request_queue *q)
 {
 	return q->mq_ops;
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index c0e7d24ca256..f9802562edf6 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kyber_latency,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		strlcpy(__entry->type, type, sizeof(__entry->type));
 		__entry->percentile	= percentile;
@@ -59,7 +59,7 @@ TRACE_EVENT(kyber_adjust,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		__entry->depth		= depth;
 	),
@@ -81,7 +81,7 @@ TRACE_EVENT(kyber_throttled,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 	),
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 03/12] block: add poll method to support bio-based IO polling
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 01/12] block: move definition of blk_qc_t to types.h Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 02/12] block: add queue_to_disk() to get gendisk from request_queue Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  2021-03-10 22:01   ` Mike Snitzer
  2021-03-03 11:57 ` [PATCH v5 04/12] block: add poll_capable " Jeffle Xu
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
callback to struct request_queue") to support bio-based queues such as
nvme multipath, but was later removed in commit 529262d56dbe ("block:
remove ->poll_fn").

Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
block_device_operations") restore the possibility of bio-based IO
polling support by adding an ->poll method to gendisk->fops.

Make blk_mq_poll() specific to blk-mq, while blk_bio_poll() is
originally a copy from blk_mq_poll(), and is specific to bio-based
polling. Currently hybrid polling is not supported by bio-based polling.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-core.c       | 58 ++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq.c         | 22 +---------------
 include/linux/blk-mq.h |  1 +
 include/linux/blkdev.h |  1 +
 4 files changed, 61 insertions(+), 21 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..6d7d53030d7c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1119,6 +1119,64 @@ blk_qc_t submit_bio(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio);
 
+
+static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+	long state;
+	struct gendisk *disk = queue_to_disk(q);
+
+	state = current->state;
+	do {
+		int ret;
+
+		ret = disk->fops->poll(q, cookie);
+		if (ret > 0) {
+			__set_current_state(TASK_RUNNING);
+			return ret;
+		}
+
+		if (signal_pending_state(state, current))
+			__set_current_state(TASK_RUNNING);
+
+		if (current->state == TASK_RUNNING)
+			return 1;
+		if (ret < 0 || !spin)
+			break;
+		cpu_relax();
+	} while (!need_resched());
+
+	__set_current_state(TASK_RUNNING);
+	return 0;
+}
+
+/**
+ * blk_poll - poll for IO completions
+ * @q:  the queue
+ * @cookie: cookie passed back at IO submission time
+ * @spin: whether to spin for completions
+ *
+ * Description:
+ *    Poll for completions on the passed in queue. Returns number of
+ *    completed entries found. If @spin is true, then blk_poll will continue
+ *    looping until at least one completion is found, unless the task is
+ *    otherwise marked running (or we need to reschedule).
+ */
+int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+	if (!blk_qc_t_valid(cookie) ||
+	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+		return 0;
+
+	if (current->plug)
+		blk_flush_plug_list(current->plug, false);
+
+	if (queue_is_mq(q))
+		return blk_mq_poll(q, cookie, spin);
+	else
+		return blk_bio_poll(q, cookie, spin);
+}
+EXPORT_SYMBOL_GPL(blk_poll);
+
 /**
  * blk_cloned_rq_check_limits - Helper function to check a cloned request
  *                              for the new queue limits
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..214fa30b460a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,30 +3852,11 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
-/**
- * blk_poll - poll for IO completions
- * @q:  the queue
- * @cookie: cookie passed back at IO submission time
- * @spin: whether to spin for completions
- *
- * Description:
- *    Poll for completions on the passed in queue. Returns number of
- *    completed entries found. If @spin is true, then blk_poll will continue
- *    looping until at least one completion is found, unless the task is
- *    otherwise marked running (or we need to reschedule).
- */
-int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) ||
-	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
-		return 0;
-
-	if (current->plug)
-		blk_flush_plug_list(current->plug, false);
-
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 
 	/*
@@ -3917,7 +3898,6 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	__set_current_state(TASK_RUNNING);
 	return 0;
 }
-EXPORT_SYMBOL_GPL(blk_poll);
 
 unsigned int blk_mq_rq_cpu(struct request *rq)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c473c9b8990..6a7b693b9917 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -615,6 +615,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
 }
 
 blk_qc_t blk_mq_submit_bio(struct bio *bio);
+int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
 void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
 		struct lock_class_key *key);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b81a9fe015ab..9dc83c30e7bc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1866,6 +1866,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
 
 struct block_device_operations {
 	blk_qc_t (*submit_bio) (struct bio *bio);
+	int (*poll)(struct request_queue *q, blk_qc_t cookie);
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
 	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 04/12] block: add poll_capable method to support bio-based IO polling
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
                   ` (2 preceding siblings ...)
  2021-03-03 11:57 ` [PATCH v5 03/12] block: add poll method to support bio-based IO polling Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  2021-03-10 22:21   ` Mike Snitzer
  2021-03-03 11:57 ` [PATCH v5 05/12] blk-mq: extract one helper function polling hw queue Jeffle Xu
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

This method can be used to check if bio-based device supports IO polling
or not. For mq devices, checking for hw queue in polling mode is
adequate, while the sanity check shall be implementation specific for
bio-based devices. For example, dm device needs to check if all
underlying devices are capable of IO polling.

Though bio-based device may have done the sanity check during the
device initialization phase, cacheing the result of this sanity check
(such as by cacheing in the queue_flags) may not work. Because for dm
devices, users could change the state of the underlying devices through
'/sys/block/<dev>/io_poll', bypassing the dm device above. In this case,
the cached result of the very beginning sanity check could be
out-of-date. Thus the sanity check needs to be done every time 'io_poll'
is to be modified.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-sysfs.c      | 14 +++++++++++---
 include/linux/blkdev.h |  1 +
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0f4f0c8a7825..367c1d9a55c6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -426,9 +426,17 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	unsigned long poll_on;
 	ssize_t ret;
 
-	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
-	    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
-		return -EINVAL;
+	if (queue_is_mq(q)) {
+		if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
+		    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
+			return -EINVAL;
+	} else {
+		struct gendisk *disk = queue_to_disk(q);
+
+		if (!disk->fops->poll_capable ||
+		    !disk->fops->poll_capable(disk))
+			return -EINVAL;
+	}
 
 	ret = queue_var_store(&poll_on, page, count);
 	if (ret < 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9dc83c30e7bc..7df40792c032 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1867,6 +1867,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
 struct block_device_operations {
 	blk_qc_t (*submit_bio) (struct bio *bio);
 	int (*poll)(struct request_queue *q, blk_qc_t cookie);
+	bool (*poll_capable)(struct gendisk *disk);
 	int (*open) (struct block_device *, fmode_t);
 	void (*release) (struct gendisk *, fmode_t);
 	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 05/12] blk-mq: extract one helper function polling hw queue
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
                   ` (3 preceding siblings ...)
  2021-03-03 11:57 ` [PATCH v5 04/12] block: add poll_capable " Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 06/12] blk-mq: add iterator for polling hw queues Jeffle Xu
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

Extract the logic of polling one hw queue and related statistics
handling out as the helper function.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-mq.c         |  5 +----
 include/linux/blk-mq.h | 13 +++++++++++++
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 214fa30b460a..6ef9f0b038c2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3876,11 +3876,8 @@ int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	do {
 		int ret;
 
-		hctx->poll_invoked++;
-
-		ret = q->mq_ops->poll(hctx);
+		ret = blk_mq_poll_hctx(q, hctx);
 		if (ret > 0) {
-			hctx->poll_success++;
 			__set_current_state(TASK_RUNNING);
 			return ret;
 		}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 6a7b693b9917..b406cab347d6 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -614,6 +614,19 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
 		rq->rq_disk = bio->bi_bdev->bd_disk;
 }
 
+static inline int blk_mq_poll_hctx(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx)
+{
+	int ret;
+
+	hctx->poll_invoked++;
+	ret = q->mq_ops->poll(hctx);
+	if (ret > 0)
+		hctx->poll_success++;
+
+	return ret;
+}
+
 blk_qc_t blk_mq_submit_bio(struct bio *bio);
 int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
 void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 06/12] blk-mq: add iterator for polling hw queues
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
                   ` (4 preceding siblings ...)
  2021-03-03 11:57 ` [PATCH v5 05/12] blk-mq: extract one helper function polling hw queue Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 07/12] blk-mq: add one helper function getting hw queue Jeffle Xu
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

Add one helper function for iterating all hardware queues in polling
mode.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 include/linux/blk-mq.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b406cab347d6..d22269b3dbe9 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -582,6 +582,13 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
 	for ((i) = 0; (i) < (q)->nr_hw_queues &&			\
 	     ({ hctx = (q)->queue_hw_ctx[i]; 1; }); (i)++)
 
+#define queue_for_each_poll_hw_ctx(q, hctx, i)				\
+	for ((i) = 0; ((q)->tag_set->nr_maps > HCTX_TYPE_POLL) &&	\
+	     (i) < (q)->tag_set->map[HCTX_TYPE_POLL].nr_queues &&	\
+	     ({ int __idx = (q)->tag_set->map[HCTX_TYPE_POLL].queue_offset + (i); \
+	     hctx = (q)->queue_hw_ctx[__idx]; 1; }); \
+	     (i)++)
+
 #define hctx_for_each_ctx(hctx, ctx, i)					\
 	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
 	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 07/12] blk-mq: add one helper function getting hw queue
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
                   ` (5 preceding siblings ...)
  2021-03-03 11:57 ` [PATCH v5 06/12] blk-mq: add iterator for polling hw queues Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 08/12] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

Add one helper function getting hw queue mapping to specific CPU, and of
specific type.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-mq.c         | 10 ++++++++++
 include/linux/blk-mq.h |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6ef9f0b038c2..72390f208c82 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3902,6 +3902,16 @@ unsigned int blk_mq_rq_cpu(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_rq_cpu);
 
+
+struct blk_mq_hw_ctx *blk_mq_get_hctx(struct request_queue *q, int cpu,
+				      unsigned int flags)
+{
+	struct blk_mq_ctx *ctx = __blk_mq_get_ctx(q, cpu);
+
+	return blk_mq_map_queue(q, flags, ctx);
+}
+EXPORT_SYMBOL(blk_mq_get_hctx);
+
 static int __init blk_mq_init(void)
 {
 	int i;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d22269b3dbe9..149f6a9d9aa7 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -638,5 +638,7 @@ blk_qc_t blk_mq_submit_bio(struct bio *bio);
 int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
 void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
 		struct lock_class_key *key);
+struct blk_mq_hw_ctx *blk_mq_get_hctx(struct request_queue *q, int cpu,
+		unsigned int flags);
 
 #endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 08/12] dm: always return BLK_QC_T_NONE for bio-based device
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
                   ` (6 preceding siblings ...)
  2021-03-03 11:57 ` [PATCH v5 07/12] blk-mq: add one helper function getting hw queue Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 09/12] nvme/pci: don't wait for locked polling queue Jeffle Xu
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

Currently the returned cookie of bio-based device is not used at all.

Cookie of bio-based device will be refactored in the following patch.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm.c | 26 ++++++++++----------------
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 50b693d776d6..f1b76203b3c7 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1294,14 +1294,13 @@ static noinline void __set_swap_bios_limit(struct mapped_device *md, int latch)
 	mutex_unlock(&md->swap_bios_lock);
 }
 
-static blk_qc_t __map_bio(struct dm_target_io *tio)
+static void __map_bio(struct dm_target_io *tio)
 {
 	int r;
 	sector_t sector;
 	struct bio *clone = &tio->clone;
 	struct dm_io *io = tio->io;
 	struct dm_target *ti = tio->ti;
-	blk_qc_t ret = BLK_QC_T_NONE;
 
 	clone->bi_end_io = clone_endio;
 
@@ -1328,7 +1327,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
 	case DM_MAPIO_REMAPPED:
 		/* the bio has been remapped so dispatch it */
 		trace_block_bio_remap(clone, bio_dev(io->orig_bio), sector);
-		ret = submit_bio_noacct(clone);
+		submit_bio_noacct(clone);
 		break;
 	case DM_MAPIO_KILL:
 		if (unlikely(swap_bios_limit(ti, clone))) {
@@ -1350,8 +1349,6 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
 		DMWARN("unimplemented target map return value: %d", r);
 		BUG();
 	}
-
-	return ret;
 }
 
 static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
@@ -1438,7 +1435,7 @@ static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
 	}
 }
 
-static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
+static void __clone_and_map_simple_bio(struct clone_info *ci,
 					   struct dm_target_io *tio, unsigned *len)
 {
 	struct bio *clone = &tio->clone;
@@ -1449,7 +1446,7 @@ static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
 	if (len)
 		bio_setup_sector(clone, ci->sector, *len);
 
-	return __map_bio(tio);
+	__map_bio(tio);
 }
 
 static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
@@ -1463,7 +1460,7 @@ static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
 
 	while ((bio = bio_list_pop(&blist))) {
 		tio = container_of(bio, struct dm_target_io, clone);
-		(void) __clone_and_map_simple_bio(ci, tio, len);
+		__clone_and_map_simple_bio(ci, tio, len);
 	}
 }
 
@@ -1507,7 +1504,7 @@ static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
 		free_tio(tio);
 		return r;
 	}
-	(void) __map_bio(tio);
+	__map_bio(tio);
 
 	return 0;
 }
@@ -1622,11 +1619,10 @@ static void init_clone_info(struct clone_info *ci, struct mapped_device *md,
 /*
  * Entry point to split a bio into clones and submit them to the targets.
  */
-static blk_qc_t __split_and_process_bio(struct mapped_device *md,
+static void __split_and_process_bio(struct mapped_device *md,
 					struct dm_table *map, struct bio *bio)
 {
 	struct clone_info ci;
-	blk_qc_t ret = BLK_QC_T_NONE;
 	int error = 0;
 
 	init_clone_info(&ci, md, map, bio);
@@ -1670,7 +1666,7 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
 
 				bio_chain(b, bio);
 				trace_block_split(b, bio->bi_iter.bi_sector);
-				ret = submit_bio_noacct(bio);
+				submit_bio_noacct(bio);
 				break;
 			}
 		}
@@ -1678,13 +1674,11 @@ static blk_qc_t __split_and_process_bio(struct mapped_device *md,
 
 	/* drop the extra reference count */
 	dec_pending(ci.io, errno_to_blk_status(error));
-	return ret;
 }
 
 static blk_qc_t dm_submit_bio(struct bio *bio)
 {
 	struct mapped_device *md = bio->bi_bdev->bd_disk->private_data;
-	blk_qc_t ret = BLK_QC_T_NONE;
 	int srcu_idx;
 	struct dm_table *map;
 
@@ -1714,10 +1708,10 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 	if (is_abnormal_io(bio))
 		blk_queue_split(&bio);
 
-	ret = __split_and_process_bio(md, map, bio);
+	__split_and_process_bio(md, map, bio);
 out:
 	dm_put_live_table(md, srcu_idx);
-	return ret;
+	return BLK_QC_T_NONE;
 }
 
 /*-----------------------------------------------------------------
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 09/12] nvme/pci: don't wait for locked polling queue
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
                   ` (7 preceding siblings ...)
  2021-03-03 11:57 ` [PATCH v5 08/12] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  2021-03-10 21:57   ` Mike Snitzer
  2021-03-03 11:57 ` [PATCH v5 10/12] block: fastpath for bio-based polling Jeffle Xu
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

There's no sense waiting for the hw queue when it currently has been
locked by another polling instance. The polling instance currently
occupying the hw queue will help reap the completion events.

It shall be safe to surrender the hw queue, as long as we could reapply
for polling later. For Synchronous polling, blk_poll() will reapply for
polling, since @spin is always True in this case. While For asynchronous
polling, i.e. io_uring itself will reapply for polling when the previous
polling returns 0.

Besides, it shall do no harm to the polling performance of mq devices.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 drivers/nvme/host/pci.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 38b0d694dfc9..150e56ed6d15 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1106,7 +1106,9 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx)
 	if (!nvme_cqe_pending(nvmeq))
 		return 0;
 
-	spin_lock(&nvmeq->cq_poll_lock);
+	if (!spin_trylock(&nvmeq->cq_poll_lock))
+		return 0;
+
 	found = nvme_process_cq(nvmeq);
 	spin_unlock(&nvmeq->cq_poll_lock);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 10/12] block: fastpath for bio-based polling
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
                   ` (8 preceding siblings ...)
  2021-03-03 11:57 ` [PATCH v5 09/12] nvme/pci: don't wait for locked polling queue Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  2021-03-10 23:18   ` Mike Snitzer
  2021-03-11 13:56   ` Ming Lei
  2021-03-03 11:57 ` [PATCH v5 11/12] block: sub-fastpath " Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 12/12] dm: support IO polling for bio-based dm device Jeffle Xu
  11 siblings, 2 replies; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

Offer one fastpath for bio-based polling when bio submitted to dm
device is not split.

In this case, there will be only one bio submitted to only one polling
hw queue of one underlying mq device, and thus we don't need to track
all split bios or iterate through all polling hw queues. The pointer to
the polling hw queue the bio submitted to is returned here as the
returned cookie. In this case, the polling routine will call
mq_ops->poll() directly with the hw queue converted from the input
cookie.

If the original bio submitted to dm device is split to multiple bios and
thus submitted to multiple polling hw queues, the polling routine will
fall back to iterating all hw queues (in polling mode) of all underlying
mq devices.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-core.c          | 73 +++++++++++++++++++++++++++++++++++++--
 include/linux/blk_types.h | 66 +++++++++++++++++++++++++++++++++--
 include/linux/types.h     |  2 +-
 3 files changed, 135 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 6d7d53030d7c..e5cd4ff08f5c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -947,14 +947,22 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 {
 	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
+	struct request_queue *top_q;
+	bool poll_on;
 
 	BUG_ON(bio->bi_next);
 
 	bio_list_init(&bio_list_on_stack[0]);
 	current->bio_list = bio_list_on_stack;
 
+	top_q = bio->bi_bdev->bd_disk->queue;
+	poll_on = test_bit(QUEUE_FLAG_POLL, &top_q->queue_flags) &&
+		  (bio->bi_opf & REQ_HIPRI);
+
 	do {
-		struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+		blk_qc_t cookie;
+		struct block_device *bdev = bio->bi_bdev;
+		struct request_queue *q = bdev->bd_disk->queue;
 		struct bio_list lower, same;
 
 		if (unlikely(bio_queue_enter(bio) != 0))
@@ -966,7 +974,23 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 		bio_list_on_stack[1] = bio_list_on_stack[0];
 		bio_list_init(&bio_list_on_stack[0]);
 
-		ret = __submit_bio(bio);
+		cookie = __submit_bio(bio);
+
+		if (poll_on && blk_qc_t_valid(cookie)) {
+			unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
+			unsigned int devt = bdev_whole(bdev)->bd_dev;
+
+			cookie = blk_qc_t_get_by_devt(devt, queue_num);
+
+			if (!blk_qc_t_valid(ret)) {
+				/* set initial value */
+				ret = cookie;
+			} else if (ret != cookie) {
+				/* bio gets split and enqueued to multi hctxs */
+				ret = BLK_QC_T_BIO_POLL_ALL;
+				poll_on = false;
+			}
+		}
 
 		/*
 		 * Sort new bios into those for a lower level and those for the
@@ -989,6 +1013,7 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 	} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
 
 	current->bio_list = NULL;
+
 	return ret;
 }
 
@@ -1119,6 +1144,44 @@ blk_qc_t submit_bio(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio);
 
+static int blk_poll_bio(blk_qc_t cookie)
+{
+	unsigned int devt = blk_qc_t_to_devt(cookie);
+	unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
+	struct block_device *bdev;
+	struct request_queue *q;
+	struct blk_mq_hw_ctx *hctx;
+	int ret;
+
+	bdev = blkdev_get_no_open(devt);
+
+	/*
+	 * One such case is that dm device has reloaded table and the original
+	 * underlying device the bio submitted to has been detached. When
+	 * reloading table, dm will ensure that previously submitted IOs have
+	 * all completed, thus return directly here.
+	 */
+	if (!bdev)
+		return 1;
+
+	q = bdev->bd_disk->queue;
+	hctx = q->queue_hw_ctx[queue_num];
+
+	/*
+	 * Similar to the case described in the above comment, that dm device
+	 * has reloaded table and the original underlying device the bio
+	 * submitted to has been detached. Thus the dev_t stored in cookie may
+	 * be reused by another blkdev, and if that's the case, return directly
+	 * here.
+	 */
+	if (hctx->type != HCTX_TYPE_POLL)
+		return 1;
+
+	ret = blk_mq_poll_hctx(q, hctx);
+
+	blkdev_put_no_open(bdev);
+	return ret;
+}
 
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
@@ -1129,7 +1192,11 @@ static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	do {
 		int ret;
 
-		ret = disk->fops->poll(q, cookie);
+		if (unlikely(blk_qc_t_is_poll_multi(cookie)))
+			ret = disk->fops->poll(q, cookie);
+		else
+			ret = blk_poll_bio(cookie);
+
 		if (ret > 0) {
 			__set_current_state(TASK_RUNNING);
 			return ret;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index fb429daaa909..8f970e026be9 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -505,10 +505,19 @@ static inline int op_stat_group(unsigned int op)
 	return op_is_write(op);
 }
 
-/* Macros for blk_qc_t */
+/*
+ * blk_qc_t for request-based mq devices.
+ * 63                    31 30          15          0 (bit)
+ * +----------------------+-+-----------+-----------+
+ * |      reserved        | | queue_num |    tag    |
+ * +----------------------+-+-----------+-----------+
+ *                         ^
+ *                         BLK_QC_T_INTERNAL
+ */
 #define BLK_QC_T_NONE		-1U
 #define BLK_QC_T_SHIFT		16
 #define BLK_QC_T_INTERNAL	(1U << 31)
+#define BLK_QC_T_QUEUE_NUM_SIZE	15
 
 static inline bool blk_qc_t_valid(blk_qc_t cookie)
 {
@@ -517,7 +526,8 @@ static inline bool blk_qc_t_valid(blk_qc_t cookie)
 
 static inline unsigned int blk_qc_t_to_queue_num(blk_qc_t cookie)
 {
-	return (cookie & ~BLK_QC_T_INTERNAL) >> BLK_QC_T_SHIFT;
+	return (cookie >> BLK_QC_T_SHIFT) &
+	       ((1u << BLK_QC_T_QUEUE_NUM_SIZE) - 1);
 }
 
 static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
@@ -530,6 +540,58 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
 	return (cookie & BLK_QC_T_INTERNAL) != 0;
 }
 
+/*
+ * blk_qc_t for bio-based devices.
+ *
+ * 1. When @bio is not split, the returned cookie has following format.
+ *    @dev_t specifies the dev_t number of the underlying device the bio
+ *    submitted to, while @queue_num specifies the hw queue the bio submitted
+ *    to.
+ *
+ * 63                    31 30          15          0 (bit)
+ * +----------------------+-+-----------+-----------+
+ * |        dev_t         | | queue_num |  reserved |
+ * +----------------------+-+-----------+-----------+
+ *                         ^
+ *                         reserved for compatibility with mq
+ *
+ * 2. When @bio gets split and enqueued into multi hw queues, the returned
+ *    cookie is just BLK_QC_T_BIO_POLL_ALL flag.
+ *
+ * 63                                              0 (bit)
+ * +----------------------------------------------+-+
+ * |                                              |1|
+ * +----------------------------------------------+-+
+ *                                                 ^
+ *                                                 BLK_QC_T_BIO_POLL_ALL
+ *
+ * 3. Otherwise, return BLK_QC_T_NONE as the cookie.
+ *
+ * 63                                              0 (bit)
+ * +-----------------------------------------------+
+ * |                  BLK_QC_T_NONE                |
+ * +-----------------------------------------------+
+ */
+#define BLK_QC_T_HIGH_SHIFT	32
+#define BLK_QC_T_BIO_POLL_ALL	1U
+
+static inline unsigned int blk_qc_t_to_devt(blk_qc_t cookie)
+{
+	return cookie >> BLK_QC_T_HIGH_SHIFT;
+}
+
+static inline blk_qc_t blk_qc_t_get_by_devt(unsigned int dev,
+					    unsigned int queue_num)
+{
+	return ((blk_qc_t)dev << BLK_QC_T_HIGH_SHIFT) |
+	       (queue_num << BLK_QC_T_SHIFT);
+}
+
+static inline bool blk_qc_t_is_poll_multi(blk_qc_t cookie)
+{
+	return cookie & BLK_QC_T_BIO_POLL_ALL;
+}
+
 struct blk_rq_stat {
 	u64 mean;
 	u64 min;
diff --git a/include/linux/types.h b/include/linux/types.h
index 52a54ed6ffac..7ff4bb96e0ea 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -126,7 +126,7 @@ typedef u64 sector_t;
 typedef u64 blkcnt_t;
 
 /* cookie used for IO polling */
-typedef unsigned int blk_qc_t;
+typedef u64 blk_qc_t;
 
 /*
  * The type of an index into the pagecache.
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 11/12] block: sub-fastpath for bio-based polling
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
                   ` (9 preceding siblings ...)
  2021-03-03 11:57 ` [PATCH v5 10/12] block: fastpath for bio-based polling Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 12/12] dm: support IO polling for bio-based dm device Jeffle Xu
  11 siblings, 0 replies; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

Offer one sub-fastpath for bio-based polling when bio submitted to dm
device gets split and enqueued into multiple hw queues, while the IO
submission process has not been migrated to another CPU.

In this case, the IO submission routine will return the CPU number on
which the IO submission happened as the returned cookie, while the
polling routine will only iterate and poll on hw queues that this CPU
number maps, instead of iterating *all* hw queues.

This optimization can dramatically reduce cache ping-pong and thus
improve the polling performance, when multiple hw queues in polling mode
per device could be reserved when there are multiple polling processes.

It will fall back to iterating all hw queues in polling mode, once the
process has ever been migrated to another CPU during the IO submission
phase.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-core.c          | 18 ++++++++++++++++--
 include/linux/blk_types.h | 38 ++++++++++++++++++++++++++++++++++----
 2 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e5cd4ff08f5c..5479fd74d3be 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -948,7 +948,8 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
 	struct request_queue *top_q;
-	bool poll_on;
+	bool orig_poll_on, poll_on;
+	u64 old_nr_migrations;
 
 	BUG_ON(bio->bi_next);
 
@@ -958,6 +959,8 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 	top_q = bio->bi_bdev->bd_disk->queue;
 	poll_on = test_bit(QUEUE_FLAG_POLL, &top_q->queue_flags) &&
 		  (bio->bi_opf & REQ_HIPRI);
+	orig_poll_on = poll_on;
+	old_nr_migrations = READ_ONCE(current->se.nr_migrations);
 
 	do {
 		blk_qc_t cookie;
@@ -987,7 +990,7 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 				ret = cookie;
 			} else if (ret != cookie) {
 				/* bio gets split and enqueued to multi hctxs */
-				ret = BLK_QC_T_BIO_POLL_ALL;
+				ret = blk_qc_t_get_by_cpu();
 				poll_on = false;
 			}
 		}
@@ -1014,6 +1017,17 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 
 	current->bio_list = NULL;
 
+	/*
+	 * For cases when bio gets split and enqueued into multi hctxs, return
+	 * the corresponding CPU number when current process has not been
+	 * migrated to another CPU. Return BLK_QC_T_BIO_POLL_ALL otherwise,
+	 * falling back to iterating and polling on all hw queues, since split
+	 * bios are submitted to different CPUs in this case.
+	 */
+	if (orig_poll_on != poll_on &&
+	    old_nr_migrations != READ_ONCE(current->se.nr_migrations))
+		ret = BLK_QC_T_BIO_POLL_ALL;
+
 	return ret;
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 8f970e026be9..32de4fb79eff 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -555,8 +555,21 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
  *                         ^
  *                         reserved for compatibility with mq
  *
- * 2. When @bio gets split and enqueued into multi hw queues, the returned
- *    cookie is just BLK_QC_T_BIO_POLL_ALL flag.
+ * 2. When @bio gets split and enqueued into multi hw queues, and current
+ *    process has *not* been migrated to another CPU, the returned cookie
+ *    actually stores the corresponding CPU number on which the IO submission
+ *    happened. Also with BLK_QC_T_BIO_POLL_CPU flag set.
+ *
+ * 63                    31                         0 (bit)
+ * +----------------------+-----------------------+-+
+ * |          cpu         |                       |1|
+ * +----------------------+-----------------------+-+
+ *                                                 ^
+ *                                                 BLK_QC_T_BIO_POLL_CPU
+ *
+ * 3. When @bio gets split and enqueued into multi hw queues, and current
+ *    process has ever been migrated to another CPU, the returned cookie is just
+ *    BLK_QC_T_BIO_POLL_ALL flag.
  *
  * 63                                              0 (bit)
  * +----------------------------------------------+-+
@@ -565,7 +578,7 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
  *                                                 ^
  *                                                 BLK_QC_T_BIO_POLL_ALL
  *
- * 3. Otherwise, return BLK_QC_T_NONE as the cookie.
+ * 4. Otherwise, return BLK_QC_T_NONE as the cookie.
  *
  * 63                                              0 (bit)
  * +-----------------------------------------------+
@@ -574,12 +587,18 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
  */
 #define BLK_QC_T_HIGH_SHIFT	32
 #define BLK_QC_T_BIO_POLL_ALL	1U
+#define BLK_QC_T_BIO_POLL_CPU	2U
 
 static inline unsigned int blk_qc_t_to_devt(blk_qc_t cookie)
 {
 	return cookie >> BLK_QC_T_HIGH_SHIFT;
 }
 
+static inline unsigned int blk_qc_t_to_cpu(blk_qc_t cookie)
+{
+	return cookie >> BLK_QC_T_HIGH_SHIFT;
+}
+
 static inline blk_qc_t blk_qc_t_get_by_devt(unsigned int dev,
 					    unsigned int queue_num)
 {
@@ -587,9 +606,20 @@ static inline blk_qc_t blk_qc_t_get_by_devt(unsigned int dev,
 	       (queue_num << BLK_QC_T_SHIFT);
 }
 
+static inline blk_qc_t blk_qc_t_get_by_cpu(void)
+{
+	return ((blk_qc_t)raw_smp_processor_id() << BLK_QC_T_HIGH_SHIFT) |
+	       BLK_QC_T_BIO_POLL_CPU;
+}
+
 static inline bool blk_qc_t_is_poll_multi(blk_qc_t cookie)
 {
-	return cookie & BLK_QC_T_BIO_POLL_ALL;
+	return cookie & (BLK_QC_T_BIO_POLL_ALL | BLK_QC_T_BIO_POLL_CPU);
+}
+
+static inline bool blk_qc_t_is_poll_cpu(blk_qc_t cookie)
+{
+	return cookie & BLK_QC_T_BIO_POLL_CPU;
 }
 
 struct blk_rq_stat {
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 12/12] dm: support IO polling for bio-based dm device
  2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
                   ` (10 preceding siblings ...)
  2021-03-03 11:57 ` [PATCH v5 11/12] block: sub-fastpath " Jeffle Xu
@ 2021-03-03 11:57 ` Jeffle Xu
  11 siblings, 0 replies; 23+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

IO polling is enabled when all underlying target devices are capable
of IO polling. The sanity check supports the stacked device model, in
which one dm device may be build upon another dm device. In this case,
the mapped device will check if the underlying dm target device
supports IO polling.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 drivers/md/dm-table.c         | 26 ++++++++++++
 drivers/md/dm.c               | 74 +++++++++++++++++++++++++++++++++++
 include/linux/device-mapper.h |  1 +
 3 files changed, 101 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 95391f78b8d5..ed72349eb1db 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1982,6 +1982,19 @@ static int device_requires_stable_pages(struct dm_target *ti,
 	return blk_queue_stable_writes(q);
 }
 
+static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev,
+				   sector_t start, sector_t len, void *data)
+{
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	return !test_bit(QUEUE_FLAG_POLL, &q->queue_flags);
+}
+
+int dm_table_supports_poll(struct dm_table *t)
+{
+	return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
+}
+
 void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 			       struct queue_limits *limits)
 {
@@ -2079,6 +2092,19 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 
 	dm_update_keyslot_manager(q, t);
 	blk_queue_update_readahead(q);
+
+	/*
+	 * Check for request-based device is remained to
+	 * dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+	 * For bio-based device, only set QUEUE_FLAG_POLL when all underlying
+	 * devices supporting polling.
+	 */
+	if (__table_type_bio_based(t->type)) {
+		if (dm_table_supports_poll(t))
+			blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+		else
+			blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
+	}
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f1b76203b3c7..3f3f47d66386 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1714,6 +1714,78 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 	return BLK_QC_T_NONE;
 }
 
+struct dm_poll_data {
+	blk_qc_t cookie;
+	int count;
+};
+
+static int dm_poll_one_dev(struct dm_target *ti, struct dm_dev *dev,
+			   sector_t start, sector_t len, void *data)
+{
+	struct dm_poll_data *pdata = data;
+	struct request_queue *q = bdev_get_queue(dev->bdev);
+
+	if (queue_is_mq(q)) {
+		struct blk_mq_hw_ctx *hctx;
+		int i, cpu;
+
+		if (!percpu_ref_tryget(&q->q_usage_counter))
+			return 0;
+
+		if (blk_qc_t_is_poll_cpu(pdata->cookie)) {
+			cpu = blk_qc_t_to_cpu(pdata->cookie);
+			hctx = blk_mq_get_hctx(q, cpu, REQ_HIPRI);
+			pdata->count += blk_mq_poll_hctx(q, hctx);
+		} else {
+			queue_for_each_poll_hw_ctx(q, hctx, i)
+				pdata->count += blk_mq_poll_hctx(q, hctx);
+		}
+
+		percpu_ref_put(&q->q_usage_counter);
+	} else {
+		struct gendisk *disk = dev->bdev->bd_disk;
+
+		pdata->count += disk->fops->poll(q, pdata->cookie);
+	}
+
+	return 0;
+}
+
+static int dm_bio_poll(struct request_queue *q, blk_qc_t cookie)
+{
+	int i, srcu_idx;
+	struct dm_table *t;
+	struct dm_target *ti;
+	struct mapped_device *md = queue_to_disk(q)->private_data;
+	struct dm_poll_data pdata = {
+		.cookie = cookie,
+	};
+
+	t = dm_get_live_table(md, &srcu_idx);
+
+	for (i = 0; i < dm_table_get_num_targets(t); i++) {
+		ti = dm_table_get_target(t, i);
+		ti->type->iterate_devices(ti, dm_poll_one_dev, &pdata);
+	}
+
+	dm_put_live_table(md, srcu_idx);
+
+	return pdata.count;
+}
+
+static bool dm_bio_poll_capable(struct gendisk *disk)
+{
+	int ret, srcu_idx;
+	struct mapped_device *md = disk->private_data;
+	struct dm_table *t;
+
+	t = dm_get_live_table(md, &srcu_idx);
+	ret = dm_table_supports_poll(t);
+	dm_put_live_table(md, srcu_idx);
+
+	return ret;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -3126,6 +3198,8 @@ static const struct pr_ops dm_pr_ops = {
 };
 
 static const struct block_device_operations dm_blk_dops = {
+	.poll = dm_bio_poll,
+	.poll_capable = dm_bio_poll_capable,
 	.submit_bio = dm_submit_bio,
 	.open = dm_blk_open,
 	.release = dm_blk_close,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 7f4ac87c0b32..31bfd6f70013 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -538,6 +538,7 @@ unsigned int dm_table_get_num_targets(struct dm_table *t);
 fmode_t dm_table_get_mode(struct dm_table *t);
 struct mapped_device *dm_table_get_md(struct dm_table *t);
 const char *dm_table_device_name(struct dm_table *t);
+int dm_table_supports_poll(struct dm_table *t);
 
 /*
  * Trigger an event.
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 09/12] nvme/pci: don't wait for locked polling queue
  2021-03-03 11:57 ` [PATCH v5 09/12] nvme/pci: don't wait for locked polling queue Jeffle Xu
@ 2021-03-10 21:57   ` Mike Snitzer
  0 siblings, 0 replies; 23+ messages in thread
From: Mike Snitzer @ 2021-03-10 21:57 UTC (permalink / raw)
  To: Jeffle Xu
  Cc: axboe, io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

On Wed, Mar 03 2021 at  6:57am -0500,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> There's no sense waiting for the hw queue when it currently has been
> locked by another polling instance. The polling instance currently
> occupying the hw queue will help reap the completion events.
> 
> It shall be safe to surrender the hw queue, as long as we could reapply
> for polling later. For Synchronous polling, blk_poll() will reapply for
> polling, since @spin is always True in this case. While For asynchronous
> polling, i.e. io_uring itself will reapply for polling when the previous
> polling returns 0.
> 
> Besides, it shall do no harm to the polling performance of mq devices.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>

You should probably just send this to the linux-nvme list independent of
this patchset.

Mike


> ---
>  drivers/nvme/host/pci.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 38b0d694dfc9..150e56ed6d15 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1106,7 +1106,9 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx)
>  	if (!nvme_cqe_pending(nvmeq))
>  		return 0;
>  
> -	spin_lock(&nvmeq->cq_poll_lock);
> +	if (!spin_trylock(&nvmeq->cq_poll_lock))
> +		return 0;
> +
>  	found = nvme_process_cq(nvmeq);
>  	spin_unlock(&nvmeq->cq_poll_lock);
>  
> -- 
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 03/12] block: add poll method to support bio-based IO polling
  2021-03-03 11:57 ` [PATCH v5 03/12] block: add poll method to support bio-based IO polling Jeffle Xu
@ 2021-03-10 22:01   ` Mike Snitzer
  2021-03-11  5:31     ` [dm-devel] " JeffleXu
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Snitzer @ 2021-03-10 22:01 UTC (permalink / raw)
  To: Jeffle Xu
  Cc: axboe, io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

On Wed, Mar 03 2021 at  6:57am -0500,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> ->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
> callback to struct request_queue") to support bio-based queues such as
> nvme multipath, but was later removed in commit 529262d56dbe ("block:
> remove ->poll_fn").
> 
> Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
> block_device_operations") restore the possibility of bio-based IO
> polling support by adding an ->poll method to gendisk->fops.
> 
> Make blk_mq_poll() specific to blk-mq, while blk_bio_poll() is
> originally a copy from blk_mq_poll(), and is specific to bio-based
> polling. Currently hybrid polling is not supported by bio-based polling.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> ---
>  block/blk-core.c       | 58 ++++++++++++++++++++++++++++++++++++++++++
>  block/blk-mq.c         | 22 +---------------
>  include/linux/blk-mq.h |  1 +
>  include/linux/blkdev.h |  1 +
>  4 files changed, 61 insertions(+), 21 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fc60ff208497..6d7d53030d7c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1119,6 +1119,64 @@ blk_qc_t submit_bio(struct bio *bio)
>  }
>  EXPORT_SYMBOL(submit_bio);
>  
> +

Minor nit: Extra empty new line here? ^

Otherwise, looks good (I like the end result of blk-mq and bio-based
polling being decoupled like hch suggested).

Reviewed-by: Mike Snitzer <snitzer@redhat.com>

> +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +{
> +	long state;
> +	struct gendisk *disk = queue_to_disk(q);
> +
> +	state = current->state;
> +	do {
> +		int ret;
> +
> +		ret = disk->fops->poll(q, cookie);
> +		if (ret > 0) {
> +			__set_current_state(TASK_RUNNING);
> +			return ret;
> +		}
> +
> +		if (signal_pending_state(state, current))
> +			__set_current_state(TASK_RUNNING);
> +
> +		if (current->state == TASK_RUNNING)
> +			return 1;
> +		if (ret < 0 || !spin)
> +			break;
> +		cpu_relax();
> +	} while (!need_resched());
> +
> +	__set_current_state(TASK_RUNNING);
> +	return 0;
> +}
> +
> +/**
> + * blk_poll - poll for IO completions
> + * @q:  the queue
> + * @cookie: cookie passed back at IO submission time
> + * @spin: whether to spin for completions
> + *
> + * Description:
> + *    Poll for completions on the passed in queue. Returns number of
> + *    completed entries found. If @spin is true, then blk_poll will continue
> + *    looping until at least one completion is found, unless the task is
> + *    otherwise marked running (or we need to reschedule).
> + */
> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +{
> +	if (!blk_qc_t_valid(cookie) ||
> +	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
> +		return 0;
> +
> +	if (current->plug)
> +		blk_flush_plug_list(current->plug, false);
> +
> +	if (queue_is_mq(q))
> +		return blk_mq_poll(q, cookie, spin);
> +	else
> +		return blk_bio_poll(q, cookie, spin);
> +}
> +EXPORT_SYMBOL_GPL(blk_poll);
> +
>  /**
>   * blk_cloned_rq_check_limits - Helper function to check a cloned request
>   *                              for the new queue limits
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa439..214fa30b460a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3852,30 +3852,11 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
>  	return blk_mq_poll_hybrid_sleep(q, rq);
>  }
>  
> -/**
> - * blk_poll - poll for IO completions
> - * @q:  the queue
> - * @cookie: cookie passed back at IO submission time
> - * @spin: whether to spin for completions
> - *
> - * Description:
> - *    Poll for completions on the passed in queue. Returns number of
> - *    completed entries found. If @spin is true, then blk_poll will continue
> - *    looping until at least one completion is found, unless the task is
> - *    otherwise marked running (or we need to reschedule).
> - */
> -int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
>  	struct blk_mq_hw_ctx *hctx;
>  	long state;
>  
> -	if (!blk_qc_t_valid(cookie) ||
> -	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
> -		return 0;
> -
> -	if (current->plug)
> -		blk_flush_plug_list(current->plug, false);
> -
>  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>  
>  	/*
> @@ -3917,7 +3898,6 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	__set_current_state(TASK_RUNNING);
>  	return 0;
>  }
> -EXPORT_SYMBOL_GPL(blk_poll);
>  
>  unsigned int blk_mq_rq_cpu(struct request *rq)
>  {
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 2c473c9b8990..6a7b693b9917 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -615,6 +615,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
>  }
>  
>  blk_qc_t blk_mq_submit_bio(struct bio *bio);
> +int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
>  void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
>  		struct lock_class_key *key);
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index b81a9fe015ab..9dc83c30e7bc 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1866,6 +1866,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
>  
>  struct block_device_operations {
>  	blk_qc_t (*submit_bio) (struct bio *bio);
> +	int (*poll)(struct request_queue *q, blk_qc_t cookie);
>  	int (*open) (struct block_device *, fmode_t);
>  	void (*release) (struct gendisk *, fmode_t);
>  	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
> -- 
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 04/12] block: add poll_capable method to support bio-based IO polling
  2021-03-03 11:57 ` [PATCH v5 04/12] block: add poll_capable " Jeffle Xu
@ 2021-03-10 22:21   ` Mike Snitzer
  2021-03-11  5:43     ` [dm-devel] " JeffleXu
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Snitzer @ 2021-03-10 22:21 UTC (permalink / raw)
  To: Jeffle Xu
  Cc: axboe, io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

On Wed, Mar 03 2021 at  6:57am -0500,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> This method can be used to check if bio-based device supports IO polling
> or not. For mq devices, checking for hw queue in polling mode is
> adequate, while the sanity check shall be implementation specific for
> bio-based devices. For example, dm device needs to check if all
> underlying devices are capable of IO polling.
> 
> Though bio-based device may have done the sanity check during the
> device initialization phase, cacheing the result of this sanity check
> (such as by cacheing in the queue_flags) may not work. Because for dm

s/cacheing/caching/

> devices, users could change the state of the underlying devices through
> '/sys/block/<dev>/io_poll', bypassing the dm device above. In this case,
> the cached result of the very beginning sanity check could be
> out-of-date. Thus the sanity check needs to be done every time 'io_poll'
> is to be modified.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>

Ideally QUEUE_FLAG_POLL would be authoritative.. but I appreciate the
problem you've described.  Though I do wonder if this should be solved
by bio-based's fops->poll() method clearing the request_queue's
QUEUE_FLAG_POLL if it finds an underlying device doesn't have
QUEUE_FLAG_POLL set.  Though making bio-based's fops->poll() always need
to validate the an underlying device does support polling is pretty
unfortunate.

Either way, queue_poll_store() will need to avoid blk-mq specific poll
checking for bio-based devices.

Mike

> ---
>  block/blk-sysfs.c      | 14 +++++++++++---
>  include/linux/blkdev.h |  1 +
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 0f4f0c8a7825..367c1d9a55c6 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -426,9 +426,17 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
>  	unsigned long poll_on;
>  	ssize_t ret;
>  
> -	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
> -	    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
> -		return -EINVAL;
> +	if (queue_is_mq(q)) {
> +		if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
> +		    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
> +			return -EINVAL;
> +	} else {
> +		struct gendisk *disk = queue_to_disk(q);
> +
> +		if (!disk->fops->poll_capable ||
> +		    !disk->fops->poll_capable(disk))
> +			return -EINVAL;
> +	}
>  
>  	ret = queue_var_store(&poll_on, page, count);
>  	if (ret < 0)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 9dc83c30e7bc..7df40792c032 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1867,6 +1867,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
>  struct block_device_operations {
>  	blk_qc_t (*submit_bio) (struct bio *bio);
>  	int (*poll)(struct request_queue *q, blk_qc_t cookie);
> +	bool (*poll_capable)(struct gendisk *disk);
>  	int (*open) (struct block_device *, fmode_t);
>  	void (*release) (struct gendisk *, fmode_t);
>  	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
> -- 
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 10/12] block: fastpath for bio-based polling
  2021-03-03 11:57 ` [PATCH v5 10/12] block: fastpath for bio-based polling Jeffle Xu
@ 2021-03-10 23:18   ` Mike Snitzer
  2021-03-11  6:36     ` [dm-devel] " JeffleXu
  2021-03-11 13:56   ` Ming Lei
  1 sibling, 1 reply; 23+ messages in thread
From: Mike Snitzer @ 2021-03-10 23:18 UTC (permalink / raw)
  To: Jeffle Xu
  Cc: axboe, io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

On Wed, Mar 03 2021 at  6:57am -0500,
Jeffle Xu <jefflexu@linux.alibaba.com> wrote:

> Offer one fastpath for bio-based polling when bio submitted to dm
> device is not split.
> 
> In this case, there will be only one bio submitted to only one polling
> hw queue of one underlying mq device, and thus we don't need to track
> all split bios or iterate through all polling hw queues. The pointer to
> the polling hw queue the bio submitted to is returned here as the
> returned cookie. In this case, the polling routine will call
> mq_ops->poll() directly with the hw queue converted from the input
> cookie.
> 
> If the original bio submitted to dm device is split to multiple bios and
> thus submitted to multiple polling hw queues, the polling routine will
> fall back to iterating all hw queues (in polling mode) of all underlying
> mq devices.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> ---
>  block/blk-core.c          | 73 +++++++++++++++++++++++++++++++++++++--
>  include/linux/blk_types.h | 66 +++++++++++++++++++++++++++++++++--
>  include/linux/types.h     |  2 +-
>  3 files changed, 135 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 6d7d53030d7c..e5cd4ff08f5c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -947,14 +947,22 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;
> +	struct request_queue *top_q;
> +	bool poll_on;
>  
>  	BUG_ON(bio->bi_next);
>  
>  	bio_list_init(&bio_list_on_stack[0]);
>  	current->bio_list = bio_list_on_stack;
>  
> +	top_q = bio->bi_bdev->bd_disk->queue;
> +	poll_on = test_bit(QUEUE_FLAG_POLL, &top_q->queue_flags) &&
> +		  (bio->bi_opf & REQ_HIPRI);
> +
>  	do {
> -		struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +		blk_qc_t cookie;
> +		struct block_device *bdev = bio->bi_bdev;
> +		struct request_queue *q = bdev->bd_disk->queue;
>  		struct bio_list lower, same;
>  
>  		if (unlikely(bio_queue_enter(bio) != 0))
> @@ -966,7 +974,23 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		cookie = __submit_bio(bio);
> +
> +		if (poll_on && blk_qc_t_valid(cookie)) {
> +			unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
> +			unsigned int devt = bdev_whole(bdev)->bd_dev;
> +
> +			cookie = blk_qc_t_get_by_devt(devt, queue_num);

The need to rebuild the cookie here is pretty awkward.  This
optimization living in block core may be worthwhile but the duality of
block core conditionally overriding the driver's returned cookie (that
is meant to be opaque to upper layer) is not great.

> +
> +			if (!blk_qc_t_valid(ret)) {
> +				/* set initial value */
> +				ret = cookie;
> +			} else if (ret != cookie) {
> +				/* bio gets split and enqueued to multi hctxs */
> +				ret = BLK_QC_T_BIO_POLL_ALL;
> +				poll_on = false;
> +			}
> +		}
>  
>  		/*
>  		 * Sort new bios into those for a lower level and those for the
> @@ -989,6 +1013,7 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
>  
>  	current->bio_list = NULL;
> +
>  	return ret;
>  }

Nit: Not seeing need to alter white space here.

>  
> @@ -1119,6 +1144,44 @@ blk_qc_t submit_bio(struct bio *bio)
>  }
>  EXPORT_SYMBOL(submit_bio);
>  
> +static int blk_poll_bio(blk_qc_t cookie)
> +{
> +	unsigned int devt = blk_qc_t_to_devt(cookie);
> +	unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
> +	struct block_device *bdev;
> +	struct request_queue *q;
> +	struct blk_mq_hw_ctx *hctx;
> +	int ret;
> +
> +	bdev = blkdev_get_no_open(devt);

As you pointed out to me in private, but for the benefit of others,
blkdev_get_no_open()'s need to take inode lock is not ideal here.

> +
> +	/*
> +	 * One such case is that dm device has reloaded table and the original
> +	 * underlying device the bio submitted to has been detached. When
> +	 * reloading table, dm will ensure that previously submitted IOs have
> +	 * all completed, thus return directly here.
> +	 */
> +	if (!bdev)
> +		return 1;
> +
> +	q = bdev->bd_disk->queue;
> +	hctx = q->queue_hw_ctx[queue_num];
> +
> +	/*
> +	 * Similar to the case described in the above comment, that dm device
> +	 * has reloaded table and the original underlying device the bio
> +	 * submitted to has been detached. Thus the dev_t stored in cookie may
> +	 * be reused by another blkdev, and if that's the case, return directly
> +	 * here.
> +	 */
> +	if (hctx->type != HCTX_TYPE_POLL)
> +		return 1;

These checks really aren't authoritative or safe enough.  If the bdev
may have changed then it may not have queue_num hctxs (so you'd access
out-of-bounds). Similarly, a new bdev may have queue_num hctxs and may
just so happen to have its type be HCTX_TYPE_POLL.. but in reality it
isn't the same bdev the cookie was generated from.

And I'm now curious how blk-mq's polling code isn't subject to async io
polling tripping over the possibility of the underlying device having
been changed out from under it -- meaning should this extra validation
be common to bio and request-based?  If not, why not?

> +
> +	ret = blk_mq_poll_hctx(q, hctx);
> +
> +	blkdev_put_no_open(bdev);
> +	return ret;
> +}
>  
>  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
> @@ -1129,7 +1192,11 @@ static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	do {
>  		int ret;
>  
> -		ret = disk->fops->poll(q, cookie);
> +		if (unlikely(blk_qc_t_is_poll_multi(cookie)))
> +			ret = disk->fops->poll(q, cookie);
> +		else
> +			ret = blk_poll_bio(cookie);
> +

Again, this just seems too limiting. Would rather always call into
disk->fops->poll and have it optimize for single hctx based on cookie it
established.

Not seeing why all this needs to be driven by block core _yet_.
Could be I'm just wanting some flexibility for the design to harden (and
potential for optimization left as a driver concern.. as I mentioned in
my reply to the 0th patch header for this v5 patchset).

>  		if (ret > 0) {
>  			__set_current_state(TASK_RUNNING);
>  			return ret;
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index fb429daaa909..8f970e026be9 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -505,10 +505,19 @@ static inline int op_stat_group(unsigned int op)
>  	return op_is_write(op);
>  }
>  
> -/* Macros for blk_qc_t */
> +/*
> + * blk_qc_t for request-based mq devices.
> + * 63                    31 30          15          0 (bit)
> + * +----------------------+-+-----------+-----------+
> + * |      reserved        | | queue_num |    tag    |
> + * +----------------------+-+-----------+-----------+
> + *                         ^
> + *                         BLK_QC_T_INTERNAL
> + */
>  #define BLK_QC_T_NONE		-1U
>  #define BLK_QC_T_SHIFT		16
>  #define BLK_QC_T_INTERNAL	(1U << 31)
> +#define BLK_QC_T_QUEUE_NUM_SIZE	15
>  
>  static inline bool blk_qc_t_valid(blk_qc_t cookie)
>  {
> @@ -517,7 +526,8 @@ static inline bool blk_qc_t_valid(blk_qc_t cookie)
>  
>  static inline unsigned int blk_qc_t_to_queue_num(blk_qc_t cookie)
>  {
> -	return (cookie & ~BLK_QC_T_INTERNAL) >> BLK_QC_T_SHIFT;
> +	return (cookie >> BLK_QC_T_SHIFT) &
> +	       ((1u << BLK_QC_T_QUEUE_NUM_SIZE) - 1);
>  }
>  
>  static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
> @@ -530,6 +540,58 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
>  	return (cookie & BLK_QC_T_INTERNAL) != 0;
>  }
>  
> +/*
> + * blk_qc_t for bio-based devices.
> + *
> + * 1. When @bio is not split, the returned cookie has following format.
> + *    @dev_t specifies the dev_t number of the underlying device the bio
> + *    submitted to, while @queue_num specifies the hw queue the bio submitted
> + *    to.
> + *
> + * 63                    31 30          15          0 (bit)
> + * +----------------------+-+-----------+-----------+
> + * |        dev_t         | | queue_num |  reserved |
> + * +----------------------+-+-----------+-----------+
> + *                         ^
> + *                         reserved for compatibility with mq
> + *
> + * 2. When @bio gets split and enqueued into multi hw queues, the returned
> + *    cookie is just BLK_QC_T_BIO_POLL_ALL flag.
> + *
> + * 63                                              0 (bit)
> + * +----------------------------------------------+-+
> + * |                                              |1|
> + * +----------------------------------------------+-+
> + *                                                 ^
> + *                                                 BLK_QC_T_BIO_POLL_ALL
> + *
> + * 3. Otherwise, return BLK_QC_T_NONE as the cookie.
> + *
> + * 63                                              0 (bit)
> + * +-----------------------------------------------+
> + * |                  BLK_QC_T_NONE                |
> + * +-----------------------------------------------+
> + */
> +#define BLK_QC_T_HIGH_SHIFT	32
> +#define BLK_QC_T_BIO_POLL_ALL	1U

Pulling on same thread I raised above, the cookie is meant to be
opaque.  Pinning down how the cookie is (ab)used in block core seems to
undermine the intended flexibility.

I'd much rather these details be pushed into drivers/md/bio-poll.h or
something for the near-term.  We can always elevate it to block core
if/when there is sufficient justification.

Just feels we're getting too constraining too quickly.

Mike


> +
> +static inline unsigned int blk_qc_t_to_devt(blk_qc_t cookie)
> +{
> +	return cookie >> BLK_QC_T_HIGH_SHIFT;
> +}
> +
> +static inline blk_qc_t blk_qc_t_get_by_devt(unsigned int dev,
> +					    unsigned int queue_num)
> +{
> +	return ((blk_qc_t)dev << BLK_QC_T_HIGH_SHIFT) |
> +	       (queue_num << BLK_QC_T_SHIFT);
> +}
> +
> +static inline bool blk_qc_t_is_poll_multi(blk_qc_t cookie)
> +{
> +	return cookie & BLK_QC_T_BIO_POLL_ALL;
> +}
> +
>  struct blk_rq_stat {
>  	u64 mean;
>  	u64 min;
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 52a54ed6ffac..7ff4bb96e0ea 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -126,7 +126,7 @@ typedef u64 sector_t;
>  typedef u64 blkcnt_t;
>  
>  /* cookie used for IO polling */
> -typedef unsigned int blk_qc_t;
> +typedef u64 blk_qc_t;
>  
>  /*
>   * The type of an index into the pagecache.
> -- 
> 2.27.0
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dm-devel] [PATCH v5 03/12] block: add poll method to support bio-based IO polling
  2021-03-10 22:01   ` Mike Snitzer
@ 2021-03-11  5:31     ` JeffleXu
  0 siblings, 0 replies; 23+ messages in thread
From: JeffleXu @ 2021-03-11  5:31 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, caspar, linux-block, joseph.qi, dm-devel, mpatocka, io-uring



On 3/11/21 6:01 AM, Mike Snitzer wrote:
> On Wed, Mar 03 2021 at  6:57am -0500,
> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> 
>> ->poll_fn was introduced in commit ea435e1b9392 ("block: add a poll_fn
>> callback to struct request_queue") to support bio-based queues such as
>> nvme multipath, but was later removed in commit 529262d56dbe ("block:
>> remove ->poll_fn").
>>
>> Given commit c62b37d96b6e ("block: move ->make_request_fn to struct
>> block_device_operations") restore the possibility of bio-based IO
>> polling support by adding an ->poll method to gendisk->fops.
>>
>> Make blk_mq_poll() specific to blk-mq, while blk_bio_poll() is
>> originally a copy from blk_mq_poll(), and is specific to bio-based
>> polling. Currently hybrid polling is not supported by bio-based polling.
>>
>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>> ---
>>  block/blk-core.c       | 58 ++++++++++++++++++++++++++++++++++++++++++
>>  block/blk-mq.c         | 22 +---------------
>>  include/linux/blk-mq.h |  1 +
>>  include/linux/blkdev.h |  1 +
>>  4 files changed, 61 insertions(+), 21 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index fc60ff208497..6d7d53030d7c 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1119,6 +1119,64 @@ blk_qc_t submit_bio(struct bio *bio)
>>  }
>>  EXPORT_SYMBOL(submit_bio);
>>  
>> +
> 
> Minor nit: Extra empty new line here? ^

Regards. Thanks.

> 
> Otherwise, looks good (I like the end result of blk-mq and bio-based
> polling being decoupled like hch suggested).
> 
> Reviewed-by: Mike Snitzer <snitzer@redhat.com>
> 
>> +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>> +{
>> +	long state;
>> +	struct gendisk *disk = queue_to_disk(q);
>> +
>> +	state = current->state;
>> +	do {
>> +		int ret;
>> +
>> +		ret = disk->fops->poll(q, cookie);
>> +		if (ret > 0) {
>> +			__set_current_state(TASK_RUNNING);
>> +			return ret;
>> +		}
>> +
>> +		if (signal_pending_state(state, current))
>> +			__set_current_state(TASK_RUNNING);
>> +
>> +		if (current->state == TASK_RUNNING)
>> +			return 1;
>> +		if (ret < 0 || !spin)
>> +			break;
>> +		cpu_relax();
>> +	} while (!need_resched());
>> +
>> +	__set_current_state(TASK_RUNNING);
>> +	return 0;
>> +}
>> +
>> +/**
>> + * blk_poll - poll for IO completions
>> + * @q:  the queue
>> + * @cookie: cookie passed back at IO submission time
>> + * @spin: whether to spin for completions
>> + *
>> + * Description:
>> + *    Poll for completions on the passed in queue. Returns number of
>> + *    completed entries found. If @spin is true, then blk_poll will continue
>> + *    looping until at least one completion is found, unless the task is
>> + *    otherwise marked running (or we need to reschedule).
>> + */
>> +int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>> +{
>> +	if (!blk_qc_t_valid(cookie) ||
>> +	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
>> +		return 0;
>> +
>> +	if (current->plug)
>> +		blk_flush_plug_list(current->plug, false);
>> +
>> +	if (queue_is_mq(q))
>> +		return blk_mq_poll(q, cookie, spin);
>> +	else
>> +		return blk_bio_poll(q, cookie, spin);
>> +}
>> +EXPORT_SYMBOL_GPL(blk_poll);
>> +
>>  /**
>>   * blk_cloned_rq_check_limits - Helper function to check a cloned request
>>   *                              for the new queue limits
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index d4d7c1caa439..214fa30b460a 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -3852,30 +3852,11 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
>>  	return blk_mq_poll_hybrid_sleep(q, rq);
>>  }
>>  
>> -/**
>> - * blk_poll - poll for IO completions
>> - * @q:  the queue
>> - * @cookie: cookie passed back at IO submission time
>> - * @spin: whether to spin for completions
>> - *
>> - * Description:
>> - *    Poll for completions on the passed in queue. Returns number of
>> - *    completed entries found. If @spin is true, then blk_poll will continue
>> - *    looping until at least one completion is found, unless the task is
>> - *    otherwise marked running (or we need to reschedule).
>> - */
>> -int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>> +int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>>  {
>>  	struct blk_mq_hw_ctx *hctx;
>>  	long state;
>>  
>> -	if (!blk_qc_t_valid(cookie) ||
>> -	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
>> -		return 0;
>> -
>> -	if (current->plug)
>> -		blk_flush_plug_list(current->plug, false);
>> -
>>  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>  
>>  	/*
>> @@ -3917,7 +3898,6 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>>  	__set_current_state(TASK_RUNNING);
>>  	return 0;
>>  }
>> -EXPORT_SYMBOL_GPL(blk_poll);
>>  
>>  unsigned int blk_mq_rq_cpu(struct request *rq)
>>  {
>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index 2c473c9b8990..6a7b693b9917 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -615,6 +615,7 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
>>  }
>>  
>>  blk_qc_t blk_mq_submit_bio(struct bio *bio);
>> +int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, bool spin);
>>  void blk_mq_hctx_set_fq_lock_class(struct blk_mq_hw_ctx *hctx,
>>  		struct lock_class_key *key);
>>  
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index b81a9fe015ab..9dc83c30e7bc 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1866,6 +1866,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
>>  
>>  struct block_device_operations {
>>  	blk_qc_t (*submit_bio) (struct bio *bio);
>> +	int (*poll)(struct request_queue *q, blk_qc_t cookie);
>>  	int (*open) (struct block_device *, fmode_t);
>>  	void (*release) (struct gendisk *, fmode_t);
>>  	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
>> -- 
>> 2.27.0
>>
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dm-devel] [PATCH v5 04/12] block: add poll_capable method to support bio-based IO polling
  2021-03-10 22:21   ` Mike Snitzer
@ 2021-03-11  5:43     ` JeffleXu
  0 siblings, 0 replies; 23+ messages in thread
From: JeffleXu @ 2021-03-11  5:43 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, caspar, linux-block, joseph.qi, dm-devel, mpatocka, io-uring



On 3/11/21 6:21 AM, Mike Snitzer wrote:
> On Wed, Mar 03 2021 at  6:57am -0500,
> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> 
>> This method can be used to check if bio-based device supports IO polling
>> or not. For mq devices, checking for hw queue in polling mode is
>> adequate, while the sanity check shall be implementation specific for
>> bio-based devices. For example, dm device needs to check if all
>> underlying devices are capable of IO polling.
>>
>> Though bio-based device may have done the sanity check during the
>> device initialization phase, cacheing the result of this sanity check
>> (such as by cacheing in the queue_flags) may not work. Because for dm
> 
> s/cacheing/caching/
> 
>> devices, users could change the state of the underlying devices through
>> '/sys/block/<dev>/io_poll', bypassing the dm device above. In this case,
>> the cached result of the very beginning sanity check could be
>> out-of-date. Thus the sanity check needs to be done every time 'io_poll'
>> is to be modified.
>>
>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> 
> Ideally QUEUE_FLAG_POLL would be authoritative.. but I appreciate the
> problem you've described.  Though I do wonder if this should be solved
> by bio-based's fops->poll() method clearing the request_queue's
> QUEUE_FLAG_POLL if it finds an underlying device doesn't have
> QUEUE_FLAG_POLL set.  Though making bio-based's fops->poll() always need
> to validate the an underlying device does support polling is pretty
> unfortunate.

Agreed. It should be avoided to do control (slow) path in IO (fast) path
as much as possible.

Besides, considering the following patch [1], you should flush the queue
before clearing QUEUE_FLAG_POLL flag. If we embed the checking and
clearing in the normal IO path, then it may result in deadlock. For
example, once the polling routine finds that one of the underlying
device has cleared QUEUE_FLAG_POLL flag, then it needs to flush the
queue (of dm device) before clearing QUEUE_FLAG_POLL flag (of dm
device), while the polling routine itself is responsible for reaping the
completion events.

Of course the polling routine could defer clearing QUEUE_FLAG_POLL flag
to workqueue or something, but all these will lead to much complexity...


[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.12-rc2&id=6b09b4d33bd964f49d07d3cabfb4204d58cf9811

> 
> Either way, queue_poll_store() will need to avoid blk-mq specific poll
> checking for bio-based devices.
> 
> Mike
> 
>> ---
>>  block/blk-sysfs.c      | 14 +++++++++++---
>>  include/linux/blkdev.h |  1 +
>>  2 files changed, 12 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 0f4f0c8a7825..367c1d9a55c6 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -426,9 +426,17 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
>>  	unsigned long poll_on;
>>  	ssize_t ret;
>>  
>> -	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
>> -	    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
>> -		return -EINVAL;
>> +	if (queue_is_mq(q)) {
>> +		if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
>> +		    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
>> +			return -EINVAL;
>> +	} else {
>> +		struct gendisk *disk = queue_to_disk(q);
>> +
>> +		if (!disk->fops->poll_capable ||
>> +		    !disk->fops->poll_capable(disk))
>> +			return -EINVAL;
>> +	}
>>  
>>  	ret = queue_var_store(&poll_on, page, count);
>>  	if (ret < 0)
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 9dc83c30e7bc..7df40792c032 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1867,6 +1867,7 @@ static inline void blk_ksm_unregister(struct request_queue *q) { }
>>  struct block_device_operations {
>>  	blk_qc_t (*submit_bio) (struct bio *bio);
>>  	int (*poll)(struct request_queue *q, blk_qc_t cookie);
>> +	bool (*poll_capable)(struct gendisk *disk);
>>  	int (*open) (struct block_device *, fmode_t);
>>  	void (*release) (struct gendisk *, fmode_t);
>>  	int (*rw_page)(struct block_device *, sector_t, struct page *, unsigned int);
>> -- 
>> 2.27.0
>>
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dm-devel] [PATCH v5 10/12] block: fastpath for bio-based polling
  2021-03-10 23:18   ` Mike Snitzer
@ 2021-03-11  6:36     ` JeffleXu
  2021-03-12  2:26       ` JeffleXu
  0 siblings, 1 reply; 23+ messages in thread
From: JeffleXu @ 2021-03-11  6:36 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, caspar, linux-block, joseph.qi, dm-devel, mpatocka, io-uring



On 3/11/21 7:18 AM, Mike Snitzer wrote:
> On Wed, Mar 03 2021 at  6:57am -0500,
> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
> 
>> Offer one fastpath for bio-based polling when bio submitted to dm
>> device is not split.
>>
>> In this case, there will be only one bio submitted to only one polling
>> hw queue of one underlying mq device, and thus we don't need to track
>> all split bios or iterate through all polling hw queues. The pointer to
>> the polling hw queue the bio submitted to is returned here as the
>> returned cookie. In this case, the polling routine will call
>> mq_ops->poll() directly with the hw queue converted from the input
>> cookie.
>>
>> If the original bio submitted to dm device is split to multiple bios and
>> thus submitted to multiple polling hw queues, the polling routine will
>> fall back to iterating all hw queues (in polling mode) of all underlying
>> mq devices.
>>
>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>> ---
>>  block/blk-core.c          | 73 +++++++++++++++++++++++++++++++++++++--
>>  include/linux/blk_types.h | 66 +++++++++++++++++++++++++++++++++--
>>  include/linux/types.h     |  2 +-
>>  3 files changed, 135 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 6d7d53030d7c..e5cd4ff08f5c 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -947,14 +947,22 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>  {
>>  	struct bio_list bio_list_on_stack[2];
>>  	blk_qc_t ret = BLK_QC_T_NONE;
>> +	struct request_queue *top_q;
>> +	bool poll_on;
>>  
>>  	BUG_ON(bio->bi_next);
>>  
>>  	bio_list_init(&bio_list_on_stack[0]);
>>  	current->bio_list = bio_list_on_stack;
>>  
>> +	top_q = bio->bi_bdev->bd_disk->queue;
>> +	poll_on = test_bit(QUEUE_FLAG_POLL, &top_q->queue_flags) &&
>> +		  (bio->bi_opf & REQ_HIPRI);
>> +
>>  	do {
>> -		struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>> +		blk_qc_t cookie;
>> +		struct block_device *bdev = bio->bi_bdev;
>> +		struct request_queue *q = bdev->bd_disk->queue;
>>  		struct bio_list lower, same;
>>  
>>  		if (unlikely(bio_queue_enter(bio) != 0))
>> @@ -966,7 +974,23 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>  		bio_list_init(&bio_list_on_stack[0]);
>>  
>> -		ret = __submit_bio(bio);
>> +		cookie = __submit_bio(bio);
>> +
>> +		if (poll_on && blk_qc_t_valid(cookie)) {
>> +			unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
>> +			unsigned int devt = bdev_whole(bdev)->bd_dev;
>> +
>> +			cookie = blk_qc_t_get_by_devt(devt, queue_num);
> 
> The need to rebuild the cookie here is pretty awkward.  This
> optimization living in block core may be worthwhile but the duality of
> block core conditionally overriding the driver's returned cookie (that
> is meant to be opaque to upper layer) is not great.

I agree that currently this design pattern has caused blk-core and dm
being tightly coupled together. Maybe it's the most serous problem of
this patchset, personally.

I can explain it though... Since the nature of the IO path of dm, dm
itself doesn't know if the original bio be split to multiple split bios
and thus submitted to multiple underlying devices or not. Currently I
can only get this information in __submit_bio_noacct(), and that's why
there's so much logic specific to dm is coupled with blk-core here.

Besides, as you just pointed out, it's not compatible once other
bio-based devices start to support polling.

> 
>> +
>> +			if (!blk_qc_t_valid(ret)) {
>> +				/* set initial value */
>> +				ret = cookie;
>> +			} else if (ret != cookie) {
>> +				/* bio gets split and enqueued to multi hctxs */
>> +				ret = BLK_QC_T_BIO_POLL_ALL;
>> +				poll_on = false;
>> +			}
>> +		}
>>  
>>  		/*
>>  		 * Sort new bios into those for a lower level and those for the
>> @@ -989,6 +1013,7 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>  	} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
>>  
>>  	current->bio_list = NULL;
>> +
>>  	return ret;
>>  }
> 
> Nit: Not seeing need to alter white space here.

Regards.

> 
>>  
>> @@ -1119,6 +1144,44 @@ blk_qc_t submit_bio(struct bio *bio)
>>  }
>>  EXPORT_SYMBOL(submit_bio);
>>  
>> +static int blk_poll_bio(blk_qc_t cookie)
>> +{
>> +	unsigned int devt = blk_qc_t_to_devt(cookie);
>> +	unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
>> +	struct block_device *bdev;
>> +	struct request_queue *q;
>> +	struct blk_mq_hw_ctx *hctx;
>> +	int ret;
>> +
>> +	bdev = blkdev_get_no_open(devt);
> 
> As you pointed out to me in private, but for the benefit of others,
> blkdev_get_no_open()'s need to take inode lock is not ideal here.

Regards.

> 
>> +
>> +	/*
>> +	 * One such case is that dm device has reloaded table and the original
>> +	 * underlying device the bio submitted to has been detached. When
>> +	 * reloading table, dm will ensure that previously submitted IOs have
>> +	 * all completed, thus return directly here.
>> +	 */
>> +	if (!bdev)
>> +		return 1;
>> +
>> +	q = bdev->bd_disk->queue;
>> +	hctx = q->queue_hw_ctx[queue_num];
>> +
>> +	/*
>> +	 * Similar to the case described in the above comment, that dm device
>> +	 * has reloaded table and the original underlying device the bio
>> +	 * submitted to has been detached. Thus the dev_t stored in cookie may
>> +	 * be reused by another blkdev, and if that's the case, return directly
>> +	 * here.
>> +	 */
>> +	if (hctx->type != HCTX_TYPE_POLL)
>> +		return 1;
> 
> These checks really aren't authoritative or safe enough.  If the bdev
> may have changed then it may not have queue_num hctxs (so you'd access
> out-of-bounds). Similarly, a new bdev may have queue_num hctxs and may
> just so happen to have its type be HCTX_TYPE_POLL.. but in reality it
> isn't the same bdev the cookie was generated from.

Agreed. More checks shall be needed here.

> 
> And I'm now curious how blk-mq's polling code isn't subject to async io
> polling tripping over the possibility of the underlying device having
> been changed out from under it -- meaning should this extra validation
> be common to bio and request-based?  If not, why not?

Blk-mq doesn't have this issue. Io_uring will do fget() for input files
that need do IO. fput() will be deferred until the IO has completed.
Thus the polling routine is guarded by fget() and fput(). fget() will
ensure that the underlying disk (struct block_device, struct disk,
struct request_queue, etc.) is pinned there.

In terms for dm, the above can only guarantee that dm device itself is
pinned, while there's no guarantee for the underlying devices once dm
table reload happens.


> 
>> +
>> +	ret = blk_mq_poll_hctx(q, hctx);
>> +
>> +	blkdev_put_no_open(bdev);
>> +	return ret;
>> +}
>>  
>>  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>>  {
>> @@ -1129,7 +1192,11 @@ static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>>  	do {
>>  		int ret;
>>  
>> -		ret = disk->fops->poll(q, cookie);
>> +		if (unlikely(blk_qc_t_is_poll_multi(cookie)))
>> +			ret = disk->fops->poll(q, cookie);
>> +		else
>> +			ret = blk_poll_bio(cookie);
>> +
> 
> Again, this just seems too limiting. Would rather always call into
> disk->fops->poll and have it optimize for single hctx based on cookie it
> established.
> 
> Not seeing why all this needs to be driven by block core _yet_.
> Could be I'm just wanting some flexibility for the design to harden (and
> potential for optimization left as a driver concern.. as I mentioned in
> my reply to the 0th patch header for this v5 patchset).

The reason why all these are mixed is explained as above.

> 
>>  		if (ret > 0) {
>>  			__set_current_state(TASK_RUNNING);
>>  			return ret;
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index fb429daaa909..8f970e026be9 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -505,10 +505,19 @@ static inline int op_stat_group(unsigned int op)
>>  	return op_is_write(op);
>>  }
>>  
>> -/* Macros for blk_qc_t */
>> +/*
>> + * blk_qc_t for request-based mq devices.
>> + * 63                    31 30          15          0 (bit)
>> + * +----------------------+-+-----------+-----------+
>> + * |      reserved        | | queue_num |    tag    |
>> + * +----------------------+-+-----------+-----------+
>> + *                         ^
>> + *                         BLK_QC_T_INTERNAL
>> + */
>>  #define BLK_QC_T_NONE		-1U
>>  #define BLK_QC_T_SHIFT		16
>>  #define BLK_QC_T_INTERNAL	(1U << 31)
>> +#define BLK_QC_T_QUEUE_NUM_SIZE	15
>>  
>>  static inline bool blk_qc_t_valid(blk_qc_t cookie)
>>  {
>> @@ -517,7 +526,8 @@ static inline bool blk_qc_t_valid(blk_qc_t cookie)
>>  
>>  static inline unsigned int blk_qc_t_to_queue_num(blk_qc_t cookie)
>>  {
>> -	return (cookie & ~BLK_QC_T_INTERNAL) >> BLK_QC_T_SHIFT;
>> +	return (cookie >> BLK_QC_T_SHIFT) &
>> +	       ((1u << BLK_QC_T_QUEUE_NUM_SIZE) - 1);
>>  }
>>  
>>  static inline unsigned int blk_qc_t_to_tag(blk_qc_t cookie)
>> @@ -530,6 +540,58 @@ static inline bool blk_qc_t_is_internal(blk_qc_t cookie)
>>  	return (cookie & BLK_QC_T_INTERNAL) != 0;
>>  }
>>  
>> +/*
>> + * blk_qc_t for bio-based devices.
>> + *
>> + * 1. When @bio is not split, the returned cookie has following format.
>> + *    @dev_t specifies the dev_t number of the underlying device the bio
>> + *    submitted to, while @queue_num specifies the hw queue the bio submitted
>> + *    to.
>> + *
>> + * 63                    31 30          15          0 (bit)
>> + * +----------------------+-+-----------+-----------+
>> + * |        dev_t         | | queue_num |  reserved |
>> + * +----------------------+-+-----------+-----------+
>> + *                         ^
>> + *                         reserved for compatibility with mq
>> + *
>> + * 2. When @bio gets split and enqueued into multi hw queues, the returned
>> + *    cookie is just BLK_QC_T_BIO_POLL_ALL flag.
>> + *
>> + * 63                                              0 (bit)
>> + * +----------------------------------------------+-+
>> + * |                                              |1|
>> + * +----------------------------------------------+-+
>> + *                                                 ^
>> + *                                                 BLK_QC_T_BIO_POLL_ALL
>> + *
>> + * 3. Otherwise, return BLK_QC_T_NONE as the cookie.
>> + *
>> + * 63                                              0 (bit)
>> + * +-----------------------------------------------+
>> + * |                  BLK_QC_T_NONE                |
>> + * +-----------------------------------------------+
>> + */
>> +#define BLK_QC_T_HIGH_SHIFT	32
>> +#define BLK_QC_T_BIO_POLL_ALL	1U
> 
> Pulling on same thread I raised above, the cookie is meant to be
> opaque.  Pinning down how the cookie is (ab)used in block core seems to
> undermine the intended flexibility.
> 
> I'd much rather these details be pushed into drivers/md/bio-poll.h or
> something for the near-term.  We can always elevate it to block core
> if/when there is sufficient justification.
> 
> Just feels we're getting too constraining too quickly.
> 
> Mike
> 
> 
>> +
>> +static inline unsigned int blk_qc_t_to_devt(blk_qc_t cookie)
>> +{
>> +	return cookie >> BLK_QC_T_HIGH_SHIFT;
>> +}
>> +
>> +static inline blk_qc_t blk_qc_t_get_by_devt(unsigned int dev,
>> +					    unsigned int queue_num)
>> +{
>> +	return ((blk_qc_t)dev << BLK_QC_T_HIGH_SHIFT) |
>> +	       (queue_num << BLK_QC_T_SHIFT);
>> +}
>> +
>> +static inline bool blk_qc_t_is_poll_multi(blk_qc_t cookie)
>> +{
>> +	return cookie & BLK_QC_T_BIO_POLL_ALL;
>> +}
>> +
>>  struct blk_rq_stat {
>>  	u64 mean;
>>  	u64 min;
>> diff --git a/include/linux/types.h b/include/linux/types.h
>> index 52a54ed6ffac..7ff4bb96e0ea 100644
>> --- a/include/linux/types.h
>> +++ b/include/linux/types.h
>> @@ -126,7 +126,7 @@ typedef u64 sector_t;
>>  typedef u64 blkcnt_t;
>>  
>>  /* cookie used for IO polling */
>> -typedef unsigned int blk_qc_t;
>> +typedef u64 blk_qc_t;
>>  
>>  /*
>>   * The type of an index into the pagecache.
>> -- 
>> 2.27.0
>>
> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 10/12] block: fastpath for bio-based polling
  2021-03-03 11:57 ` [PATCH v5 10/12] block: fastpath for bio-based polling Jeffle Xu
  2021-03-10 23:18   ` Mike Snitzer
@ 2021-03-11 13:56   ` Ming Lei
  2021-03-12  1:56     ` [dm-devel] " JeffleXu
  1 sibling, 1 reply; 23+ messages in thread
From: Ming Lei @ 2021-03-11 13:56 UTC (permalink / raw)
  To: Jeffle Xu
  Cc: msnitzer, axboe, io-uring, dm-devel, linux-block, mpatocka,
	caspar, joseph.qi

On Wed, Mar 03, 2021 at 07:57:38PM +0800, Jeffle Xu wrote:
> Offer one fastpath for bio-based polling when bio submitted to dm
> device is not split.
> 
> In this case, there will be only one bio submitted to only one polling
> hw queue of one underlying mq device, and thus we don't need to track
> all split bios or iterate through all polling hw queues. The pointer to
> the polling hw queue the bio submitted to is returned here as the
> returned cookie. In this case, the polling routine will call
> mq_ops->poll() directly with the hw queue converted from the input
> cookie.
> 
> If the original bio submitted to dm device is split to multiple bios and
> thus submitted to multiple polling hw queues, the polling routine will
> fall back to iterating all hw queues (in polling mode) of all underlying
> mq devices.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> ---
>  block/blk-core.c          | 73 +++++++++++++++++++++++++++++++++++++--
>  include/linux/blk_types.h | 66 +++++++++++++++++++++++++++++++++--
>  include/linux/types.h     |  2 +-
>  3 files changed, 135 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 6d7d53030d7c..e5cd4ff08f5c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -947,14 +947,22 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;
> +	struct request_queue *top_q;
> +	bool poll_on;
>  
>  	BUG_ON(bio->bi_next);
>  
>  	bio_list_init(&bio_list_on_stack[0]);
>  	current->bio_list = bio_list_on_stack;
>  
> +	top_q = bio->bi_bdev->bd_disk->queue;
> +	poll_on = test_bit(QUEUE_FLAG_POLL, &top_q->queue_flags) &&
> +		  (bio->bi_opf & REQ_HIPRI);
> +
>  	do {
> -		struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +		blk_qc_t cookie;
> +		struct block_device *bdev = bio->bi_bdev;
> +		struct request_queue *q = bdev->bd_disk->queue;
>  		struct bio_list lower, same;
>  
>  		if (unlikely(bio_queue_enter(bio) != 0))
> @@ -966,7 +974,23 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		cookie = __submit_bio(bio);
> +
> +		if (poll_on && blk_qc_t_valid(cookie)) {

In patch 8, dm_submit_bio() is changed to return BLK_QC_T_NONE always,
so the returned cookie may be BLK_QC_T_NONE for DM device, such as, in
case of DM_MAPIO_SUBMITTED returned from ->map(), and underlying bios
can be submitted from another context, then nothing is fed to blk_poll().


thanks, 
Ming


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dm-devel] [PATCH v5 10/12] block: fastpath for bio-based polling
  2021-03-11 13:56   ` Ming Lei
@ 2021-03-12  1:56     ` JeffleXu
  0 siblings, 0 replies; 23+ messages in thread
From: JeffleXu @ 2021-03-12  1:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, msnitzer, caspar, linux-block, joseph.qi, dm-devel,
	mpatocka, io-uring



On 3/11/21 9:56 PM, Ming Lei wrote:
> On Wed, Mar 03, 2021 at 07:57:38PM +0800, Jeffle Xu wrote:
>> Offer one fastpath for bio-based polling when bio submitted to dm
>> device is not split.
>>
>> In this case, there will be only one bio submitted to only one polling
>> hw queue of one underlying mq device, and thus we don't need to track
>> all split bios or iterate through all polling hw queues. The pointer to
>> the polling hw queue the bio submitted to is returned here as the
>> returned cookie. In this case, the polling routine will call
>> mq_ops->poll() directly with the hw queue converted from the input
>> cookie.
>>
>> If the original bio submitted to dm device is split to multiple bios and
>> thus submitted to multiple polling hw queues, the polling routine will
>> fall back to iterating all hw queues (in polling mode) of all underlying
>> mq devices.
>>
>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>> ---
>>  block/blk-core.c          | 73 +++++++++++++++++++++++++++++++++++++--
>>  include/linux/blk_types.h | 66 +++++++++++++++++++++++++++++++++--
>>  include/linux/types.h     |  2 +-
>>  3 files changed, 135 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 6d7d53030d7c..e5cd4ff08f5c 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -947,14 +947,22 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>  {
>>  	struct bio_list bio_list_on_stack[2];
>>  	blk_qc_t ret = BLK_QC_T_NONE;
>> +	struct request_queue *top_q;
>> +	bool poll_on;
>>  
>>  	BUG_ON(bio->bi_next);
>>  
>>  	bio_list_init(&bio_list_on_stack[0]);
>>  	current->bio_list = bio_list_on_stack;
>>  
>> +	top_q = bio->bi_bdev->bd_disk->queue;
>> +	poll_on = test_bit(QUEUE_FLAG_POLL, &top_q->queue_flags) &&
>> +		  (bio->bi_opf & REQ_HIPRI);
>> +
>>  	do {
>> -		struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>> +		blk_qc_t cookie;
>> +		struct block_device *bdev = bio->bi_bdev;
>> +		struct request_queue *q = bdev->bd_disk->queue;
>>  		struct bio_list lower, same;
>>  
>>  		if (unlikely(bio_queue_enter(bio) != 0))
>> @@ -966,7 +974,23 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>  		bio_list_init(&bio_list_on_stack[0]);
>>  
>> -		ret = __submit_bio(bio);
>> +		cookie = __submit_bio(bio);
>> +
>> +		if (poll_on && blk_qc_t_valid(cookie)) {
> 
> In patch 8, dm_submit_bio() is changed to return BLK_QC_T_NONE always,
> so the returned cookie may be BLK_QC_T_NONE for DM device, such as, in
> case of DM_MAPIO_SUBMITTED returned from ->map(), and underlying bios
> can be submitted from another context, then nothing is fed to blk_poll().

Thanks for poniting out this. Indeed this issue exists. If the IO
submission is offloaded to another process context, the current simple
cookie mechanism doesn't support that.


-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dm-devel] [PATCH v5 10/12] block: fastpath for bio-based polling
  2021-03-11  6:36     ` [dm-devel] " JeffleXu
@ 2021-03-12  2:26       ` JeffleXu
  0 siblings, 0 replies; 23+ messages in thread
From: JeffleXu @ 2021-03-12  2:26 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, caspar, linux-block, joseph.qi, dm-devel, mpatocka, io-uring



On 3/11/21 2:36 PM, JeffleXu wrote:
> 
> 
> On 3/11/21 7:18 AM, Mike Snitzer wrote:
>> On Wed, Mar 03 2021 at  6:57am -0500,
>> Jeffle Xu <jefflexu@linux.alibaba.com> wrote:
>>
>>> Offer one fastpath for bio-based polling when bio submitted to dm
>>> device is not split.
>>>
>>> In this case, there will be only one bio submitted to only one polling
>>> hw queue of one underlying mq device, and thus we don't need to track
>>> all split bios or iterate through all polling hw queues. The pointer to
>>> the polling hw queue the bio submitted to is returned here as the
>>> returned cookie. In this case, the polling routine will call
>>> mq_ops->poll() directly with the hw queue converted from the input
>>> cookie.
>>>
>>> If the original bio submitted to dm device is split to multiple bios and
>>> thus submitted to multiple polling hw queues, the polling routine will
>>> fall back to iterating all hw queues (in polling mode) of all underlying
>>> mq devices.
>>>
>>> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
>>> ---
>>>  block/blk-core.c          | 73 +++++++++++++++++++++++++++++++++++++--
>>>  include/linux/blk_types.h | 66 +++++++++++++++++++++++++++++++++--
>>>  include/linux/types.h     |  2 +-
>>>  3 files changed, 135 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 6d7d53030d7c..e5cd4ff08f5c 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -947,14 +947,22 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  {
>>>  	struct bio_list bio_list_on_stack[2];
>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>> +	struct request_queue *top_q;
>>> +	bool poll_on;
>>>  
>>>  	BUG_ON(bio->bi_next);
>>>  
>>>  	bio_list_init(&bio_list_on_stack[0]);
>>>  	current->bio_list = bio_list_on_stack;
>>>  
>>> +	top_q = bio->bi_bdev->bd_disk->queue;
>>> +	poll_on = test_bit(QUEUE_FLAG_POLL, &top_q->queue_flags) &&
>>> +		  (bio->bi_opf & REQ_HIPRI);
>>> +
>>>  	do {
>>> -		struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>> +		blk_qc_t cookie;
>>> +		struct block_device *bdev = bio->bi_bdev;
>>> +		struct request_queue *q = bdev->bd_disk->queue;
>>>  		struct bio_list lower, same;
>>>  
>>>  		if (unlikely(bio_queue_enter(bio) != 0))
>>> @@ -966,7 +974,23 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>  
>>> -		ret = __submit_bio(bio);
>>> +		cookie = __submit_bio(bio);
>>> +
>>> +		if (poll_on && blk_qc_t_valid(cookie)) {
>>> +			unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
>>> +			unsigned int devt = bdev_whole(bdev)->bd_dev;
>>> +
>>> +			cookie = blk_qc_t_get_by_devt(devt, queue_num);
>>
>> The need to rebuild the cookie here is pretty awkward.  This
>> optimization living in block core may be worthwhile but the duality of
>> block core conditionally overriding the driver's returned cookie (that
>> is meant to be opaque to upper layer) is not great.
> 
> I agree that currently this design pattern has caused blk-core and dm
> being tightly coupled together. Maybe it's the most serous problem of
> this patchset, personally.
> 
> I can explain it though... Since the nature of the IO path of dm, dm
> itself doesn't know if the original bio be split to multiple split bios
> and thus submitted to multiple underlying devices or not. Currently I
> can only get this information in __submit_bio_noacct(), and that's why
> there's so much logic specific to dm is coupled with blk-core here.
> 

I didn't point out the most important part of that. dm can't get the
(really valid) cookie returned by mq. Suppose one dm device is built
upon one nvme, then dm_submit_bio() only remaps and the remapped bio is
actually *buffered* in the bio_list. In fact, he remapped bio is later
submitted in __submit_bio_noacct(). So dm doesn't know the cookie
(returned by underlying mq), while blk-core does.

dm could "predict" the cookie of following submitted IO to mq (dev_t and
index of hw queue), and return it (built by dev_t and hw queue index) in
advance, but this "prediction" is quite fragile, since the IO submitting
process could be migrated to another CPU, or the hw queue mapping of the
underlying mq device could change before __submit_bio_noacct() really
submits the IO.


-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-03-12  2:27 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
2021-03-03 11:57 ` [PATCH v5 01/12] block: move definition of blk_qc_t to types.h Jeffle Xu
2021-03-03 11:57 ` [PATCH v5 02/12] block: add queue_to_disk() to get gendisk from request_queue Jeffle Xu
2021-03-03 11:57 ` [PATCH v5 03/12] block: add poll method to support bio-based IO polling Jeffle Xu
2021-03-10 22:01   ` Mike Snitzer
2021-03-11  5:31     ` [dm-devel] " JeffleXu
2021-03-03 11:57 ` [PATCH v5 04/12] block: add poll_capable " Jeffle Xu
2021-03-10 22:21   ` Mike Snitzer
2021-03-11  5:43     ` [dm-devel] " JeffleXu
2021-03-03 11:57 ` [PATCH v5 05/12] blk-mq: extract one helper function polling hw queue Jeffle Xu
2021-03-03 11:57 ` [PATCH v5 06/12] blk-mq: add iterator for polling hw queues Jeffle Xu
2021-03-03 11:57 ` [PATCH v5 07/12] blk-mq: add one helper function getting hw queue Jeffle Xu
2021-03-03 11:57 ` [PATCH v5 08/12] dm: always return BLK_QC_T_NONE for bio-based device Jeffle Xu
2021-03-03 11:57 ` [PATCH v5 09/12] nvme/pci: don't wait for locked polling queue Jeffle Xu
2021-03-10 21:57   ` Mike Snitzer
2021-03-03 11:57 ` [PATCH v5 10/12] block: fastpath for bio-based polling Jeffle Xu
2021-03-10 23:18   ` Mike Snitzer
2021-03-11  6:36     ` [dm-devel] " JeffleXu
2021-03-12  2:26       ` JeffleXu
2021-03-11 13:56   ` Ming Lei
2021-03-12  1:56     ` [dm-devel] " JeffleXu
2021-03-03 11:57 ` [PATCH v5 11/12] block: sub-fastpath " Jeffle Xu
2021-03-03 11:57 ` [PATCH v5 12/12] dm: support IO polling for bio-based dm device Jeffle Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).