All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH V2 00/13] block: support bio based io polling
@ 2021-03-18 16:48 ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Hi,

Add per-task io poll context for holding HIPRI blk-mq/underlying bios
queued from bio based driver's io submission context, and reuse one bio
padding field for storing 'cookie' returned from submit_bio() for these
bios. Also explicitly end these bios in poll context by adding two
new bio flags.

In this way, we needn't to poll all underlying hw queues any more,
which is implemented in Jeffle's patches. And we can just poll hw queues
in which there is HIPRI IO queued.

Usually io submission and io poll share same context, so the added io
poll context data is just like one stack variable, and the cost for
saving bios is cheap.

Any comments are welcome.

V2:
	- address queue depth scalability issue reported by Jeffle via bio
	group list. Reuse .bi_end_io for linking bios which share same
	.bi_end_io, and support 32 such groups in submit queue. With this way,
	the scalability issue caused by kfifio is solved. Before really
	ending bio, .bi_end_io is recovered from the group head.


Jeffle Xu (4):
  block/mq: extract one helper function polling hw queue
  block: add queue_to_disk() to get gendisk from request_queue
  block: add poll_capable method to support bio-based IO polling
  dm: support IO polling for bio-based dm device

Ming Lei (9):
  block: add helper of blk_queue_poll
  block: add one helper to free io_context
  block: add helper of blk_create_io_context
  block: create io poll context for submission and poll task
  block: add req flag of REQ_TAG
  block: add new field into 'struct bvec_iter'
  block: prepare for supporting bio_list via other link
  block: use per-task poll context to implement bio based io poll
  blk-mq: limit hw queues to be polled in each blk_poll()

 block/bio.c                   |   5 +
 block/blk-core.c              | 248 ++++++++++++++++++++++++++++++++--
 block/blk-ioc.c               |  12 +-
 block/blk-mq.c                | 232 ++++++++++++++++++++++++++++++-
 block/blk-sysfs.c             |  14 +-
 block/blk.h                   |  55 ++++++++
 drivers/md/dm-table.c         |  24 ++++
 drivers/md/dm.c               |  14 ++
 drivers/nvme/host/core.c      |   2 +-
 include/linux/bio.h           | 132 +++++++++---------
 include/linux/blk_types.h     |  20 ++-
 include/linux/blkdev.h        |   4 +
 include/linux/bvec.h          |   9 ++
 include/linux/device-mapper.h |   1 +
 include/linux/iocontext.h     |   2 +
 include/trace/events/kyber.h  |   6 +-
 16 files changed, 686 insertions(+), 94 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 00/13] block: support bio based io polling
@ 2021-03-18 16:48 ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Hi,

Add per-task io poll context for holding HIPRI blk-mq/underlying bios
queued from bio based driver's io submission context, and reuse one bio
padding field for storing 'cookie' returned from submit_bio() for these
bios. Also explicitly end these bios in poll context by adding two
new bio flags.

In this way, we needn't to poll all underlying hw queues any more,
which is implemented in Jeffle's patches. And we can just poll hw queues
in which there is HIPRI IO queued.

Usually io submission and io poll share same context, so the added io
poll context data is just like one stack variable, and the cost for
saving bios is cheap.

Any comments are welcome.

V2:
	- address queue depth scalability issue reported by Jeffle via bio
	group list. Reuse .bi_end_io for linking bios which share same
	.bi_end_io, and support 32 such groups in submit queue. With this way,
	the scalability issue caused by kfifio is solved. Before really
	ending bio, .bi_end_io is recovered from the group head.


Jeffle Xu (4):
  block/mq: extract one helper function polling hw queue
  block: add queue_to_disk() to get gendisk from request_queue
  block: add poll_capable method to support bio-based IO polling
  dm: support IO polling for bio-based dm device

Ming Lei (9):
  block: add helper of blk_queue_poll
  block: add one helper to free io_context
  block: add helper of blk_create_io_context
  block: create io poll context for submission and poll task
  block: add req flag of REQ_TAG
  block: add new field into 'struct bvec_iter'
  block: prepare for supporting bio_list via other link
  block: use per-task poll context to implement bio based io poll
  blk-mq: limit hw queues to be polled in each blk_poll()

 block/bio.c                   |   5 +
 block/blk-core.c              | 248 ++++++++++++++++++++++++++++++++--
 block/blk-ioc.c               |  12 +-
 block/blk-mq.c                | 232 ++++++++++++++++++++++++++++++-
 block/blk-sysfs.c             |  14 +-
 block/blk.h                   |  55 ++++++++
 drivers/md/dm-table.c         |  24 ++++
 drivers/md/dm.c               |  14 ++
 drivers/nvme/host/core.c      |   2 +-
 include/linux/bio.h           | 132 +++++++++---------
 include/linux/blk_types.h     |  20 ++-
 include/linux/blkdev.h        |   4 +
 include/linux/bvec.h          |   9 ++
 include/linux/device-mapper.h |   1 +
 include/linux/iocontext.h     |   2 +
 include/trace/events/kyber.h  |   6 +-
 16 files changed, 686 insertions(+), 94 deletions(-)

-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 01/13] block: add helper of blk_queue_poll
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

There has been 3 users, and will be more, so add one such helper.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c         | 2 +-
 block/blk-mq.c           | 3 +--
 drivers/nvme/host/core.c | 2 +-
 include/linux/blkdev.h   | 1 +
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..a31371d55b9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -836,7 +836,7 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
 	switch (bio_op(bio)) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..63c81df3b8b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3869,8 +3869,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) ||
-	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
 		return 0;
 
 	if (current->plug)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a5653892d773..1bf94f0d2e8d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -956,7 +956,7 @@ static void nvme_execute_rq_polled(struct request_queue *q,
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 
-	WARN_ON_ONCE(!test_bit(QUEUE_FLAG_POLL, &q->queue_flags));
+	WARN_ON_ONCE(!blk_queue_poll(q));
 
 	rq->cmd_flags |= REQ_HIPRI;
 	rq->end_io_data = &wait;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc6bc8383b43..89a01850cf12 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -665,6 +665,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
+#define blk_queue_poll(q)	test_bit(QUEUE_FLAG_POLL, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 01/13] block: add helper of blk_queue_poll
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

There has been 3 users, and will be more, so add one such helper.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c         | 2 +-
 block/blk-mq.c           | 3 +--
 drivers/nvme/host/core.c | 2 +-
 include/linux/blkdev.h   | 1 +
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..a31371d55b9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -836,7 +836,7 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
 	switch (bio_op(bio)) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..63c81df3b8b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3869,8 +3869,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) ||
-	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
 		return 0;
 
 	if (current->plug)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a5653892d773..1bf94f0d2e8d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -956,7 +956,7 @@ static void nvme_execute_rq_polled(struct request_queue *q,
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 
-	WARN_ON_ONCE(!test_bit(QUEUE_FLAG_POLL, &q->queue_flags));
+	WARN_ON_ONCE(!blk_queue_poll(q));
 
 	rq->cmd_flags |= REQ_HIPRI;
 	rq->end_io_data = &wait;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc6bc8383b43..89a01850cf12 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -665,6 +665,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
+#define blk_queue_poll(q)	test_bit(QUEUE_FLAG_POLL, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 02/13] block: add one helper to free io_context
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Prepare for putting bio poll queue into io_context, so add one helper
for free io_context.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-ioc.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 57299f860d41..b0cde18c4b8c 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -17,6 +17,11 @@
  */
 static struct kmem_cache *iocontext_cachep;
 
+static inline void free_io_context(struct io_context *ioc)
+{
+	kmem_cache_free(iocontext_cachep, ioc);
+}
+
 /**
  * get_io_context - increment reference count to io_context
  * @ioc: io_context to get
@@ -129,7 +134,7 @@ static void ioc_release_fn(struct work_struct *work)
 
 	spin_unlock_irq(&ioc->lock);
 
-	kmem_cache_free(iocontext_cachep, ioc);
+	free_io_context(ioc);
 }
 
 /**
@@ -164,7 +169,7 @@ void put_io_context(struct io_context *ioc)
 	}
 
 	if (free_ioc)
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 }
 
 /**
@@ -278,7 +283,7 @@ int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
 	    (task == current || !(task->flags & PF_EXITING)))
 		task->io_context = ioc;
 	else
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 
 	ret = task->io_context ? 0 : -EBUSY;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 02/13] block: add one helper to free io_context
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Prepare for putting bio poll queue into io_context, so add one helper
for free io_context.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-ioc.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 57299f860d41..b0cde18c4b8c 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -17,6 +17,11 @@
  */
 static struct kmem_cache *iocontext_cachep;
 
+static inline void free_io_context(struct io_context *ioc)
+{
+	kmem_cache_free(iocontext_cachep, ioc);
+}
+
 /**
  * get_io_context - increment reference count to io_context
  * @ioc: io_context to get
@@ -129,7 +134,7 @@ static void ioc_release_fn(struct work_struct *work)
 
 	spin_unlock_irq(&ioc->lock);
 
-	kmem_cache_free(iocontext_cachep, ioc);
+	free_io_context(ioc);
 }
 
 /**
@@ -164,7 +169,7 @@ void put_io_context(struct io_context *ioc)
 	}
 
 	if (free_ioc)
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 }
 
 /**
@@ -278,7 +283,7 @@ int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
 	    (task == current || !(task->flags & PF_EXITING)))
 		task->io_context = ioc;
 	else
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 
 	ret = task->io_context ? 0 : -EBUSY;
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 03/13] block: add helper of blk_create_io_context
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Add one helper for creating io context and prepare for supporting
efficient bio based io poll.

Meantime move the code of creating io_context before checking bio's
REQ_HIPRI flag because the following patch may change to clear REQ_HIPRI
if io_context can't be created.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a31371d55b9d..d58f8a0c80de 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,6 +792,18 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
+static inline void blk_create_io_context(struct request_queue *q)
+{
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -836,6 +848,8 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
+	blk_create_io_context(q);
+
 	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
@@ -876,15 +890,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		break;
 	}
 
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
-
 	if (blk_throtl_bio(bio)) {
 		blkcg_bio_issue_init(bio);
 		return false;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 03/13] block: add helper of blk_create_io_context
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Add one helper for creating io context and prepare for supporting
efficient bio based io poll.

Meantime move the code of creating io_context before checking bio's
REQ_HIPRI flag because the following patch may change to clear REQ_HIPRI
if io_context can't be created.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a31371d55b9d..d58f8a0c80de 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,6 +792,18 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
+static inline void blk_create_io_context(struct request_queue *q)
+{
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -836,6 +848,8 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
+	blk_create_io_context(q);
+
 	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
@@ -876,15 +890,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		break;
 	}
 
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
-
 	if (blk_throtl_bio(bio)) {
 		blkcg_bio_issue_init(bio);
 		return false;
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 04/13] block: create io poll context for submission and poll task
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Create per-task io poll context for both IO submission and poll task
if the queue is bio based and supports polling.

This io polling context includes two queues: submission queue(sq) for
storing HIPRI bio submission result(cookie) and the bio, written
by submission task and read by poll task; polling queue(pq) for holding
data moved from sq, only used in poll context for running bio polling.

Following patches will support bio poll.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
 block/blk-ioc.c           |  1 +
 block/blk-mq.c            | 14 ++++++++
 block/blk.h               | 46 +++++++++++++++++++++++++
 include/linux/iocontext.h |  2 ++
 5 files changed, 122 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d58f8a0c80de..0b00c21cbefb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
-static inline void blk_create_io_context(struct request_queue *q)
+static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
 {
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
+	struct io_context *ioc = current->io_context;
+
+	return ioc ? ioc->data : NULL;
+}
+
+static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
+{
+	return sizeof(struct bio_grp_list) + nr_grps *
+		sizeof(struct bio_grp_list_data);
+}
+
+static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
+{
+	pc->sq = (void *)pc + sizeof(*pc);
+	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
+
+	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
+	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
+
+	spin_lock_init(&pc->sq_lock);
+	mutex_init(&pc->pq_lock);
+}
+
+void bio_poll_ctx_alloc(struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc;
+	unsigned int size = sizeof(*pc) +
+		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
+		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
+
+	pc = kzalloc(GFP_ATOMIC, size);
+	if (pc) {
+		bio_poll_ctx_init(pc);
+		if (cmpxchg(&ioc->data, NULL, (void *)pc))
+			kfree(pc);
+	}
+}
+
+static inline bool blk_queue_support_bio_poll(struct request_queue *q)
+{
+	return !queue_is_mq(q) && blk_queue_poll(q);
+}
+
+static inline void blk_bio_poll_preprocess(struct request_queue *q,
+		struct bio *bio)
+{
+	if (!(bio->bi_opf & REQ_HIPRI))
+		return;
+
+	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+		bio->bi_opf &= ~REQ_HIPRI;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	blk_create_io_context(q);
+	/*
+	 * Created per-task io poll queue if we supports bio polling
+	 * and it is one HIPRI bio.
+	 */
+	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			(bio->bi_opf & REQ_HIPRI));
 
-	if (!blk_queue_poll(q))
-		bio->bi_opf &= ~REQ_HIPRI;
+	blk_bio_poll_preprocess(q, bio);
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b0cde18c4b8c..5574c398eff6 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
 
 static inline void free_io_context(struct io_context *ioc)
 {
+	kfree(ioc->data);
 	kmem_cache_free(iocontext_cachep, ioc);
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 63c81df3b8b5..c832faa52ca0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+	/*
+	 * Create poll queue for storing poll bio and its cookie from
+	 * submission queue
+	 */
+	blk_create_io_context(q, true);
+
+	return 0;
+}
+
 /**
  * blk_poll - poll for IO completions
  * @q:  the queue
@@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	if (current->plug)
 		blk_flush_plug_list(current->plug, false);
 
+	if (!queue_is_mq(q))
+		return blk_bio_poll(q, cookie, spin);
+
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 
 	/*
diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e..ae58a706327e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -357,4 +357,50 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+/* grouping bios belonging to same group into one list  */
+struct bio_grp_list_data {
+	/* group data */
+	void *grp_data;
+
+	/* all bios in this list share same 'grp_data' */
+	struct bio_list list;
+};
+
+struct bio_grp_list {
+	unsigned int max_nr_grps, nr_grps;
+	struct bio_grp_list_data head[0];
+};
+
+struct blk_bio_poll_ctx {
+	spinlock_t sq_lock;
+	struct bio_grp_list *sq;
+
+	struct mutex pq_lock;
+	struct bio_grp_list *pq;
+};
+
+#define BLK_BIO_POLL_SQ_SZ		32U
+#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
+
+void bio_poll_ctx_alloc(struct io_context *ioc);
+
+static inline void blk_create_io_context(struct request_queue *q,
+		bool need_poll_ctx)
+{
+	struct io_context *ioc;
+
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+
+	ioc = current->io_context;
+	if (need_poll_ctx && unlikely(ioc && !ioc->data))
+		bio_poll_ctx_alloc(ioc);
+}
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 0a9dc40b7be8..f9a467571356 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -110,6 +110,8 @@ struct io_context {
 	struct io_cq __rcu	*icq_hint;
 	struct hlist_head	icq_list;
 
+	void			*data;
+
 	struct work_struct release_work;
 };
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 04/13] block: create io poll context for submission and poll task
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Create per-task io poll context for both IO submission and poll task
if the queue is bio based and supports polling.

This io polling context includes two queues: submission queue(sq) for
storing HIPRI bio submission result(cookie) and the bio, written
by submission task and read by poll task; polling queue(pq) for holding
data moved from sq, only used in poll context for running bio polling.

Following patches will support bio poll.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
 block/blk-ioc.c           |  1 +
 block/blk-mq.c            | 14 ++++++++
 block/blk.h               | 46 +++++++++++++++++++++++++
 include/linux/iocontext.h |  2 ++
 5 files changed, 122 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d58f8a0c80de..0b00c21cbefb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
-static inline void blk_create_io_context(struct request_queue *q)
+static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
 {
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
+	struct io_context *ioc = current->io_context;
+
+	return ioc ? ioc->data : NULL;
+}
+
+static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
+{
+	return sizeof(struct bio_grp_list) + nr_grps *
+		sizeof(struct bio_grp_list_data);
+}
+
+static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
+{
+	pc->sq = (void *)pc + sizeof(*pc);
+	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
+
+	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
+	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
+
+	spin_lock_init(&pc->sq_lock);
+	mutex_init(&pc->pq_lock);
+}
+
+void bio_poll_ctx_alloc(struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc;
+	unsigned int size = sizeof(*pc) +
+		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
+		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
+
+	pc = kzalloc(GFP_ATOMIC, size);
+	if (pc) {
+		bio_poll_ctx_init(pc);
+		if (cmpxchg(&ioc->data, NULL, (void *)pc))
+			kfree(pc);
+	}
+}
+
+static inline bool blk_queue_support_bio_poll(struct request_queue *q)
+{
+	return !queue_is_mq(q) && blk_queue_poll(q);
+}
+
+static inline void blk_bio_poll_preprocess(struct request_queue *q,
+		struct bio *bio)
+{
+	if (!(bio->bi_opf & REQ_HIPRI))
+		return;
+
+	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+		bio->bi_opf &= ~REQ_HIPRI;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	blk_create_io_context(q);
+	/*
+	 * Created per-task io poll queue if we supports bio polling
+	 * and it is one HIPRI bio.
+	 */
+	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			(bio->bi_opf & REQ_HIPRI));
 
-	if (!blk_queue_poll(q))
-		bio->bi_opf &= ~REQ_HIPRI;
+	blk_bio_poll_preprocess(q, bio);
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b0cde18c4b8c..5574c398eff6 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
 
 static inline void free_io_context(struct io_context *ioc)
 {
+	kfree(ioc->data);
 	kmem_cache_free(iocontext_cachep, ioc);
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 63c81df3b8b5..c832faa52ca0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+	/*
+	 * Create poll queue for storing poll bio and its cookie from
+	 * submission queue
+	 */
+	blk_create_io_context(q, true);
+
+	return 0;
+}
+
 /**
  * blk_poll - poll for IO completions
  * @q:  the queue
@@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	if (current->plug)
 		blk_flush_plug_list(current->plug, false);
 
+	if (!queue_is_mq(q))
+		return blk_bio_poll(q, cookie, spin);
+
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 
 	/*
diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e..ae58a706327e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -357,4 +357,50 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+/* grouping bios belonging to same group into one list  */
+struct bio_grp_list_data {
+	/* group data */
+	void *grp_data;
+
+	/* all bios in this list share same 'grp_data' */
+	struct bio_list list;
+};
+
+struct bio_grp_list {
+	unsigned int max_nr_grps, nr_grps;
+	struct bio_grp_list_data head[0];
+};
+
+struct blk_bio_poll_ctx {
+	spinlock_t sq_lock;
+	struct bio_grp_list *sq;
+
+	struct mutex pq_lock;
+	struct bio_grp_list *pq;
+};
+
+#define BLK_BIO_POLL_SQ_SZ		32U
+#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
+
+void bio_poll_ctx_alloc(struct io_context *ioc);
+
+static inline void blk_create_io_context(struct request_queue *q,
+		bool need_poll_ctx)
+{
+	struct io_context *ioc;
+
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+
+	ioc = current->io_context;
+	if (need_poll_ctx && unlikely(ioc && !ioc->data))
+		bio_poll_ctx_alloc(ioc);
+}
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 0a9dc40b7be8..f9a467571356 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -110,6 +110,8 @@ struct io_context {
 	struct io_cq __rcu	*icq_hint;
 	struct hlist_head	icq_list;
 
+	void			*data;
+
 	struct work_struct release_work;
 };
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Add one req flag REQ_TAG which will be used in the following patch for
supporting bio based IO polling.

Exactly this flag can help us to do:

1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.

2)create per-task io polling context if the bio based queue supports polling
and the submitted bio is HIPRI. This per-task io polling context will be
created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
we can avoid to create such io polling context if one cloned bio with REQ_TAG
is submitted from another kernel context.

3) for supporting bio based io polling, we need to poll IOs from all
underlying queues of bio device/driver, this way help us to recognize which
IOs need to polled in bio based style, which will be implemented in next
patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 29 +++++++++++++++++++++++++++--
 include/linux/blk_types.h |  4 ++++
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0b00c21cbefb..efc7a61a84b4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
 static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		struct bio *bio)
 {
+	bool mq;
+
 	if (!(bio->bi_opf & REQ_HIPRI))
 		return;
 
-	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+	/*
+	 * Can't support bio based IO poll without per-task poll queue
+	 *
+	 * Now we have created per-task io poll context, and mark this
+	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
+	 * submitted from another kernel context, we won't create bio
+	 * poll context for it, so that bio will be completed by IRQ;
+	 * 2) If such bio is submitted from current context, we will
+	 * complete it via blk_poll(); 3) If driver knows that one
+	 * underlying bio allocated from driver is for FS bio, meantime
+	 * it is submitted in current context, driver can mark such bio
+	 * as REQ_TAG manually, so the bio can be completed via blk_poll
+	 * too.
+	 */
+	mq = queue_is_mq(q);
+	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
 		bio->bi_opf &= ~REQ_HIPRI;
+	else if (!mq)
+		bio->bi_opf |= REQ_TAG;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -893,9 +912,15 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 
 	/*
 	 * Created per-task io poll queue if we supports bio polling
-	 * and it is one HIPRI bio.
+	 * and it is one HIPRI bio, and this HIPRI bio has to be from
+	 * FS. If REQ_TAG isn't set for HIPRI bio, we think it originated
+	 * from FS.
+	 *
+	 * Driver may allocated bio by itself and REQ_TAG is set, but they
+	 * won't be marked as HIPRI.
 	 */
 	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			!(bio->bi_opf & REQ_TAG) &&
 			(bio->bi_opf & REQ_HIPRI));
 
 	blk_bio_poll_preprocess(q, bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..a1bcade4bcc3 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -394,6 +394,9 @@ enum req_flag_bits {
 
 	__REQ_HIPRI,
 
+	/* for marking IOs originated from same FS bio in same context */
+	__REQ_TAG,
+
 	/* for driver use */
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
@@ -418,6 +421,7 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
+#define REQ_TAG			(1ULL << __REQ_TAG)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Add one req flag REQ_TAG which will be used in the following patch for
supporting bio based IO polling.

Exactly this flag can help us to do:

1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.

2)create per-task io polling context if the bio based queue supports polling
and the submitted bio is HIPRI. This per-task io polling context will be
created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
we can avoid to create such io polling context if one cloned bio with REQ_TAG
is submitted from another kernel context.

3) for supporting bio based io polling, we need to poll IOs from all
underlying queues of bio device/driver, this way help us to recognize which
IOs need to polled in bio based style, which will be implemented in next
patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 29 +++++++++++++++++++++++++++--
 include/linux/blk_types.h |  4 ++++
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0b00c21cbefb..efc7a61a84b4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
 static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		struct bio *bio)
 {
+	bool mq;
+
 	if (!(bio->bi_opf & REQ_HIPRI))
 		return;
 
-	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+	/*
+	 * Can't support bio based IO poll without per-task poll queue
+	 *
+	 * Now we have created per-task io poll context, and mark this
+	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
+	 * submitted from another kernel context, we won't create bio
+	 * poll context for it, so that bio will be completed by IRQ;
+	 * 2) If such bio is submitted from current context, we will
+	 * complete it via blk_poll(); 3) If driver knows that one
+	 * underlying bio allocated from driver is for FS bio, meantime
+	 * it is submitted in current context, driver can mark such bio
+	 * as REQ_TAG manually, so the bio can be completed via blk_poll
+	 * too.
+	 */
+	mq = queue_is_mq(q);
+	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
 		bio->bi_opf &= ~REQ_HIPRI;
+	else if (!mq)
+		bio->bi_opf |= REQ_TAG;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -893,9 +912,15 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 
 	/*
 	 * Created per-task io poll queue if we supports bio polling
-	 * and it is one HIPRI bio.
+	 * and it is one HIPRI bio, and this HIPRI bio has to be from
+	 * FS. If REQ_TAG isn't set for HIPRI bio, we think it originated
+	 * from FS.
+	 *
+	 * Driver may allocated bio by itself and REQ_TAG is set, but they
+	 * won't be marked as HIPRI.
 	 */
 	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			!(bio->bi_opf & REQ_TAG) &&
 			(bio->bi_opf & REQ_HIPRI));
 
 	blk_bio_poll_preprocess(q, bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..a1bcade4bcc3 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -394,6 +394,9 @@ enum req_flag_bits {
 
 	__REQ_HIPRI,
 
+	/* for marking IOs originated from same FS bio in same context */
+	__REQ_TAG,
+
 	/* for driver use */
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
@@ -418,6 +421,7 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
+#define REQ_TAG			(1ULL << __REQ_TAG)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 06/13] block: add new field into 'struct bvec_iter'
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

There is a hole at the end of 'struct bvec_iter', so put a new field
here and we can save cookie returned from submit_bio() here for
supporting bio based polling.

This way can avoid to extend bio unnecessarily.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bvec.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ff832e698efb..61c0f55f7165 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -43,6 +43,15 @@ struct bvec_iter {
 
 	unsigned int            bi_bvec_done;	/* number of bytes completed in
 						   current bvec */
+
+	/*
+	 * There is a hole at the end of bvec_iter, define one filed to
+	 * hold something which isn't relate with 'bvec_iter', so that we can
+	 * avoid to extend bio. So far this new field is used for bio based
+	 * pooling, we will store returning value of underlying queue's
+	 * submit_bio() here.
+	 */
+	unsigned int		bi_private_data;
 };
 
 struct bvec_iter_all {
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 06/13] block: add new field into 'struct bvec_iter'
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

There is a hole at the end of 'struct bvec_iter', so put a new field
here and we can save cookie returned from submit_bio() here for
supporting bio based polling.

This way can avoid to extend bio unnecessarily.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bvec.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ff832e698efb..61c0f55f7165 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -43,6 +43,15 @@ struct bvec_iter {
 
 	unsigned int            bi_bvec_done;	/* number of bytes completed in
 						   current bvec */
+
+	/*
+	 * There is a hole at the end of bvec_iter, define one filed to
+	 * hold something which isn't relate with 'bvec_iter', so that we can
+	 * avoid to extend bio. So far this new field is used for bio based
+	 * pooling, we will store returning value of underlying queue's
+	 * submit_bio() here.
+	 */
+	unsigned int		bi_private_data;
 };
 
 struct bvec_iter_all {
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 07/13] block/mq: extract one helper function polling hw queue
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Extract the logic of polling one hw queue and related statistics
handling out as the helper function.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c832faa52ca0..03f59915fe2c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,19 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static inline int blk_mq_poll_hctx(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx)
+{
+	int ret;
+
+	hctx->poll_invoked++;
+	ret = q->mq_ops->poll(hctx);
+	if (ret > 0)
+		hctx->poll_success++;
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
 	/*
@@ -3908,11 +3921,8 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	do {
 		int ret;
 
-		hctx->poll_invoked++;
-
-		ret = q->mq_ops->poll(hctx);
+		ret = blk_mq_poll_hctx(q, hctx);
 		if (ret > 0) {
-			hctx->poll_success++;
 			__set_current_state(TASK_RUNNING);
 			return ret;
 		}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 07/13] block/mq: extract one helper function polling hw queue
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Extract the logic of polling one hw queue and related statistics
handling out as the helper function.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c832faa52ca0..03f59915fe2c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,19 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static inline int blk_mq_poll_hctx(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx)
+{
+	int ret;
+
+	hctx->poll_invoked++;
+	ret = q->mq_ops->poll(hctx);
+	if (ret > 0)
+		hctx->poll_success++;
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
 	/*
@@ -3908,11 +3921,8 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	do {
 		int ret;
 
-		hctx->poll_invoked++;
-
-		ret = q->mq_ops->poll(hctx);
+		ret = blk_mq_poll_hctx(q, hctx);
 		if (ret > 0) {
-			hctx->poll_success++;
 			__set_current_state(TASK_RUNNING);
 			return ret;
 		}
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 08/13] block: prepare for supporting bio_list via other link
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

So far bio list helpers always use .bi_next to traverse the list, we
will support to link bios by other bio field.

Prepare for such support by adding a macro so that users can define
another helpers for linking bios by other bio field.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bio.h | 132 +++++++++++++++++++++++---------------------
 1 file changed, 68 insertions(+), 64 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index d0246c92a6e8..619edd26a6c0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -608,75 +608,11 @@ static inline unsigned bio_list_size(const struct bio_list *bl)
 	return sz;
 }
 
-static inline void bio_list_add(struct bio_list *bl, struct bio *bio)
-{
-	bio->bi_next = NULL;
-
-	if (bl->tail)
-		bl->tail->bi_next = bio;
-	else
-		bl->head = bio;
-
-	bl->tail = bio;
-}
-
-static inline void bio_list_add_head(struct bio_list *bl, struct bio *bio)
-{
-	bio->bi_next = bl->head;
-
-	bl->head = bio;
-
-	if (!bl->tail)
-		bl->tail = bio;
-}
-
-static inline void bio_list_merge(struct bio_list *bl, struct bio_list *bl2)
-{
-	if (!bl2->head)
-		return;
-
-	if (bl->tail)
-		bl->tail->bi_next = bl2->head;
-	else
-		bl->head = bl2->head;
-
-	bl->tail = bl2->tail;
-}
-
-static inline void bio_list_merge_head(struct bio_list *bl,
-				       struct bio_list *bl2)
-{
-	if (!bl2->head)
-		return;
-
-	if (bl->head)
-		bl2->tail->bi_next = bl->head;
-	else
-		bl->tail = bl2->tail;
-
-	bl->head = bl2->head;
-}
-
 static inline struct bio *bio_list_peek(struct bio_list *bl)
 {
 	return bl->head;
 }
 
-static inline struct bio *bio_list_pop(struct bio_list *bl)
-{
-	struct bio *bio = bl->head;
-
-	if (bio) {
-		bl->head = bl->head->bi_next;
-		if (!bl->head)
-			bl->tail = NULL;
-
-		bio->bi_next = NULL;
-	}
-
-	return bio;
-}
-
 static inline struct bio *bio_list_get(struct bio_list *bl)
 {
 	struct bio *bio = bl->head;
@@ -686,6 +622,74 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
 	return bio;
 }
 
+#define BIO_LIST_HELPERS(_pre, link)					\
+									\
+static inline void _pre##_add(struct bio_list *bl, struct bio *bio)	\
+{									\
+	bio->bi_##link = NULL;						\
+									\
+	if (bl->tail)							\
+		bl->tail->bi_##link = bio;				\
+	else								\
+		bl->head = bio;						\
+									\
+	bl->tail = bio;							\
+}									\
+									\
+static inline void _pre##_add_head(struct bio_list *bl, struct bio *bio) \
+{									\
+	bio->bi_##link = bl->head;					\
+									\
+	bl->head = bio;							\
+									\
+	if (!bl->tail)							\
+		bl->tail = bio;						\
+}									\
+									\
+static inline void _pre##_merge(struct bio_list *bl, struct bio_list *bl2) \
+{									\
+	if (!bl2->head)							\
+		return;							\
+									\
+	if (bl->tail)							\
+		bl->tail->bi_##link = bl2->head;			\
+	else								\
+		bl->head = bl2->head;					\
+									\
+	bl->tail = bl2->tail;						\
+}									\
+									\
+static inline void _pre##_merge_head(struct bio_list *bl,		\
+				       struct bio_list *bl2)		\
+{									\
+	if (!bl2->head)							\
+		return;							\
+									\
+	if (bl->head)							\
+		bl2->tail->bi_##link = bl->head;			\
+	else								\
+		bl->tail = bl2->tail;					\
+									\
+	bl->head = bl2->head;						\
+}									\
+									\
+static inline struct bio *_pre##_pop(struct bio_list *bl)		\
+{									\
+	struct bio *bio = bl->head;					\
+									\
+	if (bio) {							\
+		bl->head = bl->head->bi_##link;				\
+		if (!bl->head)						\
+			bl->tail = NULL;				\
+									\
+		bio->bi_##link = NULL;					\
+	}								\
+									\
+	return bio;							\
+}									\
+
+BIO_LIST_HELPERS(bio_list, next);
+
 /*
  * Increment chain count for the bio. Make sure the CHAIN flag update
  * is visible before the raised count.
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 08/13] block: prepare for supporting bio_list via other link
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

So far bio list helpers always use .bi_next to traverse the list, we
will support to link bios by other bio field.

Prepare for such support by adding a macro so that users can define
another helpers for linking bios by other bio field.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bio.h | 132 +++++++++++++++++++++++---------------------
 1 file changed, 68 insertions(+), 64 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index d0246c92a6e8..619edd26a6c0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -608,75 +608,11 @@ static inline unsigned bio_list_size(const struct bio_list *bl)
 	return sz;
 }
 
-static inline void bio_list_add(struct bio_list *bl, struct bio *bio)
-{
-	bio->bi_next = NULL;
-
-	if (bl->tail)
-		bl->tail->bi_next = bio;
-	else
-		bl->head = bio;
-
-	bl->tail = bio;
-}
-
-static inline void bio_list_add_head(struct bio_list *bl, struct bio *bio)
-{
-	bio->bi_next = bl->head;
-
-	bl->head = bio;
-
-	if (!bl->tail)
-		bl->tail = bio;
-}
-
-static inline void bio_list_merge(struct bio_list *bl, struct bio_list *bl2)
-{
-	if (!bl2->head)
-		return;
-
-	if (bl->tail)
-		bl->tail->bi_next = bl2->head;
-	else
-		bl->head = bl2->head;
-
-	bl->tail = bl2->tail;
-}
-
-static inline void bio_list_merge_head(struct bio_list *bl,
-				       struct bio_list *bl2)
-{
-	if (!bl2->head)
-		return;
-
-	if (bl->head)
-		bl2->tail->bi_next = bl->head;
-	else
-		bl->tail = bl2->tail;
-
-	bl->head = bl2->head;
-}
-
 static inline struct bio *bio_list_peek(struct bio_list *bl)
 {
 	return bl->head;
 }
 
-static inline struct bio *bio_list_pop(struct bio_list *bl)
-{
-	struct bio *bio = bl->head;
-
-	if (bio) {
-		bl->head = bl->head->bi_next;
-		if (!bl->head)
-			bl->tail = NULL;
-
-		bio->bi_next = NULL;
-	}
-
-	return bio;
-}
-
 static inline struct bio *bio_list_get(struct bio_list *bl)
 {
 	struct bio *bio = bl->head;
@@ -686,6 +622,74 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
 	return bio;
 }
 
+#define BIO_LIST_HELPERS(_pre, link)					\
+									\
+static inline void _pre##_add(struct bio_list *bl, struct bio *bio)	\
+{									\
+	bio->bi_##link = NULL;						\
+									\
+	if (bl->tail)							\
+		bl->tail->bi_##link = bio;				\
+	else								\
+		bl->head = bio;						\
+									\
+	bl->tail = bio;							\
+}									\
+									\
+static inline void _pre##_add_head(struct bio_list *bl, struct bio *bio) \
+{									\
+	bio->bi_##link = bl->head;					\
+									\
+	bl->head = bio;							\
+									\
+	if (!bl->tail)							\
+		bl->tail = bio;						\
+}									\
+									\
+static inline void _pre##_merge(struct bio_list *bl, struct bio_list *bl2) \
+{									\
+	if (!bl2->head)							\
+		return;							\
+									\
+	if (bl->tail)							\
+		bl->tail->bi_##link = bl2->head;			\
+	else								\
+		bl->head = bl2->head;					\
+									\
+	bl->tail = bl2->tail;						\
+}									\
+									\
+static inline void _pre##_merge_head(struct bio_list *bl,		\
+				       struct bio_list *bl2)		\
+{									\
+	if (!bl2->head)							\
+		return;							\
+									\
+	if (bl->head)							\
+		bl2->tail->bi_##link = bl->head;			\
+	else								\
+		bl->tail = bl2->tail;					\
+									\
+	bl->head = bl2->head;						\
+}									\
+									\
+static inline struct bio *_pre##_pop(struct bio_list *bl)		\
+{									\
+	struct bio *bio = bl->head;					\
+									\
+	if (bio) {							\
+		bl->head = bl->head->bi_##link;				\
+		if (!bl->head)						\
+			bl->tail = NULL;				\
+									\
+		bio->bi_##link = NULL;					\
+	}								\
+									\
+	return bio;							\
+}									\
+
+BIO_LIST_HELPERS(bio_list, next);
+
 /*
  * Increment chain count for the bio. Make sure the CHAIN flag update
  * is visible before the raised count.
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Currently bio based IO poll needs to poll all hw queue blindly, this way
is very inefficient, and the big reason is that we can't pass bio
submission result to io poll task.

In IO submission context, track associated underlying bios by per-task
submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
and return current->pid to caller of submit_bio() for any bio based
driver's IO, which is submitted from FS.

In IO poll context, the passed cookie tells us the PID of submission
context, and we can find the bio from that submission context. Moving
bio from submission queue to poll queue of the poll context, and keep
polling until these bios are ended. Remove bio from poll queue if the
bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.

In previous version, kfifo is used to implement submission queue, and
Jeffle Xu found that kfifo can't scale well in case of high queue depth.
So far bio's size is close to 2 cacheline size, and it may not be
accepted to add new field into bio for solving the scalability issue by
tracking bios via linked list, switch to bio group list for tracking bio,
the idea is to reuse .bi_end_io for linking bios into a linked list for
all sharing same .bi_end_io(call it bio group), which is recovered before
really end bio, since BIO_END_BY_POLL is added for enhancing this point.
Usually .bi_end_bio is same for all bios in same layer, so it is enough to
provide very limited groups, such as 32 for fixing the scalability issue.

Usually submission shares context with io poll. The per-task poll context
is just like stack variable, and it is cheap to move data between the two
per-task queues.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bio.c               |   5 ++
 block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
 block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
 block/blk.h               |   9 ++
 include/linux/blk_types.h |  16 +++-
 5 files changed, 348 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 26b7f721cda8..04c043dc60fc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
  **/
 void bio_endio(struct bio *bio)
 {
+	/* BIO_END_BY_POLL has to be set before calling submit_bio */
+	if (bio_flagged(bio, BIO_END_BY_POLL)) {
+		bio_set_flag(bio, BIO_DONE);
+		return;
+	}
 again:
 	if (!bio_remaining_done(bio))
 		return;
diff --git a/block/blk-core.c b/block/blk-core.c
index efc7a61a84b4..778d25a7e76c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
 		sizeof(struct bio_grp_list_data);
 }
 
+static inline void *bio_grp_data(struct bio *bio)
+{
+	return bio->bi_poll;
+}
+
+/* add bio into bio group list, return true if it is added */
+static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
+{
+	int i;
+	struct bio_grp_list_data *grp;
+
+	for (i = 0; i < list->nr_grps; i++) {
+		grp = &list->head[i];
+		if (grp->grp_data == bio_grp_data(bio)) {
+			__bio_grp_list_add(&grp->list, bio);
+			return true;
+		}
+	}
+
+	if (i == list->max_nr_grps)
+		return false;
+
+	/* create a new group */
+	grp = &list->head[i];
+	bio_list_init(&grp->list);
+	grp->grp_data = bio_grp_data(bio);
+	__bio_grp_list_add(&grp->list, bio);
+	list->nr_grps++;
+
+	return true;
+}
+
+static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
+{
+	int i;
+	struct bio_grp_list_data *grp;
+
+	for (i = 0; i < list->max_nr_grps; i++) {
+		grp = &list->head[i];
+		if (grp->grp_data == grp_data)
+			return i;
+	}
+	for (i = 0; i < list->max_nr_grps; i++) {
+		grp = &list->head[i];
+		if (bio_grp_list_grp_empty(grp))
+			return i;
+	}
+	return -1;
+}
+
+/* Move as many as possible groups from 'src' to 'dst' */
+void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
+{
+	int i, j, cnt = 0;
+	struct bio_grp_list_data *grp;
+
+	for (i = src->nr_grps - 1; i >= 0; i--) {
+		grp = &src->head[i];
+		j = bio_grp_list_find_grp(dst, grp->grp_data);
+		if (j < 0)
+			break;
+		if (bio_grp_list_grp_empty(&dst->head[j]))
+			dst->head[j].grp_data = grp->grp_data;
+		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
+		bio_list_init(&grp->list);
+		cnt++;
+	}
+
+	src->nr_grps -= cnt;
+}
+
 static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
 {
 	pc->sq = (void *)pc + sizeof(*pc);
@@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		bio->bi_opf |= REQ_TAG;
 }
 
+static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
+{
+	struct blk_bio_poll_ctx *pc = ioc->data;
+	unsigned int queued;
+
+	/*
+	 * We rely on immutable .bi_end_io between blk-mq bio submission
+	 * and completion. However, bio crypt may update .bi_end_io during
+	 * submitting, so simply not support bio based polling for this
+	 * setting.
+	 */
+	if (likely(!bio_has_crypt_ctx(bio))) {
+		/* track this bio via bio group list */
+		spin_lock(&pc->sq_lock);
+		queued = bio_grp_list_add(pc->sq, bio);
+		spin_unlock(&pc->sq_lock);
+	} else {
+		queued = false;
+	}
+
+	/*
+	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
+	 * and the bio is always completed from the pair poll context.
+	 *
+	 * One invariant is that if bio isn't completed, blk_poll() will
+	 * be called by passing cookie returned from submitting this bio.
+	 */
+	if (!queued)
+		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
+	else
+		bio_set_flag(bio, BIO_END_BY_POLL);
+
+	return queued;
+}
+
+static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
+{
+	bio->bi_iter.bi_private_data = cookie;
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
  * bio_list_on_stack[1] contains bios that were submitted before the current
  *	->submit_bio_bio, but that haven't been processed yet.
  */
-static blk_qc_t __submit_bio_noacct(struct bio *bio)
+static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
 {
 	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
@@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 		bio_list_on_stack[1] = bio_list_on_stack[0];
 		bio_list_init(&bio_list_on_stack[0]);
 
-		ret = __submit_bio(bio);
+		if (ioc && queue_is_mq(q) &&
+				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
+			bool queued = blk_bio_poll_prep_submit(ioc, bio);
+
+			ret = __submit_bio(bio);
+			if (queued)
+				blk_bio_poll_post_submit(bio, ret);
+		} else {
+			ret = __submit_bio(bio);
+		}
 
 		/*
 		 * Sort new bios into those for a lower level and those for the
@@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 	return ret;
 }
 
+static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
+		struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc = ioc->data;
+
+	__submit_bio_noacct_int(bio, ioc);
+
+	/* bio submissions queued to per-task poll context */
+	if (READ_ONCE(pc->sq->nr_grps))
+		return current->pid;
+
+	/* swapper's pid is 0, but it can't submit poll IO for us */
+	return 0;
+}
+
+static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
+{
+	struct io_context *ioc = current->io_context;
+
+	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
+		return __submit_bio_noacct_poll(bio, ioc);
+
+	return __submit_bio_noacct_int(bio, NULL);
+}
+
 static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
 {
 	struct bio_list bio_list[2] = { };
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 03f59915fe2c..f26950a51f4a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
 	return ret;
 }
 
+static blk_qc_t bio_get_poll_cookie(struct bio *bio)
+{
+	return bio->bi_iter.bi_private_data;
+}
+
+static int blk_mq_poll_io(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+	blk_qc_t cookie = bio_get_poll_cookie(bio);
+	int ret = 0;
+
+	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
+		struct blk_mq_hw_ctx *hctx =
+			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+
+		ret += blk_mq_poll_hctx(q, hctx);
+	}
+	return ret;
+}
+
+static int blk_bio_poll_and_end_io(struct request_queue *q,
+		struct blk_bio_poll_ctx *poll_ctx)
+{
+	int ret = 0;
+	int i;
+
+	/*
+	 * Poll hw queue first.
+	 *
+	 * TODO: limit max poll times and make sure to not poll same
+	 * hw queue one more time.
+	 */
+	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
+		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
+		struct bio *bio;
+
+		if (bio_grp_list_grp_empty(grp))
+			continue;
+
+		for (bio = grp->list.head; bio; bio = bio->bi_poll)
+			ret += blk_mq_poll_io(bio);
+	}
+
+	/* reap bios */
+	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
+		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
+		struct bio *bio;
+		struct bio_list bl;
+
+		if (bio_grp_list_grp_empty(grp))
+			continue;
+
+		bio_list_init(&bl);
+
+		while ((bio = __bio_grp_list_pop(&grp->list))) {
+			if (bio_flagged(bio, BIO_DONE)) {
+
+				/* now recover original data */
+				bio->bi_poll = grp->grp_data;
+
+				/* clear BIO_END_BY_POLL and end me really */
+				bio_clear_flag(bio, BIO_END_BY_POLL);
+				bio_endio(bio);
+			} else {
+				__bio_grp_list_add(&bl, bio);
+			}
+		}
+		__bio_grp_list_merge(&grp->list, &bl);
+	}
+	return ret;
+}
+
+static int __blk_bio_poll_io(struct request_queue *q,
+		struct blk_bio_poll_ctx *submit_ctx,
+		struct blk_bio_poll_ctx *poll_ctx)
+{
+	/*
+	 * Move IO submission result from submission queue in submission
+	 * context to poll queue of poll context.
+	 */
+	spin_lock(&submit_ctx->sq_lock);
+	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
+	spin_unlock(&submit_ctx->sq_lock);
+
+	return blk_bio_poll_and_end_io(q, poll_ctx);
+}
+
+static int blk_bio_poll_io(struct request_queue *q,
+		struct io_context *submit_ioc,
+		struct io_context *poll_ioc)
+{
+	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
+	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
+	int ret;
+
+	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
+		mutex_lock(&poll_ctx->pq_lock);
+		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
+		mutex_unlock(&poll_ctx->pq_lock);
+	} else {
+		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
+	}
+	return ret;
+}
+
+static bool blk_bio_ioc_valid(struct task_struct *t)
+{
+	if (!t)
+		return false;
+
+	if (!t->io_context)
+		return false;
+
+	if (!t->io_context->data)
+		return false;
+
+	return true;
+}
+
+static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
+{
+	struct io_context *poll_ioc = current->io_context;
+	pid_t pid;
+	struct task_struct *submit_task;
+	int ret;
+
+	pid = (pid_t)cookie;
+
+	/* io poll often share io submission context */
+	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
+		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
+
+	submit_task = find_get_task_by_vpid(pid);
+	if (likely(blk_bio_ioc_valid(submit_task)))
+		ret = blk_bio_poll_io(q, submit_task->io_context,
+				poll_ioc);
+	else
+		ret = 0;
+
+	put_task_struct(submit_task);
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
+	long state;
+
+	/* no need to poll */
+	if (cookie == 0)
+		return 0;
+
 	/*
 	 * Create poll queue for storing poll bio and its cookie from
 	 * submission queue
 	 */
 	blk_create_io_context(q, true);
 
+	state = current->state;
+	do {
+		int ret;
+
+		ret = __blk_bio_poll(q, cookie);
+		if (ret > 0) {
+			__set_current_state(TASK_RUNNING);
+			return ret;
+		}
+
+		if (signal_pending_state(state, current))
+			__set_current_state(TASK_RUNNING);
+
+		if (current->state == TASK_RUNNING)
+			return 1;
+		if (ret < 0 || !spin)
+			break;
+		cpu_relax();
+	} while (!need_resched());
+
+	__set_current_state(TASK_RUNNING);
 	return 0;
 }
 
@@ -3893,7 +4064,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
+	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
 		return 0;
 
 	if (current->plug)
diff --git a/block/blk.h b/block/blk.h
index ae58a706327e..05b9f5eafdd1 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -403,4 +403,13 @@ static inline void blk_create_io_context(struct request_queue *q,
 		bio_poll_ctx_alloc(ioc);
 }
 
+BIO_LIST_HELPERS(__bio_grp_list, poll);
+
+static inline bool bio_grp_list_grp_empty(struct bio_grp_list_data *grp)
+{
+	return bio_list_empty(&grp->list);
+}
+
+void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src);
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1bcade4bcc3..2d47679bac71 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -235,7 +235,18 @@ struct bio {
 
 	struct bvec_iter	bi_iter;
 
-	bio_end_io_t		*bi_end_io;
+	union {
+		bio_end_io_t		*bi_end_io;
+		/*
+		 * bio based io poll need to track bio via bio group list
+		 * which groups bio by same .bi_end_io, and original
+		 * .bi_end_io is save into the group head. Will recover
+		 * .bi_end_io before end this bio really. BIO_END_BY_POLL
+		 * will make sure that this bio won't be really ended
+		 * before recovering .bi_end_io.
+		 */
+		struct bio		*bi_poll;
+	};
 
 	void			*bi_private;
 #ifdef CONFIG_BLK_CGROUP
@@ -304,6 +315,9 @@ enum {
 	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
 	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
 	BIO_REMAPPED,
+	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
+	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
+	BIO_DONE,
 	BIO_FLAG_LAST
 };
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Currently bio based IO poll needs to poll all hw queue blindly, this way
is very inefficient, and the big reason is that we can't pass bio
submission result to io poll task.

In IO submission context, track associated underlying bios by per-task
submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
and return current->pid to caller of submit_bio() for any bio based
driver's IO, which is submitted from FS.

In IO poll context, the passed cookie tells us the PID of submission
context, and we can find the bio from that submission context. Moving
bio from submission queue to poll queue of the poll context, and keep
polling until these bios are ended. Remove bio from poll queue if the
bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.

In previous version, kfifo is used to implement submission queue, and
Jeffle Xu found that kfifo can't scale well in case of high queue depth.
So far bio's size is close to 2 cacheline size, and it may not be
accepted to add new field into bio for solving the scalability issue by
tracking bios via linked list, switch to bio group list for tracking bio,
the idea is to reuse .bi_end_io for linking bios into a linked list for
all sharing same .bi_end_io(call it bio group), which is recovered before
really end bio, since BIO_END_BY_POLL is added for enhancing this point.
Usually .bi_end_bio is same for all bios in same layer, so it is enough to
provide very limited groups, such as 32 for fixing the scalability issue.

Usually submission shares context with io poll. The per-task poll context
is just like stack variable, and it is cheap to move data between the two
per-task queues.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bio.c               |   5 ++
 block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
 block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
 block/blk.h               |   9 ++
 include/linux/blk_types.h |  16 +++-
 5 files changed, 348 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 26b7f721cda8..04c043dc60fc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
  **/
 void bio_endio(struct bio *bio)
 {
+	/* BIO_END_BY_POLL has to be set before calling submit_bio */
+	if (bio_flagged(bio, BIO_END_BY_POLL)) {
+		bio_set_flag(bio, BIO_DONE);
+		return;
+	}
 again:
 	if (!bio_remaining_done(bio))
 		return;
diff --git a/block/blk-core.c b/block/blk-core.c
index efc7a61a84b4..778d25a7e76c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
 		sizeof(struct bio_grp_list_data);
 }
 
+static inline void *bio_grp_data(struct bio *bio)
+{
+	return bio->bi_poll;
+}
+
+/* add bio into bio group list, return true if it is added */
+static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
+{
+	int i;
+	struct bio_grp_list_data *grp;
+
+	for (i = 0; i < list->nr_grps; i++) {
+		grp = &list->head[i];
+		if (grp->grp_data == bio_grp_data(bio)) {
+			__bio_grp_list_add(&grp->list, bio);
+			return true;
+		}
+	}
+
+	if (i == list->max_nr_grps)
+		return false;
+
+	/* create a new group */
+	grp = &list->head[i];
+	bio_list_init(&grp->list);
+	grp->grp_data = bio_grp_data(bio);
+	__bio_grp_list_add(&grp->list, bio);
+	list->nr_grps++;
+
+	return true;
+}
+
+static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
+{
+	int i;
+	struct bio_grp_list_data *grp;
+
+	for (i = 0; i < list->max_nr_grps; i++) {
+		grp = &list->head[i];
+		if (grp->grp_data == grp_data)
+			return i;
+	}
+	for (i = 0; i < list->max_nr_grps; i++) {
+		grp = &list->head[i];
+		if (bio_grp_list_grp_empty(grp))
+			return i;
+	}
+	return -1;
+}
+
+/* Move as many as possible groups from 'src' to 'dst' */
+void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
+{
+	int i, j, cnt = 0;
+	struct bio_grp_list_data *grp;
+
+	for (i = src->nr_grps - 1; i >= 0; i--) {
+		grp = &src->head[i];
+		j = bio_grp_list_find_grp(dst, grp->grp_data);
+		if (j < 0)
+			break;
+		if (bio_grp_list_grp_empty(&dst->head[j]))
+			dst->head[j].grp_data = grp->grp_data;
+		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
+		bio_list_init(&grp->list);
+		cnt++;
+	}
+
+	src->nr_grps -= cnt;
+}
+
 static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
 {
 	pc->sq = (void *)pc + sizeof(*pc);
@@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		bio->bi_opf |= REQ_TAG;
 }
 
+static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
+{
+	struct blk_bio_poll_ctx *pc = ioc->data;
+	unsigned int queued;
+
+	/*
+	 * We rely on immutable .bi_end_io between blk-mq bio submission
+	 * and completion. However, bio crypt may update .bi_end_io during
+	 * submitting, so simply not support bio based polling for this
+	 * setting.
+	 */
+	if (likely(!bio_has_crypt_ctx(bio))) {
+		/* track this bio via bio group list */
+		spin_lock(&pc->sq_lock);
+		queued = bio_grp_list_add(pc->sq, bio);
+		spin_unlock(&pc->sq_lock);
+	} else {
+		queued = false;
+	}
+
+	/*
+	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
+	 * and the bio is always completed from the pair poll context.
+	 *
+	 * One invariant is that if bio isn't completed, blk_poll() will
+	 * be called by passing cookie returned from submitting this bio.
+	 */
+	if (!queued)
+		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
+	else
+		bio_set_flag(bio, BIO_END_BY_POLL);
+
+	return queued;
+}
+
+static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
+{
+	bio->bi_iter.bi_private_data = cookie;
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
  * bio_list_on_stack[1] contains bios that were submitted before the current
  *	->submit_bio_bio, but that haven't been processed yet.
  */
-static blk_qc_t __submit_bio_noacct(struct bio *bio)
+static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
 {
 	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
@@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 		bio_list_on_stack[1] = bio_list_on_stack[0];
 		bio_list_init(&bio_list_on_stack[0]);
 
-		ret = __submit_bio(bio);
+		if (ioc && queue_is_mq(q) &&
+				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
+			bool queued = blk_bio_poll_prep_submit(ioc, bio);
+
+			ret = __submit_bio(bio);
+			if (queued)
+				blk_bio_poll_post_submit(bio, ret);
+		} else {
+			ret = __submit_bio(bio);
+		}
 
 		/*
 		 * Sort new bios into those for a lower level and those for the
@@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 	return ret;
 }
 
+static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
+		struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc = ioc->data;
+
+	__submit_bio_noacct_int(bio, ioc);
+
+	/* bio submissions queued to per-task poll context */
+	if (READ_ONCE(pc->sq->nr_grps))
+		return current->pid;
+
+	/* swapper's pid is 0, but it can't submit poll IO for us */
+	return 0;
+}
+
+static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
+{
+	struct io_context *ioc = current->io_context;
+
+	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
+		return __submit_bio_noacct_poll(bio, ioc);
+
+	return __submit_bio_noacct_int(bio, NULL);
+}
+
 static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
 {
 	struct bio_list bio_list[2] = { };
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 03f59915fe2c..f26950a51f4a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
 	return ret;
 }
 
+static blk_qc_t bio_get_poll_cookie(struct bio *bio)
+{
+	return bio->bi_iter.bi_private_data;
+}
+
+static int blk_mq_poll_io(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+	blk_qc_t cookie = bio_get_poll_cookie(bio);
+	int ret = 0;
+
+	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
+		struct blk_mq_hw_ctx *hctx =
+			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+
+		ret += blk_mq_poll_hctx(q, hctx);
+	}
+	return ret;
+}
+
+static int blk_bio_poll_and_end_io(struct request_queue *q,
+		struct blk_bio_poll_ctx *poll_ctx)
+{
+	int ret = 0;
+	int i;
+
+	/*
+	 * Poll hw queue first.
+	 *
+	 * TODO: limit max poll times and make sure to not poll same
+	 * hw queue one more time.
+	 */
+	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
+		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
+		struct bio *bio;
+
+		if (bio_grp_list_grp_empty(grp))
+			continue;
+
+		for (bio = grp->list.head; bio; bio = bio->bi_poll)
+			ret += blk_mq_poll_io(bio);
+	}
+
+	/* reap bios */
+	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
+		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
+		struct bio *bio;
+		struct bio_list bl;
+
+		if (bio_grp_list_grp_empty(grp))
+			continue;
+
+		bio_list_init(&bl);
+
+		while ((bio = __bio_grp_list_pop(&grp->list))) {
+			if (bio_flagged(bio, BIO_DONE)) {
+
+				/* now recover original data */
+				bio->bi_poll = grp->grp_data;
+
+				/* clear BIO_END_BY_POLL and end me really */
+				bio_clear_flag(bio, BIO_END_BY_POLL);
+				bio_endio(bio);
+			} else {
+				__bio_grp_list_add(&bl, bio);
+			}
+		}
+		__bio_grp_list_merge(&grp->list, &bl);
+	}
+	return ret;
+}
+
+static int __blk_bio_poll_io(struct request_queue *q,
+		struct blk_bio_poll_ctx *submit_ctx,
+		struct blk_bio_poll_ctx *poll_ctx)
+{
+	/*
+	 * Move IO submission result from submission queue in submission
+	 * context to poll queue of poll context.
+	 */
+	spin_lock(&submit_ctx->sq_lock);
+	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
+	spin_unlock(&submit_ctx->sq_lock);
+
+	return blk_bio_poll_and_end_io(q, poll_ctx);
+}
+
+static int blk_bio_poll_io(struct request_queue *q,
+		struct io_context *submit_ioc,
+		struct io_context *poll_ioc)
+{
+	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
+	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
+	int ret;
+
+	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
+		mutex_lock(&poll_ctx->pq_lock);
+		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
+		mutex_unlock(&poll_ctx->pq_lock);
+	} else {
+		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
+	}
+	return ret;
+}
+
+static bool blk_bio_ioc_valid(struct task_struct *t)
+{
+	if (!t)
+		return false;
+
+	if (!t->io_context)
+		return false;
+
+	if (!t->io_context->data)
+		return false;
+
+	return true;
+}
+
+static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
+{
+	struct io_context *poll_ioc = current->io_context;
+	pid_t pid;
+	struct task_struct *submit_task;
+	int ret;
+
+	pid = (pid_t)cookie;
+
+	/* io poll often share io submission context */
+	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
+		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
+
+	submit_task = find_get_task_by_vpid(pid);
+	if (likely(blk_bio_ioc_valid(submit_task)))
+		ret = blk_bio_poll_io(q, submit_task->io_context,
+				poll_ioc);
+	else
+		ret = 0;
+
+	put_task_struct(submit_task);
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
+	long state;
+
+	/* no need to poll */
+	if (cookie == 0)
+		return 0;
+
 	/*
 	 * Create poll queue for storing poll bio and its cookie from
 	 * submission queue
 	 */
 	blk_create_io_context(q, true);
 
+	state = current->state;
+	do {
+		int ret;
+
+		ret = __blk_bio_poll(q, cookie);
+		if (ret > 0) {
+			__set_current_state(TASK_RUNNING);
+			return ret;
+		}
+
+		if (signal_pending_state(state, current))
+			__set_current_state(TASK_RUNNING);
+
+		if (current->state == TASK_RUNNING)
+			return 1;
+		if (ret < 0 || !spin)
+			break;
+		cpu_relax();
+	} while (!need_resched());
+
+	__set_current_state(TASK_RUNNING);
 	return 0;
 }
 
@@ -3893,7 +4064,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
+	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
 		return 0;
 
 	if (current->plug)
diff --git a/block/blk.h b/block/blk.h
index ae58a706327e..05b9f5eafdd1 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -403,4 +403,13 @@ static inline void blk_create_io_context(struct request_queue *q,
 		bio_poll_ctx_alloc(ioc);
 }
 
+BIO_LIST_HELPERS(__bio_grp_list, poll);
+
+static inline bool bio_grp_list_grp_empty(struct bio_grp_list_data *grp)
+{
+	return bio_list_empty(&grp->list);
+}
+
+void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src);
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1bcade4bcc3..2d47679bac71 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -235,7 +235,18 @@ struct bio {
 
 	struct bvec_iter	bi_iter;
 
-	bio_end_io_t		*bi_end_io;
+	union {
+		bio_end_io_t		*bi_end_io;
+		/*
+		 * bio based io poll need to track bio via bio group list
+		 * which groups bio by same .bi_end_io, and original
+		 * .bi_end_io is save into the group head. Will recover
+		 * .bi_end_io before end this bio really. BIO_END_BY_POLL
+		 * will make sure that this bio won't be really ended
+		 * before recovering .bi_end_io.
+		 */
+		struct bio		*bi_poll;
+	};
 
 	void			*bi_private;
 #ifdef CONFIG_BLK_CGROUP
@@ -304,6 +315,9 @@ enum {
 	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
 	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
 	BIO_REMAPPED,
+	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
+	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
+	BIO_DONE,
 	BIO_FLAG_LAST
 };
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 10/13] block: add queue_to_disk() to get gendisk from request_queue
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer, dm-devel

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Sometimes we need to get the corresponding gendisk from request_queue.

It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").

So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h       | 2 ++
 include/trace/events/kyber.h | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 89a01850cf12..bfab74b45f15 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -686,6 +686,8 @@ static inline bool blk_account_rq(struct request *rq)
 	dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \
 	(dir), (attrs))
 
+#define queue_to_disk(q)	(dev_to_disk(kobj_to_dev((q)->kobj.parent)))
+
 static inline bool queue_is_mq(struct request_queue *q)
 {
 	return q->mq_ops;
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index c0e7d24ca256..f9802562edf6 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kyber_latency,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		strlcpy(__entry->type, type, sizeof(__entry->type));
 		__entry->percentile	= percentile;
@@ -59,7 +59,7 @@ TRACE_EVENT(kyber_adjust,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		__entry->depth		= depth;
 	),
@@ -81,7 +81,7 @@ TRACE_EVENT(kyber_throttled,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 	),
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 10/13] block: add queue_to_disk() to get gendisk from request_queue
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jeffle Xu, dm-devel, Christoph Hellwig, Mike Snitzer

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Sometimes we need to get the corresponding gendisk from request_queue.

It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").

So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h       | 2 ++
 include/trace/events/kyber.h | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 89a01850cf12..bfab74b45f15 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -686,6 +686,8 @@ static inline bool blk_account_rq(struct request *rq)
 	dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \
 	(dir), (attrs))
 
+#define queue_to_disk(q)	(dev_to_disk(kobj_to_dev((q)->kobj.parent)))
+
 static inline bool queue_is_mq(struct request_queue *q)
 {
 	return q->mq_ops;
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index c0e7d24ca256..f9802562edf6 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kyber_latency,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		strlcpy(__entry->type, type, sizeof(__entry->type));
 		__entry->percentile	= percentile;
@@ -59,7 +59,7 @@ TRACE_EVENT(kyber_adjust,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		__entry->depth		= depth;
 	),
@@ -81,7 +81,7 @@ TRACE_EVENT(kyber_throttled,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 	),
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 11/13] block: add poll_capable method to support bio-based IO polling
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer, dm-devel

From: Jeffle Xu <jefflexu@linux.alibaba.com>

This method can be used to check if bio-based device supports IO polling
or not. For mq devices, checking for hw queue in polling mode is
adequate, while the sanity check shall be implementation specific for
bio-based devices. For example, dm device needs to check if all
underlying devices are capable of IO polling.

Though bio-based device may have done the sanity check during the
device initialization phase, cacheing the result of this sanity check
(such as by cacheing in the queue_flags) may not work. Because for dm
devices, users could change the state of the underlying devices through
'/sys/block/<dev>/io_poll', bypassing the dm device above. In this case,
the cached result of the very beginning sanity check could be
out-of-date. Thus the sanity check needs to be done every time 'io_poll'
is to be modified.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-sysfs.c      | 14 +++++++++++---
 include/linux/blkdev.h |  1 +
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0f4f0c8a7825..367c1d9a55c6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -426,9 +426,17 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	unsigned long poll_on;
 	ssize_t ret;
 
-	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
-	    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
-		return -EINVAL;
+	if (queue_is_mq(q)) {
+		if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
+		    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
+			return -EINVAL;
+	} else {
+		struct gendisk *disk = queue_to_disk(q);
+
+		if (!disk->fops->poll_capable ||
+		    !disk->fops->poll_capable(disk))
+			return -EINVAL;
+	}
 
 	ret = queue_var_store(&poll_on, page, count);
 	if (ret < 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bfab74b45f15..a46f975f2a2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1881,6 +1881,7 @@ struct block_device_operations {
 	int (*report_zones)(struct gendisk *, sector_t sector,
 			unsigned int nr_zones, report_zones_cb cb, void *data);
 	char *(*devnode)(struct gendisk *disk, umode_t *mode);
+	bool (*poll_capable)(struct gendisk *disk);
 	struct module *owner;
 	const struct pr_ops *pr_ops;
 };
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 11/13] block: add poll_capable method to support bio-based IO polling
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jeffle Xu, dm-devel, Christoph Hellwig, Mike Snitzer

From: Jeffle Xu <jefflexu@linux.alibaba.com>

This method can be used to check if bio-based device supports IO polling
or not. For mq devices, checking for hw queue in polling mode is
adequate, while the sanity check shall be implementation specific for
bio-based devices. For example, dm device needs to check if all
underlying devices are capable of IO polling.

Though bio-based device may have done the sanity check during the
device initialization phase, cacheing the result of this sanity check
(such as by cacheing in the queue_flags) may not work. Because for dm
devices, users could change the state of the underlying devices through
'/sys/block/<dev>/io_poll', bypassing the dm device above. In this case,
the cached result of the very beginning sanity check could be
out-of-date. Thus the sanity check needs to be done every time 'io_poll'
is to be modified.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-sysfs.c      | 14 +++++++++++---
 include/linux/blkdev.h |  1 +
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0f4f0c8a7825..367c1d9a55c6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -426,9 +426,17 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	unsigned long poll_on;
 	ssize_t ret;
 
-	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
-	    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
-		return -EINVAL;
+	if (queue_is_mq(q)) {
+		if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
+		    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
+			return -EINVAL;
+	} else {
+		struct gendisk *disk = queue_to_disk(q);
+
+		if (!disk->fops->poll_capable ||
+		    !disk->fops->poll_capable(disk))
+			return -EINVAL;
+	}
 
 	ret = queue_var_store(&poll_on, page, count);
 	if (ret < 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bfab74b45f15..a46f975f2a2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1881,6 +1881,7 @@ struct block_device_operations {
 	int (*report_zones)(struct gendisk *, sector_t sector,
 			unsigned int nr_zones, report_zones_cb cb, void *data);
 	char *(*devnode)(struct gendisk *disk, umode_t *mode);
+	bool (*poll_capable)(struct gendisk *disk);
 	struct module *owner;
 	const struct pr_ops *pr_ops;
 };
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 12/13] dm: support IO polling for bio-based dm device
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

From: Jeffle Xu <jefflexu@linux.alibaba.com>

IO polling is enabled when all underlying target devices are capable
of IO polling. The sanity check supports the stacked device model, in
which one dm device may be build upon another dm device. In this case,
the mapped device will check if the underlying dm target device
supports IO polling.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-table.c         | 24 ++++++++++++++++++++++++
 drivers/md/dm.c               | 14 ++++++++++++++
 include/linux/device-mapper.h |  1 +
 3 files changed, 39 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 95391f78b8d5..a8f3575fb118 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1509,6 +1509,12 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector)
 	return &t->targets[(KEYS_PER_NODE * n) + k];
 }
 
+static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev,
+				   sector_t start, sector_t len, void *data)
+{
+	return !blk_queue_poll(bdev_get_queue(dev->bdev));
+}
+
 /*
  * type->iterate_devices() should be called when the sanity check needs to
  * iterate and check all underlying data devices. iterate_devices() will
@@ -1559,6 +1565,11 @@ static int count_device(struct dm_target *ti, struct dm_dev *dev,
 	return 0;
 }
 
+int dm_table_supports_poll(struct dm_table *t)
+{
+	return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
+}
+
 /*
  * Check whether a table has no data devices attached using each
  * target's iterate_devices method.
@@ -2079,6 +2090,19 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 
 	dm_update_keyslot_manager(q, t);
 	blk_queue_update_readahead(q);
+
+	/*
+	 * Check for request-based device is remained to
+	 * dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+	 * For bio-based device, only set QUEUE_FLAG_POLL when all underlying
+	 * devices supporting polling.
+	 */
+	if (__table_type_bio_based(t->type)) {
+		if (dm_table_supports_poll(t))
+			blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+		else
+			blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
+	}
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 50b693d776d6..fe6893b078dc 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1720,6 +1720,19 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 	return ret;
 }
 
+static bool dm_bio_poll_capable(struct gendisk *disk)
+{
+	int ret, srcu_idx;
+	struct mapped_device *md = disk->private_data;
+	struct dm_table *t;
+
+	t = dm_get_live_table(md, &srcu_idx);
+	ret = dm_table_supports_poll(t);
+	dm_put_live_table(md, srcu_idx);
+
+	return ret;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -3132,6 +3145,7 @@ static const struct pr_ops dm_pr_ops = {
 };
 
 static const struct block_device_operations dm_blk_dops = {
+	.poll_capable = dm_bio_poll_capable,
 	.submit_bio = dm_submit_bio,
 	.open = dm_blk_open,
 	.release = dm_blk_close,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 7f4ac87c0b32..31bfd6f70013 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -538,6 +538,7 @@ unsigned int dm_table_get_num_targets(struct dm_table *t);
 fmode_t dm_table_get_mode(struct dm_table *t);
 struct mapped_device *dm_table_get_md(struct dm_table *t);
 const char *dm_table_device_name(struct dm_table *t);
+int dm_table_supports_poll(struct dm_table *t);
 
 /*
  * Trigger an event.
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 12/13] dm: support IO polling for bio-based dm device
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

From: Jeffle Xu <jefflexu@linux.alibaba.com>

IO polling is enabled when all underlying target devices are capable
of IO polling. The sanity check supports the stacked device model, in
which one dm device may be build upon another dm device. In this case,
the mapped device will check if the underlying dm target device
supports IO polling.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-table.c         | 24 ++++++++++++++++++++++++
 drivers/md/dm.c               | 14 ++++++++++++++
 include/linux/device-mapper.h |  1 +
 3 files changed, 39 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 95391f78b8d5..a8f3575fb118 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1509,6 +1509,12 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector)
 	return &t->targets[(KEYS_PER_NODE * n) + k];
 }
 
+static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev,
+				   sector_t start, sector_t len, void *data)
+{
+	return !blk_queue_poll(bdev_get_queue(dev->bdev));
+}
+
 /*
  * type->iterate_devices() should be called when the sanity check needs to
  * iterate and check all underlying data devices. iterate_devices() will
@@ -1559,6 +1565,11 @@ static int count_device(struct dm_target *ti, struct dm_dev *dev,
 	return 0;
 }
 
+int dm_table_supports_poll(struct dm_table *t)
+{
+	return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
+}
+
 /*
  * Check whether a table has no data devices attached using each
  * target's iterate_devices method.
@@ -2079,6 +2090,19 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 
 	dm_update_keyslot_manager(q, t);
 	blk_queue_update_readahead(q);
+
+	/*
+	 * Check for request-based device is remained to
+	 * dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+	 * For bio-based device, only set QUEUE_FLAG_POLL when all underlying
+	 * devices supporting polling.
+	 */
+	if (__table_type_bio_based(t->type)) {
+		if (dm_table_supports_poll(t))
+			blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+		else
+			blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
+	}
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 50b693d776d6..fe6893b078dc 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1720,6 +1720,19 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 	return ret;
 }
 
+static bool dm_bio_poll_capable(struct gendisk *disk)
+{
+	int ret, srcu_idx;
+	struct mapped_device *md = disk->private_data;
+	struct dm_table *t;
+
+	t = dm_get_live_table(md, &srcu_idx);
+	ret = dm_table_supports_poll(t);
+	dm_put_live_table(md, srcu_idx);
+
+	return ret;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -3132,6 +3145,7 @@ static const struct pr_ops dm_pr_ops = {
 };
 
 static const struct block_device_operations dm_blk_dops = {
+	.poll_capable = dm_bio_poll_capable,
 	.submit_bio = dm_submit_bio,
 	.open = dm_blk_open,
 	.release = dm_blk_close,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 7f4ac87c0b32..31bfd6f70013 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -538,6 +538,7 @@ unsigned int dm_table_get_num_targets(struct dm_table *t);
 fmode_t dm_table_get_mode(struct dm_table *t);
 struct mapped_device *dm_table_get_md(struct dm_table *t);
 const char *dm_table_device_name(struct dm_table *t);
+int dm_table_supports_poll(struct dm_table *t);
 
 /*
  * Trigger an event.
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [RFC PATCH V2 13/13] blk-mq: limit hw queues to be polled in each blk_poll()
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-18 16:48   ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Limit at most 8 queues are polled in each blk_pull(), avoid to
add extra latency when queue depth is high.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 66 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 46 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f26950a51f4a..9c94b7f0bf4b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3870,33 +3870,31 @@ static blk_qc_t bio_get_poll_cookie(struct bio *bio)
 	return bio->bi_iter.bi_private_data;
 }
 
-static int blk_mq_poll_io(struct bio *bio)
+#define POLL_HCTX_MAX_CNT 8
+
+static bool blk_add_unique_hctx(struct blk_mq_hw_ctx **data, int *cnt,
+		struct blk_mq_hw_ctx *hctx)
 {
-	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
-	blk_qc_t cookie = bio_get_poll_cookie(bio);
-	int ret = 0;
+	int i;
 
-	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
-		struct blk_mq_hw_ctx *hctx =
-			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+	for (i = 0; i < *cnt; i++) {
+		if (data[i] == hctx)
+			goto exit;
+	}
 
-		ret += blk_mq_poll_hctx(q, hctx);
+	if (i < POLL_HCTX_MAX_CNT) {
+		data[i] = hctx;
+		(*cnt)++;
 	}
-	return ret;
+ exit:
+	return *cnt == POLL_HCTX_MAX_CNT;
 }
 
-static int blk_bio_poll_and_end_io(struct request_queue *q,
-		struct blk_bio_poll_ctx *poll_ctx)
+static void blk_build_poll_queues(struct blk_bio_poll_ctx *poll_ctx,
+		struct blk_mq_hw_ctx **data, int *cnt)
 {
-	int ret = 0;
 	int i;
 
-	/*
-	 * Poll hw queue first.
-	 *
-	 * TODO: limit max poll times and make sure to not poll same
-	 * hw queue one more time.
-	 */
 	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
 		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
 		struct bio *bio;
@@ -3904,9 +3902,37 @@ static int blk_bio_poll_and_end_io(struct request_queue *q,
 		if (bio_grp_list_grp_empty(grp))
 			continue;
 
-		for (bio = grp->list.head; bio; bio = bio->bi_poll)
-			ret += blk_mq_poll_io(bio);
+		for (bio = grp->list.head; bio; bio = bio->bi_poll) {
+			blk_qc_t  cookie;
+			struct blk_mq_hw_ctx *hctx;
+			struct request_queue *q;
+
+			if (bio_flagged(bio, BIO_DONE))
+				continue;
+			cookie = bio_get_poll_cookie(bio);
+			if (!blk_qc_t_valid(cookie))
+				continue;
+
+			q = bio->bi_bdev->bd_disk->queue;
+			hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+			if (blk_add_unique_hctx(data, cnt, hctx))
+				return;
+		}
 	}
+}
+
+static int blk_bio_poll_and_end_io(struct request_queue *q,
+		struct blk_bio_poll_ctx *poll_ctx)
+{
+	int ret = 0;
+	int i;
+	struct blk_mq_hw_ctx *hctx[POLL_HCTX_MAX_CNT];
+	int cnt = 0;
+
+	blk_build_poll_queues(poll_ctx, hctx, &cnt);
+
+	for (i = 0; i < cnt; i++)
+		ret += blk_mq_poll_hctx(hctx[i]->queue, hctx[i]);
 
 	/* reap bios */
 	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [dm-devel] [RFC PATCH V2 13/13] blk-mq: limit hw queues to be polled in each blk_poll()
@ 2021-03-18 16:48   ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-18 16:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Limit at most 8 queues are polled in each blk_pull(), avoid to
add extra latency when queue depth is high.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 66 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 46 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f26950a51f4a..9c94b7f0bf4b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3870,33 +3870,31 @@ static blk_qc_t bio_get_poll_cookie(struct bio *bio)
 	return bio->bi_iter.bi_private_data;
 }
 
-static int blk_mq_poll_io(struct bio *bio)
+#define POLL_HCTX_MAX_CNT 8
+
+static bool blk_add_unique_hctx(struct blk_mq_hw_ctx **data, int *cnt,
+		struct blk_mq_hw_ctx *hctx)
 {
-	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
-	blk_qc_t cookie = bio_get_poll_cookie(bio);
-	int ret = 0;
+	int i;
 
-	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
-		struct blk_mq_hw_ctx *hctx =
-			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+	for (i = 0; i < *cnt; i++) {
+		if (data[i] == hctx)
+			goto exit;
+	}
 
-		ret += blk_mq_poll_hctx(q, hctx);
+	if (i < POLL_HCTX_MAX_CNT) {
+		data[i] = hctx;
+		(*cnt)++;
 	}
-	return ret;
+ exit:
+	return *cnt == POLL_HCTX_MAX_CNT;
 }
 
-static int blk_bio_poll_and_end_io(struct request_queue *q,
-		struct blk_bio_poll_ctx *poll_ctx)
+static void blk_build_poll_queues(struct blk_bio_poll_ctx *poll_ctx,
+		struct blk_mq_hw_ctx **data, int *cnt)
 {
-	int ret = 0;
 	int i;
 
-	/*
-	 * Poll hw queue first.
-	 *
-	 * TODO: limit max poll times and make sure to not poll same
-	 * hw queue one more time.
-	 */
 	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
 		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
 		struct bio *bio;
@@ -3904,9 +3902,37 @@ static int blk_bio_poll_and_end_io(struct request_queue *q,
 		if (bio_grp_list_grp_empty(grp))
 			continue;
 
-		for (bio = grp->list.head; bio; bio = bio->bi_poll)
-			ret += blk_mq_poll_io(bio);
+		for (bio = grp->list.head; bio; bio = bio->bi_poll) {
+			blk_qc_t  cookie;
+			struct blk_mq_hw_ctx *hctx;
+			struct request_queue *q;
+
+			if (bio_flagged(bio, BIO_DONE))
+				continue;
+			cookie = bio_get_poll_cookie(bio);
+			if (!blk_qc_t_valid(cookie))
+				continue;
+
+			q = bio->bi_bdev->bd_disk->queue;
+			hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+			if (blk_add_unique_hctx(data, cnt, hctx))
+				return;
+		}
 	}
+}
+
+static int blk_bio_poll_and_end_io(struct request_queue *q,
+		struct blk_bio_poll_ctx *poll_ctx)
+{
+	int ret = 0;
+	int i;
+	struct blk_mq_hw_ctx *hctx[POLL_HCTX_MAX_CNT];
+	int cnt = 0;
+
+	blk_build_poll_queues(poll_ctx, hctx, &cnt);
+
+	for (i = 0; i < cnt; i++)
+		ret += blk_mq_poll_hctx(hctx[i]->queue, hctx[i]);
 
 	/* reap bios */
 	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-18 16:48   ` [dm-devel] " Ming Lei
@ 2021-03-18 17:26     ` Mike Snitzer
  -1 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-18 17:26 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Currently bio based IO poll needs to poll all hw queue blindly, this way
> is very inefficient, and the big reason is that we can't pass bio
> submission result to io poll task.
> 
> In IO submission context, track associated underlying bios by per-task
> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> and return current->pid to caller of submit_bio() for any bio based
> driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, and we can find the bio from that submission context. Moving
> bio from submission queue to poll queue of the poll context, and keep
> polling until these bios are ended. Remove bio from poll queue if the
> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> 
> In previous version, kfifo is used to implement submission queue, and
> Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> So far bio's size is close to 2 cacheline size, and it may not be
> accepted to add new field into bio for solving the scalability issue by
> tracking bios via linked list, switch to bio group list for tracking bio,
> the idea is to reuse .bi_end_io for linking bios into a linked list for
> all sharing same .bi_end_io(call it bio group), which is recovered before
> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> provide very limited groups, such as 32 for fixing the scalability issue.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/bio.c               |   5 ++
>  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
>  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
>  block/blk.h               |   9 ++
>  include/linux/blk_types.h |  16 +++-
>  5 files changed, 348 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..04c043dc60fc 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> +		bio_set_flag(bio, BIO_DONE);
> +		return;
> +	}
>  again:
>  	if (!bio_remaining_done(bio))
>  		return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index efc7a61a84b4..778d25a7e76c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>  		sizeof(struct bio_grp_list_data);
>  }
>  
> +static inline void *bio_grp_data(struct bio *bio)
> +{
> +	return bio->bi_poll;
> +}
> +
> +/* add bio into bio group list, return true if it is added */
> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == bio_grp_data(bio)) {
> +			__bio_grp_list_add(&grp->list, bio);
> +			return true;
> +		}
> +	}
> +
> +	if (i == list->max_nr_grps)
> +		return false;
> +
> +	/* create a new group */
> +	grp = &list->head[i];
> +	bio_list_init(&grp->list);
> +	grp->grp_data = bio_grp_data(bio);
> +	__bio_grp_list_add(&grp->list, bio);
> +	list->nr_grps++;
> +
> +	return true;
> +}
> +
> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == grp_data)
> +			return i;
> +	}
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (bio_grp_list_grp_empty(grp))
> +			return i;
> +	}
> +	return -1;
> +}
> +
> +/* Move as many as possible groups from 'src' to 'dst' */
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> +{
> +	int i, j, cnt = 0;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> +		grp = &src->head[i];
> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> +		if (j < 0)
> +			break;
> +		if (bio_grp_list_grp_empty(&dst->head[j]))
> +			dst->head[j].grp_data = grp->grp_data;
> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> +		bio_list_init(&grp->list);
> +		cnt++;
> +	}
> +
> +	src->nr_grps -= cnt;
> +}
> +
>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>  {
>  	pc->sq = (void *)pc + sizeof(*pc);
> @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		bio->bi_opf |= REQ_TAG;
>  }
>  
> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	unsigned int queued;
> +
> +	/*
> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> +	 * and completion. However, bio crypt may update .bi_end_io during
> +	 * submitting, so simply not support bio based polling for this
> +	 * setting.
> +	 */
> +	if (likely(!bio_has_crypt_ctx(bio))) {
> +		/* track this bio via bio group list */
> +		spin_lock(&pc->sq_lock);
> +		queued = bio_grp_list_add(pc->sq, bio);
> +		spin_unlock(&pc->sq_lock);
> +	} else {
> +		queued = false;
> +	}
> +
> +	/*
> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> +	 * and the bio is always completed from the pair poll context.
> +	 *
> +	 * One invariant is that if bio isn't completed, blk_poll() will
> +	 * be called by passing cookie returned from submitting this bio.
> +	 */
> +	if (!queued)
> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> +	else
> +		bio_set_flag(bio, BIO_END_BY_POLL);
> +
> +	return queued;
> +}
> +
> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> +{
> +	bio->bi_iter.bi_private_data = cookie;
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>   * bio_list_on_stack[1] contains bios that were submitted before the current
>   *	->submit_bio_bio, but that haven't been processed yet.
>   */
> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;
> @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		if (ioc && queue_is_mq(q) &&
> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> +
> +			ret = __submit_bio(bio);
> +			if (queued)
> +				blk_bio_poll_post_submit(bio, ret);
> +		} else {
> +			ret = __submit_bio(bio);
> +		}

So you're only supporting bio-based polling if the bio-based device is
stacked _directly_ ontop of blk-mq?  Severely limits the utility of
bio-based IO polling support if such shallow stacking is required.

Mike


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-18 17:26     ` Mike Snitzer
  0 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-18 17:26 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Currently bio based IO poll needs to poll all hw queue blindly, this way
> is very inefficient, and the big reason is that we can't pass bio
> submission result to io poll task.
> 
> In IO submission context, track associated underlying bios by per-task
> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> and return current->pid to caller of submit_bio() for any bio based
> driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, and we can find the bio from that submission context. Moving
> bio from submission queue to poll queue of the poll context, and keep
> polling until these bios are ended. Remove bio from poll queue if the
> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> 
> In previous version, kfifo is used to implement submission queue, and
> Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> So far bio's size is close to 2 cacheline size, and it may not be
> accepted to add new field into bio for solving the scalability issue by
> tracking bios via linked list, switch to bio group list for tracking bio,
> the idea is to reuse .bi_end_io for linking bios into a linked list for
> all sharing same .bi_end_io(call it bio group), which is recovered before
> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> provide very limited groups, such as 32 for fixing the scalability issue.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/bio.c               |   5 ++
>  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
>  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
>  block/blk.h               |   9 ++
>  include/linux/blk_types.h |  16 +++-
>  5 files changed, 348 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..04c043dc60fc 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> +		bio_set_flag(bio, BIO_DONE);
> +		return;
> +	}
>  again:
>  	if (!bio_remaining_done(bio))
>  		return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index efc7a61a84b4..778d25a7e76c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>  		sizeof(struct bio_grp_list_data);
>  }
>  
> +static inline void *bio_grp_data(struct bio *bio)
> +{
> +	return bio->bi_poll;
> +}
> +
> +/* add bio into bio group list, return true if it is added */
> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == bio_grp_data(bio)) {
> +			__bio_grp_list_add(&grp->list, bio);
> +			return true;
> +		}
> +	}
> +
> +	if (i == list->max_nr_grps)
> +		return false;
> +
> +	/* create a new group */
> +	grp = &list->head[i];
> +	bio_list_init(&grp->list);
> +	grp->grp_data = bio_grp_data(bio);
> +	__bio_grp_list_add(&grp->list, bio);
> +	list->nr_grps++;
> +
> +	return true;
> +}
> +
> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == grp_data)
> +			return i;
> +	}
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (bio_grp_list_grp_empty(grp))
> +			return i;
> +	}
> +	return -1;
> +}
> +
> +/* Move as many as possible groups from 'src' to 'dst' */
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> +{
> +	int i, j, cnt = 0;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> +		grp = &src->head[i];
> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> +		if (j < 0)
> +			break;
> +		if (bio_grp_list_grp_empty(&dst->head[j]))
> +			dst->head[j].grp_data = grp->grp_data;
> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> +		bio_list_init(&grp->list);
> +		cnt++;
> +	}
> +
> +	src->nr_grps -= cnt;
> +}
> +
>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>  {
>  	pc->sq = (void *)pc + sizeof(*pc);
> @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		bio->bi_opf |= REQ_TAG;
>  }
>  
> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	unsigned int queued;
> +
> +	/*
> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> +	 * and completion. However, bio crypt may update .bi_end_io during
> +	 * submitting, so simply not support bio based polling for this
> +	 * setting.
> +	 */
> +	if (likely(!bio_has_crypt_ctx(bio))) {
> +		/* track this bio via bio group list */
> +		spin_lock(&pc->sq_lock);
> +		queued = bio_grp_list_add(pc->sq, bio);
> +		spin_unlock(&pc->sq_lock);
> +	} else {
> +		queued = false;
> +	}
> +
> +	/*
> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> +	 * and the bio is always completed from the pair poll context.
> +	 *
> +	 * One invariant is that if bio isn't completed, blk_poll() will
> +	 * be called by passing cookie returned from submitting this bio.
> +	 */
> +	if (!queued)
> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> +	else
> +		bio_set_flag(bio, BIO_END_BY_POLL);
> +
> +	return queued;
> +}
> +
> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> +{
> +	bio->bi_iter.bi_private_data = cookie;
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>   * bio_list_on_stack[1] contains bios that were submitted before the current
>   *	->submit_bio_bio, but that haven't been processed yet.
>   */
> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;
> @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		if (ioc && queue_is_mq(q) &&
> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> +
> +			ret = __submit_bio(bio);
> +			if (queued)
> +				blk_bio_poll_post_submit(bio, ret);
> +		} else {
> +			ret = __submit_bio(bio);
> +		}

So you're only supporting bio-based polling if the bio-based device is
stacked _directly_ ontop of blk-mq?  Severely limits the utility of
bio-based IO polling support if such shallow stacking is required.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-18 17:26     ` [dm-devel] " Mike Snitzer
@ 2021-03-18 17:38       ` Mike Snitzer
  -1 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-18 17:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, device-mapper development, Jeffle Xu,
	Christoph Hellwig

On Thu, Mar 18, 2021 at 1:26 PM Mike Snitzer <snitzer@redhat.com> wrote:
>
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
>
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> >
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any bio based
> > driver's IO, which is submitted from FS.
> >
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >
> > In previous version, kfifo is used to implement submission queue, and
> > Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> > So far bio's size is close to 2 cacheline size, and it may not be
> > accepted to add new field into bio for solving the scalability issue by
> > tracking bios via linked list, switch to bio group list for tracking bio,
> > the idea is to reuse .bi_end_io for linking bios into a linked list for
> > all sharing same .bi_end_io(call it bio group), which is recovered before
> > really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> > Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> > provide very limited groups, such as 32 for fixing the scalability issue.
> >
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 ++
> >  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
> >  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
> >  block/blk.h               |   9 ++
> >  include/linux/blk_types.h |  16 +++-
> >  5 files changed, 348 insertions(+), 4 deletions(-)
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +     /* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +     if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +             bio_set_flag(bio, BIO_DONE);
> > +             return;
> > +     }
> >  again:
> >       if (!bio_remaining_done(bio))
> >               return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index efc7a61a84b4..778d25a7e76c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >               sizeof(struct bio_grp_list_data);
> >  }
> >
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > +     return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > +     int i;
> > +     struct bio_grp_list_data *grp;
> > +
> > +     for (i = 0; i < list->nr_grps; i++) {
> > +             grp = &list->head[i];
> > +             if (grp->grp_data == bio_grp_data(bio)) {
> > +                     __bio_grp_list_add(&grp->list, bio);
> > +                     return true;
> > +             }
> > +     }
> > +
> > +     if (i == list->max_nr_grps)
> > +             return false;
> > +
> > +     /* create a new group */
> > +     grp = &list->head[i];
> > +     bio_list_init(&grp->list);
> > +     grp->grp_data = bio_grp_data(bio);
> > +     __bio_grp_list_add(&grp->list, bio);
> > +     list->nr_grps++;
> > +
> > +     return true;
> > +}
> > +
> > +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> > +{
> > +     int i;
> > +     struct bio_grp_list_data *grp;
> > +
> > +     for (i = 0; i < list->max_nr_grps; i++) {
> > +             grp = &list->head[i];
> > +             if (grp->grp_data == grp_data)
> > +                     return i;
> > +     }
> > +     for (i = 0; i < list->max_nr_grps; i++) {
> > +             grp = &list->head[i];
> > +             if (bio_grp_list_grp_empty(grp))
> > +                     return i;
> > +     }
> > +     return -1;
> > +}
> > +
> > +/* Move as many as possible groups from 'src' to 'dst' */
> > +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> > +{
> > +     int i, j, cnt = 0;
> > +     struct bio_grp_list_data *grp;
> > +
> > +     for (i = src->nr_grps - 1; i >= 0; i--) {
> > +             grp = &src->head[i];
> > +             j = bio_grp_list_find_grp(dst, grp->grp_data);
> > +             if (j < 0)
> > +                     break;
> > +             if (bio_grp_list_grp_empty(&dst->head[j]))
> > +                     dst->head[j].grp_data = grp->grp_data;
> > +             __bio_grp_list_merge(&dst->head[j].list, &grp->list);
> > +             bio_list_init(&grp->list);
> > +             cnt++;
> > +     }
> > +
> > +     src->nr_grps -= cnt;
> > +}
> > +
> >  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >  {
> >       pc->sq = (void *)pc + sizeof(*pc);
> > @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >               bio->bi_opf |= REQ_TAG;
> >  }
> >
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +     struct blk_bio_poll_ctx *pc = ioc->data;
> > +     unsigned int queued;
> > +
> > +     /*
> > +      * We rely on immutable .bi_end_io between blk-mq bio submission
> > +      * and completion. However, bio crypt may update .bi_end_io during
> > +      * submitting, so simply not support bio based polling for this
> > +      * setting.
> > +      */
> > +     if (likely(!bio_has_crypt_ctx(bio))) {
> > +             /* track this bio via bio group list */
> > +             spin_lock(&pc->sq_lock);
> > +             queued = bio_grp_list_add(pc->sq, bio);
> > +             spin_unlock(&pc->sq_lock);
> > +     } else {
> > +             queued = false;
> > +     }
> > +
> > +     /*
> > +      * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > +      * and the bio is always completed from the pair poll context.
> > +      *
> > +      * One invariant is that if bio isn't completed, blk_poll() will
> > +      * be called by passing cookie returned from submitting this bio.
> > +      */
> > +     if (!queued)
> > +             bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > +     else
> > +             bio_set_flag(bio, BIO_END_BY_POLL);
> > +
> > +     return queued;
> > +}
> > +
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +     bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >       struct block_device *bdev = bio->bi_bdev;
> > @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *   ->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >  {
> >       struct bio_list bio_list_on_stack[2];
> >       blk_qc_t ret = BLK_QC_T_NONE;
> > @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >               bio_list_on_stack[1] = bio_list_on_stack[0];
> >               bio_list_init(&bio_list_on_stack[0]);
> >
> > -             ret = __submit_bio(bio);
> > +             if (ioc && queue_is_mq(q) &&
> > +                             (bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > +                     bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +                     ret = __submit_bio(bio);
> > +                     if (queued)
> > +                             blk_bio_poll_post_submit(bio, ret);
> > +             } else {
> > +                     ret = __submit_bio(bio);
> > +             }
>
> So you're only supporting bio-based polling if the bio-based device is
> stacked _directly_ ontop of blk-mq?  Severely limits the utility of
> bio-based IO polling support if such shallow stacking is required.

Sorry, I was too focused on this core change, I didn't look far enough
to see you're returning current->pid for all the intermediate
bio-based layers in an arbitrarily deep bio-based IO device stack.

I think I understand it now, but I _do_ think we need a better name
than "__submit_bio_noacct_int"

Mike

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-18 17:38       ` Mike Snitzer
  0 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-18 17:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, device-mapper development, Jeffle Xu,
	Christoph Hellwig

On Thu, Mar 18, 2021 at 1:26 PM Mike Snitzer <snitzer@redhat.com> wrote:
>
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
>
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> >
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any bio based
> > driver's IO, which is submitted from FS.
> >
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >
> > In previous version, kfifo is used to implement submission queue, and
> > Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> > So far bio's size is close to 2 cacheline size, and it may not be
> > accepted to add new field into bio for solving the scalability issue by
> > tracking bios via linked list, switch to bio group list for tracking bio,
> > the idea is to reuse .bi_end_io for linking bios into a linked list for
> > all sharing same .bi_end_io(call it bio group), which is recovered before
> > really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> > Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> > provide very limited groups, such as 32 for fixing the scalability issue.
> >
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 ++
> >  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
> >  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
> >  block/blk.h               |   9 ++
> >  include/linux/blk_types.h |  16 +++-
> >  5 files changed, 348 insertions(+), 4 deletions(-)
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +     /* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +     if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +             bio_set_flag(bio, BIO_DONE);
> > +             return;
> > +     }
> >  again:
> >       if (!bio_remaining_done(bio))
> >               return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index efc7a61a84b4..778d25a7e76c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >               sizeof(struct bio_grp_list_data);
> >  }
> >
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > +     return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > +     int i;
> > +     struct bio_grp_list_data *grp;
> > +
> > +     for (i = 0; i < list->nr_grps; i++) {
> > +             grp = &list->head[i];
> > +             if (grp->grp_data == bio_grp_data(bio)) {
> > +                     __bio_grp_list_add(&grp->list, bio);
> > +                     return true;
> > +             }
> > +     }
> > +
> > +     if (i == list->max_nr_grps)
> > +             return false;
> > +
> > +     /* create a new group */
> > +     grp = &list->head[i];
> > +     bio_list_init(&grp->list);
> > +     grp->grp_data = bio_grp_data(bio);
> > +     __bio_grp_list_add(&grp->list, bio);
> > +     list->nr_grps++;
> > +
> > +     return true;
> > +}
> > +
> > +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> > +{
> > +     int i;
> > +     struct bio_grp_list_data *grp;
> > +
> > +     for (i = 0; i < list->max_nr_grps; i++) {
> > +             grp = &list->head[i];
> > +             if (grp->grp_data == grp_data)
> > +                     return i;
> > +     }
> > +     for (i = 0; i < list->max_nr_grps; i++) {
> > +             grp = &list->head[i];
> > +             if (bio_grp_list_grp_empty(grp))
> > +                     return i;
> > +     }
> > +     return -1;
> > +}
> > +
> > +/* Move as many as possible groups from 'src' to 'dst' */
> > +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> > +{
> > +     int i, j, cnt = 0;
> > +     struct bio_grp_list_data *grp;
> > +
> > +     for (i = src->nr_grps - 1; i >= 0; i--) {
> > +             grp = &src->head[i];
> > +             j = bio_grp_list_find_grp(dst, grp->grp_data);
> > +             if (j < 0)
> > +                     break;
> > +             if (bio_grp_list_grp_empty(&dst->head[j]))
> > +                     dst->head[j].grp_data = grp->grp_data;
> > +             __bio_grp_list_merge(&dst->head[j].list, &grp->list);
> > +             bio_list_init(&grp->list);
> > +             cnt++;
> > +     }
> > +
> > +     src->nr_grps -= cnt;
> > +}
> > +
> >  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >  {
> >       pc->sq = (void *)pc + sizeof(*pc);
> > @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >               bio->bi_opf |= REQ_TAG;
> >  }
> >
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +     struct blk_bio_poll_ctx *pc = ioc->data;
> > +     unsigned int queued;
> > +
> > +     /*
> > +      * We rely on immutable .bi_end_io between blk-mq bio submission
> > +      * and completion. However, bio crypt may update .bi_end_io during
> > +      * submitting, so simply not support bio based polling for this
> > +      * setting.
> > +      */
> > +     if (likely(!bio_has_crypt_ctx(bio))) {
> > +             /* track this bio via bio group list */
> > +             spin_lock(&pc->sq_lock);
> > +             queued = bio_grp_list_add(pc->sq, bio);
> > +             spin_unlock(&pc->sq_lock);
> > +     } else {
> > +             queued = false;
> > +     }
> > +
> > +     /*
> > +      * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > +      * and the bio is always completed from the pair poll context.
> > +      *
> > +      * One invariant is that if bio isn't completed, blk_poll() will
> > +      * be called by passing cookie returned from submitting this bio.
> > +      */
> > +     if (!queued)
> > +             bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > +     else
> > +             bio_set_flag(bio, BIO_END_BY_POLL);
> > +
> > +     return queued;
> > +}
> > +
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +     bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >       struct block_device *bdev = bio->bi_bdev;
> > @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *   ->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >  {
> >       struct bio_list bio_list_on_stack[2];
> >       blk_qc_t ret = BLK_QC_T_NONE;
> > @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >               bio_list_on_stack[1] = bio_list_on_stack[0];
> >               bio_list_init(&bio_list_on_stack[0]);
> >
> > -             ret = __submit_bio(bio);
> > +             if (ioc && queue_is_mq(q) &&
> > +                             (bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > +                     bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +                     ret = __submit_bio(bio);
> > +                     if (queued)
> > +                             blk_bio_poll_post_submit(bio, ret);
> > +             } else {
> > +                     ret = __submit_bio(bio);
> > +             }
>
> So you're only supporting bio-based polling if the bio-based device is
> stacked _directly_ ontop of blk-mq?  Severely limits the utility of
> bio-based IO polling support if such shallow stacking is required.

Sorry, I was too focused on this core change, I didn't look far enough
to see you're returning current->pid for all the intermediate
bio-based layers in an arbitrarily deep bio-based IO device stack.

I think I understand it now, but I _do_ think we need a better name
than "__submit_bio_noacct_int"

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-18 17:26     ` [dm-devel] " Mike Snitzer
@ 2021-03-19  0:30       ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-19  0:30 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Thu, Mar 18, 2021 at 01:26:22PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> > 
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any bio based
> > driver's IO, which is submitted from FS.
> > 
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > 
> > In previous version, kfifo is used to implement submission queue, and
> > Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> > So far bio's size is close to 2 cacheline size, and it may not be
> > accepted to add new field into bio for solving the scalability issue by
> > tracking bios via linked list, switch to bio group list for tracking bio,
> > the idea is to reuse .bi_end_io for linking bios into a linked list for
> > all sharing same .bi_end_io(call it bio group), which is recovered before
> > really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> > Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> > provide very limited groups, such as 32 for fixing the scalability issue.
> > 
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 ++
> >  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
> >  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
> >  block/blk.h               |   9 ++
> >  include/linux/blk_types.h |  16 +++-
> >  5 files changed, 348 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +		bio_set_flag(bio, BIO_DONE);
> > +		return;
> > +	}
> >  again:
> >  	if (!bio_remaining_done(bio))
> >  		return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index efc7a61a84b4..778d25a7e76c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >  		sizeof(struct bio_grp_list_data);
> >  }
> >  
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > +	return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == bio_grp_data(bio)) {
> > +			__bio_grp_list_add(&grp->list, bio);
> > +			return true;
> > +		}
> > +	}
> > +
> > +	if (i == list->max_nr_grps)
> > +		return false;
> > +
> > +	/* create a new group */
> > +	grp = &list->head[i];
> > +	bio_list_init(&grp->list);
> > +	grp->grp_data = bio_grp_data(bio);
> > +	__bio_grp_list_add(&grp->list, bio);
> > +	list->nr_grps++;
> > +
> > +	return true;
> > +}
> > +
> > +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == grp_data)
> > +			return i;
> > +	}
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (bio_grp_list_grp_empty(grp))
> > +			return i;
> > +	}
> > +	return -1;
> > +}
> > +
> > +/* Move as many as possible groups from 'src' to 'dst' */
> > +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> > +{
> > +	int i, j, cnt = 0;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = src->nr_grps - 1; i >= 0; i--) {
> > +		grp = &src->head[i];
> > +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> > +		if (j < 0)
> > +			break;
> > +		if (bio_grp_list_grp_empty(&dst->head[j]))
> > +			dst->head[j].grp_data = grp->grp_data;
> > +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> > +		bio_list_init(&grp->list);
> > +		cnt++;
> > +	}
> > +
> > +	src->nr_grps -= cnt;
> > +}
> > +
> >  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >  {
> >  	pc->sq = (void *)pc + sizeof(*pc);
> > @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		bio->bi_opf |= REQ_TAG;
> >  }
> >  
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	unsigned int queued;
> > +
> > +	/*
> > +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> > +	 * and completion. However, bio crypt may update .bi_end_io during
> > +	 * submitting, so simply not support bio based polling for this
> > +	 * setting.
> > +	 */
> > +	if (likely(!bio_has_crypt_ctx(bio))) {
> > +		/* track this bio via bio group list */
> > +		spin_lock(&pc->sq_lock);
> > +		queued = bio_grp_list_add(pc->sq, bio);
> > +		spin_unlock(&pc->sq_lock);
> > +	} else {
> > +		queued = false;
> > +	}
> > +
> > +	/*
> > +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > +	 * and the bio is always completed from the pair poll context.
> > +	 *
> > +	 * One invariant is that if bio isn't completed, blk_poll() will
> > +	 * be called by passing cookie returned from submitting this bio.
> > +	 */
> > +	if (!queued)
> > +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > +	else
> > +		bio_set_flag(bio, BIO_END_BY_POLL);
> > +
> > +	return queued;
> > +}
> > +
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +	bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >  	struct block_device *bdev = bio->bi_bdev;
> > @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *	->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >  {
> >  	struct bio_list bio_list_on_stack[2];
> >  	blk_qc_t ret = BLK_QC_T_NONE;
> > @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >  		bio_list_init(&bio_list_on_stack[0]);
> >  
> > -		ret = __submit_bio(bio);
> > +		if (ioc && queue_is_mq(q) &&
> > +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +			ret = __submit_bio(bio);
> > +			if (queued)
> > +				blk_bio_poll_post_submit(bio, ret);
> > +		} else {
> > +			ret = __submit_bio(bio);
> > +		}
> 
> So you're only supporting bio-based polling if the bio-based device is
> stacked _directly_ ontop of blk-mq?  Severely limits the utility of
> bio-based IO polling support if such shallow stacking is required.

No, not directly ontop of blk-mq, and it can be any descendant blk-mq
device, so far only blk-mq can provide direct polling support, see
blk_poll():

                ret = q->mq_ops->poll(hctx);

If not any descendant blk-mq device is involved in this bio based
device, we can't support polling so far.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-19  0:30       ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-19  0:30 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Thu, Mar 18, 2021 at 01:26:22PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> > 
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any bio based
> > driver's IO, which is submitted from FS.
> > 
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > 
> > In previous version, kfifo is used to implement submission queue, and
> > Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> > So far bio's size is close to 2 cacheline size, and it may not be
> > accepted to add new field into bio for solving the scalability issue by
> > tracking bios via linked list, switch to bio group list for tracking bio,
> > the idea is to reuse .bi_end_io for linking bios into a linked list for
> > all sharing same .bi_end_io(call it bio group), which is recovered before
> > really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> > Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> > provide very limited groups, such as 32 for fixing the scalability issue.
> > 
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 ++
> >  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
> >  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
> >  block/blk.h               |   9 ++
> >  include/linux/blk_types.h |  16 +++-
> >  5 files changed, 348 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +		bio_set_flag(bio, BIO_DONE);
> > +		return;
> > +	}
> >  again:
> >  	if (!bio_remaining_done(bio))
> >  		return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index efc7a61a84b4..778d25a7e76c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >  		sizeof(struct bio_grp_list_data);
> >  }
> >  
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > +	return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == bio_grp_data(bio)) {
> > +			__bio_grp_list_add(&grp->list, bio);
> > +			return true;
> > +		}
> > +	}
> > +
> > +	if (i == list->max_nr_grps)
> > +		return false;
> > +
> > +	/* create a new group */
> > +	grp = &list->head[i];
> > +	bio_list_init(&grp->list);
> > +	grp->grp_data = bio_grp_data(bio);
> > +	__bio_grp_list_add(&grp->list, bio);
> > +	list->nr_grps++;
> > +
> > +	return true;
> > +}
> > +
> > +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == grp_data)
> > +			return i;
> > +	}
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (bio_grp_list_grp_empty(grp))
> > +			return i;
> > +	}
> > +	return -1;
> > +}
> > +
> > +/* Move as many as possible groups from 'src' to 'dst' */
> > +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> > +{
> > +	int i, j, cnt = 0;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = src->nr_grps - 1; i >= 0; i--) {
> > +		grp = &src->head[i];
> > +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> > +		if (j < 0)
> > +			break;
> > +		if (bio_grp_list_grp_empty(&dst->head[j]))
> > +			dst->head[j].grp_data = grp->grp_data;
> > +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> > +		bio_list_init(&grp->list);
> > +		cnt++;
> > +	}
> > +
> > +	src->nr_grps -= cnt;
> > +}
> > +
> >  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >  {
> >  	pc->sq = (void *)pc + sizeof(*pc);
> > @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		bio->bi_opf |= REQ_TAG;
> >  }
> >  
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	unsigned int queued;
> > +
> > +	/*
> > +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> > +	 * and completion. However, bio crypt may update .bi_end_io during
> > +	 * submitting, so simply not support bio based polling for this
> > +	 * setting.
> > +	 */
> > +	if (likely(!bio_has_crypt_ctx(bio))) {
> > +		/* track this bio via bio group list */
> > +		spin_lock(&pc->sq_lock);
> > +		queued = bio_grp_list_add(pc->sq, bio);
> > +		spin_unlock(&pc->sq_lock);
> > +	} else {
> > +		queued = false;
> > +	}
> > +
> > +	/*
> > +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > +	 * and the bio is always completed from the pair poll context.
> > +	 *
> > +	 * One invariant is that if bio isn't completed, blk_poll() will
> > +	 * be called by passing cookie returned from submitting this bio.
> > +	 */
> > +	if (!queued)
> > +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > +	else
> > +		bio_set_flag(bio, BIO_END_BY_POLL);
> > +
> > +	return queued;
> > +}
> > +
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +	bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >  	struct block_device *bdev = bio->bi_bdev;
> > @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *	->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >  {
> >  	struct bio_list bio_list_on_stack[2];
> >  	blk_qc_t ret = BLK_QC_T_NONE;
> > @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >  		bio_list_init(&bio_list_on_stack[0]);
> >  
> > -		ret = __submit_bio(bio);
> > +		if (ioc && queue_is_mq(q) &&
> > +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +			ret = __submit_bio(bio);
> > +			if (queued)
> > +				blk_bio_poll_post_submit(bio, ret);
> > +		} else {
> > +			ret = __submit_bio(bio);
> > +		}
> 
> So you're only supporting bio-based polling if the bio-based device is
> stacked _directly_ ontop of blk-mq?  Severely limits the utility of
> bio-based IO polling support if such shallow stacking is required.

No, not directly ontop of blk-mq, and it can be any descendant blk-mq
device, so far only blk-mq can provide direct polling support, see
blk_poll():

                ret = q->mq_ops->poll(hctx);

If not any descendant blk-mq device is involved in this bio based
device, we can't support polling so far.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 00/13] block: support bio based io polling
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-19  5:50   ` JeffleXu
  -1 siblings, 0 replies; 82+ messages in thread
From: JeffleXu @ 2021-03-19  5:50 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, dm-devel



On 3/19/21 12:48 AM, Ming Lei wrote:
> Hi,
> 
> Add per-task io poll context for holding HIPRI blk-mq/underlying bios
> queued from bio based driver's io submission context, and reuse one bio
> padding field for storing 'cookie' returned from submit_bio() for these
> bios. Also explicitly end these bios in poll context by adding two
> new bio flags.
> 
> In this way, we needn't to poll all underlying hw queues any more,
> which is implemented in Jeffle's patches. And we can just poll hw queues
> in which there is HIPRI IO queued.
> 
> Usually io submission and io poll share same context, so the added io
> poll context data is just like one stack variable, and the cost for
> saving bios is cheap.
> 
> Any comments are welcome.
> 
> V2:
> 	- address queue depth scalability issue reported by Jeffle via bio
> 	group list. Reuse .bi_end_io for linking bios which share same
> 	.bi_end_io, and support 32 such groups in submit queue. With this way,
> 	the scalability issue caused by kfifio is solved. Before really
> 	ending bio, .bi_end_io is recovered from the group head.

I have retested this latest version, and it seems the scaling issue has
been fixed at the first glance.

Test results with the latest version:
3-threads  dm-stripe-3 targets  (12k randread IOPS, unit K)
317 -> 409 (iodepth=128)

Compared to the test results of v1:
3-threads  dm-stripe-3 targets  (12k randread IOPS, unit K)
313 -> 349 (iodepth=128, kfifo queue depth =128)
313 -> 409 (iodepth=32, kfifo queue depth =128)
314 -> 409 (iodepth=128, kfifo queue depth =512)

> 
> 
> Jeffle Xu (4):
>   block/mq: extract one helper function polling hw queue
>   block: add queue_to_disk() to get gendisk from request_queue
>   block: add poll_capable method to support bio-based IO polling
>   dm: support IO polling for bio-based dm device
> 
> Ming Lei (9):
>   block: add helper of blk_queue_poll
>   block: add one helper to free io_context
>   block: add helper of blk_create_io_context
>   block: create io poll context for submission and poll task
>   block: add req flag of REQ_TAG
>   block: add new field into 'struct bvec_iter'
>   block: prepare for supporting bio_list via other link
>   block: use per-task poll context to implement bio based io poll
>   blk-mq: limit hw queues to be polled in each blk_poll()
> 
>  block/bio.c                   |   5 +
>  block/blk-core.c              | 248 ++++++++++++++++++++++++++++++++--
>  block/blk-ioc.c               |  12 +-
>  block/blk-mq.c                | 232 ++++++++++++++++++++++++++++++-
>  block/blk-sysfs.c             |  14 +-
>  block/blk.h                   |  55 ++++++++
>  drivers/md/dm-table.c         |  24 ++++
>  drivers/md/dm.c               |  14 ++
>  drivers/nvme/host/core.c      |   2 +-
>  include/linux/bio.h           | 132 +++++++++---------
>  include/linux/blk_types.h     |  20 ++-
>  include/linux/blkdev.h        |   4 +
>  include/linux/bvec.h          |   9 ++
>  include/linux/device-mapper.h |   1 +
>  include/linux/iocontext.h     |   2 +
>  include/trace/events/kyber.h  |   6 +-
>  16 files changed, 686 insertions(+), 94 deletions(-)
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 00/13] block: support bio based io polling
@ 2021-03-19  5:50   ` JeffleXu
  0 siblings, 0 replies; 82+ messages in thread
From: JeffleXu @ 2021-03-19  5:50 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, dm-devel, Christoph Hellwig, Mike Snitzer



On 3/19/21 12:48 AM, Ming Lei wrote:
> Hi,
> 
> Add per-task io poll context for holding HIPRI blk-mq/underlying bios
> queued from bio based driver's io submission context, and reuse one bio
> padding field for storing 'cookie' returned from submit_bio() for these
> bios. Also explicitly end these bios in poll context by adding two
> new bio flags.
> 
> In this way, we needn't to poll all underlying hw queues any more,
> which is implemented in Jeffle's patches. And we can just poll hw queues
> in which there is HIPRI IO queued.
> 
> Usually io submission and io poll share same context, so the added io
> poll context data is just like one stack variable, and the cost for
> saving bios is cheap.
> 
> Any comments are welcome.
> 
> V2:
> 	- address queue depth scalability issue reported by Jeffle via bio
> 	group list. Reuse .bi_end_io for linking bios which share same
> 	.bi_end_io, and support 32 such groups in submit queue. With this way,
> 	the scalability issue caused by kfifio is solved. Before really
> 	ending bio, .bi_end_io is recovered from the group head.

I have retested this latest version, and it seems the scaling issue has
been fixed at the first glance.

Test results with the latest version:
3-threads  dm-stripe-3 targets  (12k randread IOPS, unit K)
317 -> 409 (iodepth=128)

Compared to the test results of v1:
3-threads  dm-stripe-3 targets  (12k randread IOPS, unit K)
313 -> 349 (iodepth=128, kfifo queue depth =128)
313 -> 409 (iodepth=32, kfifo queue depth =128)
314 -> 409 (iodepth=128, kfifo queue depth =512)

> 
> 
> Jeffle Xu (4):
>   block/mq: extract one helper function polling hw queue
>   block: add queue_to_disk() to get gendisk from request_queue
>   block: add poll_capable method to support bio-based IO polling
>   dm: support IO polling for bio-based dm device
> 
> Ming Lei (9):
>   block: add helper of blk_queue_poll
>   block: add one helper to free io_context
>   block: add helper of blk_create_io_context
>   block: create io poll context for submission and poll task
>   block: add req flag of REQ_TAG
>   block: add new field into 'struct bvec_iter'
>   block: prepare for supporting bio_list via other link
>   block: use per-task poll context to implement bio based io poll
>   blk-mq: limit hw queues to be polled in each blk_poll()
> 
>  block/bio.c                   |   5 +
>  block/blk-core.c              | 248 ++++++++++++++++++++++++++++++++--
>  block/blk-ioc.c               |  12 +-
>  block/blk-mq.c                | 232 ++++++++++++++++++++++++++++++-
>  block/blk-sysfs.c             |  14 +-
>  block/blk.h                   |  55 ++++++++
>  drivers/md/dm-table.c         |  24 ++++
>  drivers/md/dm.c               |  14 ++
>  drivers/nvme/host/core.c      |   2 +-
>  include/linux/bio.h           | 132 +++++++++---------
>  include/linux/blk_types.h     |  20 ++-
>  include/linux/blkdev.h        |   4 +
>  include/linux/bvec.h          |   9 ++
>  include/linux/device-mapper.h |   1 +
>  include/linux/iocontext.h     |   2 +
>  include/trace/events/kyber.h  |   6 +-
>  16 files changed, 686 insertions(+), 94 deletions(-)
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
  2021-03-18 16:48   ` [dm-devel] " Ming Lei
@ 2021-03-19  7:59     ` JeffleXu
  -1 siblings, 0 replies; 82+ messages in thread
From: JeffleXu @ 2021-03-19  7:59 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, dm-devel



On 3/19/21 12:48 AM, Ming Lei wrote:
> Add one req flag REQ_TAG which will be used in the following patch for
> supporting bio based IO polling.
> 
> Exactly this flag can help us to do:
> 
> 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.
> 
> 2)create per-task io polling context if the bio based queue supports polling
> and the submitted bio is HIPRI. This per-task io polling context will be
> created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
> we can avoid to create such io polling context if one cloned bio with REQ_TAG
> is submitted from another kernel context.
> 
> 3) for supporting bio based io polling, we need to poll IOs from all
> underlying queues of bio device/driver, this way help us to recognize which
> IOs need to polled in bio based style, which will be implemented in next
> patch.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c          | 29 +++++++++++++++++++++++++++--
>  include/linux/blk_types.h |  4 ++++
>  2 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0b00c21cbefb..efc7a61a84b4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>  static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		struct bio *bio)
>  {
> +	bool mq;
> +
>  	if (!(bio->bi_opf & REQ_HIPRI))
>  		return;
>  
> -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +	/*
> +	 * Can't support bio based IO poll without per-task poll queue
> +	 *
> +	 * Now we have created per-task io poll context, and mark this
> +	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
> +	 * submitted from another kernel context, we won't create bio
> +	 * poll context for it, so that bio will be completed by IRQ;
> +	 * 2) If such bio is submitted from current context, we will
> +	 * complete it via blk_poll(); 3) If driver knows that one
> +	 * underlying bio allocated from driver is for FS bio, meantime
> +	 * it is submitted in current context, driver can mark such bio
> +	 * as REQ_TAG manually, so the bio can be completed via blk_poll
> +	 * too.
> +	 */

Sorry I can't understand case 3, could you please further explain it? If
'driver marks such bio as REQ_TAG manually', then per-task io poll
context won't be created, and thus REQ_HIPRI will be cleared, in which
case the bio will be completed by IRQ. How could it be completed by
blk_poll()?


> +	mq = queue_is_mq(q);
> +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
>  		bio->bi_opf &= ~REQ_HIPRI;




If the use cases are mixed, saying one kernel context may submit IO with
and without REQ_TAG at the meantime (though I don't know if this
situation exists), then the above code may not work as we expect.

For example, dm-XXX will return DM_MAPIO_SUBMITTED and actually submits
the cloned bio (with REQ_TAG) with internal kernel threads. Besides,
dm-XXX will also allocate bio (without REQ_TAG) of itself for metadata
logging or something. When submitting bios (without REQ_TAG), per-task
io poll context will be allocated. Later when submitting cloned bios
(with REQ_TAG), the poll context already exists and thus REQ_HIPRI is
kept for these bios and they are submitted to polling hw queues.


> +	else if (!mq)
> +		bio->bi_opf |= REQ_TAG;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -893,9 +912,15 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  
>  	/*
>  	 * Created per-task io poll queue if we supports bio polling
> -	 * and it is one HIPRI bio.
> +	 * and it is one HIPRI bio, and this HIPRI bio has to be from
> +	 * FS. If REQ_TAG isn't set for HIPRI bio, we think it originated
> +	 * from FS.
> +	 *
> +	 * Driver may allocated bio by itself and REQ_TAG is set, but they
> +	 * won't be marked as HIPRI.
>  	 */
>  	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			!(bio->bi_opf & REQ_TAG) &&
>  			(bio->bi_opf & REQ_HIPRI));
>  
>  	blk_bio_poll_preprocess(q, bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index db026b6ec15a..a1bcade4bcc3 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -394,6 +394,9 @@ enum req_flag_bits {
>  
>  	__REQ_HIPRI,
>  
> +	/* for marking IOs originated from same FS bio in same context */
> +	__REQ_TAG,
> +
>  	/* for driver use */
>  	__REQ_DRV,
>  	__REQ_SWAP,		/* swapping request. */
> @@ -418,6 +421,7 @@ enum req_flag_bits {
>  
>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
> +#define REQ_TAG			(1ULL << __REQ_TAG)
>  
>  #define REQ_DRV			(1ULL << __REQ_DRV)
>  #define REQ_SWAP		(1ULL << __REQ_SWAP)
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
@ 2021-03-19  7:59     ` JeffleXu
  0 siblings, 0 replies; 82+ messages in thread
From: JeffleXu @ 2021-03-19  7:59 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, dm-devel, Christoph Hellwig, Mike Snitzer



On 3/19/21 12:48 AM, Ming Lei wrote:
> Add one req flag REQ_TAG which will be used in the following patch for
> supporting bio based IO polling.
> 
> Exactly this flag can help us to do:
> 
> 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.
> 
> 2)create per-task io polling context if the bio based queue supports polling
> and the submitted bio is HIPRI. This per-task io polling context will be
> created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
> we can avoid to create such io polling context if one cloned bio with REQ_TAG
> is submitted from another kernel context.
> 
> 3) for supporting bio based io polling, we need to poll IOs from all
> underlying queues of bio device/driver, this way help us to recognize which
> IOs need to polled in bio based style, which will be implemented in next
> patch.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c          | 29 +++++++++++++++++++++++++++--
>  include/linux/blk_types.h |  4 ++++
>  2 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0b00c21cbefb..efc7a61a84b4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>  static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		struct bio *bio)
>  {
> +	bool mq;
> +
>  	if (!(bio->bi_opf & REQ_HIPRI))
>  		return;
>  
> -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +	/*
> +	 * Can't support bio based IO poll without per-task poll queue
> +	 *
> +	 * Now we have created per-task io poll context, and mark this
> +	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
> +	 * submitted from another kernel context, we won't create bio
> +	 * poll context for it, so that bio will be completed by IRQ;
> +	 * 2) If such bio is submitted from current context, we will
> +	 * complete it via blk_poll(); 3) If driver knows that one
> +	 * underlying bio allocated from driver is for FS bio, meantime
> +	 * it is submitted in current context, driver can mark such bio
> +	 * as REQ_TAG manually, so the bio can be completed via blk_poll
> +	 * too.
> +	 */

Sorry I can't understand case 3, could you please further explain it? If
'driver marks such bio as REQ_TAG manually', then per-task io poll
context won't be created, and thus REQ_HIPRI will be cleared, in which
case the bio will be completed by IRQ. How could it be completed by
blk_poll()?


> +	mq = queue_is_mq(q);
> +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
>  		bio->bi_opf &= ~REQ_HIPRI;




If the use cases are mixed, saying one kernel context may submit IO with
and without REQ_TAG at the meantime (though I don't know if this
situation exists), then the above code may not work as we expect.

For example, dm-XXX will return DM_MAPIO_SUBMITTED and actually submits
the cloned bio (with REQ_TAG) with internal kernel threads. Besides,
dm-XXX will also allocate bio (without REQ_TAG) of itself for metadata
logging or something. When submitting bios (without REQ_TAG), per-task
io poll context will be allocated. Later when submitting cloned bios
(with REQ_TAG), the poll context already exists and thus REQ_HIPRI is
kept for these bios and they are submitted to polling hw queues.


> +	else if (!mq)
> +		bio->bi_opf |= REQ_TAG;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -893,9 +912,15 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  
>  	/*
>  	 * Created per-task io poll queue if we supports bio polling
> -	 * and it is one HIPRI bio.
> +	 * and it is one HIPRI bio, and this HIPRI bio has to be from
> +	 * FS. If REQ_TAG isn't set for HIPRI bio, we think it originated
> +	 * from FS.
> +	 *
> +	 * Driver may allocated bio by itself and REQ_TAG is set, but they
> +	 * won't be marked as HIPRI.
>  	 */
>  	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			!(bio->bi_opf & REQ_TAG) &&
>  			(bio->bi_opf & REQ_HIPRI));
>  
>  	blk_bio_poll_preprocess(q, bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index db026b6ec15a..a1bcade4bcc3 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -394,6 +394,9 @@ enum req_flag_bits {
>  
>  	__REQ_HIPRI,
>  
> +	/* for marking IOs originated from same FS bio in same context */
> +	__REQ_TAG,
> +
>  	/* for driver use */
>  	__REQ_DRV,
>  	__REQ_SWAP,		/* swapping request. */
> @@ -418,6 +421,7 @@ enum req_flag_bits {
>  
>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
> +#define REQ_TAG			(1ULL << __REQ_TAG)
>  
>  #define REQ_DRV			(1ULL << __REQ_DRV)
>  #define REQ_SWAP		(1ULL << __REQ_SWAP)
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
  2021-03-19  7:59     ` [dm-devel] " JeffleXu
@ 2021-03-19  8:48       ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-19  8:48 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel

On Fri, Mar 19, 2021 at 03:59:06PM +0800, JeffleXu wrote:
> 
> 
> On 3/19/21 12:48 AM, Ming Lei wrote:
> > Add one req flag REQ_TAG which will be used in the following patch for
> > supporting bio based IO polling.
> > 
> > Exactly this flag can help us to do:
> > 
> > 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> > as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.
> > 
> > 2)create per-task io polling context if the bio based queue supports polling
> > and the submitted bio is HIPRI. This per-task io polling context will be
> > created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
> > we can avoid to create such io polling context if one cloned bio with REQ_TAG
> > is submitted from another kernel context.
> > 
> > 3) for supporting bio based io polling, we need to poll IOs from all
> > underlying queues of bio device/driver, this way help us to recognize which
> > IOs need to polled in bio based style, which will be implemented in next
> > patch.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-core.c          | 29 +++++++++++++++++++++++++++--
> >  include/linux/blk_types.h |  4 ++++
> >  2 files changed, 31 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 0b00c21cbefb..efc7a61a84b4 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
> >  static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		struct bio *bio)
> >  {
> > +	bool mq;
> > +
> >  	if (!(bio->bi_opf & REQ_HIPRI))
> >  		return;
> >  
> > -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> > +	/*
> > +	 * Can't support bio based IO poll without per-task poll queue
> > +	 *
> > +	 * Now we have created per-task io poll context, and mark this
> > +	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
> > +	 * submitted from another kernel context, we won't create bio
> > +	 * poll context for it, so that bio will be completed by IRQ;
> > +	 * 2) If such bio is submitted from current context, we will
> > +	 * complete it via blk_poll(); 3) If driver knows that one
> > +	 * underlying bio allocated from driver is for FS bio, meantime
> > +	 * it is submitted in current context, driver can mark such bio
> > +	 * as REQ_TAG manually, so the bio can be completed via blk_poll
> > +	 * too.
> > +	 */
> 
> Sorry I can't understand case 3, could you please further explain it? If

I meant driver may allocate bio and submit it in current context, and this
allocated bio is for completing FS hipri bio too. So far, HIPRI won't be
set for this bio, but driver may mark this bio as HIPRI and TAG, so this
created bio can be polled.

> 'driver marks such bio as REQ_TAG manually', then per-task io poll
> context won't be created, and thus REQ_HIPRI will be cleared, in which
> case the bio will be completed by IRQ. How could it be completed by
> blk_poll()?

The io poll context is created when FS HIPRI bio on based queue(DM) is
submitted, at that time, bio based driver's ->submit_bio isn't called
yet. So when bio driver's ->submit_bio() allocates new bios and submits
them in current context, if driver marks this bio as HIPRI and TAG, they
can be polled too.

> 
> 
> > +	mq = queue_is_mq(q);
> > +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
> >  		bio->bi_opf &= ~REQ_HIPRI;
> 
> 
> 
> 
> If the use cases are mixed, saying one kernel context may submit IO with
> and without REQ_TAG at the meantime (though I don't know if this
> situation exists), then the above code may not work as we expect.

Poll context shouldn't be created for kernel context.

So far, this patch won't cover bios submitted from kernel context, and
for any bios submitted from kernel context, their HIPRI will be cleared.

> 
> For example, dm-XXX will return DM_MAPIO_SUBMITTED and actually submits
> the cloned bio (with REQ_TAG) with internal kernel threads. Besides,
> dm-XXX will also allocate bio (without REQ_TAG) of itself for metadata
> logging or something. When submitting bios (without REQ_TAG), per-task

But HIPRI won't be set for this allocated bio.

> io poll context will be allocated. Later when submitting cloned bios
> (with REQ_TAG), the poll context already exists and thus REQ_HIPRI is
> kept for these bios and they are submitted to polling hw queues.

Originally I planed to add new helper of submit_poll_bio() for current
HIPRI uses, and only create poll context in this code path, this way can
decouple REQ_TAG a bit. But looks it is enough to re-use REQ_TAG for this
purpose. If not, it can be quite easy to address issue wrt. creating poll
context.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
@ 2021-03-19  8:48       ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-19  8:48 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer

On Fri, Mar 19, 2021 at 03:59:06PM +0800, JeffleXu wrote:
> 
> 
> On 3/19/21 12:48 AM, Ming Lei wrote:
> > Add one req flag REQ_TAG which will be used in the following patch for
> > supporting bio based IO polling.
> > 
> > Exactly this flag can help us to do:
> > 
> > 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> > as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.
> > 
> > 2)create per-task io polling context if the bio based queue supports polling
> > and the submitted bio is HIPRI. This per-task io polling context will be
> > created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
> > we can avoid to create such io polling context if one cloned bio with REQ_TAG
> > is submitted from another kernel context.
> > 
> > 3) for supporting bio based io polling, we need to poll IOs from all
> > underlying queues of bio device/driver, this way help us to recognize which
> > IOs need to polled in bio based style, which will be implemented in next
> > patch.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-core.c          | 29 +++++++++++++++++++++++++++--
> >  include/linux/blk_types.h |  4 ++++
> >  2 files changed, 31 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 0b00c21cbefb..efc7a61a84b4 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
> >  static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		struct bio *bio)
> >  {
> > +	bool mq;
> > +
> >  	if (!(bio->bi_opf & REQ_HIPRI))
> >  		return;
> >  
> > -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> > +	/*
> > +	 * Can't support bio based IO poll without per-task poll queue
> > +	 *
> > +	 * Now we have created per-task io poll context, and mark this
> > +	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
> > +	 * submitted from another kernel context, we won't create bio
> > +	 * poll context for it, so that bio will be completed by IRQ;
> > +	 * 2) If such bio is submitted from current context, we will
> > +	 * complete it via blk_poll(); 3) If driver knows that one
> > +	 * underlying bio allocated from driver is for FS bio, meantime
> > +	 * it is submitted in current context, driver can mark such bio
> > +	 * as REQ_TAG manually, so the bio can be completed via blk_poll
> > +	 * too.
> > +	 */
> 
> Sorry I can't understand case 3, could you please further explain it? If

I meant driver may allocate bio and submit it in current context, and this
allocated bio is for completing FS hipri bio too. So far, HIPRI won't be
set for this bio, but driver may mark this bio as HIPRI and TAG, so this
created bio can be polled.

> 'driver marks such bio as REQ_TAG manually', then per-task io poll
> context won't be created, and thus REQ_HIPRI will be cleared, in which
> case the bio will be completed by IRQ. How could it be completed by
> blk_poll()?

The io poll context is created when FS HIPRI bio on based queue(DM) is
submitted, at that time, bio based driver's ->submit_bio isn't called
yet. So when bio driver's ->submit_bio() allocates new bios and submits
them in current context, if driver marks this bio as HIPRI and TAG, they
can be polled too.

> 
> 
> > +	mq = queue_is_mq(q);
> > +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
> >  		bio->bi_opf &= ~REQ_HIPRI;
> 
> 
> 
> 
> If the use cases are mixed, saying one kernel context may submit IO with
> and without REQ_TAG at the meantime (though I don't know if this
> situation exists), then the above code may not work as we expect.

Poll context shouldn't be created for kernel context.

So far, this patch won't cover bios submitted from kernel context, and
for any bios submitted from kernel context, their HIPRI will be cleared.

> 
> For example, dm-XXX will return DM_MAPIO_SUBMITTED and actually submits
> the cloned bio (with REQ_TAG) with internal kernel threads. Besides,
> dm-XXX will also allocate bio (without REQ_TAG) of itself for metadata
> logging or something. When submitting bios (without REQ_TAG), per-task

But HIPRI won't be set for this allocated bio.

> io poll context will be allocated. Later when submitting cloned bios
> (with REQ_TAG), the poll context already exists and thus REQ_HIPRI is
> kept for these bios and they are submitted to polling hw queues.

Originally I planed to add new helper of submit_poll_bio() for current
HIPRI uses, and only create poll context in this code path, this way can
decouple REQ_TAG a bit. But looks it is enough to re-use REQ_TAG for this
purpose. If not, it can be quite easy to address issue wrt. creating poll
context.


Thanks, 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-18 16:48   ` [dm-devel] " Ming Lei
@ 2021-03-19  9:38     ` JeffleXu
  -1 siblings, 0 replies; 82+ messages in thread
From: JeffleXu @ 2021-03-19  9:38 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, dm-devel

I'm thinking how this mechanism could work with *original* bio-based
devices that don't ne built upon mq devices, such as nvdimm. This
mechanism (also including my original design) mainly focuses on virtual
devices that built upon mq devices, i.e., md/dm.

As the original bio-based devices wants to support IO polling in the
future, then they should be somehow distingushed from md/dm.


On 3/19/21 12:48 AM, Ming Lei wrote:
> Currently bio based IO poll needs to poll all hw queue blindly, this way
> is very inefficient, and the big reason is that we can't pass bio
> submission result to io poll task.
> 
> In IO submission context, track associated underlying bios by per-task
> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> and return current->pid to caller of submit_bio() for any bio based
> driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, and we can find the bio from that submission context. Moving
> bio from submission queue to poll queue of the poll context, and keep
> polling until these bios are ended. Remove bio from poll queue if the
> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> 
> In previous version, kfifo is used to implement submission queue, and
> Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> So far bio's size is close to 2 cacheline size, and it may not be
> accepted to add new field into bio for solving the scalability issue by
> tracking bios via linked list, switch to bio group list for tracking bio,
> the idea is to reuse .bi_end_io for linking bios into a linked list for
> all sharing same .bi_end_io(call it bio group), which is recovered before
> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> provide very limited groups, such as 32 for fixing the scalability issue.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/bio.c               |   5 ++
>  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
>  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
>  block/blk.h               |   9 ++
>  include/linux/blk_types.h |  16 +++-
>  5 files changed, 348 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..04c043dc60fc 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> +		bio_set_flag(bio, BIO_DONE);
> +		return;
> +	}
>  again:
>  	if (!bio_remaining_done(bio))
>  		return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index efc7a61a84b4..778d25a7e76c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>  		sizeof(struct bio_grp_list_data);
>  }
>  
> +static inline void *bio_grp_data(struct bio *bio)
> +{
> +	return bio->bi_poll;
> +}
> +
> +/* add bio into bio group list, return true if it is added */
> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == bio_grp_data(bio)) {
> +			__bio_grp_list_add(&grp->list, bio);
> +			return true;
> +		}
> +	}
> +
> +	if (i == list->max_nr_grps)
> +		return false;
> +
> +	/* create a new group */
> +	grp = &list->head[i];
> +	bio_list_init(&grp->list);
> +	grp->grp_data = bio_grp_data(bio);
> +	__bio_grp_list_add(&grp->list, bio);
> +	list->nr_grps++;
> +
> +	return true;
> +}
> +
> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == grp_data)
> +			return i;
> +	}
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (bio_grp_list_grp_empty(grp))
> +			return i;
> +	}
> +	return -1;
> +}
> +
> +/* Move as many as possible groups from 'src' to 'dst' */
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> +{
> +	int i, j, cnt = 0;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> +		grp = &src->head[i];
> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> +		if (j < 0)
> +			break;
> +		if (bio_grp_list_grp_empty(&dst->head[j]))
> +			dst->head[j].grp_data = grp->grp_data;
> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> +		bio_list_init(&grp->list);
> +		cnt++;
> +	}
> +
> +	src->nr_grps -= cnt;
> +}

Not sure why it's checked in reverse order (starting from 'nr_grps - 1').


> +
>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>  {
>  	pc->sq = (void *)pc + sizeof(*pc);
> @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		bio->bi_opf |= REQ_TAG;
>  }
>  
> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	unsigned int queued;
> +
> +	/*
> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> +	 * and completion. However, bio crypt may update .bi_end_io during
> +	 * submitting, so simply not support bio based polling for this
> +	 * setting.
> +	 */
> +	if (likely(!bio_has_crypt_ctx(bio))) {
> +		/* track this bio via bio group list */
> +		spin_lock(&pc->sq_lock);
> +		queued = bio_grp_list_add(pc->sq, bio);
> +		spin_unlock(&pc->sq_lock);
> +	} else {
> +		queued = false;
> +	}
> +
> +	/*
> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> +	 * and the bio is always completed from the pair poll context.
> +	 *
> +	 * One invariant is that if bio isn't completed, blk_poll() will
> +	 * be called by passing cookie returned from submitting this bio.
> +	 */
> +	if (!queued)
> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> +	else
> +		bio_set_flag(bio, BIO_END_BY_POLL);
> +
> +	return queued;
> +}
> +
> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> +{
> +	bio->bi_iter.bi_private_data = cookie;
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>   * bio_list_on_stack[1] contains bios that were submitted before the current
>   *	->submit_bio_bio, but that haven't been processed yet.
>   */
> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;
> @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		if (ioc && queue_is_mq(q) &&
> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> +
> +			ret = __submit_bio(bio);
> +			if (queued)
> +				blk_bio_poll_post_submit(bio, ret);
> +		} else {
> +			ret = __submit_bio(bio);
> +		}

If input @ioc is NULL, it will still return cookie (returned by
__submit_bio()), which will call into blk_bio_poll(), which is not
expected. (It can pass blk_queue_poll() check in blk_poll(), e.g., dm
device itself is marked as QUEUE_FLAG_POLL, but ->io_context of IO
submitting process failed to allocate the io_context, and thus calls
__submit_bio_noacct_int() with @ioc is NULL).


>  
>  		/*
>  		 * Sort new bios into those for a lower level and those for the
> @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	return ret;
>  }
>  
> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> +		struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +
> +	__submit_bio_noacct_int(bio, ioc);
> +
> +	/* bio submissions queued to per-task poll context */
> +	if (READ_ONCE(pc->sq->nr_grps))
> +		return current->pid;
> +
> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> +	return 0;
> +}
> +
> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> +{
> +	struct io_context *ioc = current->io_context;
> +
> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> +		return __submit_bio_noacct_poll(bio, ioc);
> +


> +	return __submit_bio_noacct_int(bio, NULL);
> +}
> +
>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>  {
>  	struct bio_list bio_list[2] = { };
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 03f59915fe2c..f26950a51f4a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>  	return ret;
>  }
>  
> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> +{
> +	return bio->bi_iter.bi_private_data;
> +}
> +
> +static int blk_mq_poll_io(struct bio *bio)
> +{
> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> +	int ret = 0;
> +
> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> +		struct blk_mq_hw_ctx *hctx =
> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> +
> +		ret += blk_mq_poll_hctx(q, hctx);
> +	}
> +	return ret;
> +}
> +
> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	int ret = 0;
> +	int i;
> +
> +	/*
> +	 * Poll hw queue first.
> +	 *
> +	 * TODO: limit max poll times and make sure to not poll same
> +	 * hw queue one more time.
> +	 */
> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> +		struct bio *bio;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> +			ret += blk_mq_poll_io(bio);
> +	}
> +
> +	/* reap bios */
> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> +		struct bio *bio;
> +		struct bio_list bl;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		bio_list_init(&bl);
> +
> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> +			if (bio_flagged(bio, BIO_DONE)) {
> +
> +				/* now recover original data */
> +				bio->bi_poll = grp->grp_data;
> +
> +				/* clear BIO_END_BY_POLL and end me really */
> +				bio_clear_flag(bio, BIO_END_BY_POLL);
> +				bio_endio(bio);
> +			} else {
> +				__bio_grp_list_add(&bl, bio);
> +			}
> +		}
> +		__bio_grp_list_merge(&grp->list, &bl);
> +	}
> +	return ret;
> +}
> +
> +static int __blk_bio_poll_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *submit_ctx,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	/*
> +	 * Move IO submission result from submission queue in submission
> +	 * context to poll queue of poll context.
> +	 */
> +	spin_lock(&submit_ctx->sq_lock);
> +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
> +	spin_unlock(&submit_ctx->sq_lock);
> +
> +	return blk_bio_poll_and_end_io(q, poll_ctx);
> +}
> +
> +static int blk_bio_poll_io(struct request_queue *q,
> +		struct io_context *submit_ioc,
> +		struct io_context *poll_ioc)
> +{
> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> +	int ret;
> +
> +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> +		mutex_lock(&poll_ctx->pq_lock);

Why mutex is used to protect pq here rather than spinlock? Where will
the polling routine go to sleep?

Besides, how to protect the concurrent bio_list operation to sq between
producer (submission routine) and consumer (polling routine)? As far as
I understand, pc->sq_lock is used to prevent concurrent access from
multiple submission processes, while pc->pq_lock is used to prevent
concurrent access from multiple polling processes.


> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +		mutex_unlock(&poll_ctx->pq_lock);
> +	} else {
> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +	}
> +	return ret;
> +}
> +
> +static bool blk_bio_ioc_valid(struct task_struct *t)
> +{
> +	if (!t)
> +		return false;
> +
> +	if (!t->io_context)
> +		return false;
> +
> +	if (!t->io_context->data)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> +{
> +	struct io_context *poll_ioc = current->io_context;
> +	pid_t pid;
> +	struct task_struct *submit_task;
> +	int ret;
> +
> +	pid = (pid_t)cookie;
> +
> +	/* io poll often share io submission context */
> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> +

> +	submit_task = find_get_task_by_vpid(pid);

What if the process to which the returned cookie refers has exited?
find_get_task_by_vpid() will return NULL, thus blk_poll() won't help
reap anything, while maybe there are still bios in the poll context,
waiting to be reaped.

Maybe we need to flush the poll context when a task detaches from the
io_context.


> +	if (likely(blk_bio_ioc_valid(submit_task)))
> +		ret = blk_bio_poll_io(q, submit_task->io_context,
> +				poll_ioc);

poll_ioc may be invalid in this case, since the previous
blk_create_io_context() in blk_bio_poll() may fail.



> +	else
> +		ret = 0;
> +
> +	put_task_struct(submit_task);
> +
> +	return ret;
> +}
> +
>  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
> +	long state;
> +
> +	/* no need to poll */
> +	if (cookie == 0)
> +		return 0;
> +
>  	/*
>  	 * Create poll queue for storing poll bio and its cookie from
>  	 * submission queue
>  	 */
>  	blk_create_io_context(q, true);
>  
> +	state = current->state;
> +	do {
> +		int ret;
> +
> +		ret = __blk_bio_poll(q, cookie);
> +		if (ret > 0) {
> +			__set_current_state(TASK_RUNNING);
> +			return ret;
> +		}
> +
> +		if (signal_pending_state(state, current))
> +			__set_current_state(TASK_RUNNING);
> +
> +		if (current->state == TASK_RUNNING)
> +			return 1;
> +		if (ret < 0 || !spin)
> +			break;
> +		cpu_relax();
> +	} while (!need_resched());
> +
> +	__set_current_state(TASK_RUNNING);
>  	return 0;
>  }
>  
> @@ -3893,7 +4064,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	struct blk_mq_hw_ctx *hctx;
>  	long state;
>  
> -	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
> +	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
>  		return 0;
>  
>  	if (current->plug)
> diff --git a/block/blk.h b/block/blk.h
> index ae58a706327e..05b9f5eafdd1 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -403,4 +403,13 @@ static inline void blk_create_io_context(struct request_queue *q,
>  		bio_poll_ctx_alloc(ioc);
>  }
>  
> +BIO_LIST_HELPERS(__bio_grp_list, poll);
> +
> +static inline bool bio_grp_list_grp_empty(struct bio_grp_list_data *grp)
> +{
> +	return bio_list_empty(&grp->list);
> +}
> +
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src);
> +
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index a1bcade4bcc3..2d47679bac71 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -235,7 +235,18 @@ struct bio {
>  
>  	struct bvec_iter	bi_iter;
>  
> -	bio_end_io_t		*bi_end_io;
> +	union {
> +		bio_end_io_t		*bi_end_io;
> +		/*
> +		 * bio based io poll need to track bio via bio group list
> +		 * which groups bio by same .bi_end_io, and original
> +		 * .bi_end_io is save into the group head. Will recover
> +		 * .bi_end_io before end this bio really. BIO_END_BY_POLL
> +		 * will make sure that this bio won't be really ended
> +		 * before recovering .bi_end_io.
> +		 */
> +		struct bio		*bi_poll;
> +	};
>  
>  	void			*bi_private;
>  #ifdef CONFIG_BLK_CGROUP
> @@ -304,6 +315,9 @@ enum {
>  	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
>  	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
>  	BIO_REMAPPED,
> +	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
> +	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
> +	BIO_DONE,
>  	BIO_FLAG_LAST
>  };
>  
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-19  9:38     ` JeffleXu
  0 siblings, 0 replies; 82+ messages in thread
From: JeffleXu @ 2021-03-19  9:38 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, dm-devel, Christoph Hellwig, Mike Snitzer

I'm thinking how this mechanism could work with *original* bio-based
devices that don't ne built upon mq devices, such as nvdimm. This
mechanism (also including my original design) mainly focuses on virtual
devices that built upon mq devices, i.e., md/dm.

As the original bio-based devices wants to support IO polling in the
future, then they should be somehow distingushed from md/dm.


On 3/19/21 12:48 AM, Ming Lei wrote:
> Currently bio based IO poll needs to poll all hw queue blindly, this way
> is very inefficient, and the big reason is that we can't pass bio
> submission result to io poll task.
> 
> In IO submission context, track associated underlying bios by per-task
> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> and return current->pid to caller of submit_bio() for any bio based
> driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, and we can find the bio from that submission context. Moving
> bio from submission queue to poll queue of the poll context, and keep
> polling until these bios are ended. Remove bio from poll queue if the
> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> 
> In previous version, kfifo is used to implement submission queue, and
> Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> So far bio's size is close to 2 cacheline size, and it may not be
> accepted to add new field into bio for solving the scalability issue by
> tracking bios via linked list, switch to bio group list for tracking bio,
> the idea is to reuse .bi_end_io for linking bios into a linked list for
> all sharing same .bi_end_io(call it bio group), which is recovered before
> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> provide very limited groups, such as 32 for fixing the scalability issue.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/bio.c               |   5 ++
>  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
>  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
>  block/blk.h               |   9 ++
>  include/linux/blk_types.h |  16 +++-
>  5 files changed, 348 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..04c043dc60fc 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> +		bio_set_flag(bio, BIO_DONE);
> +		return;
> +	}
>  again:
>  	if (!bio_remaining_done(bio))
>  		return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index efc7a61a84b4..778d25a7e76c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>  		sizeof(struct bio_grp_list_data);
>  }
>  
> +static inline void *bio_grp_data(struct bio *bio)
> +{
> +	return bio->bi_poll;
> +}
> +
> +/* add bio into bio group list, return true if it is added */
> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == bio_grp_data(bio)) {
> +			__bio_grp_list_add(&grp->list, bio);
> +			return true;
> +		}
> +	}
> +
> +	if (i == list->max_nr_grps)
> +		return false;
> +
> +	/* create a new group */
> +	grp = &list->head[i];
> +	bio_list_init(&grp->list);
> +	grp->grp_data = bio_grp_data(bio);
> +	__bio_grp_list_add(&grp->list, bio);
> +	list->nr_grps++;
> +
> +	return true;
> +}
> +
> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == grp_data)
> +			return i;
> +	}
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (bio_grp_list_grp_empty(grp))
> +			return i;
> +	}
> +	return -1;
> +}
> +
> +/* Move as many as possible groups from 'src' to 'dst' */
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> +{
> +	int i, j, cnt = 0;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> +		grp = &src->head[i];
> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> +		if (j < 0)
> +			break;
> +		if (bio_grp_list_grp_empty(&dst->head[j]))
> +			dst->head[j].grp_data = grp->grp_data;
> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> +		bio_list_init(&grp->list);
> +		cnt++;
> +	}
> +
> +	src->nr_grps -= cnt;
> +}

Not sure why it's checked in reverse order (starting from 'nr_grps - 1').


> +
>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>  {
>  	pc->sq = (void *)pc + sizeof(*pc);
> @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		bio->bi_opf |= REQ_TAG;
>  }
>  
> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	unsigned int queued;
> +
> +	/*
> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> +	 * and completion. However, bio crypt may update .bi_end_io during
> +	 * submitting, so simply not support bio based polling for this
> +	 * setting.
> +	 */
> +	if (likely(!bio_has_crypt_ctx(bio))) {
> +		/* track this bio via bio group list */
> +		spin_lock(&pc->sq_lock);
> +		queued = bio_grp_list_add(pc->sq, bio);
> +		spin_unlock(&pc->sq_lock);
> +	} else {
> +		queued = false;
> +	}
> +
> +	/*
> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> +	 * and the bio is always completed from the pair poll context.
> +	 *
> +	 * One invariant is that if bio isn't completed, blk_poll() will
> +	 * be called by passing cookie returned from submitting this bio.
> +	 */
> +	if (!queued)
> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> +	else
> +		bio_set_flag(bio, BIO_END_BY_POLL);
> +
> +	return queued;
> +}
> +
> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> +{
> +	bio->bi_iter.bi_private_data = cookie;
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>   * bio_list_on_stack[1] contains bios that were submitted before the current
>   *	->submit_bio_bio, but that haven't been processed yet.
>   */
> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;
> @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		if (ioc && queue_is_mq(q) &&
> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> +
> +			ret = __submit_bio(bio);
> +			if (queued)
> +				blk_bio_poll_post_submit(bio, ret);
> +		} else {
> +			ret = __submit_bio(bio);
> +		}

If input @ioc is NULL, it will still return cookie (returned by
__submit_bio()), which will call into blk_bio_poll(), which is not
expected. (It can pass blk_queue_poll() check in blk_poll(), e.g., dm
device itself is marked as QUEUE_FLAG_POLL, but ->io_context of IO
submitting process failed to allocate the io_context, and thus calls
__submit_bio_noacct_int() with @ioc is NULL).


>  
>  		/*
>  		 * Sort new bios into those for a lower level and those for the
> @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	return ret;
>  }
>  
> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> +		struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +
> +	__submit_bio_noacct_int(bio, ioc);
> +
> +	/* bio submissions queued to per-task poll context */
> +	if (READ_ONCE(pc->sq->nr_grps))
> +		return current->pid;
> +
> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> +	return 0;
> +}
> +
> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> +{
> +	struct io_context *ioc = current->io_context;
> +
> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> +		return __submit_bio_noacct_poll(bio, ioc);
> +


> +	return __submit_bio_noacct_int(bio, NULL);
> +}
> +
>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>  {
>  	struct bio_list bio_list[2] = { };
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 03f59915fe2c..f26950a51f4a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>  	return ret;
>  }
>  
> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> +{
> +	return bio->bi_iter.bi_private_data;
> +}
> +
> +static int blk_mq_poll_io(struct bio *bio)
> +{
> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> +	int ret = 0;
> +
> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> +		struct blk_mq_hw_ctx *hctx =
> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> +
> +		ret += blk_mq_poll_hctx(q, hctx);
> +	}
> +	return ret;
> +}
> +
> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	int ret = 0;
> +	int i;
> +
> +	/*
> +	 * Poll hw queue first.
> +	 *
> +	 * TODO: limit max poll times and make sure to not poll same
> +	 * hw queue one more time.
> +	 */
> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> +		struct bio *bio;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> +			ret += blk_mq_poll_io(bio);
> +	}
> +
> +	/* reap bios */
> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> +		struct bio *bio;
> +		struct bio_list bl;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		bio_list_init(&bl);
> +
> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> +			if (bio_flagged(bio, BIO_DONE)) {
> +
> +				/* now recover original data */
> +				bio->bi_poll = grp->grp_data;
> +
> +				/* clear BIO_END_BY_POLL and end me really */
> +				bio_clear_flag(bio, BIO_END_BY_POLL);
> +				bio_endio(bio);
> +			} else {
> +				__bio_grp_list_add(&bl, bio);
> +			}
> +		}
> +		__bio_grp_list_merge(&grp->list, &bl);
> +	}
> +	return ret;
> +}
> +
> +static int __blk_bio_poll_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *submit_ctx,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	/*
> +	 * Move IO submission result from submission queue in submission
> +	 * context to poll queue of poll context.
> +	 */
> +	spin_lock(&submit_ctx->sq_lock);
> +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
> +	spin_unlock(&submit_ctx->sq_lock);
> +
> +	return blk_bio_poll_and_end_io(q, poll_ctx);
> +}
> +
> +static int blk_bio_poll_io(struct request_queue *q,
> +		struct io_context *submit_ioc,
> +		struct io_context *poll_ioc)
> +{
> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> +	int ret;
> +
> +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> +		mutex_lock(&poll_ctx->pq_lock);

Why mutex is used to protect pq here rather than spinlock? Where will
the polling routine go to sleep?

Besides, how to protect the concurrent bio_list operation to sq between
producer (submission routine) and consumer (polling routine)? As far as
I understand, pc->sq_lock is used to prevent concurrent access from
multiple submission processes, while pc->pq_lock is used to prevent
concurrent access from multiple polling processes.


> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +		mutex_unlock(&poll_ctx->pq_lock);
> +	} else {
> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +	}
> +	return ret;
> +}
> +
> +static bool blk_bio_ioc_valid(struct task_struct *t)
> +{
> +	if (!t)
> +		return false;
> +
> +	if (!t->io_context)
> +		return false;
> +
> +	if (!t->io_context->data)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> +{
> +	struct io_context *poll_ioc = current->io_context;
> +	pid_t pid;
> +	struct task_struct *submit_task;
> +	int ret;
> +
> +	pid = (pid_t)cookie;
> +
> +	/* io poll often share io submission context */
> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> +

> +	submit_task = find_get_task_by_vpid(pid);

What if the process to which the returned cookie refers has exited?
find_get_task_by_vpid() will return NULL, thus blk_poll() won't help
reap anything, while maybe there are still bios in the poll context,
waiting to be reaped.

Maybe we need to flush the poll context when a task detaches from the
io_context.


> +	if (likely(blk_bio_ioc_valid(submit_task)))
> +		ret = blk_bio_poll_io(q, submit_task->io_context,
> +				poll_ioc);

poll_ioc may be invalid in this case, since the previous
blk_create_io_context() in blk_bio_poll() may fail.



> +	else
> +		ret = 0;
> +
> +	put_task_struct(submit_task);
> +
> +	return ret;
> +}
> +
>  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
> +	long state;
> +
> +	/* no need to poll */
> +	if (cookie == 0)
> +		return 0;
> +
>  	/*
>  	 * Create poll queue for storing poll bio and its cookie from
>  	 * submission queue
>  	 */
>  	blk_create_io_context(q, true);
>  
> +	state = current->state;
> +	do {
> +		int ret;
> +
> +		ret = __blk_bio_poll(q, cookie);
> +		if (ret > 0) {
> +			__set_current_state(TASK_RUNNING);
> +			return ret;
> +		}
> +
> +		if (signal_pending_state(state, current))
> +			__set_current_state(TASK_RUNNING);
> +
> +		if (current->state == TASK_RUNNING)
> +			return 1;
> +		if (ret < 0 || !spin)
> +			break;
> +		cpu_relax();
> +	} while (!need_resched());
> +
> +	__set_current_state(TASK_RUNNING);
>  	return 0;
>  }
>  
> @@ -3893,7 +4064,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	struct blk_mq_hw_ctx *hctx;
>  	long state;
>  
> -	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
> +	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
>  		return 0;
>  
>  	if (current->plug)
> diff --git a/block/blk.h b/block/blk.h
> index ae58a706327e..05b9f5eafdd1 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -403,4 +403,13 @@ static inline void blk_create_io_context(struct request_queue *q,
>  		bio_poll_ctx_alloc(ioc);
>  }
>  
> +BIO_LIST_HELPERS(__bio_grp_list, poll);
> +
> +static inline bool bio_grp_list_grp_empty(struct bio_grp_list_data *grp)
> +{
> +	return bio_list_empty(&grp->list);
> +}
> +
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src);
> +
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index a1bcade4bcc3..2d47679bac71 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -235,7 +235,18 @@ struct bio {
>  
>  	struct bvec_iter	bi_iter;
>  
> -	bio_end_io_t		*bi_end_io;
> +	union {
> +		bio_end_io_t		*bi_end_io;
> +		/*
> +		 * bio based io poll need to track bio via bio group list
> +		 * which groups bio by same .bi_end_io, and original
> +		 * .bi_end_io is save into the group head. Will recover
> +		 * .bi_end_io before end this bio really. BIO_END_BY_POLL
> +		 * will make sure that this bio won't be really ended
> +		 * before recovering .bi_end_io.
> +		 */
> +		struct bio		*bi_poll;
> +	};
>  
>  	void			*bi_private;
>  #ifdef CONFIG_BLK_CGROUP
> @@ -304,6 +315,9 @@ enum {
>  	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
>  	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
>  	BIO_REMAPPED,
> +	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
> +	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
> +	BIO_DONE,
>  	BIO_FLAG_LAST
>  };
>  
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
  2021-03-19  8:48       ` [dm-devel] " Ming Lei
@ 2021-03-19  9:47         ` JeffleXu
  -1 siblings, 0 replies; 82+ messages in thread
From: JeffleXu @ 2021-03-19  9:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel



On 3/19/21 4:48 PM, Ming Lei wrote:
> On Fri, Mar 19, 2021 at 03:59:06PM +0800, JeffleXu wrote:
>>
>>
>> On 3/19/21 12:48 AM, Ming Lei wrote:
>>> Add one req flag REQ_TAG which will be used in the following patch for
>>> supporting bio based IO polling.
>>>
>>> Exactly this flag can help us to do:
>>>
>>> 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
>>> as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.
>>>
>>> 2)create per-task io polling context if the bio based queue supports polling
>>> and the submitted bio is HIPRI. This per-task io polling context will be
>>> created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
>>> we can avoid to create such io polling context if one cloned bio with REQ_TAG
>>> is submitted from another kernel context.
>>>
>>> 3) for supporting bio based io polling, we need to poll IOs from all
>>> underlying queues of bio device/driver, this way help us to recognize which
>>> IOs need to polled in bio based style, which will be implemented in next
>>> patch.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/blk-core.c          | 29 +++++++++++++++++++++++++++--
>>>  include/linux/blk_types.h |  4 ++++
>>>  2 files changed, 31 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 0b00c21cbefb..efc7a61a84b4 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>>>  static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>  		struct bio *bio)
>>>  {
>>> +	bool mq;
>>> +
>>>  	if (!(bio->bi_opf & REQ_HIPRI))
>>>  		return;
>>>  
>>> -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
>>> +	/*
>>> +	 * Can't support bio based IO poll without per-task poll queue
>>> +	 *
>>> +	 * Now we have created per-task io poll context, and mark this
>>> +	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
>>> +	 * submitted from another kernel context, we won't create bio
>>> +	 * poll context for it, so that bio will be completed by IRQ;
>>> +	 * 2) If such bio is submitted from current context, we will
>>> +	 * complete it via blk_poll(); 3) If driver knows that one
>>> +	 * underlying bio allocated from driver is for FS bio, meantime
>>> +	 * it is submitted in current context, driver can mark such bio
>>> +	 * as REQ_TAG manually, so the bio can be completed via blk_poll
>>> +	 * too.
>>> +	 */
>>
>> Sorry I can't understand case 3, could you please further explain it? If
> 
> I meant driver may allocate bio and submit it in current context, and this
> allocated bio is for completing FS hipri bio too. So far, HIPRI won't be
> set for this bio, but driver may mark this bio as HIPRI and TAG, so this
> created bio can be polled.
> 
>> 'driver marks such bio as REQ_TAG manually', then per-task io poll
>> context won't be created, and thus REQ_HIPRI will be cleared, in which
>> case the bio will be completed by IRQ. How could it be completed by
>> blk_poll()?
> 
> The io poll context is created when FS HIPRI bio on based queue(DM) is
> submitted, at that time, bio based driver's ->submit_bio isn't called
> yet. So when bio driver's ->submit_bio() allocates new bios and submits
> them in current context, if driver marks this bio as HIPRI and TAG, they
> can be polled too.

Got it.


> 
>>
>>
>>> +	mq = queue_is_mq(q);
>>> +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
>>>  		bio->bi_opf &= ~REQ_HIPRI;
>>
>>
>>
>>
>> If the use cases are mixed, saying one kernel context may submit IO with
>> and without REQ_TAG at the meantime (though I don't know if this
>> situation exists), then the above code may not work as we expect.
> 
> Poll context shouldn't be created for kernel context.
> 
> So far, this patch won't cover bios submitted from kernel context, and
> for any bios submitted from kernel context, their HIPRI will be cleared.
> 
>>
>> For example, dm-XXX will return DM_MAPIO_SUBMITTED and actually submits
>> the cloned bio (with REQ_TAG) with internal kernel threads. Besides,
>> dm-XXX will also allocate bio (without REQ_TAG) of itself for metadata
>> logging or something. When submitting bios (without REQ_TAG), per-task
> 
> But HIPRI won't be set for this allocated bio.
> 
>> io poll context will be allocated. Later when submitting cloned bios
>> (with REQ_TAG), the poll context already exists and thus REQ_HIPRI is
>> kept for these bios and they are submitted to polling hw queues.
> 
> Originally I planed to add new helper of submit_poll_bio() for current
> HIPRI uses, and only create poll context in this code path, this way can
> decouple REQ_TAG a bit. But looks it is enough to re-use REQ_TAG for this
> purpose. If not, it can be quite easy to address issue wrt. creating poll
> context.
> 
> 
> Thanks, 
> Ming
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
@ 2021-03-19  9:47         ` JeffleXu
  0 siblings, 0 replies; 82+ messages in thread
From: JeffleXu @ 2021-03-19  9:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer



On 3/19/21 4:48 PM, Ming Lei wrote:
> On Fri, Mar 19, 2021 at 03:59:06PM +0800, JeffleXu wrote:
>>
>>
>> On 3/19/21 12:48 AM, Ming Lei wrote:
>>> Add one req flag REQ_TAG which will be used in the following patch for
>>> supporting bio based IO polling.
>>>
>>> Exactly this flag can help us to do:
>>>
>>> 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
>>> as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.
>>>
>>> 2)create per-task io polling context if the bio based queue supports polling
>>> and the submitted bio is HIPRI. This per-task io polling context will be
>>> created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
>>> we can avoid to create such io polling context if one cloned bio with REQ_TAG
>>> is submitted from another kernel context.
>>>
>>> 3) for supporting bio based io polling, we need to poll IOs from all
>>> underlying queues of bio device/driver, this way help us to recognize which
>>> IOs need to polled in bio based style, which will be implemented in next
>>> patch.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/blk-core.c          | 29 +++++++++++++++++++++++++++--
>>>  include/linux/blk_types.h |  4 ++++
>>>  2 files changed, 31 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 0b00c21cbefb..efc7a61a84b4 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>>>  static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>  		struct bio *bio)
>>>  {
>>> +	bool mq;
>>> +
>>>  	if (!(bio->bi_opf & REQ_HIPRI))
>>>  		return;
>>>  
>>> -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
>>> +	/*
>>> +	 * Can't support bio based IO poll without per-task poll queue
>>> +	 *
>>> +	 * Now we have created per-task io poll context, and mark this
>>> +	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
>>> +	 * submitted from another kernel context, we won't create bio
>>> +	 * poll context for it, so that bio will be completed by IRQ;
>>> +	 * 2) If such bio is submitted from current context, we will
>>> +	 * complete it via blk_poll(); 3) If driver knows that one
>>> +	 * underlying bio allocated from driver is for FS bio, meantime
>>> +	 * it is submitted in current context, driver can mark such bio
>>> +	 * as REQ_TAG manually, so the bio can be completed via blk_poll
>>> +	 * too.
>>> +	 */
>>
>> Sorry I can't understand case 3, could you please further explain it? If
> 
> I meant driver may allocate bio and submit it in current context, and this
> allocated bio is for completing FS hipri bio too. So far, HIPRI won't be
> set for this bio, but driver may mark this bio as HIPRI and TAG, so this
> created bio can be polled.
> 
>> 'driver marks such bio as REQ_TAG manually', then per-task io poll
>> context won't be created, and thus REQ_HIPRI will be cleared, in which
>> case the bio will be completed by IRQ. How could it be completed by
>> blk_poll()?
> 
> The io poll context is created when FS HIPRI bio on based queue(DM) is
> submitted, at that time, bio based driver's ->submit_bio isn't called
> yet. So when bio driver's ->submit_bio() allocates new bios and submits
> them in current context, if driver marks this bio as HIPRI and TAG, they
> can be polled too.

Got it.


> 
>>
>>
>>> +	mq = queue_is_mq(q);
>>> +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
>>>  		bio->bi_opf &= ~REQ_HIPRI;
>>
>>
>>
>>
>> If the use cases are mixed, saying one kernel context may submit IO with
>> and without REQ_TAG at the meantime (though I don't know if this
>> situation exists), then the above code may not work as we expect.
> 
> Poll context shouldn't be created for kernel context.
> 
> So far, this patch won't cover bios submitted from kernel context, and
> for any bios submitted from kernel context, their HIPRI will be cleared.
> 
>>
>> For example, dm-XXX will return DM_MAPIO_SUBMITTED and actually submits
>> the cloned bio (with REQ_TAG) with internal kernel threads. Besides,
>> dm-XXX will also allocate bio (without REQ_TAG) of itself for metadata
>> logging or something. When submitting bios (without REQ_TAG), per-task
> 
> But HIPRI won't be set for this allocated bio.
> 
>> io poll context will be allocated. Later when submitting cloned bios
>> (with REQ_TAG), the poll context already exists and thus REQ_HIPRI is
>> kept for these bios and they are submitted to polling hw queues.
> 
> Originally I planed to add new helper of submit_poll_bio() for current
> HIPRI uses, and only create poll context in this code path, this way can
> decouple REQ_TAG a bit. But looks it is enough to re-use REQ_TAG for this
> purpose. If not, it can be quite easy to address issue wrt. creating poll
> context.
> 
> 
> Thanks, 
> Ming
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-19  9:38     ` [dm-devel] " JeffleXu
@ 2021-03-19 13:46       ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-19 13:46 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel

On Fri, Mar 19, 2021 at 05:38:38PM +0800, JeffleXu wrote:
> I'm thinking how this mechanism could work with *original* bio-based
> devices that don't ne built upon mq devices, such as nvdimm. This

non-mq device needs driver to implement io polling by itself, block
layer can't help it, and that can't be this patchset's job.

> mechanism (also including my original design) mainly focuses on virtual
> devices that built upon mq devices, i.e., md/dm.
> 
> As the original bio-based devices wants to support IO polling in the
> future, then they should be somehow distingushed from md/dm.
> 
> 
> On 3/19/21 12:48 AM, Ming Lei wrote:
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> > 
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any bio based
> > driver's IO, which is submitted from FS.
> > 
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > 
> > In previous version, kfifo is used to implement submission queue, and
> > Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> > So far bio's size is close to 2 cacheline size, and it may not be
> > accepted to add new field into bio for solving the scalability issue by
> > tracking bios via linked list, switch to bio group list for tracking bio,
> > the idea is to reuse .bi_end_io for linking bios into a linked list for
> > all sharing same .bi_end_io(call it bio group), which is recovered before
> > really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> > Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> > provide very limited groups, such as 32 for fixing the scalability issue.
> > 
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 ++
> >  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
> >  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
> >  block/blk.h               |   9 ++
> >  include/linux/blk_types.h |  16 +++-
> >  5 files changed, 348 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +		bio_set_flag(bio, BIO_DONE);
> > +		return;
> > +	}
> >  again:
> >  	if (!bio_remaining_done(bio))
> >  		return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index efc7a61a84b4..778d25a7e76c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >  		sizeof(struct bio_grp_list_data);
> >  }
> >  
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > +	return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == bio_grp_data(bio)) {
> > +			__bio_grp_list_add(&grp->list, bio);
> > +			return true;
> > +		}
> > +	}
> > +
> > +	if (i == list->max_nr_grps)
> > +		return false;
> > +
> > +	/* create a new group */
> > +	grp = &list->head[i];
> > +	bio_list_init(&grp->list);
> > +	grp->grp_data = bio_grp_data(bio);
> > +	__bio_grp_list_add(&grp->list, bio);
> > +	list->nr_grps++;
> > +
> > +	return true;
> > +}
> > +
> > +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == grp_data)
> > +			return i;
> > +	}
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (bio_grp_list_grp_empty(grp))
> > +			return i;
> > +	}
> > +	return -1;
> > +}
> > +
> > +/* Move as many as possible groups from 'src' to 'dst' */
> > +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> > +{
> > +	int i, j, cnt = 0;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = src->nr_grps - 1; i >= 0; i--) {
> > +		grp = &src->head[i];
> > +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> > +		if (j < 0)
> > +			break;
> > +		if (bio_grp_list_grp_empty(&dst->head[j]))
> > +			dst->head[j].grp_data = grp->grp_data;
> > +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> > +		bio_list_init(&grp->list);
> > +		cnt++;
> > +	}
> > +
> > +	src->nr_grps -= cnt;
> > +}
> 
> Not sure why it's checked in reverse order (starting from 'nr_grps - 1').

Then for bio group list in submission side, only first .nr_grps groups
includes bios.

> 
> 
> > +
> >  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >  {
> >  	pc->sq = (void *)pc + sizeof(*pc);
> > @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		bio->bi_opf |= REQ_TAG;
> >  }
> >  
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	unsigned int queued;
> > +
> > +	/*
> > +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> > +	 * and completion. However, bio crypt may update .bi_end_io during
> > +	 * submitting, so simply not support bio based polling for this
> > +	 * setting.
> > +	 */
> > +	if (likely(!bio_has_crypt_ctx(bio))) {
> > +		/* track this bio via bio group list */
> > +		spin_lock(&pc->sq_lock);
> > +		queued = bio_grp_list_add(pc->sq, bio);
> > +		spin_unlock(&pc->sq_lock);
> > +	} else {
> > +		queued = false;
> > +	}
> > +
> > +	/*
> > +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > +	 * and the bio is always completed from the pair poll context.
> > +	 *
> > +	 * One invariant is that if bio isn't completed, blk_poll() will
> > +	 * be called by passing cookie returned from submitting this bio.
> > +	 */
> > +	if (!queued)
> > +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > +	else
> > +		bio_set_flag(bio, BIO_END_BY_POLL);
> > +
> > +	return queued;
> > +}
> > +
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +	bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >  	struct block_device *bdev = bio->bi_bdev;
> > @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *	->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >  {
> >  	struct bio_list bio_list_on_stack[2];
> >  	blk_qc_t ret = BLK_QC_T_NONE;
> > @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >  		bio_list_init(&bio_list_on_stack[0]);
> >  
> > -		ret = __submit_bio(bio);
> > +		if (ioc && queue_is_mq(q) &&
> > +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +			ret = __submit_bio(bio);
> > +			if (queued)
> > +				blk_bio_poll_post_submit(bio, ret);
> > +		} else {
> > +			ret = __submit_bio(bio);
> > +		}
> 
> If input @ioc is NULL, it will still return cookie (returned by
> __submit_bio()), which will call into blk_bio_poll(), which is not
> expected. (It can pass blk_queue_poll() check in blk_poll(), e.g., dm
> device itself is marked as QUEUE_FLAG_POLL, but ->io_context of IO
> submitting process failed to allocate the io_context, and thus calls
> __submit_bio_noacct_int() with @ioc is NULL).

Good catch, looks the following change is needed:

diff --git a/block/blk-core.c b/block/blk-core.c
index 778d25a7e76c..dba12ba0fa48 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1211,7 +1211,8 @@ static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
 	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
 		return __submit_bio_noacct_poll(bio, ioc);
 
-	return __submit_bio_noacct_int(bio, NULL);
+	 __submit_bio_noacct_int(bio, NULL);
+	return 0;
 }
 
 static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)

> 
> 
> >  
> >  		/*
> >  		 * Sort new bios into those for a lower level and those for the
> > @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  	return ret;
> >  }
> >  
> > +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> > +		struct io_context *ioc)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +
> > +	__submit_bio_noacct_int(bio, ioc);
> > +
> > +	/* bio submissions queued to per-task poll context */
> > +	if (READ_ONCE(pc->sq->nr_grps))
> > +		return current->pid;
> > +
> > +	/* swapper's pid is 0, but it can't submit poll IO for us */
> > +	return 0;
> > +}
> > +
> > +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +{
> > +	struct io_context *ioc = current->io_context;
> > +
> > +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> > +		return __submit_bio_noacct_poll(bio, ioc);
> > +
> 
> 
> > +	return __submit_bio_noacct_int(bio, NULL);
> > +}
> > +
> >  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >  {
> >  	struct bio_list bio_list[2] = { };
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 03f59915fe2c..f26950a51f4a 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >  	return ret;
> >  }
> >  
> > +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> > +{
> > +	return bio->bi_iter.bi_private_data;
> > +}
> > +
> > +static int blk_mq_poll_io(struct bio *bio)
> > +{
> > +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> > +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> > +	int ret = 0;
> > +
> > +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> > +		struct blk_mq_hw_ctx *hctx =
> > +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> > +
> > +		ret += blk_mq_poll_hctx(q, hctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int blk_bio_poll_and_end_io(struct request_queue *q,
> > +		struct blk_bio_poll_ctx *poll_ctx)
> > +{
> > +	int ret = 0;
> > +	int i;
> > +
> > +	/*
> > +	 * Poll hw queue first.
> > +	 *
> > +	 * TODO: limit max poll times and make sure to not poll same
> > +	 * hw queue one more time.
> > +	 */
> > +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> > +		struct bio *bio;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> > +			ret += blk_mq_poll_io(bio);
> > +	}
> > +
> > +	/* reap bios */
> > +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> > +		struct bio *bio;
> > +		struct bio_list bl;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		bio_list_init(&bl);
> > +
> > +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> > +			if (bio_flagged(bio, BIO_DONE)) {
> > +
> > +				/* now recover original data */
> > +				bio->bi_poll = grp->grp_data;
> > +
> > +				/* clear BIO_END_BY_POLL and end me really */
> > +				bio_clear_flag(bio, BIO_END_BY_POLL);
> > +				bio_endio(bio);
> > +			} else {
> > +				__bio_grp_list_add(&bl, bio);
> > +			}
> > +		}
> > +		__bio_grp_list_merge(&grp->list, &bl);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int __blk_bio_poll_io(struct request_queue *q,
> > +		struct blk_bio_poll_ctx *submit_ctx,
> > +		struct blk_bio_poll_ctx *poll_ctx)
> > +{
> > +	/*
> > +	 * Move IO submission result from submission queue in submission
> > +	 * context to poll queue of poll context.
> > +	 */
> > +	spin_lock(&submit_ctx->sq_lock);
> > +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
> > +	spin_unlock(&submit_ctx->sq_lock);
> > +
> > +	return blk_bio_poll_and_end_io(q, poll_ctx);
> > +}
> > +
> > +static int blk_bio_poll_io(struct request_queue *q,
> > +		struct io_context *submit_ioc,
> > +		struct io_context *poll_ioc)
> > +{
> > +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> > +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> > +	int ret;
> > +
> > +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> > +		mutex_lock(&poll_ctx->pq_lock);
> 
> Why mutex is used to protect pq here rather than spinlock? Where will

spinlock should be fine.

> the polling routine go to sleep?

The current blk_poll() can go to sleep really.

> 
> Besides, how to protect the concurrent bio_list operation to sq between
> producer (submission routine) and consumer (polling routine)? As far as

submit_ctx->sq_lock is held in __blk_bio_poll_io() for moving bios
from sq to pq, see __blk_bio_poll_io().

> I understand, pc->sq_lock is used to prevent concurrent access from
> multiple submission processes, while pc->pq_lock is used to prevent
> concurrent access from multiple polling processes.

Usually poll context needn't any lock except for shared io context,
because blk_bio_poll_ctx->pq is only accessed in poll context, and
it is still per-task.

> 
> 
> > +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> > +		mutex_unlock(&poll_ctx->pq_lock);
> > +	} else {
> > +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static bool blk_bio_ioc_valid(struct task_struct *t)
> > +{
> > +	if (!t)
> > +		return false;
> > +
> > +	if (!t->io_context)
> > +		return false;
> > +
> > +	if (!t->io_context->data)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> > +{
> > +	struct io_context *poll_ioc = current->io_context;
> > +	pid_t pid;
> > +	struct task_struct *submit_task;
> > +	int ret;
> > +
> > +	pid = (pid_t)cookie;
> > +
> > +	/* io poll often share io submission context */
> > +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> > +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> > +
> 
> > +	submit_task = find_get_task_by_vpid(pid);
> 
> What if the process to which the returned cookie refers has exited?
> find_get_task_by_vpid() will return NULL, thus blk_poll() won't help
> reap anything, while maybe there are still bios in the poll context,
> waiting to be reaped.
> 
> Maybe we need to flush the poll context when a task detaches from the
> io_context.

Yeah, I know that issue, and just not address it in RFC stage.

It can be handled in the following way:

1) drain all bios in submission context until all bios are completed before
it exits since the code won't sleep.

OR

2) schedule wq for completing all submitted bios.

If poll context exits, and there is still bios not completed, not sure
if it can happen because the current in-tree code has the same issue.

> 
> 
> > +	if (likely(blk_bio_ioc_valid(submit_task)))
> > +		ret = blk_bio_poll_io(q, submit_task->io_context,
> > +				poll_ioc);
> 
> poll_ioc may be invalid in this case, since the previous
> blk_create_io_context() in blk_bio_poll() may fail.

Yeah, it can be addressed by above patch, 0 will be returned
for this case.


Thanks, 
Ming


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-19 13:46       ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-19 13:46 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer

On Fri, Mar 19, 2021 at 05:38:38PM +0800, JeffleXu wrote:
> I'm thinking how this mechanism could work with *original* bio-based
> devices that don't ne built upon mq devices, such as nvdimm. This

non-mq device needs driver to implement io polling by itself, block
layer can't help it, and that can't be this patchset's job.

> mechanism (also including my original design) mainly focuses on virtual
> devices that built upon mq devices, i.e., md/dm.
> 
> As the original bio-based devices wants to support IO polling in the
> future, then they should be somehow distingushed from md/dm.
> 
> 
> On 3/19/21 12:48 AM, Ming Lei wrote:
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> > 
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any bio based
> > driver's IO, which is submitted from FS.
> > 
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > 
> > In previous version, kfifo is used to implement submission queue, and
> > Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> > So far bio's size is close to 2 cacheline size, and it may not be
> > accepted to add new field into bio for solving the scalability issue by
> > tracking bios via linked list, switch to bio group list for tracking bio,
> > the idea is to reuse .bi_end_io for linking bios into a linked list for
> > all sharing same .bi_end_io(call it bio group), which is recovered before
> > really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> > Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> > provide very limited groups, such as 32 for fixing the scalability issue.
> > 
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 ++
> >  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
> >  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
> >  block/blk.h               |   9 ++
> >  include/linux/blk_types.h |  16 +++-
> >  5 files changed, 348 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +		bio_set_flag(bio, BIO_DONE);
> > +		return;
> > +	}
> >  again:
> >  	if (!bio_remaining_done(bio))
> >  		return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index efc7a61a84b4..778d25a7e76c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >  		sizeof(struct bio_grp_list_data);
> >  }
> >  
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > +	return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == bio_grp_data(bio)) {
> > +			__bio_grp_list_add(&grp->list, bio);
> > +			return true;
> > +		}
> > +	}
> > +
> > +	if (i == list->max_nr_grps)
> > +		return false;
> > +
> > +	/* create a new group */
> > +	grp = &list->head[i];
> > +	bio_list_init(&grp->list);
> > +	grp->grp_data = bio_grp_data(bio);
> > +	__bio_grp_list_add(&grp->list, bio);
> > +	list->nr_grps++;
> > +
> > +	return true;
> > +}
> > +
> > +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == grp_data)
> > +			return i;
> > +	}
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (bio_grp_list_grp_empty(grp))
> > +			return i;
> > +	}
> > +	return -1;
> > +}
> > +
> > +/* Move as many as possible groups from 'src' to 'dst' */
> > +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> > +{
> > +	int i, j, cnt = 0;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = src->nr_grps - 1; i >= 0; i--) {
> > +		grp = &src->head[i];
> > +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> > +		if (j < 0)
> > +			break;
> > +		if (bio_grp_list_grp_empty(&dst->head[j]))
> > +			dst->head[j].grp_data = grp->grp_data;
> > +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> > +		bio_list_init(&grp->list);
> > +		cnt++;
> > +	}
> > +
> > +	src->nr_grps -= cnt;
> > +}
> 
> Not sure why it's checked in reverse order (starting from 'nr_grps - 1').

Then for bio group list in submission side, only first .nr_grps groups
includes bios.

> 
> 
> > +
> >  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >  {
> >  	pc->sq = (void *)pc + sizeof(*pc);
> > @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		bio->bi_opf |= REQ_TAG;
> >  }
> >  
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	unsigned int queued;
> > +
> > +	/*
> > +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> > +	 * and completion. However, bio crypt may update .bi_end_io during
> > +	 * submitting, so simply not support bio based polling for this
> > +	 * setting.
> > +	 */
> > +	if (likely(!bio_has_crypt_ctx(bio))) {
> > +		/* track this bio via bio group list */
> > +		spin_lock(&pc->sq_lock);
> > +		queued = bio_grp_list_add(pc->sq, bio);
> > +		spin_unlock(&pc->sq_lock);
> > +	} else {
> > +		queued = false;
> > +	}
> > +
> > +	/*
> > +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > +	 * and the bio is always completed from the pair poll context.
> > +	 *
> > +	 * One invariant is that if bio isn't completed, blk_poll() will
> > +	 * be called by passing cookie returned from submitting this bio.
> > +	 */
> > +	if (!queued)
> > +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > +	else
> > +		bio_set_flag(bio, BIO_END_BY_POLL);
> > +
> > +	return queued;
> > +}
> > +
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +	bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >  	struct block_device *bdev = bio->bi_bdev;
> > @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *	->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >  {
> >  	struct bio_list bio_list_on_stack[2];
> >  	blk_qc_t ret = BLK_QC_T_NONE;
> > @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >  		bio_list_init(&bio_list_on_stack[0]);
> >  
> > -		ret = __submit_bio(bio);
> > +		if (ioc && queue_is_mq(q) &&
> > +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +			ret = __submit_bio(bio);
> > +			if (queued)
> > +				blk_bio_poll_post_submit(bio, ret);
> > +		} else {
> > +			ret = __submit_bio(bio);
> > +		}
> 
> If input @ioc is NULL, it will still return cookie (returned by
> __submit_bio()), which will call into blk_bio_poll(), which is not
> expected. (It can pass blk_queue_poll() check in blk_poll(), e.g., dm
> device itself is marked as QUEUE_FLAG_POLL, but ->io_context of IO
> submitting process failed to allocate the io_context, and thus calls
> __submit_bio_noacct_int() with @ioc is NULL).

Good catch, looks the following change is needed:

diff --git a/block/blk-core.c b/block/blk-core.c
index 778d25a7e76c..dba12ba0fa48 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1211,7 +1211,8 @@ static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
 	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
 		return __submit_bio_noacct_poll(bio, ioc);
 
-	return __submit_bio_noacct_int(bio, NULL);
+	 __submit_bio_noacct_int(bio, NULL);
+	return 0;
 }
 
 static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)

> 
> 
> >  
> >  		/*
> >  		 * Sort new bios into those for a lower level and those for the
> > @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  	return ret;
> >  }
> >  
> > +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> > +		struct io_context *ioc)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +
> > +	__submit_bio_noacct_int(bio, ioc);
> > +
> > +	/* bio submissions queued to per-task poll context */
> > +	if (READ_ONCE(pc->sq->nr_grps))
> > +		return current->pid;
> > +
> > +	/* swapper's pid is 0, but it can't submit poll IO for us */
> > +	return 0;
> > +}
> > +
> > +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +{
> > +	struct io_context *ioc = current->io_context;
> > +
> > +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> > +		return __submit_bio_noacct_poll(bio, ioc);
> > +
> 
> 
> > +	return __submit_bio_noacct_int(bio, NULL);
> > +}
> > +
> >  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >  {
> >  	struct bio_list bio_list[2] = { };
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 03f59915fe2c..f26950a51f4a 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >  	return ret;
> >  }
> >  
> > +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> > +{
> > +	return bio->bi_iter.bi_private_data;
> > +}
> > +
> > +static int blk_mq_poll_io(struct bio *bio)
> > +{
> > +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> > +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> > +	int ret = 0;
> > +
> > +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> > +		struct blk_mq_hw_ctx *hctx =
> > +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> > +
> > +		ret += blk_mq_poll_hctx(q, hctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int blk_bio_poll_and_end_io(struct request_queue *q,
> > +		struct blk_bio_poll_ctx *poll_ctx)
> > +{
> > +	int ret = 0;
> > +	int i;
> > +
> > +	/*
> > +	 * Poll hw queue first.
> > +	 *
> > +	 * TODO: limit max poll times and make sure to not poll same
> > +	 * hw queue one more time.
> > +	 */
> > +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> > +		struct bio *bio;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> > +			ret += blk_mq_poll_io(bio);
> > +	}
> > +
> > +	/* reap bios */
> > +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> > +		struct bio *bio;
> > +		struct bio_list bl;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		bio_list_init(&bl);
> > +
> > +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> > +			if (bio_flagged(bio, BIO_DONE)) {
> > +
> > +				/* now recover original data */
> > +				bio->bi_poll = grp->grp_data;
> > +
> > +				/* clear BIO_END_BY_POLL and end me really */
> > +				bio_clear_flag(bio, BIO_END_BY_POLL);
> > +				bio_endio(bio);
> > +			} else {
> > +				__bio_grp_list_add(&bl, bio);
> > +			}
> > +		}
> > +		__bio_grp_list_merge(&grp->list, &bl);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int __blk_bio_poll_io(struct request_queue *q,
> > +		struct blk_bio_poll_ctx *submit_ctx,
> > +		struct blk_bio_poll_ctx *poll_ctx)
> > +{
> > +	/*
> > +	 * Move IO submission result from submission queue in submission
> > +	 * context to poll queue of poll context.
> > +	 */
> > +	spin_lock(&submit_ctx->sq_lock);
> > +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
> > +	spin_unlock(&submit_ctx->sq_lock);
> > +
> > +	return blk_bio_poll_and_end_io(q, poll_ctx);
> > +}
> > +
> > +static int blk_bio_poll_io(struct request_queue *q,
> > +		struct io_context *submit_ioc,
> > +		struct io_context *poll_ioc)
> > +{
> > +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> > +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> > +	int ret;
> > +
> > +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> > +		mutex_lock(&poll_ctx->pq_lock);
> 
> Why mutex is used to protect pq here rather than spinlock? Where will

spinlock should be fine.

> the polling routine go to sleep?

The current blk_poll() can go to sleep really.

> 
> Besides, how to protect the concurrent bio_list operation to sq between
> producer (submission routine) and consumer (polling routine)? As far as

submit_ctx->sq_lock is held in __blk_bio_poll_io() for moving bios
from sq to pq, see __blk_bio_poll_io().

> I understand, pc->sq_lock is used to prevent concurrent access from
> multiple submission processes, while pc->pq_lock is used to prevent
> concurrent access from multiple polling processes.

Usually poll context needn't any lock except for shared io context,
because blk_bio_poll_ctx->pq is only accessed in poll context, and
it is still per-task.

> 
> 
> > +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> > +		mutex_unlock(&poll_ctx->pq_lock);
> > +	} else {
> > +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static bool blk_bio_ioc_valid(struct task_struct *t)
> > +{
> > +	if (!t)
> > +		return false;
> > +
> > +	if (!t->io_context)
> > +		return false;
> > +
> > +	if (!t->io_context->data)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> > +{
> > +	struct io_context *poll_ioc = current->io_context;
> > +	pid_t pid;
> > +	struct task_struct *submit_task;
> > +	int ret;
> > +
> > +	pid = (pid_t)cookie;
> > +
> > +	/* io poll often share io submission context */
> > +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> > +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> > +
> 
> > +	submit_task = find_get_task_by_vpid(pid);
> 
> What if the process to which the returned cookie refers has exited?
> find_get_task_by_vpid() will return NULL, thus blk_poll() won't help
> reap anything, while maybe there are still bios in the poll context,
> waiting to be reaped.
> 
> Maybe we need to flush the poll context when a task detaches from the
> io_context.

Yeah, I know that issue, and just not address it in RFC stage.

It can be handled in the following way:

1) drain all bios in submission context until all bios are completed before
it exits since the code won't sleep.

OR

2) schedule wq for completing all submitted bios.

If poll context exits, and there is still bios not completed, not sure
if it can happen because the current in-tree code has the same issue.

> 
> 
> > +	if (likely(blk_bio_ioc_valid(submit_task)))
> > +		ret = blk_bio_poll_io(q, submit_task->io_context,
> > +				poll_ioc);
> 
> poll_ioc may be invalid in this case, since the previous
> blk_create_io_context() in blk_bio_poll() may fail.

Yeah, it can be addressed by above patch, 0 will be returned
for this case.


Thanks, 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 01/13] block: add helper of blk_queue_poll
  2021-03-18 16:48   ` [dm-devel] " Ming Lei
@ 2021-03-19 16:52     ` Mike Snitzer
  -1 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 16:52 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> There has been 3 users, and will be more, so add one such helper.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Not sure if you're collecting Reviewed-by or Acked-by at this point?
Seems you dropped Chaitanya's Reviewed-by to v1:
https://listman.redhat.com/archives/dm-devel/2021-March/msg00166.html

Do you plan to iterate a lot more before you put out a non-RFC?  For
this RFC v2, I'll withhold adding any of my Reviewed-by tags and just
reply where I see things that might need folding into the next
iteration.

Mike


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 01/13] block: add helper of blk_queue_poll
@ 2021-03-19 16:52     ` Mike Snitzer
  0 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 16:52 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> There has been 3 users, and will be more, so add one such helper.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Not sure if you're collecting Reviewed-by or Acked-by at this point?
Seems you dropped Chaitanya's Reviewed-by to v1:
https://listman.redhat.com/archives/dm-devel/2021-March/msg00166.html

Do you plan to iterate a lot more before you put out a non-RFC?  For
this RFC v2, I'll withhold adding any of my Reviewed-by tags and just
reply where I see things that might need folding into the next
iteration.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 04/13] block: create io poll context for submission and poll task
  2021-03-18 16:48   ` [dm-devel] " Ming Lei
@ 2021-03-19 17:05     ` Mike Snitzer
  -1 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 17:05 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Create per-task io poll context for both IO submission and poll task
> if the queue is bio based and supports polling.
> 
> This io polling context includes two queues:
1) submission queue(sq) for storing HIPRI bio submission result(cookie)
   and the bio, written by submission task and read by poll task.
2) polling queue(pq) for holding data moved from sq, only used in poll
   context for running bio polling.
 
(nit, but it just reads a bit clearer to enumerate the 2 queues)

> Following patches will support bio poll.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
>  block/blk-ioc.c           |  1 +
>  block/blk-mq.c            | 14 ++++++++
>  block/blk.h               | 46 +++++++++++++++++++++++++
>  include/linux/iocontext.h |  2 ++
>  5 files changed, 122 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d58f8a0c80de..0b00c21cbefb 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>  	return BLK_STS_OK;
>  }
>  
> -static inline void blk_create_io_context(struct request_queue *q)
> +static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
>  {
> -	/*
> -	 * Various block parts want %current->io_context, so allocate it up
> -	 * front rather than dealing with lots of pain to allocate it only
> -	 * where needed. This may fail and the block layer knows how to live
> -	 * with it.
> -	 */
> -	if (unlikely(!current->io_context))
> -		create_task_io_context(current, GFP_ATOMIC, q->node);
> +	struct io_context *ioc = current->io_context;
> +
> +	return ioc ? ioc->data : NULL;
> +}
> +
> +static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> +{
> +	return sizeof(struct bio_grp_list) + nr_grps *
> +		sizeof(struct bio_grp_list_data);
> +}
> +
> +static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> +{
> +	pc->sq = (void *)pc + sizeof(*pc);
> +	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
> +
> +	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
> +	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
> +
> +	spin_lock_init(&pc->sq_lock);
> +	mutex_init(&pc->pq_lock);
> +}
> +
> +void bio_poll_ctx_alloc(struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc;
> +	unsigned int size = sizeof(*pc) +
> +		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
> +		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
> +
> +	pc = kzalloc(GFP_ATOMIC, size);
> +	if (pc) {
> +		bio_poll_ctx_init(pc);
> +		if (cmpxchg(&ioc->data, NULL, (void *)pc))
> +			kfree(pc);
> +	}
> +}
> +
> +static inline bool blk_queue_support_bio_poll(struct request_queue *q)
> +{
> +	return !queue_is_mq(q) && blk_queue_poll(q);
> +}
> +
> +static inline void blk_bio_poll_preprocess(struct request_queue *q,
> +		struct bio *bio)
> +{
> +	if (!(bio->bi_opf & REQ_HIPRI))
> +		return;
> +
> +	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +		bio->bi_opf &= ~REQ_HIPRI;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  		}
>  	}
>  
> -	blk_create_io_context(q);
> +	/*
> +	 * Created per-task io poll queue if we supports bio polling
> +	 * and it is one HIPRI bio.
> +	 */

Create per-task io poll queue if bio polling supported and HIPRI set.


> +	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			(bio->bi_opf & REQ_HIPRI));
>  
> -	if (!blk_queue_poll(q))
> -		bio->bi_opf &= ~REQ_HIPRI;
> +	blk_bio_poll_preprocess(q, bio);
>  
>  	switch (bio_op(bio)) {
>  	case REQ_OP_DISCARD:
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b0cde18c4b8c..5574c398eff6 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
>  
>  static inline void free_io_context(struct io_context *ioc)
>  {
> +	kfree(ioc->data);
>  	kmem_cache_free(iocontext_cachep, ioc);
>  }
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 63c81df3b8b5..c832faa52ca0 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
>  	return blk_mq_poll_hybrid_sleep(q, rq);
>  }
>  
> +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +{
> +	/*
> +	 * Create poll queue for storing poll bio and its cookie from
> +	 * submission queue
> +	 */
> +	blk_create_io_context(q, true);
> +
> +	return 0;
> +}
> +
>  /**
>   * blk_poll - poll for IO completions
>   * @q:  the queue
> @@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	if (current->plug)
>  		blk_flush_plug_list(current->plug, false);
>  
> +	if (!queue_is_mq(q))
> +		return blk_bio_poll(q, cookie, spin);
> +
>  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>  
>  	/*
> diff --git a/block/blk.h b/block/blk.h
> index 3b53e44b967e..ae58a706327e 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -357,4 +357,50 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
>  		struct page *page, unsigned int len, unsigned int offset,
>  		unsigned int max_sectors, bool *same_page);
>  
> +/* grouping bios belonging to same group into one list  */

Reads awkwardly, maybe:
/* Grouping bios that share same data into one list */

> +struct bio_grp_list_data {
> +	/* group data */

Don't think this ^ comment is needed (variable name achieves same).

> +	void *grp_data;
> +
> +	/* all bios in this list share same 'grp_data' */
> +	struct bio_list list;
> +};
> +
> +struct bio_grp_list {
> +	unsigned int max_nr_grps, nr_grps;
> +	struct bio_grp_list_data head[0];
> +};
> +
> +struct blk_bio_poll_ctx {
> +	spinlock_t sq_lock;
> +	struct bio_grp_list *sq;
> +
> +	struct mutex pq_lock;
> +	struct bio_grp_list *pq;
> +};
> +
> +#define BLK_BIO_POLL_SQ_SZ		32U
> +#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
> +
> +void bio_poll_ctx_alloc(struct io_context *ioc);
> +
> +static inline void blk_create_io_context(struct request_queue *q,
> +		bool need_poll_ctx)
> +{
> +	struct io_context *ioc;
> +
> +	/*
> +	 * Various block parts want %current->io_context, so allocate it up
> +	 * front rather than dealing with lots of pain to allocate it only
> +	 * where needed. This may fail and the block layer knows how to live
> +	 * with it.
> +	 */
> +	if (unlikely(!current->io_context))
> +		create_task_io_context(current, GFP_ATOMIC, q->node);
> +
> +	ioc = current->io_context;
> +	if (need_poll_ctx && unlikely(ioc && !ioc->data))
> +		bio_poll_ctx_alloc(ioc);
> +}
> +
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index 0a9dc40b7be8..f9a467571356 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -110,6 +110,8 @@ struct io_context {
>  	struct io_cq __rcu	*icq_hint;
>  	struct hlist_head	icq_list;
>  
> +	void			*data;
> +
>  	struct work_struct release_work;
>  };
>  
> -- 
> 2.29.2
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 04/13] block: create io poll context for submission and poll task
@ 2021-03-19 17:05     ` Mike Snitzer
  0 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 17:05 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Create per-task io poll context for both IO submission and poll task
> if the queue is bio based and supports polling.
> 
> This io polling context includes two queues:
1) submission queue(sq) for storing HIPRI bio submission result(cookie)
   and the bio, written by submission task and read by poll task.
2) polling queue(pq) for holding data moved from sq, only used in poll
   context for running bio polling.
 
(nit, but it just reads a bit clearer to enumerate the 2 queues)

> Following patches will support bio poll.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
>  block/blk-ioc.c           |  1 +
>  block/blk-mq.c            | 14 ++++++++
>  block/blk.h               | 46 +++++++++++++++++++++++++
>  include/linux/iocontext.h |  2 ++
>  5 files changed, 122 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d58f8a0c80de..0b00c21cbefb 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>  	return BLK_STS_OK;
>  }
>  
> -static inline void blk_create_io_context(struct request_queue *q)
> +static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
>  {
> -	/*
> -	 * Various block parts want %current->io_context, so allocate it up
> -	 * front rather than dealing with lots of pain to allocate it only
> -	 * where needed. This may fail and the block layer knows how to live
> -	 * with it.
> -	 */
> -	if (unlikely(!current->io_context))
> -		create_task_io_context(current, GFP_ATOMIC, q->node);
> +	struct io_context *ioc = current->io_context;
> +
> +	return ioc ? ioc->data : NULL;
> +}
> +
> +static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> +{
> +	return sizeof(struct bio_grp_list) + nr_grps *
> +		sizeof(struct bio_grp_list_data);
> +}
> +
> +static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> +{
> +	pc->sq = (void *)pc + sizeof(*pc);
> +	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
> +
> +	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
> +	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
> +
> +	spin_lock_init(&pc->sq_lock);
> +	mutex_init(&pc->pq_lock);
> +}
> +
> +void bio_poll_ctx_alloc(struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc;
> +	unsigned int size = sizeof(*pc) +
> +		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
> +		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
> +
> +	pc = kzalloc(GFP_ATOMIC, size);
> +	if (pc) {
> +		bio_poll_ctx_init(pc);
> +		if (cmpxchg(&ioc->data, NULL, (void *)pc))
> +			kfree(pc);
> +	}
> +}
> +
> +static inline bool blk_queue_support_bio_poll(struct request_queue *q)
> +{
> +	return !queue_is_mq(q) && blk_queue_poll(q);
> +}
> +
> +static inline void blk_bio_poll_preprocess(struct request_queue *q,
> +		struct bio *bio)
> +{
> +	if (!(bio->bi_opf & REQ_HIPRI))
> +		return;
> +
> +	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +		bio->bi_opf &= ~REQ_HIPRI;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  		}
>  	}
>  
> -	blk_create_io_context(q);
> +	/*
> +	 * Created per-task io poll queue if we supports bio polling
> +	 * and it is one HIPRI bio.
> +	 */

Create per-task io poll queue if bio polling supported and HIPRI set.


> +	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			(bio->bi_opf & REQ_HIPRI));
>  
> -	if (!blk_queue_poll(q))
> -		bio->bi_opf &= ~REQ_HIPRI;
> +	blk_bio_poll_preprocess(q, bio);
>  
>  	switch (bio_op(bio)) {
>  	case REQ_OP_DISCARD:
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b0cde18c4b8c..5574c398eff6 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
>  
>  static inline void free_io_context(struct io_context *ioc)
>  {
> +	kfree(ioc->data);
>  	kmem_cache_free(iocontext_cachep, ioc);
>  }
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 63c81df3b8b5..c832faa52ca0 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
>  	return blk_mq_poll_hybrid_sleep(q, rq);
>  }
>  
> +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +{
> +	/*
> +	 * Create poll queue for storing poll bio and its cookie from
> +	 * submission queue
> +	 */
> +	blk_create_io_context(q, true);
> +
> +	return 0;
> +}
> +
>  /**
>   * blk_poll - poll for IO completions
>   * @q:  the queue
> @@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	if (current->plug)
>  		blk_flush_plug_list(current->plug, false);
>  
> +	if (!queue_is_mq(q))
> +		return blk_bio_poll(q, cookie, spin);
> +
>  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>  
>  	/*
> diff --git a/block/blk.h b/block/blk.h
> index 3b53e44b967e..ae58a706327e 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -357,4 +357,50 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
>  		struct page *page, unsigned int len, unsigned int offset,
>  		unsigned int max_sectors, bool *same_page);
>  
> +/* grouping bios belonging to same group into one list  */

Reads awkwardly, maybe:
/* Grouping bios that share same data into one list */

> +struct bio_grp_list_data {
> +	/* group data */

Don't think this ^ comment is needed (variable name achieves same).

> +	void *grp_data;
> +
> +	/* all bios in this list share same 'grp_data' */
> +	struct bio_list list;
> +};
> +
> +struct bio_grp_list {
> +	unsigned int max_nr_grps, nr_grps;
> +	struct bio_grp_list_data head[0];
> +};
> +
> +struct blk_bio_poll_ctx {
> +	spinlock_t sq_lock;
> +	struct bio_grp_list *sq;
> +
> +	struct mutex pq_lock;
> +	struct bio_grp_list *pq;
> +};
> +
> +#define BLK_BIO_POLL_SQ_SZ		32U
> +#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
> +
> +void bio_poll_ctx_alloc(struct io_context *ioc);
> +
> +static inline void blk_create_io_context(struct request_queue *q,
> +		bool need_poll_ctx)
> +{
> +	struct io_context *ioc;
> +
> +	/*
> +	 * Various block parts want %current->io_context, so allocate it up
> +	 * front rather than dealing with lots of pain to allocate it only
> +	 * where needed. This may fail and the block layer knows how to live
> +	 * with it.
> +	 */
> +	if (unlikely(!current->io_context))
> +		create_task_io_context(current, GFP_ATOMIC, q->node);
> +
> +	ioc = current->io_context;
> +	if (need_poll_ctx && unlikely(ioc && !ioc->data))
> +		bio_poll_ctx_alloc(ioc);
> +}
> +
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index 0a9dc40b7be8..f9a467571356 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -110,6 +110,8 @@ struct io_context {
>  	struct io_cq __rcu	*icq_hint;
>  	struct hlist_head	icq_list;
>  
> +	void			*data;
> +
>  	struct work_struct release_work;
>  };
>  
> -- 
> 2.29.2
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
  2021-03-18 16:48   ` [dm-devel] " Ming Lei
@ 2021-03-19 17:38     ` Mike Snitzer
  -1 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 17:38 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Add one req flag REQ_TAG which will be used in the following patch for
> supporting bio based IO polling.

"REQ_TAG" is so generic yet is used in such a specific way (to mark an
FS bio as having polling context)

I don't have a great suggestion for a better name, just seems "REQ_TAG"
is lacking... (especially given the potential for confusion due to
blk-mq's notion of "tag").

REQ_FS? REQ_FS_CTX? REQ_POLL? REQ_POLL_CTX? REQ_NAMING_IS_HARD :)

Maybe others have better ideas?

Mike

> Exactly this flag can help us to do:
> 
> 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.
> 
> 2)create per-task io polling context if the bio based queue supports polling
> and the submitted bio is HIPRI. This per-task io polling context will be
> created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
> we can avoid to create such io polling context if one cloned bio with REQ_TAG
> is submitted from another kernel context.
> 
> 3) for supporting bio based io polling, we need to poll IOs from all
> underlying queues of bio device/driver, this way help us to recognize which
> IOs need to polled in bio based style, which will be implemented in next
> patch.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c          | 29 +++++++++++++++++++++++++++--
>  include/linux/blk_types.h |  4 ++++
>  2 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0b00c21cbefb..efc7a61a84b4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>  static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		struct bio *bio)
>  {
> +	bool mq;
> +
>  	if (!(bio->bi_opf & REQ_HIPRI))
>  		return;
>  
> -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +	/*
> +	 * Can't support bio based IO poll without per-task poll queue
> +	 *
> +	 * Now we have created per-task io poll context, and mark this
> +	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
> +	 * submitted from another kernel context, we won't create bio
> +	 * poll context for it, so that bio will be completed by IRQ;
> +	 * 2) If such bio is submitted from current context, we will
> +	 * complete it via blk_poll(); 3) If driver knows that one
> +	 * underlying bio allocated from driver is for FS bio, meantime
> +	 * it is submitted in current context, driver can mark such bio
> +	 * as REQ_TAG manually, so the bio can be completed via blk_poll
> +	 * too.
> +	 */
> +	mq = queue_is_mq(q);
> +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
>  		bio->bi_opf &= ~REQ_HIPRI;
> +	else if (!mq)
> +		bio->bi_opf |= REQ_TAG;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -893,9 +912,15 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  
>  	/*
>  	 * Created per-task io poll queue if we supports bio polling
> -	 * and it is one HIPRI bio.
> +	 * and it is one HIPRI bio, and this HIPRI bio has to be from
> +	 * FS. If REQ_TAG isn't set for HIPRI bio, we think it originated
> +	 * from FS.
> +	 *
> +	 * Driver may allocated bio by itself and REQ_TAG is set, but they
> +	 * won't be marked as HIPRI.
>  	 */
>  	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			!(bio->bi_opf & REQ_TAG) &&
>  			(bio->bi_opf & REQ_HIPRI));
>  
>  	blk_bio_poll_preprocess(q, bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index db026b6ec15a..a1bcade4bcc3 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -394,6 +394,9 @@ enum req_flag_bits {
>  
>  	__REQ_HIPRI,
>  
> +	/* for marking IOs originated from same FS bio in same context */
> +	__REQ_TAG,
> +
>  	/* for driver use */
>  	__REQ_DRV,
>  	__REQ_SWAP,		/* swapping request. */
> @@ -418,6 +421,7 @@ enum req_flag_bits {
>  
>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
> +#define REQ_TAG			(1ULL << __REQ_TAG)
>  
>  #define REQ_DRV			(1ULL << __REQ_DRV)
>  #define REQ_SWAP		(1ULL << __REQ_SWAP)
> -- 
> 2.29.2
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
@ 2021-03-19 17:38     ` Mike Snitzer
  0 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 17:38 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Add one req flag REQ_TAG which will be used in the following patch for
> supporting bio based IO polling.

"REQ_TAG" is so generic yet is used in such a specific way (to mark an
FS bio as having polling context)

I don't have a great suggestion for a better name, just seems "REQ_TAG"
is lacking... (especially given the potential for confusion due to
blk-mq's notion of "tag").

REQ_FS? REQ_FS_CTX? REQ_POLL? REQ_POLL_CTX? REQ_NAMING_IS_HARD :)

Maybe others have better ideas?

Mike

> Exactly this flag can help us to do:
> 
> 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.
> 
> 2)create per-task io polling context if the bio based queue supports polling
> and the submitted bio is HIPRI. This per-task io polling context will be
> created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
> we can avoid to create such io polling context if one cloned bio with REQ_TAG
> is submitted from another kernel context.
> 
> 3) for supporting bio based io polling, we need to poll IOs from all
> underlying queues of bio device/driver, this way help us to recognize which
> IOs need to polled in bio based style, which will be implemented in next
> patch.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c          | 29 +++++++++++++++++++++++++++--
>  include/linux/blk_types.h |  4 ++++
>  2 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0b00c21cbefb..efc7a61a84b4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>  static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		struct bio *bio)
>  {
> +	bool mq;
> +
>  	if (!(bio->bi_opf & REQ_HIPRI))
>  		return;
>  
> -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +	/*
> +	 * Can't support bio based IO poll without per-task poll queue
> +	 *
> +	 * Now we have created per-task io poll context, and mark this
> +	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
> +	 * submitted from another kernel context, we won't create bio
> +	 * poll context for it, so that bio will be completed by IRQ;
> +	 * 2) If such bio is submitted from current context, we will
> +	 * complete it via blk_poll(); 3) If driver knows that one
> +	 * underlying bio allocated from driver is for FS bio, meantime
> +	 * it is submitted in current context, driver can mark such bio
> +	 * as REQ_TAG manually, so the bio can be completed via blk_poll
> +	 * too.
> +	 */
> +	mq = queue_is_mq(q);
> +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
>  		bio->bi_opf &= ~REQ_HIPRI;
> +	else if (!mq)
> +		bio->bi_opf |= REQ_TAG;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -893,9 +912,15 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  
>  	/*
>  	 * Created per-task io poll queue if we supports bio polling
> -	 * and it is one HIPRI bio.
> +	 * and it is one HIPRI bio, and this HIPRI bio has to be from
> +	 * FS. If REQ_TAG isn't set for HIPRI bio, we think it originated
> +	 * from FS.
> +	 *
> +	 * Driver may allocated bio by itself and REQ_TAG is set, but they
> +	 * won't be marked as HIPRI.
>  	 */
>  	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			!(bio->bi_opf & REQ_TAG) &&
>  			(bio->bi_opf & REQ_HIPRI));
>  
>  	blk_bio_poll_preprocess(q, bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index db026b6ec15a..a1bcade4bcc3 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -394,6 +394,9 @@ enum req_flag_bits {
>  
>  	__REQ_HIPRI,
>  
> +	/* for marking IOs originated from same FS bio in same context */
> +	__REQ_TAG,
> +
>  	/* for driver use */
>  	__REQ_DRV,
>  	__REQ_SWAP,		/* swapping request. */
> @@ -418,6 +421,7 @@ enum req_flag_bits {
>  
>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
> +#define REQ_TAG			(1ULL << __REQ_TAG)
>  
>  #define REQ_DRV			(1ULL << __REQ_DRV)
>  #define REQ_SWAP		(1ULL << __REQ_SWAP)
> -- 
> 2.29.2
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 06/13] block: add new field into 'struct bvec_iter'
  2021-03-18 16:48   ` [dm-devel] " Ming Lei
@ 2021-03-19 17:44     ` Mike Snitzer
  -1 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 17:44 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> There is a hole at the end of 'struct bvec_iter', so put a new field
> here and we can save cookie returned from submit_bio() here for
> supporting bio based polling.
> 
> This way can avoid to extend bio unnecessarily.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  include/linux/bvec.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index ff832e698efb..61c0f55f7165 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -43,6 +43,15 @@ struct bvec_iter {
>  
>  	unsigned int            bi_bvec_done;	/* number of bytes completed in
>  						   current bvec */
> +
> +	/*
> +	 * There is a hole at the end of bvec_iter, define one filed to

s/filed/field/

> +	 * hold something which isn't relate with 'bvec_iter', so that we can

s/relate/related/
or
s/isn't relate with/doesn't relate to/

> +	 * avoid to extend bio. So far this new field is used for bio based

s/to extend/extending/

> +	 * pooling, we will store returning value of underlying queue's

s/pooling/polling/

> +	 * submit_bio() here.
> +	 */
> +	unsigned int		bi_private_data;
>  };
>  
>  struct bvec_iter_all {
> -- 
> 2.29.2
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 06/13] block: add new field into 'struct bvec_iter'
@ 2021-03-19 17:44     ` Mike Snitzer
  0 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 17:44 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> There is a hole at the end of 'struct bvec_iter', so put a new field
> here and we can save cookie returned from submit_bio() here for
> supporting bio based polling.
> 
> This way can avoid to extend bio unnecessarily.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  include/linux/bvec.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index ff832e698efb..61c0f55f7165 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -43,6 +43,15 @@ struct bvec_iter {
>  
>  	unsigned int            bi_bvec_done;	/* number of bytes completed in
>  						   current bvec */
> +
> +	/*
> +	 * There is a hole at the end of bvec_iter, define one filed to

s/filed/field/

> +	 * hold something which isn't relate with 'bvec_iter', so that we can

s/relate/related/
or
s/isn't relate with/doesn't relate to/

> +	 * avoid to extend bio. So far this new field is used for bio based

s/to extend/extending/

> +	 * pooling, we will store returning value of underlying queue's

s/pooling/polling/

> +	 * submit_bio() here.
> +	 */
> +	unsigned int		bi_private_data;
>  };
>  
>  struct bvec_iter_all {
> -- 
> 2.29.2
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-18 16:48   ` [dm-devel] " Ming Lei
@ 2021-03-19 18:38     ` Mike Snitzer
  -1 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 18:38 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Currently bio based IO poll needs to poll all hw queue blindly, this way
> is very inefficient, and the big reason is that we can't pass bio
> submission result to io poll task.

This is awkward because bio-based IO polling doesn't exist upstream yet,
so this header should be covering your approach as a clean slate, e.g.:

The complexity associated with frequent bio splitting with bio-based
devices makes it difficult to implement IO polling efficiently because
the fan-out of underlying hw queues that need to be polled (as a
side-effect of bios being split) creates a need for more easily mapping
a group of bios to the hw queues that need to be polled.

> In IO submission context, track associated underlying bios by per-task
> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> and return current->pid to caller of submit_bio() for any bio based
> driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, and we can find the bio from that submission context. Moving

Maybe be more precise by covering how all bios from that task's
submission context will be moved to poll queue of poll context?

> bio from submission queue to poll queue of the poll context, and keep
> polling until these bios are ended. Remove bio from poll queue if the
> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> 
> In previous version, kfifo is used to implement submission queue, and
> Jeffle Xu found that kfifo can't scale well in case of high queue depth.

Awkward to reference "previous version", maybe instead say:

In was found that kfifo doesn't scale well for a submission queue as
queue depth is increased, so a new mechanism for tracking bios is
needed.

> So far bio's size is close to 2 cacheline size, and it may not be
> accepted to add new field into bio for solving the scalability issue by
> tracking bios via linked list, switch to bio group list for tracking bio,
> the idea is to reuse .bi_end_io for linking bios into a linked list for
> all sharing same .bi_end_io(call it bio group), which is recovered before
> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> provide very limited groups, such as 32 for fixing the scalability issue.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/bio.c               |   5 ++
>  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
>  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
>  block/blk.h               |   9 ++
>  include/linux/blk_types.h |  16 +++-
>  5 files changed, 348 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..04c043dc60fc 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> +		bio_set_flag(bio, BIO_DONE);
> +		return;
> +	}
>  again:
>  	if (!bio_remaining_done(bio))
>  		return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index efc7a61a84b4..778d25a7e76c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>  		sizeof(struct bio_grp_list_data);
>  }
>  
> +static inline void *bio_grp_data(struct bio *bio)
> +{
> +	return bio->bi_poll;
> +}
> +
> +/* add bio into bio group list, return true if it is added */
> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == bio_grp_data(bio)) {
> +			__bio_grp_list_add(&grp->list, bio);
> +			return true;
> +		}
> +	}
> +
> +	if (i == list->max_nr_grps)
> +		return false;
> +
> +	/* create a new group */
> +	grp = &list->head[i];
> +	bio_list_init(&grp->list);
> +	grp->grp_data = bio_grp_data(bio);
> +	__bio_grp_list_add(&grp->list, bio);
> +	list->nr_grps++;
> +
> +	return true;
> +}
> +
> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == grp_data)
> +			return i;
> +	}
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (bio_grp_list_grp_empty(grp))
> +			return i;
> +	}
> +	return -1;
> +}
> +
> +/* Move as many as possible groups from 'src' to 'dst' */
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> +{
> +	int i, j, cnt = 0;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> +		grp = &src->head[i];
> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> +		if (j < 0)
> +			break;
> +		if (bio_grp_list_grp_empty(&dst->head[j]))
> +			dst->head[j].grp_data = grp->grp_data;
> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> +		bio_list_init(&grp->list);
> +		cnt++;
> +	}
> +
> +	src->nr_grps -= cnt;
> +}
> +
>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>  {
>  	pc->sq = (void *)pc + sizeof(*pc);
> @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		bio->bi_opf |= REQ_TAG;
>  }
>  
> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	unsigned int queued;
> +
> +	/*
> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> +	 * and completion. However, bio crypt may update .bi_end_io during
> +	 * submitting, so simply not support bio based polling for this

s/submitting/submission/
s/not/don't/

> +	 * setting.
> +	 */
> +	if (likely(!bio_has_crypt_ctx(bio))) {
> +		/* track this bio via bio group list */
> +		spin_lock(&pc->sq_lock);
> +		queued = bio_grp_list_add(pc->sq, bio);
> +		spin_unlock(&pc->sq_lock);
> +	} else {
> +		queued = false;
> +	}
> +
> +	/*
> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> +	 * and the bio is always completed from the pair poll context.

This reads awkwardly.. "pair poll context"?

> +	 *
> +	 * One invariant is that if bio isn't completed, blk_poll() will
> +	 * be called by passing cookie returned from submitting this bio.
> +	 */
> +	if (!queued)
> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> +	else
> +		bio_set_flag(bio, BIO_END_BY_POLL);
> +
> +	return queued;
> +}
> +
> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> +{
> +	bio->bi_iter.bi_private_data = cookie;
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>   * bio_list_on_stack[1] contains bios that were submitted before the current
>   *	->submit_bio_bio, but that haven't been processed yet.
>   */
> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;

I mentioned this in previous mail, but what is it you're trying to
convey with _int?

Think we need a better function name here.

> @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		if (ioc && queue_is_mq(q) &&
> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> +
> +			ret = __submit_bio(bio);
> +			if (queued)
> +				blk_bio_poll_post_submit(bio, ret);
> +		} else {
> +			ret = __submit_bio(bio);
> +		}
>  
>  		/*
>  		 * Sort new bios into those for a lower level and those for the
> @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	return ret;
>  }
>  
> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> +		struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +
> +	__submit_bio_noacct_int(bio, ioc);
> +
> +	/* bio submissions queued to per-task poll context */
> +	if (READ_ONCE(pc->sq->nr_grps))
> +		return current->pid;
> +
> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> +	return 0;
> +}
> +
> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> +{
> +	struct io_context *ioc = current->io_context;
> +
> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> +		return __submit_bio_noacct_poll(bio, ioc);
> +
> +	return __submit_bio_noacct_int(bio, NULL);
> +}
> +
>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>  {
>  	struct bio_list bio_list[2] = { };
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 03f59915fe2c..f26950a51f4a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>  	return ret;
>  }
>  
> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> +{
> +	return bio->bi_iter.bi_private_data;
> +}
> +
> +static int blk_mq_poll_io(struct bio *bio)
> +{
> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> +	int ret = 0;
> +
> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> +		struct blk_mq_hw_ctx *hctx =
> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> +
> +		ret += blk_mq_poll_hctx(q, hctx);
> +	}
> +	return ret;
> +}
> +
> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	int ret = 0;
> +	int i;
> +
> +	/*
> +	 * Poll hw queue first.
> +	 *
> +	 * TODO: limit max poll times and make sure to not poll same
> +	 * hw queue one more time.
> +	 */
> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> +		struct bio *bio;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> +			ret += blk_mq_poll_io(bio);
> +	}
> +
> +	/* reap bios */
> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> +		struct bio *bio;
> +		struct bio_list bl;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		bio_list_init(&bl);
> +
> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> +			if (bio_flagged(bio, BIO_DONE)) {
> +

Remove empty newline? ^

> +				/* now recover original data */
> +				bio->bi_poll = grp->grp_data;
> +
> +				/* clear BIO_END_BY_POLL and end me really */
> +				bio_clear_flag(bio, BIO_END_BY_POLL);
> +				bio_endio(bio);
> +			} else {
> +				__bio_grp_list_add(&bl, bio);
> +			}
> +		}
> +		__bio_grp_list_merge(&grp->list, &bl);
> +	}
> +	return ret;
> +}
> +
> +static int __blk_bio_poll_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *submit_ctx,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	/*
> +	 * Move IO submission result from submission queue in submission
> +	 * context to poll queue of poll context.
> +	 */
> +	spin_lock(&submit_ctx->sq_lock);
> +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
> +	spin_unlock(&submit_ctx->sq_lock);
> +
> +	return blk_bio_poll_and_end_io(q, poll_ctx);
> +}
> +
> +static int blk_bio_poll_io(struct request_queue *q,
> +		struct io_context *submit_ioc,
> +		struct io_context *poll_ioc)
> +{
> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> +	int ret;
> +
> +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> +		mutex_lock(&poll_ctx->pq_lock);
> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +		mutex_unlock(&poll_ctx->pq_lock);
> +	} else {
> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +	}
> +	return ret;
> +}
> +
> +static bool blk_bio_ioc_valid(struct task_struct *t)
> +{
> +	if (!t)
> +		return false;
> +
> +	if (!t->io_context)
> +		return false;
> +
> +	if (!t->io_context->data)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> +{
> +	struct io_context *poll_ioc = current->io_context;
> +	pid_t pid;
> +	struct task_struct *submit_task;
> +	int ret;
> +
> +	pid = (pid_t)cookie;
> +
> +	/* io poll often share io submission context */
> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> +
> +	submit_task = find_get_task_by_vpid(pid);
> +	if (likely(blk_bio_ioc_valid(submit_task)))
> +		ret = blk_bio_poll_io(q, submit_task->io_context,
> +				poll_ioc);

Style nit, but think it fine to put "poll_ioc);" on previous line.
Otherwise, best to add braces.

> +	else
> +		ret = 0;
> +
> +	put_task_struct(submit_task);
> +
> +	return ret;
> +}
> +
>  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
> +	long state;
> +
> +	/* no need to poll */
> +	if (cookie == 0)
> +		return 0;
> +
>  	/*
>  	 * Create poll queue for storing poll bio and its cookie from
>  	 * submission queue
>  	 */
>  	blk_create_io_context(q, true);
>  
> +	state = current->state;
> +	do {
> +		int ret;
> +
> +		ret = __blk_bio_poll(q, cookie);
> +		if (ret > 0) {
> +			__set_current_state(TASK_RUNNING);
> +			return ret;
> +		}
> +
> +		if (signal_pending_state(state, current))
> +			__set_current_state(TASK_RUNNING);
> +
> +		if (current->state == TASK_RUNNING)
> +			return 1;
> +		if (ret < 0 || !spin)
> +			break;
> +		cpu_relax();
> +	} while (!need_resched());
> +
> +	__set_current_state(TASK_RUNNING);
>  	return 0;
>  }
>  
> @@ -3893,7 +4064,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	struct blk_mq_hw_ctx *hctx;
>  	long state;
>  
> -	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
> +	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
>  		return 0;
>  
>  	if (current->plug)
> diff --git a/block/blk.h b/block/blk.h
> index ae58a706327e..05b9f5eafdd1 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -403,4 +403,13 @@ static inline void blk_create_io_context(struct request_queue *q,
>  		bio_poll_ctx_alloc(ioc);
>  }
>  
> +BIO_LIST_HELPERS(__bio_grp_list, poll);

Why the leading double underscore?
Especially given the following 2 helpers don't have leading double underscore?

Whichever you decide, just looking for consistency.

> +
> +static inline bool bio_grp_list_grp_empty(struct bio_grp_list_data *grp)
> +{
> +	return bio_list_empty(&grp->list);
> +}
> +
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src);
> +
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index a1bcade4bcc3..2d47679bac71 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -235,7 +235,18 @@ struct bio {
>  
>  	struct bvec_iter	bi_iter;
>  
> -	bio_end_io_t		*bi_end_io;
> +	union {
> +		bio_end_io_t		*bi_end_io;
> +		/*
> +		 * bio based io poll need to track bio via bio group list

s/poll need/polling needs/

> +		 * which groups bio by same .bi_end_io, and original
> +		 * .bi_end_io is save into the group head. Will recover

s/save/saved/

> +		 * .bi_end_io before end this bio really. BIO_END_BY_POLL

s/before end this bio really/before really ending bio/

> +		 * will make sure that this bio won't be really ended

s/really//

> +		 * before recovering .bi_end_io.
> +		 */
> +		struct bio		*bi_poll;
> +	};
>  
>  	void			*bi_private;
>  #ifdef CONFIG_BLK_CGROUP
> @@ -304,6 +315,9 @@ enum {
>  	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
>  	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
>  	BIO_REMAPPED,
> +	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
> +	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
> +	BIO_DONE,
>  	BIO_FLAG_LAST
>  };
>  
> -- 
> 2.29.2
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-19 18:38     ` Mike Snitzer
  0 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 18:38 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Currently bio based IO poll needs to poll all hw queue blindly, this way
> is very inefficient, and the big reason is that we can't pass bio
> submission result to io poll task.

This is awkward because bio-based IO polling doesn't exist upstream yet,
so this header should be covering your approach as a clean slate, e.g.:

The complexity associated with frequent bio splitting with bio-based
devices makes it difficult to implement IO polling efficiently because
the fan-out of underlying hw queues that need to be polled (as a
side-effect of bios being split) creates a need for more easily mapping
a group of bios to the hw queues that need to be polled.

> In IO submission context, track associated underlying bios by per-task
> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> and return current->pid to caller of submit_bio() for any bio based
> driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, and we can find the bio from that submission context. Moving

Maybe be more precise by covering how all bios from that task's
submission context will be moved to poll queue of poll context?

> bio from submission queue to poll queue of the poll context, and keep
> polling until these bios are ended. Remove bio from poll queue if the
> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> 
> In previous version, kfifo is used to implement submission queue, and
> Jeffle Xu found that kfifo can't scale well in case of high queue depth.

Awkward to reference "previous version", maybe instead say:

In was found that kfifo doesn't scale well for a submission queue as
queue depth is increased, so a new mechanism for tracking bios is
needed.

> So far bio's size is close to 2 cacheline size, and it may not be
> accepted to add new field into bio for solving the scalability issue by
> tracking bios via linked list, switch to bio group list for tracking bio,
> the idea is to reuse .bi_end_io for linking bios into a linked list for
> all sharing same .bi_end_io(call it bio group), which is recovered before
> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> provide very limited groups, such as 32 for fixing the scalability issue.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/bio.c               |   5 ++
>  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
>  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
>  block/blk.h               |   9 ++
>  include/linux/blk_types.h |  16 +++-
>  5 files changed, 348 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..04c043dc60fc 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> +		bio_set_flag(bio, BIO_DONE);
> +		return;
> +	}
>  again:
>  	if (!bio_remaining_done(bio))
>  		return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index efc7a61a84b4..778d25a7e76c 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>  		sizeof(struct bio_grp_list_data);
>  }
>  
> +static inline void *bio_grp_data(struct bio *bio)
> +{
> +	return bio->bi_poll;
> +}
> +
> +/* add bio into bio group list, return true if it is added */
> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == bio_grp_data(bio)) {
> +			__bio_grp_list_add(&grp->list, bio);
> +			return true;
> +		}
> +	}
> +
> +	if (i == list->max_nr_grps)
> +		return false;
> +
> +	/* create a new group */
> +	grp = &list->head[i];
> +	bio_list_init(&grp->list);
> +	grp->grp_data = bio_grp_data(bio);
> +	__bio_grp_list_add(&grp->list, bio);
> +	list->nr_grps++;
> +
> +	return true;
> +}
> +
> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == grp_data)
> +			return i;
> +	}
> +	for (i = 0; i < list->max_nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (bio_grp_list_grp_empty(grp))
> +			return i;
> +	}
> +	return -1;
> +}
> +
> +/* Move as many as possible groups from 'src' to 'dst' */
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> +{
> +	int i, j, cnt = 0;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> +		grp = &src->head[i];
> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> +		if (j < 0)
> +			break;
> +		if (bio_grp_list_grp_empty(&dst->head[j]))
> +			dst->head[j].grp_data = grp->grp_data;
> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> +		bio_list_init(&grp->list);
> +		cnt++;
> +	}
> +
> +	src->nr_grps -= cnt;
> +}
> +
>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>  {
>  	pc->sq = (void *)pc + sizeof(*pc);
> @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		bio->bi_opf |= REQ_TAG;
>  }
>  
> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	unsigned int queued;
> +
> +	/*
> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> +	 * and completion. However, bio crypt may update .bi_end_io during
> +	 * submitting, so simply not support bio based polling for this

s/submitting/submission/
s/not/don't/

> +	 * setting.
> +	 */
> +	if (likely(!bio_has_crypt_ctx(bio))) {
> +		/* track this bio via bio group list */
> +		spin_lock(&pc->sq_lock);
> +		queued = bio_grp_list_add(pc->sq, bio);
> +		spin_unlock(&pc->sq_lock);
> +	} else {
> +		queued = false;
> +	}
> +
> +	/*
> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> +	 * and the bio is always completed from the pair poll context.

This reads awkwardly.. "pair poll context"?

> +	 *
> +	 * One invariant is that if bio isn't completed, blk_poll() will
> +	 * be called by passing cookie returned from submitting this bio.
> +	 */
> +	if (!queued)
> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> +	else
> +		bio_set_flag(bio, BIO_END_BY_POLL);
> +
> +	return queued;
> +}
> +
> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> +{
> +	bio->bi_iter.bi_private_data = cookie;
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>   * bio_list_on_stack[1] contains bios that were submitted before the current
>   *	->submit_bio_bio, but that haven't been processed yet.
>   */
> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;

I mentioned this in previous mail, but what is it you're trying to
convey with _int?

Think we need a better function name here.

> @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		if (ioc && queue_is_mq(q) &&
> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> +
> +			ret = __submit_bio(bio);
> +			if (queued)
> +				blk_bio_poll_post_submit(bio, ret);
> +		} else {
> +			ret = __submit_bio(bio);
> +		}
>  
>  		/*
>  		 * Sort new bios into those for a lower level and those for the
> @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	return ret;
>  }
>  
> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> +		struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +
> +	__submit_bio_noacct_int(bio, ioc);
> +
> +	/* bio submissions queued to per-task poll context */
> +	if (READ_ONCE(pc->sq->nr_grps))
> +		return current->pid;
> +
> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> +	return 0;
> +}
> +
> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> +{
> +	struct io_context *ioc = current->io_context;
> +
> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> +		return __submit_bio_noacct_poll(bio, ioc);
> +
> +	return __submit_bio_noacct_int(bio, NULL);
> +}
> +
>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>  {
>  	struct bio_list bio_list[2] = { };
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 03f59915fe2c..f26950a51f4a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>  	return ret;
>  }
>  
> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> +{
> +	return bio->bi_iter.bi_private_data;
> +}
> +
> +static int blk_mq_poll_io(struct bio *bio)
> +{
> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> +	int ret = 0;
> +
> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> +		struct blk_mq_hw_ctx *hctx =
> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> +
> +		ret += blk_mq_poll_hctx(q, hctx);
> +	}
> +	return ret;
> +}
> +
> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	int ret = 0;
> +	int i;
> +
> +	/*
> +	 * Poll hw queue first.
> +	 *
> +	 * TODO: limit max poll times and make sure to not poll same
> +	 * hw queue one more time.
> +	 */
> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> +		struct bio *bio;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> +			ret += blk_mq_poll_io(bio);
> +	}
> +
> +	/* reap bios */
> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> +		struct bio *bio;
> +		struct bio_list bl;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		bio_list_init(&bl);
> +
> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> +			if (bio_flagged(bio, BIO_DONE)) {
> +

Remove empty newline? ^

> +				/* now recover original data */
> +				bio->bi_poll = grp->grp_data;
> +
> +				/* clear BIO_END_BY_POLL and end me really */
> +				bio_clear_flag(bio, BIO_END_BY_POLL);
> +				bio_endio(bio);
> +			} else {
> +				__bio_grp_list_add(&bl, bio);
> +			}
> +		}
> +		__bio_grp_list_merge(&grp->list, &bl);
> +	}
> +	return ret;
> +}
> +
> +static int __blk_bio_poll_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *submit_ctx,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	/*
> +	 * Move IO submission result from submission queue in submission
> +	 * context to poll queue of poll context.
> +	 */
> +	spin_lock(&submit_ctx->sq_lock);
> +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
> +	spin_unlock(&submit_ctx->sq_lock);
> +
> +	return blk_bio_poll_and_end_io(q, poll_ctx);
> +}
> +
> +static int blk_bio_poll_io(struct request_queue *q,
> +		struct io_context *submit_ioc,
> +		struct io_context *poll_ioc)
> +{
> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> +	int ret;
> +
> +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> +		mutex_lock(&poll_ctx->pq_lock);
> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +		mutex_unlock(&poll_ctx->pq_lock);
> +	} else {
> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +	}
> +	return ret;
> +}
> +
> +static bool blk_bio_ioc_valid(struct task_struct *t)
> +{
> +	if (!t)
> +		return false;
> +
> +	if (!t->io_context)
> +		return false;
> +
> +	if (!t->io_context->data)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> +{
> +	struct io_context *poll_ioc = current->io_context;
> +	pid_t pid;
> +	struct task_struct *submit_task;
> +	int ret;
> +
> +	pid = (pid_t)cookie;
> +
> +	/* io poll often share io submission context */
> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> +
> +	submit_task = find_get_task_by_vpid(pid);
> +	if (likely(blk_bio_ioc_valid(submit_task)))
> +		ret = blk_bio_poll_io(q, submit_task->io_context,
> +				poll_ioc);

Style nit, but think it fine to put "poll_ioc);" on previous line.
Otherwise, best to add braces.

> +	else
> +		ret = 0;
> +
> +	put_task_struct(submit_task);
> +
> +	return ret;
> +}
> +
>  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
> +	long state;
> +
> +	/* no need to poll */
> +	if (cookie == 0)
> +		return 0;
> +
>  	/*
>  	 * Create poll queue for storing poll bio and its cookie from
>  	 * submission queue
>  	 */
>  	blk_create_io_context(q, true);
>  
> +	state = current->state;
> +	do {
> +		int ret;
> +
> +		ret = __blk_bio_poll(q, cookie);
> +		if (ret > 0) {
> +			__set_current_state(TASK_RUNNING);
> +			return ret;
> +		}
> +
> +		if (signal_pending_state(state, current))
> +			__set_current_state(TASK_RUNNING);
> +
> +		if (current->state == TASK_RUNNING)
> +			return 1;
> +		if (ret < 0 || !spin)
> +			break;
> +		cpu_relax();
> +	} while (!need_resched());
> +
> +	__set_current_state(TASK_RUNNING);
>  	return 0;
>  }
>  
> @@ -3893,7 +4064,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	struct blk_mq_hw_ctx *hctx;
>  	long state;
>  
> -	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
> +	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
>  		return 0;
>  
>  	if (current->plug)
> diff --git a/block/blk.h b/block/blk.h
> index ae58a706327e..05b9f5eafdd1 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -403,4 +403,13 @@ static inline void blk_create_io_context(struct request_queue *q,
>  		bio_poll_ctx_alloc(ioc);
>  }
>  
> +BIO_LIST_HELPERS(__bio_grp_list, poll);

Why the leading double underscore?
Especially given the following 2 helpers don't have leading double underscore?

Whichever you decide, just looking for consistency.

> +
> +static inline bool bio_grp_list_grp_empty(struct bio_grp_list_data *grp)
> +{
> +	return bio_list_empty(&grp->list);
> +}
> +
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src);
> +
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index a1bcade4bcc3..2d47679bac71 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -235,7 +235,18 @@ struct bio {
>  
>  	struct bvec_iter	bi_iter;
>  
> -	bio_end_io_t		*bi_end_io;
> +	union {
> +		bio_end_io_t		*bi_end_io;
> +		/*
> +		 * bio based io poll need to track bio via bio group list

s/poll need/polling needs/

> +		 * which groups bio by same .bi_end_io, and original
> +		 * .bi_end_io is save into the group head. Will recover

s/save/saved/

> +		 * .bi_end_io before end this bio really. BIO_END_BY_POLL

s/before end this bio really/before really ending bio/

> +		 * will make sure that this bio won't be really ended

s/really//

> +		 * before recovering .bi_end_io.
> +		 */
> +		struct bio		*bi_poll;
> +	};
>  
>  	void			*bi_private;
>  #ifdef CONFIG_BLK_CGROUP
> @@ -304,6 +315,9 @@ enum {
>  	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
>  	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
>  	BIO_REMAPPED,
> +	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
> +	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
> +	BIO_DONE,
>  	BIO_FLAG_LAST
>  };
>  
> -- 
> 2.29.2
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 00/13] block: support bio based io polling
  2021-03-18 16:48 ` [dm-devel] " Ming Lei
@ 2021-03-19 18:45   ` Mike Snitzer
  -1 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 18:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Hi,
> 
> Add per-task io poll context for holding HIPRI blk-mq/underlying bios
> queued from bio based driver's io submission context, and reuse one bio
> padding field for storing 'cookie' returned from submit_bio() for these
> bios. Also explicitly end these bios in poll context by adding two
> new bio flags.
> 
> In this way, we needn't to poll all underlying hw queues any more,
> which is implemented in Jeffle's patches. And we can just poll hw queues
> in which there is HIPRI IO queued.
> 
> Usually io submission and io poll share same context, so the added io
> poll context data is just like one stack variable, and the cost for
> saving bios is cheap.
> 
> Any comments are welcome.

I really like your approach and am very encouraged by the early results
Jeffle has shared.

Please review my various nits for your next iteration of this patchset.
But I think you aren't far from these changes being ready to make the
5.13 merge, which is really pretty awesome.

Outstanding job Ming, thanks so much for taking on this line of work!

Mike


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 00/13] block: support bio based io polling
@ 2021-03-19 18:45   ` Mike Snitzer
  0 siblings, 0 replies; 82+ messages in thread
From: Mike Snitzer @ 2021-03-19 18:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Thu, Mar 18 2021 at 12:48pm -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> Hi,
> 
> Add per-task io poll context for holding HIPRI blk-mq/underlying bios
> queued from bio based driver's io submission context, and reuse one bio
> padding field for storing 'cookie' returned from submit_bio() for these
> bios. Also explicitly end these bios in poll context by adding two
> new bio flags.
> 
> In this way, we needn't to poll all underlying hw queues any more,
> which is implemented in Jeffle's patches. And we can just poll hw queues
> in which there is HIPRI IO queued.
> 
> Usually io submission and io poll share same context, so the added io
> poll context data is just like one stack variable, and the cost for
> saving bios is cheap.
> 
> Any comments are welcome.

I really like your approach and am very encouraged by the early results
Jeffle has shared.

Please review my various nits for your next iteration of this patchset.
But I think you aren't far from these changes being ready to make the
5.13 merge, which is really pretty awesome.

Outstanding job Ming, thanks so much for taking on this line of work!

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-19 13:46       ` [dm-devel] " Ming Lei
@ 2021-03-20  5:56         ` JeffleXu
  -1 siblings, 0 replies; 82+ messages in thread
From: JeffleXu @ 2021-03-20  5:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel



On 3/19/21 9:46 PM, Ming Lei wrote:
> On Fri, Mar 19, 2021 at 05:38:38PM +0800, JeffleXu wrote:
>> I'm thinking how this mechanism could work with *original* bio-based
>> devices that don't ne built upon mq devices, such as nvdimm. This
> 
> non-mq device needs driver to implement io polling by itself, block
> layer can't help it, and that can't be this patchset's job.
> 
>> mechanism (also including my original design) mainly focuses on virtual
>> devices that built upon mq devices, i.e., md/dm.
>>
>> As the original bio-based devices wants to support IO polling in the
>> future, then they should be somehow distingushed from md/dm.
>>
>>
>> On 3/19/21 12:48 AM, Ming Lei wrote:
>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
>>> is very inefficient, and the big reason is that we can't pass bio
>>> submission result to io poll task.
>>>
>>> In IO submission context, track associated underlying bios by per-task
>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
>>> and return current->pid to caller of submit_bio() for any bio based
>>> driver's IO, which is submitted from FS.
>>>
>>> In IO poll context, the passed cookie tells us the PID of submission
>>> context, and we can find the bio from that submission context. Moving
>>> bio from submission queue to poll queue of the poll context, and keep
>>> polling until these bios are ended. Remove bio from poll queue if the
>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
>>>
>>> In previous version, kfifo is used to implement submission queue, and
>>> Jeffle Xu found that kfifo can't scale well in case of high queue depth.
>>> So far bio's size is close to 2 cacheline size, and it may not be
>>> accepted to add new field into bio for solving the scalability issue by
>>> tracking bios via linked list, switch to bio group list for tracking bio,
>>> the idea is to reuse .bi_end_io for linking bios into a linked list for
>>> all sharing same .bi_end_io(call it bio group), which is recovered before
>>> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
>>> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
>>> provide very limited groups, such as 32 for fixing the scalability issue.
>>>
>>> Usually submission shares context with io poll. The per-task poll context
>>> is just like stack variable, and it is cheap to move data between the two
>>> per-task queues.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/bio.c               |   5 ++
>>>  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
>>>  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
>>>  block/blk.h               |   9 ++
>>>  include/linux/blk_types.h |  16 +++-
>>>  5 files changed, 348 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index 26b7f721cda8..04c043dc60fc 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>   **/
>>>  void bio_endio(struct bio *bio)
>>>  {
>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>> +		bio_set_flag(bio, BIO_DONE);
>>> +		return;
>>> +	}
>>>  again:
>>>  	if (!bio_remaining_done(bio))
>>>  		return;
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index efc7a61a84b4..778d25a7e76c 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>>>  		sizeof(struct bio_grp_list_data);
>>>  }
>>>  
>>> +static inline void *bio_grp_data(struct bio *bio)
>>> +{
>>> +	return bio->bi_poll;
>>> +}
>>> +
>>> +/* add bio into bio group list, return true if it is added */
>>> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
>>> +{
>>> +	int i;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = 0; i < list->nr_grps; i++) {
>>> +		grp = &list->head[i];
>>> +		if (grp->grp_data == bio_grp_data(bio)) {
>>> +			__bio_grp_list_add(&grp->list, bio);
>>> +			return true;
>>> +		}
>>> +	}
>>> +
>>> +	if (i == list->max_nr_grps)
>>> +		return false;
>>> +
>>> +	/* create a new group */
>>> +	grp = &list->head[i];
>>> +	bio_list_init(&grp->list);
>>> +	grp->grp_data = bio_grp_data(bio);
>>> +	__bio_grp_list_add(&grp->list, bio);
>>> +	list->nr_grps++;
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
>>> +{
>>> +	int i;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = 0; i < list->max_nr_grps; i++) {
>>> +		grp = &list->head[i];
>>> +		if (grp->grp_data == grp_data)
>>> +			return i;
>>> +	}
>>> +	for (i = 0; i < list->max_nr_grps; i++) {
>>> +		grp = &list->head[i];
>>> +		if (bio_grp_list_grp_empty(grp))
>>> +			return i;
>>> +	}
>>> +	return -1;
>>> +}
>>> +
>>> +/* Move as many as possible groups from 'src' to 'dst' */
>>> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
>>> +{
>>> +	int i, j, cnt = 0;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = src->nr_grps - 1; i >= 0; i--) {
>>> +		grp = &src->head[i];
>>> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
>>> +		if (j < 0)
>>> +			break;
>>> +		if (bio_grp_list_grp_empty(&dst->head[j]))
>>> +			dst->head[j].grp_data = grp->grp_data;
>>> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
>>> +		bio_list_init(&grp->list);
>>> +		cnt++;
>>> +	}
>>> +
>>> +	src->nr_grps -= cnt;
>>> +}
>>
>> Not sure why it's checked in reverse order (starting from 'nr_grps - 1').
> 
> Then for bio group list in submission side, only first .nr_grps groups
> includes bios.
> 
>>
>>
>>> +
>>>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>>>  {
>>>  	pc->sq = (void *)pc + sizeof(*pc);
>>> @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>  		bio->bi_opf |= REQ_TAG;
>>>  }
>>>  
>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	unsigned int queued;
>>> +
>>> +	/*
>>> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
>>> +	 * and completion. However, bio crypt may update .bi_end_io during
>>> +	 * submitting, so simply not support bio based polling for this
>>> +	 * setting.
>>> +	 */
>>> +	if (likely(!bio_has_crypt_ctx(bio))) {
>>> +		/* track this bio via bio group list */
>>> +		spin_lock(&pc->sq_lock);
>>> +		queued = bio_grp_list_add(pc->sq, bio);
>>> +		spin_unlock(&pc->sq_lock);
>>> +	} else {
>>> +		queued = false;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
>>> +	 * and the bio is always completed from the pair poll context.
>>> +	 *
>>> +	 * One invariant is that if bio isn't completed, blk_poll() will
>>> +	 * be called by passing cookie returned from submitting this bio.
>>> +	 */
>>> +	if (!queued)
>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
>>> +	else
>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>> +
>>> +	return queued;
>>> +}
>>> +
>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
>>> +{
>>> +	bio->bi_iter.bi_private_data = cookie;
>>> +}
>>> +
>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>  {
>>>  	struct block_device *bdev = bio->bi_bdev;
>>> @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
>>>   *	->submit_bio_bio, but that haven't been processed yet.
>>>   */
>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>>>  {
>>>  	struct bio_list bio_list_on_stack[2];
>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>> @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>  
>>> -		ret = __submit_bio(bio);
>>> +		if (ioc && queue_is_mq(q) &&
>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
>>> +
>>> +			ret = __submit_bio(bio);
>>> +			if (queued)
>>> +				blk_bio_poll_post_submit(bio, ret);
>>> +		} else {
>>> +			ret = __submit_bio(bio);
>>> +		}
>>
>> If input @ioc is NULL, it will still return cookie (returned by
>> __submit_bio()), which will call into blk_bio_poll(), which is not
>> expected. (It can pass blk_queue_poll() check in blk_poll(), e.g., dm
>> device itself is marked as QUEUE_FLAG_POLL, but ->io_context of IO
>> submitting process failed to allocate the io_context, and thus calls
>> __submit_bio_noacct_int() with @ioc is NULL).
> 
> Good catch, looks the following change is needed:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 778d25a7e76c..dba12ba0fa48 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1211,7 +1211,8 @@ static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>  		return __submit_bio_noacct_poll(bio, ioc);
>  
> -	return __submit_bio_noacct_int(bio, NULL);
> +	 __submit_bio_noacct_int(bio, NULL);
> +	return 0;
>  }

Looks good as far as now no original bio-based device supporting IO polling.


>  
>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> 
>>
>>
>>>  
>>>  		/*
>>>  		 * Sort new bios into those for a lower level and those for the
>>> @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  	return ret;
>>>  }
>>>  
>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
>>> +		struct io_context *ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +
>>> +	__submit_bio_noacct_int(bio, ioc);
>>> +
>>> +	/* bio submissions queued to per-task poll context */
>>> +	if (READ_ONCE(pc->sq->nr_grps))
>>> +		return current->pid;
>>> +
>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
>>> +	return 0;
>>> +}
>>> +
>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +{
>>> +	struct io_context *ioc = current->io_context;
>>> +
>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>>> +		return __submit_bio_noacct_poll(bio, ioc);
>>> +
>>
>>
>>> +	return __submit_bio_noacct_int(bio, NULL);
>>> +}
>>> +
>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>>>  {
>>>  	struct bio_list bio_list[2] = { };
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 03f59915fe2c..f26950a51f4a 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>>>  	return ret;
>>>  }
>>>  
>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
>>> +{
>>> +	return bio->bi_iter.bi_private_data;
>>> +}
>>> +
>>> +static int blk_mq_poll_io(struct bio *bio)
>>> +{
>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
>>> +	int ret = 0;
>>> +
>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
>>> +		struct blk_mq_hw_ctx *hctx =
>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>> +
>>> +		ret += blk_mq_poll_hctx(q, hctx);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
>>> +		struct blk_bio_poll_ctx *poll_ctx)
>>> +{
>>> +	int ret = 0;
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * Poll hw queue first.
>>> +	 *
>>> +	 * TODO: limit max poll times and make sure to not poll same
>>> +	 * hw queue one more time.
>>> +	 */
>>> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
>>> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
>>> +		struct bio *bio;
>>> +
>>> +		if (bio_grp_list_grp_empty(grp))
>>> +			continue;
>>> +
>>> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
>>> +			ret += blk_mq_poll_io(bio);
>>> +	}
>>> +
>>> +	/* reap bios */
>>> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
>>> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
>>> +		struct bio *bio;
>>> +		struct bio_list bl;
>>> +
>>> +		if (bio_grp_list_grp_empty(grp))
>>> +			continue;
>>> +
>>> +		bio_list_init(&bl);
>>> +
>>> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
>>> +			if (bio_flagged(bio, BIO_DONE)) {
>>> +
>>> +				/* now recover original data */
>>> +				bio->bi_poll = grp->grp_data;
>>> +
>>> +				/* clear BIO_END_BY_POLL and end me really */
>>> +				bio_clear_flag(bio, BIO_END_BY_POLL);
>>> +				bio_endio(bio);
>>> +			} else {
>>> +				__bio_grp_list_add(&bl, bio);
>>> +			}
>>> +		}
>>> +		__bio_grp_list_merge(&grp->list, &bl);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static int __blk_bio_poll_io(struct request_queue *q,
>>> +		struct blk_bio_poll_ctx *submit_ctx,
>>> +		struct blk_bio_poll_ctx *poll_ctx)
>>> +{
>>> +	/*
>>> +	 * Move IO submission result from submission queue in submission
>>> +	 * context to poll queue of poll context.
>>> +	 */
>>> +	spin_lock(&submit_ctx->sq_lock);
>>> +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
>>> +	spin_unlock(&submit_ctx->sq_lock);
>>> +
>>> +	return blk_bio_poll_and_end_io(q, poll_ctx);
>>> +}
>>> +
>>> +static int blk_bio_poll_io(struct request_queue *q,
>>> +		struct io_context *submit_ioc,
>>> +		struct io_context *poll_ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
>>> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
>>> +	int ret;
>>> +
>>> +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
>>> +		mutex_lock(&poll_ctx->pq_lock);
>>
>> Why mutex is used to protect pq here rather than spinlock? Where will
> 
> spinlock should be fine.
> 
>> the polling routine go to sleep?
> 
> The current blk_poll() can go to sleep really.

I know that hybrid polling can go to sleep. Except that, it seems no
other place can sleep?


> 
>>
>> Besides, how to protect the concurrent bio_list operation to sq between
>> producer (submission routine) and consumer (polling routine)? As far as
> 
> submit_ctx->sq_lock is held in __blk_bio_poll_io() for moving bios
> from sq to pq, see __blk_bio_poll_io().
> 
>> I understand, pc->sq_lock is used to prevent concurrent access from
>> multiple submission processes, while pc->pq_lock is used to prevent
>> concurrent access from multiple polling processes.
> 
> Usually poll context needn't any lock except for shared io context,
> because blk_bio_poll_ctx->pq is only accessed in poll context, and
> it is still per-task.
> 
>>
>>
>>> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
>>> +		mutex_unlock(&poll_ctx->pq_lock);
>>> +	} else {
>>> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static bool blk_bio_ioc_valid(struct task_struct *t)
>>> +{
>>> +	if (!t)
>>> +		return false;
>>> +
>>> +	if (!t->io_context)
>>> +		return false;
>>> +
>>> +	if (!t->io_context->data)
>>> +		return false;
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
>>> +{
>>> +	struct io_context *poll_ioc = current->io_context;
>>> +	pid_t pid;
>>> +	struct task_struct *submit_task;
>>> +	int ret;
>>> +
>>> +	pid = (pid_t)cookie;
>>> +
>>> +	/* io poll often share io submission context */
>>> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
>>> +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
>>> +
>>
>>> +	submit_task = find_get_task_by_vpid(pid);
>>
>> What if the process to which the returned cookie refers has exited?
>> find_get_task_by_vpid() will return NULL, thus blk_poll() won't help
>> reap anything, while maybe there are still bios in the poll context,
>> waiting to be reaped.
>>
>> Maybe we need to flush the poll context when a task detaches from the
>> io_context.
> 
> Yeah, I know that issue, and just not address it in RFC stage.
> 
> It can be handled in the following way:
> 
> 1) drain all bios in submission context until all bios are completed before
> it exits since the code won't sleep.
> 
> OR
> 
> 2) schedule wq for completing all submitted bios.
> 
> If poll context exits, and there is still bios not completed, not sure
> if it can happen because the current in-tree code has the same issue.
> 
>>
>>
>>> +	if (likely(blk_bio_ioc_valid(submit_task)))
>>> +		ret = blk_bio_poll_io(q, submit_task->io_context,
>>> +				poll_ioc);
>>
>> poll_ioc may be invalid in this case, since the previous
>> blk_create_io_context() in blk_bio_poll() may fail.
> 
> Yeah, it can be addressed by above patch, 0 will be returned
> for this case.
> 

Not really. I mean in the case of 'current->pid != pid', poll_ioc, i.e.,
current->io_context could be invalid, i.e., current->io_context is NULL,
or current->io_context->data is NULL. Maybe it should be fixed by

+	if (likely(blk_bio_ioc_valid(submit_task) && blk_bio_ioc_valid(current)))
+		ret = blk_bio_poll_io(q, submit_task->io_context,
+				poll_ioc);


Can't understand why the above patch you mentioned could fix this.

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-20  5:56         ` JeffleXu
  0 siblings, 0 replies; 82+ messages in thread
From: JeffleXu @ 2021-03-20  5:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer



On 3/19/21 9:46 PM, Ming Lei wrote:
> On Fri, Mar 19, 2021 at 05:38:38PM +0800, JeffleXu wrote:
>> I'm thinking how this mechanism could work with *original* bio-based
>> devices that don't ne built upon mq devices, such as nvdimm. This
> 
> non-mq device needs driver to implement io polling by itself, block
> layer can't help it, and that can't be this patchset's job.
> 
>> mechanism (also including my original design) mainly focuses on virtual
>> devices that built upon mq devices, i.e., md/dm.
>>
>> As the original bio-based devices wants to support IO polling in the
>> future, then they should be somehow distingushed from md/dm.
>>
>>
>> On 3/19/21 12:48 AM, Ming Lei wrote:
>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
>>> is very inefficient, and the big reason is that we can't pass bio
>>> submission result to io poll task.
>>>
>>> In IO submission context, track associated underlying bios by per-task
>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
>>> and return current->pid to caller of submit_bio() for any bio based
>>> driver's IO, which is submitted from FS.
>>>
>>> In IO poll context, the passed cookie tells us the PID of submission
>>> context, and we can find the bio from that submission context. Moving
>>> bio from submission queue to poll queue of the poll context, and keep
>>> polling until these bios are ended. Remove bio from poll queue if the
>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
>>>
>>> In previous version, kfifo is used to implement submission queue, and
>>> Jeffle Xu found that kfifo can't scale well in case of high queue depth.
>>> So far bio's size is close to 2 cacheline size, and it may not be
>>> accepted to add new field into bio for solving the scalability issue by
>>> tracking bios via linked list, switch to bio group list for tracking bio,
>>> the idea is to reuse .bi_end_io for linking bios into a linked list for
>>> all sharing same .bi_end_io(call it bio group), which is recovered before
>>> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
>>> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
>>> provide very limited groups, such as 32 for fixing the scalability issue.
>>>
>>> Usually submission shares context with io poll. The per-task poll context
>>> is just like stack variable, and it is cheap to move data between the two
>>> per-task queues.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/bio.c               |   5 ++
>>>  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
>>>  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
>>>  block/blk.h               |   9 ++
>>>  include/linux/blk_types.h |  16 +++-
>>>  5 files changed, 348 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index 26b7f721cda8..04c043dc60fc 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>   **/
>>>  void bio_endio(struct bio *bio)
>>>  {
>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>> +		bio_set_flag(bio, BIO_DONE);
>>> +		return;
>>> +	}
>>>  again:
>>>  	if (!bio_remaining_done(bio))
>>>  		return;
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index efc7a61a84b4..778d25a7e76c 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>>>  		sizeof(struct bio_grp_list_data);
>>>  }
>>>  
>>> +static inline void *bio_grp_data(struct bio *bio)
>>> +{
>>> +	return bio->bi_poll;
>>> +}
>>> +
>>> +/* add bio into bio group list, return true if it is added */
>>> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
>>> +{
>>> +	int i;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = 0; i < list->nr_grps; i++) {
>>> +		grp = &list->head[i];
>>> +		if (grp->grp_data == bio_grp_data(bio)) {
>>> +			__bio_grp_list_add(&grp->list, bio);
>>> +			return true;
>>> +		}
>>> +	}
>>> +
>>> +	if (i == list->max_nr_grps)
>>> +		return false;
>>> +
>>> +	/* create a new group */
>>> +	grp = &list->head[i];
>>> +	bio_list_init(&grp->list);
>>> +	grp->grp_data = bio_grp_data(bio);
>>> +	__bio_grp_list_add(&grp->list, bio);
>>> +	list->nr_grps++;
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
>>> +{
>>> +	int i;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = 0; i < list->max_nr_grps; i++) {
>>> +		grp = &list->head[i];
>>> +		if (grp->grp_data == grp_data)
>>> +			return i;
>>> +	}
>>> +	for (i = 0; i < list->max_nr_grps; i++) {
>>> +		grp = &list->head[i];
>>> +		if (bio_grp_list_grp_empty(grp))
>>> +			return i;
>>> +	}
>>> +	return -1;
>>> +}
>>> +
>>> +/* Move as many as possible groups from 'src' to 'dst' */
>>> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
>>> +{
>>> +	int i, j, cnt = 0;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = src->nr_grps - 1; i >= 0; i--) {
>>> +		grp = &src->head[i];
>>> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
>>> +		if (j < 0)
>>> +			break;
>>> +		if (bio_grp_list_grp_empty(&dst->head[j]))
>>> +			dst->head[j].grp_data = grp->grp_data;
>>> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
>>> +		bio_list_init(&grp->list);
>>> +		cnt++;
>>> +	}
>>> +
>>> +	src->nr_grps -= cnt;
>>> +}
>>
>> Not sure why it's checked in reverse order (starting from 'nr_grps - 1').
> 
> Then for bio group list in submission side, only first .nr_grps groups
> includes bios.
> 
>>
>>
>>> +
>>>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>>>  {
>>>  	pc->sq = (void *)pc + sizeof(*pc);
>>> @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>  		bio->bi_opf |= REQ_TAG;
>>>  }
>>>  
>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	unsigned int queued;
>>> +
>>> +	/*
>>> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
>>> +	 * and completion. However, bio crypt may update .bi_end_io during
>>> +	 * submitting, so simply not support bio based polling for this
>>> +	 * setting.
>>> +	 */
>>> +	if (likely(!bio_has_crypt_ctx(bio))) {
>>> +		/* track this bio via bio group list */
>>> +		spin_lock(&pc->sq_lock);
>>> +		queued = bio_grp_list_add(pc->sq, bio);
>>> +		spin_unlock(&pc->sq_lock);
>>> +	} else {
>>> +		queued = false;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
>>> +	 * and the bio is always completed from the pair poll context.
>>> +	 *
>>> +	 * One invariant is that if bio isn't completed, blk_poll() will
>>> +	 * be called by passing cookie returned from submitting this bio.
>>> +	 */
>>> +	if (!queued)
>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
>>> +	else
>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>> +
>>> +	return queued;
>>> +}
>>> +
>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
>>> +{
>>> +	bio->bi_iter.bi_private_data = cookie;
>>> +}
>>> +
>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>  {
>>>  	struct block_device *bdev = bio->bi_bdev;
>>> @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
>>>   *	->submit_bio_bio, but that haven't been processed yet.
>>>   */
>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>>>  {
>>>  	struct bio_list bio_list_on_stack[2];
>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>> @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>  
>>> -		ret = __submit_bio(bio);
>>> +		if (ioc && queue_is_mq(q) &&
>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
>>> +
>>> +			ret = __submit_bio(bio);
>>> +			if (queued)
>>> +				blk_bio_poll_post_submit(bio, ret);
>>> +		} else {
>>> +			ret = __submit_bio(bio);
>>> +		}
>>
>> If input @ioc is NULL, it will still return cookie (returned by
>> __submit_bio()), which will call into blk_bio_poll(), which is not
>> expected. (It can pass blk_queue_poll() check in blk_poll(), e.g., dm
>> device itself is marked as QUEUE_FLAG_POLL, but ->io_context of IO
>> submitting process failed to allocate the io_context, and thus calls
>> __submit_bio_noacct_int() with @ioc is NULL).
> 
> Good catch, looks the following change is needed:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 778d25a7e76c..dba12ba0fa48 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1211,7 +1211,8 @@ static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>  		return __submit_bio_noacct_poll(bio, ioc);
>  
> -	return __submit_bio_noacct_int(bio, NULL);
> +	 __submit_bio_noacct_int(bio, NULL);
> +	return 0;
>  }

Looks good as far as now no original bio-based device supporting IO polling.


>  
>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> 
>>
>>
>>>  
>>>  		/*
>>>  		 * Sort new bios into those for a lower level and those for the
>>> @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  	return ret;
>>>  }
>>>  
>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
>>> +		struct io_context *ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +
>>> +	__submit_bio_noacct_int(bio, ioc);
>>> +
>>> +	/* bio submissions queued to per-task poll context */
>>> +	if (READ_ONCE(pc->sq->nr_grps))
>>> +		return current->pid;
>>> +
>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
>>> +	return 0;
>>> +}
>>> +
>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +{
>>> +	struct io_context *ioc = current->io_context;
>>> +
>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>>> +		return __submit_bio_noacct_poll(bio, ioc);
>>> +
>>
>>
>>> +	return __submit_bio_noacct_int(bio, NULL);
>>> +}
>>> +
>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>>>  {
>>>  	struct bio_list bio_list[2] = { };
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 03f59915fe2c..f26950a51f4a 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>>>  	return ret;
>>>  }
>>>  
>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
>>> +{
>>> +	return bio->bi_iter.bi_private_data;
>>> +}
>>> +
>>> +static int blk_mq_poll_io(struct bio *bio)
>>> +{
>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
>>> +	int ret = 0;
>>> +
>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
>>> +		struct blk_mq_hw_ctx *hctx =
>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>> +
>>> +		ret += blk_mq_poll_hctx(q, hctx);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
>>> +		struct blk_bio_poll_ctx *poll_ctx)
>>> +{
>>> +	int ret = 0;
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * Poll hw queue first.
>>> +	 *
>>> +	 * TODO: limit max poll times and make sure to not poll same
>>> +	 * hw queue one more time.
>>> +	 */
>>> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
>>> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
>>> +		struct bio *bio;
>>> +
>>> +		if (bio_grp_list_grp_empty(grp))
>>> +			continue;
>>> +
>>> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
>>> +			ret += blk_mq_poll_io(bio);
>>> +	}
>>> +
>>> +	/* reap bios */
>>> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
>>> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
>>> +		struct bio *bio;
>>> +		struct bio_list bl;
>>> +
>>> +		if (bio_grp_list_grp_empty(grp))
>>> +			continue;
>>> +
>>> +		bio_list_init(&bl);
>>> +
>>> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
>>> +			if (bio_flagged(bio, BIO_DONE)) {
>>> +
>>> +				/* now recover original data */
>>> +				bio->bi_poll = grp->grp_data;
>>> +
>>> +				/* clear BIO_END_BY_POLL and end me really */
>>> +				bio_clear_flag(bio, BIO_END_BY_POLL);
>>> +				bio_endio(bio);
>>> +			} else {
>>> +				__bio_grp_list_add(&bl, bio);
>>> +			}
>>> +		}
>>> +		__bio_grp_list_merge(&grp->list, &bl);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static int __blk_bio_poll_io(struct request_queue *q,
>>> +		struct blk_bio_poll_ctx *submit_ctx,
>>> +		struct blk_bio_poll_ctx *poll_ctx)
>>> +{
>>> +	/*
>>> +	 * Move IO submission result from submission queue in submission
>>> +	 * context to poll queue of poll context.
>>> +	 */
>>> +	spin_lock(&submit_ctx->sq_lock);
>>> +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
>>> +	spin_unlock(&submit_ctx->sq_lock);
>>> +
>>> +	return blk_bio_poll_and_end_io(q, poll_ctx);
>>> +}
>>> +
>>> +static int blk_bio_poll_io(struct request_queue *q,
>>> +		struct io_context *submit_ioc,
>>> +		struct io_context *poll_ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
>>> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
>>> +	int ret;
>>> +
>>> +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
>>> +		mutex_lock(&poll_ctx->pq_lock);
>>
>> Why mutex is used to protect pq here rather than spinlock? Where will
> 
> spinlock should be fine.
> 
>> the polling routine go to sleep?
> 
> The current blk_poll() can go to sleep really.

I know that hybrid polling can go to sleep. Except that, it seems no
other place can sleep?


> 
>>
>> Besides, how to protect the concurrent bio_list operation to sq between
>> producer (submission routine) and consumer (polling routine)? As far as
> 
> submit_ctx->sq_lock is held in __blk_bio_poll_io() for moving bios
> from sq to pq, see __blk_bio_poll_io().
> 
>> I understand, pc->sq_lock is used to prevent concurrent access from
>> multiple submission processes, while pc->pq_lock is used to prevent
>> concurrent access from multiple polling processes.
> 
> Usually poll context needn't any lock except for shared io context,
> because blk_bio_poll_ctx->pq is only accessed in poll context, and
> it is still per-task.
> 
>>
>>
>>> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
>>> +		mutex_unlock(&poll_ctx->pq_lock);
>>> +	} else {
>>> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static bool blk_bio_ioc_valid(struct task_struct *t)
>>> +{
>>> +	if (!t)
>>> +		return false;
>>> +
>>> +	if (!t->io_context)
>>> +		return false;
>>> +
>>> +	if (!t->io_context->data)
>>> +		return false;
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
>>> +{
>>> +	struct io_context *poll_ioc = current->io_context;
>>> +	pid_t pid;
>>> +	struct task_struct *submit_task;
>>> +	int ret;
>>> +
>>> +	pid = (pid_t)cookie;
>>> +
>>> +	/* io poll often share io submission context */
>>> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
>>> +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
>>> +
>>
>>> +	submit_task = find_get_task_by_vpid(pid);
>>
>> What if the process to which the returned cookie refers has exited?
>> find_get_task_by_vpid() will return NULL, thus blk_poll() won't help
>> reap anything, while maybe there are still bios in the poll context,
>> waiting to be reaped.
>>
>> Maybe we need to flush the poll context when a task detaches from the
>> io_context.
> 
> Yeah, I know that issue, and just not address it in RFC stage.
> 
> It can be handled in the following way:
> 
> 1) drain all bios in submission context until all bios are completed before
> it exits since the code won't sleep.
> 
> OR
> 
> 2) schedule wq for completing all submitted bios.
> 
> If poll context exits, and there is still bios not completed, not sure
> if it can happen because the current in-tree code has the same issue.
> 
>>
>>
>>> +	if (likely(blk_bio_ioc_valid(submit_task)))
>>> +		ret = blk_bio_poll_io(q, submit_task->io_context,
>>> +				poll_ioc);
>>
>> poll_ioc may be invalid in this case, since the previous
>> blk_create_io_context() in blk_bio_poll() may fail.
> 
> Yeah, it can be addressed by above patch, 0 will be returned
> for this case.
> 

Not really. I mean in the case of 'current->pid != pid', poll_ioc, i.e.,
current->io_context could be invalid, i.e., current->io_context is NULL,
or current->io_context->data is NULL. Maybe it should be fixed by

+	if (likely(blk_bio_ioc_valid(submit_task) && blk_bio_ioc_valid(current)))
+		ret = blk_bio_poll_io(q, submit_task->io_context,
+				poll_ioc);


Can't understand why the above patch you mentioned could fix this.

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-18 16:48   ` [dm-devel] " Ming Lei
@ 2021-03-23  3:46     ` Sagi Grimberg
  -1 siblings, 0 replies; 82+ messages in thread
From: Sagi Grimberg @ 2021-03-23  3:46 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer, dm-devel


> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> +{
> +	bio->bi_iter.bi_private_data = cookie;
> +}
> +

Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
an exported function for failover. nvme-mpath updates bio.bi_dev
when re-submitting I/Os to an alternate path, so I'm thinking
that if this function is exported then nvme-mpath could do as little
as the below to allow polling?

--
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 92adebfaf86f..e562e296153b 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
         struct nvme_ns_head *head =
                 container_of(work, struct nvme_ns_head, requeue_work);
         struct bio *bio, *next;
+       blk_qc_t cookie;

         spin_lock_irq(&head->requeue_lock);
         next = bio_list_get(&head->requeue_list);
@@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
                  * path.
                  */
                 bio_set_dev(bio, head->disk->part0);
-               submit_bio_noacct(bio);
+               cookie = submit_bio_noacct(bio);
+               blk_bio_poll_post_submit(bio, cookie);
         }
  }
--

I/O failover will create misalignment from the polling context cpu and
the submission cpu (running requeue_work), but I don't see if there is
something that would break here...

Thoughts?

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-23  3:46     ` Sagi Grimberg
  0 siblings, 0 replies; 82+ messages in thread
From: Sagi Grimberg @ 2021-03-23  3:46 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Jeffle Xu, dm-devel, Christoph Hellwig, Mike Snitzer


> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> +{
> +	bio->bi_iter.bi_private_data = cookie;
> +}
> +

Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
an exported function for failover. nvme-mpath updates bio.bi_dev
when re-submitting I/Os to an alternate path, so I'm thinking
that if this function is exported then nvme-mpath could do as little
as the below to allow polling?

--
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 92adebfaf86f..e562e296153b 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
         struct nvme_ns_head *head =
                 container_of(work, struct nvme_ns_head, requeue_work);
         struct bio *bio, *next;
+       blk_qc_t cookie;

         spin_lock_irq(&head->requeue_lock);
         next = bio_list_get(&head->requeue_list);
@@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
                  * path.
                  */
                 bio_set_dev(bio, head->disk->part0);
-               submit_bio_noacct(bio);
+               cookie = submit_bio_noacct(bio);
+               blk_bio_poll_post_submit(bio, cookie);
         }
  }
--

I/O failover will create misalignment from the polling context cpu and
the submission cpu (running requeue_work), but I don't see if there is
something that would break here...

Thoughts?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 01/13] block: add helper of blk_queue_poll
  2021-03-19 16:52     ` [dm-devel] " Mike Snitzer
@ 2021-03-23 11:17       ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:17 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Fri, Mar 19, 2021 at 12:52:42PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > There has been 3 users, and will be more, so add one such helper.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> 
> Not sure if you're collecting Reviewed-by or Acked-by at this point?
> Seems you dropped Chaitanya's Reviewed-by to v1:
> https://listman.redhat.com/archives/dm-devel/2021-March/msg00166.html

Sorry, that should be an accident.

> 
> Do you plan to iterate a lot more before you put out a non-RFC?  For
> this RFC v2, I'll withhold adding any of my Reviewed-by tags and just
> reply where I see things that might need folding into the next
> iteration.

If no one objects the basic approach taken in V2, I will remove RFC in
V3.

-- 
Ming


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 01/13] block: add helper of blk_queue_poll
@ 2021-03-23 11:17       ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:17 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Fri, Mar 19, 2021 at 12:52:42PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > There has been 3 users, and will be more, so add one such helper.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> 
> Not sure if you're collecting Reviewed-by or Acked-by at this point?
> Seems you dropped Chaitanya's Reviewed-by to v1:
> https://listman.redhat.com/archives/dm-devel/2021-March/msg00166.html

Sorry, that should be an accident.

> 
> Do you plan to iterate a lot more before you put out a non-RFC?  For
> this RFC v2, I'll withhold adding any of my Reviewed-by tags and just
> reply where I see things that might need folding into the next
> iteration.

If no one objects the basic approach taken in V2, I will remove RFC in
V3.

-- 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 04/13] block: create io poll context for submission and poll task
  2021-03-19 17:05     ` [dm-devel] " Mike Snitzer
@ 2021-03-23 11:23       ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:23 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Fri, Mar 19, 2021 at 01:05:09PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > Create per-task io poll context for both IO submission and poll task
> > if the queue is bio based and supports polling.
> > 
> > This io polling context includes two queues:
> 1) submission queue(sq) for storing HIPRI bio submission result(cookie)
>    and the bio, written by submission task and read by poll task.

BTW, V2 has switched to store bio only, and cookie is actually stored in
side bio.

> 2) polling queue(pq) for holding data moved from sq, only used in poll
>    context for running bio polling.
>  
> (nit, but it just reads a bit clearer to enumerate the 2 queues)

OK.

-- 
Ming


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 04/13] block: create io poll context for submission and poll task
@ 2021-03-23 11:23       ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:23 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Fri, Mar 19, 2021 at 01:05:09PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > Create per-task io poll context for both IO submission and poll task
> > if the queue is bio based and supports polling.
> > 
> > This io polling context includes two queues:
> 1) submission queue(sq) for storing HIPRI bio submission result(cookie)
>    and the bio, written by submission task and read by poll task.

BTW, V2 has switched to store bio only, and cookie is actually stored in
side bio.

> 2) polling queue(pq) for holding data moved from sq, only used in poll
>    context for running bio polling.
>  
> (nit, but it just reads a bit clearer to enumerate the 2 queues)

OK.

-- 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
  2021-03-19 17:38     ` [dm-devel] " Mike Snitzer
@ 2021-03-23 11:26       ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:26 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Fri, Mar 19, 2021 at 01:38:13PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > Add one req flag REQ_TAG which will be used in the following patch for
> > supporting bio based IO polling.
> 
> "REQ_TAG" is so generic yet is used in such a specific way (to mark an
> FS bio as having polling context)
> 
> I don't have a great suggestion for a better name, just seems "REQ_TAG"
> is lacking... (especially given the potential for confusion due to
> blk-mq's notion of "tag").
> 
> REQ_FS? REQ_FS_CTX? REQ_POLL? REQ_POLL_CTX? REQ_NAMING_IS_HARD :)
> 

Maybe REQ_POLL_CTX is better, it is just for marking bios:

1) which need to be polled in this context

2) which can be polled in this context

-- 
Ming


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 05/13] block: add req flag of REQ_TAG
@ 2021-03-23 11:26       ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:26 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Fri, Mar 19, 2021 at 01:38:13PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > Add one req flag REQ_TAG which will be used in the following patch for
> > supporting bio based IO polling.
> 
> "REQ_TAG" is so generic yet is used in such a specific way (to mark an
> FS bio as having polling context)
> 
> I don't have a great suggestion for a better name, just seems "REQ_TAG"
> is lacking... (especially given the potential for confusion due to
> blk-mq's notion of "tag").
> 
> REQ_FS? REQ_FS_CTX? REQ_POLL? REQ_POLL_CTX? REQ_NAMING_IS_HARD :)
> 

Maybe REQ_POLL_CTX is better, it is just for marking bios:

1) which need to be polled in this context

2) which can be polled in this context

-- 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 06/13] block: add new field into 'struct bvec_iter'
  2021-03-19 17:44     ` [dm-devel] " Mike Snitzer
@ 2021-03-23 11:29       ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:29 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Fri, Mar 19, 2021 at 01:44:22PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > There is a hole at the end of 'struct bvec_iter', so put a new field
> > here and we can save cookie returned from submit_bio() here for
> > supporting bio based polling.
> > 
> > This way can avoid to extend bio unnecessarily.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  include/linux/bvec.h | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> > index ff832e698efb..61c0f55f7165 100644
> > --- a/include/linux/bvec.h
> > +++ b/include/linux/bvec.h
> > @@ -43,6 +43,15 @@ struct bvec_iter {
> >  
> >  	unsigned int            bi_bvec_done;	/* number of bytes completed in
> >  						   current bvec */
> > +
> > +	/*
> > +	 * There is a hole at the end of bvec_iter, define one filed to
> 
> s/filed/field/
> 
> > +	 * hold something which isn't relate with 'bvec_iter', so that we can
> 
> s/relate/related/
> or
> s/isn't relate with/doesn't relate to/
> 
> > +	 * avoid to extend bio. So far this new field is used for bio based
> 
> s/to extend/extending/
> 
> > +	 * pooling, we will store returning value of underlying queue's
> 
> s/pooling/polling/
> 

Good catch, will fix all in V3.

-- 
Ming


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 06/13] block: add new field into 'struct bvec_iter'
@ 2021-03-23 11:29       ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:29 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Fri, Mar 19, 2021 at 01:44:22PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > There is a hole at the end of 'struct bvec_iter', so put a new field
> > here and we can save cookie returned from submit_bio() here for
> > supporting bio based polling.
> > 
> > This way can avoid to extend bio unnecessarily.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  include/linux/bvec.h | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> > index ff832e698efb..61c0f55f7165 100644
> > --- a/include/linux/bvec.h
> > +++ b/include/linux/bvec.h
> > @@ -43,6 +43,15 @@ struct bvec_iter {
> >  
> >  	unsigned int            bi_bvec_done;	/* number of bytes completed in
> >  						   current bvec */
> > +
> > +	/*
> > +	 * There is a hole at the end of bvec_iter, define one filed to
> 
> s/filed/field/
> 
> > +	 * hold something which isn't relate with 'bvec_iter', so that we can
> 
> s/relate/related/
> or
> s/isn't relate with/doesn't relate to/
> 
> > +	 * avoid to extend bio. So far this new field is used for bio based
> 
> s/to extend/extending/
> 
> > +	 * pooling, we will store returning value of underlying queue's
> 
> s/pooling/polling/
> 

Good catch, will fix all in V3.

-- 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-20  5:56         ` [dm-devel] " JeffleXu
@ 2021-03-23 11:39           ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:39 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel

On Sat, Mar 20, 2021 at 01:56:13PM +0800, JeffleXu wrote:
> 
> 
> On 3/19/21 9:46 PM, Ming Lei wrote:
> > On Fri, Mar 19, 2021 at 05:38:38PM +0800, JeffleXu wrote:
> >> I'm thinking how this mechanism could work with *original* bio-based
> >> devices that don't ne built upon mq devices, such as nvdimm. This
> > 
> > non-mq device needs driver to implement io polling by itself, block
> > layer can't help it, and that can't be this patchset's job.
> > 
> >> mechanism (also including my original design) mainly focuses on virtual
> >> devices that built upon mq devices, i.e., md/dm.
> >>
> >> As the original bio-based devices wants to support IO polling in the
> >> future, then they should be somehow distingushed from md/dm.
> >>
> >>
> >> On 3/19/21 12:48 AM, Ming Lei wrote:
> >>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> >>> is very inefficient, and the big reason is that we can't pass bio
> >>> submission result to io poll task.
> >>>
> >>> In IO submission context, track associated underlying bios by per-task
> >>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> >>> and return current->pid to caller of submit_bio() for any bio based
> >>> driver's IO, which is submitted from FS.
> >>>
> >>> In IO poll context, the passed cookie tells us the PID of submission
> >>> context, and we can find the bio from that submission context. Moving
> >>> bio from submission queue to poll queue of the poll context, and keep
> >>> polling until these bios are ended. Remove bio from poll queue if the
> >>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >>>
> >>> In previous version, kfifo is used to implement submission queue, and
> >>> Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> >>> So far bio's size is close to 2 cacheline size, and it may not be
> >>> accepted to add new field into bio for solving the scalability issue by
> >>> tracking bios via linked list, switch to bio group list for tracking bio,
> >>> the idea is to reuse .bi_end_io for linking bios into a linked list for
> >>> all sharing same .bi_end_io(call it bio group), which is recovered before
> >>> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> >>> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> >>> provide very limited groups, such as 32 for fixing the scalability issue.
> >>>
> >>> Usually submission shares context with io poll. The per-task poll context
> >>> is just like stack variable, and it is cheap to move data between the two
> >>> per-task queues.
> >>>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  block/bio.c               |   5 ++
> >>>  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
> >>>  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
> >>>  block/blk.h               |   9 ++
> >>>  include/linux/blk_types.h |  16 +++-
> >>>  5 files changed, 348 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/block/bio.c b/block/bio.c
> >>> index 26b7f721cda8..04c043dc60fc 100644
> >>> --- a/block/bio.c
> >>> +++ b/block/bio.c
> >>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>   **/
> >>>  void bio_endio(struct bio *bio)
> >>>  {
> >>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>> +		bio_set_flag(bio, BIO_DONE);
> >>> +		return;
> >>> +	}
> >>>  again:
> >>>  	if (!bio_remaining_done(bio))
> >>>  		return;
> >>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>> index efc7a61a84b4..778d25a7e76c 100644
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >>>  		sizeof(struct bio_grp_list_data);
> >>>  }
> >>>  
> >>> +static inline void *bio_grp_data(struct bio *bio)
> >>> +{
> >>> +	return bio->bi_poll;
> >>> +}
> >>> +
> >>> +/* add bio into bio group list, return true if it is added */
> >>> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> >>> +{
> >>> +	int i;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = 0; i < list->nr_grps; i++) {
> >>> +		grp = &list->head[i];
> >>> +		if (grp->grp_data == bio_grp_data(bio)) {
> >>> +			__bio_grp_list_add(&grp->list, bio);
> >>> +			return true;
> >>> +		}
> >>> +	}
> >>> +
> >>> +	if (i == list->max_nr_grps)
> >>> +		return false;
> >>> +
> >>> +	/* create a new group */
> >>> +	grp = &list->head[i];
> >>> +	bio_list_init(&grp->list);
> >>> +	grp->grp_data = bio_grp_data(bio);
> >>> +	__bio_grp_list_add(&grp->list, bio);
> >>> +	list->nr_grps++;
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> >>> +{
> >>> +	int i;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = 0; i < list->max_nr_grps; i++) {
> >>> +		grp = &list->head[i];
> >>> +		if (grp->grp_data == grp_data)
> >>> +			return i;
> >>> +	}
> >>> +	for (i = 0; i < list->max_nr_grps; i++) {
> >>> +		grp = &list->head[i];
> >>> +		if (bio_grp_list_grp_empty(grp))
> >>> +			return i;
> >>> +	}
> >>> +	return -1;
> >>> +}
> >>> +
> >>> +/* Move as many as possible groups from 'src' to 'dst' */
> >>> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> >>> +{
> >>> +	int i, j, cnt = 0;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> >>> +		grp = &src->head[i];
> >>> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> >>> +		if (j < 0)
> >>> +			break;
> >>> +		if (bio_grp_list_grp_empty(&dst->head[j]))
> >>> +			dst->head[j].grp_data = grp->grp_data;
> >>> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> >>> +		bio_list_init(&grp->list);
> >>> +		cnt++;
> >>> +	}
> >>> +
> >>> +	src->nr_grps -= cnt;
> >>> +}
> >>
> >> Not sure why it's checked in reverse order (starting from 'nr_grps - 1').
> > 
> > Then for bio group list in submission side, only first .nr_grps groups
> > includes bios.
> > 
> >>
> >>
> >>> +
> >>>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >>>  {
> >>>  	pc->sq = (void *)pc + sizeof(*pc);
> >>> @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>  		bio->bi_opf |= REQ_TAG;
> >>>  }
> >>>  
> >>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +	unsigned int queued;
> >>> +
> >>> +	/*
> >>> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> >>> +	 * and completion. However, bio crypt may update .bi_end_io during
> >>> +	 * submitting, so simply not support bio based polling for this
> >>> +	 * setting.
> >>> +	 */
> >>> +	if (likely(!bio_has_crypt_ctx(bio))) {
> >>> +		/* track this bio via bio group list */
> >>> +		spin_lock(&pc->sq_lock);
> >>> +		queued = bio_grp_list_add(pc->sq, bio);
> >>> +		spin_unlock(&pc->sq_lock);
> >>> +	} else {
> >>> +		queued = false;
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> >>> +	 * and the bio is always completed from the pair poll context.
> >>> +	 *
> >>> +	 * One invariant is that if bio isn't completed, blk_poll() will
> >>> +	 * be called by passing cookie returned from submitting this bio.
> >>> +	 */
> >>> +	if (!queued)
> >>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> >>> +	else
> >>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>> +
> >>> +	return queued;
> >>> +}
> >>> +
> >>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> >>> +{
> >>> +	bio->bi_iter.bi_private_data = cookie;
> >>> +}
> >>> +
> >>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >>>  {
> >>>  	struct block_device *bdev = bio->bi_bdev;
> >>> @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >>>   * bio_list_on_stack[1] contains bios that were submitted before the current
> >>>   *	->submit_bio_bio, but that haven't been processed yet.
> >>>   */
> >>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >>>  {
> >>>  	struct bio_list bio_list_on_stack[2];
> >>>  	blk_qc_t ret = BLK_QC_T_NONE;
> >>> @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >>>  		bio_list_init(&bio_list_on_stack[0]);
> >>>  
> >>> -		ret = __submit_bio(bio);
> >>> +		if (ioc && queue_is_mq(q) &&
> >>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> >>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> >>> +
> >>> +			ret = __submit_bio(bio);
> >>> +			if (queued)
> >>> +				blk_bio_poll_post_submit(bio, ret);
> >>> +		} else {
> >>> +			ret = __submit_bio(bio);
> >>> +		}
> >>
> >> If input @ioc is NULL, it will still return cookie (returned by
> >> __submit_bio()), which will call into blk_bio_poll(), which is not
> >> expected. (It can pass blk_queue_poll() check in blk_poll(), e.g., dm
> >> device itself is marked as QUEUE_FLAG_POLL, but ->io_context of IO
> >> submitting process failed to allocate the io_context, and thus calls
> >> __submit_bio_noacct_int() with @ioc is NULL).
> > 
> > Good catch, looks the following change is needed:
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 778d25a7e76c..dba12ba0fa48 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1211,7 +1211,8 @@ static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> >  		return __submit_bio_noacct_poll(bio, ioc);
> >  
> > -	return __submit_bio_noacct_int(bio, NULL);
> > +	 __submit_bio_noacct_int(bio, NULL);
> > +	return 0;
> >  }
> 
> Looks good as far as now no original bio-based device supporting IO polling.
> 
> 
> >  
> >  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > 
> >>
> >>
> >>>  
> >>>  		/*
> >>>  		 * Sort new bios into those for a lower level and those for the
> >>> @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  	return ret;
> >>>  }
> >>>  
> >>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> >>> +		struct io_context *ioc)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +
> >>> +	__submit_bio_noacct_int(bio, ioc);
> >>> +
> >>> +	/* bio submissions queued to per-task poll context */
> >>> +	if (READ_ONCE(pc->sq->nr_grps))
> >>> +		return current->pid;
> >>> +
> >>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +{
> >>> +	struct io_context *ioc = current->io_context;
> >>> +
> >>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> >>> +		return __submit_bio_noacct_poll(bio, ioc);
> >>> +
> >>
> >>
> >>> +	return __submit_bio_noacct_int(bio, NULL);
> >>> +}
> >>> +
> >>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >>>  {
> >>>  	struct bio_list bio_list[2] = { };
> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>> index 03f59915fe2c..f26950a51f4a 100644
> >>> --- a/block/blk-mq.c
> >>> +++ b/block/blk-mq.c
> >>> @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >>>  	return ret;
> >>>  }
> >>>  
> >>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> >>> +{
> >>> +	return bio->bi_iter.bi_private_data;
> >>> +}
> >>> +
> >>> +static int blk_mq_poll_io(struct bio *bio)
> >>> +{
> >>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> >>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> >>> +	int ret = 0;
> >>> +
> >>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> >>> +		struct blk_mq_hw_ctx *hctx =
> >>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> >>> +
> >>> +		ret += blk_mq_poll_hctx(q, hctx);
> >>> +	}
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> >>> +		struct blk_bio_poll_ctx *poll_ctx)
> >>> +{
> >>> +	int ret = 0;
> >>> +	int i;
> >>> +
> >>> +	/*
> >>> +	 * Poll hw queue first.
> >>> +	 *
> >>> +	 * TODO: limit max poll times and make sure to not poll same
> >>> +	 * hw queue one more time.
> >>> +	 */
> >>> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> >>> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> >>> +		struct bio *bio;
> >>> +
> >>> +		if (bio_grp_list_grp_empty(grp))
> >>> +			continue;
> >>> +
> >>> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> >>> +			ret += blk_mq_poll_io(bio);
> >>> +	}
> >>> +
> >>> +	/* reap bios */
> >>> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> >>> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> >>> +		struct bio *bio;
> >>> +		struct bio_list bl;
> >>> +
> >>> +		if (bio_grp_list_grp_empty(grp))
> >>> +			continue;
> >>> +
> >>> +		bio_list_init(&bl);
> >>> +
> >>> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> >>> +			if (bio_flagged(bio, BIO_DONE)) {
> >>> +
> >>> +				/* now recover original data */
> >>> +				bio->bi_poll = grp->grp_data;
> >>> +
> >>> +				/* clear BIO_END_BY_POLL and end me really */
> >>> +				bio_clear_flag(bio, BIO_END_BY_POLL);
> >>> +				bio_endio(bio);
> >>> +			} else {
> >>> +				__bio_grp_list_add(&bl, bio);
> >>> +			}
> >>> +		}
> >>> +		__bio_grp_list_merge(&grp->list, &bl);
> >>> +	}
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static int __blk_bio_poll_io(struct request_queue *q,
> >>> +		struct blk_bio_poll_ctx *submit_ctx,
> >>> +		struct blk_bio_poll_ctx *poll_ctx)
> >>> +{
> >>> +	/*
> >>> +	 * Move IO submission result from submission queue in submission
> >>> +	 * context to poll queue of poll context.
> >>> +	 */
> >>> +	spin_lock(&submit_ctx->sq_lock);
> >>> +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
> >>> +	spin_unlock(&submit_ctx->sq_lock);
> >>> +
> >>> +	return blk_bio_poll_and_end_io(q, poll_ctx);
> >>> +}
> >>> +
> >>> +static int blk_bio_poll_io(struct request_queue *q,
> >>> +		struct io_context *submit_ioc,
> >>> +		struct io_context *poll_ioc)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> >>> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> >>> +	int ret;
> >>> +
> >>> +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> >>> +		mutex_lock(&poll_ctx->pq_lock);
> >>
> >> Why mutex is used to protect pq here rather than spinlock? Where will
> > 
> > spinlock should be fine.
> > 
> >> the polling routine go to sleep?
> > 
> > The current blk_poll() can go to sleep really.
> 
> I know that hybrid polling can go to sleep. Except that, it seems no
> other place can sleep?
> 
> 
> > 
> >>
> >> Besides, how to protect the concurrent bio_list operation to sq between
> >> producer (submission routine) and consumer (polling routine)? As far as
> > 
> > submit_ctx->sq_lock is held in __blk_bio_poll_io() for moving bios
> > from sq to pq, see __blk_bio_poll_io().
> > 
> >> I understand, pc->sq_lock is used to prevent concurrent access from
> >> multiple submission processes, while pc->pq_lock is used to prevent
> >> concurrent access from multiple polling processes.
> > 
> > Usually poll context needn't any lock except for shared io context,
> > because blk_bio_poll_ctx->pq is only accessed in poll context, and
> > it is still per-task.
> > 
> >>
> >>
> >>> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> >>> +		mutex_unlock(&poll_ctx->pq_lock);
> >>> +	} else {
> >>> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> >>> +	}
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static bool blk_bio_ioc_valid(struct task_struct *t)
> >>> +{
> >>> +	if (!t)
> >>> +		return false;
> >>> +
> >>> +	if (!t->io_context)
> >>> +		return false;
> >>> +
> >>> +	if (!t->io_context->data)
> >>> +		return false;
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>> +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> >>> +{
> >>> +	struct io_context *poll_ioc = current->io_context;
> >>> +	pid_t pid;
> >>> +	struct task_struct *submit_task;
> >>> +	int ret;
> >>> +
> >>> +	pid = (pid_t)cookie;
> >>> +
> >>> +	/* io poll often share io submission context */
> >>> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> >>> +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> >>> +
> >>
> >>> +	submit_task = find_get_task_by_vpid(pid);
> >>
> >> What if the process to which the returned cookie refers has exited?
> >> find_get_task_by_vpid() will return NULL, thus blk_poll() won't help
> >> reap anything, while maybe there are still bios in the poll context,
> >> waiting to be reaped.
> >>
> >> Maybe we need to flush the poll context when a task detaches from the
> >> io_context.
> > 
> > Yeah, I know that issue, and just not address it in RFC stage.
> > 
> > It can be handled in the following way:
> > 
> > 1) drain all bios in submission context until all bios are completed before
> > it exits since the code won't sleep.
> > 
> > OR
> > 
> > 2) schedule wq for completing all submitted bios.
> > 
> > If poll context exits, and there is still bios not completed, not sure
> > if it can happen because the current in-tree code has the same issue.
> > 
> >>
> >>
> >>> +	if (likely(blk_bio_ioc_valid(submit_task)))
> >>> +		ret = blk_bio_poll_io(q, submit_task->io_context,
> >>> +				poll_ioc);
> >>
> >> poll_ioc may be invalid in this case, since the previous
> >> blk_create_io_context() in blk_bio_poll() may fail.
> > 
> > Yeah, it can be addressed by above patch, 0 will be returned
> > for this case.
> > 
> 
> Not really. I mean in the case of 'current->pid != pid', poll_ioc, i.e.,
> current->io_context could be invalid, i.e., current->io_context is NULL,
> or current->io_context->data is NULL. Maybe it should be fixed by
> 
> +	if (likely(blk_bio_ioc_valid(submit_task) && blk_bio_ioc_valid(current)))
> +		ret = blk_bio_poll_io(q, submit_task->io_context,
> +				poll_ioc);
> 
> 
> Can't understand why the above patch you mentioned could fix this.

I meant the patch in which 0 is returned from '__submit_bio_noacct()'.

For submission context, either cookie is zero or one really valid
context which can be used to fetch bios.

But poll context could be invalid, that is fine to poll on submission
context directly, and I will handle that in V3.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-23 11:39           ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:39 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer

On Sat, Mar 20, 2021 at 01:56:13PM +0800, JeffleXu wrote:
> 
> 
> On 3/19/21 9:46 PM, Ming Lei wrote:
> > On Fri, Mar 19, 2021 at 05:38:38PM +0800, JeffleXu wrote:
> >> I'm thinking how this mechanism could work with *original* bio-based
> >> devices that don't ne built upon mq devices, such as nvdimm. This
> > 
> > non-mq device needs driver to implement io polling by itself, block
> > layer can't help it, and that can't be this patchset's job.
> > 
> >> mechanism (also including my original design) mainly focuses on virtual
> >> devices that built upon mq devices, i.e., md/dm.
> >>
> >> As the original bio-based devices wants to support IO polling in the
> >> future, then they should be somehow distingushed from md/dm.
> >>
> >>
> >> On 3/19/21 12:48 AM, Ming Lei wrote:
> >>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> >>> is very inefficient, and the big reason is that we can't pass bio
> >>> submission result to io poll task.
> >>>
> >>> In IO submission context, track associated underlying bios by per-task
> >>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> >>> and return current->pid to caller of submit_bio() for any bio based
> >>> driver's IO, which is submitted from FS.
> >>>
> >>> In IO poll context, the passed cookie tells us the PID of submission
> >>> context, and we can find the bio from that submission context. Moving
> >>> bio from submission queue to poll queue of the poll context, and keep
> >>> polling until these bios are ended. Remove bio from poll queue if the
> >>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >>>
> >>> In previous version, kfifo is used to implement submission queue, and
> >>> Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> >>> So far bio's size is close to 2 cacheline size, and it may not be
> >>> accepted to add new field into bio for solving the scalability issue by
> >>> tracking bios via linked list, switch to bio group list for tracking bio,
> >>> the idea is to reuse .bi_end_io for linking bios into a linked list for
> >>> all sharing same .bi_end_io(call it bio group), which is recovered before
> >>> really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> >>> Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> >>> provide very limited groups, such as 32 for fixing the scalability issue.
> >>>
> >>> Usually submission shares context with io poll. The per-task poll context
> >>> is just like stack variable, and it is cheap to move data between the two
> >>> per-task queues.
> >>>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  block/bio.c               |   5 ++
> >>>  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
> >>>  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
> >>>  block/blk.h               |   9 ++
> >>>  include/linux/blk_types.h |  16 +++-
> >>>  5 files changed, 348 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/block/bio.c b/block/bio.c
> >>> index 26b7f721cda8..04c043dc60fc 100644
> >>> --- a/block/bio.c
> >>> +++ b/block/bio.c
> >>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>   **/
> >>>  void bio_endio(struct bio *bio)
> >>>  {
> >>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>> +		bio_set_flag(bio, BIO_DONE);
> >>> +		return;
> >>> +	}
> >>>  again:
> >>>  	if (!bio_remaining_done(bio))
> >>>  		return;
> >>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>> index efc7a61a84b4..778d25a7e76c 100644
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >>>  		sizeof(struct bio_grp_list_data);
> >>>  }
> >>>  
> >>> +static inline void *bio_grp_data(struct bio *bio)
> >>> +{
> >>> +	return bio->bi_poll;
> >>> +}
> >>> +
> >>> +/* add bio into bio group list, return true if it is added */
> >>> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> >>> +{
> >>> +	int i;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = 0; i < list->nr_grps; i++) {
> >>> +		grp = &list->head[i];
> >>> +		if (grp->grp_data == bio_grp_data(bio)) {
> >>> +			__bio_grp_list_add(&grp->list, bio);
> >>> +			return true;
> >>> +		}
> >>> +	}
> >>> +
> >>> +	if (i == list->max_nr_grps)
> >>> +		return false;
> >>> +
> >>> +	/* create a new group */
> >>> +	grp = &list->head[i];
> >>> +	bio_list_init(&grp->list);
> >>> +	grp->grp_data = bio_grp_data(bio);
> >>> +	__bio_grp_list_add(&grp->list, bio);
> >>> +	list->nr_grps++;
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> >>> +{
> >>> +	int i;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = 0; i < list->max_nr_grps; i++) {
> >>> +		grp = &list->head[i];
> >>> +		if (grp->grp_data == grp_data)
> >>> +			return i;
> >>> +	}
> >>> +	for (i = 0; i < list->max_nr_grps; i++) {
> >>> +		grp = &list->head[i];
> >>> +		if (bio_grp_list_grp_empty(grp))
> >>> +			return i;
> >>> +	}
> >>> +	return -1;
> >>> +}
> >>> +
> >>> +/* Move as many as possible groups from 'src' to 'dst' */
> >>> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> >>> +{
> >>> +	int i, j, cnt = 0;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> >>> +		grp = &src->head[i];
> >>> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> >>> +		if (j < 0)
> >>> +			break;
> >>> +		if (bio_grp_list_grp_empty(&dst->head[j]))
> >>> +			dst->head[j].grp_data = grp->grp_data;
> >>> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> >>> +		bio_list_init(&grp->list);
> >>> +		cnt++;
> >>> +	}
> >>> +
> >>> +	src->nr_grps -= cnt;
> >>> +}
> >>
> >> Not sure why it's checked in reverse order (starting from 'nr_grps - 1').
> > 
> > Then for bio group list in submission side, only first .nr_grps groups
> > includes bios.
> > 
> >>
> >>
> >>> +
> >>>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >>>  {
> >>>  	pc->sq = (void *)pc + sizeof(*pc);
> >>> @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>  		bio->bi_opf |= REQ_TAG;
> >>>  }
> >>>  
> >>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +	unsigned int queued;
> >>> +
> >>> +	/*
> >>> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> >>> +	 * and completion. However, bio crypt may update .bi_end_io during
> >>> +	 * submitting, so simply not support bio based polling for this
> >>> +	 * setting.
> >>> +	 */
> >>> +	if (likely(!bio_has_crypt_ctx(bio))) {
> >>> +		/* track this bio via bio group list */
> >>> +		spin_lock(&pc->sq_lock);
> >>> +		queued = bio_grp_list_add(pc->sq, bio);
> >>> +		spin_unlock(&pc->sq_lock);
> >>> +	} else {
> >>> +		queued = false;
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> >>> +	 * and the bio is always completed from the pair poll context.
> >>> +	 *
> >>> +	 * One invariant is that if bio isn't completed, blk_poll() will
> >>> +	 * be called by passing cookie returned from submitting this bio.
> >>> +	 */
> >>> +	if (!queued)
> >>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> >>> +	else
> >>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>> +
> >>> +	return queued;
> >>> +}
> >>> +
> >>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> >>> +{
> >>> +	bio->bi_iter.bi_private_data = cookie;
> >>> +}
> >>> +
> >>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >>>  {
> >>>  	struct block_device *bdev = bio->bi_bdev;
> >>> @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >>>   * bio_list_on_stack[1] contains bios that were submitted before the current
> >>>   *	->submit_bio_bio, but that haven't been processed yet.
> >>>   */
> >>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >>>  {
> >>>  	struct bio_list bio_list_on_stack[2];
> >>>  	blk_qc_t ret = BLK_QC_T_NONE;
> >>> @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >>>  		bio_list_init(&bio_list_on_stack[0]);
> >>>  
> >>> -		ret = __submit_bio(bio);
> >>> +		if (ioc && queue_is_mq(q) &&
> >>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> >>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> >>> +
> >>> +			ret = __submit_bio(bio);
> >>> +			if (queued)
> >>> +				blk_bio_poll_post_submit(bio, ret);
> >>> +		} else {
> >>> +			ret = __submit_bio(bio);
> >>> +		}
> >>
> >> If input @ioc is NULL, it will still return cookie (returned by
> >> __submit_bio()), which will call into blk_bio_poll(), which is not
> >> expected. (It can pass blk_queue_poll() check in blk_poll(), e.g., dm
> >> device itself is marked as QUEUE_FLAG_POLL, but ->io_context of IO
> >> submitting process failed to allocate the io_context, and thus calls
> >> __submit_bio_noacct_int() with @ioc is NULL).
> > 
> > Good catch, looks the following change is needed:
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 778d25a7e76c..dba12ba0fa48 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1211,7 +1211,8 @@ static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> >  		return __submit_bio_noacct_poll(bio, ioc);
> >  
> > -	return __submit_bio_noacct_int(bio, NULL);
> > +	 __submit_bio_noacct_int(bio, NULL);
> > +	return 0;
> >  }
> 
> Looks good as far as now no original bio-based device supporting IO polling.
> 
> 
> >  
> >  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > 
> >>
> >>
> >>>  
> >>>  		/*
> >>>  		 * Sort new bios into those for a lower level and those for the
> >>> @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  	return ret;
> >>>  }
> >>>  
> >>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> >>> +		struct io_context *ioc)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +
> >>> +	__submit_bio_noacct_int(bio, ioc);
> >>> +
> >>> +	/* bio submissions queued to per-task poll context */
> >>> +	if (READ_ONCE(pc->sq->nr_grps))
> >>> +		return current->pid;
> >>> +
> >>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +{
> >>> +	struct io_context *ioc = current->io_context;
> >>> +
> >>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> >>> +		return __submit_bio_noacct_poll(bio, ioc);
> >>> +
> >>
> >>
> >>> +	return __submit_bio_noacct_int(bio, NULL);
> >>> +}
> >>> +
> >>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >>>  {
> >>>  	struct bio_list bio_list[2] = { };
> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>> index 03f59915fe2c..f26950a51f4a 100644
> >>> --- a/block/blk-mq.c
> >>> +++ b/block/blk-mq.c
> >>> @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >>>  	return ret;
> >>>  }
> >>>  
> >>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> >>> +{
> >>> +	return bio->bi_iter.bi_private_data;
> >>> +}
> >>> +
> >>> +static int blk_mq_poll_io(struct bio *bio)
> >>> +{
> >>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> >>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> >>> +	int ret = 0;
> >>> +
> >>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> >>> +		struct blk_mq_hw_ctx *hctx =
> >>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> >>> +
> >>> +		ret += blk_mq_poll_hctx(q, hctx);
> >>> +	}
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> >>> +		struct blk_bio_poll_ctx *poll_ctx)
> >>> +{
> >>> +	int ret = 0;
> >>> +	int i;
> >>> +
> >>> +	/*
> >>> +	 * Poll hw queue first.
> >>> +	 *
> >>> +	 * TODO: limit max poll times and make sure to not poll same
> >>> +	 * hw queue one more time.
> >>> +	 */
> >>> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> >>> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> >>> +		struct bio *bio;
> >>> +
> >>> +		if (bio_grp_list_grp_empty(grp))
> >>> +			continue;
> >>> +
> >>> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> >>> +			ret += blk_mq_poll_io(bio);
> >>> +	}
> >>> +
> >>> +	/* reap bios */
> >>> +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> >>> +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> >>> +		struct bio *bio;
> >>> +		struct bio_list bl;
> >>> +
> >>> +		if (bio_grp_list_grp_empty(grp))
> >>> +			continue;
> >>> +
> >>> +		bio_list_init(&bl);
> >>> +
> >>> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> >>> +			if (bio_flagged(bio, BIO_DONE)) {
> >>> +
> >>> +				/* now recover original data */
> >>> +				bio->bi_poll = grp->grp_data;
> >>> +
> >>> +				/* clear BIO_END_BY_POLL and end me really */
> >>> +				bio_clear_flag(bio, BIO_END_BY_POLL);
> >>> +				bio_endio(bio);
> >>> +			} else {
> >>> +				__bio_grp_list_add(&bl, bio);
> >>> +			}
> >>> +		}
> >>> +		__bio_grp_list_merge(&grp->list, &bl);
> >>> +	}
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static int __blk_bio_poll_io(struct request_queue *q,
> >>> +		struct blk_bio_poll_ctx *submit_ctx,
> >>> +		struct blk_bio_poll_ctx *poll_ctx)
> >>> +{
> >>> +	/*
> >>> +	 * Move IO submission result from submission queue in submission
> >>> +	 * context to poll queue of poll context.
> >>> +	 */
> >>> +	spin_lock(&submit_ctx->sq_lock);
> >>> +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
> >>> +	spin_unlock(&submit_ctx->sq_lock);
> >>> +
> >>> +	return blk_bio_poll_and_end_io(q, poll_ctx);
> >>> +}
> >>> +
> >>> +static int blk_bio_poll_io(struct request_queue *q,
> >>> +		struct io_context *submit_ioc,
> >>> +		struct io_context *poll_ioc)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> >>> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> >>> +	int ret;
> >>> +
> >>> +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> >>> +		mutex_lock(&poll_ctx->pq_lock);
> >>
> >> Why mutex is used to protect pq here rather than spinlock? Where will
> > 
> > spinlock should be fine.
> > 
> >> the polling routine go to sleep?
> > 
> > The current blk_poll() can go to sleep really.
> 
> I know that hybrid polling can go to sleep. Except that, it seems no
> other place can sleep?
> 
> 
> > 
> >>
> >> Besides, how to protect the concurrent bio_list operation to sq between
> >> producer (submission routine) and consumer (polling routine)? As far as
> > 
> > submit_ctx->sq_lock is held in __blk_bio_poll_io() for moving bios
> > from sq to pq, see __blk_bio_poll_io().
> > 
> >> I understand, pc->sq_lock is used to prevent concurrent access from
> >> multiple submission processes, while pc->pq_lock is used to prevent
> >> concurrent access from multiple polling processes.
> > 
> > Usually poll context needn't any lock except for shared io context,
> > because blk_bio_poll_ctx->pq is only accessed in poll context, and
> > it is still per-task.
> > 
> >>
> >>
> >>> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> >>> +		mutex_unlock(&poll_ctx->pq_lock);
> >>> +	} else {
> >>> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> >>> +	}
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static bool blk_bio_ioc_valid(struct task_struct *t)
> >>> +{
> >>> +	if (!t)
> >>> +		return false;
> >>> +
> >>> +	if (!t->io_context)
> >>> +		return false;
> >>> +
> >>> +	if (!t->io_context->data)
> >>> +		return false;
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>> +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> >>> +{
> >>> +	struct io_context *poll_ioc = current->io_context;
> >>> +	pid_t pid;
> >>> +	struct task_struct *submit_task;
> >>> +	int ret;
> >>> +
> >>> +	pid = (pid_t)cookie;
> >>> +
> >>> +	/* io poll often share io submission context */
> >>> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> >>> +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> >>> +
> >>
> >>> +	submit_task = find_get_task_by_vpid(pid);
> >>
> >> What if the process to which the returned cookie refers has exited?
> >> find_get_task_by_vpid() will return NULL, thus blk_poll() won't help
> >> reap anything, while maybe there are still bios in the poll context,
> >> waiting to be reaped.
> >>
> >> Maybe we need to flush the poll context when a task detaches from the
> >> io_context.
> > 
> > Yeah, I know that issue, and just not address it in RFC stage.
> > 
> > It can be handled in the following way:
> > 
> > 1) drain all bios in submission context until all bios are completed before
> > it exits since the code won't sleep.
> > 
> > OR
> > 
> > 2) schedule wq for completing all submitted bios.
> > 
> > If poll context exits, and there is still bios not completed, not sure
> > if it can happen because the current in-tree code has the same issue.
> > 
> >>
> >>
> >>> +	if (likely(blk_bio_ioc_valid(submit_task)))
> >>> +		ret = blk_bio_poll_io(q, submit_task->io_context,
> >>> +				poll_ioc);
> >>
> >> poll_ioc may be invalid in this case, since the previous
> >> blk_create_io_context() in blk_bio_poll() may fail.
> > 
> > Yeah, it can be addressed by above patch, 0 will be returned
> > for this case.
> > 
> 
> Not really. I mean in the case of 'current->pid != pid', poll_ioc, i.e.,
> current->io_context could be invalid, i.e., current->io_context is NULL,
> or current->io_context->data is NULL. Maybe it should be fixed by
> 
> +	if (likely(blk_bio_ioc_valid(submit_task) && blk_bio_ioc_valid(current)))
> +		ret = blk_bio_poll_io(q, submit_task->io_context,
> +				poll_ioc);
> 
> 
> Can't understand why the above patch you mentioned could fix this.

I meant the patch in which 0 is returned from '__submit_bio_noacct()'.

For submission context, either cookie is zero or one really valid
context which can be used to fetch bios.

But poll context could be invalid, that is fine to poll on submission
context directly, and I will handle that in V3.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-19 18:38     ` [dm-devel] " Mike Snitzer
@ 2021-03-23 11:55       ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:55 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu, dm-devel

On Fri, Mar 19, 2021 at 02:38:11PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> 
> This is awkward because bio-based IO polling doesn't exist upstream yet,
> so this header should be covering your approach as a clean slate, e.g.:
> 
> The complexity associated with frequent bio splitting with bio-based
> devices makes it difficult to implement IO polling efficiently because
> the fan-out of underlying hw queues that need to be polled (as a
> side-effect of bios being split) creates a need for more easily mapping
> a group of bios to the hw queues that need to be polled.
> 
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any bio based
> > driver's IO, which is submitted from FS.
> > 
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> 
> Maybe be more precise by covering how all bios from that task's
> submission context will be moved to poll queue of poll context?

Strictly speaking, it isn't a must to add poll context which is just
more efficient for polling locally after taking bio grouping list in V2.

> 
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > 
> > In previous version, kfifo is used to implement submission queue, and
> > Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> 
> Awkward to reference "previous version", maybe instead say:
> 
> In was found that kfifo doesn't scale well for a submission queue as
> queue depth is increased, so a new mechanism for tracking bios is
> needed.

OK.

> 
> > So far bio's size is close to 2 cacheline size, and it may not be
> > accepted to add new field into bio for solving the scalability issue by
> > tracking bios via linked list, switch to bio group list for tracking bio,
> > the idea is to reuse .bi_end_io for linking bios into a linked list for
> > all sharing same .bi_end_io(call it bio group), which is recovered before
> > really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> > Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> > provide very limited groups, such as 32 for fixing the scalability issue.
> > 
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 ++
> >  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
> >  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
> >  block/blk.h               |   9 ++
> >  include/linux/blk_types.h |  16 +++-
> >  5 files changed, 348 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +		bio_set_flag(bio, BIO_DONE);
> > +		return;
> > +	}
> >  again:
> >  	if (!bio_remaining_done(bio))
> >  		return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index efc7a61a84b4..778d25a7e76c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >  		sizeof(struct bio_grp_list_data);
> >  }
> >  
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > +	return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == bio_grp_data(bio)) {
> > +			__bio_grp_list_add(&grp->list, bio);
> > +			return true;
> > +		}
> > +	}
> > +
> > +	if (i == list->max_nr_grps)
> > +		return false;
> > +
> > +	/* create a new group */
> > +	grp = &list->head[i];
> > +	bio_list_init(&grp->list);
> > +	grp->grp_data = bio_grp_data(bio);
> > +	__bio_grp_list_add(&grp->list, bio);
> > +	list->nr_grps++;
> > +
> > +	return true;
> > +}
> > +
> > +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == grp_data)
> > +			return i;
> > +	}
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (bio_grp_list_grp_empty(grp))
> > +			return i;
> > +	}
> > +	return -1;
> > +}
> > +
> > +/* Move as many as possible groups from 'src' to 'dst' */
> > +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> > +{
> > +	int i, j, cnt = 0;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = src->nr_grps - 1; i >= 0; i--) {
> > +		grp = &src->head[i];
> > +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> > +		if (j < 0)
> > +			break;
> > +		if (bio_grp_list_grp_empty(&dst->head[j]))
> > +			dst->head[j].grp_data = grp->grp_data;
> > +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> > +		bio_list_init(&grp->list);
> > +		cnt++;
> > +	}
> > +
> > +	src->nr_grps -= cnt;
> > +}
> > +
> >  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >  {
> >  	pc->sq = (void *)pc + sizeof(*pc);
> > @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		bio->bi_opf |= REQ_TAG;
> >  }
> >  
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	unsigned int queued;
> > +
> > +	/*
> > +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> > +	 * and completion. However, bio crypt may update .bi_end_io during
> > +	 * submitting, so simply not support bio based polling for this
> 
> s/submitting/submission/
> s/not/don't/
> 
> > +	 * setting.
> > +	 */
> > +	if (likely(!bio_has_crypt_ctx(bio))) {
> > +		/* track this bio via bio group list */
> > +		spin_lock(&pc->sq_lock);
> > +		queued = bio_grp_list_add(pc->sq, bio);
> > +		spin_unlock(&pc->sq_lock);
> > +	} else {
> > +		queued = false;
> > +	}
> > +
> > +	/*
> > +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > +	 * and the bio is always completed from the pair poll context.
> 
> This reads awkwardly.. "pair poll context"?

Actually I meant that poll context in which blk_poll(cookie) is called
and 'cookie' is the exact return value of submit_bio(bio based bio) and
this blk-mq bio is for complete the bio driver's bio, and share same
pool context with the bio driver's bio.

> 
> > +	 *
> > +	 * One invariant is that if bio isn't completed, blk_poll() will
> > +	 * be called by passing cookie returned from submitting this bio.
> > +	 */
> > +	if (!queued)
> > +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > +	else
> > +		bio_set_flag(bio, BIO_END_BY_POLL);
> > +
> > +	return queued;
> > +}
> > +
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +	bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >  	struct block_device *bdev = bio->bi_bdev;
> > @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *	->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >  {
> >  	struct bio_list bio_list_on_stack[2];
> >  	blk_qc_t ret = BLK_QC_T_NONE;
> 
> I mentioned this in previous mail, but what is it you're trying to
> convey with _int?
> 
> Think we need a better function name here.

I will think about one better name, not got one yet.

> 
> > @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >  		bio_list_init(&bio_list_on_stack[0]);
> >  
> > -		ret = __submit_bio(bio);
> > +		if (ioc && queue_is_mq(q) &&
> > +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +			ret = __submit_bio(bio);
> > +			if (queued)
> > +				blk_bio_poll_post_submit(bio, ret);
> > +		} else {
> > +			ret = __submit_bio(bio);
> > +		}
> >  
> >  		/*
> >  		 * Sort new bios into those for a lower level and those for the
> > @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  	return ret;
> >  }
> >  
> > +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> > +		struct io_context *ioc)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +
> > +	__submit_bio_noacct_int(bio, ioc);
> > +
> > +	/* bio submissions queued to per-task poll context */
> > +	if (READ_ONCE(pc->sq->nr_grps))
> > +		return current->pid;
> > +
> > +	/* swapper's pid is 0, but it can't submit poll IO for us */
> > +	return 0;
> > +}
> > +
> > +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +{
> > +	struct io_context *ioc = current->io_context;
> > +
> > +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> > +		return __submit_bio_noacct_poll(bio, ioc);
> > +
> > +	return __submit_bio_noacct_int(bio, NULL);
> > +}
> > +
> >  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >  {
> >  	struct bio_list bio_list[2] = { };
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 03f59915fe2c..f26950a51f4a 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >  	return ret;
> >  }
> >  
> > +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> > +{
> > +	return bio->bi_iter.bi_private_data;
> > +}
> > +
> > +static int blk_mq_poll_io(struct bio *bio)
> > +{
> > +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> > +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> > +	int ret = 0;
> > +
> > +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> > +		struct blk_mq_hw_ctx *hctx =
> > +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> > +
> > +		ret += blk_mq_poll_hctx(q, hctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int blk_bio_poll_and_end_io(struct request_queue *q,
> > +		struct blk_bio_poll_ctx *poll_ctx)
> > +{
> > +	int ret = 0;
> > +	int i;
> > +
> > +	/*
> > +	 * Poll hw queue first.
> > +	 *
> > +	 * TODO: limit max poll times and make sure to not poll same
> > +	 * hw queue one more time.
> > +	 */
> > +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> > +		struct bio *bio;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> > +			ret += blk_mq_poll_io(bio);
> > +	}
> > +
> > +	/* reap bios */
> > +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> > +		struct bio *bio;
> > +		struct bio_list bl;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		bio_list_init(&bl);
> > +
> > +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> > +			if (bio_flagged(bio, BIO_DONE)) {
> > +
> 
> Remove empty newline? ^
> 
> > +				/* now recover original data */
> > +				bio->bi_poll = grp->grp_data;
> > +
> > +				/* clear BIO_END_BY_POLL and end me really */
> > +				bio_clear_flag(bio, BIO_END_BY_POLL);
> > +				bio_endio(bio);
> > +			} else {
> > +				__bio_grp_list_add(&bl, bio);
> > +			}
> > +		}
> > +		__bio_grp_list_merge(&grp->list, &bl);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int __blk_bio_poll_io(struct request_queue *q,
> > +		struct blk_bio_poll_ctx *submit_ctx,
> > +		struct blk_bio_poll_ctx *poll_ctx)
> > +{
> > +	/*
> > +	 * Move IO submission result from submission queue in submission
> > +	 * context to poll queue of poll context.
> > +	 */
> > +	spin_lock(&submit_ctx->sq_lock);
> > +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
> > +	spin_unlock(&submit_ctx->sq_lock);
> > +
> > +	return blk_bio_poll_and_end_io(q, poll_ctx);
> > +}
> > +
> > +static int blk_bio_poll_io(struct request_queue *q,
> > +		struct io_context *submit_ioc,
> > +		struct io_context *poll_ioc)
> > +{
> > +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> > +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> > +	int ret;
> > +
> > +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> > +		mutex_lock(&poll_ctx->pq_lock);
> > +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> > +		mutex_unlock(&poll_ctx->pq_lock);
> > +	} else {
> > +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static bool blk_bio_ioc_valid(struct task_struct *t)
> > +{
> > +	if (!t)
> > +		return false;
> > +
> > +	if (!t->io_context)
> > +		return false;
> > +
> > +	if (!t->io_context->data)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> > +{
> > +	struct io_context *poll_ioc = current->io_context;
> > +	pid_t pid;
> > +	struct task_struct *submit_task;
> > +	int ret;
> > +
> > +	pid = (pid_t)cookie;
> > +
> > +	/* io poll often share io submission context */
> > +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> > +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> > +
> > +	submit_task = find_get_task_by_vpid(pid);
> > +	if (likely(blk_bio_ioc_valid(submit_task)))
> > +		ret = blk_bio_poll_io(q, submit_task->io_context,
> > +				poll_ioc);
> 
> Style nit, but think it fine to put "poll_ioc);" on previous line.
> Otherwise, best to add braces.
> 
> > +	else
> > +		ret = 0;
> > +
> > +	put_task_struct(submit_task);
> > +
> > +	return ret;
> > +}
> > +
> >  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> >  {
> > +	long state;
> > +
> > +	/* no need to poll */
> > +	if (cookie == 0)
> > +		return 0;
> > +
> >  	/*
> >  	 * Create poll queue for storing poll bio and its cookie from
> >  	 * submission queue
> >  	 */
> >  	blk_create_io_context(q, true);
> >  
> > +	state = current->state;
> > +	do {
> > +		int ret;
> > +
> > +		ret = __blk_bio_poll(q, cookie);
> > +		if (ret > 0) {
> > +			__set_current_state(TASK_RUNNING);
> > +			return ret;
> > +		}
> > +
> > +		if (signal_pending_state(state, current))
> > +			__set_current_state(TASK_RUNNING);
> > +
> > +		if (current->state == TASK_RUNNING)
> > +			return 1;
> > +		if (ret < 0 || !spin)
> > +			break;
> > +		cpu_relax();
> > +	} while (!need_resched());
> > +
> > +	__set_current_state(TASK_RUNNING);
> >  	return 0;
> >  }
> >  
> > @@ -3893,7 +4064,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> >  	struct blk_mq_hw_ctx *hctx;
> >  	long state;
> >  
> > -	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
> > +	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
> >  		return 0;
> >  
> >  	if (current->plug)
> > diff --git a/block/blk.h b/block/blk.h
> > index ae58a706327e..05b9f5eafdd1 100644
> > --- a/block/blk.h
> > +++ b/block/blk.h
> > @@ -403,4 +403,13 @@ static inline void blk_create_io_context(struct request_queue *q,
> >  		bio_poll_ctx_alloc(ioc);
> >  }
> >  
> > +BIO_LIST_HELPERS(__bio_grp_list, poll);
> 
> Why the leading double underscore?
> Especially given the following 2 helpers don't have leading double underscore?
> 
> Whichever you decide, just looking for consistency.

The double underscore helpers are base helpers to build helper without
double underscore.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-23 11:55       ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 11:55 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Christoph Hellwig

On Fri, Mar 19, 2021 at 02:38:11PM -0400, Mike Snitzer wrote:
> On Thu, Mar 18 2021 at 12:48pm -0400,
> Ming Lei <ming.lei@redhat.com> wrote:
> 
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> 
> This is awkward because bio-based IO polling doesn't exist upstream yet,
> so this header should be covering your approach as a clean slate, e.g.:
> 
> The complexity associated with frequent bio splitting with bio-based
> devices makes it difficult to implement IO polling efficiently because
> the fan-out of underlying hw queues that need to be polled (as a
> side-effect of bios being split) creates a need for more easily mapping
> a group of bios to the hw queues that need to be polled.
> 
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any bio based
> > driver's IO, which is submitted from FS.
> > 
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> 
> Maybe be more precise by covering how all bios from that task's
> submission context will be moved to poll queue of poll context?

Strictly speaking, it isn't a must to add poll context which is just
more efficient for polling locally after taking bio grouping list in V2.

> 
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > 
> > In previous version, kfifo is used to implement submission queue, and
> > Jeffle Xu found that kfifo can't scale well in case of high queue depth.
> 
> Awkward to reference "previous version", maybe instead say:
> 
> In was found that kfifo doesn't scale well for a submission queue as
> queue depth is increased, so a new mechanism for tracking bios is
> needed.

OK.

> 
> > So far bio's size is close to 2 cacheline size, and it may not be
> > accepted to add new field into bio for solving the scalability issue by
> > tracking bios via linked list, switch to bio group list for tracking bio,
> > the idea is to reuse .bi_end_io for linking bios into a linked list for
> > all sharing same .bi_end_io(call it bio group), which is recovered before
> > really end bio, since BIO_END_BY_POLL is added for enhancing this point.
> > Usually .bi_end_bio is same for all bios in same layer, so it is enough to
> > provide very limited groups, such as 32 for fixing the scalability issue.
> > 
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 ++
> >  block/blk-core.c          | 149 +++++++++++++++++++++++++++++++-
> >  block/blk-mq.c            | 173 +++++++++++++++++++++++++++++++++++++-
> >  block/blk.h               |   9 ++
> >  include/linux/blk_types.h |  16 +++-
> >  5 files changed, 348 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +		bio_set_flag(bio, BIO_DONE);
> > +		return;
> > +	}
> >  again:
> >  	if (!bio_remaining_done(bio))
> >  		return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index efc7a61a84b4..778d25a7e76c 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,77 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >  		sizeof(struct bio_grp_list_data);
> >  }
> >  
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > +	return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == bio_grp_data(bio)) {
> > +			__bio_grp_list_add(&grp->list, bio);
> > +			return true;
> > +		}
> > +	}
> > +
> > +	if (i == list->max_nr_grps)
> > +		return false;
> > +
> > +	/* create a new group */
> > +	grp = &list->head[i];
> > +	bio_list_init(&grp->list);
> > +	grp->grp_data = bio_grp_data(bio);
> > +	__bio_grp_list_add(&grp->list, bio);
> > +	list->nr_grps++;
> > +
> > +	return true;
> > +}
> > +
> > +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == grp_data)
> > +			return i;
> > +	}
> > +	for (i = 0; i < list->max_nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (bio_grp_list_grp_empty(grp))
> > +			return i;
> > +	}
> > +	return -1;
> > +}
> > +
> > +/* Move as many as possible groups from 'src' to 'dst' */
> > +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> > +{
> > +	int i, j, cnt = 0;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = src->nr_grps - 1; i >= 0; i--) {
> > +		grp = &src->head[i];
> > +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> > +		if (j < 0)
> > +			break;
> > +		if (bio_grp_list_grp_empty(&dst->head[j]))
> > +			dst->head[j].grp_data = grp->grp_data;
> > +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> > +		bio_list_init(&grp->list);
> > +		cnt++;
> > +	}
> > +
> > +	src->nr_grps -= cnt;
> > +}
> > +
> >  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >  {
> >  	pc->sq = (void *)pc + sizeof(*pc);
> > @@ -866,6 +937,46 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		bio->bi_opf |= REQ_TAG;
> >  }
> >  
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	unsigned int queued;
> > +
> > +	/*
> > +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> > +	 * and completion. However, bio crypt may update .bi_end_io during
> > +	 * submitting, so simply not support bio based polling for this
> 
> s/submitting/submission/
> s/not/don't/
> 
> > +	 * setting.
> > +	 */
> > +	if (likely(!bio_has_crypt_ctx(bio))) {
> > +		/* track this bio via bio group list */
> > +		spin_lock(&pc->sq_lock);
> > +		queued = bio_grp_list_add(pc->sq, bio);
> > +		spin_unlock(&pc->sq_lock);
> > +	} else {
> > +		queued = false;
> > +	}
> > +
> > +	/*
> > +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > +	 * and the bio is always completed from the pair poll context.
> 
> This reads awkwardly.. "pair poll context"?

Actually I meant that poll context in which blk_poll(cookie) is called
and 'cookie' is the exact return value of submit_bio(bio based bio) and
this blk-mq bio is for complete the bio driver's bio, and share same
pool context with the bio driver's bio.

> 
> > +	 *
> > +	 * One invariant is that if bio isn't completed, blk_poll() will
> > +	 * be called by passing cookie returned from submitting this bio.
> > +	 */
> > +	if (!queued)
> > +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > +	else
> > +		bio_set_flag(bio, BIO_END_BY_POLL);
> > +
> > +	return queued;
> > +}
> > +
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +	bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >  	struct block_device *bdev = bio->bi_bdev;
> > @@ -1020,7 +1131,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *	->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >  {
> >  	struct bio_list bio_list_on_stack[2];
> >  	blk_qc_t ret = BLK_QC_T_NONE;
> 
> I mentioned this in previous mail, but what is it you're trying to
> convey with _int?
> 
> Think we need a better function name here.

I will think about one better name, not got one yet.

> 
> > @@ -1043,7 +1154,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >  		bio_list_init(&bio_list_on_stack[0]);
> >  
> > -		ret = __submit_bio(bio);
> > +		if (ioc && queue_is_mq(q) &&
> > +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +			ret = __submit_bio(bio);
> > +			if (queued)
> > +				blk_bio_poll_post_submit(bio, ret);
> > +		} else {
> > +			ret = __submit_bio(bio);
> > +		}
> >  
> >  		/*
> >  		 * Sort new bios into those for a lower level and those for the
> > @@ -1069,6 +1189,31 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  	return ret;
> >  }
> >  
> > +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> > +		struct io_context *ioc)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +
> > +	__submit_bio_noacct_int(bio, ioc);
> > +
> > +	/* bio submissions queued to per-task poll context */
> > +	if (READ_ONCE(pc->sq->nr_grps))
> > +		return current->pid;
> > +
> > +	/* swapper's pid is 0, but it can't submit poll IO for us */
> > +	return 0;
> > +}
> > +
> > +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +{
> > +	struct io_context *ioc = current->io_context;
> > +
> > +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> > +		return __submit_bio_noacct_poll(bio, ioc);
> > +
> > +	return __submit_bio_noacct_int(bio, NULL);
> > +}
> > +
> >  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >  {
> >  	struct bio_list bio_list[2] = { };
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 03f59915fe2c..f26950a51f4a 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3865,14 +3865,185 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >  	return ret;
> >  }
> >  
> > +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> > +{
> > +	return bio->bi_iter.bi_private_data;
> > +}
> > +
> > +static int blk_mq_poll_io(struct bio *bio)
> > +{
> > +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> > +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> > +	int ret = 0;
> > +
> > +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> > +		struct blk_mq_hw_ctx *hctx =
> > +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> > +
> > +		ret += blk_mq_poll_hctx(q, hctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int blk_bio_poll_and_end_io(struct request_queue *q,
> > +		struct blk_bio_poll_ctx *poll_ctx)
> > +{
> > +	int ret = 0;
> > +	int i;
> > +
> > +	/*
> > +	 * Poll hw queue first.
> > +	 *
> > +	 * TODO: limit max poll times and make sure to not poll same
> > +	 * hw queue one more time.
> > +	 */
> > +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> > +		struct bio *bio;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> > +			ret += blk_mq_poll_io(bio);
> > +	}
> > +
> > +	/* reap bios */
> > +	for (i = 0; i < poll_ctx->pq->max_nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &poll_ctx->pq->head[i];
> > +		struct bio *bio;
> > +		struct bio_list bl;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		bio_list_init(&bl);
> > +
> > +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> > +			if (bio_flagged(bio, BIO_DONE)) {
> > +
> 
> Remove empty newline? ^
> 
> > +				/* now recover original data */
> > +				bio->bi_poll = grp->grp_data;
> > +
> > +				/* clear BIO_END_BY_POLL and end me really */
> > +				bio_clear_flag(bio, BIO_END_BY_POLL);
> > +				bio_endio(bio);
> > +			} else {
> > +				__bio_grp_list_add(&bl, bio);
> > +			}
> > +		}
> > +		__bio_grp_list_merge(&grp->list, &bl);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int __blk_bio_poll_io(struct request_queue *q,
> > +		struct blk_bio_poll_ctx *submit_ctx,
> > +		struct blk_bio_poll_ctx *poll_ctx)
> > +{
> > +	/*
> > +	 * Move IO submission result from submission queue in submission
> > +	 * context to poll queue of poll context.
> > +	 */
> > +	spin_lock(&submit_ctx->sq_lock);
> > +	bio_grp_list_move(poll_ctx->pq, submit_ctx->sq);
> > +	spin_unlock(&submit_ctx->sq_lock);
> > +
> > +	return blk_bio_poll_and_end_io(q, poll_ctx);
> > +}
> > +
> > +static int blk_bio_poll_io(struct request_queue *q,
> > +		struct io_context *submit_ioc,
> > +		struct io_context *poll_ioc)
> > +{
> > +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> > +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> > +	int ret;
> > +
> > +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> > +		mutex_lock(&poll_ctx->pq_lock);
> > +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> > +		mutex_unlock(&poll_ctx->pq_lock);
> > +	} else {
> > +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static bool blk_bio_ioc_valid(struct task_struct *t)
> > +{
> > +	if (!t)
> > +		return false;
> > +
> > +	if (!t->io_context)
> > +		return false;
> > +
> > +	if (!t->io_context->data)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> > +{
> > +	struct io_context *poll_ioc = current->io_context;
> > +	pid_t pid;
> > +	struct task_struct *submit_task;
> > +	int ret;
> > +
> > +	pid = (pid_t)cookie;
> > +
> > +	/* io poll often share io submission context */
> > +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> > +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> > +
> > +	submit_task = find_get_task_by_vpid(pid);
> > +	if (likely(blk_bio_ioc_valid(submit_task)))
> > +		ret = blk_bio_poll_io(q, submit_task->io_context,
> > +				poll_ioc);
> 
> Style nit, but think it fine to put "poll_ioc);" on previous line.
> Otherwise, best to add braces.
> 
> > +	else
> > +		ret = 0;
> > +
> > +	put_task_struct(submit_task);
> > +
> > +	return ret;
> > +}
> > +
> >  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> >  {
> > +	long state;
> > +
> > +	/* no need to poll */
> > +	if (cookie == 0)
> > +		return 0;
> > +
> >  	/*
> >  	 * Create poll queue for storing poll bio and its cookie from
> >  	 * submission queue
> >  	 */
> >  	blk_create_io_context(q, true);
> >  
> > +	state = current->state;
> > +	do {
> > +		int ret;
> > +
> > +		ret = __blk_bio_poll(q, cookie);
> > +		if (ret > 0) {
> > +			__set_current_state(TASK_RUNNING);
> > +			return ret;
> > +		}
> > +
> > +		if (signal_pending_state(state, current))
> > +			__set_current_state(TASK_RUNNING);
> > +
> > +		if (current->state == TASK_RUNNING)
> > +			return 1;
> > +		if (ret < 0 || !spin)
> > +			break;
> > +		cpu_relax();
> > +	} while (!need_resched());
> > +
> > +	__set_current_state(TASK_RUNNING);
> >  	return 0;
> >  }
> >  
> > @@ -3893,7 +4064,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> >  	struct blk_mq_hw_ctx *hctx;
> >  	long state;
> >  
> > -	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
> > +	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
> >  		return 0;
> >  
> >  	if (current->plug)
> > diff --git a/block/blk.h b/block/blk.h
> > index ae58a706327e..05b9f5eafdd1 100644
> > --- a/block/blk.h
> > +++ b/block/blk.h
> > @@ -403,4 +403,13 @@ static inline void blk_create_io_context(struct request_queue *q,
> >  		bio_poll_ctx_alloc(ioc);
> >  }
> >  
> > +BIO_LIST_HELPERS(__bio_grp_list, poll);
> 
> Why the leading double underscore?
> Especially given the following 2 helpers don't have leading double underscore?
> 
> Whichever you decide, just looking for consistency.

The double underscore helpers are base helpers to build helper without
double underscore.


Thanks, 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-23  3:46     ` [dm-devel] " Sagi Grimberg
@ 2021-03-23 12:01       ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 12:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu,
	Mike Snitzer, dm-devel

On Mon, Mar 22, 2021 at 08:46:04PM -0700, Sagi Grimberg wrote:
> 
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +	bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> 
> Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
> an exported function for failover. nvme-mpath updates bio.bi_dev
> when re-submitting I/Os to an alternate path, so I'm thinking
> that if this function is exported then nvme-mpath could do as little
> as the below to allow polling?
> 
> --
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 92adebfaf86f..e562e296153b 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
>         struct nvme_ns_head *head =
>                 container_of(work, struct nvme_ns_head, requeue_work);
>         struct bio *bio, *next;
> +       blk_qc_t cookie;
> 
>         spin_lock_irq(&head->requeue_lock);
>         next = bio_list_get(&head->requeue_list);
> @@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
>                  * path.
>                  */
>                 bio_set_dev(bio, head->disk->part0);
> -               submit_bio_noacct(bio);
> +               cookie = submit_bio_noacct(bio);
> +               blk_bio_poll_post_submit(bio, cookie);
>         }
>  }
> --
> 
> I/O failover will create misalignment from the polling context cpu and
> the submission cpu (running requeue_work), but I don't see if there is
> something that would break here...

I understand requeue shouldn't be one usual event, and I guess it is just
fine to fallback to IRQ based mode?

This patchset actually doesn't cover such bio submission from kernel context.

-- 
Ming


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-23 12:01       ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-23 12:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

On Mon, Mar 22, 2021 at 08:46:04PM -0700, Sagi Grimberg wrote:
> 
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +	bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> 
> Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
> an exported function for failover. nvme-mpath updates bio.bi_dev
> when re-submitting I/Os to an alternate path, so I'm thinking
> that if this function is exported then nvme-mpath could do as little
> as the below to allow polling?
> 
> --
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 92adebfaf86f..e562e296153b 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
>         struct nvme_ns_head *head =
>                 container_of(work, struct nvme_ns_head, requeue_work);
>         struct bio *bio, *next;
> +       blk_qc_t cookie;
> 
>         spin_lock_irq(&head->requeue_lock);
>         next = bio_list_get(&head->requeue_list);
> @@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
>                  * path.
>                  */
>                 bio_set_dev(bio, head->disk->part0);
> -               submit_bio_noacct(bio);
> +               cookie = submit_bio_noacct(bio);
> +               blk_bio_poll_post_submit(bio, cookie);
>         }
>  }
> --
> 
> I/O failover will create misalignment from the polling context cpu and
> the submission cpu (running requeue_work), but I don't see if there is
> something that would break here...

I understand requeue shouldn't be one usual event, and I guess it is just
fine to fallback to IRQ based mode?

This patchset actually doesn't cover such bio submission from kernel context.

-- 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-23 12:01       ` [dm-devel] " Ming Lei
@ 2021-03-23 16:54         ` Sagi Grimberg
  -1 siblings, 0 replies; 82+ messages in thread
From: Sagi Grimberg @ 2021-03-23 16:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu,
	Mike Snitzer, dm-devel


>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
>>> +{
>>> +	bio->bi_iter.bi_private_data = cookie;
>>> +}
>>> +
>>
>> Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
>> an exported function for failover. nvme-mpath updates bio.bi_dev
>> when re-submitting I/Os to an alternate path, so I'm thinking
>> that if this function is exported then nvme-mpath could do as little
>> as the below to allow polling?
>>
>> --
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 92adebfaf86f..e562e296153b 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
>>          struct nvme_ns_head *head =
>>                  container_of(work, struct nvme_ns_head, requeue_work);
>>          struct bio *bio, *next;
>> +       blk_qc_t cookie;
>>
>>          spin_lock_irq(&head->requeue_lock);
>>          next = bio_list_get(&head->requeue_list);
>> @@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
>>                   * path.
>>                   */
>>                  bio_set_dev(bio, head->disk->part0);
>> -               submit_bio_noacct(bio);
>> +               cookie = submit_bio_noacct(bio);
>> +               blk_bio_poll_post_submit(bio, cookie);
>>          }
>>   }
>> --
>>
>> I/O failover will create misalignment from the polling context cpu and
>> the submission cpu (running requeue_work), but I don't see if there is
>> something that would break here...
> 
> I understand requeue shouldn't be one usual event, and I guess it is just
> fine to fallback to IRQ based mode?

Well, when it will failover, it will probably be directed to the poll
queues. Maybe I'm missing something...

> This patchset actually doesn't cover such bio submission from kernel context.

What is the difference?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-23 16:54         ` Sagi Grimberg
  0 siblings, 0 replies; 82+ messages in thread
From: Sagi Grimberg @ 2021-03-23 16:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig


>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
>>> +{
>>> +	bio->bi_iter.bi_private_data = cookie;
>>> +}
>>> +
>>
>> Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
>> an exported function for failover. nvme-mpath updates bio.bi_dev
>> when re-submitting I/Os to an alternate path, so I'm thinking
>> that if this function is exported then nvme-mpath could do as little
>> as the below to allow polling?
>>
>> --
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 92adebfaf86f..e562e296153b 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
>>          struct nvme_ns_head *head =
>>                  container_of(work, struct nvme_ns_head, requeue_work);
>>          struct bio *bio, *next;
>> +       blk_qc_t cookie;
>>
>>          spin_lock_irq(&head->requeue_lock);
>>          next = bio_list_get(&head->requeue_list);
>> @@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
>>                   * path.
>>                   */
>>                  bio_set_dev(bio, head->disk->part0);
>> -               submit_bio_noacct(bio);
>> +               cookie = submit_bio_noacct(bio);
>> +               blk_bio_poll_post_submit(bio, cookie);
>>          }
>>   }
>> --
>>
>> I/O failover will create misalignment from the polling context cpu and
>> the submission cpu (running requeue_work), but I don't see if there is
>> something that would break here...
> 
> I understand requeue shouldn't be one usual event, and I guess it is just
> fine to fallback to IRQ based mode?

Well, when it will failover, it will probably be directed to the poll
queues. Maybe I'm missing something...

> This patchset actually doesn't cover such bio submission from kernel context.

What is the difference?

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-23 16:54         ` [dm-devel] " Sagi Grimberg
@ 2021-03-24  0:10           ` Ming Lei
  -1 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-24  0:10 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu,
	Mike Snitzer, dm-devel

On Tue, Mar 23, 2021 at 09:54:36AM -0700, Sagi Grimberg wrote:
> 
> > > > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > > > +{
> > > > +	bio->bi_iter.bi_private_data = cookie;
> > > > +}
> > > > +
> > > 
> > > Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
> > > an exported function for failover. nvme-mpath updates bio.bi_dev
> > > when re-submitting I/Os to an alternate path, so I'm thinking
> > > that if this function is exported then nvme-mpath could do as little
> > > as the below to allow polling?
> > > 
> > > --
> > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> > > index 92adebfaf86f..e562e296153b 100644
> > > --- a/drivers/nvme/host/multipath.c
> > > +++ b/drivers/nvme/host/multipath.c
> > > @@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
> > >          struct nvme_ns_head *head =
> > >                  container_of(work, struct nvme_ns_head, requeue_work);
> > >          struct bio *bio, *next;
> > > +       blk_qc_t cookie;
> > > 
> > >          spin_lock_irq(&head->requeue_lock);
> > >          next = bio_list_get(&head->requeue_list);
> > > @@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
> > >                   * path.
> > >                   */
> > >                  bio_set_dev(bio, head->disk->part0);
> > > -               submit_bio_noacct(bio);
> > > +               cookie = submit_bio_noacct(bio);
> > > +               blk_bio_poll_post_submit(bio, cookie);
> > >          }
> > >   }
> > > --
> > > 
> > > I/O failover will create misalignment from the polling context cpu and
> > > the submission cpu (running requeue_work), but I don't see if there is
> > > something that would break here...
> > 
> > I understand requeue shouldn't be one usual event, and I guess it is just
> > fine to fallback to IRQ based mode?
> 
> Well, when it will failover, it will probably be directed to the poll
> queues. Maybe I'm missing something...

In this patchset, because it isn't submitted directly from FS, there
isn't one polling context associated with this bio, so its HIPRI flag
will be cleared, then fallback to irq mode.

> 
> > This patchset actually doesn't cover such bio submission from kernel context.
> 
> What is the difference?

So far upper layer(io_uring, or dio, ..) needs to get the returned cookie, then
pass it to blk_poll().

For this case, the cookie can't be passed to FS caller of submit_bio(FS bio), so
it can't be polled by in-tree's code.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-24  0:10           ` Ming Lei
  0 siblings, 0 replies; 82+ messages in thread
From: Ming Lei @ 2021-03-24  0:10 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

On Tue, Mar 23, 2021 at 09:54:36AM -0700, Sagi Grimberg wrote:
> 
> > > > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > > > +{
> > > > +	bio->bi_iter.bi_private_data = cookie;
> > > > +}
> > > > +
> > > 
> > > Hey Ming, thinking about nvme-mpath, I'm thinking that this should be
> > > an exported function for failover. nvme-mpath updates bio.bi_dev
> > > when re-submitting I/Os to an alternate path, so I'm thinking
> > > that if this function is exported then nvme-mpath could do as little
> > > as the below to allow polling?
> > > 
> > > --
> > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> > > index 92adebfaf86f..e562e296153b 100644
> > > --- a/drivers/nvme/host/multipath.c
> > > +++ b/drivers/nvme/host/multipath.c
> > > @@ -345,6 +345,7 @@ static void nvme_requeue_work(struct work_struct *work)
> > >          struct nvme_ns_head *head =
> > >                  container_of(work, struct nvme_ns_head, requeue_work);
> > >          struct bio *bio, *next;
> > > +       blk_qc_t cookie;
> > > 
> > >          spin_lock_irq(&head->requeue_lock);
> > >          next = bio_list_get(&head->requeue_list);
> > > @@ -359,7 +360,8 @@ static void nvme_requeue_work(struct work_struct *work)
> > >                   * path.
> > >                   */
> > >                  bio_set_dev(bio, head->disk->part0);
> > > -               submit_bio_noacct(bio);
> > > +               cookie = submit_bio_noacct(bio);
> > > +               blk_bio_poll_post_submit(bio, cookie);
> > >          }
> > >   }
> > > --
> > > 
> > > I/O failover will create misalignment from the polling context cpu and
> > > the submission cpu (running requeue_work), but I don't see if there is
> > > something that would break here...
> > 
> > I understand requeue shouldn't be one usual event, and I guess it is just
> > fine to fallback to IRQ based mode?
> 
> Well, when it will failover, it will probably be directed to the poll
> queues. Maybe I'm missing something...

In this patchset, because it isn't submitted directly from FS, there
isn't one polling context associated with this bio, so its HIPRI flag
will be cleared, then fallback to irq mode.

> 
> > This patchset actually doesn't cover such bio submission from kernel context.
> 
> What is the difference?

So far upper layer(io_uring, or dio, ..) needs to get the returned cookie, then
pass it to blk_poll().

For this case, the cookie can't be passed to FS caller of submit_bio(FS bio), so
it can't be polled by in-tree's code.



Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
  2021-03-24  0:10           ` [dm-devel] " Ming Lei
@ 2021-03-24 15:43             ` Sagi Grimberg
  -1 siblings, 0 replies; 82+ messages in thread
From: Sagi Grimberg @ 2021-03-24 15:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jeffle Xu,
	Mike Snitzer, dm-devel


>> Well, when it will failover, it will probably be directed to the poll
>> queues. Maybe I'm missing something...
> 
> In this patchset, because it isn't submitted directly from FS, there
> isn't one polling context associated with this bio, so its HIPRI flag
> will be cleared, then fallback to irq mode.

I think that's fine for failover I/O...

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [dm-devel] [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll
@ 2021-03-24 15:43             ` Sagi Grimberg
  0 siblings, 0 replies; 82+ messages in thread
From: Sagi Grimberg @ 2021-03-24 15:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig


>> Well, when it will failover, it will probably be directed to the poll
>> queues. Maybe I'm missing something...
> 
> In this patchset, because it isn't submitted directly from FS, there
> isn't one polling context associated with this bio, so its HIPRI flag
> will be cleared, then fallback to irq mode.

I think that's fine for failover I/O...

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2021-03-24 15:43 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-18 16:48 [RFC PATCH V2 00/13] block: support bio based io polling Ming Lei
2021-03-18 16:48 ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 01/13] block: add helper of blk_queue_poll Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-19 16:52   ` Mike Snitzer
2021-03-19 16:52     ` [dm-devel] " Mike Snitzer
2021-03-23 11:17     ` Ming Lei
2021-03-23 11:17       ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 02/13] block: add one helper to free io_context Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 03/13] block: add helper of blk_create_io_context Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 04/13] block: create io poll context for submission and poll task Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-19 17:05   ` Mike Snitzer
2021-03-19 17:05     ` [dm-devel] " Mike Snitzer
2021-03-23 11:23     ` Ming Lei
2021-03-23 11:23       ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 05/13] block: add req flag of REQ_TAG Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-19  7:59   ` JeffleXu
2021-03-19  7:59     ` [dm-devel] " JeffleXu
2021-03-19  8:48     ` Ming Lei
2021-03-19  8:48       ` [dm-devel] " Ming Lei
2021-03-19  9:47       ` JeffleXu
2021-03-19  9:47         ` [dm-devel] " JeffleXu
2021-03-19 17:38   ` Mike Snitzer
2021-03-19 17:38     ` [dm-devel] " Mike Snitzer
2021-03-23 11:26     ` Ming Lei
2021-03-23 11:26       ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 06/13] block: add new field into 'struct bvec_iter' Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-19 17:44   ` Mike Snitzer
2021-03-19 17:44     ` [dm-devel] " Mike Snitzer
2021-03-23 11:29     ` Ming Lei
2021-03-23 11:29       ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 07/13] block/mq: extract one helper function polling hw queue Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 08/13] block: prepare for supporting bio_list via other link Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 09/13] block: use per-task poll context to implement bio based io poll Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-18 17:26   ` Mike Snitzer
2021-03-18 17:26     ` [dm-devel] " Mike Snitzer
2021-03-18 17:38     ` Mike Snitzer
2021-03-18 17:38       ` Mike Snitzer
2021-03-19  0:30     ` Ming Lei
2021-03-19  0:30       ` [dm-devel] " Ming Lei
2021-03-19  9:38   ` JeffleXu
2021-03-19  9:38     ` [dm-devel] " JeffleXu
2021-03-19 13:46     ` Ming Lei
2021-03-19 13:46       ` [dm-devel] " Ming Lei
2021-03-20  5:56       ` JeffleXu
2021-03-20  5:56         ` [dm-devel] " JeffleXu
2021-03-23 11:39         ` Ming Lei
2021-03-23 11:39           ` [dm-devel] " Ming Lei
2021-03-19 18:38   ` Mike Snitzer
2021-03-19 18:38     ` [dm-devel] " Mike Snitzer
2021-03-23 11:55     ` Ming Lei
2021-03-23 11:55       ` [dm-devel] " Ming Lei
2021-03-23  3:46   ` Sagi Grimberg
2021-03-23  3:46     ` [dm-devel] " Sagi Grimberg
2021-03-23 12:01     ` Ming Lei
2021-03-23 12:01       ` [dm-devel] " Ming Lei
2021-03-23 16:54       ` Sagi Grimberg
2021-03-23 16:54         ` [dm-devel] " Sagi Grimberg
2021-03-24  0:10         ` Ming Lei
2021-03-24  0:10           ` [dm-devel] " Ming Lei
2021-03-24 15:43           ` Sagi Grimberg
2021-03-24 15:43             ` [dm-devel] " Sagi Grimberg
2021-03-18 16:48 ` [RFC PATCH V2 10/13] block: add queue_to_disk() to get gendisk from request_queue Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 11/13] block: add poll_capable method to support bio-based IO polling Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 12/13] dm: support IO polling for bio-based dm device Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-18 16:48 ` [RFC PATCH V2 13/13] blk-mq: limit hw queues to be polled in each blk_poll() Ming Lei
2021-03-18 16:48   ` [dm-devel] " Ming Lei
2021-03-19  5:50 ` [RFC PATCH V2 00/13] block: support bio based io polling JeffleXu
2021-03-19  5:50   ` [dm-devel] " JeffleXu
2021-03-19 18:45 ` Mike Snitzer
2021-03-19 18:45   ` [dm-devel] " Mike Snitzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.