All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 00/13] block: support bio based io polling
@ 2021-03-24 12:19 ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei

Hi,

Add per-task io poll context for holding HIPRI blk-mq/underlying bios
queued from bio based driver's io submission context, and reuse one bio
padding field for storing 'cookie' returned from submit_bio() for these
bios. Also explicitly end these bios in poll context by adding two
new bio flags.

In this way, we needn't to poll all underlying hw queues any more,
which is implemented in Jeffle's patches. And we can just poll hw queues
in which there is HIPRI IO queued.

Usually io submission and io poll share same context, so the added io
poll context data is just like one stack variable, and the cost for
saving bios is cheap.

Any comments are welcome.

V3:
	- fix cookie returned for bio based driver, as suggested by Jeffle
	  Xu
	- draining pending bios when submission context is exiting
	- patch style and comment fix, as suggested by Mike
	- allow poll context data to be NULL by always polling on submission
	  queue
	- remove RFC, and reviewed-by

V2:
	- address queue depth scalability issue reported by Jeffle via bio
	group list. Reuse .bi_end_io for linking bios which share same
	.bi_end_io, and support 32 such groups in submit queue. With this way,
	the scalability issue caused by kfifio is solved. Before really
	ending bio, .bi_end_io is recovered from the group head.

Jeffle Xu (4):
  block/mq: extract one helper function polling hw queue
  block: add queue_to_disk() to get gendisk from request_queue
  block: add poll_capable method to support bio-based IO polling
  dm: support IO polling for bio-based dm device

Ming Lei (9):
  block: add helper of blk_queue_poll
  block: add one helper to free io_context
  block: add helper of blk_create_io_context
  block: create io poll context for submission and poll task
  block: add req flag of REQ_POLL_CTX
  block: add new field into 'struct bvec_iter'
  block: prepare for supporting bio_list via other link
  block: use per-task poll context to implement bio based io polling
  blk-mq: limit hw queues to be polled in each blk_poll()

 block/bio.c                   |   5 +
 block/blk-core.c              | 251 ++++++++++++++++++++++++++--
 block/blk-ioc.c               |  14 +-
 block/blk-mq.c                | 300 +++++++++++++++++++++++++++++++++-
 block/blk-sysfs.c             |  14 +-
 block/blk.h                   |  65 ++++++++
 drivers/md/dm-table.c         |  24 +++
 drivers/md/dm.c               |  14 ++
 drivers/nvme/host/core.c      |   2 +-
 include/linux/bio.h           | 132 +++++++--------
 include/linux/blk_types.h     |  22 ++-
 include/linux/blkdev.h        |   4 +
 include/linux/bvec.h          |   8 +
 include/linux/device-mapper.h |   1 +
 include/linux/iocontext.h     |   2 +
 include/trace/events/kyber.h  |   6 +-
 16 files changed, 770 insertions(+), 94 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 00/13] block: support bio based io polling
@ 2021-03-24 12:19 ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer, Ming Lei

Hi,

Add per-task io poll context for holding HIPRI blk-mq/underlying bios
queued from bio based driver's io submission context, and reuse one bio
padding field for storing 'cookie' returned from submit_bio() for these
bios. Also explicitly end these bios in poll context by adding two
new bio flags.

In this way, we needn't to poll all underlying hw queues any more,
which is implemented in Jeffle's patches. And we can just poll hw queues
in which there is HIPRI IO queued.

Usually io submission and io poll share same context, so the added io
poll context data is just like one stack variable, and the cost for
saving bios is cheap.

Any comments are welcome.

V3:
	- fix cookie returned for bio based driver, as suggested by Jeffle
	  Xu
	- draining pending bios when submission context is exiting
	- patch style and comment fix, as suggested by Mike
	- allow poll context data to be NULL by always polling on submission
	  queue
	- remove RFC, and reviewed-by

V2:
	- address queue depth scalability issue reported by Jeffle via bio
	group list. Reuse .bi_end_io for linking bios which share same
	.bi_end_io, and support 32 such groups in submit queue. With this way,
	the scalability issue caused by kfifio is solved. Before really
	ending bio, .bi_end_io is recovered from the group head.

Jeffle Xu (4):
  block/mq: extract one helper function polling hw queue
  block: add queue_to_disk() to get gendisk from request_queue
  block: add poll_capable method to support bio-based IO polling
  dm: support IO polling for bio-based dm device

Ming Lei (9):
  block: add helper of blk_queue_poll
  block: add one helper to free io_context
  block: add helper of blk_create_io_context
  block: create io poll context for submission and poll task
  block: add req flag of REQ_POLL_CTX
  block: add new field into 'struct bvec_iter'
  block: prepare for supporting bio_list via other link
  block: use per-task poll context to implement bio based io polling
  blk-mq: limit hw queues to be polled in each blk_poll()

 block/bio.c                   |   5 +
 block/blk-core.c              | 251 ++++++++++++++++++++++++++--
 block/blk-ioc.c               |  14 +-
 block/blk-mq.c                | 300 +++++++++++++++++++++++++++++++++-
 block/blk-sysfs.c             |  14 +-
 block/blk.h                   |  65 ++++++++
 drivers/md/dm-table.c         |  24 +++
 drivers/md/dm.c               |  14 ++
 drivers/nvme/host/core.c      |   2 +-
 include/linux/bio.h           | 132 +++++++--------
 include/linux/blk_types.h     |  22 ++-
 include/linux/blkdev.h        |   4 +
 include/linux/bvec.h          |   8 +
 include/linux/device-mapper.h |   1 +
 include/linux/iocontext.h     |   2 +
 include/trace/events/kyber.h  |   6 +-
 16 files changed, 770 insertions(+), 94 deletions(-)

-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH V3 01/13] block: add helper of blk_queue_poll
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei,
	Chaitanya Kulkarni

There has been 3 users, and will be more, so add one such helper.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c         | 2 +-
 block/blk-mq.c           | 3 +--
 drivers/nvme/host/core.c | 2 +-
 include/linux/blkdev.h   | 1 +
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..a31371d55b9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -836,7 +836,7 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
 	switch (bio_op(bio)) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..63c81df3b8b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3869,8 +3869,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) ||
-	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
 		return 0;
 
 	if (current->plug)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0896e21642be..34b8c78f88e0 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -956,7 +956,7 @@ static void nvme_execute_rq_polled(struct request_queue *q,
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 
-	WARN_ON_ONCE(!test_bit(QUEUE_FLAG_POLL, &q->queue_flags));
+	WARN_ON_ONCE(!blk_queue_poll(q));
 
 	rq->cmd_flags |= REQ_HIPRI;
 	rq->end_io_data = &wait;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc6bc8383b43..89a01850cf12 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -665,6 +665,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
+#define blk_queue_poll(q)	test_bit(QUEUE_FLAG_POLL, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 01/13] block: add helper of blk_queue_poll
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Chaitanya Kulkarni, Ming Lei, linux-block,
	dm-devel, Jeffle Xu

There has been 3 users, and will be more, so add one such helper.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c         | 2 +-
 block/blk-mq.c           | 3 +--
 drivers/nvme/host/core.c | 2 +-
 include/linux/blkdev.h   | 1 +
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..a31371d55b9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -836,7 +836,7 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
 	switch (bio_op(bio)) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..63c81df3b8b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3869,8 +3869,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) ||
-	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
 		return 0;
 
 	if (current->plug)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 0896e21642be..34b8c78f88e0 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -956,7 +956,7 @@ static void nvme_execute_rq_polled(struct request_queue *q,
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 
-	WARN_ON_ONCE(!test_bit(QUEUE_FLAG_POLL, &q->queue_flags));
+	WARN_ON_ONCE(!blk_queue_poll(q));
 
 	rq->cmd_flags |= REQ_HIPRI;
 	rq->end_io_data = &wait;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc6bc8383b43..89a01850cf12 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -665,6 +665,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
+#define blk_queue_poll(q)	test_bit(QUEUE_FLAG_POLL, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 02/13] block: add one helper to free io_context
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei

Prepare for putting bio poll queue into io_context, so add one helper
for free io_context.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-ioc.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 57299f860d41..b0cde18c4b8c 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -17,6 +17,11 @@
  */
 static struct kmem_cache *iocontext_cachep;
 
+static inline void free_io_context(struct io_context *ioc)
+{
+	kmem_cache_free(iocontext_cachep, ioc);
+}
+
 /**
  * get_io_context - increment reference count to io_context
  * @ioc: io_context to get
@@ -129,7 +134,7 @@ static void ioc_release_fn(struct work_struct *work)
 
 	spin_unlock_irq(&ioc->lock);
 
-	kmem_cache_free(iocontext_cachep, ioc);
+	free_io_context(ioc);
 }
 
 /**
@@ -164,7 +169,7 @@ void put_io_context(struct io_context *ioc)
 	}
 
 	if (free_ioc)
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 }
 
 /**
@@ -278,7 +283,7 @@ int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
 	    (task == current || !(task->flags & PF_EXITING)))
 		task->io_context = ioc;
 	else
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 
 	ret = task->io_context ? 0 : -EBUSY;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 02/13] block: add one helper to free io_context
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer, Ming Lei

Prepare for putting bio poll queue into io_context, so add one helper
for free io_context.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-ioc.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 57299f860d41..b0cde18c4b8c 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -17,6 +17,11 @@
  */
 static struct kmem_cache *iocontext_cachep;
 
+static inline void free_io_context(struct io_context *ioc)
+{
+	kmem_cache_free(iocontext_cachep, ioc);
+}
+
 /**
  * get_io_context - increment reference count to io_context
  * @ioc: io_context to get
@@ -129,7 +134,7 @@ static void ioc_release_fn(struct work_struct *work)
 
 	spin_unlock_irq(&ioc->lock);
 
-	kmem_cache_free(iocontext_cachep, ioc);
+	free_io_context(ioc);
 }
 
 /**
@@ -164,7 +169,7 @@ void put_io_context(struct io_context *ioc)
 	}
 
 	if (free_ioc)
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 }
 
 /**
@@ -278,7 +283,7 @@ int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
 	    (task == current || !(task->flags & PF_EXITING)))
 		task->io_context = ioc;
 	else
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 
 	ret = task->io_context ? 0 : -EBUSY;
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 03/13] block: add helper of blk_create_io_context
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei

Add one helper for creating io context and prepare for supporting
efficient bio based io poll.

Meantime move the code of creating io_context before checking bio's
REQ_HIPRI flag because the following patch may change to clear REQ_HIPRI
if io_context can't be created.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a31371d55b9d..d58f8a0c80de 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,6 +792,18 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
+static inline void blk_create_io_context(struct request_queue *q)
+{
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -836,6 +848,8 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
+	blk_create_io_context(q);
+
 	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
@@ -876,15 +890,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		break;
 	}
 
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
-
 	if (blk_throtl_bio(bio)) {
 		blkcg_bio_issue_init(bio);
 		return false;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 03/13] block: add helper of blk_create_io_context
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer, Ming Lei

Add one helper for creating io context and prepare for supporting
efficient bio based io poll.

Meantime move the code of creating io_context before checking bio's
REQ_HIPRI flag because the following patch may change to clear REQ_HIPRI
if io_context can't be created.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a31371d55b9d..d58f8a0c80de 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,6 +792,18 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
+static inline void blk_create_io_context(struct request_queue *q)
+{
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -836,6 +848,8 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
+	blk_create_io_context(q);
+
 	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
@@ -876,15 +890,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		break;
 	}
 
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
-
 	if (blk_throtl_bio(bio)) {
 		blkcg_bio_issue_init(bio);
 		return false;
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 04/13] block: create io poll context for submission and poll task
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei

Create per-task io poll context for both IO submission and poll task
if the queue is bio based and supports polling.

This io polling context includes two queues:

1) submission queue(sq) for storing HIPRI bio, written by submission task
   and read by poll task.
2) polling queue(pq) for holding data moved from sq, only used in poll
   context for running bio polling.

Following patches will support bio based io polling.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
 block/blk-ioc.c           |  1 +
 block/blk-mq.c            | 14 ++++++++
 block/blk.h               | 45 +++++++++++++++++++++++++
 include/linux/iocontext.h |  2 ++
 5 files changed, 121 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d58f8a0c80de..4671bbf31fd3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
-static inline void blk_create_io_context(struct request_queue *q)
+static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
 {
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
+	struct io_context *ioc = current->io_context;
+
+	return ioc ? ioc->data : NULL;
+}
+
+static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
+{
+	return sizeof(struct bio_grp_list) + nr_grps *
+		sizeof(struct bio_grp_list_data);
+}
+
+static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
+{
+	pc->sq = (void *)pc + sizeof(*pc);
+	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
+
+	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
+	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
+
+	spin_lock_init(&pc->sq_lock);
+	spin_lock_init(&pc->pq_lock);
+}
+
+void bio_poll_ctx_alloc(struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc;
+	unsigned int size = sizeof(*pc) +
+		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
+		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
+
+	pc = kzalloc(GFP_ATOMIC, size);
+	if (pc) {
+		bio_poll_ctx_init(pc);
+		if (cmpxchg(&ioc->data, NULL, (void *)pc))
+			kfree(pc);
+	}
+}
+
+static inline bool blk_queue_support_bio_poll(struct request_queue *q)
+{
+	return !queue_is_mq(q) && blk_queue_poll(q);
+}
+
+static inline void blk_bio_poll_preprocess(struct request_queue *q,
+		struct bio *bio)
+{
+	if (!(bio->bi_opf & REQ_HIPRI))
+		return;
+
+	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+		bio->bi_opf &= ~REQ_HIPRI;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	blk_create_io_context(q);
+	/*
+	 * Create per-task io poll ctx if bio polling supported and HIPRI
+	 * set.
+	 */
+	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			(bio->bi_opf & REQ_HIPRI));
 
-	if (!blk_queue_poll(q))
-		bio->bi_opf &= ~REQ_HIPRI;
+	blk_bio_poll_preprocess(q, bio);
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b0cde18c4b8c..5574c398eff6 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
 
 static inline void free_io_context(struct io_context *ioc)
 {
+	kfree(ioc->data);
 	kmem_cache_free(iocontext_cachep, ioc);
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 63c81df3b8b5..c832faa52ca0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+	/*
+	 * Create poll queue for storing poll bio and its cookie from
+	 * submission queue
+	 */
+	blk_create_io_context(q, true);
+
+	return 0;
+}
+
 /**
  * blk_poll - poll for IO completions
  * @q:  the queue
@@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	if (current->plug)
 		blk_flush_plug_list(current->plug, false);
 
+	if (!queue_is_mq(q))
+		return blk_bio_poll(q, cookie, spin);
+
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 
 	/*
diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e..424949f2226d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -357,4 +357,49 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+/* Grouping bios that share same data into one list */
+struct bio_grp_list_data {
+	void *grp_data;
+
+	/* all bios in this list share same 'grp_data' */
+	struct bio_list list;
+};
+
+struct bio_grp_list {
+	unsigned int max_nr_grps, nr_grps;
+	struct bio_grp_list_data head[0];
+};
+
+struct blk_bio_poll_ctx {
+	spinlock_t sq_lock;
+	struct bio_grp_list *sq;
+
+	spinlock_t pq_lock;
+	struct bio_grp_list *pq;
+};
+
+#define BLK_BIO_POLL_SQ_SZ		16U
+#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
+
+void bio_poll_ctx_alloc(struct io_context *ioc);
+
+static inline void blk_create_io_context(struct request_queue *q,
+		bool need_poll_ctx)
+{
+	struct io_context *ioc;
+
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+
+	ioc = current->io_context;
+	if (need_poll_ctx && unlikely(ioc && !ioc->data))
+		bio_poll_ctx_alloc(ioc);
+}
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 0a9dc40b7be8..f9a467571356 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -110,6 +110,8 @@ struct io_context {
 	struct io_cq __rcu	*icq_hint;
 	struct hlist_head	icq_list;
 
+	void			*data;
+
 	struct work_struct release_work;
 };
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 04/13] block: create io poll context for submission and poll task
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer, Ming Lei

Create per-task io poll context for both IO submission and poll task
if the queue is bio based and supports polling.

This io polling context includes two queues:

1) submission queue(sq) for storing HIPRI bio, written by submission task
   and read by poll task.
2) polling queue(pq) for holding data moved from sq, only used in poll
   context for running bio polling.

Following patches will support bio based io polling.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
 block/blk-ioc.c           |  1 +
 block/blk-mq.c            | 14 ++++++++
 block/blk.h               | 45 +++++++++++++++++++++++++
 include/linux/iocontext.h |  2 ++
 5 files changed, 121 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d58f8a0c80de..4671bbf31fd3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
-static inline void blk_create_io_context(struct request_queue *q)
+static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
 {
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
+	struct io_context *ioc = current->io_context;
+
+	return ioc ? ioc->data : NULL;
+}
+
+static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
+{
+	return sizeof(struct bio_grp_list) + nr_grps *
+		sizeof(struct bio_grp_list_data);
+}
+
+static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
+{
+	pc->sq = (void *)pc + sizeof(*pc);
+	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
+
+	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
+	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
+
+	spin_lock_init(&pc->sq_lock);
+	spin_lock_init(&pc->pq_lock);
+}
+
+void bio_poll_ctx_alloc(struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc;
+	unsigned int size = sizeof(*pc) +
+		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
+		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
+
+	pc = kzalloc(GFP_ATOMIC, size);
+	if (pc) {
+		bio_poll_ctx_init(pc);
+		if (cmpxchg(&ioc->data, NULL, (void *)pc))
+			kfree(pc);
+	}
+}
+
+static inline bool blk_queue_support_bio_poll(struct request_queue *q)
+{
+	return !queue_is_mq(q) && blk_queue_poll(q);
+}
+
+static inline void blk_bio_poll_preprocess(struct request_queue *q,
+		struct bio *bio)
+{
+	if (!(bio->bi_opf & REQ_HIPRI))
+		return;
+
+	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+		bio->bi_opf &= ~REQ_HIPRI;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	blk_create_io_context(q);
+	/*
+	 * Create per-task io poll ctx if bio polling supported and HIPRI
+	 * set.
+	 */
+	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			(bio->bi_opf & REQ_HIPRI));
 
-	if (!blk_queue_poll(q))
-		bio->bi_opf &= ~REQ_HIPRI;
+	blk_bio_poll_preprocess(q, bio);
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b0cde18c4b8c..5574c398eff6 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
 
 static inline void free_io_context(struct io_context *ioc)
 {
+	kfree(ioc->data);
 	kmem_cache_free(iocontext_cachep, ioc);
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 63c81df3b8b5..c832faa52ca0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+	/*
+	 * Create poll queue for storing poll bio and its cookie from
+	 * submission queue
+	 */
+	blk_create_io_context(q, true);
+
+	return 0;
+}
+
 /**
  * blk_poll - poll for IO completions
  * @q:  the queue
@@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	if (current->plug)
 		blk_flush_plug_list(current->plug, false);
 
+	if (!queue_is_mq(q))
+		return blk_bio_poll(q, cookie, spin);
+
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 
 	/*
diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e..424949f2226d 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -357,4 +357,49 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+/* Grouping bios that share same data into one list */
+struct bio_grp_list_data {
+	void *grp_data;
+
+	/* all bios in this list share same 'grp_data' */
+	struct bio_list list;
+};
+
+struct bio_grp_list {
+	unsigned int max_nr_grps, nr_grps;
+	struct bio_grp_list_data head[0];
+};
+
+struct blk_bio_poll_ctx {
+	spinlock_t sq_lock;
+	struct bio_grp_list *sq;
+
+	spinlock_t pq_lock;
+	struct bio_grp_list *pq;
+};
+
+#define BLK_BIO_POLL_SQ_SZ		16U
+#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
+
+void bio_poll_ctx_alloc(struct io_context *ioc);
+
+static inline void blk_create_io_context(struct request_queue *q,
+		bool need_poll_ctx)
+{
+	struct io_context *ioc;
+
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+
+	ioc = current->io_context;
+	if (need_poll_ctx && unlikely(ioc && !ioc->data))
+		bio_poll_ctx_alloc(ioc);
+}
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 0a9dc40b7be8..f9a467571356 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -110,6 +110,8 @@ struct io_context {
 	struct io_cq __rcu	*icq_hint;
 	struct hlist_head	icq_list;
 
+	void			*data;
+
 	struct work_struct release_work;
 };
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 05/13] block: add req flag of REQ_POLL_CTX
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei

Add one req flag REQ_POLL_CTX which will be used in the following patch for
supporting bio based IO polling.

Exactly this flag can help us to do:

1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
as REQ_POLL_CTX, all bios cloned from this FS bio will be marked as
REQ_POLL_CTX too.

2) create per-task io polling context if the bio based queue supports
polling and the submitted bio is HIPRI. Per-task io poll context will be
created during submit_bio() before marking this HIPRI bio as REQ_POLL_CTX.
Then we can avoid to create such io polling context if one cloned bio with
REQ_POLL_CTX is submitted from another kernel context.

3) for supporting bio based io polling, we need to poll IOs from all
underlying queues of the bio device, this way help us to recognize which
IO needs to polled in bio based style, which will be applied in
following patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 25 ++++++++++++++++++++++++-
 include/linux/blk_types.h |  4 ++++
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4671bbf31fd3..eb07d61cfdc2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
 static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		struct bio *bio)
 {
+	bool mq;
+
 	if (!(bio->bi_opf & REQ_HIPRI))
 		return;
 
-	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+	/*
+	 * Can't support bio based IO polling without per-task poll ctx
+	 *
+	 * We have created per-task io poll context, and mark this
+	 * bio as REQ_POLL_CTX, so: 1) if any cloned bio from this bio is
+	 * submitted from another kernel context, we won't create bio
+	 * poll context for it, and that bio can be completed by IRQ;
+	 * 2) If such bio is submitted from current context, we will
+	 * complete it via blk_poll(); 3) If driver knows that one
+	 * underlying bio allocated from driver is for FS bio, meantime
+	 * it is submitted in current context, driver can mark such bio
+	 * as REQ_HIPRI & REQ_POLL_CTX manually, so the bio can be completed
+	 * via blk_poll too.
+	 */
+	mq = queue_is_mq(q);
+	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
 		bio->bi_opf &= ~REQ_HIPRI;
+	else if (!mq)
+		bio->bi_opf |= REQ_POLL_CTX;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -894,8 +913,12 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 	/*
 	 * Create per-task io poll ctx if bio polling supported and HIPRI
 	 * set.
+	 *
+	 * If REQ_POLL_CTX isn't set for this HIPRI bio, we think it originated
+	 * from FS and allocate io polling context.
 	 */
 	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			!(bio->bi_opf & REQ_POLL_CTX) &&
 			(bio->bi_opf & REQ_HIPRI));
 
 	blk_bio_poll_preprocess(q, bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..99160d588c2d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -394,6 +394,9 @@ enum req_flag_bits {
 
 	__REQ_HIPRI,
 
+	/* for marking IOs originated from same FS bio in same context */
+	__REQ_POLL_CTX,
+
 	/* for driver use */
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
@@ -418,6 +421,7 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
+#define REQ_POLL_CTX			(1ULL << __REQ_POLL_CTX)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 05/13] block: add req flag of REQ_POLL_CTX
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer, Ming Lei

Add one req flag REQ_POLL_CTX which will be used in the following patch for
supporting bio based IO polling.

Exactly this flag can help us to do:

1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
as REQ_POLL_CTX, all bios cloned from this FS bio will be marked as
REQ_POLL_CTX too.

2) create per-task io polling context if the bio based queue supports
polling and the submitted bio is HIPRI. Per-task io poll context will be
created during submit_bio() before marking this HIPRI bio as REQ_POLL_CTX.
Then we can avoid to create such io polling context if one cloned bio with
REQ_POLL_CTX is submitted from another kernel context.

3) for supporting bio based io polling, we need to poll IOs from all
underlying queues of the bio device, this way help us to recognize which
IO needs to polled in bio based style, which will be applied in
following patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 25 ++++++++++++++++++++++++-
 include/linux/blk_types.h |  4 ++++
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4671bbf31fd3..eb07d61cfdc2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
 static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		struct bio *bio)
 {
+	bool mq;
+
 	if (!(bio->bi_opf & REQ_HIPRI))
 		return;
 
-	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+	/*
+	 * Can't support bio based IO polling without per-task poll ctx
+	 *
+	 * We have created per-task io poll context, and mark this
+	 * bio as REQ_POLL_CTX, so: 1) if any cloned bio from this bio is
+	 * submitted from another kernel context, we won't create bio
+	 * poll context for it, and that bio can be completed by IRQ;
+	 * 2) If such bio is submitted from current context, we will
+	 * complete it via blk_poll(); 3) If driver knows that one
+	 * underlying bio allocated from driver is for FS bio, meantime
+	 * it is submitted in current context, driver can mark such bio
+	 * as REQ_HIPRI & REQ_POLL_CTX manually, so the bio can be completed
+	 * via blk_poll too.
+	 */
+	mq = queue_is_mq(q);
+	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
 		bio->bi_opf &= ~REQ_HIPRI;
+	else if (!mq)
+		bio->bi_opf |= REQ_POLL_CTX;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -894,8 +913,12 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 	/*
 	 * Create per-task io poll ctx if bio polling supported and HIPRI
 	 * set.
+	 *
+	 * If REQ_POLL_CTX isn't set for this HIPRI bio, we think it originated
+	 * from FS and allocate io polling context.
 	 */
 	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			!(bio->bi_opf & REQ_POLL_CTX) &&
 			(bio->bi_opf & REQ_HIPRI));
 
 	blk_bio_poll_preprocess(q, bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..99160d588c2d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -394,6 +394,9 @@ enum req_flag_bits {
 
 	__REQ_HIPRI,
 
+	/* for marking IOs originated from same FS bio in same context */
+	__REQ_POLL_CTX,
+
 	/* for driver use */
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
@@ -418,6 +421,7 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
+#define REQ_POLL_CTX			(1ULL << __REQ_POLL_CTX)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 06/13] block: add new field into 'struct bvec_iter'
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei

There is a hole at the end of 'struct bvec_iter', so put a new field
here and we can save cookie returned from submit_bio() here for
supporting bio based polling.

This way can avoid to extend bio unnecessarily.

Meantime add two helpers to get/set this field.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk.h          | 10 ++++++++++
 include/linux/bvec.h |  8 ++++++++
 2 files changed, 18 insertions(+)

diff --git a/block/blk.h b/block/blk.h
index 424949f2226d..7e16419904fa 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -402,4 +402,14 @@ static inline void blk_create_io_context(struct request_queue *q,
 		bio_poll_ctx_alloc(ioc);
 }
 
+static inline unsigned int bio_get_private_data(struct bio *bio)
+{
+	return bio->bi_iter.bi_private_data;
+}
+
+static inline void bio_set_private_data(struct bio *bio, unsigned int data)
+{
+	bio->bi_iter.bi_private_data = data;
+}
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ff832e698efb..547ad7526960 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -43,6 +43,14 @@ struct bvec_iter {
 
 	unsigned int            bi_bvec_done;	/* number of bytes completed in
 						   current bvec */
+
+	/*
+	 * There is a hole at the end of bvec_iter, add one new field to hold
+	 * something which isn't related with 'bvec_iter', so that we can
+	 * avoid extending bio. So far this new field is used for bio based
+	 * polling, we will store returning value of submit_bio() here.
+	 */
+	unsigned int		bi_private_data;
 };
 
 struct bvec_iter_all {
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 06/13] block: add new field into 'struct bvec_iter'
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer, Ming Lei

There is a hole at the end of 'struct bvec_iter', so put a new field
here and we can save cookie returned from submit_bio() here for
supporting bio based polling.

This way can avoid to extend bio unnecessarily.

Meantime add two helpers to get/set this field.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk.h          | 10 ++++++++++
 include/linux/bvec.h |  8 ++++++++
 2 files changed, 18 insertions(+)

diff --git a/block/blk.h b/block/blk.h
index 424949f2226d..7e16419904fa 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -402,4 +402,14 @@ static inline void blk_create_io_context(struct request_queue *q,
 		bio_poll_ctx_alloc(ioc);
 }
 
+static inline unsigned int bio_get_private_data(struct bio *bio)
+{
+	return bio->bi_iter.bi_private_data;
+}
+
+static inline void bio_set_private_data(struct bio *bio, unsigned int data)
+{
+	bio->bi_iter.bi_private_data = data;
+}
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ff832e698efb..547ad7526960 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -43,6 +43,14 @@ struct bvec_iter {
 
 	unsigned int            bi_bvec_done;	/* number of bytes completed in
 						   current bvec */
+
+	/*
+	 * There is a hole at the end of bvec_iter, add one new field to hold
+	 * something which isn't related with 'bvec_iter', so that we can
+	 * avoid extending bio. So far this new field is used for bio based
+	 * polling, we will store returning value of submit_bio() here.
+	 */
+	unsigned int		bi_private_data;
 };
 
 struct bvec_iter_all {
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 07/13] block/mq: extract one helper function polling hw queue
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Extract the logic of polling one hw queue and related statistics
handling out as the helper function.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c832faa52ca0..03f59915fe2c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,19 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static inline int blk_mq_poll_hctx(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx)
+{
+	int ret;
+
+	hctx->poll_invoked++;
+	ret = q->mq_ops->poll(hctx);
+	if (ret > 0)
+		hctx->poll_success++;
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
 	/*
@@ -3908,11 +3921,8 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	do {
 		int ret;
 
-		hctx->poll_invoked++;
-
-		ret = q->mq_ops->poll(hctx);
+		ret = blk_mq_poll_hctx(q, hctx);
 		if (ret > 0) {
-			hctx->poll_success++;
 			__set_current_state(TASK_RUNNING);
 			return ret;
 		}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 07/13] block/mq: extract one helper function polling hw queue
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer, Ming Lei

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Extract the logic of polling one hw queue and related statistics
handling out as the helper function.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c832faa52ca0..03f59915fe2c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,19 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static inline int blk_mq_poll_hctx(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx)
+{
+	int ret;
+
+	hctx->poll_invoked++;
+	ret = q->mq_ops->poll(hctx);
+	if (ret > 0)
+		hctx->poll_success++;
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
 	/*
@@ -3908,11 +3921,8 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	do {
 		int ret;
 
-		hctx->poll_invoked++;
-
-		ret = q->mq_ops->poll(hctx);
+		ret = blk_mq_poll_hctx(q, hctx);
 		if (ret > 0) {
-			hctx->poll_success++;
 			__set_current_state(TASK_RUNNING);
 			return ret;
 		}
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 08/13] block: prepare for supporting bio_list via other link
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei

So far bio list helpers always use .bi_next to traverse the list, we
will support to link bios by other bio field.

Prepare for such support by adding a macro so that users can define
another helpers for linking bios by other bio field.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bio.h | 132 +++++++++++++++++++++++---------------------
 1 file changed, 68 insertions(+), 64 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index d0246c92a6e8..619edd26a6c0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -608,75 +608,11 @@ static inline unsigned bio_list_size(const struct bio_list *bl)
 	return sz;
 }
 
-static inline void bio_list_add(struct bio_list *bl, struct bio *bio)
-{
-	bio->bi_next = NULL;
-
-	if (bl->tail)
-		bl->tail->bi_next = bio;
-	else
-		bl->head = bio;
-
-	bl->tail = bio;
-}
-
-static inline void bio_list_add_head(struct bio_list *bl, struct bio *bio)
-{
-	bio->bi_next = bl->head;
-
-	bl->head = bio;
-
-	if (!bl->tail)
-		bl->tail = bio;
-}
-
-static inline void bio_list_merge(struct bio_list *bl, struct bio_list *bl2)
-{
-	if (!bl2->head)
-		return;
-
-	if (bl->tail)
-		bl->tail->bi_next = bl2->head;
-	else
-		bl->head = bl2->head;
-
-	bl->tail = bl2->tail;
-}
-
-static inline void bio_list_merge_head(struct bio_list *bl,
-				       struct bio_list *bl2)
-{
-	if (!bl2->head)
-		return;
-
-	if (bl->head)
-		bl2->tail->bi_next = bl->head;
-	else
-		bl->tail = bl2->tail;
-
-	bl->head = bl2->head;
-}
-
 static inline struct bio *bio_list_peek(struct bio_list *bl)
 {
 	return bl->head;
 }
 
-static inline struct bio *bio_list_pop(struct bio_list *bl)
-{
-	struct bio *bio = bl->head;
-
-	if (bio) {
-		bl->head = bl->head->bi_next;
-		if (!bl->head)
-			bl->tail = NULL;
-
-		bio->bi_next = NULL;
-	}
-
-	return bio;
-}
-
 static inline struct bio *bio_list_get(struct bio_list *bl)
 {
 	struct bio *bio = bl->head;
@@ -686,6 +622,74 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
 	return bio;
 }
 
+#define BIO_LIST_HELPERS(_pre, link)					\
+									\
+static inline void _pre##_add(struct bio_list *bl, struct bio *bio)	\
+{									\
+	bio->bi_##link = NULL;						\
+									\
+	if (bl->tail)							\
+		bl->tail->bi_##link = bio;				\
+	else								\
+		bl->head = bio;						\
+									\
+	bl->tail = bio;							\
+}									\
+									\
+static inline void _pre##_add_head(struct bio_list *bl, struct bio *bio) \
+{									\
+	bio->bi_##link = bl->head;					\
+									\
+	bl->head = bio;							\
+									\
+	if (!bl->tail)							\
+		bl->tail = bio;						\
+}									\
+									\
+static inline void _pre##_merge(struct bio_list *bl, struct bio_list *bl2) \
+{									\
+	if (!bl2->head)							\
+		return;							\
+									\
+	if (bl->tail)							\
+		bl->tail->bi_##link = bl2->head;			\
+	else								\
+		bl->head = bl2->head;					\
+									\
+	bl->tail = bl2->tail;						\
+}									\
+									\
+static inline void _pre##_merge_head(struct bio_list *bl,		\
+				       struct bio_list *bl2)		\
+{									\
+	if (!bl2->head)							\
+		return;							\
+									\
+	if (bl->head)							\
+		bl2->tail->bi_##link = bl->head;			\
+	else								\
+		bl->tail = bl2->tail;					\
+									\
+	bl->head = bl2->head;						\
+}									\
+									\
+static inline struct bio *_pre##_pop(struct bio_list *bl)		\
+{									\
+	struct bio *bio = bl->head;					\
+									\
+	if (bio) {							\
+		bl->head = bl->head->bi_##link;				\
+		if (!bl->head)						\
+			bl->tail = NULL;				\
+									\
+		bio->bi_##link = NULL;					\
+	}								\
+									\
+	return bio;							\
+}									\
+
+BIO_LIST_HELPERS(bio_list, next);
+
 /*
  * Increment chain count for the bio. Make sure the CHAIN flag update
  * is visible before the raised count.
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 08/13] block: prepare for supporting bio_list via other link
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer, Ming Lei

So far bio list helpers always use .bi_next to traverse the list, we
will support to link bios by other bio field.

Prepare for such support by adding a macro so that users can define
another helpers for linking bios by other bio field.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bio.h | 132 +++++++++++++++++++++++---------------------
 1 file changed, 68 insertions(+), 64 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index d0246c92a6e8..619edd26a6c0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -608,75 +608,11 @@ static inline unsigned bio_list_size(const struct bio_list *bl)
 	return sz;
 }
 
-static inline void bio_list_add(struct bio_list *bl, struct bio *bio)
-{
-	bio->bi_next = NULL;
-
-	if (bl->tail)
-		bl->tail->bi_next = bio;
-	else
-		bl->head = bio;
-
-	bl->tail = bio;
-}
-
-static inline void bio_list_add_head(struct bio_list *bl, struct bio *bio)
-{
-	bio->bi_next = bl->head;
-
-	bl->head = bio;
-
-	if (!bl->tail)
-		bl->tail = bio;
-}
-
-static inline void bio_list_merge(struct bio_list *bl, struct bio_list *bl2)
-{
-	if (!bl2->head)
-		return;
-
-	if (bl->tail)
-		bl->tail->bi_next = bl2->head;
-	else
-		bl->head = bl2->head;
-
-	bl->tail = bl2->tail;
-}
-
-static inline void bio_list_merge_head(struct bio_list *bl,
-				       struct bio_list *bl2)
-{
-	if (!bl2->head)
-		return;
-
-	if (bl->head)
-		bl2->tail->bi_next = bl->head;
-	else
-		bl->tail = bl2->tail;
-
-	bl->head = bl2->head;
-}
-
 static inline struct bio *bio_list_peek(struct bio_list *bl)
 {
 	return bl->head;
 }
 
-static inline struct bio *bio_list_pop(struct bio_list *bl)
-{
-	struct bio *bio = bl->head;
-
-	if (bio) {
-		bl->head = bl->head->bi_next;
-		if (!bl->head)
-			bl->tail = NULL;
-
-		bio->bi_next = NULL;
-	}
-
-	return bio;
-}
-
 static inline struct bio *bio_list_get(struct bio_list *bl)
 {
 	struct bio *bio = bl->head;
@@ -686,6 +622,74 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
 	return bio;
 }
 
+#define BIO_LIST_HELPERS(_pre, link)					\
+									\
+static inline void _pre##_add(struct bio_list *bl, struct bio *bio)	\
+{									\
+	bio->bi_##link = NULL;						\
+									\
+	if (bl->tail)							\
+		bl->tail->bi_##link = bio;				\
+	else								\
+		bl->head = bio;						\
+									\
+	bl->tail = bio;							\
+}									\
+									\
+static inline void _pre##_add_head(struct bio_list *bl, struct bio *bio) \
+{									\
+	bio->bi_##link = bl->head;					\
+									\
+	bl->head = bio;							\
+									\
+	if (!bl->tail)							\
+		bl->tail = bio;						\
+}									\
+									\
+static inline void _pre##_merge(struct bio_list *bl, struct bio_list *bl2) \
+{									\
+	if (!bl2->head)							\
+		return;							\
+									\
+	if (bl->tail)							\
+		bl->tail->bi_##link = bl2->head;			\
+	else								\
+		bl->head = bl2->head;					\
+									\
+	bl->tail = bl2->tail;						\
+}									\
+									\
+static inline void _pre##_merge_head(struct bio_list *bl,		\
+				       struct bio_list *bl2)		\
+{									\
+	if (!bl2->head)							\
+		return;							\
+									\
+	if (bl->head)							\
+		bl2->tail->bi_##link = bl->head;			\
+	else								\
+		bl->tail = bl2->tail;					\
+									\
+	bl->head = bl2->head;						\
+}									\
+									\
+static inline struct bio *_pre##_pop(struct bio_list *bl)		\
+{									\
+	struct bio *bio = bl->head;					\
+									\
+	if (bio) {							\
+		bl->head = bl->head->bi_##link;				\
+		if (!bl->head)						\
+			bl->tail = NULL;				\
+									\
+		bio->bi_##link = NULL;					\
+	}								\
+									\
+	return bio;							\
+}									\
+
+BIO_LIST_HELPERS(bio_list, next);
+
 /*
  * Increment chain count for the bio. Make sure the CHAIN flag update
  * is visible before the raised count.
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 09/13] block: use per-task poll context to implement bio based io polling
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei

Currently bio based IO polling needs to poll all hw queue blindly, this
way is very inefficient, and one big reason is that we can't pass any
bio submission result to blk_poll().

In IO submission context, track associated underlying bios by per-task
submission queue and store returned 'cookie' in
bio->bi_iter.bi_private_data, and return current->pid to caller of
submit_bio() for any bio based driver's IO, which is submitted from FS.

In IO poll context, the passed cookie tells us the PID of submission
context, then we can find bios from the per-task io pull context of
submission context. Moving bios from submission queue to poll queue of
the poll context, and keep polling until these bios are ended. Remove
bio from poll queue if the bio is ended. Add bio flags of BIO_DONE and
BIO_END_BY_POLL for such purpose.

In was found in Jeffle Xu's test that kfifo doesn't scale well for a
submission queue as queue depth is increased, so a new mechanism for
tracking bios is needed. So far bio's size is close to 2 cacheline size,
and it may not be accepted to add new field into bio for solving the
scalability issue by tracking bios via linked list, switch to bio group
list for tracking bio, the idea is to reuse .bi_end_io for linking bios
into a linked list for all sharing same .bi_end_io(call it bio group),
which is recovered before ending bio really, since BIO_END_BY_POLL is
added for enhancing this point. Usually .bi_end_bio is same for all
bios in same layer, so it is enough to provide very limited groups, such
as 16 or less for fixing the scalability issue.

Usually submission shares context with io poll. The per-task poll context
is just like stack variable, and it is cheap to move data between the two
per-task queues.

Also when the submission task is exiting, drain pending IOs in the context
until all are done.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bio.c               |   5 +
 block/blk-core.c          | 154 ++++++++++++++++++++++++-
 block/blk-ioc.c           |   2 +
 block/blk-mq.c            | 234 +++++++++++++++++++++++++++++++++++++-
 block/blk.h               |  10 ++
 include/linux/blk_types.h |  18 ++-
 6 files changed, 419 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 26b7f721cda8..04c043dc60fc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
  **/
 void bio_endio(struct bio *bio)
 {
+	/* BIO_END_BY_POLL has to be set before calling submit_bio */
+	if (bio_flagged(bio, BIO_END_BY_POLL)) {
+		bio_set_flag(bio, BIO_DONE);
+		return;
+	}
 again:
 	if (!bio_remaining_done(bio))
 		return;
diff --git a/block/blk-core.c b/block/blk-core.c
index eb07d61cfdc2..95f7e36c8759 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -805,6 +805,81 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
 		sizeof(struct bio_grp_list_data);
 }
 
+static inline void *bio_grp_data(struct bio *bio)
+{
+	return bio->bi_poll;
+}
+
+/* add bio into bio group list, return true if it is added */
+static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
+{
+	int i;
+	struct bio_grp_list_data *grp;
+
+	for (i = 0; i < list->nr_grps; i++) {
+		grp = &list->head[i];
+		if (grp->grp_data == bio_grp_data(bio)) {
+			__bio_grp_list_add(&grp->list, bio);
+			return true;
+		}
+	}
+
+	if (i == list->max_nr_grps)
+		return false;
+
+	/* create a new group */
+	grp = &list->head[i];
+	bio_list_init(&grp->list);
+	grp->grp_data = bio_grp_data(bio);
+	__bio_grp_list_add(&grp->list, bio);
+	list->nr_grps++;
+
+	return true;
+}
+
+static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
+{
+	int i;
+	struct bio_grp_list_data *grp;
+
+	for (i = 0; i < list->nr_grps; i++) {
+		grp = &list->head[i];
+		if (grp->grp_data == grp_data)
+			return i;
+	}
+
+	if (i < list->max_nr_grps) {
+		grp = &list->head[i];
+		bio_list_init(&grp->list);
+		return i;
+	}
+
+	return -1;
+}
+
+/* Move as many as possible groups from 'src' to 'dst' */
+void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
+{
+	int i, j, cnt = 0;
+	struct bio_grp_list_data *grp;
+
+	for (i = src->nr_grps - 1; i >= 0; i--) {
+		grp = &src->head[i];
+		j = bio_grp_list_find_grp(dst, grp->grp_data);
+		if (j < 0)
+			break;
+		if (bio_grp_list_grp_empty(&dst->head[j])) {
+			dst->head[j].grp_data = grp->grp_data;
+			dst->nr_grps++;
+		}
+		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
+		bio_list_init(&grp->list);
+		cnt++;
+	}
+
+	src->nr_grps -= cnt;
+}
+
 static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
 {
 	pc->sq = (void *)pc + sizeof(*pc);
@@ -866,6 +941,45 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		bio->bi_opf |= REQ_POLL_CTX;
 }
 
+static inline void blk_bio_poll_mark_queued(struct bio *bio, bool queued)
+{
+	/*
+	 * The bio has been added to per-task poll queue, mark it as
+	 * END_BY_POLL, so that this bio is always completed from
+	 * blk_poll() which is provided with cookied from this bio's
+	 * submission.
+	 */
+	if (!queued)
+		bio->bi_opf &= ~(REQ_HIPRI | REQ_POLL_CTX);
+	else
+		bio_set_flag(bio, BIO_END_BY_POLL);
+}
+
+static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
+{
+	struct blk_bio_poll_ctx *pc = ioc->data;
+	unsigned int queued;
+
+	/*
+	 * We rely on immutable .bi_end_io between blk-mq bio submission
+	 * and completion. However, bio crypt may update .bi_end_io during
+	 * submission, so simply don't support bio based polling for this
+	 * setting.
+	 */
+	if (likely(!bio_has_crypt_ctx(bio))) {
+		/* track this bio via bio group list */
+		spin_lock(&pc->sq_lock);
+		queued = bio_grp_list_add(pc->sq, bio);
+		blk_bio_poll_mark_queued(bio, queued);
+		spin_unlock(&pc->sq_lock);
+	} else {
+		queued = false;
+		blk_bio_poll_mark_queued(bio, false);
+	}
+
+	return queued;
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -1018,7 +1132,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
  * bio_list_on_stack[1] contains bios that were submitted before the current
  *	->submit_bio_bio, but that haven't been processed yet.
  */
-static blk_qc_t __submit_bio_noacct(struct bio *bio)
+static blk_qc_t __submit_bio_noacct_ctx(struct bio *bio, struct io_context *ioc)
 {
 	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
@@ -1041,7 +1155,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 		bio_list_on_stack[1] = bio_list_on_stack[0];
 		bio_list_init(&bio_list_on_stack[0]);
 
-		ret = __submit_bio(bio);
+		if (ioc && queue_is_mq(q) &&
+				(bio->bi_opf & (REQ_HIPRI | REQ_POLL_CTX))) {
+			bool queued = blk_bio_poll_prep_submit(ioc, bio);
+
+			ret = __submit_bio(bio);
+			if (queued)
+				bio_set_private_data(bio, ret);
+		} else {
+			ret = __submit_bio(bio);
+		}
 
 		/*
 		 * Sort new bios into those for a lower level and those for the
@@ -1067,6 +1190,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 	return ret;
 }
 
+static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
+		struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc = ioc->data;
+
+	__submit_bio_noacct_ctx(bio, ioc);
+
+	/* bio submissions queued to per-task poll context */
+	if (READ_ONCE(pc->sq->nr_grps))
+		return current->pid;
+
+	/* swapper's pid is 0, but it can't submit poll IO for us */
+	return BLK_QC_T_BIO_NONE;
+}
+
+static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
+{
+	struct io_context *ioc = current->io_context;
+
+	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
+		return __submit_bio_noacct_poll(bio, ioc);
+
+	__submit_bio_noacct_ctx(bio, NULL);
+
+	return BLK_QC_T_BIO_NONE;
+}
+
 static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
 {
 	struct bio_list bio_list[2] = { };
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 5574c398eff6..b9a512f066f8 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -19,6 +19,8 @@ static struct kmem_cache *iocontext_cachep;
 
 static inline void free_io_context(struct io_context *ioc)
 {
+	blk_bio_poll_io_drain(ioc);
+
 	kfree(ioc->data);
 	kmem_cache_free(iocontext_cachep, ioc);
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 03f59915fe2c..76a90da83d9c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3865,14 +3865,246 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
 	return ret;
 }
 
+static int blk_mq_poll_io(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+	blk_qc_t cookie = bio_get_private_data(bio);
+	int ret = 0;
+
+	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
+		struct blk_mq_hw_ctx *hctx =
+			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+
+		ret += blk_mq_poll_hctx(q, hctx);
+	}
+	return ret;
+}
+
+static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
+{
+	int ret = 0;
+	int i;
+
+	/*
+	 * Poll hw queue first.
+	 *
+	 * TODO: limit max poll times and make sure to not poll same
+	 * hw queue one more time.
+	 */
+	for (i = 0; i < grps->nr_grps; i++) {
+		struct bio_grp_list_data *grp = &grps->head[i];
+		struct bio *bio;
+
+		if (bio_grp_list_grp_empty(grp))
+			continue;
+
+		for (bio = grp->list.head; bio; bio = bio->bi_poll)
+			ret += blk_mq_poll_io(bio);
+	}
+
+	/* reap bios */
+	for (i = 0; i < grps->nr_grps; i++) {
+		struct bio_grp_list_data *grp = &grps->head[i];
+		struct bio *bio;
+		struct bio_list bl;
+
+		if (bio_grp_list_grp_empty(grp))
+			continue;
+
+		bio_list_init(&bl);
+
+		while ((bio = __bio_grp_list_pop(&grp->list))) {
+			if (bio_flagged(bio, BIO_DONE)) {
+				/* now recover original data */
+				bio->bi_poll = grp->grp_data;
+
+				/* clear BIO_END_BY_POLL and end me really */
+				bio_clear_flag(bio, BIO_END_BY_POLL);
+				bio_endio(bio);
+			} else {
+				__bio_grp_list_add(&bl, bio);
+			}
+		}
+		__bio_grp_list_merge(&grp->list, &bl);
+	}
+	return ret;
+}
+
+static void blk_bio_poll_pack_groups(struct bio_grp_list *grps)
+{
+	int i, j, k = 0;
+	int cnt = 0;
+
+	for (i = grps->nr_grps - 1; i >= 0; i--) {
+		struct bio_grp_list_data *grp = &grps->head[i];
+		struct bio_grp_list_data *hole = NULL;
+
+		if (bio_grp_list_grp_empty(grp)) {
+			cnt++;
+			continue;
+		}
+
+		for (j = k; j < i; j++) {
+			hole = &grps->head[j];
+			if (bio_grp_list_grp_empty(hole))
+				break;
+		}
+		if (hole == NULL)
+			break;
+		*hole = *grp;
+		cnt++;
+		k = j;
+	}
+
+	grps->nr_grps -= cnt;
+}
+
+#define  MAX_BIO_GRPS_ON_STACK  8
+struct bio_grp_list_stack {
+	unsigned int max_nr_grps, nr_grps;
+	struct bio_grp_list_data head[MAX_BIO_GRPS_ON_STACK];
+};
+
+static int blk_bio_poll_io(struct io_context *submit_ioc,
+		struct io_context *poll_ioc)
+
+{
+	struct bio_grp_list_stack _bio_grps = {
+		.max_nr_grps	= ARRAY_SIZE(_bio_grps.head),
+		.nr_grps	= 0
+	};
+	struct bio_grp_list *bio_grps = (struct bio_grp_list *)&_bio_grps;
+	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
+	struct blk_bio_poll_ctx *poll_ctx = poll_ioc ?
+		poll_ioc->data : NULL;
+	int ret = 0;
+
+	/*
+	 * Move IO submission result from submission queue in submission
+	 * context to poll queue of poll context.
+	 */
+	spin_lock(&submit_ctx->sq_lock);
+	bio_grp_list_move(bio_grps, submit_ctx->sq);
+	spin_unlock(&submit_ctx->sq_lock);
+
+	/* merge new bios first, then start to poll bios from pq */
+	if (poll_ctx) {
+		spin_lock(&poll_ctx->pq_lock);
+		bio_grp_list_move(poll_ctx->pq, bio_grps);
+		bio_grp_list_move(bio_grps, poll_ctx->pq);
+		spin_unlock(&poll_ctx->pq_lock);
+	}
+
+	do {
+		ret += blk_bio_poll_and_end_io(bio_grps);
+		blk_bio_poll_pack_groups(bio_grps);
+
+		if (bio_grps->nr_grps) {
+			/*
+			 * move back, and keep polling until all can be
+			 * held in either poll queue or submission queue.
+			 */
+			if (poll_ctx) {
+				spin_lock(&poll_ctx->pq_lock);
+				bio_grp_list_move(poll_ctx->pq, bio_grps);
+				spin_unlock(&poll_ctx->pq_lock);
+			} else {
+				spin_lock(&submit_ctx->sq_lock);
+				bio_grp_list_move(submit_ctx->sq, bio_grps);
+				spin_unlock(&submit_ctx->sq_lock);
+			}
+		}
+	} while (bio_grps->nr_grps > 0);
+
+	return ret;
+}
+
+void blk_bio_poll_io_drain(struct io_context *submit_ioc)
+{
+	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
+
+	if (!submit_ctx)
+		return;
+
+	while (submit_ctx->sq->nr_grps > 0) {
+		blk_bio_poll_io(submit_ioc, NULL);
+		cpu_relax();
+	}
+}
+
+static bool blk_bio_ioc_valid(struct task_struct *t)
+{
+	if (!t)
+		return false;
+
+	if (!t->io_context)
+		return false;
+
+	if (!t->io_context->data)
+		return false;
+
+	return true;
+}
+
+static int __blk_bio_poll(blk_qc_t cookie)
+{
+	struct io_context *poll_ioc = current->io_context;
+	pid_t pid;
+	struct task_struct *submit_task;
+	int ret;
+
+	pid = (pid_t)cookie;
+
+	/* io poll often share io submission context */
+	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
+		return blk_bio_poll_io(poll_ioc, poll_ioc);
+
+	submit_task = find_get_task_by_vpid(pid);
+	if (likely(blk_bio_ioc_valid(submit_task)))
+		ret = blk_bio_poll_io(submit_task->io_context, poll_ioc);
+	else
+		ret = 0;
+
+	put_task_struct(submit_task);
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
+	long state;
+
+	/* no need to poll */
+	if (cookie == BLK_QC_T_BIO_NONE)
+		return 0;
+
 	/*
 	 * Create poll queue for storing poll bio and its cookie from
 	 * submission queue
 	 */
 	blk_create_io_context(q, true);
 
+	state = current->state;
+	do {
+		int ret;
+
+		ret = __blk_bio_poll(cookie);
+		if (ret > 0) {
+			__set_current_state(TASK_RUNNING);
+			return ret;
+		}
+
+		if (signal_pending_state(state, current))
+			__set_current_state(TASK_RUNNING);
+
+		if (current->state == TASK_RUNNING)
+			return 1;
+		if (ret < 0 || !spin)
+			break;
+		cpu_relax();
+	} while (!need_resched());
+
+	__set_current_state(TASK_RUNNING);
 	return 0;
 }
 
@@ -3893,7 +4125,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
+	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
 		return 0;
 
 	if (current->plug)
diff --git a/block/blk.h b/block/blk.h
index 7e16419904fa..948b7b19ef48 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -381,6 +381,7 @@ struct blk_bio_poll_ctx {
 #define BLK_BIO_POLL_SQ_SZ		16U
 #define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
 
+void blk_bio_poll_io_drain(struct io_context *submit_ioc);
 void bio_poll_ctx_alloc(struct io_context *ioc);
 
 static inline void blk_create_io_context(struct request_queue *q,
@@ -412,4 +413,13 @@ static inline void bio_set_private_data(struct bio *bio, unsigned int data)
 	bio->bi_iter.bi_private_data = data;
 }
 
+BIO_LIST_HELPERS(__bio_grp_list, poll);
+
+static inline bool bio_grp_list_grp_empty(struct bio_grp_list_data *grp)
+{
+	return bio_list_empty(&grp->list);
+}
+
+void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src);
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99160d588c2d..beaeb3729f11 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -235,7 +235,18 @@ struct bio {
 
 	struct bvec_iter	bi_iter;
 
-	bio_end_io_t		*bi_end_io;
+	union {
+		bio_end_io_t		*bi_end_io;
+		/*
+		 * bio based io polling needs to track bio via bio group
+		 * list which groups bios by their .bi_end_io, and original
+		 * .bi_end_io is saved into the group head. Will recover
+		 * .bi_end_io before really ending bio. BIO_END_BY_POLL
+		 * will make sure that this bio won't be ended before
+		 * recovering .bi_end_io.
+		 */
+		struct bio		*bi_poll;
+	};
 
 	void			*bi_private;
 #ifdef CONFIG_BLK_CGROUP
@@ -304,6 +315,9 @@ enum {
 	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
 	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
 	BIO_REMAPPED,
+	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
+	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
+	BIO_DONE,
 	BIO_FLAG_LAST
 };
 
@@ -513,6 +527,8 @@ typedef unsigned int blk_qc_t;
 #define BLK_QC_T_NONE		-1U
 #define BLK_QC_T_SHIFT		16
 #define BLK_QC_T_INTERNAL	(1U << 31)
+/* only used for bio based submission, has to be defined as 0 */
+#define BLK_QC_T_BIO_NONE	0
 
 static inline bool blk_qc_t_valid(blk_qc_t cookie)
 {
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 09/13] block: use per-task poll context to implement bio based io polling
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer, Ming Lei

Currently bio based IO polling needs to poll all hw queue blindly, this
way is very inefficient, and one big reason is that we can't pass any
bio submission result to blk_poll().

In IO submission context, track associated underlying bios by per-task
submission queue and store returned 'cookie' in
bio->bi_iter.bi_private_data, and return current->pid to caller of
submit_bio() for any bio based driver's IO, which is submitted from FS.

In IO poll context, the passed cookie tells us the PID of submission
context, then we can find bios from the per-task io pull context of
submission context. Moving bios from submission queue to poll queue of
the poll context, and keep polling until these bios are ended. Remove
bio from poll queue if the bio is ended. Add bio flags of BIO_DONE and
BIO_END_BY_POLL for such purpose.

In was found in Jeffle Xu's test that kfifo doesn't scale well for a
submission queue as queue depth is increased, so a new mechanism for
tracking bios is needed. So far bio's size is close to 2 cacheline size,
and it may not be accepted to add new field into bio for solving the
scalability issue by tracking bios via linked list, switch to bio group
list for tracking bio, the idea is to reuse .bi_end_io for linking bios
into a linked list for all sharing same .bi_end_io(call it bio group),
which is recovered before ending bio really, since BIO_END_BY_POLL is
added for enhancing this point. Usually .bi_end_bio is same for all
bios in same layer, so it is enough to provide very limited groups, such
as 16 or less for fixing the scalability issue.

Usually submission shares context with io poll. The per-task poll context
is just like stack variable, and it is cheap to move data between the two
per-task queues.

Also when the submission task is exiting, drain pending IOs in the context
until all are done.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bio.c               |   5 +
 block/blk-core.c          | 154 ++++++++++++++++++++++++-
 block/blk-ioc.c           |   2 +
 block/blk-mq.c            | 234 +++++++++++++++++++++++++++++++++++++-
 block/blk.h               |  10 ++
 include/linux/blk_types.h |  18 ++-
 6 files changed, 419 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 26b7f721cda8..04c043dc60fc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
  **/
 void bio_endio(struct bio *bio)
 {
+	/* BIO_END_BY_POLL has to be set before calling submit_bio */
+	if (bio_flagged(bio, BIO_END_BY_POLL)) {
+		bio_set_flag(bio, BIO_DONE);
+		return;
+	}
 again:
 	if (!bio_remaining_done(bio))
 		return;
diff --git a/block/blk-core.c b/block/blk-core.c
index eb07d61cfdc2..95f7e36c8759 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -805,6 +805,81 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
 		sizeof(struct bio_grp_list_data);
 }
 
+static inline void *bio_grp_data(struct bio *bio)
+{
+	return bio->bi_poll;
+}
+
+/* add bio into bio group list, return true if it is added */
+static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
+{
+	int i;
+	struct bio_grp_list_data *grp;
+
+	for (i = 0; i < list->nr_grps; i++) {
+		grp = &list->head[i];
+		if (grp->grp_data == bio_grp_data(bio)) {
+			__bio_grp_list_add(&grp->list, bio);
+			return true;
+		}
+	}
+
+	if (i == list->max_nr_grps)
+		return false;
+
+	/* create a new group */
+	grp = &list->head[i];
+	bio_list_init(&grp->list);
+	grp->grp_data = bio_grp_data(bio);
+	__bio_grp_list_add(&grp->list, bio);
+	list->nr_grps++;
+
+	return true;
+}
+
+static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
+{
+	int i;
+	struct bio_grp_list_data *grp;
+
+	for (i = 0; i < list->nr_grps; i++) {
+		grp = &list->head[i];
+		if (grp->grp_data == grp_data)
+			return i;
+	}
+
+	if (i < list->max_nr_grps) {
+		grp = &list->head[i];
+		bio_list_init(&grp->list);
+		return i;
+	}
+
+	return -1;
+}
+
+/* Move as many as possible groups from 'src' to 'dst' */
+void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
+{
+	int i, j, cnt = 0;
+	struct bio_grp_list_data *grp;
+
+	for (i = src->nr_grps - 1; i >= 0; i--) {
+		grp = &src->head[i];
+		j = bio_grp_list_find_grp(dst, grp->grp_data);
+		if (j < 0)
+			break;
+		if (bio_grp_list_grp_empty(&dst->head[j])) {
+			dst->head[j].grp_data = grp->grp_data;
+			dst->nr_grps++;
+		}
+		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
+		bio_list_init(&grp->list);
+		cnt++;
+	}
+
+	src->nr_grps -= cnt;
+}
+
 static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
 {
 	pc->sq = (void *)pc + sizeof(*pc);
@@ -866,6 +941,45 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		bio->bi_opf |= REQ_POLL_CTX;
 }
 
+static inline void blk_bio_poll_mark_queued(struct bio *bio, bool queued)
+{
+	/*
+	 * The bio has been added to per-task poll queue, mark it as
+	 * END_BY_POLL, so that this bio is always completed from
+	 * blk_poll() which is provided with cookied from this bio's
+	 * submission.
+	 */
+	if (!queued)
+		bio->bi_opf &= ~(REQ_HIPRI | REQ_POLL_CTX);
+	else
+		bio_set_flag(bio, BIO_END_BY_POLL);
+}
+
+static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
+{
+	struct blk_bio_poll_ctx *pc = ioc->data;
+	unsigned int queued;
+
+	/*
+	 * We rely on immutable .bi_end_io between blk-mq bio submission
+	 * and completion. However, bio crypt may update .bi_end_io during
+	 * submission, so simply don't support bio based polling for this
+	 * setting.
+	 */
+	if (likely(!bio_has_crypt_ctx(bio))) {
+		/* track this bio via bio group list */
+		spin_lock(&pc->sq_lock);
+		queued = bio_grp_list_add(pc->sq, bio);
+		blk_bio_poll_mark_queued(bio, queued);
+		spin_unlock(&pc->sq_lock);
+	} else {
+		queued = false;
+		blk_bio_poll_mark_queued(bio, false);
+	}
+
+	return queued;
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -1018,7 +1132,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
  * bio_list_on_stack[1] contains bios that were submitted before the current
  *	->submit_bio_bio, but that haven't been processed yet.
  */
-static blk_qc_t __submit_bio_noacct(struct bio *bio)
+static blk_qc_t __submit_bio_noacct_ctx(struct bio *bio, struct io_context *ioc)
 {
 	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
@@ -1041,7 +1155,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 		bio_list_on_stack[1] = bio_list_on_stack[0];
 		bio_list_init(&bio_list_on_stack[0]);
 
-		ret = __submit_bio(bio);
+		if (ioc && queue_is_mq(q) &&
+				(bio->bi_opf & (REQ_HIPRI | REQ_POLL_CTX))) {
+			bool queued = blk_bio_poll_prep_submit(ioc, bio);
+
+			ret = __submit_bio(bio);
+			if (queued)
+				bio_set_private_data(bio, ret);
+		} else {
+			ret = __submit_bio(bio);
+		}
 
 		/*
 		 * Sort new bios into those for a lower level and those for the
@@ -1067,6 +1190,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 	return ret;
 }
 
+static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
+		struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc = ioc->data;
+
+	__submit_bio_noacct_ctx(bio, ioc);
+
+	/* bio submissions queued to per-task poll context */
+	if (READ_ONCE(pc->sq->nr_grps))
+		return current->pid;
+
+	/* swapper's pid is 0, but it can't submit poll IO for us */
+	return BLK_QC_T_BIO_NONE;
+}
+
+static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
+{
+	struct io_context *ioc = current->io_context;
+
+	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
+		return __submit_bio_noacct_poll(bio, ioc);
+
+	__submit_bio_noacct_ctx(bio, NULL);
+
+	return BLK_QC_T_BIO_NONE;
+}
+
 static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
 {
 	struct bio_list bio_list[2] = { };
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 5574c398eff6..b9a512f066f8 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -19,6 +19,8 @@ static struct kmem_cache *iocontext_cachep;
 
 static inline void free_io_context(struct io_context *ioc)
 {
+	blk_bio_poll_io_drain(ioc);
+
 	kfree(ioc->data);
 	kmem_cache_free(iocontext_cachep, ioc);
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 03f59915fe2c..76a90da83d9c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3865,14 +3865,246 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
 	return ret;
 }
 
+static int blk_mq_poll_io(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+	blk_qc_t cookie = bio_get_private_data(bio);
+	int ret = 0;
+
+	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
+		struct blk_mq_hw_ctx *hctx =
+			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+
+		ret += blk_mq_poll_hctx(q, hctx);
+	}
+	return ret;
+}
+
+static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
+{
+	int ret = 0;
+	int i;
+
+	/*
+	 * Poll hw queue first.
+	 *
+	 * TODO: limit max poll times and make sure to not poll same
+	 * hw queue one more time.
+	 */
+	for (i = 0; i < grps->nr_grps; i++) {
+		struct bio_grp_list_data *grp = &grps->head[i];
+		struct bio *bio;
+
+		if (bio_grp_list_grp_empty(grp))
+			continue;
+
+		for (bio = grp->list.head; bio; bio = bio->bi_poll)
+			ret += blk_mq_poll_io(bio);
+	}
+
+	/* reap bios */
+	for (i = 0; i < grps->nr_grps; i++) {
+		struct bio_grp_list_data *grp = &grps->head[i];
+		struct bio *bio;
+		struct bio_list bl;
+
+		if (bio_grp_list_grp_empty(grp))
+			continue;
+
+		bio_list_init(&bl);
+
+		while ((bio = __bio_grp_list_pop(&grp->list))) {
+			if (bio_flagged(bio, BIO_DONE)) {
+				/* now recover original data */
+				bio->bi_poll = grp->grp_data;
+
+				/* clear BIO_END_BY_POLL and end me really */
+				bio_clear_flag(bio, BIO_END_BY_POLL);
+				bio_endio(bio);
+			} else {
+				__bio_grp_list_add(&bl, bio);
+			}
+		}
+		__bio_grp_list_merge(&grp->list, &bl);
+	}
+	return ret;
+}
+
+static void blk_bio_poll_pack_groups(struct bio_grp_list *grps)
+{
+	int i, j, k = 0;
+	int cnt = 0;
+
+	for (i = grps->nr_grps - 1; i >= 0; i--) {
+		struct bio_grp_list_data *grp = &grps->head[i];
+		struct bio_grp_list_data *hole = NULL;
+
+		if (bio_grp_list_grp_empty(grp)) {
+			cnt++;
+			continue;
+		}
+
+		for (j = k; j < i; j++) {
+			hole = &grps->head[j];
+			if (bio_grp_list_grp_empty(hole))
+				break;
+		}
+		if (hole == NULL)
+			break;
+		*hole = *grp;
+		cnt++;
+		k = j;
+	}
+
+	grps->nr_grps -= cnt;
+}
+
+#define  MAX_BIO_GRPS_ON_STACK  8
+struct bio_grp_list_stack {
+	unsigned int max_nr_grps, nr_grps;
+	struct bio_grp_list_data head[MAX_BIO_GRPS_ON_STACK];
+};
+
+static int blk_bio_poll_io(struct io_context *submit_ioc,
+		struct io_context *poll_ioc)
+
+{
+	struct bio_grp_list_stack _bio_grps = {
+		.max_nr_grps	= ARRAY_SIZE(_bio_grps.head),
+		.nr_grps	= 0
+	};
+	struct bio_grp_list *bio_grps = (struct bio_grp_list *)&_bio_grps;
+	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
+	struct blk_bio_poll_ctx *poll_ctx = poll_ioc ?
+		poll_ioc->data : NULL;
+	int ret = 0;
+
+	/*
+	 * Move IO submission result from submission queue in submission
+	 * context to poll queue of poll context.
+	 */
+	spin_lock(&submit_ctx->sq_lock);
+	bio_grp_list_move(bio_grps, submit_ctx->sq);
+	spin_unlock(&submit_ctx->sq_lock);
+
+	/* merge new bios first, then start to poll bios from pq */
+	if (poll_ctx) {
+		spin_lock(&poll_ctx->pq_lock);
+		bio_grp_list_move(poll_ctx->pq, bio_grps);
+		bio_grp_list_move(bio_grps, poll_ctx->pq);
+		spin_unlock(&poll_ctx->pq_lock);
+	}
+
+	do {
+		ret += blk_bio_poll_and_end_io(bio_grps);
+		blk_bio_poll_pack_groups(bio_grps);
+
+		if (bio_grps->nr_grps) {
+			/*
+			 * move back, and keep polling until all can be
+			 * held in either poll queue or submission queue.
+			 */
+			if (poll_ctx) {
+				spin_lock(&poll_ctx->pq_lock);
+				bio_grp_list_move(poll_ctx->pq, bio_grps);
+				spin_unlock(&poll_ctx->pq_lock);
+			} else {
+				spin_lock(&submit_ctx->sq_lock);
+				bio_grp_list_move(submit_ctx->sq, bio_grps);
+				spin_unlock(&submit_ctx->sq_lock);
+			}
+		}
+	} while (bio_grps->nr_grps > 0);
+
+	return ret;
+}
+
+void blk_bio_poll_io_drain(struct io_context *submit_ioc)
+{
+	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
+
+	if (!submit_ctx)
+		return;
+
+	while (submit_ctx->sq->nr_grps > 0) {
+		blk_bio_poll_io(submit_ioc, NULL);
+		cpu_relax();
+	}
+}
+
+static bool blk_bio_ioc_valid(struct task_struct *t)
+{
+	if (!t)
+		return false;
+
+	if (!t->io_context)
+		return false;
+
+	if (!t->io_context->data)
+		return false;
+
+	return true;
+}
+
+static int __blk_bio_poll(blk_qc_t cookie)
+{
+	struct io_context *poll_ioc = current->io_context;
+	pid_t pid;
+	struct task_struct *submit_task;
+	int ret;
+
+	pid = (pid_t)cookie;
+
+	/* io poll often share io submission context */
+	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
+		return blk_bio_poll_io(poll_ioc, poll_ioc);
+
+	submit_task = find_get_task_by_vpid(pid);
+	if (likely(blk_bio_ioc_valid(submit_task)))
+		ret = blk_bio_poll_io(submit_task->io_context, poll_ioc);
+	else
+		ret = 0;
+
+	put_task_struct(submit_task);
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
+	long state;
+
+	/* no need to poll */
+	if (cookie == BLK_QC_T_BIO_NONE)
+		return 0;
+
 	/*
 	 * Create poll queue for storing poll bio and its cookie from
 	 * submission queue
 	 */
 	blk_create_io_context(q, true);
 
+	state = current->state;
+	do {
+		int ret;
+
+		ret = __blk_bio_poll(cookie);
+		if (ret > 0) {
+			__set_current_state(TASK_RUNNING);
+			return ret;
+		}
+
+		if (signal_pending_state(state, current))
+			__set_current_state(TASK_RUNNING);
+
+		if (current->state == TASK_RUNNING)
+			return 1;
+		if (ret < 0 || !spin)
+			break;
+		cpu_relax();
+	} while (!need_resched());
+
+	__set_current_state(TASK_RUNNING);
 	return 0;
 }
 
@@ -3893,7 +4125,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
+	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
 		return 0;
 
 	if (current->plug)
diff --git a/block/blk.h b/block/blk.h
index 7e16419904fa..948b7b19ef48 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -381,6 +381,7 @@ struct blk_bio_poll_ctx {
 #define BLK_BIO_POLL_SQ_SZ		16U
 #define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
 
+void blk_bio_poll_io_drain(struct io_context *submit_ioc);
 void bio_poll_ctx_alloc(struct io_context *ioc);
 
 static inline void blk_create_io_context(struct request_queue *q,
@@ -412,4 +413,13 @@ static inline void bio_set_private_data(struct bio *bio, unsigned int data)
 	bio->bi_iter.bi_private_data = data;
 }
 
+BIO_LIST_HELPERS(__bio_grp_list, poll);
+
+static inline bool bio_grp_list_grp_empty(struct bio_grp_list_data *grp)
+{
+	return bio_list_empty(&grp->list);
+}
+
+void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src);
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99160d588c2d..beaeb3729f11 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -235,7 +235,18 @@ struct bio {
 
 	struct bvec_iter	bi_iter;
 
-	bio_end_io_t		*bi_end_io;
+	union {
+		bio_end_io_t		*bi_end_io;
+		/*
+		 * bio based io polling needs to track bio via bio group
+		 * list which groups bios by their .bi_end_io, and original
+		 * .bi_end_io is saved into the group head. Will recover
+		 * .bi_end_io before really ending bio. BIO_END_BY_POLL
+		 * will make sure that this bio won't be ended before
+		 * recovering .bi_end_io.
+		 */
+		struct bio		*bi_poll;
+	};
 
 	void			*bi_private;
 #ifdef CONFIG_BLK_CGROUP
@@ -304,6 +315,9 @@ enum {
 	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
 	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
 	BIO_REMAPPED,
+	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
+	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
+	BIO_DONE,
 	BIO_FLAG_LAST
 };
 
@@ -513,6 +527,8 @@ typedef unsigned int blk_qc_t;
 #define BLK_QC_T_NONE		-1U
 #define BLK_QC_T_SHIFT		16
 #define BLK_QC_T_INTERNAL	(1U << 31)
+/* only used for bio based submission, has to be defined as 0 */
+#define BLK_QC_T_BIO_NONE	0
 
 static inline bool blk_qc_t_valid(blk_qc_t cookie)
 {
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 10/13] blk-mq: limit hw queues to be polled in each blk_poll()
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei

Limit at most 8 queues are polled in each blk_pull(), avoid to
add extra latency when queue depth is high.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 73 ++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 53 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 76a90da83d9c..65fe6a2bad43 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3865,32 +3865,31 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
 	return ret;
 }
 
-static int blk_mq_poll_io(struct bio *bio)
+#define POLL_HCTX_MAX_CNT 8
+
+static bool blk_add_unique_hctx(struct blk_mq_hw_ctx **data, int *cnt,
+		struct blk_mq_hw_ctx *hctx)
 {
-	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
-	blk_qc_t cookie = bio_get_private_data(bio);
-	int ret = 0;
+	int i;
 
-	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
-		struct blk_mq_hw_ctx *hctx =
-			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+	for (i = 0; i < *cnt; i++) {
+		if (data[i] == hctx)
+			goto exit;
+	}
 
-		ret += blk_mq_poll_hctx(q, hctx);
+	if (i < POLL_HCTX_MAX_CNT) {
+		data[i] = hctx;
+		(*cnt)++;
 	}
-	return ret;
+ exit:
+	return *cnt == POLL_HCTX_MAX_CNT;
 }
 
-static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
+static void blk_build_poll_queues(struct bio_grp_list *grps,
+		struct blk_mq_hw_ctx **data, int *cnt)
 {
-	int ret = 0;
 	int i;
 
-	/*
-	 * Poll hw queue first.
-	 *
-	 * TODO: limit max poll times and make sure to not poll same
-	 * hw queue one more time.
-	 */
 	for (i = 0; i < grps->nr_grps; i++) {
 		struct bio_grp_list_data *grp = &grps->head[i];
 		struct bio *bio;
@@ -3898,11 +3897,29 @@ static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
 		if (bio_grp_list_grp_empty(grp))
 			continue;
 
-		for (bio = grp->list.head; bio; bio = bio->bi_poll)
-			ret += blk_mq_poll_io(bio);
+		for (bio = grp->list.head; bio; bio = bio->bi_poll) {
+			blk_qc_t  cookie;
+			struct blk_mq_hw_ctx *hctx;
+			struct request_queue *q;
+
+			if (bio_flagged(bio, BIO_DONE))
+				continue;
+			cookie = bio_get_private_data(bio);
+			if (!blk_qc_t_valid(cookie))
+				continue;
+
+			q = bio->bi_bdev->bd_disk->queue;
+			hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+			if (blk_add_unique_hctx(data, cnt, hctx))
+				return;
+		}
 	}
+}
+
+static void blk_bio_poll_reap_ios(struct bio_grp_list *grps)
+{
+	int i;
 
-	/* reap bios */
 	for (i = 0; i < grps->nr_grps; i++) {
 		struct bio_grp_list_data *grp = &grps->head[i];
 		struct bio *bio;
@@ -3927,6 +3944,22 @@ static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
 		}
 		__bio_grp_list_merge(&grp->list, &bl);
 	}
+}
+
+static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
+{
+	int ret = 0;
+	int i;
+	struct blk_mq_hw_ctx *hctx[POLL_HCTX_MAX_CNT];
+	int cnt = 0;
+
+	blk_build_poll_queues(grps, hctx, &cnt);
+
+	for (i = 0; i < cnt; i++)
+		ret += blk_mq_poll_hctx(hctx[i]->queue, hctx[i]);
+
+	blk_bio_poll_reap_ios(grps);
+
 	return ret;
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 10/13] blk-mq: limit hw queues to be polled in each blk_poll()
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer, Ming Lei

Limit at most 8 queues are polled in each blk_pull(), avoid to
add extra latency when queue depth is high.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 73 ++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 53 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 76a90da83d9c..65fe6a2bad43 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3865,32 +3865,31 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
 	return ret;
 }
 
-static int blk_mq_poll_io(struct bio *bio)
+#define POLL_HCTX_MAX_CNT 8
+
+static bool blk_add_unique_hctx(struct blk_mq_hw_ctx **data, int *cnt,
+		struct blk_mq_hw_ctx *hctx)
 {
-	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
-	blk_qc_t cookie = bio_get_private_data(bio);
-	int ret = 0;
+	int i;
 
-	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
-		struct blk_mq_hw_ctx *hctx =
-			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+	for (i = 0; i < *cnt; i++) {
+		if (data[i] == hctx)
+			goto exit;
+	}
 
-		ret += blk_mq_poll_hctx(q, hctx);
+	if (i < POLL_HCTX_MAX_CNT) {
+		data[i] = hctx;
+		(*cnt)++;
 	}
-	return ret;
+ exit:
+	return *cnt == POLL_HCTX_MAX_CNT;
 }
 
-static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
+static void blk_build_poll_queues(struct bio_grp_list *grps,
+		struct blk_mq_hw_ctx **data, int *cnt)
 {
-	int ret = 0;
 	int i;
 
-	/*
-	 * Poll hw queue first.
-	 *
-	 * TODO: limit max poll times and make sure to not poll same
-	 * hw queue one more time.
-	 */
 	for (i = 0; i < grps->nr_grps; i++) {
 		struct bio_grp_list_data *grp = &grps->head[i];
 		struct bio *bio;
@@ -3898,11 +3897,29 @@ static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
 		if (bio_grp_list_grp_empty(grp))
 			continue;
 
-		for (bio = grp->list.head; bio; bio = bio->bi_poll)
-			ret += blk_mq_poll_io(bio);
+		for (bio = grp->list.head; bio; bio = bio->bi_poll) {
+			blk_qc_t  cookie;
+			struct blk_mq_hw_ctx *hctx;
+			struct request_queue *q;
+
+			if (bio_flagged(bio, BIO_DONE))
+				continue;
+			cookie = bio_get_private_data(bio);
+			if (!blk_qc_t_valid(cookie))
+				continue;
+
+			q = bio->bi_bdev->bd_disk->queue;
+			hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+			if (blk_add_unique_hctx(data, cnt, hctx))
+				return;
+		}
 	}
+}
+
+static void blk_bio_poll_reap_ios(struct bio_grp_list *grps)
+{
+	int i;
 
-	/* reap bios */
 	for (i = 0; i < grps->nr_grps; i++) {
 		struct bio_grp_list_data *grp = &grps->head[i];
 		struct bio *bio;
@@ -3927,6 +3944,22 @@ static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
 		}
 		__bio_grp_list_merge(&grp->list, &bl);
 	}
+}
+
+static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
+{
+	int ret = 0;
+	int i;
+	struct blk_mq_hw_ctx *hctx[POLL_HCTX_MAX_CNT];
+	int cnt = 0;
+
+	blk_build_poll_queues(grps, hctx, &cnt);
+
+	for (i = 0; i < cnt; i++)
+		ret += blk_mq_poll_hctx(hctx[i]->queue, hctx[i]);
+
+	blk_bio_poll_reap_ios(grps);
+
 	return ret;
 }
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 11/13] block: add queue_to_disk() to get gendisk from request_queue
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Sometimes we need to get the corresponding gendisk from request_queue.

It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").

So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h       | 2 ++
 include/trace/events/kyber.h | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 89a01850cf12..bfab74b45f15 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -686,6 +686,8 @@ static inline bool blk_account_rq(struct request *rq)
 	dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \
 	(dir), (attrs))
 
+#define queue_to_disk(q)	(dev_to_disk(kobj_to_dev((q)->kobj.parent)))
+
 static inline bool queue_is_mq(struct request_queue *q)
 {
 	return q->mq_ops;
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index c0e7d24ca256..f9802562edf6 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kyber_latency,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		strlcpy(__entry->type, type, sizeof(__entry->type));
 		__entry->percentile	= percentile;
@@ -59,7 +59,7 @@ TRACE_EVENT(kyber_adjust,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		__entry->depth		= depth;
 	),
@@ -81,7 +81,7 @@ TRACE_EVENT(kyber_throttled,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 	),
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 11/13] block: add queue_to_disk() to get gendisk from request_queue
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Sometimes we need to get the corresponding gendisk from request_queue.

It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").

So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h       | 2 ++
 include/trace/events/kyber.h | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 89a01850cf12..bfab74b45f15 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -686,6 +686,8 @@ static inline bool blk_account_rq(struct request *rq)
 	dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \
 	(dir), (attrs))
 
+#define queue_to_disk(q)	(dev_to_disk(kobj_to_dev((q)->kobj.parent)))
+
 static inline bool queue_is_mq(struct request_queue *q)
 {
 	return q->mq_ops;
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index c0e7d24ca256..f9802562edf6 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kyber_latency,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		strlcpy(__entry->type, type, sizeof(__entry->type));
 		__entry->percentile	= percentile;
@@ -59,7 +59,7 @@ TRACE_EVENT(kyber_adjust,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		__entry->depth		= depth;
 	),
@@ -81,7 +81,7 @@ TRACE_EVENT(kyber_throttled,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 	),
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 12/13] block: add poll_capable method to support bio-based IO polling
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel

From: Jeffle Xu <jefflexu@linux.alibaba.com>

This method can be used to check if bio-based device supports IO polling
or not. For mq devices, checking for hw queue in polling mode is
adequate, while the sanity check shall be implementation specific for
bio-based devices. For example, dm device needs to check if all
underlying devices are capable of IO polling.

Though bio-based device may have done the sanity check during the
device initialization phase, cacheing the result of this sanity check
(such as by cacheing in the queue_flags) may not work. Because for dm
devices, users could change the state of the underlying devices through
'/sys/block/<dev>/io_poll', bypassing the dm device above. In this case,
the cached result of the very beginning sanity check could be
out-of-date. Thus the sanity check needs to be done every time 'io_poll'
is to be modified.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-sysfs.c      | 14 +++++++++++---
 include/linux/blkdev.h |  1 +
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0f4f0c8a7825..367c1d9a55c6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -426,9 +426,17 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	unsigned long poll_on;
 	ssize_t ret;
 
-	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
-	    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
-		return -EINVAL;
+	if (queue_is_mq(q)) {
+		if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
+		    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
+			return -EINVAL;
+	} else {
+		struct gendisk *disk = queue_to_disk(q);
+
+		if (!disk->fops->poll_capable ||
+		    !disk->fops->poll_capable(disk))
+			return -EINVAL;
+	}
 
 	ret = queue_var_store(&poll_on, page, count);
 	if (ret < 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bfab74b45f15..a46f975f2a2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1881,6 +1881,7 @@ struct block_device_operations {
 	int (*report_zones)(struct gendisk *, sector_t sector,
 			unsigned int nr_zones, report_zones_cb cb, void *data);
 	char *(*devnode)(struct gendisk *disk, umode_t *mode);
+	bool (*poll_capable)(struct gendisk *disk);
 	struct module *owner;
 	const struct pr_ops *pr_ops;
 };
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 12/13] block: add poll_capable method to support bio-based IO polling
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer

From: Jeffle Xu <jefflexu@linux.alibaba.com>

This method can be used to check if bio-based device supports IO polling
or not. For mq devices, checking for hw queue in polling mode is
adequate, while the sanity check shall be implementation specific for
bio-based devices. For example, dm device needs to check if all
underlying devices are capable of IO polling.

Though bio-based device may have done the sanity check during the
device initialization phase, cacheing the result of this sanity check
(such as by cacheing in the queue_flags) may not work. Because for dm
devices, users could change the state of the underlying devices through
'/sys/block/<dev>/io_poll', bypassing the dm device above. In this case,
the cached result of the very beginning sanity check could be
out-of-date. Thus the sanity check needs to be done every time 'io_poll'
is to be modified.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-sysfs.c      | 14 +++++++++++---
 include/linux/blkdev.h |  1 +
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0f4f0c8a7825..367c1d9a55c6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -426,9 +426,17 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	unsigned long poll_on;
 	ssize_t ret;
 
-	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
-	    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
-		return -EINVAL;
+	if (queue_is_mq(q)) {
+		if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
+		    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
+			return -EINVAL;
+	} else {
+		struct gendisk *disk = queue_to_disk(q);
+
+		if (!disk->fops->poll_capable ||
+		    !disk->fops->poll_capable(disk))
+			return -EINVAL;
+	}
 
 	ret = queue_var_store(&poll_on, page, count);
 	if (ret < 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bfab74b45f15..a46f975f2a2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1881,6 +1881,7 @@ struct block_device_operations {
 	int (*report_zones)(struct gendisk *, sector_t sector,
 			unsigned int nr_zones, report_zones_cb cb, void *data);
 	char *(*devnode)(struct gendisk *disk, umode_t *mode);
+	bool (*poll_capable)(struct gendisk *disk);
 	struct module *owner;
 	const struct pr_ops *pr_ops;
 };
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH V3 13/13] dm: support IO polling for bio-based dm device
  2021-03-24 12:19 ` [dm-devel] " Ming Lei
@ 2021-03-24 12:19   ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Ming Lei

From: Jeffle Xu <jefflexu@linux.alibaba.com>

IO polling is enabled when all underlying target devices are capable
of IO polling. The sanity check supports the stacked device model, in
which one dm device may be build upon another dm device. In this case,
the mapped device will check if the underlying dm target device
supports IO polling.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-table.c         | 24 ++++++++++++++++++++++++
 drivers/md/dm.c               | 14 ++++++++++++++
 include/linux/device-mapper.h |  1 +
 3 files changed, 39 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 95391f78b8d5..a8f3575fb118 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1509,6 +1509,12 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector)
 	return &t->targets[(KEYS_PER_NODE * n) + k];
 }
 
+static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev,
+				   sector_t start, sector_t len, void *data)
+{
+	return !blk_queue_poll(bdev_get_queue(dev->bdev));
+}
+
 /*
  * type->iterate_devices() should be called when the sanity check needs to
  * iterate and check all underlying data devices. iterate_devices() will
@@ -1559,6 +1565,11 @@ static int count_device(struct dm_target *ti, struct dm_dev *dev,
 	return 0;
 }
 
+int dm_table_supports_poll(struct dm_table *t)
+{
+	return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
+}
+
 /*
  * Check whether a table has no data devices attached using each
  * target's iterate_devices method.
@@ -2079,6 +2090,19 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 
 	dm_update_keyslot_manager(q, t);
 	blk_queue_update_readahead(q);
+
+	/*
+	 * Check for request-based device is remained to
+	 * dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+	 * For bio-based device, only set QUEUE_FLAG_POLL when all underlying
+	 * devices supporting polling.
+	 */
+	if (__table_type_bio_based(t->type)) {
+		if (dm_table_supports_poll(t))
+			blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+		else
+			blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
+	}
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 50b693d776d6..fe6893b078dc 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1720,6 +1720,19 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 	return ret;
 }
 
+static bool dm_bio_poll_capable(struct gendisk *disk)
+{
+	int ret, srcu_idx;
+	struct mapped_device *md = disk->private_data;
+	struct dm_table *t;
+
+	t = dm_get_live_table(md, &srcu_idx);
+	ret = dm_table_supports_poll(t);
+	dm_put_live_table(md, srcu_idx);
+
+	return ret;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -3132,6 +3145,7 @@ static const struct pr_ops dm_pr_ops = {
 };
 
 static const struct block_device_operations dm_blk_dops = {
+	.poll_capable = dm_bio_poll_capable,
 	.submit_bio = dm_submit_bio,
 	.open = dm_blk_open,
 	.release = dm_blk_close,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 7f4ac87c0b32..31bfd6f70013 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -538,6 +538,7 @@ unsigned int dm_table_get_num_targets(struct dm_table *t);
 fmode_t dm_table_get_mode(struct dm_table *t);
 struct mapped_device *dm_table_get_md(struct dm_table *t);
 const char *dm_table_device_name(struct dm_table *t);
+int dm_table_supports_poll(struct dm_table *t);
 
 /*
  * Trigger an event.
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [dm-devel] [PATCH V3 13/13] dm: support IO polling for bio-based dm device
@ 2021-03-24 12:19   ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-24 12:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer, Ming Lei

From: Jeffle Xu <jefflexu@linux.alibaba.com>

IO polling is enabled when all underlying target devices are capable
of IO polling. The sanity check supports the stacked device model, in
which one dm device may be build upon another dm device. In this case,
the mapped device will check if the underlying dm target device
supports IO polling.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-table.c         | 24 ++++++++++++++++++++++++
 drivers/md/dm.c               | 14 ++++++++++++++
 include/linux/device-mapper.h |  1 +
 3 files changed, 39 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 95391f78b8d5..a8f3575fb118 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1509,6 +1509,12 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector)
 	return &t->targets[(KEYS_PER_NODE * n) + k];
 }
 
+static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev,
+				   sector_t start, sector_t len, void *data)
+{
+	return !blk_queue_poll(bdev_get_queue(dev->bdev));
+}
+
 /*
  * type->iterate_devices() should be called when the sanity check needs to
  * iterate and check all underlying data devices. iterate_devices() will
@@ -1559,6 +1565,11 @@ static int count_device(struct dm_target *ti, struct dm_dev *dev,
 	return 0;
 }
 
+int dm_table_supports_poll(struct dm_table *t)
+{
+	return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
+}
+
 /*
  * Check whether a table has no data devices attached using each
  * target's iterate_devices method.
@@ -2079,6 +2090,19 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 
 	dm_update_keyslot_manager(q, t);
 	blk_queue_update_readahead(q);
+
+	/*
+	 * Check for request-based device is remained to
+	 * dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+	 * For bio-based device, only set QUEUE_FLAG_POLL when all underlying
+	 * devices supporting polling.
+	 */
+	if (__table_type_bio_based(t->type)) {
+		if (dm_table_supports_poll(t))
+			blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+		else
+			blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
+	}
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 50b693d776d6..fe6893b078dc 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1720,6 +1720,19 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 	return ret;
 }
 
+static bool dm_bio_poll_capable(struct gendisk *disk)
+{
+	int ret, srcu_idx;
+	struct mapped_device *md = disk->private_data;
+	struct dm_table *t;
+
+	t = dm_get_live_table(md, &srcu_idx);
+	ret = dm_table_supports_poll(t);
+	dm_put_live_table(md, srcu_idx);
+
+	return ret;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -3132,6 +3145,7 @@ static const struct pr_ops dm_pr_ops = {
 };
 
 static const struct block_device_operations dm_blk_dops = {
+	.poll_capable = dm_bio_poll_capable,
 	.submit_bio = dm_submit_bio,
 	.open = dm_blk_open,
 	.release = dm_blk_close,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 7f4ac87c0b32..31bfd6f70013 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -538,6 +538,7 @@ unsigned int dm_table_get_num_targets(struct dm_table *t);
 fmode_t dm_table_get_mode(struct dm_table *t);
 struct mapped_device *dm_table_get_md(struct dm_table *t);
 const char *dm_table_device_name(struct dm_table *t);
+int dm_table_supports_poll(struct dm_table *t);
 
 /*
  * Trigger an event.
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 01/13] block: add helper of blk_queue_poll
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-24 13:19     ` Hannes Reinecke
  -1 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 13:19 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Chaitanya Kulkarni

On 3/24/21 1:19 PM, Ming Lei wrote:
> There has been 3 users, and will be more, so add one such helper.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c         | 2 +-
>   block/blk-mq.c           | 3 +--
>   drivers/nvme/host/core.c | 2 +-
>   include/linux/blkdev.h   | 1 +
>   4 files changed, 4 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 01/13] block: add helper of blk_queue_poll
@ 2021-03-24 13:19     ` Hannes Reinecke
  0 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 13:19 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Jeffle Xu, dm-devel, Chaitanya Kulkarni, Mike Snitzer

On 3/24/21 1:19 PM, Ming Lei wrote:
> There has been 3 users, and will be more, so add one such helper.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c         | 2 +-
>   block/blk-mq.c           | 3 +--
>   drivers/nvme/host/core.c | 2 +-
>   include/linux/blkdev.h   | 1 +
>   4 files changed, 4 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 02/13] block: add one helper to free io_context
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-24 13:21     ` Hannes Reinecke
  -1 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 13:21 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel

On 3/24/21 1:19 PM, Ming Lei wrote:
> Prepare for putting bio poll queue into io_context, so add one helper
> for free io_context.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-ioc.c | 11 ++++++++---
>   1 file changed, 8 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 02/13] block: add one helper to free io_context
@ 2021-03-24 13:21     ` Hannes Reinecke
  0 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 13:21 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer

On 3/24/21 1:19 PM, Ming Lei wrote:
> Prepare for putting bio poll queue into io_context, so add one helper
> for free io_context.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-ioc.c | 11 ++++++++---
>   1 file changed, 8 insertions(+), 3 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 03/13] block: add helper of blk_create_io_context
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-24 13:22     ` Hannes Reinecke
  -1 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 13:22 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel

On 3/24/21 1:19 PM, Ming Lei wrote:
> Add one helper for creating io context and prepare for supporting
> efficient bio based io poll.
> 
> Meantime move the code of creating io_context before checking bio's
> REQ_HIPRI flag because the following patch may change to clear REQ_HIPRI
> if io_context can't be created.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c | 23 ++++++++++++++---------
>   1 file changed, 14 insertions(+), 9 deletions(-)
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 03/13] block: add helper of blk_create_io_context
@ 2021-03-24 13:22     ` Hannes Reinecke
  0 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 13:22 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer

On 3/24/21 1:19 PM, Ming Lei wrote:
> Add one helper for creating io context and prepare for supporting
> efficient bio based io poll.
> 
> Meantime move the code of creating io_context before checking bio's
> REQ_HIPRI flag because the following patch may change to clear REQ_HIPRI
> if io_context can't be created.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c | 23 ++++++++++++++---------
>   1 file changed, 14 insertions(+), 9 deletions(-)
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 04/13] block: create io poll context for submission and poll task
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-24 13:26     ` Hannes Reinecke
  -1 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 13:26 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel

On 3/24/21 1:19 PM, Ming Lei wrote:
> Create per-task io poll context for both IO submission and poll task
> if the queue is bio based and supports polling.
> 
> This io polling context includes two queues:
> 
> 1) submission queue(sq) for storing HIPRI bio, written by submission task
>     and read by poll task.
> 2) polling queue(pq) for holding data moved from sq, only used in poll
>     context for running bio polling.
> 
> Following patches will support bio based io polling.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
>   block/blk-ioc.c           |  1 +
>   block/blk-mq.c            | 14 ++++++++
>   block/blk.h               | 45 +++++++++++++++++++++++++
>   include/linux/iocontext.h |  2 ++
>   5 files changed, 121 insertions(+), 12 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 04/13] block: create io poll context for submission and poll task
@ 2021-03-24 13:26     ` Hannes Reinecke
  0 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 13:26 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer

On 3/24/21 1:19 PM, Ming Lei wrote:
> Create per-task io poll context for both IO submission and poll task
> if the queue is bio based and supports polling.
> 
> This io polling context includes two queues:
> 
> 1) submission queue(sq) for storing HIPRI bio, written by submission task
>     and read by poll task.
> 2) polling queue(pq) for holding data moved from sq, only used in poll
>     context for running bio polling.
> 
> Following patches will support bio based io polling.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
>   block/blk-ioc.c           |  1 +
>   block/blk-mq.c            | 14 ++++++++
>   block/blk.h               | 45 +++++++++++++++++++++++++
>   include/linux/iocontext.h |  2 ++
>   5 files changed, 121 insertions(+), 12 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 05/13] block: add req flag of REQ_POLL_CTX
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-24 15:32     ` Hannes Reinecke
  -1 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 15:32 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel

On 3/24/21 1:19 PM, Ming Lei wrote:
> Add one req flag REQ_POLL_CTX which will be used in the following patch for
> supporting bio based IO polling.
> 
> Exactly this flag can help us to do:
> 
> 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> as REQ_POLL_CTX, all bios cloned from this FS bio will be marked as
> REQ_POLL_CTX too.
> 
> 2) create per-task io polling context if the bio based queue supports
> polling and the submitted bio is HIPRI. Per-task io poll context will be
> created during submit_bio() before marking this HIPRI bio as REQ_POLL_CTX.
> Then we can avoid to create such io polling context if one cloned bio with
> REQ_POLL_CTX is submitted from another kernel context.
> 
> 3) for supporting bio based io polling, we need to poll IOs from all
> underlying queues of the bio device, this way help us to recognize which
> IO needs to polled in bio based style, which will be applied in
> following patch.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c          | 25 ++++++++++++++++++++++++-
>   include/linux/blk_types.h |  4 ++++
>   2 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4671bbf31fd3..eb07d61cfdc2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>   static inline void blk_bio_poll_preprocess(struct request_queue *q,
>   		struct bio *bio)
>   {
> +	bool mq;
> +
>   	if (!(bio->bi_opf & REQ_HIPRI))
>   		return;
>   
> -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +	/*
> +	 * Can't support bio based IO polling without per-task poll ctx
> +	 *
> +	 * We have created per-task io poll context, and mark this
> +	 * bio as REQ_POLL_CTX, so: 1) if any cloned bio from this bio is
> +	 * submitted from another kernel context, we won't create bio
> +	 * poll context for it, and that bio can be completed by IRQ;
> +	 * 2) If such bio is submitted from current context, we will
> +	 * complete it via blk_poll(); 3) If driver knows that one
> +	 * underlying bio allocated from driver is for FS bio, meantime
> +	 * it is submitted in current context, driver can mark such bio
> +	 * as REQ_HIPRI & REQ_POLL_CTX manually, so the bio can be completed
> +	 * via blk_poll too.
> +	 */
> +	mq = queue_is_mq(q);
> +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
>   		bio->bi_opf &= ~REQ_HIPRI;
> +	else if (!mq)
> +		bio->bi_opf |= REQ_POLL_CTX;
>   }
>   
>   static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -894,8 +913,12 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>   	/*
>   	 * Create per-task io poll ctx if bio polling supported and HIPRI
>   	 * set.
> +	 *
> +	 * If REQ_POLL_CTX isn't set for this HIPRI bio, we think it originated
> +	 * from FS and allocate io polling context.
>   	 */
>   	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			!(bio->bi_opf & REQ_POLL_CTX) &&
>   			(bio->bi_opf & REQ_HIPRI));
>   
>   	blk_bio_poll_preprocess(q, bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index db026b6ec15a..99160d588c2d 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -394,6 +394,9 @@ enum req_flag_bits {
>   
>   	__REQ_HIPRI,
>   
> +	/* for marking IOs originated from same FS bio in same context */
> +	__REQ_POLL_CTX,
> +
>   	/* for driver use */
>   	__REQ_DRV,
>   	__REQ_SWAP,		/* swapping request. */
> @@ -418,6 +421,7 @@ enum req_flag_bits {
>   
>   #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>   #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
> +#define REQ_POLL_CTX			(1ULL << __REQ_POLL_CTX)
>   
>   #define REQ_DRV			(1ULL << __REQ_DRV)
>   #define REQ_SWAP		(1ULL << __REQ_SWAP)
> 
What happens to split bios?
Will they be tracked similarly to cloned bios?
If so, shouldn't you document that here, too?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 05/13] block: add req flag of REQ_POLL_CTX
@ 2021-03-24 15:32     ` Hannes Reinecke
  0 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 15:32 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer

On 3/24/21 1:19 PM, Ming Lei wrote:
> Add one req flag REQ_POLL_CTX which will be used in the following patch for
> supporting bio based IO polling.
> 
> Exactly this flag can help us to do:
> 
> 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> as REQ_POLL_CTX, all bios cloned from this FS bio will be marked as
> REQ_POLL_CTX too.
> 
> 2) create per-task io polling context if the bio based queue supports
> polling and the submitted bio is HIPRI. Per-task io poll context will be
> created during submit_bio() before marking this HIPRI bio as REQ_POLL_CTX.
> Then we can avoid to create such io polling context if one cloned bio with
> REQ_POLL_CTX is submitted from another kernel context.
> 
> 3) for supporting bio based io polling, we need to poll IOs from all
> underlying queues of the bio device, this way help us to recognize which
> IO needs to polled in bio based style, which will be applied in
> following patch.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-core.c          | 25 ++++++++++++++++++++++++-
>   include/linux/blk_types.h |  4 ++++
>   2 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4671bbf31fd3..eb07d61cfdc2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>   static inline void blk_bio_poll_preprocess(struct request_queue *q,
>   		struct bio *bio)
>   {
> +	bool mq;
> +
>   	if (!(bio->bi_opf & REQ_HIPRI))
>   		return;
>   
> -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +	/*
> +	 * Can't support bio based IO polling without per-task poll ctx
> +	 *
> +	 * We have created per-task io poll context, and mark this
> +	 * bio as REQ_POLL_CTX, so: 1) if any cloned bio from this bio is
> +	 * submitted from another kernel context, we won't create bio
> +	 * poll context for it, and that bio can be completed by IRQ;
> +	 * 2) If such bio is submitted from current context, we will
> +	 * complete it via blk_poll(); 3) If driver knows that one
> +	 * underlying bio allocated from driver is for FS bio, meantime
> +	 * it is submitted in current context, driver can mark such bio
> +	 * as REQ_HIPRI & REQ_POLL_CTX manually, so the bio can be completed
> +	 * via blk_poll too.
> +	 */
> +	mq = queue_is_mq(q);
> +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
>   		bio->bi_opf &= ~REQ_HIPRI;
> +	else if (!mq)
> +		bio->bi_opf |= REQ_POLL_CTX;
>   }
>   
>   static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -894,8 +913,12 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>   	/*
>   	 * Create per-task io poll ctx if bio polling supported and HIPRI
>   	 * set.
> +	 *
> +	 * If REQ_POLL_CTX isn't set for this HIPRI bio, we think it originated
> +	 * from FS and allocate io polling context.
>   	 */
>   	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			!(bio->bi_opf & REQ_POLL_CTX) &&
>   			(bio->bi_opf & REQ_HIPRI));
>   
>   	blk_bio_poll_preprocess(q, bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index db026b6ec15a..99160d588c2d 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -394,6 +394,9 @@ enum req_flag_bits {
>   
>   	__REQ_HIPRI,
>   
> +	/* for marking IOs originated from same FS bio in same context */
> +	__REQ_POLL_CTX,
> +
>   	/* for driver use */
>   	__REQ_DRV,
>   	__REQ_SWAP,		/* swapping request. */
> @@ -418,6 +421,7 @@ enum req_flag_bits {
>   
>   #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>   #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
> +#define REQ_POLL_CTX			(1ULL << __REQ_POLL_CTX)
>   
>   #define REQ_DRV			(1ULL << __REQ_DRV)
>   #define REQ_SWAP		(1ULL << __REQ_SWAP)
> 
What happens to split bios?
Will they be tracked similarly to cloned bios?
If so, shouldn't you document that here, too?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 06/13] block: add new field into 'struct bvec_iter'
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-24 15:33     ` Hannes Reinecke
  -1 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 15:33 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel

On 3/24/21 1:19 PM, Ming Lei wrote:
> There is a hole at the end of 'struct bvec_iter', so put a new field
> here and we can save cookie returned from submit_bio() here for
> supporting bio based polling.
> 
> This way can avoid to extend bio unnecessarily.
> 
> Meantime add two helpers to get/set this field.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk.h          | 10 ++++++++++
>   include/linux/bvec.h |  8 ++++++++
>   2 files changed, 18 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 06/13] block: add new field into 'struct bvec_iter'
@ 2021-03-24 15:33     ` Hannes Reinecke
  0 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-24 15:33 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer

On 3/24/21 1:19 PM, Ming Lei wrote:
> There is a hole at the end of 'struct bvec_iter', so put a new field
> here and we can save cookie returned from submit_bio() here for
> supporting bio based polling.
> 
> This way can avoid to extend bio unnecessarily.
> 
> Meantime add two helpers to get/set this field.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk.h          | 10 ++++++++++
>   include/linux/bvec.h |  8 ++++++++
>   2 files changed, 18 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 03/13] block: add helper of blk_create_io_context
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-24 15:52     ` Keith Busch
  -1 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2021-03-24 15:52 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Jeffle Xu, Mike Snitzer, dm-devel

On Wed, Mar 24, 2021 at 08:19:17PM +0800, Ming Lei wrote:
> +static inline void blk_create_io_context(struct request_queue *q)
> +{
> +	/*
> +	 * Various block parts want %current->io_context, so allocate it up
> +	 * front rather than dealing with lots of pain to allocate it only
> +	 * where needed. This may fail and the block layer knows how to live
> +	 * with it.
> +	 */

I think this comment would make more sense if it were placed above the
caller rather than within this function. 

> +	if (unlikely(!current->io_context))
> +		create_task_io_context(current, GFP_ATOMIC, q->node);
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -836,6 +848,8 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  		}
>  	}
>  
> +	blk_create_io_context(q);
> +
>  	if (!blk_queue_poll(q))
>  		bio->bi_opf &= ~REQ_HIPRI;
>  
> @@ -876,15 +890,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  		break;
>  	}
>  
> -	/*
> -	 * Various block parts want %current->io_context, so allocate it up
> -	 * front rather than dealing with lots of pain to allocate it only
> -	 * where needed. This may fail and the block layer knows how to live
> -	 * with it.
> -	 */
> -	if (unlikely(!current->io_context))
> -		create_task_io_context(current, GFP_ATOMIC, q->node);
> -
>  	if (blk_throtl_bio(bio)) {
>  		blkcg_bio_issue_init(bio);
>  		return false;
> -- 
> 2.29.2

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 03/13] block: add helper of blk_create_io_context
@ 2021-03-24 15:52     ` Keith Busch
  0 siblings, 0 replies; 74+ messages in thread
From: Keith Busch @ 2021-03-24 15:52 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Mike Snitzer

On Wed, Mar 24, 2021 at 08:19:17PM +0800, Ming Lei wrote:
> +static inline void blk_create_io_context(struct request_queue *q)
> +{
> +	/*
> +	 * Various block parts want %current->io_context, so allocate it up
> +	 * front rather than dealing with lots of pain to allocate it only
> +	 * where needed. This may fail and the block layer knows how to live
> +	 * with it.
> +	 */

I think this comment would make more sense if it were placed above the
caller rather than within this function. 

> +	if (unlikely(!current->io_context))
> +		create_task_io_context(current, GFP_ATOMIC, q->node);
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -836,6 +848,8 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  		}
>  	}
>  
> +	blk_create_io_context(q);
> +
>  	if (!blk_queue_poll(q))
>  		bio->bi_opf &= ~REQ_HIPRI;
>  
> @@ -876,15 +890,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  		break;
>  	}
>  
> -	/*
> -	 * Various block parts want %current->io_context, so allocate it up
> -	 * front rather than dealing with lots of pain to allocate it only
> -	 * where needed. This may fail and the block layer knows how to live
> -	 * with it.
> -	 */
> -	if (unlikely(!current->io_context))
> -		create_task_io_context(current, GFP_ATOMIC, q->node);
> -
>  	if (blk_throtl_bio(bio)) {
>  		blkcg_bio_issue_init(bio);
>  		return false;
> -- 
> 2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 01/13] block: add helper of blk_queue_poll
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-24 17:54     ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2021-03-24 17:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Jeffle Xu, Mike Snitzer, dm-devel,
	Chaitanya Kulkarni

On Wed, Mar 24, 2021 at 08:19:15PM +0800, Ming Lei wrote:
> There has been 3 users, and will be more, so add one such helper.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Can't say I'm a huge fan of these wrappers that obsfucate what
actually is checked here.  But it does fit the style used for
other flags.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 01/13] block: add helper of blk_queue_poll
@ 2021-03-24 17:54     ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2021-03-24 17:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, Chaitanya Kulkarni, linux-block,
	dm-devel, Jeffle Xu

On Wed, Mar 24, 2021 at 08:19:15PM +0800, Ming Lei wrote:
> There has been 3 users, and will be more, so add one such helper.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Can't say I'm a huge fan of these wrappers that obsfucate what
actually is checked here.  But it does fit the style used for
other flags.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 03/13] block: add helper of blk_create_io_context
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-24 18:17     ` Christoph Hellwig
  -1 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2021-03-24 18:17 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Jeffle Xu, Mike Snitzer, dm-devel

On Wed, Mar 24, 2021 at 08:19:17PM +0800, Ming Lei wrote:
> Add one helper for creating io context and prepare for supporting
> efficient bio based io poll.

Looking at what gets added later here I do not think this helper is
a good idea.  Having a separate one for creating any needed poll-only
context is a lot more clear.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 03/13] block: add helper of blk_create_io_context
@ 2021-03-24 18:17     ` Christoph Hellwig
  0 siblings, 0 replies; 74+ messages in thread
From: Christoph Hellwig @ 2021-03-24 18:17 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Mike Snitzer

On Wed, Mar 24, 2021 at 08:19:17PM +0800, Ming Lei wrote:
> Add one helper for creating io context and prepare for supporting
> efficient bio based io poll.

Looking at what gets added later here I do not think this helper is
a good idea.  Having a separate one for creating any needed poll-only
context is a lot more clear.

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 03/13] block: add helper of blk_create_io_context
  2021-03-24 18:17     ` [dm-devel] " Christoph Hellwig
@ 2021-03-25  0:30       ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-25  0:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Jeffle Xu, Mike Snitzer, dm-devel

On Wed, Mar 24, 2021 at 07:17:02PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 24, 2021 at 08:19:17PM +0800, Ming Lei wrote:
> > Add one helper for creating io context and prepare for supporting
> > efficient bio based io poll.
> 
> Looking at what gets added later here I do not think this helper is
> a good idea.  Having a separate one for creating any needed poll-only
> context is a lot more clear.

The poll context actually depends on io_context, that is why I put them
all into one single helper.

thanks,
Ming


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 03/13] block: add helper of blk_create_io_context
@ 2021-03-25  0:30       ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-25  0:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Mike Snitzer

On Wed, Mar 24, 2021 at 07:17:02PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 24, 2021 at 08:19:17PM +0800, Ming Lei wrote:
> > Add one helper for creating io context and prepare for supporting
> > efficient bio based io poll.
> 
> Looking at what gets added later here I do not think this helper is
> a good idea.  Having a separate one for creating any needed poll-only
> context is a lot more clear.

The poll context actually depends on io_context, that is why I put them
all into one single helper.

thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 05/13] block: add req flag of REQ_POLL_CTX
  2021-03-24 15:32     ` [dm-devel] " Hannes Reinecke
@ 2021-03-25  0:32       ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-25  0:32 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Jeffle Xu, Mike Snitzer, dm-devel

On Wed, Mar 24, 2021 at 04:32:31PM +0100, Hannes Reinecke wrote:
> On 3/24/21 1:19 PM, Ming Lei wrote:
> > Add one req flag REQ_POLL_CTX which will be used in the following patch for
> > supporting bio based IO polling.
> > 
> > Exactly this flag can help us to do:
> > 
> > 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> > as REQ_POLL_CTX, all bios cloned from this FS bio will be marked as
> > REQ_POLL_CTX too.
> > 
> > 2) create per-task io polling context if the bio based queue supports
> > polling and the submitted bio is HIPRI. Per-task io poll context will be
> > created during submit_bio() before marking this HIPRI bio as REQ_POLL_CTX.
> > Then we can avoid to create such io polling context if one cloned bio with
> > REQ_POLL_CTX is submitted from another kernel context.
> > 
> > 3) for supporting bio based io polling, we need to poll IOs from all
> > underlying queues of the bio device, this way help us to recognize which
> > IO needs to polled in bio based style, which will be applied in
> > following patch.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-core.c          | 25 ++++++++++++++++++++++++-
> >   include/linux/blk_types.h |  4 ++++
> >   2 files changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 4671bbf31fd3..eb07d61cfdc2 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
> >   static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >   		struct bio *bio)
> >   {
> > +	bool mq;
> > +
> >   	if (!(bio->bi_opf & REQ_HIPRI))
> >   		return;
> > -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> > +	/*
> > +	 * Can't support bio based IO polling without per-task poll ctx
> > +	 *
> > +	 * We have created per-task io poll context, and mark this
> > +	 * bio as REQ_POLL_CTX, so: 1) if any cloned bio from this bio is
> > +	 * submitted from another kernel context, we won't create bio
> > +	 * poll context for it, and that bio can be completed by IRQ;
> > +	 * 2) If such bio is submitted from current context, we will
> > +	 * complete it via blk_poll(); 3) If driver knows that one
> > +	 * underlying bio allocated from driver is for FS bio, meantime
> > +	 * it is submitted in current context, driver can mark such bio
> > +	 * as REQ_HIPRI & REQ_POLL_CTX manually, so the bio can be completed
> > +	 * via blk_poll too.
> > +	 */
> > +	mq = queue_is_mq(q);
> > +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
> >   		bio->bi_opf &= ~REQ_HIPRI;
> > +	else if (!mq)
> > +		bio->bi_opf |= REQ_POLL_CTX;
> >   }
> >   static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> > @@ -894,8 +913,12 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >   	/*
> >   	 * Create per-task io poll ctx if bio polling supported and HIPRI
> >   	 * set.
> > +	 *
> > +	 * If REQ_POLL_CTX isn't set for this HIPRI bio, we think it originated
> > +	 * from FS and allocate io polling context.
> >   	 */
> >   	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> > +			!(bio->bi_opf & REQ_POLL_CTX) &&
> >   			(bio->bi_opf & REQ_HIPRI));
> >   	blk_bio_poll_preprocess(q, bio);
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index db026b6ec15a..99160d588c2d 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -394,6 +394,9 @@ enum req_flag_bits {
> >   	__REQ_HIPRI,
> > +	/* for marking IOs originated from same FS bio in same context */
> > +	__REQ_POLL_CTX,
> > +
> >   	/* for driver use */
> >   	__REQ_DRV,
> >   	__REQ_SWAP,		/* swapping request. */
> > @@ -418,6 +421,7 @@ enum req_flag_bits {
> >   #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
> >   #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
> > +#define REQ_POLL_CTX			(1ULL << __REQ_POLL_CTX)
> >   #define REQ_DRV			(1ULL << __REQ_DRV)
> >   #define REQ_SWAP		(1ULL << __REQ_SWAP)
> > 
> What happens to split bios?
> Will they be tracked similarly to cloned bios?
> If so, shouldn't you document that here, too?

split bios are simply cloned bios, please see bio_split().


thanks,
Ming


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 05/13] block: add req flag of REQ_POLL_CTX
@ 2021-03-25  0:32       ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-25  0:32 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu, Mike Snitzer

On Wed, Mar 24, 2021 at 04:32:31PM +0100, Hannes Reinecke wrote:
> On 3/24/21 1:19 PM, Ming Lei wrote:
> > Add one req flag REQ_POLL_CTX which will be used in the following patch for
> > supporting bio based IO polling.
> > 
> > Exactly this flag can help us to do:
> > 
> > 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> > as REQ_POLL_CTX, all bios cloned from this FS bio will be marked as
> > REQ_POLL_CTX too.
> > 
> > 2) create per-task io polling context if the bio based queue supports
> > polling and the submitted bio is HIPRI. Per-task io poll context will be
> > created during submit_bio() before marking this HIPRI bio as REQ_POLL_CTX.
> > Then we can avoid to create such io polling context if one cloned bio with
> > REQ_POLL_CTX is submitted from another kernel context.
> > 
> > 3) for supporting bio based io polling, we need to poll IOs from all
> > underlying queues of the bio device, this way help us to recognize which
> > IO needs to polled in bio based style, which will be applied in
> > following patch.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-core.c          | 25 ++++++++++++++++++++++++-
> >   include/linux/blk_types.h |  4 ++++
> >   2 files changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 4671bbf31fd3..eb07d61cfdc2 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
> >   static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >   		struct bio *bio)
> >   {
> > +	bool mq;
> > +
> >   	if (!(bio->bi_opf & REQ_HIPRI))
> >   		return;
> > -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> > +	/*
> > +	 * Can't support bio based IO polling without per-task poll ctx
> > +	 *
> > +	 * We have created per-task io poll context, and mark this
> > +	 * bio as REQ_POLL_CTX, so: 1) if any cloned bio from this bio is
> > +	 * submitted from another kernel context, we won't create bio
> > +	 * poll context for it, and that bio can be completed by IRQ;
> > +	 * 2) If such bio is submitted from current context, we will
> > +	 * complete it via blk_poll(); 3) If driver knows that one
> > +	 * underlying bio allocated from driver is for FS bio, meantime
> > +	 * it is submitted in current context, driver can mark such bio
> > +	 * as REQ_HIPRI & REQ_POLL_CTX manually, so the bio can be completed
> > +	 * via blk_poll too.
> > +	 */
> > +	mq = queue_is_mq(q);
> > +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
> >   		bio->bi_opf &= ~REQ_HIPRI;
> > +	else if (!mq)
> > +		bio->bi_opf |= REQ_POLL_CTX;
> >   }
> >   static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> > @@ -894,8 +913,12 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >   	/*
> >   	 * Create per-task io poll ctx if bio polling supported and HIPRI
> >   	 * set.
> > +	 *
> > +	 * If REQ_POLL_CTX isn't set for this HIPRI bio, we think it originated
> > +	 * from FS and allocate io polling context.
> >   	 */
> >   	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> > +			!(bio->bi_opf & REQ_POLL_CTX) &&
> >   			(bio->bi_opf & REQ_HIPRI));
> >   	blk_bio_poll_preprocess(q, bio);
> > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> > index db026b6ec15a..99160d588c2d 100644
> > --- a/include/linux/blk_types.h
> > +++ b/include/linux/blk_types.h
> > @@ -394,6 +394,9 @@ enum req_flag_bits {
> >   	__REQ_HIPRI,
> > +	/* for marking IOs originated from same FS bio in same context */
> > +	__REQ_POLL_CTX,
> > +
> >   	/* for driver use */
> >   	__REQ_DRV,
> >   	__REQ_SWAP,		/* swapping request. */
> > @@ -418,6 +421,7 @@ enum req_flag_bits {
> >   #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
> >   #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
> > +#define REQ_POLL_CTX			(1ULL << __REQ_POLL_CTX)
> >   #define REQ_DRV			(1ULL << __REQ_DRV)
> >   #define REQ_SWAP		(1ULL << __REQ_SWAP)
> > 
> What happens to split bios?
> Will they be tracked similarly to cloned bios?
> If so, shouldn't you document that here, too?

split bios are simply cloned bios, please see bio_split().


thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 01/13] block: add helper of blk_queue_poll
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-25  1:56     ` JeffleXu
  -1 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  1:56 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Mike Snitzer, dm-devel, Chaitanya Kulkarni



On 3/24/21 8:19 PM, Ming Lei wrote:
> There has been 3 users, and will be more, so add one such helper.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Better to also convert blk-sysfs.c:queue_poll_show().

With that fixed,

Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>


> ---
>  block/blk-core.c         | 2 +-
>  block/blk-mq.c           | 3 +--
>  drivers/nvme/host/core.c | 2 +-
>  include/linux/blkdev.h   | 1 +
>  4 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fc60ff208497..a31371d55b9d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -836,7 +836,7 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  		}
>  	}
>  
> -	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
> +	if (!blk_queue_poll(q))
>  		bio->bi_opf &= ~REQ_HIPRI;
>  
>  	switch (bio_op(bio)) {
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa439..63c81df3b8b5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3869,8 +3869,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	struct blk_mq_hw_ctx *hctx;
>  	long state;
>  
> -	if (!blk_qc_t_valid(cookie) ||
> -	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
> +	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
>  		return 0;
>  
>  	if (current->plug)
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 0896e21642be..34b8c78f88e0 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -956,7 +956,7 @@ static void nvme_execute_rq_polled(struct request_queue *q,
>  {
>  	DECLARE_COMPLETION_ONSTACK(wait);
>  
> -	WARN_ON_ONCE(!test_bit(QUEUE_FLAG_POLL, &q->queue_flags));
> +	WARN_ON_ONCE(!blk_queue_poll(q));
>  
>  	rq->cmd_flags |= REQ_HIPRI;
>  	rq->end_io_data = &wait;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bc6bc8383b43..89a01850cf12 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -665,6 +665,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
>  #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
>  #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
>  #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
> +#define blk_queue_poll(q)	test_bit(QUEUE_FLAG_POLL, &(q)->queue_flags)
>  
>  extern void blk_set_pm_only(struct request_queue *q);
>  extern void blk_clear_pm_only(struct request_queue *q);
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 01/13] block: add helper of blk_queue_poll
@ 2021-03-25  1:56     ` JeffleXu
  0 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  1:56 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, dm-devel, Chaitanya Kulkarni, Mike Snitzer



On 3/24/21 8:19 PM, Ming Lei wrote:
> There has been 3 users, and will be more, so add one such helper.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Better to also convert blk-sysfs.c:queue_poll_show().

With that fixed,

Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>


> ---
>  block/blk-core.c         | 2 +-
>  block/blk-mq.c           | 3 +--
>  drivers/nvme/host/core.c | 2 +-
>  include/linux/blkdev.h   | 1 +
>  4 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fc60ff208497..a31371d55b9d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -836,7 +836,7 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  		}
>  	}
>  
> -	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
> +	if (!blk_queue_poll(q))
>  		bio->bi_opf &= ~REQ_HIPRI;
>  
>  	switch (bio_op(bio)) {
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index d4d7c1caa439..63c81df3b8b5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3869,8 +3869,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	struct blk_mq_hw_ctx *hctx;
>  	long state;
>  
> -	if (!blk_qc_t_valid(cookie) ||
> -	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
> +	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
>  		return 0;
>  
>  	if (current->plug)
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 0896e21642be..34b8c78f88e0 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -956,7 +956,7 @@ static void nvme_execute_rq_polled(struct request_queue *q,
>  {
>  	DECLARE_COMPLETION_ONSTACK(wait);
>  
> -	WARN_ON_ONCE(!test_bit(QUEUE_FLAG_POLL, &q->queue_flags));
> +	WARN_ON_ONCE(!blk_queue_poll(q));
>  
>  	rq->cmd_flags |= REQ_HIPRI;
>  	rq->end_io_data = &wait;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bc6bc8383b43..89a01850cf12 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -665,6 +665,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
>  #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
>  #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
>  #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
> +#define blk_queue_poll(q)	test_bit(QUEUE_FLAG_POLL, &(q)->queue_flags)
>  
>  extern void blk_set_pm_only(struct request_queue *q);
>  extern void blk_clear_pm_only(struct request_queue *q);
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 04/13] block: create io poll context for submission and poll task
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-25  2:34     ` JeffleXu
  -1 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  2:34 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Mike Snitzer, dm-devel



On 3/24/21 8:19 PM, Ming Lei wrote:
> Create per-task io poll context for both IO submission and poll task
> if the queue is bio based and supports polling.
> 
> This io polling context includes two queues:
> 
> 1) submission queue(sq) for storing HIPRI bio, written by submission task
>    and read by poll task.
> 2) polling queue(pq) for holding data moved from sq, only used in poll
>    context for running bio polling.
> 
> Following patches will support bio based io polling.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
>  block/blk-ioc.c           |  1 +
>  block/blk-mq.c            | 14 ++++++++
>  block/blk.h               | 45 +++++++++++++++++++++++++
>  include/linux/iocontext.h |  2 ++
>  5 files changed, 121 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d58f8a0c80de..4671bbf31fd3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>  	return BLK_STS_OK;
>  }
>  
> -static inline void blk_create_io_context(struct request_queue *q)
> +static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
>  {
> -	/*
> -	 * Various block parts want %current->io_context, so allocate it up
> -	 * front rather than dealing with lots of pain to allocate it only
> -	 * where needed. This may fail and the block layer knows how to live
> -	 * with it.
> -	 */
> -	if (unlikely(!current->io_context))
> -		create_task_io_context(current, GFP_ATOMIC, q->node);
> +	struct io_context *ioc = current->io_context;
> +
> +	return ioc ? ioc->data : NULL;
> +}
> +
> +static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> +{
> +	return sizeof(struct bio_grp_list) + nr_grps *
> +		sizeof(struct bio_grp_list_data);
> +}
> +
> +static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> +{
> +	pc->sq = (void *)pc + sizeof(*pc);
> +	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
> +
> +	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
> +	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
> +
> +	spin_lock_init(&pc->sq_lock);
> +	spin_lock_init(&pc->pq_lock);
> +}
> +
> +void bio_poll_ctx_alloc(struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc;
> +	unsigned int size = sizeof(*pc) +
> +		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
> +		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
> +
> +	pc = kzalloc(GFP_ATOMIC, size);
> +	if (pc) {
> +		bio_poll_ctx_init(pc);
> +		if (cmpxchg(&ioc->data, NULL, (void *)pc))
> +			kfree(pc);
> +	}

Why don't put these in blk-ioc.c?


> +}
> +
> +static inline bool blk_queue_support_bio_poll(struct request_queue *q)
> +{
> +	return !queue_is_mq(q) && blk_queue_poll(q);
> +}
> +
> +static inline void blk_bio_poll_preprocess(struct request_queue *q,
> +		struct bio *bio)
> +{
> +	if (!(bio->bi_opf & REQ_HIPRI))
> +		return;
> +
> +	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +		bio->bi_opf &= ~REQ_HIPRI;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  		}
>  	}
>  
> -	blk_create_io_context(q);
> +	/*
> +	 * Create per-task io poll ctx if bio polling supported and HIPRI
> +	 * set.
> +	 */
> +	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			(bio->bi_opf & REQ_HIPRI));
>  
> -	if (!blk_queue_poll(q))
> -		bio->bi_opf &= ~REQ_HIPRI;
> +	blk_bio_poll_preprocess(q, bio);
>  
>  	switch (bio_op(bio)) {
>  	case REQ_OP_DISCARD:
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b0cde18c4b8c..5574c398eff6 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
>  
>  static inline void free_io_context(struct io_context *ioc)
>  {
> +	kfree(ioc->data);
>  	kmem_cache_free(iocontext_cachep, ioc);
>  }
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 63c81df3b8b5..c832faa52ca0 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
>  	return blk_mq_poll_hybrid_sleep(q, rq);
>  }
>  
> +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +{
> +	/*
> +	 * Create poll queue for storing poll bio and its cookie from
> +	 * submission queue
> +	 */
> +	blk_create_io_context(q, true);
> +
> +	return 0;
> +}
> +
>  /**
>   * blk_poll - poll for IO completions
>   * @q:  the queue
> @@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	if (current->plug)
>  		blk_flush_plug_list(current->plug, false);
>  
> +	if (!queue_is_mq(q))
> +		return blk_bio_poll(q, cookie, spin);
> +
>  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>  
>  	/*
> diff --git a/block/blk.h b/block/blk.h
> index 3b53e44b967e..424949f2226d 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -357,4 +357,49 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
>  		struct page *page, unsigned int len, unsigned int offset,
>  		unsigned int max_sectors, bool *same_page);
>  
> +/* Grouping bios that share same data into one list */
> +struct bio_grp_list_data {
> +	void *grp_data;
> +
> +	/* all bios in this list share same 'grp_data' */
> +	struct bio_list list;
> +};
> +
> +struct bio_grp_list {
> +	unsigned int max_nr_grps, nr_grps;
> +	struct bio_grp_list_data head[0];
> +};
> +
> +struct blk_bio_poll_ctx {
> +	spinlock_t sq_lock;
> +	struct bio_grp_list *sq;
> +
> +	spinlock_t pq_lock;
> +	struct bio_grp_list *pq;
> +};
> +
> +#define BLK_BIO_POLL_SQ_SZ		16U
> +#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)

And these in iocontext.h?


> +
> +void bio_poll_ctx_alloc(struct io_context *ioc);
> +
> +static inline void blk_create_io_context(struct request_queue *q,
> +		bool need_poll_ctx)
> +{
> +	struct io_context *ioc;
> +
> +	/*
> +	 * Various block parts want %current->io_context, so allocate it up
> +	 * front rather than dealing with lots of pain to allocate it only
> +	 * where needed. This may fail and the block layer knows how to live
> +	 * with it.
> +	 */
> +	if (unlikely(!current->io_context))
> +		create_task_io_context(current, GFP_ATOMIC, q->node);
> +
> +	ioc = current->io_context;
> +	if (need_poll_ctx && unlikely(ioc && !ioc->data))
> +		bio_poll_ctx_alloc(ioc);
> +}
> +
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index 0a9dc40b7be8..f9a467571356 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -110,6 +110,8 @@ struct io_context {
>  	struct io_cq __rcu	*icq_hint;
>  	struct hlist_head	icq_list;
>  
> +	void			*data;
> +
>  	struct work_struct release_work;
>  };
>  
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 04/13] block: create io poll context for submission and poll task
@ 2021-03-25  2:34     ` JeffleXu
  0 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  2:34 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, dm-devel, Mike Snitzer



On 3/24/21 8:19 PM, Ming Lei wrote:
> Create per-task io poll context for both IO submission and poll task
> if the queue is bio based and supports polling.
> 
> This io polling context includes two queues:
> 
> 1) submission queue(sq) for storing HIPRI bio, written by submission task
>    and read by poll task.
> 2) polling queue(pq) for holding data moved from sq, only used in poll
>    context for running bio polling.
> 
> Following patches will support bio based io polling.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
>  block/blk-ioc.c           |  1 +
>  block/blk-mq.c            | 14 ++++++++
>  block/blk.h               | 45 +++++++++++++++++++++++++
>  include/linux/iocontext.h |  2 ++
>  5 files changed, 121 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d58f8a0c80de..4671bbf31fd3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>  	return BLK_STS_OK;
>  }
>  
> -static inline void blk_create_io_context(struct request_queue *q)
> +static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
>  {
> -	/*
> -	 * Various block parts want %current->io_context, so allocate it up
> -	 * front rather than dealing with lots of pain to allocate it only
> -	 * where needed. This may fail and the block layer knows how to live
> -	 * with it.
> -	 */
> -	if (unlikely(!current->io_context))
> -		create_task_io_context(current, GFP_ATOMIC, q->node);
> +	struct io_context *ioc = current->io_context;
> +
> +	return ioc ? ioc->data : NULL;
> +}
> +
> +static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> +{
> +	return sizeof(struct bio_grp_list) + nr_grps *
> +		sizeof(struct bio_grp_list_data);
> +}
> +
> +static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> +{
> +	pc->sq = (void *)pc + sizeof(*pc);
> +	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
> +
> +	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
> +	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
> +
> +	spin_lock_init(&pc->sq_lock);
> +	spin_lock_init(&pc->pq_lock);
> +}
> +
> +void bio_poll_ctx_alloc(struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc;
> +	unsigned int size = sizeof(*pc) +
> +		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
> +		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
> +
> +	pc = kzalloc(GFP_ATOMIC, size);
> +	if (pc) {
> +		bio_poll_ctx_init(pc);
> +		if (cmpxchg(&ioc->data, NULL, (void *)pc))
> +			kfree(pc);
> +	}

Why don't put these in blk-ioc.c?


> +}
> +
> +static inline bool blk_queue_support_bio_poll(struct request_queue *q)
> +{
> +	return !queue_is_mq(q) && blk_queue_poll(q);
> +}
> +
> +static inline void blk_bio_poll_preprocess(struct request_queue *q,
> +		struct bio *bio)
> +{
> +	if (!(bio->bi_opf & REQ_HIPRI))
> +		return;
> +
> +	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +		bio->bi_opf &= ~REQ_HIPRI;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  		}
>  	}
>  
> -	blk_create_io_context(q);
> +	/*
> +	 * Create per-task io poll ctx if bio polling supported and HIPRI
> +	 * set.
> +	 */
> +	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			(bio->bi_opf & REQ_HIPRI));
>  
> -	if (!blk_queue_poll(q))
> -		bio->bi_opf &= ~REQ_HIPRI;
> +	blk_bio_poll_preprocess(q, bio);
>  
>  	switch (bio_op(bio)) {
>  	case REQ_OP_DISCARD:
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index b0cde18c4b8c..5574c398eff6 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
>  
>  static inline void free_io_context(struct io_context *ioc)
>  {
> +	kfree(ioc->data);
>  	kmem_cache_free(iocontext_cachep, ioc);
>  }
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 63c81df3b8b5..c832faa52ca0 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
>  	return blk_mq_poll_hybrid_sleep(q, rq);
>  }
>  
> +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> +{
> +	/*
> +	 * Create poll queue for storing poll bio and its cookie from
> +	 * submission queue
> +	 */
> +	blk_create_io_context(q, true);
> +
> +	return 0;
> +}
> +
>  /**
>   * blk_poll - poll for IO completions
>   * @q:  the queue
> @@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	if (current->plug)
>  		blk_flush_plug_list(current->plug, false);
>  
> +	if (!queue_is_mq(q))
> +		return blk_bio_poll(q, cookie, spin);
> +
>  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>  
>  	/*
> diff --git a/block/blk.h b/block/blk.h
> index 3b53e44b967e..424949f2226d 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -357,4 +357,49 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
>  		struct page *page, unsigned int len, unsigned int offset,
>  		unsigned int max_sectors, bool *same_page);
>  
> +/* Grouping bios that share same data into one list */
> +struct bio_grp_list_data {
> +	void *grp_data;
> +
> +	/* all bios in this list share same 'grp_data' */
> +	struct bio_list list;
> +};
> +
> +struct bio_grp_list {
> +	unsigned int max_nr_grps, nr_grps;
> +	struct bio_grp_list_data head[0];
> +};
> +
> +struct blk_bio_poll_ctx {
> +	spinlock_t sq_lock;
> +	struct bio_grp_list *sq;
> +
> +	spinlock_t pq_lock;
> +	struct bio_grp_list *pq;
> +};
> +
> +#define BLK_BIO_POLL_SQ_SZ		16U
> +#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)

And these in iocontext.h?


> +
> +void bio_poll_ctx_alloc(struct io_context *ioc);
> +
> +static inline void blk_create_io_context(struct request_queue *q,
> +		bool need_poll_ctx)
> +{
> +	struct io_context *ioc;
> +
> +	/*
> +	 * Various block parts want %current->io_context, so allocate it up
> +	 * front rather than dealing with lots of pain to allocate it only
> +	 * where needed. This may fail and the block layer knows how to live
> +	 * with it.
> +	 */
> +	if (unlikely(!current->io_context))
> +		create_task_io_context(current, GFP_ATOMIC, q->node);
> +
> +	ioc = current->io_context;
> +	if (need_poll_ctx && unlikely(ioc && !ioc->data))
> +		bio_poll_ctx_alloc(ioc);
> +}
> +
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
> index 0a9dc40b7be8..f9a467571356 100644
> --- a/include/linux/iocontext.h
> +++ b/include/linux/iocontext.h
> @@ -110,6 +110,8 @@ struct io_context {
>  	struct io_cq __rcu	*icq_hint;
>  	struct hlist_head	icq_list;
>  
> +	void			*data;
> +
>  	struct work_struct release_work;
>  };
>  
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 04/13] block: create io poll context for submission and poll task
  2021-03-25  2:34     ` [dm-devel] " JeffleXu
@ 2021-03-25  2:51       ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-25  2:51 UTC (permalink / raw)
  To: JeffleXu; +Cc: Jens Axboe, linux-block, Mike Snitzer, dm-devel

On Thu, Mar 25, 2021 at 10:34:02AM +0800, JeffleXu wrote:
> 
> 
> On 3/24/21 8:19 PM, Ming Lei wrote:
> > Create per-task io poll context for both IO submission and poll task
> > if the queue is bio based and supports polling.
> > 
> > This io polling context includes two queues:
> > 
> > 1) submission queue(sq) for storing HIPRI bio, written by submission task
> >    and read by poll task.
> > 2) polling queue(pq) for holding data moved from sq, only used in poll
> >    context for running bio polling.
> > 
> > Following patches will support bio based io polling.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
> >  block/blk-ioc.c           |  1 +
> >  block/blk-mq.c            | 14 ++++++++
> >  block/blk.h               | 45 +++++++++++++++++++++++++
> >  include/linux/iocontext.h |  2 ++
> >  5 files changed, 121 insertions(+), 12 deletions(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index d58f8a0c80de..4671bbf31fd3 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
> >  	return BLK_STS_OK;
> >  }
> >  
> > -static inline void blk_create_io_context(struct request_queue *q)
> > +static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
> >  {
> > -	/*
> > -	 * Various block parts want %current->io_context, so allocate it up
> > -	 * front rather than dealing with lots of pain to allocate it only
> > -	 * where needed. This may fail and the block layer knows how to live
> > -	 * with it.
> > -	 */
> > -	if (unlikely(!current->io_context))
> > -		create_task_io_context(current, GFP_ATOMIC, q->node);
> > +	struct io_context *ioc = current->io_context;
> > +
> > +	return ioc ? ioc->data : NULL;
> > +}
> > +
> > +static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> > +{
> > +	return sizeof(struct bio_grp_list) + nr_grps *
> > +		sizeof(struct bio_grp_list_data);
> > +}
> > +
> > +static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> > +{
> > +	pc->sq = (void *)pc + sizeof(*pc);
> > +	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
> > +
> > +	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
> > +	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
> > +
> > +	spin_lock_init(&pc->sq_lock);
> > +	spin_lock_init(&pc->pq_lock);
> > +}
> > +
> > +void bio_poll_ctx_alloc(struct io_context *ioc)
> > +{
> > +	struct blk_bio_poll_ctx *pc;
> > +	unsigned int size = sizeof(*pc) +
> > +		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
> > +		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
> > +
> > +	pc = kzalloc(GFP_ATOMIC, size);
> > +	if (pc) {
> > +		bio_poll_ctx_init(pc);
> > +		if (cmpxchg(&ioc->data, NULL, (void *)pc))
> > +			kfree(pc);
> > +	}
> 
> Why don't put these in blk-ioc.c?

It is for implementing bio polling, not necessary for moving it to
blk-ioc.c.

> 
> 
> > +}
> > +
> > +static inline bool blk_queue_support_bio_poll(struct request_queue *q)
> > +{
> > +	return !queue_is_mq(q) && blk_queue_poll(q);
> > +}
> > +
> > +static inline void blk_bio_poll_preprocess(struct request_queue *q,
> > +		struct bio *bio)
> > +{
> > +	if (!(bio->bi_opf & REQ_HIPRI))
> > +		return;
> > +
> > +	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> > +		bio->bi_opf &= ~REQ_HIPRI;
> >  }
> >  
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> > @@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  		}
> >  	}
> >  
> > -	blk_create_io_context(q);
> > +	/*
> > +	 * Create per-task io poll ctx if bio polling supported and HIPRI
> > +	 * set.
> > +	 */
> > +	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> > +			(bio->bi_opf & REQ_HIPRI));
> >  
> > -	if (!blk_queue_poll(q))
> > -		bio->bi_opf &= ~REQ_HIPRI;
> > +	blk_bio_poll_preprocess(q, bio);
> >  
> >  	switch (bio_op(bio)) {
> >  	case REQ_OP_DISCARD:
> > diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> > index b0cde18c4b8c..5574c398eff6 100644
> > --- a/block/blk-ioc.c
> > +++ b/block/blk-ioc.c
> > @@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
> >  
> >  static inline void free_io_context(struct io_context *ioc)
> >  {
> > +	kfree(ioc->data);
> >  	kmem_cache_free(iocontext_cachep, ioc);
> >  }
> >  
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 63c81df3b8b5..c832faa52ca0 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
> >  	return blk_mq_poll_hybrid_sleep(q, rq);
> >  }
> >  
> > +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> > +{
> > +	/*
> > +	 * Create poll queue for storing poll bio and its cookie from
> > +	 * submission queue
> > +	 */
> > +	blk_create_io_context(q, true);
> > +
> > +	return 0;
> > +}
> > +
> >  /**
> >   * blk_poll - poll for IO completions
> >   * @q:  the queue
> > @@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> >  	if (current->plug)
> >  		blk_flush_plug_list(current->plug, false);
> >  
> > +	if (!queue_is_mq(q))
> > +		return blk_bio_poll(q, cookie, spin);
> > +
> >  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> >  
> >  	/*
> > diff --git a/block/blk.h b/block/blk.h
> > index 3b53e44b967e..424949f2226d 100644
> > --- a/block/blk.h
> > +++ b/block/blk.h
> > @@ -357,4 +357,49 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
> >  		struct page *page, unsigned int len, unsigned int offset,
> >  		unsigned int max_sectors, bool *same_page);
> >  
> > +/* Grouping bios that share same data into one list */
> > +struct bio_grp_list_data {
> > +	void *grp_data;
> > +
> > +	/* all bios in this list share same 'grp_data' */
> > +	struct bio_list list;
> > +};
> > +
> > +struct bio_grp_list {
> > +	unsigned int max_nr_grps, nr_grps;
> > +	struct bio_grp_list_data head[0];
> > +};
> > +
> > +struct blk_bio_poll_ctx {
> > +	spinlock_t sq_lock;
> > +	struct bio_grp_list *sq;
> > +
> > +	spinlock_t pq_lock;
> > +	struct bio_grp_list *pq;
> > +};
> > +
> > +#define BLK_BIO_POLL_SQ_SZ		16U
> > +#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
> 
> And these in iocontext.h?

All are internal definition for bio polling, not necessary to put
it into one public header.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 04/13] block: create io poll context for submission and poll task
@ 2021-03-25  2:51       ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-25  2:51 UTC (permalink / raw)
  To: JeffleXu; +Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer

On Thu, Mar 25, 2021 at 10:34:02AM +0800, JeffleXu wrote:
> 
> 
> On 3/24/21 8:19 PM, Ming Lei wrote:
> > Create per-task io poll context for both IO submission and poll task
> > if the queue is bio based and supports polling.
> > 
> > This io polling context includes two queues:
> > 
> > 1) submission queue(sq) for storing HIPRI bio, written by submission task
> >    and read by poll task.
> > 2) polling queue(pq) for holding data moved from sq, only used in poll
> >    context for running bio polling.
> > 
> > Following patches will support bio based io polling.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
> >  block/blk-ioc.c           |  1 +
> >  block/blk-mq.c            | 14 ++++++++
> >  block/blk.h               | 45 +++++++++++++++++++++++++
> >  include/linux/iocontext.h |  2 ++
> >  5 files changed, 121 insertions(+), 12 deletions(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index d58f8a0c80de..4671bbf31fd3 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
> >  	return BLK_STS_OK;
> >  }
> >  
> > -static inline void blk_create_io_context(struct request_queue *q)
> > +static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
> >  {
> > -	/*
> > -	 * Various block parts want %current->io_context, so allocate it up
> > -	 * front rather than dealing with lots of pain to allocate it only
> > -	 * where needed. This may fail and the block layer knows how to live
> > -	 * with it.
> > -	 */
> > -	if (unlikely(!current->io_context))
> > -		create_task_io_context(current, GFP_ATOMIC, q->node);
> > +	struct io_context *ioc = current->io_context;
> > +
> > +	return ioc ? ioc->data : NULL;
> > +}
> > +
> > +static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> > +{
> > +	return sizeof(struct bio_grp_list) + nr_grps *
> > +		sizeof(struct bio_grp_list_data);
> > +}
> > +
> > +static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> > +{
> > +	pc->sq = (void *)pc + sizeof(*pc);
> > +	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
> > +
> > +	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
> > +	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
> > +
> > +	spin_lock_init(&pc->sq_lock);
> > +	spin_lock_init(&pc->pq_lock);
> > +}
> > +
> > +void bio_poll_ctx_alloc(struct io_context *ioc)
> > +{
> > +	struct blk_bio_poll_ctx *pc;
> > +	unsigned int size = sizeof(*pc) +
> > +		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
> > +		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
> > +
> > +	pc = kzalloc(GFP_ATOMIC, size);
> > +	if (pc) {
> > +		bio_poll_ctx_init(pc);
> > +		if (cmpxchg(&ioc->data, NULL, (void *)pc))
> > +			kfree(pc);
> > +	}
> 
> Why don't put these in blk-ioc.c?

It is for implementing bio polling, not necessary for moving it to
blk-ioc.c.

> 
> 
> > +}
> > +
> > +static inline bool blk_queue_support_bio_poll(struct request_queue *q)
> > +{
> > +	return !queue_is_mq(q) && blk_queue_poll(q);
> > +}
> > +
> > +static inline void blk_bio_poll_preprocess(struct request_queue *q,
> > +		struct bio *bio)
> > +{
> > +	if (!(bio->bi_opf & REQ_HIPRI))
> > +		return;
> > +
> > +	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> > +		bio->bi_opf &= ~REQ_HIPRI;
> >  }
> >  
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> > @@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  		}
> >  	}
> >  
> > -	blk_create_io_context(q);
> > +	/*
> > +	 * Create per-task io poll ctx if bio polling supported and HIPRI
> > +	 * set.
> > +	 */
> > +	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> > +			(bio->bi_opf & REQ_HIPRI));
> >  
> > -	if (!blk_queue_poll(q))
> > -		bio->bi_opf &= ~REQ_HIPRI;
> > +	blk_bio_poll_preprocess(q, bio);
> >  
> >  	switch (bio_op(bio)) {
> >  	case REQ_OP_DISCARD:
> > diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> > index b0cde18c4b8c..5574c398eff6 100644
> > --- a/block/blk-ioc.c
> > +++ b/block/blk-ioc.c
> > @@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
> >  
> >  static inline void free_io_context(struct io_context *ioc)
> >  {
> > +	kfree(ioc->data);
> >  	kmem_cache_free(iocontext_cachep, ioc);
> >  }
> >  
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 63c81df3b8b5..c832faa52ca0 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
> >  	return blk_mq_poll_hybrid_sleep(q, rq);
> >  }
> >  
> > +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> > +{
> > +	/*
> > +	 * Create poll queue for storing poll bio and its cookie from
> > +	 * submission queue
> > +	 */
> > +	blk_create_io_context(q, true);
> > +
> > +	return 0;
> > +}
> > +
> >  /**
> >   * blk_poll - poll for IO completions
> >   * @q:  the queue
> > @@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
> >  	if (current->plug)
> >  		blk_flush_plug_list(current->plug, false);
> >  
> > +	if (!queue_is_mq(q))
> > +		return blk_bio_poll(q, cookie, spin);
> > +
> >  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> >  
> >  	/*
> > diff --git a/block/blk.h b/block/blk.h
> > index 3b53e44b967e..424949f2226d 100644
> > --- a/block/blk.h
> > +++ b/block/blk.h
> > @@ -357,4 +357,49 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
> >  		struct page *page, unsigned int len, unsigned int offset,
> >  		unsigned int max_sectors, bool *same_page);
> >  
> > +/* Grouping bios that share same data into one list */
> > +struct bio_grp_list_data {
> > +	void *grp_data;
> > +
> > +	/* all bios in this list share same 'grp_data' */
> > +	struct bio_list list;
> > +};
> > +
> > +struct bio_grp_list {
> > +	unsigned int max_nr_grps, nr_grps;
> > +	struct bio_grp_list_data head[0];
> > +};
> > +
> > +struct blk_bio_poll_ctx {
> > +	spinlock_t sq_lock;
> > +	struct bio_grp_list *sq;
> > +
> > +	spinlock_t pq_lock;
> > +	struct bio_grp_list *pq;
> > +};
> > +
> > +#define BLK_BIO_POLL_SQ_SZ		16U
> > +#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
> 
> And these in iocontext.h?

All are internal definition for bio polling, not necessary to put
it into one public header.


Thanks, 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 04/13] block: create io poll context for submission and poll task
  2021-03-25  2:51       ` [dm-devel] " Ming Lei
@ 2021-03-25  3:01         ` JeffleXu
  -1 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  3:01 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Mike Snitzer, dm-devel



On 3/25/21 10:51 AM, Ming Lei wrote:
> On Thu, Mar 25, 2021 at 10:34:02AM +0800, JeffleXu wrote:
>>
>>
>> On 3/24/21 8:19 PM, Ming Lei wrote:
>>> Create per-task io poll context for both IO submission and poll task
>>> if the queue is bio based and supports polling.
>>>
>>> This io polling context includes two queues:
>>>
>>> 1) submission queue(sq) for storing HIPRI bio, written by submission task
>>>    and read by poll task.
>>> 2) polling queue(pq) for holding data moved from sq, only used in poll
>>>    context for running bio polling.
>>>
>>> Following patches will support bio based io polling.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
>>>  block/blk-ioc.c           |  1 +
>>>  block/blk-mq.c            | 14 ++++++++
>>>  block/blk.h               | 45 +++++++++++++++++++++++++
>>>  include/linux/iocontext.h |  2 ++
>>>  5 files changed, 121 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index d58f8a0c80de..4671bbf31fd3 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>>>  	return BLK_STS_OK;
>>>  }
>>>  
>>> -static inline void blk_create_io_context(struct request_queue *q)
>>> +static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
>>>  {
>>> -	/*
>>> -	 * Various block parts want %current->io_context, so allocate it up
>>> -	 * front rather than dealing with lots of pain to allocate it only
>>> -	 * where needed. This may fail and the block layer knows how to live
>>> -	 * with it.
>>> -	 */
>>> -	if (unlikely(!current->io_context))
>>> -		create_task_io_context(current, GFP_ATOMIC, q->node);
>>> +	struct io_context *ioc = current->io_context;
>>> +
>>> +	return ioc ? ioc->data : NULL;
>>> +}
>>> +
>>> +static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>>> +{
>>> +	return sizeof(struct bio_grp_list) + nr_grps *
>>> +		sizeof(struct bio_grp_list_data);
>>> +}
>>> +
>>> +static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>>> +{
>>> +	pc->sq = (void *)pc + sizeof(*pc);
>>> +	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
>>> +
>>> +	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
>>> +	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
>>> +
>>> +	spin_lock_init(&pc->sq_lock);
>>> +	spin_lock_init(&pc->pq_lock);
>>> +}
>>> +
>>> +void bio_poll_ctx_alloc(struct io_context *ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc;
>>> +	unsigned int size = sizeof(*pc) +
>>> +		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
>>> +		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
>>> +
>>> +	pc = kzalloc(GFP_ATOMIC, size);
>>> +	if (pc) {
>>> +		bio_poll_ctx_init(pc);
>>> +		if (cmpxchg(&ioc->data, NULL, (void *)pc))
>>> +			kfree(pc);
>>> +	}
>>
>> Why don't put these in blk-ioc.c?
> 
> It is for implementing bio polling, not necessary for moving it to
> blk-ioc.c.
> 
>>
>>
>>> +}
>>> +
>>> +static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>>> +{
>>> +	return !queue_is_mq(q) && blk_queue_poll(q);
>>> +}
>>> +
>>> +static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>> +		struct bio *bio)
>>> +{
>>> +	if (!(bio->bi_opf & REQ_HIPRI))
>>> +		return;
>>> +
>>> +	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
>>> +		bio->bi_opf &= ~REQ_HIPRI;
>>>  }
>>>  
>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>> @@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>  		}
>>>  	}
>>>  
>>> -	blk_create_io_context(q);
>>> +	/*
>>> +	 * Create per-task io poll ctx if bio polling supported and HIPRI
>>> +	 * set.
>>> +	 */
>>> +	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
>>> +			(bio->bi_opf & REQ_HIPRI));
>>>  
>>> -	if (!blk_queue_poll(q))
>>> -		bio->bi_opf &= ~REQ_HIPRI;
>>> +	blk_bio_poll_preprocess(q, bio);
>>>  
>>>  	switch (bio_op(bio)) {
>>>  	case REQ_OP_DISCARD:
>>> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
>>> index b0cde18c4b8c..5574c398eff6 100644
>>> --- a/block/blk-ioc.c
>>> +++ b/block/blk-ioc.c
>>> @@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
>>>  
>>>  static inline void free_io_context(struct io_context *ioc)
>>>  {
>>> +	kfree(ioc->data);
>>>  	kmem_cache_free(iocontext_cachep, ioc);
>>>  }
>>>  
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 63c81df3b8b5..c832faa52ca0 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
>>>  	return blk_mq_poll_hybrid_sleep(q, rq);
>>>  }
>>>  
>>> +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>>> +{
>>> +	/*
>>> +	 * Create poll queue for storing poll bio and its cookie from
>>> +	 * submission queue
>>> +	 */
>>> +	blk_create_io_context(q, true);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>  /**
>>>   * blk_poll - poll for IO completions
>>>   * @q:  the queue
>>> @@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>>>  	if (current->plug)
>>>  		blk_flush_plug_list(current->plug, false);
>>>  
>>> +	if (!queue_is_mq(q))
>>> +		return blk_bio_poll(q, cookie, spin);
>>> +
>>>  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>>  
>>>  	/*
>>> diff --git a/block/blk.h b/block/blk.h
>>> index 3b53e44b967e..424949f2226d 100644
>>> --- a/block/blk.h
>>> +++ b/block/blk.h
>>> @@ -357,4 +357,49 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
>>>  		struct page *page, unsigned int len, unsigned int offset,
>>>  		unsigned int max_sectors, bool *same_page);
>>>  
>>> +/* Grouping bios that share same data into one list */
>>> +struct bio_grp_list_data {
>>> +	void *grp_data;
>>> +
>>> +	/* all bios in this list share same 'grp_data' */
>>> +	struct bio_list list;
>>> +};
>>> +
>>> +struct bio_grp_list {
>>> +	unsigned int max_nr_grps, nr_grps;
>>> +	struct bio_grp_list_data head[0];
>>> +};
>>> +
>>> +struct blk_bio_poll_ctx {
>>> +	spinlock_t sq_lock;
>>> +	struct bio_grp_list *sq;
>>> +
>>> +	spinlock_t pq_lock;
>>> +	struct bio_grp_list *pq;
>>> +};
>>> +
>>> +#define BLK_BIO_POLL_SQ_SZ		16U
>>> +#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
>>
>> And these in iocontext.h?
> 
> All are internal definition for bio polling, not necessary to put
> it into one public header.
> 
Thanks. I missed that blk.h is a private header.

Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>


-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 04/13] block: create io poll context for submission and poll task
@ 2021-03-25  3:01         ` JeffleXu
  0 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  3:01 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer



On 3/25/21 10:51 AM, Ming Lei wrote:
> On Thu, Mar 25, 2021 at 10:34:02AM +0800, JeffleXu wrote:
>>
>>
>> On 3/24/21 8:19 PM, Ming Lei wrote:
>>> Create per-task io poll context for both IO submission and poll task
>>> if the queue is bio based and supports polling.
>>>
>>> This io polling context includes two queues:
>>>
>>> 1) submission queue(sq) for storing HIPRI bio, written by submission task
>>>    and read by poll task.
>>> 2) polling queue(pq) for holding data moved from sq, only used in poll
>>>    context for running bio polling.
>>>
>>> Following patches will support bio based io polling.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/blk-core.c          | 71 ++++++++++++++++++++++++++++++++-------
>>>  block/blk-ioc.c           |  1 +
>>>  block/blk-mq.c            | 14 ++++++++
>>>  block/blk.h               | 45 +++++++++++++++++++++++++
>>>  include/linux/iocontext.h |  2 ++
>>>  5 files changed, 121 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index d58f8a0c80de..4671bbf31fd3 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -792,16 +792,59 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>>>  	return BLK_STS_OK;
>>>  }
>>>  
>>> -static inline void blk_create_io_context(struct request_queue *q)
>>> +static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
>>>  {
>>> -	/*
>>> -	 * Various block parts want %current->io_context, so allocate it up
>>> -	 * front rather than dealing with lots of pain to allocate it only
>>> -	 * where needed. This may fail and the block layer knows how to live
>>> -	 * with it.
>>> -	 */
>>> -	if (unlikely(!current->io_context))
>>> -		create_task_io_context(current, GFP_ATOMIC, q->node);
>>> +	struct io_context *ioc = current->io_context;
>>> +
>>> +	return ioc ? ioc->data : NULL;
>>> +}
>>> +
>>> +static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>>> +{
>>> +	return sizeof(struct bio_grp_list) + nr_grps *
>>> +		sizeof(struct bio_grp_list_data);
>>> +}
>>> +
>>> +static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>>> +{
>>> +	pc->sq = (void *)pc + sizeof(*pc);
>>> +	pc->sq->max_nr_grps = BLK_BIO_POLL_SQ_SZ;
>>> +
>>> +	pc->pq = (void *)pc->sq + bio_grp_list_size(BLK_BIO_POLL_SQ_SZ);
>>> +	pc->pq->max_nr_grps = BLK_BIO_POLL_PQ_SZ;
>>> +
>>> +	spin_lock_init(&pc->sq_lock);
>>> +	spin_lock_init(&pc->pq_lock);
>>> +}
>>> +
>>> +void bio_poll_ctx_alloc(struct io_context *ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc;
>>> +	unsigned int size = sizeof(*pc) +
>>> +		bio_grp_list_size(BLK_BIO_POLL_SQ_SZ) +
>>> +		bio_grp_list_size(BLK_BIO_POLL_PQ_SZ);
>>> +
>>> +	pc = kzalloc(GFP_ATOMIC, size);
>>> +	if (pc) {
>>> +		bio_poll_ctx_init(pc);
>>> +		if (cmpxchg(&ioc->data, NULL, (void *)pc))
>>> +			kfree(pc);
>>> +	}
>>
>> Why don't put these in blk-ioc.c?
> 
> It is for implementing bio polling, not necessary for moving it to
> blk-ioc.c.
> 
>>
>>
>>> +}
>>> +
>>> +static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>>> +{
>>> +	return !queue_is_mq(q) && blk_queue_poll(q);
>>> +}
>>> +
>>> +static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>> +		struct bio *bio)
>>> +{
>>> +	if (!(bio->bi_opf & REQ_HIPRI))
>>> +		return;
>>> +
>>> +	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
>>> +		bio->bi_opf &= ~REQ_HIPRI;
>>>  }
>>>  
>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>> @@ -848,10 +891,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>  		}
>>>  	}
>>>  
>>> -	blk_create_io_context(q);
>>> +	/*
>>> +	 * Create per-task io poll ctx if bio polling supported and HIPRI
>>> +	 * set.
>>> +	 */
>>> +	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
>>> +			(bio->bi_opf & REQ_HIPRI));
>>>  
>>> -	if (!blk_queue_poll(q))
>>> -		bio->bi_opf &= ~REQ_HIPRI;
>>> +	blk_bio_poll_preprocess(q, bio);
>>>  
>>>  	switch (bio_op(bio)) {
>>>  	case REQ_OP_DISCARD:
>>> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
>>> index b0cde18c4b8c..5574c398eff6 100644
>>> --- a/block/blk-ioc.c
>>> +++ b/block/blk-ioc.c
>>> @@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
>>>  
>>>  static inline void free_io_context(struct io_context *ioc)
>>>  {
>>> +	kfree(ioc->data);
>>>  	kmem_cache_free(iocontext_cachep, ioc);
>>>  }
>>>  
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 63c81df3b8b5..c832faa52ca0 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
>>>  	return blk_mq_poll_hybrid_sleep(q, rq);
>>>  }
>>>  
>>> +static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>>> +{
>>> +	/*
>>> +	 * Create poll queue for storing poll bio and its cookie from
>>> +	 * submission queue
>>> +	 */
>>> +	blk_create_io_context(q, true);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>>  /**
>>>   * blk_poll - poll for IO completions
>>>   * @q:  the queue
>>> @@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>>>  	if (current->plug)
>>>  		blk_flush_plug_list(current->plug, false);
>>>  
>>> +	if (!queue_is_mq(q))
>>> +		return blk_bio_poll(q, cookie, spin);
>>> +
>>>  	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>>  
>>>  	/*
>>> diff --git a/block/blk.h b/block/blk.h
>>> index 3b53e44b967e..424949f2226d 100644
>>> --- a/block/blk.h
>>> +++ b/block/blk.h
>>> @@ -357,4 +357,49 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
>>>  		struct page *page, unsigned int len, unsigned int offset,
>>>  		unsigned int max_sectors, bool *same_page);
>>>  
>>> +/* Grouping bios that share same data into one list */
>>> +struct bio_grp_list_data {
>>> +	void *grp_data;
>>> +
>>> +	/* all bios in this list share same 'grp_data' */
>>> +	struct bio_list list;
>>> +};
>>> +
>>> +struct bio_grp_list {
>>> +	unsigned int max_nr_grps, nr_grps;
>>> +	struct bio_grp_list_data head[0];
>>> +};
>>> +
>>> +struct blk_bio_poll_ctx {
>>> +	spinlock_t sq_lock;
>>> +	struct bio_grp_list *sq;
>>> +
>>> +	spinlock_t pq_lock;
>>> +	struct bio_grp_list *pq;
>>> +};
>>> +
>>> +#define BLK_BIO_POLL_SQ_SZ		16U
>>> +#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
>>
>> And these in iocontext.h?
> 
> All are internal definition for bio polling, not necessary to put
> it into one public header.
> 
Thanks. I missed that blk.h is a private header.

Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>


-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 09/13] block: use per-task poll context to implement bio based io polling
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-25  6:34     ` JeffleXu
  -1 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  6:34 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Mike Snitzer, dm-devel



On 3/24/21 8:19 PM, Ming Lei wrote:
> Currently bio based IO polling needs to poll all hw queue blindly, this
> way is very inefficient, and one big reason is that we can't pass any
> bio submission result to blk_poll().
> 
> In IO submission context, track associated underlying bios by per-task
> submission queue and store returned 'cookie' in
> bio->bi_iter.bi_private_data, and return current->pid to caller of
> submit_bio() for any bio based driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, then we can find bios from the per-task io pull context of
> submission context. Moving bios from submission queue to poll queue of
> the poll context, and keep polling until these bios are ended. Remove
> bio from poll queue if the bio is ended. Add bio flags of BIO_DONE and
> BIO_END_BY_POLL for such purpose.
> 
> In was found in Jeffle Xu's test that kfifo doesn't scale well for a
> submission queue as queue depth is increased, so a new mechanism for
> tracking bios is needed. So far bio's size is close to 2 cacheline size,
> and it may not be accepted to add new field into bio for solving the
> scalability issue by tracking bios via linked list, switch to bio group
> list for tracking bio, the idea is to reuse .bi_end_io for linking bios
> into a linked list for all sharing same .bi_end_io(call it bio group),
> which is recovered before ending bio really, since BIO_END_BY_POLL is
> added for enhancing this point. Usually .bi_end_bio is same for all
> bios in same layer, so it is enough to provide very limited groups, such
> as 16 or less for fixing the scalability issue.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Also when the submission task is exiting, drain pending IOs in the context
> until all are done.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/bio.c               |   5 +
>  block/blk-core.c          | 154 ++++++++++++++++++++++++-
>  block/blk-ioc.c           |   2 +
>  block/blk-mq.c            | 234 +++++++++++++++++++++++++++++++++++++-
>  block/blk.h               |  10 ++
>  include/linux/blk_types.h |  18 ++-
>  6 files changed, 419 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..04c043dc60fc 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> +		bio_set_flag(bio, BIO_DONE);
> +		return;
> +	}
>  again:
>  	if (!bio_remaining_done(bio))
>  		return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index eb07d61cfdc2..95f7e36c8759 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -805,6 +805,81 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>  		sizeof(struct bio_grp_list_data);
>  }
>  
> +static inline void *bio_grp_data(struct bio *bio)
> +{
> +	return bio->bi_poll;
> +}
> +
> +/* add bio into bio group list, return true if it is added */
> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == bio_grp_data(bio)) {
> +			__bio_grp_list_add(&grp->list, bio);
> +			return true;
> +		}
> +	}
> +
> +	if (i == list->max_nr_grps)
> +		return false;
> +
> +	/* create a new group */
> +	grp = &list->head[i];
> +	bio_list_init(&grp->list);
> +	grp->grp_data = bio_grp_data(bio);
> +	__bio_grp_list_add(&grp->list, bio);
> +	list->nr_grps++;
> +
> +	return true;
> +}
> +
> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == grp_data)
> +			return i;
> +	}
> +
> +	if (i < list->max_nr_grps) {
> +		grp = &list->head[i];
> +		bio_list_init(&grp->list);
> +		return i;
> +	}
> +
> +	return -1;
> +}
> +
> +/* Move as many as possible groups from 'src' to 'dst' */
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> +{
> +	int i, j, cnt = 0;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> +		grp = &src->head[i];
> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> +		if (j < 0)
> +			break;
> +		if (bio_grp_list_grp_empty(&dst->head[j])) {
> +			dst->head[j].grp_data = grp->grp_data;
> +			dst->nr_grps++;
> +		}
> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> +		bio_list_init(&grp->list);
> +		cnt++;
> +	}
> +
> +	src->nr_grps -= cnt;
> +}
> +
>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>  {
>  	pc->sq = (void *)pc + sizeof(*pc);
> @@ -866,6 +941,45 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		bio->bi_opf |= REQ_POLL_CTX;
>  }
>  
> +static inline void blk_bio_poll_mark_queued(struct bio *bio, bool queued)
> +{
> +	/*
> +	 * The bio has been added to per-task poll queue, mark it as
> +	 * END_BY_POLL, so that this bio is always completed from
> +	 * blk_poll() which is provided with cookied from this bio's
> +	 * submission.
> +	 */
> +	if (!queued)
> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_POLL_CTX);
> +	else
> +		bio_set_flag(bio, BIO_END_BY_POLL);
> +}
> +
> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	unsigned int queued;
> +
> +	/*
> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> +	 * and completion. However, bio crypt may update .bi_end_io during
> +	 * submission, so simply don't support bio based polling for this
> +	 * setting.
> +	 */
> +	if (likely(!bio_has_crypt_ctx(bio))) {
> +		/* track this bio via bio group list */
> +		spin_lock(&pc->sq_lock);
> +		queued = bio_grp_list_add(pc->sq, bio);
> +		blk_bio_poll_mark_queued(bio, queued);
> +		spin_unlock(&pc->sq_lock);
> +	} else {
> +		queued = false;
> +		blk_bio_poll_mark_queued(bio, false);
> +	}
> +
> +	return queued;
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -1018,7 +1132,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>   * bio_list_on_stack[1] contains bios that were submitted before the current
>   *	->submit_bio_bio, but that haven't been processed yet.
>   */
> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> +static blk_qc_t __submit_bio_noacct_ctx(struct bio *bio, struct io_context *ioc)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;
> @@ -1041,7 +1155,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		if (ioc && queue_is_mq(q) &&
> +				(bio->bi_opf & (REQ_HIPRI | REQ_POLL_CTX))) {


I can see no sense to enqueue the bio into the context->sq when
REQ_HIPRI is cleared while REQ_POLL_CTX is set for the bio.
BLK_QC_T_NONE is returned in this case. This is possible since commit
cc29e1bf0d63 ("block: disable iopoll for split bio").


> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> +
> +			ret = __submit_bio(bio);
> +			if (queued)
> +				bio_set_private_data(bio, ret);
> +		} else {
> +			ret = __submit_bio(bio);
> +		}
>  
>  		/*
>  		 * Sort new bios into those for a lower level and those for the
> @@ -1067,6 +1190,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	return ret;
>  }
>  
> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> +		struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +
> +	__submit_bio_noacct_ctx(bio, ioc);
> +
> +	/* bio submissions queued to per-task poll context */
> +	if (READ_ONCE(pc->sq->nr_grps))
> +		return current->pid;
> +
> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> +	return BLK_QC_T_BIO_NONE;
> +}
> +
> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> +{
> +	struct io_context *ioc = current->io_context;
> +
> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> +		return __submit_bio_noacct_poll(bio, ioc);
> +
> +	__submit_bio_noacct_ctx(bio, NULL);
> +
> +	return BLK_QC_T_BIO_NONE;
> +}
> +
>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>  {
>  	struct bio_list bio_list[2] = { };
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 5574c398eff6..b9a512f066f8 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -19,6 +19,8 @@ static struct kmem_cache *iocontext_cachep;
>  
>  static inline void free_io_context(struct io_context *ioc)
>  {
> +	blk_bio_poll_io_drain(ioc);
> +

There may be a time window between the IO submission process detaches
the io_context and the io_context's refcount finally decreased to zero,
when there'are multiple processes sharing one io_context. I don't know
if it is possible that the other process sharing the io_context won't
submit any IO, in which case the bios remained in the io_context won't
be reaped for a long time.

If the above case is possible, then is it possible to drain the sq once
the process detaches the io_context?


>  	kfree(ioc->data);
>  	kmem_cache_free(iocontext_cachep, ioc);
>  }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 03f59915fe2c..76a90da83d9c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3865,14 +3865,246 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>  	return ret;
>  }
>  
> +static int blk_mq_poll_io(struct bio *bio)
> +{
> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +	blk_qc_t cookie = bio_get_private_data(bio);
> +	int ret = 0;
> +
> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> +		struct blk_mq_hw_ctx *hctx =
> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> +
> +		ret += blk_mq_poll_hctx(q, hctx);
> +	}
> +	return ret;
> +}
> +
> +static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
> +{
> +	int ret = 0;
> +	int i;
> +
> +	/*
> +	 * Poll hw queue first.
> +	 *
> +	 * TODO: limit max poll times and make sure to not poll same
> +	 * hw queue one more time.
> +	 */
> +	for (i = 0; i < grps->nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &grps->head[i];
> +		struct bio *bio;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> +			ret += blk_mq_poll_io(bio);
> +	}
> +
> +	/* reap bios */
> +	for (i = 0; i < grps->nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &grps->head[i];
> +		struct bio *bio;
> +		struct bio_list bl;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		bio_list_init(&bl);
> +
> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> +			if (bio_flagged(bio, BIO_DONE)) {
> +				/* now recover original data */
> +				bio->bi_poll = grp->grp_data;
> +
> +				/* clear BIO_END_BY_POLL and end me really */
> +				bio_clear_flag(bio, BIO_END_BY_POLL);
> +				bio_endio(bio);
> +			} else {
> +				__bio_grp_list_add(&bl, bio);
> +			}
> +		}
> +		__bio_grp_list_merge(&grp->list, &bl);
> +	}
> +	return ret;
> +}
> +
> +static void blk_bio_poll_pack_groups(struct bio_grp_list *grps)
> +{
> +	int i, j, k = 0;
> +	int cnt = 0;
> +
> +	for (i = grps->nr_grps - 1; i >= 0; i--) {
> +		struct bio_grp_list_data *grp = &grps->head[i];
> +		struct bio_grp_list_data *hole = NULL;
> +
> +		if (bio_grp_list_grp_empty(grp)) {
> +			cnt++;
> +			continue;
> +		}
> +

> +		for (j = k; j < i; j++) {
> +			hole = &grps->head[j];
> +			if (bio_grp_list_grp_empty(hole))
> +				break;
> +		}

Shoule be

> +		for (j = k; j < i; j++) {
> +			tmp = &grps->head[j];
> +			if (bio_grp_list_grp_empty(tmp)) {
> +                             hole = tmp;
> +				break;
> +                     }
> +		}

?


> +		if (hole == NULL)
> +			break;
> +		*hole = *grp;
> +		cnt++;
> +		k = j;
> +	}
> +
> +	grps->nr_grps -= cnt;
> +}
> +
> +#define  MAX_BIO_GRPS_ON_STACK  8
> +struct bio_grp_list_stack {
> +	unsigned int max_nr_grps, nr_grps;
> +	struct bio_grp_list_data head[MAX_BIO_GRPS_ON_STACK];
> +};
> +
> +static int blk_bio_poll_io(struct io_context *submit_ioc,
> +		struct io_context *poll_ioc)
> +
> +{
> +	struct bio_grp_list_stack _bio_grps = {
> +		.max_nr_grps	= ARRAY_SIZE(_bio_grps.head),
> +		.nr_grps	= 0
> +	};
> +	struct bio_grp_list *bio_grps = (struct bio_grp_list *)&_bio_grps;
> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc ?
> +		poll_ioc->data : NULL;
> +	int ret = 0;
> +
> +	/*
> +	 * Move IO submission result from submission queue in submission
> +	 * context to poll queue of poll context.
> +	 */
> +	spin_lock(&submit_ctx->sq_lock);
> +	bio_grp_list_move(bio_grps, submit_ctx->sq);
> +	spin_unlock(&submit_ctx->sq_lock);
> +
> +	/* merge new bios first, then start to poll bios from pq */
> +	if (poll_ctx) {
> +		spin_lock(&poll_ctx->pq_lock);
> +		bio_grp_list_move(poll_ctx->pq, bio_grps);
> +		bio_grp_list_move(bio_grps, poll_ctx->pq);

What's the purpose of this two-step merge? Is that for new bios (from
sq) is at the tail of the bio_list, and thus old bios (from pq) is
polled first?

> +		spin_unlock(&poll_ctx->pq_lock);
> +	}
> +
> +	do {
> +		ret += blk_bio_poll_and_end_io(bio_grps);
> +		blk_bio_poll_pack_groups(bio_grps);
> +
> +		if (bio_grps->nr_grps) {
> +			/*
> +			 * move back, and keep polling until all can be
> +			 * held in either poll queue or submission queue.
> +			 */
> +			if (poll_ctx) {
> +				spin_lock(&poll_ctx->pq_lock);
> +				bio_grp_list_move(poll_ctx->pq, bio_grps);
> +				spin_unlock(&poll_ctx->pq_lock);
> +			} else {
> +				spin_lock(&submit_ctx->sq_lock);
> +				bio_grp_list_move(submit_ctx->sq, bio_grps);
> +				spin_unlock(&submit_ctx->sq_lock);
> +			}
> +		}
> +	} while (bio_grps->nr_grps > 0);
> +
> +	return ret;
> +}
> +
> +void blk_bio_poll_io_drain(struct io_context *submit_ioc)
> +{
> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> +
> +	if (!submit_ctx)
> +		return;
> +
> +	while (submit_ctx->sq->nr_grps > 0) {
> +		blk_bio_poll_io(submit_ioc, NULL);
> +		cpu_relax();
> +	}
> +}
> +
> +static bool blk_bio_ioc_valid(struct task_struct *t)
> +{
> +	if (!t)
> +		return false;
> +
> +	if (!t->io_context)
> +		return false;
> +
> +	if (!t->io_context->data)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int __blk_bio_poll(blk_qc_t cookie)
> +{
> +	struct io_context *poll_ioc = current->io_context;
> +	pid_t pid;
> +	struct task_struct *submit_task;
> +	int ret;
> +
> +	pid = (pid_t)cookie;
> +
> +	/* io poll often share io submission context */
> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> +		return blk_bio_poll_io(poll_ioc, poll_ioc);
> +
> +	submit_task = find_get_task_by_vpid(pid);
> +	if (likely(blk_bio_ioc_valid(submit_task)))
> +		ret = blk_bio_poll_io(submit_task->io_context, poll_ioc);
> +	else
> +		ret = 0;
> +
> +	put_task_struct(submit_task);

put_task_struct() is not needed when @submit_task is NULL.


> +
> +	return ret;
> +}
> +
>  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
> +	long state;
> +
> +	/* no need to poll */
> +	if (cookie == BLK_QC_T_BIO_NONE)
> +		return 0;
> +
>  	/*
>  	 * Create poll queue for storing poll bio and its cookie from
>  	 * submission queue
>  	 */
>  	blk_create_io_context(q, true);
>  
> +	state = current->state;
> +	do {
> +		int ret;
> +
> +		ret = __blk_bio_poll(cookie);
> +		if (ret > 0) {
> +			__set_current_state(TASK_RUNNING);
> +			return ret;
> +		}
> +
> +		if (signal_pending_state(state, current))
> +			__set_current_state(TASK_RUNNING);
> +
> +		if (current->state == TASK_RUNNING)
> +			return 1;
> +		if (ret < 0 || !spin)
> +			break;
> +		cpu_relax();
> +	} while (!need_resched());
> +
> +	__set_current_state(TASK_RUNNING);
>  	return 0;
>  }
>  
> @@ -3893,7 +4125,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	struct blk_mq_hw_ctx *hctx;
>  	long state;
>  
> -	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
> +	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
>  		return 0;
>  
>  	if (current->plug)
> diff --git a/block/blk.h b/block/blk.h
> index 7e16419904fa..948b7b19ef48 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -381,6 +381,7 @@ struct blk_bio_poll_ctx {
>  #define BLK_BIO_POLL_SQ_SZ		16U
>  #define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
>  
> +void blk_bio_poll_io_drain(struct io_context *submit_ioc);
>  void bio_poll_ctx_alloc(struct io_context *ioc);
>  
>  static inline void blk_create_io_context(struct request_queue *q,
> @@ -412,4 +413,13 @@ static inline void bio_set_private_data(struct bio *bio, unsigned int data)
>  	bio->bi_iter.bi_private_data = data;
>  }
>  
> +BIO_LIST_HELPERS(__bio_grp_list, poll);
> +
> +static inline bool bio_grp_list_grp_empty(struct bio_grp_list_data *grp)
> +{
> +	return bio_list_empty(&grp->list);
> +}
> +
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src);
> +
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 99160d588c2d..beaeb3729f11 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -235,7 +235,18 @@ struct bio {
>  
>  	struct bvec_iter	bi_iter;
>  
> -	bio_end_io_t		*bi_end_io;
> +	union {
> +		bio_end_io_t		*bi_end_io;
> +		/*
> +		 * bio based io polling needs to track bio via bio group
> +		 * list which groups bios by their .bi_end_io, and original
> +		 * .bi_end_io is saved into the group head. Will recover
> +		 * .bi_end_io before really ending bio. BIO_END_BY_POLL
> +		 * will make sure that this bio won't be ended before
> +		 * recovering .bi_end_io.
> +		 */
> +		struct bio		*bi_poll;
> +	};
>  
>  	void			*bi_private;
>  #ifdef CONFIG_BLK_CGROUP
> @@ -304,6 +315,9 @@ enum {
>  	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
>  	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
>  	BIO_REMAPPED,
> +	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
> +	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
> +	BIO_DONE,
>  	BIO_FLAG_LAST
>  };
>  
> @@ -513,6 +527,8 @@ typedef unsigned int blk_qc_t;
>  #define BLK_QC_T_NONE		-1U
>  #define BLK_QC_T_SHIFT		16
>  #define BLK_QC_T_INTERNAL	(1U << 31)
> +/* only used for bio based submission, has to be defined as 0 */
> +#define BLK_QC_T_BIO_NONE	0
>  
>  static inline bool blk_qc_t_valid(blk_qc_t cookie)
>  {
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 09/13] block: use per-task poll context to implement bio based io polling
@ 2021-03-25  6:34     ` JeffleXu
  0 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  6:34 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, dm-devel, Mike Snitzer



On 3/24/21 8:19 PM, Ming Lei wrote:
> Currently bio based IO polling needs to poll all hw queue blindly, this
> way is very inefficient, and one big reason is that we can't pass any
> bio submission result to blk_poll().
> 
> In IO submission context, track associated underlying bios by per-task
> submission queue and store returned 'cookie' in
> bio->bi_iter.bi_private_data, and return current->pid to caller of
> submit_bio() for any bio based driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, then we can find bios from the per-task io pull context of
> submission context. Moving bios from submission queue to poll queue of
> the poll context, and keep polling until these bios are ended. Remove
> bio from poll queue if the bio is ended. Add bio flags of BIO_DONE and
> BIO_END_BY_POLL for such purpose.
> 
> In was found in Jeffle Xu's test that kfifo doesn't scale well for a
> submission queue as queue depth is increased, so a new mechanism for
> tracking bios is needed. So far bio's size is close to 2 cacheline size,
> and it may not be accepted to add new field into bio for solving the
> scalability issue by tracking bios via linked list, switch to bio group
> list for tracking bio, the idea is to reuse .bi_end_io for linking bios
> into a linked list for all sharing same .bi_end_io(call it bio group),
> which is recovered before ending bio really, since BIO_END_BY_POLL is
> added for enhancing this point. Usually .bi_end_bio is same for all
> bios in same layer, so it is enough to provide very limited groups, such
> as 16 or less for fixing the scalability issue.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Also when the submission task is exiting, drain pending IOs in the context
> until all are done.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/bio.c               |   5 +
>  block/blk-core.c          | 154 ++++++++++++++++++++++++-
>  block/blk-ioc.c           |   2 +
>  block/blk-mq.c            | 234 +++++++++++++++++++++++++++++++++++++-
>  block/blk.h               |  10 ++
>  include/linux/blk_types.h |  18 ++-
>  6 files changed, 419 insertions(+), 4 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..04c043dc60fc 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> +		bio_set_flag(bio, BIO_DONE);
> +		return;
> +	}
>  again:
>  	if (!bio_remaining_done(bio))
>  		return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index eb07d61cfdc2..95f7e36c8759 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -805,6 +805,81 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>  		sizeof(struct bio_grp_list_data);
>  }
>  
> +static inline void *bio_grp_data(struct bio *bio)
> +{
> +	return bio->bi_poll;
> +}
> +
> +/* add bio into bio group list, return true if it is added */
> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == bio_grp_data(bio)) {
> +			__bio_grp_list_add(&grp->list, bio);
> +			return true;
> +		}
> +	}
> +
> +	if (i == list->max_nr_grps)
> +		return false;
> +
> +	/* create a new group */
> +	grp = &list->head[i];
> +	bio_list_init(&grp->list);
> +	grp->grp_data = bio_grp_data(bio);
> +	__bio_grp_list_add(&grp->list, bio);
> +	list->nr_grps++;
> +
> +	return true;
> +}
> +
> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> +{
> +	int i;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = 0; i < list->nr_grps; i++) {
> +		grp = &list->head[i];
> +		if (grp->grp_data == grp_data)
> +			return i;
> +	}
> +
> +	if (i < list->max_nr_grps) {
> +		grp = &list->head[i];
> +		bio_list_init(&grp->list);
> +		return i;
> +	}
> +
> +	return -1;
> +}
> +
> +/* Move as many as possible groups from 'src' to 'dst' */
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> +{
> +	int i, j, cnt = 0;
> +	struct bio_grp_list_data *grp;
> +
> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> +		grp = &src->head[i];
> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> +		if (j < 0)
> +			break;
> +		if (bio_grp_list_grp_empty(&dst->head[j])) {
> +			dst->head[j].grp_data = grp->grp_data;
> +			dst->nr_grps++;
> +		}
> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> +		bio_list_init(&grp->list);
> +		cnt++;
> +	}
> +
> +	src->nr_grps -= cnt;
> +}
> +
>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>  {
>  	pc->sq = (void *)pc + sizeof(*pc);
> @@ -866,6 +941,45 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		bio->bi_opf |= REQ_POLL_CTX;
>  }
>  
> +static inline void blk_bio_poll_mark_queued(struct bio *bio, bool queued)
> +{
> +	/*
> +	 * The bio has been added to per-task poll queue, mark it as
> +	 * END_BY_POLL, so that this bio is always completed from
> +	 * blk_poll() which is provided with cookied from this bio's
> +	 * submission.
> +	 */
> +	if (!queued)
> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_POLL_CTX);
> +	else
> +		bio_set_flag(bio, BIO_END_BY_POLL);
> +}
> +
> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	unsigned int queued;
> +
> +	/*
> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> +	 * and completion. However, bio crypt may update .bi_end_io during
> +	 * submission, so simply don't support bio based polling for this
> +	 * setting.
> +	 */
> +	if (likely(!bio_has_crypt_ctx(bio))) {
> +		/* track this bio via bio group list */
> +		spin_lock(&pc->sq_lock);
> +		queued = bio_grp_list_add(pc->sq, bio);
> +		blk_bio_poll_mark_queued(bio, queued);
> +		spin_unlock(&pc->sq_lock);
> +	} else {
> +		queued = false;
> +		blk_bio_poll_mark_queued(bio, false);
> +	}
> +
> +	return queued;
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -1018,7 +1132,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>   * bio_list_on_stack[1] contains bios that were submitted before the current
>   *	->submit_bio_bio, but that haven't been processed yet.
>   */
> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> +static blk_qc_t __submit_bio_noacct_ctx(struct bio *bio, struct io_context *ioc)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;
> @@ -1041,7 +1155,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		if (ioc && queue_is_mq(q) &&
> +				(bio->bi_opf & (REQ_HIPRI | REQ_POLL_CTX))) {


I can see no sense to enqueue the bio into the context->sq when
REQ_HIPRI is cleared while REQ_POLL_CTX is set for the bio.
BLK_QC_T_NONE is returned in this case. This is possible since commit
cc29e1bf0d63 ("block: disable iopoll for split bio").


> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> +
> +			ret = __submit_bio(bio);
> +			if (queued)
> +				bio_set_private_data(bio, ret);
> +		} else {
> +			ret = __submit_bio(bio);
> +		}
>  
>  		/*
>  		 * Sort new bios into those for a lower level and those for the
> @@ -1067,6 +1190,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	return ret;
>  }
>  
> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> +		struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +
> +	__submit_bio_noacct_ctx(bio, ioc);
> +
> +	/* bio submissions queued to per-task poll context */
> +	if (READ_ONCE(pc->sq->nr_grps))
> +		return current->pid;
> +
> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> +	return BLK_QC_T_BIO_NONE;
> +}
> +
> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> +{
> +	struct io_context *ioc = current->io_context;
> +
> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> +		return __submit_bio_noacct_poll(bio, ioc);
> +
> +	__submit_bio_noacct_ctx(bio, NULL);
> +
> +	return BLK_QC_T_BIO_NONE;
> +}
> +
>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>  {
>  	struct bio_list bio_list[2] = { };
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 5574c398eff6..b9a512f066f8 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -19,6 +19,8 @@ static struct kmem_cache *iocontext_cachep;
>  
>  static inline void free_io_context(struct io_context *ioc)
>  {
> +	blk_bio_poll_io_drain(ioc);
> +

There may be a time window between the IO submission process detaches
the io_context and the io_context's refcount finally decreased to zero,
when there'are multiple processes sharing one io_context. I don't know
if it is possible that the other process sharing the io_context won't
submit any IO, in which case the bios remained in the io_context won't
be reaped for a long time.

If the above case is possible, then is it possible to drain the sq once
the process detaches the io_context?


>  	kfree(ioc->data);
>  	kmem_cache_free(iocontext_cachep, ioc);
>  }
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 03f59915fe2c..76a90da83d9c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3865,14 +3865,246 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>  	return ret;
>  }
>  
> +static int blk_mq_poll_io(struct bio *bio)
> +{
> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +	blk_qc_t cookie = bio_get_private_data(bio);
> +	int ret = 0;
> +
> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> +		struct blk_mq_hw_ctx *hctx =
> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> +
> +		ret += blk_mq_poll_hctx(q, hctx);
> +	}
> +	return ret;
> +}
> +
> +static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
> +{
> +	int ret = 0;
> +	int i;
> +
> +	/*
> +	 * Poll hw queue first.
> +	 *
> +	 * TODO: limit max poll times and make sure to not poll same
> +	 * hw queue one more time.
> +	 */
> +	for (i = 0; i < grps->nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &grps->head[i];
> +		struct bio *bio;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> +			ret += blk_mq_poll_io(bio);
> +	}
> +
> +	/* reap bios */
> +	for (i = 0; i < grps->nr_grps; i++) {
> +		struct bio_grp_list_data *grp = &grps->head[i];
> +		struct bio *bio;
> +		struct bio_list bl;
> +
> +		if (bio_grp_list_grp_empty(grp))
> +			continue;
> +
> +		bio_list_init(&bl);
> +
> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> +			if (bio_flagged(bio, BIO_DONE)) {
> +				/* now recover original data */
> +				bio->bi_poll = grp->grp_data;
> +
> +				/* clear BIO_END_BY_POLL and end me really */
> +				bio_clear_flag(bio, BIO_END_BY_POLL);
> +				bio_endio(bio);
> +			} else {
> +				__bio_grp_list_add(&bl, bio);
> +			}
> +		}
> +		__bio_grp_list_merge(&grp->list, &bl);
> +	}
> +	return ret;
> +}
> +
> +static void blk_bio_poll_pack_groups(struct bio_grp_list *grps)
> +{
> +	int i, j, k = 0;
> +	int cnt = 0;
> +
> +	for (i = grps->nr_grps - 1; i >= 0; i--) {
> +		struct bio_grp_list_data *grp = &grps->head[i];
> +		struct bio_grp_list_data *hole = NULL;
> +
> +		if (bio_grp_list_grp_empty(grp)) {
> +			cnt++;
> +			continue;
> +		}
> +

> +		for (j = k; j < i; j++) {
> +			hole = &grps->head[j];
> +			if (bio_grp_list_grp_empty(hole))
> +				break;
> +		}

Shoule be

> +		for (j = k; j < i; j++) {
> +			tmp = &grps->head[j];
> +			if (bio_grp_list_grp_empty(tmp)) {
> +                             hole = tmp;
> +				break;
> +                     }
> +		}

?


> +		if (hole == NULL)
> +			break;
> +		*hole = *grp;
> +		cnt++;
> +		k = j;
> +	}
> +
> +	grps->nr_grps -= cnt;
> +}
> +
> +#define  MAX_BIO_GRPS_ON_STACK  8
> +struct bio_grp_list_stack {
> +	unsigned int max_nr_grps, nr_grps;
> +	struct bio_grp_list_data head[MAX_BIO_GRPS_ON_STACK];
> +};
> +
> +static int blk_bio_poll_io(struct io_context *submit_ioc,
> +		struct io_context *poll_ioc)
> +
> +{
> +	struct bio_grp_list_stack _bio_grps = {
> +		.max_nr_grps	= ARRAY_SIZE(_bio_grps.head),
> +		.nr_grps	= 0
> +	};
> +	struct bio_grp_list *bio_grps = (struct bio_grp_list *)&_bio_grps;
> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc ?
> +		poll_ioc->data : NULL;
> +	int ret = 0;
> +
> +	/*
> +	 * Move IO submission result from submission queue in submission
> +	 * context to poll queue of poll context.
> +	 */
> +	spin_lock(&submit_ctx->sq_lock);
> +	bio_grp_list_move(bio_grps, submit_ctx->sq);
> +	spin_unlock(&submit_ctx->sq_lock);
> +
> +	/* merge new bios first, then start to poll bios from pq */
> +	if (poll_ctx) {
> +		spin_lock(&poll_ctx->pq_lock);
> +		bio_grp_list_move(poll_ctx->pq, bio_grps);
> +		bio_grp_list_move(bio_grps, poll_ctx->pq);

What's the purpose of this two-step merge? Is that for new bios (from
sq) is at the tail of the bio_list, and thus old bios (from pq) is
polled first?

> +		spin_unlock(&poll_ctx->pq_lock);
> +	}
> +
> +	do {
> +		ret += blk_bio_poll_and_end_io(bio_grps);
> +		blk_bio_poll_pack_groups(bio_grps);
> +
> +		if (bio_grps->nr_grps) {
> +			/*
> +			 * move back, and keep polling until all can be
> +			 * held in either poll queue or submission queue.
> +			 */
> +			if (poll_ctx) {
> +				spin_lock(&poll_ctx->pq_lock);
> +				bio_grp_list_move(poll_ctx->pq, bio_grps);
> +				spin_unlock(&poll_ctx->pq_lock);
> +			} else {
> +				spin_lock(&submit_ctx->sq_lock);
> +				bio_grp_list_move(submit_ctx->sq, bio_grps);
> +				spin_unlock(&submit_ctx->sq_lock);
> +			}
> +		}
> +	} while (bio_grps->nr_grps > 0);
> +
> +	return ret;
> +}
> +
> +void blk_bio_poll_io_drain(struct io_context *submit_ioc)
> +{
> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> +
> +	if (!submit_ctx)
> +		return;
> +
> +	while (submit_ctx->sq->nr_grps > 0) {
> +		blk_bio_poll_io(submit_ioc, NULL);
> +		cpu_relax();
> +	}
> +}
> +
> +static bool blk_bio_ioc_valid(struct task_struct *t)
> +{
> +	if (!t)
> +		return false;
> +
> +	if (!t->io_context)
> +		return false;
> +
> +	if (!t->io_context->data)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int __blk_bio_poll(blk_qc_t cookie)
> +{
> +	struct io_context *poll_ioc = current->io_context;
> +	pid_t pid;
> +	struct task_struct *submit_task;
> +	int ret;
> +
> +	pid = (pid_t)cookie;
> +
> +	/* io poll often share io submission context */
> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> +		return blk_bio_poll_io(poll_ioc, poll_ioc);
> +
> +	submit_task = find_get_task_by_vpid(pid);
> +	if (likely(blk_bio_ioc_valid(submit_task)))
> +		ret = blk_bio_poll_io(submit_task->io_context, poll_ioc);
> +	else
> +		ret = 0;
> +
> +	put_task_struct(submit_task);

put_task_struct() is not needed when @submit_task is NULL.


> +
> +	return ret;
> +}
> +
>  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
> +	long state;
> +
> +	/* no need to poll */
> +	if (cookie == BLK_QC_T_BIO_NONE)
> +		return 0;
> +
>  	/*
>  	 * Create poll queue for storing poll bio and its cookie from
>  	 * submission queue
>  	 */
>  	blk_create_io_context(q, true);
>  
> +	state = current->state;
> +	do {
> +		int ret;
> +
> +		ret = __blk_bio_poll(cookie);
> +		if (ret > 0) {
> +			__set_current_state(TASK_RUNNING);
> +			return ret;
> +		}
> +
> +		if (signal_pending_state(state, current))
> +			__set_current_state(TASK_RUNNING);
> +
> +		if (current->state == TASK_RUNNING)
> +			return 1;
> +		if (ret < 0 || !spin)
> +			break;
> +		cpu_relax();
> +	} while (!need_resched());
> +
> +	__set_current_state(TASK_RUNNING);
>  	return 0;
>  }
>  
> @@ -3893,7 +4125,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	struct blk_mq_hw_ctx *hctx;
>  	long state;
>  
> -	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
> +	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
>  		return 0;
>  
>  	if (current->plug)
> diff --git a/block/blk.h b/block/blk.h
> index 7e16419904fa..948b7b19ef48 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -381,6 +381,7 @@ struct blk_bio_poll_ctx {
>  #define BLK_BIO_POLL_SQ_SZ		16U
>  #define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
>  
> +void blk_bio_poll_io_drain(struct io_context *submit_ioc);
>  void bio_poll_ctx_alloc(struct io_context *ioc);
>  
>  static inline void blk_create_io_context(struct request_queue *q,
> @@ -412,4 +413,13 @@ static inline void bio_set_private_data(struct bio *bio, unsigned int data)
>  	bio->bi_iter.bi_private_data = data;
>  }
>  
> +BIO_LIST_HELPERS(__bio_grp_list, poll);
> +
> +static inline bool bio_grp_list_grp_empty(struct bio_grp_list_data *grp)
> +{
> +	return bio_list_empty(&grp->list);
> +}
> +
> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src);
> +
>  #endif /* BLK_INTERNAL_H */
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 99160d588c2d..beaeb3729f11 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -235,7 +235,18 @@ struct bio {
>  
>  	struct bvec_iter	bi_iter;
>  
> -	bio_end_io_t		*bi_end_io;
> +	union {
> +		bio_end_io_t		*bi_end_io;
> +		/*
> +		 * bio based io polling needs to track bio via bio group
> +		 * list which groups bios by their .bi_end_io, and original
> +		 * .bi_end_io is saved into the group head. Will recover
> +		 * .bi_end_io before really ending bio. BIO_END_BY_POLL
> +		 * will make sure that this bio won't be ended before
> +		 * recovering .bi_end_io.
> +		 */
> +		struct bio		*bi_poll;
> +	};
>  
>  	void			*bi_private;
>  #ifdef CONFIG_BLK_CGROUP
> @@ -304,6 +315,9 @@ enum {
>  	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
>  	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
>  	BIO_REMAPPED,
> +	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
> +	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
> +	BIO_DONE,
>  	BIO_FLAG_LAST
>  };
>  
> @@ -513,6 +527,8 @@ typedef unsigned int blk_qc_t;
>  #define BLK_QC_T_NONE		-1U
>  #define BLK_QC_T_SHIFT		16
>  #define BLK_QC_T_INTERNAL	(1U << 31)
> +/* only used for bio based submission, has to be defined as 0 */
> +#define BLK_QC_T_BIO_NONE	0
>  
>  static inline bool blk_qc_t_valid(blk_qc_t cookie)
>  {
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 07/13] block/mq: extract one helper function polling hw queue
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-25  6:50     ` Hannes Reinecke
  -1 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-25  6:50 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel

On 3/24/21 1:19 PM, Ming Lei wrote:
> From: Jeffle Xu <jefflexu@linux.alibaba.com>
> 
> Extract the logic of polling one hw queue and related statistics
> handling out as the helper function.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 18 ++++++++++++++----
>   1 file changed, 14 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 07/13] block/mq: extract one helper function polling hw queue
@ 2021-03-25  6:50     ` Hannes Reinecke
  0 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-25  6:50 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer

On 3/24/21 1:19 PM, Ming Lei wrote:
> From: Jeffle Xu <jefflexu@linux.alibaba.com>
> 
> Extract the logic of polling one hw queue and related statistics
> handling out as the helper function.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c | 18 ++++++++++++++----
>   1 file changed, 14 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 05/13] block: add req flag of REQ_POLL_CTX
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-25  6:55     ` JeffleXu
  -1 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  6:55 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Mike Snitzer, dm-devel



On 3/24/21 8:19 PM, Ming Lei wrote:
> Add one req flag REQ_POLL_CTX which will be used in the following patch for
> supporting bio based IO polling.
> 
> Exactly this flag can help us to do:
> 
> 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> as REQ_POLL_CTX, all bios cloned from this FS bio will be marked as
> REQ_POLL_CTX too.
> 
> 2) create per-task io polling context if the bio based queue supports
> polling and the submitted bio is HIPRI. Per-task io poll context will be
> created during submit_bio() before marking this HIPRI bio as REQ_POLL_CTX.
> Then we can avoid to create such io polling context if one cloned bio with
> REQ_POLL_CTX is submitted from another kernel context.
> 
> 3) for supporting bio based io polling, we need to poll IOs from all
> underlying queues of the bio device, this way help us to recognize which
> IO needs to polled in bio based style, which will be applied in
> following patch.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>


> ---
>  block/blk-core.c          | 25 ++++++++++++++++++++++++-
>  include/linux/blk_types.h |  4 ++++
>  2 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4671bbf31fd3..eb07d61cfdc2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>  static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		struct bio *bio)
>  {
> +	bool mq;
> +
>  	if (!(bio->bi_opf & REQ_HIPRI))
>  		return;
>  
> -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +	/*
> +	 * Can't support bio based IO polling without per-task poll ctx
> +	 *
> +	 * We have created per-task io poll context, and mark this
> +	 * bio as REQ_POLL_CTX, so: 1) if any cloned bio from this bio is
> +	 * submitted from another kernel context, we won't create bio
> +	 * poll context for it, and that bio can be completed by IRQ;
> +	 * 2) If such bio is submitted from current context, we will
> +	 * complete it via blk_poll(); 3) If driver knows that one
> +	 * underlying bio allocated from driver is for FS bio, meantime
> +	 * it is submitted in current context, driver can mark such bio
> +	 * as REQ_HIPRI & REQ_POLL_CTX manually, so the bio can be completed
> +	 * via blk_poll too.
> +	 */
> +	mq = queue_is_mq(q);
> +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
>  		bio->bi_opf &= ~REQ_HIPRI;
> +	else if (!mq)
> +		bio->bi_opf |= REQ_POLL_CTX;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -894,8 +913,12 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  	/*
>  	 * Create per-task io poll ctx if bio polling supported and HIPRI
>  	 * set.
> +	 *
> +	 * If REQ_POLL_CTX isn't set for this HIPRI bio, we think it originated
> +	 * from FS and allocate io polling context.
>  	 */
>  	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			!(bio->bi_opf & REQ_POLL_CTX) &&
>  			(bio->bi_opf & REQ_HIPRI));
>  
>  	blk_bio_poll_preprocess(q, bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index db026b6ec15a..99160d588c2d 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -394,6 +394,9 @@ enum req_flag_bits {
>  
>  	__REQ_HIPRI,
>  
> +	/* for marking IOs originated from same FS bio in same context */
> +	__REQ_POLL_CTX,
> +
>  	/* for driver use */
>  	__REQ_DRV,
>  	__REQ_SWAP,		/* swapping request. */
> @@ -418,6 +421,7 @@ enum req_flag_bits {
>  
>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
> +#define REQ_POLL_CTX			(1ULL << __REQ_POLL_CTX)
>  
>  #define REQ_DRV			(1ULL << __REQ_DRV)
>  #define REQ_SWAP		(1ULL << __REQ_SWAP)
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 05/13] block: add req flag of REQ_POLL_CTX
@ 2021-03-25  6:55     ` JeffleXu
  0 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  6:55 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, dm-devel, Mike Snitzer



On 3/24/21 8:19 PM, Ming Lei wrote:
> Add one req flag REQ_POLL_CTX which will be used in the following patch for
> supporting bio based IO polling.
> 
> Exactly this flag can help us to do:
> 
> 1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
> as REQ_POLL_CTX, all bios cloned from this FS bio will be marked as
> REQ_POLL_CTX too.
> 
> 2) create per-task io polling context if the bio based queue supports
> polling and the submitted bio is HIPRI. Per-task io poll context will be
> created during submit_bio() before marking this HIPRI bio as REQ_POLL_CTX.
> Then we can avoid to create such io polling context if one cloned bio with
> REQ_POLL_CTX is submitted from another kernel context.
> 
> 3) for supporting bio based io polling, we need to poll IOs from all
> underlying queues of the bio device, this way help us to recognize which
> IO needs to polled in bio based style, which will be applied in
> following patch.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>


> ---
>  block/blk-core.c          | 25 ++++++++++++++++++++++++-
>  include/linux/blk_types.h |  4 ++++
>  2 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4671bbf31fd3..eb07d61cfdc2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -840,11 +840,30 @@ static inline bool blk_queue_support_bio_poll(struct request_queue *q)
>  static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		struct bio *bio)
>  {
> +	bool mq;
> +
>  	if (!(bio->bi_opf & REQ_HIPRI))
>  		return;
>  
> -	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
> +	/*
> +	 * Can't support bio based IO polling without per-task poll ctx
> +	 *
> +	 * We have created per-task io poll context, and mark this
> +	 * bio as REQ_POLL_CTX, so: 1) if any cloned bio from this bio is
> +	 * submitted from another kernel context, we won't create bio
> +	 * poll context for it, and that bio can be completed by IRQ;
> +	 * 2) If such bio is submitted from current context, we will
> +	 * complete it via blk_poll(); 3) If driver knows that one
> +	 * underlying bio allocated from driver is for FS bio, meantime
> +	 * it is submitted in current context, driver can mark such bio
> +	 * as REQ_HIPRI & REQ_POLL_CTX manually, so the bio can be completed
> +	 * via blk_poll too.
> +	 */
> +	mq = queue_is_mq(q);
> +	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
>  		bio->bi_opf &= ~REQ_HIPRI;
> +	else if (!mq)
> +		bio->bi_opf |= REQ_POLL_CTX;
>  }
>  
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> @@ -894,8 +913,12 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  	/*
>  	 * Create per-task io poll ctx if bio polling supported and HIPRI
>  	 * set.
> +	 *
> +	 * If REQ_POLL_CTX isn't set for this HIPRI bio, we think it originated
> +	 * from FS and allocate io polling context.
>  	 */
>  	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
> +			!(bio->bi_opf & REQ_POLL_CTX) &&
>  			(bio->bi_opf & REQ_HIPRI));
>  
>  	blk_bio_poll_preprocess(q, bio);
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index db026b6ec15a..99160d588c2d 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -394,6 +394,9 @@ enum req_flag_bits {
>  
>  	__REQ_HIPRI,
>  
> +	/* for marking IOs originated from same FS bio in same context */
> +	__REQ_POLL_CTX,
> +
>  	/* for driver use */
>  	__REQ_DRV,
>  	__REQ_SWAP,		/* swapping request. */
> @@ -418,6 +421,7 @@ enum req_flag_bits {
>  
>  #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
>  #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
> +#define REQ_POLL_CTX			(1ULL << __REQ_POLL_CTX)
>  
>  #define REQ_DRV			(1ULL << __REQ_DRV)
>  #define REQ_SWAP		(1ULL << __REQ_SWAP)
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 09/13] block: use per-task poll context to implement bio based io polling
  2021-03-25  6:34     ` [dm-devel] " JeffleXu
@ 2021-03-25  8:05       ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-25  8:05 UTC (permalink / raw)
  To: JeffleXu; +Cc: Jens Axboe, linux-block, Mike Snitzer, dm-devel

On Thu, Mar 25, 2021 at 02:34:18PM +0800, JeffleXu wrote:
> 
> 
> On 3/24/21 8:19 PM, Ming Lei wrote:
> > Currently bio based IO polling needs to poll all hw queue blindly, this
> > way is very inefficient, and one big reason is that we can't pass any
> > bio submission result to blk_poll().
> > 
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and store returned 'cookie' in
> > bio->bi_iter.bi_private_data, and return current->pid to caller of
> > submit_bio() for any bio based driver's IO, which is submitted from FS.
> > 
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, then we can find bios from the per-task io pull context of
> > submission context. Moving bios from submission queue to poll queue of
> > the poll context, and keep polling until these bios are ended. Remove
> > bio from poll queue if the bio is ended. Add bio flags of BIO_DONE and
> > BIO_END_BY_POLL for such purpose.
> > 
> > In was found in Jeffle Xu's test that kfifo doesn't scale well for a
> > submission queue as queue depth is increased, so a new mechanism for
> > tracking bios is needed. So far bio's size is close to 2 cacheline size,
> > and it may not be accepted to add new field into bio for solving the
> > scalability issue by tracking bios via linked list, switch to bio group
> > list for tracking bio, the idea is to reuse .bi_end_io for linking bios
> > into a linked list for all sharing same .bi_end_io(call it bio group),
> > which is recovered before ending bio really, since BIO_END_BY_POLL is
> > added for enhancing this point. Usually .bi_end_bio is same for all
> > bios in same layer, so it is enough to provide very limited groups, such
> > as 16 or less for fixing the scalability issue.
> > 
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> > 
> > Also when the submission task is exiting, drain pending IOs in the context
> > until all are done.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 +
> >  block/blk-core.c          | 154 ++++++++++++++++++++++++-
> >  block/blk-ioc.c           |   2 +
> >  block/blk-mq.c            | 234 +++++++++++++++++++++++++++++++++++++-
> >  block/blk.h               |  10 ++
> >  include/linux/blk_types.h |  18 ++-
> >  6 files changed, 419 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +		bio_set_flag(bio, BIO_DONE);
> > +		return;
> > +	}
> >  again:
> >  	if (!bio_remaining_done(bio))
> >  		return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index eb07d61cfdc2..95f7e36c8759 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,81 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >  		sizeof(struct bio_grp_list_data);
> >  }
> >  
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > +	return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == bio_grp_data(bio)) {
> > +			__bio_grp_list_add(&grp->list, bio);
> > +			return true;
> > +		}
> > +	}
> > +
> > +	if (i == list->max_nr_grps)
> > +		return false;
> > +
> > +	/* create a new group */
> > +	grp = &list->head[i];
> > +	bio_list_init(&grp->list);
> > +	grp->grp_data = bio_grp_data(bio);
> > +	__bio_grp_list_add(&grp->list, bio);
> > +	list->nr_grps++;
> > +
> > +	return true;
> > +}
> > +
> > +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == grp_data)
> > +			return i;
> > +	}
> > +
> > +	if (i < list->max_nr_grps) {
> > +		grp = &list->head[i];
> > +		bio_list_init(&grp->list);
> > +		return i;
> > +	}
> > +
> > +	return -1;
> > +}
> > +
> > +/* Move as many as possible groups from 'src' to 'dst' */
> > +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> > +{
> > +	int i, j, cnt = 0;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = src->nr_grps - 1; i >= 0; i--) {
> > +		grp = &src->head[i];
> > +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> > +		if (j < 0)
> > +			break;
> > +		if (bio_grp_list_grp_empty(&dst->head[j])) {
> > +			dst->head[j].grp_data = grp->grp_data;
> > +			dst->nr_grps++;
> > +		}
> > +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> > +		bio_list_init(&grp->list);
> > +		cnt++;
> > +	}
> > +
> > +	src->nr_grps -= cnt;
> > +}
> > +
> >  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >  {
> >  	pc->sq = (void *)pc + sizeof(*pc);
> > @@ -866,6 +941,45 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		bio->bi_opf |= REQ_POLL_CTX;
> >  }
> >  
> > +static inline void blk_bio_poll_mark_queued(struct bio *bio, bool queued)
> > +{
> > +	/*
> > +	 * The bio has been added to per-task poll queue, mark it as
> > +	 * END_BY_POLL, so that this bio is always completed from
> > +	 * blk_poll() which is provided with cookied from this bio's
> > +	 * submission.
> > +	 */
> > +	if (!queued)
> > +		bio->bi_opf &= ~(REQ_HIPRI | REQ_POLL_CTX);
> > +	else
> > +		bio_set_flag(bio, BIO_END_BY_POLL);
> > +}
> > +
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	unsigned int queued;
> > +
> > +	/*
> > +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> > +	 * and completion. However, bio crypt may update .bi_end_io during
> > +	 * submission, so simply don't support bio based polling for this
> > +	 * setting.
> > +	 */
> > +	if (likely(!bio_has_crypt_ctx(bio))) {
> > +		/* track this bio via bio group list */
> > +		spin_lock(&pc->sq_lock);
> > +		queued = bio_grp_list_add(pc->sq, bio);
> > +		blk_bio_poll_mark_queued(bio, queued);
> > +		spin_unlock(&pc->sq_lock);
> > +	} else {
> > +		queued = false;
> > +		blk_bio_poll_mark_queued(bio, false);
> > +	}
> > +
> > +	return queued;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >  	struct block_device *bdev = bio->bi_bdev;
> > @@ -1018,7 +1132,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *	->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_ctx(struct bio *bio, struct io_context *ioc)
> >  {
> >  	struct bio_list bio_list_on_stack[2];
> >  	blk_qc_t ret = BLK_QC_T_NONE;
> > @@ -1041,7 +1155,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >  		bio_list_init(&bio_list_on_stack[0]);
> >  
> > -		ret = __submit_bio(bio);
> > +		if (ioc && queue_is_mq(q) &&
> > +				(bio->bi_opf & (REQ_HIPRI | REQ_POLL_CTX))) {
> 
> 
> I can see no sense to enqueue the bio into the context->sq when
> REQ_HIPRI is cleared while REQ_POLL_CTX is set for the bio.
> BLK_QC_T_NONE is returned in this case. This is possible since commit
> cc29e1bf0d63 ("block: disable iopoll for split bio").

bio has to be enqueued before submission, and once it is enqueued, it has
to be ended by blk_poll(), this way actually simplifies polled bio lifetime
a lot, no matter if this bio is really completed via poll or irq.

When submit_bio() is returning BLK_QC_T_NONE, this bio may have been
completed already, and we shouldn't touch that bio any more, otherwise
things can become quite complicated.

> 
> 
> > +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +			ret = __submit_bio(bio);
> > +			if (queued)
> > +				bio_set_private_data(bio, ret);
> > +		} else {
> > +			ret = __submit_bio(bio);
> > +		}
> >  
> >  		/*
> >  		 * Sort new bios into those for a lower level and those for the
> > @@ -1067,6 +1190,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  	return ret;
> >  }
> >  
> > +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> > +		struct io_context *ioc)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +
> > +	__submit_bio_noacct_ctx(bio, ioc);
> > +
> > +	/* bio submissions queued to per-task poll context */
> > +	if (READ_ONCE(pc->sq->nr_grps))
> > +		return current->pid;
> > +
> > +	/* swapper's pid is 0, but it can't submit poll IO for us */
> > +	return BLK_QC_T_BIO_NONE;
> > +}
> > +
> > +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +{
> > +	struct io_context *ioc = current->io_context;
> > +
> > +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> > +		return __submit_bio_noacct_poll(bio, ioc);
> > +
> > +	__submit_bio_noacct_ctx(bio, NULL);
> > +
> > +	return BLK_QC_T_BIO_NONE;
> > +}
> > +
> >  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >  {
> >  	struct bio_list bio_list[2] = { };
> > diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> > index 5574c398eff6..b9a512f066f8 100644
> > --- a/block/blk-ioc.c
> > +++ b/block/blk-ioc.c
> > @@ -19,6 +19,8 @@ static struct kmem_cache *iocontext_cachep;
> >  
> >  static inline void free_io_context(struct io_context *ioc)
> >  {
> > +	blk_bio_poll_io_drain(ioc);
> > +
> 
> There may be a time window between the IO submission process detaches
> the io_context and the io_context's refcount finally decreased to zero,
> when there'are multiple processes sharing one io_context. I don't know
> if it is possible that the other process sharing the io_context won't
> submit any IO, in which case the bios remained in the io_context won't
> be reaped for a long time.
> 
> If the above case is possible, then is it possible to drain the sq once
> the process detaches the io_context?

free_io_context() is called after the ioc's refcount drops to zero, so
any process sharing this ioc has to be exited.

> 
> 
> >  	kfree(ioc->data);
> >  	kmem_cache_free(iocontext_cachep, ioc);
> >  }
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 03f59915fe2c..76a90da83d9c 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3865,14 +3865,246 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >  	return ret;
> >  }
> >  
> > +static int blk_mq_poll_io(struct bio *bio)
> > +{
> > +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> > +	blk_qc_t cookie = bio_get_private_data(bio);
> > +	int ret = 0;
> > +
> > +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> > +		struct blk_mq_hw_ctx *hctx =
> > +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> > +
> > +		ret += blk_mq_poll_hctx(q, hctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
> > +{
> > +	int ret = 0;
> > +	int i;
> > +
> > +	/*
> > +	 * Poll hw queue first.
> > +	 *
> > +	 * TODO: limit max poll times and make sure to not poll same
> > +	 * hw queue one more time.
> > +	 */
> > +	for (i = 0; i < grps->nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &grps->head[i];
> > +		struct bio *bio;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> > +			ret += blk_mq_poll_io(bio);
> > +	}
> > +
> > +	/* reap bios */
> > +	for (i = 0; i < grps->nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &grps->head[i];
> > +		struct bio *bio;
> > +		struct bio_list bl;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		bio_list_init(&bl);
> > +
> > +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> > +			if (bio_flagged(bio, BIO_DONE)) {
> > +				/* now recover original data */
> > +				bio->bi_poll = grp->grp_data;
> > +
> > +				/* clear BIO_END_BY_POLL and end me really */
> > +				bio_clear_flag(bio, BIO_END_BY_POLL);
> > +				bio_endio(bio);
> > +			} else {
> > +				__bio_grp_list_add(&bl, bio);
> > +			}
> > +		}
> > +		__bio_grp_list_merge(&grp->list, &bl);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static void blk_bio_poll_pack_groups(struct bio_grp_list *grps)
> > +{
> > +	int i, j, k = 0;
> > +	int cnt = 0;
> > +
> > +	for (i = grps->nr_grps - 1; i >= 0; i--) {
> > +		struct bio_grp_list_data *grp = &grps->head[i];
> > +		struct bio_grp_list_data *hole = NULL;
> > +
> > +		if (bio_grp_list_grp_empty(grp)) {
> > +			cnt++;
> > +			continue;
> > +		}
> > +
> 
> > +		for (j = k; j < i; j++) {
> > +			hole = &grps->head[j];
> > +			if (bio_grp_list_grp_empty(hole))
> > +				break;
> > +		}
> 
> Shoule be
> 
> > +		for (j = k; j < i; j++) {
> > +			tmp = &grps->head[j];
> > +			if (bio_grp_list_grp_empty(tmp)) {
> > +                             hole = tmp;
> > +				break;
> > +                     }
> > +		}

Good catch!

> 
> > +		if (hole == NULL)
> > +			break;
> > +		*hole = *grp;
> > +		cnt++;
> > +		k = j;
> > +	}
> > +
> > +	grps->nr_grps -= cnt;
> > +}
> > +
> > +#define  MAX_BIO_GRPS_ON_STACK  8
> > +struct bio_grp_list_stack {
> > +	unsigned int max_nr_grps, nr_grps;
> > +	struct bio_grp_list_data head[MAX_BIO_GRPS_ON_STACK];
> > +};
> > +
> > +static int blk_bio_poll_io(struct io_context *submit_ioc,
> > +		struct io_context *poll_ioc)
> > +
> > +{
> > +	struct bio_grp_list_stack _bio_grps = {
> > +		.max_nr_grps	= ARRAY_SIZE(_bio_grps.head),
> > +		.nr_grps	= 0
> > +	};
> > +	struct bio_grp_list *bio_grps = (struct bio_grp_list *)&_bio_grps;
> > +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> > +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc ?
> > +		poll_ioc->data : NULL;
> > +	int ret = 0;
> > +
> > +	/*
> > +	 * Move IO submission result from submission queue in submission
> > +	 * context to poll queue of poll context.
> > +	 */
> > +	spin_lock(&submit_ctx->sq_lock);
> > +	bio_grp_list_move(bio_grps, submit_ctx->sq);
> > +	spin_unlock(&submit_ctx->sq_lock);
> > +
> > +	/* merge new bios first, then start to poll bios from pq */
> > +	if (poll_ctx) {
> > +		spin_lock(&poll_ctx->pq_lock);
> > +		bio_grp_list_move(poll_ctx->pq, bio_grps);
> > +		bio_grp_list_move(bio_grps, poll_ctx->pq);
> 
> What's the purpose of this two-step merge? Is that for new bios (from
> sq) is at the tail of the bio_list, and thus old bios (from pq) is
> polled first?

Yeah, so we can poll old bios first. Also the following bio polling can
cover new bios just from submission context too.

> 
> > +		spin_unlock(&poll_ctx->pq_lock);
> > +	}
> > +
> > +	do {
> > +		ret += blk_bio_poll_and_end_io(bio_grps);
> > +		blk_bio_poll_pack_groups(bio_grps);
> > +
> > +		if (bio_grps->nr_grps) {
> > +			/*
> > +			 * move back, and keep polling until all can be
> > +			 * held in either poll queue or submission queue.
> > +			 */
> > +			if (poll_ctx) {
> > +				spin_lock(&poll_ctx->pq_lock);
> > +				bio_grp_list_move(poll_ctx->pq, bio_grps);
> > +				spin_unlock(&poll_ctx->pq_lock);
> > +			} else {
> > +				spin_lock(&submit_ctx->sq_lock);
> > +				bio_grp_list_move(submit_ctx->sq, bio_grps);
> > +				spin_unlock(&submit_ctx->sq_lock);
> > +			}
> > +		}
> > +	} while (bio_grps->nr_grps > 0);
> > +
> > +	return ret;
> > +}
> > +
> > +void blk_bio_poll_io_drain(struct io_context *submit_ioc)
> > +{
> > +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> > +
> > +	if (!submit_ctx)
> > +		return;
> > +
> > +	while (submit_ctx->sq->nr_grps > 0) {
> > +		blk_bio_poll_io(submit_ioc, NULL);
> > +		cpu_relax();
> > +	}
> > +}
> > +
> > +static bool blk_bio_ioc_valid(struct task_struct *t)
> > +{
> > +	if (!t)
> > +		return false;
> > +
> > +	if (!t->io_context)
> > +		return false;
> > +
> > +	if (!t->io_context->data)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static int __blk_bio_poll(blk_qc_t cookie)
> > +{
> > +	struct io_context *poll_ioc = current->io_context;
> > +	pid_t pid;
> > +	struct task_struct *submit_task;
> > +	int ret;
> > +
> > +	pid = (pid_t)cookie;
> > +
> > +	/* io poll often share io submission context */
> > +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> > +		return blk_bio_poll_io(poll_ioc, poll_ioc);
> > +
> > +	submit_task = find_get_task_by_vpid(pid);
> > +	if (likely(blk_bio_ioc_valid(submit_task)))
> > +		ret = blk_bio_poll_io(submit_task->io_context, poll_ioc);
> > +	else
> > +		ret = 0;
> > +
> > +	put_task_struct(submit_task);
> 
> put_task_struct() is not needed when @submit_task is NULL.

Good catch, usually submit_task shouldn't be NULL, but it can be exited
already.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 09/13] block: use per-task poll context to implement bio based io polling
@ 2021-03-25  8:05       ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-25  8:05 UTC (permalink / raw)
  To: JeffleXu; +Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer

On Thu, Mar 25, 2021 at 02:34:18PM +0800, JeffleXu wrote:
> 
> 
> On 3/24/21 8:19 PM, Ming Lei wrote:
> > Currently bio based IO polling needs to poll all hw queue blindly, this
> > way is very inefficient, and one big reason is that we can't pass any
> > bio submission result to blk_poll().
> > 
> > In IO submission context, track associated underlying bios by per-task
> > submission queue and store returned 'cookie' in
> > bio->bi_iter.bi_private_data, and return current->pid to caller of
> > submit_bio() for any bio based driver's IO, which is submitted from FS.
> > 
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, then we can find bios from the per-task io pull context of
> > submission context. Moving bios from submission queue to poll queue of
> > the poll context, and keep polling until these bios are ended. Remove
> > bio from poll queue if the bio is ended. Add bio flags of BIO_DONE and
> > BIO_END_BY_POLL for such purpose.
> > 
> > In was found in Jeffle Xu's test that kfifo doesn't scale well for a
> > submission queue as queue depth is increased, so a new mechanism for
> > tracking bios is needed. So far bio's size is close to 2 cacheline size,
> > and it may not be accepted to add new field into bio for solving the
> > scalability issue by tracking bios via linked list, switch to bio group
> > list for tracking bio, the idea is to reuse .bi_end_io for linking bios
> > into a linked list for all sharing same .bi_end_io(call it bio group),
> > which is recovered before ending bio really, since BIO_END_BY_POLL is
> > added for enhancing this point. Usually .bi_end_bio is same for all
> > bios in same layer, so it is enough to provide very limited groups, such
> > as 16 or less for fixing the scalability issue.
> > 
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> > 
> > Also when the submission task is exiting, drain pending IOs in the context
> > until all are done.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 +
> >  block/blk-core.c          | 154 ++++++++++++++++++++++++-
> >  block/blk-ioc.c           |   2 +
> >  block/blk-mq.c            | 234 +++++++++++++++++++++++++++++++++++++-
> >  block/blk.h               |  10 ++
> >  include/linux/blk_types.h |  18 ++-
> >  6 files changed, 419 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 26b7f721cda8..04c043dc60fc 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +		bio_set_flag(bio, BIO_DONE);
> > +		return;
> > +	}
> >  again:
> >  	if (!bio_remaining_done(bio))
> >  		return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index eb07d61cfdc2..95f7e36c8759 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -805,6 +805,81 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >  		sizeof(struct bio_grp_list_data);
> >  }
> >  
> > +static inline void *bio_grp_data(struct bio *bio)
> > +{
> > +	return bio->bi_poll;
> > +}
> > +
> > +/* add bio into bio group list, return true if it is added */
> > +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == bio_grp_data(bio)) {
> > +			__bio_grp_list_add(&grp->list, bio);
> > +			return true;
> > +		}
> > +	}
> > +
> > +	if (i == list->max_nr_grps)
> > +		return false;
> > +
> > +	/* create a new group */
> > +	grp = &list->head[i];
> > +	bio_list_init(&grp->list);
> > +	grp->grp_data = bio_grp_data(bio);
> > +	__bio_grp_list_add(&grp->list, bio);
> > +	list->nr_grps++;
> > +
> > +	return true;
> > +}
> > +
> > +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> > +{
> > +	int i;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = 0; i < list->nr_grps; i++) {
> > +		grp = &list->head[i];
> > +		if (grp->grp_data == grp_data)
> > +			return i;
> > +	}
> > +
> > +	if (i < list->max_nr_grps) {
> > +		grp = &list->head[i];
> > +		bio_list_init(&grp->list);
> > +		return i;
> > +	}
> > +
> > +	return -1;
> > +}
> > +
> > +/* Move as many as possible groups from 'src' to 'dst' */
> > +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> > +{
> > +	int i, j, cnt = 0;
> > +	struct bio_grp_list_data *grp;
> > +
> > +	for (i = src->nr_grps - 1; i >= 0; i--) {
> > +		grp = &src->head[i];
> > +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> > +		if (j < 0)
> > +			break;
> > +		if (bio_grp_list_grp_empty(&dst->head[j])) {
> > +			dst->head[j].grp_data = grp->grp_data;
> > +			dst->nr_grps++;
> > +		}
> > +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> > +		bio_list_init(&grp->list);
> > +		cnt++;
> > +	}
> > +
> > +	src->nr_grps -= cnt;
> > +}
> > +
> >  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >  {
> >  	pc->sq = (void *)pc + sizeof(*pc);
> > @@ -866,6 +941,45 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		bio->bi_opf |= REQ_POLL_CTX;
> >  }
> >  
> > +static inline void blk_bio_poll_mark_queued(struct bio *bio, bool queued)
> > +{
> > +	/*
> > +	 * The bio has been added to per-task poll queue, mark it as
> > +	 * END_BY_POLL, so that this bio is always completed from
> > +	 * blk_poll() which is provided with cookied from this bio's
> > +	 * submission.
> > +	 */
> > +	if (!queued)
> > +		bio->bi_opf &= ~(REQ_HIPRI | REQ_POLL_CTX);
> > +	else
> > +		bio_set_flag(bio, BIO_END_BY_POLL);
> > +}
> > +
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	unsigned int queued;
> > +
> > +	/*
> > +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> > +	 * and completion. However, bio crypt may update .bi_end_io during
> > +	 * submission, so simply don't support bio based polling for this
> > +	 * setting.
> > +	 */
> > +	if (likely(!bio_has_crypt_ctx(bio))) {
> > +		/* track this bio via bio group list */
> > +		spin_lock(&pc->sq_lock);
> > +		queued = bio_grp_list_add(pc->sq, bio);
> > +		blk_bio_poll_mark_queued(bio, queued);
> > +		spin_unlock(&pc->sq_lock);
> > +	} else {
> > +		queued = false;
> > +		blk_bio_poll_mark_queued(bio, false);
> > +	}
> > +
> > +	return queued;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >  	struct block_device *bdev = bio->bi_bdev;
> > @@ -1018,7 +1132,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *	->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_ctx(struct bio *bio, struct io_context *ioc)
> >  {
> >  	struct bio_list bio_list_on_stack[2];
> >  	blk_qc_t ret = BLK_QC_T_NONE;
> > @@ -1041,7 +1155,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >  		bio_list_init(&bio_list_on_stack[0]);
> >  
> > -		ret = __submit_bio(bio);
> > +		if (ioc && queue_is_mq(q) &&
> > +				(bio->bi_opf & (REQ_HIPRI | REQ_POLL_CTX))) {
> 
> 
> I can see no sense to enqueue the bio into the context->sq when
> REQ_HIPRI is cleared while REQ_POLL_CTX is set for the bio.
> BLK_QC_T_NONE is returned in this case. This is possible since commit
> cc29e1bf0d63 ("block: disable iopoll for split bio").

bio has to be enqueued before submission, and once it is enqueued, it has
to be ended by blk_poll(), this way actually simplifies polled bio lifetime
a lot, no matter if this bio is really completed via poll or irq.

When submit_bio() is returning BLK_QC_T_NONE, this bio may have been
completed already, and we shouldn't touch that bio any more, otherwise
things can become quite complicated.

> 
> 
> > +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +			ret = __submit_bio(bio);
> > +			if (queued)
> > +				bio_set_private_data(bio, ret);
> > +		} else {
> > +			ret = __submit_bio(bio);
> > +		}
> >  
> >  		/*
> >  		 * Sort new bios into those for a lower level and those for the
> > @@ -1067,6 +1190,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  	return ret;
> >  }
> >  
> > +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> > +		struct io_context *ioc)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +
> > +	__submit_bio_noacct_ctx(bio, ioc);
> > +
> > +	/* bio submissions queued to per-task poll context */
> > +	if (READ_ONCE(pc->sq->nr_grps))
> > +		return current->pid;
> > +
> > +	/* swapper's pid is 0, but it can't submit poll IO for us */
> > +	return BLK_QC_T_BIO_NONE;
> > +}
> > +
> > +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +{
> > +	struct io_context *ioc = current->io_context;
> > +
> > +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> > +		return __submit_bio_noacct_poll(bio, ioc);
> > +
> > +	__submit_bio_noacct_ctx(bio, NULL);
> > +
> > +	return BLK_QC_T_BIO_NONE;
> > +}
> > +
> >  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >  {
> >  	struct bio_list bio_list[2] = { };
> > diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> > index 5574c398eff6..b9a512f066f8 100644
> > --- a/block/blk-ioc.c
> > +++ b/block/blk-ioc.c
> > @@ -19,6 +19,8 @@ static struct kmem_cache *iocontext_cachep;
> >  
> >  static inline void free_io_context(struct io_context *ioc)
> >  {
> > +	blk_bio_poll_io_drain(ioc);
> > +
> 
> There may be a time window between the IO submission process detaches
> the io_context and the io_context's refcount finally decreased to zero,
> when there'are multiple processes sharing one io_context. I don't know
> if it is possible that the other process sharing the io_context won't
> submit any IO, in which case the bios remained in the io_context won't
> be reaped for a long time.
> 
> If the above case is possible, then is it possible to drain the sq once
> the process detaches the io_context?

free_io_context() is called after the ioc's refcount drops to zero, so
any process sharing this ioc has to be exited.

> 
> 
> >  	kfree(ioc->data);
> >  	kmem_cache_free(iocontext_cachep, ioc);
> >  }
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 03f59915fe2c..76a90da83d9c 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3865,14 +3865,246 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >  	return ret;
> >  }
> >  
> > +static int blk_mq_poll_io(struct bio *bio)
> > +{
> > +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> > +	blk_qc_t cookie = bio_get_private_data(bio);
> > +	int ret = 0;
> > +
> > +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> > +		struct blk_mq_hw_ctx *hctx =
> > +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> > +
> > +		ret += blk_mq_poll_hctx(q, hctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
> > +{
> > +	int ret = 0;
> > +	int i;
> > +
> > +	/*
> > +	 * Poll hw queue first.
> > +	 *
> > +	 * TODO: limit max poll times and make sure to not poll same
> > +	 * hw queue one more time.
> > +	 */
> > +	for (i = 0; i < grps->nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &grps->head[i];
> > +		struct bio *bio;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
> > +			ret += blk_mq_poll_io(bio);
> > +	}
> > +
> > +	/* reap bios */
> > +	for (i = 0; i < grps->nr_grps; i++) {
> > +		struct bio_grp_list_data *grp = &grps->head[i];
> > +		struct bio *bio;
> > +		struct bio_list bl;
> > +
> > +		if (bio_grp_list_grp_empty(grp))
> > +			continue;
> > +
> > +		bio_list_init(&bl);
> > +
> > +		while ((bio = __bio_grp_list_pop(&grp->list))) {
> > +			if (bio_flagged(bio, BIO_DONE)) {
> > +				/* now recover original data */
> > +				bio->bi_poll = grp->grp_data;
> > +
> > +				/* clear BIO_END_BY_POLL and end me really */
> > +				bio_clear_flag(bio, BIO_END_BY_POLL);
> > +				bio_endio(bio);
> > +			} else {
> > +				__bio_grp_list_add(&bl, bio);
> > +			}
> > +		}
> > +		__bio_grp_list_merge(&grp->list, &bl);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static void blk_bio_poll_pack_groups(struct bio_grp_list *grps)
> > +{
> > +	int i, j, k = 0;
> > +	int cnt = 0;
> > +
> > +	for (i = grps->nr_grps - 1; i >= 0; i--) {
> > +		struct bio_grp_list_data *grp = &grps->head[i];
> > +		struct bio_grp_list_data *hole = NULL;
> > +
> > +		if (bio_grp_list_grp_empty(grp)) {
> > +			cnt++;
> > +			continue;
> > +		}
> > +
> 
> > +		for (j = k; j < i; j++) {
> > +			hole = &grps->head[j];
> > +			if (bio_grp_list_grp_empty(hole))
> > +				break;
> > +		}
> 
> Shoule be
> 
> > +		for (j = k; j < i; j++) {
> > +			tmp = &grps->head[j];
> > +			if (bio_grp_list_grp_empty(tmp)) {
> > +                             hole = tmp;
> > +				break;
> > +                     }
> > +		}

Good catch!

> 
> > +		if (hole == NULL)
> > +			break;
> > +		*hole = *grp;
> > +		cnt++;
> > +		k = j;
> > +	}
> > +
> > +	grps->nr_grps -= cnt;
> > +}
> > +
> > +#define  MAX_BIO_GRPS_ON_STACK  8
> > +struct bio_grp_list_stack {
> > +	unsigned int max_nr_grps, nr_grps;
> > +	struct bio_grp_list_data head[MAX_BIO_GRPS_ON_STACK];
> > +};
> > +
> > +static int blk_bio_poll_io(struct io_context *submit_ioc,
> > +		struct io_context *poll_ioc)
> > +
> > +{
> > +	struct bio_grp_list_stack _bio_grps = {
> > +		.max_nr_grps	= ARRAY_SIZE(_bio_grps.head),
> > +		.nr_grps	= 0
> > +	};
> > +	struct bio_grp_list *bio_grps = (struct bio_grp_list *)&_bio_grps;
> > +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> > +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc ?
> > +		poll_ioc->data : NULL;
> > +	int ret = 0;
> > +
> > +	/*
> > +	 * Move IO submission result from submission queue in submission
> > +	 * context to poll queue of poll context.
> > +	 */
> > +	spin_lock(&submit_ctx->sq_lock);
> > +	bio_grp_list_move(bio_grps, submit_ctx->sq);
> > +	spin_unlock(&submit_ctx->sq_lock);
> > +
> > +	/* merge new bios first, then start to poll bios from pq */
> > +	if (poll_ctx) {
> > +		spin_lock(&poll_ctx->pq_lock);
> > +		bio_grp_list_move(poll_ctx->pq, bio_grps);
> > +		bio_grp_list_move(bio_grps, poll_ctx->pq);
> 
> What's the purpose of this two-step merge? Is that for new bios (from
> sq) is at the tail of the bio_list, and thus old bios (from pq) is
> polled first?

Yeah, so we can poll old bios first. Also the following bio polling can
cover new bios just from submission context too.

> 
> > +		spin_unlock(&poll_ctx->pq_lock);
> > +	}
> > +
> > +	do {
> > +		ret += blk_bio_poll_and_end_io(bio_grps);
> > +		blk_bio_poll_pack_groups(bio_grps);
> > +
> > +		if (bio_grps->nr_grps) {
> > +			/*
> > +			 * move back, and keep polling until all can be
> > +			 * held in either poll queue or submission queue.
> > +			 */
> > +			if (poll_ctx) {
> > +				spin_lock(&poll_ctx->pq_lock);
> > +				bio_grp_list_move(poll_ctx->pq, bio_grps);
> > +				spin_unlock(&poll_ctx->pq_lock);
> > +			} else {
> > +				spin_lock(&submit_ctx->sq_lock);
> > +				bio_grp_list_move(submit_ctx->sq, bio_grps);
> > +				spin_unlock(&submit_ctx->sq_lock);
> > +			}
> > +		}
> > +	} while (bio_grps->nr_grps > 0);
> > +
> > +	return ret;
> > +}
> > +
> > +void blk_bio_poll_io_drain(struct io_context *submit_ioc)
> > +{
> > +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> > +
> > +	if (!submit_ctx)
> > +		return;
> > +
> > +	while (submit_ctx->sq->nr_grps > 0) {
> > +		blk_bio_poll_io(submit_ioc, NULL);
> > +		cpu_relax();
> > +	}
> > +}
> > +
> > +static bool blk_bio_ioc_valid(struct task_struct *t)
> > +{
> > +	if (!t)
> > +		return false;
> > +
> > +	if (!t->io_context)
> > +		return false;
> > +
> > +	if (!t->io_context->data)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> > +static int __blk_bio_poll(blk_qc_t cookie)
> > +{
> > +	struct io_context *poll_ioc = current->io_context;
> > +	pid_t pid;
> > +	struct task_struct *submit_task;
> > +	int ret;
> > +
> > +	pid = (pid_t)cookie;
> > +
> > +	/* io poll often share io submission context */
> > +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> > +		return blk_bio_poll_io(poll_ioc, poll_ioc);
> > +
> > +	submit_task = find_get_task_by_vpid(pid);
> > +	if (likely(blk_bio_ioc_valid(submit_task)))
> > +		ret = blk_bio_poll_io(submit_task->io_context, poll_ioc);
> > +	else
> > +		ret = 0;
> > +
> > +	put_task_struct(submit_task);
> 
> put_task_struct() is not needed when @submit_task is NULL.

Good catch, usually submit_task shouldn't be NULL, but it can be exited
already.


Thanks, 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 09/13] block: use per-task poll context to implement bio based io polling
  2021-03-25  8:05       ` [dm-devel] " Ming Lei
@ 2021-03-25  9:18         ` JeffleXu
  -1 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  9:18 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Mike Snitzer, dm-devel



On 3/25/21 4:05 PM, Ming Lei wrote:
> On Thu, Mar 25, 2021 at 02:34:18PM +0800, JeffleXu wrote:
>>
>>
>> On 3/24/21 8:19 PM, Ming Lei wrote:
>>> Currently bio based IO polling needs to poll all hw queue blindly, this
>>> way is very inefficient, and one big reason is that we can't pass any
>>> bio submission result to blk_poll().
>>>
>>> In IO submission context, track associated underlying bios by per-task
>>> submission queue and store returned 'cookie' in
>>> bio->bi_iter.bi_private_data, and return current->pid to caller of
>>> submit_bio() for any bio based driver's IO, which is submitted from FS.
>>>
>>> In IO poll context, the passed cookie tells us the PID of submission
>>> context, then we can find bios from the per-task io pull context of
>>> submission context. Moving bios from submission queue to poll queue of
>>> the poll context, and keep polling until these bios are ended. Remove
>>> bio from poll queue if the bio is ended. Add bio flags of BIO_DONE and
>>> BIO_END_BY_POLL for such purpose.
>>>
>>> In was found in Jeffle Xu's test that kfifo doesn't scale well for a
>>> submission queue as queue depth is increased, so a new mechanism for
>>> tracking bios is needed. So far bio's size is close to 2 cacheline size,
>>> and it may not be accepted to add new field into bio for solving the
>>> scalability issue by tracking bios via linked list, switch to bio group
>>> list for tracking bio, the idea is to reuse .bi_end_io for linking bios
>>> into a linked list for all sharing same .bi_end_io(call it bio group),
>>> which is recovered before ending bio really, since BIO_END_BY_POLL is
>>> added for enhancing this point. Usually .bi_end_bio is same for all
>>> bios in same layer, so it is enough to provide very limited groups, such
>>> as 16 or less for fixing the scalability issue.
>>>
>>> Usually submission shares context with io poll. The per-task poll context
>>> is just like stack variable, and it is cheap to move data between the two
>>> per-task queues.
>>>
>>> Also when the submission task is exiting, drain pending IOs in the context
>>> until all are done.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/bio.c               |   5 +
>>>  block/blk-core.c          | 154 ++++++++++++++++++++++++-
>>>  block/blk-ioc.c           |   2 +
>>>  block/blk-mq.c            | 234 +++++++++++++++++++++++++++++++++++++-
>>>  block/blk.h               |  10 ++
>>>  include/linux/blk_types.h |  18 ++-
>>>  6 files changed, 419 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index 26b7f721cda8..04c043dc60fc 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>   **/
>>>  void bio_endio(struct bio *bio)
>>>  {
>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>> +		bio_set_flag(bio, BIO_DONE);
>>> +		return;
>>> +	}
>>>  again:
>>>  	if (!bio_remaining_done(bio))
>>>  		return;
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index eb07d61cfdc2..95f7e36c8759 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -805,6 +805,81 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>>>  		sizeof(struct bio_grp_list_data);
>>>  }
>>>  
>>> +static inline void *bio_grp_data(struct bio *bio)
>>> +{
>>> +	return bio->bi_poll;
>>> +}
>>> +
>>> +/* add bio into bio group list, return true if it is added */
>>> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
>>> +{
>>> +	int i;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = 0; i < list->nr_grps; i++) {
>>> +		grp = &list->head[i];
>>> +		if (grp->grp_data == bio_grp_data(bio)) {
>>> +			__bio_grp_list_add(&grp->list, bio);
>>> +			return true;
>>> +		}
>>> +	}
>>> +
>>> +	if (i == list->max_nr_grps)
>>> +		return false;
>>> +
>>> +	/* create a new group */
>>> +	grp = &list->head[i];
>>> +	bio_list_init(&grp->list);
>>> +	grp->grp_data = bio_grp_data(bio);
>>> +	__bio_grp_list_add(&grp->list, bio);
>>> +	list->nr_grps++;
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
>>> +{
>>> +	int i;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = 0; i < list->nr_grps; i++) {
>>> +		grp = &list->head[i];
>>> +		if (grp->grp_data == grp_data)
>>> +			return i;
>>> +	}
>>> +
>>> +	if (i < list->max_nr_grps) {
>>> +		grp = &list->head[i];
>>> +		bio_list_init(&grp->list);
>>> +		return i;
>>> +	}
>>> +
>>> +	return -1;
>>> +}
>>> +
>>> +/* Move as many as possible groups from 'src' to 'dst' */
>>> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
>>> +{
>>> +	int i, j, cnt = 0;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = src->nr_grps - 1; i >= 0; i--) {
>>> +		grp = &src->head[i];
>>> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
>>> +		if (j < 0)
>>> +			break;
>>> +		if (bio_grp_list_grp_empty(&dst->head[j])) {
>>> +			dst->head[j].grp_data = grp->grp_data;
>>> +			dst->nr_grps++;
>>> +		}
>>> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
>>> +		bio_list_init(&grp->list);
>>> +		cnt++;
>>> +	}
>>> +
>>> +	src->nr_grps -= cnt;
>>> +}
>>> +
>>>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>>>  {
>>>  	pc->sq = (void *)pc + sizeof(*pc);
>>> @@ -866,6 +941,45 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>  		bio->bi_opf |= REQ_POLL_CTX;
>>>  }
>>>  
>>> +static inline void blk_bio_poll_mark_queued(struct bio *bio, bool queued)
>>> +{
>>> +	/*
>>> +	 * The bio has been added to per-task poll queue, mark it as
>>> +	 * END_BY_POLL, so that this bio is always completed from
>>> +	 * blk_poll() which is provided with cookied from this bio's
>>> +	 * submission.
>>> +	 */
>>> +	if (!queued)
>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_POLL_CTX);
>>> +	else
>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>> +}
>>> +
>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	unsigned int queued;
>>> +
>>> +	/*
>>> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
>>> +	 * and completion. However, bio crypt may update .bi_end_io during
>>> +	 * submission, so simply don't support bio based polling for this
>>> +	 * setting.
>>> +	 */
>>> +	if (likely(!bio_has_crypt_ctx(bio))) {
>>> +		/* track this bio via bio group list */
>>> +		spin_lock(&pc->sq_lock);
>>> +		queued = bio_grp_list_add(pc->sq, bio);
>>> +		blk_bio_poll_mark_queued(bio, queued);
>>> +		spin_unlock(&pc->sq_lock);
>>> +	} else {
>>> +		queued = false;
>>> +		blk_bio_poll_mark_queued(bio, false);
>>> +	}
>>> +
>>> +	return queued;
>>> +}
>>> +
>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>  {
>>>  	struct block_device *bdev = bio->bi_bdev;
>>> @@ -1018,7 +1132,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
>>>   *	->submit_bio_bio, but that haven't been processed yet.
>>>   */
>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +static blk_qc_t __submit_bio_noacct_ctx(struct bio *bio, struct io_context *ioc)
>>>  {
>>>  	struct bio_list bio_list_on_stack[2];
>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>> @@ -1041,7 +1155,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>  
>>> -		ret = __submit_bio(bio);
>>> +		if (ioc && queue_is_mq(q) &&
>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_POLL_CTX))) {
>>
>>
>> I can see no sense to enqueue the bio into the context->sq when
>> REQ_HIPRI is cleared while REQ_POLL_CTX is set for the bio.
>> BLK_QC_T_NONE is returned in this case. This is possible since commit
>> cc29e1bf0d63 ("block: disable iopoll for split bio").
> 
> bio has to be enqueued before submission, and once it is enqueued, it has
> to be ended by blk_poll(), this way actually simplifies polled bio lifetime
> a lot, no matter if this bio is really completed via poll or irq.
> 
> When submit_bio() is returning BLK_QC_T_NONE, this bio may have been
> completed already, and we shouldn't touch that bio any more, otherwise
> things can become quite complicated.

I mean if the following is adequate?

if (ioc && queue_is_mq(q) &&(bio->bi_opf & REQ_HIPRI)) {
    queued = blk_bio_poll_prep_submit(ioc, bio);
    ret = __submit_bio(bio);
    if (queued)
        bio_set_private_data(bio, ret);
}


Only REQ_HIPRI is checked here, thus bios with REQ_POLL_CTX but without
REQ_HIPRI needn't be enqueued into poll_context->sq.

For the original bio (with REQ_HIPRI | REQ_POLL_CTX), it's enqueued into
poll_context->sq, while the following split bios (only with
REQ_POLL_CTX, REQ_POLL_CTX is cleared by commit cc29e1bf0d63 ("block:
disable iopoll for split bio")) needn't be enqueued into
poll_context->sq. Since these bios (only with REQ_POLL_CTX) are not
enqueued, you won't touch them anymore and you needn't care their lifetime.

Your original code works, and my optimization here could make the code
be more clear and faster. However I'm not sure the effect of the
optimization, since the scenario of commit cc29e1bf0d63 ("block: disable
iopoll for split bio") should be rare.


> 
>>
>>
>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
>>> +
>>> +			ret = __submit_bio(bio);
>>> +			if (queued)
>>> +				bio_set_private_data(bio, ret);
>>> +		} else {
>>> +			ret = __submit_bio(bio);
>>> +		}
>>>  
>>>  		/*
>>>  		 * Sort new bios into those for a lower level and those for the
>>> @@ -1067,6 +1190,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  	return ret;
>>>  }
>>>  
>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
>>> +		struct io_context *ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +
>>> +	__submit_bio_noacct_ctx(bio, ioc);
>>> +
>>> +	/* bio submissions queued to per-task poll context */
>>> +	if (READ_ONCE(pc->sq->nr_grps))
>>> +		return current->pid;
>>> +
>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
>>> +	return BLK_QC_T_BIO_NONE;
>>> +}
>>> +
>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +{
>>> +	struct io_context *ioc = current->io_context;
>>> +
>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>>> +		return __submit_bio_noacct_poll(bio, ioc);
>>> +
>>> +	__submit_bio_noacct_ctx(bio, NULL);
>>> +
>>> +	return BLK_QC_T_BIO_NONE;
>>> +}
>>> +
>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>>>  {
>>>  	struct bio_list bio_list[2] = { };
>>> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
>>> index 5574c398eff6..b9a512f066f8 100644
>>> --- a/block/blk-ioc.c
>>> +++ b/block/blk-ioc.c
>>> @@ -19,6 +19,8 @@ static struct kmem_cache *iocontext_cachep;
>>>  
>>>  static inline void free_io_context(struct io_context *ioc)
>>>  {
>>> +	blk_bio_poll_io_drain(ioc);
>>> +
>>
>> There may be a time window between the IO submission process detaches
>> the io_context and the io_context's refcount finally decreased to zero,
>> when there'are multiple processes sharing one io_context. I don't know
>> if it is possible that the other process sharing the io_context won't
>> submit any IO, in which case the bios remained in the io_context won't
>> be reaped for a long time.
>>
>> If the above case is possible, then is it possible to drain the sq once
>> the process detaches the io_context?
> 
> free_io_context() is called after the ioc's refcount drops to zero, so
> any process sharing this ioc has to be exited.

Just like what I said, when the process, to which the returned cookie
refers, has exited, while the corresponding io_context exist there for a
long time (since it's shared by other processes and thus the refcount is
greater than zero), the bios in io_context->sq have no chance be reaped,
if the following two conditions are required at the same time

1) the process, to which the returned cookie refers, has exited
2) the io_context is shared by other processes, while these processes
dont's submit HIPRI IO, thus blk_poll() won't be called for this io_context.

Though the case may be rare.

> 
>>
>>
>>>  	kfree(ioc->data);
>>>  	kmem_cache_free(iocontext_cachep, ioc);
>>>  }
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 03f59915fe2c..76a90da83d9c 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3865,14 +3865,246 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>>>  	return ret;
>>>  }
>>>  
>>> +static int blk_mq_poll_io(struct bio *bio)
>>> +{
>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>> +	blk_qc_t cookie = bio_get_private_data(bio);
>>> +	int ret = 0;
>>> +
>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
>>> +		struct blk_mq_hw_ctx *hctx =
>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>> +
>>> +		ret += blk_mq_poll_hctx(q, hctx);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
>>> +{
>>> +	int ret = 0;
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * Poll hw queue first.
>>> +	 *
>>> +	 * TODO: limit max poll times and make sure to not poll same
>>> +	 * hw queue one more time.
>>> +	 */
>>> +	for (i = 0; i < grps->nr_grps; i++) {
>>> +		struct bio_grp_list_data *grp = &grps->head[i];
>>> +		struct bio *bio;
>>> +
>>> +		if (bio_grp_list_grp_empty(grp))
>>> +			continue;
>>> +
>>> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
>>> +			ret += blk_mq_poll_io(bio);
>>> +	}
>>> +
>>> +	/* reap bios */
>>> +	for (i = 0; i < grps->nr_grps; i++) {
>>> +		struct bio_grp_list_data *grp = &grps->head[i];
>>> +		struct bio *bio;
>>> +		struct bio_list bl;
>>> +
>>> +		if (bio_grp_list_grp_empty(grp))
>>> +			continue;
>>> +
>>> +		bio_list_init(&bl);
>>> +
>>> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
>>> +			if (bio_flagged(bio, BIO_DONE)) {
>>> +				/* now recover original data */
>>> +				bio->bi_poll = grp->grp_data;
>>> +
>>> +				/* clear BIO_END_BY_POLL and end me really */
>>> +				bio_clear_flag(bio, BIO_END_BY_POLL);
>>> +				bio_endio(bio);
>>> +			} else {
>>> +				__bio_grp_list_add(&bl, bio);
>>> +			}
>>> +		}
>>> +		__bio_grp_list_merge(&grp->list, &bl);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static void blk_bio_poll_pack_groups(struct bio_grp_list *grps)
>>> +{
>>> +	int i, j, k = 0;
>>> +	int cnt = 0;
>>> +
>>> +	for (i = grps->nr_grps - 1; i >= 0; i--) {
>>> +		struct bio_grp_list_data *grp = &grps->head[i];
>>> +		struct bio_grp_list_data *hole = NULL;
>>> +
>>> +		if (bio_grp_list_grp_empty(grp)) {
>>> +			cnt++;
>>> +			continue;
>>> +		}
>>> +
>>
>>> +		for (j = k; j < i; j++) {
>>> +			hole = &grps->head[j];
>>> +			if (bio_grp_list_grp_empty(hole))
>>> +				break;
>>> +		}
>>
>> Shoule be
>>
>>> +		for (j = k; j < i; j++) {
>>> +			tmp = &grps->head[j];
>>> +			if (bio_grp_list_grp_empty(tmp)) {
>>> +                             hole = tmp;
>>> +				break;
>>> +                     }
>>> +		}
> 
> Good catch!
> 
>>
>>> +		if (hole == NULL)
>>> +			break;
>>> +		*hole = *grp;
>>> +		cnt++;
>>> +		k = j;
>>> +	}
>>> +
>>> +	grps->nr_grps -= cnt;
>>> +}
>>> +
>>> +#define  MAX_BIO_GRPS_ON_STACK  8
>>> +struct bio_grp_list_stack {
>>> +	unsigned int max_nr_grps, nr_grps;
>>> +	struct bio_grp_list_data head[MAX_BIO_GRPS_ON_STACK];
>>> +};
>>> +
>>> +static int blk_bio_poll_io(struct io_context *submit_ioc,
>>> +		struct io_context *poll_ioc)
>>> +
>>> +{
>>> +	struct bio_grp_list_stack _bio_grps = {
>>> +		.max_nr_grps	= ARRAY_SIZE(_bio_grps.head),
>>> +		.nr_grps	= 0
>>> +	};
>>> +	struct bio_grp_list *bio_grps = (struct bio_grp_list *)&_bio_grps;
>>> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
>>> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc ?
>>> +		poll_ioc->data : NULL;
>>> +	int ret = 0;
>>> +
>>> +	/*
>>> +	 * Move IO submission result from submission queue in submission
>>> +	 * context to poll queue of poll context.
>>> +	 */
>>> +	spin_lock(&submit_ctx->sq_lock);
>>> +	bio_grp_list_move(bio_grps, submit_ctx->sq);
>>> +	spin_unlock(&submit_ctx->sq_lock);
>>> +
>>> +	/* merge new bios first, then start to poll bios from pq */
>>> +	if (poll_ctx) {
>>> +		spin_lock(&poll_ctx->pq_lock);
>>> +		bio_grp_list_move(poll_ctx->pq, bio_grps);
>>> +		bio_grp_list_move(bio_grps, poll_ctx->pq);
>>
>> What's the purpose of this two-step merge? Is that for new bios (from
>> sq) is at the tail of the bio_list, and thus old bios (from pq) is
>> polled first?
> 
> Yeah, so we can poll old bios first. Also the following bio polling can
> cover new bios just from submission context too.
> 
>>
>>> +		spin_unlock(&poll_ctx->pq_lock);
>>> +	}
>>> +
>>> +	do {
>>> +		ret += blk_bio_poll_and_end_io(bio_grps);
>>> +		blk_bio_poll_pack_groups(bio_grps);
>>> +
>>> +		if (bio_grps->nr_grps) {
>>> +			/*
>>> +			 * move back, and keep polling until all can be
>>> +			 * held in either poll queue or submission queue.
>>> +			 */
>>> +			if (poll_ctx) {
>>> +				spin_lock(&poll_ctx->pq_lock);
>>> +				bio_grp_list_move(poll_ctx->pq, bio_grps);
>>> +				spin_unlock(&poll_ctx->pq_lock);
>>> +			} else {
>>> +				spin_lock(&submit_ctx->sq_lock);
>>> +				bio_grp_list_move(submit_ctx->sq, bio_grps);
>>> +				spin_unlock(&submit_ctx->sq_lock);
>>> +			}
>>> +		}
>>> +	} while (bio_grps->nr_grps > 0);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +void blk_bio_poll_io_drain(struct io_context *submit_ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
>>> +
>>> +	if (!submit_ctx)
>>> +		return;
>>> +
>>> +	while (submit_ctx->sq->nr_grps > 0) {
>>> +		blk_bio_poll_io(submit_ioc, NULL);
>>> +		cpu_relax();
>>> +	}
>>> +}
>>> +
>>> +static bool blk_bio_ioc_valid(struct task_struct *t)
>>> +{
>>> +	if (!t)
>>> +		return false;
>>> +
>>> +	if (!t->io_context)
>>> +		return false;
>>> +
>>> +	if (!t->io_context->data)
>>> +		return false;
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static int __blk_bio_poll(blk_qc_t cookie)
>>> +{
>>> +	struct io_context *poll_ioc = current->io_context;
>>> +	pid_t pid;
>>> +	struct task_struct *submit_task;
>>> +	int ret;
>>> +
>>> +	pid = (pid_t)cookie;
>>> +
>>> +	/* io poll often share io submission context */
>>> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
>>> +		return blk_bio_poll_io(poll_ioc, poll_ioc);
>>> +
>>> +	submit_task = find_get_task_by_vpid(pid);
>>> +	if (likely(blk_bio_ioc_valid(submit_task)))
>>> +		ret = blk_bio_poll_io(submit_task->io_context, poll_ioc);
>>> +	else
>>> +		ret = 0;
>>> +
>>> +	put_task_struct(submit_task);
>>
>> put_task_struct() is not needed when @submit_task is NULL.
> 
> Good catch, usually submit_task shouldn't be NULL, but it can be exited
> already.
> 
> 
> Thanks, 
> Ming
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 09/13] block: use per-task poll context to implement bio based io polling
@ 2021-03-25  9:18         ` JeffleXu
  0 siblings, 0 replies; 74+ messages in thread
From: JeffleXu @ 2021-03-25  9:18 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer



On 3/25/21 4:05 PM, Ming Lei wrote:
> On Thu, Mar 25, 2021 at 02:34:18PM +0800, JeffleXu wrote:
>>
>>
>> On 3/24/21 8:19 PM, Ming Lei wrote:
>>> Currently bio based IO polling needs to poll all hw queue blindly, this
>>> way is very inefficient, and one big reason is that we can't pass any
>>> bio submission result to blk_poll().
>>>
>>> In IO submission context, track associated underlying bios by per-task
>>> submission queue and store returned 'cookie' in
>>> bio->bi_iter.bi_private_data, and return current->pid to caller of
>>> submit_bio() for any bio based driver's IO, which is submitted from FS.
>>>
>>> In IO poll context, the passed cookie tells us the PID of submission
>>> context, then we can find bios from the per-task io pull context of
>>> submission context. Moving bios from submission queue to poll queue of
>>> the poll context, and keep polling until these bios are ended. Remove
>>> bio from poll queue if the bio is ended. Add bio flags of BIO_DONE and
>>> BIO_END_BY_POLL for such purpose.
>>>
>>> In was found in Jeffle Xu's test that kfifo doesn't scale well for a
>>> submission queue as queue depth is increased, so a new mechanism for
>>> tracking bios is needed. So far bio's size is close to 2 cacheline size,
>>> and it may not be accepted to add new field into bio for solving the
>>> scalability issue by tracking bios via linked list, switch to bio group
>>> list for tracking bio, the idea is to reuse .bi_end_io for linking bios
>>> into a linked list for all sharing same .bi_end_io(call it bio group),
>>> which is recovered before ending bio really, since BIO_END_BY_POLL is
>>> added for enhancing this point. Usually .bi_end_bio is same for all
>>> bios in same layer, so it is enough to provide very limited groups, such
>>> as 16 or less for fixing the scalability issue.
>>>
>>> Usually submission shares context with io poll. The per-task poll context
>>> is just like stack variable, and it is cheap to move data between the two
>>> per-task queues.
>>>
>>> Also when the submission task is exiting, drain pending IOs in the context
>>> until all are done.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/bio.c               |   5 +
>>>  block/blk-core.c          | 154 ++++++++++++++++++++++++-
>>>  block/blk-ioc.c           |   2 +
>>>  block/blk-mq.c            | 234 +++++++++++++++++++++++++++++++++++++-
>>>  block/blk.h               |  10 ++
>>>  include/linux/blk_types.h |  18 ++-
>>>  6 files changed, 419 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index 26b7f721cda8..04c043dc60fc 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>   **/
>>>  void bio_endio(struct bio *bio)
>>>  {
>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>> +		bio_set_flag(bio, BIO_DONE);
>>> +		return;
>>> +	}
>>>  again:
>>>  	if (!bio_remaining_done(bio))
>>>  		return;
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index eb07d61cfdc2..95f7e36c8759 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -805,6 +805,81 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
>>>  		sizeof(struct bio_grp_list_data);
>>>  }
>>>  
>>> +static inline void *bio_grp_data(struct bio *bio)
>>> +{
>>> +	return bio->bi_poll;
>>> +}
>>> +
>>> +/* add bio into bio group list, return true if it is added */
>>> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
>>> +{
>>> +	int i;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = 0; i < list->nr_grps; i++) {
>>> +		grp = &list->head[i];
>>> +		if (grp->grp_data == bio_grp_data(bio)) {
>>> +			__bio_grp_list_add(&grp->list, bio);
>>> +			return true;
>>> +		}
>>> +	}
>>> +
>>> +	if (i == list->max_nr_grps)
>>> +		return false;
>>> +
>>> +	/* create a new group */
>>> +	grp = &list->head[i];
>>> +	bio_list_init(&grp->list);
>>> +	grp->grp_data = bio_grp_data(bio);
>>> +	__bio_grp_list_add(&grp->list, bio);
>>> +	list->nr_grps++;
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
>>> +{
>>> +	int i;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = 0; i < list->nr_grps; i++) {
>>> +		grp = &list->head[i];
>>> +		if (grp->grp_data == grp_data)
>>> +			return i;
>>> +	}
>>> +
>>> +	if (i < list->max_nr_grps) {
>>> +		grp = &list->head[i];
>>> +		bio_list_init(&grp->list);
>>> +		return i;
>>> +	}
>>> +
>>> +	return -1;
>>> +}
>>> +
>>> +/* Move as many as possible groups from 'src' to 'dst' */
>>> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
>>> +{
>>> +	int i, j, cnt = 0;
>>> +	struct bio_grp_list_data *grp;
>>> +
>>> +	for (i = src->nr_grps - 1; i >= 0; i--) {
>>> +		grp = &src->head[i];
>>> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
>>> +		if (j < 0)
>>> +			break;
>>> +		if (bio_grp_list_grp_empty(&dst->head[j])) {
>>> +			dst->head[j].grp_data = grp->grp_data;
>>> +			dst->nr_grps++;
>>> +		}
>>> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
>>> +		bio_list_init(&grp->list);
>>> +		cnt++;
>>> +	}
>>> +
>>> +	src->nr_grps -= cnt;
>>> +}
>>> +
>>>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
>>>  {
>>>  	pc->sq = (void *)pc + sizeof(*pc);
>>> @@ -866,6 +941,45 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>  		bio->bi_opf |= REQ_POLL_CTX;
>>>  }
>>>  
>>> +static inline void blk_bio_poll_mark_queued(struct bio *bio, bool queued)
>>> +{
>>> +	/*
>>> +	 * The bio has been added to per-task poll queue, mark it as
>>> +	 * END_BY_POLL, so that this bio is always completed from
>>> +	 * blk_poll() which is provided with cookied from this bio's
>>> +	 * submission.
>>> +	 */
>>> +	if (!queued)
>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_POLL_CTX);
>>> +	else
>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>> +}
>>> +
>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	unsigned int queued;
>>> +
>>> +	/*
>>> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
>>> +	 * and completion. However, bio crypt may update .bi_end_io during
>>> +	 * submission, so simply don't support bio based polling for this
>>> +	 * setting.
>>> +	 */
>>> +	if (likely(!bio_has_crypt_ctx(bio))) {
>>> +		/* track this bio via bio group list */
>>> +		spin_lock(&pc->sq_lock);
>>> +		queued = bio_grp_list_add(pc->sq, bio);
>>> +		blk_bio_poll_mark_queued(bio, queued);
>>> +		spin_unlock(&pc->sq_lock);
>>> +	} else {
>>> +		queued = false;
>>> +		blk_bio_poll_mark_queued(bio, false);
>>> +	}
>>> +
>>> +	return queued;
>>> +}
>>> +
>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>  {
>>>  	struct block_device *bdev = bio->bi_bdev;
>>> @@ -1018,7 +1132,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
>>>   *	->submit_bio_bio, but that haven't been processed yet.
>>>   */
>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +static blk_qc_t __submit_bio_noacct_ctx(struct bio *bio, struct io_context *ioc)
>>>  {
>>>  	struct bio_list bio_list_on_stack[2];
>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>> @@ -1041,7 +1155,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>  
>>> -		ret = __submit_bio(bio);
>>> +		if (ioc && queue_is_mq(q) &&
>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_POLL_CTX))) {
>>
>>
>> I can see no sense to enqueue the bio into the context->sq when
>> REQ_HIPRI is cleared while REQ_POLL_CTX is set for the bio.
>> BLK_QC_T_NONE is returned in this case. This is possible since commit
>> cc29e1bf0d63 ("block: disable iopoll for split bio").
> 
> bio has to be enqueued before submission, and once it is enqueued, it has
> to be ended by blk_poll(), this way actually simplifies polled bio lifetime
> a lot, no matter if this bio is really completed via poll or irq.
> 
> When submit_bio() is returning BLK_QC_T_NONE, this bio may have been
> completed already, and we shouldn't touch that bio any more, otherwise
> things can become quite complicated.

I mean if the following is adequate?

if (ioc && queue_is_mq(q) &&(bio->bi_opf & REQ_HIPRI)) {
    queued = blk_bio_poll_prep_submit(ioc, bio);
    ret = __submit_bio(bio);
    if (queued)
        bio_set_private_data(bio, ret);
}


Only REQ_HIPRI is checked here, thus bios with REQ_POLL_CTX but without
REQ_HIPRI needn't be enqueued into poll_context->sq.

For the original bio (with REQ_HIPRI | REQ_POLL_CTX), it's enqueued into
poll_context->sq, while the following split bios (only with
REQ_POLL_CTX, REQ_POLL_CTX is cleared by commit cc29e1bf0d63 ("block:
disable iopoll for split bio")) needn't be enqueued into
poll_context->sq. Since these bios (only with REQ_POLL_CTX) are not
enqueued, you won't touch them anymore and you needn't care their lifetime.

Your original code works, and my optimization here could make the code
be more clear and faster. However I'm not sure the effect of the
optimization, since the scenario of commit cc29e1bf0d63 ("block: disable
iopoll for split bio") should be rare.


> 
>>
>>
>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
>>> +
>>> +			ret = __submit_bio(bio);
>>> +			if (queued)
>>> +				bio_set_private_data(bio, ret);
>>> +		} else {
>>> +			ret = __submit_bio(bio);
>>> +		}
>>>  
>>>  		/*
>>>  		 * Sort new bios into those for a lower level and those for the
>>> @@ -1067,6 +1190,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  	return ret;
>>>  }
>>>  
>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
>>> +		struct io_context *ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +
>>> +	__submit_bio_noacct_ctx(bio, ioc);
>>> +
>>> +	/* bio submissions queued to per-task poll context */
>>> +	if (READ_ONCE(pc->sq->nr_grps))
>>> +		return current->pid;
>>> +
>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
>>> +	return BLK_QC_T_BIO_NONE;
>>> +}
>>> +
>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +{
>>> +	struct io_context *ioc = current->io_context;
>>> +
>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>>> +		return __submit_bio_noacct_poll(bio, ioc);
>>> +
>>> +	__submit_bio_noacct_ctx(bio, NULL);
>>> +
>>> +	return BLK_QC_T_BIO_NONE;
>>> +}
>>> +
>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>>>  {
>>>  	struct bio_list bio_list[2] = { };
>>> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
>>> index 5574c398eff6..b9a512f066f8 100644
>>> --- a/block/blk-ioc.c
>>> +++ b/block/blk-ioc.c
>>> @@ -19,6 +19,8 @@ static struct kmem_cache *iocontext_cachep;
>>>  
>>>  static inline void free_io_context(struct io_context *ioc)
>>>  {
>>> +	blk_bio_poll_io_drain(ioc);
>>> +
>>
>> There may be a time window between the IO submission process detaches
>> the io_context and the io_context's refcount finally decreased to zero,
>> when there'are multiple processes sharing one io_context. I don't know
>> if it is possible that the other process sharing the io_context won't
>> submit any IO, in which case the bios remained in the io_context won't
>> be reaped for a long time.
>>
>> If the above case is possible, then is it possible to drain the sq once
>> the process detaches the io_context?
> 
> free_io_context() is called after the ioc's refcount drops to zero, so
> any process sharing this ioc has to be exited.

Just like what I said, when the process, to which the returned cookie
refers, has exited, while the corresponding io_context exist there for a
long time (since it's shared by other processes and thus the refcount is
greater than zero), the bios in io_context->sq have no chance be reaped,
if the following two conditions are required at the same time

1) the process, to which the returned cookie refers, has exited
2) the io_context is shared by other processes, while these processes
dont's submit HIPRI IO, thus blk_poll() won't be called for this io_context.

Though the case may be rare.

> 
>>
>>
>>>  	kfree(ioc->data);
>>>  	kmem_cache_free(iocontext_cachep, ioc);
>>>  }
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 03f59915fe2c..76a90da83d9c 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3865,14 +3865,246 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>>>  	return ret;
>>>  }
>>>  
>>> +static int blk_mq_poll_io(struct bio *bio)
>>> +{
>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>> +	blk_qc_t cookie = bio_get_private_data(bio);
>>> +	int ret = 0;
>>> +
>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
>>> +		struct blk_mq_hw_ctx *hctx =
>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>> +
>>> +		ret += blk_mq_poll_hctx(q, hctx);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static int blk_bio_poll_and_end_io(struct bio_grp_list *grps)
>>> +{
>>> +	int ret = 0;
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * Poll hw queue first.
>>> +	 *
>>> +	 * TODO: limit max poll times and make sure to not poll same
>>> +	 * hw queue one more time.
>>> +	 */
>>> +	for (i = 0; i < grps->nr_grps; i++) {
>>> +		struct bio_grp_list_data *grp = &grps->head[i];
>>> +		struct bio *bio;
>>> +
>>> +		if (bio_grp_list_grp_empty(grp))
>>> +			continue;
>>> +
>>> +		for (bio = grp->list.head; bio; bio = bio->bi_poll)
>>> +			ret += blk_mq_poll_io(bio);
>>> +	}
>>> +
>>> +	/* reap bios */
>>> +	for (i = 0; i < grps->nr_grps; i++) {
>>> +		struct bio_grp_list_data *grp = &grps->head[i];
>>> +		struct bio *bio;
>>> +		struct bio_list bl;
>>> +
>>> +		if (bio_grp_list_grp_empty(grp))
>>> +			continue;
>>> +
>>> +		bio_list_init(&bl);
>>> +
>>> +		while ((bio = __bio_grp_list_pop(&grp->list))) {
>>> +			if (bio_flagged(bio, BIO_DONE)) {
>>> +				/* now recover original data */
>>> +				bio->bi_poll = grp->grp_data;
>>> +
>>> +				/* clear BIO_END_BY_POLL and end me really */
>>> +				bio_clear_flag(bio, BIO_END_BY_POLL);
>>> +				bio_endio(bio);
>>> +			} else {
>>> +				__bio_grp_list_add(&bl, bio);
>>> +			}
>>> +		}
>>> +		__bio_grp_list_merge(&grp->list, &bl);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static void blk_bio_poll_pack_groups(struct bio_grp_list *grps)
>>> +{
>>> +	int i, j, k = 0;
>>> +	int cnt = 0;
>>> +
>>> +	for (i = grps->nr_grps - 1; i >= 0; i--) {
>>> +		struct bio_grp_list_data *grp = &grps->head[i];
>>> +		struct bio_grp_list_data *hole = NULL;
>>> +
>>> +		if (bio_grp_list_grp_empty(grp)) {
>>> +			cnt++;
>>> +			continue;
>>> +		}
>>> +
>>
>>> +		for (j = k; j < i; j++) {
>>> +			hole = &grps->head[j];
>>> +			if (bio_grp_list_grp_empty(hole))
>>> +				break;
>>> +		}
>>
>> Shoule be
>>
>>> +		for (j = k; j < i; j++) {
>>> +			tmp = &grps->head[j];
>>> +			if (bio_grp_list_grp_empty(tmp)) {
>>> +                             hole = tmp;
>>> +				break;
>>> +                     }
>>> +		}
> 
> Good catch!
> 
>>
>>> +		if (hole == NULL)
>>> +			break;
>>> +		*hole = *grp;
>>> +		cnt++;
>>> +		k = j;
>>> +	}
>>> +
>>> +	grps->nr_grps -= cnt;
>>> +}
>>> +
>>> +#define  MAX_BIO_GRPS_ON_STACK  8
>>> +struct bio_grp_list_stack {
>>> +	unsigned int max_nr_grps, nr_grps;
>>> +	struct bio_grp_list_data head[MAX_BIO_GRPS_ON_STACK];
>>> +};
>>> +
>>> +static int blk_bio_poll_io(struct io_context *submit_ioc,
>>> +		struct io_context *poll_ioc)
>>> +
>>> +{
>>> +	struct bio_grp_list_stack _bio_grps = {
>>> +		.max_nr_grps	= ARRAY_SIZE(_bio_grps.head),
>>> +		.nr_grps	= 0
>>> +	};
>>> +	struct bio_grp_list *bio_grps = (struct bio_grp_list *)&_bio_grps;
>>> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
>>> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc ?
>>> +		poll_ioc->data : NULL;
>>> +	int ret = 0;
>>> +
>>> +	/*
>>> +	 * Move IO submission result from submission queue in submission
>>> +	 * context to poll queue of poll context.
>>> +	 */
>>> +	spin_lock(&submit_ctx->sq_lock);
>>> +	bio_grp_list_move(bio_grps, submit_ctx->sq);
>>> +	spin_unlock(&submit_ctx->sq_lock);
>>> +
>>> +	/* merge new bios first, then start to poll bios from pq */
>>> +	if (poll_ctx) {
>>> +		spin_lock(&poll_ctx->pq_lock);
>>> +		bio_grp_list_move(poll_ctx->pq, bio_grps);
>>> +		bio_grp_list_move(bio_grps, poll_ctx->pq);
>>
>> What's the purpose of this two-step merge? Is that for new bios (from
>> sq) is at the tail of the bio_list, and thus old bios (from pq) is
>> polled first?
> 
> Yeah, so we can poll old bios first. Also the following bio polling can
> cover new bios just from submission context too.
> 
>>
>>> +		spin_unlock(&poll_ctx->pq_lock);
>>> +	}
>>> +
>>> +	do {
>>> +		ret += blk_bio_poll_and_end_io(bio_grps);
>>> +		blk_bio_poll_pack_groups(bio_grps);
>>> +
>>> +		if (bio_grps->nr_grps) {
>>> +			/*
>>> +			 * move back, and keep polling until all can be
>>> +			 * held in either poll queue or submission queue.
>>> +			 */
>>> +			if (poll_ctx) {
>>> +				spin_lock(&poll_ctx->pq_lock);
>>> +				bio_grp_list_move(poll_ctx->pq, bio_grps);
>>> +				spin_unlock(&poll_ctx->pq_lock);
>>> +			} else {
>>> +				spin_lock(&submit_ctx->sq_lock);
>>> +				bio_grp_list_move(submit_ctx->sq, bio_grps);
>>> +				spin_unlock(&submit_ctx->sq_lock);
>>> +			}
>>> +		}
>>> +	} while (bio_grps->nr_grps > 0);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +void blk_bio_poll_io_drain(struct io_context *submit_ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
>>> +
>>> +	if (!submit_ctx)
>>> +		return;
>>> +
>>> +	while (submit_ctx->sq->nr_grps > 0) {
>>> +		blk_bio_poll_io(submit_ioc, NULL);
>>> +		cpu_relax();
>>> +	}
>>> +}
>>> +
>>> +static bool blk_bio_ioc_valid(struct task_struct *t)
>>> +{
>>> +	if (!t)
>>> +		return false;
>>> +
>>> +	if (!t->io_context)
>>> +		return false;
>>> +
>>> +	if (!t->io_context->data)
>>> +		return false;
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static int __blk_bio_poll(blk_qc_t cookie)
>>> +{
>>> +	struct io_context *poll_ioc = current->io_context;
>>> +	pid_t pid;
>>> +	struct task_struct *submit_task;
>>> +	int ret;
>>> +
>>> +	pid = (pid_t)cookie;
>>> +
>>> +	/* io poll often share io submission context */
>>> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
>>> +		return blk_bio_poll_io(poll_ioc, poll_ioc);
>>> +
>>> +	submit_task = find_get_task_by_vpid(pid);
>>> +	if (likely(blk_bio_ioc_valid(submit_task)))
>>> +		ret = blk_bio_poll_io(submit_task->io_context, poll_ioc);
>>> +	else
>>> +		ret = 0;
>>> +
>>> +	put_task_struct(submit_task);
>>
>> put_task_struct() is not needed when @submit_task is NULL.
> 
> Good catch, usually submit_task shouldn't be NULL, but it can be exited
> already.
> 
> 
> Thanks, 
> Ming
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 09/13] block: use per-task poll context to implement bio based io polling
  2021-03-25  9:18         ` [dm-devel] " JeffleXu
@ 2021-03-25  9:56           ` Ming Lei
  -1 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-25  9:56 UTC (permalink / raw)
  To: JeffleXu; +Cc: Jens Axboe, linux-block, Mike Snitzer, dm-devel

On Thu, Mar 25, 2021 at 05:18:41PM +0800, JeffleXu wrote:
> 
> 
> On 3/25/21 4:05 PM, Ming Lei wrote:
> > On Thu, Mar 25, 2021 at 02:34:18PM +0800, JeffleXu wrote:
> >>
> >>
> >> On 3/24/21 8:19 PM, Ming Lei wrote:
> >>> Currently bio based IO polling needs to poll all hw queue blindly, this
> >>> way is very inefficient, and one big reason is that we can't pass any
> >>> bio submission result to blk_poll().
> >>>
> >>> In IO submission context, track associated underlying bios by per-task
> >>> submission queue and store returned 'cookie' in
> >>> bio->bi_iter.bi_private_data, and return current->pid to caller of
> >>> submit_bio() for any bio based driver's IO, which is submitted from FS.
> >>>
> >>> In IO poll context, the passed cookie tells us the PID of submission
> >>> context, then we can find bios from the per-task io pull context of
> >>> submission context. Moving bios from submission queue to poll queue of
> >>> the poll context, and keep polling until these bios are ended. Remove
> >>> bio from poll queue if the bio is ended. Add bio flags of BIO_DONE and
> >>> BIO_END_BY_POLL for such purpose.
> >>>
> >>> In was found in Jeffle Xu's test that kfifo doesn't scale well for a
> >>> submission queue as queue depth is increased, so a new mechanism for
> >>> tracking bios is needed. So far bio's size is close to 2 cacheline size,
> >>> and it may not be accepted to add new field into bio for solving the
> >>> scalability issue by tracking bios via linked list, switch to bio group
> >>> list for tracking bio, the idea is to reuse .bi_end_io for linking bios
> >>> into a linked list for all sharing same .bi_end_io(call it bio group),
> >>> which is recovered before ending bio really, since BIO_END_BY_POLL is
> >>> added for enhancing this point. Usually .bi_end_bio is same for all
> >>> bios in same layer, so it is enough to provide very limited groups, such
> >>> as 16 or less for fixing the scalability issue.
> >>>
> >>> Usually submission shares context with io poll. The per-task poll context
> >>> is just like stack variable, and it is cheap to move data between the two
> >>> per-task queues.
> >>>
> >>> Also when the submission task is exiting, drain pending IOs in the context
> >>> until all are done.
> >>>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  block/bio.c               |   5 +
> >>>  block/blk-core.c          | 154 ++++++++++++++++++++++++-
> >>>  block/blk-ioc.c           |   2 +
> >>>  block/blk-mq.c            | 234 +++++++++++++++++++++++++++++++++++++-
> >>>  block/blk.h               |  10 ++
> >>>  include/linux/blk_types.h |  18 ++-
> >>>  6 files changed, 419 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/block/bio.c b/block/bio.c
> >>> index 26b7f721cda8..04c043dc60fc 100644
> >>> --- a/block/bio.c
> >>> +++ b/block/bio.c
> >>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>   **/
> >>>  void bio_endio(struct bio *bio)
> >>>  {
> >>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>> +		bio_set_flag(bio, BIO_DONE);
> >>> +		return;
> >>> +	}
> >>>  again:
> >>>  	if (!bio_remaining_done(bio))
> >>>  		return;
> >>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>> index eb07d61cfdc2..95f7e36c8759 100644
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -805,6 +805,81 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >>>  		sizeof(struct bio_grp_list_data);
> >>>  }
> >>>  
> >>> +static inline void *bio_grp_data(struct bio *bio)
> >>> +{
> >>> +	return bio->bi_poll;
> >>> +}
> >>> +
> >>> +/* add bio into bio group list, return true if it is added */
> >>> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> >>> +{
> >>> +	int i;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = 0; i < list->nr_grps; i++) {
> >>> +		grp = &list->head[i];
> >>> +		if (grp->grp_data == bio_grp_data(bio)) {
> >>> +			__bio_grp_list_add(&grp->list, bio);
> >>> +			return true;
> >>> +		}
> >>> +	}
> >>> +
> >>> +	if (i == list->max_nr_grps)
> >>> +		return false;
> >>> +
> >>> +	/* create a new group */
> >>> +	grp = &list->head[i];
> >>> +	bio_list_init(&grp->list);
> >>> +	grp->grp_data = bio_grp_data(bio);
> >>> +	__bio_grp_list_add(&grp->list, bio);
> >>> +	list->nr_grps++;
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> >>> +{
> >>> +	int i;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = 0; i < list->nr_grps; i++) {
> >>> +		grp = &list->head[i];
> >>> +		if (grp->grp_data == grp_data)
> >>> +			return i;
> >>> +	}
> >>> +
> >>> +	if (i < list->max_nr_grps) {
> >>> +		grp = &list->head[i];
> >>> +		bio_list_init(&grp->list);
> >>> +		return i;
> >>> +	}
> >>> +
> >>> +	return -1;
> >>> +}
> >>> +
> >>> +/* Move as many as possible groups from 'src' to 'dst' */
> >>> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> >>> +{
> >>> +	int i, j, cnt = 0;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> >>> +		grp = &src->head[i];
> >>> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> >>> +		if (j < 0)
> >>> +			break;
> >>> +		if (bio_grp_list_grp_empty(&dst->head[j])) {
> >>> +			dst->head[j].grp_data = grp->grp_data;
> >>> +			dst->nr_grps++;
> >>> +		}
> >>> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> >>> +		bio_list_init(&grp->list);
> >>> +		cnt++;
> >>> +	}
> >>> +
> >>> +	src->nr_grps -= cnt;
> >>> +}
> >>> +
> >>>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >>>  {
> >>>  	pc->sq = (void *)pc + sizeof(*pc);
> >>> @@ -866,6 +941,45 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>  		bio->bi_opf |= REQ_POLL_CTX;
> >>>  }
> >>>  
> >>> +static inline void blk_bio_poll_mark_queued(struct bio *bio, bool queued)
> >>> +{
> >>> +	/*
> >>> +	 * The bio has been added to per-task poll queue, mark it as
> >>> +	 * END_BY_POLL, so that this bio is always completed from
> >>> +	 * blk_poll() which is provided with cookied from this bio's
> >>> +	 * submission.
> >>> +	 */
> >>> +	if (!queued)
> >>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_POLL_CTX);
> >>> +	else
> >>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>> +}
> >>> +
> >>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +	unsigned int queued;
> >>> +
> >>> +	/*
> >>> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> >>> +	 * and completion. However, bio crypt may update .bi_end_io during
> >>> +	 * submission, so simply don't support bio based polling for this
> >>> +	 * setting.
> >>> +	 */
> >>> +	if (likely(!bio_has_crypt_ctx(bio))) {
> >>> +		/* track this bio via bio group list */
> >>> +		spin_lock(&pc->sq_lock);
> >>> +		queued = bio_grp_list_add(pc->sq, bio);
> >>> +		blk_bio_poll_mark_queued(bio, queued);
> >>> +		spin_unlock(&pc->sq_lock);
> >>> +	} else {
> >>> +		queued = false;
> >>> +		blk_bio_poll_mark_queued(bio, false);
> >>> +	}
> >>> +
> >>> +	return queued;
> >>> +}
> >>> +
> >>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >>>  {
> >>>  	struct block_device *bdev = bio->bi_bdev;
> >>> @@ -1018,7 +1132,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >>>   * bio_list_on_stack[1] contains bios that were submitted before the current
> >>>   *	->submit_bio_bio, but that haven't been processed yet.
> >>>   */
> >>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +static blk_qc_t __submit_bio_noacct_ctx(struct bio *bio, struct io_context *ioc)
> >>>  {
> >>>  	struct bio_list bio_list_on_stack[2];
> >>>  	blk_qc_t ret = BLK_QC_T_NONE;
> >>> @@ -1041,7 +1155,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >>>  		bio_list_init(&bio_list_on_stack[0]);
> >>>  
> >>> -		ret = __submit_bio(bio);
> >>> +		if (ioc && queue_is_mq(q) &&
> >>> +				(bio->bi_opf & (REQ_HIPRI | REQ_POLL_CTX))) {
> >>
> >>
> >> I can see no sense to enqueue the bio into the context->sq when
> >> REQ_HIPRI is cleared while REQ_POLL_CTX is set for the bio.
> >> BLK_QC_T_NONE is returned in this case. This is possible since commit
> >> cc29e1bf0d63 ("block: disable iopoll for split bio").
> > 
> > bio has to be enqueued before submission, and once it is enqueued, it has
> > to be ended by blk_poll(), this way actually simplifies polled bio lifetime
> > a lot, no matter if this bio is really completed via poll or irq.
> > 
> > When submit_bio() is returning BLK_QC_T_NONE, this bio may have been
> > completed already, and we shouldn't touch that bio any more, otherwise
> > things can become quite complicated.
> 
> I mean if the following is adequate?
> 
> if (ioc && queue_is_mq(q) &&(bio->bi_opf & REQ_HIPRI)) {
>     queued = blk_bio_poll_prep_submit(ioc, bio);
>     ret = __submit_bio(bio);
>     if (queued)
>         bio_set_private_data(bio, ret);
> }
> 
> 
> Only REQ_HIPRI is checked here, thus bios with REQ_POLL_CTX but without
> REQ_HIPRI needn't be enqueued into poll_context->sq.
> 
> For the original bio (with REQ_HIPRI | REQ_POLL_CTX), it's enqueued into
> poll_context->sq, while the following split bios (only with
> REQ_POLL_CTX, REQ_POLL_CTX is cleared by commit cc29e1bf0d63 ("block:
> disable iopoll for split bio")) needn't be enqueued into
> poll_context->sq. Since these bios (only with REQ_POLL_CTX) are not
> enqueued, you won't touch them anymore and you needn't care their lifetime.
> 
> Your original code works, and my optimization here could make the code
> be more clear and faster. However I'm not sure the effect of the
> optimization, since the scenario of commit cc29e1bf0d63 ("block: disable
> iopoll for split bio") should be rare.

OK, got it, and it should be fine to just check REQ_HIPRI here. But as
you mentioned, it should be rare, even though not, the cost is small
too.

If it isn't rare, one solution is to extend current bio based polling
for split bios.

> 
> 
> > 
> >>
> >>
> >>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> >>> +
> >>> +			ret = __submit_bio(bio);
> >>> +			if (queued)
> >>> +				bio_set_private_data(bio, ret);
> >>> +		} else {
> >>> +			ret = __submit_bio(bio);
> >>> +		}
> >>>  
> >>>  		/*
> >>>  		 * Sort new bios into those for a lower level and those for the
> >>> @@ -1067,6 +1190,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  	return ret;
> >>>  }
> >>>  
> >>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> >>> +		struct io_context *ioc)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +
> >>> +	__submit_bio_noacct_ctx(bio, ioc);
> >>> +
> >>> +	/* bio submissions queued to per-task poll context */
> >>> +	if (READ_ONCE(pc->sq->nr_grps))
> >>> +		return current->pid;
> >>> +
> >>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> >>> +	return BLK_QC_T_BIO_NONE;
> >>> +}
> >>> +
> >>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +{
> >>> +	struct io_context *ioc = current->io_context;
> >>> +
> >>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> >>> +		return __submit_bio_noacct_poll(bio, ioc);
> >>> +
> >>> +	__submit_bio_noacct_ctx(bio, NULL);
> >>> +
> >>> +	return BLK_QC_T_BIO_NONE;
> >>> +}
> >>> +
> >>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >>>  {
> >>>  	struct bio_list bio_list[2] = { };
> >>> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> >>> index 5574c398eff6..b9a512f066f8 100644
> >>> --- a/block/blk-ioc.c
> >>> +++ b/block/blk-ioc.c
> >>> @@ -19,6 +19,8 @@ static struct kmem_cache *iocontext_cachep;
> >>>  
> >>>  static inline void free_io_context(struct io_context *ioc)
> >>>  {
> >>> +	blk_bio_poll_io_drain(ioc);
> >>> +
> >>
> >> There may be a time window between the IO submission process detaches
> >> the io_context and the io_context's refcount finally decreased to zero,
> >> when there'are multiple processes sharing one io_context. I don't know
> >> if it is possible that the other process sharing the io_context won't
> >> submit any IO, in which case the bios remained in the io_context won't
> >> be reaped for a long time.
> >>
> >> If the above case is possible, then is it possible to drain the sq once
> >> the process detaches the io_context?
> > 
> > free_io_context() is called after the ioc's refcount drops to zero, so
> > any process sharing this ioc has to be exited.
> 
> Just like what I said, when the process, to which the returned cookie
> refers, has exited, while the corresponding io_context exist there for a
> long time (since it's shared by other processes and thus the refcount is
> greater than zero), the bios in io_context->sq have no chance be reaped,
> if the following two conditions are required at the same time
> 
> 1) the process, to which the returned cookie refers, has exited
> 2) the io_context is shared by other processes, while these processes
> dont's submit HIPRI IO, thus blk_poll() won't be called for this io_context.
> 
> Though the case may be rare.

OK, looks we can deal with that by simply calling blk_bio_poll_io_drain()
at the entry of exit_io_context()


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 09/13] block: use per-task poll context to implement bio based io polling
@ 2021-03-25  9:56           ` Ming Lei
  0 siblings, 0 replies; 74+ messages in thread
From: Ming Lei @ 2021-03-25  9:56 UTC (permalink / raw)
  To: JeffleXu; +Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer

On Thu, Mar 25, 2021 at 05:18:41PM +0800, JeffleXu wrote:
> 
> 
> On 3/25/21 4:05 PM, Ming Lei wrote:
> > On Thu, Mar 25, 2021 at 02:34:18PM +0800, JeffleXu wrote:
> >>
> >>
> >> On 3/24/21 8:19 PM, Ming Lei wrote:
> >>> Currently bio based IO polling needs to poll all hw queue blindly, this
> >>> way is very inefficient, and one big reason is that we can't pass any
> >>> bio submission result to blk_poll().
> >>>
> >>> In IO submission context, track associated underlying bios by per-task
> >>> submission queue and store returned 'cookie' in
> >>> bio->bi_iter.bi_private_data, and return current->pid to caller of
> >>> submit_bio() for any bio based driver's IO, which is submitted from FS.
> >>>
> >>> In IO poll context, the passed cookie tells us the PID of submission
> >>> context, then we can find bios from the per-task io pull context of
> >>> submission context. Moving bios from submission queue to poll queue of
> >>> the poll context, and keep polling until these bios are ended. Remove
> >>> bio from poll queue if the bio is ended. Add bio flags of BIO_DONE and
> >>> BIO_END_BY_POLL for such purpose.
> >>>
> >>> In was found in Jeffle Xu's test that kfifo doesn't scale well for a
> >>> submission queue as queue depth is increased, so a new mechanism for
> >>> tracking bios is needed. So far bio's size is close to 2 cacheline size,
> >>> and it may not be accepted to add new field into bio for solving the
> >>> scalability issue by tracking bios via linked list, switch to bio group
> >>> list for tracking bio, the idea is to reuse .bi_end_io for linking bios
> >>> into a linked list for all sharing same .bi_end_io(call it bio group),
> >>> which is recovered before ending bio really, since BIO_END_BY_POLL is
> >>> added for enhancing this point. Usually .bi_end_bio is same for all
> >>> bios in same layer, so it is enough to provide very limited groups, such
> >>> as 16 or less for fixing the scalability issue.
> >>>
> >>> Usually submission shares context with io poll. The per-task poll context
> >>> is just like stack variable, and it is cheap to move data between the two
> >>> per-task queues.
> >>>
> >>> Also when the submission task is exiting, drain pending IOs in the context
> >>> until all are done.
> >>>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  block/bio.c               |   5 +
> >>>  block/blk-core.c          | 154 ++++++++++++++++++++++++-
> >>>  block/blk-ioc.c           |   2 +
> >>>  block/blk-mq.c            | 234 +++++++++++++++++++++++++++++++++++++-
> >>>  block/blk.h               |  10 ++
> >>>  include/linux/blk_types.h |  18 ++-
> >>>  6 files changed, 419 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/block/bio.c b/block/bio.c
> >>> index 26b7f721cda8..04c043dc60fc 100644
> >>> --- a/block/bio.c
> >>> +++ b/block/bio.c
> >>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>   **/
> >>>  void bio_endio(struct bio *bio)
> >>>  {
> >>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>> +		bio_set_flag(bio, BIO_DONE);
> >>> +		return;
> >>> +	}
> >>>  again:
> >>>  	if (!bio_remaining_done(bio))
> >>>  		return;
> >>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>> index eb07d61cfdc2..95f7e36c8759 100644
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -805,6 +805,81 @@ static inline unsigned int bio_grp_list_size(unsigned int nr_grps)
> >>>  		sizeof(struct bio_grp_list_data);
> >>>  }
> >>>  
> >>> +static inline void *bio_grp_data(struct bio *bio)
> >>> +{
> >>> +	return bio->bi_poll;
> >>> +}
> >>> +
> >>> +/* add bio into bio group list, return true if it is added */
> >>> +static bool bio_grp_list_add(struct bio_grp_list *list, struct bio *bio)
> >>> +{
> >>> +	int i;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = 0; i < list->nr_grps; i++) {
> >>> +		grp = &list->head[i];
> >>> +		if (grp->grp_data == bio_grp_data(bio)) {
> >>> +			__bio_grp_list_add(&grp->list, bio);
> >>> +			return true;
> >>> +		}
> >>> +	}
> >>> +
> >>> +	if (i == list->max_nr_grps)
> >>> +		return false;
> >>> +
> >>> +	/* create a new group */
> >>> +	grp = &list->head[i];
> >>> +	bio_list_init(&grp->list);
> >>> +	grp->grp_data = bio_grp_data(bio);
> >>> +	__bio_grp_list_add(&grp->list, bio);
> >>> +	list->nr_grps++;
> >>> +
> >>> +	return true;
> >>> +}
> >>> +
> >>> +static int bio_grp_list_find_grp(struct bio_grp_list *list, void *grp_data)
> >>> +{
> >>> +	int i;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = 0; i < list->nr_grps; i++) {
> >>> +		grp = &list->head[i];
> >>> +		if (grp->grp_data == grp_data)
> >>> +			return i;
> >>> +	}
> >>> +
> >>> +	if (i < list->max_nr_grps) {
> >>> +		grp = &list->head[i];
> >>> +		bio_list_init(&grp->list);
> >>> +		return i;
> >>> +	}
> >>> +
> >>> +	return -1;
> >>> +}
> >>> +
> >>> +/* Move as many as possible groups from 'src' to 'dst' */
> >>> +void bio_grp_list_move(struct bio_grp_list *dst, struct bio_grp_list *src)
> >>> +{
> >>> +	int i, j, cnt = 0;
> >>> +	struct bio_grp_list_data *grp;
> >>> +
> >>> +	for (i = src->nr_grps - 1; i >= 0; i--) {
> >>> +		grp = &src->head[i];
> >>> +		j = bio_grp_list_find_grp(dst, grp->grp_data);
> >>> +		if (j < 0)
> >>> +			break;
> >>> +		if (bio_grp_list_grp_empty(&dst->head[j])) {
> >>> +			dst->head[j].grp_data = grp->grp_data;
> >>> +			dst->nr_grps++;
> >>> +		}
> >>> +		__bio_grp_list_merge(&dst->head[j].list, &grp->list);
> >>> +		bio_list_init(&grp->list);
> >>> +		cnt++;
> >>> +	}
> >>> +
> >>> +	src->nr_grps -= cnt;
> >>> +}
> >>> +
> >>>  static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
> >>>  {
> >>>  	pc->sq = (void *)pc + sizeof(*pc);
> >>> @@ -866,6 +941,45 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>  		bio->bi_opf |= REQ_POLL_CTX;
> >>>  }
> >>>  
> >>> +static inline void blk_bio_poll_mark_queued(struct bio *bio, bool queued)
> >>> +{
> >>> +	/*
> >>> +	 * The bio has been added to per-task poll queue, mark it as
> >>> +	 * END_BY_POLL, so that this bio is always completed from
> >>> +	 * blk_poll() which is provided with cookied from this bio's
> >>> +	 * submission.
> >>> +	 */
> >>> +	if (!queued)
> >>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_POLL_CTX);
> >>> +	else
> >>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>> +}
> >>> +
> >>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +	unsigned int queued;
> >>> +
> >>> +	/*
> >>> +	 * We rely on immutable .bi_end_io between blk-mq bio submission
> >>> +	 * and completion. However, bio crypt may update .bi_end_io during
> >>> +	 * submission, so simply don't support bio based polling for this
> >>> +	 * setting.
> >>> +	 */
> >>> +	if (likely(!bio_has_crypt_ctx(bio))) {
> >>> +		/* track this bio via bio group list */
> >>> +		spin_lock(&pc->sq_lock);
> >>> +		queued = bio_grp_list_add(pc->sq, bio);
> >>> +		blk_bio_poll_mark_queued(bio, queued);
> >>> +		spin_unlock(&pc->sq_lock);
> >>> +	} else {
> >>> +		queued = false;
> >>> +		blk_bio_poll_mark_queued(bio, false);
> >>> +	}
> >>> +
> >>> +	return queued;
> >>> +}
> >>> +
> >>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >>>  {
> >>>  	struct block_device *bdev = bio->bi_bdev;
> >>> @@ -1018,7 +1132,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >>>   * bio_list_on_stack[1] contains bios that were submitted before the current
> >>>   *	->submit_bio_bio, but that haven't been processed yet.
> >>>   */
> >>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +static blk_qc_t __submit_bio_noacct_ctx(struct bio *bio, struct io_context *ioc)
> >>>  {
> >>>  	struct bio_list bio_list_on_stack[2];
> >>>  	blk_qc_t ret = BLK_QC_T_NONE;
> >>> @@ -1041,7 +1155,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >>>  		bio_list_init(&bio_list_on_stack[0]);
> >>>  
> >>> -		ret = __submit_bio(bio);
> >>> +		if (ioc && queue_is_mq(q) &&
> >>> +				(bio->bi_opf & (REQ_HIPRI | REQ_POLL_CTX))) {
> >>
> >>
> >> I can see no sense to enqueue the bio into the context->sq when
> >> REQ_HIPRI is cleared while REQ_POLL_CTX is set for the bio.
> >> BLK_QC_T_NONE is returned in this case. This is possible since commit
> >> cc29e1bf0d63 ("block: disable iopoll for split bio").
> > 
> > bio has to be enqueued before submission, and once it is enqueued, it has
> > to be ended by blk_poll(), this way actually simplifies polled bio lifetime
> > a lot, no matter if this bio is really completed via poll or irq.
> > 
> > When submit_bio() is returning BLK_QC_T_NONE, this bio may have been
> > completed already, and we shouldn't touch that bio any more, otherwise
> > things can become quite complicated.
> 
> I mean if the following is adequate?
> 
> if (ioc && queue_is_mq(q) &&(bio->bi_opf & REQ_HIPRI)) {
>     queued = blk_bio_poll_prep_submit(ioc, bio);
>     ret = __submit_bio(bio);
>     if (queued)
>         bio_set_private_data(bio, ret);
> }
> 
> 
> Only REQ_HIPRI is checked here, thus bios with REQ_POLL_CTX but without
> REQ_HIPRI needn't be enqueued into poll_context->sq.
> 
> For the original bio (with REQ_HIPRI | REQ_POLL_CTX), it's enqueued into
> poll_context->sq, while the following split bios (only with
> REQ_POLL_CTX, REQ_POLL_CTX is cleared by commit cc29e1bf0d63 ("block:
> disable iopoll for split bio")) needn't be enqueued into
> poll_context->sq. Since these bios (only with REQ_POLL_CTX) are not
> enqueued, you won't touch them anymore and you needn't care their lifetime.
> 
> Your original code works, and my optimization here could make the code
> be more clear and faster. However I'm not sure the effect of the
> optimization, since the scenario of commit cc29e1bf0d63 ("block: disable
> iopoll for split bio") should be rare.

OK, got it, and it should be fine to just check REQ_HIPRI here. But as
you mentioned, it should be rare, even though not, the cost is small
too.

If it isn't rare, one solution is to extend current bio based polling
for split bios.

> 
> 
> > 
> >>
> >>
> >>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> >>> +
> >>> +			ret = __submit_bio(bio);
> >>> +			if (queued)
> >>> +				bio_set_private_data(bio, ret);
> >>> +		} else {
> >>> +			ret = __submit_bio(bio);
> >>> +		}
> >>>  
> >>>  		/*
> >>>  		 * Sort new bios into those for a lower level and those for the
> >>> @@ -1067,6 +1190,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  	return ret;
> >>>  }
> >>>  
> >>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> >>> +		struct io_context *ioc)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +
> >>> +	__submit_bio_noacct_ctx(bio, ioc);
> >>> +
> >>> +	/* bio submissions queued to per-task poll context */
> >>> +	if (READ_ONCE(pc->sq->nr_grps))
> >>> +		return current->pid;
> >>> +
> >>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> >>> +	return BLK_QC_T_BIO_NONE;
> >>> +}
> >>> +
> >>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +{
> >>> +	struct io_context *ioc = current->io_context;
> >>> +
> >>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> >>> +		return __submit_bio_noacct_poll(bio, ioc);
> >>> +
> >>> +	__submit_bio_noacct_ctx(bio, NULL);
> >>> +
> >>> +	return BLK_QC_T_BIO_NONE;
> >>> +}
> >>> +
> >>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >>>  {
> >>>  	struct bio_list bio_list[2] = { };
> >>> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> >>> index 5574c398eff6..b9a512f066f8 100644
> >>> --- a/block/blk-ioc.c
> >>> +++ b/block/blk-ioc.c
> >>> @@ -19,6 +19,8 @@ static struct kmem_cache *iocontext_cachep;
> >>>  
> >>>  static inline void free_io_context(struct io_context *ioc)
> >>>  {
> >>> +	blk_bio_poll_io_drain(ioc);
> >>> +
> >>
> >> There may be a time window between the IO submission process detaches
> >> the io_context and the io_context's refcount finally decreased to zero,
> >> when there'are multiple processes sharing one io_context. I don't know
> >> if it is possible that the other process sharing the io_context won't
> >> submit any IO, in which case the bios remained in the io_context won't
> >> be reaped for a long time.
> >>
> >> If the above case is possible, then is it possible to drain the sq once
> >> the process detaches the io_context?
> > 
> > free_io_context() is called after the ioc's refcount drops to zero, so
> > any process sharing this ioc has to be exited.
> 
> Just like what I said, when the process, to which the returned cookie
> refers, has exited, while the corresponding io_context exist there for a
> long time (since it's shared by other processes and thus the refcount is
> greater than zero), the bios in io_context->sq have no chance be reaped,
> if the following two conditions are required at the same time
> 
> 1) the process, to which the returned cookie refers, has exited
> 2) the io_context is shared by other processes, while these processes
> dont's submit HIPRI IO, thus blk_poll() won't be called for this io_context.
> 
> Though the case may be rare.

OK, looks we can deal with that by simply calling blk_bio_poll_io_drain()
at the entry of exit_io_context()


Thanks, 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 13/13] dm: support IO polling for bio-based dm device
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-25 18:45     ` Mike Snitzer
  -1 siblings, 0 replies; 74+ messages in thread
From: Mike Snitzer @ 2021-03-25 18:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Jeffle Xu, dm-devel

On Wed, Mar 24 2021 at  8:19am -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> From: Jeffle Xu <jefflexu@linux.alibaba.com>
> 
> IO polling is enabled when all underlying target devices are capable
> of IO polling. The sanity check supports the stacked device model, in
> which one dm device may be build upon another dm device. In this case,
> the mapped device will check if the underlying dm target device
> supports IO polling.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/md/dm-table.c         | 24 ++++++++++++++++++++++++
>  drivers/md/dm.c               | 14 ++++++++++++++
>  include/linux/device-mapper.h |  1 +
>  3 files changed, 39 insertions(+)
> 

...

> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 50b693d776d6..fe6893b078dc 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1720,6 +1720,19 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
>  	return ret;
>  }
>  
> +static bool dm_bio_poll_capable(struct gendisk *disk)
> +{
> +	int ret, srcu_idx;
> +	struct mapped_device *md = disk->private_data;
> +	struct dm_table *t;
> +
> +	t = dm_get_live_table(md, &srcu_idx);
> +	ret = dm_table_supports_poll(t);
> +	dm_put_live_table(md, srcu_idx);
> +
> +	return ret;
> +}
> +

I know this code will only get called by blk-core if bio-based but there
isn't anything about this method's implementation that is inherently
bio-based only.

So please rename from dm_bio_poll_capable to dm_poll_capable

Other than that:

Reviewed-by: Mike Snitzer <snitzer@redhat.com>

>  /*-----------------------------------------------------------------
>   * An IDR is used to keep track of allocated minor numbers.
>   *---------------------------------------------------------------*/
> @@ -3132,6 +3145,7 @@ static const struct pr_ops dm_pr_ops = {
>  };
>  
>  static const struct block_device_operations dm_blk_dops = {
> +	.poll_capable = dm_bio_poll_capable,
>  	.submit_bio = dm_submit_bio,
>  	.open = dm_blk_open,
>  	.release = dm_blk_close,


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 13/13] dm: support IO polling for bio-based dm device
@ 2021-03-25 18:45     ` Mike Snitzer
  0 siblings, 0 replies; 74+ messages in thread
From: Mike Snitzer @ 2021-03-25 18:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, dm-devel, Jeffle Xu

On Wed, Mar 24 2021 at  8:19am -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> From: Jeffle Xu <jefflexu@linux.alibaba.com>
> 
> IO polling is enabled when all underlying target devices are capable
> of IO polling. The sanity check supports the stacked device model, in
> which one dm device may be build upon another dm device. In this case,
> the mapped device will check if the underlying dm target device
> supports IO polling.
> 
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/md/dm-table.c         | 24 ++++++++++++++++++++++++
>  drivers/md/dm.c               | 14 ++++++++++++++
>  include/linux/device-mapper.h |  1 +
>  3 files changed, 39 insertions(+)
> 

...

> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 50b693d776d6..fe6893b078dc 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1720,6 +1720,19 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
>  	return ret;
>  }
>  
> +static bool dm_bio_poll_capable(struct gendisk *disk)
> +{
> +	int ret, srcu_idx;
> +	struct mapped_device *md = disk->private_data;
> +	struct dm_table *t;
> +
> +	t = dm_get_live_table(md, &srcu_idx);
> +	ret = dm_table_supports_poll(t);
> +	dm_put_live_table(md, srcu_idx);
> +
> +	return ret;
> +}
> +

I know this code will only get called by blk-core if bio-based but there
isn't anything about this method's implementation that is inherently
bio-based only.

So please rename from dm_bio_poll_capable to dm_poll_capable

Other than that:

Reviewed-by: Mike Snitzer <snitzer@redhat.com>

>  /*-----------------------------------------------------------------
>   * An IDR is used to keep track of allocated minor numbers.
>   *---------------------------------------------------------------*/
> @@ -3132,6 +3145,7 @@ static const struct pr_ops dm_pr_ops = {
>  };
>  
>  static const struct block_device_operations dm_blk_dops = {
> +	.poll_capable = dm_bio_poll_capable,
>  	.submit_bio = dm_submit_bio,
>  	.open = dm_blk_open,
>  	.release = dm_blk_close,

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH V3 08/13] block: prepare for supporting bio_list via other link
  2021-03-24 12:19   ` [dm-devel] " Ming Lei
@ 2021-03-26 15:02     ` Hannes Reinecke
  -1 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-26 15:02 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel

On 3/24/21 1:19 PM, Ming Lei wrote:
> So far bio list helpers always use .bi_next to traverse the list, we
> will support to link bios by other bio field.
> 
> Prepare for such support by adding a macro so that users can define
> another helpers for linking bios by other bio field.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   include/linux/bio.h | 132 +++++++++++++++++++++++---------------------
>   1 file changed, 68 insertions(+), 64 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [dm-devel] [PATCH V3 08/13] block: prepare for supporting bio_list via other link
@ 2021-03-26 15:02     ` Hannes Reinecke
  0 siblings, 0 replies; 74+ messages in thread
From: Hannes Reinecke @ 2021-03-26 15:02 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block, Jeffle Xu, dm-devel, Mike Snitzer

On 3/24/21 1:19 PM, Ming Lei wrote:
> So far bio list helpers always use .bi_next to traverse the list, we
> will support to link bios by other bio field.
> 
> Prepare for such support by adding a macro so that users can define
> another helpers for linking bios by other bio field.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   include/linux/bio.h | 132 +++++++++++++++++++++++---------------------
>   1 file changed, 68 insertions(+), 64 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2021-03-29 11:55 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-24 12:19 [PATCH V3 00/13] block: support bio based io polling Ming Lei
2021-03-24 12:19 ` [dm-devel] " Ming Lei
2021-03-24 12:19 ` [PATCH V3 01/13] block: add helper of blk_queue_poll Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-24 13:19   ` Hannes Reinecke
2021-03-24 13:19     ` [dm-devel] " Hannes Reinecke
2021-03-24 17:54   ` Christoph Hellwig
2021-03-24 17:54     ` [dm-devel] " Christoph Hellwig
2021-03-25  1:56   ` JeffleXu
2021-03-25  1:56     ` [dm-devel] " JeffleXu
2021-03-24 12:19 ` [PATCH V3 02/13] block: add one helper to free io_context Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-24 13:21   ` Hannes Reinecke
2021-03-24 13:21     ` [dm-devel] " Hannes Reinecke
2021-03-24 12:19 ` [PATCH V3 03/13] block: add helper of blk_create_io_context Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-24 13:22   ` Hannes Reinecke
2021-03-24 13:22     ` [dm-devel] " Hannes Reinecke
2021-03-24 15:52   ` Keith Busch
2021-03-24 15:52     ` [dm-devel] " Keith Busch
2021-03-24 18:17   ` Christoph Hellwig
2021-03-24 18:17     ` [dm-devel] " Christoph Hellwig
2021-03-25  0:30     ` Ming Lei
2021-03-25  0:30       ` [dm-devel] " Ming Lei
2021-03-24 12:19 ` [PATCH V3 04/13] block: create io poll context for submission and poll task Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-24 13:26   ` Hannes Reinecke
2021-03-24 13:26     ` [dm-devel] " Hannes Reinecke
2021-03-25  2:34   ` JeffleXu
2021-03-25  2:34     ` [dm-devel] " JeffleXu
2021-03-25  2:51     ` Ming Lei
2021-03-25  2:51       ` [dm-devel] " Ming Lei
2021-03-25  3:01       ` JeffleXu
2021-03-25  3:01         ` [dm-devel] " JeffleXu
2021-03-24 12:19 ` [PATCH V3 05/13] block: add req flag of REQ_POLL_CTX Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-24 15:32   ` Hannes Reinecke
2021-03-24 15:32     ` [dm-devel] " Hannes Reinecke
2021-03-25  0:32     ` Ming Lei
2021-03-25  0:32       ` [dm-devel] " Ming Lei
2021-03-25  6:55   ` JeffleXu
2021-03-25  6:55     ` [dm-devel] " JeffleXu
2021-03-24 12:19 ` [PATCH V3 06/13] block: add new field into 'struct bvec_iter' Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-24 15:33   ` Hannes Reinecke
2021-03-24 15:33     ` [dm-devel] " Hannes Reinecke
2021-03-24 12:19 ` [PATCH V3 07/13] block/mq: extract one helper function polling hw queue Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-25  6:50   ` Hannes Reinecke
2021-03-25  6:50     ` [dm-devel] " Hannes Reinecke
2021-03-24 12:19 ` [PATCH V3 08/13] block: prepare for supporting bio_list via other link Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-26 15:02   ` Hannes Reinecke
2021-03-26 15:02     ` [dm-devel] " Hannes Reinecke
2021-03-24 12:19 ` [PATCH V3 09/13] block: use per-task poll context to implement bio based io polling Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-25  6:34   ` JeffleXu
2021-03-25  6:34     ` [dm-devel] " JeffleXu
2021-03-25  8:05     ` Ming Lei
2021-03-25  8:05       ` [dm-devel] " Ming Lei
2021-03-25  9:18       ` JeffleXu
2021-03-25  9:18         ` [dm-devel] " JeffleXu
2021-03-25  9:56         ` Ming Lei
2021-03-25  9:56           ` [dm-devel] " Ming Lei
2021-03-24 12:19 ` [PATCH V3 10/13] blk-mq: limit hw queues to be polled in each blk_poll() Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-24 12:19 ` [PATCH V3 11/13] block: add queue_to_disk() to get gendisk from request_queue Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-24 12:19 ` [PATCH V3 12/13] block: add poll_capable method to support bio-based IO polling Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-24 12:19 ` [PATCH V3 13/13] dm: support IO polling for bio-based dm device Ming Lei
2021-03-24 12:19   ` [dm-devel] " Ming Lei
2021-03-25 18:45   ` Mike Snitzer
2021-03-25 18:45     ` [dm-devel] " Mike Snitzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.