All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] block: support bio based io polling
@ 2021-03-16  3:15 ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Hi,

Add per-task io poll context for holding HIPRI blk-mq/underlying bios
queued from bio based driver's io submission context, and reuse one bio
padding field for storing 'cookie' returned from submit_bio() for these
bios. Also explicitly end these bios in poll context by adding two
new bio flags.

In this way, we needn't to poll all underlying hw queues any more,
which is implemented in Jeffle's patches. And we can just poll hw queues
in which there is HIPRI IO queued.

Usually io submission and io poll share same context, so the added io
poll context data is just like one stack variable, and the cost for
saving bios is cheap.

Any comments are welcome.


Jeffle Xu (4):
  block/mq: extract one helper function polling hw queue
  block: add queue_to_disk() to get gendisk from request_queue
  block: add poll_capable method to support bio-based IO polling
  dm: support IO polling for bio-based dm device

Ming Lei (7):
  block: add helper of blk_queue_poll
  block: add one helper to free io_context
  block: add helper of blk_create_io_context
  block: create io poll context for submission and poll task
  block: add req flag of REQ_TAG
  block: add new field into 'struct bvec_iter'
  block: use per-task poll context to implement bio based io poll

 block/bio.c                   |   5 +
 block/blk-core.c              | 161 ++++++++++++++++++++++++++---
 block/blk-ioc.c               |  12 ++-
 block/blk-mq.c                | 189 ++++++++++++++++++++++++++++++++--
 block/blk-sysfs.c             |  14 ++-
 block/blk.h                   |  45 ++++++++
 drivers/md/dm-table.c         |  24 +++++
 drivers/md/dm.c               |  14 +++
 drivers/nvme/host/core.c      |   2 +-
 include/linux/blk_types.h     |   7 ++
 include/linux/blkdev.h        |   4 +
 include/linux/bvec.h          |   9 ++
 include/linux/device-mapper.h |   1 +
 include/linux/iocontext.h     |   2 +
 include/trace/events/kyber.h  |   6 +-
 15 files changed, 466 insertions(+), 29 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 00/11] block: support bio based io polling
@ 2021-03-16  3:15 ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Hi,

Add per-task io poll context for holding HIPRI blk-mq/underlying bios
queued from bio based driver's io submission context, and reuse one bio
padding field for storing 'cookie' returned from submit_bio() for these
bios. Also explicitly end these bios in poll context by adding two
new bio flags.

In this way, we needn't to poll all underlying hw queues any more,
which is implemented in Jeffle's patches. And we can just poll hw queues
in which there is HIPRI IO queued.

Usually io submission and io poll share same context, so the added io
poll context data is just like one stack variable, and the cost for
saving bios is cheap.

Any comments are welcome.


Jeffle Xu (4):
  block/mq: extract one helper function polling hw queue
  block: add queue_to_disk() to get gendisk from request_queue
  block: add poll_capable method to support bio-based IO polling
  dm: support IO polling for bio-based dm device

Ming Lei (7):
  block: add helper of blk_queue_poll
  block: add one helper to free io_context
  block: add helper of blk_create_io_context
  block: create io poll context for submission and poll task
  block: add req flag of REQ_TAG
  block: add new field into 'struct bvec_iter'
  block: use per-task poll context to implement bio based io poll

 block/bio.c                   |   5 +
 block/blk-core.c              | 161 ++++++++++++++++++++++++++---
 block/blk-ioc.c               |  12 ++-
 block/blk-mq.c                | 189 ++++++++++++++++++++++++++++++++--
 block/blk-sysfs.c             |  14 ++-
 block/blk.h                   |  45 ++++++++
 drivers/md/dm-table.c         |  24 +++++
 drivers/md/dm.c               |  14 +++
 drivers/nvme/host/core.c      |   2 +-
 include/linux/blk_types.h     |   7 ++
 include/linux/blkdev.h        |   4 +
 include/linux/bvec.h          |   9 ++
 include/linux/device-mapper.h |   1 +
 include/linux/iocontext.h     |   2 +
 include/trace/events/kyber.h  |   6 +-
 15 files changed, 466 insertions(+), 29 deletions(-)

-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [RFC PATCH 01/11] block: add helper of blk_queue_poll
  2021-03-16  3:15 ` [dm-devel] " Ming Lei
@ 2021-03-16  3:15   ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

There has been 3 users, and will be more, so add one such helper.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c         | 2 +-
 block/blk-mq.c           | 3 +--
 drivers/nvme/host/core.c | 2 +-
 include/linux/blkdev.h   | 1 +
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..a31371d55b9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -836,7 +836,7 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
 	switch (bio_op(bio)) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..63c81df3b8b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3869,8 +3869,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) ||
-	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
 		return 0;
 
 	if (current->plug)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e68a8c4ac5a6..bb7da34dd967 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -955,7 +955,7 @@ static void nvme_execute_rq_polled(struct request_queue *q,
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 
-	WARN_ON_ONCE(!test_bit(QUEUE_FLAG_POLL, &q->queue_flags));
+	WARN_ON_ONCE(!blk_queue_poll(q));
 
 	rq->cmd_flags |= REQ_HIPRI;
 	rq->end_io_data = &wait;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc6bc8383b43..89a01850cf12 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -665,6 +665,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
+#define blk_queue_poll(q)	test_bit(QUEUE_FLAG_POLL, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 01/11] block: add helper of blk_queue_poll
@ 2021-03-16  3:15   ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

There has been 3 users, and will be more, so add one such helper.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c         | 2 +-
 block/blk-mq.c           | 3 +--
 drivers/nvme/host/core.c | 2 +-
 include/linux/blkdev.h   | 1 +
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fc60ff208497..a31371d55b9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -836,7 +836,7 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
 	switch (bio_op(bio)) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d4d7c1caa439..63c81df3b8b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3869,8 +3869,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) ||
-	    !test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
+	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
 		return 0;
 
 	if (current->plug)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e68a8c4ac5a6..bb7da34dd967 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -955,7 +955,7 @@ static void nvme_execute_rq_polled(struct request_queue *q,
 {
 	DECLARE_COMPLETION_ONSTACK(wait);
 
-	WARN_ON_ONCE(!test_bit(QUEUE_FLAG_POLL, &q->queue_flags));
+	WARN_ON_ONCE(!blk_queue_poll(q));
 
 	rq->cmd_flags |= REQ_HIPRI;
 	rq->end_io_data = &wait;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc6bc8383b43..89a01850cf12 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -665,6 +665,7 @@ bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_registered(q)	test_bit(QUEUE_FLAG_REGISTERED, &(q)->queue_flags)
 #define blk_queue_nowait(q)	test_bit(QUEUE_FLAG_NOWAIT, &(q)->queue_flags)
+#define blk_queue_poll(q)	test_bit(QUEUE_FLAG_POLL, &(q)->queue_flags)
 
 extern void blk_set_pm_only(struct request_queue *q);
 extern void blk_clear_pm_only(struct request_queue *q);
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 02/11] block: add one helper to free io_context
  2021-03-16  3:15 ` [dm-devel] " Ming Lei
@ 2021-03-16  3:15   ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Prepare for putting bio poll queue into io_context, so add one helper
for free io_context.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-ioc.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 57299f860d41..b0cde18c4b8c 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -17,6 +17,11 @@
  */
 static struct kmem_cache *iocontext_cachep;
 
+static inline void free_io_context(struct io_context *ioc)
+{
+	kmem_cache_free(iocontext_cachep, ioc);
+}
+
 /**
  * get_io_context - increment reference count to io_context
  * @ioc: io_context to get
@@ -129,7 +134,7 @@ static void ioc_release_fn(struct work_struct *work)
 
 	spin_unlock_irq(&ioc->lock);
 
-	kmem_cache_free(iocontext_cachep, ioc);
+	free_io_context(ioc);
 }
 
 /**
@@ -164,7 +169,7 @@ void put_io_context(struct io_context *ioc)
 	}
 
 	if (free_ioc)
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 }
 
 /**
@@ -278,7 +283,7 @@ int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
 	    (task == current || !(task->flags & PF_EXITING)))
 		task->io_context = ioc;
 	else
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 
 	ret = task->io_context ? 0 : -EBUSY;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 02/11] block: add one helper to free io_context
@ 2021-03-16  3:15   ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Prepare for putting bio poll queue into io_context, so add one helper
for free io_context.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-ioc.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 57299f860d41..b0cde18c4b8c 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -17,6 +17,11 @@
  */
 static struct kmem_cache *iocontext_cachep;
 
+static inline void free_io_context(struct io_context *ioc)
+{
+	kmem_cache_free(iocontext_cachep, ioc);
+}
+
 /**
  * get_io_context - increment reference count to io_context
  * @ioc: io_context to get
@@ -129,7 +134,7 @@ static void ioc_release_fn(struct work_struct *work)
 
 	spin_unlock_irq(&ioc->lock);
 
-	kmem_cache_free(iocontext_cachep, ioc);
+	free_io_context(ioc);
 }
 
 /**
@@ -164,7 +169,7 @@ void put_io_context(struct io_context *ioc)
 	}
 
 	if (free_ioc)
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 }
 
 /**
@@ -278,7 +283,7 @@ int create_task_io_context(struct task_struct *task, gfp_t gfp_flags, int node)
 	    (task == current || !(task->flags & PF_EXITING)))
 		task->io_context = ioc;
 	else
-		kmem_cache_free(iocontext_cachep, ioc);
+		free_io_context(ioc);
 
 	ret = task->io_context ? 0 : -EBUSY;
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 03/11] block: add helper of blk_create_io_context
  2021-03-16  3:15 ` [dm-devel] " Ming Lei
@ 2021-03-16  3:15   ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Add one helper for creating io context and prepare for supporting
efficient bio based io poll.

Meantime move the code of creating io_context before checking bio's
REQ_HIPRI flag because the following patch may change to clear REQ_HIPRI
if io_context can't be created.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a31371d55b9d..d58f8a0c80de 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,6 +792,18 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
+static inline void blk_create_io_context(struct request_queue *q)
+{
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -836,6 +848,8 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
+	blk_create_io_context(q);
+
 	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
@@ -876,15 +890,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		break;
 	}
 
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
-
 	if (blk_throtl_bio(bio)) {
 		blkcg_bio_issue_init(bio);
 		return false;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 03/11] block: add helper of blk_create_io_context
@ 2021-03-16  3:15   ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Add one helper for creating io context and prepare for supporting
efficient bio based io poll.

Meantime move the code of creating io_context before checking bio's
REQ_HIPRI flag because the following patch may change to clear REQ_HIPRI
if io_context can't be created.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a31371d55b9d..d58f8a0c80de 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,6 +792,18 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
+static inline void blk_create_io_context(struct request_queue *q)
+{
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -836,6 +848,8 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
+	blk_create_io_context(q);
+
 	if (!blk_queue_poll(q))
 		bio->bi_opf &= ~REQ_HIPRI;
 
@@ -876,15 +890,6 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		break;
 	}
 
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
-
 	if (blk_throtl_bio(bio)) {
 		blkcg_bio_issue_init(bio);
 		return false;
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 04/11] block: create io poll context for submission and poll task
  2021-03-16  3:15 ` [dm-devel] " Ming Lei
@ 2021-03-16  3:15   ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Create per-task io poll context for both IO submission and poll task
if the queue is bio based and supports polling.

This io polling context includes two queues: submission queue(sq) for
storing HIPRI bio submission result(cookie) and the bio, written
by submission task and read by poll task; polling queue(pq) for holding
data moved from sq, only used in poll context for running bio polling.

Following patches will support bio poll.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 59 +++++++++++++++++++++++++++++++--------
 block/blk-ioc.c           |  1 +
 block/blk-mq.c            | 14 ++++++++++
 block/blk.h               | 45 +++++++++++++++++++++++++++++
 include/linux/iocontext.h |  2 ++
 5 files changed, 109 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d58f8a0c80de..7c7b0dba4f5c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,16 +792,47 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
-static inline void blk_create_io_context(struct request_queue *q)
+static inline bool blk_queue_support_bio_poll(struct request_queue *q)
 {
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
+	return !queue_is_mq(q) && blk_queue_poll(q);
+}
+
+static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
+{
+	struct io_context *ioc = current->io_context;
+
+	return ioc ? ioc->data : NULL;
+}
+
+static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
+{
+	spin_lock_init(&pc->lock);
+	INIT_KFIFO(pc->sq);
+
+	mutex_init(&pc->pq_lock);
+	memset(pc->pq, 0, sizeof(pc->pq));
+}
+
+void bio_poll_ctx_alloc(struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc;
+
+	pc = kmalloc(sizeof(*pc), GFP_ATOMIC);
+	if (pc) {
+		bio_poll_ctx_init(pc);
+		if (cmpxchg(&ioc->data, NULL, (void *)pc))
+			kfree(pc);
+	}
+}
+
+static inline void blk_bio_poll_preprocess(struct request_queue *q,
+		struct bio *bio)
+{
+	if (!(bio->bi_opf & REQ_HIPRI))
+		return;
+
+	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+		bio->bi_opf &= ~REQ_HIPRI;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -848,10 +879,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	blk_create_io_context(q);
+	/*
+	 * Created per-task io poll queue if we supports bio polling
+	 * and it is one HIPRI bio.
+	 */
+	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			(bio->bi_opf & REQ_HIPRI));
 
-	if (!blk_queue_poll(q))
-		bio->bi_opf &= ~REQ_HIPRI;
+	blk_bio_poll_preprocess(q, bio);
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b0cde18c4b8c..5574c398eff6 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
 
 static inline void free_io_context(struct io_context *ioc)
 {
+	kfree(ioc->data);
 	kmem_cache_free(iocontext_cachep, ioc);
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 63c81df3b8b5..c832faa52ca0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+	/*
+	 * Create poll queue for storing poll bio and its cookie from
+	 * submission queue
+	 */
+	blk_create_io_context(q, true);
+
+	return 0;
+}
+
 /**
  * blk_poll - poll for IO completions
  * @q:  the queue
@@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	if (current->plug)
 		blk_flush_plug_list(current->plug, false);
 
+	if (!queue_is_mq(q))
+		return blk_bio_poll(q, cookie, spin);
+
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 
 	/*
diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e..f7a889bb3720 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -7,6 +7,7 @@
 #include <linux/part_stat.h>
 #include <linux/blk-crypto.h>
 #include <xen/xen.h>
+#include <linux/kfifo.h>
 #include "blk-crypto-internal.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
@@ -357,4 +358,48 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+#define BLK_BIO_POLL_SQ_SZ		64U
+#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
+
+/* result of submit_bio */
+struct blk_bio_poll_data {
+	struct bio	*bio;
+};
+
+/* Per-task bio poll queue data and attached to io context */
+struct blk_bio_poll_ctx {
+	spinlock_t	lock;
+	/*
+	 * Submission queue for storing HIPRI bio submission result, written
+	 * by submission task and read by poll task
+	 */
+	DECLARE_KFIFO(sq, struct blk_bio_poll_data, BLK_BIO_POLL_SQ_SZ);
+
+	/* Holding poll data moved from sq, only used in poll task */
+	struct mutex  pq_lock;
+	struct blk_bio_poll_data pq[BLK_BIO_POLL_PQ_SZ];
+};
+
+void bio_poll_ctx_alloc(struct io_context *ioc);
+
+static inline void blk_create_io_context(struct request_queue *q,
+		bool need_poll_ctx)
+{
+	struct io_context *ioc;
+
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+
+	ioc = current->io_context;
+	if (need_poll_ctx && unlikely(ioc && !ioc->data))
+		bio_poll_ctx_alloc(ioc);
+}
+
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 0a9dc40b7be8..f9a467571356 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -110,6 +110,8 @@ struct io_context {
 	struct io_cq __rcu	*icq_hint;
 	struct hlist_head	icq_list;
 
+	void			*data;
+
 	struct work_struct release_work;
 };
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 04/11] block: create io poll context for submission and poll task
@ 2021-03-16  3:15   ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Create per-task io poll context for both IO submission and poll task
if the queue is bio based and supports polling.

This io polling context includes two queues: submission queue(sq) for
storing HIPRI bio submission result(cookie) and the bio, written
by submission task and read by poll task; polling queue(pq) for holding
data moved from sq, only used in poll context for running bio polling.

Following patches will support bio poll.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 59 +++++++++++++++++++++++++++++++--------
 block/blk-ioc.c           |  1 +
 block/blk-mq.c            | 14 ++++++++++
 block/blk.h               | 45 +++++++++++++++++++++++++++++
 include/linux/iocontext.h |  2 ++
 5 files changed, 109 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d58f8a0c80de..7c7b0dba4f5c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -792,16 +792,47 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
-static inline void blk_create_io_context(struct request_queue *q)
+static inline bool blk_queue_support_bio_poll(struct request_queue *q)
 {
-	/*
-	 * Various block parts want %current->io_context, so allocate it up
-	 * front rather than dealing with lots of pain to allocate it only
-	 * where needed. This may fail and the block layer knows how to live
-	 * with it.
-	 */
-	if (unlikely(!current->io_context))
-		create_task_io_context(current, GFP_ATOMIC, q->node);
+	return !queue_is_mq(q) && blk_queue_poll(q);
+}
+
+static inline struct blk_bio_poll_ctx *blk_get_bio_poll_ctx(void)
+{
+	struct io_context *ioc = current->io_context;
+
+	return ioc ? ioc->data : NULL;
+}
+
+static void bio_poll_ctx_init(struct blk_bio_poll_ctx *pc)
+{
+	spin_lock_init(&pc->lock);
+	INIT_KFIFO(pc->sq);
+
+	mutex_init(&pc->pq_lock);
+	memset(pc->pq, 0, sizeof(pc->pq));
+}
+
+void bio_poll_ctx_alloc(struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc;
+
+	pc = kmalloc(sizeof(*pc), GFP_ATOMIC);
+	if (pc) {
+		bio_poll_ctx_init(pc);
+		if (cmpxchg(&ioc->data, NULL, (void *)pc))
+			kfree(pc);
+	}
+}
+
+static inline void blk_bio_poll_preprocess(struct request_queue *q,
+		struct bio *bio)
+{
+	if (!(bio->bi_opf & REQ_HIPRI))
+		return;
+
+	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+		bio->bi_opf &= ~REQ_HIPRI;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -848,10 +879,14 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 		}
 	}
 
-	blk_create_io_context(q);
+	/*
+	 * Created per-task io poll queue if we supports bio polling
+	 * and it is one HIPRI bio.
+	 */
+	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			(bio->bi_opf & REQ_HIPRI));
 
-	if (!blk_queue_poll(q))
-		bio->bi_opf &= ~REQ_HIPRI;
+	blk_bio_poll_preprocess(q, bio);
 
 	switch (bio_op(bio)) {
 	case REQ_OP_DISCARD:
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index b0cde18c4b8c..5574c398eff6 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -19,6 +19,7 @@ static struct kmem_cache *iocontext_cachep;
 
 static inline void free_io_context(struct io_context *ioc)
 {
+	kfree(ioc->data);
 	kmem_cache_free(iocontext_cachep, ioc);
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 63c81df3b8b5..c832faa52ca0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,17 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
+{
+	/*
+	 * Create poll queue for storing poll bio and its cookie from
+	 * submission queue
+	 */
+	blk_create_io_context(q, true);
+
+	return 0;
+}
+
 /**
  * blk_poll - poll for IO completions
  * @q:  the queue
@@ -3875,6 +3886,9 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	if (current->plug)
 		blk_flush_plug_list(current->plug, false);
 
+	if (!queue_is_mq(q))
+		return blk_bio_poll(q, cookie, spin);
+
 	hctx = q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
 
 	/*
diff --git a/block/blk.h b/block/blk.h
index 3b53e44b967e..f7a889bb3720 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -7,6 +7,7 @@
 #include <linux/part_stat.h>
 #include <linux/blk-crypto.h>
 #include <xen/xen.h>
+#include <linux/kfifo.h>
 #include "blk-crypto-internal.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
@@ -357,4 +358,48 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+#define BLK_BIO_POLL_SQ_SZ		64U
+#define BLK_BIO_POLL_PQ_SZ		(BLK_BIO_POLL_SQ_SZ * 2)
+
+/* result of submit_bio */
+struct blk_bio_poll_data {
+	struct bio	*bio;
+};
+
+/* Per-task bio poll queue data and attached to io context */
+struct blk_bio_poll_ctx {
+	spinlock_t	lock;
+	/*
+	 * Submission queue for storing HIPRI bio submission result, written
+	 * by submission task and read by poll task
+	 */
+	DECLARE_KFIFO(sq, struct blk_bio_poll_data, BLK_BIO_POLL_SQ_SZ);
+
+	/* Holding poll data moved from sq, only used in poll task */
+	struct mutex  pq_lock;
+	struct blk_bio_poll_data pq[BLK_BIO_POLL_PQ_SZ];
+};
+
+void bio_poll_ctx_alloc(struct io_context *ioc);
+
+static inline void blk_create_io_context(struct request_queue *q,
+		bool need_poll_ctx)
+{
+	struct io_context *ioc;
+
+	/*
+	 * Various block parts want %current->io_context, so allocate it up
+	 * front rather than dealing with lots of pain to allocate it only
+	 * where needed. This may fail and the block layer knows how to live
+	 * with it.
+	 */
+	if (unlikely(!current->io_context))
+		create_task_io_context(current, GFP_ATOMIC, q->node);
+
+	ioc = current->io_context;
+	if (need_poll_ctx && unlikely(ioc && !ioc->data))
+		bio_poll_ctx_alloc(ioc);
+}
+
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/iocontext.h b/include/linux/iocontext.h
index 0a9dc40b7be8..f9a467571356 100644
--- a/include/linux/iocontext.h
+++ b/include/linux/iocontext.h
@@ -110,6 +110,8 @@ struct io_context {
 	struct io_cq __rcu	*icq_hint;
 	struct hlist_head	icq_list;
 
+	void			*data;
+
 	struct work_struct release_work;
 };
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 05/11] block: add req flag of REQ_TAG
  2021-03-16  3:15 ` [dm-devel] " Ming Lei
@ 2021-03-16  3:15   ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Add one req flag REQ_TAG which will be used in the following patch for
supporting bio based IO polling.

Exactly this flag can help us to do:

1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.

2)create per-task io polling context if the bio based queue supports polling
and the submitted bio is HIPRI. This per-task io polling context will be
created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
we can avoid to create such io polling context if one cloned bio with REQ_TAG
is submitted from another kernel context.

3) for supporting bio based io polling, we need to poll IOs from all
underlying queues of bio device/driver, this way help us to recognize which
IOs need to polled in bio based style, which will be implemented in next
patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 29 +++++++++++++++++++++++++++--
 include/linux/blk_types.h |  4 ++++
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7c7b0dba4f5c..a082bbc856fb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -828,11 +828,30 @@ void bio_poll_ctx_alloc(struct io_context *ioc)
 static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		struct bio *bio)
 {
+	bool mq;
+
 	if (!(bio->bi_opf & REQ_HIPRI))
 		return;
 
-	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+	/*
+	 * Can't support bio based IO poll without per-task poll queue
+	 *
+	 * Now we have created per-task io poll context, and mark this
+	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
+	 * submitted from another kernel context, we won't create bio
+	 * poll context for it, so that bio will be completed by IRQ;
+	 * 2) If such bio is submitted from current context, we will
+	 * complete it via blk_poll(); 3) If driver knows that one
+	 * underlying bio allocated from driver is for FS bio, meantime
+	 * it is submitted in current context, driver can mark such bio
+	 * as REQ_TAG manually, so the bio can be completed via blk_poll
+	 * too.
+	 */
+	mq = queue_is_mq(q);
+	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
 		bio->bi_opf &= ~REQ_HIPRI;
+	else if (!mq)
+		bio->bi_opf |= REQ_TAG;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -881,9 +900,15 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 
 	/*
 	 * Created per-task io poll queue if we supports bio polling
-	 * and it is one HIPRI bio.
+	 * and it is one HIPRI bio, and this HIPRI bio has to be from
+	 * FS. If REQ_TAG isn't set for HIPRI bio, we think it originated
+	 * from FS.
+	 *
+	 * Driver may allocated bio by itself and REQ_TAG is set, but they
+	 * won't be marked as HIPRI.
 	 */
 	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			!(bio->bi_opf & REQ_TAG) &&
 			(bio->bi_opf & REQ_HIPRI));
 
 	blk_bio_poll_preprocess(q, bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..a1bcade4bcc3 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -394,6 +394,9 @@ enum req_flag_bits {
 
 	__REQ_HIPRI,
 
+	/* for marking IOs originated from same FS bio in same context */
+	__REQ_TAG,
+
 	/* for driver use */
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
@@ -418,6 +421,7 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
+#define REQ_TAG			(1ULL << __REQ_TAG)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 05/11] block: add req flag of REQ_TAG
@ 2021-03-16  3:15   ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Add one req flag REQ_TAG which will be used in the following patch for
supporting bio based IO polling.

Exactly this flag can help us to do:

1) request flag is cloned in bio_fast_clone(), so if we mark one FS bio
as REQ_TAG, all bios cloned from this FS bio will be marked as REQ_TAG.

2)create per-task io polling context if the bio based queue supports polling
and the submitted bio is HIPRI. This per-task io polling context will be
created during submit_bio() before marking this HIPRI bio as REQ_TAG. Then
we can avoid to create such io polling context if one cloned bio with REQ_TAG
is submitted from another kernel context.

3) for supporting bio based io polling, we need to poll IOs from all
underlying queues of bio device/driver, this way help us to recognize which
IOs need to polled in bio based style, which will be implemented in next
patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 29 +++++++++++++++++++++++++++--
 include/linux/blk_types.h |  4 ++++
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7c7b0dba4f5c..a082bbc856fb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -828,11 +828,30 @@ void bio_poll_ctx_alloc(struct io_context *ioc)
 static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		struct bio *bio)
 {
+	bool mq;
+
 	if (!(bio->bi_opf & REQ_HIPRI))
 		return;
 
-	if (!blk_queue_poll(q) || (!queue_is_mq(q) && !blk_get_bio_poll_ctx()))
+	/*
+	 * Can't support bio based IO poll without per-task poll queue
+	 *
+	 * Now we have created per-task io poll context, and mark this
+	 * bio as REQ_TAG, so: 1) if any cloned bio from this bio is
+	 * submitted from another kernel context, we won't create bio
+	 * poll context for it, so that bio will be completed by IRQ;
+	 * 2) If such bio is submitted from current context, we will
+	 * complete it via blk_poll(); 3) If driver knows that one
+	 * underlying bio allocated from driver is for FS bio, meantime
+	 * it is submitted in current context, driver can mark such bio
+	 * as REQ_TAG manually, so the bio can be completed via blk_poll
+	 * too.
+	 */
+	mq = queue_is_mq(q);
+	if (!blk_queue_poll(q) || (!mq && !blk_get_bio_poll_ctx()))
 		bio->bi_opf &= ~REQ_HIPRI;
+	else if (!mq)
+		bio->bi_opf |= REQ_TAG;
 }
 
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
@@ -881,9 +900,15 @@ static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 
 	/*
 	 * Created per-task io poll queue if we supports bio polling
-	 * and it is one HIPRI bio.
+	 * and it is one HIPRI bio, and this HIPRI bio has to be from
+	 * FS. If REQ_TAG isn't set for HIPRI bio, we think it originated
+	 * from FS.
+	 *
+	 * Driver may allocated bio by itself and REQ_TAG is set, but they
+	 * won't be marked as HIPRI.
 	 */
 	blk_create_io_context(q, blk_queue_support_bio_poll(q) &&
+			!(bio->bi_opf & REQ_TAG) &&
 			(bio->bi_opf & REQ_HIPRI));
 
 	blk_bio_poll_preprocess(q, bio);
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index db026b6ec15a..a1bcade4bcc3 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -394,6 +394,9 @@ enum req_flag_bits {
 
 	__REQ_HIPRI,
 
+	/* for marking IOs originated from same FS bio in same context */
+	__REQ_TAG,
+
 	/* for driver use */
 	__REQ_DRV,
 	__REQ_SWAP,		/* swapping request. */
@@ -418,6 +421,7 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 #define REQ_HIPRI		(1ULL << __REQ_HIPRI)
+#define REQ_TAG			(1ULL << __REQ_TAG)
 
 #define REQ_DRV			(1ULL << __REQ_DRV)
 #define REQ_SWAP		(1ULL << __REQ_SWAP)
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 06/11] block: add new field into 'struct bvec_iter'
  2021-03-16  3:15 ` [dm-devel] " Ming Lei
@ 2021-03-16  3:15   ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

There is a hole at the end of 'struct bvec_iter', so put a new field
here and we can save cookie returned from submit_bio() here for
supporting bio based polling.

This way can avoid to extend bio unnecessarily.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bvec.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ff832e698efb..61c0f55f7165 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -43,6 +43,15 @@ struct bvec_iter {
 
 	unsigned int            bi_bvec_done;	/* number of bytes completed in
 						   current bvec */
+
+	/*
+	 * There is a hole at the end of bvec_iter, define one filed to
+	 * hold something which isn't relate with 'bvec_iter', so that we can
+	 * avoid to extend bio. So far this new field is used for bio based
+	 * pooling, we will store returning value of underlying queue's
+	 * submit_bio() here.
+	 */
+	unsigned int		bi_private_data;
 };
 
 struct bvec_iter_all {
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 06/11] block: add new field into 'struct bvec_iter'
@ 2021-03-16  3:15   ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

There is a hole at the end of 'struct bvec_iter', so put a new field
here and we can save cookie returned from submit_bio() here for
supporting bio based polling.

This way can avoid to extend bio unnecessarily.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/bvec.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ff832e698efb..61c0f55f7165 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -43,6 +43,15 @@ struct bvec_iter {
 
 	unsigned int            bi_bvec_done;	/* number of bytes completed in
 						   current bvec */
+
+	/*
+	 * There is a hole at the end of bvec_iter, define one filed to
+	 * hold something which isn't relate with 'bvec_iter', so that we can
+	 * avoid to extend bio. So far this new field is used for bio based
+	 * pooling, we will store returning value of underlying queue's
+	 * submit_bio() here.
+	 */
+	unsigned int		bi_private_data;
 };
 
 struct bvec_iter_all {
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 07/11] block/mq: extract one helper function polling hw queue
  2021-03-16  3:15 ` [dm-devel] " Ming Lei
@ 2021-03-16  3:15   ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Extract the logic of polling one hw queue and related statistics
handling out as the helper function.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c832faa52ca0..03f59915fe2c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,19 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static inline int blk_mq_poll_hctx(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx)
+{
+	int ret;
+
+	hctx->poll_invoked++;
+	ret = q->mq_ops->poll(hctx);
+	if (ret > 0)
+		hctx->poll_success++;
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
 	/*
@@ -3908,11 +3921,8 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	do {
 		int ret;
 
-		hctx->poll_invoked++;
-
-		ret = q->mq_ops->poll(hctx);
+		ret = blk_mq_poll_hctx(q, hctx);
 		if (ret > 0) {
-			hctx->poll_success++;
 			__set_current_state(TASK_RUNNING);
 			return ret;
 		}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 07/11] block/mq: extract one helper function polling hw queue
@ 2021-03-16  3:15   ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Extract the logic of polling one hw queue and related statistics
handling out as the helper function.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c832faa52ca0..03f59915fe2c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3852,6 +3852,19 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
+static inline int blk_mq_poll_hctx(struct request_queue *q,
+				   struct blk_mq_hw_ctx *hctx)
+{
+	int ret;
+
+	hctx->poll_invoked++;
+	ret = q->mq_ops->poll(hctx);
+	if (ret > 0)
+		hctx->poll_success++;
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
 	/*
@@ -3908,11 +3921,8 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	do {
 		int ret;
 
-		hctx->poll_invoked++;
-
-		ret = q->mq_ops->poll(hctx);
+		ret = blk_mq_poll_hctx(q, hctx);
 		if (ret > 0) {
-			hctx->poll_success++;
 			__set_current_state(TASK_RUNNING);
 			return ret;
 		}
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-16  3:15 ` [dm-devel] " Ming Lei
@ 2021-03-16  3:15   ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

Currently bio based IO poll needs to poll all hw queue blindly, this way
is very inefficient, and the big reason is that we can't pass bio
submission result to io poll task.

In IO submission context, store associated underlying bios into the
submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
and return current->pid to caller of submit_bio() for any DM or bio based
driver's IO, which is submitted from FS.

In IO poll context, the passed cookie tells us the PID of submission
context, and we can find the bio from that submission context. Moving
bio from submission queue to poll queue of the poll context, and keep
polling until these bios are ended. Remove bio from poll queue if the
bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.

Usually submission shares context with io poll. The per-task poll context
is just like stack variable, and it is cheap to move data between the two
per-task queues.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bio.c               |   5 ++
 block/blk-core.c          |  74 +++++++++++++++++-
 block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
 include/linux/blk_types.h |   3 +
 4 files changed, 235 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index a1c4d2900c7a..bcf5eca0e8e3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
  **/
 void bio_endio(struct bio *bio)
 {
+	/* BIO_END_BY_POLL has to be set before calling submit_bio */
+	if (bio_flagged(bio, BIO_END_BY_POLL)) {
+		bio_set_flag(bio, BIO_DONE);
+		return;
+	}
 again:
 	if (!bio_remaining_done(bio))
 		return;
diff --git a/block/blk-core.c b/block/blk-core.c
index a082bbc856fb..970b23fa2e6e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		bio->bi_opf |= REQ_TAG;
 }
 
+static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
+{
+	struct blk_bio_poll_data data = {
+		.bio	=	bio,
+	};
+	struct blk_bio_poll_ctx *pc = ioc->data;
+	unsigned int queued;
+
+	/* lock is required if there is more than one writer */
+	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
+		spin_lock(&pc->lock);
+		queued = kfifo_put(&pc->sq, data);
+		spin_unlock(&pc->lock);
+	} else {
+		queued = kfifo_put(&pc->sq, data);
+	}
+
+	/*
+	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
+	 * so we can save cookie into this bio after submit_bio().
+	 */
+	if (queued)
+		bio_set_flag(bio, BIO_END_BY_POLL);
+	else
+		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
+
+	return queued;
+}
+
+static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
+{
+	bio->bi_iter.bi_private_data = cookie;
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
  * bio_list_on_stack[1] contains bios that were submitted before the current
  *	->submit_bio_bio, but that haven't been processed yet.
  */
-static blk_qc_t __submit_bio_noacct(struct bio *bio)
+static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
 {
 	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
@@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 		bio_list_on_stack[1] = bio_list_on_stack[0];
 		bio_list_init(&bio_list_on_stack[0]);
 
-		ret = __submit_bio(bio);
+		if (ioc && queue_is_mq(q) &&
+				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
+			bool queued = blk_bio_poll_prep_submit(ioc, bio);
+
+			ret = __submit_bio(bio);
+			if (queued)
+				blk_bio_poll_post_submit(bio, ret);
+		} else {
+			ret = __submit_bio(bio);
+		}
 
 		/*
 		 * Sort new bios into those for a lower level and those for the
@@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 	return ret;
 }
 
+static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
+		struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc = ioc->data;
+	int entries = kfifo_len(&pc->sq);
+
+	__submit_bio_noacct_int(bio, ioc);
+
+	/* bio submissions queued to per-task poll context */
+	if (kfifo_len(&pc->sq) > entries)
+		return current->pid;
+
+	/* swapper's pid is 0, but it can't submit poll IO for us */
+	return 0;
+}
+
+static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
+{
+	struct io_context *ioc = current->io_context;
+
+	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
+		return __submit_bio_noacct_poll(bio, ioc);
+
+	return __submit_bio_noacct_int(bio, NULL);
+}
+
+
 static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
 {
 	struct bio_list bio_list[2] = { };
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 03f59915fe2c..4e6f1467d303 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
 	return ret;
 }
 
+static blk_qc_t bio_get_poll_cookie(struct bio *bio)
+{
+	return bio->bi_iter.bi_private_data;
+}
+
+static int blk_mq_poll_io(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+	blk_qc_t cookie = bio_get_poll_cookie(bio);
+	int ret = 0;
+
+	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
+		struct blk_mq_hw_ctx *hctx =
+			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+
+		ret += blk_mq_poll_hctx(q, hctx);
+	}
+	return ret;
+}
+
+static int blk_bio_poll_and_end_io(struct request_queue *q,
+		struct blk_bio_poll_ctx *poll_ctx)
+{
+	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
+	int ret = 0;
+	int i;
+
+	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
+		struct bio *bio = poll_data[i].bio;
+
+		if (!bio)
+			continue;
+
+		ret += blk_mq_poll_io(bio);
+		if (bio_flagged(bio, BIO_DONE)) {
+			poll_data[i].bio = NULL;
+
+			/* clear BIO_END_BY_POLL and end me really */
+			bio_clear_flag(bio, BIO_END_BY_POLL);
+			bio_endio(bio);
+		}
+	}
+	return ret;
+}
+
+static int __blk_bio_poll_io(struct request_queue *q,
+		struct blk_bio_poll_ctx *submit_ctx,
+		struct blk_bio_poll_ctx *poll_ctx)
+{
+	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
+	int i;
+
+	/*
+	 * Move IO submission result from submission queue in submission
+	 * context to poll queue of poll context.
+	 *
+	 * There may be more than one readers on poll queue of the same
+	 * submission context, so have to lock here.
+	 */
+	spin_lock(&submit_ctx->lock);
+	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
+		if (poll_data[i].bio == NULL &&
+				!kfifo_get(&submit_ctx->sq, &poll_data[i]))
+			break;
+	}
+	spin_unlock(&submit_ctx->lock);
+
+	return blk_bio_poll_and_end_io(q, poll_ctx);
+}
+
+static int blk_bio_poll_io(struct request_queue *q,
+		struct io_context *submit_ioc,
+		struct io_context *poll_ioc)
+{
+	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
+	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
+	int ret;
+
+	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
+		mutex_lock(&poll_ctx->pq_lock);
+		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
+		mutex_unlock(&poll_ctx->pq_lock);
+	} else {
+		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
+	}
+	return ret;
+}
+
+static bool blk_bio_ioc_valid(struct task_struct *t)
+{
+	if (!t)
+		return false;
+
+	if (!t->io_context)
+		return false;
+
+	if (!t->io_context->data)
+		return false;
+
+	return true;
+}
+
+static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
+{
+	struct io_context *poll_ioc = current->io_context;
+	pid_t pid;
+	struct task_struct *submit_task;
+	int ret;
+
+	pid = (pid_t)cookie;
+
+	/* io poll often share io submission context */
+	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
+		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
+
+	submit_task = find_get_task_by_vpid(pid);
+	if (likely(blk_bio_ioc_valid(submit_task)))
+		ret = blk_bio_poll_io(q, submit_task->io_context,
+				poll_ioc);
+	else
+		ret = 0;
+
+	put_task_struct(submit_task);
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
+	long state;
+
+	/* no need to poll */
+	if (cookie == 0)
+		return 0;
+
 	/*
 	 * Create poll queue for storing poll bio and its cookie from
 	 * submission queue
 	 */
 	blk_create_io_context(q, true);
 
+	state = current->state;
+	do {
+		int ret;
+
+		ret = __blk_bio_poll(q, cookie);
+		if (ret > 0) {
+			__set_current_state(TASK_RUNNING);
+			return ret;
+		}
+
+		if (signal_pending_state(state, current))
+			__set_current_state(TASK_RUNNING);
+
+		if (current->state == TASK_RUNNING)
+			return 1;
+		if (ret < 0 || !spin)
+			break;
+		cpu_relax();
+	} while (!need_resched());
+
+	__set_current_state(TASK_RUNNING);
 	return 0;
 }
 
@@ -3893,7 +4047,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
+	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
 		return 0;
 
 	if (current->plug)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1bcade4bcc3..53f64eea9652 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -304,6 +304,9 @@ enum {
 	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
 	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
 	BIO_REMAPPED,
+	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
+	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
+	BIO_DONE,
 	BIO_FLAG_LAST
 };
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-16  3:15   ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

Currently bio based IO poll needs to poll all hw queue blindly, this way
is very inefficient, and the big reason is that we can't pass bio
submission result to io poll task.

In IO submission context, store associated underlying bios into the
submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
and return current->pid to caller of submit_bio() for any DM or bio based
driver's IO, which is submitted from FS.

In IO poll context, the passed cookie tells us the PID of submission
context, and we can find the bio from that submission context. Moving
bio from submission queue to poll queue of the poll context, and keep
polling until these bios are ended. Remove bio from poll queue if the
bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.

Usually submission shares context with io poll. The per-task poll context
is just like stack variable, and it is cheap to move data between the two
per-task queues.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bio.c               |   5 ++
 block/blk-core.c          |  74 +++++++++++++++++-
 block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
 include/linux/blk_types.h |   3 +
 4 files changed, 235 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index a1c4d2900c7a..bcf5eca0e8e3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
  **/
 void bio_endio(struct bio *bio)
 {
+	/* BIO_END_BY_POLL has to be set before calling submit_bio */
+	if (bio_flagged(bio, BIO_END_BY_POLL)) {
+		bio_set_flag(bio, BIO_DONE);
+		return;
+	}
 again:
 	if (!bio_remaining_done(bio))
 		return;
diff --git a/block/blk-core.c b/block/blk-core.c
index a082bbc856fb..970b23fa2e6e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
 		bio->bi_opf |= REQ_TAG;
 }
 
+static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
+{
+	struct blk_bio_poll_data data = {
+		.bio	=	bio,
+	};
+	struct blk_bio_poll_ctx *pc = ioc->data;
+	unsigned int queued;
+
+	/* lock is required if there is more than one writer */
+	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
+		spin_lock(&pc->lock);
+		queued = kfifo_put(&pc->sq, data);
+		spin_unlock(&pc->lock);
+	} else {
+		queued = kfifo_put(&pc->sq, data);
+	}
+
+	/*
+	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
+	 * so we can save cookie into this bio after submit_bio().
+	 */
+	if (queued)
+		bio_set_flag(bio, BIO_END_BY_POLL);
+	else
+		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
+
+	return queued;
+}
+
+static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
+{
+	bio->bi_iter.bi_private_data = cookie;
+}
+
 static noinline_for_stack bool submit_bio_checks(struct bio *bio)
 {
 	struct block_device *bdev = bio->bi_bdev;
@@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
  * bio_list_on_stack[1] contains bios that were submitted before the current
  *	->submit_bio_bio, but that haven't been processed yet.
  */
-static blk_qc_t __submit_bio_noacct(struct bio *bio)
+static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
 {
 	struct bio_list bio_list_on_stack[2];
 	blk_qc_t ret = BLK_QC_T_NONE;
@@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 		bio_list_on_stack[1] = bio_list_on_stack[0];
 		bio_list_init(&bio_list_on_stack[0]);
 
-		ret = __submit_bio(bio);
+		if (ioc && queue_is_mq(q) &&
+				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
+			bool queued = blk_bio_poll_prep_submit(ioc, bio);
+
+			ret = __submit_bio(bio);
+			if (queued)
+				blk_bio_poll_post_submit(bio, ret);
+		} else {
+			ret = __submit_bio(bio);
+		}
 
 		/*
 		 * Sort new bios into those for a lower level and those for the
@@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 	return ret;
 }
 
+static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
+		struct io_context *ioc)
+{
+	struct blk_bio_poll_ctx *pc = ioc->data;
+	int entries = kfifo_len(&pc->sq);
+
+	__submit_bio_noacct_int(bio, ioc);
+
+	/* bio submissions queued to per-task poll context */
+	if (kfifo_len(&pc->sq) > entries)
+		return current->pid;
+
+	/* swapper's pid is 0, but it can't submit poll IO for us */
+	return 0;
+}
+
+static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
+{
+	struct io_context *ioc = current->io_context;
+
+	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
+		return __submit_bio_noacct_poll(bio, ioc);
+
+	return __submit_bio_noacct_int(bio, NULL);
+}
+
+
 static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
 {
 	struct bio_list bio_list[2] = { };
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 03f59915fe2c..4e6f1467d303 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
 	return ret;
 }
 
+static blk_qc_t bio_get_poll_cookie(struct bio *bio)
+{
+	return bio->bi_iter.bi_private_data;
+}
+
+static int blk_mq_poll_io(struct bio *bio)
+{
+	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+	blk_qc_t cookie = bio_get_poll_cookie(bio);
+	int ret = 0;
+
+	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
+		struct blk_mq_hw_ctx *hctx =
+			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
+
+		ret += blk_mq_poll_hctx(q, hctx);
+	}
+	return ret;
+}
+
+static int blk_bio_poll_and_end_io(struct request_queue *q,
+		struct blk_bio_poll_ctx *poll_ctx)
+{
+	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
+	int ret = 0;
+	int i;
+
+	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
+		struct bio *bio = poll_data[i].bio;
+
+		if (!bio)
+			continue;
+
+		ret += blk_mq_poll_io(bio);
+		if (bio_flagged(bio, BIO_DONE)) {
+			poll_data[i].bio = NULL;
+
+			/* clear BIO_END_BY_POLL and end me really */
+			bio_clear_flag(bio, BIO_END_BY_POLL);
+			bio_endio(bio);
+		}
+	}
+	return ret;
+}
+
+static int __blk_bio_poll_io(struct request_queue *q,
+		struct blk_bio_poll_ctx *submit_ctx,
+		struct blk_bio_poll_ctx *poll_ctx)
+{
+	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
+	int i;
+
+	/*
+	 * Move IO submission result from submission queue in submission
+	 * context to poll queue of poll context.
+	 *
+	 * There may be more than one readers on poll queue of the same
+	 * submission context, so have to lock here.
+	 */
+	spin_lock(&submit_ctx->lock);
+	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
+		if (poll_data[i].bio == NULL &&
+				!kfifo_get(&submit_ctx->sq, &poll_data[i]))
+			break;
+	}
+	spin_unlock(&submit_ctx->lock);
+
+	return blk_bio_poll_and_end_io(q, poll_ctx);
+}
+
+static int blk_bio_poll_io(struct request_queue *q,
+		struct io_context *submit_ioc,
+		struct io_context *poll_ioc)
+{
+	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
+	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
+	int ret;
+
+	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
+		mutex_lock(&poll_ctx->pq_lock);
+		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
+		mutex_unlock(&poll_ctx->pq_lock);
+	} else {
+		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
+	}
+	return ret;
+}
+
+static bool blk_bio_ioc_valid(struct task_struct *t)
+{
+	if (!t)
+		return false;
+
+	if (!t->io_context)
+		return false;
+
+	if (!t->io_context->data)
+		return false;
+
+	return true;
+}
+
+static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
+{
+	struct io_context *poll_ioc = current->io_context;
+	pid_t pid;
+	struct task_struct *submit_task;
+	int ret;
+
+	pid = (pid_t)cookie;
+
+	/* io poll often share io submission context */
+	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
+		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
+
+	submit_task = find_get_task_by_vpid(pid);
+	if (likely(blk_bio_ioc_valid(submit_task)))
+		ret = blk_bio_poll_io(q, submit_task->io_context,
+				poll_ioc);
+	else
+		ret = 0;
+
+	put_task_struct(submit_task);
+
+	return ret;
+}
+
 static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 {
+	long state;
+
+	/* no need to poll */
+	if (cookie == 0)
+		return 0;
+
 	/*
 	 * Create poll queue for storing poll bio and its cookie from
 	 * submission queue
 	 */
 	blk_create_io_context(q, true);
 
+	state = current->state;
+	do {
+		int ret;
+
+		ret = __blk_bio_poll(q, cookie);
+		if (ret > 0) {
+			__set_current_state(TASK_RUNNING);
+			return ret;
+		}
+
+		if (signal_pending_state(state, current))
+			__set_current_state(TASK_RUNNING);
+
+		if (current->state == TASK_RUNNING)
+			return 1;
+		if (ret < 0 || !spin)
+			break;
+		cpu_relax();
+	} while (!need_resched());
+
+	__set_current_state(TASK_RUNNING);
 	return 0;
 }
 
@@ -3893,7 +4047,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
 	struct blk_mq_hw_ctx *hctx;
 	long state;
 
-	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
+	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
 		return 0;
 
 	if (current->plug)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a1bcade4bcc3..53f64eea9652 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -304,6 +304,9 @@ enum {
 	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
 	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
 	BIO_REMAPPED,
+	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
+	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
+	BIO_DONE,
 	BIO_FLAG_LAST
 };
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 09/11] block: add queue_to_disk() to get gendisk from request_queue
  2021-03-16  3:15 ` [dm-devel] " Ming Lei
@ 2021-03-16  3:15   ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer, dm-devel

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Sometimes we need to get the corresponding gendisk from request_queue.

It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").

So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h       | 2 ++
 include/trace/events/kyber.h | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 89a01850cf12..bfab74b45f15 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -686,6 +686,8 @@ static inline bool blk_account_rq(struct request *rq)
 	dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \
 	(dir), (attrs))
 
+#define queue_to_disk(q)	(dev_to_disk(kobj_to_dev((q)->kobj.parent)))
+
 static inline bool queue_is_mq(struct request_queue *q)
 {
 	return q->mq_ops;
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index c0e7d24ca256..f9802562edf6 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kyber_latency,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		strlcpy(__entry->type, type, sizeof(__entry->type));
 		__entry->percentile	= percentile;
@@ -59,7 +59,7 @@ TRACE_EVENT(kyber_adjust,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		__entry->depth		= depth;
 	),
@@ -81,7 +81,7 @@ TRACE_EVENT(kyber_throttled,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 	),
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 09/11] block: add queue_to_disk() to get gendisk from request_queue
@ 2021-03-16  3:15   ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jeffle Xu, dm-devel, Christoph Hellwig, Mike Snitzer

From: Jeffle Xu <jefflexu@linux.alibaba.com>

Sometimes we need to get the corresponding gendisk from request_queue.

It is preferred that block drivers store private data in
gendisk->private_data rather than request_queue->queuedata, e.g. see:
commit c4a59c4e5db3 ("dm: stop using ->queuedata").

So if only request_queue is given, we need to get its corresponding
gendisk to get the private data stored in that gendisk.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
---
 include/linux/blkdev.h       | 2 ++
 include/trace/events/kyber.h | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 89a01850cf12..bfab74b45f15 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -686,6 +686,8 @@ static inline bool blk_account_rq(struct request *rq)
 	dma_map_page_attrs(dev, (bv)->bv_page, (bv)->bv_offset, (bv)->bv_len, \
 	(dir), (attrs))
 
+#define queue_to_disk(q)	(dev_to_disk(kobj_to_dev((q)->kobj.parent)))
+
 static inline bool queue_is_mq(struct request_queue *q)
 {
 	return q->mq_ops;
diff --git a/include/trace/events/kyber.h b/include/trace/events/kyber.h
index c0e7d24ca256..f9802562edf6 100644
--- a/include/trace/events/kyber.h
+++ b/include/trace/events/kyber.h
@@ -30,7 +30,7 @@ TRACE_EVENT(kyber_latency,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		strlcpy(__entry->type, type, sizeof(__entry->type));
 		__entry->percentile	= percentile;
@@ -59,7 +59,7 @@ TRACE_EVENT(kyber_adjust,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 		__entry->depth		= depth;
 	),
@@ -81,7 +81,7 @@ TRACE_EVENT(kyber_throttled,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= disk_devt(dev_to_disk(kobj_to_dev(q->kobj.parent)));
+		__entry->dev		= disk_devt(queue_to_disk(q));
 		strlcpy(__entry->domain, domain, sizeof(__entry->domain));
 	),
 
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 10/11] block: add poll_capable method to support bio-based IO polling
  2021-03-16  3:15 ` [dm-devel] " Ming Lei
@ 2021-03-16  3:15   ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer, dm-devel

From: Jeffle Xu <jefflexu@linux.alibaba.com>

This method can be used to check if bio-based device supports IO polling
or not. For mq devices, checking for hw queue in polling mode is
adequate, while the sanity check shall be implementation specific for
bio-based devices. For example, dm device needs to check if all
underlying devices are capable of IO polling.

Though bio-based device may have done the sanity check during the
device initialization phase, cacheing the result of this sanity check
(such as by cacheing in the queue_flags) may not work. Because for dm
devices, users could change the state of the underlying devices through
'/sys/block/<dev>/io_poll', bypassing the dm device above. In this case,
the cached result of the very beginning sanity check could be
out-of-date. Thus the sanity check needs to be done every time 'io_poll'
is to be modified.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-sysfs.c      | 14 +++++++++++---
 include/linux/blkdev.h |  1 +
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0f4f0c8a7825..367c1d9a55c6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -426,9 +426,17 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	unsigned long poll_on;
 	ssize_t ret;
 
-	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
-	    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
-		return -EINVAL;
+	if (queue_is_mq(q)) {
+		if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
+		    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
+			return -EINVAL;
+	} else {
+		struct gendisk *disk = queue_to_disk(q);
+
+		if (!disk->fops->poll_capable ||
+		    !disk->fops->poll_capable(disk))
+			return -EINVAL;
+	}
 
 	ret = queue_var_store(&poll_on, page, count);
 	if (ret < 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bfab74b45f15..a46f975f2a2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1881,6 +1881,7 @@ struct block_device_operations {
 	int (*report_zones)(struct gendisk *, sector_t sector,
 			unsigned int nr_zones, report_zones_cb cb, void *data);
 	char *(*devnode)(struct gendisk *disk, umode_t *mode);
+	bool (*poll_capable)(struct gendisk *disk);
 	struct module *owner;
 	const struct pr_ops *pr_ops;
 };
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 10/11] block: add poll_capable method to support bio-based IO polling
@ 2021-03-16  3:15   ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jeffle Xu, dm-devel, Christoph Hellwig, Mike Snitzer

From: Jeffle Xu <jefflexu@linux.alibaba.com>

This method can be used to check if bio-based device supports IO polling
or not. For mq devices, checking for hw queue in polling mode is
adequate, while the sanity check shall be implementation specific for
bio-based devices. For example, dm device needs to check if all
underlying devices are capable of IO polling.

Though bio-based device may have done the sanity check during the
device initialization phase, cacheing the result of this sanity check
(such as by cacheing in the queue_flags) may not work. Because for dm
devices, users could change the state of the underlying devices through
'/sys/block/<dev>/io_poll', bypassing the dm device above. In this case,
the cached result of the very beginning sanity check could be
out-of-date. Thus the sanity check needs to be done every time 'io_poll'
is to be modified.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 block/blk-sysfs.c      | 14 +++++++++++---
 include/linux/blkdev.h |  1 +
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0f4f0c8a7825..367c1d9a55c6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -426,9 +426,17 @@ static ssize_t queue_poll_store(struct request_queue *q, const char *page,
 	unsigned long poll_on;
 	ssize_t ret;
 
-	if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
-	    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
-		return -EINVAL;
+	if (queue_is_mq(q)) {
+		if (!q->tag_set || q->tag_set->nr_maps <= HCTX_TYPE_POLL ||
+		    !q->tag_set->map[HCTX_TYPE_POLL].nr_queues)
+			return -EINVAL;
+	} else {
+		struct gendisk *disk = queue_to_disk(q);
+
+		if (!disk->fops->poll_capable ||
+		    !disk->fops->poll_capable(disk))
+			return -EINVAL;
+	}
 
 	ret = queue_var_store(&poll_on, page, count);
 	if (ret < 0)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bfab74b45f15..a46f975f2a2f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1881,6 +1881,7 @@ struct block_device_operations {
 	int (*report_zones)(struct gendisk *, sector_t sector,
 			unsigned int nr_zones, report_zones_cb cb, void *data);
 	char *(*devnode)(struct gendisk *disk, umode_t *mode);
+	bool (*poll_capable)(struct gendisk *disk);
 	struct module *owner;
 	const struct pr_ops *pr_ops;
 };
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [RFC PATCH 11/11] dm: support IO polling for bio-based dm device
  2021-03-16  3:15 ` [dm-devel] " Ming Lei
@ 2021-03-16  3:15   ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer,
	dm-devel, Ming Lei

From: Jeffle Xu <jefflexu@linux.alibaba.com>

IO polling is enabled when all underlying target devices are capable
of IO polling. The sanity check supports the stacked device model, in
which one dm device may be build upon another dm device. In this case,
the mapped device will check if the underlying dm target device
supports IO polling.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-table.c         | 24 ++++++++++++++++++++++++
 drivers/md/dm.c               | 14 ++++++++++++++
 include/linux/device-mapper.h |  1 +
 3 files changed, 39 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 95391f78b8d5..a8f3575fb118 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1509,6 +1509,12 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector)
 	return &t->targets[(KEYS_PER_NODE * n) + k];
 }
 
+static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev,
+				   sector_t start, sector_t len, void *data)
+{
+	return !blk_queue_poll(bdev_get_queue(dev->bdev));
+}
+
 /*
  * type->iterate_devices() should be called when the sanity check needs to
  * iterate and check all underlying data devices. iterate_devices() will
@@ -1559,6 +1565,11 @@ static int count_device(struct dm_target *ti, struct dm_dev *dev,
 	return 0;
 }
 
+int dm_table_supports_poll(struct dm_table *t)
+{
+	return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
+}
+
 /*
  * Check whether a table has no data devices attached using each
  * target's iterate_devices method.
@@ -2079,6 +2090,19 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 
 	dm_update_keyslot_manager(q, t);
 	blk_queue_update_readahead(q);
+
+	/*
+	 * Check for request-based device is remained to
+	 * dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+	 * For bio-based device, only set QUEUE_FLAG_POLL when all underlying
+	 * devices supporting polling.
+	 */
+	if (__table_type_bio_based(t->type)) {
+		if (dm_table_supports_poll(t))
+			blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+		else
+			blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
+	}
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 50b693d776d6..fe6893b078dc 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1720,6 +1720,19 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 	return ret;
 }
 
+static bool dm_bio_poll_capable(struct gendisk *disk)
+{
+	int ret, srcu_idx;
+	struct mapped_device *md = disk->private_data;
+	struct dm_table *t;
+
+	t = dm_get_live_table(md, &srcu_idx);
+	ret = dm_table_supports_poll(t);
+	dm_put_live_table(md, srcu_idx);
+
+	return ret;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -3132,6 +3145,7 @@ static const struct pr_ops dm_pr_ops = {
 };
 
 static const struct block_device_operations dm_blk_dops = {
+	.poll_capable = dm_bio_poll_capable,
 	.submit_bio = dm_submit_bio,
 	.open = dm_blk_open,
 	.release = dm_blk_close,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 7f4ac87c0b32..31bfd6f70013 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -538,6 +538,7 @@ unsigned int dm_table_get_num_targets(struct dm_table *t);
 fmode_t dm_table_get_mode(struct dm_table *t);
 struct mapped_device *dm_table_get_md(struct dm_table *t);
 const char *dm_table_device_name(struct dm_table *t);
+int dm_table_supports_poll(struct dm_table *t);
 
 /*
  * Trigger an event.
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [dm-devel] [RFC PATCH 11/11] dm: support IO polling for bio-based dm device
@ 2021-03-16  3:15   ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  3:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, Ming Lei, linux-block, dm-devel, Jeffle Xu,
	Christoph Hellwig

From: Jeffle Xu <jefflexu@linux.alibaba.com>

IO polling is enabled when all underlying target devices are capable
of IO polling. The sanity check supports the stacked device model, in
which one dm device may be build upon another dm device. In this case,
the mapped device will check if the underlying dm target device
supports IO polling.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/md/dm-table.c         | 24 ++++++++++++++++++++++++
 drivers/md/dm.c               | 14 ++++++++++++++
 include/linux/device-mapper.h |  1 +
 3 files changed, 39 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 95391f78b8d5..a8f3575fb118 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1509,6 +1509,12 @@ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector)
 	return &t->targets[(KEYS_PER_NODE * n) + k];
 }
 
+static int device_not_poll_capable(struct dm_target *ti, struct dm_dev *dev,
+				   sector_t start, sector_t len, void *data)
+{
+	return !blk_queue_poll(bdev_get_queue(dev->bdev));
+}
+
 /*
  * type->iterate_devices() should be called when the sanity check needs to
  * iterate and check all underlying data devices. iterate_devices() will
@@ -1559,6 +1565,11 @@ static int count_device(struct dm_target *ti, struct dm_dev *dev,
 	return 0;
 }
 
+int dm_table_supports_poll(struct dm_table *t)
+{
+	return !dm_table_any_dev_attr(t, device_not_poll_capable, NULL);
+}
+
 /*
  * Check whether a table has no data devices attached using each
  * target's iterate_devices method.
@@ -2079,6 +2090,19 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 
 	dm_update_keyslot_manager(q, t);
 	blk_queue_update_readahead(q);
+
+	/*
+	 * Check for request-based device is remained to
+	 * dm_mq_init_request_queue()->blk_mq_init_allocated_queue().
+	 * For bio-based device, only set QUEUE_FLAG_POLL when all underlying
+	 * devices supporting polling.
+	 */
+	if (__table_type_bio_based(t->type)) {
+		if (dm_table_supports_poll(t))
+			blk_queue_flag_set(QUEUE_FLAG_POLL, q);
+		else
+			blk_queue_flag_clear(QUEUE_FLAG_POLL, q);
+	}
 }
 
 unsigned int dm_table_get_num_targets(struct dm_table *t)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 50b693d776d6..fe6893b078dc 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1720,6 +1720,19 @@ static blk_qc_t dm_submit_bio(struct bio *bio)
 	return ret;
 }
 
+static bool dm_bio_poll_capable(struct gendisk *disk)
+{
+	int ret, srcu_idx;
+	struct mapped_device *md = disk->private_data;
+	struct dm_table *t;
+
+	t = dm_get_live_table(md, &srcu_idx);
+	ret = dm_table_supports_poll(t);
+	dm_put_live_table(md, srcu_idx);
+
+	return ret;
+}
+
 /*-----------------------------------------------------------------
  * An IDR is used to keep track of allocated minor numbers.
  *---------------------------------------------------------------*/
@@ -3132,6 +3145,7 @@ static const struct pr_ops dm_pr_ops = {
 };
 
 static const struct block_device_operations dm_blk_dops = {
+	.poll_capable = dm_bio_poll_capable,
 	.submit_bio = dm_submit_bio,
 	.open = dm_blk_open,
 	.release = dm_blk_close,
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 7f4ac87c0b32..31bfd6f70013 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -538,6 +538,7 @@ unsigned int dm_table_get_num_targets(struct dm_table *t);
 fmode_t dm_table_get_mode(struct dm_table *t);
 struct mapped_device *dm_table_get_md(struct dm_table *t);
 const char *dm_table_device_name(struct dm_table *t);
+int dm_table_supports_poll(struct dm_table *t);
 
 /*
  * Trigger an event.
-- 
2.29.2

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 01/11] block: add helper of blk_queue_poll
  2021-03-16  3:15   ` [dm-devel] " Ming Lei
@ 2021-03-16  3:26     ` Chaitanya Kulkarni
  -1 siblings, 0 replies; 48+ messages in thread
From: Chaitanya Kulkarni @ 2021-03-16  3:26 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jeffle Xu, Mike Snitzer, dm-devel

On 3/15/21 20:18, Ming Lei wrote:
> There has been 3 users, and will be more, so add one such helper.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

This looks good to me irrespective of RFC.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 01/11] block: add helper of blk_queue_poll
@ 2021-03-16  3:26     ` Chaitanya Kulkarni
  0 siblings, 0 replies; 48+ messages in thread
From: Chaitanya Kulkarni @ 2021-03-16  3:26 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: Mike Snitzer, Christoph, linux-block, dm-devel, Jeffle Xu, Hellwig

On 3/15/21 20:18, Ming Lei wrote:
> There has been 3 users, and will be more, so add one such helper.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

This looks good to me irrespective of RFC.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>




--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-16  3:15   ` [dm-devel] " Ming Lei
@ 2021-03-16  6:46     ` JeffleXu
  -1 siblings, 0 replies; 48+ messages in thread
From: JeffleXu @ 2021-03-16  6:46 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Mike Snitzer, dm-devel

It is a giant progress to gather all split bios that need to be polled
in a per-task queue. Still some comments below.


On 3/16/21 11:15 AM, Ming Lei wrote:
> Currently bio based IO poll needs to poll all hw queue blindly, this way
> is very inefficient, and the big reason is that we can't pass bio
> submission result to io poll task.
> 
> In IO submission context, store associated underlying bios into the
> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> and return current->pid to caller of submit_bio() for any DM or bio based
> driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, and we can find the bio from that submission context. Moving
> bio from submission queue to poll queue of the poll context, and keep
> polling until these bios are ended. Remove bio from poll queue if the
> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/bio.c               |   5 ++
>  block/blk-core.c          |  74 +++++++++++++++++-
>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
>  include/linux/blk_types.h |   3 +
>  4 files changed, 235 insertions(+), 3 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index a1c4d2900c7a..bcf5eca0e8e3 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> +		bio_set_flag(bio, BIO_DONE);
> +		return;
> +	}
>  again:
>  	if (!bio_remaining_done(bio))
>  		return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a082bbc856fb..970b23fa2e6e 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		bio->bi_opf |= REQ_TAG;
>  }
>  
> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> +{
> +	struct blk_bio_poll_data data = {
> +		.bio	=	bio,
> +	};
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	unsigned int queued;
> +
> +	/* lock is required if there is more than one writer */
> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> +		spin_lock(&pc->lock);
> +		queued = kfifo_put(&pc->sq, data);
> +		spin_unlock(&pc->lock);
> +	} else {
> +		queued = kfifo_put(&pc->sq, data);
> +	}
> +
> +	/*
> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> +	 * so we can save cookie into this bio after submit_bio().
> +	 */
> +	if (queued)
> +		bio_set_flag(bio, BIO_END_BY_POLL);
> +	else
> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> +
> +	return queued;
> +}

The size of kfifo is limited, and it seems that once the sq of kfifio is
full, REQ_HIPRI flag is cleared and the corresponding bio is actually
enqueued into the default hw queue, which is IRQ driven.


> +
> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> +{
> +	bio->bi_iter.bi_private_data = cookie;
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>   * bio_list_on_stack[1] contains bios that were submitted before the current
>   *	->submit_bio_bio, but that haven't been processed yet.
>   */
> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;
> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		if (ioc && queue_is_mq(q) &&
> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> +
> +			ret = __submit_bio(bio);
> +			if (queued)
> +				blk_bio_poll_post_submit(bio, ret);
> +		} else {
> +			ret = __submit_bio(bio);
> +		}
>  
>  		/*
>  		 * Sort new bios into those for a lower level and those for the
> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	return ret;
>  }
>  
> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> +		struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	int entries = kfifo_len(&pc->sq);
> +
> +	__submit_bio_noacct_int(bio, ioc);
> +
> +	/* bio submissions queued to per-task poll context */
> +	if (kfifo_len(&pc->sq) > entries)
> +		return current->pid;
> +
> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> +	return 0;
> +}
> +
> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> +{
> +	struct io_context *ioc = current->io_context;
> +
> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> +		return __submit_bio_noacct_poll(bio, ioc);
> +
> +	return __submit_bio_noacct_int(bio, NULL);
> +}
> +
> +
>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>  {
>  	struct bio_list bio_list[2] = { };
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 03f59915fe2c..4e6f1467d303 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>  	return ret;
>  }
>  
> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> +{
> +	return bio->bi_iter.bi_private_data;
> +}
> +
> +static int blk_mq_poll_io(struct bio *bio)
> +{
> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> +	int ret = 0;
> +
> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> +		struct blk_mq_hw_ctx *hctx =
> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> +
> +		ret += blk_mq_poll_hctx(q, hctx);
> +	}
> +	return ret;
> +}
> +
> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> +	int ret = 0;
> +	int i;
> +
> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> +		struct bio *bio = poll_data[i].bio;
> +
> +		if (!bio)
> +			continue;
> +
> +		ret += blk_mq_poll_io(bio);
> +		if (bio_flagged(bio, BIO_DONE)) {
> +			poll_data[i].bio = NULL;
> +
> +			/* clear BIO_END_BY_POLL and end me really */
> +			bio_clear_flag(bio, BIO_END_BY_POLL);
> +			bio_endio(bio);
> +		}
> +	}
> +	return ret;
> +}

When there are multiple threads polling, saying thread A and thread B,
then there's one bio which should be polled by thread A (the pid is
passed to thread A), while it's actually completed by thread B. In this
case, when the bio is completed by thread B, the bio is not really
completed and one extra blk_poll() still needs to be called.



> +
> +static int __blk_bio_poll_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *submit_ctx,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> +	int i;
> +
> +	/*
> +	 * Move IO submission result from submission queue in submission
> +	 * context to poll queue of poll context.
> +	 *
> +	 * There may be more than one readers on poll queue of the same
> +	 * submission context, so have to lock here.
> +	 */
> +	spin_lock(&submit_ctx->lock);
> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> +		if (poll_data[i].bio == NULL &&
> +				!kfifo_get(&submit_ctx->sq, &poll_data[i]))
> +			break;
> +	}
> +	spin_unlock(&submit_ctx->lock);
> +
> +	return blk_bio_poll_and_end_io(q, poll_ctx);
> +}
> +
> +static int blk_bio_poll_io(struct request_queue *q,
> +		struct io_context *submit_ioc,
> +		struct io_context *poll_ioc)
> +{
> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> +	int ret;
> +
> +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> +		mutex_lock(&poll_ctx->pq_lock);
> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +		mutex_unlock(&poll_ctx->pq_lock);
> +	} else {
> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +	}
> +	return ret;
> +}
> +
> +static bool blk_bio_ioc_valid(struct task_struct *t)
> +{
> +	if (!t)
> +		return false;
> +
> +	if (!t->io_context)
> +		return false;
> +
> +	if (!t->io_context->data)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> +{
> +	struct io_context *poll_ioc = current->io_context;
> +	pid_t pid;
> +	struct task_struct *submit_task;
> +	int ret;
> +
> +	pid = (pid_t)cookie;
> +
> +	/* io poll often share io submission context */
> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> +
> +	submit_task = find_get_task_by_vpid(pid);
> +	if (likely(blk_bio_ioc_valid(submit_task)))
> +		ret = blk_bio_poll_io(q, submit_task->io_context,
> +				poll_ioc);
> +	else
> +		ret = 0;
> +
> +	put_task_struct(submit_task);
> +
> +	return ret;
> +}
> +
>  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
> +	long state;
> +
> +	/* no need to poll */
> +	if (cookie == 0)
> +		return 0;
> +
>  	/*
>  	 * Create poll queue for storing poll bio and its cookie from
>  	 * submission queue
>  	 */
>  	blk_create_io_context(q, true);
>  
> +	state = current->state;
> +	do {
> +		int ret;
> +
> +		ret = __blk_bio_poll(q, cookie);
> +		if (ret > 0) {
> +			__set_current_state(TASK_RUNNING);
> +			return ret;
> +		}
> +
> +		if (signal_pending_state(state, current))
> +			__set_current_state(TASK_RUNNING);
> +
> +		if (current->state == TASK_RUNNING)
> +			return 1;
> +		if (ret < 0 || !spin)
> +			break;
> +		cpu_relax();
> +	} while (!need_resched());
> +
> +	__set_current_state(TASK_RUNNING);
>  	return 0;
>  }
>  
> @@ -3893,7 +4047,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	struct blk_mq_hw_ctx *hctx;
>  	long state;
>  
> -	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
> +	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
>  		return 0;
>  
>  	if (current->plug)
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index a1bcade4bcc3..53f64eea9652 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -304,6 +304,9 @@ enum {
>  	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
>  	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
>  	BIO_REMAPPED,
> +	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
> +	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
> +	BIO_DONE,
>  	BIO_FLAG_LAST
>  };
>  
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-16  6:46     ` JeffleXu
  0 siblings, 0 replies; 48+ messages in thread
From: JeffleXu @ 2021-03-16  6:46 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, dm-devel, Christoph Hellwig, Mike Snitzer

It is a giant progress to gather all split bios that need to be polled
in a per-task queue. Still some comments below.


On 3/16/21 11:15 AM, Ming Lei wrote:
> Currently bio based IO poll needs to poll all hw queue blindly, this way
> is very inefficient, and the big reason is that we can't pass bio
> submission result to io poll task.
> 
> In IO submission context, store associated underlying bios into the
> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> and return current->pid to caller of submit_bio() for any DM or bio based
> driver's IO, which is submitted from FS.
> 
> In IO poll context, the passed cookie tells us the PID of submission
> context, and we can find the bio from that submission context. Moving
> bio from submission queue to poll queue of the poll context, and keep
> polling until these bios are ended. Remove bio from poll queue if the
> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> 
> Usually submission shares context with io poll. The per-task poll context
> is just like stack variable, and it is cheap to move data between the two
> per-task queues.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/bio.c               |   5 ++
>  block/blk-core.c          |  74 +++++++++++++++++-
>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
>  include/linux/blk_types.h |   3 +
>  4 files changed, 235 insertions(+), 3 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index a1c4d2900c7a..bcf5eca0e8e3 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>   **/
>  void bio_endio(struct bio *bio)
>  {
> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> +		bio_set_flag(bio, BIO_DONE);
> +		return;
> +	}
>  again:
>  	if (!bio_remaining_done(bio))
>  		return;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a082bbc856fb..970b23fa2e6e 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>  		bio->bi_opf |= REQ_TAG;
>  }
>  
> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> +{
> +	struct blk_bio_poll_data data = {
> +		.bio	=	bio,
> +	};
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	unsigned int queued;
> +
> +	/* lock is required if there is more than one writer */
> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> +		spin_lock(&pc->lock);
> +		queued = kfifo_put(&pc->sq, data);
> +		spin_unlock(&pc->lock);
> +	} else {
> +		queued = kfifo_put(&pc->sq, data);
> +	}
> +
> +	/*
> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> +	 * so we can save cookie into this bio after submit_bio().
> +	 */
> +	if (queued)
> +		bio_set_flag(bio, BIO_END_BY_POLL);
> +	else
> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> +
> +	return queued;
> +}

The size of kfifo is limited, and it seems that once the sq of kfifio is
full, REQ_HIPRI flag is cleared and the corresponding bio is actually
enqueued into the default hw queue, which is IRQ driven.


> +
> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> +{
> +	bio->bi_iter.bi_private_data = cookie;
> +}
> +
>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>  {
>  	struct block_device *bdev = bio->bi_bdev;
> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>   * bio_list_on_stack[1] contains bios that were submitted before the current
>   *	->submit_bio_bio, but that haven't been processed yet.
>   */
> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>  {
>  	struct bio_list bio_list_on_stack[2];
>  	blk_qc_t ret = BLK_QC_T_NONE;
> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>  		bio_list_init(&bio_list_on_stack[0]);
>  
> -		ret = __submit_bio(bio);
> +		if (ioc && queue_is_mq(q) &&
> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> +
> +			ret = __submit_bio(bio);
> +			if (queued)
> +				blk_bio_poll_post_submit(bio, ret);
> +		} else {
> +			ret = __submit_bio(bio);
> +		}
>  
>  		/*
>  		 * Sort new bios into those for a lower level and those for the
> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>  	return ret;
>  }
>  
> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> +		struct io_context *ioc)
> +{
> +	struct blk_bio_poll_ctx *pc = ioc->data;
> +	int entries = kfifo_len(&pc->sq);
> +
> +	__submit_bio_noacct_int(bio, ioc);
> +
> +	/* bio submissions queued to per-task poll context */
> +	if (kfifo_len(&pc->sq) > entries)
> +		return current->pid;
> +
> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> +	return 0;
> +}
> +
> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> +{
> +	struct io_context *ioc = current->io_context;
> +
> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> +		return __submit_bio_noacct_poll(bio, ioc);
> +
> +	return __submit_bio_noacct_int(bio, NULL);
> +}
> +
> +
>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>  {
>  	struct bio_list bio_list[2] = { };
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 03f59915fe2c..4e6f1467d303 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>  	return ret;
>  }
>  
> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> +{
> +	return bio->bi_iter.bi_private_data;
> +}
> +
> +static int blk_mq_poll_io(struct bio *bio)
> +{
> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> +	int ret = 0;
> +
> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> +		struct blk_mq_hw_ctx *hctx =
> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> +
> +		ret += blk_mq_poll_hctx(q, hctx);
> +	}
> +	return ret;
> +}
> +
> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> +	int ret = 0;
> +	int i;
> +
> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> +		struct bio *bio = poll_data[i].bio;
> +
> +		if (!bio)
> +			continue;
> +
> +		ret += blk_mq_poll_io(bio);
> +		if (bio_flagged(bio, BIO_DONE)) {
> +			poll_data[i].bio = NULL;
> +
> +			/* clear BIO_END_BY_POLL and end me really */
> +			bio_clear_flag(bio, BIO_END_BY_POLL);
> +			bio_endio(bio);
> +		}
> +	}
> +	return ret;
> +}

When there are multiple threads polling, saying thread A and thread B,
then there's one bio which should be polled by thread A (the pid is
passed to thread A), while it's actually completed by thread B. In this
case, when the bio is completed by thread B, the bio is not really
completed and one extra blk_poll() still needs to be called.



> +
> +static int __blk_bio_poll_io(struct request_queue *q,
> +		struct blk_bio_poll_ctx *submit_ctx,
> +		struct blk_bio_poll_ctx *poll_ctx)
> +{
> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> +	int i;
> +
> +	/*
> +	 * Move IO submission result from submission queue in submission
> +	 * context to poll queue of poll context.
> +	 *
> +	 * There may be more than one readers on poll queue of the same
> +	 * submission context, so have to lock here.
> +	 */
> +	spin_lock(&submit_ctx->lock);
> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> +		if (poll_data[i].bio == NULL &&
> +				!kfifo_get(&submit_ctx->sq, &poll_data[i]))
> +			break;
> +	}
> +	spin_unlock(&submit_ctx->lock);
> +
> +	return blk_bio_poll_and_end_io(q, poll_ctx);
> +}
> +
> +static int blk_bio_poll_io(struct request_queue *q,
> +		struct io_context *submit_ioc,
> +		struct io_context *poll_ioc)
> +{
> +	struct blk_bio_poll_ctx *submit_ctx = submit_ioc->data;
> +	struct blk_bio_poll_ctx *poll_ctx = poll_ioc->data;
> +	int ret;
> +
> +	if (unlikely(atomic_read(&poll_ioc->nr_tasks) > 1)) {
> +		mutex_lock(&poll_ctx->pq_lock);
> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +		mutex_unlock(&poll_ctx->pq_lock);
> +	} else {
> +		ret = __blk_bio_poll_io(q, submit_ctx, poll_ctx);
> +	}
> +	return ret;
> +}
> +
> +static bool blk_bio_ioc_valid(struct task_struct *t)
> +{
> +	if (!t)
> +		return false;
> +
> +	if (!t->io_context)
> +		return false;
> +
> +	if (!t->io_context->data)
> +		return false;
> +
> +	return true;
> +}
> +
> +static int __blk_bio_poll(struct request_queue *q, blk_qc_t cookie)
> +{
> +	struct io_context *poll_ioc = current->io_context;
> +	pid_t pid;
> +	struct task_struct *submit_task;
> +	int ret;
> +
> +	pid = (pid_t)cookie;
> +
> +	/* io poll often share io submission context */
> +	if (likely(current->pid == pid && blk_bio_ioc_valid(current)))
> +		return blk_bio_poll_io(q, poll_ioc, poll_ioc);
> +
> +	submit_task = find_get_task_by_vpid(pid);
> +	if (likely(blk_bio_ioc_valid(submit_task)))
> +		ret = blk_bio_poll_io(q, submit_task->io_context,
> +				poll_ioc);
> +	else
> +		ret = 0;
> +
> +	put_task_struct(submit_task);
> +
> +	return ret;
> +}
> +
>  static int blk_bio_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  {
> +	long state;
> +
> +	/* no need to poll */
> +	if (cookie == 0)
> +		return 0;
> +
>  	/*
>  	 * Create poll queue for storing poll bio and its cookie from
>  	 * submission queue
>  	 */
>  	blk_create_io_context(q, true);
>  
> +	state = current->state;
> +	do {
> +		int ret;
> +
> +		ret = __blk_bio_poll(q, cookie);
> +		if (ret > 0) {
> +			__set_current_state(TASK_RUNNING);
> +			return ret;
> +		}
> +
> +		if (signal_pending_state(state, current))
> +			__set_current_state(TASK_RUNNING);
> +
> +		if (current->state == TASK_RUNNING)
> +			return 1;
> +		if (ret < 0 || !spin)
> +			break;
> +		cpu_relax();
> +	} while (!need_resched());
> +
> +	__set_current_state(TASK_RUNNING);
>  	return 0;
>  }
>  
> @@ -3893,7 +4047,7 @@ int blk_poll(struct request_queue *q, blk_qc_t cookie, bool spin)
>  	struct blk_mq_hw_ctx *hctx;
>  	long state;
>  
> -	if (!blk_qc_t_valid(cookie) || !blk_queue_poll(q))
> +	if (!blk_queue_poll(q) || (queue_is_mq(q) && !blk_qc_t_valid(cookie)))
>  		return 0;
>  
>  	if (current->plug)
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index a1bcade4bcc3..53f64eea9652 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -304,6 +304,9 @@ enum {
>  	BIO_CGROUP_ACCT,	/* has been accounted to a cgroup */
>  	BIO_TRACKED,		/* set if bio goes through the rq_qos path */
>  	BIO_REMAPPED,
> +	BIO_END_BY_POLL,	/* end by blk_bio_poll() explicitly */
> +	/* set when bio can be ended, used for bio with BIO_END_BY_POLL */
> +	BIO_DONE,
>  	BIO_FLAG_LAST
>  };
>  
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-16  6:46     ` [dm-devel] " JeffleXu
@ 2021-03-16  7:17       ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  7:17 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel

On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> It is a giant progress to gather all split bios that need to be polled
> in a per-task queue. Still some comments below.
> 
> 
> On 3/16/21 11:15 AM, Ming Lei wrote:
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> > 
> > In IO submission context, store associated underlying bios into the
> > submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any DM or bio based
> > driver's IO, which is submitted from FS.
> > 
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > 
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 ++
> >  block/blk-core.c          |  74 +++++++++++++++++-
> >  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> >  include/linux/blk_types.h |   3 +
> >  4 files changed, 235 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index a1c4d2900c7a..bcf5eca0e8e3 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +		bio_set_flag(bio, BIO_DONE);
> > +		return;
> > +	}
> >  again:
> >  	if (!bio_remaining_done(bio))
> >  		return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index a082bbc856fb..970b23fa2e6e 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		bio->bi_opf |= REQ_TAG;
> >  }
> >  
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +	struct blk_bio_poll_data data = {
> > +		.bio	=	bio,
> > +	};
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	unsigned int queued;
> > +
> > +	/* lock is required if there is more than one writer */
> > +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> > +		spin_lock(&pc->lock);
> > +		queued = kfifo_put(&pc->sq, data);
> > +		spin_unlock(&pc->lock);
> > +	} else {
> > +		queued = kfifo_put(&pc->sq, data);
> > +	}
> > +
> > +	/*
> > +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > +	 * so we can save cookie into this bio after submit_bio().
> > +	 */
> > +	if (queued)
> > +		bio_set_flag(bio, BIO_END_BY_POLL);
> > +	else
> > +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > +
> > +	return queued;
> > +}
> 
> The size of kfifo is limited, and it seems that once the sq of kfifio is
> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> enqueued into the default hw queue, which is IRQ driven.

Yeah, this patch starts with 64 queue depth, and we can increase it to
128, which should cover most of cases.

> 
> 
> > +
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +	bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >  	struct block_device *bdev = bio->bi_bdev;
> > @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *	->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >  {
> >  	struct bio_list bio_list_on_stack[2];
> >  	blk_qc_t ret = BLK_QC_T_NONE;
> > @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >  		bio_list_init(&bio_list_on_stack[0]);
> >  
> > -		ret = __submit_bio(bio);
> > +		if (ioc && queue_is_mq(q) &&
> > +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +			ret = __submit_bio(bio);
> > +			if (queued)
> > +				blk_bio_poll_post_submit(bio, ret);
> > +		} else {
> > +			ret = __submit_bio(bio);
> > +		}
> >  
> >  		/*
> >  		 * Sort new bios into those for a lower level and those for the
> > @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  	return ret;
> >  }
> >  
> > +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> > +		struct io_context *ioc)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	int entries = kfifo_len(&pc->sq);
> > +
> > +	__submit_bio_noacct_int(bio, ioc);
> > +
> > +	/* bio submissions queued to per-task poll context */
> > +	if (kfifo_len(&pc->sq) > entries)
> > +		return current->pid;
> > +
> > +	/* swapper's pid is 0, but it can't submit poll IO for us */
> > +	return 0;
> > +}
> > +
> > +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +{
> > +	struct io_context *ioc = current->io_context;
> > +
> > +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> > +		return __submit_bio_noacct_poll(bio, ioc);
> > +
> > +	return __submit_bio_noacct_int(bio, NULL);
> > +}
> > +
> > +
> >  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >  {
> >  	struct bio_list bio_list[2] = { };
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 03f59915fe2c..4e6f1467d303 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >  	return ret;
> >  }
> >  
> > +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> > +{
> > +	return bio->bi_iter.bi_private_data;
> > +}
> > +
> > +static int blk_mq_poll_io(struct bio *bio)
> > +{
> > +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> > +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> > +	int ret = 0;
> > +
> > +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> > +		struct blk_mq_hw_ctx *hctx =
> > +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> > +
> > +		ret += blk_mq_poll_hctx(q, hctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int blk_bio_poll_and_end_io(struct request_queue *q,
> > +		struct blk_bio_poll_ctx *poll_ctx)
> > +{
> > +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> > +	int ret = 0;
> > +	int i;
> > +
> > +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> > +		struct bio *bio = poll_data[i].bio;
> > +
> > +		if (!bio)
> > +			continue;
> > +
> > +		ret += blk_mq_poll_io(bio);
> > +		if (bio_flagged(bio, BIO_DONE)) {
> > +			poll_data[i].bio = NULL;
> > +
> > +			/* clear BIO_END_BY_POLL and end me really */
> > +			bio_clear_flag(bio, BIO_END_BY_POLL);
> > +			bio_endio(bio);
> > +		}
> > +	}
> > +	return ret;
> > +}
> 
> When there are multiple threads polling, saying thread A and thread B,
> then there's one bio which should be polled by thread A (the pid is
> passed to thread A), while it's actually completed by thread B. In this
> case, when the bio is completed by thread B, the bio is not really
> completed and one extra blk_poll() still needs to be called.

When this happens, the dm bio can't be completed, and the associated
kiocb can't be completed too, io_uring or other poll code context will
keep calling blk_poll() by passing thread A's pid until this dm bio is
done, since the dm bio is submitted from thread A.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-16  7:17       ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-16  7:17 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer

On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> It is a giant progress to gather all split bios that need to be polled
> in a per-task queue. Still some comments below.
> 
> 
> On 3/16/21 11:15 AM, Ming Lei wrote:
> > Currently bio based IO poll needs to poll all hw queue blindly, this way
> > is very inefficient, and the big reason is that we can't pass bio
> > submission result to io poll task.
> > 
> > In IO submission context, store associated underlying bios into the
> > submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > and return current->pid to caller of submit_bio() for any DM or bio based
> > driver's IO, which is submitted from FS.
> > 
> > In IO poll context, the passed cookie tells us the PID of submission
> > context, and we can find the bio from that submission context. Moving
> > bio from submission queue to poll queue of the poll context, and keep
> > polling until these bios are ended. Remove bio from poll queue if the
> > bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > 
> > Usually submission shares context with io poll. The per-task poll context
> > is just like stack variable, and it is cheap to move data between the two
> > per-task queues.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/bio.c               |   5 ++
> >  block/blk-core.c          |  74 +++++++++++++++++-
> >  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> >  include/linux/blk_types.h |   3 +
> >  4 files changed, 235 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index a1c4d2900c7a..bcf5eca0e8e3 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >   **/
> >  void bio_endio(struct bio *bio)
> >  {
> > +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > +		bio_set_flag(bio, BIO_DONE);
> > +		return;
> > +	}
> >  again:
> >  	if (!bio_remaining_done(bio))
> >  		return;
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index a082bbc856fb..970b23fa2e6e 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >  		bio->bi_opf |= REQ_TAG;
> >  }
> >  
> > +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > +{
> > +	struct blk_bio_poll_data data = {
> > +		.bio	=	bio,
> > +	};
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	unsigned int queued;
> > +
> > +	/* lock is required if there is more than one writer */
> > +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> > +		spin_lock(&pc->lock);
> > +		queued = kfifo_put(&pc->sq, data);
> > +		spin_unlock(&pc->lock);
> > +	} else {
> > +		queued = kfifo_put(&pc->sq, data);
> > +	}
> > +
> > +	/*
> > +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > +	 * so we can save cookie into this bio after submit_bio().
> > +	 */
> > +	if (queued)
> > +		bio_set_flag(bio, BIO_END_BY_POLL);
> > +	else
> > +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > +
> > +	return queued;
> > +}
> 
> The size of kfifo is limited, and it seems that once the sq of kfifio is
> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> enqueued into the default hw queue, which is IRQ driven.

Yeah, this patch starts with 64 queue depth, and we can increase it to
128, which should cover most of cases.

> 
> 
> > +
> > +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > +{
> > +	bio->bi_iter.bi_private_data = cookie;
> > +}
> > +
> >  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >  {
> >  	struct block_device *bdev = bio->bi_bdev;
> > @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >   * bio_list_on_stack[1] contains bios that were submitted before the current
> >   *	->submit_bio_bio, but that haven't been processed yet.
> >   */
> > -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >  {
> >  	struct bio_list bio_list_on_stack[2];
> >  	blk_qc_t ret = BLK_QC_T_NONE;
> > @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >  		bio_list_init(&bio_list_on_stack[0]);
> >  
> > -		ret = __submit_bio(bio);
> > +		if (ioc && queue_is_mq(q) &&
> > +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > +
> > +			ret = __submit_bio(bio);
> > +			if (queued)
> > +				blk_bio_poll_post_submit(bio, ret);
> > +		} else {
> > +			ret = __submit_bio(bio);
> > +		}
> >  
> >  		/*
> >  		 * Sort new bios into those for a lower level and those for the
> > @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >  	return ret;
> >  }
> >  
> > +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> > +		struct io_context *ioc)
> > +{
> > +	struct blk_bio_poll_ctx *pc = ioc->data;
> > +	int entries = kfifo_len(&pc->sq);
> > +
> > +	__submit_bio_noacct_int(bio, ioc);
> > +
> > +	/* bio submissions queued to per-task poll context */
> > +	if (kfifo_len(&pc->sq) > entries)
> > +		return current->pid;
> > +
> > +	/* swapper's pid is 0, but it can't submit poll IO for us */
> > +	return 0;
> > +}
> > +
> > +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> > +{
> > +	struct io_context *ioc = current->io_context;
> > +
> > +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> > +		return __submit_bio_noacct_poll(bio, ioc);
> > +
> > +	return __submit_bio_noacct_int(bio, NULL);
> > +}
> > +
> > +
> >  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >  {
> >  	struct bio_list bio_list[2] = { };
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 03f59915fe2c..4e6f1467d303 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >  	return ret;
> >  }
> >  
> > +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> > +{
> > +	return bio->bi_iter.bi_private_data;
> > +}
> > +
> > +static int blk_mq_poll_io(struct bio *bio)
> > +{
> > +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> > +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> > +	int ret = 0;
> > +
> > +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> > +		struct blk_mq_hw_ctx *hctx =
> > +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> > +
> > +		ret += blk_mq_poll_hctx(q, hctx);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int blk_bio_poll_and_end_io(struct request_queue *q,
> > +		struct blk_bio_poll_ctx *poll_ctx)
> > +{
> > +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> > +	int ret = 0;
> > +	int i;
> > +
> > +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> > +		struct bio *bio = poll_data[i].bio;
> > +
> > +		if (!bio)
> > +			continue;
> > +
> > +		ret += blk_mq_poll_io(bio);
> > +		if (bio_flagged(bio, BIO_DONE)) {
> > +			poll_data[i].bio = NULL;
> > +
> > +			/* clear BIO_END_BY_POLL and end me really */
> > +			bio_clear_flag(bio, BIO_END_BY_POLL);
> > +			bio_endio(bio);
> > +		}
> > +	}
> > +	return ret;
> > +}
> 
> When there are multiple threads polling, saying thread A and thread B,
> then there's one bio which should be polled by thread A (the pid is
> passed to thread A), while it's actually completed by thread B. In this
> case, when the bio is completed by thread B, the bio is not really
> completed and one extra blk_poll() still needs to be called.

When this happens, the dm bio can't be completed, and the associated
kiocb can't be completed too, io_uring or other poll code context will
keep calling blk_poll() by passing thread A's pid until this dm bio is
done, since the dm bio is submitted from thread A.


Thanks, 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-16  7:17       ` [dm-devel] " Ming Lei
@ 2021-03-16  8:52         ` JeffleXu
  -1 siblings, 0 replies; 48+ messages in thread
From: JeffleXu @ 2021-03-16  8:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel



On 3/16/21 3:17 PM, Ming Lei wrote:
> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
>> It is a giant progress to gather all split bios that need to be polled
>> in a per-task queue. Still some comments below.
>>
>>
>> On 3/16/21 11:15 AM, Ming Lei wrote:
>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
>>> is very inefficient, and the big reason is that we can't pass bio
>>> submission result to io poll task.
>>>
>>> In IO submission context, store associated underlying bios into the
>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
>>> and return current->pid to caller of submit_bio() for any DM or bio based
>>> driver's IO, which is submitted from FS.
>>>
>>> In IO poll context, the passed cookie tells us the PID of submission
>>> context, and we can find the bio from that submission context. Moving
>>> bio from submission queue to poll queue of the poll context, and keep
>>> polling until these bios are ended. Remove bio from poll queue if the
>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
>>>
>>> Usually submission shares context with io poll. The per-task poll context
>>> is just like stack variable, and it is cheap to move data between the two
>>> per-task queues.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/bio.c               |   5 ++
>>>  block/blk-core.c          |  74 +++++++++++++++++-
>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
>>>  include/linux/blk_types.h |   3 +
>>>  4 files changed, 235 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>   **/
>>>  void bio_endio(struct bio *bio)
>>>  {
>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>> +		bio_set_flag(bio, BIO_DONE);
>>> +		return;
>>> +	}
>>>  again:
>>>  	if (!bio_remaining_done(bio))
>>>  		return;
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index a082bbc856fb..970b23fa2e6e 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>  		bio->bi_opf |= REQ_TAG;
>>>  }
>>>  
>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>> +{
>>> +	struct blk_bio_poll_data data = {
>>> +		.bio	=	bio,
>>> +	};
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	unsigned int queued;
>>> +
>>> +	/* lock is required if there is more than one writer */
>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
>>> +		spin_lock(&pc->lock);
>>> +		queued = kfifo_put(&pc->sq, data);
>>> +		spin_unlock(&pc->lock);
>>> +	} else {
>>> +		queued = kfifo_put(&pc->sq, data);
>>> +	}
>>> +
>>> +	/*
>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
>>> +	 * so we can save cookie into this bio after submit_bio().
>>> +	 */
>>> +	if (queued)
>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>> +	else
>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
>>> +
>>> +	return queued;
>>> +}
>>
>> The size of kfifo is limited, and it seems that once the sq of kfifio is
>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
>> enqueued into the default hw queue, which is IRQ driven.
> 
> Yeah, this patch starts with 64 queue depth, and we can increase it to
> 128, which should cover most of cases.

It seems that the queue depth of kfifo will affect the performance as I
did a fast test.



Test Result:

BLK_BIO_POLL_SQ_SZ | iodepth | IOPS
------------------ | ------- | ----
64                 | 128     | 301k (IRQ) -> 340k (iopoll)
64                 | 16      | 304k (IRQ) -> 392k (iopoll)
128                | 128     | 204k (IRQ) -> 317k (iopoll)
256                | 128     | 241k (IRQ) -> 391k (iopoll)

It seems that BLK_BIO_POLL_SQ_SZ need to be increased accordingly when
iodepth is quite large. But I don't know why the performance in IRQ mode
decreases when BLK_BIO_POLL_SQ_SZ is increased.





Test Environment:
nvme.poll_queues = 1,
dmsetup create testdev --table '0 2097152 linear /dev/nvme0n1 0',

```
$cat fio.conf
[global]
name=iouring-sqpoll-iopoll-1
ioengine=io_uring
iodepth=128
numjobs=1
thread
rw=randread
direct=1
#hipri=1
bs=4k
runtime=10
time_based
group_reporting
randrepeat=0
cpus_allowed=44

[device]
filename=/dev/mapper/testdev
```



> 
>>
>>
>>> +
>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
>>> +{
>>> +	bio->bi_iter.bi_private_data = cookie;
>>> +}
>>> +
>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>  {
>>>  	struct block_device *bdev = bio->bi_bdev;
>>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
>>>   *	->submit_bio_bio, but that haven't been processed yet.
>>>   */
>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>>>  {
>>>  	struct bio_list bio_list_on_stack[2];
>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>  
>>> -		ret = __submit_bio(bio);
>>> +		if (ioc && queue_is_mq(q) &&
>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
>>> +
>>> +			ret = __submit_bio(bio);
>>> +			if (queued)
>>> +				blk_bio_poll_post_submit(bio, ret);
>>> +		} else {
>>> +			ret = __submit_bio(bio);
>>> +		}
>>>  
>>>  		/*
>>>  		 * Sort new bios into those for a lower level and those for the
>>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  	return ret;
>>>  }
>>>  
>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
>>> +		struct io_context *ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	int entries = kfifo_len(&pc->sq);
>>> +
>>> +	__submit_bio_noacct_int(bio, ioc);
>>> +
>>> +	/* bio submissions queued to per-task poll context */
>>> +	if (kfifo_len(&pc->sq) > entries)
>>> +		return current->pid;
>>> +
>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
>>> +	return 0;
>>> +}
>>> +
>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +{
>>> +	struct io_context *ioc = current->io_context;
>>> +
>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>>> +		return __submit_bio_noacct_poll(bio, ioc);
>>> +
>>> +	return __submit_bio_noacct_int(bio, NULL);
>>> +}
>>> +
>>> +
>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>>>  {
>>>  	struct bio_list bio_list[2] = { };
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 03f59915fe2c..4e6f1467d303 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>>>  	return ret;
>>>  }
>>>  
>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
>>> +{
>>> +	return bio->bi_iter.bi_private_data;
>>> +}
>>> +
>>> +static int blk_mq_poll_io(struct bio *bio)
>>> +{
>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
>>> +	int ret = 0;
>>> +
>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
>>> +		struct blk_mq_hw_ctx *hctx =
>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>> +
>>> +		ret += blk_mq_poll_hctx(q, hctx);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
>>> +		struct blk_bio_poll_ctx *poll_ctx)
>>> +{
>>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
>>> +	int ret = 0;
>>> +	int i;
>>> +
>>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
>>> +		struct bio *bio = poll_data[i].bio;
>>> +
>>> +		if (!bio)
>>> +			continue;
>>> +
>>> +		ret += blk_mq_poll_io(bio);
>>> +		if (bio_flagged(bio, BIO_DONE)) {
>>> +			poll_data[i].bio = NULL;
>>> +
>>> +			/* clear BIO_END_BY_POLL and end me really */
>>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
>>> +			bio_endio(bio);
>>> +		}
>>> +	}
>>> +	return ret;
>>> +}
>>
>> When there are multiple threads polling, saying thread A and thread B,
>> then there's one bio which should be polled by thread A (the pid is
>> passed to thread A), while it's actually completed by thread B. In this
>> case, when the bio is completed by thread B, the bio is not really
>> completed and one extra blk_poll() still needs to be called.
> 
> When this happens, the dm bio can't be completed, and the associated
> kiocb can't be completed too, io_uring or other poll code context will
> keep calling blk_poll() by passing thread A's pid until this dm bio is
> done, since the dm bio is submitted from thread A.
> 
> 
> Thanks, 
> Ming
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-16  8:52         ` JeffleXu
  0 siblings, 0 replies; 48+ messages in thread
From: JeffleXu @ 2021-03-16  8:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer



On 3/16/21 3:17 PM, Ming Lei wrote:
> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
>> It is a giant progress to gather all split bios that need to be polled
>> in a per-task queue. Still some comments below.
>>
>>
>> On 3/16/21 11:15 AM, Ming Lei wrote:
>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
>>> is very inefficient, and the big reason is that we can't pass bio
>>> submission result to io poll task.
>>>
>>> In IO submission context, store associated underlying bios into the
>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
>>> and return current->pid to caller of submit_bio() for any DM or bio based
>>> driver's IO, which is submitted from FS.
>>>
>>> In IO poll context, the passed cookie tells us the PID of submission
>>> context, and we can find the bio from that submission context. Moving
>>> bio from submission queue to poll queue of the poll context, and keep
>>> polling until these bios are ended. Remove bio from poll queue if the
>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
>>>
>>> Usually submission shares context with io poll. The per-task poll context
>>> is just like stack variable, and it is cheap to move data between the two
>>> per-task queues.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/bio.c               |   5 ++
>>>  block/blk-core.c          |  74 +++++++++++++++++-
>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
>>>  include/linux/blk_types.h |   3 +
>>>  4 files changed, 235 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>   **/
>>>  void bio_endio(struct bio *bio)
>>>  {
>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>> +		bio_set_flag(bio, BIO_DONE);
>>> +		return;
>>> +	}
>>>  again:
>>>  	if (!bio_remaining_done(bio))
>>>  		return;
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index a082bbc856fb..970b23fa2e6e 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>  		bio->bi_opf |= REQ_TAG;
>>>  }
>>>  
>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>> +{
>>> +	struct blk_bio_poll_data data = {
>>> +		.bio	=	bio,
>>> +	};
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	unsigned int queued;
>>> +
>>> +	/* lock is required if there is more than one writer */
>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
>>> +		spin_lock(&pc->lock);
>>> +		queued = kfifo_put(&pc->sq, data);
>>> +		spin_unlock(&pc->lock);
>>> +	} else {
>>> +		queued = kfifo_put(&pc->sq, data);
>>> +	}
>>> +
>>> +	/*
>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
>>> +	 * so we can save cookie into this bio after submit_bio().
>>> +	 */
>>> +	if (queued)
>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>> +	else
>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
>>> +
>>> +	return queued;
>>> +}
>>
>> The size of kfifo is limited, and it seems that once the sq of kfifio is
>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
>> enqueued into the default hw queue, which is IRQ driven.
> 
> Yeah, this patch starts with 64 queue depth, and we can increase it to
> 128, which should cover most of cases.

It seems that the queue depth of kfifo will affect the performance as I
did a fast test.



Test Result:

BLK_BIO_POLL_SQ_SZ | iodepth | IOPS
------------------ | ------- | ----
64                 | 128     | 301k (IRQ) -> 340k (iopoll)
64                 | 16      | 304k (IRQ) -> 392k (iopoll)
128                | 128     | 204k (IRQ) -> 317k (iopoll)
256                | 128     | 241k (IRQ) -> 391k (iopoll)

It seems that BLK_BIO_POLL_SQ_SZ need to be increased accordingly when
iodepth is quite large. But I don't know why the performance in IRQ mode
decreases when BLK_BIO_POLL_SQ_SZ is increased.





Test Environment:
nvme.poll_queues = 1,
dmsetup create testdev --table '0 2097152 linear /dev/nvme0n1 0',

```
$cat fio.conf
[global]
name=iouring-sqpoll-iopoll-1
ioengine=io_uring
iodepth=128
numjobs=1
thread
rw=randread
direct=1
#hipri=1
bs=4k
runtime=10
time_based
group_reporting
randrepeat=0
cpus_allowed=44

[device]
filename=/dev/mapper/testdev
```



> 
>>
>>
>>> +
>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
>>> +{
>>> +	bio->bi_iter.bi_private_data = cookie;
>>> +}
>>> +
>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>  {
>>>  	struct block_device *bdev = bio->bi_bdev;
>>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
>>>   *	->submit_bio_bio, but that haven't been processed yet.
>>>   */
>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>>>  {
>>>  	struct bio_list bio_list_on_stack[2];
>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>  
>>> -		ret = __submit_bio(bio);
>>> +		if (ioc && queue_is_mq(q) &&
>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
>>> +
>>> +			ret = __submit_bio(bio);
>>> +			if (queued)
>>> +				blk_bio_poll_post_submit(bio, ret);
>>> +		} else {
>>> +			ret = __submit_bio(bio);
>>> +		}
>>>  
>>>  		/*
>>>  		 * Sort new bios into those for a lower level and those for the
>>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  	return ret;
>>>  }
>>>  
>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
>>> +		struct io_context *ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	int entries = kfifo_len(&pc->sq);
>>> +
>>> +	__submit_bio_noacct_int(bio, ioc);
>>> +
>>> +	/* bio submissions queued to per-task poll context */
>>> +	if (kfifo_len(&pc->sq) > entries)
>>> +		return current->pid;
>>> +
>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
>>> +	return 0;
>>> +}
>>> +
>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +{
>>> +	struct io_context *ioc = current->io_context;
>>> +
>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>>> +		return __submit_bio_noacct_poll(bio, ioc);
>>> +
>>> +	return __submit_bio_noacct_int(bio, NULL);
>>> +}
>>> +
>>> +
>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>>>  {
>>>  	struct bio_list bio_list[2] = { };
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 03f59915fe2c..4e6f1467d303 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>>>  	return ret;
>>>  }
>>>  
>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
>>> +{
>>> +	return bio->bi_iter.bi_private_data;
>>> +}
>>> +
>>> +static int blk_mq_poll_io(struct bio *bio)
>>> +{
>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
>>> +	int ret = 0;
>>> +
>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
>>> +		struct blk_mq_hw_ctx *hctx =
>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>> +
>>> +		ret += blk_mq_poll_hctx(q, hctx);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
>>> +		struct blk_bio_poll_ctx *poll_ctx)
>>> +{
>>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
>>> +	int ret = 0;
>>> +	int i;
>>> +
>>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
>>> +		struct bio *bio = poll_data[i].bio;
>>> +
>>> +		if (!bio)
>>> +			continue;
>>> +
>>> +		ret += blk_mq_poll_io(bio);
>>> +		if (bio_flagged(bio, BIO_DONE)) {
>>> +			poll_data[i].bio = NULL;
>>> +
>>> +			/* clear BIO_END_BY_POLL and end me really */
>>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
>>> +			bio_endio(bio);
>>> +		}
>>> +	}
>>> +	return ret;
>>> +}
>>
>> When there are multiple threads polling, saying thread A and thread B,
>> then there's one bio which should be polled by thread A (the pid is
>> passed to thread A), while it's actually completed by thread B. In this
>> case, when the bio is completed by thread B, the bio is not really
>> completed and one extra blk_poll() still needs to be called.
> 
> When this happens, the dm bio can't be completed, and the associated
> kiocb can't be completed too, io_uring or other poll code context will
> keep calling blk_poll() by passing thread A's pid until this dm bio is
> done, since the dm bio is submitted from thread A.
> 
> 
> Thanks, 
> Ming
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-16  7:17       ` [dm-devel] " Ming Lei
@ 2021-03-16 11:00         ` JeffleXu
  -1 siblings, 0 replies; 48+ messages in thread
From: JeffleXu @ 2021-03-16 11:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel



On 3/16/21 3:17 PM, Ming Lei wrote:
> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
>> It is a giant progress to gather all split bios that need to be polled
>> in a per-task queue. Still some comments below.
>>
>>
>> On 3/16/21 11:15 AM, Ming Lei wrote:
>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
>>> is very inefficient, and the big reason is that we can't pass bio
>>> submission result to io poll task.
>>>
>>> In IO submission context, store associated underlying bios into the
>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
>>> and return current->pid to caller of submit_bio() for any DM or bio based
>>> driver's IO, which is submitted from FS.
>>>
>>> In IO poll context, the passed cookie tells us the PID of submission
>>> context, and we can find the bio from that submission context. Moving
>>> bio from submission queue to poll queue of the poll context, and keep
>>> polling until these bios are ended. Remove bio from poll queue if the
>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
>>>
>>> Usually submission shares context with io poll. The per-task poll context
>>> is just like stack variable, and it is cheap to move data between the two
>>> per-task queues.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/bio.c               |   5 ++
>>>  block/blk-core.c          |  74 +++++++++++++++++-
>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
>>>  include/linux/blk_types.h |   3 +
>>>  4 files changed, 235 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>   **/
>>>  void bio_endio(struct bio *bio)
>>>  {
>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>> +		bio_set_flag(bio, BIO_DONE);
>>> +		return;
>>> +	}
>>>  again:
>>>  	if (!bio_remaining_done(bio))
>>>  		return;
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index a082bbc856fb..970b23fa2e6e 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>  		bio->bi_opf |= REQ_TAG;
>>>  }
>>>  
>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>> +{
>>> +	struct blk_bio_poll_data data = {
>>> +		.bio	=	bio,
>>> +	};
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	unsigned int queued;
>>> +
>>> +	/* lock is required if there is more than one writer */
>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
>>> +		spin_lock(&pc->lock);
>>> +		queued = kfifo_put(&pc->sq, data);
>>> +		spin_unlock(&pc->lock);
>>> +	} else {
>>> +		queued = kfifo_put(&pc->sq, data);
>>> +	}
>>> +
>>> +	/*
>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
>>> +	 * so we can save cookie into this bio after submit_bio().
>>> +	 */
>>> +	if (queued)
>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>> +	else
>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
>>> +
>>> +	return queued;
>>> +}
>>
>> The size of kfifo is limited, and it seems that once the sq of kfifio is
>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
>> enqueued into the default hw queue, which is IRQ driven.
> 
> Yeah, this patch starts with 64 queue depth, and we can increase it to
> 128, which should cover most of cases.
> 
>>
>>
>>> +
>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
>>> +{
>>> +	bio->bi_iter.bi_private_data = cookie;
>>> +}
>>> +
>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>  {
>>>  	struct block_device *bdev = bio->bi_bdev;
>>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
>>>   *	->submit_bio_bio, but that haven't been processed yet.
>>>   */
>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>>>  {
>>>  	struct bio_list bio_list_on_stack[2];
>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>  
>>> -		ret = __submit_bio(bio);
>>> +		if (ioc && queue_is_mq(q) &&
>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
>>> +
>>> +			ret = __submit_bio(bio);
>>> +			if (queued)
>>> +				blk_bio_poll_post_submit(bio, ret);
>>> +		} else {
>>> +			ret = __submit_bio(bio);
>>> +		}
>>>  
>>>  		/*
>>>  		 * Sort new bios into those for a lower level and those for the
>>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  	return ret;
>>>  }
>>>  
>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
>>> +		struct io_context *ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	int entries = kfifo_len(&pc->sq);
>>> +
>>> +	__submit_bio_noacct_int(bio, ioc);
>>> +
>>> +	/* bio submissions queued to per-task poll context */
>>> +	if (kfifo_len(&pc->sq) > entries)
>>> +		return current->pid;
>>> +
>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
>>> +	return 0;
>>> +}
>>> +
>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +{
>>> +	struct io_context *ioc = current->io_context;
>>> +
>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>>> +		return __submit_bio_noacct_poll(bio, ioc);
>>> +
>>> +	return __submit_bio_noacct_int(bio, NULL);
>>> +}
>>> +
>>> +
>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>>>  {
>>>  	struct bio_list bio_list[2] = { };
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 03f59915fe2c..4e6f1467d303 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>>>  	return ret;
>>>  }
>>>  
>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
>>> +{
>>> +	return bio->bi_iter.bi_private_data;
>>> +}
>>> +
>>> +static int blk_mq_poll_io(struct bio *bio)
>>> +{
>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
>>> +	int ret = 0;
>>> +
>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
>>> +		struct blk_mq_hw_ctx *hctx =
>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>> +
>>> +		ret += blk_mq_poll_hctx(q, hctx);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
>>> +		struct blk_bio_poll_ctx *poll_ctx)
>>> +{
>>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
>>> +	int ret = 0;
>>> +	int i;
>>> +
>>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
>>> +		struct bio *bio = poll_data[i].bio;
>>> +
>>> +		if (!bio)
>>> +			continue;
>>> +
>>> +		ret += blk_mq_poll_io(bio);
>>> +		if (bio_flagged(bio, BIO_DONE)) {
>>> +			poll_data[i].bio = NULL;
>>> +
>>> +			/* clear BIO_END_BY_POLL and end me really */
>>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
>>> +			bio_endio(bio);
>>> +		}
>>> +	}
>>> +	return ret;
>>> +}
>>
>> When there are multiple threads polling, saying thread A and thread B,
>> then there's one bio which should be polled by thread A (the pid is
>> passed to thread A), while it's actually completed by thread B. In this
>> case, when the bio is completed by thread B, the bio is not really
>> completed and one extra blk_poll() still needs to be called.
> 
> When this happens, the dm bio can't be completed, and the associated
> kiocb can't be completed too, io_uring or other poll code context will
> keep calling blk_poll() by passing thread A's pid until this dm bio is
> done, since the dm bio is submitted from thread A.
> 

This will affect the multi-thread polling performance. I tested
dm-stripe, in which every bio will be split and enqueued into all
underlying devices, and thus amplify the interference between multiple
threads.

Test Result:
IOPS: 332k (IRQ) -> 363k (iopoll), aka ~10% performance gain


Test Environment:

nvme.poll_queues = 3

BLK_BIO_POLL_SQ_SZ = 128

dmsetup create testdev --table "0 629145600 striped 3 8 /dev/nvme0n1 0
/dev/nvme1n1 0 /dev/nvme4n1 0"


```
$cat fio.conf
[global]
name=iouring-sqpoll-iopoll-1
ioengine=io_uring
iodepth=128
numjobs=1
thread
rw=randread
direct=1
hipri=1
runtime=10
time_based
group_reporting
randrepeat=0
filename=/dev/mapper/testdev
bs=12k

[job-1]
cpus_allowed=14

[job-2]
cpus_allowed=16

[job-3]
cpus_allowed=84
```

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-16 11:00         ` JeffleXu
  0 siblings, 0 replies; 48+ messages in thread
From: JeffleXu @ 2021-03-16 11:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer



On 3/16/21 3:17 PM, Ming Lei wrote:
> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
>> It is a giant progress to gather all split bios that need to be polled
>> in a per-task queue. Still some comments below.
>>
>>
>> On 3/16/21 11:15 AM, Ming Lei wrote:
>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
>>> is very inefficient, and the big reason is that we can't pass bio
>>> submission result to io poll task.
>>>
>>> In IO submission context, store associated underlying bios into the
>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
>>> and return current->pid to caller of submit_bio() for any DM or bio based
>>> driver's IO, which is submitted from FS.
>>>
>>> In IO poll context, the passed cookie tells us the PID of submission
>>> context, and we can find the bio from that submission context. Moving
>>> bio from submission queue to poll queue of the poll context, and keep
>>> polling until these bios are ended. Remove bio from poll queue if the
>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
>>>
>>> Usually submission shares context with io poll. The per-task poll context
>>> is just like stack variable, and it is cheap to move data between the two
>>> per-task queues.
>>>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/bio.c               |   5 ++
>>>  block/blk-core.c          |  74 +++++++++++++++++-
>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
>>>  include/linux/blk_types.h |   3 +
>>>  4 files changed, 235 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/bio.c b/block/bio.c
>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
>>> --- a/block/bio.c
>>> +++ b/block/bio.c
>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>   **/
>>>  void bio_endio(struct bio *bio)
>>>  {
>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>> +		bio_set_flag(bio, BIO_DONE);
>>> +		return;
>>> +	}
>>>  again:
>>>  	if (!bio_remaining_done(bio))
>>>  		return;
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index a082bbc856fb..970b23fa2e6e 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>  		bio->bi_opf |= REQ_TAG;
>>>  }
>>>  
>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>> +{
>>> +	struct blk_bio_poll_data data = {
>>> +		.bio	=	bio,
>>> +	};
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	unsigned int queued;
>>> +
>>> +	/* lock is required if there is more than one writer */
>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
>>> +		spin_lock(&pc->lock);
>>> +		queued = kfifo_put(&pc->sq, data);
>>> +		spin_unlock(&pc->lock);
>>> +	} else {
>>> +		queued = kfifo_put(&pc->sq, data);
>>> +	}
>>> +
>>> +	/*
>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
>>> +	 * so we can save cookie into this bio after submit_bio().
>>> +	 */
>>> +	if (queued)
>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>> +	else
>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
>>> +
>>> +	return queued;
>>> +}
>>
>> The size of kfifo is limited, and it seems that once the sq of kfifio is
>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
>> enqueued into the default hw queue, which is IRQ driven.
> 
> Yeah, this patch starts with 64 queue depth, and we can increase it to
> 128, which should cover most of cases.
> 
>>
>>
>>> +
>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
>>> +{
>>> +	bio->bi_iter.bi_private_data = cookie;
>>> +}
>>> +
>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>  {
>>>  	struct block_device *bdev = bio->bi_bdev;
>>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
>>>   *	->submit_bio_bio, but that haven't been processed yet.
>>>   */
>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>>>  {
>>>  	struct bio_list bio_list_on_stack[2];
>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>  
>>> -		ret = __submit_bio(bio);
>>> +		if (ioc && queue_is_mq(q) &&
>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
>>> +
>>> +			ret = __submit_bio(bio);
>>> +			if (queued)
>>> +				blk_bio_poll_post_submit(bio, ret);
>>> +		} else {
>>> +			ret = __submit_bio(bio);
>>> +		}
>>>  
>>>  		/*
>>>  		 * Sort new bios into those for a lower level and those for the
>>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>  	return ret;
>>>  }
>>>  
>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
>>> +		struct io_context *ioc)
>>> +{
>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>> +	int entries = kfifo_len(&pc->sq);
>>> +
>>> +	__submit_bio_noacct_int(bio, ioc);
>>> +
>>> +	/* bio submissions queued to per-task poll context */
>>> +	if (kfifo_len(&pc->sq) > entries)
>>> +		return current->pid;
>>> +
>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
>>> +	return 0;
>>> +}
>>> +
>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>>> +{
>>> +	struct io_context *ioc = current->io_context;
>>> +
>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>>> +		return __submit_bio_noacct_poll(bio, ioc);
>>> +
>>> +	return __submit_bio_noacct_int(bio, NULL);
>>> +}
>>> +
>>> +
>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>>>  {
>>>  	struct bio_list bio_list[2] = { };
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 03f59915fe2c..4e6f1467d303 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>>>  	return ret;
>>>  }
>>>  
>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
>>> +{
>>> +	return bio->bi_iter.bi_private_data;
>>> +}
>>> +
>>> +static int blk_mq_poll_io(struct bio *bio)
>>> +{
>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
>>> +	int ret = 0;
>>> +
>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
>>> +		struct blk_mq_hw_ctx *hctx =
>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>> +
>>> +		ret += blk_mq_poll_hctx(q, hctx);
>>> +	}
>>> +	return ret;
>>> +}
>>> +
>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
>>> +		struct blk_bio_poll_ctx *poll_ctx)
>>> +{
>>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
>>> +	int ret = 0;
>>> +	int i;
>>> +
>>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
>>> +		struct bio *bio = poll_data[i].bio;
>>> +
>>> +		if (!bio)
>>> +			continue;
>>> +
>>> +		ret += blk_mq_poll_io(bio);
>>> +		if (bio_flagged(bio, BIO_DONE)) {
>>> +			poll_data[i].bio = NULL;
>>> +
>>> +			/* clear BIO_END_BY_POLL and end me really */
>>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
>>> +			bio_endio(bio);
>>> +		}
>>> +	}
>>> +	return ret;
>>> +}
>>
>> When there are multiple threads polling, saying thread A and thread B,
>> then there's one bio which should be polled by thread A (the pid is
>> passed to thread A), while it's actually completed by thread B. In this
>> case, when the bio is completed by thread B, the bio is not really
>> completed and one extra blk_poll() still needs to be called.
> 
> When this happens, the dm bio can't be completed, and the associated
> kiocb can't be completed too, io_uring or other poll code context will
> keep calling blk_poll() by passing thread A's pid until this dm bio is
> done, since the dm bio is submitted from thread A.
> 

This will affect the multi-thread polling performance. I tested
dm-stripe, in which every bio will be split and enqueued into all
underlying devices, and thus amplify the interference between multiple
threads.

Test Result:
IOPS: 332k (IRQ) -> 363k (iopoll), aka ~10% performance gain


Test Environment:

nvme.poll_queues = 3

BLK_BIO_POLL_SQ_SZ = 128

dmsetup create testdev --table "0 629145600 striped 3 8 /dev/nvme0n1 0
/dev/nvme1n1 0 /dev/nvme4n1 0"


```
$cat fio.conf
[global]
name=iouring-sqpoll-iopoll-1
ioengine=io_uring
iodepth=128
numjobs=1
thread
rw=randread
direct=1
hipri=1
runtime=10
time_based
group_reporting
randrepeat=0
filename=/dev/mapper/testdev
bs=12k

[job-1]
cpus_allowed=14

[job-2]
cpus_allowed=16

[job-3]
cpus_allowed=84
```

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-16  8:52         ` [dm-devel] " JeffleXu
@ 2021-03-17  2:54           ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-17  2:54 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel

On Tue, Mar 16, 2021 at 04:52:36PM +0800, JeffleXu wrote:
> 
> 
> On 3/16/21 3:17 PM, Ming Lei wrote:
> > On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> >> It is a giant progress to gather all split bios that need to be polled
> >> in a per-task queue. Still some comments below.
> >>
> >>
> >> On 3/16/21 11:15 AM, Ming Lei wrote:
> >>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> >>> is very inefficient, and the big reason is that we can't pass bio
> >>> submission result to io poll task.
> >>>
> >>> In IO submission context, store associated underlying bios into the
> >>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> >>> and return current->pid to caller of submit_bio() for any DM or bio based
> >>> driver's IO, which is submitted from FS.
> >>>
> >>> In IO poll context, the passed cookie tells us the PID of submission
> >>> context, and we can find the bio from that submission context. Moving
> >>> bio from submission queue to poll queue of the poll context, and keep
> >>> polling until these bios are ended. Remove bio from poll queue if the
> >>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >>>
> >>> Usually submission shares context with io poll. The per-task poll context
> >>> is just like stack variable, and it is cheap to move data between the two
> >>> per-task queues.
> >>>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  block/bio.c               |   5 ++
> >>>  block/blk-core.c          |  74 +++++++++++++++++-
> >>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> >>>  include/linux/blk_types.h |   3 +
> >>>  4 files changed, 235 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/block/bio.c b/block/bio.c
> >>> index a1c4d2900c7a..bcf5eca0e8e3 100644
> >>> --- a/block/bio.c
> >>> +++ b/block/bio.c
> >>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>   **/
> >>>  void bio_endio(struct bio *bio)
> >>>  {
> >>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>> +		bio_set_flag(bio, BIO_DONE);
> >>> +		return;
> >>> +	}
> >>>  again:
> >>>  	if (!bio_remaining_done(bio))
> >>>  		return;
> >>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>> index a082bbc856fb..970b23fa2e6e 100644
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>  		bio->bi_opf |= REQ_TAG;
> >>>  }
> >>>  
> >>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>> +{
> >>> +	struct blk_bio_poll_data data = {
> >>> +		.bio	=	bio,
> >>> +	};
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +	unsigned int queued;
> >>> +
> >>> +	/* lock is required if there is more than one writer */
> >>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> >>> +		spin_lock(&pc->lock);
> >>> +		queued = kfifo_put(&pc->sq, data);
> >>> +		spin_unlock(&pc->lock);
> >>> +	} else {
> >>> +		queued = kfifo_put(&pc->sq, data);
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> >>> +	 * so we can save cookie into this bio after submit_bio().
> >>> +	 */
> >>> +	if (queued)
> >>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>> +	else
> >>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> >>> +
> >>> +	return queued;
> >>> +}
> >>
> >> The size of kfifo is limited, and it seems that once the sq of kfifio is
> >> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> >> enqueued into the default hw queue, which is IRQ driven.
> > 
> > Yeah, this patch starts with 64 queue depth, and we can increase it to
> > 128, which should cover most of cases.
> 
> It seems that the queue depth of kfifo will affect the performance as I
> did a fast test.
> 
> 
> 
> Test Result:
> 
> BLK_BIO_POLL_SQ_SZ | iodepth | IOPS
> ------------------ | ------- | ----
> 64                 | 128     | 301k (IRQ) -> 340k (iopoll)
> 64                 | 16      | 304k (IRQ) -> 392k (iopoll)
> 128                | 128     | 204k (IRQ) -> 317k (iopoll)
> 256                | 128     | 241k (IRQ) -> 391k (iopoll)
> 
> It seems that BLK_BIO_POLL_SQ_SZ need to be increased accordingly when
> iodepth is quite large. But I don't know why the performance in IRQ mode
> decreases when BLK_BIO_POLL_SQ_SZ is increased.

This patchset is supposed to not affect IRQ mode because HIPRI isn't set
at IRQ mode. Or you mean '--hipri' & io_uring is setup but setting
nvme.poll_queues as 0 at your 'IRQ' mode test?

Thanks for starting to run performance test, and so far I just run test
in KVM, not start performance test yet.



thanks,
Ming


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-17  2:54           ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-17  2:54 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer

On Tue, Mar 16, 2021 at 04:52:36PM +0800, JeffleXu wrote:
> 
> 
> On 3/16/21 3:17 PM, Ming Lei wrote:
> > On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> >> It is a giant progress to gather all split bios that need to be polled
> >> in a per-task queue. Still some comments below.
> >>
> >>
> >> On 3/16/21 11:15 AM, Ming Lei wrote:
> >>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> >>> is very inefficient, and the big reason is that we can't pass bio
> >>> submission result to io poll task.
> >>>
> >>> In IO submission context, store associated underlying bios into the
> >>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> >>> and return current->pid to caller of submit_bio() for any DM or bio based
> >>> driver's IO, which is submitted from FS.
> >>>
> >>> In IO poll context, the passed cookie tells us the PID of submission
> >>> context, and we can find the bio from that submission context. Moving
> >>> bio from submission queue to poll queue of the poll context, and keep
> >>> polling until these bios are ended. Remove bio from poll queue if the
> >>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >>>
> >>> Usually submission shares context with io poll. The per-task poll context
> >>> is just like stack variable, and it is cheap to move data between the two
> >>> per-task queues.
> >>>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  block/bio.c               |   5 ++
> >>>  block/blk-core.c          |  74 +++++++++++++++++-
> >>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> >>>  include/linux/blk_types.h |   3 +
> >>>  4 files changed, 235 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/block/bio.c b/block/bio.c
> >>> index a1c4d2900c7a..bcf5eca0e8e3 100644
> >>> --- a/block/bio.c
> >>> +++ b/block/bio.c
> >>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>   **/
> >>>  void bio_endio(struct bio *bio)
> >>>  {
> >>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>> +		bio_set_flag(bio, BIO_DONE);
> >>> +		return;
> >>> +	}
> >>>  again:
> >>>  	if (!bio_remaining_done(bio))
> >>>  		return;
> >>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>> index a082bbc856fb..970b23fa2e6e 100644
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>  		bio->bi_opf |= REQ_TAG;
> >>>  }
> >>>  
> >>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>> +{
> >>> +	struct blk_bio_poll_data data = {
> >>> +		.bio	=	bio,
> >>> +	};
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +	unsigned int queued;
> >>> +
> >>> +	/* lock is required if there is more than one writer */
> >>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> >>> +		spin_lock(&pc->lock);
> >>> +		queued = kfifo_put(&pc->sq, data);
> >>> +		spin_unlock(&pc->lock);
> >>> +	} else {
> >>> +		queued = kfifo_put(&pc->sq, data);
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> >>> +	 * so we can save cookie into this bio after submit_bio().
> >>> +	 */
> >>> +	if (queued)
> >>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>> +	else
> >>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> >>> +
> >>> +	return queued;
> >>> +}
> >>
> >> The size of kfifo is limited, and it seems that once the sq of kfifio is
> >> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> >> enqueued into the default hw queue, which is IRQ driven.
> > 
> > Yeah, this patch starts with 64 queue depth, and we can increase it to
> > 128, which should cover most of cases.
> 
> It seems that the queue depth of kfifo will affect the performance as I
> did a fast test.
> 
> 
> 
> Test Result:
> 
> BLK_BIO_POLL_SQ_SZ | iodepth | IOPS
> ------------------ | ------- | ----
> 64                 | 128     | 301k (IRQ) -> 340k (iopoll)
> 64                 | 16      | 304k (IRQ) -> 392k (iopoll)
> 128                | 128     | 204k (IRQ) -> 317k (iopoll)
> 256                | 128     | 241k (IRQ) -> 391k (iopoll)
> 
> It seems that BLK_BIO_POLL_SQ_SZ need to be increased accordingly when
> iodepth is quite large. But I don't know why the performance in IRQ mode
> decreases when BLK_BIO_POLL_SQ_SZ is increased.

This patchset is supposed to not affect IRQ mode because HIPRI isn't set
at IRQ mode. Or you mean '--hipri' & io_uring is setup but setting
nvme.poll_queues as 0 at your 'IRQ' mode test?

Thanks for starting to run performance test, and so far I just run test
in KVM, not start performance test yet.



thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-16 11:00         ` [dm-devel] " JeffleXu
@ 2021-03-17  3:38           ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-17  3:38 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel

On Tue, Mar 16, 2021 at 07:00:49PM +0800, JeffleXu wrote:
> 
> 
> On 3/16/21 3:17 PM, Ming Lei wrote:
> > On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> >> It is a giant progress to gather all split bios that need to be polled
> >> in a per-task queue. Still some comments below.
> >>
> >>
> >> On 3/16/21 11:15 AM, Ming Lei wrote:
> >>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> >>> is very inefficient, and the big reason is that we can't pass bio
> >>> submission result to io poll task.
> >>>
> >>> In IO submission context, store associated underlying bios into the
> >>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> >>> and return current->pid to caller of submit_bio() for any DM or bio based
> >>> driver's IO, which is submitted from FS.
> >>>
> >>> In IO poll context, the passed cookie tells us the PID of submission
> >>> context, and we can find the bio from that submission context. Moving
> >>> bio from submission queue to poll queue of the poll context, and keep
> >>> polling until these bios are ended. Remove bio from poll queue if the
> >>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >>>
> >>> Usually submission shares context with io poll. The per-task poll context
> >>> is just like stack variable, and it is cheap to move data between the two
> >>> per-task queues.
> >>>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  block/bio.c               |   5 ++
> >>>  block/blk-core.c          |  74 +++++++++++++++++-
> >>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> >>>  include/linux/blk_types.h |   3 +
> >>>  4 files changed, 235 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/block/bio.c b/block/bio.c
> >>> index a1c4d2900c7a..bcf5eca0e8e3 100644
> >>> --- a/block/bio.c
> >>> +++ b/block/bio.c
> >>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>   **/
> >>>  void bio_endio(struct bio *bio)
> >>>  {
> >>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>> +		bio_set_flag(bio, BIO_DONE);
> >>> +		return;
> >>> +	}
> >>>  again:
> >>>  	if (!bio_remaining_done(bio))
> >>>  		return;
> >>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>> index a082bbc856fb..970b23fa2e6e 100644
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>  		bio->bi_opf |= REQ_TAG;
> >>>  }
> >>>  
> >>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>> +{
> >>> +	struct blk_bio_poll_data data = {
> >>> +		.bio	=	bio,
> >>> +	};
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +	unsigned int queued;
> >>> +
> >>> +	/* lock is required if there is more than one writer */
> >>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> >>> +		spin_lock(&pc->lock);
> >>> +		queued = kfifo_put(&pc->sq, data);
> >>> +		spin_unlock(&pc->lock);
> >>> +	} else {
> >>> +		queued = kfifo_put(&pc->sq, data);
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> >>> +	 * so we can save cookie into this bio after submit_bio().
> >>> +	 */
> >>> +	if (queued)
> >>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>> +	else
> >>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> >>> +
> >>> +	return queued;
> >>> +}
> >>
> >> The size of kfifo is limited, and it seems that once the sq of kfifio is
> >> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> >> enqueued into the default hw queue, which is IRQ driven.
> > 
> > Yeah, this patch starts with 64 queue depth, and we can increase it to
> > 128, which should cover most of cases.
> > 
> >>
> >>
> >>> +
> >>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> >>> +{
> >>> +	bio->bi_iter.bi_private_data = cookie;
> >>> +}
> >>> +
> >>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >>>  {
> >>>  	struct block_device *bdev = bio->bi_bdev;
> >>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >>>   * bio_list_on_stack[1] contains bios that were submitted before the current
> >>>   *	->submit_bio_bio, but that haven't been processed yet.
> >>>   */
> >>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >>>  {
> >>>  	struct bio_list bio_list_on_stack[2];
> >>>  	blk_qc_t ret = BLK_QC_T_NONE;
> >>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >>>  		bio_list_init(&bio_list_on_stack[0]);
> >>>  
> >>> -		ret = __submit_bio(bio);
> >>> +		if (ioc && queue_is_mq(q) &&
> >>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> >>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> >>> +
> >>> +			ret = __submit_bio(bio);
> >>> +			if (queued)
> >>> +				blk_bio_poll_post_submit(bio, ret);
> >>> +		} else {
> >>> +			ret = __submit_bio(bio);
> >>> +		}
> >>>  
> >>>  		/*
> >>>  		 * Sort new bios into those for a lower level and those for the
> >>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  	return ret;
> >>>  }
> >>>  
> >>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> >>> +		struct io_context *ioc)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +	int entries = kfifo_len(&pc->sq);
> >>> +
> >>> +	__submit_bio_noacct_int(bio, ioc);
> >>> +
> >>> +	/* bio submissions queued to per-task poll context */
> >>> +	if (kfifo_len(&pc->sq) > entries)
> >>> +		return current->pid;
> >>> +
> >>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +{
> >>> +	struct io_context *ioc = current->io_context;
> >>> +
> >>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> >>> +		return __submit_bio_noacct_poll(bio, ioc);
> >>> +
> >>> +	return __submit_bio_noacct_int(bio, NULL);
> >>> +}
> >>> +
> >>> +
> >>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >>>  {
> >>>  	struct bio_list bio_list[2] = { };
> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>> index 03f59915fe2c..4e6f1467d303 100644
> >>> --- a/block/blk-mq.c
> >>> +++ b/block/blk-mq.c
> >>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >>>  	return ret;
> >>>  }
> >>>  
> >>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> >>> +{
> >>> +	return bio->bi_iter.bi_private_data;
> >>> +}
> >>> +
> >>> +static int blk_mq_poll_io(struct bio *bio)
> >>> +{
> >>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> >>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> >>> +	int ret = 0;
> >>> +
> >>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> >>> +		struct blk_mq_hw_ctx *hctx =
> >>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> >>> +
> >>> +		ret += blk_mq_poll_hctx(q, hctx);
> >>> +	}
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> >>> +		struct blk_bio_poll_ctx *poll_ctx)
> >>> +{
> >>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> >>> +	int ret = 0;
> >>> +	int i;
> >>> +
> >>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> >>> +		struct bio *bio = poll_data[i].bio;
> >>> +
> >>> +		if (!bio)
> >>> +			continue;
> >>> +
> >>> +		ret += blk_mq_poll_io(bio);
> >>> +		if (bio_flagged(bio, BIO_DONE)) {
> >>> +			poll_data[i].bio = NULL;
> >>> +
> >>> +			/* clear BIO_END_BY_POLL and end me really */
> >>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
> >>> +			bio_endio(bio);
> >>> +		}
> >>> +	}
> >>> +	return ret;
> >>> +}
> >>
> >> When there are multiple threads polling, saying thread A and thread B,
> >> then there's one bio which should be polled by thread A (the pid is
> >> passed to thread A), while it's actually completed by thread B. In this
> >> case, when the bio is completed by thread B, the bio is not really
> >> completed and one extra blk_poll() still needs to be called.
> > 
> > When this happens, the dm bio can't be completed, and the associated
> > kiocb can't be completed too, io_uring or other poll code context will
> > keep calling blk_poll() by passing thread A's pid until this dm bio is
> > done, since the dm bio is submitted from thread A.
> > 
> 
> This will affect the multi-thread polling performance. I tested
> dm-stripe, in which every bio will be split and enqueued into all
> underlying devices, and thus amplify the interference between multiple
> threads.
> 
> Test Result:
> IOPS: 332k (IRQ) -> 363k (iopoll), aka ~10% performance gain
> 
> 
> Test Environment:
> 
> nvme.poll_queues = 3
> 
> BLK_BIO_POLL_SQ_SZ = 128
> 
> dmsetup create testdev --table "0 629145600 striped 3 8 /dev/nvme0n1 0
> /dev/nvme1n1 0 /dev/nvme4n1 0"
> 
> 
> ```
> $cat fio.conf
> [global]
> name=iouring-sqpoll-iopoll-1
> ioengine=io_uring
> iodepth=128
> numjobs=1

If numjobs is 1, there can't be the queue interference issue because
there is only one submission job.

If numjobs is 3 and nvme.poll_queues is 3, there are two cases:

1) each job is assigned with different hw queue, so no queue
interference

2) two jobs share one same hw queue, and there is queue interference

The behavior is decided by scheduler since the queue mapping between
cpu vs. poll queue is fixed.

You can compare the above two cases by passing different 'cpus_allowed'
to each job in fio.

> thread
> rw=randread
> direct=1
> hipri=1
> runtime=10
> time_based
> group_reporting
> randrepeat=0
> filename=/dev/mapper/testdev
> bs=12k
> 
> [job-1]
> cpus_allowed=14
> 
> [job-2]
> cpus_allowed=16
> 
> [job-3]
> cpus_allowed=84

It depends if cpu of 14,16,84 is mapped to same hw poll queue or
not.

If all 3 cpus are mapped to same hw poll queue, there will be lock
contention in submission path, see nvme_submit_cmd(), and the hw queue
data is concurrent polled in 3 cpus too.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-17  3:38           ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-17  3:38 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer

On Tue, Mar 16, 2021 at 07:00:49PM +0800, JeffleXu wrote:
> 
> 
> On 3/16/21 3:17 PM, Ming Lei wrote:
> > On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> >> It is a giant progress to gather all split bios that need to be polled
> >> in a per-task queue. Still some comments below.
> >>
> >>
> >> On 3/16/21 11:15 AM, Ming Lei wrote:
> >>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> >>> is very inefficient, and the big reason is that we can't pass bio
> >>> submission result to io poll task.
> >>>
> >>> In IO submission context, store associated underlying bios into the
> >>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> >>> and return current->pid to caller of submit_bio() for any DM or bio based
> >>> driver's IO, which is submitted from FS.
> >>>
> >>> In IO poll context, the passed cookie tells us the PID of submission
> >>> context, and we can find the bio from that submission context. Moving
> >>> bio from submission queue to poll queue of the poll context, and keep
> >>> polling until these bios are ended. Remove bio from poll queue if the
> >>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >>>
> >>> Usually submission shares context with io poll. The per-task poll context
> >>> is just like stack variable, and it is cheap to move data between the two
> >>> per-task queues.
> >>>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  block/bio.c               |   5 ++
> >>>  block/blk-core.c          |  74 +++++++++++++++++-
> >>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> >>>  include/linux/blk_types.h |   3 +
> >>>  4 files changed, 235 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/block/bio.c b/block/bio.c
> >>> index a1c4d2900c7a..bcf5eca0e8e3 100644
> >>> --- a/block/bio.c
> >>> +++ b/block/bio.c
> >>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>   **/
> >>>  void bio_endio(struct bio *bio)
> >>>  {
> >>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>> +		bio_set_flag(bio, BIO_DONE);
> >>> +		return;
> >>> +	}
> >>>  again:
> >>>  	if (!bio_remaining_done(bio))
> >>>  		return;
> >>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>> index a082bbc856fb..970b23fa2e6e 100644
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>  		bio->bi_opf |= REQ_TAG;
> >>>  }
> >>>  
> >>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>> +{
> >>> +	struct blk_bio_poll_data data = {
> >>> +		.bio	=	bio,
> >>> +	};
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +	unsigned int queued;
> >>> +
> >>> +	/* lock is required if there is more than one writer */
> >>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> >>> +		spin_lock(&pc->lock);
> >>> +		queued = kfifo_put(&pc->sq, data);
> >>> +		spin_unlock(&pc->lock);
> >>> +	} else {
> >>> +		queued = kfifo_put(&pc->sq, data);
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> >>> +	 * so we can save cookie into this bio after submit_bio().
> >>> +	 */
> >>> +	if (queued)
> >>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>> +	else
> >>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> >>> +
> >>> +	return queued;
> >>> +}
> >>
> >> The size of kfifo is limited, and it seems that once the sq of kfifio is
> >> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> >> enqueued into the default hw queue, which is IRQ driven.
> > 
> > Yeah, this patch starts with 64 queue depth, and we can increase it to
> > 128, which should cover most of cases.
> > 
> >>
> >>
> >>> +
> >>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> >>> +{
> >>> +	bio->bi_iter.bi_private_data = cookie;
> >>> +}
> >>> +
> >>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >>>  {
> >>>  	struct block_device *bdev = bio->bi_bdev;
> >>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >>>   * bio_list_on_stack[1] contains bios that were submitted before the current
> >>>   *	->submit_bio_bio, but that haven't been processed yet.
> >>>   */
> >>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >>>  {
> >>>  	struct bio_list bio_list_on_stack[2];
> >>>  	blk_qc_t ret = BLK_QC_T_NONE;
> >>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >>>  		bio_list_init(&bio_list_on_stack[0]);
> >>>  
> >>> -		ret = __submit_bio(bio);
> >>> +		if (ioc && queue_is_mq(q) &&
> >>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> >>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> >>> +
> >>> +			ret = __submit_bio(bio);
> >>> +			if (queued)
> >>> +				blk_bio_poll_post_submit(bio, ret);
> >>> +		} else {
> >>> +			ret = __submit_bio(bio);
> >>> +		}
> >>>  
> >>>  		/*
> >>>  		 * Sort new bios into those for a lower level and those for the
> >>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>  	return ret;
> >>>  }
> >>>  
> >>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> >>> +		struct io_context *ioc)
> >>> +{
> >>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>> +	int entries = kfifo_len(&pc->sq);
> >>> +
> >>> +	__submit_bio_noacct_int(bio, ioc);
> >>> +
> >>> +	/* bio submissions queued to per-task poll context */
> >>> +	if (kfifo_len(&pc->sq) > entries)
> >>> +		return current->pid;
> >>> +
> >>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> >>> +	return 0;
> >>> +}
> >>> +
> >>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>> +{
> >>> +	struct io_context *ioc = current->io_context;
> >>> +
> >>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> >>> +		return __submit_bio_noacct_poll(bio, ioc);
> >>> +
> >>> +	return __submit_bio_noacct_int(bio, NULL);
> >>> +}
> >>> +
> >>> +
> >>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >>>  {
> >>>  	struct bio_list bio_list[2] = { };
> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>> index 03f59915fe2c..4e6f1467d303 100644
> >>> --- a/block/blk-mq.c
> >>> +++ b/block/blk-mq.c
> >>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >>>  	return ret;
> >>>  }
> >>>  
> >>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> >>> +{
> >>> +	return bio->bi_iter.bi_private_data;
> >>> +}
> >>> +
> >>> +static int blk_mq_poll_io(struct bio *bio)
> >>> +{
> >>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> >>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> >>> +	int ret = 0;
> >>> +
> >>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> >>> +		struct blk_mq_hw_ctx *hctx =
> >>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> >>> +
> >>> +		ret += blk_mq_poll_hctx(q, hctx);
> >>> +	}
> >>> +	return ret;
> >>> +}
> >>> +
> >>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> >>> +		struct blk_bio_poll_ctx *poll_ctx)
> >>> +{
> >>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> >>> +	int ret = 0;
> >>> +	int i;
> >>> +
> >>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> >>> +		struct bio *bio = poll_data[i].bio;
> >>> +
> >>> +		if (!bio)
> >>> +			continue;
> >>> +
> >>> +		ret += blk_mq_poll_io(bio);
> >>> +		if (bio_flagged(bio, BIO_DONE)) {
> >>> +			poll_data[i].bio = NULL;
> >>> +
> >>> +			/* clear BIO_END_BY_POLL and end me really */
> >>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
> >>> +			bio_endio(bio);
> >>> +		}
> >>> +	}
> >>> +	return ret;
> >>> +}
> >>
> >> When there are multiple threads polling, saying thread A and thread B,
> >> then there's one bio which should be polled by thread A (the pid is
> >> passed to thread A), while it's actually completed by thread B. In this
> >> case, when the bio is completed by thread B, the bio is not really
> >> completed and one extra blk_poll() still needs to be called.
> > 
> > When this happens, the dm bio can't be completed, and the associated
> > kiocb can't be completed too, io_uring or other poll code context will
> > keep calling blk_poll() by passing thread A's pid until this dm bio is
> > done, since the dm bio is submitted from thread A.
> > 
> 
> This will affect the multi-thread polling performance. I tested
> dm-stripe, in which every bio will be split and enqueued into all
> underlying devices, and thus amplify the interference between multiple
> threads.
> 
> Test Result:
> IOPS: 332k (IRQ) -> 363k (iopoll), aka ~10% performance gain
> 
> 
> Test Environment:
> 
> nvme.poll_queues = 3
> 
> BLK_BIO_POLL_SQ_SZ = 128
> 
> dmsetup create testdev --table "0 629145600 striped 3 8 /dev/nvme0n1 0
> /dev/nvme1n1 0 /dev/nvme4n1 0"
> 
> 
> ```
> $cat fio.conf
> [global]
> name=iouring-sqpoll-iopoll-1
> ioengine=io_uring
> iodepth=128
> numjobs=1

If numjobs is 1, there can't be the queue interference issue because
there is only one submission job.

If numjobs is 3 and nvme.poll_queues is 3, there are two cases:

1) each job is assigned with different hw queue, so no queue
interference

2) two jobs share one same hw queue, and there is queue interference

The behavior is decided by scheduler since the queue mapping between
cpu vs. poll queue is fixed.

You can compare the above two cases by passing different 'cpus_allowed'
to each job in fio.

> thread
> rw=randread
> direct=1
> hipri=1
> runtime=10
> time_based
> group_reporting
> randrepeat=0
> filename=/dev/mapper/testdev
> bs=12k
> 
> [job-1]
> cpus_allowed=14
> 
> [job-2]
> cpus_allowed=16
> 
> [job-3]
> cpus_allowed=84

It depends if cpu of 14,16,84 is mapped to same hw poll queue or
not.

If all 3 cpus are mapped to same hw poll queue, there will be lock
contention in submission path, see nvme_submit_cmd(), and the hw queue
data is concurrent polled in 3 cpus too.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-16 11:00         ` [dm-devel] " JeffleXu
@ 2021-03-17  3:49           ` JeffleXu
  -1 siblings, 0 replies; 48+ messages in thread
From: JeffleXu @ 2021-03-17  3:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel



On 3/16/21 7:00 PM, JeffleXu wrote:
> 
> 
> On 3/16/21 3:17 PM, Ming Lei wrote:
>> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
>>> It is a giant progress to gather all split bios that need to be polled
>>> in a per-task queue. Still some comments below.
>>>
>>>
>>> On 3/16/21 11:15 AM, Ming Lei wrote:
>>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
>>>> is very inefficient, and the big reason is that we can't pass bio
>>>> submission result to io poll task.
>>>>
>>>> In IO submission context, store associated underlying bios into the
>>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
>>>> and return current->pid to caller of submit_bio() for any DM or bio based
>>>> driver's IO, which is submitted from FS.
>>>>
>>>> In IO poll context, the passed cookie tells us the PID of submission
>>>> context, and we can find the bio from that submission context. Moving
>>>> bio from submission queue to poll queue of the poll context, and keep
>>>> polling until these bios are ended. Remove bio from poll queue if the
>>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
>>>>
>>>> Usually submission shares context with io poll. The per-task poll context
>>>> is just like stack variable, and it is cheap to move data between the two
>>>> per-task queues.
>>>>
>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>> ---
>>>>  block/bio.c               |   5 ++
>>>>  block/blk-core.c          |  74 +++++++++++++++++-
>>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
>>>>  include/linux/blk_types.h |   3 +
>>>>  4 files changed, 235 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/block/bio.c b/block/bio.c
>>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
>>>> --- a/block/bio.c
>>>> +++ b/block/bio.c
>>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>>   **/
>>>>  void bio_endio(struct bio *bio)
>>>>  {
>>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>>> +		bio_set_flag(bio, BIO_DONE);
>>>> +		return;
>>>> +	}
>>>>  again:
>>>>  	if (!bio_remaining_done(bio))
>>>>  		return;
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index a082bbc856fb..970b23fa2e6e 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>>  		bio->bi_opf |= REQ_TAG;
>>>>  }
>>>>  
>>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>>> +{
>>>> +	struct blk_bio_poll_data data = {
>>>> +		.bio	=	bio,
>>>> +	};
>>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>>> +	unsigned int queued;
>>>> +
>>>> +	/* lock is required if there is more than one writer */
>>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
>>>> +		spin_lock(&pc->lock);
>>>> +		queued = kfifo_put(&pc->sq, data);
>>>> +		spin_unlock(&pc->lock);
>>>> +	} else {
>>>> +		queued = kfifo_put(&pc->sq, data);
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
>>>> +	 * so we can save cookie into this bio after submit_bio().
>>>> +	 */
>>>> +	if (queued)
>>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>>> +	else
>>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
>>>> +
>>>> +	return queued;
>>>> +}
>>>
>>> The size of kfifo is limited, and it seems that once the sq of kfifio is
>>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
>>> enqueued into the default hw queue, which is IRQ driven.
>>
>> Yeah, this patch starts with 64 queue depth, and we can increase it to
>> 128, which should cover most of cases.
>>
>>>
>>>
>>>> +
>>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
>>>> +{
>>>> +	bio->bi_iter.bi_private_data = cookie;
>>>> +}
>>>> +
>>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>>  {
>>>>  	struct block_device *bdev = bio->bi_bdev;
>>>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
>>>>   *	->submit_bio_bio, but that haven't been processed yet.
>>>>   */
>>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>>>>  {
>>>>  	struct bio_list bio_list_on_stack[2];
>>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>>  
>>>> -		ret = __submit_bio(bio);
>>>> +		if (ioc && queue_is_mq(q) &&
>>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
>>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
>>>> +
>>>> +			ret = __submit_bio(bio);
>>>> +			if (queued)
>>>> +				blk_bio_poll_post_submit(bio, ret);
>>>> +		} else {
>>>> +			ret = __submit_bio(bio);
>>>> +		}
>>>>  
>>>>  		/*
>>>>  		 * Sort new bios into those for a lower level and those for the
>>>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
>>>> +		struct io_context *ioc)
>>>> +{
>>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>>> +	int entries = kfifo_len(&pc->sq);
>>>> +
>>>> +	__submit_bio_noacct_int(bio, ioc);
>>>> +
>>>> +	/* bio submissions queued to per-task poll context */
>>>> +	if (kfifo_len(&pc->sq) > entries)
>>>> +		return current->pid;
>>>> +
>>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>> +{
>>>> +	struct io_context *ioc = current->io_context;
>>>> +
>>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>>>> +		return __submit_bio_noacct_poll(bio, ioc);
>>>> +
>>>> +	return __submit_bio_noacct_int(bio, NULL);
>>>> +}
>>>> +
>>>> +
>>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>>>>  {
>>>>  	struct bio_list bio_list[2] = { };
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 03f59915fe2c..4e6f1467d303 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
>>>> +{
>>>> +	return bio->bi_iter.bi_private_data;
>>>> +}
>>>> +
>>>> +static int blk_mq_poll_io(struct bio *bio)
>>>> +{
>>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
>>>> +	int ret = 0;
>>>> +
>>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
>>>> +		struct blk_mq_hw_ctx *hctx =
>>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>>> +
>>>> +		ret += blk_mq_poll_hctx(q, hctx);
>>>> +	}
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
>>>> +		struct blk_bio_poll_ctx *poll_ctx)
>>>> +{
>>>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
>>>> +	int ret = 0;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
>>>> +		struct bio *bio = poll_data[i].bio;
>>>> +
>>>> +		if (!bio)
>>>> +			continue;
>>>> +
>>>> +		ret += blk_mq_poll_io(bio);
>>>> +		if (bio_flagged(bio, BIO_DONE)) {
>>>> +			poll_data[i].bio = NULL;
>>>> +
>>>> +			/* clear BIO_END_BY_POLL and end me really */
>>>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
>>>> +			bio_endio(bio);
>>>> +		}
>>>> +	}
>>>> +	return ret;
>>>> +}
>>>
>>> When there are multiple threads polling, saying thread A and thread B,
>>> then there's one bio which should be polled by thread A (the pid is
>>> passed to thread A), while it's actually completed by thread B. In this
>>> case, when the bio is completed by thread B, the bio is not really
>>> completed and one extra blk_poll() still needs to be called.
>>
>> When this happens, the dm bio can't be completed, and the associated
>> kiocb can't be completed too, io_uring or other poll code context will
>> keep calling blk_poll() by passing thread A's pid until this dm bio is
>> done, since the dm bio is submitted from thread A.
>>
> 
> This will affect the multi-thread polling performance. I tested
> dm-stripe, in which every bio will be split and enqueued into all
> underlying devices, and thus amplify the interference between multiple
> threads.
> 
> Test Result:
> IOPS: 332k (IRQ) -> 363k (iopoll), aka ~10% performance gain

Sorry this performance drop is not related to this bio refcount issue
here. Still it's due to the limited kfifo size.


I did another through test on another machine (aarch64 with more nvme
disks).

- Without mentioned specifically, the configuration is 'iodepth=128,
kfifo queue depth =128'.
- The number before '->' indicates the IOPS in IRQ mode, i.e.,
'hipri=0', while the number after '->' indicates the IOPS in polling
mode, i.e., 'hipri=1'.

```
3-threads  dm-linear-3 targets (4k randread IOPS, unit K)
5.12-rc1: 667
leiming: 674 -> 849
ours 8353c1a: 623 -> 811

3-threads  dm-stripe-3 targets  (12k randread IOPS, unit K)
5.12-rc1: 321
leiming: 313 -> 349
leiming : 313 -> 409 (iodepth=32, kfifo queue depth =128)
leiming : 314 -> 409 (iodepth=128, kfifo queue depth =512)
ours 8353c1a: 310 -> 406


1-thread  dm-linear-3 targets  (4k randread IOPS, unit K)
5.12-rc1: 224
leiming:  218 -> 288
ours 8353c1a: 210 -> 280

1-threads  dm-stripe-3 targets (12k randread IOPS, unit K)
5.12-rc1: 109
leiming: 107 -> 120
leiming : 107 -> 145 (iodepth=32, kfifo queue depth =128)
leiming : 108 -> 145 (iodepth=128, kfifo queue depth =512)
ours 8353c1a: 107 -> 146
```


Some hints:

1. When configured as 'iodepth=128, kfifo queue depth =128', dm-stripe
doesn't perform well in polling mode. It's because it's more likely that
the original bio will be split into split bios in dm-stripe, and thus
kfifo will be more likely used up in this case. So the size of kfifo
need to be tuned according to iodepth and the IO load. Thus exporting
the size of kfifo as a sysfs entry may be need in the following patch.

2. It indicates a performance drop of my patch in IRQ mode, compared to
the original 5.12-rc1. I doubt maybe it's due to extra code mixed in
blk-core, such as __submit_bio_noacct()...


> 
> 
> Test Environment:
> 
> nvme.poll_queues = 3
> 
> BLK_BIO_POLL_SQ_SZ = 128
> 
> dmsetup create testdev --table "0 629145600 striped 3 8 /dev/nvme0n1 0
> /dev/nvme1n1 0 /dev/nvme4n1 0"
> 
> 
> ```
> $cat fio.conf
> [global]
> name=iouring-sqpoll-iopoll-1
> ioengine=io_uring
> iodepth=128
> numjobs=1
> thread
> rw=randread
> direct=1
> hipri=1
> runtime=10
> time_based
> group_reporting
> randrepeat=0
> filename=/dev/mapper/testdev
> bs=12k
> 
> [job-1]
> cpus_allowed=14
> 
> [job-2]
> cpus_allowed=16
> 
> [job-3]
> cpus_allowed=84
> ```
> 

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-17  3:49           ` JeffleXu
  0 siblings, 0 replies; 48+ messages in thread
From: JeffleXu @ 2021-03-17  3:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer



On 3/16/21 7:00 PM, JeffleXu wrote:
> 
> 
> On 3/16/21 3:17 PM, Ming Lei wrote:
>> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
>>> It is a giant progress to gather all split bios that need to be polled
>>> in a per-task queue. Still some comments below.
>>>
>>>
>>> On 3/16/21 11:15 AM, Ming Lei wrote:
>>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
>>>> is very inefficient, and the big reason is that we can't pass bio
>>>> submission result to io poll task.
>>>>
>>>> In IO submission context, store associated underlying bios into the
>>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
>>>> and return current->pid to caller of submit_bio() for any DM or bio based
>>>> driver's IO, which is submitted from FS.
>>>>
>>>> In IO poll context, the passed cookie tells us the PID of submission
>>>> context, and we can find the bio from that submission context. Moving
>>>> bio from submission queue to poll queue of the poll context, and keep
>>>> polling until these bios are ended. Remove bio from poll queue if the
>>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
>>>>
>>>> Usually submission shares context with io poll. The per-task poll context
>>>> is just like stack variable, and it is cheap to move data between the two
>>>> per-task queues.
>>>>
>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>> ---
>>>>  block/bio.c               |   5 ++
>>>>  block/blk-core.c          |  74 +++++++++++++++++-
>>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
>>>>  include/linux/blk_types.h |   3 +
>>>>  4 files changed, 235 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/block/bio.c b/block/bio.c
>>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
>>>> --- a/block/bio.c
>>>> +++ b/block/bio.c
>>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>>   **/
>>>>  void bio_endio(struct bio *bio)
>>>>  {
>>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>>> +		bio_set_flag(bio, BIO_DONE);
>>>> +		return;
>>>> +	}
>>>>  again:
>>>>  	if (!bio_remaining_done(bio))
>>>>  		return;
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index a082bbc856fb..970b23fa2e6e 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>>  		bio->bi_opf |= REQ_TAG;
>>>>  }
>>>>  
>>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>>> +{
>>>> +	struct blk_bio_poll_data data = {
>>>> +		.bio	=	bio,
>>>> +	};
>>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>>> +	unsigned int queued;
>>>> +
>>>> +	/* lock is required if there is more than one writer */
>>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
>>>> +		spin_lock(&pc->lock);
>>>> +		queued = kfifo_put(&pc->sq, data);
>>>> +		spin_unlock(&pc->lock);
>>>> +	} else {
>>>> +		queued = kfifo_put(&pc->sq, data);
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
>>>> +	 * so we can save cookie into this bio after submit_bio().
>>>> +	 */
>>>> +	if (queued)
>>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>>> +	else
>>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
>>>> +
>>>> +	return queued;
>>>> +}
>>>
>>> The size of kfifo is limited, and it seems that once the sq of kfifio is
>>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
>>> enqueued into the default hw queue, which is IRQ driven.
>>
>> Yeah, this patch starts with 64 queue depth, and we can increase it to
>> 128, which should cover most of cases.
>>
>>>
>>>
>>>> +
>>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
>>>> +{
>>>> +	bio->bi_iter.bi_private_data = cookie;
>>>> +}
>>>> +
>>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
>>>>  {
>>>>  	struct block_device *bdev = bio->bi_bdev;
>>>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
>>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
>>>>   *	->submit_bio_bio, but that haven't been processed yet.
>>>>   */
>>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
>>>>  {
>>>>  	struct bio_list bio_list_on_stack[2];
>>>>  	blk_qc_t ret = BLK_QC_T_NONE;
>>>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
>>>>  		bio_list_init(&bio_list_on_stack[0]);
>>>>  
>>>> -		ret = __submit_bio(bio);
>>>> +		if (ioc && queue_is_mq(q) &&
>>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
>>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
>>>> +
>>>> +			ret = __submit_bio(bio);
>>>> +			if (queued)
>>>> +				blk_bio_poll_post_submit(bio, ret);
>>>> +		} else {
>>>> +			ret = __submit_bio(bio);
>>>> +		}
>>>>  
>>>>  		/*
>>>>  		 * Sort new bios into those for a lower level and those for the
>>>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
>>>> +		struct io_context *ioc)
>>>> +{
>>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>>> +	int entries = kfifo_len(&pc->sq);
>>>> +
>>>> +	__submit_bio_noacct_int(bio, ioc);
>>>> +
>>>> +	/* bio submissions queued to per-task poll context */
>>>> +	if (kfifo_len(&pc->sq) > entries)
>>>> +		return current->pid;
>>>> +
>>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
>>>> +{
>>>> +	struct io_context *ioc = current->io_context;
>>>> +
>>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
>>>> +		return __submit_bio_noacct_poll(bio, ioc);
>>>> +
>>>> +	return __submit_bio_noacct_int(bio, NULL);
>>>> +}
>>>> +
>>>> +
>>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
>>>>  {
>>>>  	struct bio_list bio_list[2] = { };
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 03f59915fe2c..4e6f1467d303 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
>>>> +{
>>>> +	return bio->bi_iter.bi_private_data;
>>>> +}
>>>> +
>>>> +static int blk_mq_poll_io(struct bio *bio)
>>>> +{
>>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
>>>> +	int ret = 0;
>>>> +
>>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
>>>> +		struct blk_mq_hw_ctx *hctx =
>>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
>>>> +
>>>> +		ret += blk_mq_poll_hctx(q, hctx);
>>>> +	}
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
>>>> +		struct blk_bio_poll_ctx *poll_ctx)
>>>> +{
>>>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
>>>> +	int ret = 0;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
>>>> +		struct bio *bio = poll_data[i].bio;
>>>> +
>>>> +		if (!bio)
>>>> +			continue;
>>>> +
>>>> +		ret += blk_mq_poll_io(bio);
>>>> +		if (bio_flagged(bio, BIO_DONE)) {
>>>> +			poll_data[i].bio = NULL;
>>>> +
>>>> +			/* clear BIO_END_BY_POLL and end me really */
>>>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
>>>> +			bio_endio(bio);
>>>> +		}
>>>> +	}
>>>> +	return ret;
>>>> +}
>>>
>>> When there are multiple threads polling, saying thread A and thread B,
>>> then there's one bio which should be polled by thread A (the pid is
>>> passed to thread A), while it's actually completed by thread B. In this
>>> case, when the bio is completed by thread B, the bio is not really
>>> completed and one extra blk_poll() still needs to be called.
>>
>> When this happens, the dm bio can't be completed, and the associated
>> kiocb can't be completed too, io_uring or other poll code context will
>> keep calling blk_poll() by passing thread A's pid until this dm bio is
>> done, since the dm bio is submitted from thread A.
>>
> 
> This will affect the multi-thread polling performance. I tested
> dm-stripe, in which every bio will be split and enqueued into all
> underlying devices, and thus amplify the interference between multiple
> threads.
> 
> Test Result:
> IOPS: 332k (IRQ) -> 363k (iopoll), aka ~10% performance gain

Sorry this performance drop is not related to this bio refcount issue
here. Still it's due to the limited kfifo size.


I did another through test on another machine (aarch64 with more nvme
disks).

- Without mentioned specifically, the configuration is 'iodepth=128,
kfifo queue depth =128'.
- The number before '->' indicates the IOPS in IRQ mode, i.e.,
'hipri=0', while the number after '->' indicates the IOPS in polling
mode, i.e., 'hipri=1'.

```
3-threads  dm-linear-3 targets (4k randread IOPS, unit K)
5.12-rc1: 667
leiming: 674 -> 849
ours 8353c1a: 623 -> 811

3-threads  dm-stripe-3 targets  (12k randread IOPS, unit K)
5.12-rc1: 321
leiming: 313 -> 349
leiming : 313 -> 409 (iodepth=32, kfifo queue depth =128)
leiming : 314 -> 409 (iodepth=128, kfifo queue depth =512)
ours 8353c1a: 310 -> 406


1-thread  dm-linear-3 targets  (4k randread IOPS, unit K)
5.12-rc1: 224
leiming:  218 -> 288
ours 8353c1a: 210 -> 280

1-threads  dm-stripe-3 targets (12k randread IOPS, unit K)
5.12-rc1: 109
leiming: 107 -> 120
leiming : 107 -> 145 (iodepth=32, kfifo queue depth =128)
leiming : 108 -> 145 (iodepth=128, kfifo queue depth =512)
ours 8353c1a: 107 -> 146
```


Some hints:

1. When configured as 'iodepth=128, kfifo queue depth =128', dm-stripe
doesn't perform well in polling mode. It's because it's more likely that
the original bio will be split into split bios in dm-stripe, and thus
kfifo will be more likely used up in this case. So the size of kfifo
need to be tuned according to iodepth and the IO load. Thus exporting
the size of kfifo as a sysfs entry may be need in the following patch.

2. It indicates a performance drop of my patch in IRQ mode, compared to
the original 5.12-rc1. I doubt maybe it's due to extra code mixed in
blk-core, such as __submit_bio_noacct()...


> 
> 
> Test Environment:
> 
> nvme.poll_queues = 3
> 
> BLK_BIO_POLL_SQ_SZ = 128
> 
> dmsetup create testdev --table "0 629145600 striped 3 8 /dev/nvme0n1 0
> /dev/nvme1n1 0 /dev/nvme4n1 0"
> 
> 
> ```
> $cat fio.conf
> [global]
> name=iouring-sqpoll-iopoll-1
> ioengine=io_uring
> iodepth=128
> numjobs=1
> thread
> rw=randread
> direct=1
> hipri=1
> runtime=10
> time_based
> group_reporting
> randrepeat=0
> filename=/dev/mapper/testdev
> bs=12k
> 
> [job-1]
> cpus_allowed=14
> 
> [job-2]
> cpus_allowed=16
> 
> [job-3]
> cpus_allowed=84
> ```
> 

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-17  2:54           ` [dm-devel] " Ming Lei
@ 2021-03-17  3:53             ` JeffleXu
  -1 siblings, 0 replies; 48+ messages in thread
From: JeffleXu @ 2021-03-17  3:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel



On 3/17/21 10:54 AM, Ming Lei wrote:
> On Tue, Mar 16, 2021 at 04:52:36PM +0800, JeffleXu wrote:
>>
>>
>> On 3/16/21 3:17 PM, Ming Lei wrote:
>>> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
>>>> It is a giant progress to gather all split bios that need to be polled
>>>> in a per-task queue. Still some comments below.
>>>>
>>>>
>>>> On 3/16/21 11:15 AM, Ming Lei wrote:
>>>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
>>>>> is very inefficient, and the big reason is that we can't pass bio
>>>>> submission result to io poll task.
>>>>>
>>>>> In IO submission context, store associated underlying bios into the
>>>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
>>>>> and return current->pid to caller of submit_bio() for any DM or bio based
>>>>> driver's IO, which is submitted from FS.
>>>>>
>>>>> In IO poll context, the passed cookie tells us the PID of submission
>>>>> context, and we can find the bio from that submission context. Moving
>>>>> bio from submission queue to poll queue of the poll context, and keep
>>>>> polling until these bios are ended. Remove bio from poll queue if the
>>>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
>>>>>
>>>>> Usually submission shares context with io poll. The per-task poll context
>>>>> is just like stack variable, and it is cheap to move data between the two
>>>>> per-task queues.
>>>>>
>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>>> ---
>>>>>  block/bio.c               |   5 ++
>>>>>  block/blk-core.c          |  74 +++++++++++++++++-
>>>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
>>>>>  include/linux/blk_types.h |   3 +
>>>>>  4 files changed, 235 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
>>>>> --- a/block/bio.c
>>>>> +++ b/block/bio.c
>>>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>>>   **/
>>>>>  void bio_endio(struct bio *bio)
>>>>>  {
>>>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>>>> +		bio_set_flag(bio, BIO_DONE);
>>>>> +		return;
>>>>> +	}
>>>>>  again:
>>>>>  	if (!bio_remaining_done(bio))
>>>>>  		return;
>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>> index a082bbc856fb..970b23fa2e6e 100644
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>>>  		bio->bi_opf |= REQ_TAG;
>>>>>  }
>>>>>  
>>>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>>>> +{
>>>>> +	struct blk_bio_poll_data data = {
>>>>> +		.bio	=	bio,
>>>>> +	};
>>>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>>>> +	unsigned int queued;
>>>>> +
>>>>> +	/* lock is required if there is more than one writer */
>>>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
>>>>> +		spin_lock(&pc->lock);
>>>>> +		queued = kfifo_put(&pc->sq, data);
>>>>> +		spin_unlock(&pc->lock);
>>>>> +	} else {
>>>>> +		queued = kfifo_put(&pc->sq, data);
>>>>> +	}
>>>>> +
>>>>> +	/*
>>>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
>>>>> +	 * so we can save cookie into this bio after submit_bio().
>>>>> +	 */
>>>>> +	if (queued)
>>>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>>>> +	else
>>>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
>>>>> +
>>>>> +	return queued;
>>>>> +}
>>>>
>>>> The size of kfifo is limited, and it seems that once the sq of kfifio is
>>>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
>>>> enqueued into the default hw queue, which is IRQ driven.
>>>
>>> Yeah, this patch starts with 64 queue depth, and we can increase it to
>>> 128, which should cover most of cases.
>>
>> It seems that the queue depth of kfifo will affect the performance as I
>> did a fast test.
>>
>>
>>
>> Test Result:
>>
>> BLK_BIO_POLL_SQ_SZ | iodepth | IOPS
>> ------------------ | ------- | ----
>> 64                 | 128     | 301k (IRQ) -> 340k (iopoll)
>> 64                 | 16      | 304k (IRQ) -> 392k (iopoll)
>> 128                | 128     | 204k (IRQ) -> 317k (iopoll)
>> 256                | 128     | 241k (IRQ) -> 391k (iopoll)
>>
>> It seems that BLK_BIO_POLL_SQ_SZ need to be increased accordingly when
>> iodepth is quite large. But I don't know why the performance in IRQ mode
>> decreases when BLK_BIO_POLL_SQ_SZ is increased.
> 
> This patchset is supposed to not affect IRQ mode because HIPRI isn't set
> at IRQ mode. Or you mean '--hipri' & io_uring is setup but setting
> nvme.poll_queues as 0 at your 'IRQ' mode test?
> 
> Thanks for starting to run performance test, and so far I just run test
> in KVM, not start performance test yet.
> 

'IRQ' means 'hipri=0' of fio configuration.

The above performance test was performed on one x86 machine with one
single nvme disk. I did the test on another aarch64 machine with more
nvme disks, showing that this performance drop didn't occure...

Please see my reply in another thread for detailed test results.

-- 
Thanks,
Jeffle

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-17  3:53             ` JeffleXu
  0 siblings, 0 replies; 48+ messages in thread
From: JeffleXu @ 2021-03-17  3:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer



On 3/17/21 10:54 AM, Ming Lei wrote:
> On Tue, Mar 16, 2021 at 04:52:36PM +0800, JeffleXu wrote:
>>
>>
>> On 3/16/21 3:17 PM, Ming Lei wrote:
>>> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
>>>> It is a giant progress to gather all split bios that need to be polled
>>>> in a per-task queue. Still some comments below.
>>>>
>>>>
>>>> On 3/16/21 11:15 AM, Ming Lei wrote:
>>>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
>>>>> is very inefficient, and the big reason is that we can't pass bio
>>>>> submission result to io poll task.
>>>>>
>>>>> In IO submission context, store associated underlying bios into the
>>>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
>>>>> and return current->pid to caller of submit_bio() for any DM or bio based
>>>>> driver's IO, which is submitted from FS.
>>>>>
>>>>> In IO poll context, the passed cookie tells us the PID of submission
>>>>> context, and we can find the bio from that submission context. Moving
>>>>> bio from submission queue to poll queue of the poll context, and keep
>>>>> polling until these bios are ended. Remove bio from poll queue if the
>>>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
>>>>>
>>>>> Usually submission shares context with io poll. The per-task poll context
>>>>> is just like stack variable, and it is cheap to move data between the two
>>>>> per-task queues.
>>>>>
>>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>>>> ---
>>>>>  block/bio.c               |   5 ++
>>>>>  block/blk-core.c          |  74 +++++++++++++++++-
>>>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
>>>>>  include/linux/blk_types.h |   3 +
>>>>>  4 files changed, 235 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
>>>>> --- a/block/bio.c
>>>>> +++ b/block/bio.c
>>>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
>>>>>   **/
>>>>>  void bio_endio(struct bio *bio)
>>>>>  {
>>>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
>>>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
>>>>> +		bio_set_flag(bio, BIO_DONE);
>>>>> +		return;
>>>>> +	}
>>>>>  again:
>>>>>  	if (!bio_remaining_done(bio))
>>>>>  		return;
>>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>>> index a082bbc856fb..970b23fa2e6e 100644
>>>>> --- a/block/blk-core.c
>>>>> +++ b/block/blk-core.c
>>>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
>>>>>  		bio->bi_opf |= REQ_TAG;
>>>>>  }
>>>>>  
>>>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
>>>>> +{
>>>>> +	struct blk_bio_poll_data data = {
>>>>> +		.bio	=	bio,
>>>>> +	};
>>>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
>>>>> +	unsigned int queued;
>>>>> +
>>>>> +	/* lock is required if there is more than one writer */
>>>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
>>>>> +		spin_lock(&pc->lock);
>>>>> +		queued = kfifo_put(&pc->sq, data);
>>>>> +		spin_unlock(&pc->lock);
>>>>> +	} else {
>>>>> +		queued = kfifo_put(&pc->sq, data);
>>>>> +	}
>>>>> +
>>>>> +	/*
>>>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
>>>>> +	 * so we can save cookie into this bio after submit_bio().
>>>>> +	 */
>>>>> +	if (queued)
>>>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
>>>>> +	else
>>>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
>>>>> +
>>>>> +	return queued;
>>>>> +}
>>>>
>>>> The size of kfifo is limited, and it seems that once the sq of kfifio is
>>>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
>>>> enqueued into the default hw queue, which is IRQ driven.
>>>
>>> Yeah, this patch starts with 64 queue depth, and we can increase it to
>>> 128, which should cover most of cases.
>>
>> It seems that the queue depth of kfifo will affect the performance as I
>> did a fast test.
>>
>>
>>
>> Test Result:
>>
>> BLK_BIO_POLL_SQ_SZ | iodepth | IOPS
>> ------------------ | ------- | ----
>> 64                 | 128     | 301k (IRQ) -> 340k (iopoll)
>> 64                 | 16      | 304k (IRQ) -> 392k (iopoll)
>> 128                | 128     | 204k (IRQ) -> 317k (iopoll)
>> 256                | 128     | 241k (IRQ) -> 391k (iopoll)
>>
>> It seems that BLK_BIO_POLL_SQ_SZ need to be increased accordingly when
>> iodepth is quite large. But I don't know why the performance in IRQ mode
>> decreases when BLK_BIO_POLL_SQ_SZ is increased.
> 
> This patchset is supposed to not affect IRQ mode because HIPRI isn't set
> at IRQ mode. Or you mean '--hipri' & io_uring is setup but setting
> nvme.poll_queues as 0 at your 'IRQ' mode test?
> 
> Thanks for starting to run performance test, and so far I just run test
> in KVM, not start performance test yet.
> 

'IRQ' means 'hipri=0' of fio configuration.

The above performance test was performed on one x86 machine with one
single nvme disk. I did the test on another aarch64 machine with more
nvme disks, showing that this performance drop didn't occure...

Please see my reply in another thread for detailed test results.

-- 
Thanks,
Jeffle

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-17  3:53             ` [dm-devel] " JeffleXu
@ 2021-03-17  6:54               ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-17  6:54 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel

On Wed, Mar 17, 2021 at 11:53:12AM +0800, JeffleXu wrote:
> 
> 
> On 3/17/21 10:54 AM, Ming Lei wrote:
> > On Tue, Mar 16, 2021 at 04:52:36PM +0800, JeffleXu wrote:
> >>
> >>
> >> On 3/16/21 3:17 PM, Ming Lei wrote:
> >>> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> >>>> It is a giant progress to gather all split bios that need to be polled
> >>>> in a per-task queue. Still some comments below.
> >>>>
> >>>>
> >>>> On 3/16/21 11:15 AM, Ming Lei wrote:
> >>>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> >>>>> is very inefficient, and the big reason is that we can't pass bio
> >>>>> submission result to io poll task.
> >>>>>
> >>>>> In IO submission context, store associated underlying bios into the
> >>>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> >>>>> and return current->pid to caller of submit_bio() for any DM or bio based
> >>>>> driver's IO, which is submitted from FS.
> >>>>>
> >>>>> In IO poll context, the passed cookie tells us the PID of submission
> >>>>> context, and we can find the bio from that submission context. Moving
> >>>>> bio from submission queue to poll queue of the poll context, and keep
> >>>>> polling until these bios are ended. Remove bio from poll queue if the
> >>>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >>>>>
> >>>>> Usually submission shares context with io poll. The per-task poll context
> >>>>> is just like stack variable, and it is cheap to move data between the two
> >>>>> per-task queues.
> >>>>>
> >>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>>> ---
> >>>>>  block/bio.c               |   5 ++
> >>>>>  block/blk-core.c          |  74 +++++++++++++++++-
> >>>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> >>>>>  include/linux/blk_types.h |   3 +
> >>>>>  4 files changed, 235 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
> >>>>> --- a/block/bio.c
> >>>>> +++ b/block/bio.c
> >>>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>>>   **/
> >>>>>  void bio_endio(struct bio *bio)
> >>>>>  {
> >>>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>>>> +		bio_set_flag(bio, BIO_DONE);
> >>>>> +		return;
> >>>>> +	}
> >>>>>  again:
> >>>>>  	if (!bio_remaining_done(bio))
> >>>>>  		return;
> >>>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>>> index a082bbc856fb..970b23fa2e6e 100644
> >>>>> --- a/block/blk-core.c
> >>>>> +++ b/block/blk-core.c
> >>>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>>>  		bio->bi_opf |= REQ_TAG;
> >>>>>  }
> >>>>>  
> >>>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>>>> +{
> >>>>> +	struct blk_bio_poll_data data = {
> >>>>> +		.bio	=	bio,
> >>>>> +	};
> >>>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>>>> +	unsigned int queued;
> >>>>> +
> >>>>> +	/* lock is required if there is more than one writer */
> >>>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> >>>>> +		spin_lock(&pc->lock);
> >>>>> +		queued = kfifo_put(&pc->sq, data);
> >>>>> +		spin_unlock(&pc->lock);
> >>>>> +	} else {
> >>>>> +		queued = kfifo_put(&pc->sq, data);
> >>>>> +	}
> >>>>> +
> >>>>> +	/*
> >>>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> >>>>> +	 * so we can save cookie into this bio after submit_bio().
> >>>>> +	 */
> >>>>> +	if (queued)
> >>>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>>>> +	else
> >>>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> >>>>> +
> >>>>> +	return queued;
> >>>>> +}
> >>>>
> >>>> The size of kfifo is limited, and it seems that once the sq of kfifio is
> >>>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> >>>> enqueued into the default hw queue, which is IRQ driven.
> >>>
> >>> Yeah, this patch starts with 64 queue depth, and we can increase it to
> >>> 128, which should cover most of cases.
> >>
> >> It seems that the queue depth of kfifo will affect the performance as I
> >> did a fast test.
> >>
> >>
> >>
> >> Test Result:
> >>
> >> BLK_BIO_POLL_SQ_SZ | iodepth | IOPS
> >> ------------------ | ------- | ----
> >> 64                 | 128     | 301k (IRQ) -> 340k (iopoll)
> >> 64                 | 16      | 304k (IRQ) -> 392k (iopoll)
> >> 128                | 128     | 204k (IRQ) -> 317k (iopoll)
> >> 256                | 128     | 241k (IRQ) -> 391k (iopoll)
> >>
> >> It seems that BLK_BIO_POLL_SQ_SZ need to be increased accordingly when
> >> iodepth is quite large. But I don't know why the performance in IRQ mode
> >> decreases when BLK_BIO_POLL_SQ_SZ is increased.
> > 
> > This patchset is supposed to not affect IRQ mode because HIPRI isn't set
> > at IRQ mode. Or you mean '--hipri' & io_uring is setup but setting
> > nvme.poll_queues as 0 at your 'IRQ' mode test?
> > 
> > Thanks for starting to run performance test, and so far I just run test
> > in KVM, not start performance test yet.
> > 
> 
> 'IRQ' means 'hipri=0' of fio configuration.

'hipri=0' isn't supposed to be affected by this patchset.


thanks,
Ming


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-17  6:54               ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-17  6:54 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer

On Wed, Mar 17, 2021 at 11:53:12AM +0800, JeffleXu wrote:
> 
> 
> On 3/17/21 10:54 AM, Ming Lei wrote:
> > On Tue, Mar 16, 2021 at 04:52:36PM +0800, JeffleXu wrote:
> >>
> >>
> >> On 3/16/21 3:17 PM, Ming Lei wrote:
> >>> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> >>>> It is a giant progress to gather all split bios that need to be polled
> >>>> in a per-task queue. Still some comments below.
> >>>>
> >>>>
> >>>> On 3/16/21 11:15 AM, Ming Lei wrote:
> >>>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> >>>>> is very inefficient, and the big reason is that we can't pass bio
> >>>>> submission result to io poll task.
> >>>>>
> >>>>> In IO submission context, store associated underlying bios into the
> >>>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> >>>>> and return current->pid to caller of submit_bio() for any DM or bio based
> >>>>> driver's IO, which is submitted from FS.
> >>>>>
> >>>>> In IO poll context, the passed cookie tells us the PID of submission
> >>>>> context, and we can find the bio from that submission context. Moving
> >>>>> bio from submission queue to poll queue of the poll context, and keep
> >>>>> polling until these bios are ended. Remove bio from poll queue if the
> >>>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >>>>>
> >>>>> Usually submission shares context with io poll. The per-task poll context
> >>>>> is just like stack variable, and it is cheap to move data between the two
> >>>>> per-task queues.
> >>>>>
> >>>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>>> ---
> >>>>>  block/bio.c               |   5 ++
> >>>>>  block/blk-core.c          |  74 +++++++++++++++++-
> >>>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> >>>>>  include/linux/blk_types.h |   3 +
> >>>>>  4 files changed, 235 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/block/bio.c b/block/bio.c
> >>>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
> >>>>> --- a/block/bio.c
> >>>>> +++ b/block/bio.c
> >>>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>>>   **/
> >>>>>  void bio_endio(struct bio *bio)
> >>>>>  {
> >>>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>>>> +		bio_set_flag(bio, BIO_DONE);
> >>>>> +		return;
> >>>>> +	}
> >>>>>  again:
> >>>>>  	if (!bio_remaining_done(bio))
> >>>>>  		return;
> >>>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>>> index a082bbc856fb..970b23fa2e6e 100644
> >>>>> --- a/block/blk-core.c
> >>>>> +++ b/block/blk-core.c
> >>>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>>>  		bio->bi_opf |= REQ_TAG;
> >>>>>  }
> >>>>>  
> >>>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>>>> +{
> >>>>> +	struct blk_bio_poll_data data = {
> >>>>> +		.bio	=	bio,
> >>>>> +	};
> >>>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>>>> +	unsigned int queued;
> >>>>> +
> >>>>> +	/* lock is required if there is more than one writer */
> >>>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> >>>>> +		spin_lock(&pc->lock);
> >>>>> +		queued = kfifo_put(&pc->sq, data);
> >>>>> +		spin_unlock(&pc->lock);
> >>>>> +	} else {
> >>>>> +		queued = kfifo_put(&pc->sq, data);
> >>>>> +	}
> >>>>> +
> >>>>> +	/*
> >>>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> >>>>> +	 * so we can save cookie into this bio after submit_bio().
> >>>>> +	 */
> >>>>> +	if (queued)
> >>>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>>>> +	else
> >>>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> >>>>> +
> >>>>> +	return queued;
> >>>>> +}
> >>>>
> >>>> The size of kfifo is limited, and it seems that once the sq of kfifio is
> >>>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> >>>> enqueued into the default hw queue, which is IRQ driven.
> >>>
> >>> Yeah, this patch starts with 64 queue depth, and we can increase it to
> >>> 128, which should cover most of cases.
> >>
> >> It seems that the queue depth of kfifo will affect the performance as I
> >> did a fast test.
> >>
> >>
> >>
> >> Test Result:
> >>
> >> BLK_BIO_POLL_SQ_SZ | iodepth | IOPS
> >> ------------------ | ------- | ----
> >> 64                 | 128     | 301k (IRQ) -> 340k (iopoll)
> >> 64                 | 16      | 304k (IRQ) -> 392k (iopoll)
> >> 128                | 128     | 204k (IRQ) -> 317k (iopoll)
> >> 256                | 128     | 241k (IRQ) -> 391k (iopoll)
> >>
> >> It seems that BLK_BIO_POLL_SQ_SZ need to be increased accordingly when
> >> iodepth is quite large. But I don't know why the performance in IRQ mode
> >> decreases when BLK_BIO_POLL_SQ_SZ is increased.
> > 
> > This patchset is supposed to not affect IRQ mode because HIPRI isn't set
> > at IRQ mode. Or you mean '--hipri' & io_uring is setup but setting
> > nvme.poll_queues as 0 at your 'IRQ' mode test?
> > 
> > Thanks for starting to run performance test, and so far I just run test
> > in KVM, not start performance test yet.
> > 
> 
> 'IRQ' means 'hipri=0' of fio configuration.

'hipri=0' isn't supposed to be affected by this patchset.


thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-17  3:49           ` [dm-devel] " JeffleXu
@ 2021-03-17  7:19             ` Ming Lei
  -1 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-17  7:19 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Mike Snitzer, dm-devel

On Wed, Mar 17, 2021 at 11:49:00AM +0800, JeffleXu wrote:
> 
> 
> On 3/16/21 7:00 PM, JeffleXu wrote:
> > 
> > 
> > On 3/16/21 3:17 PM, Ming Lei wrote:
> >> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> >>> It is a giant progress to gather all split bios that need to be polled
> >>> in a per-task queue. Still some comments below.
> >>>
> >>>
> >>> On 3/16/21 11:15 AM, Ming Lei wrote:
> >>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> >>>> is very inefficient, and the big reason is that we can't pass bio
> >>>> submission result to io poll task.
> >>>>
> >>>> In IO submission context, store associated underlying bios into the
> >>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> >>>> and return current->pid to caller of submit_bio() for any DM or bio based
> >>>> driver's IO, which is submitted from FS.
> >>>>
> >>>> In IO poll context, the passed cookie tells us the PID of submission
> >>>> context, and we can find the bio from that submission context. Moving
> >>>> bio from submission queue to poll queue of the poll context, and keep
> >>>> polling until these bios are ended. Remove bio from poll queue if the
> >>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >>>>
> >>>> Usually submission shares context with io poll. The per-task poll context
> >>>> is just like stack variable, and it is cheap to move data between the two
> >>>> per-task queues.
> >>>>
> >>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>> ---
> >>>>  block/bio.c               |   5 ++
> >>>>  block/blk-core.c          |  74 +++++++++++++++++-
> >>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> >>>>  include/linux/blk_types.h |   3 +
> >>>>  4 files changed, 235 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/block/bio.c b/block/bio.c
> >>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
> >>>> --- a/block/bio.c
> >>>> +++ b/block/bio.c
> >>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>>   **/
> >>>>  void bio_endio(struct bio *bio)
> >>>>  {
> >>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>>> +		bio_set_flag(bio, BIO_DONE);
> >>>> +		return;
> >>>> +	}
> >>>>  again:
> >>>>  	if (!bio_remaining_done(bio))
> >>>>  		return;
> >>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>> index a082bbc856fb..970b23fa2e6e 100644
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>>  		bio->bi_opf |= REQ_TAG;
> >>>>  }
> >>>>  
> >>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>>> +{
> >>>> +	struct blk_bio_poll_data data = {
> >>>> +		.bio	=	bio,
> >>>> +	};
> >>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>>> +	unsigned int queued;
> >>>> +
> >>>> +	/* lock is required if there is more than one writer */
> >>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> >>>> +		spin_lock(&pc->lock);
> >>>> +		queued = kfifo_put(&pc->sq, data);
> >>>> +		spin_unlock(&pc->lock);
> >>>> +	} else {
> >>>> +		queued = kfifo_put(&pc->sq, data);
> >>>> +	}
> >>>> +
> >>>> +	/*
> >>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> >>>> +	 * so we can save cookie into this bio after submit_bio().
> >>>> +	 */
> >>>> +	if (queued)
> >>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>>> +	else
> >>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> >>>> +
> >>>> +	return queued;
> >>>> +}
> >>>
> >>> The size of kfifo is limited, and it seems that once the sq of kfifio is
> >>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> >>> enqueued into the default hw queue, which is IRQ driven.
> >>
> >> Yeah, this patch starts with 64 queue depth, and we can increase it to
> >> 128, which should cover most of cases.
> >>
> >>>
> >>>
> >>>> +
> >>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> >>>> +{
> >>>> +	bio->bi_iter.bi_private_data = cookie;
> >>>> +}
> >>>> +
> >>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >>>>  {
> >>>>  	struct block_device *bdev = bio->bi_bdev;
> >>>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
> >>>>   *	->submit_bio_bio, but that haven't been processed yet.
> >>>>   */
> >>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >>>>  {
> >>>>  	struct bio_list bio_list_on_stack[2];
> >>>>  	blk_qc_t ret = BLK_QC_T_NONE;
> >>>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >>>>  		bio_list_init(&bio_list_on_stack[0]);
> >>>>  
> >>>> -		ret = __submit_bio(bio);
> >>>> +		if (ioc && queue_is_mq(q) &&
> >>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> >>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> >>>> +
> >>>> +			ret = __submit_bio(bio);
> >>>> +			if (queued)
> >>>> +				blk_bio_poll_post_submit(bio, ret);
> >>>> +		} else {
> >>>> +			ret = __submit_bio(bio);
> >>>> +		}
> >>>>  
> >>>>  		/*
> >>>>  		 * Sort new bios into those for a lower level and those for the
> >>>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> >>>> +		struct io_context *ioc)
> >>>> +{
> >>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>>> +	int entries = kfifo_len(&pc->sq);
> >>>> +
> >>>> +	__submit_bio_noacct_int(bio, ioc);
> >>>> +
> >>>> +	/* bio submissions queued to per-task poll context */
> >>>> +	if (kfifo_len(&pc->sq) > entries)
> >>>> +		return current->pid;
> >>>> +
> >>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>> +{
> >>>> +	struct io_context *ioc = current->io_context;
> >>>> +
> >>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> >>>> +		return __submit_bio_noacct_poll(bio, ioc);
> >>>> +
> >>>> +	return __submit_bio_noacct_int(bio, NULL);
> >>>> +}
> >>>> +
> >>>> +
> >>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >>>>  {
> >>>>  	struct bio_list bio_list[2] = { };
> >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>>> index 03f59915fe2c..4e6f1467d303 100644
> >>>> --- a/block/blk-mq.c
> >>>> +++ b/block/blk-mq.c
> >>>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> >>>> +{
> >>>> +	return bio->bi_iter.bi_private_data;
> >>>> +}
> >>>> +
> >>>> +static int blk_mq_poll_io(struct bio *bio)
> >>>> +{
> >>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> >>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> >>>> +	int ret = 0;
> >>>> +
> >>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> >>>> +		struct blk_mq_hw_ctx *hctx =
> >>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> >>>> +
> >>>> +		ret += blk_mq_poll_hctx(q, hctx);
> >>>> +	}
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> >>>> +		struct blk_bio_poll_ctx *poll_ctx)
> >>>> +{
> >>>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> >>>> +	int ret = 0;
> >>>> +	int i;
> >>>> +
> >>>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> >>>> +		struct bio *bio = poll_data[i].bio;
> >>>> +
> >>>> +		if (!bio)
> >>>> +			continue;
> >>>> +
> >>>> +		ret += blk_mq_poll_io(bio);
> >>>> +		if (bio_flagged(bio, BIO_DONE)) {
> >>>> +			poll_data[i].bio = NULL;
> >>>> +
> >>>> +			/* clear BIO_END_BY_POLL and end me really */
> >>>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
> >>>> +			bio_endio(bio);
> >>>> +		}
> >>>> +	}
> >>>> +	return ret;
> >>>> +}
> >>>
> >>> When there are multiple threads polling, saying thread A and thread B,
> >>> then there's one bio which should be polled by thread A (the pid is
> >>> passed to thread A), while it's actually completed by thread B. In this
> >>> case, when the bio is completed by thread B, the bio is not really
> >>> completed and one extra blk_poll() still needs to be called.
> >>
> >> When this happens, the dm bio can't be completed, and the associated
> >> kiocb can't be completed too, io_uring or other poll code context will
> >> keep calling blk_poll() by passing thread A's pid until this dm bio is
> >> done, since the dm bio is submitted from thread A.
> >>
> > 
> > This will affect the multi-thread polling performance. I tested
> > dm-stripe, in which every bio will be split and enqueued into all
> > underlying devices, and thus amplify the interference between multiple
> > threads.
> > 
> > Test Result:
> > IOPS: 332k (IRQ) -> 363k (iopoll), aka ~10% performance gain
> 
> Sorry this performance drop is not related to this bio refcount issue
> here. Still it's due to the limited kfifo size.
> 
> 
> I did another through test on another machine (aarch64 with more nvme
> disks).
> 
> - Without mentioned specifically, the configuration is 'iodepth=128,
> kfifo queue depth =128'.
> - The number before '->' indicates the IOPS in IRQ mode, i.e.,
> 'hipri=0', while the number after '->' indicates the IOPS in polling
> mode, i.e., 'hipri=1'.
> 
> ```
> 3-threads  dm-linear-3 targets (4k randread IOPS, unit K)
> 5.12-rc1: 667
> leiming: 674 -> 849
> ours 8353c1a: 623 -> 811
> 
> 3-threads  dm-stripe-3 targets  (12k randread IOPS, unit K)
> 5.12-rc1: 321
> leiming: 313 -> 349
> leiming : 313 -> 409 (iodepth=32, kfifo queue depth =128)
> leiming : 314 -> 409 (iodepth=128, kfifo queue depth =512)
> ours 8353c1a: 310 -> 406
> 
> 
> 1-thread  dm-linear-3 targets  (4k randread IOPS, unit K)
> 5.12-rc1: 224
> leiming:  218 -> 288
> ours 8353c1a: 210 -> 280
> 
> 1-threads  dm-stripe-3 targets (12k randread IOPS, unit K)
> 5.12-rc1: 109
> leiming: 107 -> 120
> leiming : 107 -> 145 (iodepth=32, kfifo queue depth =128)
> leiming : 108 -> 145 (iodepth=128, kfifo queue depth =512)
> ours 8353c1a: 107 -> 146
> ```
> 
> 
> Some hints:
> 
> 1. When configured as 'iodepth=128, kfifo queue depth =128', dm-stripe
> doesn't perform well in polling mode. It's because it's more likely that
> the original bio will be split into split bios in dm-stripe, and thus
> kfifo will be more likely used up in this case. So the size of kfifo
> need to be tuned according to iodepth and the IO load. Thus exporting
> the size of kfifo as a sysfs entry may be need in the following patch.

Yeah, I think your analysis is right.

On simple approach to address the scalability issue is to put submitted
bio into a per-task list, however one new field(8bytes) needs to be
added to bio, or something like below:

1) disable hipri bio merge, then we can reuse bio->bi_next

or

2) track request instead of bio, then it should be easier to get one
field from 'struct request' for such purpose, such as 'ipi_list'.

Seems 2) is possible, will try it and see if the approach is really doable.


thanks, 
Ming


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-17  7:19             ` Ming Lei
  0 siblings, 0 replies; 48+ messages in thread
From: Ming Lei @ 2021-03-17  7:19 UTC (permalink / raw)
  To: JeffleXu
  Cc: Jens Axboe, linux-block, dm-devel, Christoph Hellwig, Mike Snitzer

On Wed, Mar 17, 2021 at 11:49:00AM +0800, JeffleXu wrote:
> 
> 
> On 3/16/21 7:00 PM, JeffleXu wrote:
> > 
> > 
> > On 3/16/21 3:17 PM, Ming Lei wrote:
> >> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> >>> It is a giant progress to gather all split bios that need to be polled
> >>> in a per-task queue. Still some comments below.
> >>>
> >>>
> >>> On 3/16/21 11:15 AM, Ming Lei wrote:
> >>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> >>>> is very inefficient, and the big reason is that we can't pass bio
> >>>> submission result to io poll task.
> >>>>
> >>>> In IO submission context, store associated underlying bios into the
> >>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> >>>> and return current->pid to caller of submit_bio() for any DM or bio based
> >>>> driver's IO, which is submitted from FS.
> >>>>
> >>>> In IO poll context, the passed cookie tells us the PID of submission
> >>>> context, and we can find the bio from that submission context. Moving
> >>>> bio from submission queue to poll queue of the poll context, and keep
> >>>> polling until these bios are ended. Remove bio from poll queue if the
> >>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> >>>>
> >>>> Usually submission shares context with io poll. The per-task poll context
> >>>> is just like stack variable, and it is cheap to move data between the two
> >>>> per-task queues.
> >>>>
> >>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>>> ---
> >>>>  block/bio.c               |   5 ++
> >>>>  block/blk-core.c          |  74 +++++++++++++++++-
> >>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> >>>>  include/linux/blk_types.h |   3 +
> >>>>  4 files changed, 235 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/block/bio.c b/block/bio.c
> >>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
> >>>> --- a/block/bio.c
> >>>> +++ b/block/bio.c
> >>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> >>>>   **/
> >>>>  void bio_endio(struct bio *bio)
> >>>>  {
> >>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> >>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> >>>> +		bio_set_flag(bio, BIO_DONE);
> >>>> +		return;
> >>>> +	}
> >>>>  again:
> >>>>  	if (!bio_remaining_done(bio))
> >>>>  		return;
> >>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>> index a082bbc856fb..970b23fa2e6e 100644
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> >>>>  		bio->bi_opf |= REQ_TAG;
> >>>>  }
> >>>>  
> >>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> >>>> +{
> >>>> +	struct blk_bio_poll_data data = {
> >>>> +		.bio	=	bio,
> >>>> +	};
> >>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>>> +	unsigned int queued;
> >>>> +
> >>>> +	/* lock is required if there is more than one writer */
> >>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> >>>> +		spin_lock(&pc->lock);
> >>>> +		queued = kfifo_put(&pc->sq, data);
> >>>> +		spin_unlock(&pc->lock);
> >>>> +	} else {
> >>>> +		queued = kfifo_put(&pc->sq, data);
> >>>> +	}
> >>>> +
> >>>> +	/*
> >>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> >>>> +	 * so we can save cookie into this bio after submit_bio().
> >>>> +	 */
> >>>> +	if (queued)
> >>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> >>>> +	else
> >>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> >>>> +
> >>>> +	return queued;
> >>>> +}
> >>>
> >>> The size of kfifo is limited, and it seems that once the sq of kfifio is
> >>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> >>> enqueued into the default hw queue, which is IRQ driven.
> >>
> >> Yeah, this patch starts with 64 queue depth, and we can increase it to
> >> 128, which should cover most of cases.
> >>
> >>>
> >>>
> >>>> +
> >>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> >>>> +{
> >>>> +	bio->bi_iter.bi_private_data = cookie;
> >>>> +}
> >>>> +
> >>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> >>>>  {
> >>>>  	struct block_device *bdev = bio->bi_bdev;
> >>>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> >>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
> >>>>   *	->submit_bio_bio, but that haven't been processed yet.
> >>>>   */
> >>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> >>>>  {
> >>>>  	struct bio_list bio_list_on_stack[2];
> >>>>  	blk_qc_t ret = BLK_QC_T_NONE;
> >>>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
> >>>>  		bio_list_init(&bio_list_on_stack[0]);
> >>>>  
> >>>> -		ret = __submit_bio(bio);
> >>>> +		if (ioc && queue_is_mq(q) &&
> >>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> >>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> >>>> +
> >>>> +			ret = __submit_bio(bio);
> >>>> +			if (queued)
> >>>> +				blk_bio_poll_post_submit(bio, ret);
> >>>> +		} else {
> >>>> +			ret = __submit_bio(bio);
> >>>> +		}
> >>>>  
> >>>>  		/*
> >>>>  		 * Sort new bios into those for a lower level and those for the
> >>>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> >>>> +		struct io_context *ioc)
> >>>> +{
> >>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> >>>> +	int entries = kfifo_len(&pc->sq);
> >>>> +
> >>>> +	__submit_bio_noacct_int(bio, ioc);
> >>>> +
> >>>> +	/* bio submissions queued to per-task poll context */
> >>>> +	if (kfifo_len(&pc->sq) > entries)
> >>>> +		return current->pid;
> >>>> +
> >>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> >>>> +{
> >>>> +	struct io_context *ioc = current->io_context;
> >>>> +
> >>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> >>>> +		return __submit_bio_noacct_poll(bio, ioc);
> >>>> +
> >>>> +	return __submit_bio_noacct_int(bio, NULL);
> >>>> +}
> >>>> +
> >>>> +
> >>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> >>>>  {
> >>>>  	struct bio_list bio_list[2] = { };
> >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>>> index 03f59915fe2c..4e6f1467d303 100644
> >>>> --- a/block/blk-mq.c
> >>>> +++ b/block/blk-mq.c
> >>>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> >>>> +{
> >>>> +	return bio->bi_iter.bi_private_data;
> >>>> +}
> >>>> +
> >>>> +static int blk_mq_poll_io(struct bio *bio)
> >>>> +{
> >>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> >>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> >>>> +	int ret = 0;
> >>>> +
> >>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> >>>> +		struct blk_mq_hw_ctx *hctx =
> >>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> >>>> +
> >>>> +		ret += blk_mq_poll_hctx(q, hctx);
> >>>> +	}
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> >>>> +		struct blk_bio_poll_ctx *poll_ctx)
> >>>> +{
> >>>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> >>>> +	int ret = 0;
> >>>> +	int i;
> >>>> +
> >>>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> >>>> +		struct bio *bio = poll_data[i].bio;
> >>>> +
> >>>> +		if (!bio)
> >>>> +			continue;
> >>>> +
> >>>> +		ret += blk_mq_poll_io(bio);
> >>>> +		if (bio_flagged(bio, BIO_DONE)) {
> >>>> +			poll_data[i].bio = NULL;
> >>>> +
> >>>> +			/* clear BIO_END_BY_POLL and end me really */
> >>>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
> >>>> +			bio_endio(bio);
> >>>> +		}
> >>>> +	}
> >>>> +	return ret;
> >>>> +}
> >>>
> >>> When there are multiple threads polling, saying thread A and thread B,
> >>> then there's one bio which should be polled by thread A (the pid is
> >>> passed to thread A), while it's actually completed by thread B. In this
> >>> case, when the bio is completed by thread B, the bio is not really
> >>> completed and one extra blk_poll() still needs to be called.
> >>
> >> When this happens, the dm bio can't be completed, and the associated
> >> kiocb can't be completed too, io_uring or other poll code context will
> >> keep calling blk_poll() by passing thread A's pid until this dm bio is
> >> done, since the dm bio is submitted from thread A.
> >>
> > 
> > This will affect the multi-thread polling performance. I tested
> > dm-stripe, in which every bio will be split and enqueued into all
> > underlying devices, and thus amplify the interference between multiple
> > threads.
> > 
> > Test Result:
> > IOPS: 332k (IRQ) -> 363k (iopoll), aka ~10% performance gain
> 
> Sorry this performance drop is not related to this bio refcount issue
> here. Still it's due to the limited kfifo size.
> 
> 
> I did another through test on another machine (aarch64 with more nvme
> disks).
> 
> - Without mentioned specifically, the configuration is 'iodepth=128,
> kfifo queue depth =128'.
> - The number before '->' indicates the IOPS in IRQ mode, i.e.,
> 'hipri=0', while the number after '->' indicates the IOPS in polling
> mode, i.e., 'hipri=1'.
> 
> ```
> 3-threads  dm-linear-3 targets (4k randread IOPS, unit K)
> 5.12-rc1: 667
> leiming: 674 -> 849
> ours 8353c1a: 623 -> 811
> 
> 3-threads  dm-stripe-3 targets  (12k randread IOPS, unit K)
> 5.12-rc1: 321
> leiming: 313 -> 349
> leiming : 313 -> 409 (iodepth=32, kfifo queue depth =128)
> leiming : 314 -> 409 (iodepth=128, kfifo queue depth =512)
> ours 8353c1a: 310 -> 406
> 
> 
> 1-thread  dm-linear-3 targets  (4k randread IOPS, unit K)
> 5.12-rc1: 224
> leiming:  218 -> 288
> ours 8353c1a: 210 -> 280
> 
> 1-threads  dm-stripe-3 targets (12k randread IOPS, unit K)
> 5.12-rc1: 109
> leiming: 107 -> 120
> leiming : 107 -> 145 (iodepth=32, kfifo queue depth =128)
> leiming : 108 -> 145 (iodepth=128, kfifo queue depth =512)
> ours 8353c1a: 107 -> 146
> ```
> 
> 
> Some hints:
> 
> 1. When configured as 'iodepth=128, kfifo queue depth =128', dm-stripe
> doesn't perform well in polling mode. It's because it's more likely that
> the original bio will be split into split bios in dm-stripe, and thus
> kfifo will be more likely used up in this case. So the size of kfifo
> need to be tuned according to iodepth and the IO load. Thus exporting
> the size of kfifo as a sysfs entry may be need in the following patch.

Yeah, I think your analysis is right.

On simple approach to address the scalability issue is to put submitted
bio into a per-task list, however one new field(8bytes) needs to be
added to bio, or something like below:

1) disable hipri bio merge, then we can reuse bio->bi_next

or

2) track request instead of bio, then it should be easier to get one
field from 'struct request' for such purpose, such as 'ipi_list'.

Seems 2) is possible, will try it and see if the approach is really doable.


thanks, 
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
  2021-03-17  7:19             ` [dm-devel] " Ming Lei
@ 2021-03-18 14:51               ` Mike Snitzer
  -1 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2021-03-18 14:51 UTC (permalink / raw)
  To: Ming Lei; +Cc: JeffleXu, Jens Axboe, linux-block, Christoph Hellwig, dm-devel

On Wed, Mar 17 2021 at  3:19am -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Wed, Mar 17, 2021 at 11:49:00AM +0800, JeffleXu wrote:
> > 
> > 
> > On 3/16/21 7:00 PM, JeffleXu wrote:
> > > 
> > > 
> > > On 3/16/21 3:17 PM, Ming Lei wrote:
> > >> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> > >>> It is a giant progress to gather all split bios that need to be polled
> > >>> in a per-task queue. Still some comments below.
> > >>>
> > >>>
> > >>> On 3/16/21 11:15 AM, Ming Lei wrote:
> > >>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> > >>>> is very inefficient, and the big reason is that we can't pass bio
> > >>>> submission result to io poll task.
> > >>>>
> > >>>> In IO submission context, store associated underlying bios into the
> > >>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > >>>> and return current->pid to caller of submit_bio() for any DM or bio based
> > >>>> driver's IO, which is submitted from FS.
> > >>>>
> > >>>> In IO poll context, the passed cookie tells us the PID of submission
> > >>>> context, and we can find the bio from that submission context. Moving
> > >>>> bio from submission queue to poll queue of the poll context, and keep
> > >>>> polling until these bios are ended. Remove bio from poll queue if the
> > >>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > >>>>
> > >>>> Usually submission shares context with io poll. The per-task poll context
> > >>>> is just like stack variable, and it is cheap to move data between the two
> > >>>> per-task queues.
> > >>>>
> > >>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > >>>> ---
> > >>>>  block/bio.c               |   5 ++
> > >>>>  block/blk-core.c          |  74 +++++++++++++++++-
> > >>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> > >>>>  include/linux/blk_types.h |   3 +
> > >>>>  4 files changed, 235 insertions(+), 3 deletions(-)
> > >>>>
> > >>>> diff --git a/block/bio.c b/block/bio.c
> > >>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
> > >>>> --- a/block/bio.c
> > >>>> +++ b/block/bio.c
> > >>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> > >>>>   **/
> > >>>>  void bio_endio(struct bio *bio)
> > >>>>  {
> > >>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > >>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > >>>> +		bio_set_flag(bio, BIO_DONE);
> > >>>> +		return;
> > >>>> +	}
> > >>>>  again:
> > >>>>  	if (!bio_remaining_done(bio))
> > >>>>  		return;
> > >>>> diff --git a/block/blk-core.c b/block/blk-core.c
> > >>>> index a082bbc856fb..970b23fa2e6e 100644
> > >>>> --- a/block/blk-core.c
> > >>>> +++ b/block/blk-core.c
> > >>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> > >>>>  		bio->bi_opf |= REQ_TAG;
> > >>>>  }
> > >>>>  
> > >>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > >>>> +{
> > >>>> +	struct blk_bio_poll_data data = {
> > >>>> +		.bio	=	bio,
> > >>>> +	};
> > >>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> > >>>> +	unsigned int queued;
> > >>>> +
> > >>>> +	/* lock is required if there is more than one writer */
> > >>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> > >>>> +		spin_lock(&pc->lock);
> > >>>> +		queued = kfifo_put(&pc->sq, data);
> > >>>> +		spin_unlock(&pc->lock);
> > >>>> +	} else {
> > >>>> +		queued = kfifo_put(&pc->sq, data);
> > >>>> +	}
> > >>>> +
> > >>>> +	/*
> > >>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > >>>> +	 * so we can save cookie into this bio after submit_bio().
> > >>>> +	 */
> > >>>> +	if (queued)
> > >>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> > >>>> +	else
> > >>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > >>>> +
> > >>>> +	return queued;
> > >>>> +}
> > >>>
> > >>> The size of kfifo is limited, and it seems that once the sq of kfifio is
> > >>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> > >>> enqueued into the default hw queue, which is IRQ driven.
> > >>
> > >> Yeah, this patch starts with 64 queue depth, and we can increase it to
> > >> 128, which should cover most of cases.
> > >>
> > >>>
> > >>>
> > >>>> +
> > >>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > >>>> +{
> > >>>> +	bio->bi_iter.bi_private_data = cookie;
> > >>>> +}
> > >>>> +
> > >>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> > >>>>  {
> > >>>>  	struct block_device *bdev = bio->bi_bdev;
> > >>>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> > >>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
> > >>>>   *	->submit_bio_bio, but that haven't been processed yet.
> > >>>>   */
> > >>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > >>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> > >>>>  {
> > >>>>  	struct bio_list bio_list_on_stack[2];
> > >>>>  	blk_qc_t ret = BLK_QC_T_NONE;
> > >>>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > >>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
> > >>>>  		bio_list_init(&bio_list_on_stack[0]);
> > >>>>  
> > >>>> -		ret = __submit_bio(bio);
> > >>>> +		if (ioc && queue_is_mq(q) &&
> > >>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > >>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > >>>> +
> > >>>> +			ret = __submit_bio(bio);
> > >>>> +			if (queued)
> > >>>> +				blk_bio_poll_post_submit(bio, ret);
> > >>>> +		} else {
> > >>>> +			ret = __submit_bio(bio);
> > >>>> +		}
> > >>>>  
> > >>>>  		/*
> > >>>>  		 * Sort new bios into those for a lower level and those for the
> > >>>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > >>>>  	return ret;
> > >>>>  }
> > >>>>  
> > >>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> > >>>> +		struct io_context *ioc)
> > >>>> +{
> > >>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> > >>>> +	int entries = kfifo_len(&pc->sq);
> > >>>> +
> > >>>> +	__submit_bio_noacct_int(bio, ioc);
> > >>>> +
> > >>>> +	/* bio submissions queued to per-task poll context */
> > >>>> +	if (kfifo_len(&pc->sq) > entries)
> > >>>> +		return current->pid;
> > >>>> +
> > >>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> > >>>> +	return 0;
> > >>>> +}
> > >>>> +
> > >>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> > >>>> +{
> > >>>> +	struct io_context *ioc = current->io_context;
> > >>>> +
> > >>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> > >>>> +		return __submit_bio_noacct_poll(bio, ioc);
> > >>>> +
> > >>>> +	return __submit_bio_noacct_int(bio, NULL);
> > >>>> +}
> > >>>> +
> > >>>> +
> > >>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > >>>>  {
> > >>>>  	struct bio_list bio_list[2] = { };
> > >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > >>>> index 03f59915fe2c..4e6f1467d303 100644
> > >>>> --- a/block/blk-mq.c
> > >>>> +++ b/block/blk-mq.c
> > >>>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> > >>>>  	return ret;
> > >>>>  }
> > >>>>  
> > >>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> > >>>> +{
> > >>>> +	return bio->bi_iter.bi_private_data;
> > >>>> +}
> > >>>> +
> > >>>> +static int blk_mq_poll_io(struct bio *bio)
> > >>>> +{
> > >>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> > >>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> > >>>> +	int ret = 0;
> > >>>> +
> > >>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> > >>>> +		struct blk_mq_hw_ctx *hctx =
> > >>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> > >>>> +
> > >>>> +		ret += blk_mq_poll_hctx(q, hctx);
> > >>>> +	}
> > >>>> +	return ret;
> > >>>> +}
> > >>>> +
> > >>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> > >>>> +		struct blk_bio_poll_ctx *poll_ctx)
> > >>>> +{
> > >>>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> > >>>> +	int ret = 0;
> > >>>> +	int i;
> > >>>> +
> > >>>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> > >>>> +		struct bio *bio = poll_data[i].bio;
> > >>>> +
> > >>>> +		if (!bio)
> > >>>> +			continue;
> > >>>> +
> > >>>> +		ret += blk_mq_poll_io(bio);
> > >>>> +		if (bio_flagged(bio, BIO_DONE)) {
> > >>>> +			poll_data[i].bio = NULL;
> > >>>> +
> > >>>> +			/* clear BIO_END_BY_POLL and end me really */
> > >>>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
> > >>>> +			bio_endio(bio);
> > >>>> +		}
> > >>>> +	}
> > >>>> +	return ret;
> > >>>> +}
> > >>>
> > >>> When there are multiple threads polling, saying thread A and thread B,
> > >>> then there's one bio which should be polled by thread A (the pid is
> > >>> passed to thread A), while it's actually completed by thread B. In this
> > >>> case, when the bio is completed by thread B, the bio is not really
> > >>> completed and one extra blk_poll() still needs to be called.
> > >>
> > >> When this happens, the dm bio can't be completed, and the associated
> > >> kiocb can't be completed too, io_uring or other poll code context will
> > >> keep calling blk_poll() by passing thread A's pid until this dm bio is
> > >> done, since the dm bio is submitted from thread A.
> > >>
> > > 
> > > This will affect the multi-thread polling performance. I tested
> > > dm-stripe, in which every bio will be split and enqueued into all
> > > underlying devices, and thus amplify the interference between multiple
> > > threads.
> > > 
> > > Test Result:
> > > IOPS: 332k (IRQ) -> 363k (iopoll), aka ~10% performance gain
> > 
> > Sorry this performance drop is not related to this bio refcount issue
> > here. Still it's due to the limited kfifo size.
> > 
> > 
> > I did another through test on another machine (aarch64 with more nvme
> > disks).
> > 
> > - Without mentioned specifically, the configuration is 'iodepth=128,
> > kfifo queue depth =128'.
> > - The number before '->' indicates the IOPS in IRQ mode, i.e.,
> > 'hipri=0', while the number after '->' indicates the IOPS in polling
> > mode, i.e., 'hipri=1'.
> > 
> > ```
> > 3-threads  dm-linear-3 targets (4k randread IOPS, unit K)
> > 5.12-rc1: 667
> > leiming: 674 -> 849
> > ours 8353c1a: 623 -> 811
> > 
> > 3-threads  dm-stripe-3 targets  (12k randread IOPS, unit K)
> > 5.12-rc1: 321
> > leiming: 313 -> 349
> > leiming : 313 -> 409 (iodepth=32, kfifo queue depth =128)
> > leiming : 314 -> 409 (iodepth=128, kfifo queue depth =512)
> > ours 8353c1a: 310 -> 406
> > 
> > 
> > 1-thread  dm-linear-3 targets  (4k randread IOPS, unit K)
> > 5.12-rc1: 224
> > leiming:  218 -> 288
> > ours 8353c1a: 210 -> 280
> > 
> > 1-threads  dm-stripe-3 targets (12k randread IOPS, unit K)
> > 5.12-rc1: 109
> > leiming: 107 -> 120
> > leiming : 107 -> 145 (iodepth=32, kfifo queue depth =128)
> > leiming : 108 -> 145 (iodepth=128, kfifo queue depth =512)
> > ours 8353c1a: 107 -> 146
> > ```
> > 
> > 
> > Some hints:
> > 
> > 1. When configured as 'iodepth=128, kfifo queue depth =128', dm-stripe
> > doesn't perform well in polling mode. It's because it's more likely that
> > the original bio will be split into split bios in dm-stripe, and thus
> > kfifo will be more likely used up in this case. So the size of kfifo
> > need to be tuned according to iodepth and the IO load. Thus exporting
> > the size of kfifo as a sysfs entry may be need in the following patch.
> 
> Yeah, I think your analysis is right.
> 
> On simple approach to address the scalability issue is to put submitted
> bio into a per-task list, however one new field(8bytes) needs to be
> added to bio, or something like below:
> 
> 1) disable hipri bio merge, then we can reuse bio->bi_next
> 
> or
> 
> 2) track request instead of bio, then it should be easier to get one
> field from 'struct request' for such purpose, such as 'ipi_list'.
> 
> Seems 2) is possible, will try it and see if the approach is really doable.

Not (yet) seeing how making tracking (either requests or bios) per-task
will help.  Though tracking in terms of requests reduces the amount of
polling (due to hopeful merging, at least in sequential IO case) it
doesn't _really_ make the task cookie -> polled_object mapping any more
efficient for the single thread test-case Jeffle ran: the fan-out of
bio-splits for _random_ IO issued to 3-way dm-stripe is inherently messy
to track.

Basically I'm just wondering where you see your per-task request-based
tracking approach helping? Multithreaded sequential workloads?

Feels like the poll cookie being a task id is just extremely coarse.
Doesn't really allow polling to be done more precisely... what am I
missing?

Thanks,
Mike


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dm-devel] [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll
@ 2021-03-18 14:51               ` Mike Snitzer
  0 siblings, 0 replies; 48+ messages in thread
From: Mike Snitzer @ 2021-03-18 14:51 UTC (permalink / raw)
  To: Ming Lei; +Cc: JeffleXu, Jens Axboe, dm-devel, linux-block, Christoph Hellwig

On Wed, Mar 17 2021 at  3:19am -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> On Wed, Mar 17, 2021 at 11:49:00AM +0800, JeffleXu wrote:
> > 
> > 
> > On 3/16/21 7:00 PM, JeffleXu wrote:
> > > 
> > > 
> > > On 3/16/21 3:17 PM, Ming Lei wrote:
> > >> On Tue, Mar 16, 2021 at 02:46:08PM +0800, JeffleXu wrote:
> > >>> It is a giant progress to gather all split bios that need to be polled
> > >>> in a per-task queue. Still some comments below.
> > >>>
> > >>>
> > >>> On 3/16/21 11:15 AM, Ming Lei wrote:
> > >>>> Currently bio based IO poll needs to poll all hw queue blindly, this way
> > >>>> is very inefficient, and the big reason is that we can't pass bio
> > >>>> submission result to io poll task.
> > >>>>
> > >>>> In IO submission context, store associated underlying bios into the
> > >>>> submission queue and save 'cookie' poll data in bio->bi_iter.bi_private_data,
> > >>>> and return current->pid to caller of submit_bio() for any DM or bio based
> > >>>> driver's IO, which is submitted from FS.
> > >>>>
> > >>>> In IO poll context, the passed cookie tells us the PID of submission
> > >>>> context, and we can find the bio from that submission context. Moving
> > >>>> bio from submission queue to poll queue of the poll context, and keep
> > >>>> polling until these bios are ended. Remove bio from poll queue if the
> > >>>> bio is ended. Add BIO_DONE and BIO_END_BY_POLL for such purpose.
> > >>>>
> > >>>> Usually submission shares context with io poll. The per-task poll context
> > >>>> is just like stack variable, and it is cheap to move data between the two
> > >>>> per-task queues.
> > >>>>
> > >>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > >>>> ---
> > >>>>  block/bio.c               |   5 ++
> > >>>>  block/blk-core.c          |  74 +++++++++++++++++-
> > >>>>  block/blk-mq.c            | 156 +++++++++++++++++++++++++++++++++++++-
> > >>>>  include/linux/blk_types.h |   3 +
> > >>>>  4 files changed, 235 insertions(+), 3 deletions(-)
> > >>>>
> > >>>> diff --git a/block/bio.c b/block/bio.c
> > >>>> index a1c4d2900c7a..bcf5eca0e8e3 100644
> > >>>> --- a/block/bio.c
> > >>>> +++ b/block/bio.c
> > >>>> @@ -1402,6 +1402,11 @@ static inline bool bio_remaining_done(struct bio *bio)
> > >>>>   **/
> > >>>>  void bio_endio(struct bio *bio)
> > >>>>  {
> > >>>> +	/* BIO_END_BY_POLL has to be set before calling submit_bio */
> > >>>> +	if (bio_flagged(bio, BIO_END_BY_POLL)) {
> > >>>> +		bio_set_flag(bio, BIO_DONE);
> > >>>> +		return;
> > >>>> +	}
> > >>>>  again:
> > >>>>  	if (!bio_remaining_done(bio))
> > >>>>  		return;
> > >>>> diff --git a/block/blk-core.c b/block/blk-core.c
> > >>>> index a082bbc856fb..970b23fa2e6e 100644
> > >>>> --- a/block/blk-core.c
> > >>>> +++ b/block/blk-core.c
> > >>>> @@ -854,6 +854,40 @@ static inline void blk_bio_poll_preprocess(struct request_queue *q,
> > >>>>  		bio->bi_opf |= REQ_TAG;
> > >>>>  }
> > >>>>  
> > >>>> +static bool blk_bio_poll_prep_submit(struct io_context *ioc, struct bio *bio)
> > >>>> +{
> > >>>> +	struct blk_bio_poll_data data = {
> > >>>> +		.bio	=	bio,
> > >>>> +	};
> > >>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> > >>>> +	unsigned int queued;
> > >>>> +
> > >>>> +	/* lock is required if there is more than one writer */
> > >>>> +	if (unlikely(atomic_read(&ioc->nr_tasks) > 1)) {
> > >>>> +		spin_lock(&pc->lock);
> > >>>> +		queued = kfifo_put(&pc->sq, data);
> > >>>> +		spin_unlock(&pc->lock);
> > >>>> +	} else {
> > >>>> +		queued = kfifo_put(&pc->sq, data);
> > >>>> +	}
> > >>>> +
> > >>>> +	/*
> > >>>> +	 * Now the bio is added per-task fifo, mark it as END_BY_POLL,
> > >>>> +	 * so we can save cookie into this bio after submit_bio().
> > >>>> +	 */
> > >>>> +	if (queued)
> > >>>> +		bio_set_flag(bio, BIO_END_BY_POLL);
> > >>>> +	else
> > >>>> +		bio->bi_opf &= ~(REQ_HIPRI | REQ_TAG);
> > >>>> +
> > >>>> +	return queued;
> > >>>> +}
> > >>>
> > >>> The size of kfifo is limited, and it seems that once the sq of kfifio is
> > >>> full, REQ_HIPRI flag is cleared and the corresponding bio is actually
> > >>> enqueued into the default hw queue, which is IRQ driven.
> > >>
> > >> Yeah, this patch starts with 64 queue depth, and we can increase it to
> > >> 128, which should cover most of cases.
> > >>
> > >>>
> > >>>
> > >>>> +
> > >>>> +static void blk_bio_poll_post_submit(struct bio *bio, blk_qc_t cookie)
> > >>>> +{
> > >>>> +	bio->bi_iter.bi_private_data = cookie;
> > >>>> +}
> > >>>> +
> > >>>>  static noinline_for_stack bool submit_bio_checks(struct bio *bio)
> > >>>>  {
> > >>>>  	struct block_device *bdev = bio->bi_bdev;
> > >>>> @@ -1008,7 +1042,7 @@ static blk_qc_t __submit_bio(struct bio *bio)
> > >>>>   * bio_list_on_stack[1] contains bios that were submitted before the current
> > >>>>   *	->submit_bio_bio, but that haven't been processed yet.
> > >>>>   */
> > >>>> -static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > >>>> +static blk_qc_t __submit_bio_noacct_int(struct bio *bio, struct io_context *ioc)
> > >>>>  {
> > >>>>  	struct bio_list bio_list_on_stack[2];
> > >>>>  	blk_qc_t ret = BLK_QC_T_NONE;
> > >>>> @@ -1031,7 +1065,16 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > >>>>  		bio_list_on_stack[1] = bio_list_on_stack[0];
> > >>>>  		bio_list_init(&bio_list_on_stack[0]);
> > >>>>  
> > >>>> -		ret = __submit_bio(bio);
> > >>>> +		if (ioc && queue_is_mq(q) &&
> > >>>> +				(bio->bi_opf & (REQ_HIPRI | REQ_TAG))) {
> > >>>> +			bool queued = blk_bio_poll_prep_submit(ioc, bio);
> > >>>> +
> > >>>> +			ret = __submit_bio(bio);
> > >>>> +			if (queued)
> > >>>> +				blk_bio_poll_post_submit(bio, ret);
> > >>>> +		} else {
> > >>>> +			ret = __submit_bio(bio);
> > >>>> +		}
> > >>>>  
> > >>>>  		/*
> > >>>>  		 * Sort new bios into those for a lower level and those for the
> > >>>> @@ -1057,6 +1100,33 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
> > >>>>  	return ret;
> > >>>>  }
> > >>>>  
> > >>>> +static inline blk_qc_t __submit_bio_noacct_poll(struct bio *bio,
> > >>>> +		struct io_context *ioc)
> > >>>> +{
> > >>>> +	struct blk_bio_poll_ctx *pc = ioc->data;
> > >>>> +	int entries = kfifo_len(&pc->sq);
> > >>>> +
> > >>>> +	__submit_bio_noacct_int(bio, ioc);
> > >>>> +
> > >>>> +	/* bio submissions queued to per-task poll context */
> > >>>> +	if (kfifo_len(&pc->sq) > entries)
> > >>>> +		return current->pid;
> > >>>> +
> > >>>> +	/* swapper's pid is 0, but it can't submit poll IO for us */
> > >>>> +	return 0;
> > >>>> +}
> > >>>> +
> > >>>> +static inline blk_qc_t __submit_bio_noacct(struct bio *bio)
> > >>>> +{
> > >>>> +	struct io_context *ioc = current->io_context;
> > >>>> +
> > >>>> +	if (ioc && ioc->data && (bio->bi_opf & REQ_HIPRI))
> > >>>> +		return __submit_bio_noacct_poll(bio, ioc);
> > >>>> +
> > >>>> +	return __submit_bio_noacct_int(bio, NULL);
> > >>>> +}
> > >>>> +
> > >>>> +
> > >>>>  static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
> > >>>>  {
> > >>>>  	struct bio_list bio_list[2] = { };
> > >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > >>>> index 03f59915fe2c..4e6f1467d303 100644
> > >>>> --- a/block/blk-mq.c
> > >>>> +++ b/block/blk-mq.c
> > >>>> @@ -3865,14 +3865,168 @@ static inline int blk_mq_poll_hctx(struct request_queue *q,
> > >>>>  	return ret;
> > >>>>  }
> > >>>>  
> > >>>> +static blk_qc_t bio_get_poll_cookie(struct bio *bio)
> > >>>> +{
> > >>>> +	return bio->bi_iter.bi_private_data;
> > >>>> +}
> > >>>> +
> > >>>> +static int blk_mq_poll_io(struct bio *bio)
> > >>>> +{
> > >>>> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> > >>>> +	blk_qc_t cookie = bio_get_poll_cookie(bio);
> > >>>> +	int ret = 0;
> > >>>> +
> > >>>> +	if (!bio_flagged(bio, BIO_DONE) && blk_qc_t_valid(cookie)) {
> > >>>> +		struct blk_mq_hw_ctx *hctx =
> > >>>> +			q->queue_hw_ctx[blk_qc_t_to_queue_num(cookie)];
> > >>>> +
> > >>>> +		ret += blk_mq_poll_hctx(q, hctx);
> > >>>> +	}
> > >>>> +	return ret;
> > >>>> +}
> > >>>> +
> > >>>> +static int blk_bio_poll_and_end_io(struct request_queue *q,
> > >>>> +		struct blk_bio_poll_ctx *poll_ctx)
> > >>>> +{
> > >>>> +	struct blk_bio_poll_data *poll_data = &poll_ctx->pq[0];
> > >>>> +	int ret = 0;
> > >>>> +	int i;
> > >>>> +
> > >>>> +	for (i = 0; i < BLK_BIO_POLL_PQ_SZ; i++) {
> > >>>> +		struct bio *bio = poll_data[i].bio;
> > >>>> +
> > >>>> +		if (!bio)
> > >>>> +			continue;
> > >>>> +
> > >>>> +		ret += blk_mq_poll_io(bio);
> > >>>> +		if (bio_flagged(bio, BIO_DONE)) {
> > >>>> +			poll_data[i].bio = NULL;
> > >>>> +
> > >>>> +			/* clear BIO_END_BY_POLL and end me really */
> > >>>> +			bio_clear_flag(bio, BIO_END_BY_POLL);
> > >>>> +			bio_endio(bio);
> > >>>> +		}
> > >>>> +	}
> > >>>> +	return ret;
> > >>>> +}
> > >>>
> > >>> When there are multiple threads polling, saying thread A and thread B,
> > >>> then there's one bio which should be polled by thread A (the pid is
> > >>> passed to thread A), while it's actually completed by thread B. In this
> > >>> case, when the bio is completed by thread B, the bio is not really
> > >>> completed and one extra blk_poll() still needs to be called.
> > >>
> > >> When this happens, the dm bio can't be completed, and the associated
> > >> kiocb can't be completed too, io_uring or other poll code context will
> > >> keep calling blk_poll() by passing thread A's pid until this dm bio is
> > >> done, since the dm bio is submitted from thread A.
> > >>
> > > 
> > > This will affect the multi-thread polling performance. I tested
> > > dm-stripe, in which every bio will be split and enqueued into all
> > > underlying devices, and thus amplify the interference between multiple
> > > threads.
> > > 
> > > Test Result:
> > > IOPS: 332k (IRQ) -> 363k (iopoll), aka ~10% performance gain
> > 
> > Sorry this performance drop is not related to this bio refcount issue
> > here. Still it's due to the limited kfifo size.
> > 
> > 
> > I did another through test on another machine (aarch64 with more nvme
> > disks).
> > 
> > - Without mentioned specifically, the configuration is 'iodepth=128,
> > kfifo queue depth =128'.
> > - The number before '->' indicates the IOPS in IRQ mode, i.e.,
> > 'hipri=0', while the number after '->' indicates the IOPS in polling
> > mode, i.e., 'hipri=1'.
> > 
> > ```
> > 3-threads  dm-linear-3 targets (4k randread IOPS, unit K)
> > 5.12-rc1: 667
> > leiming: 674 -> 849
> > ours 8353c1a: 623 -> 811
> > 
> > 3-threads  dm-stripe-3 targets  (12k randread IOPS, unit K)
> > 5.12-rc1: 321
> > leiming: 313 -> 349
> > leiming : 313 -> 409 (iodepth=32, kfifo queue depth =128)
> > leiming : 314 -> 409 (iodepth=128, kfifo queue depth =512)
> > ours 8353c1a: 310 -> 406
> > 
> > 
> > 1-thread  dm-linear-3 targets  (4k randread IOPS, unit K)
> > 5.12-rc1: 224
> > leiming:  218 -> 288
> > ours 8353c1a: 210 -> 280
> > 
> > 1-threads  dm-stripe-3 targets (12k randread IOPS, unit K)
> > 5.12-rc1: 109
> > leiming: 107 -> 120
> > leiming : 107 -> 145 (iodepth=32, kfifo queue depth =128)
> > leiming : 108 -> 145 (iodepth=128, kfifo queue depth =512)
> > ours 8353c1a: 107 -> 146
> > ```
> > 
> > 
> > Some hints:
> > 
> > 1. When configured as 'iodepth=128, kfifo queue depth =128', dm-stripe
> > doesn't perform well in polling mode. It's because it's more likely that
> > the original bio will be split into split bios in dm-stripe, and thus
> > kfifo will be more likely used up in this case. So the size of kfifo
> > need to be tuned according to iodepth and the IO load. Thus exporting
> > the size of kfifo as a sysfs entry may be need in the following patch.
> 
> Yeah, I think your analysis is right.
> 
> On simple approach to address the scalability issue is to put submitted
> bio into a per-task list, however one new field(8bytes) needs to be
> added to bio, or something like below:
> 
> 1) disable hipri bio merge, then we can reuse bio->bi_next
> 
> or
> 
> 2) track request instead of bio, then it should be easier to get one
> field from 'struct request' for such purpose, such as 'ipi_list'.
> 
> Seems 2) is possible, will try it and see if the approach is really doable.

Not (yet) seeing how making tracking (either requests or bios) per-task
will help.  Though tracking in terms of requests reduces the amount of
polling (due to hopeful merging, at least in sequential IO case) it
doesn't _really_ make the task cookie -> polled_object mapping any more
efficient for the single thread test-case Jeffle ran: the fan-out of
bio-splits for _random_ IO issued to 3-way dm-stripe is inherently messy
to track.

Basically I'm just wondering where you see your per-task request-based
tracking approach helping? Multithreaded sequential workloads?

Feels like the poll cookie being a task id is just extremely coarse.
Doesn't really allow polling to be done more precisely... what am I
missing?

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-03-18 14:52 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-16  3:15 [RFC PATCH 00/11] block: support bio based io polling Ming Lei
2021-03-16  3:15 ` [dm-devel] " Ming Lei
2021-03-16  3:15 ` [RFC PATCH 01/11] block: add helper of blk_queue_poll Ming Lei
2021-03-16  3:15   ` [dm-devel] " Ming Lei
2021-03-16  3:26   ` Chaitanya Kulkarni
2021-03-16  3:26     ` [dm-devel] " Chaitanya Kulkarni
2021-03-16  3:15 ` [RFC PATCH 02/11] block: add one helper to free io_context Ming Lei
2021-03-16  3:15   ` [dm-devel] " Ming Lei
2021-03-16  3:15 ` [RFC PATCH 03/11] block: add helper of blk_create_io_context Ming Lei
2021-03-16  3:15   ` [dm-devel] " Ming Lei
2021-03-16  3:15 ` [RFC PATCH 04/11] block: create io poll context for submission and poll task Ming Lei
2021-03-16  3:15   ` [dm-devel] " Ming Lei
2021-03-16  3:15 ` [RFC PATCH 05/11] block: add req flag of REQ_TAG Ming Lei
2021-03-16  3:15   ` [dm-devel] " Ming Lei
2021-03-16  3:15 ` [RFC PATCH 06/11] block: add new field into 'struct bvec_iter' Ming Lei
2021-03-16  3:15   ` [dm-devel] " Ming Lei
2021-03-16  3:15 ` [RFC PATCH 07/11] block/mq: extract one helper function polling hw queue Ming Lei
2021-03-16  3:15   ` [dm-devel] " Ming Lei
2021-03-16  3:15 ` [RFC PATCH 08/11] block: use per-task poll context to implement bio based io poll Ming Lei
2021-03-16  3:15   ` [dm-devel] " Ming Lei
2021-03-16  6:46   ` JeffleXu
2021-03-16  6:46     ` [dm-devel] " JeffleXu
2021-03-16  7:17     ` Ming Lei
2021-03-16  7:17       ` [dm-devel] " Ming Lei
2021-03-16  8:52       ` JeffleXu
2021-03-16  8:52         ` [dm-devel] " JeffleXu
2021-03-17  2:54         ` Ming Lei
2021-03-17  2:54           ` [dm-devel] " Ming Lei
2021-03-17  3:53           ` JeffleXu
2021-03-17  3:53             ` [dm-devel] " JeffleXu
2021-03-17  6:54             ` Ming Lei
2021-03-17  6:54               ` [dm-devel] " Ming Lei
2021-03-16 11:00       ` JeffleXu
2021-03-16 11:00         ` [dm-devel] " JeffleXu
2021-03-17  3:38         ` Ming Lei
2021-03-17  3:38           ` [dm-devel] " Ming Lei
2021-03-17  3:49         ` JeffleXu
2021-03-17  3:49           ` [dm-devel] " JeffleXu
2021-03-17  7:19           ` Ming Lei
2021-03-17  7:19             ` [dm-devel] " Ming Lei
2021-03-18 14:51             ` Mike Snitzer
2021-03-18 14:51               ` [dm-devel] " Mike Snitzer
2021-03-16  3:15 ` [RFC PATCH 09/11] block: add queue_to_disk() to get gendisk from request_queue Ming Lei
2021-03-16  3:15   ` [dm-devel] " Ming Lei
2021-03-16  3:15 ` [RFC PATCH 10/11] block: add poll_capable method to support bio-based IO polling Ming Lei
2021-03-16  3:15   ` [dm-devel] " Ming Lei
2021-03-16  3:15 ` [RFC PATCH 11/11] dm: support IO polling for bio-based dm device Ming Lei
2021-03-16  3:15   ` [dm-devel] " Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.