linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V5 00/12] block: support bio based io polling
@ 2021-04-01  2:19 Ming Lei
  2021-04-01  2:19 ` [PATCH V5 01/12] block: add helper of blk_queue_poll Ming Lei
                   ` (13 more replies)
  0 siblings, 14 replies; 47+ messages in thread
From: Ming Lei @ 2021-04-01  2:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Jeffle Xu, Mike Snitzer, dm-devel, Hannes Reinecke,
	Ming Lei

Hi Jens,

Add per-task io poll context for holding HIPRI blk-mq/underlying bios
queued from bio based driver's io submission context, and reuse one bio
padding field for storing 'cookie' returned from submit_bio() for these
bios. Also explicitly end these bios in poll context by adding two
new bio flags.

In this way, we needn't to poll all underlying hw queues any more,
which is implemented in Jeffle's patches. And we can just poll hw queues
in which there is HIPRI IO queued.

Usually io submission and io poll share same context, so the added io
poll context data is just like one stack variable, and the cost for
saving bios is cheap.

V5:
	- fix one use-after-free issue in case that polling is from another
	context: adds one new cookie of BLK_QC_T_NOT_READY for preventing
	this issue in patch 8/12
	- add reviewed-by & tested-by tag

V4:
	- cover one more test_bit(QUEUE_FLAG_POLL, ...) suggested by
	  Jeffle(01/12)
	- drop patch of 'block: add helper of blk_create_io_context'
	- add new helper of blk_create_io_poll_context() (03/12)
	- drain submission queues in exit_io_context(), suggested by
	  Jeffle(08/13)
	- considering shared io context case for blk_bio_poll_io_drain()
	(08/13)
	- fix one issue in blk_bio_poll_pack_groups() as suggested by
	Jeffle(08/13)
	- add reviewed-by tag
V3:
	- fix cookie returned for bio based driver, as suggested by Jeffle Xu
	- draining pending bios when submission context is exiting
	- patch style and comment fix, as suggested by Mike
	- allow poll context data to be NULL by always polling on submission queue
	- remove RFC, and reviewed-by

V2:
	- address queue depth scalability issue reported by Jeffle via bio
	group list. Reuse .bi_end_io for linking bios which share same
	.bi_end_io, and support 32 such groups in submit queue. With this way,
	the scalability issue caused by kfifio is solved. Before really
	ending bio, .bi_end_io is recovered from the group head.



Jeffle Xu (4):
  block/mq: extract one helper function polling hw queue
  block: add queue_to_disk() to get gendisk from request_queue
  block: add poll_capable method to support bio-based IO polling
  dm: support IO polling for bio-based dm device

Ming Lei (8):
  block: add helper of blk_queue_poll
  block: add one helper to free io_context
  block: create io poll context for submission and poll task
  block: add req flag of REQ_POLL_CTX
  block: add new field into 'struct bvec_iter'
  block: prepare for supporting bio_list via other link
  block: use per-task poll context to implement bio based io polling
  blk-mq: limit hw queues to be polled in each blk_poll()

 block/bio.c                   |   5 +
 block/blk-core.c              | 258 ++++++++++++++++++++++++++--
 block/blk-ioc.c               |  15 +-
 block/blk-mq.c                | 308 +++++++++++++++++++++++++++++++++-
 block/blk-sysfs.c             |  16 +-
 block/blk.h                   |  58 +++++++
 drivers/md/dm-table.c         |  24 +++
 drivers/md/dm.c               |  14 ++
 drivers/nvme/host/core.c      |   2 +-
 include/linux/bio.h           | 132 ++++++++-------
 include/linux/blk_types.h     |  30 +++-
 include/linux/blkdev.h        |   4 +
 include/linux/bvec.h          |   8 +
 include/linux/device-mapper.h |   1 +
 include/linux/iocontext.h     |   2 +
 include/trace/events/kyber.h  |   6 +-
 16 files changed, 788 insertions(+), 95 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 47+ messages in thread
* [PATCH v5 00/12] dm: support polling
@ 2021-03-03 11:57 Jeffle Xu
  2021-03-03 11:57 ` [PATCH v5 12/12] dm: support IO polling for bio-based dm device Jeffle Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Jeffle Xu @ 2021-03-03 11:57 UTC (permalink / raw)
  To: msnitzer, axboe
  Cc: io-uring, dm-devel, linux-block, mpatocka, caspar, joseph.qi

[Changes since v4]
- rebased to 5.12

- refactor patch 10 (fastpath) to fix the issue proposed by Mikulas
  Patocka in [1].
When bio doesn't get split and is submitted to *one* underlying device,
then the o=polling routine will go into the fastpath. The refactored
design is to return the dev_t of the underlying device the bio finnaly
submitted to as the returned cookie. The following polling routine will
call blkdev_get_no_open() to get blkdev by the dev_t stored in the
cookie, and thus fetch the corresponding hw queue.
One remained issue is that, blkdev_get_no_open() need to acquire global
lock @inode_hash_lock in ilookup(), and this may become the hot spot as
the polling threads increase. The RCU version (i.e.,
find_inode_by_ino_rcu()) maybe could be used here as an optimization.

- call disk->fops->poll() recursively in patch 12 to fix the issue
  described in [2] .

- update performance tests in [Performance Tests] chapter.

[1] https://listman.redhat.com/archives/dm-devel/2021-February/msg00255.html
[2] https://listman.redhat.com/archives/dm-devel/2021-February/msg00254.html



[Intention]
Bio-based polling (e.g., for dm/md devices) is one indispensable part of
high performance IO stack. As far as I know, dm (e.g., dm-stripe) is
widely used in database, splicing several NVMe disks as one whole disk,
in hope of achieving better performance. With this patch set, io_uring
could be used upon dm devices.


[Performance Tests]
1. fastpath (patch 10)
The polling routine will go into this path when bio submitted to dm
device is not split.

In this case, there will be only one bio submitted to only one polling
hw queue of one underlying mq device, and thus we don't need to track
all split bios or iterate through all polling hw queues. The pointer to
the polling hw queue the bio submitted to is returned here as the
returned cookie. In this case, the polling routine will call
mq_ops->poll() directly with the hw queue converted from the input
cookie.

- One process reading dm-linear (mapping to three underlying NVMe devices,
with one polling hw queue per NVMe device).

(ioengine=io_uring, iodepth=128, numjobs=1, rw=randread, sqthread_poll=0
direct=1, bs=4k)

	    	 | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
---------------- | --------------- | -------------------- | ----
with patchset    |	      214k |		     282k | ~32%


- Three processes reading dm-linear (mapping to three underlying NVMe
devices, with three polling hw queues per NVMe device), with every
process pinned to one CPU and mapped to one exclusive hw queue.

(ioengine=io_uring, iodepth=128, numjobs=3, rw=randread, sqthread_poll=0
direct=1, bs=4k)

	    	 | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
---------------- | --------------- | -------------------- | ----
with patchset    |	      629k |		     820k | ~30%



2. sub-fastpath (patch 11)

The polling routine will go into this path when bio submitted to dm
device gets split and enqueued into multiple hw queues, while the IO
submission process has not been migrated to another CPU.

In this case, the IO submission routine will return the CPU number on
which the IO submission happened as the returned cookie, while the
polling routine will only iterate and poll on hw queues that this CPU
number maps, instead of iterating *all* hw queues.

This optimization can dramatically reduce cache ping-pong and thus
improve the polling performance, when multiple hw queues in polling mode
per device could be reserved when there are multiple polling processes.

- Three processes reading dm-stripe (mapping to three underlying NVMe
devices, with three polling hw queues per NVMe device), with every
process pinned to one CPU and mapped to one exclusive hw queue.

(ioengine=io_uring, iodepth=128, numjobs=3, rw=randread, sqthread_poll=0
direct=1, bs=12k(4k for every NVMe device))

	    	 | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
---------------- | --------------- | -------------------- | ----
with patchset    |	      311k |		     409k | ~32%



[Changes since v3]
- newly add patch 7 and patch 11, as a new optimization improving
performance of multiple polling processes. Now performance of multiple
polling processes can be as scalable as single polling process (~30%).
Refer to the following [Performance] chapter for more details.


[Changes since v2]

Patchset v2 caches all hw queues (in polling mode) of underlying mq
devices in dm layer. The polling routine actually iterates through all
these cached hw queues.

However, mq may change the queue mapping at runtime (e.g., NVMe RESET
command), thus the cached hw queues in dm layer may be out-of-date. Thus
patchset v3 falls back to the implementation of the very first RFC
version, in which the mq layer needs to export one interface iterating
all polling hw queues (patch 5), and the bio-based polling routine just
calls this interface to iterate all polling hw queues.

Besides, several new optimization is proposed.


- patch 1,2,7
same as v2, untouched

- patch 3
Considering advice from Christoph Hellwig, while refactoring blk_poll(),
split mq and bio-based polling routine from the very beginning. Now
blk_poll() is just a simple entry. blk_bio_poll() is simply copied from
blk_mq_poll(), while the loop structure is some sort of duplication
though.

- patch 4
This patch is newly added to support turning on/off polling through
'/sys/block/<dev>/queue/io_poll' dynamiclly for bio-based devices.
Patchset v2 implemented this functionality by added one new queue flag,
which is not preferred since the queue flag resource is quite short of
nowadays.

- patch 5
This patch is newly added, preparing for the following bio-based
polling. The following bio-based polling will call this helper function,
accounting on the corresponding hw queue.

- patch 6
It's from the very first RFC version, preparing for the following
bio-based polling.

- patch 8
One fixing patch needed by the following bio-based polling. It's
actually a v2 of [1]. I had sent the v2 singly in-reply-to [1], though
it has not been visible on the mailing list maybe due to the delay.

- patch 9
It's from the very first RFC version.

- patch 10
This patch is newly added. Patchset v2 had ever proposed one
optimization that, skipping the **busy** hw queues during the iteration
phase. Back upon that time, one flag of 'atomic_t' is specifically
maintained in dm layer, representing if the corresponding hw queue is
busy or not. The idea is inherited, while the implementation changes.
Now @nvmeq->cq_poll_lock is used directly here, no need for extra flag
anymore.

This optimization can significantly reduce the competition for one hw
queue between multiple polling instances. Following statistics is the
test result when 3 threads concurrently randread (bs=4k, direct=1) one
dm-linear device, which is built upon 3 nvme devices, with one polling
hw queue per nvme device.

	    | IOPS (IRQ mode) | IOPS (iopoll=1 mode) | diff
----------- | --------------- | -------------------- | ----
without opt | 		 318k |		 	256k | ~-20%
with opt    |		 314k |		 	354k | ~13%
							

- patch 11
This is another newly added optimizatin for bio-based polling.

One intuitive insight is that, when the original bio submitted to dm
device doesn't get split, then the bio gets enqueued into only one hw
queue of one of the underlying mq devices. In this case, we no longer
need to track all split bios, and one cookie (for the only split bio)
is enough. It is implemented by returning the pointer to the
corresponding hw queue in this case.

It should be safe by directly returning the pointer to the hw queue,
since 'struct blk_mq_hw_ctx' won't be freed during the whole lifetime of
'struct request_queue'. Even when the number of hw queues may decrease
when NVMe RESET happens, the 'struct request_queue' structure of decreased
hw queues won't be freed, instead it's buffered into
&q->unused_hctx_list list.

Though this optimization seems quite intuitive, the performance test
shows that it does no benefit nor harm to the performance, while 3
threads concurrently randreading (bs=4k, direct=1) one dm-linear
device, which is built upon 3 nvme devices, with one polling hw queue
per nvme device.

I'm not sure why it doesn't work, maybe because the number of devices,
or the depth of the devcice stack is to low in my test case?


changes since v1:
- patch 1,2,4 is the same as v1 and have already been reviewed
- patch 3 is refactored a bit on the basis of suggestions from
Mike Snitzer.
- patch 5 is newly added and introduces one new queue flag
representing if the queue is capable of IO polling. This mainly
simplifies the logic in queue_poll_store().
- patch 6 implements the core mechanism supporting IO polling.
The sanity check checking if the dm device supports IO polling is
also folded into this patch, and the queue flag will be cleared if
it doesn't support, in case of table reloading.




Jeffle Xu (12):
  block: move definition of blk_qc_t to types.h
  block: add queue_to_disk() to get gendisk from request_queue
  block: add poll method to support bio-based IO polling
  block: add poll_capable method to support bio-based IO polling
  blk-mq: extract one helper function polling hw queue
  blk-mq: add iterator for polling hw queues
  blk-mq: add one helper function getting hw queue
  dm: always return BLK_QC_T_NONE for bio-based device
  nvme/pci: don't wait for locked polling queue
  block: fastpath for bio-based polling
  block: sub-fastpath for bio-based polling
  dm: support IO polling for bio-based dm device

 block/blk-core.c              | 143 +++++++++++++++++++++++++++++++++-
 block/blk-mq.c                |  37 +++------
 block/blk-sysfs.c             |  14 +++-
 drivers/md/dm-table.c         |  26 +++++++
 drivers/md/dm.c               |  98 +++++++++++++++++++----
 drivers/nvme/host/pci.c       |   4 +-
 include/linux/blk-mq.h        |  23 ++++++
 include/linux/blk_types.h     |  96 ++++++++++++++++++++++-
 include/linux/blkdev.h        |   4 +
 include/linux/device-mapper.h |   1 +
 include/linux/fs.h            |   2 +-
 include/linux/types.h         |   3 +
 include/trace/events/kyber.h  |   6 +-
 13 files changed, 405 insertions(+), 52 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2021-04-20  7:25 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-01  2:19 [PATCH V5 00/12] block: support bio based io polling Ming Lei
2021-04-01  2:19 ` [PATCH V5 01/12] block: add helper of blk_queue_poll Ming Lei
2021-04-01  2:19 ` [PATCH V5 02/12] block: add one helper to free io_context Ming Lei
2021-04-01  2:19 ` [PATCH V5 03/12] block: create io poll context for submission and poll task Ming Lei
2021-04-12 10:19   ` Christoph Hellwig
2021-04-01  2:19 ` [PATCH V5 04/12] block: add req flag of REQ_POLL_CTX Ming Lei
2021-04-01  2:19 ` [PATCH V5 05/12] block: add new field into 'struct bvec_iter' Ming Lei
2021-04-12  9:26   ` Christoph Hellwig
2021-04-13  9:36     ` Ming Lei
2021-04-01  2:19 ` [PATCH V5 06/12] block/mq: extract one helper function polling hw queue Ming Lei
2021-04-12  9:29   ` Christoph Hellwig
2021-04-01  2:19 ` [PATCH V5 07/12] block: prepare for supporting bio_list via other link Ming Lei
2021-04-12 10:18   ` Christoph Hellwig
2021-04-12 11:37     ` Ming Lei
2021-04-01  2:19 ` [PATCH V5 08/12] block: use per-task poll context to implement bio based io polling Ming Lei
2021-04-12  9:54   ` Christoph Hellwig
2021-04-12 10:20     ` Ming Lei
2021-04-12 10:29       ` Christoph Hellwig
2021-04-12 11:42         ` Ming Lei
2021-04-12 10:16   ` Christoph Hellwig
2021-04-12 10:37     ` Ming Lei
2021-04-01  2:19 ` [PATCH V5 09/12] blk-mq: limit hw queues to be polled in each blk_poll() Ming Lei
2021-04-01  2:19 ` [PATCH V5 10/12] block: add queue_to_disk() to get gendisk from request_queue Ming Lei
2021-04-12 12:52   ` Jens Axboe
2021-04-01  2:19 ` [PATCH V5 11/12] block: add poll_capable method to support bio-based IO polling Ming Lei
2021-04-12  9:38   ` Christoph Hellwig
2021-04-14  8:38     ` JeffleXu
2021-04-14 11:24       ` Ming Lei
2021-04-15  1:34         ` JeffleXu
2021-04-15  7:43           ` Ming Lei
2021-04-15  9:21             ` JeffleXu
2021-04-15 10:06               ` Ming Lei
2021-04-15 11:21                 ` JeffleXu
2021-04-15 13:08                   ` Ming Lei
2021-04-16  8:00   ` [PATCH] block: introduce QUEUE_FLAG_POLL_CAP flag Jeffle Xu
2021-04-16  8:42     ` JeffleXu
2021-04-16  9:07     ` Ming Lei
2021-04-16 10:20       ` JeffleXu
2021-04-17 14:06       ` JeffleXu
2021-04-19  2:21         ` Ming Lei
2021-04-19  5:40           ` JeffleXu
2021-04-19 13:36             ` Ming Lei
2021-04-20  7:25               ` JeffleXu
2021-04-01  2:19 ` [PATCH V5 12/12] dm: support IO polling for bio-based dm device Ming Lei
2021-04-09 15:39 ` [PATCH V5 00/12] block: support bio based io polling Ming Lei
2021-04-12  9:46 ` Christoph Hellwig
  -- strict thread matches above, loose matches on Subject: below --
2021-03-03 11:57 [PATCH v5 00/12] dm: support polling Jeffle Xu
2021-03-03 11:57 ` [PATCH v5 12/12] dm: support IO polling for bio-based dm device Jeffle Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).